All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive
@ 2022-06-03  5:27 Arseniy Krasnov
  2022-06-03  5:31 ` [RFC PATCH v2 1/8] virtio/vsock: rework packet allocation logic Arseniy Krasnov
                   ` (9 more replies)
  0 siblings, 10 replies; 29+ messages in thread
From: Arseniy Krasnov @ 2022-06-03  5:27 UTC (permalink / raw)
  To: Stefano Garzarella, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-kernel, kvm, virtualization, netdev, kernel, Krasnov Arseniy

                              INTRODUCTION

	Hello, this is experimental implementation of virtio vsock zerocopy
receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
fill provided vma area with pages of virtio RX buffers. After received data was
processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).

                                 DETAILS

	Here is how mapping with mapped pages looks exactly: first page mapping
contains array of trimmed virtio vsock packet headers (in contains only length
of data on the corresponding page and 'flags' field):

	struct virtio_vsock_usr_hdr {
		uint32_t length;
		uint32_t flags;
		uint32_t copy_len;
	};

Field  'length' allows user to know exact size of payload within each sequence
of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
bounds or record bounds). Field 'copy_len' is described below in 'v1->v2' part.
All other pages are data pages from RX queue.

             Page 0      Page 1      Page N

	[ hdr1 .. hdrN ][ data ] .. [ data ]
           |        |       ^           ^
           |        |       |           |
           |        *-------------------*
           |                |
           |                |
           *----------------*

	Of course, single header could represent array of pages (when packet's
buffer is bigger than one page).So here is example of detailed mapping layout
for some set of packages. Lets consider that we have the following sequence  of
packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
be inserted to user's vma(vma is large enough).

	Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
	Page 1: [ 56 ]
	Page 2: [ 4096 ]
	Page 3: [ 4096 ]
	Page 4: [ 4096 ]
	Page 5: [ 8 ]

	Page 0 contains only array of headers:
	'hdr0' has 56 in length field.
	'hdr1' has 4096 in length field.
	'hdr2' has 8200 in length field.
	'hdr3' has 0 in length field(this is end of data marker).

	Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
	Page 2 corresponds to 'hdr1' and filled with data.
	Page 3 corresponds to 'hdr2' and filled with data.
	Page 4 corresponds to 'hdr2' and filled with data.
	Page 5 corresponds to 'hdr2' and has only 8 bytes of data.

	This patchset also changes packets allocation way: today implementation
uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
"not large" could be bigger than one page). So to avoid this, data buffers now
allocated using 'alloc_pages()' call.

                                   TESTS

	This patchset updates 'vsock_test' utility: two tests for new feature
were added. First test covers invalid cases. Second checks valid transmission
case.

                                BENCHMARKING

	For benchmakring I've added small utility 'rx_zerocopy'. It works in
client/server mode. When client connects to server, server starts sending exact
amount of data to client(amount is set as input argument).Client reads data and
waits for next portion of it. Client works in two modes: copy and zero-copy. In
copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
/'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission 
is better. For server, we can set size of tx buffer and for client we can set
size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
is quiet simple:

For client mode:

./rx_zerocopy --mode client [--zerocopy] [--rx]

For server mode:

./rx_zerocopy --mode server [--mb] [--tx]

[--mb] sets number of megabytes to transfer.
[--rx] sets size of receive buffer/mapping in pages.
[--tx] sets size of transmit buffer in pages.

I checked for transmission of 4000mb of data. Here are some results:

                           size of rx/tx buffers in pages
               *---------------------------------------------------*
               |    8   |    32    |    64   |   256    |   512    |
*--------------*--------*----------*---------*----------*----------*
|   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
*--------------*---------------------------------------------------- process
| non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
*--------------*----------------------------------------------------

Result in first column(where non-zerocopy works better than zerocopy) happens
because time, spent in 'read()' system call is smaller that time in 'getsockopt'
+ 'madvise'. I've checked that.

I think, that results are not so impressive, but at least it is not worse than
copy mode and there is no need to allocate memory for processing date.

                                 PROBLEMS

	Updated packet's allocation logic creates some problem: when host gets
data from guest(in vhost-vsock), it allocates at least one page for each packet
(even if packet has 1 byte payload). I think this could be resolved in several
ways:
	1) Make zerocopy rx mode disabled by default, so if user didn't enable
it, current 'kmalloc()' way will be used. <<<<<<< (IMPLEMENTED IN V2)
	2) Use 'kmalloc()' for "small" packets, else call page allocator. But
in this case, we have mix of packets, allocated in two different ways thus
during zerocopying to user(e.g. mapping pages to vma), such small packets will
be handled in some stupid way: we need to allocate one page for user, copy data
to it and then insert page to user's vma.

v1 -> v2:
 1) Zerocopy receive mode could be enabled/disabled(disabled by default). I
    didn't use generic SO_ZEROCOPY flag, because in virtio-vsock case this
    feature depends on transport support. Instead of SO_ZEROCOPY, AF_VSOCK
    layer flag was added: SO_VM_SOCKETS_ZEROCOPY, while previous meaning of
    SO_VM_SOCKETS_ZEROCOPY(insert receive buffers to user's vm area) now
    renamed to SO_VM_SOCKETS_MAP_RX.
 2) Packet header which is exported to user now get new field: 'copy_len'.
    This field handles special case:  user reads data from socket in non
    zerocopy way(with disabled zerocopy) and then enables zerocopy feature.
    In this case vhost part will switch data buffer allocation logic from
    'kmalloc()' to direct calls for buddy allocator. But, there could be
    some pending 'kmalloc()' allocated packets in socket's rx list, and then
    user tries to read such packets in zerocopy way, dequeue will fail,
    because SLAB pages could not be inserted to user's vm area. So when such
    packet is found during zerocopy dequeue, dequeue loop will break and
    'copy_len' will show size of such "bad" packet. After user detects this
    case, it must use 'read()/recv()' calls to dequeue such packet.
 3) Also may be move this features under config option?

Arseniy Krasnov(8)
 virtio/vsock: rework packet allocation logic
 vhost/vsock: rework packet allocation logic
 af_vsock: add zerocopy receive logic
 virtio/vsock: add transport zerocopy callback
 vhost/vsock: enable zerocopy callback
 virtio/vsock: enable zerocopy callback
 test/vsock: add receive zerocopy tests
 test/vsock: vsock rx zerocopy utility

 drivers/vhost/vsock.c                   | 121 +++++++++--
 include/linux/virtio_vsock.h            |   5 +
 include/net/af_vsock.h                  |   7 +
 include/uapi/linux/virtio_vsock.h       |   6 +
 include/uapi/linux/vm_sockets.h         |   3 +
 net/vmw_vsock/af_vsock.c                | 100 +++++++++
 net/vmw_vsock/virtio_transport.c        |  51 ++++-
 net/vmw_vsock/virtio_transport_common.c | 211 ++++++++++++++++++-
 tools/include/uapi/linux/virtio_vsock.h |  11 +
 tools/include/uapi/linux/vm_sockets.h   |   8 +
 tools/testing/vsock/Makefile            |   1 +
 tools/testing/vsock/control.c           |  34 +++
 tools/testing/vsock/control.h           |   2 +
 tools/testing/vsock/rx_zerocopy.c       | 356 ++++++++++++++++++++++++++++++++
 tools/testing/vsock/vsock_test.c        | 295 ++++++++++++++++++++++++++
 15 files changed, 1196 insertions(+), 15 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH v2 1/8] virtio/vsock: rework packet allocation logic
  2022-06-03  5:27 [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
@ 2022-06-03  5:31 ` Arseniy Krasnov
  2022-06-09  8:37     ` Stefano Garzarella
  2022-06-03  5:33 ` [RFC PATCH v2 2/8] vhost/vsock: " Arseniy Krasnov
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Arseniy Krasnov @ 2022-06-03  5:31 UTC (permalink / raw)
  To: Stefano Garzarella, Stefan Hajnoczi, Michael S. Tsirkin,
	David S. Miller, Jason Wang, Jakub Kicinski, Paolo Abeni
  Cc: linux-kernel, kvm, virtualization, netdev, kernel, Krasnov Arseniy

To support zerocopy receive, packet's buffer allocation
is changed: for buffers which could be mapped to user's
vma we can't use 'kmalloc()'(as kernel restricts to map
slab pages to user's vma) and raw buddy allocator now
called. But, for tx packets(such packets won't be mapped
to user), previous 'kmalloc()' way is used, but with special
flag in packet's structure which allows to distinguish
between 'kmalloc()' and raw pages buffers.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
---
 include/linux/virtio_vsock.h            | 1 +
 net/vmw_vsock/virtio_transport.c        | 8 ++++++--
 net/vmw_vsock/virtio_transport_common.c | 9 ++++++++-
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 35d7eedb5e8e..d02cb7aa922f 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -50,6 +50,7 @@ struct virtio_vsock_pkt {
 	u32 off;
 	bool reply;
 	bool tap_delivered;
+	bool slab_buf;
 };
 
 struct virtio_vsock_pkt_info {
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index ad64f403536a..19909c1e9ba3 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -255,16 +255,20 @@ static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
 	vq = vsock->vqs[VSOCK_VQ_RX];
 
 	do {
+		struct page *buf_page;
+
 		pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
 		if (!pkt)
 			break;
 
-		pkt->buf = kmalloc(buf_len, GFP_KERNEL);
-		if (!pkt->buf) {
+		buf_page = alloc_page(GFP_KERNEL);
+
+		if (!buf_page) {
 			virtio_transport_free_pkt(pkt);
 			break;
 		}
 
+		pkt->buf = page_to_virt(buf_page);
 		pkt->buf_len = buf_len;
 		pkt->len = buf_len;
 
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index ec2c2afbf0d0..278567f748f2 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -69,6 +69,7 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
 		if (!pkt->buf)
 			goto out_pkt;
 
+		pkt->slab_buf = true;
 		pkt->buf_len = len;
 
 		err = memcpy_from_msg(pkt->buf, info->msg, len);
@@ -1342,7 +1343,13 @@ EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
 
 void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
 {
-	kfree(pkt->buf);
+	if (pkt->buf_len) {
+		if (pkt->slab_buf)
+			kfree(pkt->buf);
+		else
+			free_pages(buf, get_order(pkt->buf_len));
+	}
+
 	kfree(pkt);
 }
 EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v2 2/8] vhost/vsock: rework packet allocation logic
  2022-06-03  5:27 [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
  2022-06-03  5:31 ` [RFC PATCH v2 1/8] virtio/vsock: rework packet allocation logic Arseniy Krasnov
@ 2022-06-03  5:33 ` Arseniy Krasnov
  2022-06-09  8:38     ` Stefano Garzarella
  2022-06-03  5:35 ` [RFC PATCH v2 3/8] af_vsock: add zerocopy receive logic Arseniy Krasnov
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Arseniy Krasnov @ 2022-06-03  5:33 UTC (permalink / raw)
  To: Stefano Garzarella, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-kernel, kvm, virtualization, netdev, kernel, Krasnov Arseniy

For packets received from virtio RX queue, use buddy
allocator instead of 'kmalloc()' to be able to insert
such pages to user provided vma. Single call to
'copy_from_iter()' replaced with per-page loop.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
---
 drivers/vhost/vsock.c | 81 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 69 insertions(+), 12 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index e6c9d41db1de..0dc2229f18f7 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -58,6 +58,7 @@ struct vhost_vsock {
 
 	u32 guest_cid;
 	bool seqpacket_allow;
+	bool zerocopy_rx_on;
 };
 
 static u32 vhost_transport_get_local_cid(void)
@@ -357,6 +358,7 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
 		      unsigned int out, unsigned int in)
 {
 	struct virtio_vsock_pkt *pkt;
+	struct vhost_vsock *vsock;
 	struct iov_iter iov_iter;
 	size_t nbytes;
 	size_t len;
@@ -393,20 +395,75 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
 		return NULL;
 	}
 
-	pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
-	if (!pkt->buf) {
-		kfree(pkt);
-		return NULL;
-	}
-
 	pkt->buf_len = pkt->len;
+	vsock = container_of(vq->dev, struct vhost_vsock, dev);
 
-	nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
-	if (nbytes != pkt->len) {
-		vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
-		       pkt->len, nbytes);
-		virtio_transport_free_pkt(pkt);
-		return NULL;
+	if (!vsock->zerocopy_rx_on) {
+		pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
+
+		if (!pkt->buf) {
+			kfree(pkt);
+			return NULL;
+		}
+
+		pkt->slab_buf = true;
+		nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
+		if (nbytes != pkt->len) {
+			vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
+				pkt->len, nbytes);
+			virtio_transport_free_pkt(pkt);
+			return NULL;
+		}
+	} else {
+		struct page *buf_page;
+		ssize_t pkt_len;
+		int page_idx;
+
+		/* This creates memory overrun, as we allocate
+		 * at least one page for each packet.
+		 */
+		buf_page = alloc_pages(GFP_KERNEL, get_order(pkt->len));
+
+		if (buf_page == NULL) {
+			kfree(pkt);
+			return NULL;
+		}
+
+		pkt->buf = page_to_virt(buf_page);
+
+		page_idx = 0;
+		pkt_len = pkt->len;
+
+		/* As allocated pages are not mapped, process
+		 * pages one by one.
+		 */
+		while (pkt_len > 0) {
+			void *mapped;
+			size_t to_copy;
+
+			mapped = kmap(buf_page + page_idx);
+
+			if (mapped == NULL) {
+				virtio_transport_free_pkt(pkt);
+				return NULL;
+			}
+
+			to_copy = min(pkt_len, ((ssize_t)PAGE_SIZE));
+
+			nbytes = copy_from_iter(mapped, to_copy, &iov_iter);
+			if (nbytes != to_copy) {
+				vq_err(vq, "Expected %zu byte payload, got %zu bytes\n",
+				       to_copy, nbytes);
+				kunmap(mapped);
+				virtio_transport_free_pkt(pkt);
+				return NULL;
+			}
+
+			kunmap(mapped);
+
+			pkt_len -= to_copy;
+			page_idx++;
+		}
 	}
 
 	return pkt;
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v2 3/8] af_vsock: add zerocopy receive logic
  2022-06-03  5:27 [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
  2022-06-03  5:31 ` [RFC PATCH v2 1/8] virtio/vsock: rework packet allocation logic Arseniy Krasnov
  2022-06-03  5:33 ` [RFC PATCH v2 2/8] vhost/vsock: " Arseniy Krasnov
@ 2022-06-03  5:35 ` Arseniy Krasnov
  2022-06-09  8:39     ` Stefano Garzarella
  2022-06-03  5:37 ` [RFC PATCH v2 4/8] virtio/vsock: add transport zerocopy callback Arseniy Krasnov
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Arseniy Krasnov @ 2022-06-03  5:35 UTC (permalink / raw)
  To: Stefano Garzarella, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-kernel, kvm, virtualization, netdev, kernel,
	Krasnov Arseniy, Arseniy Krasnov

This:
1) Adds callback for 'mmap()' call on socket. It checks vm
   area flags and sets vm area ops.
2) Adds special 'getsockopt()' case which calls transport
   zerocopy callback. Input argument is vm area address.
3) Adds 'getsockopt()/setsockopt()' for switching on/off rx
   zerocopy mode.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
---
 include/net/af_vsock.h          |   7 +++
 include/uapi/linux/vm_sockets.h |   3 +
 net/vmw_vsock/af_vsock.c        | 100 ++++++++++++++++++++++++++++++++
 3 files changed, 110 insertions(+)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index f742e50207fb..f15f84c648ff 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -135,6 +135,13 @@ struct vsock_transport {
 	bool (*stream_is_active)(struct vsock_sock *);
 	bool (*stream_allow)(u32 cid, u32 port);
 
+	int (*rx_zerocopy_set)(struct vsock_sock *vsk,
+			       bool enable);
+	int (*rx_zerocopy_get)(struct vsock_sock *vsk);
+	int (*zerocopy_dequeue)(struct vsock_sock *vsk,
+				struct vm_area_struct *vma,
+				unsigned long addr);
+
 	/* SEQ_PACKET. */
 	ssize_t (*seqpacket_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
 				     int flags);
diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h
index c60ca33eac59..d1f792bed1a7 100644
--- a/include/uapi/linux/vm_sockets.h
+++ b/include/uapi/linux/vm_sockets.h
@@ -83,6 +83,9 @@
 
 #define SO_VM_SOCKETS_CONNECT_TIMEOUT_NEW 8
 
+#define SO_VM_SOCKETS_MAP_RX 9
+#define SO_VM_SOCKETS_ZEROCOPY 10
+
 #if !defined(__KERNEL__)
 #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
 #define SO_VM_SOCKETS_CONNECT_TIMEOUT SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index f04abf662ec6..10061ef21730 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1644,6 +1644,17 @@ static int vsock_connectible_setsockopt(struct socket *sock,
 		}
 		break;
 	}
+	case SO_VM_SOCKETS_ZEROCOPY: {
+		if (!transport || !transport->rx_zerocopy_set) {
+			err = -EOPNOTSUPP;
+		} else {
+			COPY_IN(val);
+
+			if (transport->rx_zerocopy_set(vsk, val))
+				err = -EINVAL;
+		}
+		break;
+	}
 
 	default:
 		err = -ENOPROTOOPT;
@@ -1657,6 +1668,48 @@ static int vsock_connectible_setsockopt(struct socket *sock,
 	return err;
 }
 
+static const struct vm_operations_struct afvsock_vm_ops = {
+};
+
+static int vsock_recv_zerocopy(struct socket *sock,
+			       unsigned long address)
+{
+	struct sock *sk = sock->sk;
+	struct vsock_sock *vsk = vsock_sk(sk);
+	struct vm_area_struct *vma;
+	const struct vsock_transport *transport;
+	int res;
+
+	transport = vsk->transport;
+
+	if (!transport->rx_zerocopy_get)
+		return -EOPNOTSUPP;
+
+	if (!transport->rx_zerocopy_get(vsk))
+		return -EOPNOTSUPP;
+
+	if (!transport->zerocopy_dequeue)
+		return -EOPNOTSUPP;
+
+	lock_sock(sk);
+	mmap_write_lock(current->mm);
+
+	vma = vma_lookup(current->mm, address);
+
+	if (!vma || vma->vm_ops != &afvsock_vm_ops) {
+		mmap_write_unlock(current->mm);
+		release_sock(sk);
+		return -EINVAL;
+	}
+
+	res = transport->zerocopy_dequeue(vsk, vma, address);
+
+	mmap_write_unlock(current->mm);
+	release_sock(sk);
+
+	return res;
+}
+
 static int vsock_connectible_getsockopt(struct socket *sock,
 					int level, int optname,
 					char __user *optval,
@@ -1701,6 +1754,39 @@ static int vsock_connectible_getsockopt(struct socket *sock,
 		lv = sock_get_timeout(vsk->connect_timeout, &v,
 				      optname == SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD);
 		break;
+	case SO_VM_SOCKETS_ZEROCOPY: {
+		const struct vsock_transport *transport;
+		int res;
+
+		transport = vsk->transport;
+
+		if (!transport->rx_zerocopy_get)
+			return -EOPNOTSUPP;
+
+		lock_sock(sk);
+
+		res = transport->rx_zerocopy_get(vsk);
+
+		release_sock(sk);
+
+		if (res < 0)
+			return -EINVAL;
+
+		v.val64 = res;
+
+		break;
+	}
+	case SO_VM_SOCKETS_MAP_RX: {
+		unsigned long vma_addr;
+
+		if (len < sizeof(vma_addr))
+			return -EINVAL;
+
+		if (copy_from_user(&vma_addr, optval, sizeof(vma_addr)))
+			return -EFAULT;
+
+		return vsock_recv_zerocopy(sock, vma_addr);
+	}
 
 	default:
 		return -ENOPROTOOPT;
@@ -2129,6 +2215,19 @@ vsock_connectible_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
 	return err;
 }
 
+static int afvsock_mmap(struct file *file, struct socket *sock,
+			struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & (VM_WRITE | VM_EXEC))
+		return -EPERM;
+
+	vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+	vma->vm_flags |= (VM_MIXEDMAP);
+	vma->vm_ops = &afvsock_vm_ops;
+
+	return 0;
+}
+
 static const struct proto_ops vsock_stream_ops = {
 	.family = PF_VSOCK,
 	.owner = THIS_MODULE,
@@ -2148,6 +2247,7 @@ static const struct proto_ops vsock_stream_ops = {
 	.recvmsg = vsock_connectible_recvmsg,
 	.mmap = sock_no_mmap,
 	.sendpage = sock_no_sendpage,
+	.mmap = afvsock_mmap,
 };
 
 static const struct proto_ops vsock_seqpacket_ops = {
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v2 4/8] virtio/vsock: add transport zerocopy callback
  2022-06-03  5:27 [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
                   ` (2 preceding siblings ...)
  2022-06-03  5:35 ` [RFC PATCH v2 3/8] af_vsock: add zerocopy receive logic Arseniy Krasnov
@ 2022-06-03  5:37 ` Arseniy Krasnov
  2022-06-09  8:41     ` Stefano Garzarella
  2022-06-03  5:39 ` [RFC PATCH v2 5/8] vhost/vsock: enable " Arseniy Krasnov
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Arseniy Krasnov @ 2022-06-03  5:37 UTC (permalink / raw)
  To: Stefano Garzarella, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-kernel, kvm, virtualization, netdev, kernel,
	Krasnov Arseniy, Arseniy Krasnov

This adds transport callback which processes rx
queue of socket and instead of copying data to
user provided buffer, it inserts data pages of
each packet to user's vm area.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
---
 include/linux/virtio_vsock.h            |   4 +
 include/uapi/linux/virtio_vsock.h       |   6 +
 net/vmw_vsock/virtio_transport_common.c | 208 +++++++++++++++++++++++-
 3 files changed, 215 insertions(+), 3 deletions(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index d02cb7aa922f..47a68a2ea838 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -51,6 +51,7 @@ struct virtio_vsock_pkt {
 	bool reply;
 	bool tap_delivered;
 	bool slab_buf;
+	bool split;
 };
 
 struct virtio_vsock_pkt_info {
@@ -131,6 +132,9 @@ int virtio_transport_dgram_bind(struct vsock_sock *vsk,
 				struct sockaddr_vm *addr);
 bool virtio_transport_dgram_allow(u32 cid, u32 port);
 
+int virtio_transport_zerocopy_dequeue(struct vsock_sock *vsk,
+				      struct vm_area_struct *vma,
+				      unsigned long addr);
 int virtio_transport_connect(struct vsock_sock *vsk);
 
 int virtio_transport_shutdown(struct vsock_sock *vsk, int mode);
diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
index 64738838bee5..6775c6c44b5b 100644
--- a/include/uapi/linux/virtio_vsock.h
+++ b/include/uapi/linux/virtio_vsock.h
@@ -66,6 +66,12 @@ struct virtio_vsock_hdr {
 	__le32	fwd_cnt;
 } __attribute__((packed));
 
+struct virtio_vsock_usr_hdr {
+	u32 flags;
+	u32 len;
+	u32 copy_len;
+} __attribute__((packed));
+
 enum virtio_vsock_type {
 	VIRTIO_VSOCK_TYPE_STREAM = 1,
 	VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 278567f748f2..3a3e84176c75 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -12,6 +12,7 @@
 #include <linux/ctype.h>
 #include <linux/list.h>
 #include <linux/virtio_vsock.h>
+#include <linux/mm.h>
 #include <uapi/linux/vsockmon.h>
 
 #include <net/sock.h>
@@ -347,6 +348,196 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
 	return err;
 }
 
+#define MAX_PAGES_TO_MAP 256
+
+int virtio_transport_zerocopy_dequeue(struct vsock_sock *vsk,
+				      struct vm_area_struct *vma,
+				      unsigned long addr)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+	struct virtio_vsock_usr_hdr *usr_hdr_buffer;
+	unsigned long max_pages_to_insert;
+	unsigned long tmp_pages_inserted;
+	unsigned long pages_to_insert;
+	struct page *usr_hdr_page;
+	unsigned long vma_size;
+	struct page **pages;
+	int max_vma_pages;
+	int max_usr_hdrs;
+	int res;
+	int err;
+	int i;
+
+	/* Only use VMA from first page. */
+	if (vma->vm_start != addr)
+		return -EFAULT;
+
+	vma_size = vma->vm_end - vma->vm_start;
+
+	/* Too small vma(at least one page for headers
+	 * and one page for data).
+	 */
+	if (vma_size < 2 * PAGE_SIZE)
+		return -EFAULT;
+
+	/* Page for meta data. */
+	usr_hdr_page = alloc_page(GFP_KERNEL);
+
+	if (!usr_hdr_page)
+		return -EFAULT;
+
+	pages = kmalloc_array(MAX_PAGES_TO_MAP, sizeof(pages[0]), GFP_KERNEL);
+
+	if (!pages)
+		return -EFAULT;
+
+	pages[pages_to_insert++] = usr_hdr_page;
+
+	usr_hdr_buffer = page_to_virt(usr_hdr_page);
+
+	err = 0;
+
+	/* As we use first page for headers, so total number of
+	 * pages for user is min between number of headers in
+	 * first page and size of vma(in pages, except first page).
+	 */
+	max_usr_hdrs = PAGE_SIZE / sizeof(*usr_hdr_buffer);
+	max_vma_pages = (vma_size / PAGE_SIZE) - 1;
+	max_pages_to_insert = min(max_usr_hdrs, max_vma_pages);
+
+	if (max_pages_to_insert > MAX_PAGES_TO_MAP)
+		max_pages_to_insert = MAX_PAGES_TO_MAP;
+
+	spin_lock_bh(&vvs->rx_lock);
+
+	while (!list_empty(&vvs->rx_queue) &&
+	       pages_to_insert < max_pages_to_insert) {
+		struct virtio_vsock_pkt *pkt;
+		ssize_t rest_data_bytes;
+		size_t moved_data_bytes;
+		unsigned long pg_offs;
+
+		pkt = list_first_entry(&vvs->rx_queue,
+				       struct virtio_vsock_pkt, list);
+
+		/* Buffer was allocated by 'kmalloc()'. This could
+		 * happen, when zerocopy was enabled, but we still
+		 * have pending packet which was created before it.
+		 */
+		if (pkt->slab_buf) {
+			usr_hdr_buffer->flags = le32_to_cpu(pkt->hdr.flags);
+			usr_hdr_buffer->len = 0;
+			usr_hdr_buffer->copy_len = le32_to_cpu(pkt->hdr.len);
+			/* Report user to read it using copy. */
+			break;
+		}
+
+		/* This could happen, when packet was dequeued before
+		 * by an ordinary 'read()' call. We can't handle such
+		 * packet. Drop it.
+		 */
+		if (pkt->off % PAGE_SIZE) {
+			list_del(&pkt->list);
+			virtio_transport_dec_rx_pkt(vvs, pkt);
+			virtio_transport_free_pkt(pkt);
+			continue;
+		}
+
+		rest_data_bytes = le32_to_cpu(pkt->hdr.len) - pkt->off;
+
+		/* For packets, bigger than one page, split it's
+		 * high order allocated buffer to 0 order pages.
+		 * Otherwise 'vm_insert_pages()' will fail, for
+		 * all pages except first.
+		 */
+		if (rest_data_bytes > PAGE_SIZE) {
+			/* High order buffer not split yet. */
+			if (!pkt->split) {
+				split_page(virt_to_page(pkt->buf),
+					   get_order(le32_to_cpu(pkt->hdr.len)));
+				pkt->split = true;
+			}
+		}
+
+		pg_offs = pkt->off;
+		moved_data_bytes = 0;
+
+		while (rest_data_bytes &&
+		       pages_to_insert < max_pages_to_insert) {
+			struct page *buf_page;
+
+			buf_page = virt_to_page(pkt->buf + pg_offs);
+
+			pages[pages_to_insert++] = buf_page;
+			/* Get reference to prevent this page being
+			 * returned to page allocator when packet will
+			 * be freed. Ref count will be 2.
+			 */
+			get_page(buf_page);
+			pg_offs += PAGE_SIZE;
+
+			if (rest_data_bytes >= PAGE_SIZE) {
+				moved_data_bytes += PAGE_SIZE;
+				rest_data_bytes -= PAGE_SIZE;
+			} else {
+				moved_data_bytes += rest_data_bytes;
+				rest_data_bytes = 0;
+			}
+		}
+
+		usr_hdr_buffer->flags = le32_to_cpu(pkt->hdr.flags);
+		usr_hdr_buffer->len = moved_data_bytes;
+		usr_hdr_buffer->copy_len = 0;
+		usr_hdr_buffer++;
+
+		pkt->off = pg_offs;
+
+		if (rest_data_bytes == 0) {
+			list_del(&pkt->list);
+			virtio_transport_dec_rx_pkt(vvs, pkt);
+			virtio_transport_free_pkt(pkt);
+		}
+
+		/* Now ref count for all pages of packet is 1. */
+	}
+
+	/* Set last buffer empty(if we have one). */
+	if (pages_to_insert - 1 < max_usr_hdrs)
+		usr_hdr_buffer->len = 0;
+
+	spin_unlock_bh(&vvs->rx_lock);
+
+	tmp_pages_inserted = pages_to_insert;
+
+	res = vm_insert_pages(vma, addr, pages, &tmp_pages_inserted);
+
+	if (res || tmp_pages_inserted) {
+		/* Failed to insert some pages, we have "partially"
+		 * mapped vma. Do not return, set error code. This
+		 * code will be returned to user. User needs to call
+		 * 'madvise()/mmap()' to clear this vma. Anyway,
+		 * references to all pages will to be dropped below.
+		 */
+		err = -EFAULT;
+	}
+
+	/* Put reference for every page. */
+	for (i = 0; i < pages_to_insert; i++) {
+		/* Ref count is 2 ('get_page()' + 'vm_insert_pages()' above).
+		 * Put reference once, page will be returned to allocator
+		 * after user's 'madvice()/munmap()' call(or it wasn't mapped
+		 * if 'vm_insert_pages()' failed).
+		 */
+		put_page(pages[i]);
+	}
+
+	virtio_transport_send_credit_update(vsk);
+	kfree(pages);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_zerocopy_dequeue);
+
 static ssize_t
 virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
 				   struct msghdr *msg,
@@ -1344,10 +1535,21 @@ EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
 void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
 {
 	if (pkt->buf_len) {
-		if (pkt->slab_buf)
+		if (pkt->slab_buf) {
 			kfree(pkt->buf);
-		else
-			free_pages(buf, get_order(pkt->buf_len));
+		} else {
+			unsigned int order = get_order(pkt->buf_len);
+			unsigned long buf = (unsigned long)pkt->buf;
+
+			if (pkt->split) {
+				int i;
+
+				for (i = 0; i < (1 << order); i++)
+					free_page(buf + i * PAGE_SIZE);
+			} else {
+				free_pages(buf, order);
+			}
+		}
 	}
 
 	kfree(pkt);
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v2 5/8] vhost/vsock: enable zerocopy callback
  2022-06-03  5:27 [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
                   ` (3 preceding siblings ...)
  2022-06-03  5:37 ` [RFC PATCH v2 4/8] virtio/vsock: add transport zerocopy callback Arseniy Krasnov
@ 2022-06-03  5:39 ` Arseniy Krasnov
  2022-06-03  5:41 ` [RFC PATCH v2 6/8] virtio/vsock: " Arseniy Krasnov
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Arseniy Krasnov @ 2022-06-03  5:39 UTC (permalink / raw)
  To: Stefano Garzarella, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-kernel, kvm, virtualization, netdev, kernel,
	Krasnov Arseniy, Arseniy Krasnov

This adds zerocopy callback to vhost transport.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
---
 drivers/vhost/vsock.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 0dc2229f18f7..dcb8182f5ac9 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -481,6 +481,43 @@ static bool vhost_vsock_more_replies(struct vhost_vsock *vsock)
 	return val < vq->num;
 }
 
+static int vhost_transport_zerocopy_set(struct vsock_sock *vsk, bool enable)
+{
+	struct vhost_vsock *vsock;
+
+	rcu_read_lock();
+	vsock = vhost_vsock_get(vsk->remote_addr.svm_cid);
+
+	if (!vsock) {
+		rcu_read_unlock();
+		return -ENODEV;
+	}
+
+	vsock->zerocopy_rx_on = enable;
+	rcu_read_unlock();
+
+	return 0;
+}
+
+static int vhost_transport_zerocopy_get(struct vsock_sock *vsk)
+{
+	struct vhost_vsock *vsock;
+	bool res;
+
+	rcu_read_lock();
+	vsock = vhost_vsock_get(vsk->remote_addr.svm_cid);
+
+	if (!vsock) {
+		rcu_read_unlock();
+		return -ENODEV;
+	}
+
+	res = vsock->zerocopy_rx_on;
+	rcu_read_unlock();
+
+	return res;
+}
+
 static bool vhost_transport_seqpacket_allow(u32 remote_cid);
 
 static struct virtio_transport vhost_transport = {
@@ -508,6 +545,9 @@ static struct virtio_transport vhost_transport = {
 		.stream_rcvhiwat          = virtio_transport_stream_rcvhiwat,
 		.stream_is_active         = virtio_transport_stream_is_active,
 		.stream_allow             = virtio_transport_stream_allow,
+		.zerocopy_dequeue	  = virtio_transport_zerocopy_dequeue,
+		.rx_zerocopy_set	  = vhost_transport_zerocopy_set,
+		.rx_zerocopy_get	  = vhost_transport_zerocopy_get,
 
 		.seqpacket_dequeue        = virtio_transport_seqpacket_dequeue,
 		.seqpacket_enqueue        = virtio_transport_seqpacket_enqueue,
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v2 6/8] virtio/vsock: enable zerocopy callback
  2022-06-03  5:27 [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
                   ` (4 preceding siblings ...)
  2022-06-03  5:39 ` [RFC PATCH v2 5/8] vhost/vsock: enable " Arseniy Krasnov
@ 2022-06-03  5:41 ` Arseniy Krasnov
  2022-06-03  5:43 ` [RFC PATCH v2 7/8] test/vsock: add receive zerocopy tests Arseniy Krasnov
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Arseniy Krasnov @ 2022-06-03  5:41 UTC (permalink / raw)
  To: Stefano Garzarella, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Paolo Abeni, Jakub Kicinski
  Cc: linux-kernel, kvm, virtualization, netdev, kernel,
	Krasnov Arseniy, Arseniy Krasnov

This adds zerocopy callback for virtio transport.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
---
 net/vmw_vsock/virtio_transport.c | 43 ++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 19909c1e9ba3..2e05b01caa94 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -64,6 +64,7 @@ struct virtio_vsock {
 
 	u32 guest_cid;
 	bool seqpacket_allow;
+	bool zerocopy_rx_on;
 };
 
 static u32 virtio_transport_get_local_cid(void)
@@ -455,6 +456,45 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
 
 static bool virtio_transport_seqpacket_allow(u32 remote_cid);
 
+static int
+virtio_transport_zerocopy_set(struct vsock_sock *vsk, bool enable)
+{
+	struct virtio_vsock *vsock;
+
+	rcu_read_lock();
+	vsock = rcu_dereference(the_virtio_vsock);
+
+	if (!vsock) {
+		rcu_read_unlock();
+		return -ENODEV;
+	}
+
+	vsock->zerocopy_rx_on = enable;
+	rcu_read_unlock();
+
+	return 0;
+}
+
+static int
+virtio_transport_zerocopy_get(struct vsock_sock *vsk)
+{
+	struct virtio_vsock *vsock;
+	bool res;
+
+	rcu_read_lock();
+	vsock = rcu_dereference(the_virtio_vsock);
+
+	if (!vsock) {
+		rcu_read_unlock();
+		return -ENODEV;
+	}
+
+	res = vsock->zerocopy_rx_on;
+	rcu_read_unlock();
+
+	return res;
+}
+
 static struct virtio_transport virtio_transport = {
 	.transport = {
 		.module                   = THIS_MODULE,
@@ -480,6 +520,9 @@ static struct virtio_transport virtio_transport = {
 		.stream_rcvhiwat          = virtio_transport_stream_rcvhiwat,
 		.stream_is_active         = virtio_transport_stream_is_active,
 		.stream_allow             = virtio_transport_stream_allow,
+		.zerocopy_dequeue         = virtio_transport_zerocopy_dequeue,
+		.rx_zerocopy_set	  = virtio_transport_zerocopy_set,
+		.rx_zerocopy_get	  = virtio_transport_zerocopy_get,
 
 		.seqpacket_dequeue        = virtio_transport_seqpacket_dequeue,
 		.seqpacket_enqueue        = virtio_transport_seqpacket_enqueue,
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v2 7/8] test/vsock: add receive zerocopy tests
  2022-06-03  5:27 [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
                   ` (5 preceding siblings ...)
  2022-06-03  5:41 ` [RFC PATCH v2 6/8] virtio/vsock: " Arseniy Krasnov
@ 2022-06-03  5:43 ` Arseniy Krasnov
  2022-06-03  5:44 ` [RFC PATCH v2 8/8] test/vsock: vsock rx zerocopy utility Arseniy Krasnov
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Arseniy Krasnov @ 2022-06-03  5:43 UTC (permalink / raw)
  To: Stefano Garzarella, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-kernel, kvm, virtualization, netdev, kernel,
	Krasnov Arseniy, Arseniy Krasnov

This adds tests for zerocopy feature: one test
checks data transmission with simple integrity
control. Second test covers 'error' branches in
zerocopy logic(to check invalid arguments handling).

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
---
 tools/include/uapi/linux/virtio_vsock.h |  11 +
 tools/include/uapi/linux/vm_sockets.h   |   8 +
 tools/testing/vsock/control.c           |  34 +++
 tools/testing/vsock/control.h           |   2 +
 tools/testing/vsock/vsock_test.c        | 295 ++++++++++++++++++++++++
 5 files changed, 350 insertions(+)
 create mode 100644 tools/include/uapi/linux/virtio_vsock.h
 create mode 100644 tools/include/uapi/linux/vm_sockets.h

diff --git a/tools/include/uapi/linux/virtio_vsock.h b/tools/include/uapi/linux/virtio_vsock.h
new file mode 100644
index 000000000000..c23d85e73d04
--- /dev/null
+++ b/tools/include/uapi/linux/virtio_vsock.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _UAPI_LINUX_VIRTIO_VSOCK_H
+#define _UAPI_LINUX_VIRTIO_VSOCK_H
+#include <linux/types.h>
+
+struct virtio_vsock_usr_hdr {
+	u32 flags;
+	u32 len;
+	u32 copy_len;
+} __attribute__((packed));
+#endif /* _UAPI_LINUX_VIRTIO_VSOCK_H */
diff --git a/tools/include/uapi/linux/vm_sockets.h b/tools/include/uapi/linux/vm_sockets.h
new file mode 100644
index 000000000000..cac0bc3a7041
--- /dev/null
+++ b/tools/include/uapi/linux/vm_sockets.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _UAPI_LINUX_VM_SOCKETS_H
+#define _UAPI_LINUX_VM_SOCKETS_H
+
+#define SO_VM_SOCKETS_MAP_RX 9
+#define SO_VM_SOCKETS_ZEROCOPY 10
+
+#endif /* _UAPI_LINUX_VM_SOCKETS_H */
diff --git a/tools/testing/vsock/control.c b/tools/testing/vsock/control.c
index 4874872fc5a3..00a654e8f137 100644
--- a/tools/testing/vsock/control.c
+++ b/tools/testing/vsock/control.c
@@ -141,6 +141,40 @@ void control_writeln(const char *str)
 	timeout_end();
 }
 
+void control_writelong(long value)
+{
+	char str[32];
+
+	if (snprintf(str, sizeof(str), "%li", value) >= sizeof(str)) {
+		perror("snprintf");
+		exit(EXIT_FAILURE);
+	}
+
+	control_writeln(str);
+}
+
+long control_readlong(bool *ok)
+{
+	long value = -1;
+	char *str;
+
+	if (ok)
+		*ok = false;
+
+	str = control_readln();
+
+	if (str == NULL)
+		return value;
+
+	value = strtol(str, NULL, 10);
+	free(str);
+
+	if (ok)
+		*ok = true;
+
+	return value;
+}
+
 /* Return the next line from the control socket (without the trailing newline).
  *
  * The program terminates if a timeout occurs.
diff --git a/tools/testing/vsock/control.h b/tools/testing/vsock/control.h
index 51814b4f9ac1..5272ad20e850 100644
--- a/tools/testing/vsock/control.h
+++ b/tools/testing/vsock/control.h
@@ -9,7 +9,9 @@ void control_init(const char *control_host, const char *control_port,
 void control_cleanup(void);
 void control_writeln(const char *str);
 char *control_readln(void);
+long control_readlong(bool *ok);
 void control_expectln(const char *str);
 bool control_cmpln(char *line, const char *str, bool fail);
+void control_writelong(long value);
 
 #endif /* CONTROL_H */
diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
index dc577461afc2..1b8c40bab33e 100644
--- a/tools/testing/vsock/vsock_test.c
+++ b/tools/testing/vsock/vsock_test.c
@@ -18,11 +18,16 @@
 #include <sys/socket.h>
 #include <time.h>
 #include <sys/mman.h>
+#include <poll.h>
+#include <uapi/linux/virtio_vsock.h>
+#include <uapi/linux/vm_sockets.h>
 
 #include "timeout.h"
 #include "control.h"
 #include "util.h"
 
+#define PAGE_SIZE 4096
+
 static void test_stream_connection_reset(const struct test_opts *opts)
 {
 	union {
@@ -596,6 +601,285 @@ static void test_seqpacket_invalid_rec_buffer_server(const struct test_opts *opt
 	close(fd);
 }
 
+static void test_stream_zerocopy_rx_client(const struct test_opts *opts)
+{
+	unsigned long total_sum;
+	unsigned long zc_on = 1;
+	size_t rx_map_len;
+	long rec_value;
+	void *rx_va;
+	int fd;
+
+	fd = vsock_stream_connect(opts->peer_cid, 1234);
+	if (fd < 0) {
+		perror("connect");
+		exit(EXIT_FAILURE);
+	}
+
+	if (setsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_ZEROCOPY,
+				(void *)&zc_on, sizeof(zc_on))) {
+		perror("setsockopt");
+		exit(EXIT_FAILURE);
+	}
+
+	rx_map_len = PAGE_SIZE * 3;
+
+	rx_va = mmap(NULL, rx_map_len, PROT_READ, MAP_SHARED, fd, 0);
+	if (rx_va == MAP_FAILED) {
+		perror("mmap");
+		exit(EXIT_FAILURE);
+	}
+
+	total_sum = 0;
+
+	while (1) {
+		struct pollfd fds = { 0 };
+		int hungup = 0;
+		int res;
+
+		fds.fd = fd;
+		fds.events = POLLIN | POLLERR | POLLHUP |
+			     POLLRDHUP | POLLNVAL;
+
+		res = poll(&fds, 1, -1);
+
+		if (res < 0) {
+			perror("poll");
+			exit(EXIT_FAILURE);
+		}
+
+		if (fds.revents & POLLERR) {
+			perror("poll error");
+			exit(EXIT_FAILURE);
+		}
+
+		if (fds.revents & POLLIN) {
+			struct virtio_vsock_usr_hdr *hdr;
+			uintptr_t tmp_rx_va = (uintptr_t)rx_va;
+			unsigned char *data_va;
+			unsigned char *end_va;
+			socklen_t len = sizeof(tmp_rx_va);
+
+			if (getsockopt(fd, AF_VSOCK,
+					SO_VM_SOCKETS_MAP_RX,
+					&tmp_rx_va, &len) < 0) {
+				perror("getsockopt");
+				exit(EXIT_FAILURE);
+			}
+
+			hdr = (struct virtio_vsock_usr_hdr *)rx_va;
+			/* Skip headers page for data. */
+			data_va = rx_va + PAGE_SIZE;
+			end_va = (unsigned char *)(tmp_rx_va + rx_map_len);
+
+			while (data_va != end_va) {
+				int data_len = hdr->len;
+
+				if (!hdr->len) {
+					if (fds.revents & (POLLHUP | POLLRDHUP)) {
+						if (hdr == rx_va)
+							hungup = 1;
+					}
+
+					break;
+				}
+
+				while (data_len > 0) {
+					int i;
+					int to_read = (data_len < PAGE_SIZE) ?
+							data_len : PAGE_SIZE;
+
+					for (i = 0; i < to_read; i++)
+						total_sum += data_va[i];
+
+					data_va += PAGE_SIZE;
+					data_len -= PAGE_SIZE;
+				}
+
+				hdr++;
+			}
+
+			if (madvise((void *)rx_va, rx_map_len,
+					MADV_DONTNEED)) {
+				perror("madvise");
+				exit(EXIT_FAILURE);
+			}
+
+			if (hungup)
+				break;
+		}
+	}
+
+	if (munmap(rx_va, rx_map_len)) {
+		perror("munmap");
+		exit(EXIT_FAILURE);
+	}
+
+	rec_value = control_readlong(NULL);
+
+	if (total_sum != rec_value) {
+		fprintf(stderr, "sum mismatch %lu != %lu\n",
+				total_sum, rec_value);
+		exit(EXIT_FAILURE);
+	}
+
+	close(fd);
+}
+
+static void test_stream_zerocopy_rx_server(const struct test_opts *opts)
+{
+	size_t max_buf_size = 40000;
+	long total_sum = 0;
+	int n = 10;
+	int fd;
+
+	fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+	if (fd < 0) {
+		perror("accept");
+		exit(EXIT_FAILURE);
+	}
+
+	while (n) {
+		unsigned char *data;
+		size_t buf_size;
+		int i;
+
+		buf_size = 1 + rand() % max_buf_size;
+
+		data = malloc(buf_size);
+
+		if (!data) {
+			perror("malloc");
+			exit(EXIT_FAILURE);
+		}
+
+		for (i = 0; i < buf_size; i++) {
+			data[i] = rand() & 0xff;
+			total_sum += data[i];
+		}
+
+		if (write(fd, data, buf_size) != buf_size) {
+			perror("write");
+			exit(EXIT_FAILURE);
+		}
+
+		free(data);
+		n--;
+	}
+
+	control_writelong(total_sum);
+
+	close(fd);
+}
+
+static void test_stream_zerocopy_rx_inv_client(const struct test_opts *opts)
+{
+	size_t map_size = PAGE_SIZE * 5;
+	unsigned long zc_on = 1;
+	socklen_t len;
+	void *map_va;
+	int fd;
+
+	fd = vsock_stream_connect(opts->peer_cid, 1234);
+	if (fd < 0) {
+		perror("connect");
+		exit(EXIT_FAILURE);
+	}
+
+	len = sizeof(map_va);
+	map_va = 0;
+
+	if (setsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_ZEROCOPY,
+				(void *)&zc_on, sizeof(zc_on))) {
+		perror("setsockopt");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Try zerocopy with invalid mapping address. */
+	if (getsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_MAP_RX,
+			&map_va, &len) == 0) {
+		perror("getsockopt");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Try zerocopy with valid, but not socket mapping. */
+	map_va = mmap(NULL, map_size, PROT_READ,
+		       MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	if (map_va == MAP_FAILED) {
+		perror("anon mmap");
+		exit(EXIT_FAILURE);
+	}
+
+	if (getsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_MAP_RX,
+			&map_va, &len) == 0) {
+		perror("getsockopt");
+		exit(EXIT_FAILURE);
+	}
+
+	if (munmap(map_va, map_size)) {
+		perror("munmap");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Try zerocopy with valid, but too small mapping. */
+	map_va = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, fd, 0);
+	if (map_va == MAP_FAILED) {
+		perror("socket mmap");
+		exit(EXIT_FAILURE);
+	}
+
+	if (getsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_MAP_RX,
+			&map_va, &len) == 0) {
+		perror("getsockopt");
+		exit(EXIT_FAILURE);
+	}
+
+	if (munmap(map_va, PAGE_SIZE)) {
+		perror("munmap");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Try zerocopy with valid mapping, but not from first byte. */
+	map_va = mmap(NULL, map_size, PROT_READ, MAP_SHARED, fd, 0);
+	if (map_va == MAP_FAILED) {
+		perror("socket mmap");
+		exit(EXIT_FAILURE);
+	}
+
+	map_va += PAGE_SIZE;
+
+	if (getsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_MAP_RX,
+			&map_va, &len) == 0) {
+		perror("getsockopt");
+		exit(EXIT_FAILURE);
+	}
+
+	if (munmap(map_va - PAGE_SIZE, map_size)) {
+		perror("munmap");
+		exit(EXIT_FAILURE);
+	}
+
+	control_writeln("DONE");
+
+	close(fd);
+}
+
+static void test_stream_zerocopy_rx_inv_server(const struct test_opts *opts)
+{
+	int fd;
+
+	fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+
+	if (fd < 0) {
+		perror("accept");
+		exit(EXIT_FAILURE);
+	}
+
+	control_expectln("DONE");
+
+	close(fd);
+}
+
 static struct test_case test_cases[] = {
 	{
 		.name = "SOCK_STREAM connection reset",
@@ -646,6 +930,16 @@ static struct test_case test_cases[] = {
 		.run_client = test_seqpacket_invalid_rec_buffer_client,
 		.run_server = test_seqpacket_invalid_rec_buffer_server,
 	},
+	{
+		.name = "SOCK_STREAM zerocopy receive",
+		.run_client = test_stream_zerocopy_rx_client,
+		.run_server = test_stream_zerocopy_rx_server,
+	},
+	{
+		.name = "SOCK_STREAM zerocopy invalid",
+		.run_client = test_stream_zerocopy_rx_inv_client,
+		.run_server = test_stream_zerocopy_rx_inv_server,
+	},
 	{},
 };
 
@@ -729,6 +1023,7 @@ int main(int argc, char **argv)
 		.peer_cid = VMADDR_CID_ANY,
 	};
 
+	srand(time(NULL));
 	init_signals();
 
 	for (;;) {
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v2 8/8] test/vsock: vsock rx zerocopy utility
  2022-06-03  5:27 [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
                   ` (6 preceding siblings ...)
  2022-06-03  5:43 ` [RFC PATCH v2 7/8] test/vsock: add receive zerocopy tests Arseniy Krasnov
@ 2022-06-03  5:44 ` Arseniy Krasnov
  2022-06-08 15:59 ` [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive Zhao Liu
  2022-06-09  8:54   ` Stefano Garzarella
  9 siblings, 0 replies; 29+ messages in thread
From: Arseniy Krasnov @ 2022-06-03  5:44 UTC (permalink / raw)
  To: Stefano Garzarella, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-kernel, kvm, virtualization, netdev, kernel,
	Krasnov Arseniy, Arseniy Krasnov

This adds simple util for zerocopy benchmarking.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
---
 tools/testing/vsock/Makefile      |   1 +
 tools/testing/vsock/rx_zerocopy.c | 356 ++++++++++++++++++++++++++++++
 2 files changed, 357 insertions(+)
 create mode 100644 tools/testing/vsock/rx_zerocopy.c

diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
index f8293c6910c9..2cb5820ca2f3 100644
--- a/tools/testing/vsock/Makefile
+++ b/tools/testing/vsock/Makefile
@@ -3,6 +3,7 @@ all: test
 test: vsock_test vsock_diag_test
 vsock_test: vsock_test.o timeout.o control.o util.o
 vsock_diag_test: vsock_diag_test.o timeout.o control.o util.o
+rx_zerocopy: rx_zerocopy.o timeout.o control.o util.o
 
 CFLAGS += -g -O2 -Werror -Wall -I. -I../../include -I../../../usr/include -Wno-pointer-sign -fno-strict-overflow -fno-strict-aliasing -fno-common -MMD -U_FORTIFY_SOURCE -D_GNU_SOURCE
 .PHONY: all test clean
diff --git a/tools/testing/vsock/rx_zerocopy.c b/tools/testing/vsock/rx_zerocopy.c
new file mode 100644
index 000000000000..55deaa665752
--- /dev/null
+++ b/tools/testing/vsock/rx_zerocopy.c
@@ -0,0 +1,356 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * rx_zerocopy - benchmark utility for zerocopy
+ * receive.
+ *
+ * Copyright (C) 2022 SberDevices.
+ *
+ * Author: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
+ */
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <errno.h>
+#include <unistd.h>
+#include <time.h>
+#include <sys/mman.h>
+#include <stdint.h>
+#include <poll.h>
+#include <uapi/linux/virtio_vsock.h>
+#include <uapi/linux/vm_sockets.h>
+
+#include "util.h"
+
+#define PAGE_SIZE		4096
+
+#define DEFAULT_TX_SIZE		128
+#define DEFAULT_RX_SIZE		128
+#define DEFAULT_PORT		1234
+
+static int client_mode = 1;
+static int peer_cid = -1;
+static int port = DEFAULT_PORT;
+static unsigned long tx_buf_size;
+static unsigned long rx_buf_size;
+static unsigned long mb_to_send = 40;
+
+static time_t current_nsec(void)
+{
+	struct timespec ts;
+
+	if (clock_gettime(CLOCK_REALTIME, &ts)) {
+		perror("clock_gettime");
+		exit(EXIT_FAILURE);
+	}
+
+	return (ts.tv_sec * 1000000000ULL) + ts.tv_nsec;
+}
+
+/* Server accepts connection and */
+static void run_server(void)
+{
+	int fd;
+	char *data;
+	int client_fd;
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = port,
+			.svm_cid = VMADDR_CID_ANY,
+		},
+	};
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} clientaddr;
+
+	socklen_t clientaddr_len = sizeof(clientaddr.svm);
+	time_t tx_begin_ns;
+	ssize_t total_send = 0;
+
+	fprintf(stderr, "Running server, listen %i, mb %lu tx buf %lu\n",
+			port, mb_to_send, tx_buf_size);
+
+	fd = socket(AF_VSOCK, SOCK_STREAM, 0);
+
+	if (fd < 0) {
+		perror("socket");
+		exit(EXIT_FAILURE);
+	}
+
+	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
+		perror("bind");
+		exit(EXIT_FAILURE);
+	}
+
+	if (listen(fd, 1) < 0) {
+		perror("listen");
+		exit(EXIT_FAILURE);
+	}
+
+	client_fd = accept(fd, &clientaddr.sa, &clientaddr_len);
+
+	if (client_fd < 0) {
+		perror("accept");
+		exit(EXIT_FAILURE);
+	}
+
+	data = malloc(tx_buf_size);
+
+	if (data == NULL) {
+		fprintf(stderr, "malloc failed\n");
+		exit(EXIT_FAILURE);
+	}
+
+	memset(data, 0, tx_buf_size);
+	tx_begin_ns = current_nsec();
+
+	while (1) {
+		ssize_t sent;
+
+		if (total_send > mb_to_send * 1024 * 1024ULL)
+			break;
+
+		sent = write(client_fd, data, tx_buf_size);
+
+		if (sent <= 0) {
+			perror("write");
+			exit(EXIT_FAILURE);
+		}
+
+		total_send += sent;
+	}
+
+	free(data);
+
+	fprintf(stderr, "Total %zi MB, time %f\n", mb_to_send,
+			(float)(current_nsec() - tx_begin_ns)/1000.0/1000.0/1000.0);
+
+	close(fd);
+	close(client_fd);
+}
+
+static void run_client(int zerocopy)
+{
+	int fd;
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = port,
+			.svm_cid = peer_cid,
+		},
+	};
+	unsigned long sum = 0;
+	void *rx_va = NULL;
+	unsigned long zc_on = 1;
+
+	printf("Running client, %s mode, peer %i:%i, rx buf %lu\n",
+		zerocopy ? "zerocopy" : "copy", peer_cid, port,
+		rx_buf_size);
+
+	fd = socket(AF_VSOCK, SOCK_STREAM, 0);
+
+	if (fd < 0) {
+		perror("socket");
+		exit(EXIT_FAILURE);
+	}
+
+	if (connect(fd, &addr.sa, sizeof(addr.svm))) {
+		perror("connect");
+		exit(EXIT_FAILURE);
+	}
+
+	if (setsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_ZEROCOPY,
+				(void *)&zc_on, sizeof(zc_on))) {
+		perror("setsockopt");
+		exit(EXIT_FAILURE);
+	}
+
+	if (zerocopy) {
+		rx_va = mmap(NULL, rx_buf_size,
+				PROT_READ, MAP_SHARED, fd, 0);
+
+		if (rx_va == MAP_FAILED) {
+			perror("mmap");
+			exit(EXIT_FAILURE);
+		}
+	}
+
+	while (1) {
+		struct pollfd fds = { 0 };
+		int done = 0;
+
+		fds.fd = fd;
+		fds.events = POLLIN | POLLERR | POLLHUP |
+			     POLLRDHUP | POLLNVAL;
+
+		if (poll(&fds, 1, -1) < 0) {
+			perror("poll");
+			exit(EXIT_FAILURE);
+		}
+
+		if (fds.revents & (POLLHUP | POLLRDHUP))
+			done = 1;
+
+		if (fds.revents & POLLERR) {
+			fprintf(stderr, "Done error\n");
+			break;
+		}
+
+		if (fds.revents & POLLIN) {
+			if (zerocopy) {
+				struct virtio_vsock_usr_hdr *hdr;
+				uintptr_t tmp_rx_va = (uintptr_t)rx_va;
+				socklen_t len = sizeof(tmp_rx_va);
+
+				if (getsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_MAP_RX,
+						&tmp_rx_va, &len) < 0) {
+					perror("getsockopt");
+					exit(EXIT_FAILURE);
+				}
+
+				hdr = (struct virtio_vsock_usr_hdr *)tmp_rx_va;
+
+				if (!hdr->len) {
+					if (done) {
+						fprintf(stderr, "Done, sum %lu\n", sum);
+						break;
+					}
+				}
+
+				tmp_rx_va += PAGE_SIZE;
+
+				if (madvise((void *)rx_va, rx_buf_size,
+						MADV_DONTNEED)) {
+					perror("madvise");
+					exit(EXIT_FAILURE);
+				}
+			} else {
+				char data[rx_buf_size - PAGE_SIZE];
+				ssize_t bytes_read;
+
+				bytes_read = read(fd, data, sizeof(data));
+
+				if (bytes_read <= 0)
+					break;
+			}
+		}
+	}
+}
+
+static const char optstring[] = "";
+static const struct option longopts[] = {
+	{
+		.name = "mode",
+		.has_arg = required_argument,
+		.val = 'm',
+	},
+	{
+		.name = "zerocopy",
+		.has_arg = no_argument,
+		.val = 'z',
+	},
+	{
+		.name = "cid",
+		.has_arg = required_argument,
+		.val = 'c',
+	},
+	{
+		.name = "port",
+		.has_arg = required_argument,
+		.val = 'p',
+	},
+	{
+		.name = "mb",
+		.has_arg = required_argument,
+		.val = 's',
+	},
+	{
+		.name = "tx",
+		.has_arg = required_argument,
+		.val = 't',
+	},
+	{
+		.name = "rx",
+		.has_arg = required_argument,
+		.val = 'r',
+	},
+	{
+		.name = "help",
+		.has_arg = no_argument,
+		.val = '?',
+	},
+	{},
+};
+
+int main(int argc, char **argv)
+{
+	int zerocopy = 0;
+
+	for (;;) {
+		int opt = getopt_long(argc, argv, optstring, longopts, NULL);
+
+		if (opt == -1)
+			break;
+
+		switch (opt) {
+		case 's':
+			mb_to_send = atoi(optarg);
+			break;
+		case 'c':
+			peer_cid = atoi(optarg);
+			break;
+		case 'p':
+			port = atoi(optarg);
+			break;
+		case 'r':
+			rx_buf_size = atoi(optarg);
+			break;
+		case 't':
+			tx_buf_size = atoi(optarg);
+			break;
+		case 'm':
+			if (strcmp(optarg, "client") == 0)
+				client_mode = 1;
+			else if (strcmp(optarg, "server") == 0)
+				client_mode = 0;
+			else {
+				fprintf(stderr, "--mode must be \"client\" or \"server\"\n");
+				return EXIT_FAILURE;
+			}
+			break;
+		case 'z':
+			zerocopy = 1;
+			break;
+		default:
+			break;
+		}
+
+	}
+
+	if (!tx_buf_size)
+		tx_buf_size = DEFAULT_TX_SIZE;
+
+	if (!rx_buf_size)
+		rx_buf_size = DEFAULT_RX_SIZE;
+
+	tx_buf_size *= PAGE_SIZE;
+	rx_buf_size *= PAGE_SIZE;
+
+	srand(time(NULL));
+
+	if (client_mode)
+		run_client(zerocopy);
+	else
+		run_server();
+
+	return 0;
+}
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive
  2022-06-03  5:27 [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
                   ` (7 preceding siblings ...)
  2022-06-03  5:44 ` [RFC PATCH v2 8/8] test/vsock: vsock rx zerocopy utility Arseniy Krasnov
@ 2022-06-08 15:59 ` Zhao Liu
  2022-06-09  8:54   ` Stefano Garzarella
  9 siblings, 0 replies; 29+ messages in thread
From: Zhao Liu @ 2022-06-08 15:59 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: linux-kernel-AT-vger.kernel.org, kvm-AT-vger.kernel.org,
	virtualization-AT-lists.linux-foundation.org,
	netdev-AT-vger.kernel.org, kernel, Krasnov Arseniy,
	Arseniy Krasnov, Zhao Liu

On Fri, Jun 03, 2022 at 05:27:56AM +0000, Arseniy Krasnov wrote:
> Date:   Fri, 3 Jun 2022 05:27:56 +0000
> From: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
> Subject: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive
> 
>                               INTRODUCTION
> 
> 	Hello, this is experimental implementation of virtio vsock zerocopy
> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
> fill provided vma area with pages of virtio RX buffers. After received data was
> processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
> 
>                                  DETAILS
> 
> 	Here is how mapping with mapped pages looks exactly: first page mapping
> contains array of trimmed virtio vsock packet headers (in contains only length
> of data on the corresponding page and 'flags' field):
> 
> 	struct virtio_vsock_usr_hdr {
> 		uint32_t length;
> 		uint32_t flags;
> 		uint32_t copy_len;
> 	};
> 
> Field  'length' allows user to know exact size of payload within each sequence
> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
> bounds or record bounds). Field 'copy_len' is described below in 'v1->v2' part.
> All other pages are data pages from RX queue.
> 
>              Page 0      Page 1      Page N
> 
> 	[ hdr1 .. hdrN ][ data ] .. [ data ]
>            |        |       ^           ^
>            |        |       |           |
>            |        *-------------------*
>            |                |
>            |                |
>            *----------------*
> 
> 	Of course, single header could represent array of pages (when packet's
> buffer is bigger than one page).So here is example of detailed mapping layout
> for some set of packages. Lets consider that we have the following sequence  of
> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
> be inserted to user's vma(vma is large enough).
> 
> 	Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]

Hi Arseniy, what about adding a `header` for `virtio_vsock_usr_hdr` in Page 0?

Page 0 can be like this:

	Page 0: [[ header for hdrs ][ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]

We can store the header numbers/page numbers in this first general header:

	struct virtio_vsock_general_hdr {
		uint32_t usr_hdr_num;
	};

This usr_hdr_num represents how many pages we used here.
At most 256 pages will be used here, and this kind of statistical information is
useful.
> 	Page 1: [ 56 ]
> 	Page 2: [ 4096 ]
> 	Page 3: [ 4096 ]
> 	Page 4: [ 4096 ]
> 	Page 5: [ 8 ]
> 
> 	Page 0 contains only array of headers:
> 	'hdr0' has 56 in length field.
> 	'hdr1' has 4096 in length field.
> 	'hdr2' has 8200 in length field.
> 	'hdr3' has 0 in length field(this is end of data marker).
> 
> 	Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
> 	Page 2 corresponds to 'hdr1' and filled with data.
> 	Page 3 corresponds to 'hdr2' and filled with data.
> 	Page 4 corresponds to 'hdr2' and filled with data.
> 	Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
> 
> 	This patchset also changes packets allocation way: today implementation
> uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
> such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
> pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
> "not large" could be bigger than one page). So to avoid this, data buffers now
> allocated using 'alloc_pages()' call.
> 
>                                    TESTS
> 
> 	This patchset updates 'vsock_test' utility: two tests for new feature
> were added. First test covers invalid cases. Second checks valid transmission
> case.
> 
>                                 BENCHMARKING
> 
> 	For benchmakring I've added small utility 'rx_zerocopy'. It works in
> client/server mode. When client connects to server, server starts sending exact
> amount of data to client(amount is set as input argument).Client reads data and
> waits for next portion of it. Client works in two modes: copy and zero-copy. In
> copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
> /'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission 
> is better. For server, we can set size of tx buffer and for client we can set
> size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
> is quiet simple:
> 
> For client mode:
> 
> ./rx_zerocopy --mode client [--zerocopy] [--rx]
> 
> For server mode:
> 
> ./rx_zerocopy --mode server [--mb] [--tx]
> 
> [--mb] sets number of megabytes to transfer.
> [--rx] sets size of receive buffer/mapping in pages.
> [--tx] sets size of transmit buffer in pages.
> 
> I checked for transmission of 4000mb of data. Here are some results:
> 
>                            size of rx/tx buffers in pages
>                *---------------------------------------------------*
>                |    8   |    32    |    64   |   256    |   512    |
> *--------------*--------*----------*---------*----------*----------*
> |   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
> *--------------*---------------------------------------------------- process
> | non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
> *--------------*----------------------------------------------------
> 
> Result in first column(where non-zerocopy works better than zerocopy) happens
> because time, spent in 'read()' system call is smaller that time in 'getsockopt'
> + 'madvise'. I've checked that.
> 
> I think, that results are not so impressive, but at least it is not worse than
> copy mode and there is no need to allocate memory for processing date.
> 
>                                  PROBLEMS
> 
> 	Updated packet's allocation logic creates some problem: when host gets
> data from guest(in vhost-vsock), it allocates at least one page for each packet
> (even if packet has 1 byte payload). I think this could be resolved in several
> ways:
> 	1) Make zerocopy rx mode disabled by default, so if user didn't enable
> it, current 'kmalloc()' way will be used. <<<<<<< (IMPLEMENTED IN V2)
> 	2) Use 'kmalloc()' for "small" packets, else call page allocator. But
> in this case, we have mix of packets, allocated in two different ways thus
> during zerocopying to user(e.g. mapping pages to vma), such small packets will
> be handled in some stupid way: we need to allocate one page for user, copy data
> to it and then insert page to user's vma.
> 
> v1 -> v2:
>  1) Zerocopy receive mode could be enabled/disabled(disabled by default). I
>     didn't use generic SO_ZEROCOPY flag, because in virtio-vsock case this
>     feature depends on transport support. Instead of SO_ZEROCOPY, AF_VSOCK
>     layer flag was added: SO_VM_SOCKETS_ZEROCOPY, while previous meaning of
>     SO_VM_SOCKETS_ZEROCOPY(insert receive buffers to user's vm area) now
>     renamed to SO_VM_SOCKETS_MAP_RX.
>  2) Packet header which is exported to user now get new field: 'copy_len'.
>     This field handles special case:  user reads data from socket in non
>     zerocopy way(with disabled zerocopy) and then enables zerocopy feature.
>     In this case vhost part will switch data buffer allocation logic from
>     'kmalloc()' to direct calls for buddy allocator. But, there could be
>     some pending 'kmalloc()' allocated packets in socket's rx list, and then
>     user tries to read such packets in zerocopy way, dequeue will fail,
>     because SLAB pages could not be inserted to user's vm area. So when such
>     packet is found during zerocopy dequeue, dequeue loop will break and
>     'copy_len' will show size of such "bad" packet. After user detects this
>     case, it must use 'read()/recv()' calls to dequeue such packet.
>  3) Also may be move this features under config option?
> 
> Arseniy Krasnov(8)
>  virtio/vsock: rework packet allocation logic
>  vhost/vsock: rework packet allocation logic
>  af_vsock: add zerocopy receive logic
>  virtio/vsock: add transport zerocopy callback
>  vhost/vsock: enable zerocopy callback
>  virtio/vsock: enable zerocopy callback
>  test/vsock: add receive zerocopy tests
>  test/vsock: vsock rx zerocopy utility
> 
>  drivers/vhost/vsock.c                   | 121 +++++++++--
>  include/linux/virtio_vsock.h            |   5 +
>  include/net/af_vsock.h                  |   7 +
>  include/uapi/linux/virtio_vsock.h       |   6 +
>  include/uapi/linux/vm_sockets.h         |   3 +
>  net/vmw_vsock/af_vsock.c                | 100 +++++++++
>  net/vmw_vsock/virtio_transport.c        |  51 ++++-
>  net/vmw_vsock/virtio_transport_common.c | 211 ++++++++++++++++++-
>  tools/include/uapi/linux/virtio_vsock.h |  11 +
>  tools/include/uapi/linux/vm_sockets.h   |   8 +
>  tools/testing/vsock/Makefile            |   1 +
>  tools/testing/vsock/control.c           |  34 +++
>  tools/testing/vsock/control.h           |   2 +
>  tools/testing/vsock/rx_zerocopy.c       | 356 ++++++++++++++++++++++++++++++++
>  tools/testing/vsock/vsock_test.c        | 295 ++++++++++++++++++++++++++
>  15 files changed, 1196 insertions(+), 15 deletions(-)
> 
> -- 
> 2.25.1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 1/8] virtio/vsock: rework packet allocation logic
  2022-06-03  5:31 ` [RFC PATCH v2 1/8] virtio/vsock: rework packet allocation logic Arseniy Krasnov
@ 2022-06-09  8:37     ` Stefano Garzarella
  0 siblings, 0 replies; 29+ messages in thread
From: Stefano Garzarella @ 2022-06-09  8:37 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, David S. Miller, Jason Wang,
	Jakub Kicinski, Paolo Abeni, linux-kernel, kvm, virtualization,
	netdev, kernel, Krasnov Arseniy

On Fri, Jun 03, 2022 at 05:31:00AM +0000, Arseniy Krasnov wrote:
>To support zerocopy receive, packet's buffer allocation
>is changed: for buffers which could be mapped to user's
>vma we can't use 'kmalloc()'(as kernel restricts to map
>slab pages to user's vma) and raw buddy allocator now
>called. But, for tx packets(such packets won't be mapped
>to user), previous 'kmalloc()' way is used, but with special
>flag in packet's structure which allows to distinguish
>between 'kmalloc()' and raw pages buffers.
>
>Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>---
> include/linux/virtio_vsock.h            | 1 +
> net/vmw_vsock/virtio_transport.c        | 8 ++++++--
> net/vmw_vsock/virtio_transport_common.c | 9 ++++++++-
> 3 files changed, 15 insertions(+), 3 deletions(-)

Each patch should as much as possible work to not break the 
bisectability, and here we are not touching vhost_vsock_alloc_pkt() that 
uses kmalloc to allocate the buffer.

I see you updated it in the next patch, that should be fine, but here 
you should set slab_buf in vhost_vsock_alloc_pkt(), or you can merge the 
two patches.

>
>diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>index 35d7eedb5e8e..d02cb7aa922f 100644
>--- a/include/linux/virtio_vsock.h
>+++ b/include/linux/virtio_vsock.h
>@@ -50,6 +50,7 @@ struct virtio_vsock_pkt {
> 	u32 off;
> 	bool reply;
> 	bool tap_delivered;
>+	bool slab_buf;
> };
>
> struct virtio_vsock_pkt_info {
>diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>index ad64f403536a..19909c1e9ba3 100644
>--- a/net/vmw_vsock/virtio_transport.c
>+++ b/net/vmw_vsock/virtio_transport.c
>@@ -255,16 +255,20 @@ static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
> 	vq = vsock->vqs[VSOCK_VQ_RX];
>
> 	do {
>+		struct page *buf_page;
>+
> 		pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
> 		if (!pkt)
> 			break;
>
>-		pkt->buf = kmalloc(buf_len, GFP_KERNEL);
>-		if (!pkt->buf) {
>+		buf_page = alloc_page(GFP_KERNEL);
>+
>+		if (!buf_page) {
> 			virtio_transport_free_pkt(pkt);
> 			break;
> 		}
>
>+		pkt->buf = page_to_virt(buf_page);
> 		pkt->buf_len = buf_len;
> 		pkt->len = buf_len;
>
>diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>index ec2c2afbf0d0..278567f748f2 100644
>--- a/net/vmw_vsock/virtio_transport_common.c
>+++ b/net/vmw_vsock/virtio_transport_common.c
>@@ -69,6 +69,7 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
> 		if (!pkt->buf)
> 			goto out_pkt;
>
>+		pkt->slab_buf = true;
> 		pkt->buf_len = len;
>
> 		err = memcpy_from_msg(pkt->buf, info->msg, len);
>@@ -1342,7 +1343,13 @@ EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
>
> void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
> {
>-	kfree(pkt->buf);
>+	if (pkt->buf_len) {
>+		if (pkt->slab_buf)
>+			kfree(pkt->buf);
>+		else
>+			free_pages(buf, get_order(pkt->buf_len));
>+	}
>+
> 	kfree(pkt);
> }
> EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
>-- 
>2.25.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 1/8] virtio/vsock: rework packet allocation logic
@ 2022-06-09  8:37     ` Stefano Garzarella
  0 siblings, 0 replies; 29+ messages in thread
From: Stefano Garzarella @ 2022-06-09  8:37 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Krasnov Arseniy, kvm, Michael S. Tsirkin, netdev, linux-kernel,
	virtualization, Stefan Hajnoczi, kernel, Jakub Kicinski,
	Paolo Abeni, David S. Miller

On Fri, Jun 03, 2022 at 05:31:00AM +0000, Arseniy Krasnov wrote:
>To support zerocopy receive, packet's buffer allocation
>is changed: for buffers which could be mapped to user's
>vma we can't use 'kmalloc()'(as kernel restricts to map
>slab pages to user's vma) and raw buddy allocator now
>called. But, for tx packets(such packets won't be mapped
>to user), previous 'kmalloc()' way is used, but with special
>flag in packet's structure which allows to distinguish
>between 'kmalloc()' and raw pages buffers.
>
>Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>---
> include/linux/virtio_vsock.h            | 1 +
> net/vmw_vsock/virtio_transport.c        | 8 ++++++--
> net/vmw_vsock/virtio_transport_common.c | 9 ++++++++-
> 3 files changed, 15 insertions(+), 3 deletions(-)

Each patch should as much as possible work to not break the 
bisectability, and here we are not touching vhost_vsock_alloc_pkt() that 
uses kmalloc to allocate the buffer.

I see you updated it in the next patch, that should be fine, but here 
you should set slab_buf in vhost_vsock_alloc_pkt(), or you can merge the 
two patches.

>
>diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>index 35d7eedb5e8e..d02cb7aa922f 100644
>--- a/include/linux/virtio_vsock.h
>+++ b/include/linux/virtio_vsock.h
>@@ -50,6 +50,7 @@ struct virtio_vsock_pkt {
> 	u32 off;
> 	bool reply;
> 	bool tap_delivered;
>+	bool slab_buf;
> };
>
> struct virtio_vsock_pkt_info {
>diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>index ad64f403536a..19909c1e9ba3 100644
>--- a/net/vmw_vsock/virtio_transport.c
>+++ b/net/vmw_vsock/virtio_transport.c
>@@ -255,16 +255,20 @@ static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
> 	vq = vsock->vqs[VSOCK_VQ_RX];
>
> 	do {
>+		struct page *buf_page;
>+
> 		pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
> 		if (!pkt)
> 			break;
>
>-		pkt->buf = kmalloc(buf_len, GFP_KERNEL);
>-		if (!pkt->buf) {
>+		buf_page = alloc_page(GFP_KERNEL);
>+
>+		if (!buf_page) {
> 			virtio_transport_free_pkt(pkt);
> 			break;
> 		}
>
>+		pkt->buf = page_to_virt(buf_page);
> 		pkt->buf_len = buf_len;
> 		pkt->len = buf_len;
>
>diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>index ec2c2afbf0d0..278567f748f2 100644
>--- a/net/vmw_vsock/virtio_transport_common.c
>+++ b/net/vmw_vsock/virtio_transport_common.c
>@@ -69,6 +69,7 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
> 		if (!pkt->buf)
> 			goto out_pkt;
>
>+		pkt->slab_buf = true;
> 		pkt->buf_len = len;
>
> 		err = memcpy_from_msg(pkt->buf, info->msg, len);
>@@ -1342,7 +1343,13 @@ EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
>
> void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
> {
>-	kfree(pkt->buf);
>+	if (pkt->buf_len) {
>+		if (pkt->slab_buf)
>+			kfree(pkt->buf);
>+		else
>+			free_pages(buf, get_order(pkt->buf_len));
>+	}
>+
> 	kfree(pkt);
> }
> EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
>-- 
>2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 2/8] vhost/vsock: rework packet allocation logic
  2022-06-03  5:33 ` [RFC PATCH v2 2/8] vhost/vsock: " Arseniy Krasnov
@ 2022-06-09  8:38     ` Stefano Garzarella
  0 siblings, 0 replies; 29+ messages in thread
From: Stefano Garzarella @ 2022-06-09  8:38 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Jakub Kicinski, Paolo Abeni, linux-kernel, kvm, virtualization,
	netdev, kernel, Krasnov Arseniy

On Fri, Jun 03, 2022 at 05:33:04AM +0000, Arseniy Krasnov wrote:
>For packets received from virtio RX queue, use buddy
>allocator instead of 'kmalloc()' to be able to insert
>such pages to user provided vma. Single call to
>'copy_from_iter()' replaced with per-page loop.
>
>Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>---
> drivers/vhost/vsock.c | 81 ++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 69 insertions(+), 12 deletions(-)
>
>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>index e6c9d41db1de..0dc2229f18f7 100644
>--- a/drivers/vhost/vsock.c
>+++ b/drivers/vhost/vsock.c
>@@ -58,6 +58,7 @@ struct vhost_vsock {
>
> 	u32 guest_cid;
> 	bool seqpacket_allow;
>+	bool zerocopy_rx_on;

This is per-device, so a single socket can change the behaviour of all 
the sockets of this device.

Can we do something better?

Maybe we can allocate the header, copy it, find the socket and check if 
zero-copy is enabled or not for that socket.

Of course we should change or extend virtio_transport_recv_pkt() to 
avoid to find the socket again.


> };
>
> static u32 vhost_transport_get_local_cid(void)
>@@ -357,6 +358,7 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
> 		      unsigned int out, unsigned int in)
> {
> 	struct virtio_vsock_pkt *pkt;
>+	struct vhost_vsock *vsock;
> 	struct iov_iter iov_iter;
> 	size_t nbytes;
> 	size_t len;
>@@ -393,20 +395,75 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
> 		return NULL;
> 	}
>
>-	pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
>-	if (!pkt->buf) {
>-		kfree(pkt);
>-		return NULL;
>-	}
>-
> 	pkt->buf_len = pkt->len;
>+	vsock = container_of(vq->dev, struct vhost_vsock, dev);
>
>-	nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
>-	if (nbytes != pkt->len) {
>-		vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
>-		       pkt->len, nbytes);
>-		virtio_transport_free_pkt(pkt);
>-		return NULL;
>+	if (!vsock->zerocopy_rx_on) {
>+		pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
>+
>+		if (!pkt->buf) {
>+			kfree(pkt);
>+			return NULL;
>+		}
>+
>+		pkt->slab_buf = true;
>+		nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
>+		if (nbytes != pkt->len) {
>+			vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
>+				pkt->len, nbytes);
>+			virtio_transport_free_pkt(pkt);
>+			return NULL;
>+		}
>+	} else {
>+		struct page *buf_page;
>+		ssize_t pkt_len;
>+		int page_idx;
>+
>+		/* This creates memory overrun, as we allocate
>+		 * at least one page for each packet.
>+		 */
>+		buf_page = alloc_pages(GFP_KERNEL, get_order(pkt->len));
>+
>+		if (buf_page == NULL) {
>+			kfree(pkt);
>+			return NULL;
>+		}
>+
>+		pkt->buf = page_to_virt(buf_page);
>+
>+		page_idx = 0;
>+		pkt_len = pkt->len;
>+
>+		/* As allocated pages are not mapped, process
>+		 * pages one by one.
>+		 */
>+		while (pkt_len > 0) {
>+			void *mapped;
>+			size_t to_copy;
>+
>+			mapped = kmap(buf_page + page_idx);
>+
>+			if (mapped == NULL) {
>+				virtio_transport_free_pkt(pkt);
>+				return NULL;
>+			}
>+
>+			to_copy = min(pkt_len, ((ssize_t)PAGE_SIZE));
>+
>+			nbytes = copy_from_iter(mapped, to_copy, &iov_iter);
>+			if (nbytes != to_copy) {
>+				vq_err(vq, "Expected %zu byte payload, got %zu bytes\n",
>+				       to_copy, nbytes);
>+				kunmap(mapped);
>+				virtio_transport_free_pkt(pkt);
>+				return NULL;
>+			}
>+
>+			kunmap(mapped);
>+
>+			pkt_len -= to_copy;
>+			page_idx++;
>+		}
> 	}
>
> 	return pkt;
>-- 
>2.25.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 2/8] vhost/vsock: rework packet allocation logic
@ 2022-06-09  8:38     ` Stefano Garzarella
  0 siblings, 0 replies; 29+ messages in thread
From: Stefano Garzarella @ 2022-06-09  8:38 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Krasnov Arseniy, kvm, Michael S. Tsirkin, netdev, linux-kernel,
	virtualization, Stefan Hajnoczi, kernel, Jakub Kicinski,
	Paolo Abeni, David S. Miller

On Fri, Jun 03, 2022 at 05:33:04AM +0000, Arseniy Krasnov wrote:
>For packets received from virtio RX queue, use buddy
>allocator instead of 'kmalloc()' to be able to insert
>such pages to user provided vma. Single call to
>'copy_from_iter()' replaced with per-page loop.
>
>Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>---
> drivers/vhost/vsock.c | 81 ++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 69 insertions(+), 12 deletions(-)
>
>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>index e6c9d41db1de..0dc2229f18f7 100644
>--- a/drivers/vhost/vsock.c
>+++ b/drivers/vhost/vsock.c
>@@ -58,6 +58,7 @@ struct vhost_vsock {
>
> 	u32 guest_cid;
> 	bool seqpacket_allow;
>+	bool zerocopy_rx_on;

This is per-device, so a single socket can change the behaviour of all 
the sockets of this device.

Can we do something better?

Maybe we can allocate the header, copy it, find the socket and check if 
zero-copy is enabled or not for that socket.

Of course we should change or extend virtio_transport_recv_pkt() to 
avoid to find the socket again.


> };
>
> static u32 vhost_transport_get_local_cid(void)
>@@ -357,6 +358,7 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
> 		      unsigned int out, unsigned int in)
> {
> 	struct virtio_vsock_pkt *pkt;
>+	struct vhost_vsock *vsock;
> 	struct iov_iter iov_iter;
> 	size_t nbytes;
> 	size_t len;
>@@ -393,20 +395,75 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
> 		return NULL;
> 	}
>
>-	pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
>-	if (!pkt->buf) {
>-		kfree(pkt);
>-		return NULL;
>-	}
>-
> 	pkt->buf_len = pkt->len;
>+	vsock = container_of(vq->dev, struct vhost_vsock, dev);
>
>-	nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
>-	if (nbytes != pkt->len) {
>-		vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
>-		       pkt->len, nbytes);
>-		virtio_transport_free_pkt(pkt);
>-		return NULL;
>+	if (!vsock->zerocopy_rx_on) {
>+		pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
>+
>+		if (!pkt->buf) {
>+			kfree(pkt);
>+			return NULL;
>+		}
>+
>+		pkt->slab_buf = true;
>+		nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
>+		if (nbytes != pkt->len) {
>+			vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
>+				pkt->len, nbytes);
>+			virtio_transport_free_pkt(pkt);
>+			return NULL;
>+		}
>+	} else {
>+		struct page *buf_page;
>+		ssize_t pkt_len;
>+		int page_idx;
>+
>+		/* This creates memory overrun, as we allocate
>+		 * at least one page for each packet.
>+		 */
>+		buf_page = alloc_pages(GFP_KERNEL, get_order(pkt->len));
>+
>+		if (buf_page == NULL) {
>+			kfree(pkt);
>+			return NULL;
>+		}
>+
>+		pkt->buf = page_to_virt(buf_page);
>+
>+		page_idx = 0;
>+		pkt_len = pkt->len;
>+
>+		/* As allocated pages are not mapped, process
>+		 * pages one by one.
>+		 */
>+		while (pkt_len > 0) {
>+			void *mapped;
>+			size_t to_copy;
>+
>+			mapped = kmap(buf_page + page_idx);
>+
>+			if (mapped == NULL) {
>+				virtio_transport_free_pkt(pkt);
>+				return NULL;
>+			}
>+
>+			to_copy = min(pkt_len, ((ssize_t)PAGE_SIZE));
>+
>+			nbytes = copy_from_iter(mapped, to_copy, &iov_iter);
>+			if (nbytes != to_copy) {
>+				vq_err(vq, "Expected %zu byte payload, got %zu bytes\n",
>+				       to_copy, nbytes);
>+				kunmap(mapped);
>+				virtio_transport_free_pkt(pkt);
>+				return NULL;
>+			}
>+
>+			kunmap(mapped);
>+
>+			pkt_len -= to_copy;
>+			page_idx++;
>+		}
> 	}
>
> 	return pkt;
>-- 
>2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 3/8] af_vsock: add zerocopy receive logic
  2022-06-03  5:35 ` [RFC PATCH v2 3/8] af_vsock: add zerocopy receive logic Arseniy Krasnov
@ 2022-06-09  8:39     ` Stefano Garzarella
  0 siblings, 0 replies; 29+ messages in thread
From: Stefano Garzarella @ 2022-06-09  8:39 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Krasnov Arseniy, kvm, Michael S. Tsirkin, netdev, linux-kernel,
	virtualization, Stefan Hajnoczi, kernel, Jakub Kicinski,
	Paolo Abeni, David S. Miller

On Fri, Jun 03, 2022 at 05:35:48AM +0000, Arseniy Krasnov wrote:
>This:
>1) Adds callback for 'mmap()' call on socket. It checks vm
>   area flags and sets vm area ops.
>2) Adds special 'getsockopt()' case which calls transport
>   zerocopy callback. Input argument is vm area address.
>3) Adds 'getsockopt()/setsockopt()' for switching on/off rx
>   zerocopy mode.
>
>Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>---
> include/net/af_vsock.h          |   7 +++
> include/uapi/linux/vm_sockets.h |   3 +
> net/vmw_vsock/af_vsock.c        | 100 ++++++++++++++++++++++++++++++++
> 3 files changed, 110 insertions(+)
>
>diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>index f742e50207fb..f15f84c648ff 100644
>--- a/include/net/af_vsock.h
>+++ b/include/net/af_vsock.h
>@@ -135,6 +135,13 @@ struct vsock_transport {
> 	bool (*stream_is_active)(struct vsock_sock *);
> 	bool (*stream_allow)(u32 cid, u32 port);
>
>+	int (*rx_zerocopy_set)(struct vsock_sock *vsk,
>+			       bool enable);
>+	int (*rx_zerocopy_get)(struct vsock_sock *vsk);
>+	int (*zerocopy_dequeue)(struct vsock_sock *vsk,
>+				struct vm_area_struct *vma,
>+				unsigned long addr);
>+
> 	/* SEQ_PACKET. */
> 	ssize_t (*seqpacket_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
> 				     int flags);
>diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h
>index c60ca33eac59..d1f792bed1a7 100644
>--- a/include/uapi/linux/vm_sockets.h
>+++ b/include/uapi/linux/vm_sockets.h
>@@ -83,6 +83,9 @@
>
> #define SO_VM_SOCKETS_CONNECT_TIMEOUT_NEW 8
>
>+#define SO_VM_SOCKETS_MAP_RX 9
>+#define SO_VM_SOCKETS_ZEROCOPY 10
>+
> #if !defined(__KERNEL__)
> #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
> #define SO_VM_SOCKETS_CONNECT_TIMEOUT SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD
>diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>index f04abf662ec6..10061ef21730 100644
>--- a/net/vmw_vsock/af_vsock.c
>+++ b/net/vmw_vsock/af_vsock.c
>@@ -1644,6 +1644,17 @@ static int vsock_connectible_setsockopt(struct socket *sock,
> 		}
> 		break;
> 	}
>+	case SO_VM_SOCKETS_ZEROCOPY: {
>+		if (!transport || !transport->rx_zerocopy_set) {
>+			err = -EOPNOTSUPP;
>+		} else {
>+			COPY_IN(val);
>+
>+			if (transport->rx_zerocopy_set(vsk, val))
>+				err = -EINVAL;
>+		}
>+		break;
>+	}
>
> 	default:
> 		err = -ENOPROTOOPT;
>@@ -1657,6 +1668,48 @@ static int vsock_connectible_setsockopt(struct socket *sock,
> 	return err;
> }
>
>+static const struct vm_operations_struct afvsock_vm_ops = {
>+};
>+
>+static int vsock_recv_zerocopy(struct socket *sock,
>+			       unsigned long address)
>+{
>+	struct sock *sk = sock->sk;
>+	struct vsock_sock *vsk = vsock_sk(sk);
>+	struct vm_area_struct *vma;
>+	const struct vsock_transport *transport;
>+	int res;
>+
>+	transport = vsk->transport;
>+
>+	if (!transport->rx_zerocopy_get)
>+		return -EOPNOTSUPP;
>+
>+	if (!transport->rx_zerocopy_get(vsk))
>+		return -EOPNOTSUPP;

Maybe we can merge in
         if (!transport->rx_zerocopy_get ||
             !transport->rx_zerocopy_get(vsk)}
                 return -EOPNOTSUPP;

>+
>+	if (!transport->zerocopy_dequeue)
>+		return -EOPNOTSUPP;
>+
>+	lock_sock(sk);
>+	mmap_write_lock(current->mm);

So, multiple threads using different sockets are serialized if they use 
zero-copy?

IIUC this is necessary because the callback calls vm_insert_page().

At this point I think it's better not to do this in every transport, but 
have the callback return an array of pages to map and we map them here 
trying to limit as much as possible the critical section to protect with 
mmap_write_lock().

>+
>+	vma = vma_lookup(current->mm, address);
>+
>+	if (!vma || vma->vm_ops != &afvsock_vm_ops) {
>+		mmap_write_unlock(current->mm);
>+		release_sock(sk);
>+		return -EINVAL;
>+	}
>+
>+	res = transport->zerocopy_dequeue(vsk, vma, address);
>+
>+	mmap_write_unlock(current->mm);
>+	release_sock(sk);
>+
>+	return res;
>+}
>+
> static int vsock_connectible_getsockopt(struct socket *sock,
> 					int level, int optname,
> 					char __user *optval,
>@@ -1701,6 +1754,39 @@ static int vsock_connectible_getsockopt(struct socket *sock,
> 		lv = sock_get_timeout(vsk->connect_timeout, &v,
> 				      optname == SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD);
> 		break;
>+	case SO_VM_SOCKETS_ZEROCOPY: {
>+		const struct vsock_transport *transport;
>+		int res;
>+
>+		transport = vsk->transport;
>+
>+		if (!transport->rx_zerocopy_get)
>+			return -EOPNOTSUPP;
>+
>+		lock_sock(sk);

I think we should call lock_sock() before reading the transport to avoid 
races and we should check if it is assigned.

At that point I think is better to store this info in vsock_sock and not 
in the transport.

And maybe we should allow to change it only if the socket state is 
SS_UNCONNECTED, inheriting from the parent the setting for sockets that 
have it.

>+
>+		res = transport->rx_zerocopy_get(vsk);
>+
>+		release_sock(sk);
>+
>+		if (res < 0)
>+			return -EINVAL;
>+
>+		v.val64 = res;
>+
>+		break;
>+	}
>+	case SO_VM_SOCKETS_MAP_RX: {
>+		unsigned long vma_addr;
>+
>+		if (len < sizeof(vma_addr))
>+			return -EINVAL;
>+
>+		if (copy_from_user(&vma_addr, optval, sizeof(vma_addr)))
>+			return -EFAULT;
>+
>+		return vsock_recv_zerocopy(sock, vma_addr);
>+	}
>
> 	default:
> 		return -ENOPROTOOPT;
>@@ -2129,6 +2215,19 @@ vsock_connectible_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
> 	return err;
> }
>
>+static int afvsock_mmap(struct file *file, struct socket *sock,
>+			struct vm_area_struct *vma)
>+{
>+	if (vma->vm_flags & (VM_WRITE | VM_EXEC))
>+		return -EPERM;
>+
>+	vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
>+	vma->vm_flags |= (VM_MIXEDMAP);
>+	vma->vm_ops = &afvsock_vm_ops;
>+
>+	return 0;
>+}
>+
> static const struct proto_ops vsock_stream_ops = {
> 	.family = PF_VSOCK,
> 	.owner = THIS_MODULE,
>@@ -2148,6 +2247,7 @@ static const struct proto_ops vsock_stream_ops = {
> 	.recvmsg = vsock_connectible_recvmsg,
> 	.mmap = sock_no_mmap,
> 	.sendpage = sock_no_sendpage,
>+	.mmap = afvsock_mmap,
> };
>
> static const struct proto_ops vsock_seqpacket_ops = {
>-- 
>2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 3/8] af_vsock: add zerocopy receive logic
@ 2022-06-09  8:39     ` Stefano Garzarella
  0 siblings, 0 replies; 29+ messages in thread
From: Stefano Garzarella @ 2022-06-09  8:39 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Jakub Kicinski, Paolo Abeni, linux-kernel, kvm, virtualization,
	netdev, kernel, Krasnov Arseniy

On Fri, Jun 03, 2022 at 05:35:48AM +0000, Arseniy Krasnov wrote:
>This:
>1) Adds callback for 'mmap()' call on socket. It checks vm
>   area flags and sets vm area ops.
>2) Adds special 'getsockopt()' case which calls transport
>   zerocopy callback. Input argument is vm area address.
>3) Adds 'getsockopt()/setsockopt()' for switching on/off rx
>   zerocopy mode.
>
>Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>---
> include/net/af_vsock.h          |   7 +++
> include/uapi/linux/vm_sockets.h |   3 +
> net/vmw_vsock/af_vsock.c        | 100 ++++++++++++++++++++++++++++++++
> 3 files changed, 110 insertions(+)
>
>diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>index f742e50207fb..f15f84c648ff 100644
>--- a/include/net/af_vsock.h
>+++ b/include/net/af_vsock.h
>@@ -135,6 +135,13 @@ struct vsock_transport {
> 	bool (*stream_is_active)(struct vsock_sock *);
> 	bool (*stream_allow)(u32 cid, u32 port);
>
>+	int (*rx_zerocopy_set)(struct vsock_sock *vsk,
>+			       bool enable);
>+	int (*rx_zerocopy_get)(struct vsock_sock *vsk);
>+	int (*zerocopy_dequeue)(struct vsock_sock *vsk,
>+				struct vm_area_struct *vma,
>+				unsigned long addr);
>+
> 	/* SEQ_PACKET. */
> 	ssize_t (*seqpacket_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
> 				     int flags);
>diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h
>index c60ca33eac59..d1f792bed1a7 100644
>--- a/include/uapi/linux/vm_sockets.h
>+++ b/include/uapi/linux/vm_sockets.h
>@@ -83,6 +83,9 @@
>
> #define SO_VM_SOCKETS_CONNECT_TIMEOUT_NEW 8
>
>+#define SO_VM_SOCKETS_MAP_RX 9
>+#define SO_VM_SOCKETS_ZEROCOPY 10
>+
> #if !defined(__KERNEL__)
> #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
> #define SO_VM_SOCKETS_CONNECT_TIMEOUT SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD
>diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>index f04abf662ec6..10061ef21730 100644
>--- a/net/vmw_vsock/af_vsock.c
>+++ b/net/vmw_vsock/af_vsock.c
>@@ -1644,6 +1644,17 @@ static int vsock_connectible_setsockopt(struct socket *sock,
> 		}
> 		break;
> 	}
>+	case SO_VM_SOCKETS_ZEROCOPY: {
>+		if (!transport || !transport->rx_zerocopy_set) {
>+			err = -EOPNOTSUPP;
>+		} else {
>+			COPY_IN(val);
>+
>+			if (transport->rx_zerocopy_set(vsk, val))
>+				err = -EINVAL;
>+		}
>+		break;
>+	}
>
> 	default:
> 		err = -ENOPROTOOPT;
>@@ -1657,6 +1668,48 @@ static int vsock_connectible_setsockopt(struct socket *sock,
> 	return err;
> }
>
>+static const struct vm_operations_struct afvsock_vm_ops = {
>+};
>+
>+static int vsock_recv_zerocopy(struct socket *sock,
>+			       unsigned long address)
>+{
>+	struct sock *sk = sock->sk;
>+	struct vsock_sock *vsk = vsock_sk(sk);
>+	struct vm_area_struct *vma;
>+	const struct vsock_transport *transport;
>+	int res;
>+
>+	transport = vsk->transport;
>+
>+	if (!transport->rx_zerocopy_get)
>+		return -EOPNOTSUPP;
>+
>+	if (!transport->rx_zerocopy_get(vsk))
>+		return -EOPNOTSUPP;

Maybe we can merge in
         if (!transport->rx_zerocopy_get ||
             !transport->rx_zerocopy_get(vsk)}
                 return -EOPNOTSUPP;

>+
>+	if (!transport->zerocopy_dequeue)
>+		return -EOPNOTSUPP;
>+
>+	lock_sock(sk);
>+	mmap_write_lock(current->mm);

So, multiple threads using different sockets are serialized if they use 
zero-copy?

IIUC this is necessary because the callback calls vm_insert_page().

At this point I think it's better not to do this in every transport, but 
have the callback return an array of pages to map and we map them here 
trying to limit as much as possible the critical section to protect with 
mmap_write_lock().

>+
>+	vma = vma_lookup(current->mm, address);
>+
>+	if (!vma || vma->vm_ops != &afvsock_vm_ops) {
>+		mmap_write_unlock(current->mm);
>+		release_sock(sk);
>+		return -EINVAL;
>+	}
>+
>+	res = transport->zerocopy_dequeue(vsk, vma, address);
>+
>+	mmap_write_unlock(current->mm);
>+	release_sock(sk);
>+
>+	return res;
>+}
>+
> static int vsock_connectible_getsockopt(struct socket *sock,
> 					int level, int optname,
> 					char __user *optval,
>@@ -1701,6 +1754,39 @@ static int vsock_connectible_getsockopt(struct socket *sock,
> 		lv = sock_get_timeout(vsk->connect_timeout, &v,
> 				      optname == SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD);
> 		break;
>+	case SO_VM_SOCKETS_ZEROCOPY: {
>+		const struct vsock_transport *transport;
>+		int res;
>+
>+		transport = vsk->transport;
>+
>+		if (!transport->rx_zerocopy_get)
>+			return -EOPNOTSUPP;
>+
>+		lock_sock(sk);

I think we should call lock_sock() before reading the transport to avoid 
races and we should check if it is assigned.

At that point I think is better to store this info in vsock_sock and not 
in the transport.

And maybe we should allow to change it only if the socket state is 
SS_UNCONNECTED, inheriting from the parent the setting for sockets that 
have it.

>+
>+		res = transport->rx_zerocopy_get(vsk);
>+
>+		release_sock(sk);
>+
>+		if (res < 0)
>+			return -EINVAL;
>+
>+		v.val64 = res;
>+
>+		break;
>+	}
>+	case SO_VM_SOCKETS_MAP_RX: {
>+		unsigned long vma_addr;
>+
>+		if (len < sizeof(vma_addr))
>+			return -EINVAL;
>+
>+		if (copy_from_user(&vma_addr, optval, sizeof(vma_addr)))
>+			return -EFAULT;
>+
>+		return vsock_recv_zerocopy(sock, vma_addr);
>+	}
>
> 	default:
> 		return -ENOPROTOOPT;
>@@ -2129,6 +2215,19 @@ vsock_connectible_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
> 	return err;
> }
>
>+static int afvsock_mmap(struct file *file, struct socket *sock,
>+			struct vm_area_struct *vma)
>+{
>+	if (vma->vm_flags & (VM_WRITE | VM_EXEC))
>+		return -EPERM;
>+
>+	vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
>+	vma->vm_flags |= (VM_MIXEDMAP);
>+	vma->vm_ops = &afvsock_vm_ops;
>+
>+	return 0;
>+}
>+
> static const struct proto_ops vsock_stream_ops = {
> 	.family = PF_VSOCK,
> 	.owner = THIS_MODULE,
>@@ -2148,6 +2247,7 @@ static const struct proto_ops vsock_stream_ops = {
> 	.recvmsg = vsock_connectible_recvmsg,
> 	.mmap = sock_no_mmap,
> 	.sendpage = sock_no_sendpage,
>+	.mmap = afvsock_mmap,
> };
>
> static const struct proto_ops vsock_seqpacket_ops = {
>-- 
>2.25.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 4/8] virtio/vsock: add transport zerocopy callback
  2022-06-03  5:37 ` [RFC PATCH v2 4/8] virtio/vsock: add transport zerocopy callback Arseniy Krasnov
@ 2022-06-09  8:41     ` Stefano Garzarella
  0 siblings, 0 replies; 29+ messages in thread
From: Stefano Garzarella @ 2022-06-09  8:41 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Krasnov Arseniy, kvm, Michael S. Tsirkin, netdev, linux-kernel,
	virtualization, Stefan Hajnoczi, kernel, Jakub Kicinski,
	Paolo Abeni, David S. Miller

On Fri, Jun 03, 2022 at 05:37:48AM +0000, Arseniy Krasnov wrote:
>This adds transport callback which processes rx
>queue of socket and instead of copying data to
>user provided buffer, it inserts data pages of
>each packet to user's vm area.
>
>Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>---
> include/linux/virtio_vsock.h            |   4 +
> include/uapi/linux/virtio_vsock.h       |   6 +
> net/vmw_vsock/virtio_transport_common.c | 208 +++++++++++++++++++++++-
> 3 files changed, 215 insertions(+), 3 deletions(-)
>
>diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>index d02cb7aa922f..47a68a2ea838 100644
>--- a/include/linux/virtio_vsock.h
>+++ b/include/linux/virtio_vsock.h
>@@ -51,6 +51,7 @@ struct virtio_vsock_pkt {
> 	bool reply;
> 	bool tap_delivered;
> 	bool slab_buf;
>+	bool split;
> };
>
> struct virtio_vsock_pkt_info {
>@@ -131,6 +132,9 @@ int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> 				struct sockaddr_vm *addr);
> bool virtio_transport_dgram_allow(u32 cid, u32 port);
>
>+int virtio_transport_zerocopy_dequeue(struct vsock_sock *vsk,
>+				      struct vm_area_struct *vma,
>+				      unsigned long addr);
> int virtio_transport_connect(struct vsock_sock *vsk);
>
> int virtio_transport_shutdown(struct vsock_sock *vsk, int mode);
>diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
>index 64738838bee5..6775c6c44b5b 100644
>--- a/include/uapi/linux/virtio_vsock.h
>+++ b/include/uapi/linux/virtio_vsock.h
>@@ -66,6 +66,12 @@ struct virtio_vsock_hdr {
> 	__le32	fwd_cnt;
> } __attribute__((packed));
>
>+struct virtio_vsock_usr_hdr {
>+	u32 flags;
>+	u32 len;
>+	u32 copy_len;
>+} __attribute__((packed));
>+
> enum virtio_vsock_type {
> 	VIRTIO_VSOCK_TYPE_STREAM = 1,
> 	VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
>diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>index 278567f748f2..3a3e84176c75 100644
>--- a/net/vmw_vsock/virtio_transport_common.c
>+++ b/net/vmw_vsock/virtio_transport_common.c
>@@ -12,6 +12,7 @@
> #include <linux/ctype.h>
> #include <linux/list.h>
> #include <linux/virtio_vsock.h>
>+#include <linux/mm.h>
> #include <uapi/linux/vsockmon.h>
>
> #include <net/sock.h>
>@@ -347,6 +348,196 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
> 	return err;
> }
>
>+#define MAX_PAGES_TO_MAP 256
>+
>+int virtio_transport_zerocopy_dequeue(struct vsock_sock *vsk,
>+				      struct vm_area_struct *vma,
>+				      unsigned long addr)
>+{
>+	struct virtio_vsock_sock *vvs = vsk->trans;
>+	struct virtio_vsock_usr_hdr *usr_hdr_buffer;
>+	unsigned long max_pages_to_insert;
>+	unsigned long tmp_pages_inserted;
>+	unsigned long pages_to_insert;
>+	struct page *usr_hdr_page;
>+	unsigned long vma_size;
>+	struct page **pages;
>+	int max_vma_pages;
>+	int max_usr_hdrs;
>+	int res;
>+	int err;
>+	int i;
>+
>+	/* Only use VMA from first page. */
>+	if (vma->vm_start != addr)
>+		return -EFAULT;
>+
>+	vma_size = vma->vm_end - vma->vm_start;
>+
>+	/* Too small vma(at least one page for headers
>+	 * and one page for data).
>+	 */
>+	if (vma_size < 2 * PAGE_SIZE)
>+		return -EFAULT;
>+
>+	/* Page for meta data. */
>+	usr_hdr_page = alloc_page(GFP_KERNEL);

I think all these checks should be done in af_vsock.c.

It would be nice to avoid that every transport reimplements the same 
thing and especially that all transports have the same behavior.

If you can would be nice to have the transports to return an array of 
pages to map, and af_vsock will handle it and the usr_hdr_page.

Do you think it's doable?

>+
>+	if (!usr_hdr_page)
>+		return -EFAULT;
>+
>+	pages = kmalloc_array(MAX_PAGES_TO_MAP, sizeof(pages[0]), GFP_KERNEL);
>+
>+	if (!pages)
>+		return -EFAULT;
>+
>+	pages[pages_to_insert++] = usr_hdr_page;
>+
>+	usr_hdr_buffer = page_to_virt(usr_hdr_page);
>+
>+	err = 0;
>+
>+	/* As we use first page for headers, so total number of
>+	 * pages for user is min between number of headers in
>+	 * first page and size of vma(in pages, except first page).
>+	 */
>+	max_usr_hdrs = PAGE_SIZE / sizeof(*usr_hdr_buffer);
>+	max_vma_pages = (vma_size / PAGE_SIZE) - 1;
>+	max_pages_to_insert = min(max_usr_hdrs, max_vma_pages);
>+
>+	if (max_pages_to_insert > MAX_PAGES_TO_MAP)
>+		max_pages_to_insert = MAX_PAGES_TO_MAP;
>+
>+	spin_lock_bh(&vvs->rx_lock);
>+
>+	while (!list_empty(&vvs->rx_queue) &&
>+	       pages_to_insert < max_pages_to_insert) {
>+		struct virtio_vsock_pkt *pkt;
>+		ssize_t rest_data_bytes;
>+		size_t moved_data_bytes;
>+		unsigned long pg_offs;
>+
>+		pkt = list_first_entry(&vvs->rx_queue,
>+				       struct virtio_vsock_pkt, list);
>+
>+		/* Buffer was allocated by 'kmalloc()'. This could
>+		 * happen, when zerocopy was enabled, but we still
>+		 * have pending packet which was created before it.
>+		 */
>+		if (pkt->slab_buf) {
>+			usr_hdr_buffer->flags = le32_to_cpu(pkt->hdr.flags);
>+			usr_hdr_buffer->len = 0;
>+			usr_hdr_buffer->copy_len = le32_to_cpu(pkt->hdr.len);
>+			/* Report user to read it using copy. */

Is it a "to do"?

>+			break;
>+		}
>+
>+		/* This could happen, when packet was dequeued before
>+		 * by an ordinary 'read()' call. We can't handle such
>+		 * packet. Drop it.

We can't drop packets.
I think we should allow to enable/disable this new feature before the 
connection.

>+		 */
>+		if (pkt->off % PAGE_SIZE) {
>+			list_del(&pkt->list);
>+			virtio_transport_dec_rx_pkt(vvs, pkt);
>+			virtio_transport_free_pkt(pkt);
>+			continue;
>+		}
>+
>+		rest_data_bytes = le32_to_cpu(pkt->hdr.len) - pkt->off;
>+
>+		/* For packets, bigger than one page, split it's
>+		 * high order allocated buffer to 0 order pages.
>+		 * Otherwise 'vm_insert_pages()' will fail, for
>+		 * all pages except first.
>+		 */
>+		if (rest_data_bytes > PAGE_SIZE) {
>+			/* High order buffer not split yet. */
>+			if (!pkt->split) {
>+				split_page(virt_to_page(pkt->buf),
>+					   get_order(le32_to_cpu(pkt->hdr.len)));
>+				pkt->split = true;
>+			}
>+		}
>+
>+		pg_offs = pkt->off;
>+		moved_data_bytes = 0;
>+
>+		while (rest_data_bytes &&
>+		       pages_to_insert < max_pages_to_insert) {
>+			struct page *buf_page;
>+
>+			buf_page = virt_to_page(pkt->buf + pg_offs);
>+
>+			pages[pages_to_insert++] = buf_page;
>+			/* Get reference to prevent this page being
>+			 * returned to page allocator when packet will
>+			 * be freed. Ref count will be 2.
>+			 */
>+			get_page(buf_page);
>+			pg_offs += PAGE_SIZE;
>+
>+			if (rest_data_bytes >= PAGE_SIZE) {
>+				moved_data_bytes += PAGE_SIZE;
>+				rest_data_bytes -= PAGE_SIZE;
>+			} else {
>+				moved_data_bytes += rest_data_bytes;
>+				rest_data_bytes = 0;
>+			}
>+		}
>+
>+		usr_hdr_buffer->flags = le32_to_cpu(pkt->hdr.flags);
>+		usr_hdr_buffer->len = moved_data_bytes;
>+		usr_hdr_buffer->copy_len = 0;
>+		usr_hdr_buffer++;
>+
>+		pkt->off = pg_offs;
>+
>+		if (rest_data_bytes == 0) {
>+			list_del(&pkt->list);
>+			virtio_transport_dec_rx_pkt(vvs, pkt);
>+			virtio_transport_free_pkt(pkt);
>+		}
>+
>+		/* Now ref count for all pages of packet is 1. */
>+	}
>+
>+	/* Set last buffer empty(if we have one). */
>+	if (pages_to_insert - 1 < max_usr_hdrs)
>+		usr_hdr_buffer->len = 0;
>+
>+	spin_unlock_bh(&vvs->rx_lock);
>+
>+	tmp_pages_inserted = pages_to_insert;
>+
>+	res = vm_insert_pages(vma, addr, pages, &tmp_pages_inserted);
>+
>+	if (res || tmp_pages_inserted) {
>+		/* Failed to insert some pages, we have "partially"
>+		 * mapped vma. Do not return, set error code. This
>+		 * code will be returned to user. User needs to call
>+		 * 'madvise()/mmap()' to clear this vma. Anyway,
>+		 * references to all pages will to be dropped below.
>+		 */
>+		err = -EFAULT;
>+	}
>+
>+	/* Put reference for every page. */
>+	for (i = 0; i < pages_to_insert; i++) {
>+		/* Ref count is 2 ('get_page()' + 'vm_insert_pages()' above).
>+		 * Put reference once, page will be returned to allocator
>+		 * after user's 'madvice()/munmap()' call(or it wasn't mapped
>+		 * if 'vm_insert_pages()' failed).
>+		 */
>+		put_page(pages[i]);
>+	}
>+
>+	virtio_transport_send_credit_update(vsk);
>+	kfree(pages);
>+
>+	return err;
>+}
>+EXPORT_SYMBOL_GPL(virtio_transport_zerocopy_dequeue);
>+
> static ssize_t
> virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
> 				   struct msghdr *msg,
>@@ -1344,10 +1535,21 @@ EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
> void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
> {
> 	if (pkt->buf_len) {
>-		if (pkt->slab_buf)
>+		if (pkt->slab_buf) {
> 			kfree(pkt->buf);
>-		else
>-			free_pages(buf, get_order(pkt->buf_len));
>+		} else {
>+			unsigned int order = get_order(pkt->buf_len);
>+			unsigned long buf = (unsigned long)pkt->buf;
>+
>+			if (pkt->split) {
>+				int i;
>+
>+				for (i = 0; i < (1 << order); i++)
>+					free_page(buf + i * PAGE_SIZE);
>+			} else {
>+				free_pages(buf, order);
>+			}
>+		}
> 	}
>
> 	kfree(pkt);
>-- 2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 4/8] virtio/vsock: add transport zerocopy callback
@ 2022-06-09  8:41     ` Stefano Garzarella
  0 siblings, 0 replies; 29+ messages in thread
From: Stefano Garzarella @ 2022-06-09  8:41 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Jakub Kicinski, Paolo Abeni, linux-kernel, kvm, virtualization,
	netdev, kernel, Krasnov Arseniy

On Fri, Jun 03, 2022 at 05:37:48AM +0000, Arseniy Krasnov wrote:
>This adds transport callback which processes rx
>queue of socket and instead of copying data to
>user provided buffer, it inserts data pages of
>each packet to user's vm area.
>
>Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>---
> include/linux/virtio_vsock.h            |   4 +
> include/uapi/linux/virtio_vsock.h       |   6 +
> net/vmw_vsock/virtio_transport_common.c | 208 +++++++++++++++++++++++-
> 3 files changed, 215 insertions(+), 3 deletions(-)
>
>diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>index d02cb7aa922f..47a68a2ea838 100644
>--- a/include/linux/virtio_vsock.h
>+++ b/include/linux/virtio_vsock.h
>@@ -51,6 +51,7 @@ struct virtio_vsock_pkt {
> 	bool reply;
> 	bool tap_delivered;
> 	bool slab_buf;
>+	bool split;
> };
>
> struct virtio_vsock_pkt_info {
>@@ -131,6 +132,9 @@ int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> 				struct sockaddr_vm *addr);
> bool virtio_transport_dgram_allow(u32 cid, u32 port);
>
>+int virtio_transport_zerocopy_dequeue(struct vsock_sock *vsk,
>+				      struct vm_area_struct *vma,
>+				      unsigned long addr);
> int virtio_transport_connect(struct vsock_sock *vsk);
>
> int virtio_transport_shutdown(struct vsock_sock *vsk, int mode);
>diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
>index 64738838bee5..6775c6c44b5b 100644
>--- a/include/uapi/linux/virtio_vsock.h
>+++ b/include/uapi/linux/virtio_vsock.h
>@@ -66,6 +66,12 @@ struct virtio_vsock_hdr {
> 	__le32	fwd_cnt;
> } __attribute__((packed));
>
>+struct virtio_vsock_usr_hdr {
>+	u32 flags;
>+	u32 len;
>+	u32 copy_len;
>+} __attribute__((packed));
>+
> enum virtio_vsock_type {
> 	VIRTIO_VSOCK_TYPE_STREAM = 1,
> 	VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
>diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>index 278567f748f2..3a3e84176c75 100644
>--- a/net/vmw_vsock/virtio_transport_common.c
>+++ b/net/vmw_vsock/virtio_transport_common.c
>@@ -12,6 +12,7 @@
> #include <linux/ctype.h>
> #include <linux/list.h>
> #include <linux/virtio_vsock.h>
>+#include <linux/mm.h>
> #include <uapi/linux/vsockmon.h>
>
> #include <net/sock.h>
>@@ -347,6 +348,196 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
> 	return err;
> }
>
>+#define MAX_PAGES_TO_MAP 256
>+
>+int virtio_transport_zerocopy_dequeue(struct vsock_sock *vsk,
>+				      struct vm_area_struct *vma,
>+				      unsigned long addr)
>+{
>+	struct virtio_vsock_sock *vvs = vsk->trans;
>+	struct virtio_vsock_usr_hdr *usr_hdr_buffer;
>+	unsigned long max_pages_to_insert;
>+	unsigned long tmp_pages_inserted;
>+	unsigned long pages_to_insert;
>+	struct page *usr_hdr_page;
>+	unsigned long vma_size;
>+	struct page **pages;
>+	int max_vma_pages;
>+	int max_usr_hdrs;
>+	int res;
>+	int err;
>+	int i;
>+
>+	/* Only use VMA from first page. */
>+	if (vma->vm_start != addr)
>+		return -EFAULT;
>+
>+	vma_size = vma->vm_end - vma->vm_start;
>+
>+	/* Too small vma(at least one page for headers
>+	 * and one page for data).
>+	 */
>+	if (vma_size < 2 * PAGE_SIZE)
>+		return -EFAULT;
>+
>+	/* Page for meta data. */
>+	usr_hdr_page = alloc_page(GFP_KERNEL);

I think all these checks should be done in af_vsock.c.

It would be nice to avoid that every transport reimplements the same 
thing and especially that all transports have the same behavior.

If you can would be nice to have the transports to return an array of 
pages to map, and af_vsock will handle it and the usr_hdr_page.

Do you think it's doable?

>+
>+	if (!usr_hdr_page)
>+		return -EFAULT;
>+
>+	pages = kmalloc_array(MAX_PAGES_TO_MAP, sizeof(pages[0]), GFP_KERNEL);
>+
>+	if (!pages)
>+		return -EFAULT;
>+
>+	pages[pages_to_insert++] = usr_hdr_page;
>+
>+	usr_hdr_buffer = page_to_virt(usr_hdr_page);
>+
>+	err = 0;
>+
>+	/* As we use first page for headers, so total number of
>+	 * pages for user is min between number of headers in
>+	 * first page and size of vma(in pages, except first page).
>+	 */
>+	max_usr_hdrs = PAGE_SIZE / sizeof(*usr_hdr_buffer);
>+	max_vma_pages = (vma_size / PAGE_SIZE) - 1;
>+	max_pages_to_insert = min(max_usr_hdrs, max_vma_pages);
>+
>+	if (max_pages_to_insert > MAX_PAGES_TO_MAP)
>+		max_pages_to_insert = MAX_PAGES_TO_MAP;
>+
>+	spin_lock_bh(&vvs->rx_lock);
>+
>+	while (!list_empty(&vvs->rx_queue) &&
>+	       pages_to_insert < max_pages_to_insert) {
>+		struct virtio_vsock_pkt *pkt;
>+		ssize_t rest_data_bytes;
>+		size_t moved_data_bytes;
>+		unsigned long pg_offs;
>+
>+		pkt = list_first_entry(&vvs->rx_queue,
>+				       struct virtio_vsock_pkt, list);
>+
>+		/* Buffer was allocated by 'kmalloc()'. This could
>+		 * happen, when zerocopy was enabled, but we still
>+		 * have pending packet which was created before it.
>+		 */
>+		if (pkt->slab_buf) {
>+			usr_hdr_buffer->flags = le32_to_cpu(pkt->hdr.flags);
>+			usr_hdr_buffer->len = 0;
>+			usr_hdr_buffer->copy_len = le32_to_cpu(pkt->hdr.len);
>+			/* Report user to read it using copy. */

Is it a "to do"?

>+			break;
>+		}
>+
>+		/* This could happen, when packet was dequeued before
>+		 * by an ordinary 'read()' call. We can't handle such
>+		 * packet. Drop it.

We can't drop packets.
I think we should allow to enable/disable this new feature before the 
connection.

>+		 */
>+		if (pkt->off % PAGE_SIZE) {
>+			list_del(&pkt->list);
>+			virtio_transport_dec_rx_pkt(vvs, pkt);
>+			virtio_transport_free_pkt(pkt);
>+			continue;
>+		}
>+
>+		rest_data_bytes = le32_to_cpu(pkt->hdr.len) - pkt->off;
>+
>+		/* For packets, bigger than one page, split it's
>+		 * high order allocated buffer to 0 order pages.
>+		 * Otherwise 'vm_insert_pages()' will fail, for
>+		 * all pages except first.
>+		 */
>+		if (rest_data_bytes > PAGE_SIZE) {
>+			/* High order buffer not split yet. */
>+			if (!pkt->split) {
>+				split_page(virt_to_page(pkt->buf),
>+					   get_order(le32_to_cpu(pkt->hdr.len)));
>+				pkt->split = true;
>+			}
>+		}
>+
>+		pg_offs = pkt->off;
>+		moved_data_bytes = 0;
>+
>+		while (rest_data_bytes &&
>+		       pages_to_insert < max_pages_to_insert) {
>+			struct page *buf_page;
>+
>+			buf_page = virt_to_page(pkt->buf + pg_offs);
>+
>+			pages[pages_to_insert++] = buf_page;
>+			/* Get reference to prevent this page being
>+			 * returned to page allocator when packet will
>+			 * be freed. Ref count will be 2.
>+			 */
>+			get_page(buf_page);
>+			pg_offs += PAGE_SIZE;
>+
>+			if (rest_data_bytes >= PAGE_SIZE) {
>+				moved_data_bytes += PAGE_SIZE;
>+				rest_data_bytes -= PAGE_SIZE;
>+			} else {
>+				moved_data_bytes += rest_data_bytes;
>+				rest_data_bytes = 0;
>+			}
>+		}
>+
>+		usr_hdr_buffer->flags = le32_to_cpu(pkt->hdr.flags);
>+		usr_hdr_buffer->len = moved_data_bytes;
>+		usr_hdr_buffer->copy_len = 0;
>+		usr_hdr_buffer++;
>+
>+		pkt->off = pg_offs;
>+
>+		if (rest_data_bytes == 0) {
>+			list_del(&pkt->list);
>+			virtio_transport_dec_rx_pkt(vvs, pkt);
>+			virtio_transport_free_pkt(pkt);
>+		}
>+
>+		/* Now ref count for all pages of packet is 1. */
>+	}
>+
>+	/* Set last buffer empty(if we have one). */
>+	if (pages_to_insert - 1 < max_usr_hdrs)
>+		usr_hdr_buffer->len = 0;
>+
>+	spin_unlock_bh(&vvs->rx_lock);
>+
>+	tmp_pages_inserted = pages_to_insert;
>+
>+	res = vm_insert_pages(vma, addr, pages, &tmp_pages_inserted);
>+
>+	if (res || tmp_pages_inserted) {
>+		/* Failed to insert some pages, we have "partially"
>+		 * mapped vma. Do not return, set error code. This
>+		 * code will be returned to user. User needs to call
>+		 * 'madvise()/mmap()' to clear this vma. Anyway,
>+		 * references to all pages will to be dropped below.
>+		 */
>+		err = -EFAULT;
>+	}
>+
>+	/* Put reference for every page. */
>+	for (i = 0; i < pages_to_insert; i++) {
>+		/* Ref count is 2 ('get_page()' + 'vm_insert_pages()' above).
>+		 * Put reference once, page will be returned to allocator
>+		 * after user's 'madvice()/munmap()' call(or it wasn't mapped
>+		 * if 'vm_insert_pages()' failed).
>+		 */
>+		put_page(pages[i]);
>+	}
>+
>+	virtio_transport_send_credit_update(vsk);
>+	kfree(pages);
>+
>+	return err;
>+}
>+EXPORT_SYMBOL_GPL(virtio_transport_zerocopy_dequeue);
>+
> static ssize_t
> virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
> 				   struct msghdr *msg,
>@@ -1344,10 +1535,21 @@ EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
> void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
> {
> 	if (pkt->buf_len) {
>-		if (pkt->slab_buf)
>+		if (pkt->slab_buf) {
> 			kfree(pkt->buf);
>-		else
>-			free_pages(buf, get_order(pkt->buf_len));
>+		} else {
>+			unsigned int order = get_order(pkt->buf_len);
>+			unsigned long buf = (unsigned long)pkt->buf;
>+
>+			if (pkt->split) {
>+				int i;
>+
>+				for (i = 0; i < (1 << order); i++)
>+					free_page(buf + i * PAGE_SIZE);
>+			} else {
>+				free_pages(buf, order);
>+			}
>+		}
> 	}
>
> 	kfree(pkt);
>-- 2.25.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive
  2022-06-03  5:27 [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
@ 2022-06-09  8:54   ` Stefano Garzarella
  2022-06-03  5:33 ` [RFC PATCH v2 2/8] vhost/vsock: " Arseniy Krasnov
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Stefano Garzarella @ 2022-06-09  8:54 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Krasnov Arseniy, kvm, Michael S. Tsirkin, netdev, linux-kernel,
	virtualization, Stefan Hajnoczi, kernel, Jakub Kicinski,
	Paolo Abeni, David S. Miller

Hi Arseniy,
I left some comments in the patches, and I'm adding something also here:

On Fri, Jun 03, 2022 at 05:27:56AM +0000, Arseniy Krasnov wrote:
>                              INTRODUCTION
>
>	Hello, this is experimental implementation of virtio vsock zerocopy
>receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>fill provided vma area with pages of virtio RX buffers. After received data was
>processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>flag set(if user won't call 'madvise()', next 'getsockopt()' will 
>fail).

If it is not too time-consuming, can we have a table/list to compare this 
and the TCP zerocopy?

>
>                                 DETAILS
>
>	Here is how mapping with mapped pages looks exactly: first page mapping
>contains array of trimmed virtio vsock packet headers (in contains only length
>of data on the corresponding page and 'flags' field):
>
>	struct virtio_vsock_usr_hdr {
>		uint32_t length;
>		uint32_t flags;
>		uint32_t copy_len;
>	};
>
>Field  'length' allows user to know exact size of payload within each sequence
>of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>bounds or record bounds). Field 'copy_len' is described below in 'v1->v2' part.
>All other pages are data pages from RX queue.
>
>             Page 0      Page 1      Page N
>
>	[ hdr1 .. hdrN ][ data ] .. [ data ]
>           |        |       ^           ^
>           |        |       |           |
>           |        *-------------------*
>           |                |
>           |                |
>           *----------------*
>
>	Of course, single header could represent array of pages (when packet's
>buffer is bigger than one page).So here is example of detailed mapping layout
>for some set of packages. Lets consider that we have the following sequence  of
>packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>be inserted to user's vma(vma is large enough).

In order to have a "userspace polling-friendly approach" and reduce 
number of syscall, can we allow for example the userspace to mmap at 
least the first header before packets arrive.
Then the userspace can poll a flag or other fields in the header to 
understand that there are new packets.

That would be cool, but in the meantime it would be nice to behave 
similarly to TCP, which is why the comparison table I mentioned earlier 
would be useful.

>
>	Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
>	Page 1: [ 56 ]
>	Page 2: [ 4096 ]
>	Page 3: [ 4096 ]
>	Page 4: [ 4096 ]
>	Page 5: [ 8 ]
>
>	Page 0 contains only array of headers:
>	'hdr0' has 56 in length field.
>	'hdr1' has 4096 in length field.
>	'hdr2' has 8200 in length field.
>	'hdr3' has 0 in length field(this is end of data marker).
>
>	Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
>	Page 2 corresponds to 'hdr1' and filled with data.
>	Page 3 corresponds to 'hdr2' and filled with data.
>	Page 4 corresponds to 'hdr2' and filled with data.
>	Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>
>	This patchset also changes packets allocation way: today implementation
>uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
>such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
>pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
>"not large" could be bigger than one page). So to avoid this, data buffers now
>allocated using 'alloc_pages()' call.
>
>                                   TESTS
>
>	This patchset updates 'vsock_test' utility: two tests for new feature
>were added. First test covers invalid cases. Second checks valid transmission
>case.
>
>                                BENCHMARKING
>
>	For benchmakring I've added small utility 'rx_zerocopy'. It works in
>client/server mode. When client connects to server, server starts sending exact
>amount of data to client(amount is set as input argument).Client reads data and
>waits for next portion of it. Client works in two modes: copy and zero-copy. In
>copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
>/'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
>is better. For server, we can set size of tx buffer and for client we can set
>size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
>is quiet simple:
>
>For client mode:
>
>./rx_zerocopy --mode client [--zerocopy] [--rx]
>
>For server mode:
>
>./rx_zerocopy --mode server [--mb] [--tx]
>
>[--mb] sets number of megabytes to transfer.
>[--rx] sets size of receive buffer/mapping in pages.
>[--tx] sets size of transmit buffer in pages.
>
>I checked for transmission of 4000mb of data. Here are some results:
>
>                           size of rx/tx buffers in pages
>               *---------------------------------------------------*
>               |    8   |    32    |    64   |   256    |   512    |
>*--------------*--------*----------*---------*----------*----------*
>|   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
>*--------------*---------------------------------------------------- process
>| non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
>*--------------*----------------------------------------------------
>
>Result in first column(where non-zerocopy works better than zerocopy) happens
>because time, spent in 'read()' system call is smaller that time in 'getsockopt'
>+ 'madvise'. I've checked that.
>
>I think, that results are not so impressive, but at least it is not worse than
>copy mode and there is no need to allocate memory for processing date.
>
>                                 PROBLEMS
>
>	Updated packet's allocation logic creates some problem: when host gets
>data from guest(in vhost-vsock), it allocates at least one page for each packet
>(even if packet has 1 byte payload). I think this could be resolved in several
>ways:
>	1) Make zerocopy rx mode disabled by default, so if user didn't enable
>it, current 'kmalloc()' way will be used. <<<<<<< (IMPLEMENTED IN V2)

Yep, but I think we should not allow to change it while we are connected 
(see comments in the patches.)

>	2) Use 'kmalloc()' for "small" packets, else call page allocator. But
>in this case, we have mix of packets, allocated in two different ways thus
>during zerocopying to user(e.g. mapping pages to vma), such small packets will
>be handled in some stupid way: we need to allocate one page for user, copy data
>to it and then insert page to user's vma.
>
>v1 -> v2:
> 1) Zerocopy receive mode could be enabled/disabled(disabled by default). I
>    didn't use generic SO_ZEROCOPY flag, because in virtio-vsock case this
>    feature depends on transport support. Instead of SO_ZEROCOPY, AF_VSOCK
>    layer flag was added: SO_VM_SOCKETS_ZEROCOPY, while previous meaning of
>    SO_VM_SOCKETS_ZEROCOPY(insert receive buffers to user's vm area) now
>    renamed to SO_VM_SOCKETS_MAP_RX.
> 2) Packet header which is exported to user now get new field: 'copy_len'.
>    This field handles special case:  user reads data from socket in non
>    zerocopy way(with disabled zerocopy) and then enables zerocopy feature.
>    In this case vhost part will switch data buffer allocation logic from
>    'kmalloc()' to direct calls for buddy allocator. But, there could be
>    some pending 'kmalloc()' allocated packets in socket's rx list, and then
>    user tries to read such packets in zerocopy way, dequeue will fail,
>    because SLAB pages could not be inserted to user's vm area. So when such
>    packet is found during zerocopy dequeue, dequeue loop will break and
>    'copy_len' will show size of such "bad" packet. After user detects this
>    case, it must use 'read()/recv()' calls to dequeue such packet.
> 3) Also may be move this features under config option?

Do you mean a build config like CONFIG_VSOCK_ZERO_COPY?

I'm not sure it's needed.

Thanks,
Stefano

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive
@ 2022-06-09  8:54   ` Stefano Garzarella
  0 siblings, 0 replies; 29+ messages in thread
From: Stefano Garzarella @ 2022-06-09  8:54 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Jakub Kicinski, Paolo Abeni, linux-kernel, kvm, virtualization,
	netdev, kernel, Krasnov Arseniy

Hi Arseniy,
I left some comments in the patches, and I'm adding something also here:

On Fri, Jun 03, 2022 at 05:27:56AM +0000, Arseniy Krasnov wrote:
>                              INTRODUCTION
>
>	Hello, this is experimental implementation of virtio vsock zerocopy
>receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>fill provided vma area with pages of virtio RX buffers. After received data was
>processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>flag set(if user won't call 'madvise()', next 'getsockopt()' will 
>fail).

If it is not too time-consuming, can we have a table/list to compare this 
and the TCP zerocopy?

>
>                                 DETAILS
>
>	Here is how mapping with mapped pages looks exactly: first page mapping
>contains array of trimmed virtio vsock packet headers (in contains only length
>of data on the corresponding page and 'flags' field):
>
>	struct virtio_vsock_usr_hdr {
>		uint32_t length;
>		uint32_t flags;
>		uint32_t copy_len;
>	};
>
>Field  'length' allows user to know exact size of payload within each sequence
>of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>bounds or record bounds). Field 'copy_len' is described below in 'v1->v2' part.
>All other pages are data pages from RX queue.
>
>             Page 0      Page 1      Page N
>
>	[ hdr1 .. hdrN ][ data ] .. [ data ]
>           |        |       ^           ^
>           |        |       |           |
>           |        *-------------------*
>           |                |
>           |                |
>           *----------------*
>
>	Of course, single header could represent array of pages (when packet's
>buffer is bigger than one page).So here is example of detailed mapping layout
>for some set of packages. Lets consider that we have the following sequence  of
>packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>be inserted to user's vma(vma is large enough).

In order to have a "userspace polling-friendly approach" and reduce 
number of syscall, can we allow for example the userspace to mmap at 
least the first header before packets arrive.
Then the userspace can poll a flag or other fields in the header to 
understand that there are new packets.

That would be cool, but in the meantime it would be nice to behave 
similarly to TCP, which is why the comparison table I mentioned earlier 
would be useful.

>
>	Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
>	Page 1: [ 56 ]
>	Page 2: [ 4096 ]
>	Page 3: [ 4096 ]
>	Page 4: [ 4096 ]
>	Page 5: [ 8 ]
>
>	Page 0 contains only array of headers:
>	'hdr0' has 56 in length field.
>	'hdr1' has 4096 in length field.
>	'hdr2' has 8200 in length field.
>	'hdr3' has 0 in length field(this is end of data marker).
>
>	Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
>	Page 2 corresponds to 'hdr1' and filled with data.
>	Page 3 corresponds to 'hdr2' and filled with data.
>	Page 4 corresponds to 'hdr2' and filled with data.
>	Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>
>	This patchset also changes packets allocation way: today implementation
>uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
>such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
>pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
>"not large" could be bigger than one page). So to avoid this, data buffers now
>allocated using 'alloc_pages()' call.
>
>                                   TESTS
>
>	This patchset updates 'vsock_test' utility: two tests for new feature
>were added. First test covers invalid cases. Second checks valid transmission
>case.
>
>                                BENCHMARKING
>
>	For benchmakring I've added small utility 'rx_zerocopy'. It works in
>client/server mode. When client connects to server, server starts sending exact
>amount of data to client(amount is set as input argument).Client reads data and
>waits for next portion of it. Client works in two modes: copy and zero-copy. In
>copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
>/'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
>is better. For server, we can set size of tx buffer and for client we can set
>size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
>is quiet simple:
>
>For client mode:
>
>./rx_zerocopy --mode client [--zerocopy] [--rx]
>
>For server mode:
>
>./rx_zerocopy --mode server [--mb] [--tx]
>
>[--mb] sets number of megabytes to transfer.
>[--rx] sets size of receive buffer/mapping in pages.
>[--tx] sets size of transmit buffer in pages.
>
>I checked for transmission of 4000mb of data. Here are some results:
>
>                           size of rx/tx buffers in pages
>               *---------------------------------------------------*
>               |    8   |    32    |    64   |   256    |   512    |
>*--------------*--------*----------*---------*----------*----------*
>|   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
>*--------------*---------------------------------------------------- process
>| non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
>*--------------*----------------------------------------------------
>
>Result in first column(where non-zerocopy works better than zerocopy) happens
>because time, spent in 'read()' system call is smaller that time in 'getsockopt'
>+ 'madvise'. I've checked that.
>
>I think, that results are not so impressive, but at least it is not worse than
>copy mode and there is no need to allocate memory for processing date.
>
>                                 PROBLEMS
>
>	Updated packet's allocation logic creates some problem: when host gets
>data from guest(in vhost-vsock), it allocates at least one page for each packet
>(even if packet has 1 byte payload). I think this could be resolved in several
>ways:
>	1) Make zerocopy rx mode disabled by default, so if user didn't enable
>it, current 'kmalloc()' way will be used. <<<<<<< (IMPLEMENTED IN V2)

Yep, but I think we should not allow to change it while we are connected 
(see comments in the patches.)

>	2) Use 'kmalloc()' for "small" packets, else call page allocator. But
>in this case, we have mix of packets, allocated in two different ways thus
>during zerocopying to user(e.g. mapping pages to vma), such small packets will
>be handled in some stupid way: we need to allocate one page for user, copy data
>to it and then insert page to user's vma.
>
>v1 -> v2:
> 1) Zerocopy receive mode could be enabled/disabled(disabled by default). I
>    didn't use generic SO_ZEROCOPY flag, because in virtio-vsock case this
>    feature depends on transport support. Instead of SO_ZEROCOPY, AF_VSOCK
>    layer flag was added: SO_VM_SOCKETS_ZEROCOPY, while previous meaning of
>    SO_VM_SOCKETS_ZEROCOPY(insert receive buffers to user's vm area) now
>    renamed to SO_VM_SOCKETS_MAP_RX.
> 2) Packet header which is exported to user now get new field: 'copy_len'.
>    This field handles special case:  user reads data from socket in non
>    zerocopy way(with disabled zerocopy) and then enables zerocopy feature.
>    In this case vhost part will switch data buffer allocation logic from
>    'kmalloc()' to direct calls for buddy allocator. But, there could be
>    some pending 'kmalloc()' allocated packets in socket's rx list, and then
>    user tries to read such packets in zerocopy way, dequeue will fail,
>    because SLAB pages could not be inserted to user's vm area. So when such
>    packet is found during zerocopy dequeue, dequeue loop will break and
>    'copy_len' will show size of such "bad" packet. After user detects this
>    case, it must use 'read()/recv()' calls to dequeue such packet.
> 3) Also may be move this features under config option?

Do you mean a build config like CONFIG_VSOCK_ZERO_COPY?

I'm not sure it's needed.

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 2/8] vhost/vsock: rework packet allocation logic
  2022-06-09  8:38     ` Stefano Garzarella
  (?)
@ 2022-06-09 12:09     ` Arseniy Krasnov
  -1 siblings, 0 replies; 29+ messages in thread
From: Arseniy Krasnov @ 2022-06-09 12:09 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Jakub Kicinski, Paolo Abeni, linux-kernel, kvm, virtualization,
	netdev, kernel, Krasnov Arseniy

On 09.06.2022 11:38, Stefano Garzarella wrote:
> On Fri, Jun 03, 2022 at 05:33:04AM +0000, Arseniy Krasnov wrote:
>> For packets received from virtio RX queue, use buddy
>> allocator instead of 'kmalloc()' to be able to insert
>> such pages to user provided vma. Single call to
>> 'copy_from_iter()' replaced with per-page loop.
>>
>> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>> ---
>> drivers/vhost/vsock.c | 81 ++++++++++++++++++++++++++++++++++++-------
>> 1 file changed, 69 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> index e6c9d41db1de..0dc2229f18f7 100644
>> --- a/drivers/vhost/vsock.c
>> +++ b/drivers/vhost/vsock.c
>> @@ -58,6 +58,7 @@ struct vhost_vsock {
>>
>>     u32 guest_cid;
>>     bool seqpacket_allow;
>> +    bool zerocopy_rx_on;
> 
> This is per-device, so a single socket can change the behaviour of all the sockets of this device.

Sure, my mistake

> 
> Can we do something better?
> 
> Maybe we can allocate the header, copy it, find the socket and check if zero-copy is enabled or not for that socket.
> 
> Of course we should change or extend virtio_transport_recv_pkt() to avoid to find the socket again.

I think yes

> 
> 
>> };
>>
>> static u32 vhost_transport_get_local_cid(void)
>> @@ -357,6 +358,7 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
>>               unsigned int out, unsigned int in)
>> {
>>     struct virtio_vsock_pkt *pkt;
>> +    struct vhost_vsock *vsock;
>>     struct iov_iter iov_iter;
>>     size_t nbytes;
>>     size_t len;
>> @@ -393,20 +395,75 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
>>         return NULL;
>>     }
>>
>> -    pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
>> -    if (!pkt->buf) {
>> -        kfree(pkt);
>> -        return NULL;
>> -    }
>> -
>>     pkt->buf_len = pkt->len;
>> +    vsock = container_of(vq->dev, struct vhost_vsock, dev);
>>
>> -    nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
>> -    if (nbytes != pkt->len) {
>> -        vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
>> -               pkt->len, nbytes);
>> -        virtio_transport_free_pkt(pkt);
>> -        return NULL;
>> +    if (!vsock->zerocopy_rx_on) {
>> +        pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
>> +
>> +        if (!pkt->buf) {
>> +            kfree(pkt);
>> +            return NULL;
>> +        }
>> +
>> +        pkt->slab_buf = true;
>> +        nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
>> +        if (nbytes != pkt->len) {
>> +            vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
>> +                pkt->len, nbytes);
>> +            virtio_transport_free_pkt(pkt);
>> +            return NULL;
>> +        }
>> +    } else {
>> +        struct page *buf_page;
>> +        ssize_t pkt_len;
>> +        int page_idx;
>> +
>> +        /* This creates memory overrun, as we allocate
>> +         * at least one page for each packet.
>> +         */
>> +        buf_page = alloc_pages(GFP_KERNEL, get_order(pkt->len));
>> +
>> +        if (buf_page == NULL) {
>> +            kfree(pkt);
>> +            return NULL;
>> +        }
>> +
>> +        pkt->buf = page_to_virt(buf_page);
>> +
>> +        page_idx = 0;
>> +        pkt_len = pkt->len;
>> +
>> +        /* As allocated pages are not mapped, process
>> +         * pages one by one.
>> +         */
>> +        while (pkt_len > 0) {
>> +            void *mapped;
>> +            size_t to_copy;
>> +
>> +            mapped = kmap(buf_page + page_idx);
>> +
>> +            if (mapped == NULL) {
>> +                virtio_transport_free_pkt(pkt);
>> +                return NULL;
>> +            }
>> +
>> +            to_copy = min(pkt_len, ((ssize_t)PAGE_SIZE));
>> +
>> +            nbytes = copy_from_iter(mapped, to_copy, &iov_iter);
>> +            if (nbytes != to_copy) {
>> +                vq_err(vq, "Expected %zu byte payload, got %zu bytes\n",
>> +                       to_copy, nbytes);
>> +                kunmap(mapped);
>> +                virtio_transport_free_pkt(pkt);
>> +                return NULL;
>> +            }
>> +
>> +            kunmap(mapped);
>> +
>> +            pkt_len -= to_copy;
>> +            page_idx++;
>> +        }
>>     }
>>
>>     return pkt;
>> -- 
>> 2.25.1
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 3/8] af_vsock: add zerocopy receive logic
  2022-06-09  8:39     ` Stefano Garzarella
  (?)
@ 2022-06-09 12:20     ` Arseniy Krasnov
  2022-06-13  8:50         ` Stefano Garzarella
  -1 siblings, 1 reply; 29+ messages in thread
From: Arseniy Krasnov @ 2022-06-09 12:20 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Jakub Kicinski, Paolo Abeni, linux-kernel, kvm, virtualization,
	netdev, kernel, Krasnov Arseniy

On 09.06.2022 11:39, Stefano Garzarella wrote:
> On Fri, Jun 03, 2022 at 05:35:48AM +0000, Arseniy Krasnov wrote:
>> This:
>> 1) Adds callback for 'mmap()' call on socket. It checks vm
>>   area flags and sets vm area ops.
>> 2) Adds special 'getsockopt()' case which calls transport
>>   zerocopy callback. Input argument is vm area address.
>> 3) Adds 'getsockopt()/setsockopt()' for switching on/off rx
>>   zerocopy mode.
>>
>> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>> ---
>> include/net/af_vsock.h          |   7 +++
>> include/uapi/linux/vm_sockets.h |   3 +
>> net/vmw_vsock/af_vsock.c        | 100 ++++++++++++++++++++++++++++++++
>> 3 files changed, 110 insertions(+)
>>
>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>> index f742e50207fb..f15f84c648ff 100644
>> --- a/include/net/af_vsock.h
>> +++ b/include/net/af_vsock.h
>> @@ -135,6 +135,13 @@ struct vsock_transport {
>>     bool (*stream_is_active)(struct vsock_sock *);
>>     bool (*stream_allow)(u32 cid, u32 port);
>>
>> +    int (*rx_zerocopy_set)(struct vsock_sock *vsk,
>> +                   bool enable);
>> +    int (*rx_zerocopy_get)(struct vsock_sock *vsk);
>> +    int (*zerocopy_dequeue)(struct vsock_sock *vsk,
>> +                struct vm_area_struct *vma,
>> +                unsigned long addr);
>> +
>>     /* SEQ_PACKET. */
>>     ssize_t (*seqpacket_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
>>                      int flags);
>> diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h
>> index c60ca33eac59..d1f792bed1a7 100644
>> --- a/include/uapi/linux/vm_sockets.h
>> +++ b/include/uapi/linux/vm_sockets.h
>> @@ -83,6 +83,9 @@
>>
>> #define SO_VM_SOCKETS_CONNECT_TIMEOUT_NEW 8
>>
>> +#define SO_VM_SOCKETS_MAP_RX 9
>> +#define SO_VM_SOCKETS_ZEROCOPY 10
>> +
>> #if !defined(__KERNEL__)
>> #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
>> #define SO_VM_SOCKETS_CONNECT_TIMEOUT SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD
>> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>> index f04abf662ec6..10061ef21730 100644
>> --- a/net/vmw_vsock/af_vsock.c
>> +++ b/net/vmw_vsock/af_vsock.c
>> @@ -1644,6 +1644,17 @@ static int vsock_connectible_setsockopt(struct socket *sock,
>>         }
>>         break;
>>     }
>> +    case SO_VM_SOCKETS_ZEROCOPY: {
>> +        if (!transport || !transport->rx_zerocopy_set) {
>> +            err = -EOPNOTSUPP;
>> +        } else {
>> +            COPY_IN(val);
>> +
>> +            if (transport->rx_zerocopy_set(vsk, val))
>> +                err = -EINVAL;
>> +        }
>> +        break;
>> +    }
>>
>>     default:
>>         err = -ENOPROTOOPT;
>> @@ -1657,6 +1668,48 @@ static int vsock_connectible_setsockopt(struct socket *sock,
>>     return err;
>> }
>>
>> +static const struct vm_operations_struct afvsock_vm_ops = {
>> +};
>> +
>> +static int vsock_recv_zerocopy(struct socket *sock,
>> +                   unsigned long address)
>> +{
>> +    struct sock *sk = sock->sk;
>> +    struct vsock_sock *vsk = vsock_sk(sk);
>> +    struct vm_area_struct *vma;
>> +    const struct vsock_transport *transport;
>> +    int res;
>> +
>> +    transport = vsk->transport;
>> +
>> +    if (!transport->rx_zerocopy_get)
>> +        return -EOPNOTSUPP;
>> +
>> +    if (!transport->rx_zerocopy_get(vsk))
>> +        return -EOPNOTSUPP;
> 
> Maybe we can merge in
>         if (!transport->rx_zerocopy_get ||
>             !transport->rx_zerocopy_get(vsk)}
>                 return -EOPNOTSUPP;
> 
>> +
>> +    if (!transport->zerocopy_dequeue)
>> +        return -EOPNOTSUPP;
>> +
>> +    lock_sock(sk);
>> +    mmap_write_lock(current->mm);
> 
> So, multiple threads using different sockets are serialized if they use zero-copy?
> 
> IIUC this is necessary because the callback calls vm_insert_page().
> 
> At this point I think it's better not to do this in every transport, but have the callback return an array of pages to map and we map them here trying to limit as much as possible the critical section to protect with mmap_write_lock().

Yes, it will be easy to return array of pages by transport callback,

> 
>> +
>> +    vma = vma_lookup(current->mm, address);
>> +
>> +    if (!vma || vma->vm_ops != &afvsock_vm_ops) {
>> +        mmap_write_unlock(current->mm);
>> +        release_sock(sk);
>> +        return -EINVAL;
>> +    }
>> +
>> +    res = transport->zerocopy_dequeue(vsk, vma, address);
>> +
>> +    mmap_write_unlock(current->mm);
>> +    release_sock(sk);
>> +
>> +    return res;
>> +}
>> +
>> static int vsock_connectible_getsockopt(struct socket *sock,
>>                     int level, int optname,
>>                     char __user *optval,
>> @@ -1701,6 +1754,39 @@ static int vsock_connectible_getsockopt(struct socket *sock,
>>         lv = sock_get_timeout(vsk->connect_timeout, &v,
>>                       optname == SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD);
>>         break;
>> +    case SO_VM_SOCKETS_ZEROCOPY: {
>> +        const struct vsock_transport *transport;
>> +        int res;
>> +
>> +        transport = vsk->transport;
>> +
>> +        if (!transport->rx_zerocopy_get)
>> +            return -EOPNOTSUPP;
>> +
>> +        lock_sock(sk);
> 
> I think we should call lock_sock() before reading the transport to avoid races and we should check if it is assigned.
> 
> At that point I think is better to store this info in vsock_sock and not in the transport.
You mean to store flag that zerocopy is enabled in 'vsock_sock', just reading it here, without touching transport?
> 
> And maybe we should allow to change it only if the socket state is SS_UNCONNECTED, inheriting from the parent the setting for sockets that have itAck
> 
>> +
>> +        res = transport->rx_zerocopy_get(vsk);
>> +
>> +        release_sock(sk);
>> +
>> +        if (res < 0)
>> +            return -EINVAL;
>> +
>> +        v.val64 = res;
>> +
>> +        break;
>> +    }
>> +    case SO_VM_SOCKETS_MAP_RX: {
>> +        unsigned long vma_addr;
>> +
>> +        if (len < sizeof(vma_addr))
>> +            return -EINVAL;
>> +
>> +        if (copy_from_user(&vma_addr, optval, sizeof(vma_addr)))
>> +            return -EFAULT;
>> +
>> +        return vsock_recv_zerocopy(sock, vma_addr);
>> +    }
>>
>>     default:
>>         return -ENOPROTOOPT;
>> @@ -2129,6 +2215,19 @@ vsock_connectible_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
>>     return err;
>> }
>>
>> +static int afvsock_mmap(struct file *file, struct socket *sock,
>> +            struct vm_area_struct *vma)
>> +{
>> +    if (vma->vm_flags & (VM_WRITE | VM_EXEC))
>> +        return -EPERM;
>> +
>> +    vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
>> +    vma->vm_flags |= (VM_MIXEDMAP);
>> +    vma->vm_ops = &afvsock_vm_ops;
>> +
>> +    return 0;
>> +}
>> +
>> static const struct proto_ops vsock_stream_ops = {
>>     .family = PF_VSOCK,
>>     .owner = THIS_MODULE,
>> @@ -2148,6 +2247,7 @@ static const struct proto_ops vsock_stream_ops = {
>>     .recvmsg = vsock_connectible_recvmsg,
>>     .mmap = sock_no_mmap,
>>     .sendpage = sock_no_sendpage,
>> +    .mmap = afvsock_mmap,
>> };
>>
>> static const struct proto_ops vsock_seqpacket_ops = {
>> -- 
>> 2.25.1
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 4/8] virtio/vsock: add transport zerocopy callback
  2022-06-09  8:41     ` Stefano Garzarella
  (?)
@ 2022-06-09 12:23     ` Arseniy Krasnov
  -1 siblings, 0 replies; 29+ messages in thread
From: Arseniy Krasnov @ 2022-06-09 12:23 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Jakub Kicinski, Paolo Abeni, linux-kernel, kvm, virtualization,
	netdev, kernel, Krasnov Arseniy

On 09.06.2022 11:41, Stefano Garzarella wrote:
> On Fri, Jun 03, 2022 at 05:37:48AM +0000, Arseniy Krasnov wrote:
>> This adds transport callback which processes rx
>> queue of socket and instead of copying data to
>> user provided buffer, it inserts data pages of
>> each packet to user's vm area.
>>
>> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>> ---
>> include/linux/virtio_vsock.h            |   4 +
>> include/uapi/linux/virtio_vsock.h       |   6 +
>> net/vmw_vsock/virtio_transport_common.c | 208 +++++++++++++++++++++++-
>> 3 files changed, 215 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>> index d02cb7aa922f..47a68a2ea838 100644
>> --- a/include/linux/virtio_vsock.h
>> +++ b/include/linux/virtio_vsock.h
>> @@ -51,6 +51,7 @@ struct virtio_vsock_pkt {
>>     bool reply;
>>     bool tap_delivered;
>>     bool slab_buf;
>> +    bool split;
>> };
>>
>> struct virtio_vsock_pkt_info {
>> @@ -131,6 +132,9 @@ int virtio_transport_dgram_bind(struct vsock_sock *vsk,
>>                 struct sockaddr_vm *addr);
>> bool virtio_transport_dgram_allow(u32 cid, u32 port);
>>
>> +int virtio_transport_zerocopy_dequeue(struct vsock_sock *vsk,
>> +                      struct vm_area_struct *vma,
>> +                      unsigned long addr);
>> int virtio_transport_connect(struct vsock_sock *vsk);
>>
>> int virtio_transport_shutdown(struct vsock_sock *vsk, int mode);
>> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
>> index 64738838bee5..6775c6c44b5b 100644
>> --- a/include/uapi/linux/virtio_vsock.h
>> +++ b/include/uapi/linux/virtio_vsock.h
>> @@ -66,6 +66,12 @@ struct virtio_vsock_hdr {
>>     __le32    fwd_cnt;
>> } __attribute__((packed));
>>
>> +struct virtio_vsock_usr_hdr {
>> +    u32 flags;
>> +    u32 len;
>> +    u32 copy_len;
>> +} __attribute__((packed));
>> +
>> enum virtio_vsock_type {
>>     VIRTIO_VSOCK_TYPE_STREAM = 1,
>>     VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>> index 278567f748f2..3a3e84176c75 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -12,6 +12,7 @@
>> #include <linux/ctype.h>
>> #include <linux/list.h>
>> #include <linux/virtio_vsock.h>
>> +#include <linux/mm.h>
>> #include <uapi/linux/vsockmon.h>
>>
>> #include <net/sock.h>
>> @@ -347,6 +348,196 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
>>     return err;
>> }
>>
>> +#define MAX_PAGES_TO_MAP 256
>> +
>> +int virtio_transport_zerocopy_dequeue(struct vsock_sock *vsk,
>> +                      struct vm_area_struct *vma,
>> +                      unsigned long addr)
>> +{
>> +    struct virtio_vsock_sock *vvs = vsk->trans;
>> +    struct virtio_vsock_usr_hdr *usr_hdr_buffer;
>> +    unsigned long max_pages_to_insert;
>> +    unsigned long tmp_pages_inserted;
>> +    unsigned long pages_to_insert;
>> +    struct page *usr_hdr_page;
>> +    unsigned long vma_size;
>> +    struct page **pages;
>> +    int max_vma_pages;
>> +    int max_usr_hdrs;
>> +    int res;
>> +    int err;
>> +    int i;
>> +
>> +    /* Only use VMA from first page. */
>> +    if (vma->vm_start != addr)
>> +        return -EFAULT;
>> +
>> +    vma_size = vma->vm_end - vma->vm_start;
>> +
>> +    /* Too small vma(at least one page for headers
>> +     * and one page for data).
>> +     */
>> +    if (vma_size < 2 * PAGE_SIZE)
>> +        return -EFAULT;
>> +
>> +    /* Page for meta data. */
>> +    usr_hdr_page = alloc_page(GFP_KERNEL);
> 
> I think all these checks should be done in af_vsock.c> 
> It would be nice to avoid that every transport reimplements the same thing and especially that all transports have the same behavior.
> 
> If you can would be nice to have the transports to return an array of pages to map, and af_vsock will handle it and the usr_hdr_page.
> 
> Do you think it's doable?
Yes, as we talked in patch above, this part could be common.
> 
>> +
>> +    if (!usr_hdr_page)
>> +        return -EFAULT;
>> +
>> +    pages = kmalloc_array(MAX_PAGES_TO_MAP, sizeof(pages[0]), GFP_KERNEL);
>> +
>> +    if (!pages)
>> +        return -EFAULT;
>> +
>> +    pages[pages_to_insert++] = usr_hdr_page;
>> +
>> +    usr_hdr_buffer = page_to_virt(usr_hdr_page);
>> +
>> +    err = 0;
>> +
>> +    /* As we use first page for headers, so total number of
>> +     * pages for user is min between number of headers in
>> +     * first page and size of vma(in pages, except first page).
>> +     */
>> +    max_usr_hdrs = PAGE_SIZE / sizeof(*usr_hdr_buffer);
>> +    max_vma_pages = (vma_size / PAGE_SIZE) - 1;
>> +    max_pages_to_insert = min(max_usr_hdrs, max_vma_pages);
>> +
>> +    if (max_pages_to_insert > MAX_PAGES_TO_MAP)
>> +        max_pages_to_insert = MAX_PAGES_TO_MAP;
>> +
>> +    spin_lock_bh(&vvs->rx_lock);
>> +
>> +    while (!list_empty(&vvs->rx_queue) &&
>> +           pages_to_insert < max_pages_to_insert) {
>> +        struct virtio_vsock_pkt *pkt;
>> +        ssize_t rest_data_bytes;
>> +        size_t moved_data_bytes;
>> +        unsigned long pg_offs;
>> +
>> +        pkt = list_first_entry(&vvs->rx_queue,
>> +                       struct virtio_vsock_pkt, list);
>> +
>> +        /* Buffer was allocated by 'kmalloc()'. This could
>> +         * happen, when zerocopy was enabled, but we still
>> +         * have pending packet which was created before it.
>> +         */
>> +        if (pkt->slab_buf) {
>> +            usr_hdr_buffer->flags = le32_to_cpu(pkt->hdr.flags);
>> +            usr_hdr_buffer->len = 0;
>> +            usr_hdr_buffer->copy_len = le32_to_cpu(pkt->hdr.len);
>> +            /* Report user to read it using copy. */
> 
> Is it a "to do"?
No, seems i need to move my comment under opening bracket
> 
>> +            break;
>> +        }
>> +
>> +        /* This could happen, when packet was dequeued before
>> +         * by an ordinary 'read()' call. We can't handle such
>> +         * packet. Drop it.
> 
> We can't drop packets.
> I think we should allow to enable/disable this new feature before the connection.
Yes, allow to enable/disable only in not connected state
> 
>> +         */
>> +        if (pkt->off % PAGE_SIZE) {
>> +            list_del(&pkt->list);
>> +            virtio_transport_dec_rx_pkt(vvs, pkt);
>> +            virtio_transport_free_pkt(pkt);
>> +            continue;
>> +        }
>> +
>> +        rest_data_bytes = le32_to_cpu(pkt->hdr.len) - pkt->off;
>> +
>> +        /* For packets, bigger than one page, split it's
>> +         * high order allocated buffer to 0 order pages.
>> +         * Otherwise 'vm_insert_pages()' will fail, for
>> +         * all pages except first.
>> +         */
>> +        if (rest_data_bytes > PAGE_SIZE) {
>> +            /* High order buffer not split yet. */
>> +            if (!pkt->split) {
>> +                split_page(virt_to_page(pkt->buf),
>> +                       get_order(le32_to_cpu(pkt->hdr.len)));
>> +                pkt->split = true;
>> +            }
>> +        }
>> +
>> +        pg_offs = pkt->off;
>> +        moved_data_bytes = 0;
>> +
>> +        while (rest_data_bytes &&
>> +               pages_to_insert < max_pages_to_insert) {
>> +            struct page *buf_page;
>> +
>> +            buf_page = virt_to_page(pkt->buf + pg_offs);
>> +
>> +            pages[pages_to_insert++] = buf_page;
>> +            /* Get reference to prevent this page being
>> +             * returned to page allocator when packet will
>> +             * be freed. Ref count will be 2.
>> +             */
>> +            get_page(buf_page);
>> +            pg_offs += PAGE_SIZE;
>> +
>> +            if (rest_data_bytes >= PAGE_SIZE) {
>> +                moved_data_bytes += PAGE_SIZE;
>> +                rest_data_bytes -= PAGE_SIZE;
>> +            } else {
>> +                moved_data_bytes += rest_data_bytes;
>> +                rest_data_bytes = 0;
>> +            }
>> +        }
>> +
>> +        usr_hdr_buffer->flags = le32_to_cpu(pkt->hdr.flags);
>> +        usr_hdr_buffer->len = moved_data_bytes;
>> +        usr_hdr_buffer->copy_len = 0;
>> +        usr_hdr_buffer++;
>> +
>> +        pkt->off = pg_offs;
>> +
>> +        if (rest_data_bytes == 0) {
>> +            list_del(&pkt->list);
>> +            virtio_transport_dec_rx_pkt(vvs, pkt);
>> +            virtio_transport_free_pkt(pkt);
>> +        }
>> +
>> +        /* Now ref count for all pages of packet is 1. */
>> +    }
>> +
>> +    /* Set last buffer empty(if we have one). */
>> +    if (pages_to_insert - 1 < max_usr_hdrs)
>> +        usr_hdr_buffer->len = 0;
>> +
>> +    spin_unlock_bh(&vvs->rx_lock);
>> +
>> +    tmp_pages_inserted = pages_to_insert;
>> +
>> +    res = vm_insert_pages(vma, addr, pages, &tmp_pages_inserted);
>> +
>> +    if (res || tmp_pages_inserted) {
>> +        /* Failed to insert some pages, we have "partially"
>> +         * mapped vma. Do not return, set error code. This
>> +         * code will be returned to user. User needs to call
>> +         * 'madvise()/mmap()' to clear this vma. Anyway,
>> +         * references to all pages will to be dropped below.
>> +         */
>> +        err = -EFAULT;
>> +    }
>> +
>> +    /* Put reference for every page. */
>> +    for (i = 0; i < pages_to_insert; i++) {
>> +        /* Ref count is 2 ('get_page()' + 'vm_insert_pages()' above).
>> +         * Put reference once, page will be returned to allocator
>> +         * after user's 'madvice()/munmap()' call(or it wasn't mapped
>> +         * if 'vm_insert_pages()' failed).
>> +         */
>> +        put_page(pages[i]);
>> +    }
>> +
>> +    virtio_transport_send_credit_update(vsk);
>> +    kfree(pages);
>> +
>> +    return err;
>> +}
>> +EXPORT_SYMBOL_GPL(virtio_transport_zerocopy_dequeue);
>> +
>> static ssize_t
>> virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
>>                    struct msghdr *msg,
>> @@ -1344,10 +1535,21 @@ EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
>> void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
>> {
>>     if (pkt->buf_len) {
>> -        if (pkt->slab_buf)
>> +        if (pkt->slab_buf) {
>>             kfree(pkt->buf);
>> -        else
>> -            free_pages(buf, get_order(pkt->buf_len));
>> +        } else {
>> +            unsigned int order = get_order(pkt->buf_len);
>> +            unsigned long buf = (unsigned long)pkt->buf;
>> +
>> +            if (pkt->split) {
>> +                int i;
>> +
>> +                for (i = 0; i < (1 << order); i++)
>> +                    free_page(buf + i * PAGE_SIZE);
>> +            } else {
>> +                free_pages(buf, order);
>> +            }
>> +        }
>>     }
>>
>>     kfree(pkt);
>> -- 2.25.1
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive
  2022-06-09  8:54   ` Stefano Garzarella
  (?)
@ 2022-06-09 12:33   ` Arseniy Krasnov
  2022-06-13  8:54       ` Stefano Garzarella
  -1 siblings, 1 reply; 29+ messages in thread
From: Arseniy Krasnov @ 2022-06-09 12:33 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Jakub Kicinski, Paolo Abeni, linux-kernel, kvm, virtualization,
	netdev, kernel, Krasnov Arseniy

On 09.06.2022 11:54, Stefano Garzarella wrote:
> Hi Arseniy,
> I left some comments in the patches, and I'm adding something also here:
Thanks for comments
> 
> On Fri, Jun 03, 2022 at 05:27:56AM +0000, Arseniy Krasnov wrote:
>>                              INTRODUCTION
>>
>>     Hello, this is experimental implementation of virtio vsock zerocopy
>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>> fill provided vma area with pages of virtio RX buffers. After received data was
>> processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
> 
> If it is not too time-consuming, can we have a table/list to compare this and the TCP zerocopy?
You mean compare API with more details?
> 
>>
>>                                 DETAILS
>>
>>     Here is how mapping with mapped pages looks exactly: first page mapping
>> contains array of trimmed virtio vsock packet headers (in contains only length
>> of data on the corresponding page and 'flags' field):
>>
>>     struct virtio_vsock_usr_hdr {
>>         uint32_t length;
>>         uint32_t flags;
>>         uint32_t copy_len;
>>     };
>>
>> Field  'length' allows user to know exact size of payload within each sequence
>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>> bounds or record bounds). Field 'copy_len' is described below in 'v1->v2' part.
>> All other pages are data pages from RX queue.
>>
>>             Page 0      Page 1      Page N
>>
>>     [ hdr1 .. hdrN ][ data ] .. [ data ]
>>           |        |       ^           ^
>>           |        |       |           |
>>           |        *-------------------*
>>           |                |
>>           |                |
>>           *----------------*
>>
>>     Of course, single header could represent array of pages (when packet's
>> buffer is bigger than one page).So here is example of detailed mapping layout
>> for some set of packages. Lets consider that we have the following sequence  of
>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>> be inserted to user's vma(vma is large enough).
> 
> In order to have a "userspace polling-friendly approach" and reduce number of syscall, can we allow for example the userspace to mmap at least the first header before packets arrive.
> Then the userspace can poll a flag or other fields in the header to understand that there are new packets.
You mean to avoid 'poll()' syscall, user will spin on some flag, provided by kernel on some mapped page? I think yes. This is ok. Also i think, that i can avoid 'madvise' call
to clear memory mapping before each 'getsockopt()' - let 'getsockopt()' do 'madvise()' job by removing pages from previous data. In this case only one system call is needed - 'getsockopt()'.
> 
> That would be cool, but in the meantime it would be nice to behave similarly to TCP, which is why the comparison table I mentioned earlier would be useful.
> 
>>
>>     Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
>>     Page 1: [ 56 ]
>>     Page 2: [ 4096 ]
>>     Page 3: [ 4096 ]
>>     Page 4: [ 4096 ]
>>     Page 5: [ 8 ]
>>
>>     Page 0 contains only array of headers:
>>     'hdr0' has 56 in length field.
>>     'hdr1' has 4096 in length field.
>>     'hdr2' has 8200 in length field.
>>     'hdr3' has 0 in length field(this is end of data marker).
>>
>>     Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
>>     Page 2 corresponds to 'hdr1' and filled with data.
>>     Page 3 corresponds to 'hdr2' and filled with data.
>>     Page 4 corresponds to 'hdr2' and filled with data.
>>     Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>>
>>     This patchset also changes packets allocation way: today implementation
>> uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
>> such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
>> pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
>> "not large" could be bigger than one page). So to avoid this, data buffers now
>> allocated using 'alloc_pages()' call.
>>
>>                                   TESTS
>>
>>     This patchset updates 'vsock_test' utility: two tests for new feature
>> were added. First test covers invalid cases. Second checks valid transmission
>> case.
>>
>>                                BENCHMARKING
>>
>>     For benchmakring I've added small utility 'rx_zerocopy'. It works in
>> client/server mode. When client connects to server, server starts sending exact
>> amount of data to client(amount is set as input argument).Client reads data and
>> waits for next portion of it. Client works in two modes: copy and zero-copy. In
>> copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
>> /'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
>> is better. For server, we can set size of tx buffer and for client we can set
>> size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
>> is quiet simple:
>>
>> For client mode:
>>
>> ./rx_zerocopy --mode client [--zerocopy] [--rx]
>>
>> For server mode:
>>
>> ./rx_zerocopy --mode server [--mb] [--tx]
>>
>> [--mb] sets number of megabytes to transfer.
>> [--rx] sets size of receive buffer/mapping in pages.
>> [--tx] sets size of transmit buffer in pages.
>>
>> I checked for transmission of 4000mb of data. Here are some results:
>>
>>                           size of rx/tx buffers in pages
>>               *---------------------------------------------------*
>>               |    8   |    32    |    64   |   256    |   512    |
>> *--------------*--------*----------*---------*----------*----------*
>> |   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
>> *--------------*---------------------------------------------------- process
>> | non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
>> *--------------*----------------------------------------------------
>>
>> Result in first column(where non-zerocopy works better than zerocopy) happens
>> because time, spent in 'read()' system call is smaller that time in 'getsockopt'
>> + 'madvise'. I've checked that.
>>
>> I think, that results are not so impressive, but at least it is not worse than
>> copy mode and there is no need to allocate memory for processing date.
>>
>>                                 PROBLEMS
>>
>>     Updated packet's allocation logic creates some problem: when host gets
>> data from guest(in vhost-vsock), it allocates at least one page for each packet
>> (even if packet has 1 byte payload). I think this could be resolved in several
>> ways:
>>     1) Make zerocopy rx mode disabled by default, so if user didn't enable
>> it, current 'kmalloc()' way will be used. <<<<<<< (IMPLEMENTED IN V2)
> 
> Yep, but I think we should not allow to change it while we are connected (see comments in the patches.)
> 
>>     2) Use 'kmalloc()' for "small" packets, else call page allocator. But
>> in this case, we have mix of packets, allocated in two different ways thus
>> during zerocopying to user(e.g. mapping pages to vma), such small packets will
>> be handled in some stupid way: we need to allocate one page for user, copy data
>> to it and then insert page to user's vma.
>>
>> v1 -> v2:
>> 1) Zerocopy receive mode could be enabled/disabled(disabled by default). I
>>    didn't use generic SO_ZEROCOPY flag, because in virtio-vsock case this
>>    feature depends on transport support. Instead of SO_ZEROCOPY, AF_VSOCK
>>    layer flag was added: SO_VM_SOCKETS_ZEROCOPY, while previous meaning of
>>    SO_VM_SOCKETS_ZEROCOPY(insert receive buffers to user's vm area) now
>>    renamed to SO_VM_SOCKETS_MAP_RX.
>> 2) Packet header which is exported to user now get new field: 'copy_len'.
>>    This field handles special case:  user reads data from socket in non
>>    zerocopy way(with disabled zerocopy) and then enables zerocopy feature.
>>    In this case vhost part will switch data buffer allocation logic from
>>    'kmalloc()' to direct calls for buddy allocator. But, there could be
>>    some pending 'kmalloc()' allocated packets in socket's rx list, and then
>>    user tries to read such packets in zerocopy way, dequeue will fail,
>>    because SLAB pages could not be inserted to user's vm area. So when such
>>    packet is found during zerocopy dequeue, dequeue loop will break and
>>    'copy_len' will show size of such "bad" packet. After user detects this
>>    case, it must use 'read()/recv()' calls to dequeue such packet.
>> 3) Also may be move this features under config option?
> 
> Do you mean a build config like CONFIG_VSOCK_ZERO_COPY?
> 
> I'm not sure it's needed.
Ack
> 
> Thanks,
> Stefano
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 3/8] af_vsock: add zerocopy receive logic
  2022-06-09 12:20     ` Arseniy Krasnov
@ 2022-06-13  8:50         ` Stefano Garzarella
  0 siblings, 0 replies; 29+ messages in thread
From: Stefano Garzarella @ 2022-06-13  8:50 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Jakub Kicinski, Paolo Abeni, linux-kernel, kvm, virtualization,
	netdev, kernel, Krasnov Arseniy

On Thu, Jun 09, 2022 at 12:20:22PM +0000, Arseniy Krasnov wrote:
>On 09.06.2022 11:39, Stefano Garzarella wrote:
>> On Fri, Jun 03, 2022 at 05:35:48AM +0000, Arseniy Krasnov wrote:
>>> This:
>>> 1) Adds callback for 'mmap()' call on socket. It checks vm
>>>   area flags and sets vm area ops.
>>> 2) Adds special 'getsockopt()' case which calls transport
>>>   zerocopy callback. Input argument is vm area address.
>>> 3) Adds 'getsockopt()/setsockopt()' for switching on/off rx
>>>   zerocopy mode.
>>>
>>> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>>> ---
>>> include/net/af_vsock.h          |   7 +++
>>> include/uapi/linux/vm_sockets.h |   3 +
>>> net/vmw_vsock/af_vsock.c        | 100 ++++++++++++++++++++++++++++++++
>>> 3 files changed, 110 insertions(+)
>>>
>>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>>> index f742e50207fb..f15f84c648ff 100644
>>> --- a/include/net/af_vsock.h
>>> +++ b/include/net/af_vsock.h
>>> @@ -135,6 +135,13 @@ struct vsock_transport {
>>>     bool (*stream_is_active)(struct vsock_sock *);
>>>     bool (*stream_allow)(u32 cid, u32 port);
>>>
>>> +    int (*rx_zerocopy_set)(struct vsock_sock *vsk,
>>> +                   bool enable);
>>> +    int (*rx_zerocopy_get)(struct vsock_sock *vsk);
>>> +    int (*zerocopy_dequeue)(struct vsock_sock *vsk,
>>> +                struct vm_area_struct *vma,
>>> +                unsigned long addr);
>>> +
>>>     /* SEQ_PACKET. */
>>>     ssize_t (*seqpacket_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
>>>                      int flags);
>>> diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h
>>> index c60ca33eac59..d1f792bed1a7 100644
>>> --- a/include/uapi/linux/vm_sockets.h
>>> +++ b/include/uapi/linux/vm_sockets.h
>>> @@ -83,6 +83,9 @@
>>>
>>> #define SO_VM_SOCKETS_CONNECT_TIMEOUT_NEW 8
>>>
>>> +#define SO_VM_SOCKETS_MAP_RX 9
>>> +#define SO_VM_SOCKETS_ZEROCOPY 10
>>> +
>>> #if !defined(__KERNEL__)
>>> #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
>>> #define SO_VM_SOCKETS_CONNECT_TIMEOUT SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD
>>> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>>> index f04abf662ec6..10061ef21730 100644
>>> --- a/net/vmw_vsock/af_vsock.c
>>> +++ b/net/vmw_vsock/af_vsock.c
>>> @@ -1644,6 +1644,17 @@ static int vsock_connectible_setsockopt(struct socket *sock,
>>>         }
>>>         break;
>>>     }
>>> +    case SO_VM_SOCKETS_ZEROCOPY: {
>>> +        if (!transport || !transport->rx_zerocopy_set) {
>>> +            err = -EOPNOTSUPP;
>>> +        } else {
>>> +            COPY_IN(val);
>>> +
>>> +            if (transport->rx_zerocopy_set(vsk, val))
>>> +                err = -EINVAL;
>>> +        }
>>> +        break;
>>> +    }
>>>
>>>     default:
>>>         err = -ENOPROTOOPT;
>>> @@ -1657,6 +1668,48 @@ static int vsock_connectible_setsockopt(struct socket *sock,
>>>     return err;
>>> }
>>>
>>> +static const struct vm_operations_struct afvsock_vm_ops = {
>>> +};
>>> +
>>> +static int vsock_recv_zerocopy(struct socket *sock,
>>> +                   unsigned long address)
>>> +{
>>> +    struct sock *sk = sock->sk;
>>> +    struct vsock_sock *vsk = vsock_sk(sk);
>>> +    struct vm_area_struct *vma;
>>> +    const struct vsock_transport *transport;
>>> +    int res;
>>> +
>>> +    transport = vsk->transport;
>>> +
>>> +    if (!transport->rx_zerocopy_get)
>>> +        return -EOPNOTSUPP;
>>> +
>>> +    if (!transport->rx_zerocopy_get(vsk))
>>> +        return -EOPNOTSUPP;
>>
>> Maybe we can merge in
>>         if (!transport->rx_zerocopy_get ||
>>             !transport->rx_zerocopy_get(vsk)}
>>                 return -EOPNOTSUPP;
>>
>>> +
>>> +    if (!transport->zerocopy_dequeue)
>>> +        return -EOPNOTSUPP;
>>> +
>>> +    lock_sock(sk);
>>> +    mmap_write_lock(current->mm);
>>
>> So, multiple threads using different sockets are serialized if they use zero-copy?
>>
>> IIUC this is necessary because the callback calls vm_insert_page().
>>
>> At this point I think it's better not to do this in every transport, but have the callback return an array of pages to map and we map them here trying to limit as much as possible the critical section to protect with mmap_write_lock().
>
>Yes, it will be easy to return array of pages by transport callback,
>
>>
>>> +
>>> +    vma = vma_lookup(current->mm, address);
>>> +
>>> +    if (!vma || vma->vm_ops != &afvsock_vm_ops) {
>>> +        mmap_write_unlock(current->mm);
>>> +        release_sock(sk);
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    res = transport->zerocopy_dequeue(vsk, vma, address);
>>> +
>>> +    mmap_write_unlock(current->mm);
>>> +    release_sock(sk);
>>> +
>>> +    return res;
>>> +}
>>> +
>>> static int vsock_connectible_getsockopt(struct socket *sock,
>>>                     int level, int optname,
>>>                     char __user *optval,
>>> @@ -1701,6 +1754,39 @@ static int vsock_connectible_getsockopt(struct socket *sock,
>>>         lv = sock_get_timeout(vsk->connect_timeout, &v,
>>>                       optname == SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD);
>>>         break;
>>> +    case SO_VM_SOCKETS_ZEROCOPY: {
>>> +        const struct vsock_transport *transport;
>>> +        int res;
>>> +
>>> +        transport = vsk->transport;
>>> +
>>> +        if (!transport->rx_zerocopy_get)
>>> +            return -EOPNOTSUPP;
>>> +
>>> +        lock_sock(sk);
>>
>> I think we should call lock_sock() before reading the transport to avoid races and we should check if it is assigned.
>>
>> At that point I think is better to store this info in vsock_sock and not in the transport.
>You mean to store flag that zerocopy is enabled in 'vsock_sock', just reading it here, without touching transport?

Yep.

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 3/8] af_vsock: add zerocopy receive logic
@ 2022-06-13  8:50         ` Stefano Garzarella
  0 siblings, 0 replies; 29+ messages in thread
From: Stefano Garzarella @ 2022-06-13  8:50 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Krasnov Arseniy, kvm, Michael S. Tsirkin, netdev, linux-kernel,
	virtualization, Stefan Hajnoczi, kernel, Jakub Kicinski,
	Paolo Abeni, David S. Miller

On Thu, Jun 09, 2022 at 12:20:22PM +0000, Arseniy Krasnov wrote:
>On 09.06.2022 11:39, Stefano Garzarella wrote:
>> On Fri, Jun 03, 2022 at 05:35:48AM +0000, Arseniy Krasnov wrote:
>>> This:
>>> 1) Adds callback for 'mmap()' call on socket. It checks vm
>>>   area flags and sets vm area ops.
>>> 2) Adds special 'getsockopt()' case which calls transport
>>>   zerocopy callback. Input argument is vm area address.
>>> 3) Adds 'getsockopt()/setsockopt()' for switching on/off rx
>>>   zerocopy mode.
>>>
>>> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>>> ---
>>> include/net/af_vsock.h          |   7 +++
>>> include/uapi/linux/vm_sockets.h |   3 +
>>> net/vmw_vsock/af_vsock.c        | 100 ++++++++++++++++++++++++++++++++
>>> 3 files changed, 110 insertions(+)
>>>
>>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>>> index f742e50207fb..f15f84c648ff 100644
>>> --- a/include/net/af_vsock.h
>>> +++ b/include/net/af_vsock.h
>>> @@ -135,6 +135,13 @@ struct vsock_transport {
>>>     bool (*stream_is_active)(struct vsock_sock *);
>>>     bool (*stream_allow)(u32 cid, u32 port);
>>>
>>> +    int (*rx_zerocopy_set)(struct vsock_sock *vsk,
>>> +                   bool enable);
>>> +    int (*rx_zerocopy_get)(struct vsock_sock *vsk);
>>> +    int (*zerocopy_dequeue)(struct vsock_sock *vsk,
>>> +                struct vm_area_struct *vma,
>>> +                unsigned long addr);
>>> +
>>>     /* SEQ_PACKET. */
>>>     ssize_t (*seqpacket_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
>>>                      int flags);
>>> diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h
>>> index c60ca33eac59..d1f792bed1a7 100644
>>> --- a/include/uapi/linux/vm_sockets.h
>>> +++ b/include/uapi/linux/vm_sockets.h
>>> @@ -83,6 +83,9 @@
>>>
>>> #define SO_VM_SOCKETS_CONNECT_TIMEOUT_NEW 8
>>>
>>> +#define SO_VM_SOCKETS_MAP_RX 9
>>> +#define SO_VM_SOCKETS_ZEROCOPY 10
>>> +
>>> #if !defined(__KERNEL__)
>>> #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
>>> #define SO_VM_SOCKETS_CONNECT_TIMEOUT SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD
>>> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>>> index f04abf662ec6..10061ef21730 100644
>>> --- a/net/vmw_vsock/af_vsock.c
>>> +++ b/net/vmw_vsock/af_vsock.c
>>> @@ -1644,6 +1644,17 @@ static int vsock_connectible_setsockopt(struct socket *sock,
>>>         }
>>>         break;
>>>     }
>>> +    case SO_VM_SOCKETS_ZEROCOPY: {
>>> +        if (!transport || !transport->rx_zerocopy_set) {
>>> +            err = -EOPNOTSUPP;
>>> +        } else {
>>> +            COPY_IN(val);
>>> +
>>> +            if (transport->rx_zerocopy_set(vsk, val))
>>> +                err = -EINVAL;
>>> +        }
>>> +        break;
>>> +    }
>>>
>>>     default:
>>>         err = -ENOPROTOOPT;
>>> @@ -1657,6 +1668,48 @@ static int vsock_connectible_setsockopt(struct socket *sock,
>>>     return err;
>>> }
>>>
>>> +static const struct vm_operations_struct afvsock_vm_ops = {
>>> +};
>>> +
>>> +static int vsock_recv_zerocopy(struct socket *sock,
>>> +                   unsigned long address)
>>> +{
>>> +    struct sock *sk = sock->sk;
>>> +    struct vsock_sock *vsk = vsock_sk(sk);
>>> +    struct vm_area_struct *vma;
>>> +    const struct vsock_transport *transport;
>>> +    int res;
>>> +
>>> +    transport = vsk->transport;
>>> +
>>> +    if (!transport->rx_zerocopy_get)
>>> +        return -EOPNOTSUPP;
>>> +
>>> +    if (!transport->rx_zerocopy_get(vsk))
>>> +        return -EOPNOTSUPP;
>>
>> Maybe we can merge in
>>         if (!transport->rx_zerocopy_get ||
>>             !transport->rx_zerocopy_get(vsk)}
>>                 return -EOPNOTSUPP;
>>
>>> +
>>> +    if (!transport->zerocopy_dequeue)
>>> +        return -EOPNOTSUPP;
>>> +
>>> +    lock_sock(sk);
>>> +    mmap_write_lock(current->mm);
>>
>> So, multiple threads using different sockets are serialized if they use zero-copy?
>>
>> IIUC this is necessary because the callback calls vm_insert_page().
>>
>> At this point I think it's better not to do this in every transport, but have the callback return an array of pages to map and we map them here trying to limit as much as possible the critical section to protect with mmap_write_lock().
>
>Yes, it will be easy to return array of pages by transport callback,
>
>>
>>> +
>>> +    vma = vma_lookup(current->mm, address);
>>> +
>>> +    if (!vma || vma->vm_ops != &afvsock_vm_ops) {
>>> +        mmap_write_unlock(current->mm);
>>> +        release_sock(sk);
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    res = transport->zerocopy_dequeue(vsk, vma, address);
>>> +
>>> +    mmap_write_unlock(current->mm);
>>> +    release_sock(sk);
>>> +
>>> +    return res;
>>> +}
>>> +
>>> static int vsock_connectible_getsockopt(struct socket *sock,
>>>                     int level, int optname,
>>>                     char __user *optval,
>>> @@ -1701,6 +1754,39 @@ static int vsock_connectible_getsockopt(struct socket *sock,
>>>         lv = sock_get_timeout(vsk->connect_timeout, &v,
>>>                       optname == SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD);
>>>         break;
>>> +    case SO_VM_SOCKETS_ZEROCOPY: {
>>> +        const struct vsock_transport *transport;
>>> +        int res;
>>> +
>>> +        transport = vsk->transport;
>>> +
>>> +        if (!transport->rx_zerocopy_get)
>>> +            return -EOPNOTSUPP;
>>> +
>>> +        lock_sock(sk);
>>
>> I think we should call lock_sock() before reading the transport to avoid races and we should check if it is assigned.
>>
>> At that point I think is better to store this info in vsock_sock and not in the transport.
>You mean to store flag that zerocopy is enabled in 'vsock_sock', just reading it here, without touching transport?

Yep.

Thanks,
Stefano

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive
  2022-06-09 12:33   ` Arseniy Krasnov
@ 2022-06-13  8:54       ` Stefano Garzarella
  0 siblings, 0 replies; 29+ messages in thread
From: Stefano Garzarella @ 2022-06-13  8:54 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Jakub Kicinski, Paolo Abeni, linux-kernel, kvm, virtualization,
	netdev, kernel, Krasnov Arseniy

On Thu, Jun 09, 2022 at 12:33:32PM +0000, Arseniy Krasnov wrote:
>On 09.06.2022 11:54, Stefano Garzarella wrote:
>> Hi Arseniy,
>> I left some comments in the patches, and I'm adding something also here:
>Thanks for comments
>>
>> On Fri, Jun 03, 2022 at 05:27:56AM +0000, Arseniy Krasnov wrote:
>>>                              INTRODUCTION
>>>
>>>     Hello, this is experimental implementation of virtio vsock zerocopy
>>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>>> fill provided vma area with pages of virtio RX buffers. After received data was
>>> processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
>>
>> If it is not too time-consuming, can we have a table/list to compare this and the TCP zerocopy?
>You mean compare API with more details?

Yes, maybe a comparison from the user's point of view to do zero-copy 
with TCP and VSOCK.

>>
>>>
>>>                                 DETAILS
>>>
>>>     Here is how mapping with mapped pages looks exactly: first page mapping
>>> contains array of trimmed virtio vsock packet headers (in contains only length
>>> of data on the corresponding page and 'flags' field):
>>>
>>>     struct virtio_vsock_usr_hdr {
>>>         uint32_t length;
>>>         uint32_t flags;
>>>         uint32_t copy_len;
>>>     };
>>>
>>> Field  'length' allows user to know exact size of payload within each sequence
>>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>>> bounds or record bounds). Field 'copy_len' is described below in 'v1->v2' part.
>>> All other pages are data pages from RX queue.
>>>
>>>             Page 0      Page 1      Page N
>>>
>>>     [ hdr1 .. hdrN ][ data ] .. [ data ]
>>>           |        |       ^           ^
>>>           |        |       |           |
>>>           |        *-------------------*
>>>           |                |
>>>           |                |
>>>           *----------------*
>>>
>>>     Of course, single header could represent array of pages (when packet's
>>> buffer is bigger than one page).So here is example of detailed mapping layout
>>> for some set of packages. Lets consider that we have the following sequence  of
>>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>>> be inserted to user's vma(vma is large enough).
>>
>> In order to have a "userspace polling-friendly approach" and reduce number of syscall, can we allow for example the userspace to mmap at least the first header before packets arrive.
>> Then the userspace can poll a flag or other fields in the header to understand that there are new packets.
>You mean to avoid 'poll()' syscall, user will spin on some flag, provided by kernel on some mapped page? I think yes. This is ok. Also i think, that i can avoid 'madvise' call
>to clear memory mapping before each 'getsockopt()' - let 'getsockopt()' do 'madvise()' job by removing pages from previous data. In this case only one system call is needed - 'getsockopt()'.

Yes, that's right. I mean to support both, poll() for interrupt-based 
applications and the ability to actively poll a variable in the shared 
memory for applications that want to minimize latency.

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive
@ 2022-06-13  8:54       ` Stefano Garzarella
  0 siblings, 0 replies; 29+ messages in thread
From: Stefano Garzarella @ 2022-06-13  8:54 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Krasnov Arseniy, kvm, Michael S. Tsirkin, netdev, linux-kernel,
	virtualization, Stefan Hajnoczi, kernel, Jakub Kicinski,
	Paolo Abeni, David S. Miller

On Thu, Jun 09, 2022 at 12:33:32PM +0000, Arseniy Krasnov wrote:
>On 09.06.2022 11:54, Stefano Garzarella wrote:
>> Hi Arseniy,
>> I left some comments in the patches, and I'm adding something also here:
>Thanks for comments
>>
>> On Fri, Jun 03, 2022 at 05:27:56AM +0000, Arseniy Krasnov wrote:
>>>                              INTRODUCTION
>>>
>>>     Hello, this is experimental implementation of virtio vsock zerocopy
>>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>>> fill provided vma area with pages of virtio RX buffers. After received data was
>>> processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
>>
>> If it is not too time-consuming, can we have a table/list to compare this and the TCP zerocopy?
>You mean compare API with more details?

Yes, maybe a comparison from the user's point of view to do zero-copy 
with TCP and VSOCK.

>>
>>>
>>>                                 DETAILS
>>>
>>>     Here is how mapping with mapped pages looks exactly: first page mapping
>>> contains array of trimmed virtio vsock packet headers (in contains only length
>>> of data on the corresponding page and 'flags' field):
>>>
>>>     struct virtio_vsock_usr_hdr {
>>>         uint32_t length;
>>>         uint32_t flags;
>>>         uint32_t copy_len;
>>>     };
>>>
>>> Field  'length' allows user to know exact size of payload within each sequence
>>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>>> bounds or record bounds). Field 'copy_len' is described below in 'v1->v2' part.
>>> All other pages are data pages from RX queue.
>>>
>>>             Page 0      Page 1      Page N
>>>
>>>     [ hdr1 .. hdrN ][ data ] .. [ data ]
>>>           |        |       ^           ^
>>>           |        |       |           |
>>>           |        *-------------------*
>>>           |                |
>>>           |                |
>>>           *----------------*
>>>
>>>     Of course, single header could represent array of pages (when packet's
>>> buffer is bigger than one page).So here is example of detailed mapping layout
>>> for some set of packages. Lets consider that we have the following sequence  of
>>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>>> be inserted to user's vma(vma is large enough).
>>
>> In order to have a "userspace polling-friendly approach" and reduce number of syscall, can we allow for example the userspace to mmap at least the first header before packets arrive.
>> Then the userspace can poll a flag or other fields in the header to understand that there are new packets.
>You mean to avoid 'poll()' syscall, user will spin on some flag, provided by kernel on some mapped page? I think yes. This is ok. Also i think, that i can avoid 'madvise' call
>to clear memory mapping before each 'getsockopt()' - let 'getsockopt()' do 'madvise()' job by removing pages from previous data. In this case only one system call is needed - 'getsockopt()'.

Yes, that's right. I mean to support both, poll() for interrupt-based 
applications and the ability to actively poll a variable in the shared 
memory for applications that want to minimize latency.

Thanks,
Stefano

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive
  2022-06-13  8:54       ` Stefano Garzarella
  (?)
@ 2022-06-14  4:23       ` Arseniy Krasnov
  -1 siblings, 0 replies; 29+ messages in thread
From: Arseniy Krasnov @ 2022-06-14  4:23 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Jakub Kicinski, Paolo Abeni, linux-kernel, kvm, virtualization,
	netdev, kernel, Krasnov Arseniy

On 13.06.2022 11:54, Stefano Garzarella wrote:
> On Thu, Jun 09, 2022 at 12:33:32PM +0000, Arseniy Krasnov wrote:
>> On 09.06.2022 11:54, Stefano Garzarella wrote:
>>> Hi Arseniy,
>>> I left some comments in the patches, and I'm adding something also here:
>> Thanks for comments
>>>
>>> On Fri, Jun 03, 2022 at 05:27:56AM +0000, Arseniy Krasnov wrote:
>>>>                              INTRODUCTION
>>>>
>>>>     Hello, this is experimental implementation of virtio vsock zerocopy
>>>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>>>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>>>> fill provided vma area with pages of virtio RX buffers. After received data was
>>>> processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>>>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
>>>
>>> If it is not too time-consuming, can we have a table/list to compare this and the TCP zerocopy?
>> You mean compare API with more details?
> 
> Yes, maybe a comparison from the user's point of view to do zero-copy with TCP and VSOCK.
> 
>>>
>>>>
>>>>                                 DETAILS
>>>>
>>>>     Here is how mapping with mapped pages looks exactly: first page mapping
>>>> contains array of trimmed virtio vsock packet headers (in contains only length
>>>> of data on the corresponding page and 'flags' field):
>>>>
>>>>     struct virtio_vsock_usr_hdr {
>>>>         uint32_t length;
>>>>         uint32_t flags;
>>>>         uint32_t copy_len;
>>>>     };
>>>>
>>>> Field  'length' allows user to know exact size of payload within each sequence
>>>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>>>> bounds or record bounds). Field 'copy_len' is described below in 'v1->v2' part.
>>>> All other pages are data pages from RX queue.
>>>>
>>>>             Page 0      Page 1      Page N
>>>>
>>>>     [ hdr1 .. hdrN ][ data ] .. [ data ]
>>>>           |        |       ^           ^
>>>>           |        |       |           |
>>>>           |        *-------------------*
>>>>           |                |
>>>>           |                |
>>>>           *----------------*
>>>>
>>>>     Of course, single header could represent array of pages (when packet's
>>>> buffer is bigger than one page).So here is example of detailed mapping layout
>>>> for some set of packages. Lets consider that we have the following sequence  of
>>>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>>>> be inserted to user's vma(vma is large enough).
>>>
>>> In order to have a "userspace polling-friendly approach" and reduce number of syscall, can we allow for example the userspace to mmap at least the first header before packets arrive.
>>> Then the userspace can poll a flag or other fields in the header to understand that there are new packets.
>> You mean to avoid 'poll()' syscall, user will spin on some flag, provided by kernel on some mapped page? I think yes. This is ok. Also i think, that i can avoid 'madvise' call
>> to clear memory mapping before each 'getsockopt()' - let 'getsockopt()' do 'madvise()' job by removing pages from previous data. In this case only one system call is needed - 'getsockopt()'.
> 
> Yes, that's right. I mean to support both, poll() for interrupt-based applications and the ability to actively poll a variable in the shared memory for applications that want to minimize latency.
I see, in this case seems 'vsock_sock' will maintain list of such shared pages, to update every page when new data is available. And sometimes check that mapping was removed
by user(because we don't have munmap callback in 'proto_ops', mmap only), for example using ref counter for such shared page.

Thanks
> 
> Thanks,
> Stefano
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2022-06-14  4:23 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-03  5:27 [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
2022-06-03  5:31 ` [RFC PATCH v2 1/8] virtio/vsock: rework packet allocation logic Arseniy Krasnov
2022-06-09  8:37   ` Stefano Garzarella
2022-06-09  8:37     ` Stefano Garzarella
2022-06-03  5:33 ` [RFC PATCH v2 2/8] vhost/vsock: " Arseniy Krasnov
2022-06-09  8:38   ` Stefano Garzarella
2022-06-09  8:38     ` Stefano Garzarella
2022-06-09 12:09     ` Arseniy Krasnov
2022-06-03  5:35 ` [RFC PATCH v2 3/8] af_vsock: add zerocopy receive logic Arseniy Krasnov
2022-06-09  8:39   ` Stefano Garzarella
2022-06-09  8:39     ` Stefano Garzarella
2022-06-09 12:20     ` Arseniy Krasnov
2022-06-13  8:50       ` Stefano Garzarella
2022-06-13  8:50         ` Stefano Garzarella
2022-06-03  5:37 ` [RFC PATCH v2 4/8] virtio/vsock: add transport zerocopy callback Arseniy Krasnov
2022-06-09  8:41   ` Stefano Garzarella
2022-06-09  8:41     ` Stefano Garzarella
2022-06-09 12:23     ` Arseniy Krasnov
2022-06-03  5:39 ` [RFC PATCH v2 5/8] vhost/vsock: enable " Arseniy Krasnov
2022-06-03  5:41 ` [RFC PATCH v2 6/8] virtio/vsock: " Arseniy Krasnov
2022-06-03  5:43 ` [RFC PATCH v2 7/8] test/vsock: add receive zerocopy tests Arseniy Krasnov
2022-06-03  5:44 ` [RFC PATCH v2 8/8] test/vsock: vsock rx zerocopy utility Arseniy Krasnov
2022-06-08 15:59 ` [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive Zhao Liu
2022-06-09  8:54 ` Stefano Garzarella
2022-06-09  8:54   ` Stefano Garzarella
2022-06-09 12:33   ` Arseniy Krasnov
2022-06-13  8:54     ` Stefano Garzarella
2022-06-13  8:54       ` Stefano Garzarella
2022-06-14  4:23       ` Arseniy Krasnov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.