netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive
@ 2022-05-12  5:04 Arseniy Krasnov
  2022-05-12  5:06 ` [RFC PATCH v1 1/8] virtio/vsock: rework packet allocation logic Arseniy Krasnov
                   ` (8 more replies)
  0 siblings, 9 replies; 15+ messages in thread
From: Arseniy Krasnov @ 2022-05-12  5:04 UTC (permalink / raw)
  To: Stefano Garzarella, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-kernel, kvm, virtualization, netdev, kernel

                              INTRODUCTION

	Hello, this is experimental implementation of virtio vsock zerocopy
receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
fill provided vma area with pages of virtio RX buffers. After received data was
processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).

                                 DETAILS

	Here is how mapping with mapped pages looks exactly: first page mapping
contains array of trimmed virtio vsock packet headers (in contains only length
of data on the corresponding page and 'flags' field):

	struct virtio_vsock_usr_hdr {
		uint32_t length;
		uint32_t flags;
	};

Field  'length' allows user to know exact size of payload within each sequence
of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
bounds or record bounds). All other pages are data pages from RX queue.

             Page 0      Page 1      Page N

	[ hdr1 .. hdrN ][ data ] .. [ data ]
           |        |       ^           ^
           |        |       |           |
           |        *-------------------*
           |                |
           |                |
           *----------------*

	Of course, single header could represent array of pages (when packet's
buffer is bigger than one page).So here is example of detailed mapping layout
for some set of packages. Lets consider that we have the following sequence  of
packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
be inserted to user's vma(vma is large enough).

	Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
	Page 1: [ 56 ]
	Page 2: [ 4096 ]
	Page 3: [ 4096 ]
	Page 4: [ 4096 ]
	Page 5: [ 8 ]

	Page 0 contains only array of headers:
	'hdr0' has 56 in length field.
	'hdr1' has 4096 in length field.
	'hdr2' has 8200 in length field.
	'hdr3' has 0 in length field(this is end of data marker).

	Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
	Page 2 corresponds to 'hdr1' and filled with data.
	Page 3 corresponds to 'hdr2' and filled with data.
	Page 4 corresponds to 'hdr2' and filled with data.
	Page 5 corresponds to 'hdr2' and has only 8 bytes of data.

	This patchset also changes packets allocation way: today implementation
uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
"not large" could be bigger than one page). So to avoid this, data buffers now
allocated using 'alloc_pages()' call.

                                   TESTS

	This patchset updates 'vsock_test' utility: two tests for new feature
were added. First test covers invalid cases. Second checks valid transmission
case.

                                BENCHMARKING

	For benchmakring I've added small utility 'rx_zerocopy'. It works in
client/server mode. When client connects to server, server starts sending exact
amount of data to client(amount is set as input argument).Client reads data and
waits for next portion of it. Client works in two modes: copy and zero-copy. In
copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
/'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission 
is better. For server, we can set size of tx buffer and for client we can set
size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
is quiet simple:

For client mode:

./rx_zerocopy --mode client [--zerocopy] [--rx]

For server mode:

./rx_zerocopy --mode server [--mb] [--tx]

[--mb] sets number of megabytes to transfer.
[--rx] sets size of receive buffer/mapping in pages.
[--tx] sets size of transmit buffer in pages.

I checked for transmission of 4000mb of data. Here are some results:

                           size of rx/tx buffers in pages
               *---------------------------------------------------*
               |    8   |    32    |    64   |   256    |   512    |
*--------------*--------*----------*---------*----------*----------*
|   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
*--------------*---------------------------------------------------- process
| non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
*--------------*----------------------------------------------------

I think, that results are not so impressive, but at least it is not worse than
copy mode and there is no need to allocate memory for processing date.

                                 PROBLEMS

	Updated packet's allocation logic creates some problem: when host gets
data from guest(in vhost-vsock), it allocates at least one page for each packet
(even if packet has 1 byte payload). I think this could be resolved in several
ways:
	1) Make zerocopy rx mode disabled by default, so if user didn't enable
it, current 'kmalloc()' way will be used.
	2) Use 'kmalloc()' for "small" packets, else call page allocator. But
in this case, we have mix of packets, allocated in two different ways thus
during zerocopying to user(e.g. mapping pages to vma), such small packets will
be handled in some stupid way: we need to allocate one page for user, copy data
to it and then insert page to user's vma.

P.S: of course this is experimental RFC, so what do You think guys?

Arseniy Krasnov(8)
 virtio/vsock: rework packet allocation logic
 vhost/vsock: rework packet allocation logic
 af_vsock: add zerocopy receive logic
 virtio/vsock: add transport zerocopy callback
 vhost/vsock: enable zerocopy callback.
 virtio/vsock: enable zerocopy callback.
 test/vsock: add receive zerocopy tests
 test/vsock: vsock rx zerocopy utility

 drivers/vhost/vsock.c                   |  50 ++++-
 include/linux/virtio_vsock.h            |   4 +
 include/net/af_vsock.h                  |   4 +
 include/uapi/linux/virtio_vsock.h       |   5 +
 include/uapi/linux/vm_sockets.h         |   2 +
 net/vmw_vsock/af_vsock.c                |  61 ++++++
 net/vmw_vsock/virtio_transport.c        |   1 +
 net/vmw_vsock/virtio_transport_common.c | 195 ++++++++++++++++-
 tools/include/uapi/linux/virtio_vsock.h |  10 +
 tools/include/uapi/linux/vm_sockets.h   |   7 +
 tools/testing/vsock/Makefile            |   1 +
 tools/testing/vsock/control.c           |  34 +++
 tools/testing/vsock/control.h           |   2 +
 tools/testing/vsock/rx_zerocopy.c       | 356 ++++++++++++++++++++++++++++++++
 tools/testing/vsock/vsock_test.c        | 284 +++++++++++++++++++++++++
 15 files changed, 1005 insertions(+), 11 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 1/8] virtio/vsock: rework packet allocation logic
  2022-05-12  5:04 [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
@ 2022-05-12  5:06 ` Arseniy Krasnov
  2022-05-12  5:09 ` [RFC PATCH v1 2/8] vhost/vsock: " Arseniy Krasnov
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Arseniy Krasnov @ 2022-05-12  5:06 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-kernel, kvm, virtualization, netdev, kernel

To support zerocopy receive, packet's buffer allocation
is changed: for buffers which could be mapped to user's
vma we can't use 'kmalloc()'(as kernel restricts to map
slab pages to user's vma) and raw buddy allocator now
called. But, for tx packets(such packets won't be mapped
to user), previous 'kmalloc()' way is used, but with special
flag in packet's structure which allows to distinguish
between 'kmalloc()' and raw pages buffers.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
---
 include/linux/virtio_vsock.h            | 1 +
 net/vmw_vsock/virtio_transport.c        | 8 ++++++--
 net/vmw_vsock/virtio_transport_common.c | 9 ++++++++-
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 35d7eedb5e8e..d02cb7aa922f 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -50,6 +50,7 @@ struct virtio_vsock_pkt {
 	u32 off;
 	bool reply;
 	bool tap_delivered;
+	bool slab_buf;
 };
 
 struct virtio_vsock_pkt_info {
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index fb3302fff627..43b7b09b4a0a 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -254,16 +254,20 @@ static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
 	vq = vsock->vqs[VSOCK_VQ_RX];
 
 	do {
+		struct page *buf_page;
+
 		pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
 		if (!pkt)
 			break;
 
-		pkt->buf = kmalloc(buf_len, GFP_KERNEL);
-		if (!pkt->buf) {
+		buf_page = alloc_page(GFP_KERNEL);
+
+		if (!buf_page) {
 			virtio_transport_free_pkt(pkt);
 			break;
 		}
 
+		pkt->buf = page_to_virt(buf_page);
 		pkt->buf_len = buf_len;
 		pkt->len = buf_len;
 
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index ec2c2afbf0d0..278567f748f2 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -69,6 +69,7 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
 		if (!pkt->buf)
 			goto out_pkt;
 
+		pkt->slab_buf = true;
 		pkt->buf_len = len;
 
 		err = memcpy_from_msg(pkt->buf, info->msg, len);
@@ -1342,7 +1343,13 @@ EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
 
 void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
 {
-	kfree(pkt->buf);
+	if (pkt->buf_len) {
+		if (pkt->slab_buf)
+			kfree(pkt->buf);
+		else
+			free_pages(buf, get_order(pkt->buf_len));
+	}
+
 	kfree(pkt);
 }
 EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 2/8] vhost/vsock: rework packet allocation logic
  2022-05-12  5:04 [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
  2022-05-12  5:06 ` [RFC PATCH v1 1/8] virtio/vsock: rework packet allocation logic Arseniy Krasnov
@ 2022-05-12  5:09 ` Arseniy Krasnov
  2022-05-12  5:12 ` [RFC PATCH v1 3/8] af_vsock: add zerocopy receive logic Arseniy Krasnov
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Arseniy Krasnov @ 2022-05-12  5:09 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-kernel, kvm, virtualization, netdev, kernel

For packets received from virtio RX queue, use buddy
allocator instead of 'kmalloc()' to be able to insert
such pages to user provided vma. Single call to
'copy_from_iter()' replaced with per-page loop.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
---
 drivers/vhost/vsock.c | 49 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 41 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 37f0b4274113..157798985389 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -360,6 +360,9 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
 	struct iov_iter iov_iter;
 	size_t nbytes;
 	size_t len;
+	struct page *buf_page;
+	ssize_t pkt_len;
+	int page_idx;
 
 	if (in != 0) {
 		vq_err(vq, "Expected 0 input buffers, got %u\n", in);
@@ -393,20 +396,50 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
 		return NULL;
 	}
 
-	pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
-	if (!pkt->buf) {
+	/* This creates memory overrun, as we allocate
+	 * at least one page for each packet.
+	 */
+	buf_page = alloc_pages(GFP_KERNEL, get_order(pkt->len));
+
+	if (buf_page == NULL) {
 		kfree(pkt);
 		return NULL;
 	}
 
+	pkt->buf = page_to_virt(buf_page);
 	pkt->buf_len = pkt->len;
 
-	nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
-	if (nbytes != pkt->len) {
-		vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
-		       pkt->len, nbytes);
-		virtio_transport_free_pkt(pkt);
-		return NULL;
+	page_idx = 0;
+	pkt_len = pkt->len;
+
+	/* As allocated pages are not mapped, process
+	 * pages one by one.
+	 */
+	while (pkt_len > 0) {
+		void *mapped;
+		size_t to_copy;
+
+		mapped = kmap(buf_page + page_idx);
+
+		if (mapped == NULL) {
+			virtio_transport_free_pkt(pkt);
+			return NULL;
+		}
+
+		to_copy = min(pkt_len, ((ssize_t)PAGE_SIZE));
+
+		nbytes = copy_from_iter(mapped, to_copy, &iov_iter);
+		if (nbytes != to_copy) {
+			vq_err(vq, "Expected %zu byte payload, got %zu bytes\n",
+			       to_copy, nbytes);
+			virtio_transport_free_pkt(pkt);
+			return NULL;
+		}
+
+		kunmap(mapped);
+
+		pkt_len -= to_copy;
+		page_idx++;
 	}
 
 	return pkt;
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 3/8] af_vsock: add zerocopy receive logic
  2022-05-12  5:04 [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
  2022-05-12  5:06 ` [RFC PATCH v1 1/8] virtio/vsock: rework packet allocation logic Arseniy Krasnov
  2022-05-12  5:09 ` [RFC PATCH v1 2/8] vhost/vsock: " Arseniy Krasnov
@ 2022-05-12  5:12 ` Arseniy Krasnov
  2022-05-12  5:14 ` [RFC PATCH v1 4/8] virtio/vsock: add transport zerocopy callback Arseniy Krasnov
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Arseniy Krasnov @ 2022-05-12  5:12 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-kernel, kvm, virtualization, netdev, kernel,
	Arseniy Krasnov, Krasnov Arseniy

This:
1) Adds callback for 'mmap()' call on socket. It
   checks vm area flags and sets vm area ops.
2) Adds special 'getsockopt()' case which calls
   transport zerocopy callback. Input argument is
   vm area address.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
---
 include/net/af_vsock.h          |  4 +++
 include/uapi/linux/vm_sockets.h |  2 ++
 net/vmw_vsock/af_vsock.c        | 61 +++++++++++++++++++++++++++++++++
 3 files changed, 67 insertions(+)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index ab207677e0a8..d0aefb9ee4cf 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -135,6 +135,10 @@ struct vsock_transport {
 	bool (*stream_is_active)(struct vsock_sock *);
 	bool (*stream_allow)(u32 cid, u32 port);
 
+	int (*zerocopy_dequeue)(struct vsock_sock *vsk,
+				struct vm_area_struct *vma,
+				unsigned long addr);
+
 	/* SEQ_PACKET. */
 	ssize_t (*seqpacket_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
 				     int flags);
diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h
index c60ca33eac59..62aec51a2bc3 100644
--- a/include/uapi/linux/vm_sockets.h
+++ b/include/uapi/linux/vm_sockets.h
@@ -83,6 +83,8 @@
 
 #define SO_VM_SOCKETS_CONNECT_TIMEOUT_NEW 8
 
+#define SO_VM_SOCKETS_ZEROCOPY 9
+
 #if !defined(__KERNEL__)
 #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
 #define SO_VM_SOCKETS_CONNECT_TIMEOUT SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 38baeb189d4e..3f98477ea546 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1652,6 +1652,42 @@ static int vsock_connectible_setsockopt(struct socket *sock,
 	return err;
 }
 
+static const struct vm_operations_struct afvsock_vm_ops = {
+};
+
+static int vsock_recv_zerocopy(struct socket *sock,
+			       unsigned long address)
+{
+	struct sock *sk = sock->sk;
+	struct vsock_sock *vsk = vsock_sk(sk);
+	struct vm_area_struct *vma;
+	const struct vsock_transport *transport;
+	int res;
+
+	transport = vsk->transport;
+
+	if (!transport->zerocopy_dequeue)
+		return -EOPNOTSUPP;
+
+	lock_sock(sk);
+	mmap_write_lock(current->mm);
+
+	vma = vma_lookup(current->mm, address);
+
+	if (!vma || vma->vm_ops != &afvsock_vm_ops) {
+		mmap_write_unlock(current->mm);
+		release_sock(sk);
+		return -EINVAL;
+	}
+
+	res = transport->zerocopy_dequeue(vsk, vma, address);
+
+	mmap_write_unlock(current->mm);
+	release_sock(sk);
+
+	return res;
+}
+
 static int vsock_connectible_getsockopt(struct socket *sock,
 					int level, int optname,
 					char __user *optval,
@@ -1696,6 +1732,17 @@ static int vsock_connectible_getsockopt(struct socket *sock,
 		lv = sock_get_timeout(vsk->connect_timeout, &v,
 				      optname == SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD);
 		break;
+	case SO_VM_SOCKETS_ZEROCOPY: {
+		unsigned long vma_addr;
+
+		if (len < sizeof(vma_addr))
+			return -EINVAL;
+
+		if (copy_from_user(&vma_addr, optval, sizeof(vma_addr)))
+			return -EFAULT;
+
+		return vsock_recv_zerocopy(sock, vma_addr);
+	}
 
 	default:
 		return -ENOPROTOOPT;
@@ -2124,6 +2171,19 @@ vsock_connectible_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
 	return err;
 }
 
+static int afvsock_mmap(struct file *file, struct socket *sock,
+			struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & (VM_WRITE | VM_EXEC))
+		return -EPERM;
+
+	vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+	vma->vm_flags |= (VM_MIXEDMAP);
+	vma->vm_ops = &afvsock_vm_ops;
+
+	return 0;
+}
+
 static const struct proto_ops vsock_stream_ops = {
 	.family = PF_VSOCK,
 	.owner = THIS_MODULE,
@@ -2143,6 +2203,7 @@ static const struct proto_ops vsock_stream_ops = {
 	.recvmsg = vsock_connectible_recvmsg,
 	.mmap = sock_no_mmap,
 	.sendpage = sock_no_sendpage,
+	.mmap = afvsock_mmap,
 };
 
 static const struct proto_ops vsock_seqpacket_ops = {
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 4/8] virtio/vsock: add transport zerocopy callback
  2022-05-12  5:04 [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
                   ` (2 preceding siblings ...)
  2022-05-12  5:12 ` [RFC PATCH v1 3/8] af_vsock: add zerocopy receive logic Arseniy Krasnov
@ 2022-05-12  5:14 ` Arseniy Krasnov
  2022-05-12  5:17 ` [RFC PATCH v1 5/8] vhost/vsock: enable " Arseniy Krasnov
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Arseniy Krasnov @ 2022-05-12  5:14 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-kernel, kvm, virtualization, netdev, kernel,
	Krasnov Arseniy, Arseniy Krasnov

This adds transport callback which processes rx
queue of socket and instead of copying data to
user provided buffer, it inserts data pages of
each packet to user's vm area.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
---
 include/linux/virtio_vsock.h            |   4 +
 include/uapi/linux/virtio_vsock.h       |   5 +
 net/vmw_vsock/virtio_transport_common.c | 195 +++++++++++++++++++++++-
 3 files changed, 201 insertions(+), 3 deletions(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index d02cb7aa922f..47a68a2ea838 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -51,6 +51,7 @@ struct virtio_vsock_pkt {
 	bool reply;
 	bool tap_delivered;
 	bool slab_buf;
+	bool split;
 };
 
 struct virtio_vsock_pkt_info {
@@ -131,6 +132,9 @@ int virtio_transport_dgram_bind(struct vsock_sock *vsk,
 				struct sockaddr_vm *addr);
 bool virtio_transport_dgram_allow(u32 cid, u32 port);
 
+int virtio_transport_zerocopy_dequeue(struct vsock_sock *vsk,
+				      struct vm_area_struct *vma,
+				      unsigned long addr);
 int virtio_transport_connect(struct vsock_sock *vsk);
 
 int virtio_transport_shutdown(struct vsock_sock *vsk, int mode);
diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
index 64738838bee5..214ac9727307 100644
--- a/include/uapi/linux/virtio_vsock.h
+++ b/include/uapi/linux/virtio_vsock.h
@@ -66,6 +66,11 @@ struct virtio_vsock_hdr {
 	__le32	fwd_cnt;
 } __attribute__((packed));
 
+struct virtio_vsock_usr_hdr {
+	u32 flags;
+	u32 len;
+} __attribute__((packed));
+
 enum virtio_vsock_type {
 	VIRTIO_VSOCK_TYPE_STREAM = 1,
 	VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 278567f748f2..3c7ac47a8672 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -12,6 +12,7 @@
 #include <linux/ctype.h>
 #include <linux/list.h>
 #include <linux/virtio_vsock.h>
+#include <linux/mm.h>
 #include <uapi/linux/vsockmon.h>
 
 #include <net/sock.h>
@@ -347,6 +348,183 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
 	return err;
 }
 
+#define MAX_PAGES_TO_MAP 256
+
+int virtio_transport_zerocopy_dequeue(struct vsock_sock *vsk,
+				      struct vm_area_struct *vma,
+				      unsigned long addr)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+	struct virtio_vsock_usr_hdr *usr_hdr_buffer;
+	unsigned long max_pages_to_insert;
+	unsigned long tmp_pages_inserted;
+	unsigned long pages_to_insert;
+	struct page *usr_hdr_page;
+	unsigned long vma_size;
+	struct page **pages;
+	int max_vma_pages;
+	int max_usr_hdrs;
+	int res;
+	int err;
+	int i;
+
+	/* Only use VMA from first page. */
+	if (vma->vm_start != addr)
+		return -EFAULT;
+
+	vma_size = vma->vm_end - vma->vm_start;
+
+	/* Too small vma(at least one page for headers
+	 * and one page for data).
+	 */
+	if (vma_size < 2 * PAGE_SIZE)
+		return -EFAULT;
+
+	/* Page for meta data. */
+	usr_hdr_page = alloc_page(GFP_KERNEL);
+
+	if (!usr_hdr_page)
+		return -EFAULT;
+
+	pages = kmalloc_array(MAX_PAGES_TO_MAP, sizeof(pages[0]), GFP_KERNEL);
+
+	if (!pages)
+		return -EFAULT;
+
+	pages[pages_to_insert++] = usr_hdr_page;
+
+	usr_hdr_buffer = page_to_virt(usr_hdr_page);
+
+	err = 0;
+
+	/* As we use first page for headers, so total number of
+	 * pages for user is min between number of headers in
+	 * first page and size of vma(in pages, except first page).
+	 */
+	max_usr_hdrs = PAGE_SIZE / sizeof(*usr_hdr_buffer);
+	max_vma_pages = (vma_size / PAGE_SIZE) - 1;
+	max_pages_to_insert = min(max_usr_hdrs, max_vma_pages);
+
+	if (max_pages_to_insert > MAX_PAGES_TO_MAP)
+		max_pages_to_insert = MAX_PAGES_TO_MAP;
+
+	spin_lock_bh(&vvs->rx_lock);
+
+	while (!list_empty(&vvs->rx_queue) &&
+	       pages_to_insert < max_pages_to_insert) {
+		struct virtio_vsock_pkt *pkt;
+		ssize_t rest_data_bytes;
+		size_t moved_data_bytes;
+		unsigned long pg_offs;
+
+		pkt = list_first_entry(&vvs->rx_queue,
+				       struct virtio_vsock_pkt, list);
+
+		/* This could happen, when packet was dequeued before
+		 * by an ordinary 'read()' call. We can't handle such
+		 * packet. Drop it.
+		 */
+		if (pkt->off % PAGE_SIZE) {
+			list_del(&pkt->list);
+			virtio_transport_dec_rx_pkt(vvs, pkt);
+			virtio_transport_free_pkt(pkt);
+			continue;
+		}
+
+		rest_data_bytes = le32_to_cpu(pkt->hdr.len) - pkt->off;
+
+		/* For packets, bigger than one page, split it's
+		 * high order allocated buffer to 0 order pages.
+		 * Otherwise 'vm_insert_pages()' will fail, for
+		 * all pages except first.
+		 */
+		if (rest_data_bytes > PAGE_SIZE) {
+			/* High order buffer not split yet. */
+			if (!pkt->split) {
+				split_page(virt_to_page(pkt->buf),
+					   get_order(le32_to_cpu(pkt->hdr.len)));
+				pkt->split = true;
+			}
+		}
+
+		pg_offs = pkt->off;
+		moved_data_bytes = 0;
+
+		while (rest_data_bytes &&
+		       pages_to_insert < max_pages_to_insert) {
+			struct page *buf_page;
+
+			buf_page = virt_to_page(pkt->buf + pg_offs);
+
+			pages[pages_to_insert++] = buf_page;
+			/* Get reference to prevent this page being
+			 * returned to page allocator when packet will
+			 * be freed. Ref count will be 2.
+			 */
+			get_page(buf_page);
+			pg_offs += PAGE_SIZE;
+
+			if (rest_data_bytes >= PAGE_SIZE) {
+				moved_data_bytes += PAGE_SIZE;
+				rest_data_bytes -= PAGE_SIZE;
+			} else {
+				moved_data_bytes += rest_data_bytes;
+				rest_data_bytes = 0;
+			}
+		}
+
+		usr_hdr_buffer->flags = le32_to_cpu(pkt->hdr.flags);
+		usr_hdr_buffer->len = moved_data_bytes;
+		usr_hdr_buffer++;
+
+		pkt->off = pg_offs;
+
+		if (rest_data_bytes == 0) {
+			list_del(&pkt->list);
+			virtio_transport_dec_rx_pkt(vvs, pkt);
+			virtio_transport_free_pkt(pkt);
+		}
+
+		/* Now ref count for all pages of packet is 1. */
+	}
+
+	/* Set last buffer empty(if we have one). */
+	if (pages_to_insert - 1 < max_usr_hdrs)
+		usr_hdr_buffer->len = 0;
+
+	spin_unlock_bh(&vvs->rx_lock);
+
+	tmp_pages_inserted = pages_to_insert;
+
+	res = vm_insert_pages(vma, addr, pages, &tmp_pages_inserted);
+
+	if (res || tmp_pages_inserted) {
+		/* Failed to insert some pages, we have "partially"
+		 * mapped vma. Do not return, set error code. This
+		 * code will be returned to user. User needs to call
+		 * 'madvise()/mmap()' to clear this vma. Anyway,
+		 * references to all pages will to be dropped below.
+		 */
+		err = -EFAULT;
+	}
+
+	/* Put reference for every page. */
+	for (i = 0; i < pages_to_insert; i++) {
+		/* Ref count is 2 ('get_page()' + 'vm_insert_pages()' above).
+		 * Put reference once, page will be returned to allocator
+		 * after user's 'madvice()/munmap()' call(or it wasn't mapped
+		 * if 'vm_insert_pages()' failed).
+		 */
+		put_page(pages[i]);
+	}
+
+	virtio_transport_send_credit_update(vsk);
+	kfree(pages);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_zerocopy_dequeue);
+
 static ssize_t
 virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
 				   struct msghdr *msg,
@@ -1344,10 +1522,21 @@ EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
 void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
 {
 	if (pkt->buf_len) {
-		if (pkt->slab_buf)
+		if (pkt->slab_buf) {
 			kfree(pkt->buf);
-		else
-			free_pages(buf, get_order(pkt->buf_len));
+		} else {
+			unsigned int order = get_order(pkt->buf_len);
+			unsigned long buf = (unsigned long)pkt->buf;
+
+			if (pkt->split) {
+				int i;
+
+				for (i = 0; i < (1 << order); i++)
+					free_page(buf + i * PAGE_SIZE);
+			} else {
+				free_pages(buf, order);
+			}
+		}
 	}
 
 	kfree(pkt);
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 5/8] vhost/vsock: enable zerocopy callback
  2022-05-12  5:04 [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
                   ` (3 preceding siblings ...)
  2022-05-12  5:14 ` [RFC PATCH v1 4/8] virtio/vsock: add transport zerocopy callback Arseniy Krasnov
@ 2022-05-12  5:17 ` Arseniy Krasnov
  2022-05-12  5:18 ` [RFC PATCH v1 6/8] virtio/vsock: " Arseniy Krasnov
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Arseniy Krasnov @ 2022-05-12  5:17 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-kernel, kvm, virtualization, netdev, kernel,
	Krasnov Arseniy, Arseniy Krasnov

This adds zerocopy callback to vhost transport.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
---
 drivers/vhost/vsock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 157798985389..93119d529fb0 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -484,6 +484,7 @@ static struct virtio_transport vhost_transport = {
 		.stream_rcvhiwat          = virtio_transport_stream_rcvhiwat,
 		.stream_is_active         = virtio_transport_stream_is_active,
 		.stream_allow             = virtio_transport_stream_allow,
+		.zerocopy_dequeue	  = virtio_transport_zerocopy_dequeue,
 
 		.seqpacket_dequeue        = virtio_transport_seqpacket_dequeue,
 		.seqpacket_enqueue        = virtio_transport_seqpacket_enqueue,
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 6/8] virtio/vsock: enable zerocopy callback
  2022-05-12  5:04 [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
                   ` (4 preceding siblings ...)
  2022-05-12  5:17 ` [RFC PATCH v1 5/8] vhost/vsock: enable " Arseniy Krasnov
@ 2022-05-12  5:18 ` Arseniy Krasnov
  2022-05-12  5:20 ` [RFC PATCH v1 7/8] test/vsock: add receive zerocopy tests Arseniy Krasnov
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Arseniy Krasnov @ 2022-05-12  5:18 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-kernel, kvm, virtualization, netdev, kernel,
	Arseniy Krasnov, Krasnov Arseniy

This adds zerocopy callback for virtio transport.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
---
 net/vmw_vsock/virtio_transport.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 43b7b09b4a0a..ea0e1567cfa8 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -478,6 +478,7 @@ static struct virtio_transport virtio_transport = {
 		.stream_rcvhiwat          = virtio_transport_stream_rcvhiwat,
 		.stream_is_active         = virtio_transport_stream_is_active,
 		.stream_allow             = virtio_transport_stream_allow,
+		.zerocopy_dequeue	  = virtio_transport_zerocopy_dequeue,
 
 		.seqpacket_dequeue        = virtio_transport_seqpacket_dequeue,
 		.seqpacket_enqueue        = virtio_transport_seqpacket_enqueue,
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 7/8] test/vsock: add receive zerocopy tests
  2022-05-12  5:04 [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
                   ` (5 preceding siblings ...)
  2022-05-12  5:18 ` [RFC PATCH v1 6/8] virtio/vsock: " Arseniy Krasnov
@ 2022-05-12  5:20 ` Arseniy Krasnov
  2022-05-12  5:22 ` [RFC PATCH v1 8/8] test/vsock: vsock rx zerocopy utility Arseniy Krasnov
  2022-05-17 15:14 ` [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive Stefano Garzarella
  8 siblings, 0 replies; 15+ messages in thread
From: Arseniy Krasnov @ 2022-05-12  5:20 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-kernel, kvm, virtualization, netdev, kernel,
	Krasnov Arseniy, Arseniy Krasnov

This adds tests for zerocopy feature: one test
checks data transmission with simple integrity
control. Second test covers 'error' branches in
zerocopy logic(to check invalid arguments handling).

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
---
 tools/include/uapi/linux/virtio_vsock.h |  10 +
 tools/include/uapi/linux/vm_sockets.h   |   7 +
 tools/testing/vsock/control.c           |  34 +++
 tools/testing/vsock/control.h           |   2 +
 tools/testing/vsock/vsock_test.c        | 284 ++++++++++++++++++++++++
 5 files changed, 337 insertions(+)
 create mode 100644 tools/include/uapi/linux/virtio_vsock.h
 create mode 100644 tools/include/uapi/linux/vm_sockets.h

diff --git a/tools/include/uapi/linux/virtio_vsock.h b/tools/include/uapi/linux/virtio_vsock.h
new file mode 100644
index 000000000000..df04e50b3dd7
--- /dev/null
+++ b/tools/include/uapi/linux/virtio_vsock.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _UAPI_LINUX_VIRTIO_VSOCK_H
+#define _UAPI_LINUX_VIRTIO_VSOCK_H
+#include <linux/types.h>
+
+struct virtio_vsock_usr_hdr {
+	u32 flags;
+	u32 len;
+} __attribute__((packed));
+#endif /* _UAPI_LINUX_VIRTIO_VSOCK_H */
diff --git a/tools/include/uapi/linux/vm_sockets.h b/tools/include/uapi/linux/vm_sockets.h
new file mode 100644
index 000000000000..001aaba03eea
--- /dev/null
+++ b/tools/include/uapi/linux/vm_sockets.h
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _UAPI_LINUX_VM_SOCKETS_H
+#define _UAPI_LINUX_VM_SOCKETS_H
+
+#define SO_VM_SOCKETS_ZEROCOPY 9
+
+#endif /* _UAPI_LINUX_VM_SOCKETS_H */
diff --git a/tools/testing/vsock/control.c b/tools/testing/vsock/control.c
index 4874872fc5a3..00a654e8f137 100644
--- a/tools/testing/vsock/control.c
+++ b/tools/testing/vsock/control.c
@@ -141,6 +141,40 @@ void control_writeln(const char *str)
 	timeout_end();
 }
 
+void control_writelong(long value)
+{
+	char str[32];
+
+	if (snprintf(str, sizeof(str), "%li", value) >= sizeof(str)) {
+		perror("snprintf");
+		exit(EXIT_FAILURE);
+	}
+
+	control_writeln(str);
+}
+
+long control_readlong(bool *ok)
+{
+	long value = -1;
+	char *str;
+
+	if (ok)
+		*ok = false;
+
+	str = control_readln();
+
+	if (str == NULL)
+		return value;
+
+	value = strtol(str, NULL, 10);
+	free(str);
+
+	if (ok)
+		*ok = true;
+
+	return value;
+}
+
 /* Return the next line from the control socket (without the trailing newline).
  *
  * The program terminates if a timeout occurs.
diff --git a/tools/testing/vsock/control.h b/tools/testing/vsock/control.h
index 51814b4f9ac1..5272ad20e850 100644
--- a/tools/testing/vsock/control.h
+++ b/tools/testing/vsock/control.h
@@ -9,7 +9,9 @@ void control_init(const char *control_host, const char *control_port,
 void control_cleanup(void);
 void control_writeln(const char *str);
 char *control_readln(void);
+long control_readlong(bool *ok);
 void control_expectln(const char *str);
 bool control_cmpln(char *line, const char *str, bool fail);
+void control_writelong(long value);
 
 #endif /* CONTROL_H */
diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
index dc577461afc2..0d887c7d9474 100644
--- a/tools/testing/vsock/vsock_test.c
+++ b/tools/testing/vsock/vsock_test.c
@@ -18,11 +18,16 @@
 #include <sys/socket.h>
 #include <time.h>
 #include <sys/mman.h>
+#include <poll.h>
+#include <uapi/linux/virtio_vsock.h>
+#include <uapi/linux/vm_sockets.h>
 
 #include "timeout.h"
 #include "control.h"
 #include "util.h"
 
+#define PAGE_SIZE 4096
+
 static void test_stream_connection_reset(const struct test_opts *opts)
 {
 	union {
@@ -596,6 +601,274 @@ static void test_seqpacket_invalid_rec_buffer_server(const struct test_opts *opt
 	close(fd);
 }
 
+static void test_stream_zerocopy_rx_client(const struct test_opts *opts)
+{
+	unsigned long total_sum;
+	size_t rx_map_len;
+	long rec_value;
+	void *rx_va;
+	int fd;
+
+	fd = vsock_stream_connect(opts->peer_cid, 1234);
+	if (fd < 0) {
+		perror("connect");
+		exit(EXIT_FAILURE);
+	}
+
+	rx_map_len = PAGE_SIZE * 3;
+
+	rx_va = mmap(NULL, rx_map_len, PROT_READ, MAP_SHARED, fd, 0);
+	if (rx_va == MAP_FAILED) {
+		perror("mmap");
+		exit(EXIT_FAILURE);
+	}
+
+	total_sum = 0;
+
+	while (1) {
+		struct pollfd fds = { 0 };
+		int hungup = 0;
+		int res;
+
+		fds.fd = fd;
+		fds.events = POLLIN | POLLERR | POLLHUP |
+			     POLLRDHUP | POLLNVAL;
+
+		res = poll(&fds, 1, -1);
+
+		if (res < 0) {
+			perror("poll");
+			exit(EXIT_FAILURE);
+		}
+
+		if (fds.revents & POLLERR) {
+			perror("poll error");
+			exit(EXIT_FAILURE);
+		}
+
+		if (fds.revents & POLLIN) {
+			struct virtio_vsock_usr_hdr *hdr;
+			uintptr_t tmp_rx_va = (uintptr_t)rx_va;
+			unsigned char *data_va;
+			unsigned char *end_va;
+			socklen_t len = sizeof(tmp_rx_va);
+
+			if (getsockopt(fd, AF_VSOCK,
+					SO_VM_SOCKETS_ZEROCOPY,
+					&tmp_rx_va, &len) < 0) {
+				perror("getsockopt");
+				exit(EXIT_FAILURE);
+			}
+
+			hdr = (struct virtio_vsock_usr_hdr *)rx_va;
+			/* Skip headers page for data. */
+			data_va = rx_va + PAGE_SIZE;
+			end_va = (unsigned char *)(tmp_rx_va + rx_map_len);
+
+			while (data_va != end_va) {
+				int data_len = hdr->len;
+
+				if (!hdr->len) {
+					if (fds.revents & (POLLHUP | POLLRDHUP)) {
+						if (hdr == rx_va)
+							hungup = 1;
+					}
+
+					break;
+				}
+
+				while (data_len > 0) {
+					int i;
+					int to_read = (data_len < PAGE_SIZE) ?
+							data_len : PAGE_SIZE;
+
+					for (i = 0; i < to_read; i++)
+						total_sum += data_va[i];
+
+					data_va += PAGE_SIZE;
+					data_len -= PAGE_SIZE;
+				}
+
+				hdr++;
+			}
+
+			if (madvise((void *)rx_va, rx_map_len,
+					MADV_DONTNEED)) {
+				perror("madvise");
+				exit(EXIT_FAILURE);
+			}
+
+			if (hungup)
+				break;
+		}
+	}
+
+	if (munmap(rx_va, rx_map_len)) {
+		perror("munmap");
+		exit(EXIT_FAILURE);
+	}
+
+	rec_value = control_readlong(NULL);
+
+	if (total_sum != rec_value) {
+		fprintf(stderr, "sum mismatch %lu != %lu\n",
+				total_sum, rec_value);
+		exit(EXIT_FAILURE);
+	}
+
+	close(fd);
+}
+
+static void test_stream_zerocopy_rx_server(const struct test_opts *opts)
+{
+	size_t max_buf_size = 40000;
+	long total_sum = 0;
+	int n = 10;
+	int fd;
+
+	fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+	if (fd < 0) {
+		perror("accept");
+		exit(EXIT_FAILURE);
+	}
+
+	while (n) {
+		unsigned char *data;
+		size_t buf_size;
+		int i;
+
+		buf_size = 1 + rand() % max_buf_size;
+
+		data = malloc(buf_size);
+
+		if (!data) {
+			perror("malloc");
+			exit(EXIT_FAILURE);
+		}
+
+		for (i = 0; i < buf_size; i++) {
+			data[i] = rand() & 0xff;
+			total_sum += data[i];
+		}
+
+		if (write(fd, data, buf_size) != buf_size) {
+			perror("write");
+			exit(EXIT_FAILURE);
+		}
+
+		free(data);
+		n--;
+	}
+
+	control_writelong(total_sum);
+
+	close(fd);
+}
+
+static void test_stream_zerocopy_rx_inv_client(const struct test_opts *opts)
+{
+	size_t map_size = PAGE_SIZE * 5;
+	socklen_t len;
+	void *map_va;
+	int fd;
+
+	fd = vsock_stream_connect(opts->peer_cid, 1234);
+	if (fd < 0) {
+		perror("connect");
+		exit(EXIT_FAILURE);
+	}
+
+	len = sizeof(map_va);
+	map_va = 0;
+
+	/* Try zerocopy with invalid mapping address. */
+	if (getsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_ZEROCOPY,
+			&map_va, &len) == 0) {
+		perror("getsockopt");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Try zerocopy with valid, but not socket mapping. */
+	map_va = mmap(NULL, map_size, PROT_READ,
+		       MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	if (map_va == MAP_FAILED) {
+		perror("anon mmap");
+		exit(EXIT_FAILURE);
+	}
+
+	if (getsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_ZEROCOPY,
+			&map_va, &len) == 0) {
+		perror("getsockopt");
+		exit(EXIT_FAILURE);
+	}
+
+	if (munmap(map_va, map_size)) {
+		perror("munmap");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Try zerocopy with valid, but too small mapping. */
+	map_va = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, fd, 0);
+	if (map_va == MAP_FAILED) {
+		perror("socket mmap");
+		exit(EXIT_FAILURE);
+	}
+
+	//tmp_rx_va = (uintptr_t)map_va;
+
+	if (getsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_ZEROCOPY,
+			&map_va, &len) == 0) {
+		perror("getsockopt");
+		exit(EXIT_FAILURE);
+	}
+
+	if (munmap(map_va, PAGE_SIZE)) {
+		perror("munmap");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Try zerocopy with valid mapping, but not from first byte. */
+	map_va = mmap(NULL, map_size, PROT_READ, MAP_SHARED, fd, 0);
+	if (map_va == MAP_FAILED) {
+		perror("socket mmap");
+		exit(EXIT_FAILURE);
+	}
+
+	//tmp_rx_va = (uintptr_t)map_va + PAGE_SIZE;
+	map_va += PAGE_SIZE;
+
+	if (getsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_ZEROCOPY,
+			&map_va, &len) == 0) {
+		perror("getsockopt");
+		exit(EXIT_FAILURE);
+	}
+
+	if (munmap(map_va - PAGE_SIZE, map_size)) {
+		perror("munmap");
+		exit(EXIT_FAILURE);
+	}
+
+	control_writeln("DONE");
+
+	close(fd);
+}
+
+static void test_stream_zerocopy_rx_inv_server(const struct test_opts *opts)
+{
+	int fd;
+
+	fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+
+	if (fd < 0) {
+		perror("accept");
+		exit(EXIT_FAILURE);
+	}
+
+	control_expectln("DONE");
+
+	close(fd);
+}
+
 static struct test_case test_cases[] = {
 	{
 		.name = "SOCK_STREAM connection reset",
@@ -646,6 +919,16 @@ static struct test_case test_cases[] = {
 		.run_client = test_seqpacket_invalid_rec_buffer_client,
 		.run_server = test_seqpacket_invalid_rec_buffer_server,
 	},
+	{
+		.name = "SOCK_STREAM zerocopy receive",
+		.run_client = test_stream_zerocopy_rx_client,
+		.run_server = test_stream_zerocopy_rx_server,
+	},
+	{
+		.name = "SOCK_STREAM zerocopy invalid",
+		.run_client = test_stream_zerocopy_rx_inv_client,
+		.run_server = test_stream_zerocopy_rx_inv_server,
+	},
 	{},
 };
 
@@ -729,6 +1012,7 @@ int main(int argc, char **argv)
 		.peer_cid = VMADDR_CID_ANY,
 	};
 
+	srand(time(NULL));
 	init_signals();
 
 	for (;;) {
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 8/8] test/vsock: vsock rx zerocopy utility
  2022-05-12  5:04 [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
                   ` (6 preceding siblings ...)
  2022-05-12  5:20 ` [RFC PATCH v1 7/8] test/vsock: add receive zerocopy tests Arseniy Krasnov
@ 2022-05-12  5:22 ` Arseniy Krasnov
  2022-05-17 15:14 ` [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive Stefano Garzarella
  8 siblings, 0 replies; 15+ messages in thread
From: Arseniy Krasnov @ 2022-05-12  5:22 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-kernel, kvm, virtualization, netdev, kernel,
	Arseniy Krasnov, Krasnov Arseniy

This adds simple util for zerocopy benchmarking.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
---
 tools/testing/vsock/Makefile      |   1 +
 tools/testing/vsock/rx_zerocopy.c | 356 ++++++++++++++++++++++++++++++
 2 files changed, 357 insertions(+)
 create mode 100644 tools/testing/vsock/rx_zerocopy.c

diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
index f8293c6910c9..2cb5820ca2f3 100644
--- a/tools/testing/vsock/Makefile
+++ b/tools/testing/vsock/Makefile
@@ -3,6 +3,7 @@ all: test
 test: vsock_test vsock_diag_test
 vsock_test: vsock_test.o timeout.o control.o util.o
 vsock_diag_test: vsock_diag_test.o timeout.o control.o util.o
+rx_zerocopy: rx_zerocopy.o timeout.o control.o util.o
 
 CFLAGS += -g -O2 -Werror -Wall -I. -I../../include -I../../../usr/include -Wno-pointer-sign -fno-strict-overflow -fno-strict-aliasing -fno-common -MMD -U_FORTIFY_SOURCE -D_GNU_SOURCE
 .PHONY: all test clean
diff --git a/tools/testing/vsock/rx_zerocopy.c b/tools/testing/vsock/rx_zerocopy.c
new file mode 100644
index 000000000000..4db1a7f3a1af
--- /dev/null
+++ b/tools/testing/vsock/rx_zerocopy.c
@@ -0,0 +1,356 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * rx_zerocopy - benchmark utility for zerocopy
+ * receive.
+ *
+ * Copyright (C) 2022 SberDevices.
+ *
+ * Author: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
+ */
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <errno.h>
+#include <unistd.h>
+#include <time.h>
+#include <sys/mman.h>
+#include <stdint.h>
+#include <poll.h>
+#include <uapi/linux/virtio_vsock.h>
+#include <uapi/linux/vm_sockets.h>
+
+#include "util.h"
+
+#define PAGE_SIZE		4096
+
+#define DEFAULT_TX_SIZE		128
+#define DEFAULT_RX_SIZE		128
+#define DEFAULT_PORT		1234
+
+static int client_mode = 1;
+static int peer_cid = -1;
+static int port = DEFAULT_PORT;
+static unsigned long tx_buf_size;
+static unsigned long rx_buf_size;
+static unsigned long mb_to_send = 40;
+
+static time_t current_nsec(void)
+{
+	struct timespec ts;
+
+	if (clock_gettime(CLOCK_REALTIME, &ts)) {
+		perror("clock_gettime");
+		exit(EXIT_FAILURE);
+	}
+
+	return (ts.tv_sec * 1000000000ULL) + ts.tv_nsec;
+}
+
+/* Server accepts connection and */
+static void run_server(void)
+{
+	int fd;
+	char *data;
+	int client_fd;
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = port,
+			.svm_cid = VMADDR_CID_ANY,
+		},
+	};
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} clientaddr;
+
+	socklen_t clientaddr_len = sizeof(clientaddr.svm);
+	time_t tx_begin_ns;
+	ssize_t total_send = 0;
+	unsigned long sum;
+
+	fprintf(stderr, "Running server, listen %i, mb %lu tx buf %lu\n",
+			port, mb_to_send, tx_buf_size);
+
+	fd = socket(AF_VSOCK, SOCK_STREAM, 0);
+
+	if (fd < 0) {
+		perror("socket");
+		exit(EXIT_FAILURE);
+	}
+
+	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
+		perror("bind");
+		exit(EXIT_FAILURE);
+	}
+
+	if (listen(fd, 1) < 0) {
+		perror("listen");
+		exit(EXIT_FAILURE);
+	}
+
+	client_fd = accept(fd, &clientaddr.sa, &clientaddr_len);
+
+	if (client_fd < 0) {
+		perror("accept");
+		exit(EXIT_FAILURE);
+	}
+
+	data = malloc(tx_buf_size);
+
+	if (data == NULL) {
+		fprintf(stderr, "malloc failed\n");
+		exit(EXIT_FAILURE);
+	}
+
+	sum = 0;
+	tx_begin_ns = current_nsec();
+
+	while (1) {
+		int i;
+		ssize_t sent;
+
+		if (total_send > mb_to_send * 1024 * 1024ULL)
+			break;
+
+		for (i = 0; i < tx_buf_size; i++) {
+			data[i] = rand() % 0xff;
+			sum += data[i];
+		}
+
+		sent = write(client_fd, data, tx_buf_size);
+
+		if (sent <= 0) {
+			perror("write");
+			exit(EXIT_FAILURE);
+		}
+
+		total_send += sent;
+	}
+
+	free(data);
+
+	fprintf(stderr, "Total %zi MB, time %f\n", mb_to_send,
+			(float)(current_nsec() - tx_begin_ns)/1000.0/1000.0/1000.0);
+
+	close(fd);
+	close(client_fd);
+}
+
+static void run_client(int zerocopy)
+{
+	int fd;
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = port,
+			.svm_cid = peer_cid,
+		},
+	};
+	unsigned long sum = 0;
+	void *rx_va = NULL;
+
+	printf("Running client, %s mode, peer %i:%i, rx buf %lu\n",
+		zerocopy ? "zerocopy" : "copy", peer_cid, port,
+		rx_buf_size);
+
+	fd = socket(AF_VSOCK, SOCK_STREAM, 0);
+
+	if (fd < 0) {
+		perror("socket");
+		exit(EXIT_FAILURE);
+	}
+
+	if (connect(fd, &addr.sa, sizeof(addr.svm))) {
+		perror("connect");
+		exit(EXIT_FAILURE);
+	}
+
+	if (zerocopy) {
+		rx_va = mmap(NULL, rx_buf_size,
+				PROT_READ, MAP_SHARED, fd, 0);
+
+		if (rx_va == MAP_FAILED) {
+			perror("mmap");
+			exit(EXIT_FAILURE);
+		}
+	}
+
+	while (1) {
+		struct pollfd fds = { 0 };
+		int done = 0;
+
+		fds.fd = fd;
+		fds.events = POLLIN | POLLERR | POLLHUP |
+			     POLLRDHUP | POLLNVAL;
+
+		if (poll(&fds, 1, -1) < 0) {
+			perror("poll");
+			exit(EXIT_FAILURE);
+		}
+
+		if (fds.revents & (POLLHUP | POLLRDHUP))
+			done = 1;
+
+		if (fds.revents & POLLERR) {
+			fprintf(stderr, "Done error\n");
+			break;
+		}
+
+		if (fds.revents & POLLIN) {
+			if (zerocopy) {
+				struct virtio_vsock_usr_hdr *hdr;
+				uintptr_t tmp_rx_va = (uintptr_t)rx_va;
+				socklen_t len = sizeof(tmp_rx_va);
+
+				if (getsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_ZEROCOPY,
+						&tmp_rx_va, &len) < 0) {
+					perror("getsockopt");
+					exit(EXIT_FAILURE);
+				}
+
+				hdr = (struct virtio_vsock_usr_hdr *)tmp_rx_va;
+
+				if (!hdr->len) {
+					if (done) {
+						fprintf(stderr, "Done, sum %lu\n", sum);
+						break;
+					}
+				}
+
+				tmp_rx_va += PAGE_SIZE;
+
+				if (madvise((void *)rx_va, rx_buf_size,
+						MADV_DONTNEED)) {
+					perror("madvise");
+					exit(EXIT_FAILURE);
+				}
+			} else {
+				char data[rx_buf_size - PAGE_SIZE];
+				ssize_t bytes_read;
+
+				bytes_read = read(fd, data, sizeof(data));
+
+				if (bytes_read <= 0)
+					break;
+			}
+		}
+	}
+}
+
+static const char optstring[] = "";
+static const struct option longopts[] = {
+	{
+		.name = "mode",
+		.has_arg = required_argument,
+		.val = 'm',
+	},
+	{
+		.name = "zerocopy",
+		.has_arg = no_argument,
+		.val = 'z',
+	},
+	{
+		.name = "cid",
+		.has_arg = required_argument,
+		.val = 'c',
+	},
+	{
+		.name = "port",
+		.has_arg = required_argument,
+		.val = 'p',
+	},
+	{
+		.name = "mb",
+		.has_arg = required_argument,
+		.val = 's',
+	},
+	{
+		.name = "tx",
+		.has_arg = required_argument,
+		.val = 't',
+	},
+	{
+		.name = "rx",
+		.has_arg = required_argument,
+		.val = 'r',
+	},
+	{
+		.name = "help",
+		.has_arg = no_argument,
+		.val = '?',
+	},
+	{},
+};
+
+int main(int argc, char **argv)
+{
+	int zerocopy = 0;
+
+	for (;;) {
+		int opt = getopt_long(argc, argv, optstring, longopts, NULL);
+
+		if (opt == -1)
+			break;
+
+		switch (opt) {
+		case 's':
+			mb_to_send = atoi(optarg);
+			break;
+		case 'c':
+			peer_cid = atoi(optarg);
+			break;
+		case 'p':
+			port = atoi(optarg);
+			break;
+		case 'r':
+			rx_buf_size = atoi(optarg);
+			break;
+		case 't':
+			tx_buf_size = atoi(optarg);
+			break;
+		case 'm':
+			if (strcmp(optarg, "client") == 0)
+				client_mode = 1;
+			else if (strcmp(optarg, "server") == 0)
+				client_mode = 0;
+			else {
+				fprintf(stderr, "--mode must be \"client\" or \"server\"\n");
+				return EXIT_FAILURE;
+			}
+			break;
+		case 'z':
+			zerocopy = 1;
+			break;
+		default:
+			break;
+		}
+
+	}
+
+	if (!tx_buf_size)
+		tx_buf_size = DEFAULT_TX_SIZE;
+
+	if (!rx_buf_size)
+		rx_buf_size = DEFAULT_RX_SIZE;
+
+	tx_buf_size *= PAGE_SIZE;
+	rx_buf_size *= PAGE_SIZE;
+
+	srand(time(NULL));
+
+	if (client_mode)
+		run_client(zerocopy);
+	else
+		run_server();
+
+	return 0;
+}
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive
  2022-05-12  5:04 [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
                   ` (7 preceding siblings ...)
  2022-05-12  5:22 ` [RFC PATCH v1 8/8] test/vsock: vsock rx zerocopy utility Arseniy Krasnov
@ 2022-05-17 15:14 ` Stefano Garzarella
  2022-05-18 11:04   ` Arseniy Krasnov
  8 siblings, 1 reply; 15+ messages in thread
From: Stefano Garzarella @ 2022-05-17 15:14 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Jakub Kicinski, Paolo Abeni, linux-kernel, kvm, virtualization,
	netdev, kernel

Hi Arseniy,

On Thu, May 12, 2022 at 05:04:11AM +0000, Arseniy Krasnov wrote:
>                              INTRODUCTION
>
>	Hello, this is experimental implementation of virtio vsock zerocopy
>receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>fill provided vma area with pages of virtio RX buffers. After received data was
>processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).

Sounds cool, but maybe we would need some socket/net experts here for 
review.

Could we do something similar for the sending path as well?

>
>                                 DETAILS
>
>	Here is how mapping with mapped pages looks exactly: first page mapping
>contains array of trimmed virtio vsock packet headers (in contains only length
>of data on the corresponding page and 'flags' field):
>
>	struct virtio_vsock_usr_hdr {
>		uint32_t length;
>		uint32_t flags;
>	};
>
>Field  'length' allows user to know exact size of payload within each sequence
>of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>bounds or record bounds). All other pages are data pages from RX queue.
>
>             Page 0      Page 1      Page N
>
>	[ hdr1 .. hdrN ][ data ] .. [ data ]
>           |        |       ^           ^
>           |        |       |           |
>           |        *-------------------*
>           |                |
>           |                |
>           *----------------*
>
>	Of course, single header could represent array of pages (when packet's
>buffer is bigger than one page).So here is example of detailed mapping layout
>for some set of packages. Lets consider that we have the following sequence  of
>packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>be inserted to user's vma(vma is large enough).
>
>	Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
>	Page 1: [ 56 ]
>	Page 2: [ 4096 ]
>	Page 3: [ 4096 ]
>	Page 4: [ 4096 ]
>	Page 5: [ 8 ]
>
>	Page 0 contains only array of headers:
>	'hdr0' has 56 in length field.
>	'hdr1' has 4096 in length field.
>	'hdr2' has 8200 in length field.
>	'hdr3' has 0 in length field(this is end of data marker).
>
>	Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
>	Page 2 corresponds to 'hdr1' and filled with data.
>	Page 3 corresponds to 'hdr2' and filled with data.
>	Page 4 corresponds to 'hdr2' and filled with data.
>	Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>
>	This patchset also changes packets allocation way: today implementation
>uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
>such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
>pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
>"not large" could be bigger than one page). So to avoid this, data buffers now
>allocated using 'alloc_pages()' call.
>
>                                   TESTS
>
>	This patchset updates 'vsock_test' utility: two tests for new feature
>were added. First test covers invalid cases. Second checks valid transmission
>case.

Thanks for adding the test!

>
>                                BENCHMARKING
>
>	For benchmakring I've added small utility 'rx_zerocopy'. It works in
>client/server mode. When client connects to server, server starts sending exact
>amount of data to client(amount is set as input argument).Client reads data and
>waits for next portion of it. Client works in two modes: copy and zero-copy. In
>copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
>/'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
>is better. For server, we can set size of tx buffer and for client we can set
>size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
>is quiet simple:
>
>For client mode:
>
>./rx_zerocopy --mode client [--zerocopy] [--rx]
>
>For server mode:
>
>./rx_zerocopy --mode server [--mb] [--tx]
>
>[--mb] sets number of megabytes to transfer.
>[--rx] sets size of receive buffer/mapping in pages.
>[--tx] sets size of transmit buffer in pages.
>
>I checked for transmission of 4000mb of data. Here are some results:
>
>                           size of rx/tx buffers in pages
>               *---------------------------------------------------*
>               |    8   |    32    |    64   |   256    |   512    |
>*--------------*--------*----------*---------*----------*----------*
>|   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
>*--------------*---------------------------------------------------- process
>| non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
>*--------------*----------------------------------------------------
>
>I think, that results are not so impressive, but at least it is not worse than
>copy mode and there is no need to allocate memory for processing date.

Why is it twice as slow in the first column?

>
>                                 PROBLEMS
>
>	Updated packet's allocation logic creates some problem: when host gets
>data from guest(in vhost-vsock), it allocates at least one page for each packet
>(even if packet has 1 byte payload). I think this could be resolved in several
>ways:

Can we somehow copy the incoming packets into the payload of the already 
queued packet?

This reminds me that we have yet to fix a similar problem with kmalloc() 
as well...

https://bugzilla.kernel.org/show_bug.cgi?id=215329

>	1) Make zerocopy rx mode disabled by default, so if user didn't enable
>it, current 'kmalloc()' way will be used.

That sounds reasonable to me, I guess also TCP needs a setsockopt() call 
to enable the feature, right?

>	2) Use 'kmalloc()' for "small" packets, else call page allocator. But
>in this case, we have mix of packets, allocated in two different ways thus
>during zerocopying to user(e.g. mapping pages to vma), such small packets will
>be handled in some stupid way: we need to allocate one page for user, copy data
>to it and then insert page to user's vma.

It seems more difficult to me, but at the same time doable. I would go 
more on option 1, though.

>
>P.S: of course this is experimental RFC, so what do You think guys?

It seems cool :-)

But I would like some feedback from the net guys to have some TCP-like 
things.

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive
  2022-05-17 15:14 ` [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive Stefano Garzarella
@ 2022-05-18 11:04   ` Arseniy Krasnov
  2022-05-19  7:42     ` Stefano Garzarella
  0 siblings, 1 reply; 15+ messages in thread
From: Arseniy Krasnov @ 2022-05-18 11:04 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Jakub Kicinski, Paolo Abeni, linux-kernel, kvm, virtualization,
	netdev, kernel

Hello Stefano,

On 17.05.2022 18:14, Stefano Garzarella wrote:
> Hi Arseniy,
> 
> On Thu, May 12, 2022 at 05:04:11AM +0000, Arseniy Krasnov wrote:
>>                              INTRODUCTION
>>
>>     Hello, this is experimental implementation of virtio vsock zerocopy
>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>> fill provided vma area with pages of virtio RX buffers. After received data was
>> processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
> 
> Sounds cool, but maybe we would need some socket/net experts here for review.

Yes, that would be great

> 
> Could we do something similar for the sending path as well?

Here are thoughts about zerocopy transmission:
  
I tried to implement this feature in the following way: user creates
some page aligned buffer, then during tx packet allocation instead of
creating data buffer with 'kmalloc()', i tried to add user's buffer
to virtio queue. But found problem: as kernel virtio API uses virtual
addresses to add new buffers, in the deep of virtio subsystem
'virt_to_phys()' is called to get physical address of buffer, so user's
virtual address won't be translated correctly to physical address(in
theory, i can perform page walk for such user's va, get physical address
and pass some "fake" virtual address to virtio API in order to make
'virt_to_phys()' return valid physical address(but i think this is ugly).


If we are talking about 'mmap()' way, i think we can do the following:
user calls 'mmap()' on socket, kernel fills newly created mapping with
allocated pages(all pages have rw permissions). Now user can use pages
of this mapping(e.g. fill it with data). Finally, to start transmission,
user calls 'getsockopt()' or some 'ioctl()' and kernel processes data of
this mapping. Also as this call will return immediately(e.g. it is
asynchronous), some completion logic must be implemented. For example
use same way as MSG_ZEROCOPY uses - poll error queue of socket to get
message that pages could be reused, or don't allow user to work with
these pages: unmap it, perform transmission and finally free pages.
To start new transmission user need to call 'mmap()' again.

                            OR

I think there is another unusual way for zerocopy tx: let's use 'vmsplice()'
/'splice()'. In this approach to transmit something, user does the following
steps:
1) Creates pipe.
2) Calls 'vmsplice(SPLICE_F_GIFT)' on this pipe, insert data pages to it.
   SPLICE_F_GIFT allows user to forget about allocated pages - kernel will
   free it.
3) Calls 'splice(SPLICE_F_MOVE)' from pipe to socket. SPLICE_F_MOVE will
   move pages from pipe to socket(e.g. in special socket callback we got
   set of pipe's pages as input argument and all pages will be inserted
   to virtio queue).

But as SPLICE_F_MOVE support is disabled, it must be repaired first.
                                                                  
> 
>>
>>                                 DETAILS
>>
>>     Here is how mapping with mapped pages looks exactly: first page mapping
>> contains array of trimmed virtio vsock packet headers (in contains only length
>> of data on the corresponding page and 'flags' field):
>>
>>     struct virtio_vsock_usr_hdr {
>>         uint32_t length;
>>         uint32_t flags;
>>     };
>>
>> Field  'length' allows user to know exact size of payload within each sequence
>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>> bounds or record bounds). All other pages are data pages from RX queue.
>>
>>             Page 0      Page 1      Page N
>>
>>     [ hdr1 .. hdrN ][ data ] .. [ data ]
>>           |        |       ^           ^
>>           |        |       |           |
>>           |        *-------------------*
>>           |                |
>>           |                |
>>           *----------------*
>>
>>     Of course, single header could represent array of pages (when packet's
>> buffer is bigger than one page).So here is example of detailed mapping layout
>> for some set of packages. Lets consider that we have the following sequence  of
>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>> be inserted to user's vma(vma is large enough).
>>
>>     Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
>>     Page 1: [ 56 ]
>>     Page 2: [ 4096 ]
>>     Page 3: [ 4096 ]
>>     Page 4: [ 4096 ]
>>     Page 5: [ 8 ]
>>
>>     Page 0 contains only array of headers:
>>     'hdr0' has 56 in length field.
>>     'hdr1' has 4096 in length field.
>>     'hdr2' has 8200 in length field.
>>     'hdr3' has 0 in length field(this is end of data marker).
>>
>>     Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
>>     Page 2 corresponds to 'hdr1' and filled with data.
>>     Page 3 corresponds to 'hdr2' and filled with data.
>>     Page 4 corresponds to 'hdr2' and filled with data.
>>     Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>>
>>     This patchset also changes packets allocation way: today implementation
>> uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
>> such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
>> pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
>> "not large" could be bigger than one page). So to avoid this, data buffers now
>> allocated using 'alloc_pages()' call.
>>
>>                                   TESTS
>>
>>     This patchset updates 'vsock_test' utility: two tests for new feature
>> were added. First test covers invalid cases. Second checks valid transmission
>> case.
> 
> Thanks for adding the test!
> 
>>
>>                                BENCHMARKING
>>
>>     For benchmakring I've added small utility 'rx_zerocopy'. It works in
>> client/server mode. When client connects to server, server starts sending exact
>> amount of data to client(amount is set as input argument).Client reads data and
>> waits for next portion of it. Client works in two modes: copy and zero-copy. In
>> copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
>> /'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
>> is better. For server, we can set size of tx buffer and for client we can set
>> size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
>> is quiet simple:
>>
>> For client mode:
>>
>> ./rx_zerocopy --mode client [--zerocopy] [--rx]
>>
>> For server mode:
>>
>> ./rx_zerocopy --mode server [--mb] [--tx]
>>
>> [--mb] sets number of megabytes to transfer.
>> [--rx] sets size of receive buffer/mapping in pages.
>> [--tx] sets size of transmit buffer in pages.
>>
>> I checked for transmission of 4000mb of data. Here are some results:
>>
>>                           size of rx/tx buffers in pages
>>               *---------------------------------------------------*
>>               |    8   |    32    |    64   |   256    |   512    |
>> *--------------*--------*----------*---------*----------*----------*
>> |   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
>> *--------------*---------------------------------------------------- process
>> | non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
>> *--------------*----------------------------------------------------
>>
>> I think, that results are not so impressive, but at least it is not worse than
>> copy mode and there is no need to allocate memory for processing date.
> 
> Why is it twice as slow in the first column?

May be this is because memory copying for small buffers is very fast... i'll
analyze it deeply.

> 
>>
>>                                 PROBLEMS
>>
>>     Updated packet's allocation logic creates some problem: when host gets
>> data from guest(in vhost-vsock), it allocates at least one page for each packet
>> (even if packet has 1 byte payload). I think this could be resolved in several
>> ways:
> 
> Can we somehow copy the incoming packets into the payload of the already queued packet?

May be, i'll analyze it...

> 
> This reminds me that we have yet to fix a similar problem with kmalloc() as well...
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=215329

Yes, but it is a little bit different case: IIUC this bug happens because 'kmalloc()'
uses memory chunks of some preallocated size.

> 
>>     1) Make zerocopy rx mode disabled by default, so if user didn't enable
>> it, current 'kmalloc()' way will be used.
> 
> That sounds reasonable to me, I guess also TCP needs a setsockopt() call to enable the feature, right?

Yes, You're right. I think i'll add this to v2.

> 
>>     2) Use 'kmalloc()' for "small" packets, else call page allocator. But
>> in this case, we have mix of packets, allocated in two different ways thus
>> during zerocopying to user(e.g. mapping pages to vma), such small packets will
>> be handled in some stupid way: we need to allocate one page for user, copy data
>> to it and then insert page to user's vma.
> 
> It seems more difficult to me, but at the same time doable. I would go more on option 1, though.
> 
>>
>> P.S: of course this is experimental RFC, so what do You think guys?
> 
> It seems cool :-)
> 
> But I would like some feedback from the net guys to have some TCP-like things.

Ok, i'll prepare v2 anyway: i need to analyze performance, may be more test coverage, rebase
over latest kernel and work on packet allocation problem(from above).

> 
> Thanks,
> Stefano
> 

Thanks


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive
  2022-05-18 11:04   ` Arseniy Krasnov
@ 2022-05-19  7:42     ` Stefano Garzarella
  2022-05-20 11:09       ` Arseniy Krasnov
  0 siblings, 1 reply; 15+ messages in thread
From: Stefano Garzarella @ 2022-05-19  7:42 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Jakub Kicinski, Paolo Abeni, linux-kernel, kvm, virtualization,
	netdev, kernel

On Wed, May 18, 2022 at 11:04:30AM +0000, Arseniy Krasnov wrote:
>Hello Stefano,
>
>On 17.05.2022 18:14, Stefano Garzarella wrote:
>> Hi Arseniy,
>>
>> On Thu, May 12, 2022 at 05:04:11AM +0000, Arseniy Krasnov wrote:
>>>                              INTRODUCTION
>>>
>>>     Hello, this is experimental implementation of virtio vsock zerocopy
>>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>>> fill provided vma area with pages of virtio RX buffers. After received data was
>>> processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
>>
>> Sounds cool, but maybe we would need some socket/net experts here for review.
>
>Yes, that would be great
>
>>
>> Could we do something similar for the sending path as well?
>
>Here are thoughts about zerocopy transmission:
>
>I tried to implement this feature in the following way: user creates
>some page aligned buffer, then during tx packet allocation instead of
>creating data buffer with 'kmalloc()', i tried to add user's buffer
>to virtio queue. But found problem: as kernel virtio API uses virtual
>addresses to add new buffers, in the deep of virtio subsystem
>'virt_to_phys()' is called to get physical address of buffer, so user's
>virtual address won't be translated correctly to physical address(in
>theory, i can perform page walk for such user's va, get physical address
>and pass some "fake" virtual address to virtio API in order to make
>'virt_to_phys()' return valid physical address(but i think this is ugly).

And maybe we should also pin the pages to prevent them from being 
replaced.

I think we should do something similar to what we do in vhost-vdpa.
Take a look at vhost_vdpa_pa_map() in drivers/vhost/vdpa.c

>
>
>If we are talking about 'mmap()' way, i think we can do the following:
>user calls 'mmap()' on socket, kernel fills newly created mapping with
>allocated pages(all pages have rw permissions). Now user can use pages
>of this mapping(e.g. fill it with data). Finally, to start transmission,
>user calls 'getsockopt()' or some 'ioctl()' and kernel processes data of
>this mapping. Also as this call will return immediately(e.g. it is
>asynchronous), some completion logic must be implemented. For example
>use same way as MSG_ZEROCOPY uses - poll error queue of socket to get
>message that pages could be reused, or don't allow user to work with
>these pages: unmap it, perform transmission and finally free pages.
>To start new transmission user need to call 'mmap()' again.
>
>                            OR
>
>I think there is another unusual way for zerocopy tx: let's use 'vmsplice()'
>/'splice()'. In this approach to transmit something, user does the following
>steps:
>1) Creates pipe.
>2) Calls 'vmsplice(SPLICE_F_GIFT)' on this pipe, insert data pages to it.
>   SPLICE_F_GIFT allows user to forget about allocated pages - kernel will
>   free it.
>3) Calls 'splice(SPLICE_F_MOVE)' from pipe to socket. SPLICE_F_MOVE will
>   move pages from pipe to socket(e.g. in special socket callback we got
>   set of pipe's pages as input argument and all pages will be inserted
>   to virtio queue).
>
>But as SPLICE_F_MOVE support is disabled, it must be repaired first.

Splice seems interesting, but it would be nice If we do something 
similar to TCP. IIUC they use a flag for send(2):

     send(fd, buf, sizeof(buf), MSG_ZEROCOPY);

  
>
>>
>>>
>>>                                 DETAILS
>>>
>>>     Here is how mapping with mapped pages looks exactly: first page mapping
>>> contains array of trimmed virtio vsock packet headers (in contains only length
>>> of data on the corresponding page and 'flags' field):
>>>
>>>     struct virtio_vsock_usr_hdr {
>>>         uint32_t length;
>>>         uint32_t flags;
>>>     };
>>>
>>> Field  'length' allows user to know exact size of payload within each sequence
>>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>>> bounds or record bounds). All other pages are data pages from RX queue.
>>>
>>>             Page 0      Page 1      Page N
>>>
>>>     [ hdr1 .. hdrN ][ data ] .. [ data ]
>>>           |        |       ^           ^
>>>           |        |       |           |
>>>           |        *-------------------*
>>>           |                |
>>>           |                |
>>>           *----------------*
>>>
>>>     Of course, single header could represent array of pages (when packet's
>>> buffer is bigger than one page).So here is example of detailed mapping layout
>>> for some set of packages. Lets consider that we have the following sequence  of
>>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>>> be inserted to user's vma(vma is large enough).
>>>
>>>     Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
>>>     Page 1: [ 56 ]
>>>     Page 2: [ 4096 ]
>>>     Page 3: [ 4096 ]
>>>     Page 4: [ 4096 ]
>>>     Page 5: [ 8 ]
>>>
>>>     Page 0 contains only array of headers:
>>>     'hdr0' has 56 in length field.
>>>     'hdr1' has 4096 in length field.
>>>     'hdr2' has 8200 in length field.
>>>     'hdr3' has 0 in length field(this is end of data marker).
>>>
>>>     Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
>>>     Page 2 corresponds to 'hdr1' and filled with data.
>>>     Page 3 corresponds to 'hdr2' and filled with data.
>>>     Page 4 corresponds to 'hdr2' and filled with data.
>>>     Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>>>
>>>     This patchset also changes packets allocation way: today implementation
>>> uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
>>> such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
>>> pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
>>> "not large" could be bigger than one page). So to avoid this, data buffers now
>>> allocated using 'alloc_pages()' call.
>>>
>>>                                   TESTS
>>>
>>>     This patchset updates 'vsock_test' utility: two tests for new feature
>>> were added. First test covers invalid cases. Second checks valid transmission
>>> case.
>>
>> Thanks for adding the test!
>>
>>>
>>>                                BENCHMARKING
>>>
>>>     For benchmakring I've added small utility 'rx_zerocopy'. It works in
>>> client/server mode. When client connects to server, server starts sending exact
>>> amount of data to client(amount is set as input argument).Client reads data and
>>> waits for next portion of it. Client works in two modes: copy and zero-copy. In
>>> copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
>>> /'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
>>> is better. For server, we can set size of tx buffer and for client we can set
>>> size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
>>> is quiet simple:
>>>
>>> For client mode:
>>>
>>> ./rx_zerocopy --mode client [--zerocopy] [--rx]
>>>
>>> For server mode:
>>>
>>> ./rx_zerocopy --mode server [--mb] [--tx]
>>>
>>> [--mb] sets number of megabytes to transfer.
>>> [--rx] sets size of receive buffer/mapping in pages.
>>> [--tx] sets size of transmit buffer in pages.
>>>
>>> I checked for transmission of 4000mb of data. Here are some results:
>>>
>>>                           size of rx/tx buffers in pages
>>>               *---------------------------------------------------*
>>>               |    8   |    32    |    64   |   256    |   512    |
>>> *--------------*--------*----------*---------*----------*----------*
>>> |   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
>>> *--------------*---------------------------------------------------- process
>>> | non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
>>> *--------------*----------------------------------------------------
>>>
>>> I think, that results are not so impressive, but at least it is not worse than
>>> copy mode and there is no need to allocate memory for processing date.
>>
>> Why is it twice as slow in the first column?
>
>May be this is because memory copying for small buffers is very fast... i'll
>analyze it deeply.

Maybe I misunderstood, by small buffers here what do you mean?

I thought 8 was the number of pages, so 32KB buffers.

>
>>
>>>
>>>                                 PROBLEMS
>>>
>>>     Updated packet's allocation logic creates some problem: when host gets
>>> data from guest(in vhost-vsock), it allocates at least one page for each packet
>>> (even if packet has 1 byte payload). I think this could be resolved in several
>>> ways:
>>
>> Can we somehow copy the incoming packets into the payload of the already queued packet?
>
>May be, i'll analyze it...

Thanks!

>
>>
>> This reminds me that we have yet to fix a similar problem with kmalloc() as well...
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=215329
>
>Yes, but it is a little bit different case: IIUC this bug happens because 'kmalloc()'
>uses memory chunks of some preallocated size.

Yep, I mean I think the problem is that when we receive 1-byte packets, 
we have all the header queued up that we don't consider in the credit 
mechanism. A little bit different.

>
>>
>>>     1) Make zerocopy rx mode disabled by default, so if user didn't enable
>>> it, current 'kmalloc()' way will be used.
>>
>> That sounds reasonable to me, I guess also TCP needs a setsockopt() call to enable the feature, right?
>
>Yes, You're right. I think i'll add this to v2.
>
>>
>>>     2) Use 'kmalloc()' for "small" packets, else call page allocator. But
>>> in this case, we have mix of packets, allocated in two different ways thus
>>> during zerocopying to user(e.g. mapping pages to vma), such small packets will
>>> be handled in some stupid way: we need to allocate one page for user, copy data
>>> to it and then insert page to user's vma.
>>
>> It seems more difficult to me, but at the same time doable. I would go more on option 1, though.
>>
>>>
>>> P.S: of course this is experimental RFC, so what do You think guys?
>>
>> It seems cool :-)
>>
>> But I would like some feedback from the net guys to have some TCP-like things.
>
>Ok, i'll prepare v2 anyway: i need to analyze performance, may be more test coverage, rebase
>over latest kernel and work on packet allocation problem(from above).

If you have time, it would be cool to modify some performance tool that 
already supports vsock to take advantage of this feature and look better 
at performance.

We currently have both iperf3 (I have a modified fork for vsock [1]) and 
uperf (they have merged upstream the vsock support).

Perhaps the easiest to tweak is iperf-vsock, it should already have a 
--zerocopy option.

Thanks,
Stefano

[1] https://github.com/stefano-garzarella/iperf-vsock


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive
  2022-05-19  7:42     ` Stefano Garzarella
@ 2022-05-20 11:09       ` Arseniy Krasnov
  2022-05-24  7:32         ` Stefano Garzarella
  0 siblings, 1 reply; 15+ messages in thread
From: Arseniy Krasnov @ 2022-05-20 11:09 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Jakub Kicinski, Paolo Abeni, linux-kernel, kvm, virtualization,
	netdev, kernel

Hello Stefano,

On 19.05.2022 10:42, Stefano Garzarella wrote:
> On Wed, May 18, 2022 at 11:04:30AM +0000, Arseniy Krasnov wrote:
>> Hello Stefano,
>>
>> On 17.05.2022 18:14, Stefano Garzarella wrote:
>>> Hi Arseniy,
>>>
>>> On Thu, May 12, 2022 at 05:04:11AM +0000, Arseniy Krasnov wrote:
>>>>                              INTRODUCTION
>>>>
>>>>     Hello, this is experimental implementation of virtio vsock zerocopy
>>>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>>>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>>>> fill provided vma area with pages of virtio RX buffers. After received data was
>>>> processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>>>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
>>>
>>> Sounds cool, but maybe we would need some socket/net experts here for review.
>>
>> Yes, that would be great
>>
>>>
>>> Could we do something similar for the sending path as well?
>>
>> Here are thoughts about zerocopy transmission:
>>
>> I tried to implement this feature in the following way: user creates
>> some page aligned buffer, then during tx packet allocation instead of
>> creating data buffer with 'kmalloc()', i tried to add user's buffer
>> to virtio queue. But found problem: as kernel virtio API uses virtual
>> addresses to add new buffers, in the deep of virtio subsystem
>> 'virt_to_phys()' is called to get physical address of buffer, so user's
>> virtual address won't be translated correctly to physical address(in
>> theory, i can perform page walk for such user's va, get physical address
>> and pass some "fake" virtual address to virtio API in order to make
>> 'virt_to_phys()' return valid physical address(but i think this is ugly).
> 
> And maybe we should also pin the pages to prevent them from being replaced.
> 
> I think we should do something similar to what we do in vhost-vdpa.
> Take a look at vhost_vdpa_pa_map() in drivers/vhost/vdpa.c

Hm, ok. I'll read about vdpa...

> 
>>
>>
>> If we are talking about 'mmap()' way, i think we can do the following:
>> user calls 'mmap()' on socket, kernel fills newly created mapping with
>> allocated pages(all pages have rw permissions). Now user can use pages
>> of this mapping(e.g. fill it with data). Finally, to start transmission,
>> user calls 'getsockopt()' or some 'ioctl()' and kernel processes data of
>> this mapping. Also as this call will return immediately(e.g. it is
>> asynchronous), some completion logic must be implemented. For example
>> use same way as MSG_ZEROCOPY uses - poll error queue of socket to get
>> message that pages could be reused, or don't allow user to work with
>> these pages: unmap it, perform transmission and finally free pages.
>> To start new transmission user need to call 'mmap()' again.
>>
>>                            OR
>>
>> I think there is another unusual way for zerocopy tx: let's use 'vmsplice()'
>> /'splice()'. In this approach to transmit something, user does the following
>> steps:
>> 1) Creates pipe.
>> 2) Calls 'vmsplice(SPLICE_F_GIFT)' on this pipe, insert data pages to it.
>>   SPLICE_F_GIFT allows user to forget about allocated pages - kernel will
>>   free it.
>> 3) Calls 'splice(SPLICE_F_MOVE)' from pipe to socket. SPLICE_F_MOVE will
>>   move pages from pipe to socket(e.g. in special socket callback we got
>>   set of pipe's pages as input argument and all pages will be inserted
>>   to virtio queue).
>>
>> But as SPLICE_F_MOVE support is disabled, it must be repaired first.
> 
> Splice seems interesting, but it would be nice If we do something similar to TCP. IIUC they use a flag for send(2):
> 
>     send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
> 

Yes, but in this way i think:
1) What is 'buf'? It can't be user's address, since this buffer must be inserted to tx queue.
   E.g. it must be allocated by kernel and then returned to user for tx purposes. In TCP
   case, 'buf' is user's address(of course page aligned) because TCP logic uses sk_buff which
   allows to use such memory as data buffer.
2) To wait tx process is done(e.g. pages can be used again), such API(send + MSG_ZEROCOPY),
   uses socket's error queue - poll events that tx is finished. So same way must be
   implemented for virtio vsock.

>  
>>
>>>
>>>>
>>>>                                 DETAILS
>>>>
>>>>     Here is how mapping with mapped pages looks exactly: first page mapping
>>>> contains array of trimmed virtio vsock packet headers (in contains only length
>>>> of data on the corresponding page and 'flags' field):
>>>>
>>>>     struct virtio_vsock_usr_hdr {
>>>>         uint32_t length;
>>>>         uint32_t flags;
>>>>     };
>>>>
>>>> Field  'length' allows user to know exact size of payload within each sequence
>>>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>>>> bounds or record bounds). All other pages are data pages from RX queue.
>>>>
>>>>             Page 0      Page 1      Page N
>>>>
>>>>     [ hdr1 .. hdrN ][ data ] .. [ data ]
>>>>           |        |       ^           ^
>>>>           |        |       |           |
>>>>           |        *-------------------*
>>>>           |                |
>>>>           |                |
>>>>           *----------------*
>>>>
>>>>     Of course, single header could represent array of pages (when packet's
>>>> buffer is bigger than one page).So here is example of detailed mapping layout
>>>> for some set of packages. Lets consider that we have the following sequence  of
>>>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>>>> be inserted to user's vma(vma is large enough).
>>>>
>>>>     Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
>>>>     Page 1: [ 56 ]
>>>>     Page 2: [ 4096 ]
>>>>     Page 3: [ 4096 ]
>>>>     Page 4: [ 4096 ]
>>>>     Page 5: [ 8 ]
>>>>
>>>>     Page 0 contains only array of headers:
>>>>     'hdr0' has 56 in length field.
>>>>     'hdr1' has 4096 in length field.
>>>>     'hdr2' has 8200 in length field.
>>>>     'hdr3' has 0 in length field(this is end of data marker).
>>>>
>>>>     Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
>>>>     Page 2 corresponds to 'hdr1' and filled with data.
>>>>     Page 3 corresponds to 'hdr2' and filled with data.
>>>>     Page 4 corresponds to 'hdr2' and filled with data.
>>>>     Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>>>>
>>>>     This patchset also changes packets allocation way: today implementation
>>>> uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
>>>> such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
>>>> pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
>>>> "not large" could be bigger than one page). So to avoid this, data buffers now
>>>> allocated using 'alloc_pages()' call.
>>>>
>>>>                                   TESTS
>>>>
>>>>     This patchset updates 'vsock_test' utility: two tests for new feature
>>>> were added. First test covers invalid cases. Second checks valid transmission
>>>> case.
>>>
>>> Thanks for adding the test!
>>>
>>>>
>>>>                                BENCHMARKING
>>>>
>>>>     For benchmakring I've added small utility 'rx_zerocopy'. It works in
>>>> client/server mode. When client connects to server, server starts sending exact
>>>> amount of data to client(amount is set as input argument).Client reads data and
>>>> waits for next portion of it. Client works in two modes: copy and zero-copy. In
>>>> copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
>>>> /'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
>>>> is better. For server, we can set size of tx buffer and for client we can set
>>>> size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
>>>> is quiet simple:
>>>>
>>>> For client mode:
>>>>
>>>> ./rx_zerocopy --mode client [--zerocopy] [--rx]
>>>>
>>>> For server mode:
>>>>
>>>> ./rx_zerocopy --mode server [--mb] [--tx]
>>>>
>>>> [--mb] sets number of megabytes to transfer.
>>>> [--rx] sets size of receive buffer/mapping in pages.
>>>> [--tx] sets size of transmit buffer in pages.
>>>>
>>>> I checked for transmission of 4000mb of data. Here are some results:
>>>>
>>>>                           size of rx/tx buffers in pages
>>>>               *---------------------------------------------------*
>>>>               |    8   |    32    |    64   |   256    |   512    |
>>>> *--------------*--------*----------*---------*----------*----------*
>>>> |   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
>>>> *--------------*---------------------------------------------------- process
>>>> | non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
>>>> *--------------*----------------------------------------------------
>>>>
>>>> I think, that results are not so impressive, but at least it is not worse than
>>>> copy mode and there is no need to allocate memory for processing date.
>>>
>>> Why is it twice as slow in the first column?
>>
>> May be this is because memory copying for small buffers is very fast... i'll
>> analyze it deeply.
> 
> Maybe I misunderstood, by small buffers here what do you mean?
> 
> I thought 8 was the number of pages, so 32KB buffers.

Yes, 8 is size in pages. Anyway, i need to check it more deeply.

> 
>>
>>>
>>>>
>>>>                                 PROBLEMS
>>>>
>>>>     Updated packet's allocation logic creates some problem: when host gets
>>>> data from guest(in vhost-vsock), it allocates at least one page for each packet
>>>> (even if packet has 1 byte payload). I think this could be resolved in several
>>>> ways:
>>>
>>> Can we somehow copy the incoming packets into the payload of the already queued packet?
>>
>> May be, i'll analyze it...
> 
> Thanks!
> 
>>
>>>
>>> This reminds me that we have yet to fix a similar problem with kmalloc() as well...
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=215329
>>
>> Yes, but it is a little bit different case: IIUC this bug happens because 'kmalloc()'
>> uses memory chunks of some preallocated size.
> 
> Yep, I mean I think the problem is that when we receive 1-byte packets, we have all the header queued up that we don't consider in the credit mechanism. A little bit different.
> 
>>
>>>
>>>>     1) Make zerocopy rx mode disabled by default, so if user didn't enable
>>>> it, current 'kmalloc()' way will be used.
>>>
>>> That sounds reasonable to me, I guess also TCP needs a setsockopt() call to enable the feature, right?
>>
>> Yes, You're right. I think i'll add this to v2.
>>
>>>
>>>>     2) Use 'kmalloc()' for "small" packets, else call page allocator. But
>>>> in this case, we have mix of packets, allocated in two different ways thus
>>>> during zerocopying to user(e.g. mapping pages to vma), such small packets will
>>>> be handled in some stupid way: we need to allocate one page for user, copy data
>>>> to it and then insert page to user's vma.
>>>
>>> It seems more difficult to me, but at the same time doable. I would go more on option 1, though.
>>>
>>>>
>>>> P.S: of course this is experimental RFC, so what do You think guys?
>>>
>>> It seems cool :-)
>>>
>>> But I would like some feedback from the net guys to have some TCP-like things.
>>
>> Ok, i'll prepare v2 anyway: i need to analyze performance, may be more test coverage, rebase
>> over latest kernel and work on packet allocation problem(from above).
> 
> If you have time, it would be cool to modify some performance tool that already supports vsock to take advantage of this feature and look better at performance.
> 
> We currently have both iperf3 (I have a modified fork for vsock [1]) and uperf (they have merged upstream the vsock support).
> 
> Perhaps the easiest to tweak is iperf-vsock, it should already have a --zerocopy option.

Ack

> 
> Thanks,
> Stefano
> 
> [1] https://github.com/stefano-garzarella/iperf-vsock
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive
  2022-05-20 11:09       ` Arseniy Krasnov
@ 2022-05-24  7:32         ` Stefano Garzarella
  2022-06-07 10:26           ` Arseniy Krasnov
  0 siblings, 1 reply; 15+ messages in thread
From: Stefano Garzarella @ 2022-05-24  7:32 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Jakub Kicinski, Paolo Abeni, linux-kernel, kvm, virtualization,
	netdev, kernel

On Fri, May 20, 2022 at 11:09:11AM +0000, Arseniy Krasnov wrote:
>Hello Stefano,
>
>On 19.05.2022 10:42, Stefano Garzarella wrote:
>> On Wed, May 18, 2022 at 11:04:30AM +0000, Arseniy Krasnov wrote:
>>> Hello Stefano,
>>>
>>> On 17.05.2022 18:14, Stefano Garzarella wrote:
>>>> Hi Arseniy,
>>>>
>>>> On Thu, May 12, 2022 at 05:04:11AM +0000, Arseniy Krasnov wrote:
>>>>>                              INTRODUCTION
>>>>>
>>>>>     Hello, this is experimental implementation of virtio vsock zerocopy
>>>>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>>>>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>>>>> fill provided vma area with pages of virtio RX buffers. After received data was
>>>>> processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>>>>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
>>>>
>>>> Sounds cool, but maybe we would need some socket/net experts here for review.
>>>
>>> Yes, that would be great
>>>
>>>>
>>>> Could we do something similar for the sending path as well?
>>>
>>> Here are thoughts about zerocopy transmission:
>>>
>>> I tried to implement this feature in the following way: user creates
>>> some page aligned buffer, then during tx packet allocation instead of
>>> creating data buffer with 'kmalloc()', i tried to add user's buffer
>>> to virtio queue. But found problem: as kernel virtio API uses virtual
>>> addresses to add new buffers, in the deep of virtio subsystem
>>> 'virt_to_phys()' is called to get physical address of buffer, so user's
>>> virtual address won't be translated correctly to physical address(in
>>> theory, i can perform page walk for such user's va, get physical address
>>> and pass some "fake" virtual address to virtio API in order to make
>>> 'virt_to_phys()' return valid physical address(but i think this is ugly).
>>
>> And maybe we should also pin the pages to prevent them from being replaced.
>>
>> I think we should do something similar to what we do in vhost-vdpa.
>> Take a look at vhost_vdpa_pa_map() in drivers/vhost/vdpa.c
>
>Hm, ok. I'll read about vdpa...
>
>>
>>>
>>>
>>> If we are talking about 'mmap()' way, i think we can do the following:
>>> user calls 'mmap()' on socket, kernel fills newly created mapping with
>>> allocated pages(all pages have rw permissions). Now user can use pages
>>> of this mapping(e.g. fill it with data). Finally, to start transmission,
>>> user calls 'getsockopt()' or some 'ioctl()' and kernel processes data of
>>> this mapping. Also as this call will return immediately(e.g. it is
>>> asynchronous), some completion logic must be implemented. For example
>>> use same way as MSG_ZEROCOPY uses - poll error queue of socket to get
>>> message that pages could be reused, or don't allow user to work with
>>> these pages: unmap it, perform transmission and finally free pages.
>>> To start new transmission user need to call 'mmap()' again.
>>>
>>>                            OR
>>>
>>> I think there is another unusual way for zerocopy tx: let's use 'vmsplice()'
>>> /'splice()'. In this approach to transmit something, user does the following
>>> steps:
>>> 1) Creates pipe.
>>> 2) Calls 'vmsplice(SPLICE_F_GIFT)' on this pipe, insert data pages to it.
>>>   SPLICE_F_GIFT allows user to forget about allocated pages - kernel will
>>>   free it.
>>> 3) Calls 'splice(SPLICE_F_MOVE)' from pipe to socket. SPLICE_F_MOVE will
>>>   move pages from pipe to socket(e.g. in special socket callback we got
>>>   set of pipe's pages as input argument and all pages will be inserted
>>>   to virtio queue).
>>>
>>> But as SPLICE_F_MOVE support is disabled, it must be repaired first.
>>
>> Splice seems interesting, but it would be nice If we do something similar to TCP. IIUC they use a flag for send(2):
>>
>>     send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
>>
>
>Yes, but in this way i think:
>1) What is 'buf'? It can't be user's address, since this buffer must be inserted to tx queue.
>   E.g. it must be allocated by kernel and then returned to user for tx purposes. In TCP
>   case, 'buf' is user's address(of course page aligned) because TCP logic uses sk_buff which
>   allows to use such memory as data buffer.

IIUC we can pin that buffer like we do in vhost-vdpa, and use it in the 
VQ.

>2) To wait tx process is done(e.g. pages can be used again), such 
>API(send + MSG_ZEROCOPY),
>   uses socket's error queue - poll events that tx is finished. So same 
>   way must be
>   implemented for virtio vsock.

Yeah, I think so.

>
>>  
>>>
>>>>
>>>>>
>>>>>                                 DETAILS
>>>>>
>>>>>     Here is how mapping with mapped pages looks exactly: first page mapping
>>>>> contains array of trimmed virtio vsock packet headers (in contains only length
>>>>> of data on the corresponding page and 'flags' field):
>>>>>
>>>>>     struct virtio_vsock_usr_hdr {
>>>>>         uint32_t length;
>>>>>         uint32_t flags;
>>>>>     };
>>>>>
>>>>> Field  'length' allows user to know exact size of payload within each sequence
>>>>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>>>>> bounds or record bounds). All other pages are data pages from RX queue.
>>>>>
>>>>>             Page 0      Page 1      Page N
>>>>>
>>>>>     [ hdr1 .. hdrN ][ data ] .. [ data ]
>>>>>           |        |       ^           ^
>>>>>           |        |       |           |
>>>>>           |        *-------------------*
>>>>>           |                |
>>>>>           |                |
>>>>>           *----------------*
>>>>>
>>>>>     Of course, single header could represent array of pages (when packet's
>>>>> buffer is bigger than one page).So here is example of detailed mapping layout
>>>>> for some set of packages. Lets consider that we have the following sequence  of
>>>>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>>>>> be inserted to user's vma(vma is large enough).
>>>>>
>>>>>     Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
>>>>>     Page 1: [ 56 ]
>>>>>     Page 2: [ 4096 ]
>>>>>     Page 3: [ 4096 ]
>>>>>     Page 4: [ 4096 ]
>>>>>     Page 5: [ 8 ]
>>>>>
>>>>>     Page 0 contains only array of headers:
>>>>>     'hdr0' has 56 in length field.
>>>>>     'hdr1' has 4096 in length field.
>>>>>     'hdr2' has 8200 in length field.
>>>>>     'hdr3' has 0 in length field(this is end of data marker).
>>>>>
>>>>>     Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
>>>>>     Page 2 corresponds to 'hdr1' and filled with data.
>>>>>     Page 3 corresponds to 'hdr2' and filled with data.
>>>>>     Page 4 corresponds to 'hdr2' and filled with data.
>>>>>     Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>>>>>
>>>>>     This patchset also changes packets allocation way: today implementation
>>>>> uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
>>>>> such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
>>>>> pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
>>>>> "not large" could be bigger than one page). So to avoid this, data buffers now
>>>>> allocated using 'alloc_pages()' call.
>>>>>
>>>>>                                   TESTS
>>>>>
>>>>>     This patchset updates 'vsock_test' utility: two tests for new feature
>>>>> were added. First test covers invalid cases. Second checks valid transmission
>>>>> case.
>>>>
>>>> Thanks for adding the test!
>>>>
>>>>>
>>>>>                                BENCHMARKING
>>>>>
>>>>>     For benchmakring I've added small utility 'rx_zerocopy'. It works in
>>>>> client/server mode. When client connects to server, server starts sending exact
>>>>> amount of data to client(amount is set as input argument).Client reads data and
>>>>> waits for next portion of it. Client works in two modes: copy and zero-copy. In
>>>>> copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
>>>>> /'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
>>>>> is better. For server, we can set size of tx buffer and for client we can set
>>>>> size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
>>>>> is quiet simple:
>>>>>
>>>>> For client mode:
>>>>>
>>>>> ./rx_zerocopy --mode client [--zerocopy] [--rx]
>>>>>
>>>>> For server mode:
>>>>>
>>>>> ./rx_zerocopy --mode server [--mb] [--tx]
>>>>>
>>>>> [--mb] sets number of megabytes to transfer.
>>>>> [--rx] sets size of receive buffer/mapping in pages.
>>>>> [--tx] sets size of transmit buffer in pages.
>>>>>
>>>>> I checked for transmission of 4000mb of data. Here are some results:
>>>>>
>>>>>                           size of rx/tx buffers in pages
>>>>>               *---------------------------------------------------*
>>>>>               |    8   |    32    |    64   |   256    |   512    |
>>>>> *--------------*--------*----------*---------*----------*----------*
>>>>> |   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
>>>>> *--------------*---------------------------------------------------- process
>>>>> | non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
>>>>> *--------------*----------------------------------------------------
>>>>>
>>>>> I think, that results are not so impressive, but at least it is not worse than
>>>>> copy mode and there is no need to allocate memory for processing date.
>>>>
>>>> Why is it twice as slow in the first column?
>>>
>>> May be this is because memory copying for small buffers is very fast... i'll
>>> analyze it deeply.
>>
>> Maybe I misunderstood, by small buffers here what do you mean?
>>
>> I thought 8 was the number of pages, so 32KB buffers.
>
>Yes, 8 is size in pages. Anyway, i need to check it more deeply.

Okay, thanks!

Stefano


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive
  2022-05-24  7:32         ` Stefano Garzarella
@ 2022-06-07 10:26           ` Arseniy Krasnov
  0 siblings, 0 replies; 15+ messages in thread
From: Arseniy Krasnov @ 2022-06-07 10:26 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Jakub Kicinski, Paolo Abeni, linux-kernel, kvm, virtualization,
	netdev, kernel

On 24.05.2022 10:32, Stefano Garzarella wrote:
> On Fri, May 20, 2022 at 11:09:11AM +0000, Arseniy Krasnov wrote:
>> Hello Stefano,
>>
>> On 19.05.2022 10:42, Stefano Garzarella wrote:
>>> On Wed, May 18, 2022 at 11:04:30AM +0000, Arseniy Krasnov wrote:
>>>> Hello Stefano,
>>>>
>>>> On 17.05.2022 18:14, Stefano Garzarella wrote:
>>>>> Hi Arseniy,
>>>>>
>>>>> On Thu, May 12, 2022 at 05:04:11AM +0000, Arseniy Krasnov wrote:
>>>>>>                              INTRODUCTION
>>>>>>
>>>>>>     Hello, this is experimental implementation of virtio vsock zerocopy
>>>>>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>>>>>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>>>>>> fill provided vma area with pages of virtio RX buffers. After received data was
>>>>>> processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>>>>>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
>>>>>
>>>>> Sounds cool, but maybe we would need some socket/net experts here for review.
>>>>
>>>> Yes, that would be great
>>>>
>>>>>
>>>>> Could we do something similar for the sending path as well?
>>>>
>>>> Here are thoughts about zerocopy transmission:
>>>>
>>>> I tried to implement this feature in the following way: user creates
>>>> some page aligned buffer, then during tx packet allocation instead of
>>>> creating data buffer with 'kmalloc()', i tried to add user's buffer
>>>> to virtio queue. But found problem: as kernel virtio API uses virtual
>>>> addresses to add new buffers, in the deep of virtio subsystem
>>>> 'virt_to_phys()' is called to get physical address of buffer, so user's
>>>> virtual address won't be translated correctly to physical address(in
>>>> theory, i can perform page walk for such user's va, get physical address
>>>> and pass some "fake" virtual address to virtio API in order to make
>>>> 'virt_to_phys()' return valid physical address(but i think this is ugly).
>>>
>>> And maybe we should also pin the pages to prevent them from being replaced.
>>>
>>> I think we should do something similar to what we do in vhost-vdpa.
>>> Take a look at vhost_vdpa_pa_map() in drivers/vhost/vdpa.c
>>
>> Hm, ok. I'll read about vdpa...
>>
>>>
>>>>
>>>>
>>>> If we are talking about 'mmap()' way, i think we can do the following:
>>>> user calls 'mmap()' on socket, kernel fills newly created mapping with
>>>> allocated pages(all pages have rw permissions). Now user can use pages
>>>> of this mapping(e.g. fill it with data). Finally, to start transmission,
>>>> user calls 'getsockopt()' or some 'ioctl()' and kernel processes data of
>>>> this mapping. Also as this call will return immediately(e.g. it is
>>>> asynchronous), some completion logic must be implemented. For example
>>>> use same way as MSG_ZEROCOPY uses - poll error queue of socket to get
>>>> message that pages could be reused, or don't allow user to work with
>>>> these pages: unmap it, perform transmission and finally free pages.
>>>> To start new transmission user need to call 'mmap()' again.
>>>>
>>>>                            OR
>>>>
>>>> I think there is another unusual way for zerocopy tx: let's use 'vmsplice()'
>>>> /'splice()'. In this approach to transmit something, user does the following
>>>> steps:
>>>> 1) Creates pipe.
>>>> 2) Calls 'vmsplice(SPLICE_F_GIFT)' on this pipe, insert data pages to it.
>>>>   SPLICE_F_GIFT allows user to forget about allocated pages - kernel will
>>>>   free it.
>>>> 3) Calls 'splice(SPLICE_F_MOVE)' from pipe to socket. SPLICE_F_MOVE will
>>>>   move pages from pipe to socket(e.g. in special socket callback we got
>>>>   set of pipe's pages as input argument and all pages will be inserted
>>>>   to virtio queue).
>>>>
>>>> But as SPLICE_F_MOVE support is disabled, it must be repaired first.
>>>
>>> Splice seems interesting, but it would be nice If we do something similar to TCP. IIUC they use a flag for send(2):
>>>
>>>     send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
>>>
>>
>> Yes, but in this way i think:
>> 1) What is 'buf'? It can't be user's address, since this buffer must be inserted to tx queue.
>>   E.g. it must be allocated by kernel and then returned to user for tx purposes. In TCP
>>   case, 'buf' is user's address(of course page aligned) because TCP logic uses sk_buff which
>>   allows to use such memory as data buffer.
> 
> IIUC we can pin that buffer like we do in vhost-vdpa, and use it in the VQ.

Now i see, how to use 'pin_user_pages()'. I'm going to implement zerocopy tx with the same
API as TCP MSG_ZEROCOPY

> 
>> 2) To wait tx process is done(e.g. pages can be used again), such API(send + MSG_ZEROCOPY),
>>   uses socket's error queue - poll events that tx is finished. So same   way must be
>>   implemented for virtio vsock.
> 
> Yeah, I think so.
> 
>>
>>>  
>>>>
>>>>>
>>>>>>
>>>>>>                                 DETAILS
>>>>>>
>>>>>>     Here is how mapping with mapped pages looks exactly: first page mapping
>>>>>> contains array of trimmed virtio vsock packet headers (in contains only length
>>>>>> of data on the corresponding page and 'flags' field):
>>>>>>
>>>>>>     struct virtio_vsock_usr_hdr {
>>>>>>         uint32_t length;
>>>>>>         uint32_t flags;
>>>>>>     };
>>>>>>
>>>>>> Field  'length' allows user to know exact size of payload within each sequence
>>>>>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>>>>>> bounds or record bounds). All other pages are data pages from RX queue.
>>>>>>
>>>>>>             Page 0      Page 1      Page N
>>>>>>
>>>>>>     [ hdr1 .. hdrN ][ data ] .. [ data ]
>>>>>>           |        |       ^           ^
>>>>>>           |        |       |           |
>>>>>>           |        *-------------------*
>>>>>>           |                |
>>>>>>           |                |
>>>>>>           *----------------*
>>>>>>
>>>>>>     Of course, single header could represent array of pages (when packet's
>>>>>> buffer is bigger than one page).So here is example of detailed mapping layout
>>>>>> for some set of packages. Lets consider that we have the following sequence  of
>>>>>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>>>>>> be inserted to user's vma(vma is large enough).
>>>>>>
>>>>>>     Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
>>>>>>     Page 1: [ 56 ]
>>>>>>     Page 2: [ 4096 ]
>>>>>>     Page 3: [ 4096 ]
>>>>>>     Page 4: [ 4096 ]
>>>>>>     Page 5: [ 8 ]
>>>>>>
>>>>>>     Page 0 contains only array of headers:
>>>>>>     'hdr0' has 56 in length field.
>>>>>>     'hdr1' has 4096 in length field.
>>>>>>     'hdr2' has 8200 in length field.
>>>>>>     'hdr3' has 0 in length field(this is end of data marker).
>>>>>>
>>>>>>     Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
>>>>>>     Page 2 corresponds to 'hdr1' and filled with data.
>>>>>>     Page 3 corresponds to 'hdr2' and filled with data.
>>>>>>     Page 4 corresponds to 'hdr2' and filled with data.
>>>>>>     Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>>>>>>
>>>>>>     This patchset also changes packets allocation way: today implementation
>>>>>> uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
>>>>>> such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
>>>>>> pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
>>>>>> "not large" could be bigger than one page). So to avoid this, data buffers now
>>>>>> allocated using 'alloc_pages()' call.
>>>>>>
>>>>>>                                   TESTS
>>>>>>
>>>>>>     This patchset updates 'vsock_test' utility: two tests for new feature
>>>>>> were added. First test covers invalid cases. Second checks valid transmission
>>>>>> case.
>>>>>
>>>>> Thanks for adding the test!
>>>>>
>>>>>>
>>>>>>                                BENCHMARKING
>>>>>>
>>>>>>     For benchmakring I've added small utility 'rx_zerocopy'. It works in
>>>>>> client/server mode. When client connects to server, server starts sending exact
>>>>>> amount of data to client(amount is set as input argument).Client reads data and
>>>>>> waits for next portion of it. Client works in two modes: copy and zero-copy. In
>>>>>> copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
>>>>>> /'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
>>>>>> is better. For server, we can set size of tx buffer and for client we can set
>>>>>> size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
>>>>>> is quiet simple:
>>>>>>
>>>>>> For client mode:
>>>>>>
>>>>>> ./rx_zerocopy --mode client [--zerocopy] [--rx]
>>>>>>
>>>>>> For server mode:
>>>>>>
>>>>>> ./rx_zerocopy --mode server [--mb] [--tx]
>>>>>>
>>>>>> [--mb] sets number of megabytes to transfer.
>>>>>> [--rx] sets size of receive buffer/mapping in pages.
>>>>>> [--tx] sets size of transmit buffer in pages.
>>>>>>
>>>>>> I checked for transmission of 4000mb of data. Here are some results:
>>>>>>
>>>>>>                           size of rx/tx buffers in pages
>>>>>>               *---------------------------------------------------*
>>>>>>               |    8   |    32    |    64   |   256    |   512    |
>>>>>> *--------------*--------*----------*---------*----------*----------*
>>>>>> |   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
>>>>>> *--------------*---------------------------------------------------- process
>>>>>> | non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
>>>>>> *--------------*----------------------------------------------------
>>>>>>
>>>>>> I think, that results are not so impressive, but at least it is not worse than
>>>>>> copy mode and there is no need to allocate memory for processing date.
>>>>>
>>>>> Why is it twice as slow in the first column?
>>>>
>>>> May be this is because memory copying for small buffers is very fast... i'll
>>>> analyze it deeply.
>>>
>>> Maybe I misunderstood, by small buffers here what do you mean?
>>>
>>> I thought 8 was the number of pages, so 32KB buffers.
>>
>> Yes, 8 is size in pages. Anyway, i need to check it more deeply.
> 
> Okay, thanks!
> 
> Stefano
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-06-07 10:26 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-12  5:04 [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive Arseniy Krasnov
2022-05-12  5:06 ` [RFC PATCH v1 1/8] virtio/vsock: rework packet allocation logic Arseniy Krasnov
2022-05-12  5:09 ` [RFC PATCH v1 2/8] vhost/vsock: " Arseniy Krasnov
2022-05-12  5:12 ` [RFC PATCH v1 3/8] af_vsock: add zerocopy receive logic Arseniy Krasnov
2022-05-12  5:14 ` [RFC PATCH v1 4/8] virtio/vsock: add transport zerocopy callback Arseniy Krasnov
2022-05-12  5:17 ` [RFC PATCH v1 5/8] vhost/vsock: enable " Arseniy Krasnov
2022-05-12  5:18 ` [RFC PATCH v1 6/8] virtio/vsock: " Arseniy Krasnov
2022-05-12  5:20 ` [RFC PATCH v1 7/8] test/vsock: add receive zerocopy tests Arseniy Krasnov
2022-05-12  5:22 ` [RFC PATCH v1 8/8] test/vsock: vsock rx zerocopy utility Arseniy Krasnov
2022-05-17 15:14 ` [RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive Stefano Garzarella
2022-05-18 11:04   ` Arseniy Krasnov
2022-05-19  7:42     ` Stefano Garzarella
2022-05-20 11:09       ` Arseniy Krasnov
2022-05-24  7:32         ` Stefano Garzarella
2022-06-07 10:26           ` Arseniy Krasnov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).