netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v3 0/4] vsock/virtio/vhost: MSG_ZEROCOPY preparations
@ 2023-07-20 21:42 Arseniy Krasnov
  2023-07-20 21:42 ` [PATCH net-next v3 1/4] vsock/virtio/vhost: read data from non-linear skb Arseniy Krasnov
                   ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: Arseniy Krasnov @ 2023-07-20 21:42 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Jason Wang, Bobby Eshleman
  Cc: kvm, virtualization, netdev, linux-kernel, kernel, oxffffaa,
	avkrasnov, Arseniy Krasnov

Hello,

this patchset is first of three parts of another big patchset for
MSG_ZEROCOPY flag support:
https://lore.kernel.org/netdev/20230701063947.3422088-1-AVKrasnov@sberdevices.ru/

During review of this series, Stefano Garzarella <sgarzare@redhat.com>
suggested to split it for three parts to simplify review and merging:

1) virtio and vhost updates (for fragged skbs) <--- this patchset
2) AF_VSOCK updates (allows to enable MSG_ZEROCOPY mode and read
   tx completions) and update for Documentation/.
3) Updates for tests and utils.

This series enables handling of fragged skbs in virtio and vhost parts.
Newly logic won't be triggered, because SO_ZEROCOPY options is still
impossible to enable at this moment (next bunch of patches from big
set above will enable it).

I've included changelog to some patches anyway, because there were some
comments during review of last big patchset from the link above.

Head for this patchset is 60cc1f7d0605598b47ee3c0c2b4b6fbd4da50a06

Link to v1:
https://lore.kernel.org/netdev/20230717210051.856388-1-AVKrasnov@sberdevices.ru/
Link to v2:
https://lore.kernel.org/netdev/20230718180237.3248179-1-AVKrasnov@sberdevices.ru/

Changelog:
 * See per-patch changelog after ---.

Arseniy Krasnov (4):
  vsock/virtio/vhost: read data from non-linear skb
  vsock/virtio: support to send non-linear skb
  vsock/virtio: non-linear skb handling for tap
  vsock/virtio: MSG_ZEROCOPY flag support

 drivers/vhost/vsock.c                   |  14 +-
 include/linux/virtio_vsock.h            |   1 +
 include/net/af_vsock.h                  |   3 +
 net/vmw_vsock/virtio_transport.c        |  80 +++++-
 net/vmw_vsock/virtio_transport_common.c | 307 +++++++++++++++++++-----
 5 files changed, 328 insertions(+), 77 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH net-next v3 1/4] vsock/virtio/vhost: read data from non-linear skb
  2023-07-20 21:42 [PATCH net-next v3 0/4] vsock/virtio/vhost: MSG_ZEROCOPY preparations Arseniy Krasnov
@ 2023-07-20 21:42 ` Arseniy Krasnov
  2023-07-20 21:42 ` [PATCH net-next v3 2/4] vsock/virtio: support to send " Arseniy Krasnov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 30+ messages in thread
From: Arseniy Krasnov @ 2023-07-20 21:42 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Jason Wang, Bobby Eshleman
  Cc: kvm, virtualization, netdev, linux-kernel, kernel, oxffffaa,
	avkrasnov, Arseniy Krasnov

This is preparation patch for MSG_ZEROCOPY support. It adds handling of
non-linear skbs by replacing direct calls of 'memcpy_to_msg()' with
'skb_copy_datagram_iter()'. Main advantage of the second one is that it
can handle paged part of the skb by using 'kmap()' on each page, but if
there are no pages in the skb, it behaves like simple copying to iov
iterator. This patch also adds new field to the control block of skb -
this value shows current offset in the skb to read next portion of data
(it doesn't matter linear it or not). Idea behind this field is that
'skb_copy_datagram_iter()' handles both types of skb internally - it
just needs an offset from which to copy data from the given skb. This
offset is incremented on each read from skb. This approach allows to
avoid special handling of non-linear skbs:
1) We can't call 'skb_pull()' on it, because it updates 'data' pointer.
2) We need to update 'data_len' also on each read from this skb.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
---
 Changelog:
 v5(big patchset) -> v1:
  * Merge 'virtio_transport_common.c' and 'vhost/vsock.c' patches into
    this single patch.
  * Commit message update: grammar fix and remark that this patch is
    MSG_ZEROCOPY preparation.
  * Use 'min_t()' instead of comparison using '<>' operators.
 v1 -> v2:
  * R-b tag added.

 drivers/vhost/vsock.c                   | 14 ++++++++-----
 include/linux/virtio_vsock.h            |  1 +
 net/vmw_vsock/virtio_transport_common.c | 27 ++++++++++++++++---------
 3 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 817d377a3f36..8c917be32b5d 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -114,6 +114,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
 		struct sk_buff *skb;
 		unsigned out, in;
 		size_t nbytes;
+		u32 frag_off;
 		int head;
 
 		skb = virtio_vsock_skb_dequeue(&vsock->send_pkt_queue);
@@ -156,7 +157,8 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
 		}
 
 		iov_iter_init(&iov_iter, ITER_DEST, &vq->iov[out], in, iov_len);
-		payload_len = skb->len;
+		frag_off = VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
+		payload_len = skb->len - frag_off;
 		hdr = virtio_vsock_hdr(skb);
 
 		/* If the packet is greater than the space available in the
@@ -197,8 +199,10 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
 			break;
 		}
 
-		nbytes = copy_to_iter(skb->data, payload_len, &iov_iter);
-		if (nbytes != payload_len) {
+		if (skb_copy_datagram_iter(skb,
+					   frag_off,
+					   &iov_iter,
+					   payload_len)) {
 			kfree_skb(skb);
 			vq_err(vq, "Faulted on copying pkt buf\n");
 			break;
@@ -212,13 +216,13 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
 		vhost_add_used(vq, head, sizeof(*hdr) + payload_len);
 		added = true;
 
-		skb_pull(skb, payload_len);
+		VIRTIO_VSOCK_SKB_CB(skb)->frag_off += payload_len;
 		total_len += payload_len;
 
 		/* If we didn't send all the payload we can requeue the packet
 		 * to send it with the next available buffer.
 		 */
-		if (skb->len > 0) {
+		if (VIRTIO_VSOCK_SKB_CB(skb)->frag_off < skb->len) {
 			hdr->flags |= cpu_to_le32(flags_to_restore);
 
 			/* We are queueing the same skb to handle
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index c58453699ee9..17dbb7176e37 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -12,6 +12,7 @@
 struct virtio_vsock_skb_cb {
 	bool reply;
 	bool tap_delivered;
+	u32 frag_off;
 };
 
 #define VIRTIO_VSOCK_SKB_CB(skb) ((struct virtio_vsock_skb_cb *)((skb)->cb))
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index b769fc258931..1a376f808ae6 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -355,7 +355,7 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
 	spin_lock_bh(&vvs->rx_lock);
 
 	skb_queue_walk_safe(&vvs->rx_queue, skb,  tmp) {
-		off = 0;
+		off = VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
 
 		if (total == len)
 			break;
@@ -370,7 +370,10 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
 			 */
 			spin_unlock_bh(&vvs->rx_lock);
 
-			err = memcpy_to_msg(msg, skb->data + off, bytes);
+			err = skb_copy_datagram_iter(skb, off,
+						     &msg->msg_iter,
+						     bytes);
+
 			if (err)
 				goto out;
 
@@ -413,25 +416,28 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
 	while (total < len && !skb_queue_empty(&vvs->rx_queue)) {
 		skb = skb_peek(&vvs->rx_queue);
 
-		bytes = len - total;
-		if (bytes > skb->len)
-			bytes = skb->len;
+		bytes = min_t(size_t, len - total,
+			      skb->len - VIRTIO_VSOCK_SKB_CB(skb)->frag_off);
 
 		/* sk_lock is held by caller so no one else can dequeue.
 		 * Unlock rx_lock since memcpy_to_msg() may sleep.
 		 */
 		spin_unlock_bh(&vvs->rx_lock);
 
-		err = memcpy_to_msg(msg, skb->data, bytes);
+		err = skb_copy_datagram_iter(skb,
+					     VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
+					     &msg->msg_iter, bytes);
+
 		if (err)
 			goto out;
 
 		spin_lock_bh(&vvs->rx_lock);
 
 		total += bytes;
-		skb_pull(skb, bytes);
 
-		if (skb->len == 0) {
+		VIRTIO_VSOCK_SKB_CB(skb)->frag_off += bytes;
+
+		if (skb->len == VIRTIO_VSOCK_SKB_CB(skb)->frag_off) {
 			u32 pkt_len = le32_to_cpu(virtio_vsock_hdr(skb)->len);
 
 			virtio_transport_dec_rx_pkt(vvs, pkt_len);
@@ -503,7 +509,10 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
 				 */
 				spin_unlock_bh(&vvs->rx_lock);
 
-				err = memcpy_to_msg(msg, skb->data, bytes_to_copy);
+				err = skb_copy_datagram_iter(skb, 0,
+							     &msg->msg_iter,
+							     bytes_to_copy);
+
 				if (err) {
 					/* Copy of message failed. Rest of
 					 * fragments will be freed without copy.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH net-next v3 2/4] vsock/virtio: support to send non-linear skb
  2023-07-20 21:42 [PATCH net-next v3 0/4] vsock/virtio/vhost: MSG_ZEROCOPY preparations Arseniy Krasnov
  2023-07-20 21:42 ` [PATCH net-next v3 1/4] vsock/virtio/vhost: read data from non-linear skb Arseniy Krasnov
@ 2023-07-20 21:42 ` Arseniy Krasnov
  2023-07-25  8:17   ` Stefano Garzarella
  2023-07-20 21:42 ` [PATCH net-next v3 3/4] vsock/virtio: non-linear skb handling for tap Arseniy Krasnov
  2023-07-20 21:42 ` [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support Arseniy Krasnov
  3 siblings, 1 reply; 30+ messages in thread
From: Arseniy Krasnov @ 2023-07-20 21:42 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Jason Wang, Bobby Eshleman
  Cc: kvm, virtualization, netdev, linux-kernel, kernel, oxffffaa,
	avkrasnov, Arseniy Krasnov

For non-linear skb use its pages from fragment array as buffers in
virtio tx queue. These pages are already pinned by 'get_user_pages()'
during such skb creation.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
---
 Changelog:
 v2 -> v3:
  * Comment about 'page_to_virt()' is updated. I don't remove R-b,
    as this change is quiet small I guess.

 net/vmw_vsock/virtio_transport.c | 41 +++++++++++++++++++++++++++-----
 1 file changed, 35 insertions(+), 6 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index e95df847176b..7bbcc8093e51 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -100,7 +100,9 @@ virtio_transport_send_pkt_work(struct work_struct *work)
 	vq = vsock->vqs[VSOCK_VQ_TX];
 
 	for (;;) {
-		struct scatterlist hdr, buf, *sgs[2];
+		/* +1 is for packet header. */
+		struct scatterlist *sgs[MAX_SKB_FRAGS + 1];
+		struct scatterlist bufs[MAX_SKB_FRAGS + 1];
 		int ret, in_sg = 0, out_sg = 0;
 		struct sk_buff *skb;
 		bool reply;
@@ -111,12 +113,39 @@ virtio_transport_send_pkt_work(struct work_struct *work)
 
 		virtio_transport_deliver_tap_pkt(skb);
 		reply = virtio_vsock_skb_reply(skb);
+		sg_init_one(&bufs[out_sg], virtio_vsock_hdr(skb),
+			    sizeof(*virtio_vsock_hdr(skb)));
+		sgs[out_sg] = &bufs[out_sg];
+		out_sg++;
+
+		if (!skb_is_nonlinear(skb)) {
+			if (skb->len > 0) {
+				sg_init_one(&bufs[out_sg], skb->data, skb->len);
+				sgs[out_sg] = &bufs[out_sg];
+				out_sg++;
+			}
+		} else {
+			struct skb_shared_info *si;
+			int i;
+
+			si = skb_shinfo(skb);
+
+			for (i = 0; i < si->nr_frags; i++) {
+				skb_frag_t *skb_frag = &si->frags[i];
+				void *va;
 
-		sg_init_one(&hdr, virtio_vsock_hdr(skb), sizeof(*virtio_vsock_hdr(skb)));
-		sgs[out_sg++] = &hdr;
-		if (skb->len > 0) {
-			sg_init_one(&buf, skb->data, skb->len);
-			sgs[out_sg++] = &buf;
+				/* We will use 'page_to_virt()' for the userspace page
+				 * here, because virtio or dma-mapping layers will call
+				 * 'virt_to_phys()' later to fill the buffer descriptor.
+				 * We don't touch memory at "virtual" address of this page.
+				 */
+				va = page_to_virt(skb_frag->bv_page);
+				sg_init_one(&bufs[out_sg],
+					    va + skb_frag->bv_offset,
+					    skb_frag->bv_len);
+				sgs[out_sg] = &bufs[out_sg];
+				out_sg++;
+			}
 		}
 
 		ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, skb, GFP_KERNEL);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH net-next v3 3/4] vsock/virtio: non-linear skb handling for tap
  2023-07-20 21:42 [PATCH net-next v3 0/4] vsock/virtio/vhost: MSG_ZEROCOPY preparations Arseniy Krasnov
  2023-07-20 21:42 ` [PATCH net-next v3 1/4] vsock/virtio/vhost: read data from non-linear skb Arseniy Krasnov
  2023-07-20 21:42 ` [PATCH net-next v3 2/4] vsock/virtio: support to send " Arseniy Krasnov
@ 2023-07-20 21:42 ` Arseniy Krasnov
  2023-07-20 21:42 ` [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support Arseniy Krasnov
  3 siblings, 0 replies; 30+ messages in thread
From: Arseniy Krasnov @ 2023-07-20 21:42 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Jason Wang, Bobby Eshleman
  Cc: kvm, virtualization, netdev, linux-kernel, kernel, oxffffaa,
	avkrasnov, Arseniy Krasnov

For tap device new skb is created and data from the current skb is
copied to it. This adds copying data from non-linear skb to new
the skb.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
---
 net/vmw_vsock/virtio_transport_common.c | 31 ++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 1a376f808ae6..26a4d10da205 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -106,6 +106,27 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
 	return NULL;
 }
 
+static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
+						void *dst,
+						size_t len)
+{
+	struct iov_iter iov_iter = { 0 };
+	struct kvec kvec;
+	size_t to_copy;
+
+	kvec.iov_base = dst;
+	kvec.iov_len = len;
+
+	iov_iter.iter_type = ITER_KVEC;
+	iov_iter.kvec = &kvec;
+	iov_iter.nr_segs = 1;
+
+	to_copy = min_t(size_t, len, skb->len);
+
+	skb_copy_datagram_iter(skb, VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
+			       &iov_iter, to_copy);
+}
+
 /* Packet capture */
 static struct sk_buff *virtio_transport_build_skb(void *opaque)
 {
@@ -114,7 +135,6 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
 	struct af_vsockmon_hdr *hdr;
 	struct sk_buff *skb;
 	size_t payload_len;
-	void *payload_buf;
 
 	/* A packet could be split to fit the RX buffer, so we can retrieve
 	 * the payload length from the header and the buffer pointer taking
@@ -122,7 +142,6 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
 	 */
 	pkt_hdr = virtio_vsock_hdr(pkt);
 	payload_len = pkt->len;
-	payload_buf = pkt->data;
 
 	skb = alloc_skb(sizeof(*hdr) + sizeof(*pkt_hdr) + payload_len,
 			GFP_ATOMIC);
@@ -165,7 +184,13 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
 	skb_put_data(skb, pkt_hdr, sizeof(*pkt_hdr));
 
 	if (payload_len) {
-		skb_put_data(skb, payload_buf, payload_len);
+		if (skb_is_nonlinear(pkt)) {
+			void *data = skb_put(skb, payload_len);
+
+			virtio_transport_copy_nonlinear_skb(pkt, data, payload_len);
+		} else {
+			skb_put_data(skb, pkt->data, payload_len);
+		}
 	}
 
 	return skb;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-20 21:42 [PATCH net-next v3 0/4] vsock/virtio/vhost: MSG_ZEROCOPY preparations Arseniy Krasnov
                   ` (2 preceding siblings ...)
  2023-07-20 21:42 ` [PATCH net-next v3 3/4] vsock/virtio: non-linear skb handling for tap Arseniy Krasnov
@ 2023-07-20 21:42 ` Arseniy Krasnov
  2023-07-21  5:09   ` Arseniy Krasnov
  2023-07-25  8:25   ` Michael S. Tsirkin
  3 siblings, 2 replies; 30+ messages in thread
From: Arseniy Krasnov @ 2023-07-20 21:42 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Jason Wang, Bobby Eshleman
  Cc: kvm, virtualization, netdev, linux-kernel, kernel, oxffffaa,
	avkrasnov, Arseniy Krasnov

This adds handling of MSG_ZEROCOPY flag on transmission path: if this
flag is set and zerocopy transmission is possible (enabled in socket
options and transport allows zerocopy), then non-linear skb will be
created and filled with the pages of user's buffer. Pages of user's
buffer are locked in memory by 'get_user_pages()'. Second thing that
this patch does is replace type of skb owning: instead of calling
'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
of socket, so to decrease this field correctly proper skb destructor is
needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.

Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
---
 Changelog:
 v5(big patchset) -> v1:
  * Refactorings of 'if' conditions.
  * Remove extra blank line.
  * Remove 'frag_off' field unneeded init.
  * Add function 'virtio_transport_fill_skb()' which fills both linear
    and non-linear skb with provided data.
 v1 -> v2:
  * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
 v2 -> v3:
  * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
    provided 'iov_iter' with data could be sent in a zerocopy mode.
    If this callback is not set in transport - transport allows to send
    any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
    then zerocopy is allowed. Reason of this callback is that in case of
    G2H transmission we insert whole skb to the tx virtio queue and such
    skb must fit to the size of the virtio queue to be sent in a single
    iteration (may be tx logic in 'virtio_transport.c' could be reworked
    as in vhost to support partial send of current skb). This callback
    will be enabled only for G2H path. For details pls see comment 
    'Check that tx queue...' below.

 include/net/af_vsock.h                  |   3 +
 net/vmw_vsock/virtio_transport.c        |  39 ++++
 net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
 3 files changed, 241 insertions(+), 58 deletions(-)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 0e7504a42925..a6b346eeeb8e 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -177,6 +177,9 @@ struct vsock_transport {
 
 	/* Read a single skb */
 	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
+
+	/* Zero-copy. */
+	bool (*msgzerocopy_check_iov)(const struct iov_iter *);
 };
 
 /**** CORE ****/
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 7bbcc8093e51..23cb8ed638c4 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
 	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
 }
 
+static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
+{
+	struct virtio_vsock *vsock;
+	bool res = false;
+
+	rcu_read_lock();
+
+	vsock = rcu_dereference(the_virtio_vsock);
+	if (vsock) {
+		struct virtqueue *vq;
+		int iov_pages;
+
+		vq = vsock->vqs[VSOCK_VQ_TX];
+
+		iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
+
+		/* Check that tx queue is large enough to keep whole
+		 * data to send. This is needed, because when there is
+		 * not enough free space in the queue, current skb to
+		 * send will be reinserted to the head of tx list of
+		 * the socket to retry transmission later, so if skb
+		 * is bigger than whole queue, it will be reinserted
+		 * again and again, thus blocking other skbs to be sent.
+		 * Each page of the user provided buffer will be added
+		 * as a single buffer to the tx virtqueue, so compare
+		 * number of pages against maximum capacity of the queue.
+		 * +1 means buffer for the packet header.
+		 */
+		if (iov_pages + 1 <= vq->num_max)
+			res = true;
+	}
+
+	rcu_read_unlock();
+
+	return res;
+}
+
 static bool virtio_transport_seqpacket_allow(u32 remote_cid);
 
 static struct virtio_transport virtio_transport = {
@@ -475,6 +512,8 @@ static struct virtio_transport virtio_transport = {
 		.seqpacket_allow          = virtio_transport_seqpacket_allow,
 		.seqpacket_has_data       = virtio_transport_seqpacket_has_data,
 
+		.msgzerocopy_check_iov	  = virtio_transport_msgzerocopy_check_iov,
+
 		.notify_poll_in           = virtio_transport_notify_poll_in,
 		.notify_poll_out          = virtio_transport_notify_poll_out,
 		.notify_recv_init         = virtio_transport_notify_recv_init,
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 26a4d10da205..e4e3d541aff4 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
 	return container_of(t, struct virtio_transport, transport);
 }
 
-/* Returns a new packet on success, otherwise returns NULL.
- *
- * If NULL is returned, errp is set to a negative errno.
- */
-static struct sk_buff *
-virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
-			   size_t len,
-			   u32 src_cid,
-			   u32 src_port,
-			   u32 dst_cid,
-			   u32 dst_port)
-{
-	const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
-	struct virtio_vsock_hdr *hdr;
-	struct sk_buff *skb;
-	void *payload;
-	int err;
+static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
+				       size_t max_to_send)
+{
+	const struct vsock_transport *t;
+	struct iov_iter *iov_iter;
 
-	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
-	if (!skb)
-		return NULL;
+	if (!info->msg)
+		return false;
 
-	hdr = virtio_vsock_hdr(skb);
-	hdr->type	= cpu_to_le16(info->type);
-	hdr->op		= cpu_to_le16(info->op);
-	hdr->src_cid	= cpu_to_le64(src_cid);
-	hdr->dst_cid	= cpu_to_le64(dst_cid);
-	hdr->src_port	= cpu_to_le32(src_port);
-	hdr->dst_port	= cpu_to_le32(dst_port);
-	hdr->flags	= cpu_to_le32(info->flags);
-	hdr->len	= cpu_to_le32(len);
+	iov_iter = &info->msg->msg_iter;
 
-	if (info->msg && len > 0) {
-		payload = skb_put(skb, len);
-		err = memcpy_from_msg(payload, info->msg, len);
-		if (err)
-			goto out;
+	t = vsock_core_get_transport(info->vsk);
 
-		if (msg_data_left(info->msg) == 0 &&
-		    info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
-			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
+	if (t->msgzerocopy_check_iov &&
+	    !t->msgzerocopy_check_iov(iov_iter))
+		return false;
 
-			if (info->msg->msg_flags & MSG_EOR)
-				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
-		}
+	/* Data is simple buffer. */
+	if (iter_is_ubuf(iov_iter))
+		return true;
+
+	if (!iter_is_iovec(iov_iter))
+		return false;
+
+	if (iov_iter->iov_offset)
+		return false;
+
+	/* We can't send whole iov. */
+	if (iov_iter->count > max_to_send)
+		return false;
+
+	return true;
+}
+
+static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
+					   struct sk_buff *skb,
+					   struct msghdr *msg,
+					   bool zerocopy)
+{
+	struct ubuf_info *uarg;
+
+	if (msg->msg_ubuf) {
+		uarg = msg->msg_ubuf;
+		net_zcopy_get(uarg);
+	} else {
+		struct iov_iter *iter = &msg->msg_iter;
+		struct ubuf_info_msgzc *uarg_zc;
+		int len;
+
+		/* Only ITER_IOVEC or ITER_UBUF are allowed and
+		 * checked before.
+		 */
+		if (iter_is_iovec(iter))
+			len = iov_length(iter->__iov, iter->nr_segs);
+		else
+			len = iter->count;
+
+		uarg = msg_zerocopy_realloc(sk_vsock(vsk),
+					    len,
+					    NULL);
+		if (!uarg)
+			return -1;
+
+		uarg_zc = uarg_to_msgzc(uarg);
+		uarg_zc->zerocopy = zerocopy ? 1 : 0;
 	}
 
-	if (info->reply)
-		virtio_vsock_skb_set_reply(skb);
+	skb_zcopy_init(skb, uarg);
 
-	trace_virtio_transport_alloc_pkt(src_cid, src_port,
-					 dst_cid, dst_port,
-					 len,
-					 info->type,
-					 info->op,
-					 info->flags);
+	return 0;
+}
 
-	if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
-		WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
-		goto out;
+static int virtio_transport_fill_skb(struct sk_buff *skb,
+				     struct virtio_vsock_pkt_info *info,
+				     size_t len,
+				     bool zcopy)
+{
+	if (zcopy) {
+		return __zerocopy_sg_from_iter(info->msg, NULL, skb,
+					      &info->msg->msg_iter,
+					      len);
+	} else {
+		void *payload;
+		int err;
+
+		payload = skb_put(skb, len);
+		err = memcpy_from_msg(payload, info->msg, len);
+		if (err)
+			return -1;
+
+		if (msg_data_left(info->msg))
+			return 0;
+
+		return 0;
 	}
+}
 
-	return skb;
+static void virtio_transport_init_hdr(struct sk_buff *skb,
+				      struct virtio_vsock_pkt_info *info,
+				      u32 src_cid,
+				      u32 src_port,
+				      u32 dst_cid,
+				      u32 dst_port,
+				      size_t len)
+{
+	struct virtio_vsock_hdr *hdr;
 
-out:
-	kfree_skb(skb);
-	return NULL;
+	hdr = virtio_vsock_hdr(skb);
+	hdr->type	= cpu_to_le16(info->type);
+	hdr->op		= cpu_to_le16(info->op);
+	hdr->src_cid	= cpu_to_le64(src_cid);
+	hdr->dst_cid	= cpu_to_le64(dst_cid);
+	hdr->src_port	= cpu_to_le32(src_port);
+	hdr->dst_port	= cpu_to_le32(dst_port);
+	hdr->flags	= cpu_to_le32(info->flags);
+	hdr->len	= cpu_to_le32(len);
 }
 
 static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
@@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
 		return VIRTIO_VSOCK_TYPE_SEQPACKET;
 }
 
+static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
+						  struct virtio_vsock_pkt_info *info,
+						  size_t payload_len,
+						  bool zcopy,
+						  u32 src_cid,
+						  u32 src_port,
+						  u32 dst_cid,
+						  u32 dst_port)
+{
+	struct sk_buff *skb;
+	size_t skb_len;
+
+	skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
+
+	if (!zcopy)
+		skb_len += payload_len;
+
+	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
+	if (!skb)
+		return NULL;
+
+	virtio_transport_init_hdr(skb, info, src_cid, src_port,
+				  dst_cid, dst_port,
+				  payload_len);
+
+	/* Set owner here, because '__zerocopy_sg_from_iter()' uses
+	 * owner of skb without check to update 'sk_wmem_alloc'.
+	 */
+	if (vsk)
+		skb_set_owner_w(skb, sk_vsock(vsk));
+
+	if (info->msg && payload_len > 0) {
+		int err;
+
+		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
+		if (err)
+			goto out;
+
+		if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
+			struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
+
+			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
+
+			if (info->msg->msg_flags & MSG_EOR)
+				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
+		}
+	}
+
+	if (info->reply)
+		virtio_vsock_skb_set_reply(skb);
+
+	trace_virtio_transport_alloc_pkt(src_cid, src_port,
+					 dst_cid, dst_port,
+					 payload_len,
+					 info->type,
+					 info->op,
+					 info->flags);
+
+	return skb;
+out:
+	kfree_skb(skb);
+	return NULL;
+}
+
 /* This function can only be used on connecting/connected sockets,
  * since a socket assigned to a transport is required.
  *
@@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
 static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 					  struct virtio_vsock_pkt_info *info)
 {
+	u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
 	u32 src_cid, src_port, dst_cid, dst_port;
 	const struct virtio_transport *t_ops;
 	struct virtio_vsock_sock *vvs;
 	u32 pkt_len = info->pkt_len;
+	bool can_zcopy = false;
 	u32 rest_len;
 	int ret;
 
@@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
 		return pkt_len;
 
+	if (info->msg) {
+		/* If zerocopy is not enabled by 'setsockopt()', we behave as
+		 * there is no MSG_ZEROCOPY flag set.
+		 */
+		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
+			info->msg->msg_flags &= ~MSG_ZEROCOPY;
+
+		if (info->msg->msg_flags & MSG_ZEROCOPY)
+			can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
+
+		if (can_zcopy)
+			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
+					    (MAX_SKB_FRAGS * PAGE_SIZE));
+	}
+
 	rest_len = pkt_len;
 
 	do {
 		struct sk_buff *skb;
 		size_t skb_len;
 
-		skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
+		skb_len = min(max_skb_len, rest_len);
 
-		skb = virtio_transport_alloc_skb(info, skb_len,
+		skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
 						 src_cid, src_port,
 						 dst_cid, dst_port);
 		if (!skb) {
@@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 			break;
 		}
 
+		/* This is last skb to send this portion of data. */
+		if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
+		    skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
+			if (virtio_transport_init_zcopy_skb(vsk, skb,
+							    info->msg,
+							    can_zcopy)) {
+				ret = -ENOMEM;
+				break;
+			}
+		}
+
 		virtio_transport_inc_tx_pkt(vvs, skb);
 
 		ret = t_ops->send_pkt(skb);
@@ -934,7 +1075,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
 	if (!t)
 		return -ENOTCONN;
 
-	reply = virtio_transport_alloc_skb(&info, 0,
+	reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
 					   le64_to_cpu(hdr->dst_cid),
 					   le32_to_cpu(hdr->dst_port),
 					   le64_to_cpu(hdr->src_cid),
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-20 21:42 ` [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support Arseniy Krasnov
@ 2023-07-21  5:09   ` Arseniy Krasnov
  2023-07-25  8:43     ` Stefano Garzarella
  2023-07-25 11:50     ` Michael S. Tsirkin
  2023-07-25  8:25   ` Michael S. Tsirkin
  1 sibling, 2 replies; 30+ messages in thread
From: Arseniy Krasnov @ 2023-07-21  5:09 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Jason Wang, Bobby Eshleman
  Cc: kvm, virtualization, netdev, linux-kernel, kernel, oxffffaa



On 21.07.2023 00:42, Arseniy Krasnov wrote:
> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
> flag is set and zerocopy transmission is possible (enabled in socket
> options and transport allows zerocopy), then non-linear skb will be
> created and filled with the pages of user's buffer. Pages of user's
> buffer are locked in memory by 'get_user_pages()'. Second thing that
> this patch does is replace type of skb owning: instead of calling
> 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
> change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
> of socket, so to decrease this field correctly proper skb destructor is
> needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
> 
> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
> ---
>  Changelog:
>  v5(big patchset) -> v1:
>   * Refactorings of 'if' conditions.
>   * Remove extra blank line.
>   * Remove 'frag_off' field unneeded init.
>   * Add function 'virtio_transport_fill_skb()' which fills both linear
>     and non-linear skb with provided data.
>  v1 -> v2:
>   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
>  v2 -> v3:
>   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
>     provided 'iov_iter' with data could be sent in a zerocopy mode.
>     If this callback is not set in transport - transport allows to send
>     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
>     then zerocopy is allowed. Reason of this callback is that in case of
>     G2H transmission we insert whole skb to the tx virtio queue and such
>     skb must fit to the size of the virtio queue to be sent in a single
>     iteration (may be tx logic in 'virtio_transport.c' could be reworked
>     as in vhost to support partial send of current skb). This callback
>     will be enabled only for G2H path. For details pls see comment 
>     'Check that tx queue...' below.
> 
>  include/net/af_vsock.h                  |   3 +
>  net/vmw_vsock/virtio_transport.c        |  39 ++++
>  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
>  3 files changed, 241 insertions(+), 58 deletions(-)
> 
> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> index 0e7504a42925..a6b346eeeb8e 100644
> --- a/include/net/af_vsock.h
> +++ b/include/net/af_vsock.h
> @@ -177,6 +177,9 @@ struct vsock_transport {
>  
>  	/* Read a single skb */
>  	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
> +
> +	/* Zero-copy. */
> +	bool (*msgzerocopy_check_iov)(const struct iov_iter *);
>  };
>  
>  /**** CORE ****/
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index 7bbcc8093e51..23cb8ed638c4 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
>  	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>  }
>  
> +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
> +{
> +	struct virtio_vsock *vsock;
> +	bool res = false;
> +
> +	rcu_read_lock();
> +
> +	vsock = rcu_dereference(the_virtio_vsock);
> +	if (vsock) {
> +		struct virtqueue *vq;
> +		int iov_pages;
> +
> +		vq = vsock->vqs[VSOCK_VQ_TX];
> +
> +		iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
> +
> +		/* Check that tx queue is large enough to keep whole
> +		 * data to send. This is needed, because when there is
> +		 * not enough free space in the queue, current skb to
> +		 * send will be reinserted to the head of tx list of
> +		 * the socket to retry transmission later, so if skb
> +		 * is bigger than whole queue, it will be reinserted
> +		 * again and again, thus blocking other skbs to be sent.
> +		 * Each page of the user provided buffer will be added
> +		 * as a single buffer to the tx virtqueue, so compare
> +		 * number of pages against maximum capacity of the queue.
> +		 * +1 means buffer for the packet header.
> +		 */
> +		if (iov_pages + 1 <= vq->num_max)

I think this check is actual only for case one we don't have indirect buffer feature.
With indirect mode whole data to send will be packed into one indirect buffer.

Thanks, Arseniy

> +			res = true;
> +	}
> +
> +	rcu_read_unlock();
> +
> +	return res;
> +}
> +
>  static bool virtio_transport_seqpacket_allow(u32 remote_cid);
>  
>  static struct virtio_transport virtio_transport = {
> @@ -475,6 +512,8 @@ static struct virtio_transport virtio_transport = {
>  		.seqpacket_allow          = virtio_transport_seqpacket_allow,
>  		.seqpacket_has_data       = virtio_transport_seqpacket_has_data,
>  
> +		.msgzerocopy_check_iov	  = virtio_transport_msgzerocopy_check_iov,
> +
>  		.notify_poll_in           = virtio_transport_notify_poll_in,
>  		.notify_poll_out          = virtio_transport_notify_poll_out,
>  		.notify_recv_init         = virtio_transport_notify_recv_init,
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index 26a4d10da205..e4e3d541aff4 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>  	return container_of(t, struct virtio_transport, transport);
>  }
>  
> -/* Returns a new packet on success, otherwise returns NULL.
> - *
> - * If NULL is returned, errp is set to a negative errno.
> - */
> -static struct sk_buff *
> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
> -			   size_t len,
> -			   u32 src_cid,
> -			   u32 src_port,
> -			   u32 dst_cid,
> -			   u32 dst_port)
> -{
> -	const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
> -	struct virtio_vsock_hdr *hdr;
> -	struct sk_buff *skb;
> -	void *payload;
> -	int err;
> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
> +				       size_t max_to_send)
> +{
> +	const struct vsock_transport *t;
> +	struct iov_iter *iov_iter;
>  
> -	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
> -	if (!skb)
> -		return NULL;
> +	if (!info->msg)
> +		return false;
>  
> -	hdr = virtio_vsock_hdr(skb);
> -	hdr->type	= cpu_to_le16(info->type);
> -	hdr->op		= cpu_to_le16(info->op);
> -	hdr->src_cid	= cpu_to_le64(src_cid);
> -	hdr->dst_cid	= cpu_to_le64(dst_cid);
> -	hdr->src_port	= cpu_to_le32(src_port);
> -	hdr->dst_port	= cpu_to_le32(dst_port);
> -	hdr->flags	= cpu_to_le32(info->flags);
> -	hdr->len	= cpu_to_le32(len);
> +	iov_iter = &info->msg->msg_iter;
>  
> -	if (info->msg && len > 0) {
> -		payload = skb_put(skb, len);
> -		err = memcpy_from_msg(payload, info->msg, len);
> -		if (err)
> -			goto out;
> +	t = vsock_core_get_transport(info->vsk);
>  
> -		if (msg_data_left(info->msg) == 0 &&
> -		    info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
> -			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> +	if (t->msgzerocopy_check_iov &&
> +	    !t->msgzerocopy_check_iov(iov_iter))
> +		return false;
>  
> -			if (info->msg->msg_flags & MSG_EOR)
> -				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> -		}
> +	/* Data is simple buffer. */
> +	if (iter_is_ubuf(iov_iter))
> +		return true;
> +
> +	if (!iter_is_iovec(iov_iter))
> +		return false;
> +
> +	if (iov_iter->iov_offset)
> +		return false;
> +
> +	/* We can't send whole iov. */
> +	if (iov_iter->count > max_to_send)
> +		return false;
> +
> +	return true;
> +}
> +
> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
> +					   struct sk_buff *skb,
> +					   struct msghdr *msg,
> +					   bool zerocopy)
> +{
> +	struct ubuf_info *uarg;
> +
> +	if (msg->msg_ubuf) {
> +		uarg = msg->msg_ubuf;
> +		net_zcopy_get(uarg);
> +	} else {
> +		struct iov_iter *iter = &msg->msg_iter;
> +		struct ubuf_info_msgzc *uarg_zc;
> +		int len;
> +
> +		/* Only ITER_IOVEC or ITER_UBUF are allowed and
> +		 * checked before.
> +		 */
> +		if (iter_is_iovec(iter))
> +			len = iov_length(iter->__iov, iter->nr_segs);
> +		else
> +			len = iter->count;
> +
> +		uarg = msg_zerocopy_realloc(sk_vsock(vsk),
> +					    len,
> +					    NULL);
> +		if (!uarg)
> +			return -1;
> +
> +		uarg_zc = uarg_to_msgzc(uarg);
> +		uarg_zc->zerocopy = zerocopy ? 1 : 0;
>  	}
>  
> -	if (info->reply)
> -		virtio_vsock_skb_set_reply(skb);
> +	skb_zcopy_init(skb, uarg);
>  
> -	trace_virtio_transport_alloc_pkt(src_cid, src_port,
> -					 dst_cid, dst_port,
> -					 len,
> -					 info->type,
> -					 info->op,
> -					 info->flags);
> +	return 0;
> +}
>  
> -	if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
> -		WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
> -		goto out;
> +static int virtio_transport_fill_skb(struct sk_buff *skb,
> +				     struct virtio_vsock_pkt_info *info,
> +				     size_t len,
> +				     bool zcopy)
> +{
> +	if (zcopy) {
> +		return __zerocopy_sg_from_iter(info->msg, NULL, skb,
> +					      &info->msg->msg_iter,
> +					      len);
> +	} else {
> +		void *payload;
> +		int err;
> +
> +		payload = skb_put(skb, len);
> +		err = memcpy_from_msg(payload, info->msg, len);
> +		if (err)
> +			return -1;
> +
> +		if (msg_data_left(info->msg))
> +			return 0;
> +
> +		return 0;
>  	}
> +}
>  
> -	return skb;
> +static void virtio_transport_init_hdr(struct sk_buff *skb,
> +				      struct virtio_vsock_pkt_info *info,
> +				      u32 src_cid,
> +				      u32 src_port,
> +				      u32 dst_cid,
> +				      u32 dst_port,
> +				      size_t len)
> +{
> +	struct virtio_vsock_hdr *hdr;
>  
> -out:
> -	kfree_skb(skb);
> -	return NULL;
> +	hdr = virtio_vsock_hdr(skb);
> +	hdr->type	= cpu_to_le16(info->type);
> +	hdr->op		= cpu_to_le16(info->op);
> +	hdr->src_cid	= cpu_to_le64(src_cid);
> +	hdr->dst_cid	= cpu_to_le64(dst_cid);
> +	hdr->src_port	= cpu_to_le32(src_port);
> +	hdr->dst_port	= cpu_to_le32(dst_port);
> +	hdr->flags	= cpu_to_le32(info->flags);
> +	hdr->len	= cpu_to_le32(len);
>  }
>  
>  static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
> @@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
>  }
>  
> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
> +						  struct virtio_vsock_pkt_info *info,
> +						  size_t payload_len,
> +						  bool zcopy,
> +						  u32 src_cid,
> +						  u32 src_port,
> +						  u32 dst_cid,
> +						  u32 dst_port)
> +{
> +	struct sk_buff *skb;
> +	size_t skb_len;
> +
> +	skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
> +
> +	if (!zcopy)
> +		skb_len += payload_len;
> +
> +	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
> +	if (!skb)
> +		return NULL;
> +
> +	virtio_transport_init_hdr(skb, info, src_cid, src_port,
> +				  dst_cid, dst_port,
> +				  payload_len);
> +
> +	/* Set owner here, because '__zerocopy_sg_from_iter()' uses
> +	 * owner of skb without check to update 'sk_wmem_alloc'.
> +	 */
> +	if (vsk)
> +		skb_set_owner_w(skb, sk_vsock(vsk));
> +
> +	if (info->msg && payload_len > 0) {
> +		int err;
> +
> +		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
> +		if (err)
> +			goto out;
> +
> +		if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
> +			struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
> +
> +			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> +
> +			if (info->msg->msg_flags & MSG_EOR)
> +				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> +		}
> +	}
> +
> +	if (info->reply)
> +		virtio_vsock_skb_set_reply(skb);
> +
> +	trace_virtio_transport_alloc_pkt(src_cid, src_port,
> +					 dst_cid, dst_port,
> +					 payload_len,
> +					 info->type,
> +					 info->op,
> +					 info->flags);
> +
> +	return skb;
> +out:
> +	kfree_skb(skb);
> +	return NULL;
> +}
> +
>  /* This function can only be used on connecting/connected sockets,
>   * since a socket assigned to a transport is required.
>   *
> @@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
>  static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>  					  struct virtio_vsock_pkt_info *info)
>  {
> +	u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>  	u32 src_cid, src_port, dst_cid, dst_port;
>  	const struct virtio_transport *t_ops;
>  	struct virtio_vsock_sock *vvs;
>  	u32 pkt_len = info->pkt_len;
> +	bool can_zcopy = false;
>  	u32 rest_len;
>  	int ret;
>  
> @@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>  		return pkt_len;
>  
> +	if (info->msg) {
> +		/* If zerocopy is not enabled by 'setsockopt()', we behave as
> +		 * there is no MSG_ZEROCOPY flag set.
> +		 */
> +		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
> +			info->msg->msg_flags &= ~MSG_ZEROCOPY;
> +
> +		if (info->msg->msg_flags & MSG_ZEROCOPY)
> +			can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
> +
> +		if (can_zcopy)
> +			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
> +					    (MAX_SKB_FRAGS * PAGE_SIZE));
> +	}
> +
>  	rest_len = pkt_len;
>  
>  	do {
>  		struct sk_buff *skb;
>  		size_t skb_len;
>  
> -		skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
> +		skb_len = min(max_skb_len, rest_len);
>  
> -		skb = virtio_transport_alloc_skb(info, skb_len,
> +		skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
>  						 src_cid, src_port,
>  						 dst_cid, dst_port);
>  		if (!skb) {
> @@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>  			break;
>  		}
>  
> +		/* This is last skb to send this portion of data. */
> +		if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
> +		    skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
> +			if (virtio_transport_init_zcopy_skb(vsk, skb,
> +							    info->msg,
> +							    can_zcopy)) {
> +				ret = -ENOMEM;
> +				break;
> +			}
> +		}
> +
>  		virtio_transport_inc_tx_pkt(vvs, skb);
>  
>  		ret = t_ops->send_pkt(skb);
> @@ -934,7 +1075,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>  	if (!t)
>  		return -ENOTCONN;
>  
> -	reply = virtio_transport_alloc_skb(&info, 0,
> +	reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
>  					   le64_to_cpu(hdr->dst_cid),
>  					   le32_to_cpu(hdr->dst_port),
>  					   le64_to_cpu(hdr->src_cid),

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 2/4] vsock/virtio: support to send non-linear skb
  2023-07-20 21:42 ` [PATCH net-next v3 2/4] vsock/virtio: support to send " Arseniy Krasnov
@ 2023-07-25  8:17   ` Stefano Garzarella
  0 siblings, 0 replies; 30+ messages in thread
From: Stefano Garzarella @ 2023-07-25  8:17 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Michael S. Tsirkin, Jason Wang, Bobby Eshleman, kvm,
	virtualization, netdev, linux-kernel, kernel, oxffffaa

On Fri, Jul 21, 2023 at 12:42:43AM +0300, Arseniy Krasnov wrote:
>For non-linear skb use its pages from fragment array as buffers in
>virtio tx queue. These pages are already pinned by 'get_user_pages()'
>during such skb creation.
>
>Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
>---
> Changelog:
> v2 -> v3:
>  * Comment about 'page_to_virt()' is updated. I don't remove R-b,
>    as this change is quiet small I guess.

Ack!

Thanks,
Stefano

>
> net/vmw_vsock/virtio_transport.c | 41 +++++++++++++++++++++++++++-----
> 1 file changed, 35 insertions(+), 6 deletions(-)
>
>diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>index e95df847176b..7bbcc8093e51 100644
>--- a/net/vmw_vsock/virtio_transport.c
>+++ b/net/vmw_vsock/virtio_transport.c
>@@ -100,7 +100,9 @@ virtio_transport_send_pkt_work(struct work_struct *work)
> 	vq = vsock->vqs[VSOCK_VQ_TX];
>
> 	for (;;) {
>-		struct scatterlist hdr, buf, *sgs[2];
>+		/* +1 is for packet header. */
>+		struct scatterlist *sgs[MAX_SKB_FRAGS + 1];
>+		struct scatterlist bufs[MAX_SKB_FRAGS + 1];
> 		int ret, in_sg = 0, out_sg = 0;
> 		struct sk_buff *skb;
> 		bool reply;
>@@ -111,12 +113,39 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>
> 		virtio_transport_deliver_tap_pkt(skb);
> 		reply = virtio_vsock_skb_reply(skb);
>+		sg_init_one(&bufs[out_sg], virtio_vsock_hdr(skb),
>+			    sizeof(*virtio_vsock_hdr(skb)));
>+		sgs[out_sg] = &bufs[out_sg];
>+		out_sg++;
>+
>+		if (!skb_is_nonlinear(skb)) {
>+			if (skb->len > 0) {
>+				sg_init_one(&bufs[out_sg], skb->data, skb->len);
>+				sgs[out_sg] = &bufs[out_sg];
>+				out_sg++;
>+			}
>+		} else {
>+			struct skb_shared_info *si;
>+			int i;
>+
>+			si = skb_shinfo(skb);
>+
>+			for (i = 0; i < si->nr_frags; i++) {
>+				skb_frag_t *skb_frag = &si->frags[i];
>+				void *va;
>
>-		sg_init_one(&hdr, virtio_vsock_hdr(skb), sizeof(*virtio_vsock_hdr(skb)));
>-		sgs[out_sg++] = &hdr;
>-		if (skb->len > 0) {
>-			sg_init_one(&buf, skb->data, skb->len);
>-			sgs[out_sg++] = &buf;
>+				/* We will use 'page_to_virt()' for the userspace page
>+				 * here, because virtio or dma-mapping layers will call
>+				 * 'virt_to_phys()' later to fill the buffer descriptor.
>+				 * We don't touch memory at "virtual" address of this page.
>+				 */
>+				va = page_to_virt(skb_frag->bv_page);
>+				sg_init_one(&bufs[out_sg],
>+					    va + skb_frag->bv_offset,
>+					    skb_frag->bv_len);
>+				sgs[out_sg] = &bufs[out_sg];
>+				out_sg++;
>+			}
> 		}
>
> 		ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, skb, GFP_KERNEL);
>-- 
>2.25.1
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-20 21:42 ` [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support Arseniy Krasnov
  2023-07-21  5:09   ` Arseniy Krasnov
@ 2023-07-25  8:25   ` Michael S. Tsirkin
  2023-07-25  8:39     ` Arseniy Krasnov
  1 sibling, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2023-07-25  8:25 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
	Bobby Eshleman, kvm, virtualization, netdev, linux-kernel,
	kernel, oxffffaa

On Fri, Jul 21, 2023 at 12:42:45AM +0300, Arseniy Krasnov wrote:
> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
> flag is set and zerocopy transmission is possible (enabled in socket
> options and transport allows zerocopy), then non-linear skb will be
> created and filled with the pages of user's buffer. Pages of user's
> buffer are locked in memory by 'get_user_pages()'. Second thing that
> this patch does is replace type of skb owning: instead of calling
> 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
> change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
> of socket, so to decrease this field correctly proper skb destructor is
> needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
> 
> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
> ---
>  Changelog:
>  v5(big patchset) -> v1:
>   * Refactorings of 'if' conditions.
>   * Remove extra blank line.
>   * Remove 'frag_off' field unneeded init.
>   * Add function 'virtio_transport_fill_skb()' which fills both linear
>     and non-linear skb with provided data.
>  v1 -> v2:
>   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
>  v2 -> v3:
>   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
>     provided 'iov_iter' with data could be sent in a zerocopy mode.
>     If this callback is not set in transport - transport allows to send
>     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
>     then zerocopy is allowed. Reason of this callback is that in case of
>     G2H transmission we insert whole skb to the tx virtio queue and such
>     skb must fit to the size of the virtio queue to be sent in a single
>     iteration (may be tx logic in 'virtio_transport.c' could be reworked
>     as in vhost to support partial send of current skb). This callback
>     will be enabled only for G2H path. For details pls see comment 
>     'Check that tx queue...' below.
> 
>  include/net/af_vsock.h                  |   3 +
>  net/vmw_vsock/virtio_transport.c        |  39 ++++
>  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
>  3 files changed, 241 insertions(+), 58 deletions(-)
> 
> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> index 0e7504a42925..a6b346eeeb8e 100644
> --- a/include/net/af_vsock.h
> +++ b/include/net/af_vsock.h
> @@ -177,6 +177,9 @@ struct vsock_transport {
>  
>  	/* Read a single skb */
>  	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
> +
> +	/* Zero-copy. */
> +	bool (*msgzerocopy_check_iov)(const struct iov_iter *);
>  };
>  
>  /**** CORE ****/
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index 7bbcc8093e51..23cb8ed638c4 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
>  	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>  }
>  
> +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
> +{
> +	struct virtio_vsock *vsock;
> +	bool res = false;
> +
> +	rcu_read_lock();
> +
> +	vsock = rcu_dereference(the_virtio_vsock);
> +	if (vsock) {
> +		struct virtqueue *vq;
> +		int iov_pages;
> +
> +		vq = vsock->vqs[VSOCK_VQ_TX];
> +
> +		iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
> +
> +		/* Check that tx queue is large enough to keep whole
> +		 * data to send. This is needed, because when there is
> +		 * not enough free space in the queue, current skb to
> +		 * send will be reinserted to the head of tx list of
> +		 * the socket to retry transmission later, so if skb
> +		 * is bigger than whole queue, it will be reinserted
> +		 * again and again, thus blocking other skbs to be sent.
> +		 * Each page of the user provided buffer will be added
> +		 * as a single buffer to the tx virtqueue, so compare
> +		 * number of pages against maximum capacity of the queue.
> +		 * +1 means buffer for the packet header.
> +		 */
> +		if (iov_pages + 1 <= vq->num_max)
> +			res = true;


Yes but can't there already be buffers in the queue?
Then you can't stick num_max there.


> +	}
> +
> +	rcu_read_unlock();
> +
> +	return res;
> +}
> +
>  static bool virtio_transport_seqpacket_allow(u32 remote_cid);
>  
>  static struct virtio_transport virtio_transport = {
> @@ -475,6 +512,8 @@ static struct virtio_transport virtio_transport = {
>  		.seqpacket_allow          = virtio_transport_seqpacket_allow,
>  		.seqpacket_has_data       = virtio_transport_seqpacket_has_data,
>  
> +		.msgzerocopy_check_iov	  = virtio_transport_msgzerocopy_check_iov,
> +
>  		.notify_poll_in           = virtio_transport_notify_poll_in,
>  		.notify_poll_out          = virtio_transport_notify_poll_out,
>  		.notify_recv_init         = virtio_transport_notify_recv_init,
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index 26a4d10da205..e4e3d541aff4 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>  	return container_of(t, struct virtio_transport, transport);
>  }
>  
> -/* Returns a new packet on success, otherwise returns NULL.
> - *
> - * If NULL is returned, errp is set to a negative errno.
> - */
> -static struct sk_buff *
> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
> -			   size_t len,
> -			   u32 src_cid,
> -			   u32 src_port,
> -			   u32 dst_cid,
> -			   u32 dst_port)
> -{
> -	const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
> -	struct virtio_vsock_hdr *hdr;
> -	struct sk_buff *skb;
> -	void *payload;
> -	int err;
> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
> +				       size_t max_to_send)
> +{
> +	const struct vsock_transport *t;
> +	struct iov_iter *iov_iter;
>  
> -	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
> -	if (!skb)
> -		return NULL;
> +	if (!info->msg)
> +		return false;
>  
> -	hdr = virtio_vsock_hdr(skb);
> -	hdr->type	= cpu_to_le16(info->type);
> -	hdr->op		= cpu_to_le16(info->op);
> -	hdr->src_cid	= cpu_to_le64(src_cid);
> -	hdr->dst_cid	= cpu_to_le64(dst_cid);
> -	hdr->src_port	= cpu_to_le32(src_port);
> -	hdr->dst_port	= cpu_to_le32(dst_port);
> -	hdr->flags	= cpu_to_le32(info->flags);
> -	hdr->len	= cpu_to_le32(len);
> +	iov_iter = &info->msg->msg_iter;
>  
> -	if (info->msg && len > 0) {
> -		payload = skb_put(skb, len);
> -		err = memcpy_from_msg(payload, info->msg, len);
> -		if (err)
> -			goto out;
> +	t = vsock_core_get_transport(info->vsk);
>  
> -		if (msg_data_left(info->msg) == 0 &&
> -		    info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
> -			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> +	if (t->msgzerocopy_check_iov &&
> +	    !t->msgzerocopy_check_iov(iov_iter))
> +		return false;
>  
> -			if (info->msg->msg_flags & MSG_EOR)
> -				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> -		}
> +	/* Data is simple buffer. */
> +	if (iter_is_ubuf(iov_iter))
> +		return true;
> +
> +	if (!iter_is_iovec(iov_iter))
> +		return false;
> +
> +	if (iov_iter->iov_offset)
> +		return false;
> +
> +	/* We can't send whole iov. */
> +	if (iov_iter->count > max_to_send)
> +		return false;
> +
> +	return true;
> +}
> +
> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
> +					   struct sk_buff *skb,
> +					   struct msghdr *msg,
> +					   bool zerocopy)
> +{
> +	struct ubuf_info *uarg;
> +
> +	if (msg->msg_ubuf) {
> +		uarg = msg->msg_ubuf;
> +		net_zcopy_get(uarg);
> +	} else {
> +		struct iov_iter *iter = &msg->msg_iter;
> +		struct ubuf_info_msgzc *uarg_zc;
> +		int len;
> +
> +		/* Only ITER_IOVEC or ITER_UBUF are allowed and
> +		 * checked before.
> +		 */
> +		if (iter_is_iovec(iter))
> +			len = iov_length(iter->__iov, iter->nr_segs);
> +		else
> +			len = iter->count;
> +
> +		uarg = msg_zerocopy_realloc(sk_vsock(vsk),
> +					    len,
> +					    NULL);
> +		if (!uarg)
> +			return -1;
> +
> +		uarg_zc = uarg_to_msgzc(uarg);
> +		uarg_zc->zerocopy = zerocopy ? 1 : 0;
>  	}
>  
> -	if (info->reply)
> -		virtio_vsock_skb_set_reply(skb);
> +	skb_zcopy_init(skb, uarg);
>  
> -	trace_virtio_transport_alloc_pkt(src_cid, src_port,
> -					 dst_cid, dst_port,
> -					 len,
> -					 info->type,
> -					 info->op,
> -					 info->flags);
> +	return 0;
> +}
>  
> -	if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
> -		WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
> -		goto out;
> +static int virtio_transport_fill_skb(struct sk_buff *skb,
> +				     struct virtio_vsock_pkt_info *info,
> +				     size_t len,
> +				     bool zcopy)
> +{
> +	if (zcopy) {
> +		return __zerocopy_sg_from_iter(info->msg, NULL, skb,
> +					      &info->msg->msg_iter,
> +					      len);
> +	} else {
> +		void *payload;
> +		int err;
> +
> +		payload = skb_put(skb, len);
> +		err = memcpy_from_msg(payload, info->msg, len);
> +		if (err)
> +			return -1;
> +
> +		if (msg_data_left(info->msg))
> +			return 0;
> +
> +		return 0;
>  	}
> +}
>  
> -	return skb;
> +static void virtio_transport_init_hdr(struct sk_buff *skb,
> +				      struct virtio_vsock_pkt_info *info,
> +				      u32 src_cid,
> +				      u32 src_port,
> +				      u32 dst_cid,
> +				      u32 dst_port,
> +				      size_t len)
> +{
> +	struct virtio_vsock_hdr *hdr;
>  
> -out:
> -	kfree_skb(skb);
> -	return NULL;
> +	hdr = virtio_vsock_hdr(skb);
> +	hdr->type	= cpu_to_le16(info->type);
> +	hdr->op		= cpu_to_le16(info->op);
> +	hdr->src_cid	= cpu_to_le64(src_cid);
> +	hdr->dst_cid	= cpu_to_le64(dst_cid);
> +	hdr->src_port	= cpu_to_le32(src_port);
> +	hdr->dst_port	= cpu_to_le32(dst_port);
> +	hdr->flags	= cpu_to_le32(info->flags);
> +	hdr->len	= cpu_to_le32(len);
>  }
>  
>  static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
> @@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
>  }
>  
> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
> +						  struct virtio_vsock_pkt_info *info,
> +						  size_t payload_len,
> +						  bool zcopy,
> +						  u32 src_cid,
> +						  u32 src_port,
> +						  u32 dst_cid,
> +						  u32 dst_port)
> +{
> +	struct sk_buff *skb;
> +	size_t skb_len;
> +
> +	skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
> +
> +	if (!zcopy)
> +		skb_len += payload_len;
> +
> +	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
> +	if (!skb)
> +		return NULL;
> +
> +	virtio_transport_init_hdr(skb, info, src_cid, src_port,
> +				  dst_cid, dst_port,
> +				  payload_len);
> +
> +	/* Set owner here, because '__zerocopy_sg_from_iter()' uses
> +	 * owner of skb without check to update 'sk_wmem_alloc'.
> +	 */
> +	if (vsk)
> +		skb_set_owner_w(skb, sk_vsock(vsk));
> +
> +	if (info->msg && payload_len > 0) {
> +		int err;
> +
> +		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
> +		if (err)
> +			goto out;
> +
> +		if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
> +			struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
> +
> +			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> +
> +			if (info->msg->msg_flags & MSG_EOR)
> +				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> +		}
> +	}
> +
> +	if (info->reply)
> +		virtio_vsock_skb_set_reply(skb);
> +
> +	trace_virtio_transport_alloc_pkt(src_cid, src_port,
> +					 dst_cid, dst_port,
> +					 payload_len,
> +					 info->type,
> +					 info->op,
> +					 info->flags);
> +
> +	return skb;
> +out:
> +	kfree_skb(skb);
> +	return NULL;
> +}
> +
>  /* This function can only be used on connecting/connected sockets,
>   * since a socket assigned to a transport is required.
>   *
> @@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
>  static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>  					  struct virtio_vsock_pkt_info *info)
>  {
> +	u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>  	u32 src_cid, src_port, dst_cid, dst_port;
>  	const struct virtio_transport *t_ops;
>  	struct virtio_vsock_sock *vvs;
>  	u32 pkt_len = info->pkt_len;
> +	bool can_zcopy = false;
>  	u32 rest_len;
>  	int ret;
>  
> @@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>  		return pkt_len;
>  
> +	if (info->msg) {
> +		/* If zerocopy is not enabled by 'setsockopt()', we behave as
> +		 * there is no MSG_ZEROCOPY flag set.
> +		 */
> +		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
> +			info->msg->msg_flags &= ~MSG_ZEROCOPY;
> +
> +		if (info->msg->msg_flags & MSG_ZEROCOPY)
> +			can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
> +
> +		if (can_zcopy)
> +			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
> +					    (MAX_SKB_FRAGS * PAGE_SIZE));
> +	}
> +
>  	rest_len = pkt_len;
>  
>  	do {
>  		struct sk_buff *skb;
>  		size_t skb_len;
>  
> -		skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
> +		skb_len = min(max_skb_len, rest_len);
>  
> -		skb = virtio_transport_alloc_skb(info, skb_len,
> +		skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
>  						 src_cid, src_port,
>  						 dst_cid, dst_port);
>  		if (!skb) {
> @@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>  			break;
>  		}
>  
> +		/* This is last skb to send this portion of data. */
> +		if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
> +		    skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
> +			if (virtio_transport_init_zcopy_skb(vsk, skb,
> +							    info->msg,
> +							    can_zcopy)) {
> +				ret = -ENOMEM;
> +				break;
> +			}
> +		}
> +
>  		virtio_transport_inc_tx_pkt(vvs, skb);
>  
>  		ret = t_ops->send_pkt(skb);
> @@ -934,7 +1075,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>  	if (!t)
>  		return -ENOTCONN;
>  
> -	reply = virtio_transport_alloc_skb(&info, 0,
> +	reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
>  					   le64_to_cpu(hdr->dst_cid),
>  					   le32_to_cpu(hdr->dst_port),
>  					   le64_to_cpu(hdr->src_cid),
> -- 
> 2.25.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-25  8:25   ` Michael S. Tsirkin
@ 2023-07-25  8:39     ` Arseniy Krasnov
  2023-07-25 11:59       ` Michael S. Tsirkin
  0 siblings, 1 reply; 30+ messages in thread
From: Arseniy Krasnov @ 2023-07-25  8:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
	Bobby Eshleman, kvm, virtualization, netdev, linux-kernel,
	kernel, oxffffaa



On 25.07.2023 11:25, Michael S. Tsirkin wrote:
> On Fri, Jul 21, 2023 at 12:42:45AM +0300, Arseniy Krasnov wrote:
>> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>> flag is set and zerocopy transmission is possible (enabled in socket
>> options and transport allows zerocopy), then non-linear skb will be
>> created and filled with the pages of user's buffer. Pages of user's
>> buffer are locked in memory by 'get_user_pages()'. Second thing that
>> this patch does is replace type of skb owning: instead of calling
>> 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
>> change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
>> of socket, so to decrease this field correctly proper skb destructor is
>> needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
>>
>> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>> ---
>>  Changelog:
>>  v5(big patchset) -> v1:
>>   * Refactorings of 'if' conditions.
>>   * Remove extra blank line.
>>   * Remove 'frag_off' field unneeded init.
>>   * Add function 'virtio_transport_fill_skb()' which fills both linear
>>     and non-linear skb with provided data.
>>  v1 -> v2:
>>   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
>>  v2 -> v3:
>>   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
>>     provided 'iov_iter' with data could be sent in a zerocopy mode.
>>     If this callback is not set in transport - transport allows to send
>>     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
>>     then zerocopy is allowed. Reason of this callback is that in case of
>>     G2H transmission we insert whole skb to the tx virtio queue and such
>>     skb must fit to the size of the virtio queue to be sent in a single
>>     iteration (may be tx logic in 'virtio_transport.c' could be reworked
>>     as in vhost to support partial send of current skb). This callback
>>     will be enabled only for G2H path. For details pls see comment 
>>     'Check that tx queue...' below.
>>
>>  include/net/af_vsock.h                  |   3 +
>>  net/vmw_vsock/virtio_transport.c        |  39 ++++
>>  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
>>  3 files changed, 241 insertions(+), 58 deletions(-)
>>
>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>> index 0e7504a42925..a6b346eeeb8e 100644
>> --- a/include/net/af_vsock.h
>> +++ b/include/net/af_vsock.h
>> @@ -177,6 +177,9 @@ struct vsock_transport {
>>  
>>  	/* Read a single skb */
>>  	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
>> +
>> +	/* Zero-copy. */
>> +	bool (*msgzerocopy_check_iov)(const struct iov_iter *);
>>  };
>>  
>>  /**** CORE ****/
>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>> index 7bbcc8093e51..23cb8ed638c4 100644
>> --- a/net/vmw_vsock/virtio_transport.c
>> +++ b/net/vmw_vsock/virtio_transport.c
>> @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
>>  	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>>  }
>>  
>> +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
>> +{
>> +	struct virtio_vsock *vsock;
>> +	bool res = false;
>> +
>> +	rcu_read_lock();
>> +
>> +	vsock = rcu_dereference(the_virtio_vsock);
>> +	if (vsock) {
>> +		struct virtqueue *vq;
>> +		int iov_pages;
>> +
>> +		vq = vsock->vqs[VSOCK_VQ_TX];
>> +
>> +		iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
>> +
>> +		/* Check that tx queue is large enough to keep whole
>> +		 * data to send. This is needed, because when there is
>> +		 * not enough free space in the queue, current skb to
>> +		 * send will be reinserted to the head of tx list of
>> +		 * the socket to retry transmission later, so if skb
>> +		 * is bigger than whole queue, it will be reinserted
>> +		 * again and again, thus blocking other skbs to be sent.
>> +		 * Each page of the user provided buffer will be added
>> +		 * as a single buffer to the tx virtqueue, so compare
>> +		 * number of pages against maximum capacity of the queue.
>> +		 * +1 means buffer for the packet header.
>> +		 */
>> +		if (iov_pages + 1 <= vq->num_max)
>> +			res = true;
> 
> 
> Yes but can't there already be buffers in the queue?
> Then you can't stick num_max there.

I think, that it is not critical, because vhost part always tries to process all
incoming buffers (yes, 'vhost_exceeds_weight()' breaks at some moment, but it will
reschedule tx kick ('vhost_vsock_handle_tx_kick()') work again), so current "too
big" skb will wait until there will be enough space in queue and as it is requeued
to the head of tx list it will be inserted to tx queue first.

But anyway, I agree that comparing to 'num_free' may be more effective to the whole
system performance...

Thanks, Arseniy

> 
> 
>> +	}
>> +
>> +	rcu_read_unlock();
>> +
>> +	return res;
>> +}
>> +
>>  static bool virtio_transport_seqpacket_allow(u32 remote_cid);
>>  
>>  static struct virtio_transport virtio_transport = {
>> @@ -475,6 +512,8 @@ static struct virtio_transport virtio_transport = {
>>  		.seqpacket_allow          = virtio_transport_seqpacket_allow,
>>  		.seqpacket_has_data       = virtio_transport_seqpacket_has_data,
>>  
>> +		.msgzerocopy_check_iov	  = virtio_transport_msgzerocopy_check_iov,
>> +
>>  		.notify_poll_in           = virtio_transport_notify_poll_in,
>>  		.notify_poll_out          = virtio_transport_notify_poll_out,
>>  		.notify_recv_init         = virtio_transport_notify_recv_init,
>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>> index 26a4d10da205..e4e3d541aff4 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>>  	return container_of(t, struct virtio_transport, transport);
>>  }
>>  
>> -/* Returns a new packet on success, otherwise returns NULL.
>> - *
>> - * If NULL is returned, errp is set to a negative errno.
>> - */
>> -static struct sk_buff *
>> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>> -			   size_t len,
>> -			   u32 src_cid,
>> -			   u32 src_port,
>> -			   u32 dst_cid,
>> -			   u32 dst_port)
>> -{
>> -	const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>> -	struct virtio_vsock_hdr *hdr;
>> -	struct sk_buff *skb;
>> -	void *payload;
>> -	int err;
>> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
>> +				       size_t max_to_send)
>> +{
>> +	const struct vsock_transport *t;
>> +	struct iov_iter *iov_iter;
>>  
>> -	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>> -	if (!skb)
>> -		return NULL;
>> +	if (!info->msg)
>> +		return false;
>>  
>> -	hdr = virtio_vsock_hdr(skb);
>> -	hdr->type	= cpu_to_le16(info->type);
>> -	hdr->op		= cpu_to_le16(info->op);
>> -	hdr->src_cid	= cpu_to_le64(src_cid);
>> -	hdr->dst_cid	= cpu_to_le64(dst_cid);
>> -	hdr->src_port	= cpu_to_le32(src_port);
>> -	hdr->dst_port	= cpu_to_le32(dst_port);
>> -	hdr->flags	= cpu_to_le32(info->flags);
>> -	hdr->len	= cpu_to_le32(len);
>> +	iov_iter = &info->msg->msg_iter;
>>  
>> -	if (info->msg && len > 0) {
>> -		payload = skb_put(skb, len);
>> -		err = memcpy_from_msg(payload, info->msg, len);
>> -		if (err)
>> -			goto out;
>> +	t = vsock_core_get_transport(info->vsk);
>>  
>> -		if (msg_data_left(info->msg) == 0 &&
>> -		    info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>> -			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>> +	if (t->msgzerocopy_check_iov &&
>> +	    !t->msgzerocopy_check_iov(iov_iter))
>> +		return false;
>>  
>> -			if (info->msg->msg_flags & MSG_EOR)
>> -				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>> -		}
>> +	/* Data is simple buffer. */
>> +	if (iter_is_ubuf(iov_iter))
>> +		return true;
>> +
>> +	if (!iter_is_iovec(iov_iter))
>> +		return false;
>> +
>> +	if (iov_iter->iov_offset)
>> +		return false;
>> +
>> +	/* We can't send whole iov. */
>> +	if (iov_iter->count > max_to_send)
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>> +					   struct sk_buff *skb,
>> +					   struct msghdr *msg,
>> +					   bool zerocopy)
>> +{
>> +	struct ubuf_info *uarg;
>> +
>> +	if (msg->msg_ubuf) {
>> +		uarg = msg->msg_ubuf;
>> +		net_zcopy_get(uarg);
>> +	} else {
>> +		struct iov_iter *iter = &msg->msg_iter;
>> +		struct ubuf_info_msgzc *uarg_zc;
>> +		int len;
>> +
>> +		/* Only ITER_IOVEC or ITER_UBUF are allowed and
>> +		 * checked before.
>> +		 */
>> +		if (iter_is_iovec(iter))
>> +			len = iov_length(iter->__iov, iter->nr_segs);
>> +		else
>> +			len = iter->count;
>> +
>> +		uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>> +					    len,
>> +					    NULL);
>> +		if (!uarg)
>> +			return -1;
>> +
>> +		uarg_zc = uarg_to_msgzc(uarg);
>> +		uarg_zc->zerocopy = zerocopy ? 1 : 0;
>>  	}
>>  
>> -	if (info->reply)
>> -		virtio_vsock_skb_set_reply(skb);
>> +	skb_zcopy_init(skb, uarg);
>>  
>> -	trace_virtio_transport_alloc_pkt(src_cid, src_port,
>> -					 dst_cid, dst_port,
>> -					 len,
>> -					 info->type,
>> -					 info->op,
>> -					 info->flags);
>> +	return 0;
>> +}
>>  
>> -	if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
>> -		WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
>> -		goto out;
>> +static int virtio_transport_fill_skb(struct sk_buff *skb,
>> +				     struct virtio_vsock_pkt_info *info,
>> +				     size_t len,
>> +				     bool zcopy)
>> +{
>> +	if (zcopy) {
>> +		return __zerocopy_sg_from_iter(info->msg, NULL, skb,
>> +					      &info->msg->msg_iter,
>> +					      len);
>> +	} else {
>> +		void *payload;
>> +		int err;
>> +
>> +		payload = skb_put(skb, len);
>> +		err = memcpy_from_msg(payload, info->msg, len);
>> +		if (err)
>> +			return -1;
>> +
>> +		if (msg_data_left(info->msg))
>> +			return 0;
>> +
>> +		return 0;
>>  	}
>> +}
>>  
>> -	return skb;
>> +static void virtio_transport_init_hdr(struct sk_buff *skb,
>> +				      struct virtio_vsock_pkt_info *info,
>> +				      u32 src_cid,
>> +				      u32 src_port,
>> +				      u32 dst_cid,
>> +				      u32 dst_port,
>> +				      size_t len)
>> +{
>> +	struct virtio_vsock_hdr *hdr;
>>  
>> -out:
>> -	kfree_skb(skb);
>> -	return NULL;
>> +	hdr = virtio_vsock_hdr(skb);
>> +	hdr->type	= cpu_to_le16(info->type);
>> +	hdr->op		= cpu_to_le16(info->op);
>> +	hdr->src_cid	= cpu_to_le64(src_cid);
>> +	hdr->dst_cid	= cpu_to_le64(dst_cid);
>> +	hdr->src_port	= cpu_to_le32(src_port);
>> +	hdr->dst_port	= cpu_to_le32(dst_port);
>> +	hdr->flags	= cpu_to_le32(info->flags);
>> +	hdr->len	= cpu_to_le32(len);
>>  }
>>  
>>  static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
>> @@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
>>  }
>>  
>> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
>> +						  struct virtio_vsock_pkt_info *info,
>> +						  size_t payload_len,
>> +						  bool zcopy,
>> +						  u32 src_cid,
>> +						  u32 src_port,
>> +						  u32 dst_cid,
>> +						  u32 dst_port)
>> +{
>> +	struct sk_buff *skb;
>> +	size_t skb_len;
>> +
>> +	skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
>> +
>> +	if (!zcopy)
>> +		skb_len += payload_len;
>> +
>> +	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>> +	if (!skb)
>> +		return NULL;
>> +
>> +	virtio_transport_init_hdr(skb, info, src_cid, src_port,
>> +				  dst_cid, dst_port,
>> +				  payload_len);
>> +
>> +	/* Set owner here, because '__zerocopy_sg_from_iter()' uses
>> +	 * owner of skb without check to update 'sk_wmem_alloc'.
>> +	 */
>> +	if (vsk)
>> +		skb_set_owner_w(skb, sk_vsock(vsk));
>> +
>> +	if (info->msg && payload_len > 0) {
>> +		int err;
>> +
>> +		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
>> +		if (err)
>> +			goto out;
>> +
>> +		if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>> +			struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
>> +
>> +			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>> +
>> +			if (info->msg->msg_flags & MSG_EOR)
>> +				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>> +		}
>> +	}
>> +
>> +	if (info->reply)
>> +		virtio_vsock_skb_set_reply(skb);
>> +
>> +	trace_virtio_transport_alloc_pkt(src_cid, src_port,
>> +					 dst_cid, dst_port,
>> +					 payload_len,
>> +					 info->type,
>> +					 info->op,
>> +					 info->flags);
>> +
>> +	return skb;
>> +out:
>> +	kfree_skb(skb);
>> +	return NULL;
>> +}
>> +
>>  /* This function can only be used on connecting/connected sockets,
>>   * since a socket assigned to a transport is required.
>>   *
>> @@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>  static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>  					  struct virtio_vsock_pkt_info *info)
>>  {
>> +	u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>>  	u32 src_cid, src_port, dst_cid, dst_port;
>>  	const struct virtio_transport *t_ops;
>>  	struct virtio_vsock_sock *vvs;
>>  	u32 pkt_len = info->pkt_len;
>> +	bool can_zcopy = false;
>>  	u32 rest_len;
>>  	int ret;
>>  
>> @@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>  		return pkt_len;
>>  
>> +	if (info->msg) {
>> +		/* If zerocopy is not enabled by 'setsockopt()', we behave as
>> +		 * there is no MSG_ZEROCOPY flag set.
>> +		 */
>> +		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>> +			info->msg->msg_flags &= ~MSG_ZEROCOPY;
>> +
>> +		if (info->msg->msg_flags & MSG_ZEROCOPY)
>> +			can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
>> +
>> +		if (can_zcopy)
>> +			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>> +					    (MAX_SKB_FRAGS * PAGE_SIZE));
>> +	}
>> +
>>  	rest_len = pkt_len;
>>  
>>  	do {
>>  		struct sk_buff *skb;
>>  		size_t skb_len;
>>  
>> -		skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
>> +		skb_len = min(max_skb_len, rest_len);
>>  
>> -		skb = virtio_transport_alloc_skb(info, skb_len,
>> +		skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
>>  						 src_cid, src_port,
>>  						 dst_cid, dst_port);
>>  		if (!skb) {
>> @@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>  			break;
>>  		}
>>  
>> +		/* This is last skb to send this portion of data. */
>> +		if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
>> +		    skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
>> +			if (virtio_transport_init_zcopy_skb(vsk, skb,
>> +							    info->msg,
>> +							    can_zcopy)) {
>> +				ret = -ENOMEM;
>> +				break;
>> +			}
>> +		}
>> +
>>  		virtio_transport_inc_tx_pkt(vvs, skb);
>>  
>>  		ret = t_ops->send_pkt(skb);
>> @@ -934,7 +1075,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>>  	if (!t)
>>  		return -ENOTCONN;
>>  
>> -	reply = virtio_transport_alloc_skb(&info, 0,
>> +	reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
>>  					   le64_to_cpu(hdr->dst_cid),
>>  					   le32_to_cpu(hdr->dst_port),
>>  					   le64_to_cpu(hdr->src_cid),
>> -- 
>> 2.25.1
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-21  5:09   ` Arseniy Krasnov
@ 2023-07-25  8:43     ` Stefano Garzarella
  2023-07-25  8:46       ` Arseniy Krasnov
  2023-07-25 11:50     ` Michael S. Tsirkin
  1 sibling, 1 reply; 30+ messages in thread
From: Stefano Garzarella @ 2023-07-25  8:43 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Michael S. Tsirkin, Jason Wang, Bobby Eshleman, kvm,
	virtualization, netdev, linux-kernel, kernel, oxffffaa

On Fri, Jul 21, 2023 at 08:09:03AM +0300, Arseniy Krasnov wrote:
>
>
>On 21.07.2023 00:42, Arseniy Krasnov wrote:
>> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>> flag is set and zerocopy transmission is possible (enabled in socket
>> options and transport allows zerocopy), then non-linear skb will be
>> created and filled with the pages of user's buffer. Pages of user's
>> buffer are locked in memory by 'get_user_pages()'. Second thing that
>> this patch does is replace type of skb owning: instead of calling
>> 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
>> change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
>> of socket, so to decrease this field correctly proper skb destructor is
>> needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
>>
>> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>> ---
>>  Changelog:
>>  v5(big patchset) -> v1:
>>   * Refactorings of 'if' conditions.
>>   * Remove extra blank line.
>>   * Remove 'frag_off' field unneeded init.
>>   * Add function 'virtio_transport_fill_skb()' which fills both linear
>>     and non-linear skb with provided data.
>>  v1 -> v2:
>>   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
>>  v2 -> v3:
>>   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
>>     provided 'iov_iter' with data could be sent in a zerocopy mode.
>>     If this callback is not set in transport - transport allows to send
>>     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
>>     then zerocopy is allowed. Reason of this callback is that in case of
>>     G2H transmission we insert whole skb to the tx virtio queue and such
>>     skb must fit to the size of the virtio queue to be sent in a single
>>     iteration (may be tx logic in 'virtio_transport.c' could be reworked
>>     as in vhost to support partial send of current skb). This callback
>>     will be enabled only for G2H path. For details pls see comment
>>     'Check that tx queue...' below.
>>
>>  include/net/af_vsock.h                  |   3 +
>>  net/vmw_vsock/virtio_transport.c        |  39 ++++
>>  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
>>  3 files changed, 241 insertions(+), 58 deletions(-)
>>
>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>> index 0e7504a42925..a6b346eeeb8e 100644
>> --- a/include/net/af_vsock.h
>> +++ b/include/net/af_vsock.h
>> @@ -177,6 +177,9 @@ struct vsock_transport {
>>
>>  	/* Read a single skb */
>>  	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
>> +
>> +	/* Zero-copy. */
>> +	bool (*msgzerocopy_check_iov)(const struct iov_iter *);
>>  };
>>
>>  /**** CORE ****/
>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>> index 7bbcc8093e51..23cb8ed638c4 100644
>> --- a/net/vmw_vsock/virtio_transport.c
>> +++ b/net/vmw_vsock/virtio_transport.c
>> @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
>>  	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>>  }
>>
>> +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
>> +{
>> +	struct virtio_vsock *vsock;
>> +	bool res = false;
>> +
>> +	rcu_read_lock();
>> +
>> +	vsock = rcu_dereference(the_virtio_vsock);
>> +	if (vsock) {
>> +		struct virtqueue *vq;
>> +		int iov_pages;
>> +
>> +		vq = vsock->vqs[VSOCK_VQ_TX];
>> +
>> +		iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
>> +
>> +		/* Check that tx queue is large enough to keep whole
>> +		 * data to send. This is needed, because when there is
>> +		 * not enough free space in the queue, current skb to
>> +		 * send will be reinserted to the head of tx list of
>> +		 * the socket to retry transmission later, so if skb
>> +		 * is bigger than whole queue, it will be reinserted
>> +		 * again and again, thus blocking other skbs to be sent.
>> +		 * Each page of the user provided buffer will be added
>> +		 * as a single buffer to the tx virtqueue, so compare
>> +		 * number of pages against maximum capacity of the queue.
>> +		 * +1 means buffer for the packet header.
>> +		 */
>> +		if (iov_pages + 1 <= vq->num_max)
>
>I think this check is actual only for case one we don't have indirect buffer feature.
>With indirect mode whole data to send will be packed into one indirect buffer.

I think so.
So, should we check also that here?

>
>Thanks, Arseniy
>
>> +			res = true;
>> +	}
>> +
>> +	rcu_read_unlock();
>> +
>> +	return res;
>> +}
>> +
>>  static bool virtio_transport_seqpacket_allow(u32 remote_cid);
>>
>>  static struct virtio_transport virtio_transport = {
>> @@ -475,6 +512,8 @@ static struct virtio_transport virtio_transport = {
>>  		.seqpacket_allow          = virtio_transport_seqpacket_allow,
>>  		.seqpacket_has_data       = virtio_transport_seqpacket_has_data,
>>
>> +		.msgzerocopy_check_iov	  = virtio_transport_msgzerocopy_check_iov,
>> +
>>  		.notify_poll_in           = virtio_transport_notify_poll_in,
>>  		.notify_poll_out          = virtio_transport_notify_poll_out,
>>  		.notify_recv_init         = virtio_transport_notify_recv_init,
>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>> index 26a4d10da205..e4e3d541aff4 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>>  	return container_of(t, struct virtio_transport, transport);
>>  }
>>
>> -/* Returns a new packet on success, otherwise returns NULL.
>> - *
>> - * If NULL is returned, errp is set to a negative errno.
>> - */
>> -static struct sk_buff *
>> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>> -			   size_t len,
>> -			   u32 src_cid,
>> -			   u32 src_port,
>> -			   u32 dst_cid,
>> -			   u32 dst_port)
>> -{
>> -	const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>> -	struct virtio_vsock_hdr *hdr;
>> -	struct sk_buff *skb;
>> -	void *payload;
>> -	int err;
>> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
>> +				       size_t max_to_send)
>> +{
>> +	const struct vsock_transport *t;
>> +	struct iov_iter *iov_iter;
>>
>> -	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>> -	if (!skb)
>> -		return NULL;
>> +	if (!info->msg)
>> +		return false;
>>
>> -	hdr = virtio_vsock_hdr(skb);
>> -	hdr->type	= cpu_to_le16(info->type);
>> -	hdr->op		= cpu_to_le16(info->op);
>> -	hdr->src_cid	= cpu_to_le64(src_cid);
>> -	hdr->dst_cid	= cpu_to_le64(dst_cid);
>> -	hdr->src_port	= cpu_to_le32(src_port);
>> -	hdr->dst_port	= cpu_to_le32(dst_port);
>> -	hdr->flags	= cpu_to_le32(info->flags);
>> -	hdr->len	= cpu_to_le32(len);
>> +	iov_iter = &info->msg->msg_iter;
>>
>> -	if (info->msg && len > 0) {
>> -		payload = skb_put(skb, len);
>> -		err = memcpy_from_msg(payload, info->msg, len);
>> -		if (err)
>> -			goto out;
>> +	t = vsock_core_get_transport(info->vsk);
>>
>> -		if (msg_data_left(info->msg) == 0 &&
>> -		    info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>> -			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>> +	if (t->msgzerocopy_check_iov &&
>> +	    !t->msgzerocopy_check_iov(iov_iter))
>> +		return false;

I'd avoid adding a new transport callback used only internally in virtio
transports.

Usually the transport callbacks are used in af_vsock.c, if we need a
callback just for virtio transports, maybe better to add it in struct
virtio_vsock_pkt_info or struct virtio_vsock_sock.

Maybe better the last one so we don't have to allocate pointer space
for each packet and you should reach it via info.

Thanks,
Stefano

>>
>> -			if (info->msg->msg_flags & MSG_EOR)
>> -				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>> -		}
>> +	/* Data is simple buffer. */
>> +	if (iter_is_ubuf(iov_iter))
>> +		return true;
>> +
>> +	if (!iter_is_iovec(iov_iter))
>> +		return false;
>> +
>> +	if (iov_iter->iov_offset)
>> +		return false;
>> +
>> +	/* We can't send whole iov. */
>> +	if (iov_iter->count > max_to_send)
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>> +					   struct sk_buff *skb,
>> +					   struct msghdr *msg,
>> +					   bool zerocopy)
>> +{
>> +	struct ubuf_info *uarg;
>> +
>> +	if (msg->msg_ubuf) {
>> +		uarg = msg->msg_ubuf;
>> +		net_zcopy_get(uarg);
>> +	} else {
>> +		struct iov_iter *iter = &msg->msg_iter;
>> +		struct ubuf_info_msgzc *uarg_zc;
>> +		int len;
>> +
>> +		/* Only ITER_IOVEC or ITER_UBUF are allowed and
>> +		 * checked before.
>> +		 */
>> +		if (iter_is_iovec(iter))
>> +			len = iov_length(iter->__iov, iter->nr_segs);
>> +		else
>> +			len = iter->count;
>> +
>> +		uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>> +					    len,
>> +					    NULL);
>> +		if (!uarg)
>> +			return -1;
>> +
>> +		uarg_zc = uarg_to_msgzc(uarg);
>> +		uarg_zc->zerocopy = zerocopy ? 1 : 0;
>>  	}
>>
>> -	if (info->reply)
>> -		virtio_vsock_skb_set_reply(skb);
>> +	skb_zcopy_init(skb, uarg);
>>
>> -	trace_virtio_transport_alloc_pkt(src_cid, src_port,
>> -					 dst_cid, dst_port,
>> -					 len,
>> -					 info->type,
>> -					 info->op,
>> -					 info->flags);
>> +	return 0;
>> +}
>>
>> -	if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
>> -		WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
>> -		goto out;
>> +static int virtio_transport_fill_skb(struct sk_buff *skb,
>> +				     struct virtio_vsock_pkt_info *info,
>> +				     size_t len,
>> +				     bool zcopy)
>> +{
>> +	if (zcopy) {
>> +		return __zerocopy_sg_from_iter(info->msg, NULL, skb,
>> +					      &info->msg->msg_iter,
>> +					      len);
>> +	} else {
>> +		void *payload;
>> +		int err;
>> +
>> +		payload = skb_put(skb, len);
>> +		err = memcpy_from_msg(payload, info->msg, len);
>> +		if (err)
>> +			return -1;
>> +
>> +		if (msg_data_left(info->msg))
>> +			return 0;
>> +
>> +		return 0;
>>  	}
>> +}
>>
>> -	return skb;
>> +static void virtio_transport_init_hdr(struct sk_buff *skb,
>> +				      struct virtio_vsock_pkt_info *info,
>> +				      u32 src_cid,
>> +				      u32 src_port,
>> +				      u32 dst_cid,
>> +				      u32 dst_port,
>> +				      size_t len)
>> +{
>> +	struct virtio_vsock_hdr *hdr;
>>
>> -out:
>> -	kfree_skb(skb);
>> -	return NULL;
>> +	hdr = virtio_vsock_hdr(skb);
>> +	hdr->type	= cpu_to_le16(info->type);
>> +	hdr->op		= cpu_to_le16(info->op);
>> +	hdr->src_cid	= cpu_to_le64(src_cid);
>> +	hdr->dst_cid	= cpu_to_le64(dst_cid);
>> +	hdr->src_port	= cpu_to_le32(src_port);
>> +	hdr->dst_port	= cpu_to_le32(dst_port);
>> +	hdr->flags	= cpu_to_le32(info->flags);
>> +	hdr->len	= cpu_to_le32(len);
>>  }
>>
>>  static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
>> @@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
>>  }
>>
>> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
>> +						  struct virtio_vsock_pkt_info *info,
>> +						  size_t payload_len,
>> +						  bool zcopy,
>> +						  u32 src_cid,
>> +						  u32 src_port,
>> +						  u32 dst_cid,
>> +						  u32 dst_port)
>> +{
>> +	struct sk_buff *skb;
>> +	size_t skb_len;
>> +
>> +	skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
>> +
>> +	if (!zcopy)
>> +		skb_len += payload_len;
>> +
>> +	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>> +	if (!skb)
>> +		return NULL;
>> +
>> +	virtio_transport_init_hdr(skb, info, src_cid, src_port,
>> +				  dst_cid, dst_port,
>> +				  payload_len);
>> +
>> +	/* Set owner here, because '__zerocopy_sg_from_iter()' uses
>> +	 * owner of skb without check to update 'sk_wmem_alloc'.
>> +	 */
>> +	if (vsk)
>> +		skb_set_owner_w(skb, sk_vsock(vsk));
>> +
>> +	if (info->msg && payload_len > 0) {
>> +		int err;
>> +
>> +		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
>> +		if (err)
>> +			goto out;
>> +
>> +		if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>> +			struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
>> +
>> +			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>> +
>> +			if (info->msg->msg_flags & MSG_EOR)
>> +				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>> +		}
>> +	}
>> +
>> +	if (info->reply)
>> +		virtio_vsock_skb_set_reply(skb);
>> +
>> +	trace_virtio_transport_alloc_pkt(src_cid, src_port,
>> +					 dst_cid, dst_port,
>> +					 payload_len,
>> +					 info->type,
>> +					 info->op,
>> +					 info->flags);
>> +
>> +	return skb;
>> +out:
>> +	kfree_skb(skb);
>> +	return NULL;
>> +}
>> +
>>  /* This function can only be used on connecting/connected sockets,
>>   * since a socket assigned to a transport is required.
>>   *
>> @@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>  static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>  					  struct virtio_vsock_pkt_info *info)
>>  {
>> +	u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>>  	u32 src_cid, src_port, dst_cid, dst_port;
>>  	const struct virtio_transport *t_ops;
>>  	struct virtio_vsock_sock *vvs;
>>  	u32 pkt_len = info->pkt_len;
>> +	bool can_zcopy = false;
>>  	u32 rest_len;
>>  	int ret;
>>
>> @@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>  		return pkt_len;
>>
>> +	if (info->msg) {
>> +		/* If zerocopy is not enabled by 'setsockopt()', we behave as
>> +		 * there is no MSG_ZEROCOPY flag set.
>> +		 */
>> +		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>> +			info->msg->msg_flags &= ~MSG_ZEROCOPY;
>> +
>> +		if (info->msg->msg_flags & MSG_ZEROCOPY)
>> +			can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
>> +
>> +		if (can_zcopy)
>> +			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>> +					    (MAX_SKB_FRAGS * PAGE_SIZE));
>> +	}
>> +
>>  	rest_len = pkt_len;
>>
>>  	do {
>>  		struct sk_buff *skb;
>>  		size_t skb_len;
>>
>> -		skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
>> +		skb_len = min(max_skb_len, rest_len);
>>
>> -		skb = virtio_transport_alloc_skb(info, skb_len,
>> +		skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
>>  						 src_cid, src_port,
>>  						 dst_cid, dst_port);
>>  		if (!skb) {
>> @@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>  			break;
>>  		}
>>
>> +		/* This is last skb to send this portion of data. */
>> +		if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
>> +		    skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
>> +			if (virtio_transport_init_zcopy_skb(vsk, skb,
>> +							    info->msg,
>> +							    can_zcopy)) {
>> +				ret = -ENOMEM;
>> +				break;
>> +			}
>> +		}
>> +
>>  		virtio_transport_inc_tx_pkt(vvs, skb);
>>
>>  		ret = t_ops->send_pkt(skb);
>> @@ -934,7 +1075,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>>  	if (!t)
>>  		return -ENOTCONN;
>>
>> -	reply = virtio_transport_alloc_skb(&info, 0,
>> +	reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
>>  					   le64_to_cpu(hdr->dst_cid),
>>  					   le32_to_cpu(hdr->dst_port),
>>  					   le64_to_cpu(hdr->src_cid),
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-25  8:43     ` Stefano Garzarella
@ 2023-07-25  8:46       ` Arseniy Krasnov
  2023-07-25  9:16         ` Arseniy Krasnov
  0 siblings, 1 reply; 30+ messages in thread
From: Arseniy Krasnov @ 2023-07-25  8:46 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Stefan Hajnoczi, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Michael S. Tsirkin, Jason Wang, Bobby Eshleman, kvm,
	virtualization, netdev, linux-kernel, kernel, oxffffaa



On 25.07.2023 11:43, Stefano Garzarella wrote:
> On Fri, Jul 21, 2023 at 08:09:03AM +0300, Arseniy Krasnov wrote:
>>
>>
>> On 21.07.2023 00:42, Arseniy Krasnov wrote:
>>> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>>> flag is set and zerocopy transmission is possible (enabled in socket
>>> options and transport allows zerocopy), then non-linear skb will be
>>> created and filled with the pages of user's buffer. Pages of user's
>>> buffer are locked in memory by 'get_user_pages()'. Second thing that
>>> this patch does is replace type of skb owning: instead of calling
>>> 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
>>> change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
>>> of socket, so to decrease this field correctly proper skb destructor is
>>> needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
>>>
>>> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>>> ---
>>>  Changelog:
>>>  v5(big patchset) -> v1:
>>>   * Refactorings of 'if' conditions.
>>>   * Remove extra blank line.
>>>   * Remove 'frag_off' field unneeded init.
>>>   * Add function 'virtio_transport_fill_skb()' which fills both linear
>>>     and non-linear skb with provided data.
>>>  v1 -> v2:
>>>   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
>>>  v2 -> v3:
>>>   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
>>>     provided 'iov_iter' with data could be sent in a zerocopy mode.
>>>     If this callback is not set in transport - transport allows to send
>>>     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
>>>     then zerocopy is allowed. Reason of this callback is that in case of
>>>     G2H transmission we insert whole skb to the tx virtio queue and such
>>>     skb must fit to the size of the virtio queue to be sent in a single
>>>     iteration (may be tx logic in 'virtio_transport.c' could be reworked
>>>     as in vhost to support partial send of current skb). This callback
>>>     will be enabled only for G2H path. For details pls see comment
>>>     'Check that tx queue...' below.
>>>
>>>  include/net/af_vsock.h                  |   3 +
>>>  net/vmw_vsock/virtio_transport.c        |  39 ++++
>>>  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
>>>  3 files changed, 241 insertions(+), 58 deletions(-)
>>>
>>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>>> index 0e7504a42925..a6b346eeeb8e 100644
>>> --- a/include/net/af_vsock.h
>>> +++ b/include/net/af_vsock.h
>>> @@ -177,6 +177,9 @@ struct vsock_transport {
>>>
>>>      /* Read a single skb */
>>>      int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
>>> +
>>> +    /* Zero-copy. */
>>> +    bool (*msgzerocopy_check_iov)(const struct iov_iter *);
>>>  };
>>>
>>>  /**** CORE ****/
>>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>>> index 7bbcc8093e51..23cb8ed638c4 100644
>>> --- a/net/vmw_vsock/virtio_transport.c
>>> +++ b/net/vmw_vsock/virtio_transport.c
>>> @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
>>>      queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>>>  }
>>>
>>> +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
>>> +{
>>> +    struct virtio_vsock *vsock;
>>> +    bool res = false;
>>> +
>>> +    rcu_read_lock();
>>> +
>>> +    vsock = rcu_dereference(the_virtio_vsock);
>>> +    if (vsock) {
>>> +        struct virtqueue *vq;
>>> +        int iov_pages;
>>> +
>>> +        vq = vsock->vqs[VSOCK_VQ_TX];
>>> +
>>> +        iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
>>> +
>>> +        /* Check that tx queue is large enough to keep whole
>>> +         * data to send. This is needed, because when there is
>>> +         * not enough free space in the queue, current skb to
>>> +         * send will be reinserted to the head of tx list of
>>> +         * the socket to retry transmission later, so if skb
>>> +         * is bigger than whole queue, it will be reinserted
>>> +         * again and again, thus blocking other skbs to be sent.
>>> +         * Each page of the user provided buffer will be added
>>> +         * as a single buffer to the tx virtqueue, so compare
>>> +         * number of pages against maximum capacity of the queue.
>>> +         * +1 means buffer for the packet header.
>>> +         */
>>> +        if (iov_pages + 1 <= vq->num_max)
>>
>> I think this check is actual only for case one we don't have indirect buffer feature.
>> With indirect mode whole data to send will be packed into one indirect buffer.
> 
> I think so.
> So, should we check also that here?
> 
>>
>> Thanks, Arseniy
>>
>>> +            res = true;
>>> +    }
>>> +
>>> +    rcu_read_unlock();
>>> +
>>> +    return res;
>>> +}
>>> +
>>>  static bool virtio_transport_seqpacket_allow(u32 remote_cid);
>>>
>>>  static struct virtio_transport virtio_transport = {
>>> @@ -475,6 +512,8 @@ static struct virtio_transport virtio_transport = {
>>>          .seqpacket_allow          = virtio_transport_seqpacket_allow,
>>>          .seqpacket_has_data       = virtio_transport_seqpacket_has_data,
>>>
>>> +        .msgzerocopy_check_iov      = virtio_transport_msgzerocopy_check_iov,
>>> +
>>>          .notify_poll_in           = virtio_transport_notify_poll_in,
>>>          .notify_poll_out          = virtio_transport_notify_poll_out,
>>>          .notify_recv_init         = virtio_transport_notify_recv_init,
>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>> index 26a4d10da205..e4e3d541aff4 100644
>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>> @@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>>>      return container_of(t, struct virtio_transport, transport);
>>>  }
>>>
>>> -/* Returns a new packet on success, otherwise returns NULL.
>>> - *
>>> - * If NULL is returned, errp is set to a negative errno.
>>> - */
>>> -static struct sk_buff *
>>> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>>> -               size_t len,
>>> -               u32 src_cid,
>>> -               u32 src_port,
>>> -               u32 dst_cid,
>>> -               u32 dst_port)
>>> -{
>>> -    const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>>> -    struct virtio_vsock_hdr *hdr;
>>> -    struct sk_buff *skb;
>>> -    void *payload;
>>> -    int err;
>>> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
>>> +                       size_t max_to_send)
>>> +{
>>> +    const struct vsock_transport *t;
>>> +    struct iov_iter *iov_iter;
>>>
>>> -    skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>> -    if (!skb)
>>> -        return NULL;
>>> +    if (!info->msg)
>>> +        return false;
>>>
>>> -    hdr = virtio_vsock_hdr(skb);
>>> -    hdr->type    = cpu_to_le16(info->type);
>>> -    hdr->op        = cpu_to_le16(info->op);
>>> -    hdr->src_cid    = cpu_to_le64(src_cid);
>>> -    hdr->dst_cid    = cpu_to_le64(dst_cid);
>>> -    hdr->src_port    = cpu_to_le32(src_port);
>>> -    hdr->dst_port    = cpu_to_le32(dst_port);
>>> -    hdr->flags    = cpu_to_le32(info->flags);
>>> -    hdr->len    = cpu_to_le32(len);
>>> +    iov_iter = &info->msg->msg_iter;
>>>
>>> -    if (info->msg && len > 0) {
>>> -        payload = skb_put(skb, len);
>>> -        err = memcpy_from_msg(payload, info->msg, len);
>>> -        if (err)
>>> -            goto out;
>>> +    t = vsock_core_get_transport(info->vsk);
>>>
>>> -        if (msg_data_left(info->msg) == 0 &&
>>> -            info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>> -            hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>> +    if (t->msgzerocopy_check_iov &&
>>> +        !t->msgzerocopy_check_iov(iov_iter))
>>> +        return false;
> 
> I'd avoid adding a new transport callback used only internally in virtio
> transports.

Ok, I see.

> 
> Usually the transport callbacks are used in af_vsock.c, if we need a
> callback just for virtio transports, maybe better to add it in struct
> virtio_vsock_pkt_info or struct virtio_vsock_sock.
> 
> Maybe better the last one so we don't have to allocate pointer space
> for each packet and you should reach it via info.

Ok, thanks, I'll try this way

Thanks, Arseniy

> 
> Thanks,
> Stefano
> 
>>>
>>> -            if (info->msg->msg_flags & MSG_EOR)
>>> -                hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>> -        }
>>> +    /* Data is simple buffer. */
>>> +    if (iter_is_ubuf(iov_iter))
>>> +        return true;
>>> +
>>> +    if (!iter_is_iovec(iov_iter))
>>> +        return false;
>>> +
>>> +    if (iov_iter->iov_offset)
>>> +        return false;
>>> +
>>> +    /* We can't send whole iov. */
>>> +    if (iov_iter->count > max_to_send)
>>> +        return false;
>>> +
>>> +    return true;
>>> +}
>>> +
>>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>>> +                       struct sk_buff *skb,
>>> +                       struct msghdr *msg,
>>> +                       bool zerocopy)
>>> +{
>>> +    struct ubuf_info *uarg;
>>> +
>>> +    if (msg->msg_ubuf) {
>>> +        uarg = msg->msg_ubuf;
>>> +        net_zcopy_get(uarg);
>>> +    } else {
>>> +        struct iov_iter *iter = &msg->msg_iter;
>>> +        struct ubuf_info_msgzc *uarg_zc;
>>> +        int len;
>>> +
>>> +        /* Only ITER_IOVEC or ITER_UBUF are allowed and
>>> +         * checked before.
>>> +         */
>>> +        if (iter_is_iovec(iter))
>>> +            len = iov_length(iter->__iov, iter->nr_segs);
>>> +        else
>>> +            len = iter->count;
>>> +
>>> +        uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>> +                        len,
>>> +                        NULL);
>>> +        if (!uarg)
>>> +            return -1;
>>> +
>>> +        uarg_zc = uarg_to_msgzc(uarg);
>>> +        uarg_zc->zerocopy = zerocopy ? 1 : 0;
>>>      }
>>>
>>> -    if (info->reply)
>>> -        virtio_vsock_skb_set_reply(skb);
>>> +    skb_zcopy_init(skb, uarg);
>>>
>>> -    trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>> -                     dst_cid, dst_port,
>>> -                     len,
>>> -                     info->type,
>>> -                     info->op,
>>> -                     info->flags);
>>> +    return 0;
>>> +}
>>>
>>> -    if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
>>> -        WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
>>> -        goto out;
>>> +static int virtio_transport_fill_skb(struct sk_buff *skb,
>>> +                     struct virtio_vsock_pkt_info *info,
>>> +                     size_t len,
>>> +                     bool zcopy)
>>> +{
>>> +    if (zcopy) {
>>> +        return __zerocopy_sg_from_iter(info->msg, NULL, skb,
>>> +                          &info->msg->msg_iter,
>>> +                          len);
>>> +    } else {
>>> +        void *payload;
>>> +        int err;
>>> +
>>> +        payload = skb_put(skb, len);
>>> +        err = memcpy_from_msg(payload, info->msg, len);
>>> +        if (err)
>>> +            return -1;
>>> +
>>> +        if (msg_data_left(info->msg))
>>> +            return 0;
>>> +
>>> +        return 0;
>>>      }
>>> +}
>>>
>>> -    return skb;
>>> +static void virtio_transport_init_hdr(struct sk_buff *skb,
>>> +                      struct virtio_vsock_pkt_info *info,
>>> +                      u32 src_cid,
>>> +                      u32 src_port,
>>> +                      u32 dst_cid,
>>> +                      u32 dst_port,
>>> +                      size_t len)
>>> +{
>>> +    struct virtio_vsock_hdr *hdr;
>>>
>>> -out:
>>> -    kfree_skb(skb);
>>> -    return NULL;
>>> +    hdr = virtio_vsock_hdr(skb);
>>> +    hdr->type    = cpu_to_le16(info->type);
>>> +    hdr->op        = cpu_to_le16(info->op);
>>> +    hdr->src_cid    = cpu_to_le64(src_cid);
>>> +    hdr->dst_cid    = cpu_to_le64(dst_cid);
>>> +    hdr->src_port    = cpu_to_le32(src_port);
>>> +    hdr->dst_port    = cpu_to_le32(dst_port);
>>> +    hdr->flags    = cpu_to_le32(info->flags);
>>> +    hdr->len    = cpu_to_le32(len);
>>>  }
>>>
>>>  static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
>>> @@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>>          return VIRTIO_VSOCK_TYPE_SEQPACKET;
>>>  }
>>>
>>> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
>>> +                          struct virtio_vsock_pkt_info *info,
>>> +                          size_t payload_len,
>>> +                          bool zcopy,
>>> +                          u32 src_cid,
>>> +                          u32 src_port,
>>> +                          u32 dst_cid,
>>> +                          u32 dst_port)
>>> +{
>>> +    struct sk_buff *skb;
>>> +    size_t skb_len;
>>> +
>>> +    skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
>>> +
>>> +    if (!zcopy)
>>> +        skb_len += payload_len;
>>> +
>>> +    skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>> +    if (!skb)
>>> +        return NULL;
>>> +
>>> +    virtio_transport_init_hdr(skb, info, src_cid, src_port,
>>> +                  dst_cid, dst_port,
>>> +                  payload_len);
>>> +
>>> +    /* Set owner here, because '__zerocopy_sg_from_iter()' uses
>>> +     * owner of skb without check to update 'sk_wmem_alloc'.
>>> +     */
>>> +    if (vsk)
>>> +        skb_set_owner_w(skb, sk_vsock(vsk));
>>> +
>>> +    if (info->msg && payload_len > 0) {
>>> +        int err;
>>> +
>>> +        err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
>>> +        if (err)
>>> +            goto out;
>>> +
>>> +        if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>> +            struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
>>> +
>>> +            hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>> +
>>> +            if (info->msg->msg_flags & MSG_EOR)
>>> +                hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>> +        }
>>> +    }
>>> +
>>> +    if (info->reply)
>>> +        virtio_vsock_skb_set_reply(skb);
>>> +
>>> +    trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>> +                     dst_cid, dst_port,
>>> +                     payload_len,
>>> +                     info->type,
>>> +                     info->op,
>>> +                     info->flags);
>>> +
>>> +    return skb;
>>> +out:
>>> +    kfree_skb(skb);
>>> +    return NULL;
>>> +}
>>> +
>>>  /* This function can only be used on connecting/connected sockets,
>>>   * since a socket assigned to a transport is required.
>>>   *
>>> @@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>>  static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>                        struct virtio_vsock_pkt_info *info)
>>>  {
>>> +    u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>>>      u32 src_cid, src_port, dst_cid, dst_port;
>>>      const struct virtio_transport *t_ops;
>>>      struct virtio_vsock_sock *vvs;
>>>      u32 pkt_len = info->pkt_len;
>>> +    bool can_zcopy = false;
>>>      u32 rest_len;
>>>      int ret;
>>>
>>> @@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>      if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>>          return pkt_len;
>>>
>>> +    if (info->msg) {
>>> +        /* If zerocopy is not enabled by 'setsockopt()', we behave as
>>> +         * there is no MSG_ZEROCOPY flag set.
>>> +         */
>>> +        if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>>> +            info->msg->msg_flags &= ~MSG_ZEROCOPY;
>>> +
>>> +        if (info->msg->msg_flags & MSG_ZEROCOPY)
>>> +            can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
>>> +
>>> +        if (can_zcopy)
>>> +            max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>>> +                        (MAX_SKB_FRAGS * PAGE_SIZE));
>>> +    }
>>> +
>>>      rest_len = pkt_len;
>>>
>>>      do {
>>>          struct sk_buff *skb;
>>>          size_t skb_len;
>>>
>>> -        skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
>>> +        skb_len = min(max_skb_len, rest_len);
>>>
>>> -        skb = virtio_transport_alloc_skb(info, skb_len,
>>> +        skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
>>>                           src_cid, src_port,
>>>                           dst_cid, dst_port);
>>>          if (!skb) {
>>> @@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>              break;
>>>          }
>>>
>>> +        /* This is last skb to send this portion of data. */
>>> +        if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
>>> +            skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
>>> +            if (virtio_transport_init_zcopy_skb(vsk, skb,
>>> +                                info->msg,
>>> +                                can_zcopy)) {
>>> +                ret = -ENOMEM;
>>> +                break;
>>> +            }
>>> +        }
>>> +
>>>          virtio_transport_inc_tx_pkt(vvs, skb);
>>>
>>>          ret = t_ops->send_pkt(skb);
>>> @@ -934,7 +1075,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>>>      if (!t)
>>>          return -ENOTCONN;
>>>
>>> -    reply = virtio_transport_alloc_skb(&info, 0,
>>> +    reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
>>>                         le64_to_cpu(hdr->dst_cid),
>>>                         le32_to_cpu(hdr->dst_port),
>>>                         le64_to_cpu(hdr->src_cid),
>>
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-25  8:46       ` Arseniy Krasnov
@ 2023-07-25  9:16         ` Arseniy Krasnov
  2023-07-25 12:28           ` Stefano Garzarella
  0 siblings, 1 reply; 30+ messages in thread
From: Arseniy Krasnov @ 2023-07-25  9:16 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Stefan Hajnoczi, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Michael S. Tsirkin, Jason Wang, Bobby Eshleman, kvm,
	virtualization, netdev, linux-kernel, kernel, oxffffaa



On 25.07.2023 11:46, Arseniy Krasnov wrote:
> 
> 
> On 25.07.2023 11:43, Stefano Garzarella wrote:
>> On Fri, Jul 21, 2023 at 08:09:03AM +0300, Arseniy Krasnov wrote:
>>>
>>>
>>> On 21.07.2023 00:42, Arseniy Krasnov wrote:
>>>> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>>>> flag is set and zerocopy transmission is possible (enabled in socket
>>>> options and transport allows zerocopy), then non-linear skb will be
>>>> created and filled with the pages of user's buffer. Pages of user's
>>>> buffer are locked in memory by 'get_user_pages()'. Second thing that
>>>> this patch does is replace type of skb owning: instead of calling
>>>> 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
>>>> change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
>>>> of socket, so to decrease this field correctly proper skb destructor is
>>>> needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
>>>>
>>>> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>>>> ---
>>>>  Changelog:
>>>>  v5(big patchset) -> v1:
>>>>   * Refactorings of 'if' conditions.
>>>>   * Remove extra blank line.
>>>>   * Remove 'frag_off' field unneeded init.
>>>>   * Add function 'virtio_transport_fill_skb()' which fills both linear
>>>>     and non-linear skb with provided data.
>>>>  v1 -> v2:
>>>>   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
>>>>  v2 -> v3:
>>>>   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
>>>>     provided 'iov_iter' with data could be sent in a zerocopy mode.
>>>>     If this callback is not set in transport - transport allows to send
>>>>     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
>>>>     then zerocopy is allowed. Reason of this callback is that in case of
>>>>     G2H transmission we insert whole skb to the tx virtio queue and such
>>>>     skb must fit to the size of the virtio queue to be sent in a single
>>>>     iteration (may be tx logic in 'virtio_transport.c' could be reworked
>>>>     as in vhost to support partial send of current skb). This callback
>>>>     will be enabled only for G2H path. For details pls see comment
>>>>     'Check that tx queue...' below.
>>>>
>>>>  include/net/af_vsock.h                  |   3 +
>>>>  net/vmw_vsock/virtio_transport.c        |  39 ++++
>>>>  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
>>>>  3 files changed, 241 insertions(+), 58 deletions(-)
>>>>
>>>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>>>> index 0e7504a42925..a6b346eeeb8e 100644
>>>> --- a/include/net/af_vsock.h
>>>> +++ b/include/net/af_vsock.h
>>>> @@ -177,6 +177,9 @@ struct vsock_transport {
>>>>
>>>>      /* Read a single skb */
>>>>      int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
>>>> +
>>>> +    /* Zero-copy. */
>>>> +    bool (*msgzerocopy_check_iov)(const struct iov_iter *);
>>>>  };
>>>>
>>>>  /**** CORE ****/
>>>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>>>> index 7bbcc8093e51..23cb8ed638c4 100644
>>>> --- a/net/vmw_vsock/virtio_transport.c
>>>> +++ b/net/vmw_vsock/virtio_transport.c
>>>> @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
>>>>      queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>>>>  }
>>>>
>>>> +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
>>>> +{
>>>> +    struct virtio_vsock *vsock;
>>>> +    bool res = false;
>>>> +
>>>> +    rcu_read_lock();
>>>> +
>>>> +    vsock = rcu_dereference(the_virtio_vsock);
>>>> +    if (vsock) {
>>>> +        struct virtqueue *vq;
>>>> +        int iov_pages;
>>>> +
>>>> +        vq = vsock->vqs[VSOCK_VQ_TX];
>>>> +
>>>> +        iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
>>>> +
>>>> +        /* Check that tx queue is large enough to keep whole
>>>> +         * data to send. This is needed, because when there is
>>>> +         * not enough free space in the queue, current skb to
>>>> +         * send will be reinserted to the head of tx list of
>>>> +         * the socket to retry transmission later, so if skb
>>>> +         * is bigger than whole queue, it will be reinserted
>>>> +         * again and again, thus blocking other skbs to be sent.
>>>> +         * Each page of the user provided buffer will be added
>>>> +         * as a single buffer to the tx virtqueue, so compare
>>>> +         * number of pages against maximum capacity of the queue.
>>>> +         * +1 means buffer for the packet header.
>>>> +         */
>>>> +        if (iov_pages + 1 <= vq->num_max)
>>>
>>> I think this check is actual only for case one we don't have indirect buffer feature.
>>> With indirect mode whole data to send will be packed into one indirect buffer.
>>
>> I think so.
>> So, should we check also that here?
>>
>>>
>>> Thanks, Arseniy
>>>
>>>> +            res = true;
>>>> +    }
>>>> +
>>>> +    rcu_read_unlock();
>>>> +
>>>> +    return res;
>>>> +}
>>>> +
>>>>  static bool virtio_transport_seqpacket_allow(u32 remote_cid);
>>>>
>>>>  static struct virtio_transport virtio_transport = {
>>>> @@ -475,6 +512,8 @@ static struct virtio_transport virtio_transport = {
>>>>          .seqpacket_allow          = virtio_transport_seqpacket_allow,
>>>>          .seqpacket_has_data       = virtio_transport_seqpacket_has_data,
>>>>
>>>> +        .msgzerocopy_check_iov      = virtio_transport_msgzerocopy_check_iov,
>>>> +
>>>>          .notify_poll_in           = virtio_transport_notify_poll_in,
>>>>          .notify_poll_out          = virtio_transport_notify_poll_out,
>>>>          .notify_recv_init         = virtio_transport_notify_recv_init,
>>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>>> index 26a4d10da205..e4e3d541aff4 100644
>>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>>> @@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>>>>      return container_of(t, struct virtio_transport, transport);
>>>>  }
>>>>
>>>> -/* Returns a new packet on success, otherwise returns NULL.
>>>> - *
>>>> - * If NULL is returned, errp is set to a negative errno.
>>>> - */
>>>> -static struct sk_buff *
>>>> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>>>> -               size_t len,
>>>> -               u32 src_cid,
>>>> -               u32 src_port,
>>>> -               u32 dst_cid,
>>>> -               u32 dst_port)
>>>> -{
>>>> -    const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>>>> -    struct virtio_vsock_hdr *hdr;
>>>> -    struct sk_buff *skb;
>>>> -    void *payload;
>>>> -    int err;
>>>> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
>>>> +                       size_t max_to_send)
>>>> +{
>>>> +    const struct vsock_transport *t;
>>>> +    struct iov_iter *iov_iter;
>>>>
>>>> -    skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>>> -    if (!skb)
>>>> -        return NULL;
>>>> +    if (!info->msg)
>>>> +        return false;
>>>>
>>>> -    hdr = virtio_vsock_hdr(skb);
>>>> -    hdr->type    = cpu_to_le16(info->type);
>>>> -    hdr->op        = cpu_to_le16(info->op);
>>>> -    hdr->src_cid    = cpu_to_le64(src_cid);
>>>> -    hdr->dst_cid    = cpu_to_le64(dst_cid);
>>>> -    hdr->src_port    = cpu_to_le32(src_port);
>>>> -    hdr->dst_port    = cpu_to_le32(dst_port);
>>>> -    hdr->flags    = cpu_to_le32(info->flags);
>>>> -    hdr->len    = cpu_to_le32(len);
>>>> +    iov_iter = &info->msg->msg_iter;
>>>>
>>>> -    if (info->msg && len > 0) {
>>>> -        payload = skb_put(skb, len);
>>>> -        err = memcpy_from_msg(payload, info->msg, len);
>>>> -        if (err)
>>>> -            goto out;
>>>> +    t = vsock_core_get_transport(info->vsk);
>>>>
>>>> -        if (msg_data_left(info->msg) == 0 &&
>>>> -            info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>>> -            hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>>> +    if (t->msgzerocopy_check_iov &&
>>>> +        !t->msgzerocopy_check_iov(iov_iter))
>>>> +        return false;
>>
>> I'd avoid adding a new transport callback used only internally in virtio
>> transports.
> 
> Ok, I see.
> 
>>
>> Usually the transport callbacks are used in af_vsock.c, if we need a
>> callback just for virtio transports, maybe better to add it in struct
>> virtio_vsock_pkt_info or struct virtio_vsock_sock.

Hm, may be I just need to move this callback from 'struct vsock_transport' to parent 'struct virtio_transport',
after 'send_pkt' callback. In this case:
1) AF_VSOCK part is not touched.
2) This callback stays in 'virtio_transport.c' and is set also in this file.
   vhost and loopback are unchanged - only 'send_pkt' still enabled in both
   files for these two transports.

Thanks, Arseniy

>>
>> Maybe better the last one so we don't have to allocate pointer space
>> for each packet and you should reach it via info.
> 
> Ok, thanks, I'll try this way
> 
> Thanks, Arseniy
> 
>>
>> Thanks,
>> Stefano
>>
>>>>
>>>> -            if (info->msg->msg_flags & MSG_EOR)
>>>> -                hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>>> -        }
>>>> +    /* Data is simple buffer. */
>>>> +    if (iter_is_ubuf(iov_iter))
>>>> +        return true;
>>>> +
>>>> +    if (!iter_is_iovec(iov_iter))
>>>> +        return false;
>>>> +
>>>> +    if (iov_iter->iov_offset)
>>>> +        return false;
>>>> +
>>>> +    /* We can't send whole iov. */
>>>> +    if (iov_iter->count > max_to_send)
>>>> +        return false;
>>>> +
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>>>> +                       struct sk_buff *skb,
>>>> +                       struct msghdr *msg,
>>>> +                       bool zerocopy)
>>>> +{
>>>> +    struct ubuf_info *uarg;
>>>> +
>>>> +    if (msg->msg_ubuf) {
>>>> +        uarg = msg->msg_ubuf;
>>>> +        net_zcopy_get(uarg);
>>>> +    } else {
>>>> +        struct iov_iter *iter = &msg->msg_iter;
>>>> +        struct ubuf_info_msgzc *uarg_zc;
>>>> +        int len;
>>>> +
>>>> +        /* Only ITER_IOVEC or ITER_UBUF are allowed and
>>>> +         * checked before.
>>>> +         */
>>>> +        if (iter_is_iovec(iter))
>>>> +            len = iov_length(iter->__iov, iter->nr_segs);
>>>> +        else
>>>> +            len = iter->count;
>>>> +
>>>> +        uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>>> +                        len,
>>>> +                        NULL);
>>>> +        if (!uarg)
>>>> +            return -1;
>>>> +
>>>> +        uarg_zc = uarg_to_msgzc(uarg);
>>>> +        uarg_zc->zerocopy = zerocopy ? 1 : 0;
>>>>      }
>>>>
>>>> -    if (info->reply)
>>>> -        virtio_vsock_skb_set_reply(skb);
>>>> +    skb_zcopy_init(skb, uarg);
>>>>
>>>> -    trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>>> -                     dst_cid, dst_port,
>>>> -                     len,
>>>> -                     info->type,
>>>> -                     info->op,
>>>> -                     info->flags);
>>>> +    return 0;
>>>> +}
>>>>
>>>> -    if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
>>>> -        WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
>>>> -        goto out;
>>>> +static int virtio_transport_fill_skb(struct sk_buff *skb,
>>>> +                     struct virtio_vsock_pkt_info *info,
>>>> +                     size_t len,
>>>> +                     bool zcopy)
>>>> +{
>>>> +    if (zcopy) {
>>>> +        return __zerocopy_sg_from_iter(info->msg, NULL, skb,
>>>> +                          &info->msg->msg_iter,
>>>> +                          len);
>>>> +    } else {
>>>> +        void *payload;
>>>> +        int err;
>>>> +
>>>> +        payload = skb_put(skb, len);
>>>> +        err = memcpy_from_msg(payload, info->msg, len);
>>>> +        if (err)
>>>> +            return -1;
>>>> +
>>>> +        if (msg_data_left(info->msg))
>>>> +            return 0;
>>>> +
>>>> +        return 0;
>>>>      }
>>>> +}
>>>>
>>>> -    return skb;
>>>> +static void virtio_transport_init_hdr(struct sk_buff *skb,
>>>> +                      struct virtio_vsock_pkt_info *info,
>>>> +                      u32 src_cid,
>>>> +                      u32 src_port,
>>>> +                      u32 dst_cid,
>>>> +                      u32 dst_port,
>>>> +                      size_t len)
>>>> +{
>>>> +    struct virtio_vsock_hdr *hdr;
>>>>
>>>> -out:
>>>> -    kfree_skb(skb);
>>>> -    return NULL;
>>>> +    hdr = virtio_vsock_hdr(skb);
>>>> +    hdr->type    = cpu_to_le16(info->type);
>>>> +    hdr->op        = cpu_to_le16(info->op);
>>>> +    hdr->src_cid    = cpu_to_le64(src_cid);
>>>> +    hdr->dst_cid    = cpu_to_le64(dst_cid);
>>>> +    hdr->src_port    = cpu_to_le32(src_port);
>>>> +    hdr->dst_port    = cpu_to_le32(dst_port);
>>>> +    hdr->flags    = cpu_to_le32(info->flags);
>>>> +    hdr->len    = cpu_to_le32(len);
>>>>  }
>>>>
>>>>  static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
>>>> @@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>>>          return VIRTIO_VSOCK_TYPE_SEQPACKET;
>>>>  }
>>>>
>>>> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
>>>> +                          struct virtio_vsock_pkt_info *info,
>>>> +                          size_t payload_len,
>>>> +                          bool zcopy,
>>>> +                          u32 src_cid,
>>>> +                          u32 src_port,
>>>> +                          u32 dst_cid,
>>>> +                          u32 dst_port)
>>>> +{
>>>> +    struct sk_buff *skb;
>>>> +    size_t skb_len;
>>>> +
>>>> +    skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
>>>> +
>>>> +    if (!zcopy)
>>>> +        skb_len += payload_len;
>>>> +
>>>> +    skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>>> +    if (!skb)
>>>> +        return NULL;
>>>> +
>>>> +    virtio_transport_init_hdr(skb, info, src_cid, src_port,
>>>> +                  dst_cid, dst_port,
>>>> +                  payload_len);
>>>> +
>>>> +    /* Set owner here, because '__zerocopy_sg_from_iter()' uses
>>>> +     * owner of skb without check to update 'sk_wmem_alloc'.
>>>> +     */
>>>> +    if (vsk)
>>>> +        skb_set_owner_w(skb, sk_vsock(vsk));
>>>> +
>>>> +    if (info->msg && payload_len > 0) {
>>>> +        int err;
>>>> +
>>>> +        err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
>>>> +        if (err)
>>>> +            goto out;
>>>> +
>>>> +        if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>>> +            struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
>>>> +
>>>> +            hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>>> +
>>>> +            if (info->msg->msg_flags & MSG_EOR)
>>>> +                hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (info->reply)
>>>> +        virtio_vsock_skb_set_reply(skb);
>>>> +
>>>> +    trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>>> +                     dst_cid, dst_port,
>>>> +                     payload_len,
>>>> +                     info->type,
>>>> +                     info->op,
>>>> +                     info->flags);
>>>> +
>>>> +    return skb;
>>>> +out:
>>>> +    kfree_skb(skb);
>>>> +    return NULL;
>>>> +}
>>>> +
>>>>  /* This function can only be used on connecting/connected sockets,
>>>>   * since a socket assigned to a transport is required.
>>>>   *
>>>> @@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>>>  static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>                        struct virtio_vsock_pkt_info *info)
>>>>  {
>>>> +    u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>>>>      u32 src_cid, src_port, dst_cid, dst_port;
>>>>      const struct virtio_transport *t_ops;
>>>>      struct virtio_vsock_sock *vvs;
>>>>      u32 pkt_len = info->pkt_len;
>>>> +    bool can_zcopy = false;
>>>>      u32 rest_len;
>>>>      int ret;
>>>>
>>>> @@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>      if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>>>          return pkt_len;
>>>>
>>>> +    if (info->msg) {
>>>> +        /* If zerocopy is not enabled by 'setsockopt()', we behave as
>>>> +         * there is no MSG_ZEROCOPY flag set.
>>>> +         */
>>>> +        if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>>>> +            info->msg->msg_flags &= ~MSG_ZEROCOPY;
>>>> +
>>>> +        if (info->msg->msg_flags & MSG_ZEROCOPY)
>>>> +            can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
>>>> +
>>>> +        if (can_zcopy)
>>>> +            max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>>>> +                        (MAX_SKB_FRAGS * PAGE_SIZE));
>>>> +    }
>>>> +
>>>>      rest_len = pkt_len;
>>>>
>>>>      do {
>>>>          struct sk_buff *skb;
>>>>          size_t skb_len;
>>>>
>>>> -        skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
>>>> +        skb_len = min(max_skb_len, rest_len);
>>>>
>>>> -        skb = virtio_transport_alloc_skb(info, skb_len,
>>>> +        skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
>>>>                           src_cid, src_port,
>>>>                           dst_cid, dst_port);
>>>>          if (!skb) {
>>>> @@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>              break;
>>>>          }
>>>>
>>>> +        /* This is last skb to send this portion of data. */
>>>> +        if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
>>>> +            skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
>>>> +            if (virtio_transport_init_zcopy_skb(vsk, skb,
>>>> +                                info->msg,
>>>> +                                can_zcopy)) {
>>>> +                ret = -ENOMEM;
>>>> +                break;
>>>> +            }
>>>> +        }
>>>> +
>>>>          virtio_transport_inc_tx_pkt(vvs, skb);
>>>>
>>>>          ret = t_ops->send_pkt(skb);
>>>> @@ -934,7 +1075,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>>>>      if (!t)
>>>>          return -ENOTCONN;
>>>>
>>>> -    reply = virtio_transport_alloc_skb(&info, 0,
>>>> +    reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
>>>>                         le64_to_cpu(hdr->dst_cid),
>>>>                         le32_to_cpu(hdr->dst_port),
>>>>                         le64_to_cpu(hdr->src_cid),
>>>
>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-21  5:09   ` Arseniy Krasnov
  2023-07-25  8:43     ` Stefano Garzarella
@ 2023-07-25 11:50     ` Michael S. Tsirkin
  2023-07-25 12:53       ` Stefano Garzarella
  2023-07-25 13:04       ` Arseniy Krasnov
  1 sibling, 2 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2023-07-25 11:50 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
	Bobby Eshleman, kvm, virtualization, netdev, linux-kernel,
	kernel, oxffffaa

On Fri, Jul 21, 2023 at 08:09:03AM +0300, Arseniy Krasnov wrote:
> 
> 
> On 21.07.2023 00:42, Arseniy Krasnov wrote:
> > This adds handling of MSG_ZEROCOPY flag on transmission path: if this
> > flag is set and zerocopy transmission is possible (enabled in socket
> > options and transport allows zerocopy), then non-linear skb will be
> > created and filled with the pages of user's buffer. Pages of user's
> > buffer are locked in memory by 'get_user_pages()'. Second thing that
> > this patch does is replace type of skb owning: instead of calling
> > 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
> > change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
> > of socket, so to decrease this field correctly proper skb destructor is
> > needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
> > 
> > Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
> > ---
> >  Changelog:
> >  v5(big patchset) -> v1:
> >   * Refactorings of 'if' conditions.
> >   * Remove extra blank line.
> >   * Remove 'frag_off' field unneeded init.
> >   * Add function 'virtio_transport_fill_skb()' which fills both linear
> >     and non-linear skb with provided data.
> >  v1 -> v2:
> >   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
> >  v2 -> v3:
> >   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
> >     provided 'iov_iter' with data could be sent in a zerocopy mode.
> >     If this callback is not set in transport - transport allows to send
> >     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
> >     then zerocopy is allowed. Reason of this callback is that in case of
> >     G2H transmission we insert whole skb to the tx virtio queue and such
> >     skb must fit to the size of the virtio queue to be sent in a single
> >     iteration (may be tx logic in 'virtio_transport.c' could be reworked
> >     as in vhost to support partial send of current skb). This callback
> >     will be enabled only for G2H path. For details pls see comment 
> >     'Check that tx queue...' below.
> > 
> >  include/net/af_vsock.h                  |   3 +
> >  net/vmw_vsock/virtio_transport.c        |  39 ++++
> >  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
> >  3 files changed, 241 insertions(+), 58 deletions(-)
> > 
> > diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> > index 0e7504a42925..a6b346eeeb8e 100644
> > --- a/include/net/af_vsock.h
> > +++ b/include/net/af_vsock.h
> > @@ -177,6 +177,9 @@ struct vsock_transport {
> >  
> >  	/* Read a single skb */
> >  	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
> > +
> > +	/* Zero-copy. */
> > +	bool (*msgzerocopy_check_iov)(const struct iov_iter *);
> >  };
> >  
> >  /**** CORE ****/
> > diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> > index 7bbcc8093e51..23cb8ed638c4 100644
> > --- a/net/vmw_vsock/virtio_transport.c
> > +++ b/net/vmw_vsock/virtio_transport.c
> > @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
> >  	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
> >  }
> >  
> > +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
> > +{
> > +	struct virtio_vsock *vsock;
> > +	bool res = false;
> > +
> > +	rcu_read_lock();
> > +
> > +	vsock = rcu_dereference(the_virtio_vsock);
> > +	if (vsock) {
> > +		struct virtqueue *vq;
> > +		int iov_pages;
> > +
> > +		vq = vsock->vqs[VSOCK_VQ_TX];
> > +
> > +		iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
> > +
> > +		/* Check that tx queue is large enough to keep whole
> > +		 * data to send. This is needed, because when there is
> > +		 * not enough free space in the queue, current skb to
> > +		 * send will be reinserted to the head of tx list of
> > +		 * the socket to retry transmission later, so if skb
> > +		 * is bigger than whole queue, it will be reinserted
> > +		 * again and again, thus blocking other skbs to be sent.
> > +		 * Each page of the user provided buffer will be added
> > +		 * as a single buffer to the tx virtqueue, so compare
> > +		 * number of pages against maximum capacity of the queue.
> > +		 * +1 means buffer for the packet header.
> > +		 */
> > +		if (iov_pages + 1 <= vq->num_max)
> 
> I think this check is actual only for case one we don't have indirect buffer feature.
> With indirect mode whole data to send will be packed into one indirect buffer.
> 
> Thanks, Arseniy

Actually the reverse. With indirect you are limited to num_max.
Without you are limited to whatever space is left in the
queue (which you did not check here, so you should).


> > +			res = true;
> > +	}
> > +
> > +	rcu_read_unlock();

Just curious:
is the point of all this RCU dance to allow vsock
to change from under us? then why is it ok to
have it change? the virtio_transport_msgzerocopy_check_iov
will then refer to the old vsock ...


> > +
> > +	return res;
> > +}
> > +
> >  static bool virtio_transport_seqpacket_allow(u32 remote_cid);
> >  
> >  static struct virtio_transport virtio_transport = {
> > @@ -475,6 +512,8 @@ static struct virtio_transport virtio_transport = {
> >  		.seqpacket_allow          = virtio_transport_seqpacket_allow,
> >  		.seqpacket_has_data       = virtio_transport_seqpacket_has_data,
> >  
> > +		.msgzerocopy_check_iov	  = virtio_transport_msgzerocopy_check_iov,
> > +
> >  		.notify_poll_in           = virtio_transport_notify_poll_in,
> >  		.notify_poll_out          = virtio_transport_notify_poll_out,
> >  		.notify_recv_init         = virtio_transport_notify_recv_init,
> > diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> > index 26a4d10da205..e4e3d541aff4 100644
> > --- a/net/vmw_vsock/virtio_transport_common.c
> > +++ b/net/vmw_vsock/virtio_transport_common.c
> > @@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
> >  	return container_of(t, struct virtio_transport, transport);
> >  }
> >  
> > -/* Returns a new packet on success, otherwise returns NULL.
> > - *
> > - * If NULL is returned, errp is set to a negative errno.
> > - */
> > -static struct sk_buff *
> > -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
> > -			   size_t len,
> > -			   u32 src_cid,
> > -			   u32 src_port,
> > -			   u32 dst_cid,
> > -			   u32 dst_port)
> > -{
> > -	const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
> > -	struct virtio_vsock_hdr *hdr;
> > -	struct sk_buff *skb;
> > -	void *payload;
> > -	int err;
> > +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
> > +				       size_t max_to_send)
> > +{
> > +	const struct vsock_transport *t;
> > +	struct iov_iter *iov_iter;
> >  
> > -	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
> > -	if (!skb)
> > -		return NULL;
> > +	if (!info->msg)
> > +		return false;
> >  
> > -	hdr = virtio_vsock_hdr(skb);
> > -	hdr->type	= cpu_to_le16(info->type);
> > -	hdr->op		= cpu_to_le16(info->op);
> > -	hdr->src_cid	= cpu_to_le64(src_cid);
> > -	hdr->dst_cid	= cpu_to_le64(dst_cid);
> > -	hdr->src_port	= cpu_to_le32(src_port);
> > -	hdr->dst_port	= cpu_to_le32(dst_port);
> > -	hdr->flags	= cpu_to_le32(info->flags);
> > -	hdr->len	= cpu_to_le32(len);
> > +	iov_iter = &info->msg->msg_iter;
> >  
> > -	if (info->msg && len > 0) {
> > -		payload = skb_put(skb, len);
> > -		err = memcpy_from_msg(payload, info->msg, len);
> > -		if (err)
> > -			goto out;
> > +	t = vsock_core_get_transport(info->vsk);
> >  
> > -		if (msg_data_left(info->msg) == 0 &&
> > -		    info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
> > -			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> > +	if (t->msgzerocopy_check_iov &&
> > +	    !t->msgzerocopy_check_iov(iov_iter))
> > +		return false;
> >  
> > -			if (info->msg->msg_flags & MSG_EOR)
> > -				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> > -		}
> > +	/* Data is simple buffer. */
> > +	if (iter_is_ubuf(iov_iter))
> > +		return true;
> > +
> > +	if (!iter_is_iovec(iov_iter))
> > +		return false;
> > +
> > +	if (iov_iter->iov_offset)
> > +		return false;
> > +
> > +	/* We can't send whole iov. */
> > +	if (iov_iter->count > max_to_send)
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> > +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
> > +					   struct sk_buff *skb,
> > +					   struct msghdr *msg,
> > +					   bool zerocopy)
> > +{
> > +	struct ubuf_info *uarg;
> > +
> > +	if (msg->msg_ubuf) {
> > +		uarg = msg->msg_ubuf;
> > +		net_zcopy_get(uarg);
> > +	} else {
> > +		struct iov_iter *iter = &msg->msg_iter;
> > +		struct ubuf_info_msgzc *uarg_zc;
> > +		int len;
> > +
> > +		/* Only ITER_IOVEC or ITER_UBUF are allowed and
> > +		 * checked before.
> > +		 */
> > +		if (iter_is_iovec(iter))
> > +			len = iov_length(iter->__iov, iter->nr_segs);
> > +		else
> > +			len = iter->count;
> > +
> > +		uarg = msg_zerocopy_realloc(sk_vsock(vsk),
> > +					    len,
> > +					    NULL);
> > +		if (!uarg)
> > +			return -1;
> > +
> > +		uarg_zc = uarg_to_msgzc(uarg);
> > +		uarg_zc->zerocopy = zerocopy ? 1 : 0;
> >  	}
> >  
> > -	if (info->reply)
> > -		virtio_vsock_skb_set_reply(skb);
> > +	skb_zcopy_init(skb, uarg);
> >  
> > -	trace_virtio_transport_alloc_pkt(src_cid, src_port,
> > -					 dst_cid, dst_port,
> > -					 len,
> > -					 info->type,
> > -					 info->op,
> > -					 info->flags);
> > +	return 0;
> > +}
> >  
> > -	if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
> > -		WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
> > -		goto out;
> > +static int virtio_transport_fill_skb(struct sk_buff *skb,
> > +				     struct virtio_vsock_pkt_info *info,
> > +				     size_t len,
> > +				     bool zcopy)
> > +{
> > +	if (zcopy) {
> > +		return __zerocopy_sg_from_iter(info->msg, NULL, skb,
> > +					      &info->msg->msg_iter,
> > +					      len);
> > +	} else {
> > +		void *payload;
> > +		int err;
> > +
> > +		payload = skb_put(skb, len);
> > +		err = memcpy_from_msg(payload, info->msg, len);
> > +		if (err)
> > +			return -1;
> > +
> > +		if (msg_data_left(info->msg))
> > +			return 0;
> > +
> > +		return 0;
> >  	}
> > +}
> >  
> > -	return skb;
> > +static void virtio_transport_init_hdr(struct sk_buff *skb,
> > +				      struct virtio_vsock_pkt_info *info,
> > +				      u32 src_cid,
> > +				      u32 src_port,
> > +				      u32 dst_cid,
> > +				      u32 dst_port,
> > +				      size_t len)
> > +{
> > +	struct virtio_vsock_hdr *hdr;
> >  
> > -out:
> > -	kfree_skb(skb);
> > -	return NULL;
> > +	hdr = virtio_vsock_hdr(skb);
> > +	hdr->type	= cpu_to_le16(info->type);
> > +	hdr->op		= cpu_to_le16(info->op);
> > +	hdr->src_cid	= cpu_to_le64(src_cid);
> > +	hdr->dst_cid	= cpu_to_le64(dst_cid);
> > +	hdr->src_port	= cpu_to_le32(src_port);
> > +	hdr->dst_port	= cpu_to_le32(dst_port);
> > +	hdr->flags	= cpu_to_le32(info->flags);
> > +	hdr->len	= cpu_to_le32(len);
> >  }
> >  
> >  static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
> > @@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
> >  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
> >  }
> >  
> > +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
> > +						  struct virtio_vsock_pkt_info *info,
> > +						  size_t payload_len,
> > +						  bool zcopy,
> > +						  u32 src_cid,
> > +						  u32 src_port,
> > +						  u32 dst_cid,
> > +						  u32 dst_port)
> > +{
> > +	struct sk_buff *skb;
> > +	size_t skb_len;
> > +
> > +	skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
> > +
> > +	if (!zcopy)
> > +		skb_len += payload_len;
> > +
> > +	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
> > +	if (!skb)
> > +		return NULL;
> > +
> > +	virtio_transport_init_hdr(skb, info, src_cid, src_port,
> > +				  dst_cid, dst_port,
> > +				  payload_len);
> > +
> > +	/* Set owner here, because '__zerocopy_sg_from_iter()' uses
> > +	 * owner of skb without check to update 'sk_wmem_alloc'.
> > +	 */
> > +	if (vsk)
> > +		skb_set_owner_w(skb, sk_vsock(vsk));
> > +
> > +	if (info->msg && payload_len > 0) {
> > +		int err;
> > +
> > +		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
> > +		if (err)
> > +			goto out;
> > +
> > +		if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
> > +			struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
> > +
> > +			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> > +
> > +			if (info->msg->msg_flags & MSG_EOR)
> > +				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> > +		}
> > +	}
> > +
> > +	if (info->reply)
> > +		virtio_vsock_skb_set_reply(skb);
> > +
> > +	trace_virtio_transport_alloc_pkt(src_cid, src_port,
> > +					 dst_cid, dst_port,
> > +					 payload_len,
> > +					 info->type,
> > +					 info->op,
> > +					 info->flags);
> > +
> > +	return skb;
> > +out:
> > +	kfree_skb(skb);
> > +	return NULL;
> > +}
> > +
> >  /* This function can only be used on connecting/connected sockets,
> >   * since a socket assigned to a transport is required.
> >   *
> > @@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
> >  static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >  					  struct virtio_vsock_pkt_info *info)
> >  {
> > +	u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> >  	u32 src_cid, src_port, dst_cid, dst_port;
> >  	const struct virtio_transport *t_ops;
> >  	struct virtio_vsock_sock *vvs;
> >  	u32 pkt_len = info->pkt_len;
> > +	bool can_zcopy = false;
> >  	u32 rest_len;
> >  	int ret;
> >  
> > @@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> >  		return pkt_len;
> >  
> > +	if (info->msg) {
> > +		/* If zerocopy is not enabled by 'setsockopt()', we behave as
> > +		 * there is no MSG_ZEROCOPY flag set.
> > +		 */
> > +		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
> > +			info->msg->msg_flags &= ~MSG_ZEROCOPY;
> > +
> > +		if (info->msg->msg_flags & MSG_ZEROCOPY)
> > +			can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
> > +
> > +		if (can_zcopy)
> > +			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
> > +					    (MAX_SKB_FRAGS * PAGE_SIZE));
> > +	}
> > +
> >  	rest_len = pkt_len;
> >  
> >  	do {
> >  		struct sk_buff *skb;
> >  		size_t skb_len;
> >  
> > -		skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
> > +		skb_len = min(max_skb_len, rest_len);
> >  
> > -		skb = virtio_transport_alloc_skb(info, skb_len,
> > +		skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
> >  						 src_cid, src_port,
> >  						 dst_cid, dst_port);
> >  		if (!skb) {
> > @@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >  			break;
> >  		}
> >  
> > +		/* This is last skb to send this portion of data. */
> > +		if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
> > +		    skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
> > +			if (virtio_transport_init_zcopy_skb(vsk, skb,
> > +							    info->msg,
> > +							    can_zcopy)) {
> > +				ret = -ENOMEM;
> > +				break;
> > +			}
> > +		}
> > +
> >  		virtio_transport_inc_tx_pkt(vvs, skb);
> >  
> >  		ret = t_ops->send_pkt(skb);
> > @@ -934,7 +1075,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
> >  	if (!t)
> >  		return -ENOTCONN;
> >  
> > -	reply = virtio_transport_alloc_skb(&info, 0,
> > +	reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
> >  					   le64_to_cpu(hdr->dst_cid),
> >  					   le32_to_cpu(hdr->dst_port),
> >  					   le64_to_cpu(hdr->src_cid),


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-25  8:39     ` Arseniy Krasnov
@ 2023-07-25 11:59       ` Michael S. Tsirkin
  2023-07-25 13:10         ` Arseniy Krasnov
  0 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2023-07-25 11:59 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
	Bobby Eshleman, kvm, virtualization, netdev, linux-kernel,
	kernel, oxffffaa

On Tue, Jul 25, 2023 at 11:39:22AM +0300, Arseniy Krasnov wrote:
> 
> 
> On 25.07.2023 11:25, Michael S. Tsirkin wrote:
> > On Fri, Jul 21, 2023 at 12:42:45AM +0300, Arseniy Krasnov wrote:
> >> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
> >> flag is set and zerocopy transmission is possible (enabled in socket
> >> options and transport allows zerocopy), then non-linear skb will be
> >> created and filled with the pages of user's buffer. Pages of user's
> >> buffer are locked in memory by 'get_user_pages()'. Second thing that
> >> this patch does is replace type of skb owning: instead of calling
> >> 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
> >> change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
> >> of socket, so to decrease this field correctly proper skb destructor is
> >> needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
> >>
> >> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
> >> ---
> >>  Changelog:
> >>  v5(big patchset) -> v1:
> >>   * Refactorings of 'if' conditions.
> >>   * Remove extra blank line.
> >>   * Remove 'frag_off' field unneeded init.
> >>   * Add function 'virtio_transport_fill_skb()' which fills both linear
> >>     and non-linear skb with provided data.
> >>  v1 -> v2:
> >>   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
> >>  v2 -> v3:
> >>   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
> >>     provided 'iov_iter' with data could be sent in a zerocopy mode.
> >>     If this callback is not set in transport - transport allows to send
> >>     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
> >>     then zerocopy is allowed. Reason of this callback is that in case of
> >>     G2H transmission we insert whole skb to the tx virtio queue and such
> >>     skb must fit to the size of the virtio queue to be sent in a single
> >>     iteration (may be tx logic in 'virtio_transport.c' could be reworked
> >>     as in vhost to support partial send of current skb). This callback
> >>     will be enabled only for G2H path. For details pls see comment 
> >>     'Check that tx queue...' below.
> >>
> >>  include/net/af_vsock.h                  |   3 +
> >>  net/vmw_vsock/virtio_transport.c        |  39 ++++
> >>  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
> >>  3 files changed, 241 insertions(+), 58 deletions(-)
> >>
> >> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> >> index 0e7504a42925..a6b346eeeb8e 100644
> >> --- a/include/net/af_vsock.h
> >> +++ b/include/net/af_vsock.h
> >> @@ -177,6 +177,9 @@ struct vsock_transport {
> >>  
> >>  	/* Read a single skb */
> >>  	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
> >> +
> >> +	/* Zero-copy. */
> >> +	bool (*msgzerocopy_check_iov)(const struct iov_iter *);
> >>  };
> >>  
> >>  /**** CORE ****/
> >> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> >> index 7bbcc8093e51..23cb8ed638c4 100644
> >> --- a/net/vmw_vsock/virtio_transport.c
> >> +++ b/net/vmw_vsock/virtio_transport.c
> >> @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
> >>  	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
> >>  }
> >>  
> >> +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
> >> +{
> >> +	struct virtio_vsock *vsock;
> >> +	bool res = false;
> >> +
> >> +	rcu_read_lock();
> >> +
> >> +	vsock = rcu_dereference(the_virtio_vsock);
> >> +	if (vsock) {
> >> +		struct virtqueue *vq;
> >> +		int iov_pages;
> >> +
> >> +		vq = vsock->vqs[VSOCK_VQ_TX];
> >> +
> >> +		iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
> >> +
> >> +		/* Check that tx queue is large enough to keep whole
> >> +		 * data to send. This is needed, because when there is
> >> +		 * not enough free space in the queue, current skb to
> >> +		 * send will be reinserted to the head of tx list of
> >> +		 * the socket to retry transmission later, so if skb
> >> +		 * is bigger than whole queue, it will be reinserted
> >> +		 * again and again, thus blocking other skbs to be sent.
> >> +		 * Each page of the user provided buffer will be added
> >> +		 * as a single buffer to the tx virtqueue, so compare
> >> +		 * number of pages against maximum capacity of the queue.
> >> +		 * +1 means buffer for the packet header.
> >> +		 */
> >> +		if (iov_pages + 1 <= vq->num_max)
> >> +			res = true;
> > 
> > 
> > Yes but can't there already be buffers in the queue?
> > Then you can't stick num_max there.
> 
> I think, that it is not critical, because vhost part always tries to process all
> incoming buffers (yes, 'vhost_exceeds_weight()' breaks at some moment, but it will
> reschedule tx kick ('vhost_vsock_handle_tx_kick()') work again), so current "too
> big" skb will wait until there will be enough space in queue and as it is requeued
> to the head of tx list it will be inserted to tx queue first.
> 
> But anyway, I agree that comparing to 'num_free' may be more effective to the whole
> system performance...
> 
> Thanks, Arseniy

Oh I see. It makes sense then - instead of copying just so we can
stick it in the queue, wait a bit and send later.
Also - for stream transports can't the message be split
and sent chunk by chunk? Better than copying ...


> > 
> > 
> >> +	}
> >> +
> >> +	rcu_read_unlock();
> >> +
> >> +	return res;
> >> +}
> >> +
> >>  static bool virtio_transport_seqpacket_allow(u32 remote_cid);
> >>  
> >>  static struct virtio_transport virtio_transport = {
> >> @@ -475,6 +512,8 @@ static struct virtio_transport virtio_transport = {
> >>  		.seqpacket_allow          = virtio_transport_seqpacket_allow,
> >>  		.seqpacket_has_data       = virtio_transport_seqpacket_has_data,
> >>  
> >> +		.msgzerocopy_check_iov	  = virtio_transport_msgzerocopy_check_iov,
> >> +
> >>  		.notify_poll_in           = virtio_transport_notify_poll_in,
> >>  		.notify_poll_out          = virtio_transport_notify_poll_out,
> >>  		.notify_recv_init         = virtio_transport_notify_recv_init,
> >> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> >> index 26a4d10da205..e4e3d541aff4 100644
> >> --- a/net/vmw_vsock/virtio_transport_common.c
> >> +++ b/net/vmw_vsock/virtio_transport_common.c
> >> @@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
> >>  	return container_of(t, struct virtio_transport, transport);
> >>  }
> >>  
> >> -/* Returns a new packet on success, otherwise returns NULL.
> >> - *
> >> - * If NULL is returned, errp is set to a negative errno.
> >> - */
> >> -static struct sk_buff *
> >> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
> >> -			   size_t len,
> >> -			   u32 src_cid,
> >> -			   u32 src_port,
> >> -			   u32 dst_cid,
> >> -			   u32 dst_port)
> >> -{
> >> -	const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
> >> -	struct virtio_vsock_hdr *hdr;
> >> -	struct sk_buff *skb;
> >> -	void *payload;
> >> -	int err;
> >> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
> >> +				       size_t max_to_send)
> >> +{
> >> +	const struct vsock_transport *t;
> >> +	struct iov_iter *iov_iter;
> >>  
> >> -	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
> >> -	if (!skb)
> >> -		return NULL;
> >> +	if (!info->msg)
> >> +		return false;
> >>  
> >> -	hdr = virtio_vsock_hdr(skb);
> >> -	hdr->type	= cpu_to_le16(info->type);
> >> -	hdr->op		= cpu_to_le16(info->op);
> >> -	hdr->src_cid	= cpu_to_le64(src_cid);
> >> -	hdr->dst_cid	= cpu_to_le64(dst_cid);
> >> -	hdr->src_port	= cpu_to_le32(src_port);
> >> -	hdr->dst_port	= cpu_to_le32(dst_port);
> >> -	hdr->flags	= cpu_to_le32(info->flags);
> >> -	hdr->len	= cpu_to_le32(len);
> >> +	iov_iter = &info->msg->msg_iter;
> >>  
> >> -	if (info->msg && len > 0) {
> >> -		payload = skb_put(skb, len);
> >> -		err = memcpy_from_msg(payload, info->msg, len);
> >> -		if (err)
> >> -			goto out;
> >> +	t = vsock_core_get_transport(info->vsk);
> >>  
> >> -		if (msg_data_left(info->msg) == 0 &&
> >> -		    info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
> >> -			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> >> +	if (t->msgzerocopy_check_iov &&
> >> +	    !t->msgzerocopy_check_iov(iov_iter))
> >> +		return false;
> >>  
> >> -			if (info->msg->msg_flags & MSG_EOR)
> >> -				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> >> -		}
> >> +	/* Data is simple buffer. */
> >> +	if (iter_is_ubuf(iov_iter))
> >> +		return true;
> >> +
> >> +	if (!iter_is_iovec(iov_iter))
> >> +		return false;
> >> +
> >> +	if (iov_iter->iov_offset)
> >> +		return false;
> >> +
> >> +	/* We can't send whole iov. */
> >> +	if (iov_iter->count > max_to_send)
> >> +		return false;
> >> +
> >> +	return true;
> >> +}
> >> +
> >> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
> >> +					   struct sk_buff *skb,
> >> +					   struct msghdr *msg,
> >> +					   bool zerocopy)
> >> +{
> >> +	struct ubuf_info *uarg;
> >> +
> >> +	if (msg->msg_ubuf) {
> >> +		uarg = msg->msg_ubuf;
> >> +		net_zcopy_get(uarg);
> >> +	} else {
> >> +		struct iov_iter *iter = &msg->msg_iter;
> >> +		struct ubuf_info_msgzc *uarg_zc;
> >> +		int len;
> >> +
> >> +		/* Only ITER_IOVEC or ITER_UBUF are allowed and
> >> +		 * checked before.
> >> +		 */
> >> +		if (iter_is_iovec(iter))
> >> +			len = iov_length(iter->__iov, iter->nr_segs);
> >> +		else
> >> +			len = iter->count;
> >> +
> >> +		uarg = msg_zerocopy_realloc(sk_vsock(vsk),
> >> +					    len,
> >> +					    NULL);
> >> +		if (!uarg)
> >> +			return -1;
> >> +
> >> +		uarg_zc = uarg_to_msgzc(uarg);
> >> +		uarg_zc->zerocopy = zerocopy ? 1 : 0;
> >>  	}
> >>  
> >> -	if (info->reply)
> >> -		virtio_vsock_skb_set_reply(skb);
> >> +	skb_zcopy_init(skb, uarg);
> >>  
> >> -	trace_virtio_transport_alloc_pkt(src_cid, src_port,
> >> -					 dst_cid, dst_port,
> >> -					 len,
> >> -					 info->type,
> >> -					 info->op,
> >> -					 info->flags);
> >> +	return 0;
> >> +}
> >>  
> >> -	if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
> >> -		WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
> >> -		goto out;
> >> +static int virtio_transport_fill_skb(struct sk_buff *skb,
> >> +				     struct virtio_vsock_pkt_info *info,
> >> +				     size_t len,
> >> +				     bool zcopy)
> >> +{
> >> +	if (zcopy) {
> >> +		return __zerocopy_sg_from_iter(info->msg, NULL, skb,
> >> +					      &info->msg->msg_iter,
> >> +					      len);
> >> +	} else {
> >> +		void *payload;
> >> +		int err;
> >> +
> >> +		payload = skb_put(skb, len);
> >> +		err = memcpy_from_msg(payload, info->msg, len);
> >> +		if (err)
> >> +			return -1;
> >> +
> >> +		if (msg_data_left(info->msg))
> >> +			return 0;
> >> +
> >> +		return 0;
> >>  	}
> >> +}
> >>  
> >> -	return skb;
> >> +static void virtio_transport_init_hdr(struct sk_buff *skb,
> >> +				      struct virtio_vsock_pkt_info *info,
> >> +				      u32 src_cid,
> >> +				      u32 src_port,
> >> +				      u32 dst_cid,
> >> +				      u32 dst_port,
> >> +				      size_t len)
> >> +{
> >> +	struct virtio_vsock_hdr *hdr;
> >>  
> >> -out:
> >> -	kfree_skb(skb);
> >> -	return NULL;
> >> +	hdr = virtio_vsock_hdr(skb);
> >> +	hdr->type	= cpu_to_le16(info->type);
> >> +	hdr->op		= cpu_to_le16(info->op);
> >> +	hdr->src_cid	= cpu_to_le64(src_cid);
> >> +	hdr->dst_cid	= cpu_to_le64(dst_cid);
> >> +	hdr->src_port	= cpu_to_le32(src_port);
> >> +	hdr->dst_port	= cpu_to_le32(dst_port);
> >> +	hdr->flags	= cpu_to_le32(info->flags);
> >> +	hdr->len	= cpu_to_le32(len);
> >>  }
> >>  
> >>  static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
> >> @@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
> >>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
> >>  }
> >>  
> >> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
> >> +						  struct virtio_vsock_pkt_info *info,
> >> +						  size_t payload_len,
> >> +						  bool zcopy,
> >> +						  u32 src_cid,
> >> +						  u32 src_port,
> >> +						  u32 dst_cid,
> >> +						  u32 dst_port)
> >> +{
> >> +	struct sk_buff *skb;
> >> +	size_t skb_len;
> >> +
> >> +	skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
> >> +
> >> +	if (!zcopy)
> >> +		skb_len += payload_len;
> >> +
> >> +	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
> >> +	if (!skb)
> >> +		return NULL;
> >> +
> >> +	virtio_transport_init_hdr(skb, info, src_cid, src_port,
> >> +				  dst_cid, dst_port,
> >> +				  payload_len);
> >> +
> >> +	/* Set owner here, because '__zerocopy_sg_from_iter()' uses
> >> +	 * owner of skb without check to update 'sk_wmem_alloc'.
> >> +	 */
> >> +	if (vsk)
> >> +		skb_set_owner_w(skb, sk_vsock(vsk));
> >> +
> >> +	if (info->msg && payload_len > 0) {
> >> +		int err;
> >> +
> >> +		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
> >> +		if (err)
> >> +			goto out;
> >> +
> >> +		if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
> >> +			struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
> >> +
> >> +			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> >> +
> >> +			if (info->msg->msg_flags & MSG_EOR)
> >> +				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> >> +		}
> >> +	}
> >> +
> >> +	if (info->reply)
> >> +		virtio_vsock_skb_set_reply(skb);
> >> +
> >> +	trace_virtio_transport_alloc_pkt(src_cid, src_port,
> >> +					 dst_cid, dst_port,
> >> +					 payload_len,
> >> +					 info->type,
> >> +					 info->op,
> >> +					 info->flags);
> >> +
> >> +	return skb;
> >> +out:
> >> +	kfree_skb(skb);
> >> +	return NULL;
> >> +}
> >> +
> >>  /* This function can only be used on connecting/connected sockets,
> >>   * since a socket assigned to a transport is required.
> >>   *
> >> @@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
> >>  static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >>  					  struct virtio_vsock_pkt_info *info)
> >>  {
> >> +	u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> >>  	u32 src_cid, src_port, dst_cid, dst_port;
> >>  	const struct virtio_transport *t_ops;
> >>  	struct virtio_vsock_sock *vvs;
> >>  	u32 pkt_len = info->pkt_len;
> >> +	bool can_zcopy = false;
> >>  	u32 rest_len;
> >>  	int ret;
> >>  
> >> @@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >>  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> >>  		return pkt_len;
> >>  
> >> +	if (info->msg) {
> >> +		/* If zerocopy is not enabled by 'setsockopt()', we behave as
> >> +		 * there is no MSG_ZEROCOPY flag set.
> >> +		 */
> >> +		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
> >> +			info->msg->msg_flags &= ~MSG_ZEROCOPY;
> >> +
> >> +		if (info->msg->msg_flags & MSG_ZEROCOPY)
> >> +			can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
> >> +
> >> +		if (can_zcopy)
> >> +			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
> >> +					    (MAX_SKB_FRAGS * PAGE_SIZE));
> >> +	}
> >> +
> >>  	rest_len = pkt_len;
> >>  
> >>  	do {
> >>  		struct sk_buff *skb;
> >>  		size_t skb_len;
> >>  
> >> -		skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
> >> +		skb_len = min(max_skb_len, rest_len);
> >>  
> >> -		skb = virtio_transport_alloc_skb(info, skb_len,
> >> +		skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
> >>  						 src_cid, src_port,
> >>  						 dst_cid, dst_port);
> >>  		if (!skb) {
> >> @@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >>  			break;
> >>  		}
> >>  
> >> +		/* This is last skb to send this portion of data. */
> >> +		if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
> >> +		    skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
> >> +			if (virtio_transport_init_zcopy_skb(vsk, skb,
> >> +							    info->msg,
> >> +							    can_zcopy)) {
> >> +				ret = -ENOMEM;
> >> +				break;
> >> +			}
> >> +		}
> >> +
> >>  		virtio_transport_inc_tx_pkt(vvs, skb);
> >>  
> >>  		ret = t_ops->send_pkt(skb);
> >> @@ -934,7 +1075,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
> >>  	if (!t)
> >>  		return -ENOTCONN;
> >>  
> >> -	reply = virtio_transport_alloc_skb(&info, 0,
> >> +	reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
> >>  					   le64_to_cpu(hdr->dst_cid),
> >>  					   le32_to_cpu(hdr->dst_port),
> >>  					   le64_to_cpu(hdr->src_cid),
> >> -- 
> >> 2.25.1
> > 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-25  9:16         ` Arseniy Krasnov
@ 2023-07-25 12:28           ` Stefano Garzarella
  2023-07-25 12:39             ` Michael S. Tsirkin
  2023-07-27  8:32             ` Arseniy Krasnov
  0 siblings, 2 replies; 30+ messages in thread
From: Stefano Garzarella @ 2023-07-25 12:28 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Michael S. Tsirkin, Jason Wang, Bobby Eshleman, kvm,
	virtualization, netdev, linux-kernel, kernel, oxffffaa

On Tue, Jul 25, 2023 at 12:16:11PM +0300, Arseniy Krasnov wrote:
>
>
>On 25.07.2023 11:46, Arseniy Krasnov wrote:
>>
>>
>> On 25.07.2023 11:43, Stefano Garzarella wrote:
>>> On Fri, Jul 21, 2023 at 08:09:03AM +0300, Arseniy Krasnov wrote:
>>>>
>>>>
>>>> On 21.07.2023 00:42, Arseniy Krasnov wrote:
>>>>> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>>>>> flag is set and zerocopy transmission is possible (enabled in socket
>>>>> options and transport allows zerocopy), then non-linear skb will be
>>>>> created and filled with the pages of user's buffer. Pages of user's
>>>>> buffer are locked in memory by 'get_user_pages()'. Second thing that
>>>>> this patch does is replace type of skb owning: instead of calling
>>>>> 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
>>>>> change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
>>>>> of socket, so to decrease this field correctly proper skb destructor is
>>>>> needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
>>>>>
>>>>> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>>>>> ---
>>>>>  Changelog:
>>>>>  v5(big patchset) -> v1:
>>>>>   * Refactorings of 'if' conditions.
>>>>>   * Remove extra blank line.
>>>>>   * Remove 'frag_off' field unneeded init.
>>>>>   * Add function 'virtio_transport_fill_skb()' which fills both linear
>>>>>     and non-linear skb with provided data.
>>>>>  v1 -> v2:
>>>>>   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
>>>>>  v2 -> v3:
>>>>>   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
>>>>>     provided 'iov_iter' with data could be sent in a zerocopy mode.
>>>>>     If this callback is not set in transport - transport allows to send
>>>>>     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
>>>>>     then zerocopy is allowed. Reason of this callback is that in case of
>>>>>     G2H transmission we insert whole skb to the tx virtio queue and such
>>>>>     skb must fit to the size of the virtio queue to be sent in a single
>>>>>     iteration (may be tx logic in 'virtio_transport.c' could be reworked
>>>>>     as in vhost to support partial send of current skb). This callback
>>>>>     will be enabled only for G2H path. For details pls see comment
>>>>>     'Check that tx queue...' below.
>>>>>
>>>>>  include/net/af_vsock.h                  |   3 +
>>>>>  net/vmw_vsock/virtio_transport.c        |  39 ++++
>>>>>  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
>>>>>  3 files changed, 241 insertions(+), 58 deletions(-)
>>>>>
>>>>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>>>>> index 0e7504a42925..a6b346eeeb8e 100644
>>>>> --- a/include/net/af_vsock.h
>>>>> +++ b/include/net/af_vsock.h
>>>>> @@ -177,6 +177,9 @@ struct vsock_transport {
>>>>>
>>>>>      /* Read a single skb */
>>>>>      int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
>>>>> +
>>>>> +    /* Zero-copy. */
>>>>> +    bool (*msgzerocopy_check_iov)(const struct iov_iter *);
>>>>>  };
>>>>>
>>>>>  /**** CORE ****/
>>>>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>>>>> index 7bbcc8093e51..23cb8ed638c4 100644
>>>>> --- a/net/vmw_vsock/virtio_transport.c
>>>>> +++ b/net/vmw_vsock/virtio_transport.c
>>>>> @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
>>>>>      queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>>>>>  }
>>>>>
>>>>> +static bool virtio_transport_msgzerocopy_check_iov(const struct 
>>>>> iov_iter *iov)
>>>>> +{
>>>>> +    struct virtio_vsock *vsock;
>>>>> +    bool res = false;
>>>>> +
>>>>> +    rcu_read_lock();
>>>>> +
>>>>> +    vsock = rcu_dereference(the_virtio_vsock);
>>>>> +    if (vsock) {

Just noted, what about the following to reduce the indentation?

         if (!vsock) {
             goto out;
         }
             ...
             ...
     out:
         rcu_read_unlock();
         return res;

>>>>> +        struct virtqueue *vq;
>>>>> +        int iov_pages;
>>>>> +
>>>>> +        vq = vsock->vqs[VSOCK_VQ_TX];
>>>>> +
>>>>> +        iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
>>>>> +
>>>>> +        /* Check that tx queue is large enough to keep whole
>>>>> +         * data to send. This is needed, because when there is
>>>>> +         * not enough free space in the queue, current skb to
>>>>> +         * send will be reinserted to the head of tx list of
>>>>> +         * the socket to retry transmission later, so if skb
>>>>> +         * is bigger than whole queue, it will be reinserted
>>>>> +         * again and again, thus blocking other skbs to be sent.
>>>>> +         * Each page of the user provided buffer will be added
>>>>> +         * as a single buffer to the tx virtqueue, so compare
>>>>> +         * number of pages against maximum capacity of the queue.
>>>>> +         * +1 means buffer for the packet header.
>>>>> +         */
>>>>> +        if (iov_pages + 1 <= vq->num_max)
>>>>
>>>> I think this check is actual only for case one we don't have indirect buffer feature.
>>>> With indirect mode whole data to send will be packed into one indirect buffer.
>>>
>>> I think so.
>>> So, should we check also that here?
>>>
>>>>
>>>> Thanks, Arseniy
>>>>
>>>>> +            res = true;
>>>>> +    }
>>>>> +
>>>>> +    rcu_read_unlock();
>>>>> +
>>>>> +    return res;
>>>>> +}
>>>>> +
>>>>>  static bool virtio_transport_seqpacket_allow(u32 remote_cid);
>>>>>
>>>>>  static struct virtio_transport virtio_transport = {
>>>>> @@ -475,6 +512,8 @@ static struct virtio_transport virtio_transport = {
>>>>>          .seqpacket_allow          = virtio_transport_seqpacket_allow,
>>>>>          .seqpacket_has_data       = virtio_transport_seqpacket_has_data,
>>>>>
>>>>> +        .msgzerocopy_check_iov      = virtio_transport_msgzerocopy_check_iov,
>>>>> +
>>>>>          .notify_poll_in           = virtio_transport_notify_poll_in,
>>>>>          .notify_poll_out          = virtio_transport_notify_poll_out,
>>>>>          .notify_recv_init         = virtio_transport_notify_recv_init,
>>>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>>>> index 26a4d10da205..e4e3d541aff4 100644
>>>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>>>> @@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>>>>>      return container_of(t, struct virtio_transport, transport);
>>>>>  }
>>>>>
>>>>> -/* Returns a new packet on success, otherwise returns NULL.
>>>>> - *
>>>>> - * If NULL is returned, errp is set to a negative errno.
>>>>> - */
>>>>> -static struct sk_buff *
>>>>> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>>>>> -               size_t len,
>>>>> -               u32 src_cid,
>>>>> -               u32 src_port,
>>>>> -               u32 dst_cid,
>>>>> -               u32 dst_port)
>>>>> -{
>>>>> -    const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>>>>> -    struct virtio_vsock_hdr *hdr;
>>>>> -    struct sk_buff *skb;
>>>>> -    void *payload;
>>>>> -    int err;
>>>>> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
>>>>> +                       size_t max_to_send)
>>>>> +{
>>>>> +    const struct vsock_transport *t;
>>>>> +    struct iov_iter *iov_iter;
>>>>>
>>>>> -    skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>>>> -    if (!skb)
>>>>> -        return NULL;
>>>>> +    if (!info->msg)
>>>>> +        return false;
>>>>>
>>>>> -    hdr = virtio_vsock_hdr(skb);
>>>>> -    hdr->type    = cpu_to_le16(info->type);
>>>>> -    hdr->op        = cpu_to_le16(info->op);
>>>>> -    hdr->src_cid    = cpu_to_le64(src_cid);
>>>>> -    hdr->dst_cid    = cpu_to_le64(dst_cid);
>>>>> -    hdr->src_port    = cpu_to_le32(src_port);
>>>>> -    hdr->dst_port    = cpu_to_le32(dst_port);
>>>>> -    hdr->flags    = cpu_to_le32(info->flags);
>>>>> -    hdr->len    = cpu_to_le32(len);
>>>>> +    iov_iter = &info->msg->msg_iter;
>>>>>
>>>>> -    if (info->msg && len > 0) {
>>>>> -        payload = skb_put(skb, len);
>>>>> -        err = memcpy_from_msg(payload, info->msg, len);
>>>>> -        if (err)
>>>>> -            goto out;
>>>>> +    t = vsock_core_get_transport(info->vsk);
>>>>>
>>>>> -        if (msg_data_left(info->msg) == 0 &&
>>>>> -            info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>>>> -            hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>>>> +    if (t->msgzerocopy_check_iov &&
>>>>> +        !t->msgzerocopy_check_iov(iov_iter))
>>>>> +        return false;
>>>
>>> I'd avoid adding a new transport callback used only internally in virtio
>>> transports.
>>
>> Ok, I see.
>>
>>>
>>> Usually the transport callbacks are used in af_vsock.c, if we need a
>>> callback just for virtio transports, maybe better to add it in struct
>>> virtio_vsock_pkt_info or struct virtio_vsock_sock.
>
>Hm, may be I just need to move this callback from 'struct vsock_transport' to parent 'struct virtio_transport',
>after 'send_pkt' callback. In this case:
>1) AF_VSOCK part is not touched.
>2) This callback stays in 'virtio_transport.c' and is set also in this file.
>   vhost and loopback are unchanged - only 'send_pkt' still enabled in both
>   files for these two transports.

Yep, this could also work!

Stefano

>
>Thanks, Arseniy
>
>>>
>>> Maybe better the last one so we don't have to allocate pointer space
>>> for each packet and you should reach it via info.
>>
>> Ok, thanks, I'll try this way
>>
>> Thanks, Arseniy
>>
>>>
>>> Thanks,
>>> Stefano
>>>
>>>>>
>>>>> -            if (info->msg->msg_flags & MSG_EOR)
>>>>> -                hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>>>> -        }
>>>>> +    /* Data is simple buffer. */
>>>>> +    if (iter_is_ubuf(iov_iter))
>>>>> +        return true;
>>>>> +
>>>>> +    if (!iter_is_iovec(iov_iter))
>>>>> +        return false;
>>>>> +
>>>>> +    if (iov_iter->iov_offset)
>>>>> +        return false;
>>>>> +
>>>>> +    /* We can't send whole iov. */
>>>>> +    if (iov_iter->count > max_to_send)
>>>>> +        return false;
>>>>> +
>>>>> +    return true;
>>>>> +}
>>>>> +
>>>>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>>>>> +                       struct sk_buff *skb,
>>>>> +                       struct msghdr *msg,
>>>>> +                       bool zerocopy)
>>>>> +{
>>>>> +    struct ubuf_info *uarg;
>>>>> +
>>>>> +    if (msg->msg_ubuf) {
>>>>> +        uarg = msg->msg_ubuf;
>>>>> +        net_zcopy_get(uarg);
>>>>> +    } else {
>>>>> +        struct iov_iter *iter = &msg->msg_iter;
>>>>> +        struct ubuf_info_msgzc *uarg_zc;
>>>>> +        int len;
>>>>> +
>>>>> +        /* Only ITER_IOVEC or ITER_UBUF are allowed and
>>>>> +         * checked before.
>>>>> +         */
>>>>> +        if (iter_is_iovec(iter))
>>>>> +            len = iov_length(iter->__iov, iter->nr_segs);
>>>>> +        else
>>>>> +            len = iter->count;
>>>>> +
>>>>> +        uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>>>> +                        len,
>>>>> +                        NULL);
>>>>> +        if (!uarg)
>>>>> +            return -1;
>>>>> +
>>>>> +        uarg_zc = uarg_to_msgzc(uarg);
>>>>> +        uarg_zc->zerocopy = zerocopy ? 1 : 0;
>>>>>      }
>>>>>
>>>>> -    if (info->reply)
>>>>> -        virtio_vsock_skb_set_reply(skb);
>>>>> +    skb_zcopy_init(skb, uarg);
>>>>>
>>>>> -    trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>>>> -                     dst_cid, dst_port,
>>>>> -                     len,
>>>>> -                     info->type,
>>>>> -                     info->op,
>>>>> -                     info->flags);
>>>>> +    return 0;
>>>>> +}
>>>>>
>>>>> -    if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
>>>>> -        WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
>>>>> -        goto out;
>>>>> +static int virtio_transport_fill_skb(struct sk_buff *skb,
>>>>> +                     struct virtio_vsock_pkt_info *info,
>>>>> +                     size_t len,
>>>>> +                     bool zcopy)
>>>>> +{
>>>>> +    if (zcopy) {
>>>>> +        return __zerocopy_sg_from_iter(info->msg, NULL, skb,
>>>>> +                          &info->msg->msg_iter,
>>>>> +                          len);
>>>>> +    } else {
>>>>> +        void *payload;
>>>>> +        int err;
>>>>> +
>>>>> +        payload = skb_put(skb, len);
>>>>> +        err = memcpy_from_msg(payload, info->msg, len);
>>>>> +        if (err)
>>>>> +            return -1;
>>>>> +
>>>>> +        if (msg_data_left(info->msg))
>>>>> +            return 0;
>>>>> +
>>>>> +        return 0;
>>>>>      }
>>>>> +}
>>>>>
>>>>> -    return skb;
>>>>> +static void virtio_transport_init_hdr(struct sk_buff *skb,
>>>>> +                      struct virtio_vsock_pkt_info *info,
>>>>> +                      u32 src_cid,
>>>>> +                      u32 src_port,
>>>>> +                      u32 dst_cid,
>>>>> +                      u32 dst_port,
>>>>> +                      size_t len)
>>>>> +{
>>>>> +    struct virtio_vsock_hdr *hdr;
>>>>>
>>>>> -out:
>>>>> -    kfree_skb(skb);
>>>>> -    return NULL;
>>>>> +    hdr = virtio_vsock_hdr(skb);
>>>>> +    hdr->type    = cpu_to_le16(info->type);
>>>>> +    hdr->op        = cpu_to_le16(info->op);
>>>>> +    hdr->src_cid    = cpu_to_le64(src_cid);
>>>>> +    hdr->dst_cid    = cpu_to_le64(dst_cid);
>>>>> +    hdr->src_port    = cpu_to_le32(src_port);
>>>>> +    hdr->dst_port    = cpu_to_le32(dst_port);
>>>>> +    hdr->flags    = cpu_to_le32(info->flags);
>>>>> +    hdr->len    = cpu_to_le32(len);
>>>>>  }
>>>>>
>>>>>  static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
>>>>> @@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>>>>          return VIRTIO_VSOCK_TYPE_SEQPACKET;
>>>>>  }
>>>>>
>>>>> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
>>>>> +                          struct virtio_vsock_pkt_info *info,
>>>>> +                          size_t payload_len,
>>>>> +                          bool zcopy,
>>>>> +                          u32 src_cid,
>>>>> +                          u32 src_port,
>>>>> +                          u32 dst_cid,
>>>>> +                          u32 dst_port)
>>>>> +{
>>>>> +    struct sk_buff *skb;
>>>>> +    size_t skb_len;
>>>>> +
>>>>> +    skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
>>>>> +
>>>>> +    if (!zcopy)
>>>>> +        skb_len += payload_len;
>>>>> +
>>>>> +    skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>>>> +    if (!skb)
>>>>> +        return NULL;
>>>>> +
>>>>> +    virtio_transport_init_hdr(skb, info, src_cid, src_port,
>>>>> +                  dst_cid, dst_port,
>>>>> +                  payload_len);
>>>>> +
>>>>> +    /* Set owner here, because '__zerocopy_sg_from_iter()' uses
>>>>> +     * owner of skb without check to update 'sk_wmem_alloc'.
>>>>> +     */
>>>>> +    if (vsk)
>>>>> +        skb_set_owner_w(skb, sk_vsock(vsk));
>>>>> +
>>>>> +    if (info->msg && payload_len > 0) {
>>>>> +        int err;
>>>>> +
>>>>> +        err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
>>>>> +        if (err)
>>>>> +            goto out;
>>>>> +
>>>>> +        if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>>>> +            struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
>>>>> +
>>>>> +            hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>>>> +
>>>>> +            if (info->msg->msg_flags & MSG_EOR)
>>>>> +                hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    if (info->reply)
>>>>> +        virtio_vsock_skb_set_reply(skb);
>>>>> +
>>>>> +    trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>>>> +                     dst_cid, dst_port,
>>>>> +                     payload_len,
>>>>> +                     info->type,
>>>>> +                     info->op,
>>>>> +                     info->flags);
>>>>> +
>>>>> +    return skb;
>>>>> +out:
>>>>> +    kfree_skb(skb);
>>>>> +    return NULL;
>>>>> +}
>>>>> +
>>>>>  /* This function can only be used on connecting/connected sockets,
>>>>>   * since a socket assigned to a transport is required.
>>>>>   *
>>>>> @@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>>>>  static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>>                        struct virtio_vsock_pkt_info *info)
>>>>>  {
>>>>> +    u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>>>>>      u32 src_cid, src_port, dst_cid, dst_port;
>>>>>      const struct virtio_transport *t_ops;
>>>>>      struct virtio_vsock_sock *vvs;
>>>>>      u32 pkt_len = info->pkt_len;
>>>>> +    bool can_zcopy = false;
>>>>>      u32 rest_len;
>>>>>      int ret;
>>>>>
>>>>> @@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>>      if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>>>>          return pkt_len;
>>>>>
>>>>> +    if (info->msg) {
>>>>> +        /* If zerocopy is not enabled by 'setsockopt()', we behave as
>>>>> +         * there is no MSG_ZEROCOPY flag set.
>>>>> +         */
>>>>> +        if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>>>>> +            info->msg->msg_flags &= ~MSG_ZEROCOPY;
>>>>> +
>>>>> +        if (info->msg->msg_flags & MSG_ZEROCOPY)
>>>>> +            can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
>>>>> +
>>>>> +        if (can_zcopy)
>>>>> +            max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>>>>> +                        (MAX_SKB_FRAGS * PAGE_SIZE));
>>>>> +    }
>>>>> +
>>>>>      rest_len = pkt_len;
>>>>>
>>>>>      do {
>>>>>          struct sk_buff *skb;
>>>>>          size_t skb_len;
>>>>>
>>>>> -        skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
>>>>> +        skb_len = min(max_skb_len, rest_len);
>>>>>
>>>>> -        skb = virtio_transport_alloc_skb(info, skb_len,
>>>>> +        skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
>>>>>                           src_cid, src_port,
>>>>>                           dst_cid, dst_port);
>>>>>          if (!skb) {
>>>>> @@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>>              break;
>>>>>          }
>>>>>
>>>>> +        /* This is last skb to send this portion of data. */
>>>>> +        if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
>>>>> +            skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
>>>>> +            if (virtio_transport_init_zcopy_skb(vsk, skb,
>>>>> +                                info->msg,
>>>>> +                                can_zcopy)) {
>>>>> +                ret = -ENOMEM;
>>>>> +                break;
>>>>> +            }
>>>>> +        }
>>>>> +
>>>>>          virtio_transport_inc_tx_pkt(vvs, skb);
>>>>>
>>>>>          ret = t_ops->send_pkt(skb);
>>>>> @@ -934,7 +1075,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>>>>>      if (!t)
>>>>>          return -ENOTCONN;
>>>>>
>>>>> -    reply = virtio_transport_alloc_skb(&info, 0,
>>>>> +    reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
>>>>>                         le64_to_cpu(hdr->dst_cid),
>>>>>                         le32_to_cpu(hdr->dst_port),
>>>>>                         le64_to_cpu(hdr->src_cid),
>>>>
>>>
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-25 12:28           ` Stefano Garzarella
@ 2023-07-25 12:39             ` Michael S. Tsirkin
  2023-07-25 12:45               ` Stefano Garzarella
  2023-07-27  8:32             ` Arseniy Krasnov
  1 sibling, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2023-07-25 12:39 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Arseniy Krasnov, Stefan Hajnoczi, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jason Wang, Bobby Eshleman, kvm,
	virtualization, netdev, linux-kernel, kernel, oxffffaa

On Tue, Jul 25, 2023 at 02:28:02PM +0200, Stefano Garzarella wrote:
> On Tue, Jul 25, 2023 at 12:16:11PM +0300, Arseniy Krasnov wrote:
> > 
> > 
> > On 25.07.2023 11:46, Arseniy Krasnov wrote:
> > > 
> > > 
> > > On 25.07.2023 11:43, Stefano Garzarella wrote:
> > > > On Fri, Jul 21, 2023 at 08:09:03AM +0300, Arseniy Krasnov wrote:
> > > > > 
> > > > > 
> > > > > On 21.07.2023 00:42, Arseniy Krasnov wrote:
> > > > > > This adds handling of MSG_ZEROCOPY flag on transmission path: if this
> > > > > > flag is set and zerocopy transmission is possible (enabled in socket
> > > > > > options and transport allows zerocopy), then non-linear skb will be
> > > > > > created and filled with the pages of user's buffer. Pages of user's
> > > > > > buffer are locked in memory by 'get_user_pages()'. Second thing that
> > > > > > this patch does is replace type of skb owning: instead of calling
> > > > > > 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
> > > > > > change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
> > > > > > of socket, so to decrease this field correctly proper skb destructor is
> > > > > > needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
> > > > > > 
> > > > > > Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
> > > > > > ---
> > > > > >  Changelog:
> > > > > >  v5(big patchset) -> v1:
> > > > > >   * Refactorings of 'if' conditions.
> > > > > >   * Remove extra blank line.
> > > > > >   * Remove 'frag_off' field unneeded init.
> > > > > >   * Add function 'virtio_transport_fill_skb()' which fills both linear
> > > > > >     and non-linear skb with provided data.
> > > > > >  v1 -> v2:
> > > > > >   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
> > > > > >  v2 -> v3:
> > > > > >   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
> > > > > >     provided 'iov_iter' with data could be sent in a zerocopy mode.
> > > > > >     If this callback is not set in transport - transport allows to send
> > > > > >     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
> > > > > >     then zerocopy is allowed. Reason of this callback is that in case of
> > > > > >     G2H transmission we insert whole skb to the tx virtio queue and such
> > > > > >     skb must fit to the size of the virtio queue to be sent in a single
> > > > > >     iteration (may be tx logic in 'virtio_transport.c' could be reworked
> > > > > >     as in vhost to support partial send of current skb). This callback
> > > > > >     will be enabled only for G2H path. For details pls see comment
> > > > > >     'Check that tx queue...' below.
> > > > > > 
> > > > > >  include/net/af_vsock.h                  |   3 +
> > > > > >  net/vmw_vsock/virtio_transport.c        |  39 ++++
> > > > > >  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
> > > > > >  3 files changed, 241 insertions(+), 58 deletions(-)
> > > > > > 
> > > > > > diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> > > > > > index 0e7504a42925..a6b346eeeb8e 100644
> > > > > > --- a/include/net/af_vsock.h
> > > > > > +++ b/include/net/af_vsock.h
> > > > > > @@ -177,6 +177,9 @@ struct vsock_transport {
> > > > > > 
> > > > > >      /* Read a single skb */
> > > > > >      int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
> > > > > > +
> > > > > > +    /* Zero-copy. */
> > > > > > +    bool (*msgzerocopy_check_iov)(const struct iov_iter *);
> > > > > >  };
> > > > > > 
> > > > > >  /**** CORE ****/
> > > > > > diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> > > > > > index 7bbcc8093e51..23cb8ed638c4 100644
> > > > > > --- a/net/vmw_vsock/virtio_transport.c
> > > > > > +++ b/net/vmw_vsock/virtio_transport.c
> > > > > > @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
> > > > > >      queue_work(virtio_vsock_workqueue, &vsock->rx_work);
> > > > > >  }
> > > > > > 
> > > > > > +static bool
> > > > > > virtio_transport_msgzerocopy_check_iov(const struct
> > > > > > iov_iter *iov)
> > > > > > +{
> > > > > > +    struct virtio_vsock *vsock;
> > > > > > +    bool res = false;
> > > > > > +
> > > > > > +    rcu_read_lock();
> > > > > > +
> > > > > > +    vsock = rcu_dereference(the_virtio_vsock);
> > > > > > +    if (vsock) {
> 
> Just noted, what about the following to reduce the indentation?
> 
>         if (!vsock) {
>             goto out;
>         }

no {} pls

>             ...
>             ...
>     out:
>         rcu_read_unlock();
>         return res;

indentation is quite modest here. Not sure goto is worth it.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-25 12:39             ` Michael S. Tsirkin
@ 2023-07-25 12:45               ` Stefano Garzarella
  0 siblings, 0 replies; 30+ messages in thread
From: Stefano Garzarella @ 2023-07-25 12:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Arseniy Krasnov, Stefan Hajnoczi, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jason Wang, Bobby Eshleman, kvm,
	virtualization, netdev, linux-kernel, kernel, oxffffaa

On Tue, Jul 25, 2023 at 08:39:17AM -0400, Michael S. Tsirkin wrote:
>On Tue, Jul 25, 2023 at 02:28:02PM +0200, Stefano Garzarella wrote:
>> On Tue, Jul 25, 2023 at 12:16:11PM +0300, Arseniy Krasnov wrote:
>> >
>> >
>> > On 25.07.2023 11:46, Arseniy Krasnov wrote:
>> > >
>> > >
>> > > On 25.07.2023 11:43, Stefano Garzarella wrote:
>> > > > On Fri, Jul 21, 2023 at 08:09:03AM +0300, Arseniy Krasnov wrote:
>> > > > >
>> > > > >
>> > > > > On 21.07.2023 00:42, Arseniy Krasnov wrote:
>> > > > > > This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>> > > > > > flag is set and zerocopy transmission is possible (enabled in socket
>> > > > > > options and transport allows zerocopy), then non-linear skb will be
>> > > > > > created and filled with the pages of user's buffer. Pages of user's
>> > > > > > buffer are locked in memory by 'get_user_pages()'. Second thing that
>> > > > > > this patch does is replace type of skb owning: instead of calling
>> > > > > > 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
>> > > > > > change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
>> > > > > > of socket, so to decrease this field correctly proper skb destructor is
>> > > > > > needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
>> > > > > >
>> > > > > > Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>> > > > > > ---
>> > > > > >  Changelog:
>> > > > > >  v5(big patchset) -> v1:
>> > > > > >   * Refactorings of 'if' conditions.
>> > > > > >   * Remove extra blank line.
>> > > > > >   * Remove 'frag_off' field unneeded init.
>> > > > > >   * Add function 'virtio_transport_fill_skb()' which fills both linear
>> > > > > >     and non-linear skb with provided data.
>> > > > > >  v1 -> v2:
>> > > > > >   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
>> > > > > >  v2 -> v3:
>> > > > > >   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
>> > > > > >     provided 'iov_iter' with data could be sent in a zerocopy mode.
>> > > > > >     If this callback is not set in transport - transport allows to send
>> > > > > >     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
>> > > > > >     then zerocopy is allowed. Reason of this callback is that in case of
>> > > > > >     G2H transmission we insert whole skb to the tx virtio queue and such
>> > > > > >     skb must fit to the size of the virtio queue to be sent in a single
>> > > > > >     iteration (may be tx logic in 'virtio_transport.c' could be reworked
>> > > > > >     as in vhost to support partial send of current skb). This callback
>> > > > > >     will be enabled only for G2H path. For details pls see comment
>> > > > > >     'Check that tx queue...' below.
>> > > > > >
>> > > > > >  include/net/af_vsock.h                  |   3 +
>> > > > > >  net/vmw_vsock/virtio_transport.c        |  39 ++++
>> > > > > >  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
>> > > > > >  3 files changed, 241 insertions(+), 58 deletions(-)
>> > > > > >
>> > > > > > diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>> > > > > > index 0e7504a42925..a6b346eeeb8e 100644
>> > > > > > --- a/include/net/af_vsock.h
>> > > > > > +++ b/include/net/af_vsock.h
>> > > > > > @@ -177,6 +177,9 @@ struct vsock_transport {
>> > > > > >
>> > > > > >      /* Read a single skb */
>> > > > > >      int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
>> > > > > > +
>> > > > > > +    /* Zero-copy. */
>> > > > > > +    bool (*msgzerocopy_check_iov)(const struct iov_iter *);
>> > > > > >  };
>> > > > > >
>> > > > > >  /**** CORE ****/
>> > > > > > diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>> > > > > > index 7bbcc8093e51..23cb8ed638c4 100644
>> > > > > > --- a/net/vmw_vsock/virtio_transport.c
>> > > > > > +++ b/net/vmw_vsock/virtio_transport.c
>> > > > > > @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
>> > > > > >      queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>> > > > > >  }
>> > > > > >
>> > > > > > +static bool
>> > > > > > virtio_transport_msgzerocopy_check_iov(const struct
>> > > > > > iov_iter *iov)
>> > > > > > +{
>> > > > > > +    struct virtio_vsock *vsock;
>> > > > > > +    bool res = false;
>> > > > > > +
>> > > > > > +    rcu_read_lock();
>> > > > > > +
>> > > > > > +    vsock = rcu_dereference(the_virtio_vsock);
>> > > > > > +    if (vsock) {
>>
>> Just noted, what about the following to reduce the indentation?
>>
>>         if (!vsock) {
>>             goto out;
>>         }
>
>no {} pls

ooops, true, too much QEMU code today, but luckily checkpatch would have
spotted it ;-)

>
>>             ...
>>             ...
>>     out:
>>         rcu_read_unlock();
>>         return res;
>
>indentation is quite modest here. Not sure goto is worth it.

It's a pattern we follow a lot in this file, and I find the early
return/goto more readable.
Anyway, I don't have a strong opinion, it's fine the way it is now,
actually we don't have too many indentations for now in this function.

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-25 11:50     ` Michael S. Tsirkin
@ 2023-07-25 12:53       ` Stefano Garzarella
  2023-07-25 13:06         ` Michael S. Tsirkin
  2023-07-25 13:04       ` Arseniy Krasnov
  1 sibling, 1 reply; 30+ messages in thread
From: Stefano Garzarella @ 2023-07-25 12:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Arseniy Krasnov, Stefan Hajnoczi, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jason Wang, Bobby Eshleman, kvm,
	virtualization, netdev, linux-kernel, kernel, oxffffaa

On Tue, Jul 25, 2023 at 07:50:53AM -0400, Michael S. Tsirkin wrote:
>On Fri, Jul 21, 2023 at 08:09:03AM +0300, Arseniy Krasnov wrote:
>>
>>
>> On 21.07.2023 00:42, Arseniy Krasnov wrote:
>> > This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>> > flag is set and zerocopy transmission is possible (enabled in socket
>> > options and transport allows zerocopy), then non-linear skb will be
>> > created and filled with the pages of user's buffer. Pages of user's
>> > buffer are locked in memory by 'get_user_pages()'. Second thing that
>> > this patch does is replace type of skb owning: instead of calling
>> > 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
>> > change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
>> > of socket, so to decrease this field correctly proper skb destructor is
>> > needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
>> >
>> > Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>> > ---
>> >  Changelog:
>> >  v5(big patchset) -> v1:
>> >   * Refactorings of 'if' conditions.
>> >   * Remove extra blank line.
>> >   * Remove 'frag_off' field unneeded init.
>> >   * Add function 'virtio_transport_fill_skb()' which fills both linear
>> >     and non-linear skb with provided data.
>> >  v1 -> v2:
>> >   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
>> >  v2 -> v3:
>> >   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
>> >     provided 'iov_iter' with data could be sent in a zerocopy mode.
>> >     If this callback is not set in transport - transport allows to send
>> >     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
>> >     then zerocopy is allowed. Reason of this callback is that in case of
>> >     G2H transmission we insert whole skb to the tx virtio queue and such
>> >     skb must fit to the size of the virtio queue to be sent in a single
>> >     iteration (may be tx logic in 'virtio_transport.c' could be reworked
>> >     as in vhost to support partial send of current skb). This callback
>> >     will be enabled only for G2H path. For details pls see comment
>> >     'Check that tx queue...' below.
>> >
>> >  include/net/af_vsock.h                  |   3 +
>> >  net/vmw_vsock/virtio_transport.c        |  39 ++++
>> >  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
>> >  3 files changed, 241 insertions(+), 58 deletions(-)
>> >
>> > diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>> > index 0e7504a42925..a6b346eeeb8e 100644
>> > --- a/include/net/af_vsock.h
>> > +++ b/include/net/af_vsock.h
>> > @@ -177,6 +177,9 @@ struct vsock_transport {
>> >
>> >  	/* Read a single skb */
>> >  	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
>> > +
>> > +	/* Zero-copy. */
>> > +	bool (*msgzerocopy_check_iov)(const struct iov_iter *);
>> >  };
>> >
>> >  /**** CORE ****/
>> > diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>> > index 7bbcc8093e51..23cb8ed638c4 100644
>> > --- a/net/vmw_vsock/virtio_transport.c
>> > +++ b/net/vmw_vsock/virtio_transport.c
>> > @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
>> >  	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>> >  }
>> >
>> > +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
>> > +{
>> > +	struct virtio_vsock *vsock;
>> > +	bool res = false;
>> > +
>> > +	rcu_read_lock();
>> > +
>> > +	vsock = rcu_dereference(the_virtio_vsock);
>> > +	if (vsock) {
>> > +		struct virtqueue *vq;
>> > +		int iov_pages;
>> > +
>> > +		vq = vsock->vqs[VSOCK_VQ_TX];
>> > +
>> > +		iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
>> > +
>> > +		/* Check that tx queue is large enough to keep whole
>> > +		 * data to send. This is needed, because when there is
>> > +		 * not enough free space in the queue, current skb to
>> > +		 * send will be reinserted to the head of tx list of
>> > +		 * the socket to retry transmission later, so if skb
>> > +		 * is bigger than whole queue, it will be reinserted
>> > +		 * again and again, thus blocking other skbs to be sent.
>> > +		 * Each page of the user provided buffer will be added
>> > +		 * as a single buffer to the tx virtqueue, so compare
>> > +		 * number of pages against maximum capacity of the queue.
>> > +		 * +1 means buffer for the packet header.
>> > +		 */
>> > +		if (iov_pages + 1 <= vq->num_max)
>>
>> I think this check is actual only for case one we don't have indirect buffer feature.
>> With indirect mode whole data to send will be packed into one indirect buffer.
>>
>> Thanks, Arseniy
>
>Actually the reverse. With indirect you are limited to num_max.
>Without you are limited to whatever space is left in the
>queue (which you did not check here, so you should).
>
>
>> > +			res = true;
>> > +	}
>> > +
>> > +	rcu_read_unlock();
>
>Just curious:
>is the point of all this RCU dance to allow vsock
>to change from under us? then why is it ok to
>have it change? the virtio_transport_msgzerocopy_check_iov
>will then refer to the old vsock ...

IIRC we introduced the RCU to handle hot-unplug issues:
commit 0deab087b16a ("vsock/virtio: use RCU to avoid use-after-free on
the_virtio_vsock")

When we remove the device, we flush all the works, etc. so we should
not be in this case (referring the old vsock), except for an irrelevant
transient as the device is disappearing.

Stefano


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-25 11:50     ` Michael S. Tsirkin
  2023-07-25 12:53       ` Stefano Garzarella
@ 2023-07-25 13:04       ` Arseniy Krasnov
  2023-07-25 13:22         ` Michael S. Tsirkin
  1 sibling, 1 reply; 30+ messages in thread
From: Arseniy Krasnov @ 2023-07-25 13:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
	Bobby Eshleman, kvm, virtualization, netdev, linux-kernel,
	kernel, oxffffaa



On 25.07.2023 14:50, Michael S. Tsirkin wrote:
> On Fri, Jul 21, 2023 at 08:09:03AM +0300, Arseniy Krasnov wrote:
>>
>>
>> On 21.07.2023 00:42, Arseniy Krasnov wrote:
>>> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>>> flag is set and zerocopy transmission is possible (enabled in socket
>>> options and transport allows zerocopy), then non-linear skb will be
>>> created and filled with the pages of user's buffer. Pages of user's
>>> buffer are locked in memory by 'get_user_pages()'. Second thing that
>>> this patch does is replace type of skb owning: instead of calling
>>> 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
>>> change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
>>> of socket, so to decrease this field correctly proper skb destructor is
>>> needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
>>>
>>> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>>> ---
>>>  Changelog:
>>>  v5(big patchset) -> v1:
>>>   * Refactorings of 'if' conditions.
>>>   * Remove extra blank line.
>>>   * Remove 'frag_off' field unneeded init.
>>>   * Add function 'virtio_transport_fill_skb()' which fills both linear
>>>     and non-linear skb with provided data.
>>>  v1 -> v2:
>>>   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
>>>  v2 -> v3:
>>>   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
>>>     provided 'iov_iter' with data could be sent in a zerocopy mode.
>>>     If this callback is not set in transport - transport allows to send
>>>     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
>>>     then zerocopy is allowed. Reason of this callback is that in case of
>>>     G2H transmission we insert whole skb to the tx virtio queue and such
>>>     skb must fit to the size of the virtio queue to be sent in a single
>>>     iteration (may be tx logic in 'virtio_transport.c' could be reworked
>>>     as in vhost to support partial send of current skb). This callback
>>>     will be enabled only for G2H path. For details pls see comment 
>>>     'Check that tx queue...' below.
>>>
>>>  include/net/af_vsock.h                  |   3 +
>>>  net/vmw_vsock/virtio_transport.c        |  39 ++++
>>>  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
>>>  3 files changed, 241 insertions(+), 58 deletions(-)
>>>
>>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>>> index 0e7504a42925..a6b346eeeb8e 100644
>>> --- a/include/net/af_vsock.h
>>> +++ b/include/net/af_vsock.h
>>> @@ -177,6 +177,9 @@ struct vsock_transport {
>>>  
>>>  	/* Read a single skb */
>>>  	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
>>> +
>>> +	/* Zero-copy. */
>>> +	bool (*msgzerocopy_check_iov)(const struct iov_iter *);
>>>  };
>>>  
>>>  /**** CORE ****/
>>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>>> index 7bbcc8093e51..23cb8ed638c4 100644
>>> --- a/net/vmw_vsock/virtio_transport.c
>>> +++ b/net/vmw_vsock/virtio_transport.c
>>> @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
>>>  	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>>>  }
>>>  
>>> +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
>>> +{
>>> +	struct virtio_vsock *vsock;
>>> +	bool res = false;
>>> +
>>> +	rcu_read_lock();
>>> +
>>> +	vsock = rcu_dereference(the_virtio_vsock);
>>> +	if (vsock) {
>>> +		struct virtqueue *vq;
>>> +		int iov_pages;
>>> +
>>> +		vq = vsock->vqs[VSOCK_VQ_TX];
>>> +
>>> +		iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
>>> +
>>> +		/* Check that tx queue is large enough to keep whole
>>> +		 * data to send. This is needed, because when there is
>>> +		 * not enough free space in the queue, current skb to
>>> +		 * send will be reinserted to the head of tx list of
>>> +		 * the socket to retry transmission later, so if skb
>>> +		 * is bigger than whole queue, it will be reinserted
>>> +		 * again and again, thus blocking other skbs to be sent.
>>> +		 * Each page of the user provided buffer will be added
>>> +		 * as a single buffer to the tx virtqueue, so compare
>>> +		 * number of pages against maximum capacity of the queue.
>>> +		 * +1 means buffer for the packet header.
>>> +		 */
>>> +		if (iov_pages + 1 <= vq->num_max)
>>
>> I think this check is actual only for case one we don't have indirect buffer feature.
>> With indirect mode whole data to send will be packed into one indirect buffer.
>>
>> Thanks, Arseniy
> 
> Actually the reverse. With indirect you are limited to num_max.
> Without you are limited to whatever space is left in the
> queue (which you did not check here, so you should).

I mean that with indirect, we only need one buffer, and we can just wait
for enough space - for this single buffer ( as we discussed a little bit before).
But if indirect buffer is not supported - we need that whole packet must fit
to the size of tx queue - otherwise it never be transmitted.

Thanks, Arseniy

> 
> 
>>> +			res = true;
>>> +	}
>>> +
>>> +	rcu_read_unlock();
> 
> Just curious:
> is the point of all this RCU dance to allow vsock
> to change from under us? then why is it ok to
> have it change? the virtio_transport_msgzerocopy_check_iov
> will then refer to the old vsock ...
> 
> 
>>> +
>>> +	return res;
>>> +}
>>> +
>>>  static bool virtio_transport_seqpacket_allow(u32 remote_cid);
>>>  
>>>  static struct virtio_transport virtio_transport = {
>>> @@ -475,6 +512,8 @@ static struct virtio_transport virtio_transport = {
>>>  		.seqpacket_allow          = virtio_transport_seqpacket_allow,
>>>  		.seqpacket_has_data       = virtio_transport_seqpacket_has_data,
>>>  
>>> +		.msgzerocopy_check_iov	  = virtio_transport_msgzerocopy_check_iov,
>>> +
>>>  		.notify_poll_in           = virtio_transport_notify_poll_in,
>>>  		.notify_poll_out          = virtio_transport_notify_poll_out,
>>>  		.notify_recv_init         = virtio_transport_notify_recv_init,
>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>> index 26a4d10da205..e4e3d541aff4 100644
>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>> @@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>>>  	return container_of(t, struct virtio_transport, transport);
>>>  }
>>>  
>>> -/* Returns a new packet on success, otherwise returns NULL.
>>> - *
>>> - * If NULL is returned, errp is set to a negative errno.
>>> - */
>>> -static struct sk_buff *
>>> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>>> -			   size_t len,
>>> -			   u32 src_cid,
>>> -			   u32 src_port,
>>> -			   u32 dst_cid,
>>> -			   u32 dst_port)
>>> -{
>>> -	const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>>> -	struct virtio_vsock_hdr *hdr;
>>> -	struct sk_buff *skb;
>>> -	void *payload;
>>> -	int err;
>>> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
>>> +				       size_t max_to_send)
>>> +{
>>> +	const struct vsock_transport *t;
>>> +	struct iov_iter *iov_iter;
>>>  
>>> -	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>> -	if (!skb)
>>> -		return NULL;
>>> +	if (!info->msg)
>>> +		return false;
>>>  
>>> -	hdr = virtio_vsock_hdr(skb);
>>> -	hdr->type	= cpu_to_le16(info->type);
>>> -	hdr->op		= cpu_to_le16(info->op);
>>> -	hdr->src_cid	= cpu_to_le64(src_cid);
>>> -	hdr->dst_cid	= cpu_to_le64(dst_cid);
>>> -	hdr->src_port	= cpu_to_le32(src_port);
>>> -	hdr->dst_port	= cpu_to_le32(dst_port);
>>> -	hdr->flags	= cpu_to_le32(info->flags);
>>> -	hdr->len	= cpu_to_le32(len);
>>> +	iov_iter = &info->msg->msg_iter;
>>>  
>>> -	if (info->msg && len > 0) {
>>> -		payload = skb_put(skb, len);
>>> -		err = memcpy_from_msg(payload, info->msg, len);
>>> -		if (err)
>>> -			goto out;
>>> +	t = vsock_core_get_transport(info->vsk);
>>>  
>>> -		if (msg_data_left(info->msg) == 0 &&
>>> -		    info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>> -			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>> +	if (t->msgzerocopy_check_iov &&
>>> +	    !t->msgzerocopy_check_iov(iov_iter))
>>> +		return false;
>>>  
>>> -			if (info->msg->msg_flags & MSG_EOR)
>>> -				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>> -		}
>>> +	/* Data is simple buffer. */
>>> +	if (iter_is_ubuf(iov_iter))
>>> +		return true;
>>> +
>>> +	if (!iter_is_iovec(iov_iter))
>>> +		return false;
>>> +
>>> +	if (iov_iter->iov_offset)
>>> +		return false;
>>> +
>>> +	/* We can't send whole iov. */
>>> +	if (iov_iter->count > max_to_send)
>>> +		return false;
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>>> +					   struct sk_buff *skb,
>>> +					   struct msghdr *msg,
>>> +					   bool zerocopy)
>>> +{
>>> +	struct ubuf_info *uarg;
>>> +
>>> +	if (msg->msg_ubuf) {
>>> +		uarg = msg->msg_ubuf;
>>> +		net_zcopy_get(uarg);
>>> +	} else {
>>> +		struct iov_iter *iter = &msg->msg_iter;
>>> +		struct ubuf_info_msgzc *uarg_zc;
>>> +		int len;
>>> +
>>> +		/* Only ITER_IOVEC or ITER_UBUF are allowed and
>>> +		 * checked before.
>>> +		 */
>>> +		if (iter_is_iovec(iter))
>>> +			len = iov_length(iter->__iov, iter->nr_segs);
>>> +		else
>>> +			len = iter->count;
>>> +
>>> +		uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>> +					    len,
>>> +					    NULL);
>>> +		if (!uarg)
>>> +			return -1;
>>> +
>>> +		uarg_zc = uarg_to_msgzc(uarg);
>>> +		uarg_zc->zerocopy = zerocopy ? 1 : 0;
>>>  	}
>>>  
>>> -	if (info->reply)
>>> -		virtio_vsock_skb_set_reply(skb);
>>> +	skb_zcopy_init(skb, uarg);
>>>  
>>> -	trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>> -					 dst_cid, dst_port,
>>> -					 len,
>>> -					 info->type,
>>> -					 info->op,
>>> -					 info->flags);
>>> +	return 0;
>>> +}
>>>  
>>> -	if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
>>> -		WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
>>> -		goto out;
>>> +static int virtio_transport_fill_skb(struct sk_buff *skb,
>>> +				     struct virtio_vsock_pkt_info *info,
>>> +				     size_t len,
>>> +				     bool zcopy)
>>> +{
>>> +	if (zcopy) {
>>> +		return __zerocopy_sg_from_iter(info->msg, NULL, skb,
>>> +					      &info->msg->msg_iter,
>>> +					      len);
>>> +	} else {
>>> +		void *payload;
>>> +		int err;
>>> +
>>> +		payload = skb_put(skb, len);
>>> +		err = memcpy_from_msg(payload, info->msg, len);
>>> +		if (err)
>>> +			return -1;
>>> +
>>> +		if (msg_data_left(info->msg))
>>> +			return 0;
>>> +
>>> +		return 0;
>>>  	}
>>> +}
>>>  
>>> -	return skb;
>>> +static void virtio_transport_init_hdr(struct sk_buff *skb,
>>> +				      struct virtio_vsock_pkt_info *info,
>>> +				      u32 src_cid,
>>> +				      u32 src_port,
>>> +				      u32 dst_cid,
>>> +				      u32 dst_port,
>>> +				      size_t len)
>>> +{
>>> +	struct virtio_vsock_hdr *hdr;
>>>  
>>> -out:
>>> -	kfree_skb(skb);
>>> -	return NULL;
>>> +	hdr = virtio_vsock_hdr(skb);
>>> +	hdr->type	= cpu_to_le16(info->type);
>>> +	hdr->op		= cpu_to_le16(info->op);
>>> +	hdr->src_cid	= cpu_to_le64(src_cid);
>>> +	hdr->dst_cid	= cpu_to_le64(dst_cid);
>>> +	hdr->src_port	= cpu_to_le32(src_port);
>>> +	hdr->dst_port	= cpu_to_le32(dst_port);
>>> +	hdr->flags	= cpu_to_le32(info->flags);
>>> +	hdr->len	= cpu_to_le32(len);
>>>  }
>>>  
>>>  static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
>>> @@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
>>>  }
>>>  
>>> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
>>> +						  struct virtio_vsock_pkt_info *info,
>>> +						  size_t payload_len,
>>> +						  bool zcopy,
>>> +						  u32 src_cid,
>>> +						  u32 src_port,
>>> +						  u32 dst_cid,
>>> +						  u32 dst_port)
>>> +{
>>> +	struct sk_buff *skb;
>>> +	size_t skb_len;
>>> +
>>> +	skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
>>> +
>>> +	if (!zcopy)
>>> +		skb_len += payload_len;
>>> +
>>> +	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>> +	if (!skb)
>>> +		return NULL;
>>> +
>>> +	virtio_transport_init_hdr(skb, info, src_cid, src_port,
>>> +				  dst_cid, dst_port,
>>> +				  payload_len);
>>> +
>>> +	/* Set owner here, because '__zerocopy_sg_from_iter()' uses
>>> +	 * owner of skb without check to update 'sk_wmem_alloc'.
>>> +	 */
>>> +	if (vsk)
>>> +		skb_set_owner_w(skb, sk_vsock(vsk));
>>> +
>>> +	if (info->msg && payload_len > 0) {
>>> +		int err;
>>> +
>>> +		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
>>> +		if (err)
>>> +			goto out;
>>> +
>>> +		if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>> +			struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
>>> +
>>> +			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>> +
>>> +			if (info->msg->msg_flags & MSG_EOR)
>>> +				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>> +		}
>>> +	}
>>> +
>>> +	if (info->reply)
>>> +		virtio_vsock_skb_set_reply(skb);
>>> +
>>> +	trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>> +					 dst_cid, dst_port,
>>> +					 payload_len,
>>> +					 info->type,
>>> +					 info->op,
>>> +					 info->flags);
>>> +
>>> +	return skb;
>>> +out:
>>> +	kfree_skb(skb);
>>> +	return NULL;
>>> +}
>>> +
>>>  /* This function can only be used on connecting/connected sockets,
>>>   * since a socket assigned to a transport is required.
>>>   *
>>> @@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>>  static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>  					  struct virtio_vsock_pkt_info *info)
>>>  {
>>> +	u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>>>  	u32 src_cid, src_port, dst_cid, dst_port;
>>>  	const struct virtio_transport *t_ops;
>>>  	struct virtio_vsock_sock *vvs;
>>>  	u32 pkt_len = info->pkt_len;
>>> +	bool can_zcopy = false;
>>>  	u32 rest_len;
>>>  	int ret;
>>>  
>>> @@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>>  		return pkt_len;
>>>  
>>> +	if (info->msg) {
>>> +		/* If zerocopy is not enabled by 'setsockopt()', we behave as
>>> +		 * there is no MSG_ZEROCOPY flag set.
>>> +		 */
>>> +		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>>> +			info->msg->msg_flags &= ~MSG_ZEROCOPY;
>>> +
>>> +		if (info->msg->msg_flags & MSG_ZEROCOPY)
>>> +			can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
>>> +
>>> +		if (can_zcopy)
>>> +			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>>> +					    (MAX_SKB_FRAGS * PAGE_SIZE));
>>> +	}
>>> +
>>>  	rest_len = pkt_len;
>>>  
>>>  	do {
>>>  		struct sk_buff *skb;
>>>  		size_t skb_len;
>>>  
>>> -		skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
>>> +		skb_len = min(max_skb_len, rest_len);
>>>  
>>> -		skb = virtio_transport_alloc_skb(info, skb_len,
>>> +		skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
>>>  						 src_cid, src_port,
>>>  						 dst_cid, dst_port);
>>>  		if (!skb) {
>>> @@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>  			break;
>>>  		}
>>>  
>>> +		/* This is last skb to send this portion of data. */
>>> +		if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
>>> +		    skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
>>> +			if (virtio_transport_init_zcopy_skb(vsk, skb,
>>> +							    info->msg,
>>> +							    can_zcopy)) {
>>> +				ret = -ENOMEM;
>>> +				break;
>>> +			}
>>> +		}
>>> +
>>>  		virtio_transport_inc_tx_pkt(vvs, skb);
>>>  
>>>  		ret = t_ops->send_pkt(skb);
>>> @@ -934,7 +1075,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>>>  	if (!t)
>>>  		return -ENOTCONN;
>>>  
>>> -	reply = virtio_transport_alloc_skb(&info, 0,
>>> +	reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
>>>  					   le64_to_cpu(hdr->dst_cid),
>>>  					   le32_to_cpu(hdr->dst_port),
>>>  					   le64_to_cpu(hdr->src_cid),
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-25 12:53       ` Stefano Garzarella
@ 2023-07-25 13:06         ` Michael S. Tsirkin
  2023-07-25 13:21           ` Stefano Garzarella
  0 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2023-07-25 13:06 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Arseniy Krasnov, Stefan Hajnoczi, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jason Wang, Bobby Eshleman, kvm,
	virtualization, netdev, linux-kernel, kernel, oxffffaa

On Tue, Jul 25, 2023 at 02:53:39PM +0200, Stefano Garzarella wrote:
> On Tue, Jul 25, 2023 at 07:50:53AM -0400, Michael S. Tsirkin wrote:
> > On Fri, Jul 21, 2023 at 08:09:03AM +0300, Arseniy Krasnov wrote:
> > > 
> > > 
> > > On 21.07.2023 00:42, Arseniy Krasnov wrote:
> > > > This adds handling of MSG_ZEROCOPY flag on transmission path: if this
> > > > flag is set and zerocopy transmission is possible (enabled in socket
> > > > options and transport allows zerocopy), then non-linear skb will be
> > > > created and filled with the pages of user's buffer. Pages of user's
> > > > buffer are locked in memory by 'get_user_pages()'. Second thing that
> > > > this patch does is replace type of skb owning: instead of calling
> > > > 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
> > > > change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
> > > > of socket, so to decrease this field correctly proper skb destructor is
> > > > needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
> > > >
> > > > Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
> > > > ---
> > > >  Changelog:
> > > >  v5(big patchset) -> v1:
> > > >   * Refactorings of 'if' conditions.
> > > >   * Remove extra blank line.
> > > >   * Remove 'frag_off' field unneeded init.
> > > >   * Add function 'virtio_transport_fill_skb()' which fills both linear
> > > >     and non-linear skb with provided data.
> > > >  v1 -> v2:
> > > >   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
> > > >  v2 -> v3:
> > > >   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
> > > >     provided 'iov_iter' with data could be sent in a zerocopy mode.
> > > >     If this callback is not set in transport - transport allows to send
> > > >     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
> > > >     then zerocopy is allowed. Reason of this callback is that in case of
> > > >     G2H transmission we insert whole skb to the tx virtio queue and such
> > > >     skb must fit to the size of the virtio queue to be sent in a single
> > > >     iteration (may be tx logic in 'virtio_transport.c' could be reworked
> > > >     as in vhost to support partial send of current skb). This callback
> > > >     will be enabled only for G2H path. For details pls see comment
> > > >     'Check that tx queue...' below.
> > > >
> > > >  include/net/af_vsock.h                  |   3 +
> > > >  net/vmw_vsock/virtio_transport.c        |  39 ++++
> > > >  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
> > > >  3 files changed, 241 insertions(+), 58 deletions(-)
> > > >
> > > > diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> > > > index 0e7504a42925..a6b346eeeb8e 100644
> > > > --- a/include/net/af_vsock.h
> > > > +++ b/include/net/af_vsock.h
> > > > @@ -177,6 +177,9 @@ struct vsock_transport {
> > > >
> > > >  	/* Read a single skb */
> > > >  	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
> > > > +
> > > > +	/* Zero-copy. */
> > > > +	bool (*msgzerocopy_check_iov)(const struct iov_iter *);
> > > >  };
> > > >
> > > >  /**** CORE ****/
> > > > diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> > > > index 7bbcc8093e51..23cb8ed638c4 100644
> > > > --- a/net/vmw_vsock/virtio_transport.c
> > > > +++ b/net/vmw_vsock/virtio_transport.c
> > > > @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
> > > >  	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
> > > >  }
> > > >
> > > > +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
> > > > +{
> > > > +	struct virtio_vsock *vsock;
> > > > +	bool res = false;
> > > > +
> > > > +	rcu_read_lock();
> > > > +
> > > > +	vsock = rcu_dereference(the_virtio_vsock);
> > > > +	if (vsock) {
> > > > +		struct virtqueue *vq;
> > > > +		int iov_pages;
> > > > +
> > > > +		vq = vsock->vqs[VSOCK_VQ_TX];
> > > > +
> > > > +		iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
> > > > +
> > > > +		/* Check that tx queue is large enough to keep whole
> > > > +		 * data to send. This is needed, because when there is
> > > > +		 * not enough free space in the queue, current skb to
> > > > +		 * send will be reinserted to the head of tx list of
> > > > +		 * the socket to retry transmission later, so if skb
> > > > +		 * is bigger than whole queue, it will be reinserted
> > > > +		 * again and again, thus blocking other skbs to be sent.
> > > > +		 * Each page of the user provided buffer will be added
> > > > +		 * as a single buffer to the tx virtqueue, so compare
> > > > +		 * number of pages against maximum capacity of the queue.
> > > > +		 * +1 means buffer for the packet header.
> > > > +		 */
> > > > +		if (iov_pages + 1 <= vq->num_max)
> > > 
> > > I think this check is actual only for case one we don't have indirect buffer feature.
> > > With indirect mode whole data to send will be packed into one indirect buffer.
> > > 
> > > Thanks, Arseniy
> > 
> > Actually the reverse. With indirect you are limited to num_max.
> > Without you are limited to whatever space is left in the
> > queue (which you did not check here, so you should).
> > 
> > 
> > > > +			res = true;
> > > > +	}
> > > > +
> > > > +	rcu_read_unlock();
> > 
> > Just curious:
> > is the point of all this RCU dance to allow vsock
> > to change from under us? then why is it ok to
> > have it change? the virtio_transport_msgzerocopy_check_iov
> > will then refer to the old vsock ...
> 
> IIRC we introduced the RCU to handle hot-unplug issues:
> commit 0deab087b16a ("vsock/virtio: use RCU to avoid use-after-free on
> the_virtio_vsock")
> 
> When we remove the device, we flush all the works, etc. so we should
> not be in this case (referring the old vsock), except for an irrelevant
> transient as the device is disappearing.
> 
> Stefano

what if old device goes away then new one appears?

-- 
MST


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-25 11:59       ` Michael S. Tsirkin
@ 2023-07-25 13:10         ` Arseniy Krasnov
  2023-07-25 13:23           ` Michael S. Tsirkin
  0 siblings, 1 reply; 30+ messages in thread
From: Arseniy Krasnov @ 2023-07-25 13:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
	Bobby Eshleman, kvm, virtualization, netdev, linux-kernel,
	kernel, oxffffaa



On 25.07.2023 14:59, Michael S. Tsirkin wrote:
> On Tue, Jul 25, 2023 at 11:39:22AM +0300, Arseniy Krasnov wrote:
>>
>>
>> On 25.07.2023 11:25, Michael S. Tsirkin wrote:
>>> On Fri, Jul 21, 2023 at 12:42:45AM +0300, Arseniy Krasnov wrote:
>>>> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>>>> flag is set and zerocopy transmission is possible (enabled in socket
>>>> options and transport allows zerocopy), then non-linear skb will be
>>>> created and filled with the pages of user's buffer. Pages of user's
>>>> buffer are locked in memory by 'get_user_pages()'. Second thing that
>>>> this patch does is replace type of skb owning: instead of calling
>>>> 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
>>>> change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
>>>> of socket, so to decrease this field correctly proper skb destructor is
>>>> needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
>>>>
>>>> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>>>> ---
>>>>  Changelog:
>>>>  v5(big patchset) -> v1:
>>>>   * Refactorings of 'if' conditions.
>>>>   * Remove extra blank line.
>>>>   * Remove 'frag_off' field unneeded init.
>>>>   * Add function 'virtio_transport_fill_skb()' which fills both linear
>>>>     and non-linear skb with provided data.
>>>>  v1 -> v2:
>>>>   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
>>>>  v2 -> v3:
>>>>   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
>>>>     provided 'iov_iter' with data could be sent in a zerocopy mode.
>>>>     If this callback is not set in transport - transport allows to send
>>>>     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
>>>>     then zerocopy is allowed. Reason of this callback is that in case of
>>>>     G2H transmission we insert whole skb to the tx virtio queue and such
>>>>     skb must fit to the size of the virtio queue to be sent in a single
>>>>     iteration (may be tx logic in 'virtio_transport.c' could be reworked
>>>>     as in vhost to support partial send of current skb). This callback
>>>>     will be enabled only for G2H path. For details pls see comment 
>>>>     'Check that tx queue...' below.
>>>>
>>>>  include/net/af_vsock.h                  |   3 +
>>>>  net/vmw_vsock/virtio_transport.c        |  39 ++++
>>>>  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
>>>>  3 files changed, 241 insertions(+), 58 deletions(-)
>>>>
>>>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>>>> index 0e7504a42925..a6b346eeeb8e 100644
>>>> --- a/include/net/af_vsock.h
>>>> +++ b/include/net/af_vsock.h
>>>> @@ -177,6 +177,9 @@ struct vsock_transport {
>>>>  
>>>>  	/* Read a single skb */
>>>>  	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
>>>> +
>>>> +	/* Zero-copy. */
>>>> +	bool (*msgzerocopy_check_iov)(const struct iov_iter *);
>>>>  };
>>>>  
>>>>  /**** CORE ****/
>>>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>>>> index 7bbcc8093e51..23cb8ed638c4 100644
>>>> --- a/net/vmw_vsock/virtio_transport.c
>>>> +++ b/net/vmw_vsock/virtio_transport.c
>>>> @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
>>>>  	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>>>>  }
>>>>  
>>>> +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
>>>> +{
>>>> +	struct virtio_vsock *vsock;
>>>> +	bool res = false;
>>>> +
>>>> +	rcu_read_lock();
>>>> +
>>>> +	vsock = rcu_dereference(the_virtio_vsock);
>>>> +	if (vsock) {
>>>> +		struct virtqueue *vq;
>>>> +		int iov_pages;
>>>> +
>>>> +		vq = vsock->vqs[VSOCK_VQ_TX];
>>>> +
>>>> +		iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
>>>> +
>>>> +		/* Check that tx queue is large enough to keep whole
>>>> +		 * data to send. This is needed, because when there is
>>>> +		 * not enough free space in the queue, current skb to
>>>> +		 * send will be reinserted to the head of tx list of
>>>> +		 * the socket to retry transmission later, so if skb
>>>> +		 * is bigger than whole queue, it will be reinserted
>>>> +		 * again and again, thus blocking other skbs to be sent.
>>>> +		 * Each page of the user provided buffer will be added
>>>> +		 * as a single buffer to the tx virtqueue, so compare
>>>> +		 * number of pages against maximum capacity of the queue.
>>>> +		 * +1 means buffer for the packet header.
>>>> +		 */
>>>> +		if (iov_pages + 1 <= vq->num_max)
>>>> +			res = true;
>>>
>>>
>>> Yes but can't there already be buffers in the queue?
>>> Then you can't stick num_max there.
>>
>> I think, that it is not critical, because vhost part always tries to process all
>> incoming buffers (yes, 'vhost_exceeds_weight()' breaks at some moment, but it will
>> reschedule tx kick ('vhost_vsock_handle_tx_kick()') work again), so current "too
>> big" skb will wait until there will be enough space in queue and as it is requeued
>> to the head of tx list it will be inserted to tx queue first.
>>
>> But anyway, I agree that comparing to 'num_free' may be more effective to the whole
>> system performance...
>>
>> Thanks, Arseniy
> 
> Oh I see. It makes sense then - instead of copying just so we can
> stick it in the queue, wait a bit and send later.
> Also - for stream transports can't the message be split
> and sent chunk by chunk? Better than copying ...

Technically yes, also we can split message for non-stream sockets (as vhost
does when it copies data to rx buffers of the guest), but it requires to rework
current implementation by adding buffers one by one to the tx queue. I think
it was not implemented here because until MSG_ZEROCOPY all skbs requires one
(if it is control message) or two (with payload) buffers, so there is no big
sense in processing max two buffers in "one-by-one" mode - we can just wait
for space.

May be, I can add this logic for non-linear skb's here: 

if (skb->len > vq->max_num)
    add buffers "one-by-one", incrementing internal offset in skb,
   if (new skb insertion fails)
       requeue skb, wait for space.

In TX done callback I'll call consume skb only when above mentioned internal
offset == skb->len. I think this approach allows to get rid of new 'check_iov'
callback from this patch.


Stefano, what do You think?

Thanks, Arseniy

> 
> 
>>>
>>>
>>>> +	}
>>>> +
>>>> +	rcu_read_unlock();
>>>> +
>>>> +	return res;
>>>> +}
>>>> +
>>>>  static bool virtio_transport_seqpacket_allow(u32 remote_cid);
>>>>  
>>>>  static struct virtio_transport virtio_transport = {
>>>> @@ -475,6 +512,8 @@ static struct virtio_transport virtio_transport = {
>>>>  		.seqpacket_allow          = virtio_transport_seqpacket_allow,
>>>>  		.seqpacket_has_data       = virtio_transport_seqpacket_has_data,
>>>>  
>>>> +		.msgzerocopy_check_iov	  = virtio_transport_msgzerocopy_check_iov,
>>>> +
>>>>  		.notify_poll_in           = virtio_transport_notify_poll_in,
>>>>  		.notify_poll_out          = virtio_transport_notify_poll_out,
>>>>  		.notify_recv_init         = virtio_transport_notify_recv_init,
>>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>>> index 26a4d10da205..e4e3d541aff4 100644
>>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>>> @@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>>>>  	return container_of(t, struct virtio_transport, transport);
>>>>  }
>>>>  
>>>> -/* Returns a new packet on success, otherwise returns NULL.
>>>> - *
>>>> - * If NULL is returned, errp is set to a negative errno.
>>>> - */
>>>> -static struct sk_buff *
>>>> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>>>> -			   size_t len,
>>>> -			   u32 src_cid,
>>>> -			   u32 src_port,
>>>> -			   u32 dst_cid,
>>>> -			   u32 dst_port)
>>>> -{
>>>> -	const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>>>> -	struct virtio_vsock_hdr *hdr;
>>>> -	struct sk_buff *skb;
>>>> -	void *payload;
>>>> -	int err;
>>>> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
>>>> +				       size_t max_to_send)
>>>> +{
>>>> +	const struct vsock_transport *t;
>>>> +	struct iov_iter *iov_iter;
>>>>  
>>>> -	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>>> -	if (!skb)
>>>> -		return NULL;
>>>> +	if (!info->msg)
>>>> +		return false;
>>>>  
>>>> -	hdr = virtio_vsock_hdr(skb);
>>>> -	hdr->type	= cpu_to_le16(info->type);
>>>> -	hdr->op		= cpu_to_le16(info->op);
>>>> -	hdr->src_cid	= cpu_to_le64(src_cid);
>>>> -	hdr->dst_cid	= cpu_to_le64(dst_cid);
>>>> -	hdr->src_port	= cpu_to_le32(src_port);
>>>> -	hdr->dst_port	= cpu_to_le32(dst_port);
>>>> -	hdr->flags	= cpu_to_le32(info->flags);
>>>> -	hdr->len	= cpu_to_le32(len);
>>>> +	iov_iter = &info->msg->msg_iter;
>>>>  
>>>> -	if (info->msg && len > 0) {
>>>> -		payload = skb_put(skb, len);
>>>> -		err = memcpy_from_msg(payload, info->msg, len);
>>>> -		if (err)
>>>> -			goto out;
>>>> +	t = vsock_core_get_transport(info->vsk);
>>>>  
>>>> -		if (msg_data_left(info->msg) == 0 &&
>>>> -		    info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>>> -			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>>> +	if (t->msgzerocopy_check_iov &&
>>>> +	    !t->msgzerocopy_check_iov(iov_iter))
>>>> +		return false;
>>>>  
>>>> -			if (info->msg->msg_flags & MSG_EOR)
>>>> -				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>>> -		}
>>>> +	/* Data is simple buffer. */
>>>> +	if (iter_is_ubuf(iov_iter))
>>>> +		return true;
>>>> +
>>>> +	if (!iter_is_iovec(iov_iter))
>>>> +		return false;
>>>> +
>>>> +	if (iov_iter->iov_offset)
>>>> +		return false;
>>>> +
>>>> +	/* We can't send whole iov. */
>>>> +	if (iov_iter->count > max_to_send)
>>>> +		return false;
>>>> +
>>>> +	return true;
>>>> +}
>>>> +
>>>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>>>> +					   struct sk_buff *skb,
>>>> +					   struct msghdr *msg,
>>>> +					   bool zerocopy)
>>>> +{
>>>> +	struct ubuf_info *uarg;
>>>> +
>>>> +	if (msg->msg_ubuf) {
>>>> +		uarg = msg->msg_ubuf;
>>>> +		net_zcopy_get(uarg);
>>>> +	} else {
>>>> +		struct iov_iter *iter = &msg->msg_iter;
>>>> +		struct ubuf_info_msgzc *uarg_zc;
>>>> +		int len;
>>>> +
>>>> +		/* Only ITER_IOVEC or ITER_UBUF are allowed and
>>>> +		 * checked before.
>>>> +		 */
>>>> +		if (iter_is_iovec(iter))
>>>> +			len = iov_length(iter->__iov, iter->nr_segs);
>>>> +		else
>>>> +			len = iter->count;
>>>> +
>>>> +		uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>>> +					    len,
>>>> +					    NULL);
>>>> +		if (!uarg)
>>>> +			return -1;
>>>> +
>>>> +		uarg_zc = uarg_to_msgzc(uarg);
>>>> +		uarg_zc->zerocopy = zerocopy ? 1 : 0;
>>>>  	}
>>>>  
>>>> -	if (info->reply)
>>>> -		virtio_vsock_skb_set_reply(skb);
>>>> +	skb_zcopy_init(skb, uarg);
>>>>  
>>>> -	trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>>> -					 dst_cid, dst_port,
>>>> -					 len,
>>>> -					 info->type,
>>>> -					 info->op,
>>>> -					 info->flags);
>>>> +	return 0;
>>>> +}
>>>>  
>>>> -	if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
>>>> -		WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
>>>> -		goto out;
>>>> +static int virtio_transport_fill_skb(struct sk_buff *skb,
>>>> +				     struct virtio_vsock_pkt_info *info,
>>>> +				     size_t len,
>>>> +				     bool zcopy)
>>>> +{
>>>> +	if (zcopy) {
>>>> +		return __zerocopy_sg_from_iter(info->msg, NULL, skb,
>>>> +					      &info->msg->msg_iter,
>>>> +					      len);
>>>> +	} else {
>>>> +		void *payload;
>>>> +		int err;
>>>> +
>>>> +		payload = skb_put(skb, len);
>>>> +		err = memcpy_from_msg(payload, info->msg, len);
>>>> +		if (err)
>>>> +			return -1;
>>>> +
>>>> +		if (msg_data_left(info->msg))
>>>> +			return 0;
>>>> +
>>>> +		return 0;
>>>>  	}
>>>> +}
>>>>  
>>>> -	return skb;
>>>> +static void virtio_transport_init_hdr(struct sk_buff *skb,
>>>> +				      struct virtio_vsock_pkt_info *info,
>>>> +				      u32 src_cid,
>>>> +				      u32 src_port,
>>>> +				      u32 dst_cid,
>>>> +				      u32 dst_port,
>>>> +				      size_t len)
>>>> +{
>>>> +	struct virtio_vsock_hdr *hdr;
>>>>  
>>>> -out:
>>>> -	kfree_skb(skb);
>>>> -	return NULL;
>>>> +	hdr = virtio_vsock_hdr(skb);
>>>> +	hdr->type	= cpu_to_le16(info->type);
>>>> +	hdr->op		= cpu_to_le16(info->op);
>>>> +	hdr->src_cid	= cpu_to_le64(src_cid);
>>>> +	hdr->dst_cid	= cpu_to_le64(dst_cid);
>>>> +	hdr->src_port	= cpu_to_le32(src_port);
>>>> +	hdr->dst_port	= cpu_to_le32(dst_port);
>>>> +	hdr->flags	= cpu_to_le32(info->flags);
>>>> +	hdr->len	= cpu_to_le32(len);
>>>>  }
>>>>  
>>>>  static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
>>>> @@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>>>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
>>>>  }
>>>>  
>>>> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
>>>> +						  struct virtio_vsock_pkt_info *info,
>>>> +						  size_t payload_len,
>>>> +						  bool zcopy,
>>>> +						  u32 src_cid,
>>>> +						  u32 src_port,
>>>> +						  u32 dst_cid,
>>>> +						  u32 dst_port)
>>>> +{
>>>> +	struct sk_buff *skb;
>>>> +	size_t skb_len;
>>>> +
>>>> +	skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
>>>> +
>>>> +	if (!zcopy)
>>>> +		skb_len += payload_len;
>>>> +
>>>> +	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>>> +	if (!skb)
>>>> +		return NULL;
>>>> +
>>>> +	virtio_transport_init_hdr(skb, info, src_cid, src_port,
>>>> +				  dst_cid, dst_port,
>>>> +				  payload_len);
>>>> +
>>>> +	/* Set owner here, because '__zerocopy_sg_from_iter()' uses
>>>> +	 * owner of skb without check to update 'sk_wmem_alloc'.
>>>> +	 */
>>>> +	if (vsk)
>>>> +		skb_set_owner_w(skb, sk_vsock(vsk));
>>>> +
>>>> +	if (info->msg && payload_len > 0) {
>>>> +		int err;
>>>> +
>>>> +		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
>>>> +		if (err)
>>>> +			goto out;
>>>> +
>>>> +		if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>>> +			struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
>>>> +
>>>> +			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>>> +
>>>> +			if (info->msg->msg_flags & MSG_EOR)
>>>> +				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>>> +		}
>>>> +	}
>>>> +
>>>> +	if (info->reply)
>>>> +		virtio_vsock_skb_set_reply(skb);
>>>> +
>>>> +	trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>>> +					 dst_cid, dst_port,
>>>> +					 payload_len,
>>>> +					 info->type,
>>>> +					 info->op,
>>>> +					 info->flags);
>>>> +
>>>> +	return skb;
>>>> +out:
>>>> +	kfree_skb(skb);
>>>> +	return NULL;
>>>> +}
>>>> +
>>>>  /* This function can only be used on connecting/connected sockets,
>>>>   * since a socket assigned to a transport is required.
>>>>   *
>>>> @@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>>>  static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>  					  struct virtio_vsock_pkt_info *info)
>>>>  {
>>>> +	u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>>>>  	u32 src_cid, src_port, dst_cid, dst_port;
>>>>  	const struct virtio_transport *t_ops;
>>>>  	struct virtio_vsock_sock *vvs;
>>>>  	u32 pkt_len = info->pkt_len;
>>>> +	bool can_zcopy = false;
>>>>  	u32 rest_len;
>>>>  	int ret;
>>>>  
>>>> @@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>>>  		return pkt_len;
>>>>  
>>>> +	if (info->msg) {
>>>> +		/* If zerocopy is not enabled by 'setsockopt()', we behave as
>>>> +		 * there is no MSG_ZEROCOPY flag set.
>>>> +		 */
>>>> +		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>>>> +			info->msg->msg_flags &= ~MSG_ZEROCOPY;
>>>> +
>>>> +		if (info->msg->msg_flags & MSG_ZEROCOPY)
>>>> +			can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
>>>> +
>>>> +		if (can_zcopy)
>>>> +			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>>>> +					    (MAX_SKB_FRAGS * PAGE_SIZE));
>>>> +	}
>>>> +
>>>>  	rest_len = pkt_len;
>>>>  
>>>>  	do {
>>>>  		struct sk_buff *skb;
>>>>  		size_t skb_len;
>>>>  
>>>> -		skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
>>>> +		skb_len = min(max_skb_len, rest_len);
>>>>  
>>>> -		skb = virtio_transport_alloc_skb(info, skb_len,
>>>> +		skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
>>>>  						 src_cid, src_port,
>>>>  						 dst_cid, dst_port);
>>>>  		if (!skb) {
>>>> @@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>  			break;
>>>>  		}
>>>>  
>>>> +		/* This is last skb to send this portion of data. */
>>>> +		if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
>>>> +		    skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
>>>> +			if (virtio_transport_init_zcopy_skb(vsk, skb,
>>>> +							    info->msg,
>>>> +							    can_zcopy)) {
>>>> +				ret = -ENOMEM;
>>>> +				break;
>>>> +			}
>>>> +		}
>>>> +
>>>>  		virtio_transport_inc_tx_pkt(vvs, skb);
>>>>  
>>>>  		ret = t_ops->send_pkt(skb);
>>>> @@ -934,7 +1075,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>>>>  	if (!t)
>>>>  		return -ENOTCONN;
>>>>  
>>>> -	reply = virtio_transport_alloc_skb(&info, 0,
>>>> +	reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
>>>>  					   le64_to_cpu(hdr->dst_cid),
>>>>  					   le32_to_cpu(hdr->dst_port),
>>>>  					   le64_to_cpu(hdr->src_cid),
>>>> -- 
>>>> 2.25.1
>>>
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-25 13:06         ` Michael S. Tsirkin
@ 2023-07-25 13:21           ` Stefano Garzarella
  0 siblings, 0 replies; 30+ messages in thread
From: Stefano Garzarella @ 2023-07-25 13:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Arseniy Krasnov, Stefan Hajnoczi, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jason Wang, Bobby Eshleman, kvm,
	virtualization, netdev, linux-kernel, kernel, oxffffaa

On Tue, Jul 25, 2023 at 09:06:02AM -0400, Michael S. Tsirkin wrote:
>On Tue, Jul 25, 2023 at 02:53:39PM +0200, Stefano Garzarella wrote:
>> On Tue, Jul 25, 2023 at 07:50:53AM -0400, Michael S. Tsirkin wrote:
>> > On Fri, Jul 21, 2023 at 08:09:03AM +0300, Arseniy Krasnov wrote:
>> > >
>> > >
>> > > On 21.07.2023 00:42, Arseniy Krasnov wrote:
>> > > > This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>> > > > flag is set and zerocopy transmission is possible (enabled in socket
>> > > > options and transport allows zerocopy), then non-linear skb will be
>> > > > created and filled with the pages of user's buffer. Pages of user's
>> > > > buffer are locked in memory by 'get_user_pages()'. Second thing that
>> > > > this patch does is replace type of skb owning: instead of calling
>> > > > 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
>> > > > change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
>> > > > of socket, so to decrease this field correctly proper skb destructor is
>> > > > needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
>> > > >
>> > > > Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>> > > > ---
>> > > >  Changelog:
>> > > >  v5(big patchset) -> v1:
>> > > >   * Refactorings of 'if' conditions.
>> > > >   * Remove extra blank line.
>> > > >   * Remove 'frag_off' field unneeded init.
>> > > >   * Add function 'virtio_transport_fill_skb()' which fills both linear
>> > > >     and non-linear skb with provided data.
>> > > >  v1 -> v2:
>> > > >   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
>> > > >  v2 -> v3:
>> > > >   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
>> > > >     provided 'iov_iter' with data could be sent in a zerocopy mode.
>> > > >     If this callback is not set in transport - transport allows to send
>> > > >     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
>> > > >     then zerocopy is allowed. Reason of this callback is that in case of
>> > > >     G2H transmission we insert whole skb to the tx virtio queue and such
>> > > >     skb must fit to the size of the virtio queue to be sent in a single
>> > > >     iteration (may be tx logic in 'virtio_transport.c' could be reworked
>> > > >     as in vhost to support partial send of current skb). This callback
>> > > >     will be enabled only for G2H path. For details pls see comment
>> > > >     'Check that tx queue...' below.
>> > > >
>> > > >  include/net/af_vsock.h                  |   3 +
>> > > >  net/vmw_vsock/virtio_transport.c        |  39 ++++
>> > > >  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
>> > > >  3 files changed, 241 insertions(+), 58 deletions(-)
>> > > >
>> > > > diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>> > > > index 0e7504a42925..a6b346eeeb8e 100644
>> > > > --- a/include/net/af_vsock.h
>> > > > +++ b/include/net/af_vsock.h
>> > > > @@ -177,6 +177,9 @@ struct vsock_transport {
>> > > >
>> > > >  	/* Read a single skb */
>> > > >  	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
>> > > > +
>> > > > +	/* Zero-copy. */
>> > > > +	bool (*msgzerocopy_check_iov)(const struct iov_iter *);
>> > > >  };
>> > > >
>> > > >  /**** CORE ****/
>> > > > diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>> > > > index 7bbcc8093e51..23cb8ed638c4 100644
>> > > > --- a/net/vmw_vsock/virtio_transport.c
>> > > > +++ b/net/vmw_vsock/virtio_transport.c
>> > > > @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
>> > > >  	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>> > > >  }
>> > > >
>> > > > +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
>> > > > +{
>> > > > +	struct virtio_vsock *vsock;
>> > > > +	bool res = false;
>> > > > +
>> > > > +	rcu_read_lock();
>> > > > +
>> > > > +	vsock = rcu_dereference(the_virtio_vsock);
>> > > > +	if (vsock) {
>> > > > +		struct virtqueue *vq;
>> > > > +		int iov_pages;
>> > > > +
>> > > > +		vq = vsock->vqs[VSOCK_VQ_TX];
>> > > > +
>> > > > +		iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
>> > > > +
>> > > > +		/* Check that tx queue is large enough to keep whole
>> > > > +		 * data to send. This is needed, because when there is
>> > > > +		 * not enough free space in the queue, current skb to
>> > > > +		 * send will be reinserted to the head of tx list of
>> > > > +		 * the socket to retry transmission later, so if skb
>> > > > +		 * is bigger than whole queue, it will be reinserted
>> > > > +		 * again and again, thus blocking other skbs to be sent.
>> > > > +		 * Each page of the user provided buffer will be added
>> > > > +		 * as a single buffer to the tx virtqueue, so compare
>> > > > +		 * number of pages against maximum capacity of the queue.
>> > > > +		 * +1 means buffer for the packet header.
>> > > > +		 */
>> > > > +		if (iov_pages + 1 <= vq->num_max)
>> > >
>> > > I think this check is actual only for case one we don't have indirect buffer feature.
>> > > With indirect mode whole data to send will be packed into one indirect buffer.
>> > >
>> > > Thanks, Arseniy
>> >
>> > Actually the reverse. With indirect you are limited to num_max.
>> > Without you are limited to whatever space is left in the
>> > queue (which you did not check here, so you should).
>> >
>> >
>> > > > +			res = true;
>> > > > +	}
>> > > > +
>> > > > +	rcu_read_unlock();
>> >
>> > Just curious:
>> > is the point of all this RCU dance to allow vsock
>> > to change from under us? then why is it ok to
>> > have it change? the virtio_transport_msgzerocopy_check_iov
>> > will then refer to the old vsock ...
>>
>> IIRC we introduced the RCU to handle hot-unplug issues:
>> commit 0deab087b16a ("vsock/virtio: use RCU to avoid use-after-free on
>> the_virtio_vsock")
>>
>> When we remove the device, we flush all the works, etc. so we should
>> not be in this case (referring the old vsock), except for an irrelevant
>> transient as the device is disappearing.
>>
>> Stefano
>
>what if old device goes away then new one appears?

In virtio_vsock_remove() (.remove cb) we hold `the_virtio_vsock_mutex`
while flushing all the works/sockets/packets and sync the RCU.

In virtio_vsock_probe (.probe cb) we hold the same lock while adding
the new one and updating the RCU pointer. (only 1 virtio-vsock device
per guest is currently supported)

So when the new one appears, all the previous sockets are closed, all
the queue packets and pending works flushed.

So new packets will see the new vsock device. It looks safe to me.

Stefano


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-25 13:04       ` Arseniy Krasnov
@ 2023-07-25 13:22         ` Michael S. Tsirkin
  2023-07-25 13:28           ` Arseniy Krasnov
  0 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2023-07-25 13:22 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
	Bobby Eshleman, kvm, virtualization, netdev, linux-kernel,
	kernel, oxffffaa

On Tue, Jul 25, 2023 at 04:04:13PM +0300, Arseniy Krasnov wrote:
> 
> 
> On 25.07.2023 14:50, Michael S. Tsirkin wrote:
> > On Fri, Jul 21, 2023 at 08:09:03AM +0300, Arseniy Krasnov wrote:
> >>
> >>
> >> On 21.07.2023 00:42, Arseniy Krasnov wrote:
> >>> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
> >>> flag is set and zerocopy transmission is possible (enabled in socket
> >>> options and transport allows zerocopy), then non-linear skb will be
> >>> created and filled with the pages of user's buffer. Pages of user's
> >>> buffer are locked in memory by 'get_user_pages()'. Second thing that
> >>> this patch does is replace type of skb owning: instead of calling
> >>> 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
> >>> change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
> >>> of socket, so to decrease this field correctly proper skb destructor is
> >>> needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
> >>>
> >>> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
> >>> ---
> >>>  Changelog:
> >>>  v5(big patchset) -> v1:
> >>>   * Refactorings of 'if' conditions.
> >>>   * Remove extra blank line.
> >>>   * Remove 'frag_off' field unneeded init.
> >>>   * Add function 'virtio_transport_fill_skb()' which fills both linear
> >>>     and non-linear skb with provided data.
> >>>  v1 -> v2:
> >>>   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
> >>>  v2 -> v3:
> >>>   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
> >>>     provided 'iov_iter' with data could be sent in a zerocopy mode.
> >>>     If this callback is not set in transport - transport allows to send
> >>>     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
> >>>     then zerocopy is allowed. Reason of this callback is that in case of
> >>>     G2H transmission we insert whole skb to the tx virtio queue and such
> >>>     skb must fit to the size of the virtio queue to be sent in a single
> >>>     iteration (may be tx logic in 'virtio_transport.c' could be reworked
> >>>     as in vhost to support partial send of current skb). This callback
> >>>     will be enabled only for G2H path. For details pls see comment 
> >>>     'Check that tx queue...' below.
> >>>
> >>>  include/net/af_vsock.h                  |   3 +
> >>>  net/vmw_vsock/virtio_transport.c        |  39 ++++
> >>>  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
> >>>  3 files changed, 241 insertions(+), 58 deletions(-)
> >>>
> >>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> >>> index 0e7504a42925..a6b346eeeb8e 100644
> >>> --- a/include/net/af_vsock.h
> >>> +++ b/include/net/af_vsock.h
> >>> @@ -177,6 +177,9 @@ struct vsock_transport {
> >>>  
> >>>  	/* Read a single skb */
> >>>  	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
> >>> +
> >>> +	/* Zero-copy. */
> >>> +	bool (*msgzerocopy_check_iov)(const struct iov_iter *);
> >>>  };
> >>>  
> >>>  /**** CORE ****/
> >>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> >>> index 7bbcc8093e51..23cb8ed638c4 100644
> >>> --- a/net/vmw_vsock/virtio_transport.c
> >>> +++ b/net/vmw_vsock/virtio_transport.c
> >>> @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
> >>>  	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
> >>>  }
> >>>  
> >>> +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
> >>> +{
> >>> +	struct virtio_vsock *vsock;
> >>> +	bool res = false;
> >>> +
> >>> +	rcu_read_lock();
> >>> +
> >>> +	vsock = rcu_dereference(the_virtio_vsock);
> >>> +	if (vsock) {
> >>> +		struct virtqueue *vq;
> >>> +		int iov_pages;
> >>> +
> >>> +		vq = vsock->vqs[VSOCK_VQ_TX];
> >>> +
> >>> +		iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
> >>> +
> >>> +		/* Check that tx queue is large enough to keep whole
> >>> +		 * data to send. This is needed, because when there is
> >>> +		 * not enough free space in the queue, current skb to
> >>> +		 * send will be reinserted to the head of tx list of
> >>> +		 * the socket to retry transmission later, so if skb
> >>> +		 * is bigger than whole queue, it will be reinserted
> >>> +		 * again and again, thus blocking other skbs to be sent.
> >>> +		 * Each page of the user provided buffer will be added
> >>> +		 * as a single buffer to the tx virtqueue, so compare
> >>> +		 * number of pages against maximum capacity of the queue.
> >>> +		 * +1 means buffer for the packet header.
> >>> +		 */
> >>> +		if (iov_pages + 1 <= vq->num_max)
> >>
> >> I think this check is actual only for case one we don't have indirect buffer feature.
> >> With indirect mode whole data to send will be packed into one indirect buffer.
> >>
> >> Thanks, Arseniy
> > 
> > Actually the reverse. With indirect you are limited to num_max.
> > Without you are limited to whatever space is left in the
> > queue (which you did not check here, so you should).
> 
> I mean that with indirect, we only need one buffer, and we can just wait
> for enough space - for this single buffer ( as we discussed a little bit before).
> But if indirect buffer is not supported - we need that whole packet must fit
> to the size of tx queue - otherwise it never be transmitted.
> 
> Thanks, Arseniy


yes but according to virtio spec it's illegal to add s/g that is bigger
than queue size.

> > 
> > 
> >>> +			res = true;
> >>> +	}
> >>> +
> >>> +	rcu_read_unlock();
> > 
> > Just curious:
> > is the point of all this RCU dance to allow vsock
> > to change from under us? then why is it ok to
> > have it change? the virtio_transport_msgzerocopy_check_iov
> > will then refer to the old vsock ...
> > 
> > 
> >>> +
> >>> +	return res;
> >>> +}
> >>> +
> >>>  static bool virtio_transport_seqpacket_allow(u32 remote_cid);
> >>>  
> >>>  static struct virtio_transport virtio_transport = {
> >>> @@ -475,6 +512,8 @@ static struct virtio_transport virtio_transport = {
> >>>  		.seqpacket_allow          = virtio_transport_seqpacket_allow,
> >>>  		.seqpacket_has_data       = virtio_transport_seqpacket_has_data,
> >>>  
> >>> +		.msgzerocopy_check_iov	  = virtio_transport_msgzerocopy_check_iov,
> >>> +
> >>>  		.notify_poll_in           = virtio_transport_notify_poll_in,
> >>>  		.notify_poll_out          = virtio_transport_notify_poll_out,
> >>>  		.notify_recv_init         = virtio_transport_notify_recv_init,
> >>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> >>> index 26a4d10da205..e4e3d541aff4 100644
> >>> --- a/net/vmw_vsock/virtio_transport_common.c
> >>> +++ b/net/vmw_vsock/virtio_transport_common.c
> >>> @@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
> >>>  	return container_of(t, struct virtio_transport, transport);
> >>>  }
> >>>  
> >>> -/* Returns a new packet on success, otherwise returns NULL.
> >>> - *
> >>> - * If NULL is returned, errp is set to a negative errno.
> >>> - */
> >>> -static struct sk_buff *
> >>> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
> >>> -			   size_t len,
> >>> -			   u32 src_cid,
> >>> -			   u32 src_port,
> >>> -			   u32 dst_cid,
> >>> -			   u32 dst_port)
> >>> -{
> >>> -	const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
> >>> -	struct virtio_vsock_hdr *hdr;
> >>> -	struct sk_buff *skb;
> >>> -	void *payload;
> >>> -	int err;
> >>> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
> >>> +				       size_t max_to_send)
> >>> +{
> >>> +	const struct vsock_transport *t;
> >>> +	struct iov_iter *iov_iter;
> >>>  
> >>> -	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
> >>> -	if (!skb)
> >>> -		return NULL;
> >>> +	if (!info->msg)
> >>> +		return false;
> >>>  
> >>> -	hdr = virtio_vsock_hdr(skb);
> >>> -	hdr->type	= cpu_to_le16(info->type);
> >>> -	hdr->op		= cpu_to_le16(info->op);
> >>> -	hdr->src_cid	= cpu_to_le64(src_cid);
> >>> -	hdr->dst_cid	= cpu_to_le64(dst_cid);
> >>> -	hdr->src_port	= cpu_to_le32(src_port);
> >>> -	hdr->dst_port	= cpu_to_le32(dst_port);
> >>> -	hdr->flags	= cpu_to_le32(info->flags);
> >>> -	hdr->len	= cpu_to_le32(len);
> >>> +	iov_iter = &info->msg->msg_iter;
> >>>  
> >>> -	if (info->msg && len > 0) {
> >>> -		payload = skb_put(skb, len);
> >>> -		err = memcpy_from_msg(payload, info->msg, len);
> >>> -		if (err)
> >>> -			goto out;
> >>> +	t = vsock_core_get_transport(info->vsk);
> >>>  
> >>> -		if (msg_data_left(info->msg) == 0 &&
> >>> -		    info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
> >>> -			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> >>> +	if (t->msgzerocopy_check_iov &&
> >>> +	    !t->msgzerocopy_check_iov(iov_iter))
> >>> +		return false;
> >>>  
> >>> -			if (info->msg->msg_flags & MSG_EOR)
> >>> -				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> >>> -		}
> >>> +	/* Data is simple buffer. */
> >>> +	if (iter_is_ubuf(iov_iter))
> >>> +		return true;
> >>> +
> >>> +	if (!iter_is_iovec(iov_iter))
> >>> +		return false;
> >>> +
> >>> +	if (iov_iter->iov_offset)
> >>> +		return false;
> >>> +
> >>> +	/* We can't send whole iov. */
> >>> +	if (iov_iter->count > max_to_send)
> >>> +		return false;
> >>> +
> >>> +	return true;
> >>> +}
> >>> +
> >>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
> >>> +					   struct sk_buff *skb,
> >>> +					   struct msghdr *msg,
> >>> +					   bool zerocopy)
> >>> +{
> >>> +	struct ubuf_info *uarg;
> >>> +
> >>> +	if (msg->msg_ubuf) {
> >>> +		uarg = msg->msg_ubuf;
> >>> +		net_zcopy_get(uarg);
> >>> +	} else {
> >>> +		struct iov_iter *iter = &msg->msg_iter;
> >>> +		struct ubuf_info_msgzc *uarg_zc;
> >>> +		int len;
> >>> +
> >>> +		/* Only ITER_IOVEC or ITER_UBUF are allowed and
> >>> +		 * checked before.
> >>> +		 */
> >>> +		if (iter_is_iovec(iter))
> >>> +			len = iov_length(iter->__iov, iter->nr_segs);
> >>> +		else
> >>> +			len = iter->count;
> >>> +
> >>> +		uarg = msg_zerocopy_realloc(sk_vsock(vsk),
> >>> +					    len,
> >>> +					    NULL);
> >>> +		if (!uarg)
> >>> +			return -1;
> >>> +
> >>> +		uarg_zc = uarg_to_msgzc(uarg);
> >>> +		uarg_zc->zerocopy = zerocopy ? 1 : 0;
> >>>  	}
> >>>  
> >>> -	if (info->reply)
> >>> -		virtio_vsock_skb_set_reply(skb);
> >>> +	skb_zcopy_init(skb, uarg);
> >>>  
> >>> -	trace_virtio_transport_alloc_pkt(src_cid, src_port,
> >>> -					 dst_cid, dst_port,
> >>> -					 len,
> >>> -					 info->type,
> >>> -					 info->op,
> >>> -					 info->flags);
> >>> +	return 0;
> >>> +}
> >>>  
> >>> -	if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
> >>> -		WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
> >>> -		goto out;
> >>> +static int virtio_transport_fill_skb(struct sk_buff *skb,
> >>> +				     struct virtio_vsock_pkt_info *info,
> >>> +				     size_t len,
> >>> +				     bool zcopy)
> >>> +{
> >>> +	if (zcopy) {
> >>> +		return __zerocopy_sg_from_iter(info->msg, NULL, skb,
> >>> +					      &info->msg->msg_iter,
> >>> +					      len);
> >>> +	} else {
> >>> +		void *payload;
> >>> +		int err;
> >>> +
> >>> +		payload = skb_put(skb, len);
> >>> +		err = memcpy_from_msg(payload, info->msg, len);
> >>> +		if (err)
> >>> +			return -1;
> >>> +
> >>> +		if (msg_data_left(info->msg))
> >>> +			return 0;
> >>> +
> >>> +		return 0;
> >>>  	}
> >>> +}
> >>>  
> >>> -	return skb;
> >>> +static void virtio_transport_init_hdr(struct sk_buff *skb,
> >>> +				      struct virtio_vsock_pkt_info *info,
> >>> +				      u32 src_cid,
> >>> +				      u32 src_port,
> >>> +				      u32 dst_cid,
> >>> +				      u32 dst_port,
> >>> +				      size_t len)
> >>> +{
> >>> +	struct virtio_vsock_hdr *hdr;
> >>>  
> >>> -out:
> >>> -	kfree_skb(skb);
> >>> -	return NULL;
> >>> +	hdr = virtio_vsock_hdr(skb);
> >>> +	hdr->type	= cpu_to_le16(info->type);
> >>> +	hdr->op		= cpu_to_le16(info->op);
> >>> +	hdr->src_cid	= cpu_to_le64(src_cid);
> >>> +	hdr->dst_cid	= cpu_to_le64(dst_cid);
> >>> +	hdr->src_port	= cpu_to_le32(src_port);
> >>> +	hdr->dst_port	= cpu_to_le32(dst_port);
> >>> +	hdr->flags	= cpu_to_le32(info->flags);
> >>> +	hdr->len	= cpu_to_le32(len);
> >>>  }
> >>>  
> >>>  static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
> >>> @@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
> >>>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
> >>>  }
> >>>  
> >>> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
> >>> +						  struct virtio_vsock_pkt_info *info,
> >>> +						  size_t payload_len,
> >>> +						  bool zcopy,
> >>> +						  u32 src_cid,
> >>> +						  u32 src_port,
> >>> +						  u32 dst_cid,
> >>> +						  u32 dst_port)
> >>> +{
> >>> +	struct sk_buff *skb;
> >>> +	size_t skb_len;
> >>> +
> >>> +	skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
> >>> +
> >>> +	if (!zcopy)
> >>> +		skb_len += payload_len;
> >>> +
> >>> +	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
> >>> +	if (!skb)
> >>> +		return NULL;
> >>> +
> >>> +	virtio_transport_init_hdr(skb, info, src_cid, src_port,
> >>> +				  dst_cid, dst_port,
> >>> +				  payload_len);
> >>> +
> >>> +	/* Set owner here, because '__zerocopy_sg_from_iter()' uses
> >>> +	 * owner of skb without check to update 'sk_wmem_alloc'.
> >>> +	 */
> >>> +	if (vsk)
> >>> +		skb_set_owner_w(skb, sk_vsock(vsk));
> >>> +
> >>> +	if (info->msg && payload_len > 0) {
> >>> +		int err;
> >>> +
> >>> +		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
> >>> +		if (err)
> >>> +			goto out;
> >>> +
> >>> +		if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
> >>> +			struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
> >>> +
> >>> +			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> >>> +
> >>> +			if (info->msg->msg_flags & MSG_EOR)
> >>> +				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> >>> +		}
> >>> +	}
> >>> +
> >>> +	if (info->reply)
> >>> +		virtio_vsock_skb_set_reply(skb);
> >>> +
> >>> +	trace_virtio_transport_alloc_pkt(src_cid, src_port,
> >>> +					 dst_cid, dst_port,
> >>> +					 payload_len,
> >>> +					 info->type,
> >>> +					 info->op,
> >>> +					 info->flags);
> >>> +
> >>> +	return skb;
> >>> +out:
> >>> +	kfree_skb(skb);
> >>> +	return NULL;
> >>> +}
> >>> +
> >>>  /* This function can only be used on connecting/connected sockets,
> >>>   * since a socket assigned to a transport is required.
> >>>   *
> >>> @@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
> >>>  static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >>>  					  struct virtio_vsock_pkt_info *info)
> >>>  {
> >>> +	u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> >>>  	u32 src_cid, src_port, dst_cid, dst_port;
> >>>  	const struct virtio_transport *t_ops;
> >>>  	struct virtio_vsock_sock *vvs;
> >>>  	u32 pkt_len = info->pkt_len;
> >>> +	bool can_zcopy = false;
> >>>  	u32 rest_len;
> >>>  	int ret;
> >>>  
> >>> @@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >>>  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> >>>  		return pkt_len;
> >>>  
> >>> +	if (info->msg) {
> >>> +		/* If zerocopy is not enabled by 'setsockopt()', we behave as
> >>> +		 * there is no MSG_ZEROCOPY flag set.
> >>> +		 */
> >>> +		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
> >>> +			info->msg->msg_flags &= ~MSG_ZEROCOPY;
> >>> +
> >>> +		if (info->msg->msg_flags & MSG_ZEROCOPY)
> >>> +			can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
> >>> +
> >>> +		if (can_zcopy)
> >>> +			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
> >>> +					    (MAX_SKB_FRAGS * PAGE_SIZE));
> >>> +	}
> >>> +
> >>>  	rest_len = pkt_len;
> >>>  
> >>>  	do {
> >>>  		struct sk_buff *skb;
> >>>  		size_t skb_len;
> >>>  
> >>> -		skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
> >>> +		skb_len = min(max_skb_len, rest_len);
> >>>  
> >>> -		skb = virtio_transport_alloc_skb(info, skb_len,
> >>> +		skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
> >>>  						 src_cid, src_port,
> >>>  						 dst_cid, dst_port);
> >>>  		if (!skb) {
> >>> @@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >>>  			break;
> >>>  		}
> >>>  
> >>> +		/* This is last skb to send this portion of data. */
> >>> +		if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
> >>> +		    skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
> >>> +			if (virtio_transport_init_zcopy_skb(vsk, skb,
> >>> +							    info->msg,
> >>> +							    can_zcopy)) {
> >>> +				ret = -ENOMEM;
> >>> +				break;
> >>> +			}
> >>> +		}
> >>> +
> >>>  		virtio_transport_inc_tx_pkt(vvs, skb);
> >>>  
> >>>  		ret = t_ops->send_pkt(skb);
> >>> @@ -934,7 +1075,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
> >>>  	if (!t)
> >>>  		return -ENOTCONN;
> >>>  
> >>> -	reply = virtio_transport_alloc_skb(&info, 0,
> >>> +	reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
> >>>  					   le64_to_cpu(hdr->dst_cid),
> >>>  					   le32_to_cpu(hdr->dst_port),
> >>>  					   le64_to_cpu(hdr->src_cid),
> > 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-25 13:10         ` Arseniy Krasnov
@ 2023-07-25 13:23           ` Michael S. Tsirkin
  2023-07-25 13:30             ` Arseniy Krasnov
  0 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2023-07-25 13:23 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
	Bobby Eshleman, kvm, virtualization, netdev, linux-kernel,
	kernel, oxffffaa

On Tue, Jul 25, 2023 at 04:10:40PM +0300, Arseniy Krasnov wrote:
> 
> 
> On 25.07.2023 14:59, Michael S. Tsirkin wrote:
> > On Tue, Jul 25, 2023 at 11:39:22AM +0300, Arseniy Krasnov wrote:
> >>
> >>
> >> On 25.07.2023 11:25, Michael S. Tsirkin wrote:
> >>> On Fri, Jul 21, 2023 at 12:42:45AM +0300, Arseniy Krasnov wrote:
> >>>> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
> >>>> flag is set and zerocopy transmission is possible (enabled in socket
> >>>> options and transport allows zerocopy), then non-linear skb will be
> >>>> created and filled with the pages of user's buffer. Pages of user's
> >>>> buffer are locked in memory by 'get_user_pages()'. Second thing that
> >>>> this patch does is replace type of skb owning: instead of calling
> >>>> 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
> >>>> change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
> >>>> of socket, so to decrease this field correctly proper skb destructor is
> >>>> needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
> >>>>
> >>>> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
> >>>> ---
> >>>>  Changelog:
> >>>>  v5(big patchset) -> v1:
> >>>>   * Refactorings of 'if' conditions.
> >>>>   * Remove extra blank line.
> >>>>   * Remove 'frag_off' field unneeded init.
> >>>>   * Add function 'virtio_transport_fill_skb()' which fills both linear
> >>>>     and non-linear skb with provided data.
> >>>>  v1 -> v2:
> >>>>   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
> >>>>  v2 -> v3:
> >>>>   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
> >>>>     provided 'iov_iter' with data could be sent in a zerocopy mode.
> >>>>     If this callback is not set in transport - transport allows to send
> >>>>     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
> >>>>     then zerocopy is allowed. Reason of this callback is that in case of
> >>>>     G2H transmission we insert whole skb to the tx virtio queue and such
> >>>>     skb must fit to the size of the virtio queue to be sent in a single
> >>>>     iteration (may be tx logic in 'virtio_transport.c' could be reworked
> >>>>     as in vhost to support partial send of current skb). This callback
> >>>>     will be enabled only for G2H path. For details pls see comment 
> >>>>     'Check that tx queue...' below.
> >>>>
> >>>>  include/net/af_vsock.h                  |   3 +
> >>>>  net/vmw_vsock/virtio_transport.c        |  39 ++++
> >>>>  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
> >>>>  3 files changed, 241 insertions(+), 58 deletions(-)
> >>>>
> >>>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> >>>> index 0e7504a42925..a6b346eeeb8e 100644
> >>>> --- a/include/net/af_vsock.h
> >>>> +++ b/include/net/af_vsock.h
> >>>> @@ -177,6 +177,9 @@ struct vsock_transport {
> >>>>  
> >>>>  	/* Read a single skb */
> >>>>  	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
> >>>> +
> >>>> +	/* Zero-copy. */
> >>>> +	bool (*msgzerocopy_check_iov)(const struct iov_iter *);
> >>>>  };
> >>>>  
> >>>>  /**** CORE ****/
> >>>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> >>>> index 7bbcc8093e51..23cb8ed638c4 100644
> >>>> --- a/net/vmw_vsock/virtio_transport.c
> >>>> +++ b/net/vmw_vsock/virtio_transport.c
> >>>> @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
> >>>>  	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
> >>>>  }
> >>>>  
> >>>> +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
> >>>> +{
> >>>> +	struct virtio_vsock *vsock;
> >>>> +	bool res = false;
> >>>> +
> >>>> +	rcu_read_lock();
> >>>> +
> >>>> +	vsock = rcu_dereference(the_virtio_vsock);
> >>>> +	if (vsock) {
> >>>> +		struct virtqueue *vq;
> >>>> +		int iov_pages;
> >>>> +
> >>>> +		vq = vsock->vqs[VSOCK_VQ_TX];
> >>>> +
> >>>> +		iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
> >>>> +
> >>>> +		/* Check that tx queue is large enough to keep whole
> >>>> +		 * data to send. This is needed, because when there is
> >>>> +		 * not enough free space in the queue, current skb to
> >>>> +		 * send will be reinserted to the head of tx list of
> >>>> +		 * the socket to retry transmission later, so if skb
> >>>> +		 * is bigger than whole queue, it will be reinserted
> >>>> +		 * again and again, thus blocking other skbs to be sent.
> >>>> +		 * Each page of the user provided buffer will be added
> >>>> +		 * as a single buffer to the tx virtqueue, so compare
> >>>> +		 * number of pages against maximum capacity of the queue.
> >>>> +		 * +1 means buffer for the packet header.
> >>>> +		 */
> >>>> +		if (iov_pages + 1 <= vq->num_max)
> >>>> +			res = true;
> >>>
> >>>
> >>> Yes but can't there already be buffers in the queue?
> >>> Then you can't stick num_max there.
> >>
> >> I think, that it is not critical, because vhost part always tries to process all
> >> incoming buffers (yes, 'vhost_exceeds_weight()' breaks at some moment, but it will
> >> reschedule tx kick ('vhost_vsock_handle_tx_kick()') work again), so current "too
> >> big" skb will wait until there will be enough space in queue and as it is requeued
> >> to the head of tx list it will be inserted to tx queue first.
> >>
> >> But anyway, I agree that comparing to 'num_free' may be more effective to the whole
> >> system performance...
> >>
> >> Thanks, Arseniy
> > 
> > Oh I see. It makes sense then - instead of copying just so we can
> > stick it in the queue, wait a bit and send later.
> > Also - for stream transports can't the message be split
> > and sent chunk by chunk? Better than copying ...
> 
> Technically yes, also we can split message for non-stream sockets (as vhost
> does when it copies data to rx buffers of the guest),

Won't breaking up messages break applications though?


> but it requires to rework
> current implementation by adding buffers one by one to the tx queue. I think
> it was not implemented here because until MSG_ZEROCOPY all skbs requires one
> (if it is control message) or two (with payload) buffers, so there is no big
> sense in processing max two buffers in "one-by-one" mode - we can just wait
> for space.
> 
> May be, I can add this logic for non-linear skb's here: 
> 
> if (skb->len > vq->max_num)
>     add buffers "one-by-one", incrementing internal offset in skb,
>    if (new skb insertion fails)
>        requeue skb, wait for space.
> 
> In TX done callback I'll call consume skb only when above mentioned internal
> offset == skb->len. I think this approach allows to get rid of new 'check_iov'
> callback from this patch.
> 
> 
> Stefano, what do You think?
> 
> Thanks, Arseniy
> 
> > 
> > 
> >>>
> >>>
> >>>> +	}
> >>>> +
> >>>> +	rcu_read_unlock();
> >>>> +
> >>>> +	return res;
> >>>> +}
> >>>> +
> >>>>  static bool virtio_transport_seqpacket_allow(u32 remote_cid);
> >>>>  
> >>>>  static struct virtio_transport virtio_transport = {
> >>>> @@ -475,6 +512,8 @@ static struct virtio_transport virtio_transport = {
> >>>>  		.seqpacket_allow          = virtio_transport_seqpacket_allow,
> >>>>  		.seqpacket_has_data       = virtio_transport_seqpacket_has_data,
> >>>>  
> >>>> +		.msgzerocopy_check_iov	  = virtio_transport_msgzerocopy_check_iov,
> >>>> +
> >>>>  		.notify_poll_in           = virtio_transport_notify_poll_in,
> >>>>  		.notify_poll_out          = virtio_transport_notify_poll_out,
> >>>>  		.notify_recv_init         = virtio_transport_notify_recv_init,
> >>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> >>>> index 26a4d10da205..e4e3d541aff4 100644
> >>>> --- a/net/vmw_vsock/virtio_transport_common.c
> >>>> +++ b/net/vmw_vsock/virtio_transport_common.c
> >>>> @@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
> >>>>  	return container_of(t, struct virtio_transport, transport);
> >>>>  }
> >>>>  
> >>>> -/* Returns a new packet on success, otherwise returns NULL.
> >>>> - *
> >>>> - * If NULL is returned, errp is set to a negative errno.
> >>>> - */
> >>>> -static struct sk_buff *
> >>>> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
> >>>> -			   size_t len,
> >>>> -			   u32 src_cid,
> >>>> -			   u32 src_port,
> >>>> -			   u32 dst_cid,
> >>>> -			   u32 dst_port)
> >>>> -{
> >>>> -	const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
> >>>> -	struct virtio_vsock_hdr *hdr;
> >>>> -	struct sk_buff *skb;
> >>>> -	void *payload;
> >>>> -	int err;
> >>>> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
> >>>> +				       size_t max_to_send)
> >>>> +{
> >>>> +	const struct vsock_transport *t;
> >>>> +	struct iov_iter *iov_iter;
> >>>>  
> >>>> -	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
> >>>> -	if (!skb)
> >>>> -		return NULL;
> >>>> +	if (!info->msg)
> >>>> +		return false;
> >>>>  
> >>>> -	hdr = virtio_vsock_hdr(skb);
> >>>> -	hdr->type	= cpu_to_le16(info->type);
> >>>> -	hdr->op		= cpu_to_le16(info->op);
> >>>> -	hdr->src_cid	= cpu_to_le64(src_cid);
> >>>> -	hdr->dst_cid	= cpu_to_le64(dst_cid);
> >>>> -	hdr->src_port	= cpu_to_le32(src_port);
> >>>> -	hdr->dst_port	= cpu_to_le32(dst_port);
> >>>> -	hdr->flags	= cpu_to_le32(info->flags);
> >>>> -	hdr->len	= cpu_to_le32(len);
> >>>> +	iov_iter = &info->msg->msg_iter;
> >>>>  
> >>>> -	if (info->msg && len > 0) {
> >>>> -		payload = skb_put(skb, len);
> >>>> -		err = memcpy_from_msg(payload, info->msg, len);
> >>>> -		if (err)
> >>>> -			goto out;
> >>>> +	t = vsock_core_get_transport(info->vsk);
> >>>>  
> >>>> -		if (msg_data_left(info->msg) == 0 &&
> >>>> -		    info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
> >>>> -			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> >>>> +	if (t->msgzerocopy_check_iov &&
> >>>> +	    !t->msgzerocopy_check_iov(iov_iter))
> >>>> +		return false;
> >>>>  
> >>>> -			if (info->msg->msg_flags & MSG_EOR)
> >>>> -				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> >>>> -		}
> >>>> +	/* Data is simple buffer. */
> >>>> +	if (iter_is_ubuf(iov_iter))
> >>>> +		return true;
> >>>> +
> >>>> +	if (!iter_is_iovec(iov_iter))
> >>>> +		return false;
> >>>> +
> >>>> +	if (iov_iter->iov_offset)
> >>>> +		return false;
> >>>> +
> >>>> +	/* We can't send whole iov. */
> >>>> +	if (iov_iter->count > max_to_send)
> >>>> +		return false;
> >>>> +
> >>>> +	return true;
> >>>> +}
> >>>> +
> >>>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
> >>>> +					   struct sk_buff *skb,
> >>>> +					   struct msghdr *msg,
> >>>> +					   bool zerocopy)
> >>>> +{
> >>>> +	struct ubuf_info *uarg;
> >>>> +
> >>>> +	if (msg->msg_ubuf) {
> >>>> +		uarg = msg->msg_ubuf;
> >>>> +		net_zcopy_get(uarg);
> >>>> +	} else {
> >>>> +		struct iov_iter *iter = &msg->msg_iter;
> >>>> +		struct ubuf_info_msgzc *uarg_zc;
> >>>> +		int len;
> >>>> +
> >>>> +		/* Only ITER_IOVEC or ITER_UBUF are allowed and
> >>>> +		 * checked before.
> >>>> +		 */
> >>>> +		if (iter_is_iovec(iter))
> >>>> +			len = iov_length(iter->__iov, iter->nr_segs);
> >>>> +		else
> >>>> +			len = iter->count;
> >>>> +
> >>>> +		uarg = msg_zerocopy_realloc(sk_vsock(vsk),
> >>>> +					    len,
> >>>> +					    NULL);
> >>>> +		if (!uarg)
> >>>> +			return -1;
> >>>> +
> >>>> +		uarg_zc = uarg_to_msgzc(uarg);
> >>>> +		uarg_zc->zerocopy = zerocopy ? 1 : 0;
> >>>>  	}
> >>>>  
> >>>> -	if (info->reply)
> >>>> -		virtio_vsock_skb_set_reply(skb);
> >>>> +	skb_zcopy_init(skb, uarg);
> >>>>  
> >>>> -	trace_virtio_transport_alloc_pkt(src_cid, src_port,
> >>>> -					 dst_cid, dst_port,
> >>>> -					 len,
> >>>> -					 info->type,
> >>>> -					 info->op,
> >>>> -					 info->flags);
> >>>> +	return 0;
> >>>> +}
> >>>>  
> >>>> -	if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
> >>>> -		WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
> >>>> -		goto out;
> >>>> +static int virtio_transport_fill_skb(struct sk_buff *skb,
> >>>> +				     struct virtio_vsock_pkt_info *info,
> >>>> +				     size_t len,
> >>>> +				     bool zcopy)
> >>>> +{
> >>>> +	if (zcopy) {
> >>>> +		return __zerocopy_sg_from_iter(info->msg, NULL, skb,
> >>>> +					      &info->msg->msg_iter,
> >>>> +					      len);
> >>>> +	} else {
> >>>> +		void *payload;
> >>>> +		int err;
> >>>> +
> >>>> +		payload = skb_put(skb, len);
> >>>> +		err = memcpy_from_msg(payload, info->msg, len);
> >>>> +		if (err)
> >>>> +			return -1;
> >>>> +
> >>>> +		if (msg_data_left(info->msg))
> >>>> +			return 0;
> >>>> +
> >>>> +		return 0;
> >>>>  	}
> >>>> +}
> >>>>  
> >>>> -	return skb;
> >>>> +static void virtio_transport_init_hdr(struct sk_buff *skb,
> >>>> +				      struct virtio_vsock_pkt_info *info,
> >>>> +				      u32 src_cid,
> >>>> +				      u32 src_port,
> >>>> +				      u32 dst_cid,
> >>>> +				      u32 dst_port,
> >>>> +				      size_t len)
> >>>> +{
> >>>> +	struct virtio_vsock_hdr *hdr;
> >>>>  
> >>>> -out:
> >>>> -	kfree_skb(skb);
> >>>> -	return NULL;
> >>>> +	hdr = virtio_vsock_hdr(skb);
> >>>> +	hdr->type	= cpu_to_le16(info->type);
> >>>> +	hdr->op		= cpu_to_le16(info->op);
> >>>> +	hdr->src_cid	= cpu_to_le64(src_cid);
> >>>> +	hdr->dst_cid	= cpu_to_le64(dst_cid);
> >>>> +	hdr->src_port	= cpu_to_le32(src_port);
> >>>> +	hdr->dst_port	= cpu_to_le32(dst_port);
> >>>> +	hdr->flags	= cpu_to_le32(info->flags);
> >>>> +	hdr->len	= cpu_to_le32(len);
> >>>>  }
> >>>>  
> >>>>  static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
> >>>> @@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
> >>>>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
> >>>>  }
> >>>>  
> >>>> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
> >>>> +						  struct virtio_vsock_pkt_info *info,
> >>>> +						  size_t payload_len,
> >>>> +						  bool zcopy,
> >>>> +						  u32 src_cid,
> >>>> +						  u32 src_port,
> >>>> +						  u32 dst_cid,
> >>>> +						  u32 dst_port)
> >>>> +{
> >>>> +	struct sk_buff *skb;
> >>>> +	size_t skb_len;
> >>>> +
> >>>> +	skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
> >>>> +
> >>>> +	if (!zcopy)
> >>>> +		skb_len += payload_len;
> >>>> +
> >>>> +	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
> >>>> +	if (!skb)
> >>>> +		return NULL;
> >>>> +
> >>>> +	virtio_transport_init_hdr(skb, info, src_cid, src_port,
> >>>> +				  dst_cid, dst_port,
> >>>> +				  payload_len);
> >>>> +
> >>>> +	/* Set owner here, because '__zerocopy_sg_from_iter()' uses
> >>>> +	 * owner of skb without check to update 'sk_wmem_alloc'.
> >>>> +	 */
> >>>> +	if (vsk)
> >>>> +		skb_set_owner_w(skb, sk_vsock(vsk));
> >>>> +
> >>>> +	if (info->msg && payload_len > 0) {
> >>>> +		int err;
> >>>> +
> >>>> +		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
> >>>> +		if (err)
> >>>> +			goto out;
> >>>> +
> >>>> +		if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
> >>>> +			struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
> >>>> +
> >>>> +			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> >>>> +
> >>>> +			if (info->msg->msg_flags & MSG_EOR)
> >>>> +				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> >>>> +		}
> >>>> +	}
> >>>> +
> >>>> +	if (info->reply)
> >>>> +		virtio_vsock_skb_set_reply(skb);
> >>>> +
> >>>> +	trace_virtio_transport_alloc_pkt(src_cid, src_port,
> >>>> +					 dst_cid, dst_port,
> >>>> +					 payload_len,
> >>>> +					 info->type,
> >>>> +					 info->op,
> >>>> +					 info->flags);
> >>>> +
> >>>> +	return skb;
> >>>> +out:
> >>>> +	kfree_skb(skb);
> >>>> +	return NULL;
> >>>> +}
> >>>> +
> >>>>  /* This function can only be used on connecting/connected sockets,
> >>>>   * since a socket assigned to a transport is required.
> >>>>   *
> >>>> @@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
> >>>>  static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >>>>  					  struct virtio_vsock_pkt_info *info)
> >>>>  {
> >>>> +	u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> >>>>  	u32 src_cid, src_port, dst_cid, dst_port;
> >>>>  	const struct virtio_transport *t_ops;
> >>>>  	struct virtio_vsock_sock *vvs;
> >>>>  	u32 pkt_len = info->pkt_len;
> >>>> +	bool can_zcopy = false;
> >>>>  	u32 rest_len;
> >>>>  	int ret;
> >>>>  
> >>>> @@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >>>>  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> >>>>  		return pkt_len;
> >>>>  
> >>>> +	if (info->msg) {
> >>>> +		/* If zerocopy is not enabled by 'setsockopt()', we behave as
> >>>> +		 * there is no MSG_ZEROCOPY flag set.
> >>>> +		 */
> >>>> +		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
> >>>> +			info->msg->msg_flags &= ~MSG_ZEROCOPY;
> >>>> +
> >>>> +		if (info->msg->msg_flags & MSG_ZEROCOPY)
> >>>> +			can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
> >>>> +
> >>>> +		if (can_zcopy)
> >>>> +			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
> >>>> +					    (MAX_SKB_FRAGS * PAGE_SIZE));
> >>>> +	}
> >>>> +
> >>>>  	rest_len = pkt_len;
> >>>>  
> >>>>  	do {
> >>>>  		struct sk_buff *skb;
> >>>>  		size_t skb_len;
> >>>>  
> >>>> -		skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
> >>>> +		skb_len = min(max_skb_len, rest_len);
> >>>>  
> >>>> -		skb = virtio_transport_alloc_skb(info, skb_len,
> >>>> +		skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
> >>>>  						 src_cid, src_port,
> >>>>  						 dst_cid, dst_port);
> >>>>  		if (!skb) {
> >>>> @@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >>>>  			break;
> >>>>  		}
> >>>>  
> >>>> +		/* This is last skb to send this portion of data. */
> >>>> +		if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
> >>>> +		    skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
> >>>> +			if (virtio_transport_init_zcopy_skb(vsk, skb,
> >>>> +							    info->msg,
> >>>> +							    can_zcopy)) {
> >>>> +				ret = -ENOMEM;
> >>>> +				break;
> >>>> +			}
> >>>> +		}
> >>>> +
> >>>>  		virtio_transport_inc_tx_pkt(vvs, skb);
> >>>>  
> >>>>  		ret = t_ops->send_pkt(skb);
> >>>> @@ -934,7 +1075,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
> >>>>  	if (!t)
> >>>>  		return -ENOTCONN;
> >>>>  
> >>>> -	reply = virtio_transport_alloc_skb(&info, 0,
> >>>> +	reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
> >>>>  					   le64_to_cpu(hdr->dst_cid),
> >>>>  					   le32_to_cpu(hdr->dst_port),
> >>>>  					   le64_to_cpu(hdr->src_cid),
> >>>> -- 
> >>>> 2.25.1
> >>>
> > 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-25 13:22         ` Michael S. Tsirkin
@ 2023-07-25 13:28           ` Arseniy Krasnov
  2023-07-25 13:36             ` Michael S. Tsirkin
  0 siblings, 1 reply; 30+ messages in thread
From: Arseniy Krasnov @ 2023-07-25 13:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
	Bobby Eshleman, kvm, virtualization, netdev, linux-kernel,
	kernel, oxffffaa



On 25.07.2023 16:22, Michael S. Tsirkin wrote:
> On Tue, Jul 25, 2023 at 04:04:13PM +0300, Arseniy Krasnov wrote:
>>
>>
>> On 25.07.2023 14:50, Michael S. Tsirkin wrote:
>>> On Fri, Jul 21, 2023 at 08:09:03AM +0300, Arseniy Krasnov wrote:
>>>>
>>>>
>>>> On 21.07.2023 00:42, Arseniy Krasnov wrote:
>>>>> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>>>>> flag is set and zerocopy transmission is possible (enabled in socket
>>>>> options and transport allows zerocopy), then non-linear skb will be
>>>>> created and filled with the pages of user's buffer. Pages of user's
>>>>> buffer are locked in memory by 'get_user_pages()'. Second thing that
>>>>> this patch does is replace type of skb owning: instead of calling
>>>>> 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
>>>>> change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
>>>>> of socket, so to decrease this field correctly proper skb destructor is
>>>>> needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
>>>>>
>>>>> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>>>>> ---
>>>>>  Changelog:
>>>>>  v5(big patchset) -> v1:
>>>>>   * Refactorings of 'if' conditions.
>>>>>   * Remove extra blank line.
>>>>>   * Remove 'frag_off' field unneeded init.
>>>>>   * Add function 'virtio_transport_fill_skb()' which fills both linear
>>>>>     and non-linear skb with provided data.
>>>>>  v1 -> v2:
>>>>>   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
>>>>>  v2 -> v3:
>>>>>   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
>>>>>     provided 'iov_iter' with data could be sent in a zerocopy mode.
>>>>>     If this callback is not set in transport - transport allows to send
>>>>>     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
>>>>>     then zerocopy is allowed. Reason of this callback is that in case of
>>>>>     G2H transmission we insert whole skb to the tx virtio queue and such
>>>>>     skb must fit to the size of the virtio queue to be sent in a single
>>>>>     iteration (may be tx logic in 'virtio_transport.c' could be reworked
>>>>>     as in vhost to support partial send of current skb). This callback
>>>>>     will be enabled only for G2H path. For details pls see comment 
>>>>>     'Check that tx queue...' below.
>>>>>
>>>>>  include/net/af_vsock.h                  |   3 +
>>>>>  net/vmw_vsock/virtio_transport.c        |  39 ++++
>>>>>  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
>>>>>  3 files changed, 241 insertions(+), 58 deletions(-)
>>>>>
>>>>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>>>>> index 0e7504a42925..a6b346eeeb8e 100644
>>>>> --- a/include/net/af_vsock.h
>>>>> +++ b/include/net/af_vsock.h
>>>>> @@ -177,6 +177,9 @@ struct vsock_transport {
>>>>>  
>>>>>  	/* Read a single skb */
>>>>>  	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
>>>>> +
>>>>> +	/* Zero-copy. */
>>>>> +	bool (*msgzerocopy_check_iov)(const struct iov_iter *);
>>>>>  };
>>>>>  
>>>>>  /**** CORE ****/
>>>>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>>>>> index 7bbcc8093e51..23cb8ed638c4 100644
>>>>> --- a/net/vmw_vsock/virtio_transport.c
>>>>> +++ b/net/vmw_vsock/virtio_transport.c
>>>>> @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
>>>>>  	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>>>>>  }
>>>>>  
>>>>> +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
>>>>> +{
>>>>> +	struct virtio_vsock *vsock;
>>>>> +	bool res = false;
>>>>> +
>>>>> +	rcu_read_lock();
>>>>> +
>>>>> +	vsock = rcu_dereference(the_virtio_vsock);
>>>>> +	if (vsock) {
>>>>> +		struct virtqueue *vq;
>>>>> +		int iov_pages;
>>>>> +
>>>>> +		vq = vsock->vqs[VSOCK_VQ_TX];
>>>>> +
>>>>> +		iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
>>>>> +
>>>>> +		/* Check that tx queue is large enough to keep whole
>>>>> +		 * data to send. This is needed, because when there is
>>>>> +		 * not enough free space in the queue, current skb to
>>>>> +		 * send will be reinserted to the head of tx list of
>>>>> +		 * the socket to retry transmission later, so if skb
>>>>> +		 * is bigger than whole queue, it will be reinserted
>>>>> +		 * again and again, thus blocking other skbs to be sent.
>>>>> +		 * Each page of the user provided buffer will be added
>>>>> +		 * as a single buffer to the tx virtqueue, so compare
>>>>> +		 * number of pages against maximum capacity of the queue.
>>>>> +		 * +1 means buffer for the packet header.
>>>>> +		 */
>>>>> +		if (iov_pages + 1 <= vq->num_max)
>>>>
>>>> I think this check is actual only for case one we don't have indirect buffer feature.
>>>> With indirect mode whole data to send will be packed into one indirect buffer.
>>>>
>>>> Thanks, Arseniy
>>>
>>> Actually the reverse. With indirect you are limited to num_max.
>>> Without you are limited to whatever space is left in the
>>> queue (which you did not check here, so you should).
>>
>> I mean that with indirect, we only need one buffer, and we can just wait
>> for enough space - for this single buffer ( as we discussed a little bit before).
>> But if indirect buffer is not supported - we need that whole packet must fit
>> to the size of tx queue - otherwise it never be transmitted.
>>
>> Thanks, Arseniy
> 
> 
> yes but according to virtio spec it's illegal to add s/g that is bigger
> than queue size.

Aah, so even in case of indirect buffers feature, buffer descriptors stored in memory
pointed by indirect buffer must be accounted against queue size ?

Thanks, Arseniy

> 
>>>
>>>
>>>>> +			res = true;
>>>>> +	}
>>>>> +
>>>>> +	rcu_read_unlock();
>>>
>>> Just curious:
>>> is the point of all this RCU dance to allow vsock
>>> to change from under us? then why is it ok to
>>> have it change? the virtio_transport_msgzerocopy_check_iov
>>> will then refer to the old vsock ...
>>>
>>>
>>>>> +
>>>>> +	return res;
>>>>> +}
>>>>> +
>>>>>  static bool virtio_transport_seqpacket_allow(u32 remote_cid);
>>>>>  
>>>>>  static struct virtio_transport virtio_transport = {
>>>>> @@ -475,6 +512,8 @@ static struct virtio_transport virtio_transport = {
>>>>>  		.seqpacket_allow          = virtio_transport_seqpacket_allow,
>>>>>  		.seqpacket_has_data       = virtio_transport_seqpacket_has_data,
>>>>>  
>>>>> +		.msgzerocopy_check_iov	  = virtio_transport_msgzerocopy_check_iov,
>>>>> +
>>>>>  		.notify_poll_in           = virtio_transport_notify_poll_in,
>>>>>  		.notify_poll_out          = virtio_transport_notify_poll_out,
>>>>>  		.notify_recv_init         = virtio_transport_notify_recv_init,
>>>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>>>> index 26a4d10da205..e4e3d541aff4 100644
>>>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>>>> @@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>>>>>  	return container_of(t, struct virtio_transport, transport);
>>>>>  }
>>>>>  
>>>>> -/* Returns a new packet on success, otherwise returns NULL.
>>>>> - *
>>>>> - * If NULL is returned, errp is set to a negative errno.
>>>>> - */
>>>>> -static struct sk_buff *
>>>>> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>>>>> -			   size_t len,
>>>>> -			   u32 src_cid,
>>>>> -			   u32 src_port,
>>>>> -			   u32 dst_cid,
>>>>> -			   u32 dst_port)
>>>>> -{
>>>>> -	const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>>>>> -	struct virtio_vsock_hdr *hdr;
>>>>> -	struct sk_buff *skb;
>>>>> -	void *payload;
>>>>> -	int err;
>>>>> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
>>>>> +				       size_t max_to_send)
>>>>> +{
>>>>> +	const struct vsock_transport *t;
>>>>> +	struct iov_iter *iov_iter;
>>>>>  
>>>>> -	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>>>> -	if (!skb)
>>>>> -		return NULL;
>>>>> +	if (!info->msg)
>>>>> +		return false;
>>>>>  
>>>>> -	hdr = virtio_vsock_hdr(skb);
>>>>> -	hdr->type	= cpu_to_le16(info->type);
>>>>> -	hdr->op		= cpu_to_le16(info->op);
>>>>> -	hdr->src_cid	= cpu_to_le64(src_cid);
>>>>> -	hdr->dst_cid	= cpu_to_le64(dst_cid);
>>>>> -	hdr->src_port	= cpu_to_le32(src_port);
>>>>> -	hdr->dst_port	= cpu_to_le32(dst_port);
>>>>> -	hdr->flags	= cpu_to_le32(info->flags);
>>>>> -	hdr->len	= cpu_to_le32(len);
>>>>> +	iov_iter = &info->msg->msg_iter;
>>>>>  
>>>>> -	if (info->msg && len > 0) {
>>>>> -		payload = skb_put(skb, len);
>>>>> -		err = memcpy_from_msg(payload, info->msg, len);
>>>>> -		if (err)
>>>>> -			goto out;
>>>>> +	t = vsock_core_get_transport(info->vsk);
>>>>>  
>>>>> -		if (msg_data_left(info->msg) == 0 &&
>>>>> -		    info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>>>> -			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>>>> +	if (t->msgzerocopy_check_iov &&
>>>>> +	    !t->msgzerocopy_check_iov(iov_iter))
>>>>> +		return false;
>>>>>  
>>>>> -			if (info->msg->msg_flags & MSG_EOR)
>>>>> -				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>>>> -		}
>>>>> +	/* Data is simple buffer. */
>>>>> +	if (iter_is_ubuf(iov_iter))
>>>>> +		return true;
>>>>> +
>>>>> +	if (!iter_is_iovec(iov_iter))
>>>>> +		return false;
>>>>> +
>>>>> +	if (iov_iter->iov_offset)
>>>>> +		return false;
>>>>> +
>>>>> +	/* We can't send whole iov. */
>>>>> +	if (iov_iter->count > max_to_send)
>>>>> +		return false;
>>>>> +
>>>>> +	return true;
>>>>> +}
>>>>> +
>>>>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>>>>> +					   struct sk_buff *skb,
>>>>> +					   struct msghdr *msg,
>>>>> +					   bool zerocopy)
>>>>> +{
>>>>> +	struct ubuf_info *uarg;
>>>>> +
>>>>> +	if (msg->msg_ubuf) {
>>>>> +		uarg = msg->msg_ubuf;
>>>>> +		net_zcopy_get(uarg);
>>>>> +	} else {
>>>>> +		struct iov_iter *iter = &msg->msg_iter;
>>>>> +		struct ubuf_info_msgzc *uarg_zc;
>>>>> +		int len;
>>>>> +
>>>>> +		/* Only ITER_IOVEC or ITER_UBUF are allowed and
>>>>> +		 * checked before.
>>>>> +		 */
>>>>> +		if (iter_is_iovec(iter))
>>>>> +			len = iov_length(iter->__iov, iter->nr_segs);
>>>>> +		else
>>>>> +			len = iter->count;
>>>>> +
>>>>> +		uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>>>> +					    len,
>>>>> +					    NULL);
>>>>> +		if (!uarg)
>>>>> +			return -1;
>>>>> +
>>>>> +		uarg_zc = uarg_to_msgzc(uarg);
>>>>> +		uarg_zc->zerocopy = zerocopy ? 1 : 0;
>>>>>  	}
>>>>>  
>>>>> -	if (info->reply)
>>>>> -		virtio_vsock_skb_set_reply(skb);
>>>>> +	skb_zcopy_init(skb, uarg);
>>>>>  
>>>>> -	trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>>>> -					 dst_cid, dst_port,
>>>>> -					 len,
>>>>> -					 info->type,
>>>>> -					 info->op,
>>>>> -					 info->flags);
>>>>> +	return 0;
>>>>> +}
>>>>>  
>>>>> -	if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
>>>>> -		WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
>>>>> -		goto out;
>>>>> +static int virtio_transport_fill_skb(struct sk_buff *skb,
>>>>> +				     struct virtio_vsock_pkt_info *info,
>>>>> +				     size_t len,
>>>>> +				     bool zcopy)
>>>>> +{
>>>>> +	if (zcopy) {
>>>>> +		return __zerocopy_sg_from_iter(info->msg, NULL, skb,
>>>>> +					      &info->msg->msg_iter,
>>>>> +					      len);
>>>>> +	} else {
>>>>> +		void *payload;
>>>>> +		int err;
>>>>> +
>>>>> +		payload = skb_put(skb, len);
>>>>> +		err = memcpy_from_msg(payload, info->msg, len);
>>>>> +		if (err)
>>>>> +			return -1;
>>>>> +
>>>>> +		if (msg_data_left(info->msg))
>>>>> +			return 0;
>>>>> +
>>>>> +		return 0;
>>>>>  	}
>>>>> +}
>>>>>  
>>>>> -	return skb;
>>>>> +static void virtio_transport_init_hdr(struct sk_buff *skb,
>>>>> +				      struct virtio_vsock_pkt_info *info,
>>>>> +				      u32 src_cid,
>>>>> +				      u32 src_port,
>>>>> +				      u32 dst_cid,
>>>>> +				      u32 dst_port,
>>>>> +				      size_t len)
>>>>> +{
>>>>> +	struct virtio_vsock_hdr *hdr;
>>>>>  
>>>>> -out:
>>>>> -	kfree_skb(skb);
>>>>> -	return NULL;
>>>>> +	hdr = virtio_vsock_hdr(skb);
>>>>> +	hdr->type	= cpu_to_le16(info->type);
>>>>> +	hdr->op		= cpu_to_le16(info->op);
>>>>> +	hdr->src_cid	= cpu_to_le64(src_cid);
>>>>> +	hdr->dst_cid	= cpu_to_le64(dst_cid);
>>>>> +	hdr->src_port	= cpu_to_le32(src_port);
>>>>> +	hdr->dst_port	= cpu_to_le32(dst_port);
>>>>> +	hdr->flags	= cpu_to_le32(info->flags);
>>>>> +	hdr->len	= cpu_to_le32(len);
>>>>>  }
>>>>>  
>>>>>  static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
>>>>> @@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>>>>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
>>>>>  }
>>>>>  
>>>>> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
>>>>> +						  struct virtio_vsock_pkt_info *info,
>>>>> +						  size_t payload_len,
>>>>> +						  bool zcopy,
>>>>> +						  u32 src_cid,
>>>>> +						  u32 src_port,
>>>>> +						  u32 dst_cid,
>>>>> +						  u32 dst_port)
>>>>> +{
>>>>> +	struct sk_buff *skb;
>>>>> +	size_t skb_len;
>>>>> +
>>>>> +	skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
>>>>> +
>>>>> +	if (!zcopy)
>>>>> +		skb_len += payload_len;
>>>>> +
>>>>> +	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>>>> +	if (!skb)
>>>>> +		return NULL;
>>>>> +
>>>>> +	virtio_transport_init_hdr(skb, info, src_cid, src_port,
>>>>> +				  dst_cid, dst_port,
>>>>> +				  payload_len);
>>>>> +
>>>>> +	/* Set owner here, because '__zerocopy_sg_from_iter()' uses
>>>>> +	 * owner of skb without check to update 'sk_wmem_alloc'.
>>>>> +	 */
>>>>> +	if (vsk)
>>>>> +		skb_set_owner_w(skb, sk_vsock(vsk));
>>>>> +
>>>>> +	if (info->msg && payload_len > 0) {
>>>>> +		int err;
>>>>> +
>>>>> +		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
>>>>> +		if (err)
>>>>> +			goto out;
>>>>> +
>>>>> +		if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>>>> +			struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
>>>>> +
>>>>> +			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>>>> +
>>>>> +			if (info->msg->msg_flags & MSG_EOR)
>>>>> +				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>>>> +		}
>>>>> +	}
>>>>> +
>>>>> +	if (info->reply)
>>>>> +		virtio_vsock_skb_set_reply(skb);
>>>>> +
>>>>> +	trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>>>> +					 dst_cid, dst_port,
>>>>> +					 payload_len,
>>>>> +					 info->type,
>>>>> +					 info->op,
>>>>> +					 info->flags);
>>>>> +
>>>>> +	return skb;
>>>>> +out:
>>>>> +	kfree_skb(skb);
>>>>> +	return NULL;
>>>>> +}
>>>>> +
>>>>>  /* This function can only be used on connecting/connected sockets,
>>>>>   * since a socket assigned to a transport is required.
>>>>>   *
>>>>> @@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>>>>  static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>>  					  struct virtio_vsock_pkt_info *info)
>>>>>  {
>>>>> +	u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>>>>>  	u32 src_cid, src_port, dst_cid, dst_port;
>>>>>  	const struct virtio_transport *t_ops;
>>>>>  	struct virtio_vsock_sock *vvs;
>>>>>  	u32 pkt_len = info->pkt_len;
>>>>> +	bool can_zcopy = false;
>>>>>  	u32 rest_len;
>>>>>  	int ret;
>>>>>  
>>>>> @@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>>  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>>>>  		return pkt_len;
>>>>>  
>>>>> +	if (info->msg) {
>>>>> +		/* If zerocopy is not enabled by 'setsockopt()', we behave as
>>>>> +		 * there is no MSG_ZEROCOPY flag set.
>>>>> +		 */
>>>>> +		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>>>>> +			info->msg->msg_flags &= ~MSG_ZEROCOPY;
>>>>> +
>>>>> +		if (info->msg->msg_flags & MSG_ZEROCOPY)
>>>>> +			can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
>>>>> +
>>>>> +		if (can_zcopy)
>>>>> +			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>>>>> +					    (MAX_SKB_FRAGS * PAGE_SIZE));
>>>>> +	}
>>>>> +
>>>>>  	rest_len = pkt_len;
>>>>>  
>>>>>  	do {
>>>>>  		struct sk_buff *skb;
>>>>>  		size_t skb_len;
>>>>>  
>>>>> -		skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
>>>>> +		skb_len = min(max_skb_len, rest_len);
>>>>>  
>>>>> -		skb = virtio_transport_alloc_skb(info, skb_len,
>>>>> +		skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
>>>>>  						 src_cid, src_port,
>>>>>  						 dst_cid, dst_port);
>>>>>  		if (!skb) {
>>>>> @@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>>  			break;
>>>>>  		}
>>>>>  
>>>>> +		/* This is last skb to send this portion of data. */
>>>>> +		if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
>>>>> +		    skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
>>>>> +			if (virtio_transport_init_zcopy_skb(vsk, skb,
>>>>> +							    info->msg,
>>>>> +							    can_zcopy)) {
>>>>> +				ret = -ENOMEM;
>>>>> +				break;
>>>>> +			}
>>>>> +		}
>>>>> +
>>>>>  		virtio_transport_inc_tx_pkt(vvs, skb);
>>>>>  
>>>>>  		ret = t_ops->send_pkt(skb);
>>>>> @@ -934,7 +1075,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>>>>>  	if (!t)
>>>>>  		return -ENOTCONN;
>>>>>  
>>>>> -	reply = virtio_transport_alloc_skb(&info, 0,
>>>>> +	reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
>>>>>  					   le64_to_cpu(hdr->dst_cid),
>>>>>  					   le32_to_cpu(hdr->dst_port),
>>>>>  					   le64_to_cpu(hdr->src_cid),
>>>
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-25 13:23           ` Michael S. Tsirkin
@ 2023-07-25 13:30             ` Arseniy Krasnov
  0 siblings, 0 replies; 30+ messages in thread
From: Arseniy Krasnov @ 2023-07-25 13:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
	Bobby Eshleman, kvm, virtualization, netdev, linux-kernel,
	kernel, oxffffaa



On 25.07.2023 16:23, Michael S. Tsirkin wrote:
> On Tue, Jul 25, 2023 at 04:10:40PM +0300, Arseniy Krasnov wrote:
>>
>>
>> On 25.07.2023 14:59, Michael S. Tsirkin wrote:
>>> On Tue, Jul 25, 2023 at 11:39:22AM +0300, Arseniy Krasnov wrote:
>>>>
>>>>
>>>> On 25.07.2023 11:25, Michael S. Tsirkin wrote:
>>>>> On Fri, Jul 21, 2023 at 12:42:45AM +0300, Arseniy Krasnov wrote:
>>>>>> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>>>>>> flag is set and zerocopy transmission is possible (enabled in socket
>>>>>> options and transport allows zerocopy), then non-linear skb will be
>>>>>> created and filled with the pages of user's buffer. Pages of user's
>>>>>> buffer are locked in memory by 'get_user_pages()'. Second thing that
>>>>>> this patch does is replace type of skb owning: instead of calling
>>>>>> 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
>>>>>> change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
>>>>>> of socket, so to decrease this field correctly proper skb destructor is
>>>>>> needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
>>>>>>
>>>>>> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>>>>>> ---
>>>>>>  Changelog:
>>>>>>  v5(big patchset) -> v1:
>>>>>>   * Refactorings of 'if' conditions.
>>>>>>   * Remove extra blank line.
>>>>>>   * Remove 'frag_off' field unneeded init.
>>>>>>   * Add function 'virtio_transport_fill_skb()' which fills both linear
>>>>>>     and non-linear skb with provided data.
>>>>>>  v1 -> v2:
>>>>>>   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
>>>>>>  v2 -> v3:
>>>>>>   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
>>>>>>     provided 'iov_iter' with data could be sent in a zerocopy mode.
>>>>>>     If this callback is not set in transport - transport allows to send
>>>>>>     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
>>>>>>     then zerocopy is allowed. Reason of this callback is that in case of
>>>>>>     G2H transmission we insert whole skb to the tx virtio queue and such
>>>>>>     skb must fit to the size of the virtio queue to be sent in a single
>>>>>>     iteration (may be tx logic in 'virtio_transport.c' could be reworked
>>>>>>     as in vhost to support partial send of current skb). This callback
>>>>>>     will be enabled only for G2H path. For details pls see comment 
>>>>>>     'Check that tx queue...' below.
>>>>>>
>>>>>>  include/net/af_vsock.h                  |   3 +
>>>>>>  net/vmw_vsock/virtio_transport.c        |  39 ++++
>>>>>>  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
>>>>>>  3 files changed, 241 insertions(+), 58 deletions(-)
>>>>>>
>>>>>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>>>>>> index 0e7504a42925..a6b346eeeb8e 100644
>>>>>> --- a/include/net/af_vsock.h
>>>>>> +++ b/include/net/af_vsock.h
>>>>>> @@ -177,6 +177,9 @@ struct vsock_transport {
>>>>>>  
>>>>>>  	/* Read a single skb */
>>>>>>  	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
>>>>>> +
>>>>>> +	/* Zero-copy. */
>>>>>> +	bool (*msgzerocopy_check_iov)(const struct iov_iter *);
>>>>>>  };
>>>>>>  
>>>>>>  /**** CORE ****/
>>>>>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>>>>>> index 7bbcc8093e51..23cb8ed638c4 100644
>>>>>> --- a/net/vmw_vsock/virtio_transport.c
>>>>>> +++ b/net/vmw_vsock/virtio_transport.c
>>>>>> @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
>>>>>>  	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>>>>>>  }
>>>>>>  
>>>>>> +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
>>>>>> +{
>>>>>> +	struct virtio_vsock *vsock;
>>>>>> +	bool res = false;
>>>>>> +
>>>>>> +	rcu_read_lock();
>>>>>> +
>>>>>> +	vsock = rcu_dereference(the_virtio_vsock);
>>>>>> +	if (vsock) {
>>>>>> +		struct virtqueue *vq;
>>>>>> +		int iov_pages;
>>>>>> +
>>>>>> +		vq = vsock->vqs[VSOCK_VQ_TX];
>>>>>> +
>>>>>> +		iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
>>>>>> +
>>>>>> +		/* Check that tx queue is large enough to keep whole
>>>>>> +		 * data to send. This is needed, because when there is
>>>>>> +		 * not enough free space in the queue, current skb to
>>>>>> +		 * send will be reinserted to the head of tx list of
>>>>>> +		 * the socket to retry transmission later, so if skb
>>>>>> +		 * is bigger than whole queue, it will be reinserted
>>>>>> +		 * again and again, thus blocking other skbs to be sent.
>>>>>> +		 * Each page of the user provided buffer will be added
>>>>>> +		 * as a single buffer to the tx virtqueue, so compare
>>>>>> +		 * number of pages against maximum capacity of the queue.
>>>>>> +		 * +1 means buffer for the packet header.
>>>>>> +		 */
>>>>>> +		if (iov_pages + 1 <= vq->num_max)
>>>>>> +			res = true;
>>>>>
>>>>>
>>>>> Yes but can't there already be buffers in the queue?
>>>>> Then you can't stick num_max there.
>>>>
>>>> I think, that it is not critical, because vhost part always tries to process all
>>>> incoming buffers (yes, 'vhost_exceeds_weight()' breaks at some moment, but it will
>>>> reschedule tx kick ('vhost_vsock_handle_tx_kick()') work again), so current "too
>>>> big" skb will wait until there will be enough space in queue and as it is requeued
>>>> to the head of tx list it will be inserted to tx queue first.
>>>>
>>>> But anyway, I agree that comparing to 'num_free' may be more effective to the whole
>>>> system performance...
>>>>
>>>> Thanks, Arseniy
>>>
>>> Oh I see. It makes sense then - instead of copying just so we can
>>> stick it in the queue, wait a bit and send later.
>>> Also - for stream transports can't the message be split
>>> and sent chunk by chunk? Better than copying ...
>>
>> Technically yes, also we can split message for non-stream sockets (as vhost
>> does when it copies data to rx buffers of the guest),
> 
> Won't breaking up messages break applications though?

No, for seqpacket we have special marker + port in each message, so we
can restore original message at the receiver and pass it to the socket.

Thanks, Arseniy

> 
> 
>> but it requires to rework
>> current implementation by adding buffers one by one to the tx queue. I think
>> it was not implemented here because until MSG_ZEROCOPY all skbs requires one
>> (if it is control message) or two (with payload) buffers, so there is no big
>> sense in processing max two buffers in "one-by-one" mode - we can just wait
>> for space.
>>
>> May be, I can add this logic for non-linear skb's here: 
>>
>> if (skb->len > vq->max_num)
>>     add buffers "one-by-one", incrementing internal offset in skb,
>>    if (new skb insertion fails)
>>        requeue skb, wait for space.
>>
>> In TX done callback I'll call consume skb only when above mentioned internal
>> offset == skb->len. I think this approach allows to get rid of new 'check_iov'
>> callback from this patch.
>>
>>
>> Stefano, what do You think?
>>
>> Thanks, Arseniy
>>
>>>
>>>
>>>>>
>>>>>
>>>>>> +	}
>>>>>> +
>>>>>> +	rcu_read_unlock();
>>>>>> +
>>>>>> +	return res;
>>>>>> +}
>>>>>> +
>>>>>>  static bool virtio_transport_seqpacket_allow(u32 remote_cid);
>>>>>>  
>>>>>>  static struct virtio_transport virtio_transport = {
>>>>>> @@ -475,6 +512,8 @@ static struct virtio_transport virtio_transport = {
>>>>>>  		.seqpacket_allow          = virtio_transport_seqpacket_allow,
>>>>>>  		.seqpacket_has_data       = virtio_transport_seqpacket_has_data,
>>>>>>  
>>>>>> +		.msgzerocopy_check_iov	  = virtio_transport_msgzerocopy_check_iov,
>>>>>> +
>>>>>>  		.notify_poll_in           = virtio_transport_notify_poll_in,
>>>>>>  		.notify_poll_out          = virtio_transport_notify_poll_out,
>>>>>>  		.notify_recv_init         = virtio_transport_notify_recv_init,
>>>>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>>>>> index 26a4d10da205..e4e3d541aff4 100644
>>>>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>>>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>>>>> @@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>>>>>>  	return container_of(t, struct virtio_transport, transport);
>>>>>>  }
>>>>>>  
>>>>>> -/* Returns a new packet on success, otherwise returns NULL.
>>>>>> - *
>>>>>> - * If NULL is returned, errp is set to a negative errno.
>>>>>> - */
>>>>>> -static struct sk_buff *
>>>>>> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>>>>>> -			   size_t len,
>>>>>> -			   u32 src_cid,
>>>>>> -			   u32 src_port,
>>>>>> -			   u32 dst_cid,
>>>>>> -			   u32 dst_port)
>>>>>> -{
>>>>>> -	const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>>>>>> -	struct virtio_vsock_hdr *hdr;
>>>>>> -	struct sk_buff *skb;
>>>>>> -	void *payload;
>>>>>> -	int err;
>>>>>> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
>>>>>> +				       size_t max_to_send)
>>>>>> +{
>>>>>> +	const struct vsock_transport *t;
>>>>>> +	struct iov_iter *iov_iter;
>>>>>>  
>>>>>> -	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>>>>> -	if (!skb)
>>>>>> -		return NULL;
>>>>>> +	if (!info->msg)
>>>>>> +		return false;
>>>>>>  
>>>>>> -	hdr = virtio_vsock_hdr(skb);
>>>>>> -	hdr->type	= cpu_to_le16(info->type);
>>>>>> -	hdr->op		= cpu_to_le16(info->op);
>>>>>> -	hdr->src_cid	= cpu_to_le64(src_cid);
>>>>>> -	hdr->dst_cid	= cpu_to_le64(dst_cid);
>>>>>> -	hdr->src_port	= cpu_to_le32(src_port);
>>>>>> -	hdr->dst_port	= cpu_to_le32(dst_port);
>>>>>> -	hdr->flags	= cpu_to_le32(info->flags);
>>>>>> -	hdr->len	= cpu_to_le32(len);
>>>>>> +	iov_iter = &info->msg->msg_iter;
>>>>>>  
>>>>>> -	if (info->msg && len > 0) {
>>>>>> -		payload = skb_put(skb, len);
>>>>>> -		err = memcpy_from_msg(payload, info->msg, len);
>>>>>> -		if (err)
>>>>>> -			goto out;
>>>>>> +	t = vsock_core_get_transport(info->vsk);
>>>>>>  
>>>>>> -		if (msg_data_left(info->msg) == 0 &&
>>>>>> -		    info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>>>>> -			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>>>>> +	if (t->msgzerocopy_check_iov &&
>>>>>> +	    !t->msgzerocopy_check_iov(iov_iter))
>>>>>> +		return false;
>>>>>>  
>>>>>> -			if (info->msg->msg_flags & MSG_EOR)
>>>>>> -				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>>>>> -		}
>>>>>> +	/* Data is simple buffer. */
>>>>>> +	if (iter_is_ubuf(iov_iter))
>>>>>> +		return true;
>>>>>> +
>>>>>> +	if (!iter_is_iovec(iov_iter))
>>>>>> +		return false;
>>>>>> +
>>>>>> +	if (iov_iter->iov_offset)
>>>>>> +		return false;
>>>>>> +
>>>>>> +	/* We can't send whole iov. */
>>>>>> +	if (iov_iter->count > max_to_send)
>>>>>> +		return false;
>>>>>> +
>>>>>> +	return true;
>>>>>> +}
>>>>>> +
>>>>>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>>>>>> +					   struct sk_buff *skb,
>>>>>> +					   struct msghdr *msg,
>>>>>> +					   bool zerocopy)
>>>>>> +{
>>>>>> +	struct ubuf_info *uarg;
>>>>>> +
>>>>>> +	if (msg->msg_ubuf) {
>>>>>> +		uarg = msg->msg_ubuf;
>>>>>> +		net_zcopy_get(uarg);
>>>>>> +	} else {
>>>>>> +		struct iov_iter *iter = &msg->msg_iter;
>>>>>> +		struct ubuf_info_msgzc *uarg_zc;
>>>>>> +		int len;
>>>>>> +
>>>>>> +		/* Only ITER_IOVEC or ITER_UBUF are allowed and
>>>>>> +		 * checked before.
>>>>>> +		 */
>>>>>> +		if (iter_is_iovec(iter))
>>>>>> +			len = iov_length(iter->__iov, iter->nr_segs);
>>>>>> +		else
>>>>>> +			len = iter->count;
>>>>>> +
>>>>>> +		uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>>>>> +					    len,
>>>>>> +					    NULL);
>>>>>> +		if (!uarg)
>>>>>> +			return -1;
>>>>>> +
>>>>>> +		uarg_zc = uarg_to_msgzc(uarg);
>>>>>> +		uarg_zc->zerocopy = zerocopy ? 1 : 0;
>>>>>>  	}
>>>>>>  
>>>>>> -	if (info->reply)
>>>>>> -		virtio_vsock_skb_set_reply(skb);
>>>>>> +	skb_zcopy_init(skb, uarg);
>>>>>>  
>>>>>> -	trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>>>>> -					 dst_cid, dst_port,
>>>>>> -					 len,
>>>>>> -					 info->type,
>>>>>> -					 info->op,
>>>>>> -					 info->flags);
>>>>>> +	return 0;
>>>>>> +}
>>>>>>  
>>>>>> -	if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
>>>>>> -		WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
>>>>>> -		goto out;
>>>>>> +static int virtio_transport_fill_skb(struct sk_buff *skb,
>>>>>> +				     struct virtio_vsock_pkt_info *info,
>>>>>> +				     size_t len,
>>>>>> +				     bool zcopy)
>>>>>> +{
>>>>>> +	if (zcopy) {
>>>>>> +		return __zerocopy_sg_from_iter(info->msg, NULL, skb,
>>>>>> +					      &info->msg->msg_iter,
>>>>>> +					      len);
>>>>>> +	} else {
>>>>>> +		void *payload;
>>>>>> +		int err;
>>>>>> +
>>>>>> +		payload = skb_put(skb, len);
>>>>>> +		err = memcpy_from_msg(payload, info->msg, len);
>>>>>> +		if (err)
>>>>>> +			return -1;
>>>>>> +
>>>>>> +		if (msg_data_left(info->msg))
>>>>>> +			return 0;
>>>>>> +
>>>>>> +		return 0;
>>>>>>  	}
>>>>>> +}
>>>>>>  
>>>>>> -	return skb;
>>>>>> +static void virtio_transport_init_hdr(struct sk_buff *skb,
>>>>>> +				      struct virtio_vsock_pkt_info *info,
>>>>>> +				      u32 src_cid,
>>>>>> +				      u32 src_port,
>>>>>> +				      u32 dst_cid,
>>>>>> +				      u32 dst_port,
>>>>>> +				      size_t len)
>>>>>> +{
>>>>>> +	struct virtio_vsock_hdr *hdr;
>>>>>>  
>>>>>> -out:
>>>>>> -	kfree_skb(skb);
>>>>>> -	return NULL;
>>>>>> +	hdr = virtio_vsock_hdr(skb);
>>>>>> +	hdr->type	= cpu_to_le16(info->type);
>>>>>> +	hdr->op		= cpu_to_le16(info->op);
>>>>>> +	hdr->src_cid	= cpu_to_le64(src_cid);
>>>>>> +	hdr->dst_cid	= cpu_to_le64(dst_cid);
>>>>>> +	hdr->src_port	= cpu_to_le32(src_port);
>>>>>> +	hdr->dst_port	= cpu_to_le32(dst_port);
>>>>>> +	hdr->flags	= cpu_to_le32(info->flags);
>>>>>> +	hdr->len	= cpu_to_le32(len);
>>>>>>  }
>>>>>>  
>>>>>>  static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
>>>>>> @@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>>>>>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
>>>>>>  }
>>>>>>  
>>>>>> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
>>>>>> +						  struct virtio_vsock_pkt_info *info,
>>>>>> +						  size_t payload_len,
>>>>>> +						  bool zcopy,
>>>>>> +						  u32 src_cid,
>>>>>> +						  u32 src_port,
>>>>>> +						  u32 dst_cid,
>>>>>> +						  u32 dst_port)
>>>>>> +{
>>>>>> +	struct sk_buff *skb;
>>>>>> +	size_t skb_len;
>>>>>> +
>>>>>> +	skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
>>>>>> +
>>>>>> +	if (!zcopy)
>>>>>> +		skb_len += payload_len;
>>>>>> +
>>>>>> +	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>>>>> +	if (!skb)
>>>>>> +		return NULL;
>>>>>> +
>>>>>> +	virtio_transport_init_hdr(skb, info, src_cid, src_port,
>>>>>> +				  dst_cid, dst_port,
>>>>>> +				  payload_len);
>>>>>> +
>>>>>> +	/* Set owner here, because '__zerocopy_sg_from_iter()' uses
>>>>>> +	 * owner of skb without check to update 'sk_wmem_alloc'.
>>>>>> +	 */
>>>>>> +	if (vsk)
>>>>>> +		skb_set_owner_w(skb, sk_vsock(vsk));
>>>>>> +
>>>>>> +	if (info->msg && payload_len > 0) {
>>>>>> +		int err;
>>>>>> +
>>>>>> +		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
>>>>>> +		if (err)
>>>>>> +			goto out;
>>>>>> +
>>>>>> +		if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>>>>> +			struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
>>>>>> +
>>>>>> +			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>>>>> +
>>>>>> +			if (info->msg->msg_flags & MSG_EOR)
>>>>>> +				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>>>>> +		}
>>>>>> +	}
>>>>>> +
>>>>>> +	if (info->reply)
>>>>>> +		virtio_vsock_skb_set_reply(skb);
>>>>>> +
>>>>>> +	trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>>>>> +					 dst_cid, dst_port,
>>>>>> +					 payload_len,
>>>>>> +					 info->type,
>>>>>> +					 info->op,
>>>>>> +					 info->flags);
>>>>>> +
>>>>>> +	return skb;
>>>>>> +out:
>>>>>> +	kfree_skb(skb);
>>>>>> +	return NULL;
>>>>>> +}
>>>>>> +
>>>>>>  /* This function can only be used on connecting/connected sockets,
>>>>>>   * since a socket assigned to a transport is required.
>>>>>>   *
>>>>>> @@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>>>>>  static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>>>  					  struct virtio_vsock_pkt_info *info)
>>>>>>  {
>>>>>> +	u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>>>>>>  	u32 src_cid, src_port, dst_cid, dst_port;
>>>>>>  	const struct virtio_transport *t_ops;
>>>>>>  	struct virtio_vsock_sock *vvs;
>>>>>>  	u32 pkt_len = info->pkt_len;
>>>>>> +	bool can_zcopy = false;
>>>>>>  	u32 rest_len;
>>>>>>  	int ret;
>>>>>>  
>>>>>> @@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>>>  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>>>>>  		return pkt_len;
>>>>>>  
>>>>>> +	if (info->msg) {
>>>>>> +		/* If zerocopy is not enabled by 'setsockopt()', we behave as
>>>>>> +		 * there is no MSG_ZEROCOPY flag set.
>>>>>> +		 */
>>>>>> +		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>>>>>> +			info->msg->msg_flags &= ~MSG_ZEROCOPY;
>>>>>> +
>>>>>> +		if (info->msg->msg_flags & MSG_ZEROCOPY)
>>>>>> +			can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
>>>>>> +
>>>>>> +		if (can_zcopy)
>>>>>> +			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>>>>>> +					    (MAX_SKB_FRAGS * PAGE_SIZE));
>>>>>> +	}
>>>>>> +
>>>>>>  	rest_len = pkt_len;
>>>>>>  
>>>>>>  	do {
>>>>>>  		struct sk_buff *skb;
>>>>>>  		size_t skb_len;
>>>>>>  
>>>>>> -		skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
>>>>>> +		skb_len = min(max_skb_len, rest_len);
>>>>>>  
>>>>>> -		skb = virtio_transport_alloc_skb(info, skb_len,
>>>>>> +		skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
>>>>>>  						 src_cid, src_port,
>>>>>>  						 dst_cid, dst_port);
>>>>>>  		if (!skb) {
>>>>>> @@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>>>  			break;
>>>>>>  		}
>>>>>>  
>>>>>> +		/* This is last skb to send this portion of data. */
>>>>>> +		if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
>>>>>> +		    skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
>>>>>> +			if (virtio_transport_init_zcopy_skb(vsk, skb,
>>>>>> +							    info->msg,
>>>>>> +							    can_zcopy)) {
>>>>>> +				ret = -ENOMEM;
>>>>>> +				break;
>>>>>> +			}
>>>>>> +		}
>>>>>> +
>>>>>>  		virtio_transport_inc_tx_pkt(vvs, skb);
>>>>>>  
>>>>>>  		ret = t_ops->send_pkt(skb);
>>>>>> @@ -934,7 +1075,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>>>>>>  	if (!t)
>>>>>>  		return -ENOTCONN;
>>>>>>  
>>>>>> -	reply = virtio_transport_alloc_skb(&info, 0,
>>>>>> +	reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
>>>>>>  					   le64_to_cpu(hdr->dst_cid),
>>>>>>  					   le32_to_cpu(hdr->dst_port),
>>>>>>  					   le64_to_cpu(hdr->src_cid),
>>>>>> -- 
>>>>>> 2.25.1
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-25 13:36             ` Michael S. Tsirkin
@ 2023-07-25 13:35               ` Arseniy Krasnov
  0 siblings, 0 replies; 30+ messages in thread
From: Arseniy Krasnov @ 2023-07-25 13:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
	Bobby Eshleman, kvm, virtualization, netdev, linux-kernel,
	kernel, oxffffaa



On 25.07.2023 16:36, Michael S. Tsirkin wrote:
> On Tue, Jul 25, 2023 at 04:28:14PM +0300, Arseniy Krasnov wrote:
>>
>>
>> On 25.07.2023 16:22, Michael S. Tsirkin wrote:
>>> On Tue, Jul 25, 2023 at 04:04:13PM +0300, Arseniy Krasnov wrote:
>>>>
>>>>
>>>> On 25.07.2023 14:50, Michael S. Tsirkin wrote:
>>>>> On Fri, Jul 21, 2023 at 08:09:03AM +0300, Arseniy Krasnov wrote:
>>>>>>
>>>>>>
>>>>>> On 21.07.2023 00:42, Arseniy Krasnov wrote:
>>>>>>> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>>>>>>> flag is set and zerocopy transmission is possible (enabled in socket
>>>>>>> options and transport allows zerocopy), then non-linear skb will be
>>>>>>> created and filled with the pages of user's buffer. Pages of user's
>>>>>>> buffer are locked in memory by 'get_user_pages()'. Second thing that
>>>>>>> this patch does is replace type of skb owning: instead of calling
>>>>>>> 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
>>>>>>> change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
>>>>>>> of socket, so to decrease this field correctly proper skb destructor is
>>>>>>> needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
>>>>>>>
>>>>>>> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>>>>>>> ---
>>>>>>>  Changelog:
>>>>>>>  v5(big patchset) -> v1:
>>>>>>>   * Refactorings of 'if' conditions.
>>>>>>>   * Remove extra blank line.
>>>>>>>   * Remove 'frag_off' field unneeded init.
>>>>>>>   * Add function 'virtio_transport_fill_skb()' which fills both linear
>>>>>>>     and non-linear skb with provided data.
>>>>>>>  v1 -> v2:
>>>>>>>   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
>>>>>>>  v2 -> v3:
>>>>>>>   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
>>>>>>>     provided 'iov_iter' with data could be sent in a zerocopy mode.
>>>>>>>     If this callback is not set in transport - transport allows to send
>>>>>>>     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
>>>>>>>     then zerocopy is allowed. Reason of this callback is that in case of
>>>>>>>     G2H transmission we insert whole skb to the tx virtio queue and such
>>>>>>>     skb must fit to the size of the virtio queue to be sent in a single
>>>>>>>     iteration (may be tx logic in 'virtio_transport.c' could be reworked
>>>>>>>     as in vhost to support partial send of current skb). This callback
>>>>>>>     will be enabled only for G2H path. For details pls see comment 
>>>>>>>     'Check that tx queue...' below.
>>>>>>>
>>>>>>>  include/net/af_vsock.h                  |   3 +
>>>>>>>  net/vmw_vsock/virtio_transport.c        |  39 ++++
>>>>>>>  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
>>>>>>>  3 files changed, 241 insertions(+), 58 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>>>>>>> index 0e7504a42925..a6b346eeeb8e 100644
>>>>>>> --- a/include/net/af_vsock.h
>>>>>>> +++ b/include/net/af_vsock.h
>>>>>>> @@ -177,6 +177,9 @@ struct vsock_transport {
>>>>>>>  
>>>>>>>  	/* Read a single skb */
>>>>>>>  	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
>>>>>>> +
>>>>>>> +	/* Zero-copy. */
>>>>>>> +	bool (*msgzerocopy_check_iov)(const struct iov_iter *);
>>>>>>>  };
>>>>>>>  
>>>>>>>  /**** CORE ****/
>>>>>>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>>>>>>> index 7bbcc8093e51..23cb8ed638c4 100644
>>>>>>> --- a/net/vmw_vsock/virtio_transport.c
>>>>>>> +++ b/net/vmw_vsock/virtio_transport.c
>>>>>>> @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
>>>>>>>  	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>>>>>>>  }
>>>>>>>  
>>>>>>> +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
>>>>>>> +{
>>>>>>> +	struct virtio_vsock *vsock;
>>>>>>> +	bool res = false;
>>>>>>> +
>>>>>>> +	rcu_read_lock();
>>>>>>> +
>>>>>>> +	vsock = rcu_dereference(the_virtio_vsock);
>>>>>>> +	if (vsock) {
>>>>>>> +		struct virtqueue *vq;
>>>>>>> +		int iov_pages;
>>>>>>> +
>>>>>>> +		vq = vsock->vqs[VSOCK_VQ_TX];
>>>>>>> +
>>>>>>> +		iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
>>>>>>> +
>>>>>>> +		/* Check that tx queue is large enough to keep whole
>>>>>>> +		 * data to send. This is needed, because when there is
>>>>>>> +		 * not enough free space in the queue, current skb to
>>>>>>> +		 * send will be reinserted to the head of tx list of
>>>>>>> +		 * the socket to retry transmission later, so if skb
>>>>>>> +		 * is bigger than whole queue, it will be reinserted
>>>>>>> +		 * again and again, thus blocking other skbs to be sent.
>>>>>>> +		 * Each page of the user provided buffer will be added
>>>>>>> +		 * as a single buffer to the tx virtqueue, so compare
>>>>>>> +		 * number of pages against maximum capacity of the queue.
>>>>>>> +		 * +1 means buffer for the packet header.
>>>>>>> +		 */
>>>>>>> +		if (iov_pages + 1 <= vq->num_max)
>>>>>>
>>>>>> I think this check is actual only for case one we don't have indirect buffer feature.
>>>>>> With indirect mode whole data to send will be packed into one indirect buffer.
>>>>>>
>>>>>> Thanks, Arseniy
>>>>>
>>>>> Actually the reverse. With indirect you are limited to num_max.
>>>>> Without you are limited to whatever space is left in the
>>>>> queue (which you did not check here, so you should).
>>>>
>>>> I mean that with indirect, we only need one buffer, and we can just wait
>>>> for enough space - for this single buffer ( as we discussed a little bit before).
>>>> But if indirect buffer is not supported - we need that whole packet must fit
>>>> to the size of tx queue - otherwise it never be transmitted.
>>>>
>>>> Thanks, Arseniy
>>>
>>>
>>> yes but according to virtio spec it's illegal to add s/g that is bigger
>>> than queue size.
>>
>> Aah, so even in case of indirect buffers feature, buffer descriptors stored in memory
>> pointed by indirect buffer must be accounted against queue size ?
>>
>> Thanks, Arseniy
> 
> a single indirect buffer can't exceed vq size.

I see, so i guess right way is to compare length of data "in buffers"
again 'num_max', as it is implemented. And we should not refer to whether
vq support indirect buffering or not.

Thanks, Arseniy

> 
> 
>>>
>>>>>
>>>>>
>>>>>>> +			res = true;
>>>>>>> +	}
>>>>>>> +
>>>>>>> +	rcu_read_unlock();
>>>>>
>>>>> Just curious:
>>>>> is the point of all this RCU dance to allow vsock
>>>>> to change from under us? then why is it ok to
>>>>> have it change? the virtio_transport_msgzerocopy_check_iov
>>>>> will then refer to the old vsock ...
>>>>>
>>>>>
>>>>>>> +
>>>>>>> +	return res;
>>>>>>> +}
>>>>>>> +
>>>>>>>  static bool virtio_transport_seqpacket_allow(u32 remote_cid);
>>>>>>>  
>>>>>>>  static struct virtio_transport virtio_transport = {
>>>>>>> @@ -475,6 +512,8 @@ static struct virtio_transport virtio_transport = {
>>>>>>>  		.seqpacket_allow          = virtio_transport_seqpacket_allow,
>>>>>>>  		.seqpacket_has_data       = virtio_transport_seqpacket_has_data,
>>>>>>>  
>>>>>>> +		.msgzerocopy_check_iov	  = virtio_transport_msgzerocopy_check_iov,
>>>>>>> +
>>>>>>>  		.notify_poll_in           = virtio_transport_notify_poll_in,
>>>>>>>  		.notify_poll_out          = virtio_transport_notify_poll_out,
>>>>>>>  		.notify_recv_init         = virtio_transport_notify_recv_init,
>>>>>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>>>>>> index 26a4d10da205..e4e3d541aff4 100644
>>>>>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>>>>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>>>>>> @@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>>>>>>>  	return container_of(t, struct virtio_transport, transport);
>>>>>>>  }
>>>>>>>  
>>>>>>> -/* Returns a new packet on success, otherwise returns NULL.
>>>>>>> - *
>>>>>>> - * If NULL is returned, errp is set to a negative errno.
>>>>>>> - */
>>>>>>> -static struct sk_buff *
>>>>>>> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>>>>>>> -			   size_t len,
>>>>>>> -			   u32 src_cid,
>>>>>>> -			   u32 src_port,
>>>>>>> -			   u32 dst_cid,
>>>>>>> -			   u32 dst_port)
>>>>>>> -{
>>>>>>> -	const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>>>>>>> -	struct virtio_vsock_hdr *hdr;
>>>>>>> -	struct sk_buff *skb;
>>>>>>> -	void *payload;
>>>>>>> -	int err;
>>>>>>> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
>>>>>>> +				       size_t max_to_send)
>>>>>>> +{
>>>>>>> +	const struct vsock_transport *t;
>>>>>>> +	struct iov_iter *iov_iter;
>>>>>>>  
>>>>>>> -	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>>>>>> -	if (!skb)
>>>>>>> -		return NULL;
>>>>>>> +	if (!info->msg)
>>>>>>> +		return false;
>>>>>>>  
>>>>>>> -	hdr = virtio_vsock_hdr(skb);
>>>>>>> -	hdr->type	= cpu_to_le16(info->type);
>>>>>>> -	hdr->op		= cpu_to_le16(info->op);
>>>>>>> -	hdr->src_cid	= cpu_to_le64(src_cid);
>>>>>>> -	hdr->dst_cid	= cpu_to_le64(dst_cid);
>>>>>>> -	hdr->src_port	= cpu_to_le32(src_port);
>>>>>>> -	hdr->dst_port	= cpu_to_le32(dst_port);
>>>>>>> -	hdr->flags	= cpu_to_le32(info->flags);
>>>>>>> -	hdr->len	= cpu_to_le32(len);
>>>>>>> +	iov_iter = &info->msg->msg_iter;
>>>>>>>  
>>>>>>> -	if (info->msg && len > 0) {
>>>>>>> -		payload = skb_put(skb, len);
>>>>>>> -		err = memcpy_from_msg(payload, info->msg, len);
>>>>>>> -		if (err)
>>>>>>> -			goto out;
>>>>>>> +	t = vsock_core_get_transport(info->vsk);
>>>>>>>  
>>>>>>> -		if (msg_data_left(info->msg) == 0 &&
>>>>>>> -		    info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>>>>>> -			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>>>>>> +	if (t->msgzerocopy_check_iov &&
>>>>>>> +	    !t->msgzerocopy_check_iov(iov_iter))
>>>>>>> +		return false;
>>>>>>>  
>>>>>>> -			if (info->msg->msg_flags & MSG_EOR)
>>>>>>> -				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>>>>>> -		}
>>>>>>> +	/* Data is simple buffer. */
>>>>>>> +	if (iter_is_ubuf(iov_iter))
>>>>>>> +		return true;
>>>>>>> +
>>>>>>> +	if (!iter_is_iovec(iov_iter))
>>>>>>> +		return false;
>>>>>>> +
>>>>>>> +	if (iov_iter->iov_offset)
>>>>>>> +		return false;
>>>>>>> +
>>>>>>> +	/* We can't send whole iov. */
>>>>>>> +	if (iov_iter->count > max_to_send)
>>>>>>> +		return false;
>>>>>>> +
>>>>>>> +	return true;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>>>>>>> +					   struct sk_buff *skb,
>>>>>>> +					   struct msghdr *msg,
>>>>>>> +					   bool zerocopy)
>>>>>>> +{
>>>>>>> +	struct ubuf_info *uarg;
>>>>>>> +
>>>>>>> +	if (msg->msg_ubuf) {
>>>>>>> +		uarg = msg->msg_ubuf;
>>>>>>> +		net_zcopy_get(uarg);
>>>>>>> +	} else {
>>>>>>> +		struct iov_iter *iter = &msg->msg_iter;
>>>>>>> +		struct ubuf_info_msgzc *uarg_zc;
>>>>>>> +		int len;
>>>>>>> +
>>>>>>> +		/* Only ITER_IOVEC or ITER_UBUF are allowed and
>>>>>>> +		 * checked before.
>>>>>>> +		 */
>>>>>>> +		if (iter_is_iovec(iter))
>>>>>>> +			len = iov_length(iter->__iov, iter->nr_segs);
>>>>>>> +		else
>>>>>>> +			len = iter->count;
>>>>>>> +
>>>>>>> +		uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>>>>>> +					    len,
>>>>>>> +					    NULL);
>>>>>>> +		if (!uarg)
>>>>>>> +			return -1;
>>>>>>> +
>>>>>>> +		uarg_zc = uarg_to_msgzc(uarg);
>>>>>>> +		uarg_zc->zerocopy = zerocopy ? 1 : 0;
>>>>>>>  	}
>>>>>>>  
>>>>>>> -	if (info->reply)
>>>>>>> -		virtio_vsock_skb_set_reply(skb);
>>>>>>> +	skb_zcopy_init(skb, uarg);
>>>>>>>  
>>>>>>> -	trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>>>>>> -					 dst_cid, dst_port,
>>>>>>> -					 len,
>>>>>>> -					 info->type,
>>>>>>> -					 info->op,
>>>>>>> -					 info->flags);
>>>>>>> +	return 0;
>>>>>>> +}
>>>>>>>  
>>>>>>> -	if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
>>>>>>> -		WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
>>>>>>> -		goto out;
>>>>>>> +static int virtio_transport_fill_skb(struct sk_buff *skb,
>>>>>>> +				     struct virtio_vsock_pkt_info *info,
>>>>>>> +				     size_t len,
>>>>>>> +				     bool zcopy)
>>>>>>> +{
>>>>>>> +	if (zcopy) {
>>>>>>> +		return __zerocopy_sg_from_iter(info->msg, NULL, skb,
>>>>>>> +					      &info->msg->msg_iter,
>>>>>>> +					      len);
>>>>>>> +	} else {
>>>>>>> +		void *payload;
>>>>>>> +		int err;
>>>>>>> +
>>>>>>> +		payload = skb_put(skb, len);
>>>>>>> +		err = memcpy_from_msg(payload, info->msg, len);
>>>>>>> +		if (err)
>>>>>>> +			return -1;
>>>>>>> +
>>>>>>> +		if (msg_data_left(info->msg))
>>>>>>> +			return 0;
>>>>>>> +
>>>>>>> +		return 0;
>>>>>>>  	}
>>>>>>> +}
>>>>>>>  
>>>>>>> -	return skb;
>>>>>>> +static void virtio_transport_init_hdr(struct sk_buff *skb,
>>>>>>> +				      struct virtio_vsock_pkt_info *info,
>>>>>>> +				      u32 src_cid,
>>>>>>> +				      u32 src_port,
>>>>>>> +				      u32 dst_cid,
>>>>>>> +				      u32 dst_port,
>>>>>>> +				      size_t len)
>>>>>>> +{
>>>>>>> +	struct virtio_vsock_hdr *hdr;
>>>>>>>  
>>>>>>> -out:
>>>>>>> -	kfree_skb(skb);
>>>>>>> -	return NULL;
>>>>>>> +	hdr = virtio_vsock_hdr(skb);
>>>>>>> +	hdr->type	= cpu_to_le16(info->type);
>>>>>>> +	hdr->op		= cpu_to_le16(info->op);
>>>>>>> +	hdr->src_cid	= cpu_to_le64(src_cid);
>>>>>>> +	hdr->dst_cid	= cpu_to_le64(dst_cid);
>>>>>>> +	hdr->src_port	= cpu_to_le32(src_port);
>>>>>>> +	hdr->dst_port	= cpu_to_le32(dst_port);
>>>>>>> +	hdr->flags	= cpu_to_le32(info->flags);
>>>>>>> +	hdr->len	= cpu_to_le32(len);
>>>>>>>  }
>>>>>>>  
>>>>>>>  static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
>>>>>>> @@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>>>>>>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
>>>>>>>  }
>>>>>>>  
>>>>>>> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
>>>>>>> +						  struct virtio_vsock_pkt_info *info,
>>>>>>> +						  size_t payload_len,
>>>>>>> +						  bool zcopy,
>>>>>>> +						  u32 src_cid,
>>>>>>> +						  u32 src_port,
>>>>>>> +						  u32 dst_cid,
>>>>>>> +						  u32 dst_port)
>>>>>>> +{
>>>>>>> +	struct sk_buff *skb;
>>>>>>> +	size_t skb_len;
>>>>>>> +
>>>>>>> +	skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
>>>>>>> +
>>>>>>> +	if (!zcopy)
>>>>>>> +		skb_len += payload_len;
>>>>>>> +
>>>>>>> +	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>>>>>> +	if (!skb)
>>>>>>> +		return NULL;
>>>>>>> +
>>>>>>> +	virtio_transport_init_hdr(skb, info, src_cid, src_port,
>>>>>>> +				  dst_cid, dst_port,
>>>>>>> +				  payload_len);
>>>>>>> +
>>>>>>> +	/* Set owner here, because '__zerocopy_sg_from_iter()' uses
>>>>>>> +	 * owner of skb without check to update 'sk_wmem_alloc'.
>>>>>>> +	 */
>>>>>>> +	if (vsk)
>>>>>>> +		skb_set_owner_w(skb, sk_vsock(vsk));
>>>>>>> +
>>>>>>> +	if (info->msg && payload_len > 0) {
>>>>>>> +		int err;
>>>>>>> +
>>>>>>> +		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
>>>>>>> +		if (err)
>>>>>>> +			goto out;
>>>>>>> +
>>>>>>> +		if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>>>>>> +			struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
>>>>>>> +
>>>>>>> +			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>>>>>> +
>>>>>>> +			if (info->msg->msg_flags & MSG_EOR)
>>>>>>> +				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>>>>>> +		}
>>>>>>> +	}
>>>>>>> +
>>>>>>> +	if (info->reply)
>>>>>>> +		virtio_vsock_skb_set_reply(skb);
>>>>>>> +
>>>>>>> +	trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>>>>>> +					 dst_cid, dst_port,
>>>>>>> +					 payload_len,
>>>>>>> +					 info->type,
>>>>>>> +					 info->op,
>>>>>>> +					 info->flags);
>>>>>>> +
>>>>>>> +	return skb;
>>>>>>> +out:
>>>>>>> +	kfree_skb(skb);
>>>>>>> +	return NULL;
>>>>>>> +}
>>>>>>> +
>>>>>>>  /* This function can only be used on connecting/connected sockets,
>>>>>>>   * since a socket assigned to a transport is required.
>>>>>>>   *
>>>>>>> @@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>>>>>>  static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>>>>  					  struct virtio_vsock_pkt_info *info)
>>>>>>>  {
>>>>>>> +	u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>>>>>>>  	u32 src_cid, src_port, dst_cid, dst_port;
>>>>>>>  	const struct virtio_transport *t_ops;
>>>>>>>  	struct virtio_vsock_sock *vvs;
>>>>>>>  	u32 pkt_len = info->pkt_len;
>>>>>>> +	bool can_zcopy = false;
>>>>>>>  	u32 rest_len;
>>>>>>>  	int ret;
>>>>>>>  
>>>>>>> @@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>>>>  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>>>>>>  		return pkt_len;
>>>>>>>  
>>>>>>> +	if (info->msg) {
>>>>>>> +		/* If zerocopy is not enabled by 'setsockopt()', we behave as
>>>>>>> +		 * there is no MSG_ZEROCOPY flag set.
>>>>>>> +		 */
>>>>>>> +		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>>>>>>> +			info->msg->msg_flags &= ~MSG_ZEROCOPY;
>>>>>>> +
>>>>>>> +		if (info->msg->msg_flags & MSG_ZEROCOPY)
>>>>>>> +			can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
>>>>>>> +
>>>>>>> +		if (can_zcopy)
>>>>>>> +			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>>>>>>> +					    (MAX_SKB_FRAGS * PAGE_SIZE));
>>>>>>> +	}
>>>>>>> +
>>>>>>>  	rest_len = pkt_len;
>>>>>>>  
>>>>>>>  	do {
>>>>>>>  		struct sk_buff *skb;
>>>>>>>  		size_t skb_len;
>>>>>>>  
>>>>>>> -		skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
>>>>>>> +		skb_len = min(max_skb_len, rest_len);
>>>>>>>  
>>>>>>> -		skb = virtio_transport_alloc_skb(info, skb_len,
>>>>>>> +		skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
>>>>>>>  						 src_cid, src_port,
>>>>>>>  						 dst_cid, dst_port);
>>>>>>>  		if (!skb) {
>>>>>>> @@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>>>>  			break;
>>>>>>>  		}
>>>>>>>  
>>>>>>> +		/* This is last skb to send this portion of data. */
>>>>>>> +		if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
>>>>>>> +		    skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
>>>>>>> +			if (virtio_transport_init_zcopy_skb(vsk, skb,
>>>>>>> +							    info->msg,
>>>>>>> +							    can_zcopy)) {
>>>>>>> +				ret = -ENOMEM;
>>>>>>> +				break;
>>>>>>> +			}
>>>>>>> +		}
>>>>>>> +
>>>>>>>  		virtio_transport_inc_tx_pkt(vvs, skb);
>>>>>>>  
>>>>>>>  		ret = t_ops->send_pkt(skb);
>>>>>>> @@ -934,7 +1075,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>>>>>>>  	if (!t)
>>>>>>>  		return -ENOTCONN;
>>>>>>>  
>>>>>>> -	reply = virtio_transport_alloc_skb(&info, 0,
>>>>>>> +	reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
>>>>>>>  					   le64_to_cpu(hdr->dst_cid),
>>>>>>>  					   le32_to_cpu(hdr->dst_port),
>>>>>>>  					   le64_to_cpu(hdr->src_cid),
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-25 13:28           ` Arseniy Krasnov
@ 2023-07-25 13:36             ` Michael S. Tsirkin
  2023-07-25 13:35               ` Arseniy Krasnov
  0 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2023-07-25 13:36 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
	Bobby Eshleman, kvm, virtualization, netdev, linux-kernel,
	kernel, oxffffaa

On Tue, Jul 25, 2023 at 04:28:14PM +0300, Arseniy Krasnov wrote:
> 
> 
> On 25.07.2023 16:22, Michael S. Tsirkin wrote:
> > On Tue, Jul 25, 2023 at 04:04:13PM +0300, Arseniy Krasnov wrote:
> >>
> >>
> >> On 25.07.2023 14:50, Michael S. Tsirkin wrote:
> >>> On Fri, Jul 21, 2023 at 08:09:03AM +0300, Arseniy Krasnov wrote:
> >>>>
> >>>>
> >>>> On 21.07.2023 00:42, Arseniy Krasnov wrote:
> >>>>> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
> >>>>> flag is set and zerocopy transmission is possible (enabled in socket
> >>>>> options and transport allows zerocopy), then non-linear skb will be
> >>>>> created and filled with the pages of user's buffer. Pages of user's
> >>>>> buffer are locked in memory by 'get_user_pages()'. Second thing that
> >>>>> this patch does is replace type of skb owning: instead of calling
> >>>>> 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
> >>>>> change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
> >>>>> of socket, so to decrease this field correctly proper skb destructor is
> >>>>> needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
> >>>>>
> >>>>> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
> >>>>> ---
> >>>>>  Changelog:
> >>>>>  v5(big patchset) -> v1:
> >>>>>   * Refactorings of 'if' conditions.
> >>>>>   * Remove extra blank line.
> >>>>>   * Remove 'frag_off' field unneeded init.
> >>>>>   * Add function 'virtio_transport_fill_skb()' which fills both linear
> >>>>>     and non-linear skb with provided data.
> >>>>>  v1 -> v2:
> >>>>>   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
> >>>>>  v2 -> v3:
> >>>>>   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
> >>>>>     provided 'iov_iter' with data could be sent in a zerocopy mode.
> >>>>>     If this callback is not set in transport - transport allows to send
> >>>>>     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
> >>>>>     then zerocopy is allowed. Reason of this callback is that in case of
> >>>>>     G2H transmission we insert whole skb to the tx virtio queue and such
> >>>>>     skb must fit to the size of the virtio queue to be sent in a single
> >>>>>     iteration (may be tx logic in 'virtio_transport.c' could be reworked
> >>>>>     as in vhost to support partial send of current skb). This callback
> >>>>>     will be enabled only for G2H path. For details pls see comment 
> >>>>>     'Check that tx queue...' below.
> >>>>>
> >>>>>  include/net/af_vsock.h                  |   3 +
> >>>>>  net/vmw_vsock/virtio_transport.c        |  39 ++++
> >>>>>  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
> >>>>>  3 files changed, 241 insertions(+), 58 deletions(-)
> >>>>>
> >>>>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> >>>>> index 0e7504a42925..a6b346eeeb8e 100644
> >>>>> --- a/include/net/af_vsock.h
> >>>>> +++ b/include/net/af_vsock.h
> >>>>> @@ -177,6 +177,9 @@ struct vsock_transport {
> >>>>>  
> >>>>>  	/* Read a single skb */
> >>>>>  	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
> >>>>> +
> >>>>> +	/* Zero-copy. */
> >>>>> +	bool (*msgzerocopy_check_iov)(const struct iov_iter *);
> >>>>>  };
> >>>>>  
> >>>>>  /**** CORE ****/
> >>>>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> >>>>> index 7bbcc8093e51..23cb8ed638c4 100644
> >>>>> --- a/net/vmw_vsock/virtio_transport.c
> >>>>> +++ b/net/vmw_vsock/virtio_transport.c
> >>>>> @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
> >>>>>  	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
> >>>>>  }
> >>>>>  
> >>>>> +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
> >>>>> +{
> >>>>> +	struct virtio_vsock *vsock;
> >>>>> +	bool res = false;
> >>>>> +
> >>>>> +	rcu_read_lock();
> >>>>> +
> >>>>> +	vsock = rcu_dereference(the_virtio_vsock);
> >>>>> +	if (vsock) {
> >>>>> +		struct virtqueue *vq;
> >>>>> +		int iov_pages;
> >>>>> +
> >>>>> +		vq = vsock->vqs[VSOCK_VQ_TX];
> >>>>> +
> >>>>> +		iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
> >>>>> +
> >>>>> +		/* Check that tx queue is large enough to keep whole
> >>>>> +		 * data to send. This is needed, because when there is
> >>>>> +		 * not enough free space in the queue, current skb to
> >>>>> +		 * send will be reinserted to the head of tx list of
> >>>>> +		 * the socket to retry transmission later, so if skb
> >>>>> +		 * is bigger than whole queue, it will be reinserted
> >>>>> +		 * again and again, thus blocking other skbs to be sent.
> >>>>> +		 * Each page of the user provided buffer will be added
> >>>>> +		 * as a single buffer to the tx virtqueue, so compare
> >>>>> +		 * number of pages against maximum capacity of the queue.
> >>>>> +		 * +1 means buffer for the packet header.
> >>>>> +		 */
> >>>>> +		if (iov_pages + 1 <= vq->num_max)
> >>>>
> >>>> I think this check is actual only for case one we don't have indirect buffer feature.
> >>>> With indirect mode whole data to send will be packed into one indirect buffer.
> >>>>
> >>>> Thanks, Arseniy
> >>>
> >>> Actually the reverse. With indirect you are limited to num_max.
> >>> Without you are limited to whatever space is left in the
> >>> queue (which you did not check here, so you should).
> >>
> >> I mean that with indirect, we only need one buffer, and we can just wait
> >> for enough space - for this single buffer ( as we discussed a little bit before).
> >> But if indirect buffer is not supported - we need that whole packet must fit
> >> to the size of tx queue - otherwise it never be transmitted.
> >>
> >> Thanks, Arseniy
> > 
> > 
> > yes but according to virtio spec it's illegal to add s/g that is bigger
> > than queue size.
> 
> Aah, so even in case of indirect buffers feature, buffer descriptors stored in memory
> pointed by indirect buffer must be accounted against queue size ?
> 
> Thanks, Arseniy

a single indirect buffer can't exceed vq size.


> > 
> >>>
> >>>
> >>>>> +			res = true;
> >>>>> +	}
> >>>>> +
> >>>>> +	rcu_read_unlock();
> >>>
> >>> Just curious:
> >>> is the point of all this RCU dance to allow vsock
> >>> to change from under us? then why is it ok to
> >>> have it change? the virtio_transport_msgzerocopy_check_iov
> >>> will then refer to the old vsock ...
> >>>
> >>>
> >>>>> +
> >>>>> +	return res;
> >>>>> +}
> >>>>> +
> >>>>>  static bool virtio_transport_seqpacket_allow(u32 remote_cid);
> >>>>>  
> >>>>>  static struct virtio_transport virtio_transport = {
> >>>>> @@ -475,6 +512,8 @@ static struct virtio_transport virtio_transport = {
> >>>>>  		.seqpacket_allow          = virtio_transport_seqpacket_allow,
> >>>>>  		.seqpacket_has_data       = virtio_transport_seqpacket_has_data,
> >>>>>  
> >>>>> +		.msgzerocopy_check_iov	  = virtio_transport_msgzerocopy_check_iov,
> >>>>> +
> >>>>>  		.notify_poll_in           = virtio_transport_notify_poll_in,
> >>>>>  		.notify_poll_out          = virtio_transport_notify_poll_out,
> >>>>>  		.notify_recv_init         = virtio_transport_notify_recv_init,
> >>>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> >>>>> index 26a4d10da205..e4e3d541aff4 100644
> >>>>> --- a/net/vmw_vsock/virtio_transport_common.c
> >>>>> +++ b/net/vmw_vsock/virtio_transport_common.c
> >>>>> @@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
> >>>>>  	return container_of(t, struct virtio_transport, transport);
> >>>>>  }
> >>>>>  
> >>>>> -/* Returns a new packet on success, otherwise returns NULL.
> >>>>> - *
> >>>>> - * If NULL is returned, errp is set to a negative errno.
> >>>>> - */
> >>>>> -static struct sk_buff *
> >>>>> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
> >>>>> -			   size_t len,
> >>>>> -			   u32 src_cid,
> >>>>> -			   u32 src_port,
> >>>>> -			   u32 dst_cid,
> >>>>> -			   u32 dst_port)
> >>>>> -{
> >>>>> -	const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
> >>>>> -	struct virtio_vsock_hdr *hdr;
> >>>>> -	struct sk_buff *skb;
> >>>>> -	void *payload;
> >>>>> -	int err;
> >>>>> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
> >>>>> +				       size_t max_to_send)
> >>>>> +{
> >>>>> +	const struct vsock_transport *t;
> >>>>> +	struct iov_iter *iov_iter;
> >>>>>  
> >>>>> -	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
> >>>>> -	if (!skb)
> >>>>> -		return NULL;
> >>>>> +	if (!info->msg)
> >>>>> +		return false;
> >>>>>  
> >>>>> -	hdr = virtio_vsock_hdr(skb);
> >>>>> -	hdr->type	= cpu_to_le16(info->type);
> >>>>> -	hdr->op		= cpu_to_le16(info->op);
> >>>>> -	hdr->src_cid	= cpu_to_le64(src_cid);
> >>>>> -	hdr->dst_cid	= cpu_to_le64(dst_cid);
> >>>>> -	hdr->src_port	= cpu_to_le32(src_port);
> >>>>> -	hdr->dst_port	= cpu_to_le32(dst_port);
> >>>>> -	hdr->flags	= cpu_to_le32(info->flags);
> >>>>> -	hdr->len	= cpu_to_le32(len);
> >>>>> +	iov_iter = &info->msg->msg_iter;
> >>>>>  
> >>>>> -	if (info->msg && len > 0) {
> >>>>> -		payload = skb_put(skb, len);
> >>>>> -		err = memcpy_from_msg(payload, info->msg, len);
> >>>>> -		if (err)
> >>>>> -			goto out;
> >>>>> +	t = vsock_core_get_transport(info->vsk);
> >>>>>  
> >>>>> -		if (msg_data_left(info->msg) == 0 &&
> >>>>> -		    info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
> >>>>> -			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> >>>>> +	if (t->msgzerocopy_check_iov &&
> >>>>> +	    !t->msgzerocopy_check_iov(iov_iter))
> >>>>> +		return false;
> >>>>>  
> >>>>> -			if (info->msg->msg_flags & MSG_EOR)
> >>>>> -				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> >>>>> -		}
> >>>>> +	/* Data is simple buffer. */
> >>>>> +	if (iter_is_ubuf(iov_iter))
> >>>>> +		return true;
> >>>>> +
> >>>>> +	if (!iter_is_iovec(iov_iter))
> >>>>> +		return false;
> >>>>> +
> >>>>> +	if (iov_iter->iov_offset)
> >>>>> +		return false;
> >>>>> +
> >>>>> +	/* We can't send whole iov. */
> >>>>> +	if (iov_iter->count > max_to_send)
> >>>>> +		return false;
> >>>>> +
> >>>>> +	return true;
> >>>>> +}
> >>>>> +
> >>>>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
> >>>>> +					   struct sk_buff *skb,
> >>>>> +					   struct msghdr *msg,
> >>>>> +					   bool zerocopy)
> >>>>> +{
> >>>>> +	struct ubuf_info *uarg;
> >>>>> +
> >>>>> +	if (msg->msg_ubuf) {
> >>>>> +		uarg = msg->msg_ubuf;
> >>>>> +		net_zcopy_get(uarg);
> >>>>> +	} else {
> >>>>> +		struct iov_iter *iter = &msg->msg_iter;
> >>>>> +		struct ubuf_info_msgzc *uarg_zc;
> >>>>> +		int len;
> >>>>> +
> >>>>> +		/* Only ITER_IOVEC or ITER_UBUF are allowed and
> >>>>> +		 * checked before.
> >>>>> +		 */
> >>>>> +		if (iter_is_iovec(iter))
> >>>>> +			len = iov_length(iter->__iov, iter->nr_segs);
> >>>>> +		else
> >>>>> +			len = iter->count;
> >>>>> +
> >>>>> +		uarg = msg_zerocopy_realloc(sk_vsock(vsk),
> >>>>> +					    len,
> >>>>> +					    NULL);
> >>>>> +		if (!uarg)
> >>>>> +			return -1;
> >>>>> +
> >>>>> +		uarg_zc = uarg_to_msgzc(uarg);
> >>>>> +		uarg_zc->zerocopy = zerocopy ? 1 : 0;
> >>>>>  	}
> >>>>>  
> >>>>> -	if (info->reply)
> >>>>> -		virtio_vsock_skb_set_reply(skb);
> >>>>> +	skb_zcopy_init(skb, uarg);
> >>>>>  
> >>>>> -	trace_virtio_transport_alloc_pkt(src_cid, src_port,
> >>>>> -					 dst_cid, dst_port,
> >>>>> -					 len,
> >>>>> -					 info->type,
> >>>>> -					 info->op,
> >>>>> -					 info->flags);
> >>>>> +	return 0;
> >>>>> +}
> >>>>>  
> >>>>> -	if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
> >>>>> -		WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
> >>>>> -		goto out;
> >>>>> +static int virtio_transport_fill_skb(struct sk_buff *skb,
> >>>>> +				     struct virtio_vsock_pkt_info *info,
> >>>>> +				     size_t len,
> >>>>> +				     bool zcopy)
> >>>>> +{
> >>>>> +	if (zcopy) {
> >>>>> +		return __zerocopy_sg_from_iter(info->msg, NULL, skb,
> >>>>> +					      &info->msg->msg_iter,
> >>>>> +					      len);
> >>>>> +	} else {
> >>>>> +		void *payload;
> >>>>> +		int err;
> >>>>> +
> >>>>> +		payload = skb_put(skb, len);
> >>>>> +		err = memcpy_from_msg(payload, info->msg, len);
> >>>>> +		if (err)
> >>>>> +			return -1;
> >>>>> +
> >>>>> +		if (msg_data_left(info->msg))
> >>>>> +			return 0;
> >>>>> +
> >>>>> +		return 0;
> >>>>>  	}
> >>>>> +}
> >>>>>  
> >>>>> -	return skb;
> >>>>> +static void virtio_transport_init_hdr(struct sk_buff *skb,
> >>>>> +				      struct virtio_vsock_pkt_info *info,
> >>>>> +				      u32 src_cid,
> >>>>> +				      u32 src_port,
> >>>>> +				      u32 dst_cid,
> >>>>> +				      u32 dst_port,
> >>>>> +				      size_t len)
> >>>>> +{
> >>>>> +	struct virtio_vsock_hdr *hdr;
> >>>>>  
> >>>>> -out:
> >>>>> -	kfree_skb(skb);
> >>>>> -	return NULL;
> >>>>> +	hdr = virtio_vsock_hdr(skb);
> >>>>> +	hdr->type	= cpu_to_le16(info->type);
> >>>>> +	hdr->op		= cpu_to_le16(info->op);
> >>>>> +	hdr->src_cid	= cpu_to_le64(src_cid);
> >>>>> +	hdr->dst_cid	= cpu_to_le64(dst_cid);
> >>>>> +	hdr->src_port	= cpu_to_le32(src_port);
> >>>>> +	hdr->dst_port	= cpu_to_le32(dst_port);
> >>>>> +	hdr->flags	= cpu_to_le32(info->flags);
> >>>>> +	hdr->len	= cpu_to_le32(len);
> >>>>>  }
> >>>>>  
> >>>>>  static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
> >>>>> @@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
> >>>>>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
> >>>>>  }
> >>>>>  
> >>>>> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
> >>>>> +						  struct virtio_vsock_pkt_info *info,
> >>>>> +						  size_t payload_len,
> >>>>> +						  bool zcopy,
> >>>>> +						  u32 src_cid,
> >>>>> +						  u32 src_port,
> >>>>> +						  u32 dst_cid,
> >>>>> +						  u32 dst_port)
> >>>>> +{
> >>>>> +	struct sk_buff *skb;
> >>>>> +	size_t skb_len;
> >>>>> +
> >>>>> +	skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
> >>>>> +
> >>>>> +	if (!zcopy)
> >>>>> +		skb_len += payload_len;
> >>>>> +
> >>>>> +	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
> >>>>> +	if (!skb)
> >>>>> +		return NULL;
> >>>>> +
> >>>>> +	virtio_transport_init_hdr(skb, info, src_cid, src_port,
> >>>>> +				  dst_cid, dst_port,
> >>>>> +				  payload_len);
> >>>>> +
> >>>>> +	/* Set owner here, because '__zerocopy_sg_from_iter()' uses
> >>>>> +	 * owner of skb without check to update 'sk_wmem_alloc'.
> >>>>> +	 */
> >>>>> +	if (vsk)
> >>>>> +		skb_set_owner_w(skb, sk_vsock(vsk));
> >>>>> +
> >>>>> +	if (info->msg && payload_len > 0) {
> >>>>> +		int err;
> >>>>> +
> >>>>> +		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
> >>>>> +		if (err)
> >>>>> +			goto out;
> >>>>> +
> >>>>> +		if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
> >>>>> +			struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
> >>>>> +
> >>>>> +			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> >>>>> +
> >>>>> +			if (info->msg->msg_flags & MSG_EOR)
> >>>>> +				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> >>>>> +		}
> >>>>> +	}
> >>>>> +
> >>>>> +	if (info->reply)
> >>>>> +		virtio_vsock_skb_set_reply(skb);
> >>>>> +
> >>>>> +	trace_virtio_transport_alloc_pkt(src_cid, src_port,
> >>>>> +					 dst_cid, dst_port,
> >>>>> +					 payload_len,
> >>>>> +					 info->type,
> >>>>> +					 info->op,
> >>>>> +					 info->flags);
> >>>>> +
> >>>>> +	return skb;
> >>>>> +out:
> >>>>> +	kfree_skb(skb);
> >>>>> +	return NULL;
> >>>>> +}
> >>>>> +
> >>>>>  /* This function can only be used on connecting/connected sockets,
> >>>>>   * since a socket assigned to a transport is required.
> >>>>>   *
> >>>>> @@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
> >>>>>  static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >>>>>  					  struct virtio_vsock_pkt_info *info)
> >>>>>  {
> >>>>> +	u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> >>>>>  	u32 src_cid, src_port, dst_cid, dst_port;
> >>>>>  	const struct virtio_transport *t_ops;
> >>>>>  	struct virtio_vsock_sock *vvs;
> >>>>>  	u32 pkt_len = info->pkt_len;
> >>>>> +	bool can_zcopy = false;
> >>>>>  	u32 rest_len;
> >>>>>  	int ret;
> >>>>>  
> >>>>> @@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >>>>>  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> >>>>>  		return pkt_len;
> >>>>>  
> >>>>> +	if (info->msg) {
> >>>>> +		/* If zerocopy is not enabled by 'setsockopt()', we behave as
> >>>>> +		 * there is no MSG_ZEROCOPY flag set.
> >>>>> +		 */
> >>>>> +		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
> >>>>> +			info->msg->msg_flags &= ~MSG_ZEROCOPY;
> >>>>> +
> >>>>> +		if (info->msg->msg_flags & MSG_ZEROCOPY)
> >>>>> +			can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
> >>>>> +
> >>>>> +		if (can_zcopy)
> >>>>> +			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
> >>>>> +					    (MAX_SKB_FRAGS * PAGE_SIZE));
> >>>>> +	}
> >>>>> +
> >>>>>  	rest_len = pkt_len;
> >>>>>  
> >>>>>  	do {
> >>>>>  		struct sk_buff *skb;
> >>>>>  		size_t skb_len;
> >>>>>  
> >>>>> -		skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
> >>>>> +		skb_len = min(max_skb_len, rest_len);
> >>>>>  
> >>>>> -		skb = virtio_transport_alloc_skb(info, skb_len,
> >>>>> +		skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
> >>>>>  						 src_cid, src_port,
> >>>>>  						 dst_cid, dst_port);
> >>>>>  		if (!skb) {
> >>>>> @@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >>>>>  			break;
> >>>>>  		}
> >>>>>  
> >>>>> +		/* This is last skb to send this portion of data. */
> >>>>> +		if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
> >>>>> +		    skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
> >>>>> +			if (virtio_transport_init_zcopy_skb(vsk, skb,
> >>>>> +							    info->msg,
> >>>>> +							    can_zcopy)) {
> >>>>> +				ret = -ENOMEM;
> >>>>> +				break;
> >>>>> +			}
> >>>>> +		}
> >>>>> +
> >>>>>  		virtio_transport_inc_tx_pkt(vvs, skb);
> >>>>>  
> >>>>>  		ret = t_ops->send_pkt(skb);
> >>>>> @@ -934,7 +1075,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
> >>>>>  	if (!t)
> >>>>>  		return -ENOTCONN;
> >>>>>  
> >>>>> -	reply = virtio_transport_alloc_skb(&info, 0,
> >>>>> +	reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
> >>>>>  					   le64_to_cpu(hdr->dst_cid),
> >>>>>  					   le32_to_cpu(hdr->dst_port),
> >>>>>  					   le64_to_cpu(hdr->src_cid),
> >>>
> > 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-25 12:28           ` Stefano Garzarella
  2023-07-25 12:39             ` Michael S. Tsirkin
@ 2023-07-27  8:32             ` Arseniy Krasnov
  2023-07-27  8:54               ` Stefano Garzarella
  1 sibling, 1 reply; 30+ messages in thread
From: Arseniy Krasnov @ 2023-07-27  8:32 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Stefan Hajnoczi, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Michael S. Tsirkin, Jason Wang, Bobby Eshleman, kvm,
	virtualization, netdev, linux-kernel, kernel, oxffffaa



On 25.07.2023 15:28, Stefano Garzarella wrote:
> On Tue, Jul 25, 2023 at 12:16:11PM +0300, Arseniy Krasnov wrote:
>>
>>
>> On 25.07.2023 11:46, Arseniy Krasnov wrote:
>>>
>>>
>>> On 25.07.2023 11:43, Stefano Garzarella wrote:
>>>> On Fri, Jul 21, 2023 at 08:09:03AM +0300, Arseniy Krasnov wrote:
>>>>>
>>>>>
>>>>> On 21.07.2023 00:42, Arseniy Krasnov wrote:
>>>>>> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>>>>>> flag is set and zerocopy transmission is possible (enabled in socket
>>>>>> options and transport allows zerocopy), then non-linear skb will be
>>>>>> created and filled with the pages of user's buffer. Pages of user's
>>>>>> buffer are locked in memory by 'get_user_pages()'. Second thing that
>>>>>> this patch does is replace type of skb owning: instead of calling
>>>>>> 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
>>>>>> change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
>>>>>> of socket, so to decrease this field correctly proper skb destructor is
>>>>>> needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
>>>>>>
>>>>>> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
>>>>>> ---
>>>>>>  Changelog:
>>>>>>  v5(big patchset) -> v1:
>>>>>>   * Refactorings of 'if' conditions.
>>>>>>   * Remove extra blank line.
>>>>>>   * Remove 'frag_off' field unneeded init.
>>>>>>   * Add function 'virtio_transport_fill_skb()' which fills both linear
>>>>>>     and non-linear skb with provided data.
>>>>>>  v1 -> v2:
>>>>>>   * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
>>>>>>  v2 -> v3:
>>>>>>   * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
>>>>>>     provided 'iov_iter' with data could be sent in a zerocopy mode.
>>>>>>     If this callback is not set in transport - transport allows to send
>>>>>>     any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
>>>>>>     then zerocopy is allowed. Reason of this callback is that in case of
>>>>>>     G2H transmission we insert whole skb to the tx virtio queue and such
>>>>>>     skb must fit to the size of the virtio queue to be sent in a single
>>>>>>     iteration (may be tx logic in 'virtio_transport.c' could be reworked
>>>>>>     as in vhost to support partial send of current skb). This callback
>>>>>>     will be enabled only for G2H path. For details pls see comment
>>>>>>     'Check that tx queue...' below.
>>>>>>
>>>>>>  include/net/af_vsock.h                  |   3 +
>>>>>>  net/vmw_vsock/virtio_transport.c        |  39 ++++
>>>>>>  net/vmw_vsock/virtio_transport_common.c | 257 ++++++++++++++++++------
>>>>>>  3 files changed, 241 insertions(+), 58 deletions(-)
>>>>>>
>>>>>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>>>>>> index 0e7504a42925..a6b346eeeb8e 100644
>>>>>> --- a/include/net/af_vsock.h
>>>>>> +++ b/include/net/af_vsock.h
>>>>>> @@ -177,6 +177,9 @@ struct vsock_transport {
>>>>>>
>>>>>>      /* Read a single skb */
>>>>>>      int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
>>>>>> +
>>>>>> +    /* Zero-copy. */
>>>>>> +    bool (*msgzerocopy_check_iov)(const struct iov_iter *);
>>>>>>  };
>>>>>>
>>>>>>  /**** CORE ****/
>>>>>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>>>>>> index 7bbcc8093e51..23cb8ed638c4 100644
>>>>>> --- a/net/vmw_vsock/virtio_transport.c
>>>>>> +++ b/net/vmw_vsock/virtio_transport.c
>>>>>> @@ -442,6 +442,43 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
>>>>>>      queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>>>>>>  }
>>>>>>
>>>>>> +static bool virtio_transport_msgzerocopy_check_iov(const struct iov_iter *iov)
>>>>>> +{
>>>>>> +    struct virtio_vsock *vsock;
>>>>>> +    bool res = false;
>>>>>> +
>>>>>> +    rcu_read_lock();
>>>>>> +
>>>>>> +    vsock = rcu_dereference(the_virtio_vsock);
>>>>>> +    if (vsock) {
> 
> Just noted, what about the following to reduce the indentation?
> 
>         if (!vsock) {
>             goto out;
>         }
>             ...
>             ...
>     out:
>         rcu_read_unlock();
>         return res;
> 
>>>>>> +        struct virtqueue *vq;
>>>>>> +        int iov_pages;
>>>>>> +
>>>>>> +        vq = vsock->vqs[VSOCK_VQ_TX];
>>>>>> +
>>>>>> +        iov_pages = round_up(iov->count, PAGE_SIZE) / PAGE_SIZE;
>>>>>> +
>>>>>> +        /* Check that tx queue is large enough to keep whole
>>>>>> +         * data to send. This is needed, because when there is
>>>>>> +         * not enough free space in the queue, current skb to
>>>>>> +         * send will be reinserted to the head of tx list of
>>>>>> +         * the socket to retry transmission later, so if skb
>>>>>> +         * is bigger than whole queue, it will be reinserted
>>>>>> +         * again and again, thus blocking other skbs to be sent.
>>>>>> +         * Each page of the user provided buffer will be added
>>>>>> +         * as a single buffer to the tx virtqueue, so compare
>>>>>> +         * number of pages against maximum capacity of the queue.
>>>>>> +         * +1 means buffer for the packet header.
>>>>>> +         */
>>>>>> +        if (iov_pages + 1 <= vq->num_max)
>>>>>
>>>>> I think this check is actual only for case one we don't have indirect buffer feature.
>>>>> With indirect mode whole data to send will be packed into one indirect buffer.
>>>>
>>>> I think so.
>>>> So, should we check also that here?
>>>>
>>>>>
>>>>> Thanks, Arseniy
>>>>>
>>>>>> +            res = true;
>>>>>> +    }
>>>>>> +
>>>>>> +    rcu_read_unlock();
>>>>>> +
>>>>>> +    return res;
>>>>>> +}
>>>>>> +
>>>>>>  static bool virtio_transport_seqpacket_allow(u32 remote_cid);
>>>>>>
>>>>>>  static struct virtio_transport virtio_transport = {
>>>>>> @@ -475,6 +512,8 @@ static struct virtio_transport virtio_transport = {
>>>>>>          .seqpacket_allow          = virtio_transport_seqpacket_allow,
>>>>>>          .seqpacket_has_data       = virtio_transport_seqpacket_has_data,
>>>>>>
>>>>>> +        .msgzerocopy_check_iov      = virtio_transport_msgzerocopy_check_iov,
>>>>>> +
>>>>>>          .notify_poll_in           = virtio_transport_notify_poll_in,
>>>>>>          .notify_poll_out          = virtio_transport_notify_poll_out,
>>>>>>          .notify_recv_init         = virtio_transport_notify_recv_init,
>>>>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>>>>> index 26a4d10da205..e4e3d541aff4 100644
>>>>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>>>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>>>>> @@ -37,73 +37,122 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>>>>>>      return container_of(t, struct virtio_transport, transport);
>>>>>>  }
>>>>>>
>>>>>> -/* Returns a new packet on success, otherwise returns NULL.
>>>>>> - *
>>>>>> - * If NULL is returned, errp is set to a negative errno.
>>>>>> - */
>>>>>> -static struct sk_buff *
>>>>>> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>>>>>> -               size_t len,
>>>>>> -               u32 src_cid,
>>>>>> -               u32 src_port,
>>>>>> -               u32 dst_cid,
>>>>>> -               u32 dst_port)
>>>>>> -{
>>>>>> -    const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>>>>>> -    struct virtio_vsock_hdr *hdr;
>>>>>> -    struct sk_buff *skb;
>>>>>> -    void *payload;
>>>>>> -    int err;
>>>>>> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
>>>>>> +                       size_t max_to_send)
>>>>>> +{
>>>>>> +    const struct vsock_transport *t;
>>>>>> +    struct iov_iter *iov_iter;
>>>>>>
>>>>>> -    skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>>>>> -    if (!skb)
>>>>>> -        return NULL;
>>>>>> +    if (!info->msg)
>>>>>> +        return false;
>>>>>>
>>>>>> -    hdr = virtio_vsock_hdr(skb);
>>>>>> -    hdr->type    = cpu_to_le16(info->type);
>>>>>> -    hdr->op        = cpu_to_le16(info->op);
>>>>>> -    hdr->src_cid    = cpu_to_le64(src_cid);
>>>>>> -    hdr->dst_cid    = cpu_to_le64(dst_cid);
>>>>>> -    hdr->src_port    = cpu_to_le32(src_port);
>>>>>> -    hdr->dst_port    = cpu_to_le32(dst_port);
>>>>>> -    hdr->flags    = cpu_to_le32(info->flags);
>>>>>> -    hdr->len    = cpu_to_le32(len);
>>>>>> +    iov_iter = &info->msg->msg_iter;
>>>>>>
>>>>>> -    if (info->msg && len > 0) {
>>>>>> -        payload = skb_put(skb, len);
>>>>>> -        err = memcpy_from_msg(payload, info->msg, len);
>>>>>> -        if (err)
>>>>>> -            goto out;
>>>>>> +    t = vsock_core_get_transport(info->vsk);
>>>>>>
>>>>>> -        if (msg_data_left(info->msg) == 0 &&
>>>>>> -            info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>>>>> -            hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>>>>> +    if (t->msgzerocopy_check_iov &&
>>>>>> +        !t->msgzerocopy_check_iov(iov_iter))
>>>>>> +        return false;
>>>>
>>>> I'd avoid adding a new transport callback used only internally in virtio
>>>> transports.
>>>
>>> Ok, I see.
>>>
>>>>
>>>> Usually the transport callbacks are used in af_vsock.c, if we need a
>>>> callback just for virtio transports, maybe better to add it in struct
>>>> virtio_vsock_pkt_info or struct virtio_vsock_sock.
>>
>> Hm, may be I just need to move this callback from 'struct vsock_transport' to parent 'struct virtio_transport',
>> after 'send_pkt' callback. In this case:
>> 1) AF_VSOCK part is not touched.
>> 2) This callback stays in 'virtio_transport.c' and is set also in this file.
>>   vhost and loopback are unchanged - only 'send_pkt' still enabled in both
>>   files for these two transports.
> 
> Yep, this could also work!
> 
> Stefano

Great! I'll send this implementation when this patchset for MSG_PEEK will be merged
to net-next as both conflicts with each other.

https://lore.kernel.org/netdev/20230726060150-mutt-send-email-mst@kernel.org/T/#m56f3b850361a412735616145162d2d9df25f6350

Thanks, Arseniy

> 
>>
>> Thanks, Arseniy
>>
>>>>
>>>> Maybe better the last one so we don't have to allocate pointer space
>>>> for each packet and you should reach it via info.
>>>
>>> Ok, thanks, I'll try this way
>>>
>>> Thanks, Arseniy
>>>
>>>>
>>>> Thanks,
>>>> Stefano
>>>>
>>>>>>
>>>>>> -            if (info->msg->msg_flags & MSG_EOR)
>>>>>> -                hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>>>>> -        }
>>>>>> +    /* Data is simple buffer. */
>>>>>> +    if (iter_is_ubuf(iov_iter))
>>>>>> +        return true;
>>>>>> +
>>>>>> +    if (!iter_is_iovec(iov_iter))
>>>>>> +        return false;
>>>>>> +
>>>>>> +    if (iov_iter->iov_offset)
>>>>>> +        return false;
>>>>>> +
>>>>>> +    /* We can't send whole iov. */
>>>>>> +    if (iov_iter->count > max_to_send)
>>>>>> +        return false;
>>>>>> +
>>>>>> +    return true;
>>>>>> +}
>>>>>> +
>>>>>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>>>>>> +                       struct sk_buff *skb,
>>>>>> +                       struct msghdr *msg,
>>>>>> +                       bool zerocopy)
>>>>>> +{
>>>>>> +    struct ubuf_info *uarg;
>>>>>> +
>>>>>> +    if (msg->msg_ubuf) {
>>>>>> +        uarg = msg->msg_ubuf;
>>>>>> +        net_zcopy_get(uarg);
>>>>>> +    } else {
>>>>>> +        struct iov_iter *iter = &msg->msg_iter;
>>>>>> +        struct ubuf_info_msgzc *uarg_zc;
>>>>>> +        int len;
>>>>>> +
>>>>>> +        /* Only ITER_IOVEC or ITER_UBUF are allowed and
>>>>>> +         * checked before.
>>>>>> +         */
>>>>>> +        if (iter_is_iovec(iter))
>>>>>> +            len = iov_length(iter->__iov, iter->nr_segs);
>>>>>> +        else
>>>>>> +            len = iter->count;
>>>>>> +
>>>>>> +        uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>>>>> +                        len,
>>>>>> +                        NULL);
>>>>>> +        if (!uarg)
>>>>>> +            return -1;
>>>>>> +
>>>>>> +        uarg_zc = uarg_to_msgzc(uarg);
>>>>>> +        uarg_zc->zerocopy = zerocopy ? 1 : 0;
>>>>>>      }
>>>>>>
>>>>>> -    if (info->reply)
>>>>>> -        virtio_vsock_skb_set_reply(skb);
>>>>>> +    skb_zcopy_init(skb, uarg);
>>>>>>
>>>>>> -    trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>>>>> -                     dst_cid, dst_port,
>>>>>> -                     len,
>>>>>> -                     info->type,
>>>>>> -                     info->op,
>>>>>> -                     info->flags);
>>>>>> +    return 0;
>>>>>> +}
>>>>>>
>>>>>> -    if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
>>>>>> -        WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
>>>>>> -        goto out;
>>>>>> +static int virtio_transport_fill_skb(struct sk_buff *skb,
>>>>>> +                     struct virtio_vsock_pkt_info *info,
>>>>>> +                     size_t len,
>>>>>> +                     bool zcopy)
>>>>>> +{
>>>>>> +    if (zcopy) {
>>>>>> +        return __zerocopy_sg_from_iter(info->msg, NULL, skb,
>>>>>> +                          &info->msg->msg_iter,
>>>>>> +                          len);
>>>>>> +    } else {
>>>>>> +        void *payload;
>>>>>> +        int err;
>>>>>> +
>>>>>> +        payload = skb_put(skb, len);
>>>>>> +        err = memcpy_from_msg(payload, info->msg, len);
>>>>>> +        if (err)
>>>>>> +            return -1;
>>>>>> +
>>>>>> +        if (msg_data_left(info->msg))
>>>>>> +            return 0;
>>>>>> +
>>>>>> +        return 0;
>>>>>>      }
>>>>>> +}
>>>>>>
>>>>>> -    return skb;
>>>>>> +static void virtio_transport_init_hdr(struct sk_buff *skb,
>>>>>> +                      struct virtio_vsock_pkt_info *info,
>>>>>> +                      u32 src_cid,
>>>>>> +                      u32 src_port,
>>>>>> +                      u32 dst_cid,
>>>>>> +                      u32 dst_port,
>>>>>> +                      size_t len)
>>>>>> +{
>>>>>> +    struct virtio_vsock_hdr *hdr;
>>>>>>
>>>>>> -out:
>>>>>> -    kfree_skb(skb);
>>>>>> -    return NULL;
>>>>>> +    hdr = virtio_vsock_hdr(skb);
>>>>>> +    hdr->type    = cpu_to_le16(info->type);
>>>>>> +    hdr->op        = cpu_to_le16(info->op);
>>>>>> +    hdr->src_cid    = cpu_to_le64(src_cid);
>>>>>> +    hdr->dst_cid    = cpu_to_le64(dst_cid);
>>>>>> +    hdr->src_port    = cpu_to_le32(src_port);
>>>>>> +    hdr->dst_port    = cpu_to_le32(dst_port);
>>>>>> +    hdr->flags    = cpu_to_le32(info->flags);
>>>>>> +    hdr->len    = cpu_to_le32(len);
>>>>>>  }
>>>>>>
>>>>>>  static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
>>>>>> @@ -214,6 +263,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>>>>>          return VIRTIO_VSOCK_TYPE_SEQPACKET;
>>>>>>  }
>>>>>>
>>>>>> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
>>>>>> +                          struct virtio_vsock_pkt_info *info,
>>>>>> +                          size_t payload_len,
>>>>>> +                          bool zcopy,
>>>>>> +                          u32 src_cid,
>>>>>> +                          u32 src_port,
>>>>>> +                          u32 dst_cid,
>>>>>> +                          u32 dst_port)
>>>>>> +{
>>>>>> +    struct sk_buff *skb;
>>>>>> +    size_t skb_len;
>>>>>> +
>>>>>> +    skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
>>>>>> +
>>>>>> +    if (!zcopy)
>>>>>> +        skb_len += payload_len;
>>>>>> +
>>>>>> +    skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>>>>> +    if (!skb)
>>>>>> +        return NULL;
>>>>>> +
>>>>>> +    virtio_transport_init_hdr(skb, info, src_cid, src_port,
>>>>>> +                  dst_cid, dst_port,
>>>>>> +                  payload_len);
>>>>>> +
>>>>>> +    /* Set owner here, because '__zerocopy_sg_from_iter()' uses
>>>>>> +     * owner of skb without check to update 'sk_wmem_alloc'.
>>>>>> +     */
>>>>>> +    if (vsk)
>>>>>> +        skb_set_owner_w(skb, sk_vsock(vsk));
>>>>>> +
>>>>>> +    if (info->msg && payload_len > 0) {
>>>>>> +        int err;
>>>>>> +
>>>>>> +        err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
>>>>>> +        if (err)
>>>>>> +            goto out;
>>>>>> +
>>>>>> +        if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>>>>> +            struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
>>>>>> +
>>>>>> +            hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>>>>> +
>>>>>> +            if (info->msg->msg_flags & MSG_EOR)
>>>>>> +                hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    if (info->reply)
>>>>>> +        virtio_vsock_skb_set_reply(skb);
>>>>>> +
>>>>>> +    trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>>>>> +                     dst_cid, dst_port,
>>>>>> +                     payload_len,
>>>>>> +                     info->type,
>>>>>> +                     info->op,
>>>>>> +                     info->flags);
>>>>>> +
>>>>>> +    return skb;
>>>>>> +out:
>>>>>> +    kfree_skb(skb);
>>>>>> +    return NULL;
>>>>>> +}
>>>>>> +
>>>>>>  /* This function can only be used on connecting/connected sockets,
>>>>>>   * since a socket assigned to a transport is required.
>>>>>>   *
>>>>>> @@ -222,10 +335,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>>>>>  static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>>>                        struct virtio_vsock_pkt_info *info)
>>>>>>  {
>>>>>> +    u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>>>>>>      u32 src_cid, src_port, dst_cid, dst_port;
>>>>>>      const struct virtio_transport *t_ops;
>>>>>>      struct virtio_vsock_sock *vvs;
>>>>>>      u32 pkt_len = info->pkt_len;
>>>>>> +    bool can_zcopy = false;
>>>>>>      u32 rest_len;
>>>>>>      int ret;
>>>>>>
>>>>>> @@ -254,15 +369,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>>>      if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>>>>>          return pkt_len;
>>>>>>
>>>>>> +    if (info->msg) {
>>>>>> +        /* If zerocopy is not enabled by 'setsockopt()', we behave as
>>>>>> +         * there is no MSG_ZEROCOPY flag set.
>>>>>> +         */
>>>>>> +        if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>>>>>> +            info->msg->msg_flags &= ~MSG_ZEROCOPY;
>>>>>> +
>>>>>> +        if (info->msg->msg_flags & MSG_ZEROCOPY)
>>>>>> +            can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
>>>>>> +
>>>>>> +        if (can_zcopy)
>>>>>> +            max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>>>>>> +                        (MAX_SKB_FRAGS * PAGE_SIZE));
>>>>>> +    }
>>>>>> +
>>>>>>      rest_len = pkt_len;
>>>>>>
>>>>>>      do {
>>>>>>          struct sk_buff *skb;
>>>>>>          size_t skb_len;
>>>>>>
>>>>>> -        skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
>>>>>> +        skb_len = min(max_skb_len, rest_len);
>>>>>>
>>>>>> -        skb = virtio_transport_alloc_skb(info, skb_len,
>>>>>> +        skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
>>>>>>                           src_cid, src_port,
>>>>>>                           dst_cid, dst_port);
>>>>>>          if (!skb) {
>>>>>> @@ -270,6 +400,17 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>>>              break;
>>>>>>          }
>>>>>>
>>>>>> +        /* This is last skb to send this portion of data. */
>>>>>> +        if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
>>>>>> +            skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
>>>>>> +            if (virtio_transport_init_zcopy_skb(vsk, skb,
>>>>>> +                                info->msg,
>>>>>> +                                can_zcopy)) {
>>>>>> +                ret = -ENOMEM;
>>>>>> +                break;
>>>>>> +            }
>>>>>> +        }
>>>>>> +
>>>>>>          virtio_transport_inc_tx_pkt(vvs, skb);
>>>>>>
>>>>>>          ret = t_ops->send_pkt(skb);
>>>>>> @@ -934,7 +1075,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>>>>>>      if (!t)
>>>>>>          return -ENOTCONN;
>>>>>>
>>>>>> -    reply = virtio_transport_alloc_skb(&info, 0,
>>>>>> +    reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
>>>>>>                         le64_to_cpu(hdr->dst_cid),
>>>>>>                         le32_to_cpu(hdr->dst_port),
>>>>>>                         le64_to_cpu(hdr->src_cid),
>>>>>
>>>>
>>
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support
  2023-07-27  8:32             ` Arseniy Krasnov
@ 2023-07-27  8:54               ` Stefano Garzarella
  0 siblings, 0 replies; 30+ messages in thread
From: Stefano Garzarella @ 2023-07-27  8:54 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Michael S. Tsirkin, Jason Wang, Bobby Eshleman, kvm,
	virtualization, netdev, linux-kernel, kernel, oxffffaa

On Thu, Jul 27, 2023 at 11:32:00AM +0300, Arseniy Krasnov wrote:
>On 25.07.2023 15:28, Stefano Garzarella wrote:
>> On Tue, Jul 25, 2023 at 12:16:11PM +0300, Arseniy Krasnov wrote:
>>> On 25.07.2023 11:46, Arseniy Krasnov wrote:
>>>> On 25.07.2023 11:43, Stefano Garzarella wrote:
>>>>> On Fri, Jul 21, 2023 at 08:09:03AM +0300, Arseniy Krasnov wrote:

[...]

>>>>>>> +    t = vsock_core_get_transport(info->vsk);
>>>>>>>
>>>>>>> -        if (msg_data_left(info->msg) == 0 &&
>>>>>>> -            info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>>>>>> -            hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>>>>>> +    if (t->msgzerocopy_check_iov &&
>>>>>>> +        !t->msgzerocopy_check_iov(iov_iter))
>>>>>>> +        return false;
>>>>>
>>>>> I'd avoid adding a new transport callback used only internally in virtio
>>>>> transports.
>>>>
>>>> Ok, I see.
>>>>
>>>>>
>>>>> Usually the transport callbacks are used in af_vsock.c, if we need a
>>>>> callback just for virtio transports, maybe better to add it in struct
>>>>> virtio_vsock_pkt_info or struct virtio_vsock_sock.
>>>
>>> Hm, may be I just need to move this callback from 'struct vsock_transport' to parent 'struct virtio_transport',
>>> after 'send_pkt' callback. In this case:
>>> 1) AF_VSOCK part is not touched.
>>> 2) This callback stays in 'virtio_transport.c' and is set also in this file.
>>>   vhost and loopback are unchanged - only 'send_pkt' still enabled in both
>>>   files for these two transports.
>>
>> Yep, this could also work!
>>
>> Stefano
>
>Great! I'll send this implementation when this patchset for MSG_PEEK will be merged
>to net-next as both conflicts with each other.
>
>https://lore.kernel.org/netdev/20230726060150-mutt-send-email-mst@kernel.org/T/#m56f3b850361a412735616145162d2d9df25f6350

Ack!

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2023-07-27  8:55 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-20 21:42 [PATCH net-next v3 0/4] vsock/virtio/vhost: MSG_ZEROCOPY preparations Arseniy Krasnov
2023-07-20 21:42 ` [PATCH net-next v3 1/4] vsock/virtio/vhost: read data from non-linear skb Arseniy Krasnov
2023-07-20 21:42 ` [PATCH net-next v3 2/4] vsock/virtio: support to send " Arseniy Krasnov
2023-07-25  8:17   ` Stefano Garzarella
2023-07-20 21:42 ` [PATCH net-next v3 3/4] vsock/virtio: non-linear skb handling for tap Arseniy Krasnov
2023-07-20 21:42 ` [PATCH net-next v3 4/4] vsock/virtio: MSG_ZEROCOPY flag support Arseniy Krasnov
2023-07-21  5:09   ` Arseniy Krasnov
2023-07-25  8:43     ` Stefano Garzarella
2023-07-25  8:46       ` Arseniy Krasnov
2023-07-25  9:16         ` Arseniy Krasnov
2023-07-25 12:28           ` Stefano Garzarella
2023-07-25 12:39             ` Michael S. Tsirkin
2023-07-25 12:45               ` Stefano Garzarella
2023-07-27  8:32             ` Arseniy Krasnov
2023-07-27  8:54               ` Stefano Garzarella
2023-07-25 11:50     ` Michael S. Tsirkin
2023-07-25 12:53       ` Stefano Garzarella
2023-07-25 13:06         ` Michael S. Tsirkin
2023-07-25 13:21           ` Stefano Garzarella
2023-07-25 13:04       ` Arseniy Krasnov
2023-07-25 13:22         ` Michael S. Tsirkin
2023-07-25 13:28           ` Arseniy Krasnov
2023-07-25 13:36             ` Michael S. Tsirkin
2023-07-25 13:35               ` Arseniy Krasnov
2023-07-25  8:25   ` Michael S. Tsirkin
2023-07-25  8:39     ` Arseniy Krasnov
2023-07-25 11:59       ` Michael S. Tsirkin
2023-07-25 13:10         ` Arseniy Krasnov
2023-07-25 13:23           ` Michael S. Tsirkin
2023-07-25 13:30             ` Arseniy Krasnov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).