linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc
@ 2022-08-15 17:56 Bobby Eshleman
  2022-08-15 17:56 ` [PATCH 1/6] vsock: replace virtio_vsock_pkt with sk_buff Bobby Eshleman
                   ` (11 more replies)
  0 siblings, 12 replies; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-15 17:56 UTC (permalink / raw)
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Wei Liu, Dexuan Cui, kvm, virtualization, netdev, linux-kernel,
	linux-hyperv

Hey everybody,

This series introduces datagrams, packet scheduling, and sk_buff usage
to virtio vsock.

The usage of struct sk_buff benefits users by a) preparing vsock to use
other related systems that require sk_buff, such as sockmap and qdisc,
b) supporting basic congestion control via sock_alloc_send_skb, and c)
reducing copying when delivering packets to TAP.

The socket layer no longer forces errors to be -ENOMEM, as typically
userspace expects -EAGAIN when the sk_sndbuf threshold is reached and
messages are being sent with option MSG_DONTWAIT.

The datagram work is based off previous patches by Jiang Wang[1].

The introduction of datagrams creates a transport layer fairness issue
where datagrams may freely starve streams of queue access. This happens
because, unlike streams, datagrams lack the transactions necessary for
calculating credits and throttling.

Previous proposals introduce changes to the spec to add an additional
virtqueue pair for datagrams[1]. Although this solution works, using
Linux's qdisc for packet scheduling leverages already existing systems,
avoids the need to change the virtio specification, and gives additional
capabilities. The usage of SFQ or fq_codel, for example, may solve the
transport layer starvation problem. It is easy to imagine other use
cases as well. For example, services of varying importance may be
assigned different priorities, and qdisc will apply appropriate
priority-based scheduling. By default, the system default pfifo qdisc is
used. The qdisc may be bypassed and legacy queuing is resumed by simply
setting the virtio-vsock%d network device to state DOWN. This technique
still allows vsock to work with zero-configuration.

In summary, this series introduces these major changes to vsock:

- virtio vsock supports datagrams
- virtio vsock uses struct sk_buff instead of virtio_vsock_pkt
  - Because virtio vsock uses sk_buff, it also uses sock_alloc_send_skb,
    which applies the throttling threshold sk_sndbuf.
- The vsock socket layer supports returning errors other than -ENOMEM.
  - This is used to return -EAGAIN when the sk_sndbuf threshold is
    reached.
- virtio vsock uses a net_device, through which qdisc may be used.
 - qdisc allows scheduling policies to be applied to vsock flows.
  - Some qdiscs, like SFQ, may allow vsock to avoid transport layer congestion. That is,
    it may avoid datagrams from flooding out stream flows. The benefit
    to this is that additional virtqueues are not needed for datagrams.
  - The net_device and qdisc is bypassed by simply setting the
    net_device state to DOWN.

[1]: https://lore.kernel.org/all/20210914055440.3121004-1-jiang.wang@bytedance.com/

Bobby Eshleman (5):
  vsock: replace virtio_vsock_pkt with sk_buff
  vsock: return errors other than -ENOMEM to socket
  vsock: add netdev to vhost/virtio vsock
  virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
  virtio/vsock: add support for dgram

Jiang Wang (1):
  vsock_test: add tests for vsock dgram

 drivers/vhost/vsock.c                   | 238 ++++----
 include/linux/virtio_vsock.h            |  73 ++-
 include/net/af_vsock.h                  |   2 +
 include/uapi/linux/virtio_vsock.h       |   2 +
 net/vmw_vsock/af_vsock.c                |  30 +-
 net/vmw_vsock/hyperv_transport.c        |   2 +-
 net/vmw_vsock/virtio_transport.c        | 237 +++++---
 net/vmw_vsock/virtio_transport_common.c | 771 ++++++++++++++++--------
 net/vmw_vsock/vmci_transport.c          |   9 +-
 net/vmw_vsock/vsock_loopback.c          |  51 +-
 tools/testing/vsock/util.c              | 105 ++++
 tools/testing/vsock/util.h              |   4 +
 tools/testing/vsock/vsock_test.c        | 195 ++++++
 13 files changed, 1176 insertions(+), 543 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 1/6] vsock: replace virtio_vsock_pkt with sk_buff
  2022-08-15 17:56 [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Bobby Eshleman
@ 2022-08-15 17:56 ` Bobby Eshleman
  2022-08-16  2:30   ` Bobby Eshleman
  2022-08-15 17:56 ` [PATCH 2/6] vsock: return errors other than -ENOMEM to socket Bobby Eshleman
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-15 17:56 UTC (permalink / raw)
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, kvm, virtualization, netdev, linux-kernel

This patch replaces virtio_vsock_pkt with sk_buff.

The benefit of this series includes:

* The bug reported @ https://bugzilla.redhat.com/show_bug.cgi?id=2009935
  does not present itself when reasonable sk_sndbuf thresholds are set.
* Using sock_alloc_send_skb() teaches VSOCK to respect
  sk_sndbuf for tunability.
* Eliminates copying for vsock_deliver_tap().
* sk_buff is required for future improvements, such as using socket map.

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
---
 drivers/vhost/vsock.c                   | 214 +++++------
 include/linux/virtio_vsock.h            |  60 ++-
 net/vmw_vsock/af_vsock.c                |   1 +
 net/vmw_vsock/virtio_transport.c        | 212 +++++-----
 net/vmw_vsock/virtio_transport_common.c | 491 ++++++++++++------------
 net/vmw_vsock/vsock_loopback.c          |  51 +--
 6 files changed, 517 insertions(+), 512 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 368330417bde..f8601d93d94d 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -51,8 +51,7 @@ struct vhost_vsock {
 	struct hlist_node hash;
 
 	struct vhost_work send_pkt_work;
-	spinlock_t send_pkt_list_lock;
-	struct list_head send_pkt_list;	/* host->guest pending packets */
+	struct sk_buff_head send_pkt_queue; /* host->guest pending packets */
 
 	atomic_t queued_replies;
 
@@ -108,7 +107,8 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
 	vhost_disable_notify(&vsock->dev, vq);
 
 	do {
-		struct virtio_vsock_pkt *pkt;
+		struct sk_buff *skb;
+		struct virtio_vsock_hdr *hdr;
 		struct iov_iter iov_iter;
 		unsigned out, in;
 		size_t nbytes;
@@ -116,31 +116,22 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
 		int head;
 		u32 flags_to_restore = 0;
 
-		spin_lock_bh(&vsock->send_pkt_list_lock);
-		if (list_empty(&vsock->send_pkt_list)) {
-			spin_unlock_bh(&vsock->send_pkt_list_lock);
+		skb = skb_dequeue(&vsock->send_pkt_queue);
+
+		if (!skb) {
 			vhost_enable_notify(&vsock->dev, vq);
 			break;
 		}
 
-		pkt = list_first_entry(&vsock->send_pkt_list,
-				       struct virtio_vsock_pkt, list);
-		list_del_init(&pkt->list);
-		spin_unlock_bh(&vsock->send_pkt_list_lock);
-
 		head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
 					 &out, &in, NULL, NULL);
 		if (head < 0) {
-			spin_lock_bh(&vsock->send_pkt_list_lock);
-			list_add(&pkt->list, &vsock->send_pkt_list);
-			spin_unlock_bh(&vsock->send_pkt_list_lock);
+			skb_queue_head(&vsock->send_pkt_queue, skb);
 			break;
 		}
 
 		if (head == vq->num) {
-			spin_lock_bh(&vsock->send_pkt_list_lock);
-			list_add(&pkt->list, &vsock->send_pkt_list);
-			spin_unlock_bh(&vsock->send_pkt_list_lock);
+			skb_queue_head(&vsock->send_pkt_queue, skb);
 
 			/* We cannot finish yet if more buffers snuck in while
 			 * re-enabling notify.
@@ -153,26 +144,27 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
 		}
 
 		if (out) {
-			virtio_transport_free_pkt(pkt);
+			kfree_skb(skb);
 			vq_err(vq, "Expected 0 output buffers, got %u\n", out);
 			break;
 		}
 
 		iov_len = iov_length(&vq->iov[out], in);
-		if (iov_len < sizeof(pkt->hdr)) {
-			virtio_transport_free_pkt(pkt);
+		if (iov_len < sizeof(*hdr)) {
+			kfree_skb(skb);
 			vq_err(vq, "Buffer len [%zu] too small\n", iov_len);
 			break;
 		}
 
 		iov_iter_init(&iov_iter, READ, &vq->iov[out], in, iov_len);
-		payload_len = pkt->len - pkt->off;
+		payload_len = skb->len - vsock_metadata(skb)->off;
+		hdr = vsock_hdr(skb);
 
 		/* If the packet is greater than the space available in the
 		 * buffer, we split it using multiple buffers.
 		 */
-		if (payload_len > iov_len - sizeof(pkt->hdr)) {
-			payload_len = iov_len - sizeof(pkt->hdr);
+		if (payload_len > iov_len - sizeof(*hdr)) {
+			payload_len = iov_len - sizeof(*hdr);
 
 			/* As we are copying pieces of large packet's buffer to
 			 * small rx buffers, headers of packets in rx queue are
@@ -185,31 +177,31 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
 			 * bits set. After initialized header will be copied to
 			 * rx buffer, these required bits will be restored.
 			 */
-			if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOM) {
-				pkt->hdr.flags &= ~cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
+			if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) {
+				hdr->flags &= ~cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
 				flags_to_restore |= VIRTIO_VSOCK_SEQ_EOM;
 
-				if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOR) {
-					pkt->hdr.flags &= ~cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
+				if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR) {
+					hdr->flags &= ~cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
 					flags_to_restore |= VIRTIO_VSOCK_SEQ_EOR;
 				}
 			}
 		}
 
 		/* Set the correct length in the header */
-		pkt->hdr.len = cpu_to_le32(payload_len);
+		hdr->len = cpu_to_le32(payload_len);
 
-		nbytes = copy_to_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
-		if (nbytes != sizeof(pkt->hdr)) {
-			virtio_transport_free_pkt(pkt);
+		nbytes = copy_to_iter(hdr, sizeof(*hdr), &iov_iter);
+		if (nbytes != sizeof(*hdr)) {
+			kfree_skb(skb);
 			vq_err(vq, "Faulted on copying pkt hdr\n");
 			break;
 		}
 
-		nbytes = copy_to_iter(pkt->buf + pkt->off, payload_len,
+		nbytes = copy_to_iter(skb->data + vsock_metadata(skb)->off, payload_len,
 				      &iov_iter);
 		if (nbytes != payload_len) {
-			virtio_transport_free_pkt(pkt);
+			kfree_skb(skb);
 			vq_err(vq, "Faulted on copying pkt buf\n");
 			break;
 		}
@@ -217,31 +209,28 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
 		/* Deliver to monitoring devices all packets that we
 		 * will transmit.
 		 */
-		virtio_transport_deliver_tap_pkt(pkt);
+		virtio_transport_deliver_tap_pkt(skb);
 
-		vhost_add_used(vq, head, sizeof(pkt->hdr) + payload_len);
+		vhost_add_used(vq, head, sizeof(*hdr) + payload_len);
 		added = true;
 
-		pkt->off += payload_len;
+		vsock_metadata(skb)->off += payload_len;
 		total_len += payload_len;
 
 		/* If we didn't send all the payload we can requeue the packet
 		 * to send it with the next available buffer.
 		 */
-		if (pkt->off < pkt->len) {
-			pkt->hdr.flags |= cpu_to_le32(flags_to_restore);
+		if (vsock_metadata(skb)->off < skb->len) {
+			hdr->flags |= cpu_to_le32(flags_to_restore);
 
-			/* We are queueing the same virtio_vsock_pkt to handle
+			/* We are queueing the same skb to handle
 			 * the remaining bytes, and we want to deliver it
 			 * to monitoring devices in the next iteration.
 			 */
-			pkt->tap_delivered = false;
-
-			spin_lock_bh(&vsock->send_pkt_list_lock);
-			list_add(&pkt->list, &vsock->send_pkt_list);
-			spin_unlock_bh(&vsock->send_pkt_list_lock);
+			vsock_metadata(skb)->flags &= ~VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED;
+			skb_queue_head(&vsock->send_pkt_queue, skb);
 		} else {
-			if (pkt->reply) {
+			if (vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_REPLY) {
 				int val;
 
 				val = atomic_dec_return(&vsock->queued_replies);
@@ -253,7 +242,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
 					restart_tx = true;
 			}
 
-			virtio_transport_free_pkt(pkt);
+			consume_skb(skb);
 		}
 	} while(likely(!vhost_exceeds_weight(vq, ++pkts, total_len)));
 	if (added)
@@ -278,28 +267,26 @@ static void vhost_transport_send_pkt_work(struct vhost_work *work)
 }
 
 static int
-vhost_transport_send_pkt(struct virtio_vsock_pkt *pkt)
+vhost_transport_send_pkt(struct sk_buff *skb)
 {
 	struct vhost_vsock *vsock;
-	int len = pkt->len;
+	int len = skb->len;
+	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
 
 	rcu_read_lock();
 
 	/* Find the vhost_vsock according to guest context id  */
-	vsock = vhost_vsock_get(le64_to_cpu(pkt->hdr.dst_cid));
+	vsock = vhost_vsock_get(le64_to_cpu(hdr->dst_cid));
 	if (!vsock) {
 		rcu_read_unlock();
-		virtio_transport_free_pkt(pkt);
+		kfree_skb(skb);
 		return -ENODEV;
 	}
 
-	if (pkt->reply)
+	if (vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_REPLY)
 		atomic_inc(&vsock->queued_replies);
 
-	spin_lock_bh(&vsock->send_pkt_list_lock);
-	list_add_tail(&pkt->list, &vsock->send_pkt_list);
-	spin_unlock_bh(&vsock->send_pkt_list_lock);
-
+	skb_queue_tail(&vsock->send_pkt_queue, skb);
 	vhost_work_queue(&vsock->dev, &vsock->send_pkt_work);
 
 	rcu_read_unlock();
@@ -310,10 +297,8 @@ static int
 vhost_transport_cancel_pkt(struct vsock_sock *vsk)
 {
 	struct vhost_vsock *vsock;
-	struct virtio_vsock_pkt *pkt, *n;
 	int cnt = 0;
 	int ret = -ENODEV;
-	LIST_HEAD(freeme);
 
 	rcu_read_lock();
 
@@ -322,20 +307,7 @@ vhost_transport_cancel_pkt(struct vsock_sock *vsk)
 	if (!vsock)
 		goto out;
 
-	spin_lock_bh(&vsock->send_pkt_list_lock);
-	list_for_each_entry_safe(pkt, n, &vsock->send_pkt_list, list) {
-		if (pkt->vsk != vsk)
-			continue;
-		list_move(&pkt->list, &freeme);
-	}
-	spin_unlock_bh(&vsock->send_pkt_list_lock);
-
-	list_for_each_entry_safe(pkt, n, &freeme, list) {
-		if (pkt->reply)
-			cnt++;
-		list_del(&pkt->list);
-		virtio_transport_free_pkt(pkt);
-	}
+	cnt = virtio_transport_purge_skbs(vsk, &vsock->send_pkt_queue);
 
 	if (cnt) {
 		struct vhost_virtqueue *tx_vq = &vsock->vqs[VSOCK_VQ_TX];
@@ -352,11 +324,12 @@ vhost_transport_cancel_pkt(struct vsock_sock *vsk)
 	return ret;
 }
 
-static struct virtio_vsock_pkt *
-vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
+static struct sk_buff *
+vhost_vsock_alloc_skb(struct vhost_virtqueue *vq,
 		      unsigned int out, unsigned int in)
 {
-	struct virtio_vsock_pkt *pkt;
+	struct sk_buff *skb;
+	struct virtio_vsock_hdr *hdr;
 	struct iov_iter iov_iter;
 	size_t nbytes;
 	size_t len;
@@ -366,50 +339,49 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
 		return NULL;
 	}
 
-	pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
-	if (!pkt)
+	len = iov_length(vq->iov, out);
+
+	/* len contains both payload and hdr, so only add additional space for metadata */
+	skb = alloc_skb(len + sizeof(struct virtio_vsock_metadata), GFP_KERNEL);
+	if (!skb)
 		return NULL;
 
-	len = iov_length(vq->iov, out);
+	memset(skb->head, 0, sizeof(struct virtio_vsock_metadata));
+	virtio_vsock_skb_reserve(skb);
 	iov_iter_init(&iov_iter, WRITE, vq->iov, out, len);
 
-	nbytes = copy_from_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
-	if (nbytes != sizeof(pkt->hdr)) {
+	hdr = vsock_hdr(skb);
+	nbytes = copy_from_iter(hdr, sizeof(*hdr), &iov_iter);
+	if (nbytes != sizeof(*hdr)) {
 		vq_err(vq, "Expected %zu bytes for pkt->hdr, got %zu bytes\n",
-		       sizeof(pkt->hdr), nbytes);
-		kfree(pkt);
+		       sizeof(*hdr), nbytes);
+		kfree_skb(skb);
 		return NULL;
 	}
 
-	pkt->len = le32_to_cpu(pkt->hdr.len);
+	len = le32_to_cpu(hdr->len);
 
 	/* No payload */
-	if (!pkt->len)
-		return pkt;
+	if (!len)
+		return skb;
 
 	/* The pkt is too big */
-	if (pkt->len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
-		kfree(pkt);
+	if (len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
+		kfree_skb(skb);
 		return NULL;
 	}
 
-	pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
-	if (!pkt->buf) {
-		kfree(pkt);
-		return NULL;
-	}
+	virtio_vsock_skb_rx_put(skb);
 
-	pkt->buf_len = pkt->len;
-
-	nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
-	if (nbytes != pkt->len) {
-		vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
-		       pkt->len, nbytes);
-		virtio_transport_free_pkt(pkt);
+	nbytes = copy_from_iter(skb->data, len, &iov_iter);
+	if (nbytes != len) {
+		vq_err(vq, "Expected %zu byte payload, got %zu bytes\n",
+		       len, nbytes);
+		kfree_skb(skb);
 		return NULL;
 	}
 
-	return pkt;
+	return skb;
 }
 
 /* Is there space left for replies to rx packets? */
@@ -496,7 +468,7 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
 						  poll.work);
 	struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
 						 dev);
-	struct virtio_vsock_pkt *pkt;
+	struct sk_buff *skb;
 	int head, pkts = 0, total_len = 0;
 	unsigned int out, in;
 	bool added = false;
@@ -511,6 +483,9 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
 
 	vhost_disable_notify(&vsock->dev, vq);
 	do {
+		struct virtio_vsock_hdr *hdr;
+		u32 len;
+
 		if (!vhost_vsock_more_replies(vsock)) {
 			/* Stop tx until the device processes already
 			 * pending replies.  Leave tx virtqueue
@@ -532,26 +507,29 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
 			break;
 		}
 
-		pkt = vhost_vsock_alloc_pkt(vq, out, in);
-		if (!pkt) {
-			vq_err(vq, "Faulted on pkt\n");
+		skb = vhost_vsock_alloc_skb(vq, out, in);
+		if (!skb)
 			continue;
-		}
 
-		total_len += sizeof(pkt->hdr) + pkt->len;
+		len = skb->len;
 
 		/* Deliver to monitoring devices all received packets */
-		virtio_transport_deliver_tap_pkt(pkt);
+		virtio_transport_deliver_tap_pkt(skb);
+
+		hdr = vsock_hdr(skb);
 
 		/* Only accept correctly addressed packets */
-		if (le64_to_cpu(pkt->hdr.src_cid) == vsock->guest_cid &&
-		    le64_to_cpu(pkt->hdr.dst_cid) ==
+		if (le64_to_cpu(hdr->src_cid) == vsock->guest_cid &&
+		    le64_to_cpu(hdr->dst_cid) ==
 		    vhost_transport_get_local_cid())
-			virtio_transport_recv_pkt(&vhost_transport, pkt);
+			virtio_transport_recv_pkt(&vhost_transport, skb);
 		else
-			virtio_transport_free_pkt(pkt);
+			kfree_skb(skb);
+
 
-		vhost_add_used(vq, head, 0);
+		len += sizeof(*hdr);
+		vhost_add_used(vq, head, len);
+		total_len += len;
 		added = true;
 	} while(likely(!vhost_exceeds_weight(vq, ++pkts, total_len)));
 
@@ -693,8 +671,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
 		       VHOST_VSOCK_WEIGHT, true, NULL);
 
 	file->private_data = vsock;
-	spin_lock_init(&vsock->send_pkt_list_lock);
-	INIT_LIST_HEAD(&vsock->send_pkt_list);
+	skb_queue_head_init(&vsock->send_pkt_queue);
 	vhost_work_init(&vsock->send_pkt_work, vhost_transport_send_pkt_work);
 	return 0;
 
@@ -760,16 +737,7 @@ static int vhost_vsock_dev_release(struct inode *inode, struct file *file)
 	vhost_vsock_flush(vsock);
 	vhost_dev_stop(&vsock->dev);
 
-	spin_lock_bh(&vsock->send_pkt_list_lock);
-	while (!list_empty(&vsock->send_pkt_list)) {
-		struct virtio_vsock_pkt *pkt;
-
-		pkt = list_first_entry(&vsock->send_pkt_list,
-				struct virtio_vsock_pkt, list);
-		list_del_init(&pkt->list);
-		virtio_transport_free_pkt(pkt);
-	}
-	spin_unlock_bh(&vsock->send_pkt_list_lock);
+	skb_queue_purge(&vsock->send_pkt_queue);
 
 	vhost_dev_cleanup(&vsock->dev);
 	kfree(vsock->dev.vqs);
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 35d7eedb5e8e..17ed01466875 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -4,9 +4,43 @@
 
 #include <uapi/linux/virtio_vsock.h>
 #include <linux/socket.h>
+#include <vdso/bits.h>
 #include <net/sock.h>
 #include <net/af_vsock.h>
 
+enum virtio_vsock_metadata_flags {
+	VIRTIO_VSOCK_METADATA_FLAGS_REPLY		= BIT(0),
+	VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED	= BIT(1),
+};
+
+/* Used only by the virtio/vhost vsock drivers, not related to protocol */
+struct virtio_vsock_metadata {
+	size_t off;
+	enum virtio_vsock_metadata_flags flags;
+};
+
+#define vsock_hdr(skb) \
+	((struct virtio_vsock_hdr *) \
+	 ((void *)skb->head + sizeof(struct virtio_vsock_metadata)))
+
+#define vsock_metadata(skb) \
+	((struct virtio_vsock_metadata *)skb->head)
+
+#define virtio_vsock_skb_reserve(skb)	\
+	skb_reserve(skb,	\
+		sizeof(struct virtio_vsock_metadata) + \
+		sizeof(struct virtio_vsock_hdr))
+
+static inline void virtio_vsock_skb_rx_put(struct sk_buff *skb)
+{
+	u32 len;
+
+	len = le32_to_cpu(vsock_hdr(skb)->len);
+
+	if (len > 0)
+		skb_put(skb, len);
+}
+
 #define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE	(1024 * 4)
 #define VIRTIO_VSOCK_MAX_BUF_SIZE		0xFFFFFFFFUL
 #define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE		(1024 * 64)
@@ -35,23 +69,10 @@ struct virtio_vsock_sock {
 	u32 last_fwd_cnt;
 	u32 rx_bytes;
 	u32 buf_alloc;
-	struct list_head rx_queue;
+	struct sk_buff_head rx_queue;
 	u32 msg_count;
 };
 
-struct virtio_vsock_pkt {
-	struct virtio_vsock_hdr	hdr;
-	struct list_head list;
-	/* socket refcnt not held, only use for cancellation */
-	struct vsock_sock *vsk;
-	void *buf;
-	u32 buf_len;
-	u32 len;
-	u32 off;
-	bool reply;
-	bool tap_delivered;
-};
-
 struct virtio_vsock_pkt_info {
 	u32 remote_cid, remote_port;
 	struct vsock_sock *vsk;
@@ -68,7 +89,7 @@ struct virtio_transport {
 	struct vsock_transport transport;
 
 	/* Takes ownership of the packet */
-	int (*send_pkt)(struct virtio_vsock_pkt *pkt);
+	int (*send_pkt)(struct sk_buff *skb);
 };
 
 ssize_t
@@ -149,11 +170,10 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
 void virtio_transport_destruct(struct vsock_sock *vsk);
 
 void virtio_transport_recv_pkt(struct virtio_transport *t,
-			       struct virtio_vsock_pkt *pkt);
-void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt);
-void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct virtio_vsock_pkt *pkt);
+			       struct sk_buff *skb);
+void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct sk_buff *skb);
 u32 virtio_transport_get_credit(struct virtio_vsock_sock *vvs, u32 wanted);
 void virtio_transport_put_credit(struct virtio_vsock_sock *vvs, u32 credit);
-void virtio_transport_deliver_tap_pkt(struct virtio_vsock_pkt *pkt);
-
+void virtio_transport_deliver_tap_pkt(struct sk_buff *skb);
+int virtio_transport_purge_skbs(void *vsk, struct sk_buff_head *queue);
 #endif /* _LINUX_VIRTIO_VSOCK_H */
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index f04abf662ec6..e348b2d09eac 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -748,6 +748,7 @@ static struct sock *__vsock_create(struct net *net,
 	vsock_addr_init(&vsk->local_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
 	vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
 
+	sk->sk_allocation = GFP_KERNEL;
 	sk->sk_destruct = vsock_sk_destruct;
 	sk->sk_backlog_rcv = vsock_queue_rcv_skb;
 	sock_reset_flag(sk, SOCK_DONE);
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index ad64f403536a..3bb293fd8607 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -21,6 +21,12 @@
 #include <linux/mutex.h>
 #include <net/af_vsock.h>
 
+#define VIRTIO_VSOCK_MAX_RX_HDR_PAYLOAD_SIZE	\
+	(VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE \
+		 - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) \
+		 - sizeof(struct virtio_vsock_hdr) \
+		 - sizeof(struct virtio_vsock_metadata))
+
 static struct workqueue_struct *virtio_vsock_workqueue;
 static struct virtio_vsock __rcu *the_virtio_vsock;
 static DEFINE_MUTEX(the_virtio_vsock_mutex); /* protects the_virtio_vsock */
@@ -42,8 +48,7 @@ struct virtio_vsock {
 	bool tx_run;
 
 	struct work_struct send_pkt_work;
-	spinlock_t send_pkt_list_lock;
-	struct list_head send_pkt_list;
+	struct sk_buff_head send_pkt_queue;
 
 	atomic_t queued_replies;
 
@@ -101,41 +106,32 @@ virtio_transport_send_pkt_work(struct work_struct *work)
 	vq = vsock->vqs[VSOCK_VQ_TX];
 
 	for (;;) {
-		struct virtio_vsock_pkt *pkt;
+		struct sk_buff *skb;
 		struct scatterlist hdr, buf, *sgs[2];
 		int ret, in_sg = 0, out_sg = 0;
 		bool reply;
 
-		spin_lock_bh(&vsock->send_pkt_list_lock);
-		if (list_empty(&vsock->send_pkt_list)) {
-			spin_unlock_bh(&vsock->send_pkt_list_lock);
-			break;
-		}
+		skb = skb_dequeue(&vsock->send_pkt_queue);
 
-		pkt = list_first_entry(&vsock->send_pkt_list,
-				       struct virtio_vsock_pkt, list);
-		list_del_init(&pkt->list);
-		spin_unlock_bh(&vsock->send_pkt_list_lock);
-
-		virtio_transport_deliver_tap_pkt(pkt);
+		if (!skb)
+			break;
 
-		reply = pkt->reply;
+		virtio_transport_deliver_tap_pkt(skb);
+		reply = vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_REPLY;
 
-		sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
+		sg_init_one(&hdr, vsock_hdr(skb), sizeof(*vsock_hdr(skb)));
 		sgs[out_sg++] = &hdr;
-		if (pkt->buf) {
-			sg_init_one(&buf, pkt->buf, pkt->len);
+		if (skb->len > 0) {
+			sg_init_one(&buf, skb->data, skb->len);
 			sgs[out_sg++] = &buf;
 		}
 
-		ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, pkt, GFP_KERNEL);
+		ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, skb, GFP_KERNEL);
 		/* Usually this means that there is no more space available in
 		 * the vq
 		 */
 		if (ret < 0) {
-			spin_lock_bh(&vsock->send_pkt_list_lock);
-			list_add(&pkt->list, &vsock->send_pkt_list);
-			spin_unlock_bh(&vsock->send_pkt_list_lock);
+			skb_queue_head(&vsock->send_pkt_queue, skb);
 			break;
 		}
 
@@ -163,33 +159,84 @@ virtio_transport_send_pkt_work(struct work_struct *work)
 		queue_work(virtio_vsock_workqueue, &vsock->rx_work);
 }
 
+static inline bool
+virtio_transport_skbs_can_merge(struct sk_buff *old, struct sk_buff *new)
+{
+	return (new->len < GOOD_COPY_LEN &&
+		skb_tailroom(old) >= new->len &&
+		vsock_hdr(new)->src_cid == vsock_hdr(old)->src_cid &&
+		vsock_hdr(new)->dst_cid == vsock_hdr(old)->dst_cid &&
+		vsock_hdr(new)->src_port == vsock_hdr(old)->src_port &&
+		vsock_hdr(new)->dst_port == vsock_hdr(old)->dst_port &&
+		vsock_hdr(new)->type == vsock_hdr(old)->type &&
+		vsock_hdr(new)->flags == vsock_hdr(old)->flags &&
+		vsock_hdr(old)->op == VIRTIO_VSOCK_OP_RW &&
+		vsock_hdr(new)->op == VIRTIO_VSOCK_OP_RW);
+}
+
+/**
+ * Merge the two most recent skbs together if possible.
+ *
+ * Caller must hold the queue lock.
+ */
+static void
+virtio_transport_add_to_queue(struct sk_buff_head *queue, struct sk_buff *new)
+{
+	struct sk_buff *old;
+
+	spin_lock_bh(&queue->lock);
+	/* In order to reduce skb memory overhead, we merge new packets with
+	 * older packets if they pass virtio_transport_skbs_can_merge().
+	 */
+	if (skb_queue_empty_lockless(queue)) {
+		__skb_queue_tail(queue, new);
+		goto out;
+	}
+
+	old = skb_peek_tail(queue);
+
+	if (!virtio_transport_skbs_can_merge(old, new)) {
+		__skb_queue_tail(queue, new);
+		goto out;
+	}
+
+	memcpy(skb_put(old, new->len), new->data, new->len);
+	vsock_hdr(old)->len = cpu_to_le32(old->len);
+	vsock_hdr(old)->buf_alloc = vsock_hdr(new)->buf_alloc;
+	vsock_hdr(old)->fwd_cnt = vsock_hdr(new)->fwd_cnt;
+	dev_kfree_skb_any(new);
+
+out:
+	spin_unlock_bh(&queue->lock);
+}
+
 static int
-virtio_transport_send_pkt(struct virtio_vsock_pkt *pkt)
+virtio_transport_send_pkt(struct sk_buff *skb)
 {
+	struct virtio_vsock_hdr *hdr;
 	struct virtio_vsock *vsock;
-	int len = pkt->len;
+	int len = skb->len;
+
+	hdr = vsock_hdr(skb);
 
 	rcu_read_lock();
 	vsock = rcu_dereference(the_virtio_vsock);
 	if (!vsock) {
-		virtio_transport_free_pkt(pkt);
+		kfree_skb(skb);
 		len = -ENODEV;
 		goto out_rcu;
 	}
 
-	if (le64_to_cpu(pkt->hdr.dst_cid) == vsock->guest_cid) {
-		virtio_transport_free_pkt(pkt);
+	if (le64_to_cpu(hdr->dst_cid) == vsock->guest_cid) {
+		kfree_skb(skb);
 		len = -ENODEV;
 		goto out_rcu;
 	}
 
-	if (pkt->reply)
+	if (vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_REPLY)
 		atomic_inc(&vsock->queued_replies);
 
-	spin_lock_bh(&vsock->send_pkt_list_lock);
-	list_add_tail(&pkt->list, &vsock->send_pkt_list);
-	spin_unlock_bh(&vsock->send_pkt_list_lock);
-
+	virtio_transport_add_to_queue(&vsock->send_pkt_queue, skb);
 	queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
 
 out_rcu:
@@ -201,9 +248,7 @@ static int
 virtio_transport_cancel_pkt(struct vsock_sock *vsk)
 {
 	struct virtio_vsock *vsock;
-	struct virtio_vsock_pkt *pkt, *n;
 	int cnt = 0, ret;
-	LIST_HEAD(freeme);
 
 	rcu_read_lock();
 	vsock = rcu_dereference(the_virtio_vsock);
@@ -212,20 +257,7 @@ virtio_transport_cancel_pkt(struct vsock_sock *vsk)
 		goto out_rcu;
 	}
 
-	spin_lock_bh(&vsock->send_pkt_list_lock);
-	list_for_each_entry_safe(pkt, n, &vsock->send_pkt_list, list) {
-		if (pkt->vsk != vsk)
-			continue;
-		list_move(&pkt->list, &freeme);
-	}
-	spin_unlock_bh(&vsock->send_pkt_list_lock);
-
-	list_for_each_entry_safe(pkt, n, &freeme, list) {
-		if (pkt->reply)
-			cnt++;
-		list_del(&pkt->list);
-		virtio_transport_free_pkt(pkt);
-	}
+	cnt = virtio_transport_purge_skbs(vsk, &vsock->send_pkt_queue);
 
 	if (cnt) {
 		struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
@@ -246,38 +278,34 @@ virtio_transport_cancel_pkt(struct vsock_sock *vsk)
 
 static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
 {
-	int buf_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
-	struct virtio_vsock_pkt *pkt;
-	struct scatterlist hdr, buf, *sgs[2];
+	struct scatterlist pkt, *sgs[1];
 	struct virtqueue *vq;
 	int ret;
 
 	vq = vsock->vqs[VSOCK_VQ_RX];
 
 	do {
-		pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
-		if (!pkt)
-			break;
+		struct sk_buff *skb;
+		const size_t len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE -
+				SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
-		pkt->buf = kmalloc(buf_len, GFP_KERNEL);
-		if (!pkt->buf) {
-			virtio_transport_free_pkt(pkt);
+		skb = alloc_skb(len, GFP_KERNEL);
+		if (!skb)
 			break;
-		}
 
-		pkt->buf_len = buf_len;
-		pkt->len = buf_len;
+		memset(skb->head, 0,
+		       sizeof(struct virtio_vsock_metadata) + sizeof(struct virtio_vsock_hdr));
 
-		sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
-		sgs[0] = &hdr;
+		sg_init_one(&pkt, skb->head + sizeof(struct virtio_vsock_metadata),
+			    VIRTIO_VSOCK_MAX_RX_HDR_PAYLOAD_SIZE);
+		sgs[0] = &pkt;
 
-		sg_init_one(&buf, pkt->buf, buf_len);
-		sgs[1] = &buf;
-		ret = virtqueue_add_sgs(vq, sgs, 0, 2, pkt, GFP_KERNEL);
-		if (ret) {
-			virtio_transport_free_pkt(pkt);
+		ret = virtqueue_add_sgs(vq, sgs, 0, 1, skb, GFP_KERNEL);
+		if (ret < 0) {
+			kfree_skb(skb);
 			break;
 		}
+
 		vsock->rx_buf_nr++;
 	} while (vq->num_free);
 	if (vsock->rx_buf_nr > vsock->rx_buf_max_nr)
@@ -299,12 +327,12 @@ static void virtio_transport_tx_work(struct work_struct *work)
 		goto out;
 
 	do {
-		struct virtio_vsock_pkt *pkt;
+		struct sk_buff *skb;
 		unsigned int len;
 
 		virtqueue_disable_cb(vq);
-		while ((pkt = virtqueue_get_buf(vq, &len)) != NULL) {
-			virtio_transport_free_pkt(pkt);
+		while ((skb = virtqueue_get_buf(vq, &len)) != NULL) {
+			consume_skb(skb);
 			added = true;
 		}
 	} while (!virtqueue_enable_cb(vq));
@@ -529,7 +557,8 @@ static void virtio_transport_rx_work(struct work_struct *work)
 	do {
 		virtqueue_disable_cb(vq);
 		for (;;) {
-			struct virtio_vsock_pkt *pkt;
+			struct virtio_vsock_hdr *hdr;
+			struct sk_buff *skb;
 			unsigned int len;
 
 			if (!virtio_transport_more_replies(vsock)) {
@@ -540,23 +569,24 @@ static void virtio_transport_rx_work(struct work_struct *work)
 				goto out;
 			}
 
-			pkt = virtqueue_get_buf(vq, &len);
-			if (!pkt) {
+			skb = virtqueue_get_buf(vq, &len);
+			if (!skb)
 				break;
-			}
 
 			vsock->rx_buf_nr--;
 
 			/* Drop short/long packets */
-			if (unlikely(len < sizeof(pkt->hdr) ||
-				     len > sizeof(pkt->hdr) + pkt->len)) {
-				virtio_transport_free_pkt(pkt);
+			if (unlikely(len < sizeof(*hdr) ||
+				     len > VIRTIO_VSOCK_MAX_RX_HDR_PAYLOAD_SIZE)) {
+				kfree_skb(skb);
 				continue;
 			}
 
-			pkt->len = len - sizeof(pkt->hdr);
-			virtio_transport_deliver_tap_pkt(pkt);
-			virtio_transport_recv_pkt(&virtio_transport, pkt);
+			hdr = vsock_hdr(skb);
+			virtio_vsock_skb_reserve(skb);
+			virtio_vsock_skb_rx_put(skb);
+			virtio_transport_deliver_tap_pkt(skb);
+			virtio_transport_recv_pkt(&virtio_transport, skb);
 		}
 	} while (!virtqueue_enable_cb(vq));
 
@@ -610,7 +640,7 @@ static int virtio_vsock_vqs_init(struct virtio_vsock *vsock)
 static void virtio_vsock_vqs_del(struct virtio_vsock *vsock)
 {
 	struct virtio_device *vdev = vsock->vdev;
-	struct virtio_vsock_pkt *pkt;
+	struct sk_buff *skb;
 
 	/* Reset all connected sockets when the VQs disappear */
 	vsock_for_each_connected_socket(&virtio_transport.transport,
@@ -637,23 +667,16 @@ static void virtio_vsock_vqs_del(struct virtio_vsock *vsock)
 	virtio_reset_device(vdev);
 
 	mutex_lock(&vsock->rx_lock);
-	while ((pkt = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_RX])))
-		virtio_transport_free_pkt(pkt);
+	while ((skb = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_RX])))
+		kfree_skb(skb);
 	mutex_unlock(&vsock->rx_lock);
 
 	mutex_lock(&vsock->tx_lock);
-	while ((pkt = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_TX])))
-		virtio_transport_free_pkt(pkt);
+	while ((skb = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_TX])))
+		kfree_skb(skb);
 	mutex_unlock(&vsock->tx_lock);
 
-	spin_lock_bh(&vsock->send_pkt_list_lock);
-	while (!list_empty(&vsock->send_pkt_list)) {
-		pkt = list_first_entry(&vsock->send_pkt_list,
-				       struct virtio_vsock_pkt, list);
-		list_del(&pkt->list);
-		virtio_transport_free_pkt(pkt);
-	}
-	spin_unlock_bh(&vsock->send_pkt_list_lock);
+	skb_queue_purge(&vsock->send_pkt_queue);
 
 	/* Delete virtqueues and flush outstanding callbacks if any */
 	vdev->config->del_vqs(vdev);
@@ -690,8 +713,7 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
 	mutex_init(&vsock->tx_lock);
 	mutex_init(&vsock->rx_lock);
 	mutex_init(&vsock->event_lock);
-	spin_lock_init(&vsock->send_pkt_list_lock);
-	INIT_LIST_HEAD(&vsock->send_pkt_list);
+	skb_queue_head_init(&vsock->send_pkt_queue);
 	INIT_WORK(&vsock->rx_work, virtio_transport_rx_work);
 	INIT_WORK(&vsock->tx_work, virtio_transport_tx_work);
 	INIT_WORK(&vsock->event_work, virtio_transport_event_work);
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index ec2c2afbf0d0..920578597bb9 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -37,53 +37,81 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
 	return container_of(t, struct virtio_transport, transport);
 }
 
-static struct virtio_vsock_pkt *
-virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
+/* Returns a new packet on success, otherwise returns NULL.
+ *
+ * If NULL is returned, errp is set to a negative errno.
+ */
+static struct sk_buff *
+virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
 			   size_t len,
 			   u32 src_cid,
 			   u32 src_port,
 			   u32 dst_cid,
-			   u32 dst_port)
+			   u32 dst_port,
+			   int *errp)
 {
-	struct virtio_vsock_pkt *pkt;
+	struct sk_buff *skb;
+	struct virtio_vsock_hdr *hdr;
+	void *payload;
+	const size_t skb_len = sizeof(*hdr) + sizeof(struct virtio_vsock_metadata) + len;
 	int err;
 
-	pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
-	if (!pkt)
-		return NULL;
+	if (info->vsk) {
+		unsigned int msg_flags = info->msg ? info->msg->msg_flags : 0;
+		struct sock *sk;
 
-	pkt->hdr.type		= cpu_to_le16(info->type);
-	pkt->hdr.op		= cpu_to_le16(info->op);
-	pkt->hdr.src_cid	= cpu_to_le64(src_cid);
-	pkt->hdr.dst_cid	= cpu_to_le64(dst_cid);
-	pkt->hdr.src_port	= cpu_to_le32(src_port);
-	pkt->hdr.dst_port	= cpu_to_le32(dst_port);
-	pkt->hdr.flags		= cpu_to_le32(info->flags);
-	pkt->len		= len;
-	pkt->hdr.len		= cpu_to_le32(len);
-	pkt->reply		= info->reply;
-	pkt->vsk		= info->vsk;
+		sk = sk_vsock(info->vsk);
+		skb = sock_alloc_send_skb(sk, skb_len,
+					  msg_flags & MSG_DONTWAIT, errp);
 
-	if (info->msg && len > 0) {
-		pkt->buf = kmalloc(len, GFP_KERNEL);
-		if (!pkt->buf)
-			goto out_pkt;
+		if (skb)
+			skb->priority = sk->sk_priority;
+	} else {
+		skb = alloc_skb(skb_len, GFP_KERNEL);
+	}
+
+	if (!skb) {
+		/* If using alloc_skb(), the skb is NULL due to lacking memory.
+		 * Otherwise, errp is set by sock_alloc_send_skb().
+		 */
+		if (!info->vsk)
+			*errp = -ENOMEM;
+		return NULL;
+	}
 
-		pkt->buf_len = len;
+	memset(skb->head, 0, sizeof(*hdr) + sizeof(struct virtio_vsock_metadata));
+	virtio_vsock_skb_reserve(skb);
+	payload = skb_put(skb, len);
 
-		err = memcpy_from_msg(pkt->buf, info->msg, len);
-		if (err)
+	hdr = vsock_hdr(skb);
+	hdr->type	= cpu_to_le16(info->type);
+	hdr->op		= cpu_to_le16(info->op);
+	hdr->src_cid	= cpu_to_le64(src_cid);
+	hdr->dst_cid	= cpu_to_le64(dst_cid);
+	hdr->src_port	= cpu_to_le32(src_port);
+	hdr->dst_port	= cpu_to_le32(dst_port);
+	hdr->flags	= cpu_to_le32(info->flags);
+	hdr->len	= cpu_to_le32(len);
+
+	if (info->msg && len > 0) {
+		err = memcpy_from_msg(payload, info->msg, len);
+		if (err) {
+			*errp = -ENOMEM;
 			goto out;
+		}
 
 		if (msg_data_left(info->msg) == 0 &&
 		    info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
-			pkt->hdr.flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
+			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
 
 			if (info->msg->msg_flags & MSG_EOR)
-				pkt->hdr.flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
+				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
 		}
 	}
 
+	if (info->reply)
+		vsock_metadata(skb)->flags |= VIRTIO_VSOCK_METADATA_FLAGS_REPLY;
+
 	trace_virtio_transport_alloc_pkt(src_cid, src_port,
 					 dst_cid, dst_port,
 					 len,
@@ -91,85 +119,26 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
 					 info->op,
 					 info->flags);
 
-	return pkt;
+	return skb;
 
 out:
-	kfree(pkt->buf);
-out_pkt:
-	kfree(pkt);
+	kfree_skb(skb);
 	return NULL;
 }
 
 /* Packet capture */
 static struct sk_buff *virtio_transport_build_skb(void *opaque)
 {
-	struct virtio_vsock_pkt *pkt = opaque;
-	struct af_vsockmon_hdr *hdr;
-	struct sk_buff *skb;
-	size_t payload_len;
-	void *payload_buf;
-
-	/* A packet could be split to fit the RX buffer, so we can retrieve
-	 * the payload length from the header and the buffer pointer taking
-	 * care of the offset in the original packet.
-	 */
-	payload_len = le32_to_cpu(pkt->hdr.len);
-	payload_buf = pkt->buf + pkt->off;
-
-	skb = alloc_skb(sizeof(*hdr) + sizeof(pkt->hdr) + payload_len,
-			GFP_ATOMIC);
-	if (!skb)
-		return NULL;
-
-	hdr = skb_put(skb, sizeof(*hdr));
-
-	/* pkt->hdr is little-endian so no need to byteswap here */
-	hdr->src_cid = pkt->hdr.src_cid;
-	hdr->src_port = pkt->hdr.src_port;
-	hdr->dst_cid = pkt->hdr.dst_cid;
-	hdr->dst_port = pkt->hdr.dst_port;
-
-	hdr->transport = cpu_to_le16(AF_VSOCK_TRANSPORT_VIRTIO);
-	hdr->len = cpu_to_le16(sizeof(pkt->hdr));
-	memset(hdr->reserved, 0, sizeof(hdr->reserved));
-
-	switch (le16_to_cpu(pkt->hdr.op)) {
-	case VIRTIO_VSOCK_OP_REQUEST:
-	case VIRTIO_VSOCK_OP_RESPONSE:
-		hdr->op = cpu_to_le16(AF_VSOCK_OP_CONNECT);
-		break;
-	case VIRTIO_VSOCK_OP_RST:
-	case VIRTIO_VSOCK_OP_SHUTDOWN:
-		hdr->op = cpu_to_le16(AF_VSOCK_OP_DISCONNECT);
-		break;
-	case VIRTIO_VSOCK_OP_RW:
-		hdr->op = cpu_to_le16(AF_VSOCK_OP_PAYLOAD);
-		break;
-	case VIRTIO_VSOCK_OP_CREDIT_UPDATE:
-	case VIRTIO_VSOCK_OP_CREDIT_REQUEST:
-		hdr->op = cpu_to_le16(AF_VSOCK_OP_CONTROL);
-		break;
-	default:
-		hdr->op = cpu_to_le16(AF_VSOCK_OP_UNKNOWN);
-		break;
-	}
-
-	skb_put_data(skb, &pkt->hdr, sizeof(pkt->hdr));
-
-	if (payload_len) {
-		skb_put_data(skb, payload_buf, payload_len);
-	}
-
-	return skb;
+	return (struct sk_buff *)opaque;
 }
 
-void virtio_transport_deliver_tap_pkt(struct virtio_vsock_pkt *pkt)
+void virtio_transport_deliver_tap_pkt(struct sk_buff *skb)
 {
-	if (pkt->tap_delivered)
+	if (vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED)
 		return;
 
-	vsock_deliver_tap(virtio_transport_build_skb, pkt);
-	pkt->tap_delivered = true;
+	vsock_deliver_tap(virtio_transport_build_skb, skb);
+	vsock_metadata(skb)->flags |= VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED;
 }
 EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
 
@@ -192,8 +161,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 	u32 src_cid, src_port, dst_cid, dst_port;
 	const struct virtio_transport *t_ops;
 	struct virtio_vsock_sock *vvs;
-	struct virtio_vsock_pkt *pkt;
+	struct sk_buff *skb;
 	u32 pkt_len = info->pkt_len;
+	int err;
 
 	info->type = virtio_transport_get_type(sk_vsock(vsk));
 
@@ -224,42 +194,47 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
 		return pkt_len;
 
-	pkt = virtio_transport_alloc_pkt(info, pkt_len,
+	skb = virtio_transport_alloc_skb(info, pkt_len,
 					 src_cid, src_port,
-					 dst_cid, dst_port);
-	if (!pkt) {
+					 dst_cid, dst_port,
+					 &err);
+	if (!skb) {
 		virtio_transport_put_credit(vvs, pkt_len);
-		return -ENOMEM;
+		return err;
 	}
 
-	virtio_transport_inc_tx_pkt(vvs, pkt);
+	virtio_transport_inc_tx_pkt(vvs, skb);
+
+	err = t_ops->send_pkt(skb);
 
-	return t_ops->send_pkt(pkt);
+	return err < 0 ? -ENOMEM : err;
 }
 
 static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
-					struct virtio_vsock_pkt *pkt)
+					struct sk_buff *skb)
 {
-	if (vvs->rx_bytes + pkt->len > vvs->buf_alloc)
+	if (vvs->rx_bytes + skb->len > vvs->buf_alloc)
 		return false;
 
-	vvs->rx_bytes += pkt->len;
+	vvs->rx_bytes += skb->len;
 	return true;
 }
 
 static void virtio_transport_dec_rx_pkt(struct virtio_vsock_sock *vvs,
-					struct virtio_vsock_pkt *pkt)
+					struct sk_buff *skb)
 {
-	vvs->rx_bytes -= pkt->len;
-	vvs->fwd_cnt += pkt->len;
+	vvs->rx_bytes -= skb->len;
+	vvs->fwd_cnt += skb->len;
 }
 
-void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct virtio_vsock_pkt *pkt)
+void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct sk_buff *skb)
 {
+	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
+
 	spin_lock_bh(&vvs->rx_lock);
 	vvs->last_fwd_cnt = vvs->fwd_cnt;
-	pkt->hdr.fwd_cnt = cpu_to_le32(vvs->fwd_cnt);
-	pkt->hdr.buf_alloc = cpu_to_le32(vvs->buf_alloc);
+	hdr->fwd_cnt = cpu_to_le32(vvs->fwd_cnt);
+	hdr->buf_alloc = cpu_to_le32(vvs->buf_alloc);
 	spin_unlock_bh(&vvs->rx_lock);
 }
 EXPORT_SYMBOL_GPL(virtio_transport_inc_tx_pkt);
@@ -303,29 +278,29 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
 				size_t len)
 {
 	struct virtio_vsock_sock *vvs = vsk->trans;
-	struct virtio_vsock_pkt *pkt;
+	struct sk_buff *skb, *tmp;
 	size_t bytes, total = 0, off;
 	int err = -EFAULT;
 
 	spin_lock_bh(&vvs->rx_lock);
 
-	list_for_each_entry(pkt, &vvs->rx_queue, list) {
-		off = pkt->off;
+	skb_queue_walk_safe(&vvs->rx_queue, skb,  tmp) {
+		off = vsock_metadata(skb)->off;
 
 		if (total == len)
 			break;
 
-		while (total < len && off < pkt->len) {
+		while (total < len && off < skb->len) {
 			bytes = len - total;
-			if (bytes > pkt->len - off)
-				bytes = pkt->len - off;
+			if (bytes > skb->len - off)
+				bytes = skb->len - off;
 
 			/* sk_lock is held by caller so no one else can dequeue.
 			 * Unlock rx_lock since memcpy_to_msg() may sleep.
 			 */
 			spin_unlock_bh(&vvs->rx_lock);
 
-			err = memcpy_to_msg(msg, pkt->buf + off, bytes);
+			err = memcpy_to_msg(msg, skb->data + off, bytes);
 			if (err)
 				goto out;
 
@@ -352,37 +327,40 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
 				   size_t len)
 {
 	struct virtio_vsock_sock *vvs = vsk->trans;
-	struct virtio_vsock_pkt *pkt;
+	struct sk_buff *skb;
 	size_t bytes, total = 0;
 	u32 free_space;
 	int err = -EFAULT;
 
 	spin_lock_bh(&vvs->rx_lock);
-	while (total < len && !list_empty(&vvs->rx_queue)) {
-		pkt = list_first_entry(&vvs->rx_queue,
-				       struct virtio_vsock_pkt, list);
+	while (total < len && !skb_queue_empty_lockless(&vvs->rx_queue)) {
+		skb = __skb_dequeue(&vvs->rx_queue);
 
 		bytes = len - total;
-		if (bytes > pkt->len - pkt->off)
-			bytes = pkt->len - pkt->off;
+		if (bytes > skb->len - vsock_metadata(skb)->off)
+			bytes = skb->len - vsock_metadata(skb)->off;
 
 		/* sk_lock is held by caller so no one else can dequeue.
 		 * Unlock rx_lock since memcpy_to_msg() may sleep.
 		 */
 		spin_unlock_bh(&vvs->rx_lock);
 
-		err = memcpy_to_msg(msg, pkt->buf + pkt->off, bytes);
+		err = memcpy_to_msg(msg, skb->data + vsock_metadata(skb)->off, bytes);
 		if (err)
 			goto out;
 
 		spin_lock_bh(&vvs->rx_lock);
 
 		total += bytes;
-		pkt->off += bytes;
-		if (pkt->off == pkt->len) {
-			virtio_transport_dec_rx_pkt(vvs, pkt);
-			list_del(&pkt->list);
-			virtio_transport_free_pkt(pkt);
+		vsock_metadata(skb)->off += bytes;
+
+		WARN_ON(vsock_metadata(skb)->off > skb->len);
+
+		if (vsock_metadata(skb)->off == skb->len) {
+			virtio_transport_dec_rx_pkt(vvs, skb);
+			consume_skb(skb);
+		} else {
+			__skb_queue_head(&vvs->rx_queue, skb);
 		}
 	}
 
@@ -414,7 +392,7 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
 						 int flags)
 {
 	struct virtio_vsock_sock *vvs = vsk->trans;
-	struct virtio_vsock_pkt *pkt;
+	struct sk_buff *skb;
 	int dequeued_len = 0;
 	size_t user_buf_len = msg_data_left(msg);
 	bool msg_ready = false;
@@ -427,13 +405,16 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
 	}
 
 	while (!msg_ready) {
-		pkt = list_first_entry(&vvs->rx_queue, struct virtio_vsock_pkt, list);
+		struct virtio_vsock_hdr *hdr;
+
+		skb = __skb_dequeue(&vvs->rx_queue);
+		hdr = vsock_hdr(skb);
 
 		if (dequeued_len >= 0) {
 			size_t pkt_len;
 			size_t bytes_to_copy;
 
-			pkt_len = (size_t)le32_to_cpu(pkt->hdr.len);
+			pkt_len = (size_t)le32_to_cpu(hdr->len);
 			bytes_to_copy = min(user_buf_len, pkt_len);
 
 			if (bytes_to_copy) {
@@ -444,7 +425,7 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
 				 */
 				spin_unlock_bh(&vvs->rx_lock);
 
-				err = memcpy_to_msg(msg, pkt->buf, bytes_to_copy);
+				err = memcpy_to_msg(msg, skb->data, bytes_to_copy);
 				if (err) {
 					/* Copy of message failed. Rest of
 					 * fragments will be freed without copy.
@@ -461,17 +442,16 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
 				dequeued_len += pkt_len;
 		}
 
-		if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOM) {
+		if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) {
 			msg_ready = true;
 			vvs->msg_count--;
 
-			if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOR)
+			if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR)
 				msg->msg_flags |= MSG_EOR;
 		}
 
-		virtio_transport_dec_rx_pkt(vvs, pkt);
-		list_del(&pkt->list);
-		virtio_transport_free_pkt(pkt);
+		virtio_transport_dec_rx_pkt(vvs, skb);
+		kfree_skb(skb);
 	}
 
 	spin_unlock_bh(&vvs->rx_lock);
@@ -609,7 +589,7 @@ int virtio_transport_do_socket_init(struct vsock_sock *vsk,
 
 	spin_lock_init(&vvs->rx_lock);
 	spin_lock_init(&vvs->tx_lock);
-	INIT_LIST_HEAD(&vvs->rx_queue);
+	skb_queue_head_init(&vvs->rx_queue);
 
 	return 0;
 }
@@ -809,16 +789,16 @@ void virtio_transport_destruct(struct vsock_sock *vsk)
 EXPORT_SYMBOL_GPL(virtio_transport_destruct);
 
 static int virtio_transport_reset(struct vsock_sock *vsk,
-				  struct virtio_vsock_pkt *pkt)
+				  struct sk_buff *skb)
 {
 	struct virtio_vsock_pkt_info info = {
 		.op = VIRTIO_VSOCK_OP_RST,
-		.reply = !!pkt,
+		.reply = !!skb,
 		.vsk = vsk,
 	};
 
 	/* Send RST only if the original pkt is not a RST pkt */
-	if (pkt && le16_to_cpu(pkt->hdr.op) == VIRTIO_VSOCK_OP_RST)
+	if (skb && le16_to_cpu(vsock_hdr(skb)->op) == VIRTIO_VSOCK_OP_RST)
 		return 0;
 
 	return virtio_transport_send_pkt_info(vsk, &info);
@@ -828,29 +808,32 @@ static int virtio_transport_reset(struct vsock_sock *vsk,
  * attempt was made to connect to a socket that does not exist.
  */
 static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
-					  struct virtio_vsock_pkt *pkt)
+					  struct sk_buff *skb)
 {
-	struct virtio_vsock_pkt *reply;
+	struct sk_buff *reply;
+	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
 	struct virtio_vsock_pkt_info info = {
 		.op = VIRTIO_VSOCK_OP_RST,
-		.type = le16_to_cpu(pkt->hdr.type),
+		.type = le16_to_cpu(hdr->type),
 		.reply = true,
 	};
+	int err;
 
 	/* Send RST only if the original pkt is not a RST pkt */
-	if (le16_to_cpu(pkt->hdr.op) == VIRTIO_VSOCK_OP_RST)
+	if (le16_to_cpu(hdr->op) == VIRTIO_VSOCK_OP_RST)
 		return 0;
 
-	reply = virtio_transport_alloc_pkt(&info, 0,
-					   le64_to_cpu(pkt->hdr.dst_cid),
-					   le32_to_cpu(pkt->hdr.dst_port),
-					   le64_to_cpu(pkt->hdr.src_cid),
-					   le32_to_cpu(pkt->hdr.src_port));
+	reply = virtio_transport_alloc_skb(&info, 0,
+					   le64_to_cpu(hdr->dst_cid),
+					   le32_to_cpu(hdr->dst_port),
+					   le64_to_cpu(hdr->src_cid),
+					   le32_to_cpu(hdr->src_port),
+					   &err);
 	if (!reply)
-		return -ENOMEM;
+		return err;
 
 	if (!t) {
-		virtio_transport_free_pkt(reply);
+		kfree_skb(reply);
 		return -ENOTCONN;
 	}
 
@@ -861,16 +844,11 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
 static void virtio_transport_remove_sock(struct vsock_sock *vsk)
 {
 	struct virtio_vsock_sock *vvs = vsk->trans;
-	struct virtio_vsock_pkt *pkt, *tmp;
 
 	/* We don't need to take rx_lock, as the socket is closing and we are
 	 * removing it.
 	 */
-	list_for_each_entry_safe(pkt, tmp, &vvs->rx_queue, list) {
-		list_del(&pkt->list);
-		virtio_transport_free_pkt(pkt);
-	}
-
+	__skb_queue_purge(&vvs->rx_queue);
 	vsock_remove_sock(vsk);
 }
 
@@ -984,13 +962,14 @@ EXPORT_SYMBOL_GPL(virtio_transport_release);
 
 static int
 virtio_transport_recv_connecting(struct sock *sk,
-				 struct virtio_vsock_pkt *pkt)
+				 struct sk_buff *skb)
 {
 	struct vsock_sock *vsk = vsock_sk(sk);
+	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
 	int err;
 	int skerr;
 
-	switch (le16_to_cpu(pkt->hdr.op)) {
+	switch (le16_to_cpu(hdr->op)) {
 	case VIRTIO_VSOCK_OP_RESPONSE:
 		sk->sk_state = TCP_ESTABLISHED;
 		sk->sk_socket->state = SS_CONNECTED;
@@ -1011,7 +990,7 @@ virtio_transport_recv_connecting(struct sock *sk,
 	return 0;
 
 destroy:
-	virtio_transport_reset(vsk, pkt);
+	virtio_transport_reset(vsk, skb);
 	sk->sk_state = TCP_CLOSE;
 	sk->sk_err = skerr;
 	sk_error_report(sk);
@@ -1020,34 +999,38 @@ virtio_transport_recv_connecting(struct sock *sk,
 
 static void
 virtio_transport_recv_enqueue(struct vsock_sock *vsk,
-			      struct virtio_vsock_pkt *pkt)
+			      struct sk_buff *skb)
 {
 	struct virtio_vsock_sock *vvs = vsk->trans;
+	struct virtio_vsock_hdr *hdr;
 	bool can_enqueue, free_pkt = false;
+	u32 len;
 
-	pkt->len = le32_to_cpu(pkt->hdr.len);
-	pkt->off = 0;
+	hdr = vsock_hdr(skb);
+	len = le32_to_cpu(hdr->len);
+	vsock_metadata(skb)->off = 0;
 
 	spin_lock_bh(&vvs->rx_lock);
 
-	can_enqueue = virtio_transport_inc_rx_pkt(vvs, pkt);
+	can_enqueue = virtio_transport_inc_rx_pkt(vvs, skb);
 	if (!can_enqueue) {
 		free_pkt = true;
 		goto out;
 	}
 
-	if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOM)
+	if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM)
 		vvs->msg_count++;
 
 	/* Try to copy small packets into the buffer of last packet queued,
 	 * to avoid wasting memory queueing the entire buffer with a small
 	 * payload.
 	 */
-	if (pkt->len <= GOOD_COPY_LEN && !list_empty(&vvs->rx_queue)) {
-		struct virtio_vsock_pkt *last_pkt;
+	if (len <= GOOD_COPY_LEN && !skb_queue_empty_lockless(&vvs->rx_queue)) {
+		struct virtio_vsock_hdr *last_hdr;
+		struct sk_buff *last_skb;
 
-		last_pkt = list_last_entry(&vvs->rx_queue,
-					   struct virtio_vsock_pkt, list);
+		last_skb = skb_peek_tail(&vvs->rx_queue);
+		last_hdr = vsock_hdr(last_skb);
 
 		/* If there is space in the last packet queued, we copy the
 		 * new packet in its buffer. We avoid this if the last packet
@@ -1055,35 +1038,35 @@ virtio_transport_recv_enqueue(struct vsock_sock *vsk,
 		 * delimiter of SEQPACKET message, so 'pkt' is the first packet
 		 * of a new message.
 		 */
-		if ((pkt->len <= last_pkt->buf_len - last_pkt->len) &&
-		    !(le32_to_cpu(last_pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOM)) {
-			memcpy(last_pkt->buf + last_pkt->len, pkt->buf,
-			       pkt->len);
-			last_pkt->len += pkt->len;
+		if (skb->len < skb_tailroom(last_skb) &&
+		    !(le32_to_cpu(last_hdr->flags) & VIRTIO_VSOCK_SEQ_EOR) &&
+		    (vsock_hdr(skb)->type != VIRTIO_VSOCK_TYPE_DGRAM)) {
+			memcpy(skb_put(last_skb, skb->len), skb->data, skb->len);
 			free_pkt = true;
-			last_pkt->hdr.flags |= pkt->hdr.flags;
+			last_hdr->flags |= hdr->flags;
 			goto out;
 		}
 	}
 
-	list_add_tail(&pkt->list, &vvs->rx_queue);
+	__skb_queue_tail(&vvs->rx_queue, skb);
 
 out:
 	spin_unlock_bh(&vvs->rx_lock);
 	if (free_pkt)
-		virtio_transport_free_pkt(pkt);
+		kfree_skb(skb);
 }
 
 static int
 virtio_transport_recv_connected(struct sock *sk,
-				struct virtio_vsock_pkt *pkt)
+				struct sk_buff *skb)
 {
 	struct vsock_sock *vsk = vsock_sk(sk);
+	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
 	int err = 0;
 
-	switch (le16_to_cpu(pkt->hdr.op)) {
+	switch (le16_to_cpu(hdr->op)) {
 	case VIRTIO_VSOCK_OP_RW:
-		virtio_transport_recv_enqueue(vsk, pkt);
+		virtio_transport_recv_enqueue(vsk, skb);
 		sk->sk_data_ready(sk);
 		return err;
 	case VIRTIO_VSOCK_OP_CREDIT_REQUEST:
@@ -1093,18 +1076,17 @@ virtio_transport_recv_connected(struct sock *sk,
 		sk->sk_write_space(sk);
 		break;
 	case VIRTIO_VSOCK_OP_SHUTDOWN:
-		if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SHUTDOWN_RCV)
+		if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SHUTDOWN_RCV)
 			vsk->peer_shutdown |= RCV_SHUTDOWN;
-		if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SHUTDOWN_SEND)
+		if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SHUTDOWN_SEND)
 			vsk->peer_shutdown |= SEND_SHUTDOWN;
 		if (vsk->peer_shutdown == SHUTDOWN_MASK &&
 		    vsock_stream_has_data(vsk) <= 0 &&
 		    !sock_flag(sk, SOCK_DONE)) {
 			(void)virtio_transport_reset(vsk, NULL);
-
 			virtio_transport_do_close(vsk, true);
 		}
-		if (le32_to_cpu(pkt->hdr.flags))
+		if (le32_to_cpu(vsock_hdr(skb)->flags))
 			sk->sk_state_change(sk);
 		break;
 	case VIRTIO_VSOCK_OP_RST:
@@ -1115,28 +1097,30 @@ virtio_transport_recv_connected(struct sock *sk,
 		break;
 	}
 
-	virtio_transport_free_pkt(pkt);
+	kfree_skb(skb);
 	return err;
 }
 
 static void
 virtio_transport_recv_disconnecting(struct sock *sk,
-				    struct virtio_vsock_pkt *pkt)
+				    struct sk_buff *skb)
 {
 	struct vsock_sock *vsk = vsock_sk(sk);
+	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
 
-	if (le16_to_cpu(pkt->hdr.op) == VIRTIO_VSOCK_OP_RST)
+	if (le16_to_cpu(hdr->op) == VIRTIO_VSOCK_OP_RST)
 		virtio_transport_do_close(vsk, true);
 }
 
 static int
 virtio_transport_send_response(struct vsock_sock *vsk,
-			       struct virtio_vsock_pkt *pkt)
+			       struct sk_buff *skb)
 {
+	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
 	struct virtio_vsock_pkt_info info = {
 		.op = VIRTIO_VSOCK_OP_RESPONSE,
-		.remote_cid = le64_to_cpu(pkt->hdr.src_cid),
-		.remote_port = le32_to_cpu(pkt->hdr.src_port),
+		.remote_cid = le64_to_cpu(hdr->src_cid),
+		.remote_port = le32_to_cpu(hdr->src_port),
 		.reply = true,
 		.vsk = vsk,
 	};
@@ -1145,10 +1129,11 @@ virtio_transport_send_response(struct vsock_sock *vsk,
 }
 
 static bool virtio_transport_space_update(struct sock *sk,
-					  struct virtio_vsock_pkt *pkt)
+					  struct sk_buff *skb)
 {
 	struct vsock_sock *vsk = vsock_sk(sk);
 	struct virtio_vsock_sock *vvs = vsk->trans;
+	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
 	bool space_available;
 
 	/* Listener sockets are not associated with any transport, so we are
@@ -1161,8 +1146,8 @@ static bool virtio_transport_space_update(struct sock *sk,
 
 	/* buf_alloc and fwd_cnt is always included in the hdr */
 	spin_lock_bh(&vvs->tx_lock);
-	vvs->peer_buf_alloc = le32_to_cpu(pkt->hdr.buf_alloc);
-	vvs->peer_fwd_cnt = le32_to_cpu(pkt->hdr.fwd_cnt);
+	vvs->peer_buf_alloc = le32_to_cpu(hdr->buf_alloc);
+	vvs->peer_fwd_cnt = le32_to_cpu(hdr->fwd_cnt);
 	space_available = virtio_transport_has_space(vsk);
 	spin_unlock_bh(&vvs->tx_lock);
 	return space_available;
@@ -1170,27 +1155,28 @@ static bool virtio_transport_space_update(struct sock *sk,
 
 /* Handle server socket */
 static int
-virtio_transport_recv_listen(struct sock *sk, struct virtio_vsock_pkt *pkt,
+virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
 			     struct virtio_transport *t)
 {
 	struct vsock_sock *vsk = vsock_sk(sk);
+	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
 	struct vsock_sock *vchild;
 	struct sock *child;
 	int ret;
 
-	if (le16_to_cpu(pkt->hdr.op) != VIRTIO_VSOCK_OP_REQUEST) {
-		virtio_transport_reset_no_sock(t, pkt);
+	if (le16_to_cpu(hdr->op) != VIRTIO_VSOCK_OP_REQUEST) {
+		virtio_transport_reset_no_sock(t, skb);
 		return -EINVAL;
 	}
 
 	if (sk_acceptq_is_full(sk)) {
-		virtio_transport_reset_no_sock(t, pkt);
+		virtio_transport_reset_no_sock(t, skb);
 		return -ENOMEM;
 	}
 
 	child = vsock_create_connected(sk);
 	if (!child) {
-		virtio_transport_reset_no_sock(t, pkt);
+		virtio_transport_reset_no_sock(t, skb);
 		return -ENOMEM;
 	}
 
@@ -1201,10 +1187,10 @@ virtio_transport_recv_listen(struct sock *sk, struct virtio_vsock_pkt *pkt,
 	child->sk_state = TCP_ESTABLISHED;
 
 	vchild = vsock_sk(child);
-	vsock_addr_init(&vchild->local_addr, le64_to_cpu(pkt->hdr.dst_cid),
-			le32_to_cpu(pkt->hdr.dst_port));
-	vsock_addr_init(&vchild->remote_addr, le64_to_cpu(pkt->hdr.src_cid),
-			le32_to_cpu(pkt->hdr.src_port));
+	vsock_addr_init(&vchild->local_addr, le64_to_cpu(hdr->dst_cid),
+			le32_to_cpu(hdr->dst_port));
+	vsock_addr_init(&vchild->remote_addr, le64_to_cpu(hdr->src_cid),
+			le32_to_cpu(hdr->src_port));
 
 	ret = vsock_assign_transport(vchild, vsk);
 	/* Transport assigned (looking at remote_addr) must be the same
@@ -1212,17 +1198,17 @@ virtio_transport_recv_listen(struct sock *sk, struct virtio_vsock_pkt *pkt,
 	 */
 	if (ret || vchild->transport != &t->transport) {
 		release_sock(child);
-		virtio_transport_reset_no_sock(t, pkt);
+		virtio_transport_reset_no_sock(t, skb);
 		sock_put(child);
 		return ret;
 	}
 
-	if (virtio_transport_space_update(child, pkt))
+	if (virtio_transport_space_update(child, skb))
 		child->sk_write_space(child);
 
 	vsock_insert_connected(vchild);
 	vsock_enqueue_accept(sk, child);
-	virtio_transport_send_response(vchild, pkt);
+	virtio_transport_send_response(vchild, skb);
 
 	release_sock(child);
 
@@ -1240,29 +1226,30 @@ static bool virtio_transport_valid_type(u16 type)
  * lock.
  */
 void virtio_transport_recv_pkt(struct virtio_transport *t,
-			       struct virtio_vsock_pkt *pkt)
+			       struct sk_buff *skb)
 {
 	struct sockaddr_vm src, dst;
 	struct vsock_sock *vsk;
 	struct sock *sk;
 	bool space_available;
+	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
 
-	vsock_addr_init(&src, le64_to_cpu(pkt->hdr.src_cid),
-			le32_to_cpu(pkt->hdr.src_port));
-	vsock_addr_init(&dst, le64_to_cpu(pkt->hdr.dst_cid),
-			le32_to_cpu(pkt->hdr.dst_port));
+	vsock_addr_init(&src, le64_to_cpu(hdr->src_cid),
+			le32_to_cpu(hdr->src_port));
+	vsock_addr_init(&dst, le64_to_cpu(hdr->dst_cid),
+			le32_to_cpu(hdr->dst_port));
 
 	trace_virtio_transport_recv_pkt(src.svm_cid, src.svm_port,
 					dst.svm_cid, dst.svm_port,
-					le32_to_cpu(pkt->hdr.len),
-					le16_to_cpu(pkt->hdr.type),
-					le16_to_cpu(pkt->hdr.op),
-					le32_to_cpu(pkt->hdr.flags),
-					le32_to_cpu(pkt->hdr.buf_alloc),
-					le32_to_cpu(pkt->hdr.fwd_cnt));
-
-	if (!virtio_transport_valid_type(le16_to_cpu(pkt->hdr.type))) {
-		(void)virtio_transport_reset_no_sock(t, pkt);
+					le32_to_cpu(hdr->len),
+					le16_to_cpu(hdr->type),
+					le16_to_cpu(hdr->op),
+					le32_to_cpu(hdr->flags),
+					le32_to_cpu(hdr->buf_alloc),
+					le32_to_cpu(hdr->fwd_cnt));
+
+	if (!virtio_transport_valid_type(le16_to_cpu(hdr->type))) {
+		(void)virtio_transport_reset_no_sock(t, skb);
 		goto free_pkt;
 	}
 
@@ -1273,13 +1260,13 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
 	if (!sk) {
 		sk = vsock_find_bound_socket(&dst);
 		if (!sk) {
-			(void)virtio_transport_reset_no_sock(t, pkt);
+			(void)virtio_transport_reset_no_sock(t, skb);
 			goto free_pkt;
 		}
 	}
 
-	if (virtio_transport_get_type(sk) != le16_to_cpu(pkt->hdr.type)) {
-		(void)virtio_transport_reset_no_sock(t, pkt);
+	if (virtio_transport_get_type(sk) != le16_to_cpu(hdr->type)) {
+		(void)virtio_transport_reset_no_sock(t, skb);
 		sock_put(sk);
 		goto free_pkt;
 	}
@@ -1290,13 +1277,13 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
 
 	/* Check if sk has been closed before lock_sock */
 	if (sock_flag(sk, SOCK_DONE)) {
-		(void)virtio_transport_reset_no_sock(t, pkt);
+		(void)virtio_transport_reset_no_sock(t, skb);
 		release_sock(sk);
 		sock_put(sk);
 		goto free_pkt;
 	}
 
-	space_available = virtio_transport_space_update(sk, pkt);
+	space_available = virtio_transport_space_update(sk, skb);
 
 	/* Update CID in case it has changed after a transport reset event */
 	if (vsk->local_addr.svm_cid != VMADDR_CID_ANY)
@@ -1307,23 +1294,23 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
 
 	switch (sk->sk_state) {
 	case TCP_LISTEN:
-		virtio_transport_recv_listen(sk, pkt, t);
-		virtio_transport_free_pkt(pkt);
+		virtio_transport_recv_listen(sk, skb, t);
+		kfree_skb(skb);
 		break;
 	case TCP_SYN_SENT:
-		virtio_transport_recv_connecting(sk, pkt);
-		virtio_transport_free_pkt(pkt);
+		virtio_transport_recv_connecting(sk, skb);
+		kfree_skb(skb);
 		break;
 	case TCP_ESTABLISHED:
-		virtio_transport_recv_connected(sk, pkt);
+		virtio_transport_recv_connected(sk, skb);
 		break;
 	case TCP_CLOSING:
-		virtio_transport_recv_disconnecting(sk, pkt);
-		virtio_transport_free_pkt(pkt);
+		virtio_transport_recv_disconnecting(sk, skb);
+		kfree_skb(skb);
 		break;
 	default:
-		(void)virtio_transport_reset_no_sock(t, pkt);
-		virtio_transport_free_pkt(pkt);
+		(void)virtio_transport_reset_no_sock(t, skb);
+		kfree_skb(skb);
 		break;
 	}
 
@@ -1336,16 +1323,42 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
 	return;
 
 free_pkt:
-	virtio_transport_free_pkt(pkt);
+	kfree(skb);
 }
 EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
 
-void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
+/* Remove skbs found in a queue that have a vsk that matches.
+ *
+ * Each skb is freed.
+ *
+ * Returns the count of skbs that were reply packets.
+ */
+int virtio_transport_purge_skbs(void *vsk, struct sk_buff_head *queue)
 {
-	kfree(pkt->buf);
-	kfree(pkt);
+	int cnt = 0;
+	struct sk_buff *skb, *tmp;
+	struct sk_buff_head freeme;
+
+	skb_queue_head_init(&freeme);
+
+	spin_lock_bh(&queue->lock);
+	skb_queue_walk_safe(queue, skb, tmp) {
+		if (vsock_sk(skb->sk) != vsk)
+			continue;
+
+		__skb_unlink(skb, queue);
+		skb_queue_tail(&freeme, skb);
+
+		if (vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_REPLY)
+			cnt++;
+	}
+	spin_unlock_bh(&queue->lock);
+
+	skb_queue_purge(&freeme);
+
+	return cnt;
 }
-EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
+EXPORT_SYMBOL_GPL(virtio_transport_purge_skbs);
 
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR("Asias He");
diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
index 169a8cf65b39..906f7cdff65e 100644
--- a/net/vmw_vsock/vsock_loopback.c
+++ b/net/vmw_vsock/vsock_loopback.c
@@ -16,7 +16,7 @@ struct vsock_loopback {
 	struct workqueue_struct *workqueue;
 
 	spinlock_t pkt_list_lock; /* protects pkt_list */
-	struct list_head pkt_list;
+	struct sk_buff_head pkt_queue;
 	struct work_struct pkt_work;
 };
 
@@ -27,13 +27,13 @@ static u32 vsock_loopback_get_local_cid(void)
 	return VMADDR_CID_LOCAL;
 }
 
-static int vsock_loopback_send_pkt(struct virtio_vsock_pkt *pkt)
+static int vsock_loopback_send_pkt(struct sk_buff *skb)
 {
 	struct vsock_loopback *vsock = &the_vsock_loopback;
-	int len = pkt->len;
+	int len = skb->len;
 
 	spin_lock_bh(&vsock->pkt_list_lock);
-	list_add_tail(&pkt->list, &vsock->pkt_list);
+	skb_queue_tail(&vsock->pkt_queue, skb);
 	spin_unlock_bh(&vsock->pkt_list_lock);
 
 	queue_work(vsock->workqueue, &vsock->pkt_work);
@@ -44,21 +44,8 @@ static int vsock_loopback_send_pkt(struct virtio_vsock_pkt *pkt)
 static int vsock_loopback_cancel_pkt(struct vsock_sock *vsk)
 {
 	struct vsock_loopback *vsock = &the_vsock_loopback;
-	struct virtio_vsock_pkt *pkt, *n;
-	LIST_HEAD(freeme);
 
-	spin_lock_bh(&vsock->pkt_list_lock);
-	list_for_each_entry_safe(pkt, n, &vsock->pkt_list, list) {
-		if (pkt->vsk != vsk)
-			continue;
-		list_move(&pkt->list, &freeme);
-	}
-	spin_unlock_bh(&vsock->pkt_list_lock);
-
-	list_for_each_entry_safe(pkt, n, &freeme, list) {
-		list_del(&pkt->list);
-		virtio_transport_free_pkt(pkt);
-	}
+	virtio_transport_purge_skbs(vsk, &vsock->pkt_queue);
 
 	return 0;
 }
@@ -121,20 +108,20 @@ static void vsock_loopback_work(struct work_struct *work)
 {
 	struct vsock_loopback *vsock =
 		container_of(work, struct vsock_loopback, pkt_work);
-	LIST_HEAD(pkts);
+	struct sk_buff_head pkts;
+
+	skb_queue_head_init(&pkts);
 
 	spin_lock_bh(&vsock->pkt_list_lock);
-	list_splice_init(&vsock->pkt_list, &pkts);
+	skb_queue_splice_init(&vsock->pkt_queue, &pkts);
 	spin_unlock_bh(&vsock->pkt_list_lock);
 
-	while (!list_empty(&pkts)) {
-		struct virtio_vsock_pkt *pkt;
+	while (!skb_queue_empty(&pkts)) {
+		struct sk_buff *skb;
 
-		pkt = list_first_entry(&pkts, struct virtio_vsock_pkt, list);
-		list_del_init(&pkt->list);
-
-		virtio_transport_deliver_tap_pkt(pkt);
-		virtio_transport_recv_pkt(&loopback_transport, pkt);
+		skb = skb_dequeue(&pkts);
+		virtio_transport_deliver_tap_pkt(skb);
+		virtio_transport_recv_pkt(&loopback_transport, skb);
 	}
 }
 
@@ -148,7 +135,7 @@ static int __init vsock_loopback_init(void)
 		return -ENOMEM;
 
 	spin_lock_init(&vsock->pkt_list_lock);
-	INIT_LIST_HEAD(&vsock->pkt_list);
+	skb_queue_head_init(&vsock->pkt_queue);
 	INIT_WORK(&vsock->pkt_work, vsock_loopback_work);
 
 	ret = vsock_core_register(&loopback_transport.transport,
@@ -166,19 +153,13 @@ static int __init vsock_loopback_init(void)
 static void __exit vsock_loopback_exit(void)
 {
 	struct vsock_loopback *vsock = &the_vsock_loopback;
-	struct virtio_vsock_pkt *pkt;
 
 	vsock_core_unregister(&loopback_transport.transport);
 
 	flush_work(&vsock->pkt_work);
 
 	spin_lock_bh(&vsock->pkt_list_lock);
-	while (!list_empty(&vsock->pkt_list)) {
-		pkt = list_first_entry(&vsock->pkt_list,
-				       struct virtio_vsock_pkt, list);
-		list_del(&pkt->list);
-		virtio_transport_free_pkt(pkt);
-	}
+	skb_queue_purge(&vsock->pkt_queue);
 	spin_unlock_bh(&vsock->pkt_list_lock);
 
 	destroy_workqueue(vsock->workqueue);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 2/6] vsock: return errors other than -ENOMEM to socket
  2022-08-15 17:56 [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Bobby Eshleman
  2022-08-15 17:56 ` [PATCH 1/6] vsock: replace virtio_vsock_pkt with sk_buff Bobby Eshleman
@ 2022-08-15 17:56 ` Bobby Eshleman
  2022-08-15 20:01   ` kernel test robot
                     ` (4 more replies)
  2022-08-15 17:56 ` [PATCH 3/6] vsock: add netdev to vhost/virtio vsock Bobby Eshleman
                   ` (9 subsequent siblings)
  11 siblings, 5 replies; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-15 17:56 UTC (permalink / raw)
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Wei Liu, Dexuan Cui, kvm, virtualization, netdev, linux-kernel,
	linux-hyperv

This commit allows vsock implementations to return errors
to the socket layer other than -ENOMEM. One immediate effect
of this is that upon the sk_sndbuf threshold being reached -EAGAIN
will be returned and userspace may throttle appropriately.

Resultingly, a known issue with uperf is resolved[1].

Additionally, to preserve legacy behavior for non-virtio
implementations, hyperv/vmci force errors to be -ENOMEM so that behavior
is unchanged.

[1]: https://gitlab.com/vsock/vsock/-/issues/1

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
---
 include/linux/virtio_vsock.h            | 3 +++
 net/vmw_vsock/af_vsock.c                | 3 ++-
 net/vmw_vsock/hyperv_transport.c        | 2 +-
 net/vmw_vsock/virtio_transport_common.c | 3 ---
 net/vmw_vsock/vmci_transport.c          | 9 ++++++++-
 5 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 17ed01466875..9a37eddbb87a 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -8,6 +8,9 @@
 #include <net/sock.h>
 #include <net/af_vsock.h>
 
+/* Threshold for detecting small packets to copy */
+#define GOOD_COPY_LEN  128
+
 enum virtio_vsock_metadata_flags {
 	VIRTIO_VSOCK_METADATA_FLAGS_REPLY		= BIT(0),
 	VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED	= BIT(1),
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index e348b2d09eac..1893f8aafa48 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1844,8 +1844,9 @@ static int vsock_connectible_sendmsg(struct socket *sock, struct msghdr *msg,
 			written = transport->stream_enqueue(vsk,
 					msg, len - total_written);
 		}
+
 		if (written < 0) {
-			err = -ENOMEM;
+			err = written;
 			goto out_err;
 		}
 
diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index fd98229e3db3..e99aea571f6f 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -687,7 +687,7 @@ static ssize_t hvs_stream_enqueue(struct vsock_sock *vsk, struct msghdr *msg,
 	if (bytes_written)
 		ret = bytes_written;
 	kfree(send_buf);
-	return ret;
+	return ret < 0 ? -ENOMEM : ret;
 }
 
 static s64 hvs_stream_has_data(struct vsock_sock *vsk)
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 920578597bb9..d5780599fe93 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -23,9 +23,6 @@
 /* How long to wait for graceful shutdown of a connection */
 #define VSOCK_CLOSE_TIMEOUT (8 * HZ)
 
-/* Threshold for detecting small packets to copy */
-#define GOOD_COPY_LEN  128
-
 static const struct virtio_transport *
 virtio_transport_get_ops(struct vsock_sock *vsk)
 {
diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index b14f0ed7427b..c927a90dc859 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -1838,7 +1838,14 @@ static ssize_t vmci_transport_stream_enqueue(
 	struct msghdr *msg,
 	size_t len)
 {
-	return vmci_qpair_enquev(vmci_trans(vsk)->qpair, msg, len, 0);
+	int err;
+
+	err = vmci_qpair_enquev(vmci_trans(vsk)->qpair, msg, len, 0);
+
+	if (err < 0)
+		err = -ENOMEM;
+
+	return err;
 }
 
 static s64 vmci_transport_stream_has_data(struct vsock_sock *vsk)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 3/6] vsock: add netdev to vhost/virtio vsock
  2022-08-15 17:56 [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Bobby Eshleman
  2022-08-15 17:56 ` [PATCH 1/6] vsock: replace virtio_vsock_pkt with sk_buff Bobby Eshleman
  2022-08-15 17:56 ` [PATCH 2/6] vsock: return errors other than -ENOMEM to socket Bobby Eshleman
@ 2022-08-15 17:56 ` Bobby Eshleman
  2022-08-16  2:31   ` Bobby Eshleman
                     ` (2 more replies)
  2022-08-15 17:56 ` [PATCH 4/6] virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit Bobby Eshleman
                   ` (8 subsequent siblings)
  11 siblings, 3 replies; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-15 17:56 UTC (permalink / raw)
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, kvm, virtualization, netdev, linux-kernel

In order to support usage of qdisc on vsock traffic, this commit
introduces a struct net_device to vhost and virtio vsock.

Two new devices are created, vhost-vsock for vhost and virtio-vsock
for virtio. The devices are attached to the respective transports.

To bypass the usage of the device, the user may "down" the associated
network interface using common tools. For example, "ip link set dev
virtio-vsock down" lets vsock bypass the net_device and qdisc entirely,
simply using the FIFO logic of the prior implementation.

For both hosts and guests, there is one device for all G2H vsock sockets
and one device for all H2G vsock sockets. This makes sense for guests
because the driver only supports a single vsock channel (one pair of
TX/RX virtqueues), so one device and qdisc fits. For hosts, this may not
seem ideal for some workloads. However, it is possible to use a
multi-queue qdisc, where a given queue is responsible for a range of
sockets. This seems to be a better solution than having one device per
socket, which may yield a very large number of devices and qdiscs, all
of which are dynamically being created and destroyed. Because of this
dynamism, it would also require a complex policy management daemon, as
devices would constantly be spun up and down as sockets were created and
destroyed. To avoid this, one device and qdisc also applies to all H2G
sockets.

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
---
 drivers/vhost/vsock.c                   |  19 +++-
 include/linux/virtio_vsock.h            |  10 +++
 net/vmw_vsock/virtio_transport.c        |  19 +++-
 net/vmw_vsock/virtio_transport_common.c | 112 +++++++++++++++++++++++-
 4 files changed, 152 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index f8601d93d94d..b20ddec2664b 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -927,13 +927,30 @@ static int __init vhost_vsock_init(void)
 				  VSOCK_TRANSPORT_F_H2G);
 	if (ret < 0)
 		return ret;
-	return misc_register(&vhost_vsock_misc);
+
+	ret = virtio_transport_init(&vhost_transport, "vhost-vsock");
+	if (ret < 0)
+		goto out_unregister;
+
+	ret = misc_register(&vhost_vsock_misc);
+	if (ret < 0)
+		goto out_transport_exit;
+	return ret;
+
+out_transport_exit:
+	virtio_transport_exit(&vhost_transport);
+
+out_unregister:
+	vsock_core_unregister(&vhost_transport.transport);
+	return ret;
+
 };
 
 static void __exit vhost_vsock_exit(void)
 {
 	misc_deregister(&vhost_vsock_misc);
 	vsock_core_unregister(&vhost_transport.transport);
+	virtio_transport_exit(&vhost_transport);
 };
 
 module_init(vhost_vsock_init);
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 9a37eddbb87a..5d7e7fbd75f8 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -91,10 +91,20 @@ struct virtio_transport {
 	/* This must be the first field */
 	struct vsock_transport transport;
 
+	/* Used almost exclusively for qdisc */
+	struct net_device *dev;
+
 	/* Takes ownership of the packet */
 	int (*send_pkt)(struct sk_buff *skb);
 };
 
+int
+virtio_transport_init(struct virtio_transport *t,
+		      const char *name);
+
+void
+virtio_transport_exit(struct virtio_transport *t);
+
 ssize_t
 virtio_transport_stream_dequeue(struct vsock_sock *vsk,
 				struct msghdr *msg,
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 3bb293fd8607..c6212eb38d3c 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -131,7 +131,9 @@ virtio_transport_send_pkt_work(struct work_struct *work)
 		 * the vq
 		 */
 		if (ret < 0) {
-			skb_queue_head(&vsock->send_pkt_queue, skb);
+			spin_lock_bh(&vsock->send_pkt_queue.lock);
+			__skb_queue_head(&vsock->send_pkt_queue, skb);
+			spin_unlock_bh(&vsock->send_pkt_queue.lock);
 			break;
 		}
 
@@ -676,7 +678,9 @@ static void virtio_vsock_vqs_del(struct virtio_vsock *vsock)
 		kfree_skb(skb);
 	mutex_unlock(&vsock->tx_lock);
 
-	skb_queue_purge(&vsock->send_pkt_queue);
+	spin_lock_bh(&vsock->send_pkt_queue.lock);
+	__skb_queue_purge(&vsock->send_pkt_queue);
+	spin_unlock_bh(&vsock->send_pkt_queue.lock);
 
 	/* Delete virtqueues and flush outstanding callbacks if any */
 	vdev->config->del_vqs(vdev);
@@ -760,6 +764,8 @@ static void virtio_vsock_remove(struct virtio_device *vdev)
 	flush_work(&vsock->event_work);
 	flush_work(&vsock->send_pkt_work);
 
+	virtio_transport_exit(&virtio_transport);
+
 	mutex_unlock(&the_virtio_vsock_mutex);
 
 	kfree(vsock);
@@ -844,12 +850,18 @@ static int __init virtio_vsock_init(void)
 	if (ret)
 		goto out_wq;
 
-	ret = register_virtio_driver(&virtio_vsock_driver);
+	ret = virtio_transport_init(&virtio_transport, "virtio-vsock");
 	if (ret)
 		goto out_vci;
 
+	ret = register_virtio_driver(&virtio_vsock_driver);
+	if (ret)
+		goto out_transport;
+
 	return 0;
 
+out_transport:
+	virtio_transport_exit(&virtio_transport);
 out_vci:
 	vsock_core_unregister(&virtio_transport.transport);
 out_wq:
@@ -861,6 +873,7 @@ static void __exit virtio_vsock_exit(void)
 {
 	unregister_virtio_driver(&virtio_vsock_driver);
 	vsock_core_unregister(&virtio_transport.transport);
+	virtio_transport_exit(&virtio_transport);
 	destroy_workqueue(virtio_vsock_workqueue);
 }
 
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index d5780599fe93..bdf16fff054f 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -16,6 +16,7 @@
 
 #include <net/sock.h>
 #include <net/af_vsock.h>
+#include <net/pkt_sched.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/vsock_virtio_transport_common.h>
@@ -23,6 +24,93 @@
 /* How long to wait for graceful shutdown of a connection */
 #define VSOCK_CLOSE_TIMEOUT (8 * HZ)
 
+struct virtio_transport_priv {
+	struct virtio_transport *trans;
+};
+
+static netdev_tx_t virtio_transport_start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct virtio_transport *t =
+		((struct virtio_transport_priv *)netdev_priv(dev))->trans;
+	int ret;
+
+	ret = t->send_pkt(skb);
+	if (unlikely(ret == -ENODEV))
+		return NETDEV_TX_BUSY;
+
+	return NETDEV_TX_OK;
+}
+
+const struct net_device_ops virtio_transport_netdev_ops = {
+	.ndo_start_xmit = virtio_transport_start_xmit,
+};
+
+static void virtio_transport_setup(struct net_device *dev)
+{
+	dev->netdev_ops = &virtio_transport_netdev_ops;
+	dev->needs_free_netdev = true;
+	dev->flags = IFF_NOARP;
+	dev->mtu = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
+	dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN;
+}
+
+static int ifup(struct net_device *dev)
+{
+	int ret;
+
+	rtnl_lock();
+	ret = dev_open(dev, NULL) ? -ENOMEM : 0;
+	rtnl_unlock();
+
+	return ret;
+}
+
+/* virtio_transport_init - initialize a virtio vsock transport layer
+ *
+ * @t: ptr to the virtio transport struct to initialize
+ * @name: the name of the net_device to be created.
+ *
+ * Return 0 on success, otherwise negative errno.
+ */
+int virtio_transport_init(struct virtio_transport *t, const char *name)
+{
+	struct virtio_transport_priv *priv;
+	int ret;
+
+	t->dev = alloc_netdev(sizeof(*priv), name, NET_NAME_UNKNOWN, virtio_transport_setup);
+	if (!t->dev)
+		return -ENOMEM;
+
+	priv = netdev_priv(t->dev);
+	priv->trans = t;
+
+	ret = register_netdev(t->dev);
+	if (ret < 0)
+		goto out_free_netdev;
+
+	ret = ifup(t->dev);
+	if (ret < 0)
+		goto out_unregister_netdev;
+
+	return 0;
+
+out_unregister_netdev:
+	unregister_netdev(t->dev);
+
+out_free_netdev:
+	free_netdev(t->dev);
+
+	return ret;
+}
+
+void virtio_transport_exit(struct virtio_transport *t)
+{
+	if (t->dev) {
+		unregister_netdev(t->dev);
+		free_netdev(t->dev);
+	}
+}
+
 static const struct virtio_transport *
 virtio_transport_get_ops(struct vsock_sock *vsk)
 {
@@ -147,6 +235,24 @@ static u16 virtio_transport_get_type(struct sock *sk)
 		return VIRTIO_VSOCK_TYPE_SEQPACKET;
 }
 
+/* Return pkt->len on success, otherwise negative errno */
+static int virtio_transport_send_pkt(const struct virtio_transport *t, struct sk_buff *skb)
+{
+	int ret;
+	int len = skb->len;
+
+	if (unlikely(!t->dev || !(t->dev->flags & IFF_UP)))
+		return t->send_pkt(skb);
+
+	skb->dev = t->dev;
+	ret = dev_queue_xmit(skb);
+
+	if (likely(ret == NET_XMIT_SUCCESS || ret == NET_XMIT_CN))
+		return len;
+
+	return -ENOMEM;
+}
+
 /* This function can only be used on connecting/connected sockets,
  * since a socket assigned to a transport is required.
  *
@@ -202,9 +308,7 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 
 	virtio_transport_inc_tx_pkt(vvs, skb);
 
-	err = t_ops->send_pkt(skb);
-
-	return err < 0 ? -ENOMEM : err;
+	return virtio_transport_send_pkt(t_ops, skb);
 }
 
 static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
@@ -834,7 +938,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
 		return -ENOTCONN;
 	}
 
-	return t->send_pkt(reply);
+	return virtio_transport_send_pkt(t, reply);
 }
 
 /* This function should be called with sk_lock held and SOCK_DONE set */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 4/6] virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
  2022-08-15 17:56 [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Bobby Eshleman
                   ` (2 preceding siblings ...)
  2022-08-15 17:56 ` [PATCH 3/6] vsock: add netdev to vhost/virtio vsock Bobby Eshleman
@ 2022-08-15 17:56 ` Bobby Eshleman
  2022-08-16  2:31   ` Bobby Eshleman
  2022-09-26 13:17   ` Stefano Garzarella
  2022-08-15 17:56 ` [PATCH 5/6] virtio/vsock: add support for dgram Bobby Eshleman
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-15 17:56 UTC (permalink / raw)
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, kvm, virtualization, netdev, linux-kernel

This commit adds a feature bit for virtio vsock to support datagrams.

Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
---
 drivers/vhost/vsock.c             | 3 ++-
 include/uapi/linux/virtio_vsock.h | 1 +
 net/vmw_vsock/virtio_transport.c  | 8 ++++++--
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index b20ddec2664b..a5d1bdb786fe 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -32,7 +32,8 @@
 enum {
 	VHOST_VSOCK_FEATURES = VHOST_FEATURES |
 			       (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
-			       (1ULL << VIRTIO_VSOCK_F_SEQPACKET)
+			       (1ULL << VIRTIO_VSOCK_F_SEQPACKET) |
+			       (1ULL << VIRTIO_VSOCK_F_DGRAM)
 };
 
 enum {
diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
index 64738838bee5..857df3a3a70d 100644
--- a/include/uapi/linux/virtio_vsock.h
+++ b/include/uapi/linux/virtio_vsock.h
@@ -40,6 +40,7 @@
 
 /* The feature bitmap for virtio vsock */
 #define VIRTIO_VSOCK_F_SEQPACKET	1	/* SOCK_SEQPACKET supported */
+#define VIRTIO_VSOCK_F_DGRAM		2	/* Host support dgram vsock */
 
 struct virtio_vsock_config {
 	__le64 guest_cid;
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index c6212eb38d3c..073314312683 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -35,6 +35,7 @@ static struct virtio_transport virtio_transport; /* forward declaration */
 struct virtio_vsock {
 	struct virtio_device *vdev;
 	struct virtqueue *vqs[VSOCK_VQ_MAX];
+	bool has_dgram;
 
 	/* Virtqueue processing is deferred to a workqueue */
 	struct work_struct tx_work;
@@ -709,7 +710,6 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
 	}
 
 	vsock->vdev = vdev;
-
 	vsock->rx_buf_nr = 0;
 	vsock->rx_buf_max_nr = 0;
 	atomic_set(&vsock->queued_replies, 0);
@@ -726,6 +726,9 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_SEQPACKET))
 		vsock->seqpacket_allow = true;
 
+	if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_DGRAM))
+		vsock->has_dgram = true;
+
 	vdev->priv = vsock;
 
 	ret = virtio_vsock_vqs_init(vsock);
@@ -820,7 +823,8 @@ static struct virtio_device_id id_table[] = {
 };
 
 static unsigned int features[] = {
-	VIRTIO_VSOCK_F_SEQPACKET
+	VIRTIO_VSOCK_F_SEQPACKET,
+	VIRTIO_VSOCK_F_DGRAM
 };
 
 static struct virtio_driver virtio_vsock_driver = {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 5/6] virtio/vsock: add support for dgram
  2022-08-15 17:56 [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Bobby Eshleman
                   ` (3 preceding siblings ...)
  2022-08-15 17:56 ` [PATCH 4/6] virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit Bobby Eshleman
@ 2022-08-15 17:56 ` Bobby Eshleman
  2022-08-15 21:02   ` kernel test robot
  2022-08-16  2:32   ` Bobby Eshleman
  2022-08-15 17:56 ` [PATCH 6/6] vsock_test: add tests for vsock dgram Bobby Eshleman
                   ` (6 subsequent siblings)
  11 siblings, 2 replies; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-15 17:56 UTC (permalink / raw)
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, kvm, virtualization, netdev, linux-kernel

This patch supports dgram in virtio and on the vhost side.

Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
---
 drivers/vhost/vsock.c                   |   2 +-
 include/net/af_vsock.h                  |   2 +
 include/uapi/linux/virtio_vsock.h       |   1 +
 net/vmw_vsock/af_vsock.c                |  26 +++-
 net/vmw_vsock/virtio_transport.c        |   2 +-
 net/vmw_vsock/virtio_transport_common.c | 173 ++++++++++++++++++++++--
 6 files changed, 186 insertions(+), 20 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index a5d1bdb786fe..3dc72a5647ca 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -925,7 +925,7 @@ static int __init vhost_vsock_init(void)
 	int ret;
 
 	ret = vsock_core_register(&vhost_transport.transport,
-				  VSOCK_TRANSPORT_F_H2G);
+				  VSOCK_TRANSPORT_F_H2G | VSOCK_TRANSPORT_F_DGRAM);
 	if (ret < 0)
 		return ret;
 
diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 1c53c4c4d88f..37e55c81e4df 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -78,6 +78,8 @@ struct vsock_sock {
 s64 vsock_stream_has_data(struct vsock_sock *vsk);
 s64 vsock_stream_has_space(struct vsock_sock *vsk);
 struct sock *vsock_create_connected(struct sock *parent);
+int vsock_bind_stream(struct vsock_sock *vsk,
+		      struct sockaddr_vm *addr);
 
 /**** TRANSPORT ****/
 
diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
index 857df3a3a70d..0975b9c88292 100644
--- a/include/uapi/linux/virtio_vsock.h
+++ b/include/uapi/linux/virtio_vsock.h
@@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
 enum virtio_vsock_type {
 	VIRTIO_VSOCK_TYPE_STREAM = 1,
 	VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
+	VIRTIO_VSOCK_TYPE_DGRAM = 3,
 };
 
 enum virtio_vsock_op {
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 1893f8aafa48..87e4ae1866d3 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -675,6 +675,19 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
 	return 0;
 }
 
+int vsock_bind_stream(struct vsock_sock *vsk,
+		      struct sockaddr_vm *addr)
+{
+	int retval;
+
+	spin_lock_bh(&vsock_table_lock);
+	retval = __vsock_bind_connectible(vsk, addr);
+	spin_unlock_bh(&vsock_table_lock);
+
+	return retval;
+}
+EXPORT_SYMBOL(vsock_bind_stream);
+
 static int __vsock_bind_dgram(struct vsock_sock *vsk,
 			      struct sockaddr_vm *addr)
 {
@@ -2363,11 +2376,16 @@ int vsock_core_register(const struct vsock_transport *t, int features)
 	}
 
 	if (features & VSOCK_TRANSPORT_F_DGRAM) {
-		if (t_dgram) {
-			err = -EBUSY;
-			goto err_busy;
+		/* TODO: always chose the G2H variant over others, support nesting later */
+		if (features & VSOCK_TRANSPORT_F_G2H) {
+			if (t_dgram)
+				pr_warn("virtio_vsock: t_dgram already set\n");
+			t_dgram = t;
+		}
+
+		if (!t_dgram) {
+			t_dgram = t;
 		}
-		t_dgram = t;
 	}
 
 	if (features & VSOCK_TRANSPORT_F_LOCAL) {
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 073314312683..d4526ca462d2 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -850,7 +850,7 @@ static int __init virtio_vsock_init(void)
 		return -ENOMEM;
 
 	ret = vsock_core_register(&virtio_transport.transport,
-				  VSOCK_TRANSPORT_F_G2H);
+				  VSOCK_TRANSPORT_F_G2H | VSOCK_TRANSPORT_F_DGRAM);
 	if (ret)
 		goto out_wq;
 
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index bdf16fff054f..aedb48728677 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -229,7 +229,9 @@ EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
 
 static u16 virtio_transport_get_type(struct sock *sk)
 {
-	if (sk->sk_type == SOCK_STREAM)
+	if (sk->sk_type == SOCK_DGRAM)
+		return VIRTIO_VSOCK_TYPE_DGRAM;
+	else if (sk->sk_type == SOCK_STREAM)
 		return VIRTIO_VSOCK_TYPE_STREAM;
 	else
 		return VIRTIO_VSOCK_TYPE_SEQPACKET;
@@ -287,22 +289,29 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 	vvs = vsk->trans;
 
 	/* we can send less than pkt_len bytes */
-	if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
-		pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
+	if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
+		if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
+			pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
+		else
+			return 0;
+	}
 
-	/* virtio_transport_get_credit might return less than pkt_len credit */
-	pkt_len = virtio_transport_get_credit(vvs, pkt_len);
+	if (info->type != VIRTIO_VSOCK_TYPE_DGRAM) {
+		/* virtio_transport_get_credit might return less than pkt_len credit */
+		pkt_len = virtio_transport_get_credit(vvs, pkt_len);
 
-	/* Do not send zero length OP_RW pkt */
-	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
-		return pkt_len;
+		/* Do not send zero length OP_RW pkt */
+		if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
+			return pkt_len;
+	}
 
 	skb = virtio_transport_alloc_skb(info, pkt_len,
 					 src_cid, src_port,
 					 dst_cid, dst_port,
 					 &err);
 	if (!skb) {
-		virtio_transport_put_credit(vvs, pkt_len);
+		if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
+			virtio_transport_put_credit(vvs, pkt_len);
 		return err;
 	}
 
@@ -586,6 +595,61 @@ virtio_transport_seqpacket_dequeue(struct vsock_sock *vsk,
 }
 EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_dequeue);
 
+static ssize_t
+virtio_transport_dgram_do_dequeue(struct vsock_sock *vsk,
+				  struct msghdr *msg, size_t len)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+	struct sk_buff *skb;
+	size_t total = 0;
+	u32 free_space;
+	int err = -EFAULT;
+
+	spin_lock_bh(&vvs->rx_lock);
+	if (total < len && !skb_queue_empty_lockless(&vvs->rx_queue)) {
+		skb = __skb_dequeue(&vvs->rx_queue);
+
+		total = len;
+		if (total > skb->len - vsock_metadata(skb)->off)
+			total = skb->len - vsock_metadata(skb)->off;
+		else if (total < skb->len - vsock_metadata(skb)->off)
+			msg->msg_flags |= MSG_TRUNC;
+
+		/* sk_lock is held by caller so no one else can dequeue.
+		 * Unlock rx_lock since memcpy_to_msg() may sleep.
+		 */
+		spin_unlock_bh(&vvs->rx_lock);
+
+		err = memcpy_to_msg(msg, skb->data + vsock_metadata(skb)->off, total);
+		if (err)
+			return err;
+
+		spin_lock_bh(&vvs->rx_lock);
+
+		virtio_transport_dec_rx_pkt(vvs, skb);
+		consume_skb(skb);
+	}
+
+	free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
+
+	spin_unlock_bh(&vvs->rx_lock);
+
+	if (total > 0 && msg->msg_name) {
+		/* Provide the address of the sender. */
+		DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
+
+		vsock_addr_init(vm_addr, le64_to_cpu(vsock_hdr(skb)->src_cid),
+				le32_to_cpu(vsock_hdr(skb)->src_port));
+		msg->msg_namelen = sizeof(*vm_addr);
+	}
+	return total;
+}
+
+static s64 virtio_transport_dgram_has_data(struct vsock_sock *vsk)
+{
+	return virtio_transport_stream_has_data(vsk);
+}
+
 int
 virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
 				   struct msghdr *msg,
@@ -611,7 +675,66 @@ virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
 			       struct msghdr *msg,
 			       size_t len, int flags)
 {
-	return -EOPNOTSUPP;
+	struct sock *sk;
+	size_t err = 0;
+	long timeout;
+
+	DEFINE_WAIT(wait);
+
+	sk = &vsk->sk;
+	err = 0;
+
+	if (flags & MSG_OOB || flags & MSG_ERRQUEUE || flags & MSG_PEEK)
+		return -EOPNOTSUPP;
+
+	lock_sock(sk);
+
+	if (!len)
+		goto out;
+
+	timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
+
+	while (1) {
+		s64 ready;
+
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+		ready = virtio_transport_dgram_has_data(vsk);
+
+		if (ready == 0) {
+			if (timeout == 0) {
+				err = -EAGAIN;
+				finish_wait(sk_sleep(sk), &wait);
+				break;
+			}
+
+			release_sock(sk);
+			timeout = schedule_timeout(timeout);
+			lock_sock(sk);
+
+			if (signal_pending(current)) {
+				err = sock_intr_errno(timeout);
+				finish_wait(sk_sleep(sk), &wait);
+				break;
+			} else if (timeout == 0) {
+				err = -EAGAIN;
+				finish_wait(sk_sleep(sk), &wait);
+				break;
+			}
+		} else {
+			finish_wait(sk_sleep(sk), &wait);
+
+			if (ready < 0) {
+				err = -ENOMEM;
+				goto out;
+			}
+
+			err = virtio_transport_dgram_do_dequeue(vsk, msg, len);
+			break;
+		}
+	}
+out:
+	release_sock(sk);
+	return err;
 }
 EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
 
@@ -819,13 +942,13 @@ EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
 int virtio_transport_dgram_bind(struct vsock_sock *vsk,
 				struct sockaddr_vm *addr)
 {
-	return -EOPNOTSUPP;
+	return vsock_bind_stream(vsk, addr);
 }
 EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
 
 bool virtio_transport_dgram_allow(u32 cid, u32 port)
 {
-	return false;
+	return true;
 }
 EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
 
@@ -861,7 +984,16 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
 			       struct msghdr *msg,
 			       size_t dgram_len)
 {
-	return -EOPNOTSUPP;
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_RW,
+		.msg = msg,
+		.pkt_len = dgram_len,
+		.vsk = vsk,
+		.remote_cid = remote_addr->svm_cid,
+		.remote_port = remote_addr->svm_port,
+	};
+
+	return virtio_transport_send_pkt_info(vsk, &info);
 }
 EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
 
@@ -1165,6 +1297,12 @@ virtio_transport_recv_connected(struct sock *sk,
 	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
 	int err = 0;
 
+	if (le16_to_cpu(vsock_hdr(skb)->type) == VIRTIO_VSOCK_TYPE_DGRAM) {
+		virtio_transport_recv_enqueue(vsk, skb);
+		sk->sk_data_ready(sk);
+		return err;
+	}
+
 	switch (le16_to_cpu(hdr->op)) {
 	case VIRTIO_VSOCK_OP_RW:
 		virtio_transport_recv_enqueue(vsk, skb);
@@ -1320,7 +1458,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
 static bool virtio_transport_valid_type(u16 type)
 {
 	return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
-	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
+	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
+	       (type == VIRTIO_VSOCK_TYPE_DGRAM);
 }
 
 /* We are under the virtio-vsock's vsock->rx_lock or vhost-vsock's vq->mutex
@@ -1384,6 +1523,11 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
 		goto free_pkt;
 	}
 
+	if (sk->sk_type == SOCK_DGRAM) {
+		virtio_transport_recv_connected(sk, skb);
+		goto out;
+	}
+
 	space_available = virtio_transport_space_update(sk, skb);
 
 	/* Update CID in case it has changed after a transport reset event */
@@ -1415,6 +1559,7 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
 		break;
 	}
 
+out:
 	release_sock(sk);
 
 	/* Release refcnt obtained when we fetched this socket out of the
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 6/6] vsock_test: add tests for vsock dgram
  2022-08-15 17:56 [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Bobby Eshleman
                   ` (4 preceding siblings ...)
  2022-08-15 17:56 ` [PATCH 5/6] virtio/vsock: add support for dgram Bobby Eshleman
@ 2022-08-15 17:56 ` Bobby Eshleman
  2022-08-16  2:32   ` Bobby Eshleman
  2022-08-15 20:39 ` [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Michael S. Tsirkin
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-15 17:56 UTC (permalink / raw)
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefano Garzarella, virtualization, netdev, linux-kernel

From: Jiang Wang <jiang.wang@bytedance.com>

Added test cases for vsock dgram types.

Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
---
 tools/testing/vsock/util.c       | 105 +++++++++++++++++
 tools/testing/vsock/util.h       |   4 +
 tools/testing/vsock/vsock_test.c | 195 +++++++++++++++++++++++++++++++
 3 files changed, 304 insertions(+)

diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
index 2acbb7703c6a..d2f5b223bf85 100644
--- a/tools/testing/vsock/util.c
+++ b/tools/testing/vsock/util.c
@@ -260,6 +260,57 @@ void send_byte(int fd, int expected_ret, int flags)
 	}
 }
 
+/* Transmit one byte and check the return value.
+ *
+ * expected_ret:
+ *  <0 Negative errno (for testing errors)
+ *   0 End-of-file
+ *   1 Success
+ */
+void sendto_byte(int fd, const struct sockaddr *dest_addr, int len, int expected_ret,
+				int flags)
+{
+	const uint8_t byte = 'A';
+	ssize_t nwritten;
+
+	timeout_begin(TIMEOUT);
+	do {
+		nwritten = sendto(fd, &byte, sizeof(byte), flags, dest_addr,
+						len);
+		timeout_check("write");
+	} while (nwritten < 0 && errno == EINTR);
+	timeout_end();
+
+	if (expected_ret < 0) {
+		if (nwritten != -1) {
+			fprintf(stderr, "bogus sendto(2) return value %zd\n",
+				nwritten);
+			exit(EXIT_FAILURE);
+		}
+		if (errno != -expected_ret) {
+			perror("write");
+			exit(EXIT_FAILURE);
+		}
+		return;
+	}
+
+	if (nwritten < 0) {
+		perror("write");
+		exit(EXIT_FAILURE);
+	}
+	if (nwritten == 0) {
+		if (expected_ret == 0)
+			return;
+
+		fprintf(stderr, "unexpected EOF while sending byte\n");
+		exit(EXIT_FAILURE);
+	}
+	if (nwritten != sizeof(byte)) {
+		fprintf(stderr, "bogus sendto(2) return value %zd\n", nwritten);
+		exit(EXIT_FAILURE);
+	}
+}
+
 /* Receive one byte and check the return value.
  *
  * expected_ret:
@@ -313,6 +364,60 @@ void recv_byte(int fd, int expected_ret, int flags)
 	}
 }
 
+/* Receive one byte and check the return value.
+ *
+ * expected_ret:
+ *  <0 Negative errno (for testing errors)
+ *   0 End-of-file
+ *   1 Success
+ */
+void recvfrom_byte(int fd, struct sockaddr *src_addr, socklen_t *addrlen,
+				int expected_ret, int flags)
+{
+	uint8_t byte;
+	ssize_t nread;
+
+	timeout_begin(TIMEOUT);
+	do {
+		nread = recvfrom(fd, &byte, sizeof(byte), flags, src_addr, addrlen);
+		timeout_check("read");
+	} while (nread < 0 && errno == EINTR);
+	timeout_end();
+
+	if (expected_ret < 0) {
+		if (nread != -1) {
+			fprintf(stderr, "bogus recvfrom(2) return value %zd\n",
+				nread);
+			exit(EXIT_FAILURE);
+		}
+		if (errno != -expected_ret) {
+			perror("read");
+			exit(EXIT_FAILURE);
+		}
+		return;
+	}
+
+	if (nread < 0) {
+		perror("read");
+		exit(EXIT_FAILURE);
+	}
+	if (nread == 0) {
+		if (expected_ret == 0)
+			return;
+
+		fprintf(stderr, "unexpected EOF while receiving byte\n");
+		exit(EXIT_FAILURE);
+	}
+	if (nread != sizeof(byte)) {
+		fprintf(stderr, "bogus recvfrom(2) return value %zd\n", nread);
+		exit(EXIT_FAILURE);
+	}
+	if (byte != 'A') {
+		fprintf(stderr, "unexpected byte read %c\n", byte);
+		exit(EXIT_FAILURE);
+	}
+}
+
 /* Run test cases.  The program terminates if a failure occurs. */
 void run_tests(const struct test_case *test_cases,
 	       const struct test_opts *opts)
diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
index a3375ad2fb7f..7213f2a51c1e 100644
--- a/tools/testing/vsock/util.h
+++ b/tools/testing/vsock/util.h
@@ -43,7 +43,11 @@ int vsock_seqpacket_accept(unsigned int cid, unsigned int port,
 			   struct sockaddr_vm *clientaddrp);
 void vsock_wait_remote_close(int fd);
 void send_byte(int fd, int expected_ret, int flags);
+void sendto_byte(int fd, const struct sockaddr *dest_addr, int len, int expected_ret,
+				int flags);
 void recv_byte(int fd, int expected_ret, int flags);
+void recvfrom_byte(int fd, struct sockaddr *src_addr, socklen_t *addrlen,
+				int expected_ret, int flags);
 void run_tests(const struct test_case *test_cases,
 	       const struct test_opts *opts);
 void list_tests(const struct test_case *test_cases);
diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
index dc577461afc2..640379f1b462 100644
--- a/tools/testing/vsock/vsock_test.c
+++ b/tools/testing/vsock/vsock_test.c
@@ -201,6 +201,115 @@ static void test_stream_server_close_server(const struct test_opts *opts)
 	close(fd);
 }
 
+static void test_dgram_sendto_client(const struct test_opts *opts)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = 1234,
+			.svm_cid = opts->peer_cid,
+		},
+	};
+	int fd;
+
+	/* Wait for the server to be ready */
+	control_expectln("BIND");
+
+	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
+	if (fd < 0) {
+		perror("socket");
+		exit(EXIT_FAILURE);
+	}
+
+	sendto_byte(fd, &addr.sa, sizeof(addr.svm), 1, 0);
+
+	/* Notify the server that the client has finished */
+	control_writeln("DONE");
+
+	close(fd);
+}
+
+static void test_dgram_sendto_server(const struct test_opts *opts)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = 1234,
+			.svm_cid = VMADDR_CID_ANY,
+		},
+	};
+	int fd;
+	int len = sizeof(addr.sa);
+
+	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
+
+	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
+		perror("bind");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Notify the client that the server is ready */
+	control_writeln("BIND");
+
+	recvfrom_byte(fd, &addr.sa, &len, 1, 0);
+	printf("got message from cid:%d, port %u ", addr.svm.svm_cid,
+			addr.svm.svm_port);
+
+	/* Wait for the client to finish */
+	control_expectln("DONE");
+
+	close(fd);
+}
+
+static void test_dgram_connect_client(const struct test_opts *opts)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = 1234,
+			.svm_cid = opts->peer_cid,
+		},
+	};
+	int fd;
+	int ret;
+
+	/* Wait for the server to be ready */
+	control_expectln("BIND");
+
+	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
+	if (fd < 0) {
+		perror("bind");
+		exit(EXIT_FAILURE);
+	}
+
+	ret = connect(fd, &addr.sa, sizeof(addr.svm));
+	if (ret < 0) {
+		perror("connect");
+		exit(EXIT_FAILURE);
+	}
+
+	send_byte(fd, 1, 0);
+
+	/* Notify the server that the client has finished */
+	control_writeln("DONE");
+
+	close(fd);
+}
+
+static void test_dgram_connect_server(const struct test_opts *opts)
+{
+	test_dgram_sendto_server(opts);
+}
+
 /* With the standard socket sizes, VMCI is able to support about 100
  * concurrent stream connections.
  */
@@ -254,6 +363,77 @@ static void test_stream_multiconn_server(const struct test_opts *opts)
 		close(fds[i]);
 }
 
+static void test_dgram_multiconn_client(const struct test_opts *opts)
+{
+	int fds[MULTICONN_NFDS];
+	int i;
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = 1234,
+			.svm_cid = opts->peer_cid,
+		},
+	};
+
+	/* Wait for the server to be ready */
+	control_expectln("BIND");
+
+	for (i = 0; i < MULTICONN_NFDS; i++) {
+		fds[i] = socket(AF_VSOCK, SOCK_DGRAM, 0);
+		if (fds[i] < 0) {
+			perror("socket");
+			exit(EXIT_FAILURE);
+		}
+	}
+
+	for (i = 0; i < MULTICONN_NFDS; i++)
+		sendto_byte(fds[i], &addr.sa, sizeof(addr.svm), 1, 0);
+
+	/* Notify the server that the client has finished */
+	control_writeln("DONE");
+
+	for (i = 0; i < MULTICONN_NFDS; i++)
+		close(fds[i]);
+}
+
+static void test_dgram_multiconn_server(const struct test_opts *opts)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = 1234,
+			.svm_cid = VMADDR_CID_ANY,
+		},
+	};
+	int fd;
+	int len = sizeof(addr.sa);
+	int i;
+
+	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
+
+	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
+		perror("bind");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Notify the client that the server is ready */
+	control_writeln("BIND");
+
+	for (i = 0; i < MULTICONN_NFDS; i++)
+		recvfrom_byte(fd, &addr.sa, &len, 1, 0);
+
+	/* Wait for the client to finish */
+	control_expectln("DONE");
+
+	close(fd);
+}
+
 static void test_stream_msg_peek_client(const struct test_opts *opts)
 {
 	int fd;
@@ -646,6 +826,21 @@ static struct test_case test_cases[] = {
 		.run_client = test_seqpacket_invalid_rec_buffer_client,
 		.run_server = test_seqpacket_invalid_rec_buffer_server,
 	},
+	{
+		.name = "SOCK_DGRAM client close",
+		.run_client = test_dgram_sendto_client,
+		.run_server = test_dgram_sendto_server,
+	},
+	{
+		.name = "SOCK_DGRAM client connect",
+		.run_client = test_dgram_connect_client,
+		.run_server = test_dgram_connect_server,
+	},
+	{
+		.name = "SOCK_DGRAM multiple connections",
+		.run_client = test_dgram_multiconn_client,
+		.run_server = test_dgram_multiconn_server,
+	},
 	{},
 };
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/6] vsock: return errors other than -ENOMEM to socket
  2022-08-15 17:56 ` [PATCH 2/6] vsock: return errors other than -ENOMEM to socket Bobby Eshleman
@ 2022-08-15 20:01   ` kernel test robot
  2022-08-15 23:13   ` kernel test robot
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 67+ messages in thread
From: kernel test robot @ 2022-08-15 20:01 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: kbuild-all, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, kvm, virtualization, netdev, linux-kernel,
	linux-hyperv

Hi Bobby,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on mst-vhost/linux-next]
[also build test WARNING on linus/master v6.0-rc1 next-20220815]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Bobby-Eshleman/virtio-vsock-introduce-dgrams-sk_buff-and-qdisc/20220816-015812
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
config: m68k-allyesconfig (https://download.01.org/0day-ci/archive/20220816/202208160300.HYFUSTbF-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/68c9c8216a573cdfe2170cad677854e2f4a34634
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Bobby-Eshleman/virtio-vsock-introduce-dgrams-sk_buff-and-qdisc/20220816-015812
        git checkout 68c9c8216a573cdfe2170cad677854e2f4a34634
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k SHELL=/bin/bash net/vmw_vsock/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> net/vmw_vsock/virtio_transport.c:178: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
    * Merge the two most recent skbs together if possible.


vim +178 net/vmw_vsock/virtio_transport.c

93afaf2cdefaa9 Bobby Eshleman 2022-08-15  176  
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  177  /**
93afaf2cdefaa9 Bobby Eshleman 2022-08-15 @178   * Merge the two most recent skbs together if possible.
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  179   *
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  180   * Caller must hold the queue lock.
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  181   */
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  182  static void
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  183  virtio_transport_add_to_queue(struct sk_buff_head *queue, struct sk_buff *new)
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  184  {
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  185  	struct sk_buff *old;
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  186  
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  187  	spin_lock_bh(&queue->lock);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  188  	/* In order to reduce skb memory overhead, we merge new packets with
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  189  	 * older packets if they pass virtio_transport_skbs_can_merge().
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  190  	 */
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  191  	if (skb_queue_empty_lockless(queue)) {
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  192  		__skb_queue_tail(queue, new);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  193  		goto out;
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  194  	}
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  195  
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  196  	old = skb_peek_tail(queue);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  197  
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  198  	if (!virtio_transport_skbs_can_merge(old, new)) {
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  199  		__skb_queue_tail(queue, new);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  200  		goto out;
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  201  	}
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  202  
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  203  	memcpy(skb_put(old, new->len), new->data, new->len);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  204  	vsock_hdr(old)->len = cpu_to_le32(old->len);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  205  	vsock_hdr(old)->buf_alloc = vsock_hdr(new)->buf_alloc;
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  206  	vsock_hdr(old)->fwd_cnt = vsock_hdr(new)->fwd_cnt;
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  207  	dev_kfree_skb_any(new);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  208  
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  209  out:
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  210  	spin_unlock_bh(&queue->lock);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  211  }
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  212  

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc
  2022-08-15 17:56 [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Bobby Eshleman
                   ` (5 preceding siblings ...)
  2022-08-15 17:56 ` [PATCH 6/6] vsock_test: add tests for vsock dgram Bobby Eshleman
@ 2022-08-15 20:39 ` Michael S. Tsirkin
  2022-08-16  1:55   ` Bobby Eshleman
  2022-08-16  2:29 ` Bobby Eshleman
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 67+ messages in thread
From: Michael S. Tsirkin @ 2022-08-15 20:39 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Dexuan Cui, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv

On Mon, Aug 15, 2022 at 10:56:03AM -0700, Bobby Eshleman wrote:
> Hey everybody,
> 
> This series introduces datagrams, packet scheduling, and sk_buff usage
> to virtio vsock.
> 
> The usage of struct sk_buff benefits users by a) preparing vsock to use
> other related systems that require sk_buff, such as sockmap and qdisc,
> b) supporting basic congestion control via sock_alloc_send_skb, and c)
> reducing copying when delivering packets to TAP.
> 
> The socket layer no longer forces errors to be -ENOMEM, as typically
> userspace expects -EAGAIN when the sk_sndbuf threshold is reached and
> messages are being sent with option MSG_DONTWAIT.
> 
> The datagram work is based off previous patches by Jiang Wang[1].
> 
> The introduction of datagrams creates a transport layer fairness issue
> where datagrams may freely starve streams of queue access. This happens
> because, unlike streams, datagrams lack the transactions necessary for
> calculating credits and throttling.
> 
> Previous proposals introduce changes to the spec to add an additional
> virtqueue pair for datagrams[1]. Although this solution works, using
> Linux's qdisc for packet scheduling leverages already existing systems,
> avoids the need to change the virtio specification, and gives additional
> capabilities. The usage of SFQ or fq_codel, for example, may solve the
> transport layer starvation problem. It is easy to imagine other use
> cases as well. For example, services of varying importance may be
> assigned different priorities, and qdisc will apply appropriate
> priority-based scheduling. By default, the system default pfifo qdisc is
> used. The qdisc may be bypassed and legacy queuing is resumed by simply
> setting the virtio-vsock%d network device to state DOWN. This technique
> still allows vsock to work with zero-configuration.
> 
> In summary, this series introduces these major changes to vsock:
> 
> - virtio vsock supports datagrams
> - virtio vsock uses struct sk_buff instead of virtio_vsock_pkt
>   - Because virtio vsock uses sk_buff, it also uses sock_alloc_send_skb,
>     which applies the throttling threshold sk_sndbuf.
> - The vsock socket layer supports returning errors other than -ENOMEM.
>   - This is used to return -EAGAIN when the sk_sndbuf threshold is
>     reached.
> - virtio vsock uses a net_device, through which qdisc may be used.
>  - qdisc allows scheduling policies to be applied to vsock flows.
>   - Some qdiscs, like SFQ, may allow vsock to avoid transport layer congestion. That is,
>     it may avoid datagrams from flooding out stream flows. The benefit
>     to this is that additional virtqueues are not needed for datagrams.
>   - The net_device and qdisc is bypassed by simply setting the
>     net_device state to DOWN.
> 
> [1]: https://lore.kernel.org/all/20210914055440.3121004-1-jiang.wang@bytedance.com/

Given this affects the driver/device interface I'd like to
ask you to please copy virtio-dev mailing list on these patches.
Subscriber only I'm afraid you will need to subscribe :(

> Bobby Eshleman (5):
>   vsock: replace virtio_vsock_pkt with sk_buff
>   vsock: return errors other than -ENOMEM to socket
>   vsock: add netdev to vhost/virtio vsock
>   virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
>   virtio/vsock: add support for dgram
> 
> Jiang Wang (1):
>   vsock_test: add tests for vsock dgram
> 
>  drivers/vhost/vsock.c                   | 238 ++++----
>  include/linux/virtio_vsock.h            |  73 ++-
>  include/net/af_vsock.h                  |   2 +
>  include/uapi/linux/virtio_vsock.h       |   2 +
>  net/vmw_vsock/af_vsock.c                |  30 +-
>  net/vmw_vsock/hyperv_transport.c        |   2 +-
>  net/vmw_vsock/virtio_transport.c        | 237 +++++---
>  net/vmw_vsock/virtio_transport_common.c | 771 ++++++++++++++++--------
>  net/vmw_vsock/vmci_transport.c          |   9 +-
>  net/vmw_vsock/vsock_loopback.c          |  51 +-
>  tools/testing/vsock/util.c              | 105 ++++
>  tools/testing/vsock/util.h              |   4 +
>  tools/testing/vsock/vsock_test.c        | 195 ++++++
>  13 files changed, 1176 insertions(+), 543 deletions(-)
> 
> -- 
> 2.35.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/6] virtio/vsock: add support for dgram
  2022-08-15 17:56 ` [PATCH 5/6] virtio/vsock: add support for dgram Bobby Eshleman
@ 2022-08-15 21:02   ` kernel test robot
  2022-08-16  2:32   ` Bobby Eshleman
  1 sibling, 0 replies; 67+ messages in thread
From: kernel test robot @ 2022-08-15 21:02 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: kbuild-all, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, Eric Dumazet, Jakub Kicinski, Paolo Abeni, kvm,
	virtualization, netdev, linux-kernel

Hi Bobby,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on mst-vhost/linux-next]
[also build test WARNING on linus/master v6.0-rc1 next-20220815]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Bobby-Eshleman/virtio-vsock-introduce-dgrams-sk_buff-and-qdisc/20220816-015812
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
config: m68k-allyesconfig (https://download.01.org/0day-ci/archive/20220816/202208160405.cG02E3MZ-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/cbb332da78c86ac574688831ed6f404d04d506db
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Bobby-Eshleman/virtio-vsock-introduce-dgrams-sk_buff-and-qdisc/20220816-015812
        git checkout cbb332da78c86ac574688831ed6f404d04d506db
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k SHELL=/bin/bash net/vmw_vsock/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   net/vmw_vsock/virtio_transport_common.c: In function 'virtio_transport_dgram_do_dequeue':
>> net/vmw_vsock/virtio_transport_common.c:605:13: warning: variable 'free_space' set but not used [-Wunused-but-set-variable]
     605 |         u32 free_space;
         |             ^~~~~~~~~~


vim +/free_space +605 net/vmw_vsock/virtio_transport_common.c

   597	
   598	static ssize_t
   599	virtio_transport_dgram_do_dequeue(struct vsock_sock *vsk,
   600					  struct msghdr *msg, size_t len)
   601	{
   602		struct virtio_vsock_sock *vvs = vsk->trans;
   603		struct sk_buff *skb;
   604		size_t total = 0;
 > 605		u32 free_space;
   606		int err = -EFAULT;
   607	
   608		spin_lock_bh(&vvs->rx_lock);
   609		if (total < len && !skb_queue_empty_lockless(&vvs->rx_queue)) {
   610			skb = __skb_dequeue(&vvs->rx_queue);
   611	
   612			total = len;
   613			if (total > skb->len - vsock_metadata(skb)->off)
   614				total = skb->len - vsock_metadata(skb)->off;
   615			else if (total < skb->len - vsock_metadata(skb)->off)
   616				msg->msg_flags |= MSG_TRUNC;
   617	
   618			/* sk_lock is held by caller so no one else can dequeue.
   619			 * Unlock rx_lock since memcpy_to_msg() may sleep.
   620			 */
   621			spin_unlock_bh(&vvs->rx_lock);
   622	
   623			err = memcpy_to_msg(msg, skb->data + vsock_metadata(skb)->off, total);
   624			if (err)
   625				return err;
   626	
   627			spin_lock_bh(&vvs->rx_lock);
   628	
   629			virtio_transport_dec_rx_pkt(vvs, skb);
   630			consume_skb(skb);
   631		}
   632	
   633		free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
   634	
   635		spin_unlock_bh(&vvs->rx_lock);
   636	
   637		if (total > 0 && msg->msg_name) {
   638			/* Provide the address of the sender. */
   639			DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
   640	
   641			vsock_addr_init(vm_addr, le64_to_cpu(vsock_hdr(skb)->src_cid),
   642					le32_to_cpu(vsock_hdr(skb)->src_port));
   643			msg->msg_namelen = sizeof(*vm_addr);
   644		}
   645		return total;
   646	}
   647	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/6] vsock: return errors other than -ENOMEM to socket
  2022-08-15 17:56 ` [PATCH 2/6] vsock: return errors other than -ENOMEM to socket Bobby Eshleman
  2022-08-15 20:01   ` kernel test robot
@ 2022-08-15 23:13   ` kernel test robot
  2022-08-16  2:16   ` kernel test robot
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 67+ messages in thread
From: kernel test robot @ 2022-08-15 23:13 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: llvm, kbuild-all, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, kvm, virtualization, netdev, linux-kernel,
	linux-hyperv

Hi Bobby,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on mst-vhost/linux-next]
[also build test WARNING on linus/master v6.0-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Bobby-Eshleman/virtio-vsock-introduce-dgrams-sk_buff-and-qdisc/20220816-015812
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
config: i386-randconfig-a014-20220815 (https://download.01.org/0day-ci/archive/20220816/202208160737.gXXFmPbY-lkp@intel.com/config)
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 6afcc4a459ead8809a0d6d9b4bf7b64bcc13582b)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/68c9c8216a573cdfe2170cad677854e2f4a34634
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Bobby-Eshleman/virtio-vsock-introduce-dgrams-sk_buff-and-qdisc/20220816-015812
        git checkout 68c9c8216a573cdfe2170cad677854e2f4a34634
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash net/vmw_vsock/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> net/vmw_vsock/virtio_transport.c:178: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
    * Merge the two most recent skbs together if possible.


vim +178 net/vmw_vsock/virtio_transport.c

93afaf2cdefaa9 Bobby Eshleman 2022-08-15  176  
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  177  /**
93afaf2cdefaa9 Bobby Eshleman 2022-08-15 @178   * Merge the two most recent skbs together if possible.
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  179   *
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  180   * Caller must hold the queue lock.
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  181   */
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  182  static void
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  183  virtio_transport_add_to_queue(struct sk_buff_head *queue, struct sk_buff *new)
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  184  {
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  185  	struct sk_buff *old;
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  186  
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  187  	spin_lock_bh(&queue->lock);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  188  	/* In order to reduce skb memory overhead, we merge new packets with
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  189  	 * older packets if they pass virtio_transport_skbs_can_merge().
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  190  	 */
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  191  	if (skb_queue_empty_lockless(queue)) {
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  192  		__skb_queue_tail(queue, new);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  193  		goto out;
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  194  	}
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  195  
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  196  	old = skb_peek_tail(queue);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  197  
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  198  	if (!virtio_transport_skbs_can_merge(old, new)) {
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  199  		__skb_queue_tail(queue, new);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  200  		goto out;
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  201  	}
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  202  
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  203  	memcpy(skb_put(old, new->len), new->data, new->len);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  204  	vsock_hdr(old)->len = cpu_to_le32(old->len);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  205  	vsock_hdr(old)->buf_alloc = vsock_hdr(new)->buf_alloc;
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  206  	vsock_hdr(old)->fwd_cnt = vsock_hdr(new)->fwd_cnt;
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  207  	dev_kfree_skb_any(new);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  208  
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  209  out:
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  210  	spin_unlock_bh(&queue->lock);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  211  }
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  212  

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc
  2022-08-15 20:39 ` [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Michael S. Tsirkin
@ 2022-08-16  1:55   ` Bobby Eshleman
  0 siblings, 0 replies; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-16  1:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Bobby Eshleman, Wei Liu, Cong Wang, Stephen Hemminger,
	Bobby Eshleman, Jiang Wang, Dexuan Cui, Haiyang Zhang,
	linux-kernel, virtualization, Eric Dumazet, netdev,
	Stefan Hajnoczi, kvm, Jakub Kicinski, Paolo Abeni, linux-hyperv,
	David S. Miller

On Mon, Aug 15, 2022 at 04:39:08PM -0400, Michael S. Tsirkin wrote:
> 
> Given this affects the driver/device interface I'd like to
> ask you to please copy virtio-dev mailing list on these patches.
> Subscriber only I'm afraid you will need to subscribe :(
> 

Ah makes sense, will do!

Best,
Bobby

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/6] vsock: return errors other than -ENOMEM to socket
  2022-08-15 17:56 ` [PATCH 2/6] vsock: return errors other than -ENOMEM to socket Bobby Eshleman
  2022-08-15 20:01   ` kernel test robot
  2022-08-15 23:13   ` kernel test robot
@ 2022-08-16  2:16   ` kernel test robot
  2022-08-16  2:30   ` Bobby Eshleman
  2022-09-26 13:21   ` Stefano Garzarella
  4 siblings, 0 replies; 67+ messages in thread
From: kernel test robot @ 2022-08-16  2:16 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: kbuild-all, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, kvm, virtualization, netdev, linux-kernel,
	linux-hyperv

Hi Bobby,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on mst-vhost/linux-next]
[also build test WARNING on linus/master v6.0-rc1 next-20220815]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Bobby-Eshleman/virtio-vsock-introduce-dgrams-sk_buff-and-qdisc/20220816-015812
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
config: i386-randconfig-s001-20220815 (https://download.01.org/0day-ci/archive/20220816/202208161059.GPIlPpvd-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-5) 11.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.4-39-gce1a6720-dirty
        # https://github.com/intel-lab-lkp/linux/commit/68c9c8216a573cdfe2170cad677854e2f4a34634
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Bobby-Eshleman/virtio-vsock-introduce-dgrams-sk_buff-and-qdisc/20220816-015812
        git checkout 68c9c8216a573cdfe2170cad677854e2f4a34634
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=i386 SHELL=/bin/bash fs/nilfs2/ net/vmw_vsock/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

sparse warnings: (new ones prefixed by >>)
>> net/vmw_vsock/virtio_transport.c:173:31: sparse: sparse: restricted __le16 degrades to integer
   net/vmw_vsock/virtio_transport.c:174:31: sparse: sparse: restricted __le16 degrades to integer

vim +173 net/vmw_vsock/virtio_transport.c

0ea9e1d3a9e3ef Asias He       2016-07-28  161  
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  162  static inline bool
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  163  virtio_transport_skbs_can_merge(struct sk_buff *old, struct sk_buff *new)
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  164  {
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  165  	return (new->len < GOOD_COPY_LEN &&
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  166  		skb_tailroom(old) >= new->len &&
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  167  		vsock_hdr(new)->src_cid == vsock_hdr(old)->src_cid &&
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  168  		vsock_hdr(new)->dst_cid == vsock_hdr(old)->dst_cid &&
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  169  		vsock_hdr(new)->src_port == vsock_hdr(old)->src_port &&
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  170  		vsock_hdr(new)->dst_port == vsock_hdr(old)->dst_port &&
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  171  		vsock_hdr(new)->type == vsock_hdr(old)->type &&
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  172  		vsock_hdr(new)->flags == vsock_hdr(old)->flags &&
93afaf2cdefaa9 Bobby Eshleman 2022-08-15 @173  		vsock_hdr(old)->op == VIRTIO_VSOCK_OP_RW &&
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  174  		vsock_hdr(new)->op == VIRTIO_VSOCK_OP_RW);
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  175  }
93afaf2cdefaa9 Bobby Eshleman 2022-08-15  176  

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc
  2022-08-15 17:56 [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Bobby Eshleman
                   ` (6 preceding siblings ...)
  2022-08-15 20:39 ` [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Michael S. Tsirkin
@ 2022-08-16  2:29 ` Bobby Eshleman
       [not found] ` <CAGxU2F7+L-UiNPtUm4EukOgTVJ1J6Orqs1LMvhWWvfL9zWb23g@mail.gmail.com>
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-16  2:29 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: virtio-dev, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Wei Liu, Dexuan Cui, kvm, virtualization, netdev, linux-kernel,
	linux-hyperv

CC'ing virtio-dev@lists.oasis-open.org

On Mon, Aug 15, 2022 at 10:56:03AM -0700, Bobby Eshleman wrote:
> Hey everybody,
> 
> This series introduces datagrams, packet scheduling, and sk_buff usage
> to virtio vsock.
> 
> The usage of struct sk_buff benefits users by a) preparing vsock to use
> other related systems that require sk_buff, such as sockmap and qdisc,
> b) supporting basic congestion control via sock_alloc_send_skb, and c)
> reducing copying when delivering packets to TAP.
> 
> The socket layer no longer forces errors to be -ENOMEM, as typically
> userspace expects -EAGAIN when the sk_sndbuf threshold is reached and
> messages are being sent with option MSG_DONTWAIT.
> 
> The datagram work is based off previous patches by Jiang Wang[1].
> 
> The introduction of datagrams creates a transport layer fairness issue
> where datagrams may freely starve streams of queue access. This happens
> because, unlike streams, datagrams lack the transactions necessary for
> calculating credits and throttling.
> 
> Previous proposals introduce changes to the spec to add an additional
> virtqueue pair for datagrams[1]. Although this solution works, using
> Linux's qdisc for packet scheduling leverages already existing systems,
> avoids the need to change the virtio specification, and gives additional
> capabilities. The usage of SFQ or fq_codel, for example, may solve the
> transport layer starvation problem. It is easy to imagine other use
> cases as well. For example, services of varying importance may be
> assigned different priorities, and qdisc will apply appropriate
> priority-based scheduling. By default, the system default pfifo qdisc is
> used. The qdisc may be bypassed and legacy queuing is resumed by simply
> setting the virtio-vsock%d network device to state DOWN. This technique
> still allows vsock to work with zero-configuration.
> 
> In summary, this series introduces these major changes to vsock:
> 
> - virtio vsock supports datagrams
> - virtio vsock uses struct sk_buff instead of virtio_vsock_pkt
>   - Because virtio vsock uses sk_buff, it also uses sock_alloc_send_skb,
>     which applies the throttling threshold sk_sndbuf.
> - The vsock socket layer supports returning errors other than -ENOMEM.
>   - This is used to return -EAGAIN when the sk_sndbuf threshold is
>     reached.
> - virtio vsock uses a net_device, through which qdisc may be used.
>  - qdisc allows scheduling policies to be applied to vsock flows.
>   - Some qdiscs, like SFQ, may allow vsock to avoid transport layer congestion. That is,
>     it may avoid datagrams from flooding out stream flows. The benefit
>     to this is that additional virtqueues are not needed for datagrams.
>   - The net_device and qdisc is bypassed by simply setting the
>     net_device state to DOWN.
> 
> [1]: https://lore.kernel.org/all/20210914055440.3121004-1-jiang.wang@bytedance.com/
> 
> Bobby Eshleman (5):
>   vsock: replace virtio_vsock_pkt with sk_buff
>   vsock: return errors other than -ENOMEM to socket
>   vsock: add netdev to vhost/virtio vsock
>   virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
>   virtio/vsock: add support for dgram
> 
> Jiang Wang (1):
>   vsock_test: add tests for vsock dgram
> 
>  drivers/vhost/vsock.c                   | 238 ++++----
>  include/linux/virtio_vsock.h            |  73 ++-
>  include/net/af_vsock.h                  |   2 +
>  include/uapi/linux/virtio_vsock.h       |   2 +
>  net/vmw_vsock/af_vsock.c                |  30 +-
>  net/vmw_vsock/hyperv_transport.c        |   2 +-
>  net/vmw_vsock/virtio_transport.c        | 237 +++++---
>  net/vmw_vsock/virtio_transport_common.c | 771 ++++++++++++++++--------
>  net/vmw_vsock/vmci_transport.c          |   9 +-
>  net/vmw_vsock/vsock_loopback.c          |  51 +-
>  tools/testing/vsock/util.c              | 105 ++++
>  tools/testing/vsock/util.h              |   4 +
>  tools/testing/vsock/vsock_test.c        | 195 ++++++
>  13 files changed, 1176 insertions(+), 543 deletions(-)
> 
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 1/6] vsock: replace virtio_vsock_pkt with sk_buff
  2022-08-15 17:56 ` [PATCH 1/6] vsock: replace virtio_vsock_pkt with sk_buff Bobby Eshleman
@ 2022-08-16  2:30   ` Bobby Eshleman
  0 siblings, 0 replies; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-16  2:30 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: virtio-dev, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, kvm, virtualization, netdev, linux-kernel

CC'ing virtio-dev@lists.oasis-open.org

On Mon, Aug 15, 2022 at 10:56:04AM -0700, Bobby Eshleman wrote:
> This patch replaces virtio_vsock_pkt with sk_buff.
> 
> The benefit of this series includes:
> 
> * The bug reported @ https://bugzilla.redhat.com/show_bug.cgi?id=2009935
>   does not present itself when reasonable sk_sndbuf thresholds are set.
> * Using sock_alloc_send_skb() teaches VSOCK to respect
>   sk_sndbuf for tunability.
> * Eliminates copying for vsock_deliver_tap().
> * sk_buff is required for future improvements, such as using socket map.
> 
> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> ---
>  drivers/vhost/vsock.c                   | 214 +++++------
>  include/linux/virtio_vsock.h            |  60 ++-
>  net/vmw_vsock/af_vsock.c                |   1 +
>  net/vmw_vsock/virtio_transport.c        | 212 +++++-----
>  net/vmw_vsock/virtio_transport_common.c | 491 ++++++++++++------------
>  net/vmw_vsock/vsock_loopback.c          |  51 +--
>  6 files changed, 517 insertions(+), 512 deletions(-)
> 
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index 368330417bde..f8601d93d94d 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -51,8 +51,7 @@ struct vhost_vsock {
>  	struct hlist_node hash;
>  
>  	struct vhost_work send_pkt_work;
> -	spinlock_t send_pkt_list_lock;
> -	struct list_head send_pkt_list;	/* host->guest pending packets */
> +	struct sk_buff_head send_pkt_queue; /* host->guest pending packets */
>  
>  	atomic_t queued_replies;
>  
> @@ -108,7 +107,8 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>  	vhost_disable_notify(&vsock->dev, vq);
>  
>  	do {
> -		struct virtio_vsock_pkt *pkt;
> +		struct sk_buff *skb;
> +		struct virtio_vsock_hdr *hdr;
>  		struct iov_iter iov_iter;
>  		unsigned out, in;
>  		size_t nbytes;
> @@ -116,31 +116,22 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>  		int head;
>  		u32 flags_to_restore = 0;
>  
> -		spin_lock_bh(&vsock->send_pkt_list_lock);
> -		if (list_empty(&vsock->send_pkt_list)) {
> -			spin_unlock_bh(&vsock->send_pkt_list_lock);
> +		skb = skb_dequeue(&vsock->send_pkt_queue);
> +
> +		if (!skb) {
>  			vhost_enable_notify(&vsock->dev, vq);
>  			break;
>  		}
>  
> -		pkt = list_first_entry(&vsock->send_pkt_list,
> -				       struct virtio_vsock_pkt, list);
> -		list_del_init(&pkt->list);
> -		spin_unlock_bh(&vsock->send_pkt_list_lock);
> -
>  		head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
>  					 &out, &in, NULL, NULL);
>  		if (head < 0) {
> -			spin_lock_bh(&vsock->send_pkt_list_lock);
> -			list_add(&pkt->list, &vsock->send_pkt_list);
> -			spin_unlock_bh(&vsock->send_pkt_list_lock);
> +			skb_queue_head(&vsock->send_pkt_queue, skb);
>  			break;
>  		}
>  
>  		if (head == vq->num) {
> -			spin_lock_bh(&vsock->send_pkt_list_lock);
> -			list_add(&pkt->list, &vsock->send_pkt_list);
> -			spin_unlock_bh(&vsock->send_pkt_list_lock);
> +			skb_queue_head(&vsock->send_pkt_queue, skb);
>  
>  			/* We cannot finish yet if more buffers snuck in while
>  			 * re-enabling notify.
> @@ -153,26 +144,27 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>  		}
>  
>  		if (out) {
> -			virtio_transport_free_pkt(pkt);
> +			kfree_skb(skb);
>  			vq_err(vq, "Expected 0 output buffers, got %u\n", out);
>  			break;
>  		}
>  
>  		iov_len = iov_length(&vq->iov[out], in);
> -		if (iov_len < sizeof(pkt->hdr)) {
> -			virtio_transport_free_pkt(pkt);
> +		if (iov_len < sizeof(*hdr)) {
> +			kfree_skb(skb);
>  			vq_err(vq, "Buffer len [%zu] too small\n", iov_len);
>  			break;
>  		}
>  
>  		iov_iter_init(&iov_iter, READ, &vq->iov[out], in, iov_len);
> -		payload_len = pkt->len - pkt->off;
> +		payload_len = skb->len - vsock_metadata(skb)->off;
> +		hdr = vsock_hdr(skb);
>  
>  		/* If the packet is greater than the space available in the
>  		 * buffer, we split it using multiple buffers.
>  		 */
> -		if (payload_len > iov_len - sizeof(pkt->hdr)) {
> -			payload_len = iov_len - sizeof(pkt->hdr);
> +		if (payload_len > iov_len - sizeof(*hdr)) {
> +			payload_len = iov_len - sizeof(*hdr);
>  
>  			/* As we are copying pieces of large packet's buffer to
>  			 * small rx buffers, headers of packets in rx queue are
> @@ -185,31 +177,31 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>  			 * bits set. After initialized header will be copied to
>  			 * rx buffer, these required bits will be restored.
>  			 */
> -			if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOM) {
> -				pkt->hdr.flags &= ~cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> +			if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) {
> +				hdr->flags &= ~cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>  				flags_to_restore |= VIRTIO_VSOCK_SEQ_EOM;
>  
> -				if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOR) {
> -					pkt->hdr.flags &= ~cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> +				if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR) {
> +					hdr->flags &= ~cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>  					flags_to_restore |= VIRTIO_VSOCK_SEQ_EOR;
>  				}
>  			}
>  		}
>  
>  		/* Set the correct length in the header */
> -		pkt->hdr.len = cpu_to_le32(payload_len);
> +		hdr->len = cpu_to_le32(payload_len);
>  
> -		nbytes = copy_to_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
> -		if (nbytes != sizeof(pkt->hdr)) {
> -			virtio_transport_free_pkt(pkt);
> +		nbytes = copy_to_iter(hdr, sizeof(*hdr), &iov_iter);
> +		if (nbytes != sizeof(*hdr)) {
> +			kfree_skb(skb);
>  			vq_err(vq, "Faulted on copying pkt hdr\n");
>  			break;
>  		}
>  
> -		nbytes = copy_to_iter(pkt->buf + pkt->off, payload_len,
> +		nbytes = copy_to_iter(skb->data + vsock_metadata(skb)->off, payload_len,
>  				      &iov_iter);
>  		if (nbytes != payload_len) {
> -			virtio_transport_free_pkt(pkt);
> +			kfree_skb(skb);
>  			vq_err(vq, "Faulted on copying pkt buf\n");
>  			break;
>  		}
> @@ -217,31 +209,28 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>  		/* Deliver to monitoring devices all packets that we
>  		 * will transmit.
>  		 */
> -		virtio_transport_deliver_tap_pkt(pkt);
> +		virtio_transport_deliver_tap_pkt(skb);
>  
> -		vhost_add_used(vq, head, sizeof(pkt->hdr) + payload_len);
> +		vhost_add_used(vq, head, sizeof(*hdr) + payload_len);
>  		added = true;
>  
> -		pkt->off += payload_len;
> +		vsock_metadata(skb)->off += payload_len;
>  		total_len += payload_len;
>  
>  		/* If we didn't send all the payload we can requeue the packet
>  		 * to send it with the next available buffer.
>  		 */
> -		if (pkt->off < pkt->len) {
> -			pkt->hdr.flags |= cpu_to_le32(flags_to_restore);
> +		if (vsock_metadata(skb)->off < skb->len) {
> +			hdr->flags |= cpu_to_le32(flags_to_restore);
>  
> -			/* We are queueing the same virtio_vsock_pkt to handle
> +			/* We are queueing the same skb to handle
>  			 * the remaining bytes, and we want to deliver it
>  			 * to monitoring devices in the next iteration.
>  			 */
> -			pkt->tap_delivered = false;
> -
> -			spin_lock_bh(&vsock->send_pkt_list_lock);
> -			list_add(&pkt->list, &vsock->send_pkt_list);
> -			spin_unlock_bh(&vsock->send_pkt_list_lock);
> +			vsock_metadata(skb)->flags &= ~VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED;
> +			skb_queue_head(&vsock->send_pkt_queue, skb);
>  		} else {
> -			if (pkt->reply) {
> +			if (vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_REPLY) {
>  				int val;
>  
>  				val = atomic_dec_return(&vsock->queued_replies);
> @@ -253,7 +242,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>  					restart_tx = true;
>  			}
>  
> -			virtio_transport_free_pkt(pkt);
> +			consume_skb(skb);
>  		}
>  	} while(likely(!vhost_exceeds_weight(vq, ++pkts, total_len)));
>  	if (added)
> @@ -278,28 +267,26 @@ static void vhost_transport_send_pkt_work(struct vhost_work *work)
>  }
>  
>  static int
> -vhost_transport_send_pkt(struct virtio_vsock_pkt *pkt)
> +vhost_transport_send_pkt(struct sk_buff *skb)
>  {
>  	struct vhost_vsock *vsock;
> -	int len = pkt->len;
> +	int len = skb->len;
> +	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
>  
>  	rcu_read_lock();
>  
>  	/* Find the vhost_vsock according to guest context id  */
> -	vsock = vhost_vsock_get(le64_to_cpu(pkt->hdr.dst_cid));
> +	vsock = vhost_vsock_get(le64_to_cpu(hdr->dst_cid));
>  	if (!vsock) {
>  		rcu_read_unlock();
> -		virtio_transport_free_pkt(pkt);
> +		kfree_skb(skb);
>  		return -ENODEV;
>  	}
>  
> -	if (pkt->reply)
> +	if (vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_REPLY)
>  		atomic_inc(&vsock->queued_replies);
>  
> -	spin_lock_bh(&vsock->send_pkt_list_lock);
> -	list_add_tail(&pkt->list, &vsock->send_pkt_list);
> -	spin_unlock_bh(&vsock->send_pkt_list_lock);
> -
> +	skb_queue_tail(&vsock->send_pkt_queue, skb);
>  	vhost_work_queue(&vsock->dev, &vsock->send_pkt_work);
>  
>  	rcu_read_unlock();
> @@ -310,10 +297,8 @@ static int
>  vhost_transport_cancel_pkt(struct vsock_sock *vsk)
>  {
>  	struct vhost_vsock *vsock;
> -	struct virtio_vsock_pkt *pkt, *n;
>  	int cnt = 0;
>  	int ret = -ENODEV;
> -	LIST_HEAD(freeme);
>  
>  	rcu_read_lock();
>  
> @@ -322,20 +307,7 @@ vhost_transport_cancel_pkt(struct vsock_sock *vsk)
>  	if (!vsock)
>  		goto out;
>  
> -	spin_lock_bh(&vsock->send_pkt_list_lock);
> -	list_for_each_entry_safe(pkt, n, &vsock->send_pkt_list, list) {
> -		if (pkt->vsk != vsk)
> -			continue;
> -		list_move(&pkt->list, &freeme);
> -	}
> -	spin_unlock_bh(&vsock->send_pkt_list_lock);
> -
> -	list_for_each_entry_safe(pkt, n, &freeme, list) {
> -		if (pkt->reply)
> -			cnt++;
> -		list_del(&pkt->list);
> -		virtio_transport_free_pkt(pkt);
> -	}
> +	cnt = virtio_transport_purge_skbs(vsk, &vsock->send_pkt_queue);
>  
>  	if (cnt) {
>  		struct vhost_virtqueue *tx_vq = &vsock->vqs[VSOCK_VQ_TX];
> @@ -352,11 +324,12 @@ vhost_transport_cancel_pkt(struct vsock_sock *vsk)
>  	return ret;
>  }
>  
> -static struct virtio_vsock_pkt *
> -vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
> +static struct sk_buff *
> +vhost_vsock_alloc_skb(struct vhost_virtqueue *vq,
>  		      unsigned int out, unsigned int in)
>  {
> -	struct virtio_vsock_pkt *pkt;
> +	struct sk_buff *skb;
> +	struct virtio_vsock_hdr *hdr;
>  	struct iov_iter iov_iter;
>  	size_t nbytes;
>  	size_t len;
> @@ -366,50 +339,49 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
>  		return NULL;
>  	}
>  
> -	pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
> -	if (!pkt)
> +	len = iov_length(vq->iov, out);
> +
> +	/* len contains both payload and hdr, so only add additional space for metadata */
> +	skb = alloc_skb(len + sizeof(struct virtio_vsock_metadata), GFP_KERNEL);
> +	if (!skb)
>  		return NULL;
>  
> -	len = iov_length(vq->iov, out);
> +	memset(skb->head, 0, sizeof(struct virtio_vsock_metadata));
> +	virtio_vsock_skb_reserve(skb);
>  	iov_iter_init(&iov_iter, WRITE, vq->iov, out, len);
>  
> -	nbytes = copy_from_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
> -	if (nbytes != sizeof(pkt->hdr)) {
> +	hdr = vsock_hdr(skb);
> +	nbytes = copy_from_iter(hdr, sizeof(*hdr), &iov_iter);
> +	if (nbytes != sizeof(*hdr)) {
>  		vq_err(vq, "Expected %zu bytes for pkt->hdr, got %zu bytes\n",
> -		       sizeof(pkt->hdr), nbytes);
> -		kfree(pkt);
> +		       sizeof(*hdr), nbytes);
> +		kfree_skb(skb);
>  		return NULL;
>  	}
>  
> -	pkt->len = le32_to_cpu(pkt->hdr.len);
> +	len = le32_to_cpu(hdr->len);
>  
>  	/* No payload */
> -	if (!pkt->len)
> -		return pkt;
> +	if (!len)
> +		return skb;
>  
>  	/* The pkt is too big */
> -	if (pkt->len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
> -		kfree(pkt);
> +	if (len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
> +		kfree_skb(skb);
>  		return NULL;
>  	}
>  
> -	pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
> -	if (!pkt->buf) {
> -		kfree(pkt);
> -		return NULL;
> -	}
> +	virtio_vsock_skb_rx_put(skb);
>  
> -	pkt->buf_len = pkt->len;
> -
> -	nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
> -	if (nbytes != pkt->len) {
> -		vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
> -		       pkt->len, nbytes);
> -		virtio_transport_free_pkt(pkt);
> +	nbytes = copy_from_iter(skb->data, len, &iov_iter);
> +	if (nbytes != len) {
> +		vq_err(vq, "Expected %zu byte payload, got %zu bytes\n",
> +		       len, nbytes);
> +		kfree_skb(skb);
>  		return NULL;
>  	}
>  
> -	return pkt;
> +	return skb;
>  }
>  
>  /* Is there space left for replies to rx packets? */
> @@ -496,7 +468,7 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
>  						  poll.work);
>  	struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
>  						 dev);
> -	struct virtio_vsock_pkt *pkt;
> +	struct sk_buff *skb;
>  	int head, pkts = 0, total_len = 0;
>  	unsigned int out, in;
>  	bool added = false;
> @@ -511,6 +483,9 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
>  
>  	vhost_disable_notify(&vsock->dev, vq);
>  	do {
> +		struct virtio_vsock_hdr *hdr;
> +		u32 len;
> +
>  		if (!vhost_vsock_more_replies(vsock)) {
>  			/* Stop tx until the device processes already
>  			 * pending replies.  Leave tx virtqueue
> @@ -532,26 +507,29 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
>  			break;
>  		}
>  
> -		pkt = vhost_vsock_alloc_pkt(vq, out, in);
> -		if (!pkt) {
> -			vq_err(vq, "Faulted on pkt\n");
> +		skb = vhost_vsock_alloc_skb(vq, out, in);
> +		if (!skb)
>  			continue;
> -		}
>  
> -		total_len += sizeof(pkt->hdr) + pkt->len;
> +		len = skb->len;
>  
>  		/* Deliver to monitoring devices all received packets */
> -		virtio_transport_deliver_tap_pkt(pkt);
> +		virtio_transport_deliver_tap_pkt(skb);
> +
> +		hdr = vsock_hdr(skb);
>  
>  		/* Only accept correctly addressed packets */
> -		if (le64_to_cpu(pkt->hdr.src_cid) == vsock->guest_cid &&
> -		    le64_to_cpu(pkt->hdr.dst_cid) ==
> +		if (le64_to_cpu(hdr->src_cid) == vsock->guest_cid &&
> +		    le64_to_cpu(hdr->dst_cid) ==
>  		    vhost_transport_get_local_cid())
> -			virtio_transport_recv_pkt(&vhost_transport, pkt);
> +			virtio_transport_recv_pkt(&vhost_transport, skb);
>  		else
> -			virtio_transport_free_pkt(pkt);
> +			kfree_skb(skb);
> +
>  
> -		vhost_add_used(vq, head, 0);
> +		len += sizeof(*hdr);
> +		vhost_add_used(vq, head, len);
> +		total_len += len;
>  		added = true;
>  	} while(likely(!vhost_exceeds_weight(vq, ++pkts, total_len)));
>  
> @@ -693,8 +671,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
>  		       VHOST_VSOCK_WEIGHT, true, NULL);
>  
>  	file->private_data = vsock;
> -	spin_lock_init(&vsock->send_pkt_list_lock);
> -	INIT_LIST_HEAD(&vsock->send_pkt_list);
> +	skb_queue_head_init(&vsock->send_pkt_queue);
>  	vhost_work_init(&vsock->send_pkt_work, vhost_transport_send_pkt_work);
>  	return 0;
>  
> @@ -760,16 +737,7 @@ static int vhost_vsock_dev_release(struct inode *inode, struct file *file)
>  	vhost_vsock_flush(vsock);
>  	vhost_dev_stop(&vsock->dev);
>  
> -	spin_lock_bh(&vsock->send_pkt_list_lock);
> -	while (!list_empty(&vsock->send_pkt_list)) {
> -		struct virtio_vsock_pkt *pkt;
> -
> -		pkt = list_first_entry(&vsock->send_pkt_list,
> -				struct virtio_vsock_pkt, list);
> -		list_del_init(&pkt->list);
> -		virtio_transport_free_pkt(pkt);
> -	}
> -	spin_unlock_bh(&vsock->send_pkt_list_lock);
> +	skb_queue_purge(&vsock->send_pkt_queue);
>  
>  	vhost_dev_cleanup(&vsock->dev);
>  	kfree(vsock->dev.vqs);
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index 35d7eedb5e8e..17ed01466875 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -4,9 +4,43 @@
>  
>  #include <uapi/linux/virtio_vsock.h>
>  #include <linux/socket.h>
> +#include <vdso/bits.h>
>  #include <net/sock.h>
>  #include <net/af_vsock.h>
>  
> +enum virtio_vsock_metadata_flags {
> +	VIRTIO_VSOCK_METADATA_FLAGS_REPLY		= BIT(0),
> +	VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED	= BIT(1),
> +};
> +
> +/* Used only by the virtio/vhost vsock drivers, not related to protocol */
> +struct virtio_vsock_metadata {
> +	size_t off;
> +	enum virtio_vsock_metadata_flags flags;
> +};
> +
> +#define vsock_hdr(skb) \
> +	((struct virtio_vsock_hdr *) \
> +	 ((void *)skb->head + sizeof(struct virtio_vsock_metadata)))
> +
> +#define vsock_metadata(skb) \
> +	((struct virtio_vsock_metadata *)skb->head)
> +
> +#define virtio_vsock_skb_reserve(skb)	\
> +	skb_reserve(skb,	\
> +		sizeof(struct virtio_vsock_metadata) + \
> +		sizeof(struct virtio_vsock_hdr))
> +
> +static inline void virtio_vsock_skb_rx_put(struct sk_buff *skb)
> +{
> +	u32 len;
> +
> +	len = le32_to_cpu(vsock_hdr(skb)->len);
> +
> +	if (len > 0)
> +		skb_put(skb, len);
> +}
> +
>  #define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE	(1024 * 4)
>  #define VIRTIO_VSOCK_MAX_BUF_SIZE		0xFFFFFFFFUL
>  #define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE		(1024 * 64)
> @@ -35,23 +69,10 @@ struct virtio_vsock_sock {
>  	u32 last_fwd_cnt;
>  	u32 rx_bytes;
>  	u32 buf_alloc;
> -	struct list_head rx_queue;
> +	struct sk_buff_head rx_queue;
>  	u32 msg_count;
>  };
>  
> -struct virtio_vsock_pkt {
> -	struct virtio_vsock_hdr	hdr;
> -	struct list_head list;
> -	/* socket refcnt not held, only use for cancellation */
> -	struct vsock_sock *vsk;
> -	void *buf;
> -	u32 buf_len;
> -	u32 len;
> -	u32 off;
> -	bool reply;
> -	bool tap_delivered;
> -};
> -
>  struct virtio_vsock_pkt_info {
>  	u32 remote_cid, remote_port;
>  	struct vsock_sock *vsk;
> @@ -68,7 +89,7 @@ struct virtio_transport {
>  	struct vsock_transport transport;
>  
>  	/* Takes ownership of the packet */
> -	int (*send_pkt)(struct virtio_vsock_pkt *pkt);
> +	int (*send_pkt)(struct sk_buff *skb);
>  };
>  
>  ssize_t
> @@ -149,11 +170,10 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
>  void virtio_transport_destruct(struct vsock_sock *vsk);
>  
>  void virtio_transport_recv_pkt(struct virtio_transport *t,
> -			       struct virtio_vsock_pkt *pkt);
> -void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt);
> -void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct virtio_vsock_pkt *pkt);
> +			       struct sk_buff *skb);
> +void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct sk_buff *skb);
>  u32 virtio_transport_get_credit(struct virtio_vsock_sock *vvs, u32 wanted);
>  void virtio_transport_put_credit(struct virtio_vsock_sock *vvs, u32 credit);
> -void virtio_transport_deliver_tap_pkt(struct virtio_vsock_pkt *pkt);
> -
> +void virtio_transport_deliver_tap_pkt(struct sk_buff *skb);
> +int virtio_transport_purge_skbs(void *vsk, struct sk_buff_head *queue);
>  #endif /* _LINUX_VIRTIO_VSOCK_H */
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index f04abf662ec6..e348b2d09eac 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -748,6 +748,7 @@ static struct sock *__vsock_create(struct net *net,
>  	vsock_addr_init(&vsk->local_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
>  	vsock_addr_init(&vsk->remote_addr, VMADDR_CID_ANY, VMADDR_PORT_ANY);
>  
> +	sk->sk_allocation = GFP_KERNEL;
>  	sk->sk_destruct = vsock_sk_destruct;
>  	sk->sk_backlog_rcv = vsock_queue_rcv_skb;
>  	sock_reset_flag(sk, SOCK_DONE);
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index ad64f403536a..3bb293fd8607 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -21,6 +21,12 @@
>  #include <linux/mutex.h>
>  #include <net/af_vsock.h>
>  
> +#define VIRTIO_VSOCK_MAX_RX_HDR_PAYLOAD_SIZE	\
> +	(VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE \
> +		 - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) \
> +		 - sizeof(struct virtio_vsock_hdr) \
> +		 - sizeof(struct virtio_vsock_metadata))
> +
>  static struct workqueue_struct *virtio_vsock_workqueue;
>  static struct virtio_vsock __rcu *the_virtio_vsock;
>  static DEFINE_MUTEX(the_virtio_vsock_mutex); /* protects the_virtio_vsock */
> @@ -42,8 +48,7 @@ struct virtio_vsock {
>  	bool tx_run;
>  
>  	struct work_struct send_pkt_work;
> -	spinlock_t send_pkt_list_lock;
> -	struct list_head send_pkt_list;
> +	struct sk_buff_head send_pkt_queue;
>  
>  	atomic_t queued_replies;
>  
> @@ -101,41 +106,32 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>  	vq = vsock->vqs[VSOCK_VQ_TX];
>  
>  	for (;;) {
> -		struct virtio_vsock_pkt *pkt;
> +		struct sk_buff *skb;
>  		struct scatterlist hdr, buf, *sgs[2];
>  		int ret, in_sg = 0, out_sg = 0;
>  		bool reply;
>  
> -		spin_lock_bh(&vsock->send_pkt_list_lock);
> -		if (list_empty(&vsock->send_pkt_list)) {
> -			spin_unlock_bh(&vsock->send_pkt_list_lock);
> -			break;
> -		}
> +		skb = skb_dequeue(&vsock->send_pkt_queue);
>  
> -		pkt = list_first_entry(&vsock->send_pkt_list,
> -				       struct virtio_vsock_pkt, list);
> -		list_del_init(&pkt->list);
> -		spin_unlock_bh(&vsock->send_pkt_list_lock);
> -
> -		virtio_transport_deliver_tap_pkt(pkt);
> +		if (!skb)
> +			break;
>  
> -		reply = pkt->reply;
> +		virtio_transport_deliver_tap_pkt(skb);
> +		reply = vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_REPLY;
>  
> -		sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
> +		sg_init_one(&hdr, vsock_hdr(skb), sizeof(*vsock_hdr(skb)));
>  		sgs[out_sg++] = &hdr;
> -		if (pkt->buf) {
> -			sg_init_one(&buf, pkt->buf, pkt->len);
> +		if (skb->len > 0) {
> +			sg_init_one(&buf, skb->data, skb->len);
>  			sgs[out_sg++] = &buf;
>  		}
>  
> -		ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, pkt, GFP_KERNEL);
> +		ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, skb, GFP_KERNEL);
>  		/* Usually this means that there is no more space available in
>  		 * the vq
>  		 */
>  		if (ret < 0) {
> -			spin_lock_bh(&vsock->send_pkt_list_lock);
> -			list_add(&pkt->list, &vsock->send_pkt_list);
> -			spin_unlock_bh(&vsock->send_pkt_list_lock);
> +			skb_queue_head(&vsock->send_pkt_queue, skb);
>  			break;
>  		}
>  
> @@ -163,33 +159,84 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>  		queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>  }
>  
> +static inline bool
> +virtio_transport_skbs_can_merge(struct sk_buff *old, struct sk_buff *new)
> +{
> +	return (new->len < GOOD_COPY_LEN &&
> +		skb_tailroom(old) >= new->len &&
> +		vsock_hdr(new)->src_cid == vsock_hdr(old)->src_cid &&
> +		vsock_hdr(new)->dst_cid == vsock_hdr(old)->dst_cid &&
> +		vsock_hdr(new)->src_port == vsock_hdr(old)->src_port &&
> +		vsock_hdr(new)->dst_port == vsock_hdr(old)->dst_port &&
> +		vsock_hdr(new)->type == vsock_hdr(old)->type &&
> +		vsock_hdr(new)->flags == vsock_hdr(old)->flags &&
> +		vsock_hdr(old)->op == VIRTIO_VSOCK_OP_RW &&
> +		vsock_hdr(new)->op == VIRTIO_VSOCK_OP_RW);
> +}
> +
> +/**
> + * Merge the two most recent skbs together if possible.
> + *
> + * Caller must hold the queue lock.
> + */
> +static void
> +virtio_transport_add_to_queue(struct sk_buff_head *queue, struct sk_buff *new)
> +{
> +	struct sk_buff *old;
> +
> +	spin_lock_bh(&queue->lock);
> +	/* In order to reduce skb memory overhead, we merge new packets with
> +	 * older packets if they pass virtio_transport_skbs_can_merge().
> +	 */
> +	if (skb_queue_empty_lockless(queue)) {
> +		__skb_queue_tail(queue, new);
> +		goto out;
> +	}
> +
> +	old = skb_peek_tail(queue);
> +
> +	if (!virtio_transport_skbs_can_merge(old, new)) {
> +		__skb_queue_tail(queue, new);
> +		goto out;
> +	}
> +
> +	memcpy(skb_put(old, new->len), new->data, new->len);
> +	vsock_hdr(old)->len = cpu_to_le32(old->len);
> +	vsock_hdr(old)->buf_alloc = vsock_hdr(new)->buf_alloc;
> +	vsock_hdr(old)->fwd_cnt = vsock_hdr(new)->fwd_cnt;
> +	dev_kfree_skb_any(new);
> +
> +out:
> +	spin_unlock_bh(&queue->lock);
> +}
> +
>  static int
> -virtio_transport_send_pkt(struct virtio_vsock_pkt *pkt)
> +virtio_transport_send_pkt(struct sk_buff *skb)
>  {
> +	struct virtio_vsock_hdr *hdr;
>  	struct virtio_vsock *vsock;
> -	int len = pkt->len;
> +	int len = skb->len;
> +
> +	hdr = vsock_hdr(skb);
>  
>  	rcu_read_lock();
>  	vsock = rcu_dereference(the_virtio_vsock);
>  	if (!vsock) {
> -		virtio_transport_free_pkt(pkt);
> +		kfree_skb(skb);
>  		len = -ENODEV;
>  		goto out_rcu;
>  	}
>  
> -	if (le64_to_cpu(pkt->hdr.dst_cid) == vsock->guest_cid) {
> -		virtio_transport_free_pkt(pkt);
> +	if (le64_to_cpu(hdr->dst_cid) == vsock->guest_cid) {
> +		kfree_skb(skb);
>  		len = -ENODEV;
>  		goto out_rcu;
>  	}
>  
> -	if (pkt->reply)
> +	if (vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_REPLY)
>  		atomic_inc(&vsock->queued_replies);
>  
> -	spin_lock_bh(&vsock->send_pkt_list_lock);
> -	list_add_tail(&pkt->list, &vsock->send_pkt_list);
> -	spin_unlock_bh(&vsock->send_pkt_list_lock);
> -
> +	virtio_transport_add_to_queue(&vsock->send_pkt_queue, skb);
>  	queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
>  
>  out_rcu:
> @@ -201,9 +248,7 @@ static int
>  virtio_transport_cancel_pkt(struct vsock_sock *vsk)
>  {
>  	struct virtio_vsock *vsock;
> -	struct virtio_vsock_pkt *pkt, *n;
>  	int cnt = 0, ret;
> -	LIST_HEAD(freeme);
>  
>  	rcu_read_lock();
>  	vsock = rcu_dereference(the_virtio_vsock);
> @@ -212,20 +257,7 @@ virtio_transport_cancel_pkt(struct vsock_sock *vsk)
>  		goto out_rcu;
>  	}
>  
> -	spin_lock_bh(&vsock->send_pkt_list_lock);
> -	list_for_each_entry_safe(pkt, n, &vsock->send_pkt_list, list) {
> -		if (pkt->vsk != vsk)
> -			continue;
> -		list_move(&pkt->list, &freeme);
> -	}
> -	spin_unlock_bh(&vsock->send_pkt_list_lock);
> -
> -	list_for_each_entry_safe(pkt, n, &freeme, list) {
> -		if (pkt->reply)
> -			cnt++;
> -		list_del(&pkt->list);
> -		virtio_transport_free_pkt(pkt);
> -	}
> +	cnt = virtio_transport_purge_skbs(vsk, &vsock->send_pkt_queue);
>  
>  	if (cnt) {
>  		struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
> @@ -246,38 +278,34 @@ virtio_transport_cancel_pkt(struct vsock_sock *vsk)
>  
>  static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
>  {
> -	int buf_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
> -	struct virtio_vsock_pkt *pkt;
> -	struct scatterlist hdr, buf, *sgs[2];
> +	struct scatterlist pkt, *sgs[1];
>  	struct virtqueue *vq;
>  	int ret;
>  
>  	vq = vsock->vqs[VSOCK_VQ_RX];
>  
>  	do {
> -		pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
> -		if (!pkt)
> -			break;
> +		struct sk_buff *skb;
> +		const size_t len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE -
> +				SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>  
> -		pkt->buf = kmalloc(buf_len, GFP_KERNEL);
> -		if (!pkt->buf) {
> -			virtio_transport_free_pkt(pkt);
> +		skb = alloc_skb(len, GFP_KERNEL);
> +		if (!skb)
>  			break;
> -		}
>  
> -		pkt->buf_len = buf_len;
> -		pkt->len = buf_len;
> +		memset(skb->head, 0,
> +		       sizeof(struct virtio_vsock_metadata) + sizeof(struct virtio_vsock_hdr));
>  
> -		sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
> -		sgs[0] = &hdr;
> +		sg_init_one(&pkt, skb->head + sizeof(struct virtio_vsock_metadata),
> +			    VIRTIO_VSOCK_MAX_RX_HDR_PAYLOAD_SIZE);
> +		sgs[0] = &pkt;
>  
> -		sg_init_one(&buf, pkt->buf, buf_len);
> -		sgs[1] = &buf;
> -		ret = virtqueue_add_sgs(vq, sgs, 0, 2, pkt, GFP_KERNEL);
> -		if (ret) {
> -			virtio_transport_free_pkt(pkt);
> +		ret = virtqueue_add_sgs(vq, sgs, 0, 1, skb, GFP_KERNEL);
> +		if (ret < 0) {
> +			kfree_skb(skb);
>  			break;
>  		}
> +
>  		vsock->rx_buf_nr++;
>  	} while (vq->num_free);
>  	if (vsock->rx_buf_nr > vsock->rx_buf_max_nr)
> @@ -299,12 +327,12 @@ static void virtio_transport_tx_work(struct work_struct *work)
>  		goto out;
>  
>  	do {
> -		struct virtio_vsock_pkt *pkt;
> +		struct sk_buff *skb;
>  		unsigned int len;
>  
>  		virtqueue_disable_cb(vq);
> -		while ((pkt = virtqueue_get_buf(vq, &len)) != NULL) {
> -			virtio_transport_free_pkt(pkt);
> +		while ((skb = virtqueue_get_buf(vq, &len)) != NULL) {
> +			consume_skb(skb);
>  			added = true;
>  		}
>  	} while (!virtqueue_enable_cb(vq));
> @@ -529,7 +557,8 @@ static void virtio_transport_rx_work(struct work_struct *work)
>  	do {
>  		virtqueue_disable_cb(vq);
>  		for (;;) {
> -			struct virtio_vsock_pkt *pkt;
> +			struct virtio_vsock_hdr *hdr;
> +			struct sk_buff *skb;
>  			unsigned int len;
>  
>  			if (!virtio_transport_more_replies(vsock)) {
> @@ -540,23 +569,24 @@ static void virtio_transport_rx_work(struct work_struct *work)
>  				goto out;
>  			}
>  
> -			pkt = virtqueue_get_buf(vq, &len);
> -			if (!pkt) {
> +			skb = virtqueue_get_buf(vq, &len);
> +			if (!skb)
>  				break;
> -			}
>  
>  			vsock->rx_buf_nr--;
>  
>  			/* Drop short/long packets */
> -			if (unlikely(len < sizeof(pkt->hdr) ||
> -				     len > sizeof(pkt->hdr) + pkt->len)) {
> -				virtio_transport_free_pkt(pkt);
> +			if (unlikely(len < sizeof(*hdr) ||
> +				     len > VIRTIO_VSOCK_MAX_RX_HDR_PAYLOAD_SIZE)) {
> +				kfree_skb(skb);
>  				continue;
>  			}
>  
> -			pkt->len = len - sizeof(pkt->hdr);
> -			virtio_transport_deliver_tap_pkt(pkt);
> -			virtio_transport_recv_pkt(&virtio_transport, pkt);
> +			hdr = vsock_hdr(skb);
> +			virtio_vsock_skb_reserve(skb);
> +			virtio_vsock_skb_rx_put(skb);
> +			virtio_transport_deliver_tap_pkt(skb);
> +			virtio_transport_recv_pkt(&virtio_transport, skb);
>  		}
>  	} while (!virtqueue_enable_cb(vq));
>  
> @@ -610,7 +640,7 @@ static int virtio_vsock_vqs_init(struct virtio_vsock *vsock)
>  static void virtio_vsock_vqs_del(struct virtio_vsock *vsock)
>  {
>  	struct virtio_device *vdev = vsock->vdev;
> -	struct virtio_vsock_pkt *pkt;
> +	struct sk_buff *skb;
>  
>  	/* Reset all connected sockets when the VQs disappear */
>  	vsock_for_each_connected_socket(&virtio_transport.transport,
> @@ -637,23 +667,16 @@ static void virtio_vsock_vqs_del(struct virtio_vsock *vsock)
>  	virtio_reset_device(vdev);
>  
>  	mutex_lock(&vsock->rx_lock);
> -	while ((pkt = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_RX])))
> -		virtio_transport_free_pkt(pkt);
> +	while ((skb = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_RX])))
> +		kfree_skb(skb);
>  	mutex_unlock(&vsock->rx_lock);
>  
>  	mutex_lock(&vsock->tx_lock);
> -	while ((pkt = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_TX])))
> -		virtio_transport_free_pkt(pkt);
> +	while ((skb = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_TX])))
> +		kfree_skb(skb);
>  	mutex_unlock(&vsock->tx_lock);
>  
> -	spin_lock_bh(&vsock->send_pkt_list_lock);
> -	while (!list_empty(&vsock->send_pkt_list)) {
> -		pkt = list_first_entry(&vsock->send_pkt_list,
> -				       struct virtio_vsock_pkt, list);
> -		list_del(&pkt->list);
> -		virtio_transport_free_pkt(pkt);
> -	}
> -	spin_unlock_bh(&vsock->send_pkt_list_lock);
> +	skb_queue_purge(&vsock->send_pkt_queue);
>  
>  	/* Delete virtqueues and flush outstanding callbacks if any */
>  	vdev->config->del_vqs(vdev);
> @@ -690,8 +713,7 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
>  	mutex_init(&vsock->tx_lock);
>  	mutex_init(&vsock->rx_lock);
>  	mutex_init(&vsock->event_lock);
> -	spin_lock_init(&vsock->send_pkt_list_lock);
> -	INIT_LIST_HEAD(&vsock->send_pkt_list);
> +	skb_queue_head_init(&vsock->send_pkt_queue);
>  	INIT_WORK(&vsock->rx_work, virtio_transport_rx_work);
>  	INIT_WORK(&vsock->tx_work, virtio_transport_tx_work);
>  	INIT_WORK(&vsock->event_work, virtio_transport_event_work);
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index ec2c2afbf0d0..920578597bb9 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -37,53 +37,81 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>  	return container_of(t, struct virtio_transport, transport);
>  }
>  
> -static struct virtio_vsock_pkt *
> -virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
> +/* Returns a new packet on success, otherwise returns NULL.
> + *
> + * If NULL is returned, errp is set to a negative errno.
> + */
> +static struct sk_buff *
> +virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>  			   size_t len,
>  			   u32 src_cid,
>  			   u32 src_port,
>  			   u32 dst_cid,
> -			   u32 dst_port)
> +			   u32 dst_port,
> +			   int *errp)
>  {
> -	struct virtio_vsock_pkt *pkt;
> +	struct sk_buff *skb;
> +	struct virtio_vsock_hdr *hdr;
> +	void *payload;
> +	const size_t skb_len = sizeof(*hdr) + sizeof(struct virtio_vsock_metadata) + len;
>  	int err;
>  
> -	pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
> -	if (!pkt)
> -		return NULL;
> +	if (info->vsk) {
> +		unsigned int msg_flags = info->msg ? info->msg->msg_flags : 0;
> +		struct sock *sk;
>  
> -	pkt->hdr.type		= cpu_to_le16(info->type);
> -	pkt->hdr.op		= cpu_to_le16(info->op);
> -	pkt->hdr.src_cid	= cpu_to_le64(src_cid);
> -	pkt->hdr.dst_cid	= cpu_to_le64(dst_cid);
> -	pkt->hdr.src_port	= cpu_to_le32(src_port);
> -	pkt->hdr.dst_port	= cpu_to_le32(dst_port);
> -	pkt->hdr.flags		= cpu_to_le32(info->flags);
> -	pkt->len		= len;
> -	pkt->hdr.len		= cpu_to_le32(len);
> -	pkt->reply		= info->reply;
> -	pkt->vsk		= info->vsk;
> +		sk = sk_vsock(info->vsk);
> +		skb = sock_alloc_send_skb(sk, skb_len,
> +					  msg_flags & MSG_DONTWAIT, errp);
>  
> -	if (info->msg && len > 0) {
> -		pkt->buf = kmalloc(len, GFP_KERNEL);
> -		if (!pkt->buf)
> -			goto out_pkt;
> +		if (skb)
> +			skb->priority = sk->sk_priority;
> +	} else {
> +		skb = alloc_skb(skb_len, GFP_KERNEL);
> +	}
> +
> +	if (!skb) {
> +		/* If using alloc_skb(), the skb is NULL due to lacking memory.
> +		 * Otherwise, errp is set by sock_alloc_send_skb().
> +		 */
> +		if (!info->vsk)
> +			*errp = -ENOMEM;
> +		return NULL;
> +	}
>  
> -		pkt->buf_len = len;
> +	memset(skb->head, 0, sizeof(*hdr) + sizeof(struct virtio_vsock_metadata));
> +	virtio_vsock_skb_reserve(skb);
> +	payload = skb_put(skb, len);
>  
> -		err = memcpy_from_msg(pkt->buf, info->msg, len);
> -		if (err)
> +	hdr = vsock_hdr(skb);
> +	hdr->type	= cpu_to_le16(info->type);
> +	hdr->op		= cpu_to_le16(info->op);
> +	hdr->src_cid	= cpu_to_le64(src_cid);
> +	hdr->dst_cid	= cpu_to_le64(dst_cid);
> +	hdr->src_port	= cpu_to_le32(src_port);
> +	hdr->dst_port	= cpu_to_le32(dst_port);
> +	hdr->flags	= cpu_to_le32(info->flags);
> +	hdr->len	= cpu_to_le32(len);
> +
> +	if (info->msg && len > 0) {
> +		err = memcpy_from_msg(payload, info->msg, len);
> +		if (err) {
> +			*errp = -ENOMEM;
>  			goto out;
> +		}
>  
>  		if (msg_data_left(info->msg) == 0 &&
>  		    info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
> -			pkt->hdr.flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
> +			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>  
>  			if (info->msg->msg_flags & MSG_EOR)
> -				pkt->hdr.flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
> +				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>  		}
>  	}
>  
> +	if (info->reply)
> +		vsock_metadata(skb)->flags |= VIRTIO_VSOCK_METADATA_FLAGS_REPLY;
> +
>  	trace_virtio_transport_alloc_pkt(src_cid, src_port,
>  					 dst_cid, dst_port,
>  					 len,
> @@ -91,85 +119,26 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
>  					 info->op,
>  					 info->flags);
>  
> -	return pkt;
> +	return skb;
>  
>  out:
> -	kfree(pkt->buf);
> -out_pkt:
> -	kfree(pkt);
> +	kfree_skb(skb);
>  	return NULL;
>  }
>  
>  /* Packet capture */
>  static struct sk_buff *virtio_transport_build_skb(void *opaque)
>  {
> -	struct virtio_vsock_pkt *pkt = opaque;
> -	struct af_vsockmon_hdr *hdr;
> -	struct sk_buff *skb;
> -	size_t payload_len;
> -	void *payload_buf;
> -
> -	/* A packet could be split to fit the RX buffer, so we can retrieve
> -	 * the payload length from the header and the buffer pointer taking
> -	 * care of the offset in the original packet.
> -	 */
> -	payload_len = le32_to_cpu(pkt->hdr.len);
> -	payload_buf = pkt->buf + pkt->off;
> -
> -	skb = alloc_skb(sizeof(*hdr) + sizeof(pkt->hdr) + payload_len,
> -			GFP_ATOMIC);
> -	if (!skb)
> -		return NULL;
> -
> -	hdr = skb_put(skb, sizeof(*hdr));
> -
> -	/* pkt->hdr is little-endian so no need to byteswap here */
> -	hdr->src_cid = pkt->hdr.src_cid;
> -	hdr->src_port = pkt->hdr.src_port;
> -	hdr->dst_cid = pkt->hdr.dst_cid;
> -	hdr->dst_port = pkt->hdr.dst_port;
> -
> -	hdr->transport = cpu_to_le16(AF_VSOCK_TRANSPORT_VIRTIO);
> -	hdr->len = cpu_to_le16(sizeof(pkt->hdr));
> -	memset(hdr->reserved, 0, sizeof(hdr->reserved));
> -
> -	switch (le16_to_cpu(pkt->hdr.op)) {
> -	case VIRTIO_VSOCK_OP_REQUEST:
> -	case VIRTIO_VSOCK_OP_RESPONSE:
> -		hdr->op = cpu_to_le16(AF_VSOCK_OP_CONNECT);
> -		break;
> -	case VIRTIO_VSOCK_OP_RST:
> -	case VIRTIO_VSOCK_OP_SHUTDOWN:
> -		hdr->op = cpu_to_le16(AF_VSOCK_OP_DISCONNECT);
> -		break;
> -	case VIRTIO_VSOCK_OP_RW:
> -		hdr->op = cpu_to_le16(AF_VSOCK_OP_PAYLOAD);
> -		break;
> -	case VIRTIO_VSOCK_OP_CREDIT_UPDATE:
> -	case VIRTIO_VSOCK_OP_CREDIT_REQUEST:
> -		hdr->op = cpu_to_le16(AF_VSOCK_OP_CONTROL);
> -		break;
> -	default:
> -		hdr->op = cpu_to_le16(AF_VSOCK_OP_UNKNOWN);
> -		break;
> -	}
> -
> -	skb_put_data(skb, &pkt->hdr, sizeof(pkt->hdr));
> -
> -	if (payload_len) {
> -		skb_put_data(skb, payload_buf, payload_len);
> -	}
> -
> -	return skb;
> +	return (struct sk_buff *)opaque;
>  }
>  
> -void virtio_transport_deliver_tap_pkt(struct virtio_vsock_pkt *pkt)
> +void virtio_transport_deliver_tap_pkt(struct sk_buff *skb)
>  {
> -	if (pkt->tap_delivered)
> +	if (vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED)
>  		return;
>  
> -	vsock_deliver_tap(virtio_transport_build_skb, pkt);
> -	pkt->tap_delivered = true;
> +	vsock_deliver_tap(virtio_transport_build_skb, skb);
> +	vsock_metadata(skb)->flags |= VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED;
>  }
>  EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
>  
> @@ -192,8 +161,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>  	u32 src_cid, src_port, dst_cid, dst_port;
>  	const struct virtio_transport *t_ops;
>  	struct virtio_vsock_sock *vvs;
> -	struct virtio_vsock_pkt *pkt;
> +	struct sk_buff *skb;
>  	u32 pkt_len = info->pkt_len;
> +	int err;
>  
>  	info->type = virtio_transport_get_type(sk_vsock(vsk));
>  
> @@ -224,42 +194,47 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>  		return pkt_len;
>  
> -	pkt = virtio_transport_alloc_pkt(info, pkt_len,
> +	skb = virtio_transport_alloc_skb(info, pkt_len,
>  					 src_cid, src_port,
> -					 dst_cid, dst_port);
> -	if (!pkt) {
> +					 dst_cid, dst_port,
> +					 &err);
> +	if (!skb) {
>  		virtio_transport_put_credit(vvs, pkt_len);
> -		return -ENOMEM;
> +		return err;
>  	}
>  
> -	virtio_transport_inc_tx_pkt(vvs, pkt);
> +	virtio_transport_inc_tx_pkt(vvs, skb);
> +
> +	err = t_ops->send_pkt(skb);
>  
> -	return t_ops->send_pkt(pkt);
> +	return err < 0 ? -ENOMEM : err;
>  }
>  
>  static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
> -					struct virtio_vsock_pkt *pkt)
> +					struct sk_buff *skb)
>  {
> -	if (vvs->rx_bytes + pkt->len > vvs->buf_alloc)
> +	if (vvs->rx_bytes + skb->len > vvs->buf_alloc)
>  		return false;
>  
> -	vvs->rx_bytes += pkt->len;
> +	vvs->rx_bytes += skb->len;
>  	return true;
>  }
>  
>  static void virtio_transport_dec_rx_pkt(struct virtio_vsock_sock *vvs,
> -					struct virtio_vsock_pkt *pkt)
> +					struct sk_buff *skb)
>  {
> -	vvs->rx_bytes -= pkt->len;
> -	vvs->fwd_cnt += pkt->len;
> +	vvs->rx_bytes -= skb->len;
> +	vvs->fwd_cnt += skb->len;
>  }
>  
> -void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct virtio_vsock_pkt *pkt)
> +void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct sk_buff *skb)
>  {
> +	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> +
>  	spin_lock_bh(&vvs->rx_lock);
>  	vvs->last_fwd_cnt = vvs->fwd_cnt;
> -	pkt->hdr.fwd_cnt = cpu_to_le32(vvs->fwd_cnt);
> -	pkt->hdr.buf_alloc = cpu_to_le32(vvs->buf_alloc);
> +	hdr->fwd_cnt = cpu_to_le32(vvs->fwd_cnt);
> +	hdr->buf_alloc = cpu_to_le32(vvs->buf_alloc);
>  	spin_unlock_bh(&vvs->rx_lock);
>  }
>  EXPORT_SYMBOL_GPL(virtio_transport_inc_tx_pkt);
> @@ -303,29 +278,29 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
>  				size_t len)
>  {
>  	struct virtio_vsock_sock *vvs = vsk->trans;
> -	struct virtio_vsock_pkt *pkt;
> +	struct sk_buff *skb, *tmp;
>  	size_t bytes, total = 0, off;
>  	int err = -EFAULT;
>  
>  	spin_lock_bh(&vvs->rx_lock);
>  
> -	list_for_each_entry(pkt, &vvs->rx_queue, list) {
> -		off = pkt->off;
> +	skb_queue_walk_safe(&vvs->rx_queue, skb,  tmp) {
> +		off = vsock_metadata(skb)->off;
>  
>  		if (total == len)
>  			break;
>  
> -		while (total < len && off < pkt->len) {
> +		while (total < len && off < skb->len) {
>  			bytes = len - total;
> -			if (bytes > pkt->len - off)
> -				bytes = pkt->len - off;
> +			if (bytes > skb->len - off)
> +				bytes = skb->len - off;
>  
>  			/* sk_lock is held by caller so no one else can dequeue.
>  			 * Unlock rx_lock since memcpy_to_msg() may sleep.
>  			 */
>  			spin_unlock_bh(&vvs->rx_lock);
>  
> -			err = memcpy_to_msg(msg, pkt->buf + off, bytes);
> +			err = memcpy_to_msg(msg, skb->data + off, bytes);
>  			if (err)
>  				goto out;
>  
> @@ -352,37 +327,40 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
>  				   size_t len)
>  {
>  	struct virtio_vsock_sock *vvs = vsk->trans;
> -	struct virtio_vsock_pkt *pkt;
> +	struct sk_buff *skb;
>  	size_t bytes, total = 0;
>  	u32 free_space;
>  	int err = -EFAULT;
>  
>  	spin_lock_bh(&vvs->rx_lock);
> -	while (total < len && !list_empty(&vvs->rx_queue)) {
> -		pkt = list_first_entry(&vvs->rx_queue,
> -				       struct virtio_vsock_pkt, list);
> +	while (total < len && !skb_queue_empty_lockless(&vvs->rx_queue)) {
> +		skb = __skb_dequeue(&vvs->rx_queue);
>  
>  		bytes = len - total;
> -		if (bytes > pkt->len - pkt->off)
> -			bytes = pkt->len - pkt->off;
> +		if (bytes > skb->len - vsock_metadata(skb)->off)
> +			bytes = skb->len - vsock_metadata(skb)->off;
>  
>  		/* sk_lock is held by caller so no one else can dequeue.
>  		 * Unlock rx_lock since memcpy_to_msg() may sleep.
>  		 */
>  		spin_unlock_bh(&vvs->rx_lock);
>  
> -		err = memcpy_to_msg(msg, pkt->buf + pkt->off, bytes);
> +		err = memcpy_to_msg(msg, skb->data + vsock_metadata(skb)->off, bytes);
>  		if (err)
>  			goto out;
>  
>  		spin_lock_bh(&vvs->rx_lock);
>  
>  		total += bytes;
> -		pkt->off += bytes;
> -		if (pkt->off == pkt->len) {
> -			virtio_transport_dec_rx_pkt(vvs, pkt);
> -			list_del(&pkt->list);
> -			virtio_transport_free_pkt(pkt);
> +		vsock_metadata(skb)->off += bytes;
> +
> +		WARN_ON(vsock_metadata(skb)->off > skb->len);
> +
> +		if (vsock_metadata(skb)->off == skb->len) {
> +			virtio_transport_dec_rx_pkt(vvs, skb);
> +			consume_skb(skb);
> +		} else {
> +			__skb_queue_head(&vvs->rx_queue, skb);
>  		}
>  	}
>  
> @@ -414,7 +392,7 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
>  						 int flags)
>  {
>  	struct virtio_vsock_sock *vvs = vsk->trans;
> -	struct virtio_vsock_pkt *pkt;
> +	struct sk_buff *skb;
>  	int dequeued_len = 0;
>  	size_t user_buf_len = msg_data_left(msg);
>  	bool msg_ready = false;
> @@ -427,13 +405,16 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
>  	}
>  
>  	while (!msg_ready) {
> -		pkt = list_first_entry(&vvs->rx_queue, struct virtio_vsock_pkt, list);
> +		struct virtio_vsock_hdr *hdr;
> +
> +		skb = __skb_dequeue(&vvs->rx_queue);
> +		hdr = vsock_hdr(skb);
>  
>  		if (dequeued_len >= 0) {
>  			size_t pkt_len;
>  			size_t bytes_to_copy;
>  
> -			pkt_len = (size_t)le32_to_cpu(pkt->hdr.len);
> +			pkt_len = (size_t)le32_to_cpu(hdr->len);
>  			bytes_to_copy = min(user_buf_len, pkt_len);
>  
>  			if (bytes_to_copy) {
> @@ -444,7 +425,7 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
>  				 */
>  				spin_unlock_bh(&vvs->rx_lock);
>  
> -				err = memcpy_to_msg(msg, pkt->buf, bytes_to_copy);
> +				err = memcpy_to_msg(msg, skb->data, bytes_to_copy);
>  				if (err) {
>  					/* Copy of message failed. Rest of
>  					 * fragments will be freed without copy.
> @@ -461,17 +442,16 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
>  				dequeued_len += pkt_len;
>  		}
>  
> -		if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOM) {
> +		if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM) {
>  			msg_ready = true;
>  			vvs->msg_count--;
>  
> -			if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOR)
> +			if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOR)
>  				msg->msg_flags |= MSG_EOR;
>  		}
>  
> -		virtio_transport_dec_rx_pkt(vvs, pkt);
> -		list_del(&pkt->list);
> -		virtio_transport_free_pkt(pkt);
> +		virtio_transport_dec_rx_pkt(vvs, skb);
> +		kfree_skb(skb);
>  	}
>  
>  	spin_unlock_bh(&vvs->rx_lock);
> @@ -609,7 +589,7 @@ int virtio_transport_do_socket_init(struct vsock_sock *vsk,
>  
>  	spin_lock_init(&vvs->rx_lock);
>  	spin_lock_init(&vvs->tx_lock);
> -	INIT_LIST_HEAD(&vvs->rx_queue);
> +	skb_queue_head_init(&vvs->rx_queue);
>  
>  	return 0;
>  }
> @@ -809,16 +789,16 @@ void virtio_transport_destruct(struct vsock_sock *vsk)
>  EXPORT_SYMBOL_GPL(virtio_transport_destruct);
>  
>  static int virtio_transport_reset(struct vsock_sock *vsk,
> -				  struct virtio_vsock_pkt *pkt)
> +				  struct sk_buff *skb)
>  {
>  	struct virtio_vsock_pkt_info info = {
>  		.op = VIRTIO_VSOCK_OP_RST,
> -		.reply = !!pkt,
> +		.reply = !!skb,
>  		.vsk = vsk,
>  	};
>  
>  	/* Send RST only if the original pkt is not a RST pkt */
> -	if (pkt && le16_to_cpu(pkt->hdr.op) == VIRTIO_VSOCK_OP_RST)
> +	if (skb && le16_to_cpu(vsock_hdr(skb)->op) == VIRTIO_VSOCK_OP_RST)
>  		return 0;
>  
>  	return virtio_transport_send_pkt_info(vsk, &info);
> @@ -828,29 +808,32 @@ static int virtio_transport_reset(struct vsock_sock *vsk,
>   * attempt was made to connect to a socket that does not exist.
>   */
>  static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
> -					  struct virtio_vsock_pkt *pkt)
> +					  struct sk_buff *skb)
>  {
> -	struct virtio_vsock_pkt *reply;
> +	struct sk_buff *reply;
> +	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
>  	struct virtio_vsock_pkt_info info = {
>  		.op = VIRTIO_VSOCK_OP_RST,
> -		.type = le16_to_cpu(pkt->hdr.type),
> +		.type = le16_to_cpu(hdr->type),
>  		.reply = true,
>  	};
> +	int err;
>  
>  	/* Send RST only if the original pkt is not a RST pkt */
> -	if (le16_to_cpu(pkt->hdr.op) == VIRTIO_VSOCK_OP_RST)
> +	if (le16_to_cpu(hdr->op) == VIRTIO_VSOCK_OP_RST)
>  		return 0;
>  
> -	reply = virtio_transport_alloc_pkt(&info, 0,
> -					   le64_to_cpu(pkt->hdr.dst_cid),
> -					   le32_to_cpu(pkt->hdr.dst_port),
> -					   le64_to_cpu(pkt->hdr.src_cid),
> -					   le32_to_cpu(pkt->hdr.src_port));
> +	reply = virtio_transport_alloc_skb(&info, 0,
> +					   le64_to_cpu(hdr->dst_cid),
> +					   le32_to_cpu(hdr->dst_port),
> +					   le64_to_cpu(hdr->src_cid),
> +					   le32_to_cpu(hdr->src_port),
> +					   &err);
>  	if (!reply)
> -		return -ENOMEM;
> +		return err;
>  
>  	if (!t) {
> -		virtio_transport_free_pkt(reply);
> +		kfree_skb(reply);
>  		return -ENOTCONN;
>  	}
>  
> @@ -861,16 +844,11 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>  static void virtio_transport_remove_sock(struct vsock_sock *vsk)
>  {
>  	struct virtio_vsock_sock *vvs = vsk->trans;
> -	struct virtio_vsock_pkt *pkt, *tmp;
>  
>  	/* We don't need to take rx_lock, as the socket is closing and we are
>  	 * removing it.
>  	 */
> -	list_for_each_entry_safe(pkt, tmp, &vvs->rx_queue, list) {
> -		list_del(&pkt->list);
> -		virtio_transport_free_pkt(pkt);
> -	}
> -
> +	__skb_queue_purge(&vvs->rx_queue);
>  	vsock_remove_sock(vsk);
>  }
>  
> @@ -984,13 +962,14 @@ EXPORT_SYMBOL_GPL(virtio_transport_release);
>  
>  static int
>  virtio_transport_recv_connecting(struct sock *sk,
> -				 struct virtio_vsock_pkt *pkt)
> +				 struct sk_buff *skb)
>  {
>  	struct vsock_sock *vsk = vsock_sk(sk);
> +	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
>  	int err;
>  	int skerr;
>  
> -	switch (le16_to_cpu(pkt->hdr.op)) {
> +	switch (le16_to_cpu(hdr->op)) {
>  	case VIRTIO_VSOCK_OP_RESPONSE:
>  		sk->sk_state = TCP_ESTABLISHED;
>  		sk->sk_socket->state = SS_CONNECTED;
> @@ -1011,7 +990,7 @@ virtio_transport_recv_connecting(struct sock *sk,
>  	return 0;
>  
>  destroy:
> -	virtio_transport_reset(vsk, pkt);
> +	virtio_transport_reset(vsk, skb);
>  	sk->sk_state = TCP_CLOSE;
>  	sk->sk_err = skerr;
>  	sk_error_report(sk);
> @@ -1020,34 +999,38 @@ virtio_transport_recv_connecting(struct sock *sk,
>  
>  static void
>  virtio_transport_recv_enqueue(struct vsock_sock *vsk,
> -			      struct virtio_vsock_pkt *pkt)
> +			      struct sk_buff *skb)
>  {
>  	struct virtio_vsock_sock *vvs = vsk->trans;
> +	struct virtio_vsock_hdr *hdr;
>  	bool can_enqueue, free_pkt = false;
> +	u32 len;
>  
> -	pkt->len = le32_to_cpu(pkt->hdr.len);
> -	pkt->off = 0;
> +	hdr = vsock_hdr(skb);
> +	len = le32_to_cpu(hdr->len);
> +	vsock_metadata(skb)->off = 0;
>  
>  	spin_lock_bh(&vvs->rx_lock);
>  
> -	can_enqueue = virtio_transport_inc_rx_pkt(vvs, pkt);
> +	can_enqueue = virtio_transport_inc_rx_pkt(vvs, skb);
>  	if (!can_enqueue) {
>  		free_pkt = true;
>  		goto out;
>  	}
>  
> -	if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOM)
> +	if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SEQ_EOM)
>  		vvs->msg_count++;
>  
>  	/* Try to copy small packets into the buffer of last packet queued,
>  	 * to avoid wasting memory queueing the entire buffer with a small
>  	 * payload.
>  	 */
> -	if (pkt->len <= GOOD_COPY_LEN && !list_empty(&vvs->rx_queue)) {
> -		struct virtio_vsock_pkt *last_pkt;
> +	if (len <= GOOD_COPY_LEN && !skb_queue_empty_lockless(&vvs->rx_queue)) {
> +		struct virtio_vsock_hdr *last_hdr;
> +		struct sk_buff *last_skb;
>  
> -		last_pkt = list_last_entry(&vvs->rx_queue,
> -					   struct virtio_vsock_pkt, list);
> +		last_skb = skb_peek_tail(&vvs->rx_queue);
> +		last_hdr = vsock_hdr(last_skb);
>  
>  		/* If there is space in the last packet queued, we copy the
>  		 * new packet in its buffer. We avoid this if the last packet
> @@ -1055,35 +1038,35 @@ virtio_transport_recv_enqueue(struct vsock_sock *vsk,
>  		 * delimiter of SEQPACKET message, so 'pkt' is the first packet
>  		 * of a new message.
>  		 */
> -		if ((pkt->len <= last_pkt->buf_len - last_pkt->len) &&
> -		    !(le32_to_cpu(last_pkt->hdr.flags) & VIRTIO_VSOCK_SEQ_EOM)) {
> -			memcpy(last_pkt->buf + last_pkt->len, pkt->buf,
> -			       pkt->len);
> -			last_pkt->len += pkt->len;
> +		if (skb->len < skb_tailroom(last_skb) &&
> +		    !(le32_to_cpu(last_hdr->flags) & VIRTIO_VSOCK_SEQ_EOR) &&
> +		    (vsock_hdr(skb)->type != VIRTIO_VSOCK_TYPE_DGRAM)) {
> +			memcpy(skb_put(last_skb, skb->len), skb->data, skb->len);
>  			free_pkt = true;
> -			last_pkt->hdr.flags |= pkt->hdr.flags;
> +			last_hdr->flags |= hdr->flags;
>  			goto out;
>  		}
>  	}
>  
> -	list_add_tail(&pkt->list, &vvs->rx_queue);
> +	__skb_queue_tail(&vvs->rx_queue, skb);
>  
>  out:
>  	spin_unlock_bh(&vvs->rx_lock);
>  	if (free_pkt)
> -		virtio_transport_free_pkt(pkt);
> +		kfree_skb(skb);
>  }
>  
>  static int
>  virtio_transport_recv_connected(struct sock *sk,
> -				struct virtio_vsock_pkt *pkt)
> +				struct sk_buff *skb)
>  {
>  	struct vsock_sock *vsk = vsock_sk(sk);
> +	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
>  	int err = 0;
>  
> -	switch (le16_to_cpu(pkt->hdr.op)) {
> +	switch (le16_to_cpu(hdr->op)) {
>  	case VIRTIO_VSOCK_OP_RW:
> -		virtio_transport_recv_enqueue(vsk, pkt);
> +		virtio_transport_recv_enqueue(vsk, skb);
>  		sk->sk_data_ready(sk);
>  		return err;
>  	case VIRTIO_VSOCK_OP_CREDIT_REQUEST:
> @@ -1093,18 +1076,17 @@ virtio_transport_recv_connected(struct sock *sk,
>  		sk->sk_write_space(sk);
>  		break;
>  	case VIRTIO_VSOCK_OP_SHUTDOWN:
> -		if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SHUTDOWN_RCV)
> +		if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SHUTDOWN_RCV)
>  			vsk->peer_shutdown |= RCV_SHUTDOWN;
> -		if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SHUTDOWN_SEND)
> +		if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SHUTDOWN_SEND)
>  			vsk->peer_shutdown |= SEND_SHUTDOWN;
>  		if (vsk->peer_shutdown == SHUTDOWN_MASK &&
>  		    vsock_stream_has_data(vsk) <= 0 &&
>  		    !sock_flag(sk, SOCK_DONE)) {
>  			(void)virtio_transport_reset(vsk, NULL);
> -
>  			virtio_transport_do_close(vsk, true);
>  		}
> -		if (le32_to_cpu(pkt->hdr.flags))
> +		if (le32_to_cpu(vsock_hdr(skb)->flags))
>  			sk->sk_state_change(sk);
>  		break;
>  	case VIRTIO_VSOCK_OP_RST:
> @@ -1115,28 +1097,30 @@ virtio_transport_recv_connected(struct sock *sk,
>  		break;
>  	}
>  
> -	virtio_transport_free_pkt(pkt);
> +	kfree_skb(skb);
>  	return err;
>  }
>  
>  static void
>  virtio_transport_recv_disconnecting(struct sock *sk,
> -				    struct virtio_vsock_pkt *pkt)
> +				    struct sk_buff *skb)
>  {
>  	struct vsock_sock *vsk = vsock_sk(sk);
> +	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
>  
> -	if (le16_to_cpu(pkt->hdr.op) == VIRTIO_VSOCK_OP_RST)
> +	if (le16_to_cpu(hdr->op) == VIRTIO_VSOCK_OP_RST)
>  		virtio_transport_do_close(vsk, true);
>  }
>  
>  static int
>  virtio_transport_send_response(struct vsock_sock *vsk,
> -			       struct virtio_vsock_pkt *pkt)
> +			       struct sk_buff *skb)
>  {
> +	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
>  	struct virtio_vsock_pkt_info info = {
>  		.op = VIRTIO_VSOCK_OP_RESPONSE,
> -		.remote_cid = le64_to_cpu(pkt->hdr.src_cid),
> -		.remote_port = le32_to_cpu(pkt->hdr.src_port),
> +		.remote_cid = le64_to_cpu(hdr->src_cid),
> +		.remote_port = le32_to_cpu(hdr->src_port),
>  		.reply = true,
>  		.vsk = vsk,
>  	};
> @@ -1145,10 +1129,11 @@ virtio_transport_send_response(struct vsock_sock *vsk,
>  }
>  
>  static bool virtio_transport_space_update(struct sock *sk,
> -					  struct virtio_vsock_pkt *pkt)
> +					  struct sk_buff *skb)
>  {
>  	struct vsock_sock *vsk = vsock_sk(sk);
>  	struct virtio_vsock_sock *vvs = vsk->trans;
> +	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
>  	bool space_available;
>  
>  	/* Listener sockets are not associated with any transport, so we are
> @@ -1161,8 +1146,8 @@ static bool virtio_transport_space_update(struct sock *sk,
>  
>  	/* buf_alloc and fwd_cnt is always included in the hdr */
>  	spin_lock_bh(&vvs->tx_lock);
> -	vvs->peer_buf_alloc = le32_to_cpu(pkt->hdr.buf_alloc);
> -	vvs->peer_fwd_cnt = le32_to_cpu(pkt->hdr.fwd_cnt);
> +	vvs->peer_buf_alloc = le32_to_cpu(hdr->buf_alloc);
> +	vvs->peer_fwd_cnt = le32_to_cpu(hdr->fwd_cnt);
>  	space_available = virtio_transport_has_space(vsk);
>  	spin_unlock_bh(&vvs->tx_lock);
>  	return space_available;
> @@ -1170,27 +1155,28 @@ static bool virtio_transport_space_update(struct sock *sk,
>  
>  /* Handle server socket */
>  static int
> -virtio_transport_recv_listen(struct sock *sk, struct virtio_vsock_pkt *pkt,
> +virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
>  			     struct virtio_transport *t)
>  {
>  	struct vsock_sock *vsk = vsock_sk(sk);
> +	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
>  	struct vsock_sock *vchild;
>  	struct sock *child;
>  	int ret;
>  
> -	if (le16_to_cpu(pkt->hdr.op) != VIRTIO_VSOCK_OP_REQUEST) {
> -		virtio_transport_reset_no_sock(t, pkt);
> +	if (le16_to_cpu(hdr->op) != VIRTIO_VSOCK_OP_REQUEST) {
> +		virtio_transport_reset_no_sock(t, skb);
>  		return -EINVAL;
>  	}
>  
>  	if (sk_acceptq_is_full(sk)) {
> -		virtio_transport_reset_no_sock(t, pkt);
> +		virtio_transport_reset_no_sock(t, skb);
>  		return -ENOMEM;
>  	}
>  
>  	child = vsock_create_connected(sk);
>  	if (!child) {
> -		virtio_transport_reset_no_sock(t, pkt);
> +		virtio_transport_reset_no_sock(t, skb);
>  		return -ENOMEM;
>  	}
>  
> @@ -1201,10 +1187,10 @@ virtio_transport_recv_listen(struct sock *sk, struct virtio_vsock_pkt *pkt,
>  	child->sk_state = TCP_ESTABLISHED;
>  
>  	vchild = vsock_sk(child);
> -	vsock_addr_init(&vchild->local_addr, le64_to_cpu(pkt->hdr.dst_cid),
> -			le32_to_cpu(pkt->hdr.dst_port));
> -	vsock_addr_init(&vchild->remote_addr, le64_to_cpu(pkt->hdr.src_cid),
> -			le32_to_cpu(pkt->hdr.src_port));
> +	vsock_addr_init(&vchild->local_addr, le64_to_cpu(hdr->dst_cid),
> +			le32_to_cpu(hdr->dst_port));
> +	vsock_addr_init(&vchild->remote_addr, le64_to_cpu(hdr->src_cid),
> +			le32_to_cpu(hdr->src_port));
>  
>  	ret = vsock_assign_transport(vchild, vsk);
>  	/* Transport assigned (looking at remote_addr) must be the same
> @@ -1212,17 +1198,17 @@ virtio_transport_recv_listen(struct sock *sk, struct virtio_vsock_pkt *pkt,
>  	 */
>  	if (ret || vchild->transport != &t->transport) {
>  		release_sock(child);
> -		virtio_transport_reset_no_sock(t, pkt);
> +		virtio_transport_reset_no_sock(t, skb);
>  		sock_put(child);
>  		return ret;
>  	}
>  
> -	if (virtio_transport_space_update(child, pkt))
> +	if (virtio_transport_space_update(child, skb))
>  		child->sk_write_space(child);
>  
>  	vsock_insert_connected(vchild);
>  	vsock_enqueue_accept(sk, child);
> -	virtio_transport_send_response(vchild, pkt);
> +	virtio_transport_send_response(vchild, skb);
>  
>  	release_sock(child);
>  
> @@ -1240,29 +1226,30 @@ static bool virtio_transport_valid_type(u16 type)
>   * lock.
>   */
>  void virtio_transport_recv_pkt(struct virtio_transport *t,
> -			       struct virtio_vsock_pkt *pkt)
> +			       struct sk_buff *skb)
>  {
>  	struct sockaddr_vm src, dst;
>  	struct vsock_sock *vsk;
>  	struct sock *sk;
>  	bool space_available;
> +	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
>  
> -	vsock_addr_init(&src, le64_to_cpu(pkt->hdr.src_cid),
> -			le32_to_cpu(pkt->hdr.src_port));
> -	vsock_addr_init(&dst, le64_to_cpu(pkt->hdr.dst_cid),
> -			le32_to_cpu(pkt->hdr.dst_port));
> +	vsock_addr_init(&src, le64_to_cpu(hdr->src_cid),
> +			le32_to_cpu(hdr->src_port));
> +	vsock_addr_init(&dst, le64_to_cpu(hdr->dst_cid),
> +			le32_to_cpu(hdr->dst_port));
>  
>  	trace_virtio_transport_recv_pkt(src.svm_cid, src.svm_port,
>  					dst.svm_cid, dst.svm_port,
> -					le32_to_cpu(pkt->hdr.len),
> -					le16_to_cpu(pkt->hdr.type),
> -					le16_to_cpu(pkt->hdr.op),
> -					le32_to_cpu(pkt->hdr.flags),
> -					le32_to_cpu(pkt->hdr.buf_alloc),
> -					le32_to_cpu(pkt->hdr.fwd_cnt));
> -
> -	if (!virtio_transport_valid_type(le16_to_cpu(pkt->hdr.type))) {
> -		(void)virtio_transport_reset_no_sock(t, pkt);
> +					le32_to_cpu(hdr->len),
> +					le16_to_cpu(hdr->type),
> +					le16_to_cpu(hdr->op),
> +					le32_to_cpu(hdr->flags),
> +					le32_to_cpu(hdr->buf_alloc),
> +					le32_to_cpu(hdr->fwd_cnt));
> +
> +	if (!virtio_transport_valid_type(le16_to_cpu(hdr->type))) {
> +		(void)virtio_transport_reset_no_sock(t, skb);
>  		goto free_pkt;
>  	}
>  
> @@ -1273,13 +1260,13 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
>  	if (!sk) {
>  		sk = vsock_find_bound_socket(&dst);
>  		if (!sk) {
> -			(void)virtio_transport_reset_no_sock(t, pkt);
> +			(void)virtio_transport_reset_no_sock(t, skb);
>  			goto free_pkt;
>  		}
>  	}
>  
> -	if (virtio_transport_get_type(sk) != le16_to_cpu(pkt->hdr.type)) {
> -		(void)virtio_transport_reset_no_sock(t, pkt);
> +	if (virtio_transport_get_type(sk) != le16_to_cpu(hdr->type)) {
> +		(void)virtio_transport_reset_no_sock(t, skb);
>  		sock_put(sk);
>  		goto free_pkt;
>  	}
> @@ -1290,13 +1277,13 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
>  
>  	/* Check if sk has been closed before lock_sock */
>  	if (sock_flag(sk, SOCK_DONE)) {
> -		(void)virtio_transport_reset_no_sock(t, pkt);
> +		(void)virtio_transport_reset_no_sock(t, skb);
>  		release_sock(sk);
>  		sock_put(sk);
>  		goto free_pkt;
>  	}
>  
> -	space_available = virtio_transport_space_update(sk, pkt);
> +	space_available = virtio_transport_space_update(sk, skb);
>  
>  	/* Update CID in case it has changed after a transport reset event */
>  	if (vsk->local_addr.svm_cid != VMADDR_CID_ANY)
> @@ -1307,23 +1294,23 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
>  
>  	switch (sk->sk_state) {
>  	case TCP_LISTEN:
> -		virtio_transport_recv_listen(sk, pkt, t);
> -		virtio_transport_free_pkt(pkt);
> +		virtio_transport_recv_listen(sk, skb, t);
> +		kfree_skb(skb);
>  		break;
>  	case TCP_SYN_SENT:
> -		virtio_transport_recv_connecting(sk, pkt);
> -		virtio_transport_free_pkt(pkt);
> +		virtio_transport_recv_connecting(sk, skb);
> +		kfree_skb(skb);
>  		break;
>  	case TCP_ESTABLISHED:
> -		virtio_transport_recv_connected(sk, pkt);
> +		virtio_transport_recv_connected(sk, skb);
>  		break;
>  	case TCP_CLOSING:
> -		virtio_transport_recv_disconnecting(sk, pkt);
> -		virtio_transport_free_pkt(pkt);
> +		virtio_transport_recv_disconnecting(sk, skb);
> +		kfree_skb(skb);
>  		break;
>  	default:
> -		(void)virtio_transport_reset_no_sock(t, pkt);
> -		virtio_transport_free_pkt(pkt);
> +		(void)virtio_transport_reset_no_sock(t, skb);
> +		kfree_skb(skb);
>  		break;
>  	}
>  
> @@ -1336,16 +1323,42 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
>  	return;
>  
>  free_pkt:
> -	virtio_transport_free_pkt(pkt);
> +	kfree(skb);
>  }
>  EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
>  
> -void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
> +/* Remove skbs found in a queue that have a vsk that matches.
> + *
> + * Each skb is freed.
> + *
> + * Returns the count of skbs that were reply packets.
> + */
> +int virtio_transport_purge_skbs(void *vsk, struct sk_buff_head *queue)
>  {
> -	kfree(pkt->buf);
> -	kfree(pkt);
> +	int cnt = 0;
> +	struct sk_buff *skb, *tmp;
> +	struct sk_buff_head freeme;
> +
> +	skb_queue_head_init(&freeme);
> +
> +	spin_lock_bh(&queue->lock);
> +	skb_queue_walk_safe(queue, skb, tmp) {
> +		if (vsock_sk(skb->sk) != vsk)
> +			continue;
> +
> +		__skb_unlink(skb, queue);
> +		skb_queue_tail(&freeme, skb);
> +
> +		if (vsock_metadata(skb)->flags & VIRTIO_VSOCK_METADATA_FLAGS_REPLY)
> +			cnt++;
> +	}
> +	spin_unlock_bh(&queue->lock);
> +
> +	skb_queue_purge(&freeme);
> +
> +	return cnt;
>  }
> -EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
> +EXPORT_SYMBOL_GPL(virtio_transport_purge_skbs);
>  
>  MODULE_LICENSE("GPL v2");
>  MODULE_AUTHOR("Asias He");
> diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
> index 169a8cf65b39..906f7cdff65e 100644
> --- a/net/vmw_vsock/vsock_loopback.c
> +++ b/net/vmw_vsock/vsock_loopback.c
> @@ -16,7 +16,7 @@ struct vsock_loopback {
>  	struct workqueue_struct *workqueue;
>  
>  	spinlock_t pkt_list_lock; /* protects pkt_list */
> -	struct list_head pkt_list;
> +	struct sk_buff_head pkt_queue;
>  	struct work_struct pkt_work;
>  };
>  
> @@ -27,13 +27,13 @@ static u32 vsock_loopback_get_local_cid(void)
>  	return VMADDR_CID_LOCAL;
>  }
>  
> -static int vsock_loopback_send_pkt(struct virtio_vsock_pkt *pkt)
> +static int vsock_loopback_send_pkt(struct sk_buff *skb)
>  {
>  	struct vsock_loopback *vsock = &the_vsock_loopback;
> -	int len = pkt->len;
> +	int len = skb->len;
>  
>  	spin_lock_bh(&vsock->pkt_list_lock);
> -	list_add_tail(&pkt->list, &vsock->pkt_list);
> +	skb_queue_tail(&vsock->pkt_queue, skb);
>  	spin_unlock_bh(&vsock->pkt_list_lock);
>  
>  	queue_work(vsock->workqueue, &vsock->pkt_work);
> @@ -44,21 +44,8 @@ static int vsock_loopback_send_pkt(struct virtio_vsock_pkt *pkt)
>  static int vsock_loopback_cancel_pkt(struct vsock_sock *vsk)
>  {
>  	struct vsock_loopback *vsock = &the_vsock_loopback;
> -	struct virtio_vsock_pkt *pkt, *n;
> -	LIST_HEAD(freeme);
>  
> -	spin_lock_bh(&vsock->pkt_list_lock);
> -	list_for_each_entry_safe(pkt, n, &vsock->pkt_list, list) {
> -		if (pkt->vsk != vsk)
> -			continue;
> -		list_move(&pkt->list, &freeme);
> -	}
> -	spin_unlock_bh(&vsock->pkt_list_lock);
> -
> -	list_for_each_entry_safe(pkt, n, &freeme, list) {
> -		list_del(&pkt->list);
> -		virtio_transport_free_pkt(pkt);
> -	}
> +	virtio_transport_purge_skbs(vsk, &vsock->pkt_queue);
>  
>  	return 0;
>  }
> @@ -121,20 +108,20 @@ static void vsock_loopback_work(struct work_struct *work)
>  {
>  	struct vsock_loopback *vsock =
>  		container_of(work, struct vsock_loopback, pkt_work);
> -	LIST_HEAD(pkts);
> +	struct sk_buff_head pkts;
> +
> +	skb_queue_head_init(&pkts);
>  
>  	spin_lock_bh(&vsock->pkt_list_lock);
> -	list_splice_init(&vsock->pkt_list, &pkts);
> +	skb_queue_splice_init(&vsock->pkt_queue, &pkts);
>  	spin_unlock_bh(&vsock->pkt_list_lock);
>  
> -	while (!list_empty(&pkts)) {
> -		struct virtio_vsock_pkt *pkt;
> +	while (!skb_queue_empty(&pkts)) {
> +		struct sk_buff *skb;
>  
> -		pkt = list_first_entry(&pkts, struct virtio_vsock_pkt, list);
> -		list_del_init(&pkt->list);
> -
> -		virtio_transport_deliver_tap_pkt(pkt);
> -		virtio_transport_recv_pkt(&loopback_transport, pkt);
> +		skb = skb_dequeue(&pkts);
> +		virtio_transport_deliver_tap_pkt(skb);
> +		virtio_transport_recv_pkt(&loopback_transport, skb);
>  	}
>  }
>  
> @@ -148,7 +135,7 @@ static int __init vsock_loopback_init(void)
>  		return -ENOMEM;
>  
>  	spin_lock_init(&vsock->pkt_list_lock);
> -	INIT_LIST_HEAD(&vsock->pkt_list);
> +	skb_queue_head_init(&vsock->pkt_queue);
>  	INIT_WORK(&vsock->pkt_work, vsock_loopback_work);
>  
>  	ret = vsock_core_register(&loopback_transport.transport,
> @@ -166,19 +153,13 @@ static int __init vsock_loopback_init(void)
>  static void __exit vsock_loopback_exit(void)
>  {
>  	struct vsock_loopback *vsock = &the_vsock_loopback;
> -	struct virtio_vsock_pkt *pkt;
>  
>  	vsock_core_unregister(&loopback_transport.transport);
>  
>  	flush_work(&vsock->pkt_work);
>  
>  	spin_lock_bh(&vsock->pkt_list_lock);
> -	while (!list_empty(&vsock->pkt_list)) {
> -		pkt = list_first_entry(&vsock->pkt_list,
> -				       struct virtio_vsock_pkt, list);
> -		list_del(&pkt->list);
> -		virtio_transport_free_pkt(pkt);
> -	}
> +	skb_queue_purge(&vsock->pkt_queue);
>  	spin_unlock_bh(&vsock->pkt_list_lock);
>  
>  	destroy_workqueue(vsock->workqueue);
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/6] vsock: return errors other than -ENOMEM to socket
  2022-08-15 17:56 ` [PATCH 2/6] vsock: return errors other than -ENOMEM to socket Bobby Eshleman
                     ` (2 preceding siblings ...)
  2022-08-16  2:16   ` kernel test robot
@ 2022-08-16  2:30   ` Bobby Eshleman
  2022-08-17  5:28     ` [virtio-dev] " Arseniy Krasnov
  2022-09-26 13:21   ` Stefano Garzarella
  4 siblings, 1 reply; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-16  2:30 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: virtio-dev, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Wei Liu, Dexuan Cui, kvm, virtualization, netdev, linux-kernel,
	linux-hyperv

CC'ing virtio-dev@lists.oasis-open.org

On Mon, Aug 15, 2022 at 10:56:05AM -0700, Bobby Eshleman wrote:
> This commit allows vsock implementations to return errors
> to the socket layer other than -ENOMEM. One immediate effect
> of this is that upon the sk_sndbuf threshold being reached -EAGAIN
> will be returned and userspace may throttle appropriately.
> 
> Resultingly, a known issue with uperf is resolved[1].
> 
> Additionally, to preserve legacy behavior for non-virtio
> implementations, hyperv/vmci force errors to be -ENOMEM so that behavior
> is unchanged.
> 
> [1]: https://gitlab.com/vsock/vsock/-/issues/1
> 
> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> ---
>  include/linux/virtio_vsock.h            | 3 +++
>  net/vmw_vsock/af_vsock.c                | 3 ++-
>  net/vmw_vsock/hyperv_transport.c        | 2 +-
>  net/vmw_vsock/virtio_transport_common.c | 3 ---
>  net/vmw_vsock/vmci_transport.c          | 9 ++++++++-
>  5 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index 17ed01466875..9a37eddbb87a 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -8,6 +8,9 @@
>  #include <net/sock.h>
>  #include <net/af_vsock.h>
>  
> +/* Threshold for detecting small packets to copy */
> +#define GOOD_COPY_LEN  128
> +
>  enum virtio_vsock_metadata_flags {
>  	VIRTIO_VSOCK_METADATA_FLAGS_REPLY		= BIT(0),
>  	VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED	= BIT(1),
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index e348b2d09eac..1893f8aafa48 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -1844,8 +1844,9 @@ static int vsock_connectible_sendmsg(struct socket *sock, struct msghdr *msg,
>  			written = transport->stream_enqueue(vsk,
>  					msg, len - total_written);
>  		}
> +
>  		if (written < 0) {
> -			err = -ENOMEM;
> +			err = written;
>  			goto out_err;
>  		}
>  
> diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
> index fd98229e3db3..e99aea571f6f 100644
> --- a/net/vmw_vsock/hyperv_transport.c
> +++ b/net/vmw_vsock/hyperv_transport.c
> @@ -687,7 +687,7 @@ static ssize_t hvs_stream_enqueue(struct vsock_sock *vsk, struct msghdr *msg,
>  	if (bytes_written)
>  		ret = bytes_written;
>  	kfree(send_buf);
> -	return ret;
> +	return ret < 0 ? -ENOMEM : ret;
>  }
>  
>  static s64 hvs_stream_has_data(struct vsock_sock *vsk)
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index 920578597bb9..d5780599fe93 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -23,9 +23,6 @@
>  /* How long to wait for graceful shutdown of a connection */
>  #define VSOCK_CLOSE_TIMEOUT (8 * HZ)
>  
> -/* Threshold for detecting small packets to copy */
> -#define GOOD_COPY_LEN  128
> -
>  static const struct virtio_transport *
>  virtio_transport_get_ops(struct vsock_sock *vsk)
>  {
> diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
> index b14f0ed7427b..c927a90dc859 100644
> --- a/net/vmw_vsock/vmci_transport.c
> +++ b/net/vmw_vsock/vmci_transport.c
> @@ -1838,7 +1838,14 @@ static ssize_t vmci_transport_stream_enqueue(
>  	struct msghdr *msg,
>  	size_t len)
>  {
> -	return vmci_qpair_enquev(vmci_trans(vsk)->qpair, msg, len, 0);
> +	int err;
> +
> +	err = vmci_qpair_enquev(vmci_trans(vsk)->qpair, msg, len, 0);
> +
> +	if (err < 0)
> +		err = -ENOMEM;
> +
> +	return err;
>  }
>  
>  static s64 vmci_transport_stream_has_data(struct vsock_sock *vsk)
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/6] vsock: add netdev to vhost/virtio vsock
  2022-08-15 17:56 ` [PATCH 3/6] vsock: add netdev to vhost/virtio vsock Bobby Eshleman
@ 2022-08-16  2:31   ` Bobby Eshleman
  2022-08-16 16:38   ` Michael S. Tsirkin
  2022-09-06 10:58   ` Michael S. Tsirkin
  2 siblings, 0 replies; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-16  2:31 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: virtio-dev, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, kvm, virtualization, netdev, linux-kernel

CC'ing virtio-dev@lists.oasis-open.org

On Mon, Aug 15, 2022 at 10:56:06AM -0700, Bobby Eshleman wrote:
> In order to support usage of qdisc on vsock traffic, this commit
> introduces a struct net_device to vhost and virtio vsock.
> 
> Two new devices are created, vhost-vsock for vhost and virtio-vsock
> for virtio. The devices are attached to the respective transports.
> 
> To bypass the usage of the device, the user may "down" the associated
> network interface using common tools. For example, "ip link set dev
> virtio-vsock down" lets vsock bypass the net_device and qdisc entirely,
> simply using the FIFO logic of the prior implementation.
> 
> For both hosts and guests, there is one device for all G2H vsock sockets
> and one device for all H2G vsock sockets. This makes sense for guests
> because the driver only supports a single vsock channel (one pair of
> TX/RX virtqueues), so one device and qdisc fits. For hosts, this may not
> seem ideal for some workloads. However, it is possible to use a
> multi-queue qdisc, where a given queue is responsible for a range of
> sockets. This seems to be a better solution than having one device per
> socket, which may yield a very large number of devices and qdiscs, all
> of which are dynamically being created and destroyed. Because of this
> dynamism, it would also require a complex policy management daemon, as
> devices would constantly be spun up and down as sockets were created and
> destroyed. To avoid this, one device and qdisc also applies to all H2G
> sockets.
> 
> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> ---
>  drivers/vhost/vsock.c                   |  19 +++-
>  include/linux/virtio_vsock.h            |  10 +++
>  net/vmw_vsock/virtio_transport.c        |  19 +++-
>  net/vmw_vsock/virtio_transport_common.c | 112 +++++++++++++++++++++++-
>  4 files changed, 152 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index f8601d93d94d..b20ddec2664b 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -927,13 +927,30 @@ static int __init vhost_vsock_init(void)
>  				  VSOCK_TRANSPORT_F_H2G);
>  	if (ret < 0)
>  		return ret;
> -	return misc_register(&vhost_vsock_misc);
> +
> +	ret = virtio_transport_init(&vhost_transport, "vhost-vsock");
> +	if (ret < 0)
> +		goto out_unregister;
> +
> +	ret = misc_register(&vhost_vsock_misc);
> +	if (ret < 0)
> +		goto out_transport_exit;
> +	return ret;
> +
> +out_transport_exit:
> +	virtio_transport_exit(&vhost_transport);
> +
> +out_unregister:
> +	vsock_core_unregister(&vhost_transport.transport);
> +	return ret;
> +
>  };
>  
>  static void __exit vhost_vsock_exit(void)
>  {
>  	misc_deregister(&vhost_vsock_misc);
>  	vsock_core_unregister(&vhost_transport.transport);
> +	virtio_transport_exit(&vhost_transport);
>  };
>  
>  module_init(vhost_vsock_init);
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index 9a37eddbb87a..5d7e7fbd75f8 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -91,10 +91,20 @@ struct virtio_transport {
>  	/* This must be the first field */
>  	struct vsock_transport transport;
>  
> +	/* Used almost exclusively for qdisc */
> +	struct net_device *dev;
> +
>  	/* Takes ownership of the packet */
>  	int (*send_pkt)(struct sk_buff *skb);
>  };
>  
> +int
> +virtio_transport_init(struct virtio_transport *t,
> +		      const char *name);
> +
> +void
> +virtio_transport_exit(struct virtio_transport *t);
> +
>  ssize_t
>  virtio_transport_stream_dequeue(struct vsock_sock *vsk,
>  				struct msghdr *msg,
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index 3bb293fd8607..c6212eb38d3c 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -131,7 +131,9 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>  		 * the vq
>  		 */
>  		if (ret < 0) {
> -			skb_queue_head(&vsock->send_pkt_queue, skb);
> +			spin_lock_bh(&vsock->send_pkt_queue.lock);
> +			__skb_queue_head(&vsock->send_pkt_queue, skb);
> +			spin_unlock_bh(&vsock->send_pkt_queue.lock);
>  			break;
>  		}
>  
> @@ -676,7 +678,9 @@ static void virtio_vsock_vqs_del(struct virtio_vsock *vsock)
>  		kfree_skb(skb);
>  	mutex_unlock(&vsock->tx_lock);
>  
> -	skb_queue_purge(&vsock->send_pkt_queue);
> +	spin_lock_bh(&vsock->send_pkt_queue.lock);
> +	__skb_queue_purge(&vsock->send_pkt_queue);
> +	spin_unlock_bh(&vsock->send_pkt_queue.lock);
>  
>  	/* Delete virtqueues and flush outstanding callbacks if any */
>  	vdev->config->del_vqs(vdev);
> @@ -760,6 +764,8 @@ static void virtio_vsock_remove(struct virtio_device *vdev)
>  	flush_work(&vsock->event_work);
>  	flush_work(&vsock->send_pkt_work);
>  
> +	virtio_transport_exit(&virtio_transport);
> +
>  	mutex_unlock(&the_virtio_vsock_mutex);
>  
>  	kfree(vsock);
> @@ -844,12 +850,18 @@ static int __init virtio_vsock_init(void)
>  	if (ret)
>  		goto out_wq;
>  
> -	ret = register_virtio_driver(&virtio_vsock_driver);
> +	ret = virtio_transport_init(&virtio_transport, "virtio-vsock");
>  	if (ret)
>  		goto out_vci;
>  
> +	ret = register_virtio_driver(&virtio_vsock_driver);
> +	if (ret)
> +		goto out_transport;
> +
>  	return 0;
>  
> +out_transport:
> +	virtio_transport_exit(&virtio_transport);
>  out_vci:
>  	vsock_core_unregister(&virtio_transport.transport);
>  out_wq:
> @@ -861,6 +873,7 @@ static void __exit virtio_vsock_exit(void)
>  {
>  	unregister_virtio_driver(&virtio_vsock_driver);
>  	vsock_core_unregister(&virtio_transport.transport);
> +	virtio_transport_exit(&virtio_transport);
>  	destroy_workqueue(virtio_vsock_workqueue);
>  }
>  
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index d5780599fe93..bdf16fff054f 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -16,6 +16,7 @@
>  
>  #include <net/sock.h>
>  #include <net/af_vsock.h>
> +#include <net/pkt_sched.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/vsock_virtio_transport_common.h>
> @@ -23,6 +24,93 @@
>  /* How long to wait for graceful shutdown of a connection */
>  #define VSOCK_CLOSE_TIMEOUT (8 * HZ)
>  
> +struct virtio_transport_priv {
> +	struct virtio_transport *trans;
> +};
> +
> +static netdev_tx_t virtio_transport_start_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> +	struct virtio_transport *t =
> +		((struct virtio_transport_priv *)netdev_priv(dev))->trans;
> +	int ret;
> +
> +	ret = t->send_pkt(skb);
> +	if (unlikely(ret == -ENODEV))
> +		return NETDEV_TX_BUSY;
> +
> +	return NETDEV_TX_OK;
> +}
> +
> +const struct net_device_ops virtio_transport_netdev_ops = {
> +	.ndo_start_xmit = virtio_transport_start_xmit,
> +};
> +
> +static void virtio_transport_setup(struct net_device *dev)
> +{
> +	dev->netdev_ops = &virtio_transport_netdev_ops;
> +	dev->needs_free_netdev = true;
> +	dev->flags = IFF_NOARP;
> +	dev->mtu = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> +	dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN;
> +}
> +
> +static int ifup(struct net_device *dev)
> +{
> +	int ret;
> +
> +	rtnl_lock();
> +	ret = dev_open(dev, NULL) ? -ENOMEM : 0;
> +	rtnl_unlock();
> +
> +	return ret;
> +}
> +
> +/* virtio_transport_init - initialize a virtio vsock transport layer
> + *
> + * @t: ptr to the virtio transport struct to initialize
> + * @name: the name of the net_device to be created.
> + *
> + * Return 0 on success, otherwise negative errno.
> + */
> +int virtio_transport_init(struct virtio_transport *t, const char *name)
> +{
> +	struct virtio_transport_priv *priv;
> +	int ret;
> +
> +	t->dev = alloc_netdev(sizeof(*priv), name, NET_NAME_UNKNOWN, virtio_transport_setup);
> +	if (!t->dev)
> +		return -ENOMEM;
> +
> +	priv = netdev_priv(t->dev);
> +	priv->trans = t;
> +
> +	ret = register_netdev(t->dev);
> +	if (ret < 0)
> +		goto out_free_netdev;
> +
> +	ret = ifup(t->dev);
> +	if (ret < 0)
> +		goto out_unregister_netdev;
> +
> +	return 0;
> +
> +out_unregister_netdev:
> +	unregister_netdev(t->dev);
> +
> +out_free_netdev:
> +	free_netdev(t->dev);
> +
> +	return ret;
> +}
> +
> +void virtio_transport_exit(struct virtio_transport *t)
> +{
> +	if (t->dev) {
> +		unregister_netdev(t->dev);
> +		free_netdev(t->dev);
> +	}
> +}
> +
>  static const struct virtio_transport *
>  virtio_transport_get_ops(struct vsock_sock *vsk)
>  {
> @@ -147,6 +235,24 @@ static u16 virtio_transport_get_type(struct sock *sk)
>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
>  }
>  
> +/* Return pkt->len on success, otherwise negative errno */
> +static int virtio_transport_send_pkt(const struct virtio_transport *t, struct sk_buff *skb)
> +{
> +	int ret;
> +	int len = skb->len;
> +
> +	if (unlikely(!t->dev || !(t->dev->flags & IFF_UP)))
> +		return t->send_pkt(skb);
> +
> +	skb->dev = t->dev;
> +	ret = dev_queue_xmit(skb);
> +
> +	if (likely(ret == NET_XMIT_SUCCESS || ret == NET_XMIT_CN))
> +		return len;
> +
> +	return -ENOMEM;
> +}
> +
>  /* This function can only be used on connecting/connected sockets,
>   * since a socket assigned to a transport is required.
>   *
> @@ -202,9 +308,7 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>  
>  	virtio_transport_inc_tx_pkt(vvs, skb);
>  
> -	err = t_ops->send_pkt(skb);
> -
> -	return err < 0 ? -ENOMEM : err;
> +	return virtio_transport_send_pkt(t_ops, skb);
>  }
>  
>  static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
> @@ -834,7 +938,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>  		return -ENOTCONN;
>  	}
>  
> -	return t->send_pkt(reply);
> +	return virtio_transport_send_pkt(t, reply);
>  }
>  
>  /* This function should be called with sk_lock held and SOCK_DONE set */
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/6] virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
  2022-08-15 17:56 ` [PATCH 4/6] virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit Bobby Eshleman
@ 2022-08-16  2:31   ` Bobby Eshleman
  2022-09-26 13:17   ` Stefano Garzarella
  1 sibling, 0 replies; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-16  2:31 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: virtio-dev, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, kvm, virtualization, netdev, linux-kernel

CC'ing virtio-dev@lists.oasis-open.org

On Mon, Aug 15, 2022 at 10:56:07AM -0700, Bobby Eshleman wrote:
> This commit adds a feature bit for virtio vsock to support datagrams.
> 
> Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> ---
>  drivers/vhost/vsock.c             | 3 ++-
>  include/uapi/linux/virtio_vsock.h | 1 +
>  net/vmw_vsock/virtio_transport.c  | 8 ++++++--
>  3 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index b20ddec2664b..a5d1bdb786fe 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -32,7 +32,8 @@
>  enum {
>  	VHOST_VSOCK_FEATURES = VHOST_FEATURES |
>  			       (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
> -			       (1ULL << VIRTIO_VSOCK_F_SEQPACKET)
> +			       (1ULL << VIRTIO_VSOCK_F_SEQPACKET) |
> +			       (1ULL << VIRTIO_VSOCK_F_DGRAM)
>  };
>  
>  enum {
> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
> index 64738838bee5..857df3a3a70d 100644
> --- a/include/uapi/linux/virtio_vsock.h
> +++ b/include/uapi/linux/virtio_vsock.h
> @@ -40,6 +40,7 @@
>  
>  /* The feature bitmap for virtio vsock */
>  #define VIRTIO_VSOCK_F_SEQPACKET	1	/* SOCK_SEQPACKET supported */
> +#define VIRTIO_VSOCK_F_DGRAM		2	/* Host support dgram vsock */
>  
>  struct virtio_vsock_config {
>  	__le64 guest_cid;
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index c6212eb38d3c..073314312683 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -35,6 +35,7 @@ static struct virtio_transport virtio_transport; /* forward declaration */
>  struct virtio_vsock {
>  	struct virtio_device *vdev;
>  	struct virtqueue *vqs[VSOCK_VQ_MAX];
> +	bool has_dgram;
>  
>  	/* Virtqueue processing is deferred to a workqueue */
>  	struct work_struct tx_work;
> @@ -709,7 +710,6 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
>  	}
>  
>  	vsock->vdev = vdev;
> -
>  	vsock->rx_buf_nr = 0;
>  	vsock->rx_buf_max_nr = 0;
>  	atomic_set(&vsock->queued_replies, 0);
> @@ -726,6 +726,9 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_SEQPACKET))
>  		vsock->seqpacket_allow = true;
>  
> +	if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_DGRAM))
> +		vsock->has_dgram = true;
> +
>  	vdev->priv = vsock;
>  
>  	ret = virtio_vsock_vqs_init(vsock);
> @@ -820,7 +823,8 @@ static struct virtio_device_id id_table[] = {
>  };
>  
>  static unsigned int features[] = {
> -	VIRTIO_VSOCK_F_SEQPACKET
> +	VIRTIO_VSOCK_F_SEQPACKET,
> +	VIRTIO_VSOCK_F_DGRAM
>  };
>  
>  static struct virtio_driver virtio_vsock_driver = {
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/6] virtio/vsock: add support for dgram
  2022-08-15 17:56 ` [PATCH 5/6] virtio/vsock: add support for dgram Bobby Eshleman
  2022-08-15 21:02   ` kernel test robot
@ 2022-08-16  2:32   ` Bobby Eshleman
  2022-08-17  5:01     ` [virtio-dev] " Arseniy Krasnov
  1 sibling, 1 reply; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-16  2:32 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: virtio-dev, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, kvm, virtualization, netdev, linux-kernel

CC'ing virtio-dev@lists.oasis-open.org

On Mon, Aug 15, 2022 at 10:56:08AM -0700, Bobby Eshleman wrote:
> This patch supports dgram in virtio and on the vhost side.
> 
> Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> ---
>  drivers/vhost/vsock.c                   |   2 +-
>  include/net/af_vsock.h                  |   2 +
>  include/uapi/linux/virtio_vsock.h       |   1 +
>  net/vmw_vsock/af_vsock.c                |  26 +++-
>  net/vmw_vsock/virtio_transport.c        |   2 +-
>  net/vmw_vsock/virtio_transport_common.c | 173 ++++++++++++++++++++++--
>  6 files changed, 186 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index a5d1bdb786fe..3dc72a5647ca 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -925,7 +925,7 @@ static int __init vhost_vsock_init(void)
>  	int ret;
>  
>  	ret = vsock_core_register(&vhost_transport.transport,
> -				  VSOCK_TRANSPORT_F_H2G);
> +				  VSOCK_TRANSPORT_F_H2G | VSOCK_TRANSPORT_F_DGRAM);
>  	if (ret < 0)
>  		return ret;
>  
> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> index 1c53c4c4d88f..37e55c81e4df 100644
> --- a/include/net/af_vsock.h
> +++ b/include/net/af_vsock.h
> @@ -78,6 +78,8 @@ struct vsock_sock {
>  s64 vsock_stream_has_data(struct vsock_sock *vsk);
>  s64 vsock_stream_has_space(struct vsock_sock *vsk);
>  struct sock *vsock_create_connected(struct sock *parent);
> +int vsock_bind_stream(struct vsock_sock *vsk,
> +		      struct sockaddr_vm *addr);
>  
>  /**** TRANSPORT ****/
>  
> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
> index 857df3a3a70d..0975b9c88292 100644
> --- a/include/uapi/linux/virtio_vsock.h
> +++ b/include/uapi/linux/virtio_vsock.h
> @@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
>  enum virtio_vsock_type {
>  	VIRTIO_VSOCK_TYPE_STREAM = 1,
>  	VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
> +	VIRTIO_VSOCK_TYPE_DGRAM = 3,
>  };
>  
>  enum virtio_vsock_op {
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index 1893f8aafa48..87e4ae1866d3 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -675,6 +675,19 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
>  	return 0;
>  }
>  
> +int vsock_bind_stream(struct vsock_sock *vsk,
> +		      struct sockaddr_vm *addr)
> +{
> +	int retval;
> +
> +	spin_lock_bh(&vsock_table_lock);
> +	retval = __vsock_bind_connectible(vsk, addr);
> +	spin_unlock_bh(&vsock_table_lock);
> +
> +	return retval;
> +}
> +EXPORT_SYMBOL(vsock_bind_stream);
> +
>  static int __vsock_bind_dgram(struct vsock_sock *vsk,
>  			      struct sockaddr_vm *addr)
>  {
> @@ -2363,11 +2376,16 @@ int vsock_core_register(const struct vsock_transport *t, int features)
>  	}
>  
>  	if (features & VSOCK_TRANSPORT_F_DGRAM) {
> -		if (t_dgram) {
> -			err = -EBUSY;
> -			goto err_busy;
> +		/* TODO: always chose the G2H variant over others, support nesting later */
> +		if (features & VSOCK_TRANSPORT_F_G2H) {
> +			if (t_dgram)
> +				pr_warn("virtio_vsock: t_dgram already set\n");
> +			t_dgram = t;
> +		}
> +
> +		if (!t_dgram) {
> +			t_dgram = t;
>  		}
> -		t_dgram = t;
>  	}
>  
>  	if (features & VSOCK_TRANSPORT_F_LOCAL) {
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index 073314312683..d4526ca462d2 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -850,7 +850,7 @@ static int __init virtio_vsock_init(void)
>  		return -ENOMEM;
>  
>  	ret = vsock_core_register(&virtio_transport.transport,
> -				  VSOCK_TRANSPORT_F_G2H);
> +				  VSOCK_TRANSPORT_F_G2H | VSOCK_TRANSPORT_F_DGRAM);
>  	if (ret)
>  		goto out_wq;
>  
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index bdf16fff054f..aedb48728677 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -229,7 +229,9 @@ EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
>  
>  static u16 virtio_transport_get_type(struct sock *sk)
>  {
> -	if (sk->sk_type == SOCK_STREAM)
> +	if (sk->sk_type == SOCK_DGRAM)
> +		return VIRTIO_VSOCK_TYPE_DGRAM;
> +	else if (sk->sk_type == SOCK_STREAM)
>  		return VIRTIO_VSOCK_TYPE_STREAM;
>  	else
>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
> @@ -287,22 +289,29 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>  	vvs = vsk->trans;
>  
>  	/* we can send less than pkt_len bytes */
> -	if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> -		pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> +	if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
> +		if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> +			pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> +		else
> +			return 0;
> +	}
>  
> -	/* virtio_transport_get_credit might return less than pkt_len credit */
> -	pkt_len = virtio_transport_get_credit(vvs, pkt_len);
> +	if (info->type != VIRTIO_VSOCK_TYPE_DGRAM) {
> +		/* virtio_transport_get_credit might return less than pkt_len credit */
> +		pkt_len = virtio_transport_get_credit(vvs, pkt_len);
>  
> -	/* Do not send zero length OP_RW pkt */
> -	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> -		return pkt_len;
> +		/* Do not send zero length OP_RW pkt */
> +		if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> +			return pkt_len;
> +	}
>  
>  	skb = virtio_transport_alloc_skb(info, pkt_len,
>  					 src_cid, src_port,
>  					 dst_cid, dst_port,
>  					 &err);
>  	if (!skb) {
> -		virtio_transport_put_credit(vvs, pkt_len);
> +		if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> +			virtio_transport_put_credit(vvs, pkt_len);
>  		return err;
>  	}
>  
> @@ -586,6 +595,61 @@ virtio_transport_seqpacket_dequeue(struct vsock_sock *vsk,
>  }
>  EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_dequeue);
>  
> +static ssize_t
> +virtio_transport_dgram_do_dequeue(struct vsock_sock *vsk,
> +				  struct msghdr *msg, size_t len)
> +{
> +	struct virtio_vsock_sock *vvs = vsk->trans;
> +	struct sk_buff *skb;
> +	size_t total = 0;
> +	u32 free_space;
> +	int err = -EFAULT;
> +
> +	spin_lock_bh(&vvs->rx_lock);
> +	if (total < len && !skb_queue_empty_lockless(&vvs->rx_queue)) {
> +		skb = __skb_dequeue(&vvs->rx_queue);
> +
> +		total = len;
> +		if (total > skb->len - vsock_metadata(skb)->off)
> +			total = skb->len - vsock_metadata(skb)->off;
> +		else if (total < skb->len - vsock_metadata(skb)->off)
> +			msg->msg_flags |= MSG_TRUNC;
> +
> +		/* sk_lock is held by caller so no one else can dequeue.
> +		 * Unlock rx_lock since memcpy_to_msg() may sleep.
> +		 */
> +		spin_unlock_bh(&vvs->rx_lock);
> +
> +		err = memcpy_to_msg(msg, skb->data + vsock_metadata(skb)->off, total);
> +		if (err)
> +			return err;
> +
> +		spin_lock_bh(&vvs->rx_lock);
> +
> +		virtio_transport_dec_rx_pkt(vvs, skb);
> +		consume_skb(skb);
> +	}
> +
> +	free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
> +
> +	spin_unlock_bh(&vvs->rx_lock);
> +
> +	if (total > 0 && msg->msg_name) {
> +		/* Provide the address of the sender. */
> +		DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
> +
> +		vsock_addr_init(vm_addr, le64_to_cpu(vsock_hdr(skb)->src_cid),
> +				le32_to_cpu(vsock_hdr(skb)->src_port));
> +		msg->msg_namelen = sizeof(*vm_addr);
> +	}
> +	return total;
> +}
> +
> +static s64 virtio_transport_dgram_has_data(struct vsock_sock *vsk)
> +{
> +	return virtio_transport_stream_has_data(vsk);
> +}
> +
>  int
>  virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
>  				   struct msghdr *msg,
> @@ -611,7 +675,66 @@ virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
>  			       struct msghdr *msg,
>  			       size_t len, int flags)
>  {
> -	return -EOPNOTSUPP;
> +	struct sock *sk;
> +	size_t err = 0;
> +	long timeout;
> +
> +	DEFINE_WAIT(wait);
> +
> +	sk = &vsk->sk;
> +	err = 0;
> +
> +	if (flags & MSG_OOB || flags & MSG_ERRQUEUE || flags & MSG_PEEK)
> +		return -EOPNOTSUPP;
> +
> +	lock_sock(sk);
> +
> +	if (!len)
> +		goto out;
> +
> +	timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
> +
> +	while (1) {
> +		s64 ready;
> +
> +		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
> +		ready = virtio_transport_dgram_has_data(vsk);
> +
> +		if (ready == 0) {
> +			if (timeout == 0) {
> +				err = -EAGAIN;
> +				finish_wait(sk_sleep(sk), &wait);
> +				break;
> +			}
> +
> +			release_sock(sk);
> +			timeout = schedule_timeout(timeout);
> +			lock_sock(sk);
> +
> +			if (signal_pending(current)) {
> +				err = sock_intr_errno(timeout);
> +				finish_wait(sk_sleep(sk), &wait);
> +				break;
> +			} else if (timeout == 0) {
> +				err = -EAGAIN;
> +				finish_wait(sk_sleep(sk), &wait);
> +				break;
> +			}
> +		} else {
> +			finish_wait(sk_sleep(sk), &wait);
> +
> +			if (ready < 0) {
> +				err = -ENOMEM;
> +				goto out;
> +			}
> +
> +			err = virtio_transport_dgram_do_dequeue(vsk, msg, len);
> +			break;
> +		}
> +	}
> +out:
> +	release_sock(sk);
> +	return err;
>  }
>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
>  
> @@ -819,13 +942,13 @@ EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
>  int virtio_transport_dgram_bind(struct vsock_sock *vsk,
>  				struct sockaddr_vm *addr)
>  {
> -	return -EOPNOTSUPP;
> +	return vsock_bind_stream(vsk, addr);
>  }
>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
>  
>  bool virtio_transport_dgram_allow(u32 cid, u32 port)
>  {
> -	return false;
> +	return true;
>  }
>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
>  
> @@ -861,7 +984,16 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
>  			       struct msghdr *msg,
>  			       size_t dgram_len)
>  {
> -	return -EOPNOTSUPP;
> +	struct virtio_vsock_pkt_info info = {
> +		.op = VIRTIO_VSOCK_OP_RW,
> +		.msg = msg,
> +		.pkt_len = dgram_len,
> +		.vsk = vsk,
> +		.remote_cid = remote_addr->svm_cid,
> +		.remote_port = remote_addr->svm_port,
> +	};
> +
> +	return virtio_transport_send_pkt_info(vsk, &info);
>  }
>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
>  
> @@ -1165,6 +1297,12 @@ virtio_transport_recv_connected(struct sock *sk,
>  	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
>  	int err = 0;
>  
> +	if (le16_to_cpu(vsock_hdr(skb)->type) == VIRTIO_VSOCK_TYPE_DGRAM) {
> +		virtio_transport_recv_enqueue(vsk, skb);
> +		sk->sk_data_ready(sk);
> +		return err;
> +	}
> +
>  	switch (le16_to_cpu(hdr->op)) {
>  	case VIRTIO_VSOCK_OP_RW:
>  		virtio_transport_recv_enqueue(vsk, skb);
> @@ -1320,7 +1458,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
>  static bool virtio_transport_valid_type(u16 type)
>  {
>  	return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
> -	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
> +	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
> +	       (type == VIRTIO_VSOCK_TYPE_DGRAM);
>  }
>  
>  /* We are under the virtio-vsock's vsock->rx_lock or vhost-vsock's vq->mutex
> @@ -1384,6 +1523,11 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
>  		goto free_pkt;
>  	}
>  
> +	if (sk->sk_type == SOCK_DGRAM) {
> +		virtio_transport_recv_connected(sk, skb);
> +		goto out;
> +	}
> +
>  	space_available = virtio_transport_space_update(sk, skb);
>  
>  	/* Update CID in case it has changed after a transport reset event */
> @@ -1415,6 +1559,7 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
>  		break;
>  	}
>  
> +out:
>  	release_sock(sk);
>  
>  	/* Release refcnt obtained when we fetched this socket out of the
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 6/6] vsock_test: add tests for vsock dgram
  2022-08-15 17:56 ` [PATCH 6/6] vsock_test: add tests for vsock dgram Bobby Eshleman
@ 2022-08-16  2:32   ` Bobby Eshleman
  0 siblings, 0 replies; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-16  2:32 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: virtio-dev, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefano Garzarella, virtualization, netdev, linux-kernel

CC'ing virtio-dev@lists.oasis-open.org

On Mon, Aug 15, 2022 at 10:56:09AM -0700, Bobby Eshleman wrote:
> From: Jiang Wang <jiang.wang@bytedance.com>
> 
> Added test cases for vsock dgram types.
> 
> Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
> ---
>  tools/testing/vsock/util.c       | 105 +++++++++++++++++
>  tools/testing/vsock/util.h       |   4 +
>  tools/testing/vsock/vsock_test.c | 195 +++++++++++++++++++++++++++++++
>  3 files changed, 304 insertions(+)
> 
> diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
> index 2acbb7703c6a..d2f5b223bf85 100644
> --- a/tools/testing/vsock/util.c
> +++ b/tools/testing/vsock/util.c
> @@ -260,6 +260,57 @@ void send_byte(int fd, int expected_ret, int flags)
>  	}
>  }
>  
> +/* Transmit one byte and check the return value.
> + *
> + * expected_ret:
> + *  <0 Negative errno (for testing errors)
> + *   0 End-of-file
> + *   1 Success
> + */
> +void sendto_byte(int fd, const struct sockaddr *dest_addr, int len, int expected_ret,
> +				int flags)
> +{
> +	const uint8_t byte = 'A';
> +	ssize_t nwritten;
> +
> +	timeout_begin(TIMEOUT);
> +	do {
> +		nwritten = sendto(fd, &byte, sizeof(byte), flags, dest_addr,
> +						len);
> +		timeout_check("write");
> +	} while (nwritten < 0 && errno == EINTR);
> +	timeout_end();
> +
> +	if (expected_ret < 0) {
> +		if (nwritten != -1) {
> +			fprintf(stderr, "bogus sendto(2) return value %zd\n",
> +				nwritten);
> +			exit(EXIT_FAILURE);
> +		}
> +		if (errno != -expected_ret) {
> +			perror("write");
> +			exit(EXIT_FAILURE);
> +		}
> +		return;
> +	}
> +
> +	if (nwritten < 0) {
> +		perror("write");
> +		exit(EXIT_FAILURE);
> +	}
> +	if (nwritten == 0) {
> +		if (expected_ret == 0)
> +			return;
> +
> +		fprintf(stderr, "unexpected EOF while sending byte\n");
> +		exit(EXIT_FAILURE);
> +	}
> +	if (nwritten != sizeof(byte)) {
> +		fprintf(stderr, "bogus sendto(2) return value %zd\n", nwritten);
> +		exit(EXIT_FAILURE);
> +	}
> +}
> +
>  /* Receive one byte and check the return value.
>   *
>   * expected_ret:
> @@ -313,6 +364,60 @@ void recv_byte(int fd, int expected_ret, int flags)
>  	}
>  }
>  
> +/* Receive one byte and check the return value.
> + *
> + * expected_ret:
> + *  <0 Negative errno (for testing errors)
> + *   0 End-of-file
> + *   1 Success
> + */
> +void recvfrom_byte(int fd, struct sockaddr *src_addr, socklen_t *addrlen,
> +				int expected_ret, int flags)
> +{
> +	uint8_t byte;
> +	ssize_t nread;
> +
> +	timeout_begin(TIMEOUT);
> +	do {
> +		nread = recvfrom(fd, &byte, sizeof(byte), flags, src_addr, addrlen);
> +		timeout_check("read");
> +	} while (nread < 0 && errno == EINTR);
> +	timeout_end();
> +
> +	if (expected_ret < 0) {
> +		if (nread != -1) {
> +			fprintf(stderr, "bogus recvfrom(2) return value %zd\n",
> +				nread);
> +			exit(EXIT_FAILURE);
> +		}
> +		if (errno != -expected_ret) {
> +			perror("read");
> +			exit(EXIT_FAILURE);
> +		}
> +		return;
> +	}
> +
> +	if (nread < 0) {
> +		perror("read");
> +		exit(EXIT_FAILURE);
> +	}
> +	if (nread == 0) {
> +		if (expected_ret == 0)
> +			return;
> +
> +		fprintf(stderr, "unexpected EOF while receiving byte\n");
> +		exit(EXIT_FAILURE);
> +	}
> +	if (nread != sizeof(byte)) {
> +		fprintf(stderr, "bogus recvfrom(2) return value %zd\n", nread);
> +		exit(EXIT_FAILURE);
> +	}
> +	if (byte != 'A') {
> +		fprintf(stderr, "unexpected byte read %c\n", byte);
> +		exit(EXIT_FAILURE);
> +	}
> +}
> +
>  /* Run test cases.  The program terminates if a failure occurs. */
>  void run_tests(const struct test_case *test_cases,
>  	       const struct test_opts *opts)
> diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
> index a3375ad2fb7f..7213f2a51c1e 100644
> --- a/tools/testing/vsock/util.h
> +++ b/tools/testing/vsock/util.h
> @@ -43,7 +43,11 @@ int vsock_seqpacket_accept(unsigned int cid, unsigned int port,
>  			   struct sockaddr_vm *clientaddrp);
>  void vsock_wait_remote_close(int fd);
>  void send_byte(int fd, int expected_ret, int flags);
> +void sendto_byte(int fd, const struct sockaddr *dest_addr, int len, int expected_ret,
> +				int flags);
>  void recv_byte(int fd, int expected_ret, int flags);
> +void recvfrom_byte(int fd, struct sockaddr *src_addr, socklen_t *addrlen,
> +				int expected_ret, int flags);
>  void run_tests(const struct test_case *test_cases,
>  	       const struct test_opts *opts);
>  void list_tests(const struct test_case *test_cases);
> diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
> index dc577461afc2..640379f1b462 100644
> --- a/tools/testing/vsock/vsock_test.c
> +++ b/tools/testing/vsock/vsock_test.c
> @@ -201,6 +201,115 @@ static void test_stream_server_close_server(const struct test_opts *opts)
>  	close(fd);
>  }
>  
> +static void test_dgram_sendto_client(const struct test_opts *opts)
> +{
> +	union {
> +		struct sockaddr sa;
> +		struct sockaddr_vm svm;
> +	} addr = {
> +		.svm = {
> +			.svm_family = AF_VSOCK,
> +			.svm_port = 1234,
> +			.svm_cid = opts->peer_cid,
> +		},
> +	};
> +	int fd;
> +
> +	/* Wait for the server to be ready */
> +	control_expectln("BIND");
> +
> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> +	if (fd < 0) {
> +		perror("socket");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	sendto_byte(fd, &addr.sa, sizeof(addr.svm), 1, 0);
> +
> +	/* Notify the server that the client has finished */
> +	control_writeln("DONE");
> +
> +	close(fd);
> +}
> +
> +static void test_dgram_sendto_server(const struct test_opts *opts)
> +{
> +	union {
> +		struct sockaddr sa;
> +		struct sockaddr_vm svm;
> +	} addr = {
> +		.svm = {
> +			.svm_family = AF_VSOCK,
> +			.svm_port = 1234,
> +			.svm_cid = VMADDR_CID_ANY,
> +		},
> +	};
> +	int fd;
> +	int len = sizeof(addr.sa);
> +
> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> +
> +	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
> +		perror("bind");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	/* Notify the client that the server is ready */
> +	control_writeln("BIND");
> +
> +	recvfrom_byte(fd, &addr.sa, &len, 1, 0);
> +	printf("got message from cid:%d, port %u ", addr.svm.svm_cid,
> +			addr.svm.svm_port);
> +
> +	/* Wait for the client to finish */
> +	control_expectln("DONE");
> +
> +	close(fd);
> +}
> +
> +static void test_dgram_connect_client(const struct test_opts *opts)
> +{
> +	union {
> +		struct sockaddr sa;
> +		struct sockaddr_vm svm;
> +	} addr = {
> +		.svm = {
> +			.svm_family = AF_VSOCK,
> +			.svm_port = 1234,
> +			.svm_cid = opts->peer_cid,
> +		},
> +	};
> +	int fd;
> +	int ret;
> +
> +	/* Wait for the server to be ready */
> +	control_expectln("BIND");
> +
> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> +	if (fd < 0) {
> +		perror("bind");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	ret = connect(fd, &addr.sa, sizeof(addr.svm));
> +	if (ret < 0) {
> +		perror("connect");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	send_byte(fd, 1, 0);
> +
> +	/* Notify the server that the client has finished */
> +	control_writeln("DONE");
> +
> +	close(fd);
> +}
> +
> +static void test_dgram_connect_server(const struct test_opts *opts)
> +{
> +	test_dgram_sendto_server(opts);
> +}
> +
>  /* With the standard socket sizes, VMCI is able to support about 100
>   * concurrent stream connections.
>   */
> @@ -254,6 +363,77 @@ static void test_stream_multiconn_server(const struct test_opts *opts)
>  		close(fds[i]);
>  }
>  
> +static void test_dgram_multiconn_client(const struct test_opts *opts)
> +{
> +	int fds[MULTICONN_NFDS];
> +	int i;
> +	union {
> +		struct sockaddr sa;
> +		struct sockaddr_vm svm;
> +	} addr = {
> +		.svm = {
> +			.svm_family = AF_VSOCK,
> +			.svm_port = 1234,
> +			.svm_cid = opts->peer_cid,
> +		},
> +	};
> +
> +	/* Wait for the server to be ready */
> +	control_expectln("BIND");
> +
> +	for (i = 0; i < MULTICONN_NFDS; i++) {
> +		fds[i] = socket(AF_VSOCK, SOCK_DGRAM, 0);
> +		if (fds[i] < 0) {
> +			perror("socket");
> +			exit(EXIT_FAILURE);
> +		}
> +	}
> +
> +	for (i = 0; i < MULTICONN_NFDS; i++)
> +		sendto_byte(fds[i], &addr.sa, sizeof(addr.svm), 1, 0);
> +
> +	/* Notify the server that the client has finished */
> +	control_writeln("DONE");
> +
> +	for (i = 0; i < MULTICONN_NFDS; i++)
> +		close(fds[i]);
> +}
> +
> +static void test_dgram_multiconn_server(const struct test_opts *opts)
> +{
> +	union {
> +		struct sockaddr sa;
> +		struct sockaddr_vm svm;
> +	} addr = {
> +		.svm = {
> +			.svm_family = AF_VSOCK,
> +			.svm_port = 1234,
> +			.svm_cid = VMADDR_CID_ANY,
> +		},
> +	};
> +	int fd;
> +	int len = sizeof(addr.sa);
> +	int i;
> +
> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> +
> +	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
> +		perror("bind");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	/* Notify the client that the server is ready */
> +	control_writeln("BIND");
> +
> +	for (i = 0; i < MULTICONN_NFDS; i++)
> +		recvfrom_byte(fd, &addr.sa, &len, 1, 0);
> +
> +	/* Wait for the client to finish */
> +	control_expectln("DONE");
> +
> +	close(fd);
> +}
> +
>  static void test_stream_msg_peek_client(const struct test_opts *opts)
>  {
>  	int fd;
> @@ -646,6 +826,21 @@ static struct test_case test_cases[] = {
>  		.run_client = test_seqpacket_invalid_rec_buffer_client,
>  		.run_server = test_seqpacket_invalid_rec_buffer_server,
>  	},
> +	{
> +		.name = "SOCK_DGRAM client close",
> +		.run_client = test_dgram_sendto_client,
> +		.run_server = test_dgram_sendto_server,
> +	},
> +	{
> +		.name = "SOCK_DGRAM client connect",
> +		.run_client = test_dgram_connect_client,
> +		.run_server = test_dgram_connect_server,
> +	},
> +	{
> +		.name = "SOCK_DGRAM multiple connections",
> +		.run_client = test_dgram_multiconn_client,
> +		.run_server = test_dgram_multiconn_server,
> +	},
>  	{},
>  };
>  
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc
       [not found] ` <CAGxU2F7+L-UiNPtUm4EukOgTVJ1J6Orqs1LMvhWWvfL9zWb23g@mail.gmail.com>
@ 2022-08-16  2:35   ` Bobby Eshleman
  0 siblings, 0 replies; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-16  2:35 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Bobby Eshleman, Wei Liu, Cong Wang, Stephen Hemminger,
	Bobby Eshleman, Jiang Wang, Michael S. Tsirkin, Dexuan Cui,
	Haiyang Zhang, linux-kernel, virtualization, Eric Dumazet,
	netdev, Stefan Hajnoczi, kvm, Jakub Kicinski, Paolo Abeni,
	linux-hyperv, David S. Miller

On Tue, Aug 16, 2022 at 09:00:45AM +0200, Stefano Garzarella wrote:
> Hi Bobby,

..

> 
> Please send next versions of this series as RFC until we have at least an
> agreement on the spec changes.
> 
> I think will be better to agree on the spec before merge Linux changes.
> 
> Thanks,
> Stefano
> 

Duly noted, I'll tag it as RFC on the next send.


Best,
Bobby

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/6] vsock: add netdev to vhost/virtio vsock
  2022-08-16 16:38   ` Michael S. Tsirkin
@ 2022-08-16  6:18     ` Bobby Eshleman
  2022-08-16 18:07     ` Jakub Kicinski
  1 sibling, 0 replies; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-16  6:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, kvm, virtualization,
	netdev, linux-kernel

On Tue, Aug 16, 2022 at 12:38:52PM -0400, Michael S. Tsirkin wrote:
> On Mon, Aug 15, 2022 at 10:56:06AM -0700, Bobby Eshleman wrote:
> > In order to support usage of qdisc on vsock traffic, this commit
> > introduces a struct net_device to vhost and virtio vsock.
> > 
> > Two new devices are created, vhost-vsock for vhost and virtio-vsock
> > for virtio. The devices are attached to the respective transports.
> > 
> > To bypass the usage of the device, the user may "down" the associated
> > network interface using common tools. For example, "ip link set dev
> > virtio-vsock down" lets vsock bypass the net_device and qdisc entirely,
> > simply using the FIFO logic of the prior implementation.
> 
> Ugh. That's quite a hack. Mark my words, at some point we will decide to
> have down mean "down".  Besides, multiple net devices with link up tend
> to confuse userspace. So might want to keep it down at all times
> even short term.
> 

I have to admit, this choice was born more of perceived necessity than
of a love for the design... but I can explain the pain points that led
to the current state, which I hope sparks some discussion.

When the state is down, dev_queue_xmit() will fail. To avoid this and
preserve the "zero-configuration" guarantee of vsock, I chose to make
transmission work regardless of device state by implementing this
"ignore up/down state" hack.

This is unfortunate because what we are really after here is just packet
scheduling, i.e., qdisc. We don't really need the rest of the
net_device, and I don't think up/down buys us anything of value. The
relationship between qdisc and net_device is so tightly knit together
though, that using qdisc without a net_device doesn't look very
practical (and maybe impossible).

Some alternative routes might be:

1) Default the state to up, and let users disable vsock by downing the
   device if they'd like. It still works out-of-the-box, but if users
   really want to disable vsock they may.

2) vsock may simply turn the device to state up when a socket is first
   used. For instance, the HCI device in net/bluetooth/hci_* uses a
   technique where the net_device is turned to up when bind() is called on
   any HCI socket (BTPROTO_HCI). It can also be turned up/down via
   ioctl().

3) Modify net_device registration to allow us to have an invisible
   device that is only known by the kernel. It may default to up and remain
   unchanged. The qdisc controls alone may be exposed to userspace,
   hopefully via netlink to still work with tc. This is not
   currently supported by register_netdevice(), but a series from 2007 was
   sent to the ML, tentatively approved in concept, but was never merged[1].

4) Currently NETDEV_UP/NETDEV_DOWN commands can't be vetoed.
   NETDEV_PRE_UP, however, is used to effectively veto NETDEV_UP
   commands[2]. We could introduce NETDEV_PRE_DOWN to support vetoing of
   NETDEV_DOWN. This would allow us to install a hook to determine if
   we actually want to allow the device to be downed.

In an ideal world, we could simply pass a set of netdev queues, a
packet, and maybe a blob of state to qdisc and let it work its
scheduling magic...

Any thoughts?

[1]: https://lore.kernel.org/netdev/20070129140958.0cf6880f@freekitty/
[2]: https://lore.kernel.org/all/20090529.220906.243061042.davem@davemloft.net/

Thanks,
Bobby

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/6] vsock: add netdev to vhost/virtio vsock
  2022-08-16 18:07     ` Jakub Kicinski
@ 2022-08-16  7:02       ` Bobby Eshleman
  2022-08-16 23:07         ` Jakub Kicinski
  0 siblings, 1 reply; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-16  7:02 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Bobby Eshleman, Michael S. Tsirkin, Bobby Eshleman, Cong Wang,
	Jiang Wang, Stefan Hajnoczi, Stefano Garzarella, Jason Wang,
	David S. Miller, Eric Dumazet, Paolo Abeni, kvm, virtualization,
	netdev, linux-kernel

On Tue, Aug 16, 2022 at 11:07:17AM -0700, Jakub Kicinski wrote:
> On Tue, 16 Aug 2022 12:38:52 -0400 Michael S. Tsirkin wrote:
> > On Mon, Aug 15, 2022 at 10:56:06AM -0700, Bobby Eshleman wrote:
> > > In order to support usage of qdisc on vsock traffic, this commit
> > > introduces a struct net_device to vhost and virtio vsock.
> > > 
> > > Two new devices are created, vhost-vsock for vhost and virtio-vsock
> > > for virtio. The devices are attached to the respective transports.
> > > 
> > > To bypass the usage of the device, the user may "down" the associated
> > > network interface using common tools. For example, "ip link set dev
> > > virtio-vsock down" lets vsock bypass the net_device and qdisc entirely,
> > > simply using the FIFO logic of the prior implementation.  
> > 
> > Ugh. That's quite a hack. Mark my words, at some point we will decide to
> > have down mean "down".  Besides, multiple net devices with link up tend
> > to confuse userspace. So might want to keep it down at all times
> > even short term.
> 
> Agreed!
> 
> From a cursory look (and Documentation/ would be nice..) it feels
> very wrong to me. Do you know of any uses of a netdev which would 
> be semantically similar to what you're doing? Treating netdevs as
> buildings blocks for arbitrary message passing solutions is something 
> I dislike quite strongly.

The big difference between vsock and "arbitrary message passing" is that
vsock is actually constrained by the virtio device that backs it (made
up of virtqueues and the underlying protocol). That virtqueue pair is
acting like the queues on a physical NIC, so it actually makes sense to
manage the queuing of vsock's device like we would manage the queueing
of a real device.

Still, I concede that ignoring the netdev state is a probably bad idea.

That said, I also think that using packet scheduling in vsock is a good
idea, and that ideally we can reuse Linux's already robust library of
packet scheduling algorithms by introducing qdisc somehow.

> 
> Could you recommend where I can learn more about vsocks?

I think the spec is probably the best place to start[1].

[1]: https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html

Best,
Bobby

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/6] vsock: add netdev to vhost/virtio vsock
  2022-08-16 23:07         ` Jakub Kicinski
@ 2022-08-16  8:29           ` Bobby Eshleman
  2022-08-17  1:15             ` Jakub Kicinski
  2022-08-17  1:23           ` [External] " Cong Wang .
  1 sibling, 1 reply; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-16  8:29 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Bobby Eshleman, Michael S. Tsirkin, Bobby Eshleman, Cong Wang,
	Jiang Wang, Stefan Hajnoczi, Stefano Garzarella, Jason Wang,
	David S. Miller, Eric Dumazet, Paolo Abeni, kvm, virtualization,
	netdev, linux-kernel, Toke Høiland-Jørgensen

On Tue, Aug 16, 2022 at 04:07:55PM -0700, Jakub Kicinski wrote:
> On Tue, 16 Aug 2022 07:02:33 +0000 Bobby Eshleman wrote:
> > > From a cursory look (and Documentation/ would be nice..) it feels
> > > very wrong to me. Do you know of any uses of a netdev which would 
> > > be semantically similar to what you're doing? Treating netdevs as
> > > buildings blocks for arbitrary message passing solutions is something 
> > > I dislike quite strongly.  
> > 
> > The big difference between vsock and "arbitrary message passing" is that
> > vsock is actually constrained by the virtio device that backs it (made
> > up of virtqueues and the underlying protocol). That virtqueue pair is
> > acting like the queues on a physical NIC, so it actually makes sense to
> > manage the queuing of vsock's device like we would manage the queueing
> > of a real device.
> > 
> > Still, I concede that ignoring the netdev state is a probably bad idea.
> > 
> > That said, I also think that using packet scheduling in vsock is a good
> > idea, and that ideally we can reuse Linux's already robust library of
> > packet scheduling algorithms by introducing qdisc somehow.
> 
> We've been burnt in the past by people doing the "let me just pick
> these useful pieces out of netdev" thing. Makes life hard both for
> maintainers and users trying to make sense of the interfaces.
> 
> What comes to mind if you're just after queuing is that we already
> bastardized the CoDel implementation (include/net/codel_impl.h).
> If CoDel is good enough for you maybe that's the easiest way?
> Although I suspect that you're after fairness not early drops.
> Wireless folks use CoDel as a second layer queuing. (CC: Toke)
> 

That is certainly interesting to me. Sitting next to "codel_impl.h" is
"include/net/fq_impl.h", and it looks like it may solve the datagram
flooding issue. The downside to this approach is the baking of a
specific policy into vsock... which I don't exactly love either.

I'm not seeing too many other of these qdisc bastardizations in
include/net, are there any others that you are aware of?

> > > Could you recommend where I can learn more about vsocks?  
> > 
> > I think the spec is probably the best place to start[1].
> > 
> > [1]: https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html
> 
> Eh, I was hoping it was a side channel of an existing virtio_net 
> which is not the case. Given the zero-config requirement IDK if 
> we'll be able to fit this into netdev semantics :(

It's certainly possible that it may not fit :/ I feel that it partially
depends on what we mean by zero-config. Is it "no config required to
have a working socket" or is it "no config required, but also no
tuning/policy/etc... supported"?

Best,
Bobby

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc
  2022-08-17  6:54 ` Michael S. Tsirkin
@ 2022-08-16  9:42   ` Bobby Eshleman
  2022-08-17 17:02     ` Michael S. Tsirkin
  2022-08-18  4:28   ` Jason Wang
  1 sibling, 1 reply; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-16  9:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Dexuan Cui, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv

On Wed, Aug 17, 2022 at 02:54:33AM -0400, Michael S. Tsirkin wrote:
> On Mon, Aug 15, 2022 at 10:56:03AM -0700, Bobby Eshleman wrote:
> > Hey everybody,
> > 
> > This series introduces datagrams, packet scheduling, and sk_buff usage
> > to virtio vsock.
> > 
> > The usage of struct sk_buff benefits users by a) preparing vsock to use
> > other related systems that require sk_buff, such as sockmap and qdisc,
> > b) supporting basic congestion control via sock_alloc_send_skb, and c)
> > reducing copying when delivering packets to TAP.
> > 
> > The socket layer no longer forces errors to be -ENOMEM, as typically
> > userspace expects -EAGAIN when the sk_sndbuf threshold is reached and
> > messages are being sent with option MSG_DONTWAIT.
> > 
> > The datagram work is based off previous patches by Jiang Wang[1].
> > 
> > The introduction of datagrams creates a transport layer fairness issue
> > where datagrams may freely starve streams of queue access. This happens
> > because, unlike streams, datagrams lack the transactions necessary for
> > calculating credits and throttling.
> > 
> > Previous proposals introduce changes to the spec to add an additional
> > virtqueue pair for datagrams[1]. Although this solution works, using
> > Linux's qdisc for packet scheduling leverages already existing systems,
> > avoids the need to change the virtio specification, and gives additional
> > capabilities. The usage of SFQ or fq_codel, for example, may solve the
> > transport layer starvation problem. It is easy to imagine other use
> > cases as well. For example, services of varying importance may be
> > assigned different priorities, and qdisc will apply appropriate
> > priority-based scheduling. By default, the system default pfifo qdisc is
> > used. The qdisc may be bypassed and legacy queuing is resumed by simply
> > setting the virtio-vsock%d network device to state DOWN. This technique
> > still allows vsock to work with zero-configuration.
> 
> The basic question to answer then is this: with a net device qdisc
> etc in the picture, how is this different from virtio net then?
> Why do you still want to use vsock?
> 

When using virtio-net, users looking for inter-VM communication are
required to setup bridges, TAPs, allocate IP addresses or setup DNS,
etc... and then finally when you have a network, you can open a socket
on an IP address and port. This is the configuration that vsock avoids.
For vsock, we just need a CID and a port, but no network configuration.

This benefit still exists after introducing a netdev to vsock. The major
added benefit is that when you have many different vsock flows in
parallel and you are observing issues like starvation and tail latency
that are caused by pure FIFO queuing, now there is a mechanism to fix
those issues. You might recall such an issue discussed here[1].

[1]: https://gitlab.com/vsock/vsock/-/issues/1

> > In summary, this series introduces these major changes to vsock:
> > 
> > - virtio vsock supports datagrams
> > - virtio vsock uses struct sk_buff instead of virtio_vsock_pkt
> >   - Because virtio vsock uses sk_buff, it also uses sock_alloc_send_skb,
> >     which applies the throttling threshold sk_sndbuf.
> > - The vsock socket layer supports returning errors other than -ENOMEM.
> >   - This is used to return -EAGAIN when the sk_sndbuf threshold is
> >     reached.
> > - virtio vsock uses a net_device, through which qdisc may be used.
> >  - qdisc allows scheduling policies to be applied to vsock flows.
> >   - Some qdiscs, like SFQ, may allow vsock to avoid transport layer congestion. That is,
> >     it may avoid datagrams from flooding out stream flows. The benefit
> >     to this is that additional virtqueues are not needed for datagrams.
> >   - The net_device and qdisc is bypassed by simply setting the
> >     net_device state to DOWN.
> > 
> > [1]: https://lore.kernel.org/all/20210914055440.3121004-1-jiang.wang@bytedance.com/
> > 
> > Bobby Eshleman (5):
> >   vsock: replace virtio_vsock_pkt with sk_buff
> >   vsock: return errors other than -ENOMEM to socket
> >   vsock: add netdev to vhost/virtio vsock
> >   virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
> >   virtio/vsock: add support for dgram
> > 
> > Jiang Wang (1):
> >   vsock_test: add tests for vsock dgram
> > 
> >  drivers/vhost/vsock.c                   | 238 ++++----
> >  include/linux/virtio_vsock.h            |  73 ++-
> >  include/net/af_vsock.h                  |   2 +
> >  include/uapi/linux/virtio_vsock.h       |   2 +
> >  net/vmw_vsock/af_vsock.c                |  30 +-
> >  net/vmw_vsock/hyperv_transport.c        |   2 +-
> >  net/vmw_vsock/virtio_transport.c        | 237 +++++---
> >  net/vmw_vsock/virtio_transport_common.c | 771 ++++++++++++++++--------
> >  net/vmw_vsock/vmci_transport.c          |   9 +-
> >  net/vmw_vsock/vsock_loopback.c          |  51 +-
> >  tools/testing/vsock/util.c              | 105 ++++
> >  tools/testing/vsock/util.h              |   4 +
> >  tools/testing/vsock/vsock_test.c        | 195 ++++++
> >  13 files changed, 1176 insertions(+), 543 deletions(-)
> > 
> > -- 
> > 2.35.1
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [virtio-dev] Re: [PATCH 5/6] virtio/vsock: add support for dgram
  2022-08-17  5:01     ` [virtio-dev] " Arseniy Krasnov
@ 2022-08-16  9:57       ` Bobby Eshleman
  2022-08-18  8:24         ` Arseniy Krasnov
  2022-08-17  5:42       ` Arseniy Krasnov
  1 sibling, 1 reply; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-16  9:57 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Bobby Eshleman, virtio-dev, Bobby Eshleman, Cong Wang,
	Jiang Wang, Stefan Hajnoczi, Stefano Garzarella,
	Michael S. Tsirkin, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, kvm, virtualization, netdev,
	linux-kernel

On Wed, Aug 17, 2022 at 05:01:00AM +0000, Arseniy Krasnov wrote:
> On 16.08.2022 05:32, Bobby Eshleman wrote:
> > CC'ing virtio-dev@lists.oasis-open.org
> > 
> > On Mon, Aug 15, 2022 at 10:56:08AM -0700, Bobby Eshleman wrote:
> >> This patch supports dgram in virtio and on the vhost side.
> Hello,
> 
> sorry, i don't understand, how this maintains message boundaries? Or it
> is unnecessary for SOCK_DGRAM?
> 
> Thanks

If I understand your question, the length is included in the header, so
receivers always know that header start + header length + payload length
marks the message boundary.

> >>
> >> Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
> >> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> >> ---
> >>  drivers/vhost/vsock.c                   |   2 +-
> >>  include/net/af_vsock.h                  |   2 +
> >>  include/uapi/linux/virtio_vsock.h       |   1 +
> >>  net/vmw_vsock/af_vsock.c                |  26 +++-
> >>  net/vmw_vsock/virtio_transport.c        |   2 +-
> >>  net/vmw_vsock/virtio_transport_common.c | 173 ++++++++++++++++++++++--
> >>  6 files changed, 186 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> >> index a5d1bdb786fe..3dc72a5647ca 100644
> >> --- a/drivers/vhost/vsock.c
> >> +++ b/drivers/vhost/vsock.c
> >> @@ -925,7 +925,7 @@ static int __init vhost_vsock_init(void)
> >>  	int ret;
> >>  
> >>  	ret = vsock_core_register(&vhost_transport.transport,
> >> -				  VSOCK_TRANSPORT_F_H2G);
> >> +				  VSOCK_TRANSPORT_F_H2G | VSOCK_TRANSPORT_F_DGRAM);
> >>  	if (ret < 0)
> >>  		return ret;
> >>  
> >> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> >> index 1c53c4c4d88f..37e55c81e4df 100644
> >> --- a/include/net/af_vsock.h
> >> +++ b/include/net/af_vsock.h
> >> @@ -78,6 +78,8 @@ struct vsock_sock {
> >>  s64 vsock_stream_has_data(struct vsock_sock *vsk);
> >>  s64 vsock_stream_has_space(struct vsock_sock *vsk);
> >>  struct sock *vsock_create_connected(struct sock *parent);
> >> +int vsock_bind_stream(struct vsock_sock *vsk,
> >> +		      struct sockaddr_vm *addr);
> >>  
> >>  /**** TRANSPORT ****/
> >>  
> >> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
> >> index 857df3a3a70d..0975b9c88292 100644
> >> --- a/include/uapi/linux/virtio_vsock.h
> >> +++ b/include/uapi/linux/virtio_vsock.h
> >> @@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
> >>  enum virtio_vsock_type {
> >>  	VIRTIO_VSOCK_TYPE_STREAM = 1,
> >>  	VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
> >> +	VIRTIO_VSOCK_TYPE_DGRAM = 3,
> >>  };
> >>  
> >>  enum virtio_vsock_op {
> >> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> >> index 1893f8aafa48..87e4ae1866d3 100644
> >> --- a/net/vmw_vsock/af_vsock.c
> >> +++ b/net/vmw_vsock/af_vsock.c
> >> @@ -675,6 +675,19 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
> >>  	return 0;
> >>  }
> >>  
> >> +int vsock_bind_stream(struct vsock_sock *vsk,
> >> +		      struct sockaddr_vm *addr)
> >> +{
> >> +	int retval;
> >> +
> >> +	spin_lock_bh(&vsock_table_lock);
> >> +	retval = __vsock_bind_connectible(vsk, addr);
> >> +	spin_unlock_bh(&vsock_table_lock);
> >> +
> >> +	return retval;
> >> +}
> >> +EXPORT_SYMBOL(vsock_bind_stream);
> >> +
> >>  static int __vsock_bind_dgram(struct vsock_sock *vsk,
> >>  			      struct sockaddr_vm *addr)
> >>  {
> >> @@ -2363,11 +2376,16 @@ int vsock_core_register(const struct vsock_transport *t, int features)
> >>  	}
> >>  
> >>  	if (features & VSOCK_TRANSPORT_F_DGRAM) {
> >> -		if (t_dgram) {
> >> -			err = -EBUSY;
> >> -			goto err_busy;
> >> +		/* TODO: always chose the G2H variant over others, support nesting later */
> >> +		if (features & VSOCK_TRANSPORT_F_G2H) {
> >> +			if (t_dgram)
> >> +				pr_warn("virtio_vsock: t_dgram already set\n");
> >> +			t_dgram = t;
> >> +		}
> >> +
> >> +		if (!t_dgram) {
> >> +			t_dgram = t;
> >>  		}
> >> -		t_dgram = t;
> >>  	}
> >>  
> >>  	if (features & VSOCK_TRANSPORT_F_LOCAL) {
> >> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> >> index 073314312683..d4526ca462d2 100644
> >> --- a/net/vmw_vsock/virtio_transport.c
> >> +++ b/net/vmw_vsock/virtio_transport.c
> >> @@ -850,7 +850,7 @@ static int __init virtio_vsock_init(void)
> >>  		return -ENOMEM;
> >>  
> >>  	ret = vsock_core_register(&virtio_transport.transport,
> >> -				  VSOCK_TRANSPORT_F_G2H);
> >> +				  VSOCK_TRANSPORT_F_G2H | VSOCK_TRANSPORT_F_DGRAM);
> >>  	if (ret)
> >>  		goto out_wq;
> >>  
> >> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> >> index bdf16fff054f..aedb48728677 100644
> >> --- a/net/vmw_vsock/virtio_transport_common.c
> >> +++ b/net/vmw_vsock/virtio_transport_common.c
> >> @@ -229,7 +229,9 @@ EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
> >>  
> >>  static u16 virtio_transport_get_type(struct sock *sk)
> >>  {
> >> -	if (sk->sk_type == SOCK_STREAM)
> >> +	if (sk->sk_type == SOCK_DGRAM)
> >> +		return VIRTIO_VSOCK_TYPE_DGRAM;
> >> +	else if (sk->sk_type == SOCK_STREAM)
> >>  		return VIRTIO_VSOCK_TYPE_STREAM;
> >>  	else
> >>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
> >> @@ -287,22 +289,29 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >>  	vvs = vsk->trans;
> >>  
> >>  	/* we can send less than pkt_len bytes */
> >> -	if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> >> -		pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> >> +	if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
> >> +		if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> >> +			pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> >> +		else
> >> +			return 0;
> >> +	}
> >>  
> >> -	/* virtio_transport_get_credit might return less than pkt_len credit */
> >> -	pkt_len = virtio_transport_get_credit(vvs, pkt_len);
> >> +	if (info->type != VIRTIO_VSOCK_TYPE_DGRAM) {
> >> +		/* virtio_transport_get_credit might return less than pkt_len credit */
> >> +		pkt_len = virtio_transport_get_credit(vvs, pkt_len);
> >>  
> >> -	/* Do not send zero length OP_RW pkt */
> >> -	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> >> -		return pkt_len;
> >> +		/* Do not send zero length OP_RW pkt */
> >> +		if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> >> +			return pkt_len;
> >> +	}
> >>  
> >>  	skb = virtio_transport_alloc_skb(info, pkt_len,
> >>  					 src_cid, src_port,
> >>  					 dst_cid, dst_port,
> >>  					 &err);
> >>  	if (!skb) {
> >> -		virtio_transport_put_credit(vvs, pkt_len);
> >> +		if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> >> +			virtio_transport_put_credit(vvs, pkt_len);
> >>  		return err;
> >>  	}
> >>  
> >> @@ -586,6 +595,61 @@ virtio_transport_seqpacket_dequeue(struct vsock_sock *vsk,
> >>  }
> >>  EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_dequeue);
> >>  
> >> +static ssize_t
> >> +virtio_transport_dgram_do_dequeue(struct vsock_sock *vsk,
> >> +				  struct msghdr *msg, size_t len)
> >> +{
> >> +	struct virtio_vsock_sock *vvs = vsk->trans;
> >> +	struct sk_buff *skb;
> >> +	size_t total = 0;
> >> +	u32 free_space;
> >> +	int err = -EFAULT;
> >> +
> >> +	spin_lock_bh(&vvs->rx_lock);
> >> +	if (total < len && !skb_queue_empty_lockless(&vvs->rx_queue)) {
> >> +		skb = __skb_dequeue(&vvs->rx_queue);
> >> +
> >> +		total = len;
> >> +		if (total > skb->len - vsock_metadata(skb)->off)
> >> +			total = skb->len - vsock_metadata(skb)->off;
> >> +		else if (total < skb->len - vsock_metadata(skb)->off)
> >> +			msg->msg_flags |= MSG_TRUNC;
> >> +
> >> +		/* sk_lock is held by caller so no one else can dequeue.
> >> +		 * Unlock rx_lock since memcpy_to_msg() may sleep.
> >> +		 */
> >> +		spin_unlock_bh(&vvs->rx_lock);
> >> +
> >> +		err = memcpy_to_msg(msg, skb->data + vsock_metadata(skb)->off, total);
> >> +		if (err)
> >> +			return err;
> >> +
> >> +		spin_lock_bh(&vvs->rx_lock);
> >> +
> >> +		virtio_transport_dec_rx_pkt(vvs, skb);
> >> +		consume_skb(skb);
> >> +	}
> >> +
> >> +	free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
> >> +
> >> +	spin_unlock_bh(&vvs->rx_lock);
> >> +
> >> +	if (total > 0 && msg->msg_name) {
> >> +		/* Provide the address of the sender. */
> >> +		DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
> >> +
> >> +		vsock_addr_init(vm_addr, le64_to_cpu(vsock_hdr(skb)->src_cid),
> >> +				le32_to_cpu(vsock_hdr(skb)->src_port));
> >> +		msg->msg_namelen = sizeof(*vm_addr);
> >> +	}
> >> +	return total;
> >> +}
> >> +
> >> +static s64 virtio_transport_dgram_has_data(struct vsock_sock *vsk)
> >> +{
> >> +	return virtio_transport_stream_has_data(vsk);
> >> +}
> >> +
> >>  int
> >>  virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
> >>  				   struct msghdr *msg,
> >> @@ -611,7 +675,66 @@ virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
> >>  			       struct msghdr *msg,
> >>  			       size_t len, int flags)
> >>  {
> >> -	return -EOPNOTSUPP;
> >> +	struct sock *sk;
> >> +	size_t err = 0;
> >> +	long timeout;
> >> +
> >> +	DEFINE_WAIT(wait);
> >> +
> >> +	sk = &vsk->sk;
> >> +	err = 0;
> >> +
> >> +	if (flags & MSG_OOB || flags & MSG_ERRQUEUE || flags & MSG_PEEK)
> >> +		return -EOPNOTSUPP;
> >> +
> >> +	lock_sock(sk);
> >> +
> >> +	if (!len)
> >> +		goto out;
> >> +
> >> +	timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
> >> +
> >> +	while (1) {
> >> +		s64 ready;
> >> +
> >> +		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
> >> +		ready = virtio_transport_dgram_has_data(vsk);
> >> +
> >> +		if (ready == 0) {
> >> +			if (timeout == 0) {
> >> +				err = -EAGAIN;
> >> +				finish_wait(sk_sleep(sk), &wait);
> >> +				break;
> >> +			}
> >> +
> >> +			release_sock(sk);
> >> +			timeout = schedule_timeout(timeout);
> >> +			lock_sock(sk);
> >> +
> >> +			if (signal_pending(current)) {
> >> +				err = sock_intr_errno(timeout);
> >> +				finish_wait(sk_sleep(sk), &wait);
> >> +				break;
> >> +			} else if (timeout == 0) {
> >> +				err = -EAGAIN;
> >> +				finish_wait(sk_sleep(sk), &wait);
> >> +				break;
> >> +			}
> >> +		} else {
> >> +			finish_wait(sk_sleep(sk), &wait);
> >> +
> >> +			if (ready < 0) {
> >> +				err = -ENOMEM;
> >> +				goto out;
> >> +			}
> >> +
> >> +			err = virtio_transport_dgram_do_dequeue(vsk, msg, len);
> >> +			break;
> >> +		}
> >> +	}
> >> +out:
> >> +	release_sock(sk);
> >> +	return err;
> >>  }
> >>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
> >>  
> >> @@ -819,13 +942,13 @@ EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
> >>  int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> >>  				struct sockaddr_vm *addr)
> >>  {
> >> -	return -EOPNOTSUPP;
> >> +	return vsock_bind_stream(vsk, addr);
> >>  }
> >>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
> >>  
> >>  bool virtio_transport_dgram_allow(u32 cid, u32 port)
> >>  {
> >> -	return false;
> >> +	return true;
> >>  }
> >>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
> >>  
> >> @@ -861,7 +984,16 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
> >>  			       struct msghdr *msg,
> >>  			       size_t dgram_len)
> >>  {
> >> -	return -EOPNOTSUPP;
> >> +	struct virtio_vsock_pkt_info info = {
> >> +		.op = VIRTIO_VSOCK_OP_RW,
> >> +		.msg = msg,
> >> +		.pkt_len = dgram_len,
> >> +		.vsk = vsk,
> >> +		.remote_cid = remote_addr->svm_cid,
> >> +		.remote_port = remote_addr->svm_port,
> >> +	};
> >> +
> >> +	return virtio_transport_send_pkt_info(vsk, &info);
> >>  }
> >>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
> >>  
> >> @@ -1165,6 +1297,12 @@ virtio_transport_recv_connected(struct sock *sk,
> >>  	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> >>  	int err = 0;
> >>  
> >> +	if (le16_to_cpu(vsock_hdr(skb)->type) == VIRTIO_VSOCK_TYPE_DGRAM) {
> >> +		virtio_transport_recv_enqueue(vsk, skb);
> >> +		sk->sk_data_ready(sk);
> >> +		return err;
> >> +	}
> >> +
> >>  	switch (le16_to_cpu(hdr->op)) {
> >>  	case VIRTIO_VSOCK_OP_RW:
> >>  		virtio_transport_recv_enqueue(vsk, skb);
> >> @@ -1320,7 +1458,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
> >>  static bool virtio_transport_valid_type(u16 type)
> >>  {
> >>  	return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
> >> -	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
> >> +	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
> >> +	       (type == VIRTIO_VSOCK_TYPE_DGRAM);
> >>  }
> >>  
> >>  /* We are under the virtio-vsock's vsock->rx_lock or vhost-vsock's vq->mutex
> >> @@ -1384,6 +1523,11 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
> >>  		goto free_pkt;
> >>  	}
> >>  
> >> +	if (sk->sk_type == SOCK_DGRAM) {
> >> +		virtio_transport_recv_connected(sk, skb);
> >> +		goto out;
> >> +	}
> >> +
> >>  	space_available = virtio_transport_space_update(sk, skb);
> >>  
> >>  	/* Update CID in case it has changed after a transport reset event */
> >> @@ -1415,6 +1559,7 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
> >>  		break;
> >>  	}
> >>  
> >> +out:
> >>  	release_sock(sk);
> >>  
> >>  	/* Release refcnt obtained when we fetched this socket out of the
> >> -- 
> >> 2.35.1
> >>
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > 
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [virtio-dev] Re: [PATCH 5/6] virtio/vsock: add support for dgram
  2022-08-17  5:42       ` Arseniy Krasnov
@ 2022-08-16  9:58         ` Bobby Eshleman
  2022-08-18  8:35           ` Arseniy Krasnov
  0 siblings, 1 reply; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-16  9:58 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Bobby Eshleman, virtio-dev, Bobby Eshleman, Cong Wang,
	Jiang Wang, Stefan Hajnoczi, Stefano Garzarella,
	Michael S. Tsirkin, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, kvm, virtualization, netdev,
	linux-kernel

On Wed, Aug 17, 2022 at 05:42:08AM +0000, Arseniy Krasnov wrote:
> On 17.08.2022 08:01, Arseniy Krasnov wrote:
> > On 16.08.2022 05:32, Bobby Eshleman wrote:
> >> CC'ing virtio-dev@lists.oasis-open.org
> >>
> >> On Mon, Aug 15, 2022 at 10:56:08AM -0700, Bobby Eshleman wrote:
> >>> This patch supports dgram in virtio and on the vhost side.
> > Hello,
> > 
> > sorry, i don't understand, how this maintains message boundaries? Or it
> > is unnecessary for SOCK_DGRAM?
> > 
> > Thanks
> >>>
> >>> Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
> >>> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> >>> ---
> >>>  drivers/vhost/vsock.c                   |   2 +-
> >>>  include/net/af_vsock.h                  |   2 +
> >>>  include/uapi/linux/virtio_vsock.h       |   1 +
> >>>  net/vmw_vsock/af_vsock.c                |  26 +++-
> >>>  net/vmw_vsock/virtio_transport.c        |   2 +-
> >>>  net/vmw_vsock/virtio_transport_common.c | 173 ++++++++++++++++++++++--
> >>>  6 files changed, 186 insertions(+), 20 deletions(-)
> >>>
> >>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> >>> index a5d1bdb786fe..3dc72a5647ca 100644
> >>> --- a/drivers/vhost/vsock.c
> >>> +++ b/drivers/vhost/vsock.c
> >>> @@ -925,7 +925,7 @@ static int __init vhost_vsock_init(void)
> >>>  	int ret;
> >>>  
> >>>  	ret = vsock_core_register(&vhost_transport.transport,
> >>> -				  VSOCK_TRANSPORT_F_H2G);
> >>> +				  VSOCK_TRANSPORT_F_H2G | VSOCK_TRANSPORT_F_DGRAM);
> >>>  	if (ret < 0)
> >>>  		return ret;
> >>>  
> >>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> >>> index 1c53c4c4d88f..37e55c81e4df 100644
> >>> --- a/include/net/af_vsock.h
> >>> +++ b/include/net/af_vsock.h
> >>> @@ -78,6 +78,8 @@ struct vsock_sock {
> >>>  s64 vsock_stream_has_data(struct vsock_sock *vsk);
> >>>  s64 vsock_stream_has_space(struct vsock_sock *vsk);
> >>>  struct sock *vsock_create_connected(struct sock *parent);
> >>> +int vsock_bind_stream(struct vsock_sock *vsk,
> >>> +		      struct sockaddr_vm *addr);
> >>>  
> >>>  /**** TRANSPORT ****/
> >>>  
> >>> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
> >>> index 857df3a3a70d..0975b9c88292 100644
> >>> --- a/include/uapi/linux/virtio_vsock.h
> >>> +++ b/include/uapi/linux/virtio_vsock.h
> >>> @@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
> >>>  enum virtio_vsock_type {
> >>>  	VIRTIO_VSOCK_TYPE_STREAM = 1,
> >>>  	VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
> >>> +	VIRTIO_VSOCK_TYPE_DGRAM = 3,
> >>>  };
> >>>  
> >>>  enum virtio_vsock_op {
> >>> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> >>> index 1893f8aafa48..87e4ae1866d3 100644
> >>> --- a/net/vmw_vsock/af_vsock.c
> >>> +++ b/net/vmw_vsock/af_vsock.c
> >>> @@ -675,6 +675,19 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
> >>>  	return 0;
> >>>  }
> >>>  
> >>> +int vsock_bind_stream(struct vsock_sock *vsk,
> >>> +		      struct sockaddr_vm *addr)
> >>> +{
> >>> +	int retval;
> >>> +
> >>> +	spin_lock_bh(&vsock_table_lock);
> >>> +	retval = __vsock_bind_connectible(vsk, addr);
> >>> +	spin_unlock_bh(&vsock_table_lock);
> >>> +
> >>> +	return retval;
> >>> +}
> >>> +EXPORT_SYMBOL(vsock_bind_stream);
> >>> +
> >>>  static int __vsock_bind_dgram(struct vsock_sock *vsk,
> >>>  			      struct sockaddr_vm *addr)
> >>>  {
> >>> @@ -2363,11 +2376,16 @@ int vsock_core_register(const struct vsock_transport *t, int features)
> >>>  	}
> >>>  
> >>>  	if (features & VSOCK_TRANSPORT_F_DGRAM) {
> >>> -		if (t_dgram) {
> >>> -			err = -EBUSY;
> >>> -			goto err_busy;
> >>> +		/* TODO: always chose the G2H variant over others, support nesting later */
> >>> +		if (features & VSOCK_TRANSPORT_F_G2H) {
> >>> +			if (t_dgram)
> >>> +				pr_warn("virtio_vsock: t_dgram already set\n");
> >>> +			t_dgram = t;
> >>> +		}
> >>> +
> >>> +		if (!t_dgram) {
> >>> +			t_dgram = t;
> >>>  		}
> >>> -		t_dgram = t;
> >>>  	}
> >>>  
> >>>  	if (features & VSOCK_TRANSPORT_F_LOCAL) {
> >>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> >>> index 073314312683..d4526ca462d2 100644
> >>> --- a/net/vmw_vsock/virtio_transport.c
> >>> +++ b/net/vmw_vsock/virtio_transport.c
> >>> @@ -850,7 +850,7 @@ static int __init virtio_vsock_init(void)
> >>>  		return -ENOMEM;
> >>>  
> >>>  	ret = vsock_core_register(&virtio_transport.transport,
> >>> -				  VSOCK_TRANSPORT_F_G2H);
> >>> +				  VSOCK_TRANSPORT_F_G2H | VSOCK_TRANSPORT_F_DGRAM);
> >>>  	if (ret)
> >>>  		goto out_wq;
> >>>  
> >>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> >>> index bdf16fff054f..aedb48728677 100644
> >>> --- a/net/vmw_vsock/virtio_transport_common.c
> >>> +++ b/net/vmw_vsock/virtio_transport_common.c
> >>> @@ -229,7 +229,9 @@ EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
> >>>  
> >>>  static u16 virtio_transport_get_type(struct sock *sk)
> >>>  {
> >>> -	if (sk->sk_type == SOCK_STREAM)
> >>> +	if (sk->sk_type == SOCK_DGRAM)
> >>> +		return VIRTIO_VSOCK_TYPE_DGRAM;
> >>> +	else if (sk->sk_type == SOCK_STREAM)
> >>>  		return VIRTIO_VSOCK_TYPE_STREAM;
> >>>  	else
> >>>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
> >>> @@ -287,22 +289,29 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> >>>  	vvs = vsk->trans;
> >>>  
> >>>  	/* we can send less than pkt_len bytes */
> >>> -	if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> >>> -		pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> >>> +	if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
> >>> +		if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> >>> +			pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> >>> +		else
> >>> +			return 0;
> >>> +	}
> >>>  
> >>> -	/* virtio_transport_get_credit might return less than pkt_len credit */
> >>> -	pkt_len = virtio_transport_get_credit(vvs, pkt_len);
> >>> +	if (info->type != VIRTIO_VSOCK_TYPE_DGRAM) {
> >>> +		/* virtio_transport_get_credit might return less than pkt_len credit */
> >>> +		pkt_len = virtio_transport_get_credit(vvs, pkt_len);
> >>>  
> >>> -	/* Do not send zero length OP_RW pkt */
> >>> -	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> >>> -		return pkt_len;
> >>> +		/* Do not send zero length OP_RW pkt */
> >>> +		if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> >>> +			return pkt_len;
> >>> +	}
> >>>  
> >>>  	skb = virtio_transport_alloc_skb(info, pkt_len,
> >>>  					 src_cid, src_port,
> >>>  					 dst_cid, dst_port,
> >>>  					 &err);
> >>>  	if (!skb) {
> >>> -		virtio_transport_put_credit(vvs, pkt_len);
> >>> +		if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> >>> +			virtio_transport_put_credit(vvs, pkt_len);
> >>>  		return err;
> >>>  	}
> >>>  
> >>> @@ -586,6 +595,61 @@ virtio_transport_seqpacket_dequeue(struct vsock_sock *vsk,
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_dequeue);
> >>>  
> >>> +static ssize_t
> >>> +virtio_transport_dgram_do_dequeue(struct vsock_sock *vsk,
> >>> +				  struct msghdr *msg, size_t len)
> >>> +{
> >>> +	struct virtio_vsock_sock *vvs = vsk->trans;
> >>> +	struct sk_buff *skb;
> >>> +	size_t total = 0;
> >>> +	u32 free_space;
> >>> +	int err = -EFAULT;
> >>> +
> >>> +	spin_lock_bh(&vvs->rx_lock);
> >>> +	if (total < len && !skb_queue_empty_lockless(&vvs->rx_queue)) {
> >>> +		skb = __skb_dequeue(&vvs->rx_queue);
> >>> +
> >>> +		total = len;
> >>> +		if (total > skb->len - vsock_metadata(skb)->off)
> >>> +			total = skb->len - vsock_metadata(skb)->off;
> >>> +		else if (total < skb->len - vsock_metadata(skb)->off)
> >>> +			msg->msg_flags |= MSG_TRUNC;
> >>> +
> >>> +		/* sk_lock is held by caller so no one else can dequeue.
> >>> +		 * Unlock rx_lock since memcpy_to_msg() may sleep.
> >>> +		 */
> >>> +		spin_unlock_bh(&vvs->rx_lock);
> >>> +
> >>> +		err = memcpy_to_msg(msg, skb->data + vsock_metadata(skb)->off, total);
> >>> +		if (err)
> >>> +			return err;
> >>> +
> >>> +		spin_lock_bh(&vvs->rx_lock);
> >>> +
> >>> +		virtio_transport_dec_rx_pkt(vvs, skb);
> >>> +		consume_skb(skb);
> >>> +	}
> >>> +
> >>> +	free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
> >>> +
> >>> +	spin_unlock_bh(&vvs->rx_lock);
> >>> +
> >>> +	if (total > 0 && msg->msg_name) {
> >>> +		/* Provide the address of the sender. */
> >>> +		DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
> >>> +
> >>> +		vsock_addr_init(vm_addr, le64_to_cpu(vsock_hdr(skb)->src_cid),
> >>> +				le32_to_cpu(vsock_hdr(skb)->src_port));
> >>> +		msg->msg_namelen = sizeof(*vm_addr);
> >>> +	}
> >>> +	return total;
> >>> +}
> >>> +
> >>> +static s64 virtio_transport_dgram_has_data(struct vsock_sock *vsk)
> >>> +{
> >>> +	return virtio_transport_stream_has_data(vsk);
> >>> +}
> >>> +
> >>>  int
> >>>  virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
> >>>  				   struct msghdr *msg,
> >>> @@ -611,7 +675,66 @@ virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
> >>>  			       struct msghdr *msg,
> >>>  			       size_t len, int flags)
> >>>  {
> >>> -	return -EOPNOTSUPP;
> >>> +	struct sock *sk;
> >>> +	size_t err = 0;
> >>> +	long timeout;
> >>> +
> >>> +	DEFINE_WAIT(wait);
> >>> +
> >>> +	sk = &vsk->sk;
> >>> +	err = 0;
> >>> +
> >>> +	if (flags & MSG_OOB || flags & MSG_ERRQUEUE || flags & MSG_PEEK)
> >>> +		return -EOPNOTSUPP;
> >>> +
> >>> +	lock_sock(sk);
> >>> +
> >>> +	if (!len)
> >>> +		goto out;
> >>> +
> >>> +	timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
> >>> +
> >>> +	while (1) {
> >>> +		s64 ready;
> >>> +
> >>> +		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
> >>> +		ready = virtio_transport_dgram_has_data(vsk);
> >>> +
> >>> +		if (ready == 0) {
> >>> +			if (timeout == 0) {
> >>> +				err = -EAGAIN;
> >>> +				finish_wait(sk_sleep(sk), &wait);
> >>> +				break;
> >>> +			}
> >>> +
> >>> +			release_sock(sk);
> >>> +			timeout = schedule_timeout(timeout);
> >>> +			lock_sock(sk);
> >>> +
> >>> +			if (signal_pending(current)) {
> >>> +				err = sock_intr_errno(timeout);
> >>> +				finish_wait(sk_sleep(sk), &wait);
> >>> +				break;
> >>> +			} else if (timeout == 0) {
> >>> +				err = -EAGAIN;
> >>> +				finish_wait(sk_sleep(sk), &wait);
> >>> +				break;
> >>> +			}
> >>> +		} else {
> >>> +			finish_wait(sk_sleep(sk), &wait);
> >>> +
> >>> +			if (ready < 0) {
> >>> +				err = -ENOMEM;
> >>> +				goto out;
> >>> +			}
> >>> +
> >>> +			err = virtio_transport_dgram_do_dequeue(vsk, msg, len);
> >>> +			break;
> >>> +		}
> >>> +	}
> >>> +out:
> >>> +	release_sock(sk);
> >>> +	return err;
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
> ^^^
> May be, this generic data waiting logic should be in af_vsock.c, as for stream/seqpacket?
> In this way, another transport which supports SOCK_DGRAM could reuse it.

I think that is a great idea. I'll test that change for v2.

Thanks.

> >>>  
> >>> @@ -819,13 +942,13 @@ EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
> >>>  int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> >>>  				struct sockaddr_vm *addr)
> >>>  {
> >>> -	return -EOPNOTSUPP;
> >>> +	return vsock_bind_stream(vsk, addr);
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
> >>>  
> >>>  bool virtio_transport_dgram_allow(u32 cid, u32 port)
> >>>  {
> >>> -	return false;
> >>> +	return true;
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
> >>>  
> >>> @@ -861,7 +984,16 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
> >>>  			       struct msghdr *msg,
> >>>  			       size_t dgram_len)
> >>>  {
> >>> -	return -EOPNOTSUPP;
> >>> +	struct virtio_vsock_pkt_info info = {
> >>> +		.op = VIRTIO_VSOCK_OP_RW,
> >>> +		.msg = msg,
> >>> +		.pkt_len = dgram_len,
> >>> +		.vsk = vsk,
> >>> +		.remote_cid = remote_addr->svm_cid,
> >>> +		.remote_port = remote_addr->svm_port,
> >>> +	};
> >>> +
> >>> +	return virtio_transport_send_pkt_info(vsk, &info);
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
> >>>  
> >>> @@ -1165,6 +1297,12 @@ virtio_transport_recv_connected(struct sock *sk,
> >>>  	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> >>>  	int err = 0;
> >>>  
> >>> +	if (le16_to_cpu(vsock_hdr(skb)->type) == VIRTIO_VSOCK_TYPE_DGRAM) {
> >>> +		virtio_transport_recv_enqueue(vsk, skb);
> >>> +		sk->sk_data_ready(sk);
> >>> +		return err;
> >>> +	}
> >>> +
> >>>  	switch (le16_to_cpu(hdr->op)) {
> >>>  	case VIRTIO_VSOCK_OP_RW:
> >>>  		virtio_transport_recv_enqueue(vsk, skb);
> >>> @@ -1320,7 +1458,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
> >>>  static bool virtio_transport_valid_type(u16 type)
> >>>  {
> >>>  	return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
> >>> -	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
> >>> +	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
> >>> +	       (type == VIRTIO_VSOCK_TYPE_DGRAM);
> >>>  }
> >>>  
> >>>  /* We are under the virtio-vsock's vsock->rx_lock or vhost-vsock's vq->mutex
> >>> @@ -1384,6 +1523,11 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
> >>>  		goto free_pkt;
> >>>  	}
> >>>  
> >>> +	if (sk->sk_type == SOCK_DGRAM) {
> >>> +		virtio_transport_recv_connected(sk, skb);
> >>> +		goto out;
> >>> +	}
> >>> +
> >>>  	space_available = virtio_transport_space_update(sk, skb);
> >>>  
> >>>  	/* Update CID in case it has changed after a transport reset event */
> >>> @@ -1415,6 +1559,7 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
> >>>  		break;
> >>>  	}
> >>>  
> >>> +out:
> >>>  	release_sock(sk);
> >>>  
> >>>  	/* Release refcnt obtained when we fetched this socket out of the
> >>> -- 
> >>> 2.35.1
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> >> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> >>
> > 
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/6] vsock: add netdev to vhost/virtio vsock
  2022-08-17  1:15             ` Jakub Kicinski
@ 2022-08-16 10:50               ` Bobby Eshleman
  2022-08-17 17:20                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-16 10:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Bobby Eshleman, Michael S. Tsirkin, Bobby Eshleman, Cong Wang,
	Jiang Wang, Stefan Hajnoczi, Stefano Garzarella, Jason Wang,
	David S. Miller, Eric Dumazet, Paolo Abeni, kvm, virtualization,
	netdev, linux-kernel, Toke Høiland-Jørgensen

On Tue, Aug 16, 2022 at 06:15:28PM -0700, Jakub Kicinski wrote:
> On Tue, 16 Aug 2022 08:29:04 +0000 Bobby Eshleman wrote:
> > > We've been burnt in the past by people doing the "let me just pick
> > > these useful pieces out of netdev" thing. Makes life hard both for
> > > maintainers and users trying to make sense of the interfaces.
> > > 
> > > What comes to mind if you're just after queuing is that we already
> > > bastardized the CoDel implementation (include/net/codel_impl.h).
> > > If CoDel is good enough for you maybe that's the easiest way?
> > > Although I suspect that you're after fairness not early drops.
> > > Wireless folks use CoDel as a second layer queuing. (CC: Toke)
> > 
> > That is certainly interesting to me. Sitting next to "codel_impl.h" is
> > "include/net/fq_impl.h", and it looks like it may solve the datagram
> > flooding issue. The downside to this approach is the baking of a
> > specific policy into vsock... which I don't exactly love either.
> > 
> > I'm not seeing too many other of these qdisc bastardizations in
> > include/net, are there any others that you are aware of?
> 
> Just what wireless uses (so codel and fq as you found out), nothing
> else comes to mind.
> 
> > > Eh, I was hoping it was a side channel of an existing virtio_net 
> > > which is not the case. Given the zero-config requirement IDK if 
> > > we'll be able to fit this into netdev semantics :(  
> > 
> > It's certainly possible that it may not fit :/ I feel that it partially
> > depends on what we mean by zero-config. Is it "no config required to
> > have a working socket" or is it "no config required, but also no
> > tuning/policy/etc... supported"?
> 
> The value of tuning vs confusion of a strange netdev floating around
> in the system is hard to estimate upfront. 

I think "a strange netdev floating around" is a total
mischaracterization... vsock is a networking device and it supports
vsock networks. Sure, it is a virtual device and the routing is done in
host software, but the same is true for virtio-net and VM-to-VM vlan.

This patch actually uses netdev for its intended purpose: to support and
manage the transmission of packets via a network device to a network.

Furthermore, it actually prepares vsock to eliminate a "strange" use of
a netdev. The netdev in vsockmon isn't even used to transmit
packets, it's "floating around" for no other reason than it is needed to
support packet capture, which vsock couldn't support because it didn't
have a netdev.

Something smells when we are required to build workaround kernel modules
that use netdev for ciphoning packets off to userspace, when we could
instead be using netdev for its intended purpose and get the same and
more benefit.

> 
> The nice thing about using a built-in fq with no user visible knobs is
> that there's no extra uAPI. We can always rip it out and replace later.
> And it shouldn't be controversial, making the path to upstream smoother.

The issue is that after pulling in fq for one kind of flow management,
then as users observe other flow issues, we will need to re-implement
pfifo, and then TBF, and then we need to build an interface to let users
select one, and to choose queue sizes... and then after awhile we've
needlessly re-implemented huge chunks of the tc system.

I don't see any good reason to restrict vsock users to using suboptimal
and rigid queuing.

Thanks.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc
  2022-08-17 17:02     ` Michael S. Tsirkin
@ 2022-08-16 11:08       ` Bobby Eshleman
  2022-08-17 17:53         ` Michael S. Tsirkin
  0 siblings, 1 reply; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-16 11:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Dexuan Cui, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv

On Wed, Aug 17, 2022 at 01:02:52PM -0400, Michael S. Tsirkin wrote:
> On Tue, Aug 16, 2022 at 09:42:51AM +0000, Bobby Eshleman wrote:
> > > The basic question to answer then is this: with a net device qdisc
> > > etc in the picture, how is this different from virtio net then?
> > > Why do you still want to use vsock?
> > > 
> > 
> > When using virtio-net, users looking for inter-VM communication are
> > required to setup bridges, TAPs, allocate IP addresses or setup DNS,
> > etc... and then finally when you have a network, you can open a socket
> > on an IP address and port. This is the configuration that vsock avoids.
> > For vsock, we just need a CID and a port, but no network configuration.
> 
> Surely when you mention DNS you are going overboard? vsock doesn't
> remove the need for DNS as much as it does not support it.
> 

Oops, s/DNS/dhcp.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc
  2022-08-17 17:53         ` Michael S. Tsirkin
@ 2022-08-16 12:10           ` Bobby Eshleman
  0 siblings, 0 replies; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-16 12:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Dexuan Cui, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv

On Wed, Aug 17, 2022 at 01:53:32PM -0400, Michael S. Tsirkin wrote:
> On Tue, Aug 16, 2022 at 11:08:26AM +0000, Bobby Eshleman wrote:
> > On Wed, Aug 17, 2022 at 01:02:52PM -0400, Michael S. Tsirkin wrote:
> > > On Tue, Aug 16, 2022 at 09:42:51AM +0000, Bobby Eshleman wrote:
> > > > > The basic question to answer then is this: with a net device qdisc
> > > > > etc in the picture, how is this different from virtio net then?
> > > > > Why do you still want to use vsock?
> > > > > 
> > > > 
> > > > When using virtio-net, users looking for inter-VM communication are
> > > > required to setup bridges, TAPs, allocate IP addresses or setup DNS,
> > > > etc... and then finally when you have a network, you can open a socket
> > > > on an IP address and port. This is the configuration that vsock avoids.
> > > > For vsock, we just need a CID and a port, but no network configuration.
> > > 
> > > Surely when you mention DNS you are going overboard? vsock doesn't
> > > remove the need for DNS as much as it does not support it.
> > > 
> > 
> > Oops, s/DNS/dhcp.
> 
> That too.
> 

Sure, setting up dhcp would be overboard for just inter-VM comms.

It is fair to mention that vsock CIDs also need to be managed /
allocated somehow.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/6] vsock: add netdev to vhost/virtio vsock
  2022-08-15 17:56 ` [PATCH 3/6] vsock: add netdev to vhost/virtio vsock Bobby Eshleman
  2022-08-16  2:31   ` Bobby Eshleman
@ 2022-08-16 16:38   ` Michael S. Tsirkin
  2022-08-16  6:18     ` Bobby Eshleman
  2022-08-16 18:07     ` Jakub Kicinski
  2022-09-06 10:58   ` Michael S. Tsirkin
  2 siblings, 2 replies; 67+ messages in thread
From: Michael S. Tsirkin @ 2022-08-16 16:38 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, kvm, virtualization,
	netdev, linux-kernel

On Mon, Aug 15, 2022 at 10:56:06AM -0700, Bobby Eshleman wrote:
> In order to support usage of qdisc on vsock traffic, this commit
> introduces a struct net_device to vhost and virtio vsock.
> 
> Two new devices are created, vhost-vsock for vhost and virtio-vsock
> for virtio. The devices are attached to the respective transports.
> 
> To bypass the usage of the device, the user may "down" the associated
> network interface using common tools. For example, "ip link set dev
> virtio-vsock down" lets vsock bypass the net_device and qdisc entirely,
> simply using the FIFO logic of the prior implementation.

Ugh. That's quite a hack. Mark my words, at some point we will decide to
have down mean "down".  Besides, multiple net devices with link up tend
to confuse userspace. So might want to keep it down at all times
even short term.

> For both hosts and guests, there is one device for all G2H vsock sockets
> and one device for all H2G vsock sockets. This makes sense for guests
> because the driver only supports a single vsock channel (one pair of
> TX/RX virtqueues), so one device and qdisc fits. For hosts, this may not
> seem ideal for some workloads. However, it is possible to use a
> multi-queue qdisc, where a given queue is responsible for a range of
> sockets. This seems to be a better solution than having one device per
> socket, which may yield a very large number of devices and qdiscs, all
> of which are dynamically being created and destroyed. Because of this
> dynamism, it would also require a complex policy management daemon, as
> devices would constantly be spun up and down as sockets were created and
> destroyed. To avoid this, one device and qdisc also applies to all H2G
> sockets.
> 
> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> ---
>  drivers/vhost/vsock.c                   |  19 +++-
>  include/linux/virtio_vsock.h            |  10 +++
>  net/vmw_vsock/virtio_transport.c        |  19 +++-
>  net/vmw_vsock/virtio_transport_common.c | 112 +++++++++++++++++++++++-
>  4 files changed, 152 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index f8601d93d94d..b20ddec2664b 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -927,13 +927,30 @@ static int __init vhost_vsock_init(void)
>  				  VSOCK_TRANSPORT_F_H2G);
>  	if (ret < 0)
>  		return ret;
> -	return misc_register(&vhost_vsock_misc);
> +
> +	ret = virtio_transport_init(&vhost_transport, "vhost-vsock");
> +	if (ret < 0)
> +		goto out_unregister;
> +
> +	ret = misc_register(&vhost_vsock_misc);
> +	if (ret < 0)
> +		goto out_transport_exit;
> +	return ret;
> +
> +out_transport_exit:
> +	virtio_transport_exit(&vhost_transport);
> +
> +out_unregister:
> +	vsock_core_unregister(&vhost_transport.transport);
> +	return ret;
> +
>  };
>  
>  static void __exit vhost_vsock_exit(void)
>  {
>  	misc_deregister(&vhost_vsock_misc);
>  	vsock_core_unregister(&vhost_transport.transport);
> +	virtio_transport_exit(&vhost_transport);
>  };
>  
>  module_init(vhost_vsock_init);
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index 9a37eddbb87a..5d7e7fbd75f8 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -91,10 +91,20 @@ struct virtio_transport {
>  	/* This must be the first field */
>  	struct vsock_transport transport;
>  
> +	/* Used almost exclusively for qdisc */
> +	struct net_device *dev;
> +
>  	/* Takes ownership of the packet */
>  	int (*send_pkt)(struct sk_buff *skb);
>  };
>  
> +int
> +virtio_transport_init(struct virtio_transport *t,
> +		      const char *name);
> +
> +void
> +virtio_transport_exit(struct virtio_transport *t);
> +
>  ssize_t
>  virtio_transport_stream_dequeue(struct vsock_sock *vsk,
>  				struct msghdr *msg,
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index 3bb293fd8607..c6212eb38d3c 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -131,7 +131,9 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>  		 * the vq
>  		 */
>  		if (ret < 0) {
> -			skb_queue_head(&vsock->send_pkt_queue, skb);
> +			spin_lock_bh(&vsock->send_pkt_queue.lock);
> +			__skb_queue_head(&vsock->send_pkt_queue, skb);
> +			spin_unlock_bh(&vsock->send_pkt_queue.lock);
>  			break;
>  		}
>  
> @@ -676,7 +678,9 @@ static void virtio_vsock_vqs_del(struct virtio_vsock *vsock)
>  		kfree_skb(skb);
>  	mutex_unlock(&vsock->tx_lock);
>  
> -	skb_queue_purge(&vsock->send_pkt_queue);
> +	spin_lock_bh(&vsock->send_pkt_queue.lock);
> +	__skb_queue_purge(&vsock->send_pkt_queue);
> +	spin_unlock_bh(&vsock->send_pkt_queue.lock);
>  
>  	/* Delete virtqueues and flush outstanding callbacks if any */
>  	vdev->config->del_vqs(vdev);
> @@ -760,6 +764,8 @@ static void virtio_vsock_remove(struct virtio_device *vdev)
>  	flush_work(&vsock->event_work);
>  	flush_work(&vsock->send_pkt_work);
>  
> +	virtio_transport_exit(&virtio_transport);
> +
>  	mutex_unlock(&the_virtio_vsock_mutex);
>  
>  	kfree(vsock);
> @@ -844,12 +850,18 @@ static int __init virtio_vsock_init(void)
>  	if (ret)
>  		goto out_wq;
>  
> -	ret = register_virtio_driver(&virtio_vsock_driver);
> +	ret = virtio_transport_init(&virtio_transport, "virtio-vsock");
>  	if (ret)
>  		goto out_vci;
>  
> +	ret = register_virtio_driver(&virtio_vsock_driver);
> +	if (ret)
> +		goto out_transport;
> +
>  	return 0;
>  
> +out_transport:
> +	virtio_transport_exit(&virtio_transport);
>  out_vci:
>  	vsock_core_unregister(&virtio_transport.transport);
>  out_wq:
> @@ -861,6 +873,7 @@ static void __exit virtio_vsock_exit(void)
>  {
>  	unregister_virtio_driver(&virtio_vsock_driver);
>  	vsock_core_unregister(&virtio_transport.transport);
> +	virtio_transport_exit(&virtio_transport);
>  	destroy_workqueue(virtio_vsock_workqueue);
>  }
>  
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index d5780599fe93..bdf16fff054f 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -16,6 +16,7 @@
>  
>  #include <net/sock.h>
>  #include <net/af_vsock.h>
> +#include <net/pkt_sched.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/vsock_virtio_transport_common.h>
> @@ -23,6 +24,93 @@
>  /* How long to wait for graceful shutdown of a connection */
>  #define VSOCK_CLOSE_TIMEOUT (8 * HZ)
>  
> +struct virtio_transport_priv {
> +	struct virtio_transport *trans;
> +};
> +
> +static netdev_tx_t virtio_transport_start_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> +	struct virtio_transport *t =
> +		((struct virtio_transport_priv *)netdev_priv(dev))->trans;
> +	int ret;
> +
> +	ret = t->send_pkt(skb);
> +	if (unlikely(ret == -ENODEV))
> +		return NETDEV_TX_BUSY;
> +
> +	return NETDEV_TX_OK;
> +}
> +
> +const struct net_device_ops virtio_transport_netdev_ops = {
> +	.ndo_start_xmit = virtio_transport_start_xmit,
> +};
> +
> +static void virtio_transport_setup(struct net_device *dev)
> +{
> +	dev->netdev_ops = &virtio_transport_netdev_ops;
> +	dev->needs_free_netdev = true;
> +	dev->flags = IFF_NOARP;
> +	dev->mtu = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> +	dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN;
> +}
> +
> +static int ifup(struct net_device *dev)
> +{
> +	int ret;
> +
> +	rtnl_lock();
> +	ret = dev_open(dev, NULL) ? -ENOMEM : 0;
> +	rtnl_unlock();
> +
> +	return ret;
> +}
> +
> +/* virtio_transport_init - initialize a virtio vsock transport layer
> + *
> + * @t: ptr to the virtio transport struct to initialize
> + * @name: the name of the net_device to be created.
> + *
> + * Return 0 on success, otherwise negative errno.
> + */
> +int virtio_transport_init(struct virtio_transport *t, const char *name)
> +{
> +	struct virtio_transport_priv *priv;
> +	int ret;
> +
> +	t->dev = alloc_netdev(sizeof(*priv), name, NET_NAME_UNKNOWN, virtio_transport_setup);
> +	if (!t->dev)
> +		return -ENOMEM;
> +
> +	priv = netdev_priv(t->dev);
> +	priv->trans = t;
> +
> +	ret = register_netdev(t->dev);
> +	if (ret < 0)
> +		goto out_free_netdev;
> +
> +	ret = ifup(t->dev);
> +	if (ret < 0)
> +		goto out_unregister_netdev;
> +
> +	return 0;
> +
> +out_unregister_netdev:
> +	unregister_netdev(t->dev);
> +
> +out_free_netdev:
> +	free_netdev(t->dev);
> +
> +	return ret;
> +}
> +
> +void virtio_transport_exit(struct virtio_transport *t)
> +{
> +	if (t->dev) {
> +		unregister_netdev(t->dev);
> +		free_netdev(t->dev);
> +	}
> +}
> +
>  static const struct virtio_transport *
>  virtio_transport_get_ops(struct vsock_sock *vsk)
>  {
> @@ -147,6 +235,24 @@ static u16 virtio_transport_get_type(struct sock *sk)
>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
>  }
>  
> +/* Return pkt->len on success, otherwise negative errno */
> +static int virtio_transport_send_pkt(const struct virtio_transport *t, struct sk_buff *skb)
> +{
> +	int ret;
> +	int len = skb->len;
> +
> +	if (unlikely(!t->dev || !(t->dev->flags & IFF_UP)))
> +		return t->send_pkt(skb);
> +
> +	skb->dev = t->dev;
> +	ret = dev_queue_xmit(skb);
> +
> +	if (likely(ret == NET_XMIT_SUCCESS || ret == NET_XMIT_CN))
> +		return len;
> +
> +	return -ENOMEM;
> +}
> +
>  /* This function can only be used on connecting/connected sockets,
>   * since a socket assigned to a transport is required.
>   *
> @@ -202,9 +308,7 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>  
>  	virtio_transport_inc_tx_pkt(vvs, skb);
>  
> -	err = t_ops->send_pkt(skb);
> -
> -	return err < 0 ? -ENOMEM : err;
> +	return virtio_transport_send_pkt(t_ops, skb);
>  }
>  
>  static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
> @@ -834,7 +938,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>  		return -ENOTCONN;
>  	}
>  
> -	return t->send_pkt(reply);
> +	return virtio_transport_send_pkt(t, reply);
>  }
>  
>  /* This function should be called with sk_lock held and SOCK_DONE set */
> -- 
> 2.35.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/6] vsock: add netdev to vhost/virtio vsock
  2022-08-16 16:38   ` Michael S. Tsirkin
  2022-08-16  6:18     ` Bobby Eshleman
@ 2022-08-16 18:07     ` Jakub Kicinski
  2022-08-16  7:02       ` Bobby Eshleman
  1 sibling, 1 reply; 67+ messages in thread
From: Jakub Kicinski @ 2022-08-16 18:07 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Michael S. Tsirkin, Bobby Eshleman, Bobby Eshleman, Cong Wang,
	Jiang Wang, Stefan Hajnoczi, Stefano Garzarella, Jason Wang,
	David S. Miller, Eric Dumazet, Paolo Abeni, kvm, virtualization,
	netdev, linux-kernel

On Tue, 16 Aug 2022 12:38:52 -0400 Michael S. Tsirkin wrote:
> On Mon, Aug 15, 2022 at 10:56:06AM -0700, Bobby Eshleman wrote:
> > In order to support usage of qdisc on vsock traffic, this commit
> > introduces a struct net_device to vhost and virtio vsock.
> > 
> > Two new devices are created, vhost-vsock for vhost and virtio-vsock
> > for virtio. The devices are attached to the respective transports.
> > 
> > To bypass the usage of the device, the user may "down" the associated
> > network interface using common tools. For example, "ip link set dev
> > virtio-vsock down" lets vsock bypass the net_device and qdisc entirely,
> > simply using the FIFO logic of the prior implementation.  
> 
> Ugh. That's quite a hack. Mark my words, at some point we will decide to
> have down mean "down".  Besides, multiple net devices with link up tend
> to confuse userspace. So might want to keep it down at all times
> even short term.

Agreed!

From a cursory look (and Documentation/ would be nice..) it feels
very wrong to me. Do you know of any uses of a netdev which would 
be semantically similar to what you're doing? Treating netdevs as
buildings blocks for arbitrary message passing solutions is something 
I dislike quite strongly.

Could you recommend where I can learn more about vsocks?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [virtio-dev] Re: [PATCH 5/6] virtio/vsock: add support for dgram
  2022-08-18  8:35           ` Arseniy Krasnov
@ 2022-08-16 20:52             ` Bobby Eshleman
  2022-08-19  4:30               ` Arseniy Krasnov
  0 siblings, 1 reply; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-16 20:52 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: kvm, jasowang, bobby.eshleman, davem, virtio-dev, stefanha,
	bobby.eshleman, linux-kernel, pabeni, edumazet, jiang.wang,
	sgarzare, kuba, cong.wang, netdev, virtualization, mst

On Thu, Aug 18, 2022 at 08:35:48AM +0000, Arseniy Krasnov wrote:
> On Tue, 2022-08-16 at 09:58 +0000, Bobby Eshleman wrote:
> > On Wed, Aug 17, 2022 at 05:42:08AM +0000, Arseniy Krasnov wrote:
> > > On 17.08.2022 08:01, Arseniy Krasnov wrote:
> > > > On 16.08.2022 05:32, Bobby Eshleman wrote:
> > > > > CC'ing virtio-dev@lists.oasis-open.org
> > > > > 
> > > > > On Mon, Aug 15, 2022 at 10:56:08AM -0700, Bobby Eshleman wrote:
> > > > > > This patch supports dgram in virtio and on the vhost side.
> > > > Hello,
> > > > 
> > > > sorry, i don't understand, how this maintains message boundaries?
> > > > Or it
> > > > is unnecessary for SOCK_DGRAM?
> > > > 
> > > > Thanks
> > > > > > Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
> > > > > > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> > > > > > ---
> > > > > >  drivers/vhost/vsock.c                   |   2 +-
> > > > > >  include/net/af_vsock.h                  |   2 +
> > > > > >  include/uapi/linux/virtio_vsock.h       |   1 +
> > > > > >  net/vmw_vsock/af_vsock.c                |  26 +++-
> > > > > >  net/vmw_vsock/virtio_transport.c        |   2 +-
> > > > > >  net/vmw_vsock/virtio_transport_common.c | 173
> > > > > > ++++++++++++++++++++++--
> > > > > >  6 files changed, 186 insertions(+), 20 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> > > > > > index a5d1bdb786fe..3dc72a5647ca 100644
> > > > > > --- a/drivers/vhost/vsock.c
> > > > > > +++ b/drivers/vhost/vsock.c
> > > > > > @@ -925,7 +925,7 @@ static int __init vhost_vsock_init(void)
> > > > > >  	int ret;
> > > > > >  
> > > > > >  	ret = vsock_core_register(&vhost_transport.transport,
> > > > > > -				  VSOCK_TRANSPORT_F_H2G);
> > > > > > +				  VSOCK_TRANSPORT_F_H2G |
> > > > > > VSOCK_TRANSPORT_F_DGRAM);
> > > > > >  	if (ret < 0)
> > > > > >  		return ret;
> > > > > >  
> > > > > > diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> > > > > > index 1c53c4c4d88f..37e55c81e4df 100644
> > > > > > --- a/include/net/af_vsock.h
> > > > > > +++ b/include/net/af_vsock.h
> > > > > > @@ -78,6 +78,8 @@ struct vsock_sock {
> > > > > >  s64 vsock_stream_has_data(struct vsock_sock *vsk);
> > > > > >  s64 vsock_stream_has_space(struct vsock_sock *vsk);
> > > > > >  struct sock *vsock_create_connected(struct sock *parent);
> > > > > > +int vsock_bind_stream(struct vsock_sock *vsk,
> > > > > > +		      struct sockaddr_vm *addr);
> > > > > >  
> > > > > >  /**** TRANSPORT ****/
> > > > > >  
> > > > > > diff --git a/include/uapi/linux/virtio_vsock.h
> > > > > > b/include/uapi/linux/virtio_vsock.h
> > > > > > index 857df3a3a70d..0975b9c88292 100644
> > > > > > --- a/include/uapi/linux/virtio_vsock.h
> > > > > > +++ b/include/uapi/linux/virtio_vsock.h
> > > > > > @@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
> > > > > >  enum virtio_vsock_type {
> > > > > >  	VIRTIO_VSOCK_TYPE_STREAM = 1,
> > > > > >  	VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
> > > > > > +	VIRTIO_VSOCK_TYPE_DGRAM = 3,
> > > > > >  };
> > > > > >  
> > > > > >  enum virtio_vsock_op {
> > > > > > diff --git a/net/vmw_vsock/af_vsock.c
> > > > > > b/net/vmw_vsock/af_vsock.c
> > > > > > index 1893f8aafa48..87e4ae1866d3 100644
> > > > > > --- a/net/vmw_vsock/af_vsock.c
> > > > > > +++ b/net/vmw_vsock/af_vsock.c
> > > > > > @@ -675,6 +675,19 @@ static int
> > > > > > __vsock_bind_connectible(struct vsock_sock *vsk,
> > > > > >  	return 0;
> > > > > >  }
> > > > > >  
> > > > > > +int vsock_bind_stream(struct vsock_sock *vsk,
> > > > > > +		      struct sockaddr_vm *addr)
> > > > > > +{
> > > > > > +	int retval;
> > > > > > +
> > > > > > +	spin_lock_bh(&vsock_table_lock);
> > > > > > +	retval = __vsock_bind_connectible(vsk, addr);
> > > > > > +	spin_unlock_bh(&vsock_table_lock);
> > > > > > +
> > > > > > +	return retval;
> > > > > > +}
> > > > > > +EXPORT_SYMBOL(vsock_bind_stream);
> > > > > > +
> > > > > >  static int __vsock_bind_dgram(struct vsock_sock *vsk,
> > > > > >  			      struct sockaddr_vm *addr)
> > > > > >  {
> > > > > > @@ -2363,11 +2376,16 @@ int vsock_core_register(const struct
> > > > > > vsock_transport *t, int features)
> > > > > >  	}
> > > > > >  
> > > > > >  	if (features & VSOCK_TRANSPORT_F_DGRAM) {
> > > > > > -		if (t_dgram) {
> > > > > > -			err = -EBUSY;
> > > > > > -			goto err_busy;
> > > > > > +		/* TODO: always chose the G2H variant over
> > > > > > others, support nesting later */
> > > > > > +		if (features & VSOCK_TRANSPORT_F_G2H) {
> > > > > > +			if (t_dgram)
> > > > > > +				pr_warn("virtio_vsock: t_dgram
> > > > > > already set\n");
> > > > > > +			t_dgram = t;
> > > > > > +		}
> > > > > > +
> > > > > > +		if (!t_dgram) {
> > > > > > +			t_dgram = t;
> > > > > >  		}
> > > > > > -		t_dgram = t;
> > > > > >  	}
> > > > > >  
> > > > > >  	if (features & VSOCK_TRANSPORT_F_LOCAL) {
> > > > > > diff --git a/net/vmw_vsock/virtio_transport.c
> > > > > > b/net/vmw_vsock/virtio_transport.c
> > > > > > index 073314312683..d4526ca462d2 100644
> > > > > > --- a/net/vmw_vsock/virtio_transport.c
> > > > > > +++ b/net/vmw_vsock/virtio_transport.c
> > > > > > @@ -850,7 +850,7 @@ static int __init virtio_vsock_init(void)
> > > > > >  		return -ENOMEM;
> > > > > >  
> > > > > >  	ret = vsock_core_register(&virtio_transport.transport,
> > > > > > -				  VSOCK_TRANSPORT_F_G2H);
> > > > > > +				  VSOCK_TRANSPORT_F_G2H |
> > > > > > VSOCK_TRANSPORT_F_DGRAM);
> > > > > >  	if (ret)
> > > > > >  		goto out_wq;
> > > > > >  
> > > > > > diff --git a/net/vmw_vsock/virtio_transport_common.c
> > > > > > b/net/vmw_vsock/virtio_transport_common.c
> > > > > > index bdf16fff054f..aedb48728677 100644
> > > > > > --- a/net/vmw_vsock/virtio_transport_common.c
> > > > > > +++ b/net/vmw_vsock/virtio_transport_common.c
> > > > > > @@ -229,7 +229,9 @@
> > > > > > EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
> > > > > >  
> > > > > >  static u16 virtio_transport_get_type(struct sock *sk)
> > > > > >  {
> > > > > > -	if (sk->sk_type == SOCK_STREAM)
> > > > > > +	if (sk->sk_type == SOCK_DGRAM)
> > > > > > +		return VIRTIO_VSOCK_TYPE_DGRAM;
> > > > > > +	else if (sk->sk_type == SOCK_STREAM)
> > > > > >  		return VIRTIO_VSOCK_TYPE_STREAM;
> > > > > >  	else
> > > > > >  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
> > > > > > @@ -287,22 +289,29 @@ static int
> > > > > > virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> > > > > >  	vvs = vsk->trans;
> > > > > >  
> > > > > >  	/* we can send less than pkt_len bytes */
> > > > > > -	if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> > > > > > -		pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> > > > > > +	if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
> > > > > > +		if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> > > > > > +			pkt_len =
> > > > > > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> > > > > > +		else
> > > > > > +			return 0;
> > > > > > +	}
> > > > > >  
> > > > > > -	/* virtio_transport_get_credit might return less than
> > > > > > pkt_len credit */
> > > > > > -	pkt_len = virtio_transport_get_credit(vvs, pkt_len);
> > > > > > +	if (info->type != VIRTIO_VSOCK_TYPE_DGRAM) {
> > > > > > +		/* virtio_transport_get_credit might return
> > > > > > less than pkt_len credit */
> > > > > > +		pkt_len = virtio_transport_get_credit(vvs,
> > > > > > pkt_len);
> > > > > >  
> > > > > > -	/* Do not send zero length OP_RW pkt */
> > > > > > -	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> > > > > > -		return pkt_len;
> > > > > > +		/* Do not send zero length OP_RW pkt */
> > > > > > +		if (pkt_len == 0 && info->op ==
> > > > > > VIRTIO_VSOCK_OP_RW)
> > > > > > +			return pkt_len;
> > > > > > +	}
> > > > > >  
> > > > > >  	skb = virtio_transport_alloc_skb(info, pkt_len,
> > > > > >  					 src_cid, src_port,
> > > > > >  					 dst_cid, dst_port,
> > > > > >  					 &err);
> > > > > >  	if (!skb) {
> > > > > > -		virtio_transport_put_credit(vvs, pkt_len);
> > > > > > +		if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> > > > > > +			virtio_transport_put_credit(vvs,
> > > > > > pkt_len);
> > > > > >  		return err;
> > > > > >  	}
> > > > > >  
> > > > > > @@ -586,6 +595,61 @@
> > > > > > virtio_transport_seqpacket_dequeue(struct vsock_sock *vsk,
> > > > > >  }
> > > > > >  EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_dequeue);
> > > > > >  
> > > > > > +static ssize_t
> > > > > > +virtio_transport_dgram_do_dequeue(struct vsock_sock *vsk,
> > > > > > +				  struct msghdr *msg, size_t
> > > > > > len)
> > > > > > +{
> > > > > > +	struct virtio_vsock_sock *vvs = vsk->trans;
> > > > > > +	struct sk_buff *skb;
> > > > > > +	size_t total = 0;
> > > > > > +	u32 free_space;
> > > > > > +	int err = -EFAULT;
> > > > > > +
> > > > > > +	spin_lock_bh(&vvs->rx_lock);
> > > > > > +	if (total < len && !skb_queue_empty_lockless(&vvs-
> > > > > > >rx_queue)) {
> > > > > > +		skb = __skb_dequeue(&vvs->rx_queue);
> > > > > > +
> > > > > > +		total = len;
> > > > > > +		if (total > skb->len - vsock_metadata(skb)-
> > > > > > >off)
> > > > > > +			total = skb->len - vsock_metadata(skb)-
> > > > > > >off;
> > > > > > +		else if (total < skb->len -
> > > > > > vsock_metadata(skb)->off)
> > > > > > +			msg->msg_flags |= MSG_TRUNC;
> > > > > > +
> > > > > > +		/* sk_lock is held by caller so no one else can
> > > > > > dequeue.
> > > > > > +		 * Unlock rx_lock since memcpy_to_msg() may
> > > > > > sleep.
> > > > > > +		 */
> > > > > > +		spin_unlock_bh(&vvs->rx_lock);
> > > > > > +
> > > > > > +		err = memcpy_to_msg(msg, skb->data +
> > > > > > vsock_metadata(skb)->off, total);
> > > > > > +		if (err)
> > > > > > +			return err;
> > > > > > +
> > > > > > +		spin_lock_bh(&vvs->rx_lock);
> > > > > > +
> > > > > > +		virtio_transport_dec_rx_pkt(vvs, skb);
> > > > > > +		consume_skb(skb);
> > > > > > +	}
> > > > > > +
> > > > > > +	free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs-
> > > > > > >last_fwd_cnt);
> > > > > > +
> > > > > > +	spin_unlock_bh(&vvs->rx_lock);
> > > > > > +
> > > > > > +	if (total > 0 && msg->msg_name) {
> > > > > > +		/* Provide the address of the sender. */
> > > > > > +		DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr,
> > > > > > msg->msg_name);
> > > > > > +
> > > > > > +		vsock_addr_init(vm_addr,
> > > > > > le64_to_cpu(vsock_hdr(skb)->src_cid),
> > > > > > +				le32_to_cpu(vsock_hdr(skb)-
> > > > > > >src_port));
> > > > > > +		msg->msg_namelen = sizeof(*vm_addr);
> > > > > > +	}
> > > > > > +	return total;
> > > > > > +}
> > > > > > +
> > > > > > +static s64 virtio_transport_dgram_has_data(struct vsock_sock
> > > > > > *vsk)
> > > > > > +{
> > > > > > +	return virtio_transport_stream_has_data(vsk);
> > > > > > +}
> > > > > > +
> > > > > >  int
> > > > > >  virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
> > > > > >  				   struct msghdr *msg,
> > > > > > @@ -611,7 +675,66 @@ virtio_transport_dgram_dequeue(struct
> > > > > > vsock_sock *vsk,
> > > > > >  			       struct msghdr *msg,
> > > > > >  			       size_t len, int flags)
> > > > > >  {
> > > > > > -	return -EOPNOTSUPP;
> > > > > > +	struct sock *sk;
> > > > > > +	size_t err = 0;
> > > > > > +	long timeout;
> > > > > > +
> > > > > > +	DEFINE_WAIT(wait);
> > > > > > +
> > > > > > +	sk = &vsk->sk;
> > > > > > +	err = 0;
> > > > > > +
> > > > > > +	if (flags & MSG_OOB || flags & MSG_ERRQUEUE || flags &
> > > > > > MSG_PEEK)
> > > > > > +		return -EOPNOTSUPP;
> > > > > > +
> > > > > > +	lock_sock(sk);
> > > > > > +
> > > > > > +	if (!len)
> > > > > > +		goto out;
> > > > > > +
> > > > > > +	timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
> > > > > > +
> > > > > > +	while (1) {
> > > > > > +		s64 ready;
> > > > > > +
> > > > > > +		prepare_to_wait(sk_sleep(sk), &wait,
> > > > > > TASK_INTERRUPTIBLE);
> > > > > > +		ready = virtio_transport_dgram_has_data(vsk);
> > > > > > +
> > > > > > +		if (ready == 0) {
> > > > > > +			if (timeout == 0) {
> > > > > > +				err = -EAGAIN;
> > > > > > +				finish_wait(sk_sleep(sk),
> > > > > > &wait);
> > > > > > +				break;
> > > > > > +			}
> > > > > > +
> > > > > > +			release_sock(sk);
> > > > > > +			timeout = schedule_timeout(timeout);
> > > > > > +			lock_sock(sk);
> > > > > > +
> > > > > > +			if (signal_pending(current)) {
> > > > > > +				err = sock_intr_errno(timeout);
> > > > > > +				finish_wait(sk_sleep(sk),
> > > > > > &wait);
> > > > > > +				break;
> > > > > > +			} else if (timeout == 0) {
> > > > > > +				err = -EAGAIN;
> > > > > > +				finish_wait(sk_sleep(sk),
> > > > > > &wait);
> > > > > > +				break;
> > > > > > +			}
> > > > > > +		} else {
> > > > > > +			finish_wait(sk_sleep(sk), &wait);
> > > > > > +
> > > > > > +			if (ready < 0) {
> > > > > > +				err = -ENOMEM;
> > > > > > +				goto out;
> > > > > > +			}
> > > > > > +
> > > > > > +			err =
> > > > > > virtio_transport_dgram_do_dequeue(vsk, msg, len);
> > > > > > +			break;
> > > > > > +		}
> > > > > > +	}
> > > > > > +out:
> > > > > > +	release_sock(sk);
> > > > > > +	return err;
> > > > > >  }
> > > > > >  EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
> > > ^^^
> > > May be, this generic data waiting logic should be in af_vsock.c, as
> > > for stream/seqpacket?
> > > In this way, another transport which supports SOCK_DGRAM could
> > > reuse it.
> > 
> > I think that is a great idea. I'll test that change for v2.
> > 
> > Thanks.
> 
> Also for v2, i tested Your patchset a little bit(write here to not
> spread over all mails):
> 1) seqpacket test in vsock_test.c fails(seems MSG_EOR flag issue)

I will investigate.

> 2) i can't do rmmod with the following config(after testing):
>    CONFIG_VSOCKETS=m
>    CONFIG_VIRTIO_VSOCKETS=m
>    CONFIG_VIRTIO_VSOCKETS_COMMON=m
>    CONFIG_VHOST=m
>    CONFIG_VHOST_VSOCK=m
>    Guest is shutdown, but rmmod fails.
> 3) virtio_transport_init + virtio_transport_exit seems must be
>    under EXPORT_SYMBOL_GPL(), because both used in another module.

Definitely, will fix.

> 4) I tried to send 5kb(or 20kb not matter) piece of data, but got      
>    kernel panic both in guest and later in host.
> 

Thanks for catching that. I can reproduce it intermittently, but only
for seqpacket. Did you happen to see this for other socket types as
well?

Thanks

> Thank You
> > 
> > > > > >  
> > > > > > @@ -819,13 +942,13 @@
> > > > > > EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
> > > > > >  int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> > > > > >  				struct sockaddr_vm *addr)
> > > > > >  {
> > > > > > -	return -EOPNOTSUPP;
> > > > > > +	return vsock_bind_stream(vsk, addr);
> > > > > >  }
> > > > > >  EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
> > > > > >  
> > > > > >  bool virtio_transport_dgram_allow(u32 cid, u32 port)
> > > > > >  {
> > > > > > -	return false;
> > > > > > +	return true;
> > > > > >  }
> > > > > >  EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
> > > > > >  
> > > > > > @@ -861,7 +984,16 @@ virtio_transport_dgram_enqueue(struct
> > > > > > vsock_sock *vsk,
> > > > > >  			       struct msghdr *msg,
> > > > > >  			       size_t dgram_len)
> > > > > >  {
> > > > > > -	return -EOPNOTSUPP;
> > > > > > +	struct virtio_vsock_pkt_info info = {
> > > > > > +		.op = VIRTIO_VSOCK_OP_RW,
> > > > > > +		.msg = msg,
> > > > > > +		.pkt_len = dgram_len,
> > > > > > +		.vsk = vsk,
> > > > > > +		.remote_cid = remote_addr->svm_cid,
> > > > > > +		.remote_port = remote_addr->svm_port,
> > > > > > +	};
> > > > > > +
> > > > > > +	return virtio_transport_send_pkt_info(vsk, &info);
> > > > > >  }
> > > > > >  EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
> > > > > >  
> > > > > > @@ -1165,6 +1297,12 @@ virtio_transport_recv_connected(struct
> > > > > > sock *sk,
> > > > > >  	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> > > > > >  	int err = 0;
> > > > > >  
> > > > > > +	if (le16_to_cpu(vsock_hdr(skb)->type) ==
> > > > > > VIRTIO_VSOCK_TYPE_DGRAM) {
> > > > > > +		virtio_transport_recv_enqueue(vsk, skb);
> > > > > > +		sk->sk_data_ready(sk);
> > > > > > +		return err;
> > > > > > +	}
> > > > > > +
> > > > > >  	switch (le16_to_cpu(hdr->op)) {
> > > > > >  	case VIRTIO_VSOCK_OP_RW:
> > > > > >  		virtio_transport_recv_enqueue(vsk, skb);
> > > > > > @@ -1320,7 +1458,8 @@ virtio_transport_recv_listen(struct
> > > > > > sock *sk, struct sk_buff *skb,
> > > > > >  static bool virtio_transport_valid_type(u16 type)
> > > > > >  {
> > > > > >  	return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
> > > > > > -	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
> > > > > > +	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
> > > > > > +	       (type == VIRTIO_VSOCK_TYPE_DGRAM);
> > > > > >  }
> > > > > >  
> > > > > >  /* We are under the virtio-vsock's vsock->rx_lock or vhost-
> > > > > > vsock's vq->mutex
> > > > > > @@ -1384,6 +1523,11 @@ void virtio_transport_recv_pkt(struct
> > > > > > virtio_transport *t,
> > > > > >  		goto free_pkt;
> > > > > >  	}
> > > > > >  
> > > > > > +	if (sk->sk_type == SOCK_DGRAM) {
> > > > > > +		virtio_transport_recv_connected(sk, skb);
> > > > > > +		goto out;
> > > > > > +	}
> > > > > > +
> > > > > >  	space_available = virtio_transport_space_update(sk,
> > > > > > skb);
> > > > > >  
> > > > > >  	/* Update CID in case it has changed after a transport
> > > > > > reset event */
> > > > > > @@ -1415,6 +1559,7 @@ void virtio_transport_recv_pkt(struct
> > > > > > virtio_transport *t,
> > > > > >  		break;
> > > > > >  	}
> > > > > >  
> > > > > > +out:
> > > > > >  	release_sock(sk);
> > > > > >  
> > > > > >  	/* Release refcnt obtained when we fetched this socket
> > > > > > out of the
> > > > > > -- 
> > > > > > 2.35.1
> > > > > > 
> > > > > 
> > > > > -------------------------------------------------------------
> > > > > --------
> > > > > To unsubscribe, e-mail: 
> > > > > virtio-dev-unsubscribe@lists.oasis-open.org
> > > > > For additional commands, e-mail: 
> > > > > virtio-dev-help@lists.oasis-open.org
> > > > > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/6] vsock: add netdev to vhost/virtio vsock
  2022-08-16  7:02       ` Bobby Eshleman
@ 2022-08-16 23:07         ` Jakub Kicinski
  2022-08-16  8:29           ` Bobby Eshleman
  2022-08-17  1:23           ` [External] " Cong Wang .
  0 siblings, 2 replies; 67+ messages in thread
From: Jakub Kicinski @ 2022-08-16 23:07 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Bobby Eshleman, Michael S. Tsirkin, Bobby Eshleman, Cong Wang,
	Jiang Wang, Stefan Hajnoczi, Stefano Garzarella, Jason Wang,
	David S. Miller, Eric Dumazet, Paolo Abeni, kvm, virtualization,
	netdev, linux-kernel, Toke Høiland-Jørgensen

On Tue, 16 Aug 2022 07:02:33 +0000 Bobby Eshleman wrote:
> > From a cursory look (and Documentation/ would be nice..) it feels
> > very wrong to me. Do you know of any uses of a netdev which would 
> > be semantically similar to what you're doing? Treating netdevs as
> > buildings blocks for arbitrary message passing solutions is something 
> > I dislike quite strongly.  
> 
> The big difference between vsock and "arbitrary message passing" is that
> vsock is actually constrained by the virtio device that backs it (made
> up of virtqueues and the underlying protocol). That virtqueue pair is
> acting like the queues on a physical NIC, so it actually makes sense to
> manage the queuing of vsock's device like we would manage the queueing
> of a real device.
> 
> Still, I concede that ignoring the netdev state is a probably bad idea.
> 
> That said, I also think that using packet scheduling in vsock is a good
> idea, and that ideally we can reuse Linux's already robust library of
> packet scheduling algorithms by introducing qdisc somehow.

We've been burnt in the past by people doing the "let me just pick
these useful pieces out of netdev" thing. Makes life hard both for
maintainers and users trying to make sense of the interfaces.

What comes to mind if you're just after queuing is that we already
bastardized the CoDel implementation (include/net/codel_impl.h).
If CoDel is good enough for you maybe that's the easiest way?
Although I suspect that you're after fairness not early drops.
Wireless folks use CoDel as a second layer queuing. (CC: Toke)

> > Could you recommend where I can learn more about vsocks?  
> 
> I think the spec is probably the best place to start[1].
> 
> [1]: https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html

Eh, I was hoping it was a side channel of an existing virtio_net 
which is not the case. Given the zero-config requirement IDK if 
we'll be able to fit this into netdev semantics :(

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/6] vsock: add netdev to vhost/virtio vsock
  2022-08-16  8:29           ` Bobby Eshleman
@ 2022-08-17  1:15             ` Jakub Kicinski
  2022-08-16 10:50               ` Bobby Eshleman
  0 siblings, 1 reply; 67+ messages in thread
From: Jakub Kicinski @ 2022-08-17  1:15 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Bobby Eshleman, Michael S. Tsirkin, Bobby Eshleman, Cong Wang,
	Jiang Wang, Stefan Hajnoczi, Stefano Garzarella, Jason Wang,
	David S. Miller, Eric Dumazet, Paolo Abeni, kvm, virtualization,
	netdev, linux-kernel, Toke Høiland-Jørgensen

On Tue, 16 Aug 2022 08:29:04 +0000 Bobby Eshleman wrote:
> > We've been burnt in the past by people doing the "let me just pick
> > these useful pieces out of netdev" thing. Makes life hard both for
> > maintainers and users trying to make sense of the interfaces.
> > 
> > What comes to mind if you're just after queuing is that we already
> > bastardized the CoDel implementation (include/net/codel_impl.h).
> > If CoDel is good enough for you maybe that's the easiest way?
> > Although I suspect that you're after fairness not early drops.
> > Wireless folks use CoDel as a second layer queuing. (CC: Toke)
> 
> That is certainly interesting to me. Sitting next to "codel_impl.h" is
> "include/net/fq_impl.h", and it looks like it may solve the datagram
> flooding issue. The downside to this approach is the baking of a
> specific policy into vsock... which I don't exactly love either.
> 
> I'm not seeing too many other of these qdisc bastardizations in
> include/net, are there any others that you are aware of?

Just what wireless uses (so codel and fq as you found out), nothing
else comes to mind.

> > Eh, I was hoping it was a side channel of an existing virtio_net 
> > which is not the case. Given the zero-config requirement IDK if 
> > we'll be able to fit this into netdev semantics :(  
> 
> It's certainly possible that it may not fit :/ I feel that it partially
> depends on what we mean by zero-config. Is it "no config required to
> have a working socket" or is it "no config required, but also no
> tuning/policy/etc... supported"?

The value of tuning vs confusion of a strange netdev floating around
in the system is hard to estimate upfront. 

The nice thing about using a built-in fq with no user visible knobs is
that there's no extra uAPI. We can always rip it out and replace later.
And it shouldn't be controversial, making the path to upstream smoother.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [External] Re: [PATCH 3/6] vsock: add netdev to vhost/virtio vsock
  2022-08-16 23:07         ` Jakub Kicinski
  2022-08-16  8:29           ` Bobby Eshleman
@ 2022-08-17  1:23           ` Cong Wang .
  1 sibling, 0 replies; 67+ messages in thread
From: Cong Wang . @ 2022-08-17  1:23 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Bobby Eshleman, Bobby Eshleman, Michael S. Tsirkin,
	Bobby Eshleman, Jiang Wang, Stefan Hajnoczi, Stefano Garzarella,
	Jason Wang, David S. Miller, Eric Dumazet, Paolo Abeni, kvm,
	virtualization, netdev, linux-kernel,
	Toke Høiland-Jørgensen

On Tue, Aug 16, 2022 at 4:07 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 16 Aug 2022 07:02:33 +0000 Bobby Eshleman wrote:
> > > From a cursory look (and Documentation/ would be nice..) it feels
> > > very wrong to me. Do you know of any uses of a netdev which would
> > > be semantically similar to what you're doing? Treating netdevs as
> > > buildings blocks for arbitrary message passing solutions is something
> > > I dislike quite strongly.
> >
> > The big difference between vsock and "arbitrary message passing" is that
> > vsock is actually constrained by the virtio device that backs it (made
> > up of virtqueues and the underlying protocol). That virtqueue pair is
> > acting like the queues on a physical NIC, so it actually makes sense to
> > manage the queuing of vsock's device like we would manage the queueing
> > of a real device.
> >
> > Still, I concede that ignoring the netdev state is a probably bad idea.
> >
> > That said, I also think that using packet scheduling in vsock is a good
> > idea, and that ideally we can reuse Linux's already robust library of
> > packet scheduling algorithms by introducing qdisc somehow.
>
> We've been burnt in the past by people doing the "let me just pick
> these useful pieces out of netdev" thing. Makes life hard both for
> maintainers and users trying to make sense of the interfaces.

I interpret this in a different way: we just believe "one size does
not fit all",
as most Linux kernel developers do. I am very surprised you don't.

Feel free to suggest any other ways, eventually you will need to
reimplement TC one way or the other.

If you think about it in another way, vsock is networking too, its name
contains a "sock", do I need to say more? :)

>
> What comes to mind if you're just after queuing is that we already
> bastardized the CoDel implementation (include/net/codel_impl.h).
> If CoDel is good enough for you maybe that's the easiest way?
> Although I suspect that you're after fairness not early drops.
> Wireless folks use CoDel as a second layer queuing. (CC: Toke)

What makes you believe CoDel fits all cases? If it really does, you
probably have to convince Toke to give up his idea on XDP map
as it would no longer make any sense. I don't see you raise such
an argument there... What makes you treat this differently with XDP
map? I am very curious about your thought process here. ;-)

Thanks.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [virtio-dev] Re: [PATCH 5/6] virtio/vsock: add support for dgram
  2022-08-16  2:32   ` Bobby Eshleman
@ 2022-08-17  5:01     ` Arseniy Krasnov
  2022-08-16  9:57       ` Bobby Eshleman
  2022-08-17  5:42       ` Arseniy Krasnov
  0 siblings, 2 replies; 67+ messages in thread
From: Arseniy Krasnov @ 2022-08-17  5:01 UTC (permalink / raw)
  To: Bobby Eshleman, Bobby Eshleman
  Cc: virtio-dev, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, kvm, virtualization, netdev, linux-kernel

On 16.08.2022 05:32, Bobby Eshleman wrote:
> CC'ing virtio-dev@lists.oasis-open.org
> 
> On Mon, Aug 15, 2022 at 10:56:08AM -0700, Bobby Eshleman wrote:
>> This patch supports dgram in virtio and on the vhost side.
Hello,

sorry, i don't understand, how this maintains message boundaries? Or it
is unnecessary for SOCK_DGRAM?

Thanks
>>
>> Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
>> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
>> ---
>>  drivers/vhost/vsock.c                   |   2 +-
>>  include/net/af_vsock.h                  |   2 +
>>  include/uapi/linux/virtio_vsock.h       |   1 +
>>  net/vmw_vsock/af_vsock.c                |  26 +++-
>>  net/vmw_vsock/virtio_transport.c        |   2 +-
>>  net/vmw_vsock/virtio_transport_common.c | 173 ++++++++++++++++++++++--
>>  6 files changed, 186 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> index a5d1bdb786fe..3dc72a5647ca 100644
>> --- a/drivers/vhost/vsock.c
>> +++ b/drivers/vhost/vsock.c
>> @@ -925,7 +925,7 @@ static int __init vhost_vsock_init(void)
>>  	int ret;
>>  
>>  	ret = vsock_core_register(&vhost_transport.transport,
>> -				  VSOCK_TRANSPORT_F_H2G);
>> +				  VSOCK_TRANSPORT_F_H2G | VSOCK_TRANSPORT_F_DGRAM);
>>  	if (ret < 0)
>>  		return ret;
>>  
>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>> index 1c53c4c4d88f..37e55c81e4df 100644
>> --- a/include/net/af_vsock.h
>> +++ b/include/net/af_vsock.h
>> @@ -78,6 +78,8 @@ struct vsock_sock {
>>  s64 vsock_stream_has_data(struct vsock_sock *vsk);
>>  s64 vsock_stream_has_space(struct vsock_sock *vsk);
>>  struct sock *vsock_create_connected(struct sock *parent);
>> +int vsock_bind_stream(struct vsock_sock *vsk,
>> +		      struct sockaddr_vm *addr);
>>  
>>  /**** TRANSPORT ****/
>>  
>> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
>> index 857df3a3a70d..0975b9c88292 100644
>> --- a/include/uapi/linux/virtio_vsock.h
>> +++ b/include/uapi/linux/virtio_vsock.h
>> @@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
>>  enum virtio_vsock_type {
>>  	VIRTIO_VSOCK_TYPE_STREAM = 1,
>>  	VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
>> +	VIRTIO_VSOCK_TYPE_DGRAM = 3,
>>  };
>>  
>>  enum virtio_vsock_op {
>> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>> index 1893f8aafa48..87e4ae1866d3 100644
>> --- a/net/vmw_vsock/af_vsock.c
>> +++ b/net/vmw_vsock/af_vsock.c
>> @@ -675,6 +675,19 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
>>  	return 0;
>>  }
>>  
>> +int vsock_bind_stream(struct vsock_sock *vsk,
>> +		      struct sockaddr_vm *addr)
>> +{
>> +	int retval;
>> +
>> +	spin_lock_bh(&vsock_table_lock);
>> +	retval = __vsock_bind_connectible(vsk, addr);
>> +	spin_unlock_bh(&vsock_table_lock);
>> +
>> +	return retval;
>> +}
>> +EXPORT_SYMBOL(vsock_bind_stream);
>> +
>>  static int __vsock_bind_dgram(struct vsock_sock *vsk,
>>  			      struct sockaddr_vm *addr)
>>  {
>> @@ -2363,11 +2376,16 @@ int vsock_core_register(const struct vsock_transport *t, int features)
>>  	}
>>  
>>  	if (features & VSOCK_TRANSPORT_F_DGRAM) {
>> -		if (t_dgram) {
>> -			err = -EBUSY;
>> -			goto err_busy;
>> +		/* TODO: always chose the G2H variant over others, support nesting later */
>> +		if (features & VSOCK_TRANSPORT_F_G2H) {
>> +			if (t_dgram)
>> +				pr_warn("virtio_vsock: t_dgram already set\n");
>> +			t_dgram = t;
>> +		}
>> +
>> +		if (!t_dgram) {
>> +			t_dgram = t;
>>  		}
>> -		t_dgram = t;
>>  	}
>>  
>>  	if (features & VSOCK_TRANSPORT_F_LOCAL) {
>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>> index 073314312683..d4526ca462d2 100644
>> --- a/net/vmw_vsock/virtio_transport.c
>> +++ b/net/vmw_vsock/virtio_transport.c
>> @@ -850,7 +850,7 @@ static int __init virtio_vsock_init(void)
>>  		return -ENOMEM;
>>  
>>  	ret = vsock_core_register(&virtio_transport.transport,
>> -				  VSOCK_TRANSPORT_F_G2H);
>> +				  VSOCK_TRANSPORT_F_G2H | VSOCK_TRANSPORT_F_DGRAM);
>>  	if (ret)
>>  		goto out_wq;
>>  
>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>> index bdf16fff054f..aedb48728677 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -229,7 +229,9 @@ EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
>>  
>>  static u16 virtio_transport_get_type(struct sock *sk)
>>  {
>> -	if (sk->sk_type == SOCK_STREAM)
>> +	if (sk->sk_type == SOCK_DGRAM)
>> +		return VIRTIO_VSOCK_TYPE_DGRAM;
>> +	else if (sk->sk_type == SOCK_STREAM)
>>  		return VIRTIO_VSOCK_TYPE_STREAM;
>>  	else
>>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
>> @@ -287,22 +289,29 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>  	vvs = vsk->trans;
>>  
>>  	/* we can send less than pkt_len bytes */
>> -	if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
>> -		pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>> +	if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
>> +		if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
>> +			pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>> +		else
>> +			return 0;
>> +	}
>>  
>> -	/* virtio_transport_get_credit might return less than pkt_len credit */
>> -	pkt_len = virtio_transport_get_credit(vvs, pkt_len);
>> +	if (info->type != VIRTIO_VSOCK_TYPE_DGRAM) {
>> +		/* virtio_transport_get_credit might return less than pkt_len credit */
>> +		pkt_len = virtio_transport_get_credit(vvs, pkt_len);
>>  
>> -	/* Do not send zero length OP_RW pkt */
>> -	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>> -		return pkt_len;
>> +		/* Do not send zero length OP_RW pkt */
>> +		if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>> +			return pkt_len;
>> +	}
>>  
>>  	skb = virtio_transport_alloc_skb(info, pkt_len,
>>  					 src_cid, src_port,
>>  					 dst_cid, dst_port,
>>  					 &err);
>>  	if (!skb) {
>> -		virtio_transport_put_credit(vvs, pkt_len);
>> +		if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
>> +			virtio_transport_put_credit(vvs, pkt_len);
>>  		return err;
>>  	}
>>  
>> @@ -586,6 +595,61 @@ virtio_transport_seqpacket_dequeue(struct vsock_sock *vsk,
>>  }
>>  EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_dequeue);
>>  
>> +static ssize_t
>> +virtio_transport_dgram_do_dequeue(struct vsock_sock *vsk,
>> +				  struct msghdr *msg, size_t len)
>> +{
>> +	struct virtio_vsock_sock *vvs = vsk->trans;
>> +	struct sk_buff *skb;
>> +	size_t total = 0;
>> +	u32 free_space;
>> +	int err = -EFAULT;
>> +
>> +	spin_lock_bh(&vvs->rx_lock);
>> +	if (total < len && !skb_queue_empty_lockless(&vvs->rx_queue)) {
>> +		skb = __skb_dequeue(&vvs->rx_queue);
>> +
>> +		total = len;
>> +		if (total > skb->len - vsock_metadata(skb)->off)
>> +			total = skb->len - vsock_metadata(skb)->off;
>> +		else if (total < skb->len - vsock_metadata(skb)->off)
>> +			msg->msg_flags |= MSG_TRUNC;
>> +
>> +		/* sk_lock is held by caller so no one else can dequeue.
>> +		 * Unlock rx_lock since memcpy_to_msg() may sleep.
>> +		 */
>> +		spin_unlock_bh(&vvs->rx_lock);
>> +
>> +		err = memcpy_to_msg(msg, skb->data + vsock_metadata(skb)->off, total);
>> +		if (err)
>> +			return err;
>> +
>> +		spin_lock_bh(&vvs->rx_lock);
>> +
>> +		virtio_transport_dec_rx_pkt(vvs, skb);
>> +		consume_skb(skb);
>> +	}
>> +
>> +	free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
>> +
>> +	spin_unlock_bh(&vvs->rx_lock);
>> +
>> +	if (total > 0 && msg->msg_name) {
>> +		/* Provide the address of the sender. */
>> +		DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
>> +
>> +		vsock_addr_init(vm_addr, le64_to_cpu(vsock_hdr(skb)->src_cid),
>> +				le32_to_cpu(vsock_hdr(skb)->src_port));
>> +		msg->msg_namelen = sizeof(*vm_addr);
>> +	}
>> +	return total;
>> +}
>> +
>> +static s64 virtio_transport_dgram_has_data(struct vsock_sock *vsk)
>> +{
>> +	return virtio_transport_stream_has_data(vsk);
>> +}
>> +
>>  int
>>  virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
>>  				   struct msghdr *msg,
>> @@ -611,7 +675,66 @@ virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
>>  			       struct msghdr *msg,
>>  			       size_t len, int flags)
>>  {
>> -	return -EOPNOTSUPP;
>> +	struct sock *sk;
>> +	size_t err = 0;
>> +	long timeout;
>> +
>> +	DEFINE_WAIT(wait);
>> +
>> +	sk = &vsk->sk;
>> +	err = 0;
>> +
>> +	if (flags & MSG_OOB || flags & MSG_ERRQUEUE || flags & MSG_PEEK)
>> +		return -EOPNOTSUPP;
>> +
>> +	lock_sock(sk);
>> +
>> +	if (!len)
>> +		goto out;
>> +
>> +	timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
>> +
>> +	while (1) {
>> +		s64 ready;
>> +
>> +		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
>> +		ready = virtio_transport_dgram_has_data(vsk);
>> +
>> +		if (ready == 0) {
>> +			if (timeout == 0) {
>> +				err = -EAGAIN;
>> +				finish_wait(sk_sleep(sk), &wait);
>> +				break;
>> +			}
>> +
>> +			release_sock(sk);
>> +			timeout = schedule_timeout(timeout);
>> +			lock_sock(sk);
>> +
>> +			if (signal_pending(current)) {
>> +				err = sock_intr_errno(timeout);
>> +				finish_wait(sk_sleep(sk), &wait);
>> +				break;
>> +			} else if (timeout == 0) {
>> +				err = -EAGAIN;
>> +				finish_wait(sk_sleep(sk), &wait);
>> +				break;
>> +			}
>> +		} else {
>> +			finish_wait(sk_sleep(sk), &wait);
>> +
>> +			if (ready < 0) {
>> +				err = -ENOMEM;
>> +				goto out;
>> +			}
>> +
>> +			err = virtio_transport_dgram_do_dequeue(vsk, msg, len);
>> +			break;
>> +		}
>> +	}
>> +out:
>> +	release_sock(sk);
>> +	return err;
>>  }
>>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
>>  
>> @@ -819,13 +942,13 @@ EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
>>  int virtio_transport_dgram_bind(struct vsock_sock *vsk,
>>  				struct sockaddr_vm *addr)
>>  {
>> -	return -EOPNOTSUPP;
>> +	return vsock_bind_stream(vsk, addr);
>>  }
>>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
>>  
>>  bool virtio_transport_dgram_allow(u32 cid, u32 port)
>>  {
>> -	return false;
>> +	return true;
>>  }
>>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
>>  
>> @@ -861,7 +984,16 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
>>  			       struct msghdr *msg,
>>  			       size_t dgram_len)
>>  {
>> -	return -EOPNOTSUPP;
>> +	struct virtio_vsock_pkt_info info = {
>> +		.op = VIRTIO_VSOCK_OP_RW,
>> +		.msg = msg,
>> +		.pkt_len = dgram_len,
>> +		.vsk = vsk,
>> +		.remote_cid = remote_addr->svm_cid,
>> +		.remote_port = remote_addr->svm_port,
>> +	};
>> +
>> +	return virtio_transport_send_pkt_info(vsk, &info);
>>  }
>>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
>>  
>> @@ -1165,6 +1297,12 @@ virtio_transport_recv_connected(struct sock *sk,
>>  	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
>>  	int err = 0;
>>  
>> +	if (le16_to_cpu(vsock_hdr(skb)->type) == VIRTIO_VSOCK_TYPE_DGRAM) {
>> +		virtio_transport_recv_enqueue(vsk, skb);
>> +		sk->sk_data_ready(sk);
>> +		return err;
>> +	}
>> +
>>  	switch (le16_to_cpu(hdr->op)) {
>>  	case VIRTIO_VSOCK_OP_RW:
>>  		virtio_transport_recv_enqueue(vsk, skb);
>> @@ -1320,7 +1458,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
>>  static bool virtio_transport_valid_type(u16 type)
>>  {
>>  	return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
>> -	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
>> +	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
>> +	       (type == VIRTIO_VSOCK_TYPE_DGRAM);
>>  }
>>  
>>  /* We are under the virtio-vsock's vsock->rx_lock or vhost-vsock's vq->mutex
>> @@ -1384,6 +1523,11 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
>>  		goto free_pkt;
>>  	}
>>  
>> +	if (sk->sk_type == SOCK_DGRAM) {
>> +		virtio_transport_recv_connected(sk, skb);
>> +		goto out;
>> +	}
>> +
>>  	space_available = virtio_transport_space_update(sk, skb);
>>  
>>  	/* Update CID in case it has changed after a transport reset event */
>> @@ -1415,6 +1559,7 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
>>  		break;
>>  	}
>>  
>> +out:
>>  	release_sock(sk);
>>  
>>  	/* Release refcnt obtained when we fetched this socket out of the
>> -- 
>> 2.35.1
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [virtio-dev] Re: [PATCH 2/6] vsock: return errors other than -ENOMEM to socket
  2022-08-16  2:30   ` Bobby Eshleman
@ 2022-08-17  5:28     ` Arseniy Krasnov
  0 siblings, 0 replies; 67+ messages in thread
From: Arseniy Krasnov @ 2022-08-17  5:28 UTC (permalink / raw)
  To: Bobby Eshleman, Bobby Eshleman
  Cc: virtio-dev, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Wei Liu, Dexuan Cui, kvm, virtualization, netdev, linux-kernel,
	linux-hyperv

On 16.08.2022 05:30, Bobby Eshleman wrote:
> CC'ing virtio-dev@lists.oasis-open.org
> 
> On Mon, Aug 15, 2022 at 10:56:05AM -0700, Bobby Eshleman wrote:
>> This commit allows vsock implementations to return errors
>> to the socket layer other than -ENOMEM. One immediate effect
>> of this is that upon the sk_sndbuf threshold being reached -EAGAIN
>> will be returned and userspace may throttle appropriately.
>>
>> Resultingly, a known issue with uperf is resolved[1].
>>
>> Additionally, to preserve legacy behavior for non-virtio
>> implementations, hyperv/vmci force errors to be -ENOMEM so that behavior
>> is unchanged.
>>
>> [1]: https://gitlab.com/vsock/vsock/-/issues/1
>>
>> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
>> ---
>>  include/linux/virtio_vsock.h            | 3 +++
>>  net/vmw_vsock/af_vsock.c                | 3 ++-
>>  net/vmw_vsock/hyperv_transport.c        | 2 +-
>>  net/vmw_vsock/virtio_transport_common.c | 3 ---
>>  net/vmw_vsock/vmci_transport.c          | 9 ++++++++-
>>  5 files changed, 14 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>> index 17ed01466875..9a37eddbb87a 100644
>> --- a/include/linux/virtio_vsock.h
>> +++ b/include/linux/virtio_vsock.h
>> @@ -8,6 +8,9 @@
>>  #include <net/sock.h>
>>  #include <net/af_vsock.h>
>>  
>> +/* Threshold for detecting small packets to copy */
>> +#define GOOD_COPY_LEN  128
>> +
>>  enum virtio_vsock_metadata_flags {
>>  	VIRTIO_VSOCK_METADATA_FLAGS_REPLY		= BIT(0),
>>  	VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED	= BIT(1),
>> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>> index e348b2d09eac..1893f8aafa48 100644
>> --- a/net/vmw_vsock/af_vsock.c
>> +++ b/net/vmw_vsock/af_vsock.c
>> @@ -1844,8 +1844,9 @@ static int vsock_connectible_sendmsg(struct socket *sock, struct msghdr *msg,
>>  			written = transport->stream_enqueue(vsk,
>>  					msg, len - total_written);
>>  		}
>> +
>>  		if (written < 0) {
>> -			err = -ENOMEM;
>> +			err = written;
>>  			goto out_err;
>>  		}
IIUC, for stream, this thing will have effect, only one first transport access fails. In this
case 'total_written' will be 0, so 'err' == 'written' will be returned. But when 'total_written > 0',
'err' will be overwritten by 'total_written' below, preserving current behaviour. Is it what You
supposed?

Thanks
>>  
>> diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
>> index fd98229e3db3..e99aea571f6f 100644
>> --- a/net/vmw_vsock/hyperv_transport.c
>> +++ b/net/vmw_vsock/hyperv_transport.c
>> @@ -687,7 +687,7 @@ static ssize_t hvs_stream_enqueue(struct vsock_sock *vsk, struct msghdr *msg,
>>  	if (bytes_written)
>>  		ret = bytes_written;
>>  	kfree(send_buf);
>> -	return ret;
>> +	return ret < 0 ? -ENOMEM : ret;
>>  }
>>  
>>  static s64 hvs_stream_has_data(struct vsock_sock *vsk)
>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>> index 920578597bb9..d5780599fe93 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -23,9 +23,6 @@
>>  /* How long to wait for graceful shutdown of a connection */
>>  #define VSOCK_CLOSE_TIMEOUT (8 * HZ)
>>  
>> -/* Threshold for detecting small packets to copy */
>> -#define GOOD_COPY_LEN  128
>> -
>>  static const struct virtio_transport *
>>  virtio_transport_get_ops(struct vsock_sock *vsk)
>>  {
>> diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
>> index b14f0ed7427b..c927a90dc859 100644
>> --- a/net/vmw_vsock/vmci_transport.c
>> +++ b/net/vmw_vsock/vmci_transport.c
>> @@ -1838,7 +1838,14 @@ static ssize_t vmci_transport_stream_enqueue(
>>  	struct msghdr *msg,
>>  	size_t len)
>>  {
>> -	return vmci_qpair_enquev(vmci_trans(vsk)->qpair, msg, len, 0);
>> +	int err;
>> +
>> +	err = vmci_qpair_enquev(vmci_trans(vsk)->qpair, msg, len, 0);
>> +
>> +	if (err < 0)
>> +		err = -ENOMEM;
>> +
>> +	return err;
>>  }
>>  
>>  static s64 vmci_transport_stream_has_data(struct vsock_sock *vsk)
>> -- 
>> 2.35.1
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [virtio-dev] Re: [PATCH 5/6] virtio/vsock: add support for dgram
  2022-08-17  5:01     ` [virtio-dev] " Arseniy Krasnov
  2022-08-16  9:57       ` Bobby Eshleman
@ 2022-08-17  5:42       ` Arseniy Krasnov
  2022-08-16  9:58         ` Bobby Eshleman
  1 sibling, 1 reply; 67+ messages in thread
From: Arseniy Krasnov @ 2022-08-17  5:42 UTC (permalink / raw)
  To: Bobby Eshleman, Bobby Eshleman
  Cc: virtio-dev, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, kvm, virtualization, netdev, linux-kernel

On 17.08.2022 08:01, Arseniy Krasnov wrote:
> On 16.08.2022 05:32, Bobby Eshleman wrote:
>> CC'ing virtio-dev@lists.oasis-open.org
>>
>> On Mon, Aug 15, 2022 at 10:56:08AM -0700, Bobby Eshleman wrote:
>>> This patch supports dgram in virtio and on the vhost side.
> Hello,
> 
> sorry, i don't understand, how this maintains message boundaries? Or it
> is unnecessary for SOCK_DGRAM?
> 
> Thanks
>>>
>>> Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
>>> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
>>> ---
>>>  drivers/vhost/vsock.c                   |   2 +-
>>>  include/net/af_vsock.h                  |   2 +
>>>  include/uapi/linux/virtio_vsock.h       |   1 +
>>>  net/vmw_vsock/af_vsock.c                |  26 +++-
>>>  net/vmw_vsock/virtio_transport.c        |   2 +-
>>>  net/vmw_vsock/virtio_transport_common.c | 173 ++++++++++++++++++++++--
>>>  6 files changed, 186 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>>> index a5d1bdb786fe..3dc72a5647ca 100644
>>> --- a/drivers/vhost/vsock.c
>>> +++ b/drivers/vhost/vsock.c
>>> @@ -925,7 +925,7 @@ static int __init vhost_vsock_init(void)
>>>  	int ret;
>>>  
>>>  	ret = vsock_core_register(&vhost_transport.transport,
>>> -				  VSOCK_TRANSPORT_F_H2G);
>>> +				  VSOCK_TRANSPORT_F_H2G | VSOCK_TRANSPORT_F_DGRAM);
>>>  	if (ret < 0)
>>>  		return ret;
>>>  
>>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>>> index 1c53c4c4d88f..37e55c81e4df 100644
>>> --- a/include/net/af_vsock.h
>>> +++ b/include/net/af_vsock.h
>>> @@ -78,6 +78,8 @@ struct vsock_sock {
>>>  s64 vsock_stream_has_data(struct vsock_sock *vsk);
>>>  s64 vsock_stream_has_space(struct vsock_sock *vsk);
>>>  struct sock *vsock_create_connected(struct sock *parent);
>>> +int vsock_bind_stream(struct vsock_sock *vsk,
>>> +		      struct sockaddr_vm *addr);
>>>  
>>>  /**** TRANSPORT ****/
>>>  
>>> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
>>> index 857df3a3a70d..0975b9c88292 100644
>>> --- a/include/uapi/linux/virtio_vsock.h
>>> +++ b/include/uapi/linux/virtio_vsock.h
>>> @@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
>>>  enum virtio_vsock_type {
>>>  	VIRTIO_VSOCK_TYPE_STREAM = 1,
>>>  	VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
>>> +	VIRTIO_VSOCK_TYPE_DGRAM = 3,
>>>  };
>>>  
>>>  enum virtio_vsock_op {
>>> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>>> index 1893f8aafa48..87e4ae1866d3 100644
>>> --- a/net/vmw_vsock/af_vsock.c
>>> +++ b/net/vmw_vsock/af_vsock.c
>>> @@ -675,6 +675,19 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
>>>  	return 0;
>>>  }
>>>  
>>> +int vsock_bind_stream(struct vsock_sock *vsk,
>>> +		      struct sockaddr_vm *addr)
>>> +{
>>> +	int retval;
>>> +
>>> +	spin_lock_bh(&vsock_table_lock);
>>> +	retval = __vsock_bind_connectible(vsk, addr);
>>> +	spin_unlock_bh(&vsock_table_lock);
>>> +
>>> +	return retval;
>>> +}
>>> +EXPORT_SYMBOL(vsock_bind_stream);
>>> +
>>>  static int __vsock_bind_dgram(struct vsock_sock *vsk,
>>>  			      struct sockaddr_vm *addr)
>>>  {
>>> @@ -2363,11 +2376,16 @@ int vsock_core_register(const struct vsock_transport *t, int features)
>>>  	}
>>>  
>>>  	if (features & VSOCK_TRANSPORT_F_DGRAM) {
>>> -		if (t_dgram) {
>>> -			err = -EBUSY;
>>> -			goto err_busy;
>>> +		/* TODO: always chose the G2H variant over others, support nesting later */
>>> +		if (features & VSOCK_TRANSPORT_F_G2H) {
>>> +			if (t_dgram)
>>> +				pr_warn("virtio_vsock: t_dgram already set\n");
>>> +			t_dgram = t;
>>> +		}
>>> +
>>> +		if (!t_dgram) {
>>> +			t_dgram = t;
>>>  		}
>>> -		t_dgram = t;
>>>  	}
>>>  
>>>  	if (features & VSOCK_TRANSPORT_F_LOCAL) {
>>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>>> index 073314312683..d4526ca462d2 100644
>>> --- a/net/vmw_vsock/virtio_transport.c
>>> +++ b/net/vmw_vsock/virtio_transport.c
>>> @@ -850,7 +850,7 @@ static int __init virtio_vsock_init(void)
>>>  		return -ENOMEM;
>>>  
>>>  	ret = vsock_core_register(&virtio_transport.transport,
>>> -				  VSOCK_TRANSPORT_F_G2H);
>>> +				  VSOCK_TRANSPORT_F_G2H | VSOCK_TRANSPORT_F_DGRAM);
>>>  	if (ret)
>>>  		goto out_wq;
>>>  
>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>> index bdf16fff054f..aedb48728677 100644
>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>> @@ -229,7 +229,9 @@ EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
>>>  
>>>  static u16 virtio_transport_get_type(struct sock *sk)
>>>  {
>>> -	if (sk->sk_type == SOCK_STREAM)
>>> +	if (sk->sk_type == SOCK_DGRAM)
>>> +		return VIRTIO_VSOCK_TYPE_DGRAM;
>>> +	else if (sk->sk_type == SOCK_STREAM)
>>>  		return VIRTIO_VSOCK_TYPE_STREAM;
>>>  	else
>>>  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
>>> @@ -287,22 +289,29 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>  	vvs = vsk->trans;
>>>  
>>>  	/* we can send less than pkt_len bytes */
>>> -	if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
>>> -		pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>>> +	if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
>>> +		if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
>>> +			pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>>> +		else
>>> +			return 0;
>>> +	}
>>>  
>>> -	/* virtio_transport_get_credit might return less than pkt_len credit */
>>> -	pkt_len = virtio_transport_get_credit(vvs, pkt_len);
>>> +	if (info->type != VIRTIO_VSOCK_TYPE_DGRAM) {
>>> +		/* virtio_transport_get_credit might return less than pkt_len credit */
>>> +		pkt_len = virtio_transport_get_credit(vvs, pkt_len);
>>>  
>>> -	/* Do not send zero length OP_RW pkt */
>>> -	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>> -		return pkt_len;
>>> +		/* Do not send zero length OP_RW pkt */
>>> +		if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>> +			return pkt_len;
>>> +	}
>>>  
>>>  	skb = virtio_transport_alloc_skb(info, pkt_len,
>>>  					 src_cid, src_port,
>>>  					 dst_cid, dst_port,
>>>  					 &err);
>>>  	if (!skb) {
>>> -		virtio_transport_put_credit(vvs, pkt_len);
>>> +		if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
>>> +			virtio_transport_put_credit(vvs, pkt_len);
>>>  		return err;
>>>  	}
>>>  
>>> @@ -586,6 +595,61 @@ virtio_transport_seqpacket_dequeue(struct vsock_sock *vsk,
>>>  }
>>>  EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_dequeue);
>>>  
>>> +static ssize_t
>>> +virtio_transport_dgram_do_dequeue(struct vsock_sock *vsk,
>>> +				  struct msghdr *msg, size_t len)
>>> +{
>>> +	struct virtio_vsock_sock *vvs = vsk->trans;
>>> +	struct sk_buff *skb;
>>> +	size_t total = 0;
>>> +	u32 free_space;
>>> +	int err = -EFAULT;
>>> +
>>> +	spin_lock_bh(&vvs->rx_lock);
>>> +	if (total < len && !skb_queue_empty_lockless(&vvs->rx_queue)) {
>>> +		skb = __skb_dequeue(&vvs->rx_queue);
>>> +
>>> +		total = len;
>>> +		if (total > skb->len - vsock_metadata(skb)->off)
>>> +			total = skb->len - vsock_metadata(skb)->off;
>>> +		else if (total < skb->len - vsock_metadata(skb)->off)
>>> +			msg->msg_flags |= MSG_TRUNC;
>>> +
>>> +		/* sk_lock is held by caller so no one else can dequeue.
>>> +		 * Unlock rx_lock since memcpy_to_msg() may sleep.
>>> +		 */
>>> +		spin_unlock_bh(&vvs->rx_lock);
>>> +
>>> +		err = memcpy_to_msg(msg, skb->data + vsock_metadata(skb)->off, total);
>>> +		if (err)
>>> +			return err;
>>> +
>>> +		spin_lock_bh(&vvs->rx_lock);
>>> +
>>> +		virtio_transport_dec_rx_pkt(vvs, skb);
>>> +		consume_skb(skb);
>>> +	}
>>> +
>>> +	free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
>>> +
>>> +	spin_unlock_bh(&vvs->rx_lock);
>>> +
>>> +	if (total > 0 && msg->msg_name) {
>>> +		/* Provide the address of the sender. */
>>> +		DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
>>> +
>>> +		vsock_addr_init(vm_addr, le64_to_cpu(vsock_hdr(skb)->src_cid),
>>> +				le32_to_cpu(vsock_hdr(skb)->src_port));
>>> +		msg->msg_namelen = sizeof(*vm_addr);
>>> +	}
>>> +	return total;
>>> +}
>>> +
>>> +static s64 virtio_transport_dgram_has_data(struct vsock_sock *vsk)
>>> +{
>>> +	return virtio_transport_stream_has_data(vsk);
>>> +}
>>> +
>>>  int
>>>  virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
>>>  				   struct msghdr *msg,
>>> @@ -611,7 +675,66 @@ virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
>>>  			       struct msghdr *msg,
>>>  			       size_t len, int flags)
>>>  {
>>> -	return -EOPNOTSUPP;
>>> +	struct sock *sk;
>>> +	size_t err = 0;
>>> +	long timeout;
>>> +
>>> +	DEFINE_WAIT(wait);
>>> +
>>> +	sk = &vsk->sk;
>>> +	err = 0;
>>> +
>>> +	if (flags & MSG_OOB || flags & MSG_ERRQUEUE || flags & MSG_PEEK)
>>> +		return -EOPNOTSUPP;
>>> +
>>> +	lock_sock(sk);
>>> +
>>> +	if (!len)
>>> +		goto out;
>>> +
>>> +	timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
>>> +
>>> +	while (1) {
>>> +		s64 ready;
>>> +
>>> +		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
>>> +		ready = virtio_transport_dgram_has_data(vsk);
>>> +
>>> +		if (ready == 0) {
>>> +			if (timeout == 0) {
>>> +				err = -EAGAIN;
>>> +				finish_wait(sk_sleep(sk), &wait);
>>> +				break;
>>> +			}
>>> +
>>> +			release_sock(sk);
>>> +			timeout = schedule_timeout(timeout);
>>> +			lock_sock(sk);
>>> +
>>> +			if (signal_pending(current)) {
>>> +				err = sock_intr_errno(timeout);
>>> +				finish_wait(sk_sleep(sk), &wait);
>>> +				break;
>>> +			} else if (timeout == 0) {
>>> +				err = -EAGAIN;
>>> +				finish_wait(sk_sleep(sk), &wait);
>>> +				break;
>>> +			}
>>> +		} else {
>>> +			finish_wait(sk_sleep(sk), &wait);
>>> +
>>> +			if (ready < 0) {
>>> +				err = -ENOMEM;
>>> +				goto out;
>>> +			}
>>> +
>>> +			err = virtio_transport_dgram_do_dequeue(vsk, msg, len);
>>> +			break;
>>> +		}
>>> +	}
>>> +out:
>>> +	release_sock(sk);
>>> +	return err;
>>>  }
>>>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
^^^
May be, this generic data waiting logic should be in af_vsock.c, as for stream/seqpacket?
In this way, another transport which supports SOCK_DGRAM could reuse it.
>>>  
>>> @@ -819,13 +942,13 @@ EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
>>>  int virtio_transport_dgram_bind(struct vsock_sock *vsk,
>>>  				struct sockaddr_vm *addr)
>>>  {
>>> -	return -EOPNOTSUPP;
>>> +	return vsock_bind_stream(vsk, addr);
>>>  }
>>>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
>>>  
>>>  bool virtio_transport_dgram_allow(u32 cid, u32 port)
>>>  {
>>> -	return false;
>>> +	return true;
>>>  }
>>>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
>>>  
>>> @@ -861,7 +984,16 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
>>>  			       struct msghdr *msg,
>>>  			       size_t dgram_len)
>>>  {
>>> -	return -EOPNOTSUPP;
>>> +	struct virtio_vsock_pkt_info info = {
>>> +		.op = VIRTIO_VSOCK_OP_RW,
>>> +		.msg = msg,
>>> +		.pkt_len = dgram_len,
>>> +		.vsk = vsk,
>>> +		.remote_cid = remote_addr->svm_cid,
>>> +		.remote_port = remote_addr->svm_port,
>>> +	};
>>> +
>>> +	return virtio_transport_send_pkt_info(vsk, &info);
>>>  }
>>>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
>>>  
>>> @@ -1165,6 +1297,12 @@ virtio_transport_recv_connected(struct sock *sk,
>>>  	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
>>>  	int err = 0;
>>>  
>>> +	if (le16_to_cpu(vsock_hdr(skb)->type) == VIRTIO_VSOCK_TYPE_DGRAM) {
>>> +		virtio_transport_recv_enqueue(vsk, skb);
>>> +		sk->sk_data_ready(sk);
>>> +		return err;
>>> +	}
>>> +
>>>  	switch (le16_to_cpu(hdr->op)) {
>>>  	case VIRTIO_VSOCK_OP_RW:
>>>  		virtio_transport_recv_enqueue(vsk, skb);
>>> @@ -1320,7 +1458,8 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
>>>  static bool virtio_transport_valid_type(u16 type)
>>>  {
>>>  	return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
>>> -	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
>>> +	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
>>> +	       (type == VIRTIO_VSOCK_TYPE_DGRAM);
>>>  }
>>>  
>>>  /* We are under the virtio-vsock's vsock->rx_lock or vhost-vsock's vq->mutex
>>> @@ -1384,6 +1523,11 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
>>>  		goto free_pkt;
>>>  	}
>>>  
>>> +	if (sk->sk_type == SOCK_DGRAM) {
>>> +		virtio_transport_recv_connected(sk, skb);
>>> +		goto out;
>>> +	}
>>> +
>>>  	space_available = virtio_transport_space_update(sk, skb);
>>>  
>>>  	/* Update CID in case it has changed after a transport reset event */
>>> @@ -1415,6 +1559,7 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
>>>  		break;
>>>  	}
>>>  
>>> +out:
>>>  	release_sock(sk);
>>>  
>>>  	/* Release refcnt obtained when we fetched this socket out of the
>>> -- 
>>> 2.35.1
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc
  2022-08-15 17:56 [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Bobby Eshleman
                   ` (8 preceding siblings ...)
       [not found] ` <CAGxU2F7+L-UiNPtUm4EukOgTVJ1J6Orqs1LMvhWWvfL9zWb23g@mail.gmail.com>
@ 2022-08-17  6:54 ` Michael S. Tsirkin
  2022-08-16  9:42   ` Bobby Eshleman
  2022-08-18  4:28   ` Jason Wang
  2022-09-06 13:26 ` Stefan Hajnoczi
  2022-09-26 13:42 ` [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Stefano Garzarella
  11 siblings, 2 replies; 67+ messages in thread
From: Michael S. Tsirkin @ 2022-08-17  6:54 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Dexuan Cui, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv

On Mon, Aug 15, 2022 at 10:56:03AM -0700, Bobby Eshleman wrote:
> Hey everybody,
> 
> This series introduces datagrams, packet scheduling, and sk_buff usage
> to virtio vsock.
> 
> The usage of struct sk_buff benefits users by a) preparing vsock to use
> other related systems that require sk_buff, such as sockmap and qdisc,
> b) supporting basic congestion control via sock_alloc_send_skb, and c)
> reducing copying when delivering packets to TAP.
> 
> The socket layer no longer forces errors to be -ENOMEM, as typically
> userspace expects -EAGAIN when the sk_sndbuf threshold is reached and
> messages are being sent with option MSG_DONTWAIT.
> 
> The datagram work is based off previous patches by Jiang Wang[1].
> 
> The introduction of datagrams creates a transport layer fairness issue
> where datagrams may freely starve streams of queue access. This happens
> because, unlike streams, datagrams lack the transactions necessary for
> calculating credits and throttling.
> 
> Previous proposals introduce changes to the spec to add an additional
> virtqueue pair for datagrams[1]. Although this solution works, using
> Linux's qdisc for packet scheduling leverages already existing systems,
> avoids the need to change the virtio specification, and gives additional
> capabilities. The usage of SFQ or fq_codel, for example, may solve the
> transport layer starvation problem. It is easy to imagine other use
> cases as well. For example, services of varying importance may be
> assigned different priorities, and qdisc will apply appropriate
> priority-based scheduling. By default, the system default pfifo qdisc is
> used. The qdisc may be bypassed and legacy queuing is resumed by simply
> setting the virtio-vsock%d network device to state DOWN. This technique
> still allows vsock to work with zero-configuration.

The basic question to answer then is this: with a net device qdisc
etc in the picture, how is this different from virtio net then?
Why do you still want to use vsock?

> In summary, this series introduces these major changes to vsock:
> 
> - virtio vsock supports datagrams
> - virtio vsock uses struct sk_buff instead of virtio_vsock_pkt
>   - Because virtio vsock uses sk_buff, it also uses sock_alloc_send_skb,
>     which applies the throttling threshold sk_sndbuf.
> - The vsock socket layer supports returning errors other than -ENOMEM.
>   - This is used to return -EAGAIN when the sk_sndbuf threshold is
>     reached.
> - virtio vsock uses a net_device, through which qdisc may be used.
>  - qdisc allows scheduling policies to be applied to vsock flows.
>   - Some qdiscs, like SFQ, may allow vsock to avoid transport layer congestion. That is,
>     it may avoid datagrams from flooding out stream flows. The benefit
>     to this is that additional virtqueues are not needed for datagrams.
>   - The net_device and qdisc is bypassed by simply setting the
>     net_device state to DOWN.
> 
> [1]: https://lore.kernel.org/all/20210914055440.3121004-1-jiang.wang@bytedance.com/
> 
> Bobby Eshleman (5):
>   vsock: replace virtio_vsock_pkt with sk_buff
>   vsock: return errors other than -ENOMEM to socket
>   vsock: add netdev to vhost/virtio vsock
>   virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
>   virtio/vsock: add support for dgram
> 
> Jiang Wang (1):
>   vsock_test: add tests for vsock dgram
> 
>  drivers/vhost/vsock.c                   | 238 ++++----
>  include/linux/virtio_vsock.h            |  73 ++-
>  include/net/af_vsock.h                  |   2 +
>  include/uapi/linux/virtio_vsock.h       |   2 +
>  net/vmw_vsock/af_vsock.c                |  30 +-
>  net/vmw_vsock/hyperv_transport.c        |   2 +-
>  net/vmw_vsock/virtio_transport.c        | 237 +++++---
>  net/vmw_vsock/virtio_transport_common.c | 771 ++++++++++++++++--------
>  net/vmw_vsock/vmci_transport.c          |   9 +-
>  net/vmw_vsock/vsock_loopback.c          |  51 +-
>  tools/testing/vsock/util.c              | 105 ++++
>  tools/testing/vsock/util.h              |   4 +
>  tools/testing/vsock/vsock_test.c        | 195 ++++++
>  13 files changed, 1176 insertions(+), 543 deletions(-)
> 
> -- 
> 2.35.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc
  2022-08-16  9:42   ` Bobby Eshleman
@ 2022-08-17 17:02     ` Michael S. Tsirkin
  2022-08-16 11:08       ` Bobby Eshleman
  0 siblings, 1 reply; 67+ messages in thread
From: Michael S. Tsirkin @ 2022-08-17 17:02 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Dexuan Cui, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv

On Tue, Aug 16, 2022 at 09:42:51AM +0000, Bobby Eshleman wrote:
> > The basic question to answer then is this: with a net device qdisc
> > etc in the picture, how is this different from virtio net then?
> > Why do you still want to use vsock?
> > 
> 
> When using virtio-net, users looking for inter-VM communication are
> required to setup bridges, TAPs, allocate IP addresses or setup DNS,
> etc... and then finally when you have a network, you can open a socket
> on an IP address and port. This is the configuration that vsock avoids.
> For vsock, we just need a CID and a port, but no network configuration.

Surely when you mention DNS you are going overboard? vsock doesn't
remove the need for DNS as much as it does not support it.

-- 
MST


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/6] vsock: add netdev to vhost/virtio vsock
  2022-08-16 10:50               ` Bobby Eshleman
@ 2022-08-17 17:20                 ` Michael S. Tsirkin
  2022-08-18  4:34                   ` Jason Wang
  0 siblings, 1 reply; 67+ messages in thread
From: Michael S. Tsirkin @ 2022-08-17 17:20 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Jakub Kicinski, Bobby Eshleman, Bobby Eshleman, Cong Wang,
	Jiang Wang, Stefan Hajnoczi, Stefano Garzarella, Jason Wang,
	David S. Miller, Eric Dumazet, Paolo Abeni, kvm, virtualization,
	netdev, linux-kernel, Toke Høiland-Jørgensen

On Tue, Aug 16, 2022 at 10:50:55AM +0000, Bobby Eshleman wrote:
> > > > Eh, I was hoping it was a side channel of an existing virtio_net 
> > > > which is not the case. Given the zero-config requirement IDK if 
> > > > we'll be able to fit this into netdev semantics :(  
> > > 
> > > It's certainly possible that it may not fit :/ I feel that it partially
> > > depends on what we mean by zero-config. Is it "no config required to
> > > have a working socket" or is it "no config required, but also no
> > > tuning/policy/etc... supported"?
> > 
> > The value of tuning vs confusion of a strange netdev floating around
> > in the system is hard to estimate upfront. 
> 
> I think "a strange netdev floating around" is a total
> mischaracterization... vsock is a networking device and it supports
> vsock networks. Sure, it is a virtual device and the routing is done in
> host software, but the same is true for virtio-net and VM-to-VM vlan.
> 
> This patch actually uses netdev for its intended purpose: to support and
> manage the transmission of packets via a network device to a network.
> 
> Furthermore, it actually prepares vsock to eliminate a "strange" use of
> a netdev. The netdev in vsockmon isn't even used to transmit
> packets, it's "floating around" for no other reason than it is needed to
> support packet capture, which vsock couldn't support because it didn't
> have a netdev.
> 
> Something smells when we are required to build workaround kernel modules
> that use netdev for ciphoning packets off to userspace, when we could
> instead be using netdev for its intended purpose and get the same and
> more benefit.

So what happens when userspace inevitably attempts to bind a raw
packet socket to this device? Assign it an IP? Set up some firewall
rules?

These things all need to be addressed before merging since they affect UAPI.


> > 
> > The nice thing about using a built-in fq with no user visible knobs is
> > that there's no extra uAPI. We can always rip it out and replace later.
> > And it shouldn't be controversial, making the path to upstream smoother.
> 
> The issue is that after pulling in fq for one kind of flow management,
> then as users observe other flow issues, we will need to re-implement
> pfifo, and then TBF, and then we need to build an interface to let users
> select one, and to choose queue sizes... and then after awhile we've
> needlessly re-implemented huge chunks of the tc system.
> 
> I don't see any good reason to restrict vsock users to using suboptimal
> and rigid queuing.
> 
> Thanks.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc
  2022-08-16 11:08       ` Bobby Eshleman
@ 2022-08-17 17:53         ` Michael S. Tsirkin
  2022-08-16 12:10           ` Bobby Eshleman
  0 siblings, 1 reply; 67+ messages in thread
From: Michael S. Tsirkin @ 2022-08-17 17:53 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Dexuan Cui, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv

On Tue, Aug 16, 2022 at 11:08:26AM +0000, Bobby Eshleman wrote:
> On Wed, Aug 17, 2022 at 01:02:52PM -0400, Michael S. Tsirkin wrote:
> > On Tue, Aug 16, 2022 at 09:42:51AM +0000, Bobby Eshleman wrote:
> > > > The basic question to answer then is this: with a net device qdisc
> > > > etc in the picture, how is this different from virtio net then?
> > > > Why do you still want to use vsock?
> > > > 
> > > 
> > > When using virtio-net, users looking for inter-VM communication are
> > > required to setup bridges, TAPs, allocate IP addresses or setup DNS,
> > > etc... and then finally when you have a network, you can open a socket
> > > on an IP address and port. This is the configuration that vsock avoids.
> > > For vsock, we just need a CID and a port, but no network configuration.
> > 
> > Surely when you mention DNS you are going overboard? vsock doesn't
> > remove the need for DNS as much as it does not support it.
> > 
> 
> Oops, s/DNS/dhcp.

That too.

-- 
MST


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc
  2022-08-17  6:54 ` Michael S. Tsirkin
  2022-08-16  9:42   ` Bobby Eshleman
@ 2022-08-18  4:28   ` Jason Wang
  2022-09-06  9:00     ` Stefano Garzarella
  1 sibling, 1 reply; 67+ messages in thread
From: Jason Wang @ 2022-08-18  4:28 UTC (permalink / raw)
  To: Michael S. Tsirkin, Bobby Eshleman
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Dexuan Cui, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv


在 2022/8/17 14:54, Michael S. Tsirkin 写道:
> On Mon, Aug 15, 2022 at 10:56:03AM -0700, Bobby Eshleman wrote:
>> Hey everybody,
>>
>> This series introduces datagrams, packet scheduling, and sk_buff usage
>> to virtio vsock.
>>
>> The usage of struct sk_buff benefits users by a) preparing vsock to use
>> other related systems that require sk_buff, such as sockmap and qdisc,
>> b) supporting basic congestion control via sock_alloc_send_skb, and c)
>> reducing copying when delivering packets to TAP.
>>
>> The socket layer no longer forces errors to be -ENOMEM, as typically
>> userspace expects -EAGAIN when the sk_sndbuf threshold is reached and
>> messages are being sent with option MSG_DONTWAIT.
>>
>> The datagram work is based off previous patches by Jiang Wang[1].
>>
>> The introduction of datagrams creates a transport layer fairness issue
>> where datagrams may freely starve streams of queue access. This happens
>> because, unlike streams, datagrams lack the transactions necessary for
>> calculating credits and throttling.
>>
>> Previous proposals introduce changes to the spec to add an additional
>> virtqueue pair for datagrams[1]. Although this solution works, using
>> Linux's qdisc for packet scheduling leverages already existing systems,
>> avoids the need to change the virtio specification, and gives additional
>> capabilities. The usage of SFQ or fq_codel, for example, may solve the
>> transport layer starvation problem. It is easy to imagine other use
>> cases as well. For example, services of varying importance may be
>> assigned different priorities, and qdisc will apply appropriate
>> priority-based scheduling. By default, the system default pfifo qdisc is
>> used. The qdisc may be bypassed and legacy queuing is resumed by simply
>> setting the virtio-vsock%d network device to state DOWN. This technique
>> still allows vsock to work with zero-configuration.
> The basic question to answer then is this: with a net device qdisc
> etc in the picture, how is this different from virtio net then?
> Why do you still want to use vsock?


Or maybe it's time to revisit an old idea[1] to unify at least the 
driver part (e.g using virtio-net driver for vsock then we can all 
features that vsock is lacking now)?

Thanks

[1] 
https://lists.linuxfoundation.org/pipermail/virtualization/2018-November/039783.html


>
>> In summary, this series introduces these major changes to vsock:
>>
>> - virtio vsock supports datagrams
>> - virtio vsock uses struct sk_buff instead of virtio_vsock_pkt
>>    - Because virtio vsock uses sk_buff, it also uses sock_alloc_send_skb,
>>      which applies the throttling threshold sk_sndbuf.
>> - The vsock socket layer supports returning errors other than -ENOMEM.
>>    - This is used to return -EAGAIN when the sk_sndbuf threshold is
>>      reached.
>> - virtio vsock uses a net_device, through which qdisc may be used.
>>   - qdisc allows scheduling policies to be applied to vsock flows.
>>    - Some qdiscs, like SFQ, may allow vsock to avoid transport layer congestion. That is,
>>      it may avoid datagrams from flooding out stream flows. The benefit
>>      to this is that additional virtqueues are not needed for datagrams.
>>    - The net_device and qdisc is bypassed by simply setting the
>>      net_device state to DOWN.
>>
>> [1]: https://lore.kernel.org/all/20210914055440.3121004-1-jiang.wang@bytedance.com/
>>
>> Bobby Eshleman (5):
>>    vsock: replace virtio_vsock_pkt with sk_buff
>>    vsock: return errors other than -ENOMEM to socket
>>    vsock: add netdev to vhost/virtio vsock
>>    virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
>>    virtio/vsock: add support for dgram
>>
>> Jiang Wang (1):
>>    vsock_test: add tests for vsock dgram
>>
>>   drivers/vhost/vsock.c                   | 238 ++++----
>>   include/linux/virtio_vsock.h            |  73 ++-
>>   include/net/af_vsock.h                  |   2 +
>>   include/uapi/linux/virtio_vsock.h       |   2 +
>>   net/vmw_vsock/af_vsock.c                |  30 +-
>>   net/vmw_vsock/hyperv_transport.c        |   2 +-
>>   net/vmw_vsock/virtio_transport.c        | 237 +++++---
>>   net/vmw_vsock/virtio_transport_common.c | 771 ++++++++++++++++--------
>>   net/vmw_vsock/vmci_transport.c          |   9 +-
>>   net/vmw_vsock/vsock_loopback.c          |  51 +-
>>   tools/testing/vsock/util.c              | 105 ++++
>>   tools/testing/vsock/util.h              |   4 +
>>   tools/testing/vsock/vsock_test.c        | 195 ++++++
>>   13 files changed, 1176 insertions(+), 543 deletions(-)
>>
>> -- 
>> 2.35.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/6] vsock: add netdev to vhost/virtio vsock
  2022-08-17 17:20                 ` Michael S. Tsirkin
@ 2022-08-18  4:34                   ` Jason Wang
  0 siblings, 0 replies; 67+ messages in thread
From: Jason Wang @ 2022-08-18  4:34 UTC (permalink / raw)
  To: Michael S. Tsirkin, Bobby Eshleman
  Cc: Jakub Kicinski, Bobby Eshleman, Bobby Eshleman, Cong Wang,
	Jiang Wang, Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Paolo Abeni, kvm, virtualization, netdev,
	linux-kernel, Toke Høiland-Jørgensen


在 2022/8/18 01:20, Michael S. Tsirkin 写道:
> On Tue, Aug 16, 2022 at 10:50:55AM +0000, Bobby Eshleman wrote:
>>>>> Eh, I was hoping it was a side channel of an existing virtio_net
>>>>> which is not the case. Given the zero-config requirement IDK if
>>>>> we'll be able to fit this into netdev semantics :(
>>>> It's certainly possible that it may not fit :/ I feel that it partially
>>>> depends on what we mean by zero-config. Is it "no config required to
>>>> have a working socket" or is it "no config required, but also no
>>>> tuning/policy/etc... supported"?
>>> The value of tuning vs confusion of a strange netdev floating around
>>> in the system is hard to estimate upfront.
>> I think "a strange netdev floating around" is a total
>> mischaracterization... vsock is a networking device and it supports
>> vsock networks. Sure, it is a virtual device and the routing is done in
>> host software, but the same is true for virtio-net and VM-to-VM vlan.
>>
>> This patch actually uses netdev for its intended purpose: to support and
>> manage the transmission of packets via a network device to a network.
>>
>> Furthermore, it actually prepares vsock to eliminate a "strange" use of
>> a netdev. The netdev in vsockmon isn't even used to transmit
>> packets, it's "floating around" for no other reason than it is needed to
>> support packet capture, which vsock couldn't support because it didn't
>> have a netdev.
>>
>> Something smells when we are required to build workaround kernel modules
>> that use netdev for ciphoning packets off to userspace, when we could
>> instead be using netdev for its intended purpose and get the same and
>> more benefit.
> So what happens when userspace inevitably attempts to bind a raw
> packet socket to this device? Assign it an IP? Set up some firewall
> rules?
>
> These things all need to be addressed before merging since they affect UAPI.


It's possible if

1) extend virtio-net to have vsock queues

2) present vsock device on top of virtio-net via e.g auxiliary bus

Then raw socket still work at ethernet level while vsock works too.

The value is to share codes between the two type of devices (queues).

Thanks


>
>>> The nice thing about using a built-in fq with no user visible knobs is
>>> that there's no extra uAPI. We can always rip it out and replace later.
>>> And it shouldn't be controversial, making the path to upstream smoother.
>> The issue is that after pulling in fq for one kind of flow management,
>> then as users observe other flow issues, we will need to re-implement
>> pfifo, and then TBF, and then we need to build an interface to let users
>> select one, and to choose queue sizes... and then after awhile we've
>> needlessly re-implemented huge chunks of the tc system.
>>
>> I don't see any good reason to restrict vsock users to using suboptimal
>> and rigid queuing.
>>
>> Thanks.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [virtio-dev] Re: [PATCH 5/6] virtio/vsock: add support for dgram
  2022-08-16  9:57       ` Bobby Eshleman
@ 2022-08-18  8:24         ` Arseniy Krasnov
  0 siblings, 0 replies; 67+ messages in thread
From: Arseniy Krasnov @ 2022-08-18  8:24 UTC (permalink / raw)
  To: bobbyeshleman
  Cc: kvm, jasowang, bobby.eshleman, davem, virtio-dev, stefanha,
	bobby.eshleman, linux-kernel, pabeni, edumazet, jiang.wang,
	sgarzare, kuba, cong.wang, netdev, virtualization, mst

On Tue, 2022-08-16 at 09:57 +0000, Bobby Eshleman wrote:
> On Wed, Aug 17, 2022 at 05:01:00AM +0000, Arseniy Krasnov wrote:
> > On 16.08.2022 05:32, Bobby Eshleman wrote:
> > > CC'ing virtio-dev@lists.oasis-open.org
> > > 
> > > On Mon, Aug 15, 2022 at 10:56:08AM -0700, Bobby Eshleman wrote:
> > > > This patch supports dgram in virtio and on the vhost side.
> > Hello,
> > 
> > sorry, i don't understand, how this maintains message boundaries?
> > Or it
> > is unnecessary for SOCK_DGRAM?
> > 
> > Thanks
> 
> If I understand your question, the length is included in the header,
> so
> receivers always know that header start + header length + payload
> length
> marks the message boundary.

I mean, consider the following case: host sends 5kb packet to guest.
Guest uses 4kb virtio rx buffers, so in drivers/vhost/vsock.c this 5kb
packet(e.g. its payload) will be placed to 2 virtio rx buffers - 4kb
to first buffer and rest 1kb to second buffer. Is it implemented, that
receiver gets whole 5kb piece of data during single 'read()/recv()'
system call?

Thanks

> 
> > > > Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
> > > > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> > > > ---
> > > >  drivers/vhost/vsock.c                   |   2 +-
> > > >  include/net/af_vsock.h                  |   2 +
> > > >  include/uapi/linux/virtio_vsock.h       |   1 +
> > > >  net/vmw_vsock/af_vsock.c                |  26 +++-
> > > >  net/vmw_vsock/virtio_transport.c        |   2 +-
> > > >  net/vmw_vsock/virtio_transport_common.c | 173
> > > > ++++++++++++++++++++++--
> > > >  6 files changed, 186 insertions(+), 20 deletions(-)
> > > > 
> > > > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> > > > index a5d1bdb786fe..3dc72a5647ca 100644
> > > > --- a/drivers/vhost/vsock.c
> > > > +++ b/drivers/vhost/vsock.c
> > > > @@ -925,7 +925,7 @@ static int __init vhost_vsock_init(void)
> > > >  	int ret;
> > > >  
> > > >  	ret = vsock_core_register(&vhost_transport.transport,
> > > > -				  VSOCK_TRANSPORT_F_H2G);
> > > > +				  VSOCK_TRANSPORT_F_H2G |
> > > > VSOCK_TRANSPORT_F_DGRAM);
> > > >  	if (ret < 0)
> > > >  		return ret;
> > > >  
> > > > diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> > > > index 1c53c4c4d88f..37e55c81e4df 100644
> > > > --- a/include/net/af_vsock.h
> > > > +++ b/include/net/af_vsock.h
> > > > @@ -78,6 +78,8 @@ struct vsock_sock {
> > > >  s64 vsock_stream_has_data(struct vsock_sock *vsk);
> > > >  s64 vsock_stream_has_space(struct vsock_sock *vsk);
> > > >  struct sock *vsock_create_connected(struct sock *parent);
> > > > +int vsock_bind_stream(struct vsock_sock *vsk,
> > > > +		      struct sockaddr_vm *addr);
> > > >  
> > > >  /**** TRANSPORT ****/
> > > >  
> > > > diff --git a/include/uapi/linux/virtio_vsock.h
> > > > b/include/uapi/linux/virtio_vsock.h
> > > > index 857df3a3a70d..0975b9c88292 100644
> > > > --- a/include/uapi/linux/virtio_vsock.h
> > > > +++ b/include/uapi/linux/virtio_vsock.h
> > > > @@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
> > > >  enum virtio_vsock_type {
> > > >  	VIRTIO_VSOCK_TYPE_STREAM = 1,
> > > >  	VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
> > > > +	VIRTIO_VSOCK_TYPE_DGRAM = 3,
> > > >  };
> > > >  
> > > >  enum virtio_vsock_op {
> > > > diff --git a/net/vmw_vsock/af_vsock.c
> > > > b/net/vmw_vsock/af_vsock.c
> > > > index 1893f8aafa48..87e4ae1866d3 100644
> > > > --- a/net/vmw_vsock/af_vsock.c
> > > > +++ b/net/vmw_vsock/af_vsock.c
> > > > @@ -675,6 +675,19 @@ static int __vsock_bind_connectible(struct
> > > > vsock_sock *vsk,
> > > >  	return 0;
> > > >  }
> > > >  
> > > > +int vsock_bind_stream(struct vsock_sock *vsk,
> > > > +		      struct sockaddr_vm *addr)
> > > > +{
> > > > +	int retval;
> > > > +
> > > > +	spin_lock_bh(&vsock_table_lock);
> > > > +	retval = __vsock_bind_connectible(vsk, addr);
> > > > +	spin_unlock_bh(&vsock_table_lock);
> > > > +
> > > > +	return retval;
> > > > +}
> > > > +EXPORT_SYMBOL(vsock_bind_stream);
> > > > +
> > > >  static int __vsock_bind_dgram(struct vsock_sock *vsk,
> > > >  			      struct sockaddr_vm *addr)
> > > >  {
> > > > @@ -2363,11 +2376,16 @@ int vsock_core_register(const struct
> > > > vsock_transport *t, int features)
> > > >  	}
> > > >  
> > > >  	if (features & VSOCK_TRANSPORT_F_DGRAM) {
> > > > -		if (t_dgram) {
> > > > -			err = -EBUSY;
> > > > -			goto err_busy;
> > > > +		/* TODO: always chose the G2H variant over
> > > > others, support nesting later */
> > > > +		if (features & VSOCK_TRANSPORT_F_G2H) {
> > > > +			if (t_dgram)
> > > > +				pr_warn("virtio_vsock: t_dgram
> > > > already set\n");
> > > > +			t_dgram = t;
> > > > +		}
> > > > +
> > > > +		if (!t_dgram) {
> > > > +			t_dgram = t;
> > > >  		}
> > > > -		t_dgram = t;
> > > >  	}
> > > >  
> > > >  	if (features & VSOCK_TRANSPORT_F_LOCAL) {
> > > > diff --git a/net/vmw_vsock/virtio_transport.c
> > > > b/net/vmw_vsock/virtio_transport.c
> > > > index 073314312683..d4526ca462d2 100644
> > > > --- a/net/vmw_vsock/virtio_transport.c
> > > > +++ b/net/vmw_vsock/virtio_transport.c
> > > > @@ -850,7 +850,7 @@ static int __init virtio_vsock_init(void)
> > > >  		return -ENOMEM;
> > > >  
> > > >  	ret = vsock_core_register(&virtio_transport.transport,
> > > > -				  VSOCK_TRANSPORT_F_G2H);
> > > > +				  VSOCK_TRANSPORT_F_G2H |
> > > > VSOCK_TRANSPORT_F_DGRAM);
> > > >  	if (ret)
> > > >  		goto out_wq;
> > > >  
> > > > diff --git a/net/vmw_vsock/virtio_transport_common.c
> > > > b/net/vmw_vsock/virtio_transport_common.c
> > > > index bdf16fff054f..aedb48728677 100644
> > > > --- a/net/vmw_vsock/virtio_transport_common.c
> > > > +++ b/net/vmw_vsock/virtio_transport_common.c
> > > > @@ -229,7 +229,9 @@
> > > > EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
> > > >  
> > > >  static u16 virtio_transport_get_type(struct sock *sk)
> > > >  {
> > > > -	if (sk->sk_type == SOCK_STREAM)
> > > > +	if (sk->sk_type == SOCK_DGRAM)
> > > > +		return VIRTIO_VSOCK_TYPE_DGRAM;
> > > > +	else if (sk->sk_type == SOCK_STREAM)
> > > >  		return VIRTIO_VSOCK_TYPE_STREAM;
> > > >  	else
> > > >  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
> > > > @@ -287,22 +289,29 @@ static int
> > > > virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> > > >  	vvs = vsk->trans;
> > > >  
> > > >  	/* we can send less than pkt_len bytes */
> > > > -	if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> > > > -		pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> > > > +	if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
> > > > +		if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> > > > +			pkt_len =
> > > > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> > > > +		else
> > > > +			return 0;
> > > > +	}
> > > >  
> > > > -	/* virtio_transport_get_credit might return less than
> > > > pkt_len credit */
> > > > -	pkt_len = virtio_transport_get_credit(vvs, pkt_len);
> > > > +	if (info->type != VIRTIO_VSOCK_TYPE_DGRAM) {
> > > > +		/* virtio_transport_get_credit might return
> > > > less than pkt_len credit */
> > > > +		pkt_len = virtio_transport_get_credit(vvs,
> > > > pkt_len);
> > > >  
> > > > -	/* Do not send zero length OP_RW pkt */
> > > > -	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> > > > -		return pkt_len;
> > > > +		/* Do not send zero length OP_RW pkt */
> > > > +		if (pkt_len == 0 && info->op ==
> > > > VIRTIO_VSOCK_OP_RW)
> > > > +			return pkt_len;
> > > > +	}
> > > >  
> > > >  	skb = virtio_transport_alloc_skb(info, pkt_len,
> > > >  					 src_cid, src_port,
> > > >  					 dst_cid, dst_port,
> > > >  					 &err);
> > > >  	if (!skb) {
> > > > -		virtio_transport_put_credit(vvs, pkt_len);
> > > > +		if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> > > > +			virtio_transport_put_credit(vvs,
> > > > pkt_len);
> > > >  		return err;
> > > >  	}
> > > >  
> > > > @@ -586,6 +595,61 @@ virtio_transport_seqpacket_dequeue(struct
> > > > vsock_sock *vsk,
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_dequeue);
> > > >  
> > > > +static ssize_t
> > > > +virtio_transport_dgram_do_dequeue(struct vsock_sock *vsk,
> > > > +				  struct msghdr *msg, size_t
> > > > len)
> > > > +{
> > > > +	struct virtio_vsock_sock *vvs = vsk->trans;
> > > > +	struct sk_buff *skb;
> > > > +	size_t total = 0;
> > > > +	u32 free_space;
> > > > +	int err = -EFAULT;
> > > > +
> > > > +	spin_lock_bh(&vvs->rx_lock);
> > > > +	if (total < len && !skb_queue_empty_lockless(&vvs-
> > > > >rx_queue)) {
> > > > +		skb = __skb_dequeue(&vvs->rx_queue);
> > > > +
> > > > +		total = len;
> > > > +		if (total > skb->len - vsock_metadata(skb)-
> > > > >off)
> > > > +			total = skb->len - vsock_metadata(skb)-
> > > > >off;
> > > > +		else if (total < skb->len -
> > > > vsock_metadata(skb)->off)
> > > > +			msg->msg_flags |= MSG_TRUNC;
> > > > +
> > > > +		/* sk_lock is held by caller so no one else can
> > > > dequeue.
> > > > +		 * Unlock rx_lock since memcpy_to_msg() may
> > > > sleep.
> > > > +		 */
> > > > +		spin_unlock_bh(&vvs->rx_lock);
> > > > +
> > > > +		err = memcpy_to_msg(msg, skb->data +
> > > > vsock_metadata(skb)->off, total);
> > > > +		if (err)
> > > > +			return err;
> > > > +
> > > > +		spin_lock_bh(&vvs->rx_lock);
> > > > +
> > > > +		virtio_transport_dec_rx_pkt(vvs, skb);
> > > > +		consume_skb(skb);
> > > > +	}
> > > > +
> > > > +	free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs-
> > > > >last_fwd_cnt);
> > > > +
> > > > +	spin_unlock_bh(&vvs->rx_lock);
> > > > +
> > > > +	if (total > 0 && msg->msg_name) {
> > > > +		/* Provide the address of the sender. */
> > > > +		DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr,
> > > > msg->msg_name);
> > > > +
> > > > +		vsock_addr_init(vm_addr,
> > > > le64_to_cpu(vsock_hdr(skb)->src_cid),
> > > > +				le32_to_cpu(vsock_hdr(skb)-
> > > > >src_port));
> > > > +		msg->msg_namelen = sizeof(*vm_addr);
> > > > +	}
> > > > +	return total;
> > > > +}
> > > > +
> > > > +static s64 virtio_transport_dgram_has_data(struct vsock_sock
> > > > *vsk)
> > > > +{
> > > > +	return virtio_transport_stream_has_data(vsk);
> > > > +}
> > > > +
> > > >  int
> > > >  virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
> > > >  				   struct msghdr *msg,
> > > > @@ -611,7 +675,66 @@ virtio_transport_dgram_dequeue(struct
> > > > vsock_sock *vsk,
> > > >  			       struct msghdr *msg,
> > > >  			       size_t len, int flags)
> > > >  {
> > > > -	return -EOPNOTSUPP;
> > > > +	struct sock *sk;
> > > > +	size_t err = 0;
> > > > +	long timeout;
> > > > +
> > > > +	DEFINE_WAIT(wait);
> > > > +
> > > > +	sk = &vsk->sk;
> > > > +	err = 0;
> > > > +
> > > > +	if (flags & MSG_OOB || flags & MSG_ERRQUEUE || flags &
> > > > MSG_PEEK)
> > > > +		return -EOPNOTSUPP;
> > > > +
> > > > +	lock_sock(sk);
> > > > +
> > > > +	if (!len)
> > > > +		goto out;
> > > > +
> > > > +	timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
> > > > +
> > > > +	while (1) {
> > > > +		s64 ready;
> > > > +
> > > > +		prepare_to_wait(sk_sleep(sk), &wait,
> > > > TASK_INTERRUPTIBLE);
> > > > +		ready = virtio_transport_dgram_has_data(vsk);
> > > > +
> > > > +		if (ready == 0) {
> > > > +			if (timeout == 0) {
> > > > +				err = -EAGAIN;
> > > > +				finish_wait(sk_sleep(sk),
> > > > &wait);
> > > > +				break;
> > > > +			}
> > > > +
> > > > +			release_sock(sk);
> > > > +			timeout = schedule_timeout(timeout);
> > > > +			lock_sock(sk);
> > > > +
> > > > +			if (signal_pending(current)) {
> > > > +				err = sock_intr_errno(timeout);
> > > > +				finish_wait(sk_sleep(sk),
> > > > &wait);
> > > > +				break;
> > > > +			} else if (timeout == 0) {
> > > > +				err = -EAGAIN;
> > > > +				finish_wait(sk_sleep(sk),
> > > > &wait);
> > > > +				break;
> > > > +			}
> > > > +		} else {
> > > > +			finish_wait(sk_sleep(sk), &wait);
> > > > +
> > > > +			if (ready < 0) {
> > > > +				err = -ENOMEM;
> > > > +				goto out;
> > > > +			}
> > > > +
> > > > +			err =
> > > > virtio_transport_dgram_do_dequeue(vsk, msg, len);
> > > > +			break;
> > > > +		}
> > > > +	}
> > > > +out:
> > > > +	release_sock(sk);
> > > > +	return err;
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
> > > >  
> > > > @@ -819,13 +942,13 @@
> > > > EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
> > > >  int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> > > >  				struct sockaddr_vm *addr)
> > > >  {
> > > > -	return -EOPNOTSUPP;
> > > > +	return vsock_bind_stream(vsk, addr);
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
> > > >  
> > > >  bool virtio_transport_dgram_allow(u32 cid, u32 port)
> > > >  {
> > > > -	return false;
> > > > +	return true;
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
> > > >  
> > > > @@ -861,7 +984,16 @@ virtio_transport_dgram_enqueue(struct
> > > > vsock_sock *vsk,
> > > >  			       struct msghdr *msg,
> > > >  			       size_t dgram_len)
> > > >  {
> > > > -	return -EOPNOTSUPP;
> > > > +	struct virtio_vsock_pkt_info info = {
> > > > +		.op = VIRTIO_VSOCK_OP_RW,
> > > > +		.msg = msg,
> > > > +		.pkt_len = dgram_len,
> > > > +		.vsk = vsk,
> > > > +		.remote_cid = remote_addr->svm_cid,
> > > > +		.remote_port = remote_addr->svm_port,
> > > > +	};
> > > > +
> > > > +	return virtio_transport_send_pkt_info(vsk, &info);
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
> > > >  
> > > > @@ -1165,6 +1297,12 @@ virtio_transport_recv_connected(struct
> > > > sock *sk,
> > > >  	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> > > >  	int err = 0;
> > > >  
> > > > +	if (le16_to_cpu(vsock_hdr(skb)->type) ==
> > > > VIRTIO_VSOCK_TYPE_DGRAM) {
> > > > +		virtio_transport_recv_enqueue(vsk, skb);
> > > > +		sk->sk_data_ready(sk);
> > > > +		return err;
> > > > +	}
> > > > +
> > > >  	switch (le16_to_cpu(hdr->op)) {
> > > >  	case VIRTIO_VSOCK_OP_RW:
> > > >  		virtio_transport_recv_enqueue(vsk, skb);
> > > > @@ -1320,7 +1458,8 @@ virtio_transport_recv_listen(struct sock
> > > > *sk, struct sk_buff *skb,
> > > >  static bool virtio_transport_valid_type(u16 type)
> > > >  {
> > > >  	return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
> > > > -	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
> > > > +	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
> > > > +	       (type == VIRTIO_VSOCK_TYPE_DGRAM);
> > > >  }
> > > >  
> > > >  /* We are under the virtio-vsock's vsock->rx_lock or vhost-
> > > > vsock's vq->mutex
> > > > @@ -1384,6 +1523,11 @@ void virtio_transport_recv_pkt(struct
> > > > virtio_transport *t,
> > > >  		goto free_pkt;
> > > >  	}
> > > >  
> > > > +	if (sk->sk_type == SOCK_DGRAM) {
> > > > +		virtio_transport_recv_connected(sk, skb);
> > > > +		goto out;
> > > > +	}
> > > > +
> > > >  	space_available = virtio_transport_space_update(sk,
> > > > skb);
> > > >  
> > > >  	/* Update CID in case it has changed after a transport
> > > > reset event */
> > > > @@ -1415,6 +1559,7 @@ void virtio_transport_recv_pkt(struct
> > > > virtio_transport *t,
> > > >  		break;
> > > >  	}
> > > >  
> > > > +out:
> > > >  	release_sock(sk);
> > > >  
> > > >  	/* Release refcnt obtained when we fetched this socket
> > > > out of the
> > > > -- 
> > > > 2.35.1
> > > > 
> > > 
> > > ---------------------------------------------------------------
> > > ------
> > > To unsubscribe, e-mail: 
> > > virtio-dev-unsubscribe@lists.oasis-open.org
> > > For additional commands, e-mail: 
> > > virtio-dev-help@lists.oasis-open.org
> > > 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [virtio-dev] Re: [PATCH 5/6] virtio/vsock: add support for dgram
  2022-08-16  9:58         ` Bobby Eshleman
@ 2022-08-18  8:35           ` Arseniy Krasnov
  2022-08-16 20:52             ` Bobby Eshleman
  0 siblings, 1 reply; 67+ messages in thread
From: Arseniy Krasnov @ 2022-08-18  8:35 UTC (permalink / raw)
  To: bobbyeshleman
  Cc: kvm, jasowang, bobby.eshleman, davem, virtio-dev, stefanha,
	bobby.eshleman, linux-kernel, pabeni, edumazet, jiang.wang,
	sgarzare, kuba, cong.wang, netdev, virtualization, mst

On Tue, 2022-08-16 at 09:58 +0000, Bobby Eshleman wrote:
> On Wed, Aug 17, 2022 at 05:42:08AM +0000, Arseniy Krasnov wrote:
> > On 17.08.2022 08:01, Arseniy Krasnov wrote:
> > > On 16.08.2022 05:32, Bobby Eshleman wrote:
> > > > CC'ing virtio-dev@lists.oasis-open.org
> > > > 
> > > > On Mon, Aug 15, 2022 at 10:56:08AM -0700, Bobby Eshleman wrote:
> > > > > This patch supports dgram in virtio and on the vhost side.
> > > Hello,
> > > 
> > > sorry, i don't understand, how this maintains message boundaries?
> > > Or it
> > > is unnecessary for SOCK_DGRAM?
> > > 
> > > Thanks
> > > > > Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
> > > > > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> > > > > ---
> > > > >  drivers/vhost/vsock.c                   |   2 +-
> > > > >  include/net/af_vsock.h                  |   2 +
> > > > >  include/uapi/linux/virtio_vsock.h       |   1 +
> > > > >  net/vmw_vsock/af_vsock.c                |  26 +++-
> > > > >  net/vmw_vsock/virtio_transport.c        |   2 +-
> > > > >  net/vmw_vsock/virtio_transport_common.c | 173
> > > > > ++++++++++++++++++++++--
> > > > >  6 files changed, 186 insertions(+), 20 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> > > > > index a5d1bdb786fe..3dc72a5647ca 100644
> > > > > --- a/drivers/vhost/vsock.c
> > > > > +++ b/drivers/vhost/vsock.c
> > > > > @@ -925,7 +925,7 @@ static int __init vhost_vsock_init(void)
> > > > >  	int ret;
> > > > >  
> > > > >  	ret = vsock_core_register(&vhost_transport.transport,
> > > > > -				  VSOCK_TRANSPORT_F_H2G);
> > > > > +				  VSOCK_TRANSPORT_F_H2G |
> > > > > VSOCK_TRANSPORT_F_DGRAM);
> > > > >  	if (ret < 0)
> > > > >  		return ret;
> > > > >  
> > > > > diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> > > > > index 1c53c4c4d88f..37e55c81e4df 100644
> > > > > --- a/include/net/af_vsock.h
> > > > > +++ b/include/net/af_vsock.h
> > > > > @@ -78,6 +78,8 @@ struct vsock_sock {
> > > > >  s64 vsock_stream_has_data(struct vsock_sock *vsk);
> > > > >  s64 vsock_stream_has_space(struct vsock_sock *vsk);
> > > > >  struct sock *vsock_create_connected(struct sock *parent);
> > > > > +int vsock_bind_stream(struct vsock_sock *vsk,
> > > > > +		      struct sockaddr_vm *addr);
> > > > >  
> > > > >  /**** TRANSPORT ****/
> > > > >  
> > > > > diff --git a/include/uapi/linux/virtio_vsock.h
> > > > > b/include/uapi/linux/virtio_vsock.h
> > > > > index 857df3a3a70d..0975b9c88292 100644
> > > > > --- a/include/uapi/linux/virtio_vsock.h
> > > > > +++ b/include/uapi/linux/virtio_vsock.h
> > > > > @@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
> > > > >  enum virtio_vsock_type {
> > > > >  	VIRTIO_VSOCK_TYPE_STREAM = 1,
> > > > >  	VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
> > > > > +	VIRTIO_VSOCK_TYPE_DGRAM = 3,
> > > > >  };
> > > > >  
> > > > >  enum virtio_vsock_op {
> > > > > diff --git a/net/vmw_vsock/af_vsock.c
> > > > > b/net/vmw_vsock/af_vsock.c
> > > > > index 1893f8aafa48..87e4ae1866d3 100644
> > > > > --- a/net/vmw_vsock/af_vsock.c
> > > > > +++ b/net/vmw_vsock/af_vsock.c
> > > > > @@ -675,6 +675,19 @@ static int
> > > > > __vsock_bind_connectible(struct vsock_sock *vsk,
> > > > >  	return 0;
> > > > >  }
> > > > >  
> > > > > +int vsock_bind_stream(struct vsock_sock *vsk,
> > > > > +		      struct sockaddr_vm *addr)
> > > > > +{
> > > > > +	int retval;
> > > > > +
> > > > > +	spin_lock_bh(&vsock_table_lock);
> > > > > +	retval = __vsock_bind_connectible(vsk, addr);
> > > > > +	spin_unlock_bh(&vsock_table_lock);
> > > > > +
> > > > > +	return retval;
> > > > > +}
> > > > > +EXPORT_SYMBOL(vsock_bind_stream);
> > > > > +
> > > > >  static int __vsock_bind_dgram(struct vsock_sock *vsk,
> > > > >  			      struct sockaddr_vm *addr)
> > > > >  {
> > > > > @@ -2363,11 +2376,16 @@ int vsock_core_register(const struct
> > > > > vsock_transport *t, int features)
> > > > >  	}
> > > > >  
> > > > >  	if (features & VSOCK_TRANSPORT_F_DGRAM) {
> > > > > -		if (t_dgram) {
> > > > > -			err = -EBUSY;
> > > > > -			goto err_busy;
> > > > > +		/* TODO: always chose the G2H variant over
> > > > > others, support nesting later */
> > > > > +		if (features & VSOCK_TRANSPORT_F_G2H) {
> > > > > +			if (t_dgram)
> > > > > +				pr_warn("virtio_vsock: t_dgram
> > > > > already set\n");
> > > > > +			t_dgram = t;
> > > > > +		}
> > > > > +
> > > > > +		if (!t_dgram) {
> > > > > +			t_dgram = t;
> > > > >  		}
> > > > > -		t_dgram = t;
> > > > >  	}
> > > > >  
> > > > >  	if (features & VSOCK_TRANSPORT_F_LOCAL) {
> > > > > diff --git a/net/vmw_vsock/virtio_transport.c
> > > > > b/net/vmw_vsock/virtio_transport.c
> > > > > index 073314312683..d4526ca462d2 100644
> > > > > --- a/net/vmw_vsock/virtio_transport.c
> > > > > +++ b/net/vmw_vsock/virtio_transport.c
> > > > > @@ -850,7 +850,7 @@ static int __init virtio_vsock_init(void)
> > > > >  		return -ENOMEM;
> > > > >  
> > > > >  	ret = vsock_core_register(&virtio_transport.transport,
> > > > > -				  VSOCK_TRANSPORT_F_G2H);
> > > > > +				  VSOCK_TRANSPORT_F_G2H |
> > > > > VSOCK_TRANSPORT_F_DGRAM);
> > > > >  	if (ret)
> > > > >  		goto out_wq;
> > > > >  
> > > > > diff --git a/net/vmw_vsock/virtio_transport_common.c
> > > > > b/net/vmw_vsock/virtio_transport_common.c
> > > > > index bdf16fff054f..aedb48728677 100644
> > > > > --- a/net/vmw_vsock/virtio_transport_common.c
> > > > > +++ b/net/vmw_vsock/virtio_transport_common.c
> > > > > @@ -229,7 +229,9 @@
> > > > > EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
> > > > >  
> > > > >  static u16 virtio_transport_get_type(struct sock *sk)
> > > > >  {
> > > > > -	if (sk->sk_type == SOCK_STREAM)
> > > > > +	if (sk->sk_type == SOCK_DGRAM)
> > > > > +		return VIRTIO_VSOCK_TYPE_DGRAM;
> > > > > +	else if (sk->sk_type == SOCK_STREAM)
> > > > >  		return VIRTIO_VSOCK_TYPE_STREAM;
> > > > >  	else
> > > > >  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
> > > > > @@ -287,22 +289,29 @@ static int
> > > > > virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> > > > >  	vvs = vsk->trans;
> > > > >  
> > > > >  	/* we can send less than pkt_len bytes */
> > > > > -	if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> > > > > -		pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> > > > > +	if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
> > > > > +		if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> > > > > +			pkt_len =
> > > > > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> > > > > +		else
> > > > > +			return 0;
> > > > > +	}
> > > > >  
> > > > > -	/* virtio_transport_get_credit might return less than
> > > > > pkt_len credit */
> > > > > -	pkt_len = virtio_transport_get_credit(vvs, pkt_len);
> > > > > +	if (info->type != VIRTIO_VSOCK_TYPE_DGRAM) {
> > > > > +		/* virtio_transport_get_credit might return
> > > > > less than pkt_len credit */
> > > > > +		pkt_len = virtio_transport_get_credit(vvs,
> > > > > pkt_len);
> > > > >  
> > > > > -	/* Do not send zero length OP_RW pkt */
> > > > > -	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> > > > > -		return pkt_len;
> > > > > +		/* Do not send zero length OP_RW pkt */
> > > > > +		if (pkt_len == 0 && info->op ==
> > > > > VIRTIO_VSOCK_OP_RW)
> > > > > +			return pkt_len;
> > > > > +	}
> > > > >  
> > > > >  	skb = virtio_transport_alloc_skb(info, pkt_len,
> > > > >  					 src_cid, src_port,
> > > > >  					 dst_cid, dst_port,
> > > > >  					 &err);
> > > > >  	if (!skb) {
> > > > > -		virtio_transport_put_credit(vvs, pkt_len);
> > > > > +		if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> > > > > +			virtio_transport_put_credit(vvs,
> > > > > pkt_len);
> > > > >  		return err;
> > > > >  	}
> > > > >  
> > > > > @@ -586,6 +595,61 @@
> > > > > virtio_transport_seqpacket_dequeue(struct vsock_sock *vsk,
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_dequeue);
> > > > >  
> > > > > +static ssize_t
> > > > > +virtio_transport_dgram_do_dequeue(struct vsock_sock *vsk,
> > > > > +				  struct msghdr *msg, size_t
> > > > > len)
> > > > > +{
> > > > > +	struct virtio_vsock_sock *vvs = vsk->trans;
> > > > > +	struct sk_buff *skb;
> > > > > +	size_t total = 0;
> > > > > +	u32 free_space;
> > > > > +	int err = -EFAULT;
> > > > > +
> > > > > +	spin_lock_bh(&vvs->rx_lock);
> > > > > +	if (total < len && !skb_queue_empty_lockless(&vvs-
> > > > > >rx_queue)) {
> > > > > +		skb = __skb_dequeue(&vvs->rx_queue);
> > > > > +
> > > > > +		total = len;
> > > > > +		if (total > skb->len - vsock_metadata(skb)-
> > > > > >off)
> > > > > +			total = skb->len - vsock_metadata(skb)-
> > > > > >off;
> > > > > +		else if (total < skb->len -
> > > > > vsock_metadata(skb)->off)
> > > > > +			msg->msg_flags |= MSG_TRUNC;
> > > > > +
> > > > > +		/* sk_lock is held by caller so no one else can
> > > > > dequeue.
> > > > > +		 * Unlock rx_lock since memcpy_to_msg() may
> > > > > sleep.
> > > > > +		 */
> > > > > +		spin_unlock_bh(&vvs->rx_lock);
> > > > > +
> > > > > +		err = memcpy_to_msg(msg, skb->data +
> > > > > vsock_metadata(skb)->off, total);
> > > > > +		if (err)
> > > > > +			return err;
> > > > > +
> > > > > +		spin_lock_bh(&vvs->rx_lock);
> > > > > +
> > > > > +		virtio_transport_dec_rx_pkt(vvs, skb);
> > > > > +		consume_skb(skb);
> > > > > +	}
> > > > > +
> > > > > +	free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs-
> > > > > >last_fwd_cnt);
> > > > > +
> > > > > +	spin_unlock_bh(&vvs->rx_lock);
> > > > > +
> > > > > +	if (total > 0 && msg->msg_name) {
> > > > > +		/* Provide the address of the sender. */
> > > > > +		DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr,
> > > > > msg->msg_name);
> > > > > +
> > > > > +		vsock_addr_init(vm_addr,
> > > > > le64_to_cpu(vsock_hdr(skb)->src_cid),
> > > > > +				le32_to_cpu(vsock_hdr(skb)-
> > > > > >src_port));
> > > > > +		msg->msg_namelen = sizeof(*vm_addr);
> > > > > +	}
> > > > > +	return total;
> > > > > +}
> > > > > +
> > > > > +static s64 virtio_transport_dgram_has_data(struct vsock_sock
> > > > > *vsk)
> > > > > +{
> > > > > +	return virtio_transport_stream_has_data(vsk);
> > > > > +}
> > > > > +
> > > > >  int
> > > > >  virtio_transport_seqpacket_enqueue(struct vsock_sock *vsk,
> > > > >  				   struct msghdr *msg,
> > > > > @@ -611,7 +675,66 @@ virtio_transport_dgram_dequeue(struct
> > > > > vsock_sock *vsk,
> > > > >  			       struct msghdr *msg,
> > > > >  			       size_t len, int flags)
> > > > >  {
> > > > > -	return -EOPNOTSUPP;
> > > > > +	struct sock *sk;
> > > > > +	size_t err = 0;
> > > > > +	long timeout;
> > > > > +
> > > > > +	DEFINE_WAIT(wait);
> > > > > +
> > > > > +	sk = &vsk->sk;
> > > > > +	err = 0;
> > > > > +
> > > > > +	if (flags & MSG_OOB || flags & MSG_ERRQUEUE || flags &
> > > > > MSG_PEEK)
> > > > > +		return -EOPNOTSUPP;
> > > > > +
> > > > > +	lock_sock(sk);
> > > > > +
> > > > > +	if (!len)
> > > > > +		goto out;
> > > > > +
> > > > > +	timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
> > > > > +
> > > > > +	while (1) {
> > > > > +		s64 ready;
> > > > > +
> > > > > +		prepare_to_wait(sk_sleep(sk), &wait,
> > > > > TASK_INTERRUPTIBLE);
> > > > > +		ready = virtio_transport_dgram_has_data(vsk);
> > > > > +
> > > > > +		if (ready == 0) {
> > > > > +			if (timeout == 0) {
> > > > > +				err = -EAGAIN;
> > > > > +				finish_wait(sk_sleep(sk),
> > > > > &wait);
> > > > > +				break;
> > > > > +			}
> > > > > +
> > > > > +			release_sock(sk);
> > > > > +			timeout = schedule_timeout(timeout);
> > > > > +			lock_sock(sk);
> > > > > +
> > > > > +			if (signal_pending(current)) {
> > > > > +				err = sock_intr_errno(timeout);
> > > > > +				finish_wait(sk_sleep(sk),
> > > > > &wait);
> > > > > +				break;
> > > > > +			} else if (timeout == 0) {
> > > > > +				err = -EAGAIN;
> > > > > +				finish_wait(sk_sleep(sk),
> > > > > &wait);
> > > > > +				break;
> > > > > +			}
> > > > > +		} else {
> > > > > +			finish_wait(sk_sleep(sk), &wait);
> > > > > +
> > > > > +			if (ready < 0) {
> > > > > +				err = -ENOMEM;
> > > > > +				goto out;
> > > > > +			}
> > > > > +
> > > > > +			err =
> > > > > virtio_transport_dgram_do_dequeue(vsk, msg, len);
> > > > > +			break;
> > > > > +		}
> > > > > +	}
> > > > > +out:
> > > > > +	release_sock(sk);
> > > > > +	return err;
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
> > ^^^
> > May be, this generic data waiting logic should be in af_vsock.c, as
> > for stream/seqpacket?
> > In this way, another transport which supports SOCK_DGRAM could
> > reuse it.
> 
> I think that is a great idea. I'll test that change for v2.
> 
> Thanks.

Also for v2, i tested Your patchset a little bit(write here to not
spread over all mails):
1) seqpacket test in vsock_test.c fails(seems MSG_EOR flag issue)
2) i can't do rmmod with the following config(after testing):
   CONFIG_VSOCKETS=m
   CONFIG_VIRTIO_VSOCKETS=m
   CONFIG_VIRTIO_VSOCKETS_COMMON=m
   CONFIG_VHOST=m
   CONFIG_VHOST_VSOCK=m
   Guest is shutdown, but rmmod fails.
3) virtio_transport_init + virtio_transport_exit seems must be
   under EXPORT_SYMBOL_GPL(), because both used in another module.
4) I tried to send 5kb(or 20kb not matter) piece of data, but got      
   kernel panic both in guest and later in host.

Thank You
> 
> > > > >  
> > > > > @@ -819,13 +942,13 @@
> > > > > EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
> > > > >  int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> > > > >  				struct sockaddr_vm *addr)
> > > > >  {
> > > > > -	return -EOPNOTSUPP;
> > > > > +	return vsock_bind_stream(vsk, addr);
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
> > > > >  
> > > > >  bool virtio_transport_dgram_allow(u32 cid, u32 port)
> > > > >  {
> > > > > -	return false;
> > > > > +	return true;
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
> > > > >  
> > > > > @@ -861,7 +984,16 @@ virtio_transport_dgram_enqueue(struct
> > > > > vsock_sock *vsk,
> > > > >  			       struct msghdr *msg,
> > > > >  			       size_t dgram_len)
> > > > >  {
> > > > > -	return -EOPNOTSUPP;
> > > > > +	struct virtio_vsock_pkt_info info = {
> > > > > +		.op = VIRTIO_VSOCK_OP_RW,
> > > > > +		.msg = msg,
> > > > > +		.pkt_len = dgram_len,
> > > > > +		.vsk = vsk,
> > > > > +		.remote_cid = remote_addr->svm_cid,
> > > > > +		.remote_port = remote_addr->svm_port,
> > > > > +	};
> > > > > +
> > > > > +	return virtio_transport_send_pkt_info(vsk, &info);
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
> > > > >  
> > > > > @@ -1165,6 +1297,12 @@ virtio_transport_recv_connected(struct
> > > > > sock *sk,
> > > > >  	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> > > > >  	int err = 0;
> > > > >  
> > > > > +	if (le16_to_cpu(vsock_hdr(skb)->type) ==
> > > > > VIRTIO_VSOCK_TYPE_DGRAM) {
> > > > > +		virtio_transport_recv_enqueue(vsk, skb);
> > > > > +		sk->sk_data_ready(sk);
> > > > > +		return err;
> > > > > +	}
> > > > > +
> > > > >  	switch (le16_to_cpu(hdr->op)) {
> > > > >  	case VIRTIO_VSOCK_OP_RW:
> > > > >  		virtio_transport_recv_enqueue(vsk, skb);
> > > > > @@ -1320,7 +1458,8 @@ virtio_transport_recv_listen(struct
> > > > > sock *sk, struct sk_buff *skb,
> > > > >  static bool virtio_transport_valid_type(u16 type)
> > > > >  {
> > > > >  	return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
> > > > > -	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
> > > > > +	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
> > > > > +	       (type == VIRTIO_VSOCK_TYPE_DGRAM);
> > > > >  }
> > > > >  
> > > > >  /* We are under the virtio-vsock's vsock->rx_lock or vhost-
> > > > > vsock's vq->mutex
> > > > > @@ -1384,6 +1523,11 @@ void virtio_transport_recv_pkt(struct
> > > > > virtio_transport *t,
> > > > >  		goto free_pkt;
> > > > >  	}
> > > > >  
> > > > > +	if (sk->sk_type == SOCK_DGRAM) {
> > > > > +		virtio_transport_recv_connected(sk, skb);
> > > > > +		goto out;
> > > > > +	}
> > > > > +
> > > > >  	space_available = virtio_transport_space_update(sk,
> > > > > skb);
> > > > >  
> > > > >  	/* Update CID in case it has changed after a transport
> > > > > reset event */
> > > > > @@ -1415,6 +1559,7 @@ void virtio_transport_recv_pkt(struct
> > > > > virtio_transport *t,
> > > > >  		break;
> > > > >  	}
> > > > >  
> > > > > +out:
> > > > >  	release_sock(sk);
> > > > >  
> > > > >  	/* Release refcnt obtained when we fetched this socket
> > > > > out of the
> > > > > -- 
> > > > > 2.35.1
> > > > > 
> > > > 
> > > > -------------------------------------------------------------
> > > > --------
> > > > To unsubscribe, e-mail: 
> > > > virtio-dev-unsubscribe@lists.oasis-open.org
> > > > For additional commands, e-mail: 
> > > > virtio-dev-help@lists.oasis-open.org
> > > > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/6] vsock: add netdev to vhost/virtio vsock
  2022-09-06 10:58   ` Michael S. Tsirkin
@ 2022-08-18 14:20     ` Bobby Eshleman
  0 siblings, 0 replies; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-18 14:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, kvm, virtualization,
	netdev, linux-kernel

On Tue, Sep 06, 2022 at 06:58:32AM -0400, Michael S. Tsirkin wrote:
> On Mon, Aug 15, 2022 at 10:56:06AM -0700, Bobby Eshleman wrote:
> > In order to support usage of qdisc on vsock traffic, this commit
> > introduces a struct net_device to vhost and virtio vsock.
> > 
> > Two new devices are created, vhost-vsock for vhost and virtio-vsock
> > for virtio. The devices are attached to the respective transports.
> > 
> > To bypass the usage of the device, the user may "down" the associated
> > network interface using common tools. For example, "ip link set dev
> > virtio-vsock down" lets vsock bypass the net_device and qdisc entirely,
> > simply using the FIFO logic of the prior implementation.
> > 
> > For both hosts and guests, there is one device for all G2H vsock sockets
> > and one device for all H2G vsock sockets. This makes sense for guests
> > because the driver only supports a single vsock channel (one pair of
> > TX/RX virtqueues), so one device and qdisc fits. For hosts, this may not
> > seem ideal for some workloads. However, it is possible to use a
> > multi-queue qdisc, where a given queue is responsible for a range of
> > sockets. This seems to be a better solution than having one device per
> > socket, which may yield a very large number of devices and qdiscs, all
> > of which are dynamically being created and destroyed. Because of this
> > dynamism, it would also require a complex policy management daemon, as
> > devices would constantly be spun up and down as sockets were created and
> > destroyed. To avoid this, one device and qdisc also applies to all H2G
> > sockets.
> > 
> > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> 
> 
> I've been thinking about this generally. vsock currently
> assumes reliability, but with qdisc can't we get
> packet drops e.g. depending on the queueing?
> 
> What prevents user from configuring such a discipline?
> One thing people like about vsock is that it's very hard
> to break H2G communication even with misconfigured
> networking.
> 

If qdisc decides to discard a packet, it returns NET_XMIT_CN via
dev_queue_xmit(). This v1 allows this quietly, but v2 could return
an error to the user (-ENOMEM or maybe -ENOBUFS) when this happens, similar
to when vsock is unable to enqueue a packet currently.

The user could still, for example, choose the noop qdisc. Assuming the
v2 change mentioned above, their sendmsg() calls will return errors.
Similar to how if they choose the wrong CID they will get an error when
connecting a socket.

Best,
Bobby

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc
  2022-09-06 13:26 ` Stefan Hajnoczi
@ 2022-08-18 14:39   ` Bobby Eshleman
  2022-09-08  8:30     ` Stefano Garzarella
  2022-09-08 14:36     ` Call to discuss vsock netdev/sk_buff [was Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc] Stefano Garzarella
  0 siblings, 2 replies; 67+ messages in thread
From: Bobby Eshleman @ 2022-08-18 14:39 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefano Garzarella, Michael S. Tsirkin, Jason Wang,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, kvm, virtualization, netdev, linux-kernel,
	linux-hyperv

On Tue, Sep 06, 2022 at 09:26:33AM -0400, Stefan Hajnoczi wrote:
> Hi Bobby,
> If you are attending Linux Foundation conferences in Dublin, Ireland
> next week (Linux Plumbers Conference, Open Source Summit Europe, KVM
> Forum, ContainerCon Europe, CloudOpen Europe, etc) then you could meet
> Stefano Garzarella and others to discuss this patch series.
> 
> Using netdev and sk_buff is a big change to vsock. Discussing your
> requirements and the future direction of vsock in person could help.
> 
> If you won't be in Dublin, don't worry. You can schedule a video call if
> you feel it would be helpful to discuss these topics.
> 
> Stefan

Hey Stefan,

That sounds like a great idea! I was unable to make the Dublin trip work
so I think a video call would be best, of course if okay with everyone.

Thanks,
Bobby

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [virtio-dev] Re: [PATCH 5/6] virtio/vsock: add support for dgram
  2022-08-16 20:52             ` Bobby Eshleman
@ 2022-08-19  4:30               ` Arseniy Krasnov
  0 siblings, 0 replies; 67+ messages in thread
From: Arseniy Krasnov @ 2022-08-19  4:30 UTC (permalink / raw)
  To: bobbyeshleman
  Cc: kvm, jasowang, bobby.eshleman, davem, virtio-dev, stefanha,
	bobby.eshleman, linux-kernel, pabeni, edumazet, jiang.wang,
	sgarzare, kuba, cong.wang, netdev, virtualization, mst

On Tue, 2022-08-16 at 20:52 +0000, Bobby Eshleman wrote:
> On Thu, Aug 18, 2022 at 08:35:48AM +0000, Arseniy Krasnov wrote:
> > On Tue, 2022-08-16 at 09:58 +0000, Bobby Eshleman wrote:
> > > On Wed, Aug 17, 2022 at 05:42:08AM +0000, Arseniy Krasnov wrote:
> > > > On 17.08.2022 08:01, Arseniy Krasnov wrote:
> > > > > On 16.08.2022 05:32, Bobby Eshleman wrote:
> > > > > > CC'ing virtio-dev@lists.oasis-open.org
> > > > > > 
> > > > > > On Mon, Aug 15, 2022 at 10:56:08AM -0700, Bobby Eshleman
> > > > > > wrote:
> > > > > > > This patch supports dgram in virtio and on the vhost
> > > > > > > side.
> > > > > Hello,
> > > > > 
> > > > > sorry, i don't understand, how this maintains message
> > > > > boundaries?
> > > > > Or it
> > > > > is unnecessary for SOCK_DGRAM?
> > > > > 
> > > > > Thanks
> > > > > > > Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
> > > > > > > Signed-off-by: Bobby Eshleman <
> > > > > > > bobby.eshleman@bytedance.com>
> > > > > > > ---
> > > > > > >  drivers/vhost/vsock.c                   |   2 +-
> > > > > > >  include/net/af_vsock.h                  |   2 +
> > > > > > >  include/uapi/linux/virtio_vsock.h       |   1 +
> > > > > > >  net/vmw_vsock/af_vsock.c                |  26 +++-
> > > > > > >  net/vmw_vsock/virtio_transport.c        |   2 +-
> > > > > > >  net/vmw_vsock/virtio_transport_common.c | 173
> > > > > > > ++++++++++++++++++++++--
> > > > > > >  6 files changed, 186 insertions(+), 20 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/vhost/vsock.c
> > > > > > > b/drivers/vhost/vsock.c
> > > > > > > index a5d1bdb786fe..3dc72a5647ca 100644
> > > > > > > --- a/drivers/vhost/vsock.c
> > > > > > > +++ b/drivers/vhost/vsock.c
> > > > > > > @@ -925,7 +925,7 @@ static int __init
> > > > > > > vhost_vsock_init(void)
> > > > > > >  	int ret;
> > > > > > >  
> > > > > > >  	ret = vsock_core_register(&vhost_transport.transport,
> > > > > > > -				  VSOCK_TRANSPORT_F_H2G);
> > > > > > > +				  VSOCK_TRANSPORT_F_H2G |
> > > > > > > VSOCK_TRANSPORT_F_DGRAM);
> > > > > > >  	if (ret < 0)
> > > > > > >  		return ret;
> > > > > > >  
> > > > > > > diff --git a/include/net/af_vsock.h
> > > > > > > b/include/net/af_vsock.h
> > > > > > > index 1c53c4c4d88f..37e55c81e4df 100644
> > > > > > > --- a/include/net/af_vsock.h
> > > > > > > +++ b/include/net/af_vsock.h
> > > > > > > @@ -78,6 +78,8 @@ struct vsock_sock {
> > > > > > >  s64 vsock_stream_has_data(struct vsock_sock *vsk);
> > > > > > >  s64 vsock_stream_has_space(struct vsock_sock *vsk);
> > > > > > >  struct sock *vsock_create_connected(struct sock
> > > > > > > *parent);
> > > > > > > +int vsock_bind_stream(struct vsock_sock *vsk,
> > > > > > > +		      struct sockaddr_vm *addr);
> > > > > > >  
> > > > > > >  /**** TRANSPORT ****/
> > > > > > >  
> > > > > > > diff --git a/include/uapi/linux/virtio_vsock.h
> > > > > > > b/include/uapi/linux/virtio_vsock.h
> > > > > > > index 857df3a3a70d..0975b9c88292 100644
> > > > > > > --- a/include/uapi/linux/virtio_vsock.h
> > > > > > > +++ b/include/uapi/linux/virtio_vsock.h
> > > > > > > @@ -70,6 +70,7 @@ struct virtio_vsock_hdr {
> > > > > > >  enum virtio_vsock_type {
> > > > > > >  	VIRTIO_VSOCK_TYPE_STREAM = 1,
> > > > > > >  	VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
> > > > > > > +	VIRTIO_VSOCK_TYPE_DGRAM = 3,
> > > > > > >  };
> > > > > > >  
> > > > > > >  enum virtio_vsock_op {
> > > > > > > diff --git a/net/vmw_vsock/af_vsock.c
> > > > > > > b/net/vmw_vsock/af_vsock.c
> > > > > > > index 1893f8aafa48..87e4ae1866d3 100644
> > > > > > > --- a/net/vmw_vsock/af_vsock.c
> > > > > > > +++ b/net/vmw_vsock/af_vsock.c
> > > > > > > @@ -675,6 +675,19 @@ static int
> > > > > > > __vsock_bind_connectible(struct vsock_sock *vsk,
> > > > > > >  	return 0;
> > > > > > >  }
> > > > > > >  
> > > > > > > +int vsock_bind_stream(struct vsock_sock *vsk,
> > > > > > > +		      struct sockaddr_vm *addr)
> > > > > > > +{
> > > > > > > +	int retval;
> > > > > > > +
> > > > > > > +	spin_lock_bh(&vsock_table_lock);
> > > > > > > +	retval = __vsock_bind_connectible(vsk, addr);
> > > > > > > +	spin_unlock_bh(&vsock_table_lock);
> > > > > > > +
> > > > > > > +	return retval;
> > > > > > > +}
> > > > > > > +EXPORT_SYMBOL(vsock_bind_stream);
> > > > > > > +
> > > > > > >  static int __vsock_bind_dgram(struct vsock_sock *vsk,
> > > > > > >  			      struct sockaddr_vm *addr)
> > > > > > >  {
> > > > > > > @@ -2363,11 +2376,16 @@ int vsock_core_register(const
> > > > > > > struct
> > > > > > > vsock_transport *t, int features)
> > > > > > >  	}
> > > > > > >  
> > > > > > >  	if (features & VSOCK_TRANSPORT_F_DGRAM) {
> > > > > > > -		if (t_dgram) {
> > > > > > > -			err = -EBUSY;
> > > > > > > -			goto err_busy;
> > > > > > > +		/* TODO: always chose the G2H variant over
> > > > > > > others, support nesting later */
> > > > > > > +		if (features & VSOCK_TRANSPORT_F_G2H) {
> > > > > > > +			if (t_dgram)
> > > > > > > +				pr_warn("virtio_vsock: t_dgram
> > > > > > > already set\n");
> > > > > > > +			t_dgram = t;
> > > > > > > +		}
> > > > > > > +
> > > > > > > +		if (!t_dgram) {
> > > > > > > +			t_dgram = t;
> > > > > > >  		}
> > > > > > > -		t_dgram = t;
> > > > > > >  	}
> > > > > > >  
> > > > > > >  	if (features & VSOCK_TRANSPORT_F_LOCAL) {
> > > > > > > diff --git a/net/vmw_vsock/virtio_transport.c
> > > > > > > b/net/vmw_vsock/virtio_transport.c
> > > > > > > index 073314312683..d4526ca462d2 100644
> > > > > > > --- a/net/vmw_vsock/virtio_transport.c
> > > > > > > +++ b/net/vmw_vsock/virtio_transport.c
> > > > > > > @@ -850,7 +850,7 @@ static int __init
> > > > > > > virtio_vsock_init(void)
> > > > > > >  		return -ENOMEM;
> > > > > > >  
> > > > > > >  	ret = vsock_core_register(&virtio_transport.transport,
> > > > > > > -				  VSOCK_TRANSPORT_F_G2H);
> > > > > > > +				  VSOCK_TRANSPORT_F_G2H |
> > > > > > > VSOCK_TRANSPORT_F_DGRAM);
> > > > > > >  	if (ret)
> > > > > > >  		goto out_wq;
> > > > > > >  
> > > > > > > diff --git a/net/vmw_vsock/virtio_transport_common.c
> > > > > > > b/net/vmw_vsock/virtio_transport_common.c
> > > > > > > index bdf16fff054f..aedb48728677 100644
> > > > > > > --- a/net/vmw_vsock/virtio_transport_common.c
> > > > > > > +++ b/net/vmw_vsock/virtio_transport_common.c
> > > > > > > @@ -229,7 +229,9 @@
> > > > > > > EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
> > > > > > >  
> > > > > > >  static u16 virtio_transport_get_type(struct sock *sk)
> > > > > > >  {
> > > > > > > -	if (sk->sk_type == SOCK_STREAM)
> > > > > > > +	if (sk->sk_type == SOCK_DGRAM)
> > > > > > > +		return VIRTIO_VSOCK_TYPE_DGRAM;
> > > > > > > +	else if (sk->sk_type == SOCK_STREAM)
> > > > > > >  		return VIRTIO_VSOCK_TYPE_STREAM;
> > > > > > >  	else
> > > > > > >  		return VIRTIO_VSOCK_TYPE_SEQPACKET;
> > > > > > > @@ -287,22 +289,29 @@ static int
> > > > > > > virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> > > > > > >  	vvs = vsk->trans;
> > > > > > >  
> > > > > > >  	/* we can send less than pkt_len bytes */
> > > > > > > -	if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> > > > > > > -		pkt_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> > > > > > > +	if (pkt_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
> > > > > > > +		if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> > > > > > > +			pkt_len =
> > > > > > > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> > > > > > > +		else
> > > > > > > +			return 0;
> > > > > > > +	}
> > > > > > >  
> > > > > > > -	/* virtio_transport_get_credit might return less than
> > > > > > > pkt_len credit */
> > > > > > > -	pkt_len = virtio_transport_get_credit(vvs, pkt_len);
> > > > > > > +	if (info->type != VIRTIO_VSOCK_TYPE_DGRAM) {
> > > > > > > +		/* virtio_transport_get_credit might return
> > > > > > > less than pkt_len credit */
> > > > > > > +		pkt_len = virtio_transport_get_credit(vvs,
> > > > > > > pkt_len);
> > > > > > >  
> > > > > > > -	/* Do not send zero length OP_RW pkt */
> > > > > > > -	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> > > > > > > -		return pkt_len;
> > > > > > > +		/* Do not send zero length OP_RW pkt */
> > > > > > > +		if (pkt_len == 0 && info->op ==
> > > > > > > VIRTIO_VSOCK_OP_RW)
> > > > > > > +			return pkt_len;
> > > > > > > +	}
> > > > > > >  
> > > > > > >  	skb = virtio_transport_alloc_skb(info, pkt_len,
> > > > > > >  					 src_cid, src_port,
> > > > > > >  					 dst_cid, dst_port,
> > > > > > >  					 &err);
> > > > > > >  	if (!skb) {
> > > > > > > -		virtio_transport_put_credit(vvs, pkt_len);
> > > > > > > +		if (info->type != VIRTIO_VSOCK_TYPE_DGRAM)
> > > > > > > +			virtio_transport_put_credit(vvs,
> > > > > > > pkt_len);
> > > > > > >  		return err;
> > > > > > >  	}
> > > > > > >  
> > > > > > > @@ -586,6 +595,61 @@
> > > > > > > virtio_transport_seqpacket_dequeue(struct vsock_sock
> > > > > > > *vsk,
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL_GPL(virtio_transport_seqpacket_dequeue);
> > > > > > >  
> > > > > > > +static ssize_t
> > > > > > > +virtio_transport_dgram_do_dequeue(struct vsock_sock
> > > > > > > *vsk,
> > > > > > > +				  struct msghdr *msg, size_t
> > > > > > > len)
> > > > > > > +{
> > > > > > > +	struct virtio_vsock_sock *vvs = vsk->trans;
> > > > > > > +	struct sk_buff *skb;
> > > > > > > +	size_t total = 0;
> > > > > > > +	u32 free_space;
> > > > > > > +	int err = -EFAULT;
> > > > > > > +
> > > > > > > +	spin_lock_bh(&vvs->rx_lock);
> > > > > > > +	if (total < len && !skb_queue_empty_lockless(&vvs-
> > > > > > > > rx_queue)) {
> > > > > > > +		skb = __skb_dequeue(&vvs->rx_queue);
> > > > > > > +
> > > > > > > +		total = len;
> > > > > > > +		if (total > skb->len - vsock_metadata(skb)-
> > > > > > > > off)
> > > > > > > +			total = skb->len - vsock_metadata(skb)-
> > > > > > > > off;
> > > > > > > +		else if (total < skb->len -
> > > > > > > vsock_metadata(skb)->off)
> > > > > > > +			msg->msg_flags |= MSG_TRUNC;
> > > > > > > +
> > > > > > > +		/* sk_lock is held by caller so no one else can
> > > > > > > dequeue.
> > > > > > > +		 * Unlock rx_lock since memcpy_to_msg() may
> > > > > > > sleep.
> > > > > > > +		 */
> > > > > > > +		spin_unlock_bh(&vvs->rx_lock);
> > > > > > > +
> > > > > > > +		err = memcpy_to_msg(msg, skb->data +
> > > > > > > vsock_metadata(skb)->off, total);
> > > > > > > +		if (err)
> > > > > > > +			return err;
> > > > > > > +
> > > > > > > +		spin_lock_bh(&vvs->rx_lock);
> > > > > > > +
> > > > > > > +		virtio_transport_dec_rx_pkt(vvs, skb);
> > > > > > > +		consume_skb(skb);
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs-
> > > > > > > > last_fwd_cnt);
> > > > > > > +
> > > > > > > +	spin_unlock_bh(&vvs->rx_lock);
> > > > > > > +
> > > > > > > +	if (total > 0 && msg->msg_name) {
> > > > > > > +		/* Provide the address of the sender. */
> > > > > > > +		DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr,
> > > > > > > msg->msg_name);
> > > > > > > +
> > > > > > > +		vsock_addr_init(vm_addr,
> > > > > > > le64_to_cpu(vsock_hdr(skb)->src_cid),
> > > > > > > +				le32_to_cpu(vsock_hdr(skb)-
> > > > > > > > src_port));
> > > > > > > +		msg->msg_namelen = sizeof(*vm_addr);
> > > > > > > +	}
> > > > > > > +	return total;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static s64 virtio_transport_dgram_has_data(struct
> > > > > > > vsock_sock
> > > > > > > *vsk)
> > > > > > > +{
> > > > > > > +	return virtio_transport_stream_has_data(vsk);
> > > > > > > +}
> > > > > > > +
> > > > > > >  int
> > > > > > >  virtio_transport_seqpacket_enqueue(struct vsock_sock
> > > > > > > *vsk,
> > > > > > >  				   struct msghdr *msg,
> > > > > > > @@ -611,7 +675,66 @@
> > > > > > > virtio_transport_dgram_dequeue(struct
> > > > > > > vsock_sock *vsk,
> > > > > > >  			       struct msghdr *msg,
> > > > > > >  			       size_t len, int flags)
> > > > > > >  {
> > > > > > > -	return -EOPNOTSUPP;
> > > > > > > +	struct sock *sk;
> > > > > > > +	size_t err = 0;
> > > > > > > +	long timeout;
> > > > > > > +
> > > > > > > +	DEFINE_WAIT(wait);
> > > > > > > +
> > > > > > > +	sk = &vsk->sk;
> > > > > > > +	err = 0;
> > > > > > > +
> > > > > > > +	if (flags & MSG_OOB || flags & MSG_ERRQUEUE || flags &
> > > > > > > MSG_PEEK)
> > > > > > > +		return -EOPNOTSUPP;
> > > > > > > +
> > > > > > > +	lock_sock(sk);
> > > > > > > +
> > > > > > > +	if (!len)
> > > > > > > +		goto out;
> > > > > > > +
> > > > > > > +	timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
> > > > > > > +
> > > > > > > +	while (1) {
> > > > > > > +		s64 ready;
> > > > > > > +
> > > > > > > +		prepare_to_wait(sk_sleep(sk), &wait,
> > > > > > > TASK_INTERRUPTIBLE);
> > > > > > > +		ready = virtio_transport_dgram_has_data(vsk);
> > > > > > > +
> > > > > > > +		if (ready == 0) {
> > > > > > > +			if (timeout == 0) {
> > > > > > > +				err = -EAGAIN;
> > > > > > > +				finish_wait(sk_sleep(sk),
> > > > > > > &wait);
> > > > > > > +				break;
> > > > > > > +			}
> > > > > > > +
> > > > > > > +			release_sock(sk);
> > > > > > > +			timeout = schedule_timeout(timeout);
> > > > > > > +			lock_sock(sk);
> > > > > > > +
> > > > > > > +			if (signal_pending(current)) {
> > > > > > > +				err = sock_intr_errno(timeout);
> > > > > > > +				finish_wait(sk_sleep(sk),
> > > > > > > &wait);
> > > > > > > +				break;
> > > > > > > +			} else if (timeout == 0) {
> > > > > > > +				err = -EAGAIN;
> > > > > > > +				finish_wait(sk_sleep(sk),
> > > > > > > &wait);
> > > > > > > +				break;
> > > > > > > +			}
> > > > > > > +		} else {
> > > > > > > +			finish_wait(sk_sleep(sk), &wait);
> > > > > > > +
> > > > > > > +			if (ready < 0) {
> > > > > > > +				err = -ENOMEM;
> > > > > > > +				goto out;
> > > > > > > +			}
> > > > > > > +
> > > > > > > +			err =
> > > > > > > virtio_transport_dgram_do_dequeue(vsk, msg, len);
> > > > > > > +			break;
> > > > > > > +		}
> > > > > > > +	}
> > > > > > > +out:
> > > > > > > +	release_sock(sk);
> > > > > > > +	return err;
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
> > > > ^^^
> > > > May be, this generic data waiting logic should be in
> > > > af_vsock.c, as
> > > > for stream/seqpacket?
> > > > In this way, another transport which supports SOCK_DGRAM could
> > > > reuse it.
> > > 
> > > I think that is a great idea. I'll test that change for v2.
> > > 
> > > Thanks.
> > 
> > Also for v2, i tested Your patchset a little bit(write here to not
> > spread over all mails):
> > 1) seqpacket test in vsock_test.c fails(seems MSG_EOR flag issue)
> 
> I will investigate.
> 
> > 2) i can't do rmmod with the following config(after testing):
> >    CONFIG_VSOCKETS=m
> >    CONFIG_VIRTIO_VSOCKETS=m
> >    CONFIG_VIRTIO_VSOCKETS_COMMON=m
> >    CONFIG_VHOST=m
> >    CONFIG_VHOST_VSOCK=m
> >    Guest is shutdown, but rmmod fails.
> > 3) virtio_transport_init + virtio_transport_exit seems must be
> >    under EXPORT_SYMBOL_GPL(), because both used in another module.
> 
> Definitely, will fix.
> 
> > 4) I tried to send 5kb(or 20kb not matter) piece of data, but
> > got      
> >    kernel panic both in guest and later in host.
> > 
> 
> Thanks for catching that. I can reproduce it intermittently, but only
> for seqpacket. Did you happen to see this for other socket types as
> well?
> 
> Thanks

I got this for SOCK_DGRAM, i didnt test seqpacket or stream.

Thanks, Arseniy

> 
> > Thank You
> > > > > > >  
> > > > > > > @@ -819,13 +942,13 @@
> > > > > > > EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
> > > > > > >  int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> > > > > > >  				struct sockaddr_vm *addr)
> > > > > > >  {
> > > > > > > -	return -EOPNOTSUPP;
> > > > > > > +	return vsock_bind_stream(vsk, addr);
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
> > > > > > >  
> > > > > > >  bool virtio_transport_dgram_allow(u32 cid, u32 port)
> > > > > > >  {
> > > > > > > -	return false;
> > > > > > > +	return true;
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
> > > > > > >  
> > > > > > > @@ -861,7 +984,16 @@
> > > > > > > virtio_transport_dgram_enqueue(struct
> > > > > > > vsock_sock *vsk,
> > > > > > >  			       struct msghdr *msg,
> > > > > > >  			       size_t dgram_len)
> > > > > > >  {
> > > > > > > -	return -EOPNOTSUPP;
> > > > > > > +	struct virtio_vsock_pkt_info info = {
> > > > > > > +		.op = VIRTIO_VSOCK_OP_RW,
> > > > > > > +		.msg = msg,
> > > > > > > +		.pkt_len = dgram_len,
> > > > > > > +		.vsk = vsk,
> > > > > > > +		.remote_cid = remote_addr->svm_cid,
> > > > > > > +		.remote_port = remote_addr->svm_port,
> > > > > > > +	};
> > > > > > > +
> > > > > > > +	return virtio_transport_send_pkt_info(vsk, &info);
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
> > > > > > >  
> > > > > > > @@ -1165,6 +1297,12 @@
> > > > > > > virtio_transport_recv_connected(struct
> > > > > > > sock *sk,
> > > > > > >  	struct virtio_vsock_hdr *hdr = vsock_hdr(skb);
> > > > > > >  	int err = 0;
> > > > > > >  
> > > > > > > +	if (le16_to_cpu(vsock_hdr(skb)->type) ==
> > > > > > > VIRTIO_VSOCK_TYPE_DGRAM) {
> > > > > > > +		virtio_transport_recv_enqueue(vsk, skb);
> > > > > > > +		sk->sk_data_ready(sk);
> > > > > > > +		return err;
> > > > > > > +	}
> > > > > > > +
> > > > > > >  	switch (le16_to_cpu(hdr->op)) {
> > > > > > >  	case VIRTIO_VSOCK_OP_RW:
> > > > > > >  		virtio_transport_recv_enqueue(vsk, skb);
> > > > > > > @@ -1320,7 +1458,8 @@ virtio_transport_recv_listen(struct
> > > > > > > sock *sk, struct sk_buff *skb,
> > > > > > >  static bool virtio_transport_valid_type(u16 type)
> > > > > > >  {
> > > > > > >  	return (type == VIRTIO_VSOCK_TYPE_STREAM) ||
> > > > > > > -	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET);
> > > > > > > +	       (type == VIRTIO_VSOCK_TYPE_SEQPACKET) ||
> > > > > > > +	       (type == VIRTIO_VSOCK_TYPE_DGRAM);
> > > > > > >  }
> > > > > > >  
> > > > > > >  /* We are under the virtio-vsock's vsock->rx_lock or
> > > > > > > vhost-
> > > > > > > vsock's vq->mutex
> > > > > > > @@ -1384,6 +1523,11 @@ void
> > > > > > > virtio_transport_recv_pkt(struct
> > > > > > > virtio_transport *t,
> > > > > > >  		goto free_pkt;
> > > > > > >  	}
> > > > > > >  
> > > > > > > +	if (sk->sk_type == SOCK_DGRAM) {
> > > > > > > +		virtio_transport_recv_connected(sk, skb);
> > > > > > > +		goto out;
> > > > > > > +	}
> > > > > > > +
> > > > > > >  	space_available = virtio_transport_space_update(sk,
> > > > > > > skb);
> > > > > > >  
> > > > > > >  	/* Update CID in case it has changed after a transport
> > > > > > > reset event */
> > > > > > > @@ -1415,6 +1559,7 @@ void
> > > > > > > virtio_transport_recv_pkt(struct
> > > > > > > virtio_transport *t,
> > > > > > >  		break;
> > > > > > >  	}
> > > > > > >  
> > > > > > > +out:
> > > > > > >  	release_sock(sk);
> > > > > > >  
> > > > > > >  	/* Release refcnt obtained when we fetched this socket
> > > > > > > out of the
> > > > > > > -- 
> > > > > > > 2.35.1
> > > > > > > 
> > > > > > 
> > > > > > ---------------------------------------------------------
> > > > > > ----
> > > > > > --------
> > > > > > To unsubscribe, e-mail: 
> > > > > > virtio-dev-unsubscribe@lists.oasis-open.org
> > > > > > For additional commands, e-mail: 
> > > > > > virtio-dev-help@lists.oasis-open.org
> > > > > > 
> > > 
> > > ---------------------------------------------------------------
> > > ------
> > > To unsubscribe, e-mail: 
> > > virtio-dev-unsubscribe@lists.oasis-open.org
> > > For additional commands, e-mail: 
> > > virtio-dev-help@lists.oasis-open.org
> > > 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc
  2022-08-18  4:28   ` Jason Wang
@ 2022-09-06  9:00     ` Stefano Garzarella
  0 siblings, 0 replies; 67+ messages in thread
From: Stefano Garzarella @ 2022-09-06  9:00 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Bobby Eshleman, Bobby Eshleman,
	Bobby Eshleman, Cong Wang, Jiang Wang, Stefan Hajnoczi,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, kvm, virtualization, netdev, linux-kernel,
	linux-hyperv

On Thu, Aug 18, 2022 at 12:28:48PM +0800, Jason Wang wrote:
>
>在 2022/8/17 14:54, Michael S. Tsirkin 写道:
>>On Mon, Aug 15, 2022 at 10:56:03AM -0700, Bobby Eshleman wrote:
>>>Hey everybody,
>>>
>>>This series introduces datagrams, packet scheduling, and sk_buff usage
>>>to virtio vsock.
>>>
>>>The usage of struct sk_buff benefits users by a) preparing vsock to use
>>>other related systems that require sk_buff, such as sockmap and qdisc,
>>>b) supporting basic congestion control via sock_alloc_send_skb, and c)
>>>reducing copying when delivering packets to TAP.
>>>
>>>The socket layer no longer forces errors to be -ENOMEM, as typically
>>>userspace expects -EAGAIN when the sk_sndbuf threshold is reached and
>>>messages are being sent with option MSG_DONTWAIT.
>>>
>>>The datagram work is based off previous patches by Jiang Wang[1].
>>>
>>>The introduction of datagrams creates a transport layer fairness issue
>>>where datagrams may freely starve streams of queue access. This happens
>>>because, unlike streams, datagrams lack the transactions necessary for
>>>calculating credits and throttling.
>>>
>>>Previous proposals introduce changes to the spec to add an additional
>>>virtqueue pair for datagrams[1]. Although this solution works, using
>>>Linux's qdisc for packet scheduling leverages already existing systems,
>>>avoids the need to change the virtio specification, and gives additional
>>>capabilities. The usage of SFQ or fq_codel, for example, may solve the
>>>transport layer starvation problem. It is easy to imagine other use
>>>cases as well. For example, services of varying importance may be
>>>assigned different priorities, and qdisc will apply appropriate
>>>priority-based scheduling. By default, the system default pfifo qdisc is
>>>used. The qdisc may be bypassed and legacy queuing is resumed by simply
>>>setting the virtio-vsock%d network device to state DOWN. This technique
>>>still allows vsock to work with zero-configuration.
>>The basic question to answer then is this: with a net device qdisc
>>etc in the picture, how is this different from virtio net then?
>>Why do you still want to use vsock?
>
>
>Or maybe it's time to revisit an old idea[1] to unify at least the 
>driver part (e.g using virtio-net driver for vsock then we can all 
>features that vsock is lacking now)?

Sorry for coming late to the discussion!

This would be great, though, last time I had looked at it, I had found 
it quite complicated. The main problem is trying to avoid all the 
net-specific stuff (MTU, ethernet header, HW offloading, etc.).

Maybe we could start thinking about this idea by adding a new transport 
to vsock (e.g. virtio-net-vsock) completely separate from what we have 
now.

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/6] vsock: add netdev to vhost/virtio vsock
  2022-08-15 17:56 ` [PATCH 3/6] vsock: add netdev to vhost/virtio vsock Bobby Eshleman
  2022-08-16  2:31   ` Bobby Eshleman
  2022-08-16 16:38   ` Michael S. Tsirkin
@ 2022-09-06 10:58   ` Michael S. Tsirkin
  2022-08-18 14:20     ` Bobby Eshleman
  2 siblings, 1 reply; 67+ messages in thread
From: Michael S. Tsirkin @ 2022-09-06 10:58 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Stefano Garzarella, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, kvm, virtualization,
	netdev, linux-kernel

On Mon, Aug 15, 2022 at 10:56:06AM -0700, Bobby Eshleman wrote:
> In order to support usage of qdisc on vsock traffic, this commit
> introduces a struct net_device to vhost and virtio vsock.
> 
> Two new devices are created, vhost-vsock for vhost and virtio-vsock
> for virtio. The devices are attached to the respective transports.
> 
> To bypass the usage of the device, the user may "down" the associated
> network interface using common tools. For example, "ip link set dev
> virtio-vsock down" lets vsock bypass the net_device and qdisc entirely,
> simply using the FIFO logic of the prior implementation.
> 
> For both hosts and guests, there is one device for all G2H vsock sockets
> and one device for all H2G vsock sockets. This makes sense for guests
> because the driver only supports a single vsock channel (one pair of
> TX/RX virtqueues), so one device and qdisc fits. For hosts, this may not
> seem ideal for some workloads. However, it is possible to use a
> multi-queue qdisc, where a given queue is responsible for a range of
> sockets. This seems to be a better solution than having one device per
> socket, which may yield a very large number of devices and qdiscs, all
> of which are dynamically being created and destroyed. Because of this
> dynamism, it would also require a complex policy management daemon, as
> devices would constantly be spun up and down as sockets were created and
> destroyed. To avoid this, one device and qdisc also applies to all H2G
> sockets.
> 
> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>


I've been thinking about this generally. vsock currently
assumes reliability, but with qdisc can't we get
packet drops e.g. depending on the queueing?

What prevents user from configuring such a discipline?
One thing people like about vsock is that it's very hard
to break H2G communication even with misconfigured
networking.

-- 
MST


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc
  2022-08-15 17:56 [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Bobby Eshleman
                   ` (9 preceding siblings ...)
  2022-08-17  6:54 ` Michael S. Tsirkin
@ 2022-09-06 13:26 ` Stefan Hajnoczi
  2022-08-18 14:39   ` Bobby Eshleman
  2022-09-26 13:42 ` [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Stefano Garzarella
  11 siblings, 1 reply; 67+ messages in thread
From: Stefan Hajnoczi @ 2022-09-06 13:26 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefano Garzarella, Michael S. Tsirkin, Jason Wang,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, kvm, virtualization, netdev, linux-kernel,
	linux-hyperv

[-- Attachment #1: Type: text/plain, Size: 552 bytes --]

Hi Bobby,
If you are attending Linux Foundation conferences in Dublin, Ireland
next week (Linux Plumbers Conference, Open Source Summit Europe, KVM
Forum, ContainerCon Europe, CloudOpen Europe, etc) then you could meet
Stefano Garzarella and others to discuss this patch series.

Using netdev and sk_buff is a big change to vsock. Discussing your
requirements and the future direction of vsock in person could help.

If you won't be in Dublin, don't worry. You can schedule a video call if
you feel it would be helpful to discuss these topics.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc
  2022-08-18 14:39   ` Bobby Eshleman
@ 2022-09-08  8:30     ` Stefano Garzarella
  2022-09-08 14:36     ` Call to discuss vsock netdev/sk_buff [was Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc] Stefano Garzarella
  1 sibling, 0 replies; 67+ messages in thread
From: Stefano Garzarella @ 2022-09-08  8:30 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Stefan Hajnoczi, Bobby Eshleman, Bobby Eshleman, Cong Wang,
	Jiang Wang, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Dexuan Cui, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv

On Thu, Aug 18, 2022 at 02:39:32PM +0000, Bobby Eshleman wrote:
>On Tue, Sep 06, 2022 at 09:26:33AM -0400, Stefan Hajnoczi wrote:
>> Hi Bobby,
>> If you are attending Linux Foundation conferences in Dublin, Ireland
>> next week (Linux Plumbers Conference, Open Source Summit Europe, KVM
>> Forum, ContainerCon Europe, CloudOpen Europe, etc) then you could meet
>> Stefano Garzarella and others to discuss this patch series.
>>
>> Using netdev and sk_buff is a big change to vsock. Discussing your
>> requirements and the future direction of vsock in person could help.
>>
>> If you won't be in Dublin, don't worry. You can schedule a video call if
>> you feel it would be helpful to discuss these topics.
>>
>> Stefan
>
>Hey Stefan,
>
>That sounds like a great idea!

Yep, I agree!

>I was unable to make the Dublin trip work
>so I think a video call would be best, of course if okay with everyone.

It will work for me, but I'll be a bit busy in the next 2 weeks:

 From Sep 12 to Sep 14 I'll be at KVM Forum, so it may be difficult to 
arrange, but we can try.

Sep 15 I'm not available.
Sep 16 I'm traveling, but early in my morning, so I should be available.

Form Sep 10 to Sep 23 I'll be mostly off, but I can try to find some 
slots if needed.

 From Sep 26 I'm back and fully available.

Let's see if others are available and try to find a slot :-)

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Call to discuss vsock netdev/sk_buff [was Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc]
  2022-08-18 14:39   ` Bobby Eshleman
  2022-09-08  8:30     ` Stefano Garzarella
@ 2022-09-08 14:36     ` Stefano Garzarella
  2022-09-09 18:13       ` Bobby Eshleman
  1 sibling, 1 reply; 67+ messages in thread
From: Stefano Garzarella @ 2022-09-08 14:36 UTC (permalink / raw)
  To: Bobby Eshleman, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	Michael S. Tsirkin, Jason Wang, Stefan Hajnoczi
  Cc: Stefan Hajnoczi, Bobby Eshleman, Bobby Eshleman, Cong Wang,
	Jiang Wang, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Dexuan Cui, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv

On Thu, Aug 18, 2022 at 02:39:32PM +0000, Bobby Eshleman wrote:
>On Tue, Sep 06, 2022 at 09:26:33AM -0400, Stefan Hajnoczi wrote:
>> Hi Bobby,
>> If you are attending Linux Foundation conferences in Dublin, Ireland
>> next week (Linux Plumbers Conference, Open Source Summit Europe, KVM
>> Forum, ContainerCon Europe, CloudOpen Europe, etc) then you could meet
>> Stefano Garzarella and others to discuss this patch series.
>>
>> Using netdev and sk_buff is a big change to vsock. Discussing your
>> requirements and the future direction of vsock in person could help.
>>
>> If you won't be in Dublin, don't worry. You can schedule a video call if
>> you feel it would be helpful to discuss these topics.
>>
>> Stefan
>
>Hey Stefan,
>
>That sounds like a great idea! I was unable to make the Dublin trip work
>so I think a video call would be best, of course if okay with everyone.

Looking better at the KVM forum sched, I found 1h slot for Sep 15 at 
16:30 UTC.

Could this work for you?

It would be nice to also have HyperV and VMCI people in the call and 
anyone else who is interested of course.

@Dexuan @Bryan @Vishnu can you attend?

@MST @Jason @Stefan if you can be there that would be great, we could 
connect together from Dublin.

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Call to discuss vsock netdev/sk_buff [was Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc]
  2022-09-08 14:36     ` Call to discuss vsock netdev/sk_buff [was Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc] Stefano Garzarella
@ 2022-09-09 18:13       ` Bobby Eshleman
  2022-09-12 18:12         ` Stefano Garzarella
  0 siblings, 1 reply; 67+ messages in thread
From: Bobby Eshleman @ 2022-09-09 18:13 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Dexuan Cui, Bryan Tan, Vishnu Dasa, Michael S. Tsirkin,
	Jason Wang, Stefan Hajnoczi, Bobby Eshleman, Bobby Eshleman,
	Cong Wang, Jiang Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, kvm, virtualization, netdev,
	linux-kernel, linux-hyperv

Hey Stefano, thanks for sending this out.

On Thu, Sep 08, 2022 at 04:36:52PM +0200, Stefano Garzarella wrote:
> 
> Looking better at the KVM forum sched, I found 1h slot for Sep 15 at 16:30
> UTC.
> 
> Could this work for you?

Unfortunately, I can't make this time slot.

My schedule also opens up a lot the week of the 26th, especially between
16:00 and 19:00 UTC, as well as after 22:00 UTC.

Best,
Bobby

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Call to discuss vsock netdev/sk_buff [was Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc]
  2022-09-12 18:12         ` Stefano Garzarella
@ 2022-09-09 23:33           ` Bobby Eshleman
  2022-09-16  3:51             ` Stefano Garzarella
  0 siblings, 1 reply; 67+ messages in thread
From: Bobby Eshleman @ 2022-09-09 23:33 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Dexuan Cui, Bryan Tan, Vishnu Dasa, Michael S. Tsirkin,
	Jason Wang, Stefan Hajnoczi, Bobby Eshleman, Bobby Eshleman,
	Cong Wang, Jiang Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, kvm, virtualization, netdev,
	linux-kernel, linux-hyperv

On Mon, Sep 12, 2022 at 08:12:58PM +0200, Stefano Garzarella wrote:
> On Fri, Sep 9, 2022 at 8:13 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> >
> > Hey Stefano, thanks for sending this out.
> >
> > On Thu, Sep 08, 2022 at 04:36:52PM +0200, Stefano Garzarella wrote:
> > >
> > > Looking better at the KVM forum sched, I found 1h slot for Sep 15 at 16:30
> > > UTC.
> > >
> > > Could this work for you?
> >
> > Unfortunately, I can't make this time slot.
> 
> No problem at all!
> 
> >
> > My schedule also opens up a lot the week of the 26th, especially between
> > 16:00 and 19:00 UTC, as well as after 22:00 UTC.
> 
> Great, that week works for me too.
> What about Sep 27 @ 16:00 UTC?
> 

That time works for me!

Thanks,
Bobby

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Call to discuss vsock netdev/sk_buff [was Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc]
  2022-09-16  3:51             ` Stefano Garzarella
@ 2022-09-10 16:29               ` Bobby Eshleman
  0 siblings, 0 replies; 67+ messages in thread
From: Bobby Eshleman @ 2022-09-10 16:29 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Dexuan Cui, Bryan Tan, Vishnu Dasa, Michael S. Tsirkin,
	Jason Wang, Stefan Hajnoczi, Bobby Eshleman, Bobby Eshleman,
	Cong Wang, Jiang Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, kvm, virtualization, netdev,
	linux-kernel, linux-hyperv

On Fri, Sep 16, 2022 at 05:51:22AM +0200, Stefano Garzarella wrote:
> On Mon, Sep 12, 2022 at 8:28 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> >
> > On Mon, Sep 12, 2022 at 08:12:58PM +0200, Stefano Garzarella wrote:
> > > On Fri, Sep 9, 2022 at 8:13 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> > > >
> > > > Hey Stefano, thanks for sending this out.
> > > >
> > > > On Thu, Sep 08, 2022 at 04:36:52PM +0200, Stefano Garzarella wrote:
> > > > >
> > > > > Looking better at the KVM forum sched, I found 1h slot for Sep 15 at 16:30
> > > > > UTC.
> > > > >
> > > > > Could this work for you?
> > > >
> > > > Unfortunately, I can't make this time slot.
> > >
> > > No problem at all!
> > >
> > > >
> > > > My schedule also opens up a lot the week of the 26th, especially between
> > > > 16:00 and 19:00 UTC, as well as after 22:00 UTC.
> > >
> > > Great, that week works for me too.
> > > What about Sep 27 @ 16:00 UTC?
> > >
> >
> > That time works for me!
> 
> Great! I sent you an invitation.
> 

Awesome, see you then!

Thanks,
Bobby

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Call to discuss vsock netdev/sk_buff [was Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc]
  2022-09-09 18:13       ` Bobby Eshleman
@ 2022-09-12 18:12         ` Stefano Garzarella
  2022-09-09 23:33           ` Bobby Eshleman
  0 siblings, 1 reply; 67+ messages in thread
From: Stefano Garzarella @ 2022-09-12 18:12 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Dexuan Cui, Bryan Tan, Vishnu Dasa, Michael S. Tsirkin,
	Jason Wang, Stefan Hajnoczi, Bobby Eshleman, Bobby Eshleman,
	Cong Wang, Jiang Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, kvm, virtualization, netdev,
	linux-kernel, linux-hyperv

On Fri, Sep 9, 2022 at 8:13 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
>
> Hey Stefano, thanks for sending this out.
>
> On Thu, Sep 08, 2022 at 04:36:52PM +0200, Stefano Garzarella wrote:
> >
> > Looking better at the KVM forum sched, I found 1h slot for Sep 15 at 16:30
> > UTC.
> >
> > Could this work for you?
>
> Unfortunately, I can't make this time slot.

No problem at all!

>
> My schedule also opens up a lot the week of the 26th, especially between
> 16:00 and 19:00 UTC, as well as after 22:00 UTC.

Great, that week works for me too.
What about Sep 27 @ 16:00 UTC?

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: Call to discuss vsock netdev/sk_buff [was Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc]
  2022-09-09 23:33           ` Bobby Eshleman
@ 2022-09-16  3:51             ` Stefano Garzarella
  2022-09-10 16:29               ` Bobby Eshleman
  0 siblings, 1 reply; 67+ messages in thread
From: Stefano Garzarella @ 2022-09-16  3:51 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Dexuan Cui, Bryan Tan, Vishnu Dasa, Michael S. Tsirkin,
	Jason Wang, Stefan Hajnoczi, Bobby Eshleman, Bobby Eshleman,
	Cong Wang, Jiang Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, kvm, virtualization, netdev,
	linux-kernel, linux-hyperv

On Mon, Sep 12, 2022 at 8:28 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
>
> On Mon, Sep 12, 2022 at 08:12:58PM +0200, Stefano Garzarella wrote:
> > On Fri, Sep 9, 2022 at 8:13 PM Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> > >
> > > Hey Stefano, thanks for sending this out.
> > >
> > > On Thu, Sep 08, 2022 at 04:36:52PM +0200, Stefano Garzarella wrote:
> > > >
> > > > Looking better at the KVM forum sched, I found 1h slot for Sep 15 at 16:30
> > > > UTC.
> > > >
> > > > Could this work for you?
> > >
> > > Unfortunately, I can't make this time slot.
> >
> > No problem at all!
> >
> > >
> > > My schedule also opens up a lot the week of the 26th, especially between
> > > 16:00 and 19:00 UTC, as well as after 22:00 UTC.
> >
> > Great, that week works for me too.
> > What about Sep 27 @ 16:00 UTC?
> >
>
> That time works for me!

Great! I sent you an invitation.

For others that want to join the discussion, we will meet Sep 27 @
16:00 UTC at this room: https://meet.google.com/fxi-vuzr-jjb

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/6] virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
  2022-08-15 17:56 ` [PATCH 4/6] virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit Bobby Eshleman
  2022-08-16  2:31   ` Bobby Eshleman
@ 2022-09-26 13:17   ` Stefano Garzarella
  2022-09-26 21:52     ` Bobby Eshleman
  1 sibling, 1 reply; 67+ messages in thread
From: Stefano Garzarella @ 2022-09-26 13:17 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, kvm, virtualization,
	netdev, linux-kernel

On Mon, Aug 15, 2022 at 10:56:07AM -0700, Bobby Eshleman wrote:
>This commit adds a feature bit for virtio vsock to support datagrams.
>
>Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
>Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
>---
> drivers/vhost/vsock.c             | 3 ++-
> include/uapi/linux/virtio_vsock.h | 1 +
> net/vmw_vsock/virtio_transport.c  | 8 ++++++--
> 3 files changed, 9 insertions(+), 3 deletions(-)
>
>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>index b20ddec2664b..a5d1bdb786fe 100644
>--- a/drivers/vhost/vsock.c
>+++ b/drivers/vhost/vsock.c
>@@ -32,7 +32,8 @@
> enum {
> 	VHOST_VSOCK_FEATURES = VHOST_FEATURES |
> 			       (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
>-			       (1ULL << VIRTIO_VSOCK_F_SEQPACKET)
>+			       (1ULL << VIRTIO_VSOCK_F_SEQPACKET) |
>+			       (1ULL << VIRTIO_VSOCK_F_DGRAM)
> };
>
> enum {
>diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
>index 64738838bee5..857df3a3a70d 100644
>--- a/include/uapi/linux/virtio_vsock.h
>+++ b/include/uapi/linux/virtio_vsock.h
>@@ -40,6 +40,7 @@
>
> /* The feature bitmap for virtio vsock */
> #define VIRTIO_VSOCK_F_SEQPACKET	1	/* SOCK_SEQPACKET supported */
>+#define VIRTIO_VSOCK_F_DGRAM		2	/* Host support dgram vsock */

We already allocated bit 2 for F_NO_IMPLIED_STREAM , so we should use 3:
https://github.com/oasis-tcs/virtio-spec/blob/26ed30ccb049fd51d6e20aad3de2807d678edb3a/virtio-vsock.tex#L22
(I'll send patches to implement F_STREAM and F_NO_IMPLIED_STREAM 
negotiation soon).

As long as it's RFC it's fine to introduce F_DGRAM, but we should first 
change virtio-spec before merging this series.

About the patch, we should only negotiate the new feature when we really 
have DGRAM support. So, it's better to move this patch after adding 
support for datagram.

Thanks,
Stefano

>
> struct virtio_vsock_config {
> 	__le64 guest_cid;
>diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>index c6212eb38d3c..073314312683 100644
>--- a/net/vmw_vsock/virtio_transport.c
>+++ b/net/vmw_vsock/virtio_transport.c
>@@ -35,6 +35,7 @@ static struct virtio_transport virtio_transport; /* 
>forward declaration */
> struct virtio_vsock {
> 	struct virtio_device *vdev;
> 	struct virtqueue *vqs[VSOCK_VQ_MAX];
>+	bool has_dgram;
>
> 	/* Virtqueue processing is deferred to a workqueue */
> 	struct work_struct tx_work;
>@@ -709,7 +710,6 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
> 	}
>
> 	vsock->vdev = vdev;
>-
> 	vsock->rx_buf_nr = 0;
> 	vsock->rx_buf_max_nr = 0;
> 	atomic_set(&vsock->queued_replies, 0);
>@@ -726,6 +726,9 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
> 	if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_SEQPACKET))
> 		vsock->seqpacket_allow = true;
>
>+	if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_DGRAM))
>+		vsock->has_dgram = true;
>+
> 	vdev->priv = vsock;
>
> 	ret = virtio_vsock_vqs_init(vsock);
>@@ -820,7 +823,8 @@ static struct virtio_device_id id_table[] = {
> };
>
> static unsigned int features[] = {
>-	VIRTIO_VSOCK_F_SEQPACKET
>+	VIRTIO_VSOCK_F_SEQPACKET,
>+	VIRTIO_VSOCK_F_DGRAM
> };
>
> static struct virtio_driver virtio_vsock_driver = {
>-- 
>2.35.1
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/6] vsock: return errors other than -ENOMEM to socket
  2022-08-15 17:56 ` [PATCH 2/6] vsock: return errors other than -ENOMEM to socket Bobby Eshleman
                     ` (3 preceding siblings ...)
  2022-08-16  2:30   ` Bobby Eshleman
@ 2022-09-26 13:21   ` Stefano Garzarella
  2022-09-26 21:30     ` Bobby Eshleman
  4 siblings, 1 reply; 67+ messages in thread
From: Stefano Garzarella @ 2022-09-26 13:21 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Dexuan Cui, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv

On Mon, Aug 15, 2022 at 10:56:05AM -0700, Bobby Eshleman wrote:
>This commit allows vsock implementations to return errors
>to the socket layer other than -ENOMEM. One immediate effect
>of this is that upon the sk_sndbuf threshold being reached -EAGAIN
>will be returned and userspace may throttle appropriately.
>
>Resultingly, a known issue with uperf is resolved[1].
>
>Additionally, to preserve legacy behavior for non-virtio
>implementations, hyperv/vmci force errors to be -ENOMEM so that behavior
>is unchanged.
>
>[1]: https://gitlab.com/vsock/vsock/-/issues/1
>
>Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
>---
> include/linux/virtio_vsock.h            | 3 +++
> net/vmw_vsock/af_vsock.c                | 3 ++-
> net/vmw_vsock/hyperv_transport.c        | 2 +-
> net/vmw_vsock/virtio_transport_common.c | 3 ---
> net/vmw_vsock/vmci_transport.c          | 9 ++++++++-
> 5 files changed, 14 insertions(+), 6 deletions(-)
>
>diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>index 17ed01466875..9a37eddbb87a 100644
>--- a/include/linux/virtio_vsock.h
>+++ b/include/linux/virtio_vsock.h
>@@ -8,6 +8,9 @@
> #include <net/sock.h>
> #include <net/af_vsock.h>
>
>+/* Threshold for detecting small packets to copy */
>+#define GOOD_COPY_LEN  128
>+

This change seems unrelated.

Please move it in the patch where you need this.
Maybe it's better to add a prefix if we move it in an header file (e.g.  
VIRTIO_VSOCK_...).

Thanks,
Stefano

> enum virtio_vsock_metadata_flags {
> 	VIRTIO_VSOCK_METADATA_FLAGS_REPLY		= BIT(0),
> 	VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED	= BIT(1),
>diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>index e348b2d09eac..1893f8aafa48 100644
>--- a/net/vmw_vsock/af_vsock.c
>+++ b/net/vmw_vsock/af_vsock.c
>@@ -1844,8 +1844,9 @@ static int vsock_connectible_sendmsg(struct socket *sock, struct msghdr *msg,
> 			written = transport->stream_enqueue(vsk,
> 					msg, len - total_written);
> 		}
>+
> 		if (written < 0) {
>-			err = -ENOMEM;
>+			err = written;
> 			goto out_err;
> 		}
>
>diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
>index fd98229e3db3..e99aea571f6f 100644
>--- a/net/vmw_vsock/hyperv_transport.c
>+++ b/net/vmw_vsock/hyperv_transport.c
>@@ -687,7 +687,7 @@ static ssize_t hvs_stream_enqueue(struct vsock_sock *vsk, struct msghdr *msg,
> 	if (bytes_written)
> 		ret = bytes_written;
> 	kfree(send_buf);
>-	return ret;
>+	return ret < 0 ? -ENOMEM : ret;
> }
>
> static s64 hvs_stream_has_data(struct vsock_sock *vsk)
>diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>index 920578597bb9..d5780599fe93 100644
>--- a/net/vmw_vsock/virtio_transport_common.c
>+++ b/net/vmw_vsock/virtio_transport_common.c
>@@ -23,9 +23,6 @@
> /* How long to wait for graceful shutdown of a connection */
> #define VSOCK_CLOSE_TIMEOUT (8 * HZ)
>
>-/* Threshold for detecting small packets to copy */
>-#define GOOD_COPY_LEN  128
>-
> static const struct virtio_transport *
> virtio_transport_get_ops(struct vsock_sock *vsk)
> {
>diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
>index b14f0ed7427b..c927a90dc859 100644
>--- a/net/vmw_vsock/vmci_transport.c
>+++ b/net/vmw_vsock/vmci_transport.c
>@@ -1838,7 +1838,14 @@ static ssize_t vmci_transport_stream_enqueue(
> 	struct msghdr *msg,
> 	size_t len)
> {
>-	return vmci_qpair_enquev(vmci_trans(vsk)->qpair, msg, len, 0);
>+	int err;
>+
>+	err = vmci_qpair_enquev(vmci_trans(vsk)->qpair, msg, len, 0);
>+
>+	if (err < 0)
>+		err = -ENOMEM;
>+
>+	return err;
> }
>
> static s64 vmci_transport_stream_has_data(struct vsock_sock *vsk)
>-- 
>2.35.1
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc
  2022-08-15 17:56 [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Bobby Eshleman
                   ` (10 preceding siblings ...)
  2022-09-06 13:26 ` Stefan Hajnoczi
@ 2022-09-26 13:42 ` Stefano Garzarella
  2022-09-26 21:44   ` Bobby Eshleman
  2022-09-27 17:45   ` Stefano Garzarella
  11 siblings, 2 replies; 67+ messages in thread
From: Stefano Garzarella @ 2022-09-26 13:42 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Dexuan Cui, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv

Hi,

On Mon, Aug 15, 2022 at 10:56:03AM -0700, Bobby Eshleman wrote:
>Hey everybody,
>
>This series introduces datagrams, packet scheduling, and sk_buff usage
>to virtio vsock.

Just a reminder for those who are interested, tomorrow Sep 27 @ 16:00 
UTC we will discuss more about the next steps for this series in this 
room: https://meet.google.com/fxi-vuzr-jjb
(I'll try to record it and take notes that we will share)

Bobby, thank you so much for working on this! It would be great to solve 
the fairness issue and support datagram!

I took a look at the series, left some comments in the individual 
patches, and add some advice here that we could pick up tomorrow:
- it would be nice to run benchmarks (e.g., iperf-vsock, uperf, etc.) to
   see how much the changes cost (e.g. sk_buff use)
- we should take care also of other transports (i.e. vmci, hyperv), the 
   uAPI should be as close as possible regardless of the transport

About the use of netdev, it seems the most controversial point and I 
understand Jakub and Michael's concerns. Tomorrow would be great if you 
can update us if you have found any way to avoid it, just reusing a 
packet scheduler somehow.
It would be great if we could make it available for all transports (I'm 
not asking you to implement it for all, but to have a generic api that 
others can use).

But we can talk about that tomorrow!
Thanks,
Stefano

>
>The usage of struct sk_buff benefits users by a) preparing vsock to use
>other related systems that require sk_buff, such as sockmap and qdisc,
>b) supporting basic congestion control via sock_alloc_send_skb, and c)
>reducing copying when delivering packets to TAP.
>
>The socket layer no longer forces errors to be -ENOMEM, as typically
>userspace expects -EAGAIN when the sk_sndbuf threshold )s reached and
>messages are being sent with option MSG_DONTWAIT.
>
>The datagram work is based off previous patches by Jiang Wang[1].
>
>The introduction of datagrams creates a transport layer fairness issue
>where datagrams may freely starve streams of queue access. This happens
>because, unlike streams, datagrams lack the transactions necessary for
>calculating credits and throttling.
>
>Previous proposals introduce changes to the spec to add an additional
>virtqueue pair for datagrams[1]. Although this solution works, using
>Linux's qdisc for packet scheduling leverages already existing systems,
>avoids the need to change the virtio specification, and gives additional
>capabilities. The usage of SFQ or fq_codel, for example, may solve the
>transport layer starvation problem. It is easy to imagine other use
>cases as well. For example, services of varying importance may be
>assigned different priorities, and qdisc will apply appropriate
>priority-based scheduling. By default, the system default pfifo qdisc is
>used. The qdisc may be bypassed and legacy queuing is resumed by simply
>setting the virtio-vsock%d network device to state DOWN. This technique
>still allows vsock to work with zero-configuration.
>
>In summary, this series introduces these major changes to vsock:
>
>- virtio vsock supports datagrams
>- virtio vsock uses struct sk_buff instead of virtio_vsock_pkt
>  - Because virtio vsock uses sk_buff, it also uses sock_alloc_send_skb,
>    which applies the throttling threshold sk_sndbuf.
>- The vsock socket layer supports returning errors other than -ENOMEM.
>  - This is used to return -EAGAIN when the sk_sndbuf threshold is
>    reached.
>- virtio vsock uses a net_device, through which qdisc may be used.
> - qdisc allows scheduling policies to be applied to vsock flows.
>  - Some qdiscs, like SFQ, may allow vsock to avoid transport layer congestion. That is,
>    it may avoid datagrams from flooding out stream flows. The benefit
>    to this is that additional virtqueues are not needed for datagrams.
>  - The net_device and qdisc is bypassed by simply setting the
>    net_device state to DOWN.
>
>[1]: https://lore.kernel.org/all/20210914055440.3121004-1-jiang.wang@bytedance.com/
>
>Bobby Eshleman (5):
>  vsock: replace virtio_vsock_pkt with sk_buff
>  vsock: return errors other than -ENOMEM to socket
>  vsock: add netdev to vhost/virtio vsock
>  virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
>  virtio/vsock: add support for dgram
>
>Jiang Wang (1):
>  vsock_test: add tests for vsock dgram
>
> drivers/vhost/vsock.c                   | 238 ++++----
> include/linux/virtio_vsock.h            |  73 ++-
> include/net/af_vsock.h                  |   2 +
> include/uapi/linux/virtio_vsock.h       |   2 +
> net/vmw_vsock/af_vsock.c                |  30 +-
> net/vmw_vsock/hyperv_transport.c        |   2 +-
> net/vmw_vsock/virtio_transport.c        | 237 +++++---
> net/vmw_vsock/virtio_transport_common.c | 771 ++++++++++++++++--------
> net/vmw_vsock/vmci_transport.c          |   9 +-
> net/vmw_vsock/vsock_loopback.c          |  51 +-
> tools/testing/vsock/util.c              | 105 ++++
> tools/testing/vsock/util.h              |   4 +
> tools/testing/vsock/vsock_test.c        | 195 ++++++
> 13 files changed, 1176 insertions(+), 543 deletions(-)
>
>-- 
>2.35.1
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/6] vsock: return errors other than -ENOMEM to socket
  2022-09-26 13:21   ` Stefano Garzarella
@ 2022-09-26 21:30     ` Bobby Eshleman
  0 siblings, 0 replies; 67+ messages in thread
From: Bobby Eshleman @ 2022-09-26 21:30 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Bobby Eshleman, Wei Liu, Cong Wang, Stephen Hemminger,
	Bobby Eshleman, Jiang Wang, Michael S. Tsirkin, Dexuan Cui,
	Haiyang Zhang, linux-kernel, virtualization, Eric Dumazet,
	netdev, Stefan Hajnoczi, kvm, Jakub Kicinski, Paolo Abeni,
	linux-hyperv, David S. Miller

On Mon, Sep 26, 2022 at 03:21:45PM +0200, Stefano Garzarella wrote:
> On Mon, Aug 15, 2022 at 10:56:05AM -0700, Bobby Eshleman wrote:
> > This commit allows vsock implementations to return errors
> > to the socket layer other than -ENOMEM. One immediate effect
> > of this is that upon the sk_sndbuf threshold being reached -EAGAIN
> > will be returned and userspace may throttle appropriately.
> > 
> > Resultingly, a known issue with uperf is resolved[1].
> > 
> > Additionally, to preserve legacy behavior for non-virtio
> > implementations, hyperv/vmci force errors to be -ENOMEM so that behavior
> > is unchanged.
> > 
> > [1]: https://gitlab.com/vsock/vsock/-/issues/1
> > 
> > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> > ---
> > include/linux/virtio_vsock.h            | 3 +++
> > net/vmw_vsock/af_vsock.c                | 3 ++-
> > net/vmw_vsock/hyperv_transport.c        | 2 +-
> > net/vmw_vsock/virtio_transport_common.c | 3 ---
> > net/vmw_vsock/vmci_transport.c          | 9 ++++++++-
> > 5 files changed, 14 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> > index 17ed01466875..9a37eddbb87a 100644
> > --- a/include/linux/virtio_vsock.h
> > +++ b/include/linux/virtio_vsock.h
> > @@ -8,6 +8,9 @@
> > #include <net/sock.h>
> > #include <net/af_vsock.h>
> > 
> > +/* Threshold for detecting small packets to copy */
> > +#define GOOD_COPY_LEN  128
> > +
> 
> This change seems unrelated.
> 
> Please move it in the patch where you need this.
> Maybe it's better to add a prefix if we move it in an header file (e.g.
> VIRTIO_VSOCK_...).
> 
> Thanks,
> Stefano
> 

Oh yes, definitely.

Thanks,
Bobby

> > enum virtio_vsock_metadata_flags {
> > 	VIRTIO_VSOCK_METADATA_FLAGS_REPLY		= BIT(0),
> > 	VIRTIO_VSOCK_METADATA_FLAGS_TAP_DELIVERED	= BIT(1),
> > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> > index e348b2d09eac..1893f8aafa48 100644
> > --- a/net/vmw_vsock/af_vsock.c
> > +++ b/net/vmw_vsock/af_vsock.c
> > @@ -1844,8 +1844,9 @@ static int vsock_connectible_sendmsg(struct socket *sock, struct msghdr *msg,
> > 			written = transport->stream_enqueue(vsk,
> > 					msg, len - total_written);
> > 		}
> > +
> > 		if (written < 0) {
> > -			err = -ENOMEM;
> > +			err = written;
> > 			goto out_err;
> > 		}
> > 
> > diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
> > index fd98229e3db3..e99aea571f6f 100644
> > --- a/net/vmw_vsock/hyperv_transport.c
> > +++ b/net/vmw_vsock/hyperv_transport.c
> > @@ -687,7 +687,7 @@ static ssize_t hvs_stream_enqueue(struct vsock_sock *vsk, struct msghdr *msg,
> > 	if (bytes_written)
> > 		ret = bytes_written;
> > 	kfree(send_buf);
> > -	return ret;
> > +	return ret < 0 ? -ENOMEM : ret;
> > }
> > 
> > static s64 hvs_stream_has_data(struct vsock_sock *vsk)
> > diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> > index 920578597bb9..d5780599fe93 100644
> > --- a/net/vmw_vsock/virtio_transport_common.c
> > +++ b/net/vmw_vsock/virtio_transport_common.c
> > @@ -23,9 +23,6 @@
> > /* How long to wait for graceful shutdown of a connection */
> > #define VSOCK_CLOSE_TIMEOUT (8 * HZ)
> > 
> > -/* Threshold for detecting small packets to copy */
> > -#define GOOD_COPY_LEN  128
> > -
> > static const struct virtio_transport *
> > virtio_transport_get_ops(struct vsock_sock *vsk)
> > {
> > diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
> > index b14f0ed7427b..c927a90dc859 100644
> > --- a/net/vmw_vsock/vmci_transport.c
> > +++ b/net/vmw_vsock/vmci_transport.c
> > @@ -1838,7 +1838,14 @@ static ssize_t vmci_transport_stream_enqueue(
> > 	struct msghdr *msg,
> > 	size_t len)
> > {
> > -	return vmci_qpair_enquev(vmci_trans(vsk)->qpair, msg, len, 0);
> > +	int err;
> > +
> > +	err = vmci_qpair_enquev(vmci_trans(vsk)->qpair, msg, len, 0);
> > +
> > +	if (err < 0)
> > +		err = -ENOMEM;
> > +
> > +	return err;
> > }
> > 
> > static s64 vmci_transport_stream_has_data(struct vsock_sock *vsk)
> > -- 
> > 2.35.1
> > 
> 
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc
  2022-09-26 13:42 ` [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Stefano Garzarella
@ 2022-09-26 21:44   ` Bobby Eshleman
  2022-09-27 17:45   ` Stefano Garzarella
  1 sibling, 0 replies; 67+ messages in thread
From: Bobby Eshleman @ 2022-09-26 21:44 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Bobby Eshleman, Wei Liu, Cong Wang, Stephen Hemminger,
	Bobby Eshleman, Jiang Wang, Michael S. Tsirkin, Dexuan Cui,
	Haiyang Zhang, linux-kernel, virtualization, Eric Dumazet,
	netdev, Stefan Hajnoczi, kvm, Jakub Kicinski, Paolo Abeni,
	linux-hyperv, David S. Miller

On Mon, Sep 26, 2022 at 03:42:19PM +0200, Stefano Garzarella wrote:
> Hi,
> 
> On Mon, Aug 15, 2022 at 10:56:03AM -0700, Bobby Eshleman wrote:
> > Hey everybody,
> > 
> > This series introduces datagrams, packet scheduling, and sk_buff usage
> > to virtio vsock.
> 
> Just a reminder for those who are interested, tomorrow Sep 27 @ 16:00 UTC we
> will discuss more about the next steps for this series in this room:
> https://meet.google.com/fxi-vuzr-jjb
> (I'll try to record it and take notes that we will share)
> 
> Bobby, thank you so much for working on this! It would be great to solve the
> fairness issue and support datagram!
>

I appreciate that, thanks!

> I took a look at the series, left some comments in the individual patches,
> and add some advice here that we could pick up tomorrow:
> - it would be nice to run benchmarks (e.g., iperf-vsock, uperf, etc.) to
>   see how much the changes cost (e.g. sk_buff use)
> - we should take care also of other transports (i.e. vmci, hyperv), the
> uAPI should be as close as possible regardless of the transport
>

Duly noted. I have some measurements with uperf, I'll put the data
together and send that out here.

Regarding the uAPI topic, I'll save that topic for our conversation
tomorrow as I think the netdev topic will weigh on it.

> About the use of netdev, it seems the most controversial point and I
> understand Jakub and Michael's concerns. Tomorrow would be great if you can
> update us if you have found any way to avoid it, just reusing a packet
> scheduler somehow.
> It would be great if we could make it available for all transports (I'm not
> asking you to implement it for all, but to have a generic api that others
> can use).
>
> But we can talk about that tomorrow!

Sounds good, talk to you then!

Best,
Bobby


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/6] virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
  2022-09-26 13:17   ` Stefano Garzarella
@ 2022-09-26 21:52     ` Bobby Eshleman
  0 siblings, 0 replies; 67+ messages in thread
From: Bobby Eshleman @ 2022-09-26 21:52 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Bobby Eshleman, Cong Wang, Bobby Eshleman, Jiang Wang,
	Michael S. Tsirkin, netdev, linux-kernel, virtualization,
	Eric Dumazet, Stefan Hajnoczi, kvm, Jakub Kicinski, Paolo Abeni,
	David S. Miller

On Mon, Sep 26, 2022 at 03:17:51PM +0200, Stefano Garzarella wrote:
> On Mon, Aug 15, 2022 at 10:56:07AM -0700, Bobby Eshleman wrote:
> > This commit adds a feature bit for virtio vsock to support datagrams.
> > 
> > Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
> > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> > ---
> > drivers/vhost/vsock.c             | 3 ++-
> > include/uapi/linux/virtio_vsock.h | 1 +
> > net/vmw_vsock/virtio_transport.c  | 8 ++++++--
> > 3 files changed, 9 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> > index b20ddec2664b..a5d1bdb786fe 100644
> > --- a/drivers/vhost/vsock.c
> > +++ b/drivers/vhost/vsock.c
> > @@ -32,7 +32,8 @@
> > enum {
> > 	VHOST_VSOCK_FEATURES = VHOST_FEATURES |
> > 			       (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
> > -			       (1ULL << VIRTIO_VSOCK_F_SEQPACKET)
> > +			       (1ULL << VIRTIO_VSOCK_F_SEQPACKET) |
> > +			       (1ULL << VIRTIO_VSOCK_F_DGRAM)
> > };
> > 
> > enum {
> > diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
> > index 64738838bee5..857df3a3a70d 100644
> > --- a/include/uapi/linux/virtio_vsock.h
> > +++ b/include/uapi/linux/virtio_vsock.h
> > @@ -40,6 +40,7 @@
> > 
> > /* The feature bitmap for virtio vsock */
> > #define VIRTIO_VSOCK_F_SEQPACKET	1	/* SOCK_SEQPACKET supported */
> > +#define VIRTIO_VSOCK_F_DGRAM		2	/* Host support dgram vsock */
> 
> We already allocated bit 2 for F_NO_IMPLIED_STREAM , so we should use 3:
> https://github.com/oasis-tcs/virtio-spec/blob/26ed30ccb049fd51d6e20aad3de2807d678edb3a/virtio-vsock.tex#L22
> (I'll send patches to implement F_STREAM and F_NO_IMPLIED_STREAM negotiation
> soon).
> 
> As long as it's RFC it's fine to introduce F_DGRAM, but we should first
> change virtio-spec before merging this series.
> 
> About the patch, we should only negotiate the new feature when we really
> have DGRAM support. So, it's better to move this patch after adding support
> for datagram.

Roger that, I'll reorder that for v2 and also clarify the series by
prefixing it with RFC.

Before removing "RFC" from the series, I'll be sure to send out
virtio-spec patches first.

Thanks,
Bobby

> 
> Thanks,
> Stefano
> 
> > 
> > struct virtio_vsock_config {
> > 	__le64 guest_cid;
> > diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> > index c6212eb38d3c..073314312683 100644
> > --- a/net/vmw_vsock/virtio_transport.c
> > +++ b/net/vmw_vsock/virtio_transport.c
> > @@ -35,6 +35,7 @@ static struct virtio_transport virtio_transport; /*
> > forward declaration */
> > struct virtio_vsock {
> > 	struct virtio_device *vdev;
> > 	struct virtqueue *vqs[VSOCK_VQ_MAX];
> > +	bool has_dgram;
> > 
> > 	/* Virtqueue processing is deferred to a workqueue */
> > 	struct work_struct tx_work;
> > @@ -709,7 +710,6 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
> > 	}
> > 
> > 	vsock->vdev = vdev;
> > -
> > 	vsock->rx_buf_nr = 0;
> > 	vsock->rx_buf_max_nr = 0;
> > 	atomic_set(&vsock->queued_replies, 0);
> > @@ -726,6 +726,9 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
> > 	if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_SEQPACKET))
> > 		vsock->seqpacket_allow = true;
> > 
> > +	if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_DGRAM))
> > +		vsock->has_dgram = true;
> > +
> > 	vdev->priv = vsock;
> > 
> > 	ret = virtio_vsock_vqs_init(vsock);
> > @@ -820,7 +823,8 @@ static struct virtio_device_id id_table[] = {
> > };
> > 
> > static unsigned int features[] = {
> > -	VIRTIO_VSOCK_F_SEQPACKET
> > +	VIRTIO_VSOCK_F_SEQPACKET,
> > +	VIRTIO_VSOCK_F_DGRAM
> > };
> > 
> > static struct virtio_driver virtio_vsock_driver = {
> > -- 
> > 2.35.1
> > 
> 
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc
  2022-09-26 13:42 ` [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Stefano Garzarella
  2022-09-26 21:44   ` Bobby Eshleman
@ 2022-09-27 17:45   ` Stefano Garzarella
  1 sibling, 0 replies; 67+ messages in thread
From: Stefano Garzarella @ 2022-09-27 17:45 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Bobby Eshleman, Bobby Eshleman, Cong Wang, Jiang Wang,
	Stefan Hajnoczi, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Wei Liu, Dexuan Cui, kvm,
	virtualization, netdev, linux-kernel, linux-hyperv

On Mon, Sep 26, 2022 at 03:42:19PM +0200, Stefano Garzarella wrote:
>Hi,
>
>On Mon, Aug 15, 2022 at 10:56:03AM -0700, Bobby Eshleman wrote:
>>Hey everybody,
>>
>>This series introduces datagrams, packet scheduling, and sk_buff usage
>>to virtio vsock.
>
>Just a reminder for those who are interested, tomorrow Sep 27 @ 16:00 
>UTC we will discuss more about the next steps for this series in this 
>room: https://meet.google.com/fxi-vuzr-jjb
>(I'll try to record it and take notes that we will share)
>

Thank you all for participating in the call!
I'm attaching video/audio recording and notes (feel free to update it).

Notes: 
https://docs.google.com/document/d/14UHH0tEaBKfElLZjNkyKUs_HnOgHhZZBqIS86VEIqR0/edit?usp=sharing
Video recording: 
https://drive.google.com/file/d/1vUvTc_aiE1mB30tLPeJjANnb915-CIKa/view?usp=sharing

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2022-09-27 17:49 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-15 17:56 [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Bobby Eshleman
2022-08-15 17:56 ` [PATCH 1/6] vsock: replace virtio_vsock_pkt with sk_buff Bobby Eshleman
2022-08-16  2:30   ` Bobby Eshleman
2022-08-15 17:56 ` [PATCH 2/6] vsock: return errors other than -ENOMEM to socket Bobby Eshleman
2022-08-15 20:01   ` kernel test robot
2022-08-15 23:13   ` kernel test robot
2022-08-16  2:16   ` kernel test robot
2022-08-16  2:30   ` Bobby Eshleman
2022-08-17  5:28     ` [virtio-dev] " Arseniy Krasnov
2022-09-26 13:21   ` Stefano Garzarella
2022-09-26 21:30     ` Bobby Eshleman
2022-08-15 17:56 ` [PATCH 3/6] vsock: add netdev to vhost/virtio vsock Bobby Eshleman
2022-08-16  2:31   ` Bobby Eshleman
2022-08-16 16:38   ` Michael S. Tsirkin
2022-08-16  6:18     ` Bobby Eshleman
2022-08-16 18:07     ` Jakub Kicinski
2022-08-16  7:02       ` Bobby Eshleman
2022-08-16 23:07         ` Jakub Kicinski
2022-08-16  8:29           ` Bobby Eshleman
2022-08-17  1:15             ` Jakub Kicinski
2022-08-16 10:50               ` Bobby Eshleman
2022-08-17 17:20                 ` Michael S. Tsirkin
2022-08-18  4:34                   ` Jason Wang
2022-08-17  1:23           ` [External] " Cong Wang .
2022-09-06 10:58   ` Michael S. Tsirkin
2022-08-18 14:20     ` Bobby Eshleman
2022-08-15 17:56 ` [PATCH 4/6] virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit Bobby Eshleman
2022-08-16  2:31   ` Bobby Eshleman
2022-09-26 13:17   ` Stefano Garzarella
2022-09-26 21:52     ` Bobby Eshleman
2022-08-15 17:56 ` [PATCH 5/6] virtio/vsock: add support for dgram Bobby Eshleman
2022-08-15 21:02   ` kernel test robot
2022-08-16  2:32   ` Bobby Eshleman
2022-08-17  5:01     ` [virtio-dev] " Arseniy Krasnov
2022-08-16  9:57       ` Bobby Eshleman
2022-08-18  8:24         ` Arseniy Krasnov
2022-08-17  5:42       ` Arseniy Krasnov
2022-08-16  9:58         ` Bobby Eshleman
2022-08-18  8:35           ` Arseniy Krasnov
2022-08-16 20:52             ` Bobby Eshleman
2022-08-19  4:30               ` Arseniy Krasnov
2022-08-15 17:56 ` [PATCH 6/6] vsock_test: add tests for vsock dgram Bobby Eshleman
2022-08-16  2:32   ` Bobby Eshleman
2022-08-15 20:39 ` [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Michael S. Tsirkin
2022-08-16  1:55   ` Bobby Eshleman
2022-08-16  2:29 ` Bobby Eshleman
     [not found] ` <CAGxU2F7+L-UiNPtUm4EukOgTVJ1J6Orqs1LMvhWWvfL9zWb23g@mail.gmail.com>
2022-08-16  2:35   ` Bobby Eshleman
2022-08-17  6:54 ` Michael S. Tsirkin
2022-08-16  9:42   ` Bobby Eshleman
2022-08-17 17:02     ` Michael S. Tsirkin
2022-08-16 11:08       ` Bobby Eshleman
2022-08-17 17:53         ` Michael S. Tsirkin
2022-08-16 12:10           ` Bobby Eshleman
2022-08-18  4:28   ` Jason Wang
2022-09-06  9:00     ` Stefano Garzarella
2022-09-06 13:26 ` Stefan Hajnoczi
2022-08-18 14:39   ` Bobby Eshleman
2022-09-08  8:30     ` Stefano Garzarella
2022-09-08 14:36     ` Call to discuss vsock netdev/sk_buff [was Re: [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc] Stefano Garzarella
2022-09-09 18:13       ` Bobby Eshleman
2022-09-12 18:12         ` Stefano Garzarella
2022-09-09 23:33           ` Bobby Eshleman
2022-09-16  3:51             ` Stefano Garzarella
2022-09-10 16:29               ` Bobby Eshleman
2022-09-26 13:42 ` [PATCH 0/6] virtio/vsock: introduce dgrams, sk_buff, and qdisc Stefano Garzarella
2022-09-26 21:44   ` Bobby Eshleman
2022-09-27 17:45   ` Stefano Garzarella

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).