netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes
@ 2012-05-02  3:41 Jason Wang
  2012-05-02  3:41 ` [V2 PATCH 1/9] macvtap: zerocopy: fix offset calculation when building skb Jason Wang
                   ` (9 more replies)
  0 siblings, 10 replies; 47+ messages in thread
From: Jason Wang @ 2012-05-02  3:41 UTC (permalink / raw)
  To: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

This is an updated since the last series of vhost/macvtap zerocopy fixes which
fixes the the possible transmission stall, host kernel stack overflow and other
misc fixes.

Changes from V1:
- Addressing comments from Eric and Michael.
- Adding more fixes into the seires.

---

Jason Wang (9):
      macvtap: zerocopy: fix offset calculation when building skb
      macvtap: zerocopy: fix truesize underestimation
      macvtap: zerocopy: put page when fail to get all requested user pages
      macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully
      macvtap: zerocopy: validate vectors before building skb
      vhost_net: zerocopy: fix possible NULL pointer dereference of vq->bufs
      vhost_net: re-poll only on EAGAIN or ENOBUFS
      vhost_net: zerocopy: adding and signalling immediately when fully copied
      vhost: zerocopy: poll vq in zerocopy callback


 drivers/net/macvtap.c |   57 ++++++++++++++++++++++++++++++++++---------------
 drivers/vhost/net.c   |    7 ++++--
 drivers/vhost/vhost.c |    1 +
 3 files changed, 46 insertions(+), 19 deletions(-)

-- 
Jason Wang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [V2 PATCH 1/9] macvtap: zerocopy: fix offset calculation when building skb
  2012-05-02  3:41 [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes Jason Wang
@ 2012-05-02  3:41 ` Jason Wang
  2012-05-15 17:17   ` Shirley Ma
  2012-05-02  3:41 ` [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation Jason Wang
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 47+ messages in thread
From: Jason Wang @ 2012-05-02  3:41 UTC (permalink / raw)
  To: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

This patch fixes the offset calculation when building skb:

- offset1 were used as skb data offset not vector offset
- reset offset to zero only when we advance to next vector

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/macvtap.c |   13 +++++++------
 1 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 0427c65..bd4a70d 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -505,10 +505,11 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
 		if (copy > size) {
 			++from;
 			--count;
-		}
+			offset = 0;
+		} else
+			offset += size;
 		copy -= size;
 		offset1 += size;
-		offset = 0;
 	}
 
 	if (len == offset1)
@@ -519,13 +520,13 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
 		int num_pages;
 		unsigned long base;
 
-		len = from->iov_len - offset1;
+		len = from->iov_len - offset;
 		if (!len) {
-			offset1 = 0;
+			offset = 0;
 			++from;
 			continue;
 		}
-		base = (unsigned long)from->iov_base + offset1;
+		base = (unsigned long)from->iov_base + offset;
 		size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
 		num_pages = get_user_pages_fast(base, size, 0, &page[i]);
 		if ((num_pages != size) ||
@@ -546,7 +547,7 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
 			len -= size;
 			i++;
 		}
-		offset1 = 0;
+		offset = 0;
 		++from;
 	}
 	return 0;

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation
  2012-05-02  3:41 [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes Jason Wang
  2012-05-02  3:41 ` [V2 PATCH 1/9] macvtap: zerocopy: fix offset calculation when building skb Jason Wang
@ 2012-05-02  3:41 ` Jason Wang
  2012-05-15 17:26   ` Shirley Ma
  2012-05-02  3:41 ` [V2 PATCH 3/9] macvtap: zerocopy: put page when fail to get all requested user pages Jason Wang
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 47+ messages in thread
From: Jason Wang @ 2012-05-02  3:41 UTC (permalink / raw)
  To: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

As the skb fragment were pinned/built from user pages, we should
account the page instead of length for truesize.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/macvtap.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index bd4a70d..7cb2684 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -519,6 +519,7 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
 		struct page *page[MAX_SKB_FRAGS];
 		int num_pages;
 		unsigned long base;
+		unsigned long truesize;
 
 		len = from->iov_len - offset;
 		if (!len) {
@@ -533,10 +534,11 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
 		    (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
 			/* put_page is in skb free */
 			return -EFAULT;
+		truesize = size * PAGE_SIZE;
 		skb->data_len += len;
 		skb->len += len;
-		skb->truesize += len;
-		atomic_add(len, &skb->sk->sk_wmem_alloc);
+		skb->truesize += truesize;
+		atomic_add(truesize, &skb->sk->sk_wmem_alloc);
 		while (len) {
 			int off = base & ~PAGE_MASK;
 			int size = min_t(int, len, PAGE_SIZE - off);

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [V2 PATCH 3/9] macvtap: zerocopy: put page when fail to get all requested user pages
  2012-05-02  3:41 [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes Jason Wang
  2012-05-02  3:41 ` [V2 PATCH 1/9] macvtap: zerocopy: fix offset calculation when building skb Jason Wang
  2012-05-02  3:41 ` [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation Jason Wang
@ 2012-05-02  3:41 ` Jason Wang
  2012-05-15 17:33   ` Shirley Ma
  2012-05-02  3:42 ` [V2 PATCH 4/9] macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully Jason Wang
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 47+ messages in thread
From: Jason Wang @ 2012-05-02  3:41 UTC (permalink / raw)
  To: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

When get_user_pages_fast() fails to get all requested pages, we could not use
kfree_skb() to free it as it has not been put in the skb fragments. So we need
to call put_page() instead.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/macvtap.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 7cb2684..9ab182a 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -531,9 +531,11 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
 		size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
 		num_pages = get_user_pages_fast(base, size, 0, &page[i]);
 		if ((num_pages != size) ||
-		    (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
-			/* put_page is in skb free */
+		    (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags)) {
+			for (i = 0; i < num_pages; i++)
+				put_page(page[i]);
 			return -EFAULT;
+		}
 		truesize = size * PAGE_SIZE;
 		skb->data_len += len;
 		skb->len += len;

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [V2 PATCH 4/9] macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully
  2012-05-02  3:41 [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes Jason Wang
                   ` (2 preceding siblings ...)
  2012-05-02  3:41 ` [V2 PATCH 3/9] macvtap: zerocopy: put page when fail to get all requested user pages Jason Wang
@ 2012-05-02  3:42 ` Jason Wang
  2012-05-15 17:44   ` Shirley Ma
  2012-05-02  3:42 ` [V2 PATCH 5/9] macvtap: zerocopy: validate vectors before building skb Jason Wang
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 47+ messages in thread
From: Jason Wang @ 2012-05-02  3:42 UTC (permalink / raw)
  To: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

Current the SKBTX_DEV_ZEROCOPY is set unconditionally after
zerocopy_sg_from_iovec(), this would lead NULL pointer when macvtap
fails to build zerocopy skb because destructor_arg was not
initialized. Solve this by set this flag after the skb were built
successfully.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/macvtap.c |    9 +++++----
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 9ab182a..a4ff694 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -699,10 +699,9 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
 	if (!skb)
 		goto err;
 
-	if (zerocopy) {
+	if (zerocopy)
 		err = zerocopy_sg_from_iovec(skb, iv, vnet_hdr_len, count);
-		skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
-	} else
+	else
 		err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len,
 						   len);
 	if (err)
@@ -721,8 +720,10 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
 	rcu_read_lock_bh();
 	vlan = rcu_dereference_bh(q->vlan);
 	/* copy skb_ubuf_info for callback when skb has no error */
-	if (zerocopy)
+	if (zerocopy) {
 		skb_shinfo(skb)->destructor_arg = m->msg_control;
+		skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
+	}
 	if (vlan)
 		macvlan_start_xmit(skb, vlan->dev);
 	else

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [V2 PATCH 5/9] macvtap: zerocopy: validate vectors before building skb
  2012-05-02  3:41 [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes Jason Wang
                   ` (3 preceding siblings ...)
  2012-05-02  3:42 ` [V2 PATCH 4/9] macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully Jason Wang
@ 2012-05-02  3:42 ` Jason Wang
  2012-05-02  3:42 ` [V2 PATCH 6/9] vhost_net: zerocopy: fix possible NULL pointer dereference of vq->bufs Jason Wang
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 47+ messages in thread
From: Jason Wang @ 2012-05-02  3:42 UTC (permalink / raw)
  To: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

There're several reasons that the vectors need to be validated:

- Return error when caller provides vectors whose num is greater than UIO_MAXIOV.
- Linearize part of skb when userspace provides vectors grater than MAX_SKB_FRAGS.
- Return error when userspace provides vectors whose total length may exceed
- MAX_SKB_FRAGS * PAGE_SIZE.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/macvtap.c |   25 +++++++++++++++++++++----
 1 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index a4ff694..163559c 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -529,9 +529,10 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
 		}
 		base = (unsigned long)from->iov_base + offset;
 		size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+		if (i + size > MAX_SKB_FRAGS)
+			return -EMSGSIZE;
 		num_pages = get_user_pages_fast(base, size, 0, &page[i]);
-		if ((num_pages != size) ||
-		    (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags)) {
+		if (num_pages != size) {
 			for (i = 0; i < num_pages; i++)
 				put_page(page[i]);
 			return -EFAULT;
@@ -651,7 +652,7 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
 	int err;
 	struct virtio_net_hdr vnet_hdr = { 0 };
 	int vnet_hdr_len = 0;
-	int copylen;
+	int copylen = 0;
 	bool zerocopy = false;
 
 	if (q->flags & IFF_VNET_HDR) {
@@ -680,15 +681,31 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
 	if (unlikely(len < ETH_HLEN))
 		goto err;
 
+	err = -EMSGSIZE;
+	if (unlikely(count > UIO_MAXIOV))
+		goto err;
+
 	if (m && m->msg_control && sock_flag(&q->sk, SOCK_ZEROCOPY))
 		zerocopy = true;
 
 	if (zerocopy) {
+		/* Userspace may produce vectors with count greater than
+		 * MAX_SKB_FRAGS, so we need to linearize parts of the skb
+		 * to let the rest of data to be fit in the frags.
+		 */
+		if (count > MAX_SKB_FRAGS) {
+			copylen = iov_length(iv, count - MAX_SKB_FRAGS);
+			if (copylen < vnet_hdr_len)
+				copylen = 0;
+			else
+				copylen -= vnet_hdr_len;
+		}
 		/* There are 256 bytes to be copied in skb, so there is enough
 		 * room for skb expand head in case it is used.
 		 * The rest buffer is mapped from userspace.
 		 */
-		copylen = vnet_hdr.hdr_len;
+		if (copylen < vnet_hdr.hdr_len)
+			copylen = vnet_hdr.hdr_len;
 		if (!copylen)
 			copylen = GOODCOPY_LEN;
 	} else

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [V2 PATCH 6/9] vhost_net: zerocopy: fix possible NULL pointer dereference of vq->bufs
  2012-05-02  3:41 [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes Jason Wang
                   ` (4 preceding siblings ...)
  2012-05-02  3:42 ` [V2 PATCH 5/9] macvtap: zerocopy: validate vectors before building skb Jason Wang
@ 2012-05-02  3:42 ` Jason Wang
  2012-05-02  3:42 ` [V2 PATCH 7/9] vhost_net: re-poll only on EAGAIN or ENOBUFS Jason Wang
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 47+ messages in thread
From: Jason Wang @ 2012-05-02  3:42 UTC (permalink / raw)
  To: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

When we want to disable vhost_net backend while there's a tx work, a possible
NULL pointer defernece may happen we we try to deference the vq->bufs after
vhost_net_set_backend() assign a NULL to it.

As suggested by Michael, fix this by checking the vq->bufs instead of
vhost_sock_zcopy().
---
 drivers/vhost/net.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f0da2c3..ffdc0d8 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -166,7 +166,7 @@ static void handle_tx(struct vhost_net *net)
 	if (wmem < sock->sk->sk_sndbuf / 2)
 		tx_poll_stop(net);
 	hdr_size = vq->vhost_hlen;
-	zcopy = vhost_sock_zcopy(sock);
+	zcopy = vq->ubufs;
 
 	for (;;) {
 		/* Release DMAs done buffers first */

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [V2 PATCH 7/9] vhost_net: re-poll only on EAGAIN or ENOBUFS
  2012-05-02  3:41 [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes Jason Wang
                   ` (5 preceding siblings ...)
  2012-05-02  3:42 ` [V2 PATCH 6/9] vhost_net: zerocopy: fix possible NULL pointer dereference of vq->bufs Jason Wang
@ 2012-05-02  3:42 ` Jason Wang
  2012-05-02  3:42 ` [V2 PATCH 8/9] vhost_net: zerocopy: adding and signalling immediately when fully copied Jason Wang
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 47+ messages in thread
From: Jason Wang @ 2012-05-02  3:42 UTC (permalink / raw)
  To: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

Currently, we restart tx polling unconditionally when sendmsg()
fails. This would cause unnecessary wakeups of vhost wokers and waste
cpu utlization when evil userspace(guest driver) is able to hit EFAULT or
EINVAL.

The polling is only needed when the socket send buffer were exceeded or not
enough memory. So fix this by restarting polling only when sendmsg() returns
EAGAIN/ENOBUFS.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index ffdc0d8..62828aa 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -257,7 +257,8 @@ static void handle_tx(struct vhost_net *net)
 					UIO_MAXIOV;
 			}
 			vhost_discard_vq_desc(vq, 1);
-			tx_poll_start(net, sock);
+			if (err == -EAGAIN || err == -ENOBUFS)
+				tx_poll_start(net, sock);
 			break;
 		}
 		if (err != len)

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [V2 PATCH 8/9] vhost_net: zerocopy: adding and signalling immediately when fully copied
  2012-05-02  3:41 [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes Jason Wang
                   ` (6 preceding siblings ...)
  2012-05-02  3:42 ` [V2 PATCH 7/9] vhost_net: re-poll only on EAGAIN or ENOBUFS Jason Wang
@ 2012-05-02  3:42 ` Jason Wang
  2012-05-02  3:42 ` [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback Jason Wang
  2012-05-02  5:50 ` [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes Michael S. Tsirkin
  9 siblings, 0 replies; 47+ messages in thread
From: Jason Wang @ 2012-05-02  3:42 UTC (permalink / raw)
  To: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

When a packet were fully copied in zerocopy, we don't wait for the DMA done to
mark the done flag, so after the packet were passed to lower device, we need to
add used and signal guest immediately.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 62828aa..1dc2aeb 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -266,6 +266,8 @@ static void handle_tx(struct vhost_net *net)
 				 " len %d != %zd\n", err, len);
 		if (!zcopy)
 			vhost_add_used_and_signal(&net->dev, vq, head, 0);
+		else
+			vhost_zerocopy_signal_used(vq);
 		total_len += len;
 		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
 			vhost_poll_queue(&vq->poll);

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
  2012-05-02  3:41 [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes Jason Wang
                   ` (7 preceding siblings ...)
  2012-05-02  3:42 ` [V2 PATCH 8/9] vhost_net: zerocopy: adding and signalling immediately when fully copied Jason Wang
@ 2012-05-02  3:42 ` Jason Wang
  2012-05-15 16:50   ` Shirley Ma
  2012-05-02  5:50 ` [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes Michael S. Tsirkin
  9 siblings, 1 reply; 47+ messages in thread
From: Jason Wang @ 2012-05-02  3:42 UTC (permalink / raw)
  To: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

We add used and signal guest in worker thread but did not poll the virtqueue
during the zero copy callback. This may lead the missing of adding and
signalling during zerocopy. Solve this by polling the virtqueue and let it
wakeup the worker during callback.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 947f00d..7b75fdf 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void *arg)
 	struct vhost_ubuf_ref *ubufs = ubuf->arg;
 	struct vhost_virtqueue *vq = ubufs->vq;
 
+	vhost_poll_queue(&vq->poll);
 	/* set len = 1 to mark this desc buffers done DMA */
 	vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
 	kref_put(&ubufs->kref, vhost_zerocopy_done_signal);

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes
  2012-05-02  3:41 [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes Jason Wang
                   ` (8 preceding siblings ...)
  2012-05-02  3:42 ` [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback Jason Wang
@ 2012-05-02  5:50 ` Michael S. Tsirkin
  2012-05-02  6:44   ` David Miller
  9 siblings, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2012-05-02  5:50 UTC (permalink / raw)
  To: Jason Wang; +Cc: eric.dumazet, netdev, linux-kernel, ebiederm, davem

On Wed, May 02, 2012 at 11:41:21AM +0800, Jason Wang wrote:
> This is an updated since the last series of vhost/macvtap zerocopy fixes which
> fixes the the possible transmission stall, host kernel stack overflow and other
> misc fixes.
> 
> Changes from V1:
> - Addressing comments from Eric and Michael.
> - Adding more fixes into the seires.

Thanks for fixing this.
Acked-by: Michael S. Tsirkin <mst@redhat.com>

Dave, can you merge this for 3.4 please?
Thanks!

> ---
> 
> Jason Wang (9):
>       macvtap: zerocopy: fix offset calculation when building skb
>       macvtap: zerocopy: fix truesize underestimation
>       macvtap: zerocopy: put page when fail to get all requested user pages
>       macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully
>       macvtap: zerocopy: validate vectors before building skb
>       vhost_net: zerocopy: fix possible NULL pointer dereference of vq->bufs
>       vhost_net: re-poll only on EAGAIN or ENOBUFS
>       vhost_net: zerocopy: adding and signalling immediately when fully copied
>       vhost: zerocopy: poll vq in zerocopy callback
> 
> 
>  drivers/net/macvtap.c |   57 ++++++++++++++++++++++++++++++++++---------------
>  drivers/vhost/net.c   |    7 ++++--
>  drivers/vhost/vhost.c |    1 +
>  3 files changed, 46 insertions(+), 19 deletions(-)
> 
> -- 
> Jason Wang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes
  2012-05-02  5:50 ` [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes Michael S. Tsirkin
@ 2012-05-02  6:44   ` David Miller
  2012-05-02  8:11     ` Michael S. Tsirkin
  0 siblings, 1 reply; 47+ messages in thread
From: David Miller @ 2012-05-02  6:44 UTC (permalink / raw)
  To: mst; +Cc: jasowang, eric.dumazet, netdev, linux-kernel, ebiederm

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Wed, 2 May 2012 08:50:09 +0300

> On Wed, May 02, 2012 at 11:41:21AM +0800, Jason Wang wrote:
>> This is an updated since the last series of vhost/macvtap zerocopy fixes which
>> fixes the the possible transmission stall, host kernel stack overflow and other
>> misc fixes.
>> 
>> Changes from V1:
>> - Addressing comments from Eric and Michael.
>> - Adding more fixes into the seires.
> 
> Thanks for fixing this.
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> 
> Dave, can you merge this for 3.4 please?

It's rather late in the -RC for such a large patch set.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes
  2012-05-02  6:44   ` David Miller
@ 2012-05-02  8:11     ` Michael S. Tsirkin
  2012-05-02 19:40       ` Eric W. Biederman
  0 siblings, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2012-05-02  8:11 UTC (permalink / raw)
  To: David Miller; +Cc: jasowang, eric.dumazet, netdev, linux-kernel, ebiederm

On Wed, May 02, 2012 at 02:44:27AM -0400, David Miller wrote:
> From: "Michael S. Tsirkin" <mst@redhat.com>
> Date: Wed, 2 May 2012 08:50:09 +0300
> 
> > On Wed, May 02, 2012 at 11:41:21AM +0800, Jason Wang wrote:
> >> This is an updated since the last series of vhost/macvtap zerocopy fixes which
> >> fixes the the possible transmission stall, host kernel stack overflow and other
> >> misc fixes.
> >> 
> >> Changes from V1:
> >> - Addressing comments from Eric and Michael.
> >> - Adding more fixes into the seires.
> > 
> > Thanks for fixing this.
> > Acked-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > Dave, can you merge this for 3.4 please?
> 
> It's rather late in the -RC for such a large patch set.

I was in doubt that's why I asked instead of just merging through my tree.
OK, I'll apply to my tree and send pull request for net-next a bit later.
-- 
MST

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes
  2012-05-02  8:11     ` Michael S. Tsirkin
@ 2012-05-02 19:40       ` Eric W. Biederman
  2012-05-02 21:31         ` Michael S. Tsirkin
  0 siblings, 1 reply; 47+ messages in thread
From: Eric W. Biederman @ 2012-05-02 19:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, jasowang, eric.dumazet, netdev, linux-kernel


Side question.  Are you aware that macvtap/vhost net is broken
in the presence of  vlan accelleration and that vlan accelleration
is not optional if you are using vlan headers?

Eric

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes
  2012-05-02 19:40       ` Eric W. Biederman
@ 2012-05-02 21:31         ` Michael S. Tsirkin
  2012-05-02 21:54           ` Eric W. Biederman
  0 siblings, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2012-05-02 21:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, jasowang, eric.dumazet, netdev, linux-kernel

On Wed, May 02, 2012 at 12:40:55PM -0700, Eric W. Biederman wrote:
> 
> Side question.  Are you aware that macvtap/vhost net is broken
> in the presence of  vlan accelleration and that vlan accelleration
> is not optional if you are using vlan headers?
> 
> Eric

I didn't know. Could you explain please?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes
  2012-05-02 21:31         ` Michael S. Tsirkin
@ 2012-05-02 21:54           ` Eric W. Biederman
  0 siblings, 0 replies; 47+ messages in thread
From: Eric W. Biederman @ 2012-05-02 21:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, jasowang, eric.dumazet, netdev, linux-kernel, Basil Gor

"Michael S. Tsirkin" <mst@redhat.com> writes:

> On Wed, May 02, 2012 at 12:40:55PM -0700, Eric W. Biederman wrote:
>> 
>> Side question.  Are you aware that macvtap/vhost net is broken
>> in the presence of  vlan accelleration and that vlan accelleration
>> is not optional if you are using vlan headers?
>> 
>> Eric
>
> I didn't know. Could you explain please?

It is worth looking at the netdev history for some recent patches by
Basil Gor, as he has been hit by this issue and has been trying to come
up with a clean fix.  I was burned by this issue in other parts of the
networking stack and so have been doing some basic review.

The short version is that on any normal path through the networking
stack we implement vlan header accelleration in hardware or emulation
in software.  The result is that the ethernet vlan header is not
on the packet and is instead in the skb->tci field.

When coming on out the pf_packet sockets we don't include the vlan
header in the packet but instead the vlan header is put in aux data.

The result of all of this is that vhost/net.c looses the vlan
header when coming from pf_packet socket.

Furthermore macvtap_recvmsg does not act like pf_packet sockets
when there is a vlan header persent so weirdness ensues.  Especially
when the packet is coming from a path where the vlan header has
been stripped and placed in skb->tci.

Eric

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
  2012-05-02  3:42 ` [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback Jason Wang
@ 2012-05-15 16:50   ` Shirley Ma
  2012-05-16  2:58     ` Jason Wang
  0 siblings, 1 reply; 47+ messages in thread
From: Shirley Ma @ 2012-05-15 16:50 UTC (permalink / raw)
  To: Jason Wang; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

On Wed, 2012-05-02 at 11:42 +0800, Jason Wang wrote:
> We add used and signal guest in worker thread but did not poll the
> virtqueue
> during the zero copy callback. This may lead the missing of adding and
> signalling during zerocopy. Solve this by polling the virtqueue and
> let it
> wakeup the worker during callback.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vhost/vhost.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 947f00d..7b75fdf 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void *arg)
>         struct vhost_ubuf_ref *ubufs = ubuf->arg;
>         struct vhost_virtqueue *vq = ubufs->vq;
> 
> +       vhost_poll_queue(&vq->poll);
>         /* set len = 1 to mark this desc buffers done DMA */
>         vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
>         kref_put(&ubufs->kref, vhost_zerocopy_done_signal);

Doing so, we might have redundant vhost_poll_queue(). Do you know in
which scenario there might be missing of adding and signaling during
zerocopy?

Thanks
Shirley

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 1/9] macvtap: zerocopy: fix offset calculation when building skb
  2012-05-02  3:41 ` [V2 PATCH 1/9] macvtap: zerocopy: fix offset calculation when building skb Jason Wang
@ 2012-05-15 17:17   ` Shirley Ma
  0 siblings, 0 replies; 47+ messages in thread
From: Shirley Ma @ 2012-05-15 17:17 UTC (permalink / raw)
  To: Jason Wang; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

On Wed, 2012-05-02 at 11:41 +0800, Jason Wang wrote:
> This patch fixes the offset calculation when building skb:
> 
> - offset1 were used as skb data offset not vector offset
> - reset offset to zero only when we advance to next vector

I tested the original code in all scenario, it worked well.

However this patch makes the code more clear.

Thanks
Shirley

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation
  2012-05-02  3:41 ` [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation Jason Wang
@ 2012-05-15 17:26   ` Shirley Ma
  2012-05-16  3:04     ` Jason Wang
  0 siblings, 1 reply; 47+ messages in thread
From: Shirley Ma @ 2012-05-15 17:26 UTC (permalink / raw)
  To: Jason Wang; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

On Wed, 2012-05-02 at 11:41 +0800, Jason Wang wrote:
> As the skb fragment were pinned/built from user pages, we should
> account the page instead of length for truesize.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/net/macvtap.c |    6 ++++--
>  1 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index bd4a70d..7cb2684 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -519,6 +519,7 @@ static int zerocopy_sg_from_iovec(struct sk_buff
> *skb, const struct iovec *from,
>                 struct page *page[MAX_SKB_FRAGS];
>                 int num_pages;
>                 unsigned long base;
> +               unsigned long truesize;
> 
>                 len = from->iov_len - offset;
>                 if (!len) {
> @@ -533,10 +534,11 @@ static int zerocopy_sg_from_iovec(struct sk_buff
> *skb, const struct iovec *from,
>                     (num_pages > MAX_SKB_FRAGS -
> skb_shinfo(skb)->nr_frags))
>                         /* put_page is in skb free */
>                         return -EFAULT;
> +               truesize = size * PAGE_SIZE;

Here should be truesize = size * PAGE_SIZE - offset, right?

>                 skb->data_len += len;
>                 skb->len += len;
> -               skb->truesize += len;
> -               atomic_add(len, &skb->sk->sk_wmem_alloc);
> +               skb->truesize += truesize;
> +               atomic_add(truesize, &skb->sk->sk_wmem_alloc);
>                 while (len) {
>                         int off = base & ~PAGE_MASK;
>                         int size = min_t(int, len, PAGE_SIZE - off);
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 3/9] macvtap: zerocopy: put page when fail to get all requested user pages
  2012-05-02  3:41 ` [V2 PATCH 3/9] macvtap: zerocopy: put page when fail to get all requested user pages Jason Wang
@ 2012-05-15 17:33   ` Shirley Ma
  0 siblings, 0 replies; 47+ messages in thread
From: Shirley Ma @ 2012-05-15 17:33 UTC (permalink / raw)
  To: Jason Wang; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

On Wed, 2012-05-02 at 11:41 +0800, Jason Wang wrote:
> When get_user_pages_fast() fails to get all requested pages, we could
> not use
> kfree_skb() to free it as it has not been put in the skb fragments. So
> we need
> to call put_page() instead.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/net/macvtap.c |    6 ++++--
>  1 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index 7cb2684..9ab182a 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -531,9 +531,11 @@ static int zerocopy_sg_from_iovec(struct sk_buff
> *skb, const struct iovec *from,
>                 size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >>
> PAGE_SHIFT;
>                 num_pages = get_user_pages_fast(base, size, 0,
> &page[i]);
>                 if ((num_pages != size) ||
> -                   (num_pages > MAX_SKB_FRAGS -
> skb_shinfo(skb)->nr_frags))
> -                       /* put_page is in skb free */
> +                   (num_pages > MAX_SKB_FRAGS -
> skb_shinfo(skb)->nr_frags)) {
> +                       for (i = 0; i < num_pages; i++)
> +                               put_page(page[i]);
>                         return -EFAULT;
> +               }
>                 truesize = size * PAGE_SIZE;
>                 skb->data_len += len;
>                 skb->len += len; 

Good catch. I don't know why I thought put_page would be called in
skb_free for these pages which hadn't been added to skb frags before. :(

thanks
Shirley

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 4/9] macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully
  2012-05-02  3:42 ` [V2 PATCH 4/9] macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully Jason Wang
@ 2012-05-15 17:44   ` Shirley Ma
  2012-05-16  3:17     ` Jason Wang
  0 siblings, 1 reply; 47+ messages in thread
From: Shirley Ma @ 2012-05-15 17:44 UTC (permalink / raw)
  To: Jason Wang; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

On Wed, 2012-05-02 at 11:42 +0800, Jason Wang wrote:
> Current the SKBTX_DEV_ZEROCOPY is set unconditionally after
> zerocopy_sg_from_iovec(), this would lead NULL pointer when macvtap
> fails to build zerocopy skb because destructor_arg was not
> initialized. Solve this by set this flag after the skb were built
> successfully.

I thought this flag was needed for zerocopy skb free even in err case.
I've checked it back again, it's OK to move the flag after the skb
successfully built. Or we can fix it by modify skb free with
destructor_arg NULL check as below:
...
skb_release_data() {
...
                if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
                        struct ubuf_info *uarg;

                        uarg = skb_shinfo(skb)->destructor_arg;
                        if (uarg && uarg->callback)
                                uarg->callback(uarg);
                }

...
}
Thanks
Shirley

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
  2012-05-15 16:50   ` Shirley Ma
@ 2012-05-16  2:58     ` Jason Wang
  2012-05-16 15:10       ` Shirley Ma
  0 siblings, 1 reply; 47+ messages in thread
From: Jason Wang @ 2012-05-16  2:58 UTC (permalink / raw)
  To: Shirley Ma; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

On 05/16/2012 12:50 AM, Shirley Ma wrote:
> On Wed, 2012-05-02 at 11:42 +0800, Jason Wang wrote:
>> We add used and signal guest in worker thread but did not poll the
>> virtqueue
>> during the zero copy callback. This may lead the missing of adding and
>> signalling during zerocopy. Solve this by polling the virtqueue and
>> let it
>> wakeup the worker during callback.
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>> ---
>>   drivers/vhost/vhost.c |    1 +
>>   1 files changed, 1 insertions(+), 0 deletions(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index 947f00d..7b75fdf 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void *arg)
>>          struct vhost_ubuf_ref *ubufs = ubuf->arg;
>>          struct vhost_virtqueue *vq = ubufs->vq;
>>
>> +       vhost_poll_queue(&vq->poll);
>>          /* set len = 1 to mark this desc buffers done DMA */
>>          vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
>>          kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
> Doing so, we might have redundant vhost_poll_queue(). Do you know in
> which scenario there might be missing of adding and signaling during
> zerocopy?

Yes, as we only do signaling and adding during tx work, if there's no tx 
work when the skb were sent, we may lose the opportunity to let guest 
know about the completion. It's easy to be reproduced with netperf test.

Thanks
> Thanks
> Shirley
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation
  2012-05-15 17:26   ` Shirley Ma
@ 2012-05-16  3:04     ` Jason Wang
  2012-05-16 15:03       ` Shirley Ma
  0 siblings, 1 reply; 47+ messages in thread
From: Jason Wang @ 2012-05-16  3:04 UTC (permalink / raw)
  To: Shirley Ma; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

On 05/16/2012 01:26 AM, Shirley Ma wrote:
> On Wed, 2012-05-02 at 11:41 +0800, Jason Wang wrote:
>> As the skb fragment were pinned/built from user pages, we should
>> account the page instead of length for truesize.
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>> ---
>>   drivers/net/macvtap.c |    6 ++++--
>>   1 files changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
>> index bd4a70d..7cb2684 100644
>> --- a/drivers/net/macvtap.c
>> +++ b/drivers/net/macvtap.c
>> @@ -519,6 +519,7 @@ static int zerocopy_sg_from_iovec(struct sk_buff
>> *skb, const struct iovec *from,
>>                  struct page *page[MAX_SKB_FRAGS];
>>                  int num_pages;
>>                  unsigned long base;
>> +               unsigned long truesize;
>>
>>                  len = from->iov_len - offset;
>>                  if (!len) {
>> @@ -533,10 +534,11 @@ static int zerocopy_sg_from_iovec(struct sk_buff
>> *skb, const struct iovec *from,
>>                      (num_pages>  MAX_SKB_FRAGS -
>> skb_shinfo(skb)->nr_frags))
>>                          /* put_page is in skb free */
>>                          return -EFAULT;
>> +               truesize = size * PAGE_SIZE;
> Here should be truesize = size * PAGE_SIZE - offset, right?
>

We get the whole user page, so need to account them all. Also this is 
aligned with skb_copy_ubufs().
>>                  skb->data_len += len;
>>                  skb->len += len;
>> -               skb->truesize += len;
>> -               atomic_add(len,&skb->sk->sk_wmem_alloc);
>> +               skb->truesize += truesize;
>> +               atomic_add(truesize,&skb->sk->sk_wmem_alloc);
>>                  while (len) {
>>                          int off = base&  ~PAGE_MASK;
>>                          int size = min_t(int, len, PAGE_SIZE - off);
>>
>>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 4/9] macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully
  2012-05-15 17:44   ` Shirley Ma
@ 2012-05-16  3:17     ` Jason Wang
  0 siblings, 0 replies; 47+ messages in thread
From: Jason Wang @ 2012-05-16  3:17 UTC (permalink / raw)
  To: Shirley Ma; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

On 05/16/2012 01:44 AM, Shirley Ma wrote:
> On Wed, 2012-05-02 at 11:42 +0800, Jason Wang wrote:
>> Current the SKBTX_DEV_ZEROCOPY is set unconditionally after
>> zerocopy_sg_from_iovec(), this would lead NULL pointer when macvtap
>> fails to build zerocopy skb because destructor_arg was not
>> initialized. Solve this by set this flag after the skb were built
>> successfully.
> I thought this flag was needed for zerocopy skb free even in err case.
> I've checked it back again, it's OK to move the flag after the skb
> successfully built. Or we can fix it by modify skb free with
> destructor_arg NULL check as below:
> ...
> skb_release_data() {
> ...
>                  if (skb_shinfo(skb)->tx_flags&  SKBTX_DEV_ZEROCOPY) {
>                          struct ubuf_info *uarg;
>
>                          uarg = skb_shinfo(skb)->destructor_arg;
>                          if (uarg&&  uarg->callback)
>                                  uarg->callback(uarg);
>                  }
>
> ...
> }
> Thanks
> Shirley
>
Yes, both are ok. Since the code were merged, let's just use current method.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation
  2012-05-16  3:04     ` Jason Wang
@ 2012-05-16 15:03       ` Shirley Ma
  2012-05-17  2:59         ` Jason Wang
  0 siblings, 1 reply; 47+ messages in thread
From: Shirley Ma @ 2012-05-16 15:03 UTC (permalink / raw)
  To: Jason Wang; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

On Wed, 2012-05-16 at 11:04 +0800, Jason Wang wrote:
> >> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> >> index bd4a70d..7cb2684 100644
> >> --- a/drivers/net/macvtap.c
> >> +++ b/drivers/net/macvtap.c
> >> @@ -519,6 +519,7 @@ static int zerocopy_sg_from_iovec(struct
> sk_buff
> >> *skb, const struct iovec *from,
> >>                  struct page *page[MAX_SKB_FRAGS];
> >>                  int num_pages;
> >>                  unsigned long base;
> >> +               unsigned long truesize;
> >>
> >>                  len = from->iov_len - offset;
> >>                  if (!len) {
> >> @@ -533,10 +534,11 @@ static int zerocopy_sg_from_iovec(struct
> sk_buff
> >> *skb, const struct iovec *from,
> >>                      (num_pages>  MAX_SKB_FRAGS -
> >> skb_shinfo(skb)->nr_frags))
> >>                          /* put_page is in skb free */
> >>                          return -EFAULT;
> >> +               truesize = size * PAGE_SIZE;
> > Here should be truesize = size * PAGE_SIZE - offset, right?
> >
> 
> We get the whole user page, so need to account them all. Also this is 
> aligned with skb_copy_ubufs(). 

Then this would double count the size of "first" offset left from
previous copy, both skb->len and truesize.

Thanks
Shirley

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
  2012-05-16  2:58     ` Jason Wang
@ 2012-05-16 15:10       ` Shirley Ma
  2012-05-16 15:14         ` Michael S. Tsirkin
  0 siblings, 1 reply; 47+ messages in thread
From: Shirley Ma @ 2012-05-16 15:10 UTC (permalink / raw)
  To: Jason Wang; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

On Wed, 2012-05-16 at 10:58 +0800, Jason Wang wrote:
> >>   drivers/vhost/vhost.c |    1 +
> >>   1 files changed, 1 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> >> index 947f00d..7b75fdf 100644
> >> --- a/drivers/vhost/vhost.c
> >> +++ b/drivers/vhost/vhost.c
> >> @@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void *arg)
> >>          struct vhost_ubuf_ref *ubufs = ubuf->arg;
> >>          struct vhost_virtqueue *vq = ubufs->vq;
> >>
> >> +       vhost_poll_queue(&vq->poll);
> >>          /* set len = 1 to mark this desc buffers done DMA */
> >>          vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
> >>          kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
> > Doing so, we might have redundant vhost_poll_queue(). Do you know in
> > which scenario there might be missing of adding and signaling during
> > zerocopy?
> 
> Yes, as we only do signaling and adding during tx work, if there's no
> tx 
> work when the skb were sent, we may lose the opportunity to let guest 
> know about the completion. It's easy to be reproduced with netperf
> test. 

The reason which host signals guest is to free guest tx buffers, if
there is no tx work, then it's not necessary to signal the guest unless
guest runs out of memory. The pending buffers will be released
virtio_net device gone.

What's the behavior of netperf test when you hit this situation?

Thanks
Shirley

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
  2012-05-16 15:10       ` Shirley Ma
@ 2012-05-16 15:14         ` Michael S. Tsirkin
  2012-05-16 17:32           ` Shirley Ma
  0 siblings, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2012-05-16 15:14 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Jason Wang, eric.dumazet, netdev, linux-kernel, ebiederm, davem

On Wed, May 16, 2012 at 08:10:27AM -0700, Shirley Ma wrote:
> On Wed, 2012-05-16 at 10:58 +0800, Jason Wang wrote:
> > >>   drivers/vhost/vhost.c |    1 +
> > >>   1 files changed, 1 insertions(+), 0 deletions(-)
> > >>
> > >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > >> index 947f00d..7b75fdf 100644
> > >> --- a/drivers/vhost/vhost.c
> > >> +++ b/drivers/vhost/vhost.c
> > >> @@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void *arg)
> > >>          struct vhost_ubuf_ref *ubufs = ubuf->arg;
> > >>          struct vhost_virtqueue *vq = ubufs->vq;
> > >>
> > >> +       vhost_poll_queue(&vq->poll);
> > >>          /* set len = 1 to mark this desc buffers done DMA */
> > >>          vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
> > >>          kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
> > > Doing so, we might have redundant vhost_poll_queue(). Do you know in
> > > which scenario there might be missing of adding and signaling during
> > > zerocopy?
> > 
> > Yes, as we only do signaling and adding during tx work, if there's no
> > tx 
> > work when the skb were sent, we may lose the opportunity to let guest 
> > know about the completion. It's easy to be reproduced with netperf
> > test. 
> 
> The reason which host signals guest is to free guest tx buffers, if
> there is no tx work, then it's not necessary to signal the guest unless
> guest runs out of memory. The pending buffers will be released
> virtio_net device gone.
> 
> What's the behavior of netperf test when you hit this situation?
> 
> Thanks
> Shirley

IIRC guest networking seems to be lost.


-- 
MST

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
  2012-05-16 15:14         ` Michael S. Tsirkin
@ 2012-05-16 17:32           ` Shirley Ma
  2012-05-16 18:36             ` Michael S. Tsirkin
  2012-05-17  2:50             ` Jason Wang
  0 siblings, 2 replies; 47+ messages in thread
From: Shirley Ma @ 2012-05-16 17:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, eric.dumazet, netdev, linux-kernel, ebiederm, davem

On Wed, 2012-05-16 at 18:14 +0300, Michael S. Tsirkin wrote:
> On Wed, May 16, 2012 at 08:10:27AM -0700, Shirley Ma wrote:
> > On Wed, 2012-05-16 at 10:58 +0800, Jason Wang wrote:
> > > >>   drivers/vhost/vhost.c |    1 +
> > > >>   1 files changed, 1 insertions(+), 0 deletions(-)
> > > >>
> > > >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > >> index 947f00d..7b75fdf 100644
> > > >> --- a/drivers/vhost/vhost.c
> > > >> +++ b/drivers/vhost/vhost.c
> > > >> @@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void *arg)
> > > >>          struct vhost_ubuf_ref *ubufs = ubuf->arg;
> > > >>          struct vhost_virtqueue *vq = ubufs->vq;
> > > >>
> > > >> +       vhost_poll_queue(&vq->poll);
> > > >>          /* set len = 1 to mark this desc buffers done DMA */
> > > >>          vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
> > > >>          kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
> > > > Doing so, we might have redundant vhost_poll_queue(). Do you
> know in
> > > > which scenario there might be missing of adding and signaling
> during
> > > > zerocopy?
> > > 
> > > Yes, as we only do signaling and adding during tx work, if there's
> no
> > > tx 
> > > work when the skb were sent, we may lose the opportunity to let
> guest 
> > > know about the completion. It's easy to be reproduced with netperf
> > > test. 
> > 
> > The reason which host signals guest is to free guest tx buffers, if
> > there is no tx work, then it's not necessary to signal the guest
> unless
> > guest runs out of memory. The pending buffers will be released
> > virtio_net device gone.
> > 
> > What's the behavior of netperf test when you hit this situation?
> > 
> > Thanks
> > Shirley
> 
> IIRC guest networking seems to be lost. 

It seems vhost_enable_notify is missing in somewhere else?

Thanks
Shirley

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
  2012-05-16 17:32           ` Shirley Ma
@ 2012-05-16 18:36             ` Michael S. Tsirkin
  2012-05-16 19:08               ` Shirley Ma
  2012-05-17  2:50             ` Jason Wang
  1 sibling, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2012-05-16 18:36 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Jason Wang, eric.dumazet, netdev, linux-kernel, ebiederm, davem

On Wed, May 16, 2012 at 10:32:05AM -0700, Shirley Ma wrote:
> On Wed, 2012-05-16 at 18:14 +0300, Michael S. Tsirkin wrote:
> > On Wed, May 16, 2012 at 08:10:27AM -0700, Shirley Ma wrote:
> > > On Wed, 2012-05-16 at 10:58 +0800, Jason Wang wrote:
> > > > >>   drivers/vhost/vhost.c |    1 +
> > > > >>   1 files changed, 1 insertions(+), 0 deletions(-)
> > > > >>
> > > > >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > >> index 947f00d..7b75fdf 100644
> > > > >> --- a/drivers/vhost/vhost.c
> > > > >> +++ b/drivers/vhost/vhost.c
> > > > >> @@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void *arg)
> > > > >>          struct vhost_ubuf_ref *ubufs = ubuf->arg;
> > > > >>          struct vhost_virtqueue *vq = ubufs->vq;
> > > > >>
> > > > >> +       vhost_poll_queue(&vq->poll);
> > > > >>          /* set len = 1 to mark this desc buffers done DMA */
> > > > >>          vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
> > > > >>          kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
> > > > > Doing so, we might have redundant vhost_poll_queue(). Do you
> > know in
> > > > > which scenario there might be missing of adding and signaling
> > during
> > > > > zerocopy?
> > > > 
> > > > Yes, as we only do signaling and adding during tx work, if there's
> > no
> > > > tx 
> > > > work when the skb were sent, we may lose the opportunity to let
> > guest 
> > > > know about the completion. It's easy to be reproduced with netperf
> > > > test. 
> > > 
> > > The reason which host signals guest is to free guest tx buffers, if
> > > there is no tx work, then it's not necessary to signal the guest
> > unless
> > > guest runs out of memory. The pending buffers will be released
> > > virtio_net device gone.
> > > 
> > > What's the behavior of netperf test when you hit this situation?
> > > 
> > > Thanks
> > > Shirley
> > 
> > IIRC guest networking seems to be lost. 
> 
> It seems vhost_enable_notify is missing in somewhere else?
> 
> Thanks
> Shirley

Donnu. I see virtio sending packets but they do not get
to tun on host. debugging.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
  2012-05-16 18:36             ` Michael S. Tsirkin
@ 2012-05-16 19:08               ` Shirley Ma
  2012-05-21  5:22                 ` Jason Wang
  0 siblings, 1 reply; 47+ messages in thread
From: Shirley Ma @ 2012-05-16 19:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, eric.dumazet, netdev, linux-kernel, ebiederm, davem

On Wed, 2012-05-16 at 21:36 +0300, Michael S. Tsirkin wrote:
> On Wed, May 16, 2012 at 10:32:05AM -0700, Shirley Ma wrote:
> > On Wed, 2012-05-16 at 18:14 +0300, Michael S. Tsirkin wrote:
> > > On Wed, May 16, 2012 at 08:10:27AM -0700, Shirley Ma wrote:
> > > > On Wed, 2012-05-16 at 10:58 +0800, Jason Wang wrote:
> > > > > >>   drivers/vhost/vhost.c |    1 +
> > > > > >>   1 files changed, 1 insertions(+), 0 deletions(-)
> > > > > >>
> > > > > >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > >> index 947f00d..7b75fdf 100644
> > > > > >> --- a/drivers/vhost/vhost.c
> > > > > >> +++ b/drivers/vhost/vhost.c
> > > > > >> @@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void
> *arg)
> > > > > >>          struct vhost_ubuf_ref *ubufs = ubuf->arg;
> > > > > >>          struct vhost_virtqueue *vq = ubufs->vq;
> > > > > >>
> > > > > >> +       vhost_poll_queue(&vq->poll);
> > > > > >>          /* set len = 1 to mark this desc buffers done DMA
> */
> > > > > >>          vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
> > > > > >>          kref_put(&ubufs->kref,
> vhost_zerocopy_done_signal);
> > > > > > Doing so, we might have redundant vhost_poll_queue(). Do you
> > > know in
> > > > > > which scenario there might be missing of adding and
> signaling
> > > during
> > > > > > zerocopy?
> > > > > 
> > > > > Yes, as we only do signaling and adding during tx work, if
> there's
> > > no
> > > > > tx 
> > > > > work when the skb were sent, we may lose the opportunity to
> let
> > > guest 
> > > > > know about the completion. It's easy to be reproduced with
> netperf
> > > > > test. 
> > > > 
> > > > The reason which host signals guest is to free guest tx buffers,
> if
> > > > there is no tx work, then it's not necessary to signal the guest
> > > unless
> > > > guest runs out of memory. The pending buffers will be released
> > > > virtio_net device gone.
> > > > 
> > > > What's the behavior of netperf test when you hit this situation?
> > > > 
> > > > Thanks
> > > > Shirley
> > > 
> > > IIRC guest networking seems to be lost. 
> > 
> > It seems vhost_enable_notify is missing in somewhere else?
> > 
> > Thanks
> > Shirley
> 
> Donnu. I see virtio sending packets but they do not get
> to tun on host. debugging. 

I looked at the code, if (zerocopy) check is needed for below code:

+	if (zerocopy) {
                        num_pends = likely(vq->upend_idx >= vq->done_idx) ?
                                    (vq->upend_idx - vq->done_idx) :
                                    (vq->upend_idx + UIO_MAXIOV - vq->done_idx);
                        if (unlikely(num_pends > VHOST_MAX_PEND)) {
                                tx_poll_start(net, sock);
				vhost_poll_queue
                                set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
                                break;
                        }
+	}
                        if (unlikely(vhost_enable_notify(&net->dev, vq))) {
                                vhost_disable_notify(&net->dev, vq);
                                continue;
                        }
                        break;


Second, whether it's possible the problem comes from tx_poll_start()? In
some situation, vhost_poll_wakeup() is not being called?

Thanks
Shirley

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
  2012-05-16 17:32           ` Shirley Ma
  2012-05-16 18:36             ` Michael S. Tsirkin
@ 2012-05-17  2:50             ` Jason Wang
  2012-05-17 15:34               ` Shirley Ma
  1 sibling, 1 reply; 47+ messages in thread
From: Jason Wang @ 2012-05-17  2:50 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michael S. Tsirkin, eric.dumazet, netdev, linux-kernel, ebiederm, davem

On 05/17/2012 01:32 AM, Shirley Ma wrote:
> On Wed, 2012-05-16 at 18:14 +0300, Michael S. Tsirkin wrote:
>> On Wed, May 16, 2012 at 08:10:27AM -0700, Shirley Ma wrote:
>>> On Wed, 2012-05-16 at 10:58 +0800, Jason Wang wrote:
>>>>>>    drivers/vhost/vhost.c |    1 +
>>>>>>    1 files changed, 1 insertions(+), 0 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>>>> index 947f00d..7b75fdf 100644
>>>>>> --- a/drivers/vhost/vhost.c
>>>>>> +++ b/drivers/vhost/vhost.c
>>>>>> @@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void *arg)
>>>>>>           struct vhost_ubuf_ref *ubufs = ubuf->arg;
>>>>>>           struct vhost_virtqueue *vq = ubufs->vq;
>>>>>>
>>>>>> +       vhost_poll_queue(&vq->poll);
>>>>>>           /* set len = 1 to mark this desc buffers done DMA */
>>>>>>           vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
>>>>>>           kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
>>>>> Doing so, we might have redundant vhost_poll_queue(). Do you
>> know in
>>>>> which scenario there might be missing of adding and signaling
>> during
>>>>> zerocopy?
>>>> Yes, as we only do signaling and adding during tx work, if there's
>> no
>>>> tx
>>>> work when the skb were sent, we may lose the opportunity to let
>> guest
>>>> know about the completion. It's easy to be reproduced with netperf
>>>> test.
>>> The reason which host signals guest is to free guest tx buffers, if
>>> there is no tx work, then it's not necessary to signal the guest
>> unless
>>> guest runs out of memory. The pending buffers will be released
>>> virtio_net device gone.

Looks like we only free the skbs in .ndo_start_xmit().
>>>
>>> What's the behavior of netperf test when you hit this situation?
>>>
>>> Thanks
>>> Shirley
>> IIRC guest networking seems to be lost.
> It seems vhost_enable_notify is missing in somewhere else?
>
> Thanks
> Shirley
>

The problem is we may stop the tx queue when there no enough capacity to 
place packets, at this moment  we depends on the tx interrupt to 
re-enable the tx queue. So if we didn't poll the vhost during callback, 
guest may lose the tx interrupt to re-enable the tx queue which could 
stall the whole tx queue.

Thanks

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation
  2012-05-16 15:03       ` Shirley Ma
@ 2012-05-17  2:59         ` Jason Wang
  2012-05-17 15:28           ` Shirley Ma
  0 siblings, 1 reply; 47+ messages in thread
From: Jason Wang @ 2012-05-17  2:59 UTC (permalink / raw)
  To: Shirley Ma; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

On 05/16/2012 11:03 PM, Shirley Ma wrote:
> On Wed, 2012-05-16 at 11:04 +0800, Jason Wang wrote:
>>>> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
>>>> index bd4a70d..7cb2684 100644
>>>> --- a/drivers/net/macvtap.c
>>>> +++ b/drivers/net/macvtap.c
>>>> @@ -519,6 +519,7 @@ static int zerocopy_sg_from_iovec(struct
>> sk_buff
>>>> *skb, const struct iovec *from,
>>>>                   struct page *page[MAX_SKB_FRAGS];
>>>>                   int num_pages;
>>>>                   unsigned long base;
>>>> +               unsigned long truesize;
>>>>
>>>>                   len = from->iov_len - offset;
>>>>                   if (!len) {
>>>> @@ -533,10 +534,11 @@ static int zerocopy_sg_from_iovec(struct
>> sk_buff
>>>> *skb, const struct iovec *from,
>>>>                       (num_pages>   MAX_SKB_FRAGS -
>>>> skb_shinfo(skb)->nr_frags))
>>>>                           /* put_page is in skb free */
>>>>                           return -EFAULT;
>>>> +               truesize = size * PAGE_SIZE;
>>> Here should be truesize = size * PAGE_SIZE - offset, right?
>>>
>> We get the whole user page, so need to account them all. Also this is
>> aligned with skb_copy_ubufs().
> Then this would double count the size of "first" offset left from
> previous copy, both skb->len and truesize.
>
> Thanks
> Shirley
>

Didn't see how this affact skb->len. And for truesize, I think they are 
different, when the offset were not zero, the data in this vector were 
divided into two parts. First part is copied into skb directly, and the 
second were pinned from a whole userspace page by get_user_pages_fast(), 
so we need count the whole page to the socket limit to prevent evil 
application.

Thanks

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation
  2012-05-17  2:59         ` Jason Wang
@ 2012-05-17 15:28           ` Shirley Ma
  2012-05-18 10:10             ` Jason Wang
  0 siblings, 1 reply; 47+ messages in thread
From: Shirley Ma @ 2012-05-17 15:28 UTC (permalink / raw)
  To: Jason Wang; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

On Thu, 2012-05-17 at 10:59 +0800, Jason Wang wrote:
> Didn't see how this affact skb->len. And for truesize, I think they
> are 
> different, when the offset were not zero, the data in this vector
> were 
> divided into two parts. First part is copied into skb directly, and
> the 
> second were pinned from a whole userspace page by
> get_user_pages_fast(), 
> so we need count the whole page to the socket limit to prevent evil 
> application. 

What I meant that the code for skb->truesize has double added the first
offset if any left from that vector (partically copied into skb
directly, and then count pagesize which includes the offset (truesize +=
PAGE_SIZE)).

Thanks
Shirley

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
  2012-05-17  2:50             ` Jason Wang
@ 2012-05-17 15:34               ` Shirley Ma
  2012-05-18  9:58                 ` Jason Wang
  0 siblings, 1 reply; 47+ messages in thread
From: Shirley Ma @ 2012-05-17 15:34 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, eric.dumazet, netdev, linux-kernel, ebiederm, davem

On Thu, 2012-05-17 at 10:50 +0800, Jason Wang wrote:
> The problem is we may stop the tx queue when there no enough capacity
> to 
> place packets, at this moment  we depends on the tx interrupt to 
> re-enable the tx queue. So if we didn't poll the vhost during
> callback, 
> guest may lose the tx interrupt to re-enable the tx queue which could 
> stall the whole tx queue. 

VHOST_MAX_PEND should handle the capacity. 

Hasn't the above situation been handled in handle_tx() code?:
...
                        if (unlikely(num_pends > VHOST_MAX_PEND)) {
                                tx_poll_start(net, sock);
                                set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
                                break;
                        }
...

Thanks
Shirley

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
  2012-05-17 15:34               ` Shirley Ma
@ 2012-05-18  9:58                 ` Jason Wang
  2012-05-18 15:29                   ` Shirley Ma
  0 siblings, 1 reply; 47+ messages in thread
From: Jason Wang @ 2012-05-18  9:58 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michael S. Tsirkin, eric.dumazet, netdev, linux-kernel, ebiederm, davem

On 05/17/2012 11:34 PM, Shirley Ma wrote:
> On Thu, 2012-05-17 at 10:50 +0800, Jason Wang wrote:
>> The problem is we may stop the tx queue when there no enough capacity
>> to
>> place packets, at this moment  we depends on the tx interrupt to
>> re-enable the tx queue. So if we didn't poll the vhost during
>> callback,
>> guest may lose the tx interrupt to re-enable the tx queue which could
>> stall the whole tx queue.
> VHOST_MAX_PEND should handle the capacity.
>
> Hasn't the above situation been handled in handle_tx() code?:
> ...
>                          if (unlikely(num_pends>  VHOST_MAX_PEND)) {
>                                  tx_poll_start(net, sock);
>                                  set_bit(SOCK_ASYNC_NOSPACE,&sock->flags);
>                                  break;
>                          }
> ...
>
> Thanks
> Shirley

It may not help in because:

- tx polling depends on skb_orphan() which is often called by device 
driver when it place the packet into the queue of the devices instead 
of  when the packets were sent. So it was too early for vhost to be 
notified.

- it only works when the pending DMAs exceeds VHOST_MAX_PEND, it's 
highly possible that guest needs to be notified when the pending packets 
isn't so much.

So this piece of code may not help and could be removed and we need to 
poll the virt-queue during zerocopy callback ( through it could be 
further optimized but may not be easy).
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation
  2012-05-17 15:28           ` Shirley Ma
@ 2012-05-18 10:10             ` Jason Wang
  2012-05-18 15:22               ` Shirley Ma
  0 siblings, 1 reply; 47+ messages in thread
From: Jason Wang @ 2012-05-18 10:10 UTC (permalink / raw)
  To: Shirley Ma; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

On 05/17/2012 11:28 PM, Shirley Ma wrote:
> On Thu, 2012-05-17 at 10:59 +0800, Jason Wang wrote:
>> Didn't see how this affact skb->len. And for truesize, I think they
>> are
>> different, when the offset were not zero, the data in this vector
>> were
>> divided into two parts. First part is copied into skb directly, and
>> the
>> second were pinned from a whole userspace page by
>> get_user_pages_fast(),
>> so we need count the whole page to the socket limit to prevent evil
>> application.
> What I meant that the code for skb->truesize has double added the first
> offset if any left from that vector (partically copied into skb
> directly, and then count pagesize which includes the offset (truesize +=
> PAGE_SIZE)).

Yes, I get you mean. There's no difference between first frag and 
others: it's also possible for other frags that didn't occupy the whole 
page. Since we pin the whole user page, better to count the whole page 
size to prevent evil application.
> Thanks
> Shirley
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation
  2012-05-18 10:10             ` Jason Wang
@ 2012-05-18 15:22               ` Shirley Ma
  2012-05-21  6:15                 ` Jason Wang
  0 siblings, 1 reply; 47+ messages in thread
From: Shirley Ma @ 2012-05-18 15:22 UTC (permalink / raw)
  To: Jason Wang; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

On Fri, 2012-05-18 at 18:10 +0800, Jason Wang wrote:
> > On Thu, 2012-05-17 at 10:59 +0800, Jason Wang wrote:
> >> Didn't see how this affact skb->len. And for truesize, I think they
> >> are
> >> different, when the offset were not zero, the data in this vector
> >> were
> >> divided into two parts. First part is copied into skb directly, and
> >> the
> >> second were pinned from a whole userspace page by
> >> get_user_pages_fast(),
> >> so we need count the whole page to the socket limit to prevent evil
> >> application.
> > What I meant that the code for skb->truesize has double added the
> first
> > offset if any left from that vector (partically copied into skb
> > directly, and then count pagesize which includes the offset
> (truesize +=
> > PAGE_SIZE)).
> 
> Yes, I get you mean. There's no difference between first frag and 
> others: it's also possible for other frags that didn't occupy the
> whole 
> page. Since we pin the whole user page, better to count the whole
> page 
> size to prevent evil application. 

The difference between first frags and others is other frags might not
occupy the whole page, but the first frags extra offset was doubled
added in skb truesize.

So it's ok for skb->truesize to be bigger than all the skb pinned pages
here?

Thanks
Shirley

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
  2012-05-18  9:58                 ` Jason Wang
@ 2012-05-18 15:29                   ` Shirley Ma
  2012-05-21  6:05                     ` Jason Wang
  0 siblings, 1 reply; 47+ messages in thread
From: Shirley Ma @ 2012-05-18 15:29 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, eric.dumazet, netdev, linux-kernel, ebiederm, davem

On Fri, 2012-05-18 at 17:58 +0800, Jason Wang wrote:
> On 05/17/2012 11:34 PM, Shirley Ma wrote:
> > On Thu, 2012-05-17 at 10:50 +0800, Jason Wang wrote:
> >> The problem is we may stop the tx queue when there no enough
> capacity
> >> to
> >> place packets, at this moment  we depends on the tx interrupt to
> >> re-enable the tx queue. So if we didn't poll the vhost during
> >> callback,
> >> guest may lose the tx interrupt to re-enable the tx queue which
> could
> >> stall the whole tx queue.
> > VHOST_MAX_PEND should handle the capacity.
> >
> > Hasn't the above situation been handled in handle_tx() code?:
> > ...
> >                          if (unlikely(num_pends>  VHOST_MAX_PEND)) {
> >                                  tx_poll_start(net, sock);
> >
> set_bit(SOCK_ASYNC_NOSPACE,&sock->flags);
> >                                  break;
> >                          }
> > ...
> >
> > Thanks
> > Shirley
> 
> It may not help in because:
> 
> - tx polling depends on skb_orphan() which is often called by device 
> driver when it place the packet into the queue of the devices instead 
> of  when the packets were sent. So it was too early for vhost to be 
> notified.
Then do you think it's better to replace with vhost_poll_queue here
instead?

> - it only works when the pending DMAs exceeds VHOST_MAX_PEND, it's 
> highly possible that guest needs to be notified when the pending
> packets 
> isn't so much.
In which situation the guest needs to be notified when there is no TX
besides buffers run out?

> So this piece of code may not help and could be removed and we need
> to 
> poll the virt-queue during zerocopy callback ( through it could be 
> further optimized but may not be easy). 

Thanks
Shirley

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
  2012-05-16 19:08               ` Shirley Ma
@ 2012-05-21  5:22                 ` Jason Wang
  0 siblings, 0 replies; 47+ messages in thread
From: Jason Wang @ 2012-05-21  5:22 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michael S. Tsirkin, eric.dumazet, netdev, linux-kernel, ebiederm, davem

On 05/17/2012 03:08 AM, Shirley Ma wrote:
> On Wed, 2012-05-16 at 21:36 +0300, Michael S. Tsirkin wrote:
>> On Wed, May 16, 2012 at 10:32:05AM -0700, Shirley Ma wrote:
>>> On Wed, 2012-05-16 at 18:14 +0300, Michael S. Tsirkin wrote:
>>>> On Wed, May 16, 2012 at 08:10:27AM -0700, Shirley Ma wrote:
>>>>> On Wed, 2012-05-16 at 10:58 +0800, Jason Wang wrote:
>>>>>>>>    drivers/vhost/vhost.c |    1 +
>>>>>>>>    1 files changed, 1 insertions(+), 0 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>>>>>> index 947f00d..7b75fdf 100644
>>>>>>>> --- a/drivers/vhost/vhost.c
>>>>>>>> +++ b/drivers/vhost/vhost.c
>>>>>>>> @@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void
>> *arg)
>>>>>>>>           struct vhost_ubuf_ref *ubufs = ubuf->arg;
>>>>>>>>           struct vhost_virtqueue *vq = ubufs->vq;
>>>>>>>>
>>>>>>>> +       vhost_poll_queue(&vq->poll);
>>>>>>>>           /* set len = 1 to mark this desc buffers done DMA
>> */
>>>>>>>>           vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
>>>>>>>>           kref_put(&ubufs->kref,
>> vhost_zerocopy_done_signal);
>>>>>>> Doing so, we might have redundant vhost_poll_queue(). Do you
>>>> know in
>>>>>>> which scenario there might be missing of adding and
>> signaling
>>>> during
>>>>>>> zerocopy?
>>>>>> Yes, as we only do signaling and adding during tx work, if
>> there's
>>>> no
>>>>>> tx
>>>>>> work when the skb were sent, we may lose the opportunity to
>> let
>>>> guest
>>>>>> know about the completion. It's easy to be reproduced with
>> netperf
>>>>>> test.
>>>>> The reason which host signals guest is to free guest tx buffers,
>> if
>>>>> there is no tx work, then it's not necessary to signal the guest
>>>> unless
>>>>> guest runs out of memory. The pending buffers will be released
>>>>> virtio_net device gone.
>>>>>
>>>>> What's the behavior of netperf test when you hit this situation?
>>>>>
>>>>> Thanks
>>>>> Shirley
>>>> IIRC guest networking seems to be lost.
>>> It seems vhost_enable_notify is missing in somewhere else?
>>>
>>> Thanks
>>> Shirley
>> Donnu. I see virtio sending packets but they do not get
>> to tun on host. debugging.
> I looked at the code, if (zerocopy) check is needed for below code:
>
> +	if (zerocopy) {
>                          num_pends = likely(vq->upend_idx>= vq->done_idx) ?
>                                      (vq->upend_idx - vq->done_idx) :
>                                      (vq->upend_idx + UIO_MAXIOV - vq->done_idx);
>                          if (unlikely(num_pends>  VHOST_MAX_PEND)) {
>                                  tx_poll_start(net, sock);
> 				vhost_poll_queue
>                                  set_bit(SOCK_ASYNC_NOSPACE,&sock->flags);
>                                  break;
>                          }
> +	}
>                          if (unlikely(vhost_enable_notify(&net->dev, vq))) {
>                                  vhost_disable_notify(&net->dev, vq);
>                                  continue;
>                          }
>                          break;

No objection to add this but looks no difference to me as we won't touch 
upend_idx and done_idx when zercocopy is not used.

>
> Second, whether it's possible the problem comes from tx_poll_start()? In
> some situation, vhost_poll_wakeup() is not being called?
>
> Thanks
> Shirley
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
  2012-05-18 15:29                   ` Shirley Ma
@ 2012-05-21  6:05                     ` Jason Wang
  2012-05-21 15:42                       ` Shirley Ma
  0 siblings, 1 reply; 47+ messages in thread
From: Jason Wang @ 2012-05-21  6:05 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michael S. Tsirkin, eric.dumazet, netdev, linux-kernel, ebiederm, davem

On 05/18/2012 11:29 PM, Shirley Ma wrote:
> On Fri, 2012-05-18 at 17:58 +0800, Jason Wang wrote:
>> On 05/17/2012 11:34 PM, Shirley Ma wrote:
>>> On Thu, 2012-05-17 at 10:50 +0800, Jason Wang wrote:
>>>> The problem is we may stop the tx queue when there no enough
>> capacity
>>>> to
>>>> place packets, at this moment  we depends on the tx interrupt to
>>>> re-enable the tx queue. So if we didn't poll the vhost during
>>>> callback,
>>>> guest may lose the tx interrupt to re-enable the tx queue which
>> could
>>>> stall the whole tx queue.
>>> VHOST_MAX_PEND should handle the capacity.
>>>
>>> Hasn't the above situation been handled in handle_tx() code?:
>>> ...
>>>                           if (unlikely(num_pends>   VHOST_MAX_PEND)) {
>>>                                   tx_poll_start(net, sock);
>>>
>> set_bit(SOCK_ASYNC_NOSPACE,&sock->flags);
>>>                                   break;
>>>                           }
>>> ...
>>>
>>> Thanks
>>> Shirley
>> It may not help in because:
>>
>> - tx polling depends on skb_orphan() which is often called by device
>> driver when it place the packet into the queue of the devices instead
>> of  when the packets were sent. So it was too early for vhost to be
>> notified.
> Then do you think it's better to replace with vhost_poll_queue here
> instead?

Just like what does this patch do - calling vhost_poll_queue() in 
vhost_zerocopy_callback().
>> - it only works when the pending DMAs exceeds VHOST_MAX_PEND, it's
>> highly possible that guest needs to be notified when the pending
>> packets
>> isn't so much.
> In which situation the guest needs to be notified when there is no TX
> besides buffers run out?

Consider guest call virtqueue_enable_cb_delayed() which means it only 
need to be notified when 3/4 of pending buffers ( about 178 buffers 
(256-MAX_SKB_FRAGS-2)*3/4 ) were sent by host. So vhost_net would notify 
guest when about 60 buffers were pending. Since tx polling is only 
enabled when pending packets exceeds VHOST_MAX_PEND 128, so tx work 
would not be notified to run and guest would never get the interrupt it 
expected to re-enable the queue.

And just like what we've discussed, tx polling based adding and 
signaling is too early for vhost.
>> So this piece of code may not help and could be removed and we need
>> to
>> poll the virt-queue during zerocopy callback ( through it could be
>> further optimized but may not be easy).
> Thanks
> Shirley
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation
  2012-05-18 15:22               ` Shirley Ma
@ 2012-05-21  6:15                 ` Jason Wang
  0 siblings, 0 replies; 47+ messages in thread
From: Jason Wang @ 2012-05-21  6:15 UTC (permalink / raw)
  To: Shirley Ma; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem

On 05/18/2012 11:22 PM, Shirley Ma wrote:
> On Fri, 2012-05-18 at 18:10 +0800, Jason Wang wrote:
>>> On Thu, 2012-05-17 at 10:59 +0800, Jason Wang wrote:
>>>> Didn't see how this affact skb->len. And for truesize, I think they
>>>> are
>>>> different, when the offset were not zero, the data in this vector
>>>> were
>>>> divided into two parts. First part is copied into skb directly, and
>>>> the
>>>> second were pinned from a whole userspace page by
>>>> get_user_pages_fast(),
>>>> so we need count the whole page to the socket limit to prevent evil
>>>> application.
>>> What I meant that the code for skb->truesize has double added the
>> first
>>> offset if any left from that vector (partically copied into skb
>>> directly, and then count pagesize which includes the offset
>> (truesize +=
>>> PAGE_SIZE)).
>> Yes, I get you mean. There's no difference between first frag and
>> others: it's also possible for other frags that didn't occupy the
>> whole
>> page. Since we pin the whole user page, better to count the whole
>> page
>> size to prevent evil application.
> The difference between first frags and others is other frags might not
> occupy the whole page, but the first frags extra offset was doubled
> added in skb truesize.
>
> So it's ok for skb->truesize to be bigger than all the skb pinned pages
> here?

I think it's ok here and we could find other example such as virtio_net 
driver.
>
> Thanks
> Shirley
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
  2012-05-21  6:05                     ` Jason Wang
@ 2012-05-21 15:42                       ` Shirley Ma
  2012-05-21 16:12                         ` Shirley Ma
  2012-05-22 10:05                         ` Jason Wang
  0 siblings, 2 replies; 47+ messages in thread
From: Shirley Ma @ 2012-05-21 15:42 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, eric.dumazet, netdev, linux-kernel, ebiederm, davem

On Mon, 2012-05-21 at 14:05 +0800, Jason Wang wrote:
> >> - tx polling depends on skb_orphan() which is often called by
> device
> >> driver when it place the packet into the queue of the devices
> instead
> >> of  when the packets were sent. So it was too early for vhost to be
> >> notified.
> > Then do you think it's better to replace with vhost_poll_queue here
> > instead?
> 
> Just like what does this patch do - calling vhost_poll_queue() in 
> vhost_zerocopy_callback().
> >> - it only works when the pending DMAs exceeds VHOST_MAX_PEND, it's
> >> highly possible that guest needs to be notified when the pending
> >> packets
> >> isn't so much.
> > In which situation the guest needs to be notified when there is no
> TX
> > besides buffers run out?
> 
> Consider guest call virtqueue_enable_cb_delayed() which means it only 
> need to be notified when 3/4 of pending buffers ( about 178 buffers 
> (256-MAX_SKB_FRAGS-2)*3/4 ) were sent by host. So vhost_net would
> notify 
> guest when about 60 buffers were pending. Since tx polling is only 
> enabled when pending packets exceeds VHOST_MAX_PEND 128, so tx work 
> would not be notified to run and guest would never get the interrupt
> it 
> expected to re-enable the queue.

So it seems we still need vhost_enable_notify() in handle_tx when there
is no tx in zerocopy case.

Do you know which one is more expensive: the cost of vhost_poll_queue()
in each zerocopy callback or calling vhost_enable_notify()?

Have you compared the results by removing below code in handle_tx()?

-                       if (unlikely(num_pends > VHOST_MAX_PEND)) {
-                                tx_poll_start(net, sock);
-                                set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
-                                break;
-                        }

> 
> And just like what we've discussed, tx polling based adding and 
> signaling is too early for vhost. 

Thanks
Shirley

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
  2012-05-21 15:42                       ` Shirley Ma
@ 2012-05-21 16:12                         ` Shirley Ma
  2012-05-22 10:13                           ` Jason Wang
  2012-05-22 10:05                         ` Jason Wang
  1 sibling, 1 reply; 47+ messages in thread
From: Shirley Ma @ 2012-05-21 16:12 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, eric.dumazet, netdev, linux-kernel, ebiederm, davem

On Mon, 2012-05-21 at 08:42 -0700, Shirley Ma wrote:
> On Mon, 2012-05-21 at 14:05 +0800, Jason Wang wrote:
> > >> - tx polling depends on skb_orphan() which is often called by
> > device
> > >> driver when it place the packet into the queue of the devices
> > instead
> > >> of  when the packets were sent. So it was too early for vhost to
> be
> > >> notified.
> > > Then do you think it's better to replace with vhost_poll_queue
> here
> > > instead?
> > 
> > Just like what does this patch do - calling vhost_poll_queue() in 
> > vhost_zerocopy_callback().
> > >> - it only works when the pending DMAs exceeds VHOST_MAX_PEND,
> it's
> > >> highly possible that guest needs to be notified when the pending
> > >> packets
> > >> isn't so much.
> > > In which situation the guest needs to be notified when there is no
> > TX
> > > besides buffers run out?
> > 
> > Consider guest call virtqueue_enable_cb_delayed() which means it
> only 
> > need to be notified when 3/4 of pending buffers ( about 178 buffers 
> > (256-MAX_SKB_FRAGS-2)*3/4 ) were sent by host. So vhost_net would
> > notify 
> > guest when about 60 buffers were pending. Since tx polling is only 
> > enabled when pending packets exceeds VHOST_MAX_PEND 128, so tx work 
> > would not be notified to run and guest would never get the interrupt
> > it 
> > expected to re-enable the queue.
> 
> So it seems we still need vhost_enable_notify() in handle_tx when there
> is no tx in zerocopy case.
> 
> Do you know which one is more expensive: the cost of
> vhost_poll_queue()
> in each zerocopy callback or calling vhost_enable_notify()?
> 
> Have you compared the results by removing below code in handle_tx()?
> 
> -                       if (unlikely(num_pends > VHOST_MAX_PEND)) {
> -                                tx_poll_start(net, sock);
> -                                set_bit(SOCK_ASYNC_NOSPACE,
> &sock->flags);
> -                                break;
> -                        }
> > 
> > And just like what we've discussed, tx polling based adding and 
> > signaling is too early for vhost. 
> 

Then it could be too early for vhost to notify guest anywhere in
handle_tx for zerocopy. Then we might need to remove any notification in
handle_tx for zerocopy to vhost zerocopy callback instead.

Adding vhost_poll_queue in vhost zerocopy callback unconditionally would
consume unnecessary cpu.

We need to think about a better solution here.

Thanks
Shirley

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
  2012-05-21 15:42                       ` Shirley Ma
  2012-05-21 16:12                         ` Shirley Ma
@ 2012-05-22 10:05                         ` Jason Wang
  2012-05-22 15:55                           ` Shirley Ma
  1 sibling, 1 reply; 47+ messages in thread
From: Jason Wang @ 2012-05-22 10:05 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michael S. Tsirkin, eric.dumazet, netdev, linux-kernel, ebiederm, davem

On 05/21/2012 11:42 PM, Shirley Ma wrote:
> On Mon, 2012-05-21 at 14:05 +0800, Jason Wang wrote:
>>>> - tx polling depends on skb_orphan() which is often called by
>> device
>>>> driver when it place the packet into the queue of the devices
>> instead
>>>> of  when the packets were sent. So it was too early for vhost to be
>>>> notified.
>>> Then do you think it's better to replace with vhost_poll_queue here
>>> instead?
>> Just like what does this patch do - calling vhost_poll_queue() in
>> vhost_zerocopy_callback().
>>>> - it only works when the pending DMAs exceeds VHOST_MAX_PEND, it's
>>>> highly possible that guest needs to be notified when the pending
>>>> packets
>>>> isn't so much.
>>> In which situation the guest needs to be notified when there is no
>> TX
>>> besides buffers run out?
>> Consider guest call virtqueue_enable_cb_delayed() which means it only
>> need to be notified when 3/4 of pending buffers ( about 178 buffers
>> (256-MAX_SKB_FRAGS-2)*3/4 ) were sent by host. So vhost_net would
>> notify
>> guest when about 60 buffers were pending. Since tx polling is only
>> enabled when pending packets exceeds VHOST_MAX_PEND 128, so tx work
>> would not be notified to run and guest would never get the interrupt
>> it
>> expected to re-enable the queue.
> So it seems we still need vhost_enable_notify() in handle_tx when there
> is no tx in zerocopy case.
>
> Do you know which one is more expensive: the cost of vhost_poll_queue()
> in each zerocopy callback or calling vhost_enable_notify()?

Didn't follow here, do you mean vhost_signal() here?
>
> Have you compared the results by removing below code in handle_tx()?
>
> -                       if (unlikely(num_pends>  VHOST_MAX_PEND)) {
> -                                tx_poll_start(net, sock);
> -                                set_bit(SOCK_ASYNC_NOSPACE,&sock->flags);
> -                                break;
> -                        }

I remember I've done some basic test when I send this patch, there's no 
much increasing of cpu utilization. Would double check this again.
>> And just like what we've discussed, tx polling based adding and
>> signaling is too early for vhost.
> Thanks
> Shirley
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
  2012-05-21 16:12                         ` Shirley Ma
@ 2012-05-22 10:13                           ` Jason Wang
  0 siblings, 0 replies; 47+ messages in thread
From: Jason Wang @ 2012-05-22 10:13 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michael S. Tsirkin, eric.dumazet, netdev, linux-kernel, ebiederm, davem

On 05/22/2012 12:12 AM, Shirley Ma wrote:
> Then it could be too early for vhost to notify guest anywhere in
> handle_tx for zerocopy. Then we might need to remove any notification in
> handle_tx for zerocopy to vhost zerocopy callback instead.
>
> Adding vhost_poll_queue in vhost zerocopy callback unconditionally would
> consume unnecessary cpu.
>
> We need to think about a better solution here.

A possible idea is only call vhost_poll_queue only when the packet of 
used_event were sent: if there's no out-of-order completion vhost could 
signal the guest; if there is, let the pending packets before used_event 
call vhost_poll_queue.
> Thanks
> Shirley

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
  2012-05-22 10:05                         ` Jason Wang
@ 2012-05-22 15:55                           ` Shirley Ma
  2012-05-23 10:31                             ` Jason Wang
  0 siblings, 1 reply; 47+ messages in thread
From: Shirley Ma @ 2012-05-22 15:55 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, eric.dumazet, netdev, linux-kernel, ebiederm, davem

On Tue, 2012-05-22 at 18:05 +0800, Jason Wang wrote:
> On 05/21/2012 11:42 PM, Shirley Ma wrote:
> > On Mon, 2012-05-21 at 14:05 +0800, Jason Wang wrote:
> >>>> - tx polling depends on skb_orphan() which is often called by
> >> device
> >>>> driver when it place the packet into the queue of the devices
> >> instead
> >>>> of  when the packets were sent. So it was too early for vhost to
> be
> >>>> notified.
> >>> Then do you think it's better to replace with vhost_poll_queue
> here
> >>> instead?
> >> Just like what does this patch do - calling vhost_poll_queue() in
> >> vhost_zerocopy_callback().
> >>>> - it only works when the pending DMAs exceeds VHOST_MAX_PEND,
> it's
> >>>> highly possible that guest needs to be notified when the pending
> >>>> packets
> >>>> isn't so much.
> >>> In which situation the guest needs to be notified when there is no
> >> TX
> >>> besides buffers run out?
> >> Consider guest call virtqueue_enable_cb_delayed() which means it
> only
> >> need to be notified when 3/4 of pending buffers ( about 178 buffers
> >> (256-MAX_SKB_FRAGS-2)*3/4 ) were sent by host. So vhost_net would
> >> notify
> >> guest when about 60 buffers were pending. Since tx polling is only
> >> enabled when pending packets exceeds VHOST_MAX_PEND 128, so tx work
> >> would not be notified to run and guest would never get the
> interrupt
> >> it
> >> expected to re-enable the queue.
> > So it seems we still need vhost_enable_notify() in handle_tx when
> there
> > is no tx in zerocopy case.
> >
> > Do you know which one is more expensive: the cost of
> vhost_poll_queue()
> > in each zerocopy callback or calling vhost_enable_notify()?
> 
> Didn't follow here, do you mean vhost_signal() here? 

I meant removing the code in handle_tx for zerocopy as below:

+	if (zcopy) {
                        /* If more outstanding DMAs, queue the work.
                         * Handle upend_idx wrap around
                         */
                        num_pends = likely(vq->upend_idx >= vq->done_idx) ?
                                    (vq->upend_idx - vq->done_idx) :
                                    (vq->upend_idx + UIO_MAXIOV - vq->done_idx);
+			/* zerocopy vhost_enable_notify is under zerocopy callback
+			 * since it could be too early to notify here */
+			break;
+	}
-                       if (unlikely(num_pends > VHOST_MAX_PEND)) {
-                                tx_poll_start(net, sock);
-                                set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
-                                break;
-                        }
                        if (unlikely(vhost_enable_notify(&net->dev, vq))) {
                                vhost_disable_notify(&net->dev, vq);
                                continue;
                        }
                        break;

Thanks
Shirley

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
  2012-05-22 15:55                           ` Shirley Ma
@ 2012-05-23 10:31                             ` Jason Wang
  0 siblings, 0 replies; 47+ messages in thread
From: Jason Wang @ 2012-05-23 10:31 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michael S. Tsirkin, eric.dumazet, netdev, linux-kernel, ebiederm, davem

On 05/22/2012 11:55 PM, Shirley Ma wrote:
> On Tue, 2012-05-22 at 18:05 +0800, Jason Wang wrote:
>> On 05/21/2012 11:42 PM, Shirley Ma wrote:
>>> On Mon, 2012-05-21 at 14:05 +0800, Jason Wang wrote:
>>>>>> - tx polling depends on skb_orphan() which is often called by
>>>> device
>>>>>> driver when it place the packet into the queue of the devices
>>>> instead
>>>>>> of  when the packets were sent. So it was too early for vhost to
>> be
>>>>>> notified.
>>>>> Then do you think it's better to replace with vhost_poll_queue
>> here
>>>>> instead?
>>>> Just like what does this patch do - calling vhost_poll_queue() in
>>>> vhost_zerocopy_callback().
>>>>>> - it only works when the pending DMAs exceeds VHOST_MAX_PEND,
>> it's
>>>>>> highly possible that guest needs to be notified when the pending
>>>>>> packets
>>>>>> isn't so much.
>>>>> In which situation the guest needs to be notified when there is no
>>>> TX
>>>>> besides buffers run out?
>>>> Consider guest call virtqueue_enable_cb_delayed() which means it
>> only
>>>> need to be notified when 3/4 of pending buffers ( about 178 buffers
>>>> (256-MAX_SKB_FRAGS-2)*3/4 ) were sent by host. So vhost_net would
>>>> notify
>>>> guest when about 60 buffers were pending. Since tx polling is only
>>>> enabled when pending packets exceeds VHOST_MAX_PEND 128, so tx work
>>>> would not be notified to run and guest would never get the
>> interrupt
>>>> it
>>>> expected to re-enable the queue.
>>> So it seems we still need vhost_enable_notify() in handle_tx when
>> there
>>> is no tx in zerocopy case.
>>>
>>> Do you know which one is more expensive: the cost of
>> vhost_poll_queue()
>>> in each zerocopy callback or calling vhost_enable_notify()?
>> Didn't follow here, do you mean vhost_signal() here?
> I meant removing the code in handle_tx for zerocopy as below:
>
> +	if (zcopy) {
>                          /* If more outstanding DMAs, queue the work.
>                           * Handle upend_idx wrap around
>                           */
>                          num_pends = likely(vq->upend_idx>= vq->done_idx) ?
>                                      (vq->upend_idx - vq->done_idx) :
>                                      (vq->upend_idx + UIO_MAXIOV - vq->done_idx);
> +			/* zerocopy vhost_enable_notify is under zerocopy callback
> +			 * since it could be too early to notify here */
> +			break;
> +	}
> -                       if (unlikely(num_pends>  VHOST_MAX_PEND)) {
> -                                tx_poll_start(net, sock);
> -                                set_bit(SOCK_ASYNC_NOSPACE,&sock->flags);
> -                                break;
> -                        }
>                          if (unlikely(vhost_enable_notify(&net->dev, vq))) {
>                                  vhost_disable_notify(&net->dev, vq);
>                                  continue;
>                          }
>                          break;

Didn't think this can work well as the notification from guest were 
disabled forever.

>
> Thanks
> Shirley
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2012-05-23 10:30 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-02  3:41 [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes Jason Wang
2012-05-02  3:41 ` [V2 PATCH 1/9] macvtap: zerocopy: fix offset calculation when building skb Jason Wang
2012-05-15 17:17   ` Shirley Ma
2012-05-02  3:41 ` [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation Jason Wang
2012-05-15 17:26   ` Shirley Ma
2012-05-16  3:04     ` Jason Wang
2012-05-16 15:03       ` Shirley Ma
2012-05-17  2:59         ` Jason Wang
2012-05-17 15:28           ` Shirley Ma
2012-05-18 10:10             ` Jason Wang
2012-05-18 15:22               ` Shirley Ma
2012-05-21  6:15                 ` Jason Wang
2012-05-02  3:41 ` [V2 PATCH 3/9] macvtap: zerocopy: put page when fail to get all requested user pages Jason Wang
2012-05-15 17:33   ` Shirley Ma
2012-05-02  3:42 ` [V2 PATCH 4/9] macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully Jason Wang
2012-05-15 17:44   ` Shirley Ma
2012-05-16  3:17     ` Jason Wang
2012-05-02  3:42 ` [V2 PATCH 5/9] macvtap: zerocopy: validate vectors before building skb Jason Wang
2012-05-02  3:42 ` [V2 PATCH 6/9] vhost_net: zerocopy: fix possible NULL pointer dereference of vq->bufs Jason Wang
2012-05-02  3:42 ` [V2 PATCH 7/9] vhost_net: re-poll only on EAGAIN or ENOBUFS Jason Wang
2012-05-02  3:42 ` [V2 PATCH 8/9] vhost_net: zerocopy: adding and signalling immediately when fully copied Jason Wang
2012-05-02  3:42 ` [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback Jason Wang
2012-05-15 16:50   ` Shirley Ma
2012-05-16  2:58     ` Jason Wang
2012-05-16 15:10       ` Shirley Ma
2012-05-16 15:14         ` Michael S. Tsirkin
2012-05-16 17:32           ` Shirley Ma
2012-05-16 18:36             ` Michael S. Tsirkin
2012-05-16 19:08               ` Shirley Ma
2012-05-21  5:22                 ` Jason Wang
2012-05-17  2:50             ` Jason Wang
2012-05-17 15:34               ` Shirley Ma
2012-05-18  9:58                 ` Jason Wang
2012-05-18 15:29                   ` Shirley Ma
2012-05-21  6:05                     ` Jason Wang
2012-05-21 15:42                       ` Shirley Ma
2012-05-21 16:12                         ` Shirley Ma
2012-05-22 10:13                           ` Jason Wang
2012-05-22 10:05                         ` Jason Wang
2012-05-22 15:55                           ` Shirley Ma
2012-05-23 10:31                             ` Jason Wang
2012-05-02  5:50 ` [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes Michael S. Tsirkin
2012-05-02  6:44   ` David Miller
2012-05-02  8:11     ` Michael S. Tsirkin
2012-05-02 19:40       ` Eric W. Biederman
2012-05-02 21:31         ` Michael S. Tsirkin
2012-05-02 21:54           ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).