All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/6] macvtap: zerocopy: fix offset calculation when building skb
@ 2012-04-16  6:07 Jason Wang
  2012-04-16  6:07 ` [PATCH 2/6] macvtap: zerocopy: fix truesize underestimation Jason Wang
                   ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: Jason Wang @ 2012-04-16  6:07 UTC (permalink / raw)
  To: netdev, xma, davem, linux-kernel, mst; +Cc: ebiederm

This patch fixes the offset calculation when build skb:

- offset1 were used as skb data offset not vector offset
- reset offset to zero only when we advance to next vector

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/macvtap.c |   13 +++++++------
 1 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 0427c65..bd4a70d 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -505,10 +505,11 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
 		if (copy > size) {
 			++from;
 			--count;
-		}
+			offset = 0;
+		} else
+			offset += size;
 		copy -= size;
 		offset1 += size;
-		offset = 0;
 	}
 
 	if (len == offset1)
@@ -519,13 +520,13 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
 		int num_pages;
 		unsigned long base;
 
-		len = from->iov_len - offset1;
+		len = from->iov_len - offset;
 		if (!len) {
-			offset1 = 0;
+			offset = 0;
 			++from;
 			continue;
 		}
-		base = (unsigned long)from->iov_base + offset1;
+		base = (unsigned long)from->iov_base + offset;
 		size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
 		num_pages = get_user_pages_fast(base, size, 0, &page[i]);
 		if ((num_pages != size) ||
@@ -546,7 +547,7 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
 			len -= size;
 			i++;
 		}
-		offset1 = 0;
+		offset = 0;
 		++from;
 	}
 	return 0;


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/6] macvtap: zerocopy: fix truesize underestimation
  2012-04-16  6:07 [PATCH 1/6] macvtap: zerocopy: fix offset calculation when building skb Jason Wang
@ 2012-04-16  6:07 ` Jason Wang
  2012-04-16  7:14   ` Michael S. Tsirkin
  2012-04-16  6:08 ` [PATCH 3/6] macvtap: zerocopy: validate vector length before pinning user pages Jason Wang
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2012-04-16  6:07 UTC (permalink / raw)
  To: netdev, xma, davem, linux-kernel, mst; +Cc: ebiederm

As the skb fragment were pinned/built from user pages, we should
account the page instead of length for truesize.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/macvtap.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index bd4a70d..7cb2684 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -519,6 +519,7 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
 		struct page *page[MAX_SKB_FRAGS];
 		int num_pages;
 		unsigned long base;
+		unsigned long truesize;
 
 		len = from->iov_len - offset;
 		if (!len) {
@@ -533,10 +534,11 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
 		    (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
 			/* put_page is in skb free */
 			return -EFAULT;
+		truesize = size * PAGE_SIZE;
 		skb->data_len += len;
 		skb->len += len;
-		skb->truesize += len;
-		atomic_add(len, &skb->sk->sk_wmem_alloc);
+		skb->truesize += truesize;
+		atomic_add(truesize, &skb->sk->sk_wmem_alloc);
 		while (len) {
 			int off = base & ~PAGE_MASK;
 			int size = min_t(int, len, PAGE_SIZE - off);


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/6] macvtap: zerocopy: validate vector length before pinning user pages
  2012-04-16  6:07 [PATCH 1/6] macvtap: zerocopy: fix offset calculation when building skb Jason Wang
  2012-04-16  6:07 ` [PATCH 2/6] macvtap: zerocopy: fix truesize underestimation Jason Wang
@ 2012-04-16  6:08 ` Jason Wang
  2012-04-16  6:53   ` Eric Dumazet
  2012-04-16  7:58   ` Michael S. Tsirkin
  2012-04-16  6:08 ` [PATCH 4/6] macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully Jason Wang
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 30+ messages in thread
From: Jason Wang @ 2012-04-16  6:08 UTC (permalink / raw)
  To: netdev, xma, davem, linux-kernel, mst; +Cc: ebiederm

Currently we do not validate the vector length before calling
get_user_pages_fast(), host stack would be easily overflowed by
malicious guest driver who give us a descriptor with length greater
than MAX_SKB_FRAGS. Solve this problem by checking the free entries
before trying to pin user pages.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/macvtap.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 7cb2684..d197a78 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -529,6 +529,8 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
 		}
 		base = (unsigned long)from->iov_base + offset;
 		size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+		if (i + size >= MAX_SKB_FRAGS)
+			return -EFAULT;
 		num_pages = get_user_pages_fast(base, size, 0, &page[i]);
 		if ((num_pages != size) ||
 		    (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/6] macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully
  2012-04-16  6:07 [PATCH 1/6] macvtap: zerocopy: fix offset calculation when building skb Jason Wang
  2012-04-16  6:07 ` [PATCH 2/6] macvtap: zerocopy: fix truesize underestimation Jason Wang
  2012-04-16  6:08 ` [PATCH 3/6] macvtap: zerocopy: validate vector length before pinning user pages Jason Wang
@ 2012-04-16  6:08 ` Jason Wang
  2012-04-16  6:08 ` [PATCH 5/6] vhost_net: fix use after free of vq->ubufs Jason Wang
  2012-04-16  6:08 ` [PATCH 6/6] vhost_net: don't poll on -EFAULT Jason Wang
  4 siblings, 0 replies; 30+ messages in thread
From: Jason Wang @ 2012-04-16  6:08 UTC (permalink / raw)
  To: netdev, xma, davem, linux-kernel, mst; +Cc: ebiederm

Current the SKBTX_DEV_ZEROCOPY is set unconditionally after
zerocopy_sg_from_iovec(), this would lead NULL pointer when macvtap
fails to build zerocopy skb because destructor_arg was not
initialized. Solve this by set this flag after the skb were built
successfully.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/macvtap.c |    9 +++++----
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index d197a78..c030468 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -699,10 +699,9 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
 	if (!skb)
 		goto err;
 
-	if (zerocopy) {
+	if (zerocopy)
 		err = zerocopy_sg_from_iovec(skb, iv, vnet_hdr_len, count);
-		skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
-	} else
+	else
 		err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len,
 						   len);
 	if (err)
@@ -721,8 +720,10 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
 	rcu_read_lock_bh();
 	vlan = rcu_dereference_bh(q->vlan);
 	/* copy skb_ubuf_info for callback when skb has no error */
-	if (zerocopy)
+	if (zerocopy) {
 		skb_shinfo(skb)->destructor_arg = m->msg_control;
+		skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
+	}
 	if (vlan)
 		macvlan_start_xmit(skb, vlan->dev);
 	else


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/6] vhost_net: fix use after free of vq->ubufs
  2012-04-16  6:07 [PATCH 1/6] macvtap: zerocopy: fix offset calculation when building skb Jason Wang
                   ` (2 preceding siblings ...)
  2012-04-16  6:08 ` [PATCH 4/6] macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully Jason Wang
@ 2012-04-16  6:08 ` Jason Wang
  2012-04-16 13:28   ` Michael S. Tsirkin
  2012-04-16  6:08 ` [PATCH 6/6] vhost_net: don't poll on -EFAULT Jason Wang
  4 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2012-04-16  6:08 UTC (permalink / raw)
  To: netdev, xma, davem, linux-kernel, mst; +Cc: ebiederm

When zerocopy socket is used, ubufs pointer were used in handle_tx()
without any validation. This would cause NULL pointer deference after
it has been freed in vhost_net_set_backend(). Fix this by check the
pointer before using it.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f0da2c3..29abd65 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -163,10 +163,15 @@ static void handle_tx(struct vhost_net *net)
 	mutex_lock(&vq->mutex);
 	vhost_disable_notify(&net->dev, vq);
 
+	zcopy = vhost_sock_zcopy(sock);
+	if (zcopy && !vq->ubufs) {
+		mutex_unlock(&vq->mutex);
+		return;
+	}
+
 	if (wmem < sock->sk->sk_sndbuf / 2)
 		tx_poll_stop(net);
 	hdr_size = vq->vhost_hlen;
-	zcopy = vhost_sock_zcopy(sock);
 
 	for (;;) {
 		/* Release DMAs done buffers first */


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 6/6] vhost_net: don't poll on -EFAULT
  2012-04-16  6:07 [PATCH 1/6] macvtap: zerocopy: fix offset calculation when building skb Jason Wang
                   ` (3 preceding siblings ...)
  2012-04-16  6:08 ` [PATCH 5/6] vhost_net: fix use after free of vq->ubufs Jason Wang
@ 2012-04-16  6:08 ` Jason Wang
  2012-04-16  7:16   ` Michael S. Tsirkin
  4 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2012-04-16  6:08 UTC (permalink / raw)
  To: netdev, xma, davem, linux-kernel, mst; +Cc: ebiederm

Currently, we restart tx polling unconditionally when sendmsg()
fails. This would cause unnecessary wakeups of vhost wokers as it's
only needed when the socket send buffer were exceeded. Fix this by
restart the tx polling only when sendmsg() returns value other than
-EFAULT.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 29abd65..035fa95 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -262,7 +262,8 @@ static void handle_tx(struct vhost_net *net)
 					UIO_MAXIOV;
 			}
 			vhost_discard_vq_desc(vq, 1);
-			tx_poll_start(net, sock);
+			if (err != -EFAULT)
+				tx_poll_start(net, sock);
 			break;
 		}
 		if (err != len)


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/6] macvtap: zerocopy: validate vector length before pinning user pages
  2012-04-16  6:08 ` [PATCH 3/6] macvtap: zerocopy: validate vector length before pinning user pages Jason Wang
@ 2012-04-16  6:53   ` Eric Dumazet
  2012-04-16  8:21     ` Jason Wang
  2012-04-17  6:19     ` Michael S. Tsirkin
  2012-04-16  7:58   ` Michael S. Tsirkin
  1 sibling, 2 replies; 30+ messages in thread
From: Eric Dumazet @ 2012-04-16  6:53 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, xma, davem, linux-kernel, mst, ebiederm

On Mon, 2012-04-16 at 14:08 +0800, Jason Wang wrote:
> Currently we do not validate the vector length before calling
> get_user_pages_fast(), host stack would be easily overflowed by
> malicious guest driver who give us a descriptor with length greater
> than MAX_SKB_FRAGS. Solve this problem by checking the free entries
> before trying to pin user pages.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/net/macvtap.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index 7cb2684..d197a78 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -529,6 +529,8 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
>  		}
>  		base = (unsigned long)from->iov_base + offset;
>  		size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
> +		if (i + size >= MAX_SKB_FRAGS)
> +			return -EFAULT;
>  		num_pages = get_user_pages_fast(base, size, 0, &page[i]);
>  		if ((num_pages != size) ||
>  		    (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
> 

Hi Jason

Why is -EFAULT the right error code ?

Also, why not removing the "(num_pages > MAX_SKB_FRAGS -
skb_shinfo(skb)->nr_frags)" test done few lines after ?

Also, if get_user_pages_fast() returns truncated result (size - 1 for
example), we apparently dont free the references on pages.

The comment applies only on pages that were added in skb frags.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] macvtap: zerocopy: fix truesize underestimation
  2012-04-16  6:07 ` [PATCH 2/6] macvtap: zerocopy: fix truesize underestimation Jason Wang
@ 2012-04-16  7:14   ` Michael S. Tsirkin
  2012-04-16  8:23     ` Jason Wang
  2012-04-16  8:49     ` Eric Dumazet
  0 siblings, 2 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-04-16  7:14 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, xma, davem, linux-kernel, ebiederm

On Mon, Apr 16, 2012 at 02:07:59PM +0800, Jason Wang wrote:
> As the skb fragment were pinned/built from user pages, we should
> account the page instead of length for truesize.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

I'm not sure this is right: the skb does *not* consume the
whole page, userspace uses the rest of the page
for other skbs. So we'll end up accounting for the
same page twice.
Eric, what's the right thing to do here in your opinion?

> ---
>  drivers/net/macvtap.c |    6 ++++--
>  1 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index bd4a70d..7cb2684 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -519,6 +519,7 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
>  		struct page *page[MAX_SKB_FRAGS];
>  		int num_pages;
>  		unsigned long base;
> +		unsigned long truesize;
>  
>  		len = from->iov_len - offset;
>  		if (!len) {
> @@ -533,10 +534,11 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
>  		    (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
>  			/* put_page is in skb free */
>  			return -EFAULT;
> +		truesize = size * PAGE_SIZE;
>  		skb->data_len += len;
>  		skb->len += len;
> -		skb->truesize += len;
> -		atomic_add(len, &skb->sk->sk_wmem_alloc);
> +		skb->truesize += truesize;
> +		atomic_add(truesize, &skb->sk->sk_wmem_alloc);
>  		while (len) {
>  			int off = base & ~PAGE_MASK;
>  			int size = min_t(int, len, PAGE_SIZE - off);

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/6] vhost_net: don't poll on -EFAULT
  2012-04-16  6:08 ` [PATCH 6/6] vhost_net: don't poll on -EFAULT Jason Wang
@ 2012-04-16  7:16   ` Michael S. Tsirkin
  2012-04-16  8:28     ` Jason Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-04-16  7:16 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, xma, davem, linux-kernel, ebiederm

On Mon, Apr 16, 2012 at 02:08:33PM +0800, Jason Wang wrote:
> Currently, we restart tx polling unconditionally when sendmsg()
> fails. This would cause unnecessary wakeups of vhost wokers as it's
> only needed when the socket send buffer were exceeded.

Why is this a problem?

> Fix this by
> restart the tx polling only when sendmsg() returns value other than
> -EFAULT.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vhost/net.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 29abd65..035fa95 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -262,7 +262,8 @@ static void handle_tx(struct vhost_net *net)
>  					UIO_MAXIOV;
>  			}
>  			vhost_discard_vq_desc(vq, 1);
> -			tx_poll_start(net, sock);
> +			if (err != -EFAULT)
> +				tx_poll_start(net, sock);
>  			break;
>  		}
>  		if (err != len)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/6] macvtap: zerocopy: validate vector length before pinning user pages
  2012-04-16  6:08 ` [PATCH 3/6] macvtap: zerocopy: validate vector length before pinning user pages Jason Wang
  2012-04-16  6:53   ` Eric Dumazet
@ 2012-04-16  7:58   ` Michael S. Tsirkin
  1 sibling, 0 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-04-16  7:58 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, xma, davem, linux-kernel, ebiederm

On Mon, Apr 16, 2012 at 02:08:07PM +0800, Jason Wang wrote:
> Currently we do not validate the vector length before calling
> get_user_pages_fast(), host stack would be easily overflowed by
> malicious guest driver who give us a descriptor with length greater
> than MAX_SKB_FRAGS. Solve this problem by checking the free entries
> before trying to pin user pages.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

We could handle this by copying and linearising some fragments.
That would be better.

> ---
>  drivers/net/macvtap.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index 7cb2684..d197a78 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -529,6 +529,8 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
>  		}
>  		base = (unsigned long)from->iov_base + offset;
>  		size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
> +		if (i + size >= MAX_SKB_FRAGS)
> +			return -EFAULT;
>  		num_pages = get_user_pages_fast(base, size, 0, &page[i]);
>  		if ((num_pages != size) ||
>  		    (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/6] macvtap: zerocopy: validate vector length before pinning user pages
  2012-04-16  6:53   ` Eric Dumazet
@ 2012-04-16  8:21     ` Jason Wang
  2012-04-17  5:33       ` Eric Dumazet
  2012-04-17  6:19     ` Michael S. Tsirkin
  1 sibling, 1 reply; 30+ messages in thread
From: Jason Wang @ 2012-04-16  8:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, xma, davem, linux-kernel, mst, ebiederm

Hi:
On 04/16/2012 02:53 PM, Eric Dumazet wrote:
> On Mon, 2012-04-16 at 14:08 +0800, Jason Wang wrote:
>> Currently we do not validate the vector length before calling
>> get_user_pages_fast(), host stack would be easily overflowed by
>> malicious guest driver who give us a descriptor with length greater
>> than MAX_SKB_FRAGS. Solve this problem by checking the free entries
>> before trying to pin user pages.
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>> ---
>>   drivers/net/macvtap.c |    2 ++
>>   1 files changed, 2 insertions(+), 0 deletions(-)
>>
>> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
>> index 7cb2684..d197a78 100644
>> --- a/drivers/net/macvtap.c
>> +++ b/drivers/net/macvtap.c
>> @@ -529,6 +529,8 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
>>   		}
>>   		base = (unsigned long)from->iov_base + offset;
>>   		size = ((base&  ~PAGE_MASK) + len + ~PAGE_MASK)>>  PAGE_SHIFT;
>> +		if (i + size>= MAX_SKB_FRAGS)
>> +			return -EFAULT;
>>   		num_pages = get_user_pages_fast(base, size, 0,&page[i]);
>>   		if ((num_pages != size) ||
>>   		    (num_pages>  MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
>>
> Hi Jason
>
> Why is -EFAULT the right error code ?

E2BIG or is there any error code you prefer?
>
> Also, why not removing the "(num_pages>  MAX_SKB_FRAGS -
> skb_shinfo(skb)->nr_frags)" test done few lines after ?

Yes, it can be removed.
>
> Also, if get_user_pages_fast() returns truncated result (size - 1 for
> example), we apparently dont free the references on pages.

It's a bug of this patch, thanks.
> The comment applies only on pages that were added in skb frags.
>
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] macvtap: zerocopy: fix truesize underestimation
  2012-04-16  7:14   ` Michael S. Tsirkin
@ 2012-04-16  8:23     ` Jason Wang
  2012-04-16  8:49     ` Eric Dumazet
  1 sibling, 0 replies; 30+ messages in thread
From: Jason Wang @ 2012-04-16  8:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, xma, davem, linux-kernel, ebiederm, Eric Dumazet

On 04/16/2012 03:14 PM, Michael S. Tsirkin wrote:
> On Mon, Apr 16, 2012 at 02:07:59PM +0800, Jason Wang wrote:
>> As the skb fragment were pinned/built from user pages, we should
>> account the page instead of length for truesize.
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
> I'm not sure this is right: the skb does *not* consume the
> whole page, userspace uses the rest of the page
> for other skbs. So we'll end up accounting for the
> same page twice.
> Eric, what's the right thing to do here in your opinion?

Or at very least, we need to do this in skb_copy_ubufs() as it allocate 
whole new pages.
>> ---
>>   drivers/net/macvtap.c |    6 ++++--
>>   1 files changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
>> index bd4a70d..7cb2684 100644
>> --- a/drivers/net/macvtap.c
>> +++ b/drivers/net/macvtap.c
>> @@ -519,6 +519,7 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
>>   		struct page *page[MAX_SKB_FRAGS];
>>   		int num_pages;
>>   		unsigned long base;
>> +		unsigned long truesize;
>>
>>   		len = from->iov_len - offset;
>>   		if (!len) {
>> @@ -533,10 +534,11 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
>>   		    (num_pages>  MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
>>   			/* put_page is in skb free */
>>   			return -EFAULT;
>> +		truesize = size * PAGE_SIZE;
>>   		skb->data_len += len;
>>   		skb->len += len;
>> -		skb->truesize += len;
>> -		atomic_add(len,&skb->sk->sk_wmem_alloc);
>> +		skb->truesize += truesize;
>> +		atomic_add(truesize,&skb->sk->sk_wmem_alloc);
>>   		while (len) {
>>   			int off = base&  ~PAGE_MASK;
>>   			int size = min_t(int, len, PAGE_SIZE - off);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/6] vhost_net: don't poll on -EFAULT
  2012-04-16  7:16   ` Michael S. Tsirkin
@ 2012-04-16  8:28     ` Jason Wang
  2012-04-16 13:39       ` Michael S. Tsirkin
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2012-04-16  8:28 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, xma, davem, linux-kernel, ebiederm

On 04/16/2012 03:16 PM, Michael S. Tsirkin wrote:
> On Mon, Apr 16, 2012 at 02:08:33PM +0800, Jason Wang wrote:
>> Currently, we restart tx polling unconditionally when sendmsg()
>> fails. This would cause unnecessary wakeups of vhost wokers as it's
>> only needed when the socket send buffer were exceeded.
> Why is this a problem?

This issue is when guest driver is able to hit the -EFAULT, vhost 
discard the the descriptor and restart the polling. This would wake 
vhost thread and repeat the loop again which waste cpu.

Another possible solution is don't discard the descriptor.
>
>> Fix this by
>> restart the tx polling only when sendmsg() returns value other than
>> -EFAULT.
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>> ---
>>   drivers/vhost/net.c |    3 ++-
>>   1 files changed, 2 insertions(+), 1 deletions(-)
>>
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 29abd65..035fa95 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -262,7 +262,8 @@ static void handle_tx(struct vhost_net *net)
>>   					UIO_MAXIOV;
>>   			}
>>   			vhost_discard_vq_desc(vq, 1);
>> -			tx_poll_start(net, sock);
>> +			if (err != -EFAULT)
>> +				tx_poll_start(net, sock);
>>   			break;
>>   		}
>>   		if (err != len)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] macvtap: zerocopy: fix truesize underestimation
  2012-04-16  7:14   ` Michael S. Tsirkin
  2012-04-16  8:23     ` Jason Wang
@ 2012-04-16  8:49     ` Eric Dumazet
  2012-04-16 13:25       ` Michael S. Tsirkin
  1 sibling, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2012-04-16  8:49 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Jason Wang, netdev, xma, davem, linux-kernel, ebiederm

On Mon, 2012-04-16 at 10:14 +0300, Michael S. Tsirkin wrote:
> On Mon, Apr 16, 2012 at 02:07:59PM +0800, Jason Wang wrote:
> > As the skb fragment were pinned/built from user pages, we should
> > account the page instead of length for truesize.
> > 
> > Signed-off-by: Jason Wang <jasowang@redhat.com>
> 
> I'm not sure this is right: the skb does *not* consume the
> whole page, userspace uses the rest of the page
> for other skbs. So we'll end up accounting for the
> same page twice.
> Eric, what's the right thing to do here in your opinion?

Problem is we dont know for sure userspace wont free pages right after
this syscall. So an evil application could consume more kernel memory
than what socket limit allowed.

Its same problem with vmsplice(mem -> pipe) + splice(pipe -> socket)

When we clone skb with frags, resulting skb will have same truesize,
even if the pages are shared ...




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] macvtap: zerocopy: fix truesize underestimation
  2012-04-16  8:49     ` Eric Dumazet
@ 2012-04-16 13:25       ` Michael S. Tsirkin
  0 siblings, 0 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-04-16 13:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jason Wang, netdev, xma, davem, linux-kernel, ebiederm

On Mon, Apr 16, 2012 at 10:49:53AM +0200, Eric Dumazet wrote:
> On Mon, 2012-04-16 at 10:14 +0300, Michael S. Tsirkin wrote:
> > On Mon, Apr 16, 2012 at 02:07:59PM +0800, Jason Wang wrote:
> > > As the skb fragment were pinned/built from user pages, we should
> > > account the page instead of length for truesize.
> > > 
> > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > 
> > I'm not sure this is right: the skb does *not* consume the
> > whole page, userspace uses the rest of the page
> > for other skbs. So we'll end up accounting for the
> > same page twice.
> > Eric, what's the right thing to do here in your opinion?
> 
> Problem is we dont know for sure userspace wont free pages right after
> this syscall. So an evil application could consume more kernel memory
> than what socket limit allowed.
> 
> Its same problem with vmsplice(mem -> pipe) + splice(pipe -> socket)
> 
> When we clone skb with frags, resulting skb will have same truesize,
> even if the pages are shared ...
> 

I see, thanks for the clarification.

-- 
MST

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] vhost_net: fix use after free of vq->ubufs
  2012-04-16  6:08 ` [PATCH 5/6] vhost_net: fix use after free of vq->ubufs Jason Wang
@ 2012-04-16 13:28   ` Michael S. Tsirkin
  2012-04-17  3:19     ` Jason Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-04-16 13:28 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, xma, davem, linux-kernel, ebiederm

On Mon, Apr 16, 2012 at 02:08:25PM +0800, Jason Wang wrote:
> When zerocopy socket is used, ubufs pointer were used in handle_tx()
> without any validation. This would cause NULL pointer deference after
> it has been freed in vhost_net_set_backend(). Fix this by check the
> pointer before using it.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>


OK so it's NULL dereference and not user after free :)
Also could you clarify how does this happen pls?
Don't we always initialize ubufs when vhost_sock_zcopy is set?

> ---
>  drivers/vhost/net.c |    7 ++++++-
>  1 files changed, 6 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index f0da2c3..29abd65 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -163,10 +163,15 @@ static void handle_tx(struct vhost_net *net)
>  	mutex_lock(&vq->mutex);
>  	vhost_disable_notify(&net->dev, vq);
>  
> +	zcopy = vhost_sock_zcopy(sock);
> +	if (zcopy && !vq->ubufs) {
> +		mutex_unlock(&vq->mutex);
> +		return;
> +	}
> +
>  	if (wmem < sock->sk->sk_sndbuf / 2)
>  		tx_poll_stop(net);
>  	hdr_size = vq->vhost_hlen;
> -	zcopy = vhost_sock_zcopy(sock);
>  
>  	for (;;) {
>  		/* Release DMAs done buffers first */

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/6] vhost_net: don't poll on -EFAULT
  2012-04-16  8:28     ` Jason Wang
@ 2012-04-16 13:39       ` Michael S. Tsirkin
  2012-04-17  3:27         ` Jason Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-04-16 13:39 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, xma, davem, linux-kernel, ebiederm

On Mon, Apr 16, 2012 at 04:28:10PM +0800, Jason Wang wrote:
> On 04/16/2012 03:16 PM, Michael S. Tsirkin wrote:
> >On Mon, Apr 16, 2012 at 02:08:33PM +0800, Jason Wang wrote:
> >>Currently, we restart tx polling unconditionally when sendmsg()
> >>fails. This would cause unnecessary wakeups of vhost wokers as it's
> >>only needed when the socket send buffer were exceeded.
> >Why is this a problem?
> 
> This issue is when guest driver is able to hit the -EFAULT, vhost
> discard the the descriptor and restart the polling. This would wake
> vhost thread and repeat the loop again which waste cpu.

Does same thing happen if we get an error from copy from user?

> Another possible solution is don't discard the descriptor.
> >
> >>Fix this by
> >>restart the tx polling only when sendmsg() returns value other than
> >>-EFAULT.
> >>
> >>Signed-off-by: Jason Wang<jasowang@redhat.com>
> >>---
> >>  drivers/vhost/net.c |    3 ++-
> >>  1 files changed, 2 insertions(+), 1 deletions(-)
> >>
> >>diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> >>index 29abd65..035fa95 100644
> >>--- a/drivers/vhost/net.c
> >>+++ b/drivers/vhost/net.c
> >>@@ -262,7 +262,8 @@ static void handle_tx(struct vhost_net *net)
> >>  					UIO_MAXIOV;
> >>  			}
> >>  			vhost_discard_vq_desc(vq, 1);
> >>-			tx_poll_start(net, sock);
> >>+			if (err != -EFAULT)
> >>+				tx_poll_start(net, sock);
> >>  			break;
> >>  		}
> >>  		if (err != len)
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] vhost_net: fix use after free of vq->ubufs
  2012-04-16 13:28   ` Michael S. Tsirkin
@ 2012-04-17  3:19     ` Jason Wang
  2012-04-17 10:22       ` Michael S. Tsirkin
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2012-04-17  3:19 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, xma, davem, linux-kernel, ebiederm

On 04/16/2012 09:28 PM, Michael S. Tsirkin wrote:
> On Mon, Apr 16, 2012 at 02:08:25PM +0800, Jason Wang wrote:
>> When zerocopy socket is used, ubufs pointer were used in handle_tx()
>> without any validation. This would cause NULL pointer deference after
>> it has been freed in vhost_net_set_backend(). Fix this by check the
>> pointer before using it.
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>
> OK so it's NULL dereference and not user after free :)
> Also could you clarify how does this happen pls?
> Don't we always initialize ubufs when vhost_sock_zcopy is set?

The problem happens when we want to disable backend. At this time ubufs 
were assigned to NULL and it may be dereferenced by handle_tx():

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffffa0474c57>] handle_tx+0x267/0x5c0 [vhost_net]
PGD 4230b1067 PUD 42227a067 PMD 0
Oops: 0000 [#1] SMP
CPU 7
Modules linked in: vhost_net ip6table_filter ip6_tables ebtable_nat 
ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 
nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM 
iptable_mangle iptable_filter ip_tables nfsd lockd nfs_acl auth_rpcgss 
sunrpc exportfs cpufreq_ondemand acpi_cpufreq freq_table mperf bridge 
stp llc ipv6 tun kvm_intel kvm hp_wmi sparse_keymap rfkill coretemp 
crc32c_intel ghash_clmulni_intel microcode serio_raw pcspkr sg iTCO_wdt 
iTCO_vendor_support tg3 snd_hda_codec_realtek snd_hda_intel 
snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd 
soundcore snd_page_alloc i7core_edac edac_core ixgbe dca mdio ext4 
mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom firewire_ohci firewire_core 
crc_itu_t aesni_intel cryptd aes_x86_64 aes_generic mptsas mptscsih 
mptbase scsi_transport_sas ahci libahci floppy nouveau ttm 
drm_kms_helper drm i2c_core mxm_wmi video wmi dm_mirror dm_region_hash 
dm_log dm_mod [last unloaded: vhost_net]

Pid: 3993, comm: vhost-3991 Not tainted 3.3.0+ #10 Hewlett-Packard HP 
Z800 Workstation/0AECh
RIP: 0010:[<ffffffffa0474c57>] [<ffffffffa0474c57>] 
handle_tx+0x267/0x5c0 [vhost_net]
RSP: 0018:ffff880220b41d90 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff8804222242b0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f5f495e6002
RBP: ffff880220b41e60 R08: 0000000000000000 R09: 0000000000000002
R10: 000000007fffffff R11: ffff8802246b7340 R12: ffff8804222243d8
R13: ffff8804222283d8 R14: 0000000004e20000 R15: ffff880422a1f690
FS: 0000000000000000(0000) GS:ffff88042fc60000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000424786000 CR4: 00000000000027e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process vhost-3991 (pid: 3993, threadinfo ffff880220b40000, task 
ffff880222606a70)
Stack:
0000000000000000 0000000000000000 ffff880422220000 ffff880220b41df0
0000000000000007 ffff880200000000 0000000000000000 ffff8804222242b8
0000000000000000 0000000000000000 01ff880220b41fd8 ffff880422220000
Call Trace:
[<ffffffffa0474fc5>] handle_tx_net+0x15/0x20 [vhost_net]
[<ffffffffa0472cf4>] vhost_worker+0xe4/0x180 [vhost_net]
[<ffffffffa0472c10>] ? vhost_attach_cgroups_work+0x30/0x30 [vhost_net]
[<ffffffffa0472c10>] ? vhost_attach_cgroups_work+0x30/0x30 [vhost_net]
[<ffffffff8107096e>] kthread+0x9e/0xb0
[<ffffffff81516364>] kernel_thread_helper+0x4/0x10
[<ffffffff810708d0>] ? kthread_freezable_should_stop+0x70/0x70
[<ffffffff81516360>] ? gs_change+0x13/0x13
Code: 00 00 48 89 50 08 48 63 93 20 42 00 00 48 89 50 10 48 89 45 b0 48 
c7 45 b8 08 00 00 00 48 8b 83 30 42 00 00 48 89 85 60 ff ff ff <8b> 00 
85 c0 0f 84 1c 03 00 00 48 8b 85 60 ff ff ff f0 ff 00 8b
RIP [<ffffffffa0474c57>] handle_tx+0x267/0x5c0 [vhost_net]
RSP <ffff880220b41d90>
CR2: 0000000000000000
>
>> ---
>>   drivers/vhost/net.c |    7 ++++++-
>>   1 files changed, 6 insertions(+), 1 deletions(-)
>>
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index f0da2c3..29abd65 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -163,10 +163,15 @@ static void handle_tx(struct vhost_net *net)
>>   	mutex_lock(&vq->mutex);
>>   	vhost_disable_notify(&net->dev, vq);
>>
>> +	zcopy = vhost_sock_zcopy(sock);
>> +	if (zcopy&&  !vq->ubufs) {
>> +		mutex_unlock(&vq->mutex);
>> +		return;
>> +	}
>> +
>>   	if (wmem<  sock->sk->sk_sndbuf / 2)
>>   		tx_poll_stop(net);
>>   	hdr_size = vq->vhost_hlen;
>> -	zcopy = vhost_sock_zcopy(sock);
>>
>>   	for (;;) {
>>   		/* Release DMAs done buffers first */
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/6] vhost_net: don't poll on -EFAULT
  2012-04-16 13:39       ` Michael S. Tsirkin
@ 2012-04-17  3:27         ` Jason Wang
  2012-04-17  4:57           ` Michael S. Tsirkin
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2012-04-17  3:27 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, xma, davem, linux-kernel, ebiederm

On 04/16/2012 09:39 PM, Michael S. Tsirkin wrote:
> On Mon, Apr 16, 2012 at 04:28:10PM +0800, Jason Wang wrote:
>> >  On 04/16/2012 03:16 PM, Michael S. Tsirkin wrote:
>>> >  >On Mon, Apr 16, 2012 at 02:08:33PM +0800, Jason Wang wrote:
>>>> >  >>Currently, we restart tx polling unconditionally when sendmsg()
>>>> >  >>fails. This would cause unnecessary wakeups of vhost wokers as it's
>>>> >  >>only needed when the socket send buffer were exceeded.
>>> >  >Why is this a problem?
>> >  
>> >  This issue is when guest driver is able to hit the -EFAULT, vhost
>> >  discard the the descriptor and restart the polling. This would wake
>> >  vhost thread and repeat the loop again which waste cpu.
> Does same thing happen if we get an error from copy from user?
>

Right, so do you think it makes sense that we only restart polling on 
-EAGAIN or -ENOBUFS?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/6] vhost_net: don't poll on -EFAULT
  2012-04-17  3:27         ` Jason Wang
@ 2012-04-17  4:57           ` Michael S. Tsirkin
  2012-04-17  5:54             ` Jason Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-04-17  4:57 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, xma, davem, linux-kernel, ebiederm

On Tue, Apr 17, 2012 at 11:27:01AM +0800, Jason Wang wrote:
> On 04/16/2012 09:39 PM, Michael S. Tsirkin wrote:
> >On Mon, Apr 16, 2012 at 04:28:10PM +0800, Jason Wang wrote:
> >>>  On 04/16/2012 03:16 PM, Michael S. Tsirkin wrote:
> >>>>  >On Mon, Apr 16, 2012 at 02:08:33PM +0800, Jason Wang wrote:
> >>>>>  >>Currently, we restart tx polling unconditionally when sendmsg()
> >>>>>  >>fails. This would cause unnecessary wakeups of vhost wokers as it's
> >>>>>  >>only needed when the socket send buffer were exceeded.
> >>>>  >Why is this a problem?
> >>>  >  This issue is when guest driver is able to hit the
> >>-EFAULT, vhost
> >>>  discard the the descriptor and restart the polling. This would wake
> >>>  vhost thread and repeat the loop again which waste cpu.
> >Does same thing happen if we get an error from copy from user?
> >
> 
> Right, so do you think it makes sense that we only restart polling
> on -EAGAIN or -ENOBUFS?

Sounds OK. BTW how do you test this?

-- 
MST

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/6] macvtap: zerocopy: validate vector length before pinning user pages
  2012-04-16  8:21     ` Jason Wang
@ 2012-04-17  5:33       ` Eric Dumazet
  2012-04-17  5:43         ` Michael S. Tsirkin
  0 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2012-04-17  5:33 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, xma, davem, linux-kernel, mst, ebiederm

On Mon, 2012-04-16 at 16:21 +0800, Jason Wang wrote:
> Hi:
> On 04/16/2012 02:53 PM, Eric Dumazet wrote:
> 	if ((num_pages != size) ||
> >>   		    (num_pages>  MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
> >>
> > Hi Jason
> >
> > Why is -EFAULT the right error code ?
> 
> E2BIG or is there any error code you prefer?

Might be good yes.

However it sounds strange user cant write any size he wants (and kernel
needs to build several skbs to fulfill user request)




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/6] macvtap: zerocopy: validate vector length before pinning user pages
  2012-04-17  5:33       ` Eric Dumazet
@ 2012-04-17  5:43         ` Michael S. Tsirkin
  0 siblings, 0 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-04-17  5:43 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jason Wang, netdev, xma, davem, linux-kernel, ebiederm

On Tue, Apr 17, 2012 at 07:33:28AM +0200, Eric Dumazet wrote:
> On Mon, 2012-04-16 at 16:21 +0800, Jason Wang wrote:
> > Hi:
> > On 04/16/2012 02:53 PM, Eric Dumazet wrote:
> > 	if ((num_pages != size) ||
> > >>   		    (num_pages>  MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
> > >>
> > > Hi Jason
> > >
> > > Why is -EFAULT the right error code ?
> > 
> > E2BIG or is there any error code you prefer?
> 
> Might be good yes.
> 
> However it sounds strange user cant write any size he wants (and kernel
> needs to build several skbs to fulfill user request)

We never supported arbitrary length writes:
macvtap is exactly like packet sockets in this regard.

-- 
MST


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/6] vhost_net: don't poll on -EFAULT
  2012-04-17  4:57           ` Michael S. Tsirkin
@ 2012-04-17  5:54             ` Jason Wang
  2012-04-17  6:07               ` Michael S. Tsirkin
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2012-04-17  5:54 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, xma, davem, linux-kernel, ebiederm

On 04/17/2012 12:57 PM, Michael S. Tsirkin wrote:
> On Tue, Apr 17, 2012 at 11:27:01AM +0800, Jason Wang wrote:
>> On 04/16/2012 09:39 PM, Michael S. Tsirkin wrote:
>>> On Mon, Apr 16, 2012 at 04:28:10PM +0800, Jason Wang wrote:
>>>>>   On 04/16/2012 03:16 PM, Michael S. Tsirkin wrote:
>>>>>>   >On Mon, Apr 16, 2012 at 02:08:33PM +0800, Jason Wang wrote:
>>>>>>>   >>Currently, we restart tx polling unconditionally when sendmsg()
>>>>>>>   >>fails. This would cause unnecessary wakeups of vhost wokers as it's
>>>>>>>   >>only needed when the socket send buffer were exceeded.
>>>>>>   >Why is this a problem?
>>>>>   >   This issue is when guest driver is able to hit the
>>>> -EFAULT, vhost
>>>>>   discard the the descriptor and restart the polling. This would wake
>>>>>   vhost thread and repeat the loop again which waste cpu.
>>> Does same thing happen if we get an error from copy from user?
>>>
>> Right, so do you think it makes sense that we only restart polling
>> on -EAGAIN or -ENOBUFS?
> Sounds OK. BTW how do you test this?
>

Not very hard, w/o this patch, we can see almost 100% cpu utilization 
for vhost thread if guest hit EFAULT or EINVAL. With this patch, the cpu 
utilization should be very low I think.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/6] vhost_net: don't poll on -EFAULT
  2012-04-17  5:54             ` Jason Wang
@ 2012-04-17  6:07               ` Michael S. Tsirkin
  2012-04-17  6:30                 ` Jason Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-04-17  6:07 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, xma, davem, linux-kernel, ebiederm

On Tue, Apr 17, 2012 at 01:54:55PM +0800, Jason Wang wrote:
> On 04/17/2012 12:57 PM, Michael S. Tsirkin wrote:
> >On Tue, Apr 17, 2012 at 11:27:01AM +0800, Jason Wang wrote:
> >>On 04/16/2012 09:39 PM, Michael S. Tsirkin wrote:
> >>>On Mon, Apr 16, 2012 at 04:28:10PM +0800, Jason Wang wrote:
> >>>>>  On 04/16/2012 03:16 PM, Michael S. Tsirkin wrote:
> >>>>>>  >On Mon, Apr 16, 2012 at 02:08:33PM +0800, Jason Wang wrote:
> >>>>>>>  >>Currently, we restart tx polling unconditionally when sendmsg()
> >>>>>>>  >>fails. This would cause unnecessary wakeups of vhost wokers as it's
> >>>>>>>  >>only needed when the socket send buffer were exceeded.
> >>>>>>  >Why is this a problem?
> >>>>>  >   This issue is when guest driver is able to hit the
> >>>>-EFAULT, vhost
> >>>>>  discard the the descriptor and restart the polling. This would wake
> >>>>>  vhost thread and repeat the loop again which waste cpu.
> >>>Does same thing happen if we get an error from copy from user?
> >>>
> >>Right, so do you think it makes sense that we only restart polling
> >>on -EAGAIN or -ENOBUFS?
> >Sounds OK. BTW how do you test this?
> >
> 
> Not very hard, w/o this patch, we can see almost 100% cpu
> utilization for vhost thread if guest hit EFAULT or EINVAL. With
> this patch, the cpu utilization should be very low I think.

Yes but do you have a test that makes guest hit EFAULT or EINVAL?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/6] macvtap: zerocopy: validate vector length before pinning user pages
  2012-04-16  6:53   ` Eric Dumazet
  2012-04-16  8:21     ` Jason Wang
@ 2012-04-17  6:19     ` Michael S. Tsirkin
  1 sibling, 0 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-04-17  6:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jason Wang, netdev, xma, davem, linux-kernel, ebiederm

On Mon, Apr 16, 2012 at 08:53:03AM +0200, Eric Dumazet wrote:
> On Mon, 2012-04-16 at 14:08 +0800, Jason Wang wrote:
> > Currently we do not validate the vector length before calling
> > get_user_pages_fast(), host stack would be easily overflowed by
> > malicious guest driver who give us a descriptor with length greater
> > than MAX_SKB_FRAGS. Solve this problem by checking the free entries
> > before trying to pin user pages.
> > 
> > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > ---
> >  drivers/net/macvtap.c |    2 ++
> >  1 files changed, 2 insertions(+), 0 deletions(-)
> > 
> > diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> > index 7cb2684..d197a78 100644
> > --- a/drivers/net/macvtap.c
> > +++ b/drivers/net/macvtap.c
> > @@ -529,6 +529,8 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
> >  		}
> >  		base = (unsigned long)from->iov_base + offset;
> >  		size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
> > +		if (i + size >= MAX_SKB_FRAGS)
> > +			return -EFAULT;
> >  		num_pages = get_user_pages_fast(base, size, 0, &page[i]);
> >  		if ((num_pages != size) ||
> >  		    (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
> > 
> 
> Hi Jason
> 
> Why is -EFAULT the right error code ?


skb_copy_datagram_from_iovec return -EFAULT on iovs
that are too large, too, so this is at least consistent.

-- 
MST

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/6] vhost_net: don't poll on -EFAULT
  2012-04-17  6:07               ` Michael S. Tsirkin
@ 2012-04-17  6:30                 ` Jason Wang
  2012-04-17 10:18                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2012-04-17  6:30 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, xma, davem, linux-kernel, ebiederm

On 04/17/2012 02:07 PM, Michael S. Tsirkin wrote:
> On Tue, Apr 17, 2012 at 01:54:55PM +0800, Jason Wang wrote:
>> On 04/17/2012 12:57 PM, Michael S. Tsirkin wrote:
>>> On Tue, Apr 17, 2012 at 11:27:01AM +0800, Jason Wang wrote:
>>>> On 04/16/2012 09:39 PM, Michael S. Tsirkin wrote:
>>>>> On Mon, Apr 16, 2012 at 04:28:10PM +0800, Jason Wang wrote:
>>>>>>>   On 04/16/2012 03:16 PM, Michael S. Tsirkin wrote:
>>>>>>>>   >On Mon, Apr 16, 2012 at 02:08:33PM +0800, Jason Wang wrote:
>>>>>>>>>   >>Currently, we restart tx polling unconditionally when sendmsg()
>>>>>>>>>   >>fails. This would cause unnecessary wakeups of vhost wokers as it's
>>>>>>>>>   >>only needed when the socket send buffer were exceeded.
>>>>>>>>   >Why is this a problem?
>>>>>>>   >    This issue is when guest driver is able to hit the
>>>>>> -EFAULT, vhost
>>>>>>>   discard the the descriptor and restart the polling. This would wake
>>>>>>>   vhost thread and repeat the loop again which waste cpu.
>>>>> Does same thing happen if we get an error from copy from user?
>>>>>
>>>> Right, so do you think it makes sense that we only restart polling
>>>> on -EAGAIN or -ENOBUFS?
>>> Sounds OK. BTW how do you test this?
>>>
>> Not very hard, w/o this patch, we can see almost 100% cpu
>> utilization for vhost thread if guest hit EFAULT or EINVAL. With
>> this patch, the cpu utilization should be very low I think.
> Yes but do you have a test that makes guest hit EFAULT or EINVAL?

Looks like we can do this by supplying an invalid hdr_len in vnet header 
as tap does the check for this.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/6] vhost_net: don't poll on -EFAULT
  2012-04-17  6:30                 ` Jason Wang
@ 2012-04-17 10:18                   ` Michael S. Tsirkin
  2012-04-17 10:46                     ` Jason Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-04-17 10:18 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, xma, davem, linux-kernel, ebiederm

On Tue, Apr 17, 2012 at 02:30:27PM +0800, Jason Wang wrote:
> On 04/17/2012 02:07 PM, Michael S. Tsirkin wrote:
> >On Tue, Apr 17, 2012 at 01:54:55PM +0800, Jason Wang wrote:
> >>On 04/17/2012 12:57 PM, Michael S. Tsirkin wrote:
> >>>On Tue, Apr 17, 2012 at 11:27:01AM +0800, Jason Wang wrote:
> >>>>On 04/16/2012 09:39 PM, Michael S. Tsirkin wrote:
> >>>>>On Mon, Apr 16, 2012 at 04:28:10PM +0800, Jason Wang wrote:
> >>>>>>>  On 04/16/2012 03:16 PM, Michael S. Tsirkin wrote:
> >>>>>>>>  >On Mon, Apr 16, 2012 at 02:08:33PM +0800, Jason Wang wrote:
> >>>>>>>>>  >>Currently, we restart tx polling unconditionally when sendmsg()
> >>>>>>>>>  >>fails. This would cause unnecessary wakeups of vhost wokers as it's
> >>>>>>>>>  >>only needed when the socket send buffer were exceeded.
> >>>>>>>>  >Why is this a problem?
> >>>>>>>  >    This issue is when guest driver is able to hit the
> >>>>>>-EFAULT, vhost
> >>>>>>>  discard the the descriptor and restart the polling. This would wake
> >>>>>>>  vhost thread and repeat the loop again which waste cpu.
> >>>>>Does same thing happen if we get an error from copy from user?
> >>>>>
> >>>>Right, so do you think it makes sense that we only restart polling
> >>>>on -EAGAIN or -ENOBUFS?
> >>>Sounds OK. BTW how do you test this?
> >>>
> >>Not very hard, w/o this patch, we can see almost 100% cpu
> >>utilization for vhost thread if guest hit EFAULT or EINVAL. With
> >>this patch, the cpu utilization should be very low I think.
> >Yes but do you have a test that makes guest hit EFAULT or EINVAL?
> 
> Looks like we can do this by supplying an invalid hdr_len in vnet
> header as tap does the check for this.

Ah so you patched qemu to do this? Cool. Can you post the patch for testing pls?

> >--
> >To unsubscribe from this list: send the line "unsubscribe netdev" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] vhost_net: fix use after free of vq->ubufs
  2012-04-17  3:19     ` Jason Wang
@ 2012-04-17 10:22       ` Michael S. Tsirkin
  2012-04-17 10:47         ` Jason Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-04-17 10:22 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, xma, davem, linux-kernel, ebiederm

On Tue, Apr 17, 2012 at 11:19:42AM +0800, Jason Wang wrote:
> On 04/16/2012 09:28 PM, Michael S. Tsirkin wrote:
> >On Mon, Apr 16, 2012 at 02:08:25PM +0800, Jason Wang wrote:
> >>When zerocopy socket is used, ubufs pointer were used in handle_tx()
> >>without any validation. This would cause NULL pointer deference after
> >>it has been freed in vhost_net_set_backend(). Fix this by check the
> >>pointer before using it.
> >>
> >>Signed-off-by: Jason Wang<jasowang@redhat.com>
> >
> >OK so it's NULL dereference and not user after free :)
> >Also could you clarify how does this happen pls?
> >Don't we always initialize ubufs when vhost_sock_zcopy is set?
> 
> The problem happens when we want to disable backend. At this time
> ubufs were assigned to NULL and it may be dereferenced by
> handle_tx():

Heh, I see. How about
- zcopy = vhost_sock_zcopy(sock);
+ zcopy = vq->ubufs;

-- 
MST

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/6] vhost_net: don't poll on -EFAULT
  2012-04-17 10:18                   ` Michael S. Tsirkin
@ 2012-04-17 10:46                     ` Jason Wang
  0 siblings, 0 replies; 30+ messages in thread
From: Jason Wang @ 2012-04-17 10:46 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, xma, davem, linux-kernel, ebiederm

On 04/17/2012 06:18 PM, Michael S. Tsirkin wrote:
> On Tue, Apr 17, 2012 at 02:30:27PM +0800, Jason Wang wrote:
>> >  On 04/17/2012 02:07 PM, Michael S. Tsirkin wrote:
>>> >  >On Tue, Apr 17, 2012 at 01:54:55PM +0800, Jason Wang wrote:
>>>> >  >>On 04/17/2012 12:57 PM, Michael S. Tsirkin wrote:
>>>>> >  >>>On Tue, Apr 17, 2012 at 11:27:01AM +0800, Jason Wang wrote:
>>>>>> >  >>>>On 04/16/2012 09:39 PM, Michael S. Tsirkin wrote:
>>>>>>> >  >>>>>On Mon, Apr 16, 2012 at 04:28:10PM +0800, Jason Wang wrote:
>>>>>>>>> >  >>>>>>>    On 04/16/2012 03:16 PM, Michael S. Tsirkin wrote:
>>>>>>>>>> >  >>>>>>>>    >On Mon, Apr 16, 2012 at 02:08:33PM +0800, Jason Wang wrote:
>>>>>>>>>>> >  >>>>>>>>>    >>Currently, we restart tx polling unconditionally when sendmsg()
>>>>>>>>>>> >  >>>>>>>>>    >>fails. This would cause unnecessary wakeups of vhost wokers as it's
>>>>>>>>>>> >  >>>>>>>>>    >>only needed when the socket send buffer were exceeded.
>>>>>>>>>> >  >>>>>>>>    >Why is this a problem?
>>>>>>>>> >  >>>>>>>    >     This issue is when guest driver is able to hit the
>>>>>>>> >  >>>>>>-EFAULT, vhost
>>>>>>>>> >  >>>>>>>    discard the the descriptor and restart the polling. This would wake
>>>>>>>>> >  >>>>>>>    vhost thread and repeat the loop again which waste cpu.
>>>>>>> >  >>>>>Does same thing happen if we get an error from copy from user?
>>>>>>> >  >>>>>
>>>>>> >  >>>>Right, so do you think it makes sense that we only restart polling
>>>>>> >  >>>>on -EAGAIN or -ENOBUFS?
>>>>> >  >>>Sounds OK. BTW how do you test this?
>>>>> >  >>>
>>>> >  >>Not very hard, w/o this patch, we can see almost 100% cpu
>>>> >  >>utilization for vhost thread if guest hit EFAULT or EINVAL. With
>>>> >  >>this patch, the cpu utilization should be very low I think.
>>> >  >Yes but do you have a test that makes guest hit EFAULT or EINVAL?
>> >  
>> >  Looks like we can do this by supplying an invalid hdr_len in vnet
>> >  header as tap does the check for this.
> Ah so you patched qemu to do this? Cool. Can you post the patch for testing pls?
>
No, I mean patch the guest driver like this:

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 019da01..6e2f487 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -582,7 +582,7 @@ static int xmit_skb(struct virtnet_info *vi, struct 
sk_buff *skb)
         }

         if (skb_is_gso(skb)) {
-               hdr->hdr.hdr_len = skb_headlen(skb);
+               hdr->hdr.hdr_len = 65535;
                 hdr->hdr.gso_size = skb_shinfo(skb)->gso_size;
                 if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4)
                         hdr->hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;

btw, If we still choose to drop the packet, we can hit -EFAULT by send a 
descriptor with a large number of pages.

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] vhost_net: fix use after free of vq->ubufs
  2012-04-17 10:22       ` Michael S. Tsirkin
@ 2012-04-17 10:47         ` Jason Wang
  0 siblings, 0 replies; 30+ messages in thread
From: Jason Wang @ 2012-04-17 10:47 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, xma, davem, linux-kernel, ebiederm

On 04/17/2012 06:22 PM, Michael S. Tsirkin wrote:
> On Tue, Apr 17, 2012 at 11:19:42AM +0800, Jason Wang wrote:
>> >  On 04/16/2012 09:28 PM, Michael S. Tsirkin wrote:
>>> >  >On Mon, Apr 16, 2012 at 02:08:25PM +0800, Jason Wang wrote:
>>>> >  >>When zerocopy socket is used, ubufs pointer were used in handle_tx()
>>>> >  >>without any validation. This would cause NULL pointer deference after
>>>> >  >>it has been freed in vhost_net_set_backend(). Fix this by check the
>>>> >  >>pointer before using it.
>>>> >  >>
>>>> >  >>Signed-off-by: Jason Wang<jasowang@redhat.com>
>>> >  >
>>> >  >OK so it's NULL dereference and not user after free:)
>>> >  >Also could you clarify how does this happen pls?
>>> >  >Don't we always initialize ubufs when vhost_sock_zcopy is set?
>> >  
>> >  The problem happens when we want to disable backend. At this time
>> >  ubufs were assigned to NULL and it may be dereferenced by
>> >  handle_tx():
> Heh, I see. How about
> - zcopy = vhost_sock_zcopy(sock);
> + zcopy = vq->ubufs;

Yes, It's more is simpler.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2012-04-17 10:47 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-16  6:07 [PATCH 1/6] macvtap: zerocopy: fix offset calculation when building skb Jason Wang
2012-04-16  6:07 ` [PATCH 2/6] macvtap: zerocopy: fix truesize underestimation Jason Wang
2012-04-16  7:14   ` Michael S. Tsirkin
2012-04-16  8:23     ` Jason Wang
2012-04-16  8:49     ` Eric Dumazet
2012-04-16 13:25       ` Michael S. Tsirkin
2012-04-16  6:08 ` [PATCH 3/6] macvtap: zerocopy: validate vector length before pinning user pages Jason Wang
2012-04-16  6:53   ` Eric Dumazet
2012-04-16  8:21     ` Jason Wang
2012-04-17  5:33       ` Eric Dumazet
2012-04-17  5:43         ` Michael S. Tsirkin
2012-04-17  6:19     ` Michael S. Tsirkin
2012-04-16  7:58   ` Michael S. Tsirkin
2012-04-16  6:08 ` [PATCH 4/6] macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully Jason Wang
2012-04-16  6:08 ` [PATCH 5/6] vhost_net: fix use after free of vq->ubufs Jason Wang
2012-04-16 13:28   ` Michael S. Tsirkin
2012-04-17  3:19     ` Jason Wang
2012-04-17 10:22       ` Michael S. Tsirkin
2012-04-17 10:47         ` Jason Wang
2012-04-16  6:08 ` [PATCH 6/6] vhost_net: don't poll on -EFAULT Jason Wang
2012-04-16  7:16   ` Michael S. Tsirkin
2012-04-16  8:28     ` Jason Wang
2012-04-16 13:39       ` Michael S. Tsirkin
2012-04-17  3:27         ` Jason Wang
2012-04-17  4:57           ` Michael S. Tsirkin
2012-04-17  5:54             ` Jason Wang
2012-04-17  6:07               ` Michael S. Tsirkin
2012-04-17  6:30                 ` Jason Wang
2012-04-17 10:18                   ` Michael S. Tsirkin
2012-04-17 10:46                     ` Jason Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.