netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next] tcp: rcvbuf autotuning improvements
@ 2013-10-03  7:56 Daniel Borkmann
  2013-10-03 13:03 ` Eric Dumazet
  2013-10-03 13:13 ` [PATCH net-next] tcp: rcvbuf autotuning improvements Eric Dumazet
  0 siblings, 2 replies; 11+ messages in thread
From: Daniel Borkmann @ 2013-10-03  7:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, eric.dumazet, Francesco Fusco

This is a complementary patch for commit 6ae705323 ("tcp: sndbuf
autotuning improvements") that fixes a performance regression on
receiver side in setups with low to mid latency, high throughput,
and senders with TSO/GSO off (receivers w/ default settings).

The following measurements in Mbit/s were done for 60sec w/ netperf
on virtio w/ TSO/GSO off:

(ms)    1)              2)              3)
  0     2762.11         1150.32         2906.17
 10     1083.61          538.89         1091.03
 25      471.81          313.18          474.60
 50      242.33          187.84          242.36
 75      162.14          134.45          161.95
100      121.55          101.96          121.49
150       80.64           57.75           80.48
200       58.97           54.11           59.90
250       47.10           46.92           47.31

Same setup w/ TSO/GSO on:

(ms)    1)              2)              3)
  0     12225.91        12366.89        16514.37
 10      1526.64         1525.79         2176.63
 25       655.13          647.79          871.52
 50       338.51          377.88          439.46
 75       246.49          278.46          295.62
100       210.93          207.56          217.34
150       127.88          129.56          141.33
200        94.95           94.50          107.29
250        67.39           73.88           88.35

Similarly as in 6ae705323, we fixed up power-of-two rounding and
took cached mss into account, thus bringing per_mss calculations
closer to each other, the rest stays as is.

We also renamed tcp_fixup_rcvbuf() to tcp_rcvbuf_expand() to be
consistent with tcp_sndbuf_expand().

While we do think that 6ae705323b71 is the right way to go, also
this follow-up seems necessary to restore performance for
receivers.

For the evaluation, same kernels on each host were used:

1) net-next (4fbef95af), which is before 6ae705323
2) net-next (6ae705323), which is sndbuf improvements
3) net-next (6ae705323), plus this patch on top

This was done in joint work with Francesco Fusco.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Francesco Fusco <ffusco@redhat.com>
---
 net/ipv4/tcp_input.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index cd65674..ed37b1d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -367,13 +367,19 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
 }
 
 /* 3. Tuning rcvbuf, when connection enters established state. */
-static void tcp_fixup_rcvbuf(struct sock *sk)
+static void tcp_rcvbuf_expand(struct sock *sk)
 {
-	u32 mss = tcp_sk(sk)->advmss;
-	int rcvmem;
+	const struct tcp_sock *tp = tcp_sk(sk);
+	int rcvmem, per_mss;
+
+	per_mss = max_t(u32, tp->advmss, tp->mss_cache) +
+		  MAX_TCP_HEADER +
+		  SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+
+	per_mss = roundup_pow_of_two(per_mss) +
+		  SKB_DATA_ALIGN(sizeof(struct sk_buff));
 
-	rcvmem = 2 * SKB_TRUESIZE(mss + MAX_TCP_HEADER) *
-		 tcp_default_init_rwnd(mss);
+	rcvmem = 2 * tcp_default_init_rwnd(per_mss) * per_mss;
 
 	/* Dynamic Right Sizing (DRS) has 2 to 3 RTT latency
 	 * Allow enough cushion so that sender is not limited by our window
@@ -394,7 +400,7 @@ void tcp_init_buffer_space(struct sock *sk)
 	int maxwin;
 
 	if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK))
-		tcp_fixup_rcvbuf(sk);
+		tcp_rcvbuf_expand(sk);
 	if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
 		tcp_sndbuf_expand(sk);
 
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next] tcp: rcvbuf autotuning improvements
  2013-10-03  7:56 [PATCH net-next] tcp: rcvbuf autotuning improvements Daniel Borkmann
@ 2013-10-03 13:03 ` Eric Dumazet
  2013-10-04  6:56   ` Michael Dalton
  2013-10-03 13:13 ` [PATCH net-next] tcp: rcvbuf autotuning improvements Eric Dumazet
  1 sibling, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2013-10-03 13:03 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: davem, netdev, Francesco Fusco, Michael Dalton, ycheng, ncardwell

On Thu, 2013-10-03 at 09:56 +0200, Daniel Borkmann wrote:
> This is a complementary patch for commit 6ae705323 ("tcp: sndbuf
> autotuning improvements") that fixes a performance regression on
> receiver side in setups with low to mid latency, high throughput,
> and senders with TSO/GSO off (receivers w/ default settings).
> 
> The following measurements in Mbit/s were done for 60sec w/ netperf
> on virtio w/ TSO/GSO off:
> 
> (ms)    1)              2)              3)
>   0     2762.11         1150.32         2906.17
>  10     1083.61          538.89         1091.03
>  25      471.81          313.18          474.60
>  50      242.33          187.84          242.36
>  75      162.14          134.45          161.95
> 100      121.55          101.96          121.49
> 150       80.64           57.75           80.48
> 200       58.97           54.11           59.90
> 250       47.10           46.92           47.31
> 
> Same setup w/ TSO/GSO on:
> 
> (ms)    1)              2)              3)
>   0     12225.91        12366.89        16514.37
>  10      1526.64         1525.79         2176.63
>  25       655.13          647.79          871.52
>  50       338.51          377.88          439.46
>  75       246.49          278.46          295.62
> 100       210.93          207.56          217.34
> 150       127.88          129.56          141.33
> 200        94.95           94.50          107.29
> 250        67.39           73.88           88.35
> 
> Similarly as in 6ae705323, we fixed up power-of-two rounding and
> took cached mss into account, thus bringing per_mss calculations
> closer to each other, the rest stays as is.
> 
> We also renamed tcp_fixup_rcvbuf() to tcp_rcvbuf_expand() to be
> consistent with tcp_sndbuf_expand().
> 
> While we do think that 6ae705323b71 is the right way to go, also
> this follow-up seems necessary to restore performance for
> receivers.

Hmm, I think you based this patch on some virtio requirements.

I would rather fix virtio, because virtio has poor truesize/payload
ratio.

Michael Dalton is working on this right now.

Really I don't understand how 'fixing' initial rcvbuf could explain such
difference in a 60 second transfert.

Normally, if autotuning was working, the first sk_rcvbuf value would
only matter in the very beginning of a flow (maybe one, two or even
three RTT)

It looks like you only need to set sk_rcvbuf to tcp_rmem[2],
so you probably have to fix the autotuning, or virtio to give normal
skbs, not fat ones ;)


Thanks

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next] tcp: rcvbuf autotuning improvements
  2013-10-03  7:56 [PATCH net-next] tcp: rcvbuf autotuning improvements Daniel Borkmann
  2013-10-03 13:03 ` Eric Dumazet
@ 2013-10-03 13:13 ` Eric Dumazet
  1 sibling, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2013-10-03 13:13 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: davem, netdev, Francesco Fusco

On Thu, 2013-10-03 at 09:56 +0200, Daniel Borkmann wrote:

> We also renamed tcp_fixup_rcvbuf() to tcp_rcvbuf_expand() to be
> consistent with tcp_sndbuf_expand().

BTW we renamed the function only because it was used both for initial
sizing, and from tcp_new_space()

As is, tcp_fixup_rcvbuf() is only called at connection setup.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next] tcp: rcvbuf autotuning improvements
  2013-10-03 13:03 ` Eric Dumazet
@ 2013-10-04  6:56   ` Michael Dalton
  2013-10-13  0:25     ` [PATCH net-next] net: refactor sk_page_frag_refill() Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Michael Dalton @ 2013-10-04  6:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Daniel Borkmann, davem, netdev, Francesco Fusco, ycheng,
	Neal Cardwell, Eric Northup

Thanks Eric,

I believe this issue may be related to one that I encountered
recently - poor performance with MTU-sized packets in
virtio_net when mergeable receive buffers are enabled. Performance was
quite low relative to virtio_net where mergeable receive buffers are
disabled and MTU-sized packets are received. The issue can be reliably
reproduced via netperf TCP_STREAM when mergeable receive buffers is
enabled but GRO is disabled (to force MTU-sized packets on receive).

I found the root cause was the memory allocation strategy employed for
virtio_net -- when mergeable receive buffers are enabled, every
receive ring packet buffer is allocated using a full page via the page
allocator, so the SKB truesize is 4096 + skb header +
128 (GOOD_COPY_LEN). This means that there is >100% overhead
(true_size / number of bytes actually used to store packet data) for
 MTU-sized packets, impacting TCP.

The issue can be resolved by switching mergeable receive packet's
packet allocation to use netdev_alloc_frag(), allocating MTU-sized (or
slightly larger) buffers, and handling the rare edge case where the
number of frags exceeds SKB_MAX_FRAGS (occurs for extremely large
GRO'd packets and is permitted by the virtio specification) by using
the SKB frag list. I will update this thread with a patch when one is
ready, hopefully in the next few days. Thanks!

Best,

Mike

On Thu, Oct 3, 2013 at 6:03 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2013-10-03 at 09:56 +0200, Daniel Borkmann wrote:
>> This is a complementary patch for commit 6ae705323 ("tcp: sndbuf
>> autotuning improvements") that fixes a performance regression on
>> receiver side in setups with low to mid latency, high throughput,
>> and senders with TSO/GSO off (receivers w/ default settings).
>>
>> The following measurements in Mbit/s were done for 60sec w/ netperf
>> on virtio w/ TSO/GSO off:
>>
>> (ms)    1)              2)              3)
>>   0     2762.11         1150.32         2906.17
>>  10     1083.61          538.89         1091.03
>>  25      471.81          313.18          474.60
>>  50      242.33          187.84          242.36
>>  75      162.14          134.45          161.95
>> 100      121.55          101.96          121.49
>> 150       80.64           57.75           80.48
>> 200       58.97           54.11           59.90
>> 250       47.10           46.92           47.31
>>
>> Same setup w/ TSO/GSO on:
>>
>> (ms)    1)              2)              3)
>>   0     12225.91        12366.89        16514.37
>>  10      1526.64         1525.79         2176.63
>>  25       655.13          647.79          871.52
>>  50       338.51          377.88          439.46
>>  75       246.49          278.46          295.62
>> 100       210.93          207.56          217.34
>> 150       127.88          129.56          141.33
>> 200        94.95           94.50          107.29
>> 250        67.39           73.88           88.35
>>
>> Similarly as in 6ae705323, we fixed up power-of-two rounding and
>> took cached mss into account, thus bringing per_mss calculations
>> closer to each other, the rest stays as is.
>>
>> We also renamed tcp_fixup_rcvbuf() to tcp_rcvbuf_expand() to be
>> consistent with tcp_sndbuf_expand().
>>
>> While we do think that 6ae705323b71 is the right way to go, also
>> this follow-up seems necessary to restore performance for
>> receivers.
>
> Hmm, I think you based this patch on some virtio requirements.
>
> I would rather fix virtio, because virtio has poor truesize/payload
> ratio.
>
> Michael Dalton is working on this right now.
>
> Really I don't understand how 'fixing' initial rcvbuf could explain such
> difference in a 60 second transfert.
>
> Normally, if autotuning was working, the first sk_rcvbuf value would
> only matter in the very beginning of a flow (maybe one, two or even
> three RTT)
>
> It looks like you only need to set sk_rcvbuf to tcp_rmem[2],
> so you probably have to fix the autotuning, or virtio to give normal
> skbs, not fat ones ;)
>
>
> Thanks
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH net-next] net: refactor sk_page_frag_refill()
  2013-10-04  6:56   ` Michael Dalton
@ 2013-10-13  0:25     ` Eric Dumazet
  2013-10-13  0:55       ` Eric Dumazet
  2013-10-13  4:46       ` [PATCH v2 " Eric Dumazet
  0 siblings, 2 replies; 11+ messages in thread
From: Eric Dumazet @ 2013-10-13  0:25 UTC (permalink / raw)
  To: Michael Dalton
  Cc: Daniel Borkmann, davem, netdev, Francesco Fusco, ycheng,
	Neal Cardwell, Eric Northup

From: Eric Dumazet <edumazet@google.com>

While working on virtio_net new allocation strategy to increase
payload/truesize ratio, we found that refactoring sk_page_frag_refill()
was needed.

This patch splits sk_page_frag_refill() into two parts, adding
page_frag_refill() which can be used without a socket.

While we are at it, add a minimum frag size of 32 for
sk_page_frag_refill()

Michael will either use netdev_alloc_frag() from softirq context,
or page_frag_refill() from process context in refill_work() (GFP_KERNEL
allocations)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michael Dalton <mwdalton@google.com>
---
 include/linux/skbuff.h |    2 ++
 net/core/sock.c        |   27 +++++++++++++++++++++++----
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 1cd32f9..0c5d40f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2062,6 +2062,8 @@ static inline void skb_frag_set_page(struct sk_buff *skb, int f,
 	__skb_frag_set_page(&skb_shinfo(skb)->frags[f], page);
 }
 
+bool page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio);
+
 /**
  * skb_frag_dma_map - maps a paged fragment via the DMA API
  * @dev: the device to map the fragment to
diff --git a/net/core/sock.c b/net/core/sock.c
index fd6afa2..e87d624 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1847,7 +1847,17 @@ EXPORT_SYMBOL(sock_alloc_send_skb);
 /* On 32bit arches, an skb frag is limited to 2^15 */
 #define SKB_FRAG_PAGE_ORDER	get_order(32768)
 
-bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+/**
+ * page_frag_refill - check that a page_frag contains enough room
+ * @sz: minimum size of the fragment we want to get
+ * @pfrag: pointer to page_frag
+ * @prio: priority for memory allocation
+ *
+ * Note: While this allocator tries to use high order pages, there is
+ * no guarantee that allocations succeed. Therefore, @sz MUST be
+ * less or equal than PAGE_SIZE.
+ */
+bool page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio)
 {
 	int order;
 
@@ -1856,16 +1866,16 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
 			pfrag->offset = 0;
 			return true;
 		}
-		if (pfrag->offset < pfrag->size)
+		if (pfrag->offset + sz < pfrag->size)
 			return true;
 		put_page(pfrag->page);
 	}
 
 	/* We restrict high order allocations to users that can afford to wait */
-	order = (sk->sk_allocation & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
+	order = (prio & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
 
 	do {
-		gfp_t gfp = sk->sk_allocation;
+		gfp_t gfp = prio;
 
 		if (order)
 			gfp |= __GFP_COMP | __GFP_NOWARN;
@@ -1877,6 +1887,15 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
 		}
 	} while (--order >= 0);
 
+	return false;
+}
+EXPORT_SYMBOL(page_frag_refill);
+
+bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+{
+	if (page_frag_refill(32U, pfrag, sk->sk_allocation))
+		return true;
+
 	sk_enter_memory_pressure(sk);
 	sk_stream_moderate_sndbuf(sk);
 	return false;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next] net: refactor sk_page_frag_refill()
  2013-10-13  0:25     ` [PATCH net-next] net: refactor sk_page_frag_refill() Eric Dumazet
@ 2013-10-13  0:55       ` Eric Dumazet
  2013-10-13  4:46       ` [PATCH v2 " Eric Dumazet
  1 sibling, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2013-10-13  0:55 UTC (permalink / raw)
  To: Michael Dalton
  Cc: Daniel Borkmann, davem, netdev, Francesco Fusco, ycheng,
	Neal Cardwell, Eric Northup

On Sat, 2013-10-12 at 17:25 -0700, Eric Dumazet wrote:

> -		if (pfrag->offset < pfrag->size)
> +		if (pfrag->offset + sz < pfrag->size)
>  			return true;
>  		put_page(pfrag->page);

This needs to be :

if (pfrag->offset + sz <= pfrag->size)

I'll send a v2

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 net-next] net: refactor sk_page_frag_refill()
  2013-10-13  0:25     ` [PATCH net-next] net: refactor sk_page_frag_refill() Eric Dumazet
  2013-10-13  0:55       ` Eric Dumazet
@ 2013-10-13  4:46       ` Eric Dumazet
  2013-10-17 19:46         ` David Miller
  2013-10-17 23:27         ` [PATCH v3 " Eric Dumazet
  1 sibling, 2 replies; 11+ messages in thread
From: Eric Dumazet @ 2013-10-13  4:46 UTC (permalink / raw)
  To: Michael Dalton
  Cc: Daniel Borkmann, davem, netdev, Francesco Fusco, ycheng,
	Neal Cardwell, Eric Northup

From: Eric Dumazet <edumazet@google.com>

While working on virtio_net new allocation strategy to increase
payload/truesize ratio, we found that refactoring sk_page_frag_refill()
was needed.

This patch splits sk_page_frag_refill() into two parts, adding
page_frag_refill() which can be used without a socket.

While we are at it, add a minimum frag size of 32 for
sk_page_frag_refill()

Michael will either use netdev_alloc_frag() from softirq context,
or page_frag_refill() from process context in refill_work() (GFP_KERNEL
allocations)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michael Dalton <mwdalton@google.com>
---
v2: fix a off-by one error

 include/linux/skbuff.h |    2 ++
 net/core/sock.c        |   27 +++++++++++++++++++++++----
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 1cd32f9..0c5d40f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2062,6 +2062,8 @@ static inline void skb_frag_set_page(struct sk_buff *skb, int f,
 	__skb_frag_set_page(&skb_shinfo(skb)->frags[f], page);
 }
 
+bool page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio);
+
 /**
  * skb_frag_dma_map - maps a paged fragment via the DMA API
  * @dev: the device to map the fragment to
diff --git a/net/core/sock.c b/net/core/sock.c
index fd6afa2..7824d60 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1847,7 +1847,17 @@ EXPORT_SYMBOL(sock_alloc_send_skb);
 /* On 32bit arches, an skb frag is limited to 2^15 */
 #define SKB_FRAG_PAGE_ORDER	get_order(32768)
 
-bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+/**
+ * page_frag_refill - check that a page_frag contains enough room
+ * @sz: minimum size of the fragment we want to get
+ * @pfrag: pointer to page_frag
+ * @prio: priority for memory allocation
+ *
+ * Note: While this allocator tries to use high order pages, there is
+ * no guarantee that allocations succeed. Therefore, @sz MUST be
+ * less or equal than PAGE_SIZE.
+ */
+bool page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio)
 {
 	int order;
 
@@ -1856,16 +1866,16 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
 			pfrag->offset = 0;
 			return true;
 		}
-		if (pfrag->offset < pfrag->size)
+		if (pfrag->offset + sz <= pfrag->size)
 			return true;
 		put_page(pfrag->page);
 	}
 
 	/* We restrict high order allocations to users that can afford to wait */
-	order = (sk->sk_allocation & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
+	order = (prio & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
 
 	do {
-		gfp_t gfp = sk->sk_allocation;
+		gfp_t gfp = prio;
 
 		if (order)
 			gfp |= __GFP_COMP | __GFP_NOWARN;
@@ -1877,6 +1887,15 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
 		}
 	} while (--order >= 0);
 
+	return false;
+}
+EXPORT_SYMBOL(page_frag_refill);
+
+bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+{
+	if (likely(page_frag_refill(32U, pfrag, sk->sk_allocation)))
+		return true;
+
 	sk_enter_memory_pressure(sk);
 	sk_stream_moderate_sndbuf(sk);
 	return false;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 net-next] net: refactor sk_page_frag_refill()
  2013-10-13  4:46       ` [PATCH v2 " Eric Dumazet
@ 2013-10-17 19:46         ` David Miller
  2013-10-17 23:13           ` Eric Dumazet
  2013-10-17 23:27         ` [PATCH v3 " Eric Dumazet
  1 sibling, 1 reply; 11+ messages in thread
From: David Miller @ 2013-10-17 19:46 UTC (permalink / raw)
  To: eric.dumazet
  Cc: mwdalton, dborkman, netdev, ffusco, ycheng, ncardwell, digitaleric

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 12 Oct 2013 21:46:31 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> While working on virtio_net new allocation strategy to increase
> payload/truesize ratio, we found that refactoring sk_page_frag_refill()
> was needed.
> 
> This patch splits sk_page_frag_refill() into two parts, adding
> page_frag_refill() which can be used without a socket.
> 
> While we are at it, add a minimum frag size of 32 for
> sk_page_frag_refill()
> 
> Michael will either use netdev_alloc_frag() from softirq context,
> or page_frag_refill() from process context in refill_work() (GFP_KERNEL
> allocations)
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Please rename this to something like "skb_page_frag_refill" so that it
is clear that this is a networking interface.

Thanks Eric.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 net-next] net: refactor sk_page_frag_refill()
  2013-10-17 19:46         ` David Miller
@ 2013-10-17 23:13           ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2013-10-17 23:13 UTC (permalink / raw)
  To: David Miller
  Cc: mwdalton, dborkman, netdev, ffusco, ycheng, ncardwell, digitaleric

On Thu, 2013-10-17 at 15:46 -0400, David Miller wrote:

> Please rename this to something like "skb_page_frag_refill" so that it
> is clear that this is a networking interface.

Sure will do, thanks !

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 net-next] net: refactor sk_page_frag_refill()
  2013-10-13  4:46       ` [PATCH v2 " Eric Dumazet
  2013-10-17 19:46         ` David Miller
@ 2013-10-17 23:27         ` Eric Dumazet
  2013-10-18  4:09           ` David Miller
  1 sibling, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2013-10-17 23:27 UTC (permalink / raw)
  To: Michael Dalton
  Cc: Daniel Borkmann, davem, netdev, Francesco Fusco, ycheng,
	Neal Cardwell, Eric Northup

From: Eric Dumazet <edumazet@google.com>

While working on virtio_net new allocation strategy to increase
payload/truesize ratio, we found that refactoring sk_page_frag_refill()
was needed.

This patch splits sk_page_frag_refill() into two parts, adding
skb_page_frag_refill() which can be used without a socket.

While we are at it, add a minimum frag size of 32 for
sk_page_frag_refill()

Michael will either use netdev_alloc_frag() from softirq context,
or skb_page_frag_refill() from process context in refill_work()
 (GFP_KERNEL allocations)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michael Dalton <mwdalton@google.com>
---
v3: page_frag_refill() -> skb_page_frag_refill()
v2: fix a off-by one error

 include/linux/skbuff.h |    2 ++
 net/core/sock.c        |   27 +++++++++++++++++++++++----
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 1cd32f9..ba74474 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2062,6 +2062,8 @@ static inline void skb_frag_set_page(struct sk_buff *skb, int f,
 	__skb_frag_set_page(&skb_shinfo(skb)->frags[f], page);
 }
 
+bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio);
+
 /**
  * skb_frag_dma_map - maps a paged fragment via the DMA API
  * @dev: the device to map the fragment to
diff --git a/net/core/sock.c b/net/core/sock.c
index fd6afa2..440afdc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1847,7 +1847,17 @@ EXPORT_SYMBOL(sock_alloc_send_skb);
 /* On 32bit arches, an skb frag is limited to 2^15 */
 #define SKB_FRAG_PAGE_ORDER	get_order(32768)
 
-bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+/**
+ * skb_page_frag_refill - check that a page_frag contains enough room
+ * @sz: minimum size of the fragment we want to get
+ * @pfrag: pointer to page_frag
+ * @prio: priority for memory allocation
+ *
+ * Note: While this allocator tries to use high order pages, there is
+ * no guarantee that allocations succeed. Therefore, @sz MUST be
+ * less or equal than PAGE_SIZE.
+ */
+bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio)
 {
 	int order;
 
@@ -1856,16 +1866,16 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
 			pfrag->offset = 0;
 			return true;
 		}
-		if (pfrag->offset < pfrag->size)
+		if (pfrag->offset + sz <= pfrag->size)
 			return true;
 		put_page(pfrag->page);
 	}
 
 	/* We restrict high order allocations to users that can afford to wait */
-	order = (sk->sk_allocation & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
+	order = (prio & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
 
 	do {
-		gfp_t gfp = sk->sk_allocation;
+		gfp_t gfp = prio;
 
 		if (order)
 			gfp |= __GFP_COMP | __GFP_NOWARN;
@@ -1877,6 +1887,15 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
 		}
 	} while (--order >= 0);
 
+	return false;
+}
+EXPORT_SYMBOL(skb_page_frag_refill);
+
+bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+{
+	if (likely(skb_page_frag_refill(32U, pfrag, sk->sk_allocation)))
+		return true;
+
 	sk_enter_memory_pressure(sk);
 	sk_stream_moderate_sndbuf(sk);
 	return false;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 net-next] net: refactor sk_page_frag_refill()
  2013-10-17 23:27         ` [PATCH v3 " Eric Dumazet
@ 2013-10-18  4:09           ` David Miller
  0 siblings, 0 replies; 11+ messages in thread
From: David Miller @ 2013-10-18  4:09 UTC (permalink / raw)
  To: eric.dumazet
  Cc: mwdalton, dborkman, netdev, ffusco, ycheng, ncardwell, digitaleric

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 17 Oct 2013 16:27:07 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> While working on virtio_net new allocation strategy to increase
> payload/truesize ratio, we found that refactoring sk_page_frag_refill()
> was needed.
> 
> This patch splits sk_page_frag_refill() into two parts, adding
> skb_page_frag_refill() which can be used without a socket.
> 
> While we are at it, add a minimum frag size of 32 for
> sk_page_frag_refill()
> 
> Michael will either use netdev_alloc_frag() from softirq context,
> or skb_page_frag_refill() from process context in refill_work()
>  (GFP_KERNEL allocations)
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-10-18  4:09 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-03  7:56 [PATCH net-next] tcp: rcvbuf autotuning improvements Daniel Borkmann
2013-10-03 13:03 ` Eric Dumazet
2013-10-04  6:56   ` Michael Dalton
2013-10-13  0:25     ` [PATCH net-next] net: refactor sk_page_frag_refill() Eric Dumazet
2013-10-13  0:55       ` Eric Dumazet
2013-10-13  4:46       ` [PATCH v2 " Eric Dumazet
2013-10-17 19:46         ` David Miller
2013-10-17 23:13           ` Eric Dumazet
2013-10-17 23:27         ` [PATCH v3 " Eric Dumazet
2013-10-18  4:09           ` David Miller
2013-10-03 13:13 ` [PATCH net-next] tcp: rcvbuf autotuning improvements Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).