All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next] tcp: rcvbuf autotuning improvements
@ 2013-10-03  7:56 Daniel Borkmann
  2013-10-03 13:03 ` Eric Dumazet
  2013-10-03 13:13 ` [PATCH net-next] tcp: rcvbuf autotuning improvements Eric Dumazet
  0 siblings, 2 replies; 11+ messages in thread
From: Daniel Borkmann @ 2013-10-03  7:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, eric.dumazet, Francesco Fusco

This is a complementary patch for commit 6ae705323 ("tcp: sndbuf
autotuning improvements") that fixes a performance regression on
receiver side in setups with low to mid latency, high throughput,
and senders with TSO/GSO off (receivers w/ default settings).

The following measurements in Mbit/s were done for 60sec w/ netperf
on virtio w/ TSO/GSO off:

(ms)    1)              2)              3)
  0     2762.11         1150.32         2906.17
 10     1083.61          538.89         1091.03
 25      471.81          313.18          474.60
 50      242.33          187.84          242.36
 75      162.14          134.45          161.95
100      121.55          101.96          121.49
150       80.64           57.75           80.48
200       58.97           54.11           59.90
250       47.10           46.92           47.31

Same setup w/ TSO/GSO on:

(ms)    1)              2)              3)
  0     12225.91        12366.89        16514.37
 10      1526.64         1525.79         2176.63
 25       655.13          647.79          871.52
 50       338.51          377.88          439.46
 75       246.49          278.46          295.62
100       210.93          207.56          217.34
150       127.88          129.56          141.33
200        94.95           94.50          107.29
250        67.39           73.88           88.35

Similarly as in 6ae705323, we fixed up power-of-two rounding and
took cached mss into account, thus bringing per_mss calculations
closer to each other, the rest stays as is.

We also renamed tcp_fixup_rcvbuf() to tcp_rcvbuf_expand() to be
consistent with tcp_sndbuf_expand().

While we do think that 6ae705323b71 is the right way to go, also
this follow-up seems necessary to restore performance for
receivers.

For the evaluation, same kernels on each host were used:

1) net-next (4fbef95af), which is before 6ae705323
2) net-next (6ae705323), which is sndbuf improvements
3) net-next (6ae705323), plus this patch on top

This was done in joint work with Francesco Fusco.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Francesco Fusco <ffusco@redhat.com>
---
 net/ipv4/tcp_input.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index cd65674..ed37b1d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -367,13 +367,19 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
 }
 
 /* 3. Tuning rcvbuf, when connection enters established state. */
-static void tcp_fixup_rcvbuf(struct sock *sk)
+static void tcp_rcvbuf_expand(struct sock *sk)
 {
-	u32 mss = tcp_sk(sk)->advmss;
-	int rcvmem;
+	const struct tcp_sock *tp = tcp_sk(sk);
+	int rcvmem, per_mss;
+
+	per_mss = max_t(u32, tp->advmss, tp->mss_cache) +
+		  MAX_TCP_HEADER +
+		  SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+
+	per_mss = roundup_pow_of_two(per_mss) +
+		  SKB_DATA_ALIGN(sizeof(struct sk_buff));
 
-	rcvmem = 2 * SKB_TRUESIZE(mss + MAX_TCP_HEADER) *
-		 tcp_default_init_rwnd(mss);
+	rcvmem = 2 * tcp_default_init_rwnd(per_mss) * per_mss;
 
 	/* Dynamic Right Sizing (DRS) has 2 to 3 RTT latency
 	 * Allow enough cushion so that sender is not limited by our window
@@ -394,7 +400,7 @@ void tcp_init_buffer_space(struct sock *sk)
 	int maxwin;
 
 	if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK))
-		tcp_fixup_rcvbuf(sk);
+		tcp_rcvbuf_expand(sk);
 	if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
 		tcp_sndbuf_expand(sk);
 
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next] tcp: rcvbuf autotuning improvements
  2013-10-03  7:56 [PATCH net-next] tcp: rcvbuf autotuning improvements Daniel Borkmann
@ 2013-10-03 13:03 ` Eric Dumazet
  2013-10-04  6:56   ` Michael Dalton
  2013-10-03 13:13 ` [PATCH net-next] tcp: rcvbuf autotuning improvements Eric Dumazet
  1 sibling, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2013-10-03 13:03 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: davem, netdev, Francesco Fusco, Michael Dalton, ycheng, ncardwell

On Thu, 2013-10-03 at 09:56 +0200, Daniel Borkmann wrote:
> This is a complementary patch for commit 6ae705323 ("tcp: sndbuf
> autotuning improvements") that fixes a performance regression on
> receiver side in setups with low to mid latency, high throughput,
> and senders with TSO/GSO off (receivers w/ default settings).
> 
> The following measurements in Mbit/s were done for 60sec w/ netperf
> on virtio w/ TSO/GSO off:
> 
> (ms)    1)              2)              3)
>   0     2762.11         1150.32         2906.17
>  10     1083.61          538.89         1091.03
>  25      471.81          313.18          474.60
>  50      242.33          187.84          242.36
>  75      162.14          134.45          161.95
> 100      121.55          101.96          121.49
> 150       80.64           57.75           80.48
> 200       58.97           54.11           59.90
> 250       47.10           46.92           47.31
> 
> Same setup w/ TSO/GSO on:
> 
> (ms)    1)              2)              3)
>   0     12225.91        12366.89        16514.37
>  10      1526.64         1525.79         2176.63
>  25       655.13          647.79          871.52
>  50       338.51          377.88          439.46
>  75       246.49          278.46          295.62
> 100       210.93          207.56          217.34
> 150       127.88          129.56          141.33
> 200        94.95           94.50          107.29
> 250        67.39           73.88           88.35
> 
> Similarly as in 6ae705323, we fixed up power-of-two rounding and
> took cached mss into account, thus bringing per_mss calculations
> closer to each other, the rest stays as is.
> 
> We also renamed tcp_fixup_rcvbuf() to tcp_rcvbuf_expand() to be
> consistent with tcp_sndbuf_expand().
> 
> While we do think that 6ae705323b71 is the right way to go, also
> this follow-up seems necessary to restore performance for
> receivers.

Hmm, I think you based this patch on some virtio requirements.

I would rather fix virtio, because virtio has poor truesize/payload
ratio.

Michael Dalton is working on this right now.

Really I don't understand how 'fixing' initial rcvbuf could explain such
difference in a 60 second transfert.

Normally, if autotuning was working, the first sk_rcvbuf value would
only matter in the very beginning of a flow (maybe one, two or even
three RTT)

It looks like you only need to set sk_rcvbuf to tcp_rmem[2],
so you probably have to fix the autotuning, or virtio to give normal
skbs, not fat ones ;)


Thanks

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next] tcp: rcvbuf autotuning improvements
  2013-10-03  7:56 [PATCH net-next] tcp: rcvbuf autotuning improvements Daniel Borkmann
  2013-10-03 13:03 ` Eric Dumazet
@ 2013-10-03 13:13 ` Eric Dumazet
  1 sibling, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2013-10-03 13:13 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: davem, netdev, Francesco Fusco

On Thu, 2013-10-03 at 09:56 +0200, Daniel Borkmann wrote:

> We also renamed tcp_fixup_rcvbuf() to tcp_rcvbuf_expand() to be
> consistent with tcp_sndbuf_expand().

BTW we renamed the function only because it was used both for initial
sizing, and from tcp_new_space()

As is, tcp_fixup_rcvbuf() is only called at connection setup.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next] tcp: rcvbuf autotuning improvements
  2013-10-03 13:03 ` Eric Dumazet
@ 2013-10-04  6:56   ` Michael Dalton
  2013-10-13  0:25     ` [PATCH net-next] net: refactor sk_page_frag_refill() Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Michael Dalton @ 2013-10-04  6:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Daniel Borkmann, davem, netdev, Francesco Fusco, ycheng,
	Neal Cardwell, Eric Northup

Thanks Eric,

I believe this issue may be related to one that I encountered
recently - poor performance with MTU-sized packets in
virtio_net when mergeable receive buffers are enabled. Performance was
quite low relative to virtio_net where mergeable receive buffers are
disabled and MTU-sized packets are received. The issue can be reliably
reproduced via netperf TCP_STREAM when mergeable receive buffers is
enabled but GRO is disabled (to force MTU-sized packets on receive).

I found the root cause was the memory allocation strategy employed for
virtio_net -- when mergeable receive buffers are enabled, every
receive ring packet buffer is allocated using a full page via the page
allocator, so the SKB truesize is 4096 + skb header +
128 (GOOD_COPY_LEN). This means that there is >100% overhead
(true_size / number of bytes actually used to store packet data) for
 MTU-sized packets, impacting TCP.

The issue can be resolved by switching mergeable receive packet's
packet allocation to use netdev_alloc_frag(), allocating MTU-sized (or
slightly larger) buffers, and handling the rare edge case where the
number of frags exceeds SKB_MAX_FRAGS (occurs for extremely large
GRO'd packets and is permitted by the virtio specification) by using
the SKB frag list. I will update this thread with a patch when one is
ready, hopefully in the next few days. Thanks!

Best,

Mike

On Thu, Oct 3, 2013 at 6:03 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2013-10-03 at 09:56 +0200, Daniel Borkmann wrote:
>> This is a complementary patch for commit 6ae705323 ("tcp: sndbuf
>> autotuning improvements") that fixes a performance regression on
>> receiver side in setups with low to mid latency, high throughput,
>> and senders with TSO/GSO off (receivers w/ default settings).
>>
>> The following measurements in Mbit/s were done for 60sec w/ netperf
>> on virtio w/ TSO/GSO off:
>>
>> (ms)    1)              2)              3)
>>   0     2762.11         1150.32         2906.17
>>  10     1083.61          538.89         1091.03
>>  25      471.81          313.18          474.60
>>  50      242.33          187.84          242.36
>>  75      162.14          134.45          161.95
>> 100      121.55          101.96          121.49
>> 150       80.64           57.75           80.48
>> 200       58.97           54.11           59.90
>> 250       47.10           46.92           47.31
>>
>> Same setup w/ TSO/GSO on:
>>
>> (ms)    1)              2)              3)
>>   0     12225.91        12366.89        16514.37
>>  10      1526.64         1525.79         2176.63
>>  25       655.13          647.79          871.52
>>  50       338.51          377.88          439.46
>>  75       246.49          278.46          295.62
>> 100       210.93          207.56          217.34
>> 150       127.88          129.56          141.33
>> 200        94.95           94.50          107.29
>> 250        67.39           73.88           88.35
>>
>> Similarly as in 6ae705323, we fixed up power-of-two rounding and
>> took cached mss into account, thus bringing per_mss calculations
>> closer to each other, the rest stays as is.
>>
>> We also renamed tcp_fixup_rcvbuf() to tcp_rcvbuf_expand() to be
>> consistent with tcp_sndbuf_expand().
>>
>> While we do think that 6ae705323b71 is the right way to go, also
>> this follow-up seems necessary to restore performance for
>> receivers.
>
> Hmm, I think you based this patch on some virtio requirements.
>
> I would rather fix virtio, because virtio has poor truesize/payload
> ratio.
>
> Michael Dalton is working on this right now.
>
> Really I don't understand how 'fixing' initial rcvbuf could explain such
> difference in a 60 second transfert.
>
> Normally, if autotuning was working, the first sk_rcvbuf value would
> only matter in the very beginning of a flow (maybe one, two or even
> three RTT)
>
> It looks like you only need to set sk_rcvbuf to tcp_rmem[2],
> so you probably have to fix the autotuning, or virtio to give normal
> skbs, not fat ones ;)
>
>
> Thanks
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH net-next] net: refactor sk_page_frag_refill()
  2013-10-04  6:56   ` Michael Dalton
@ 2013-10-13  0:25     ` Eric Dumazet
  2013-10-13  0:55       ` Eric Dumazet
  2013-10-13  4:46       ` [PATCH v2 " Eric Dumazet
  0 siblings, 2 replies; 11+ messages in thread
From: Eric Dumazet @ 2013-10-13  0:25 UTC (permalink / raw)
  To: Michael Dalton
  Cc: Daniel Borkmann, davem, netdev, Francesco Fusco, ycheng,
	Neal Cardwell, Eric Northup

From: Eric Dumazet <edumazet@google.com>

While working on virtio_net new allocation strategy to increase
payload/truesize ratio, we found that refactoring sk_page_frag_refill()
was needed.

This patch splits sk_page_frag_refill() into two parts, adding
page_frag_refill() which can be used without a socket.

While we are at it, add a minimum frag size of 32 for
sk_page_frag_refill()

Michael will either use netdev_alloc_frag() from softirq context,
or page_frag_refill() from process context in refill_work() (GFP_KERNEL
allocations)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michael Dalton <mwdalton@google.com>
---
 include/linux/skbuff.h |    2 ++
 net/core/sock.c        |   27 +++++++++++++++++++++++----
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 1cd32f9..0c5d40f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2062,6 +2062,8 @@ static inline void skb_frag_set_page(struct sk_buff *skb, int f,
 	__skb_frag_set_page(&skb_shinfo(skb)->frags[f], page);
 }
 
+bool page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio);
+
 /**
  * skb_frag_dma_map - maps a paged fragment via the DMA API
  * @dev: the device to map the fragment to
diff --git a/net/core/sock.c b/net/core/sock.c
index fd6afa2..e87d624 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1847,7 +1847,17 @@ EXPORT_SYMBOL(sock_alloc_send_skb);
 /* On 32bit arches, an skb frag is limited to 2^15 */
 #define SKB_FRAG_PAGE_ORDER	get_order(32768)
 
-bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+/**
+ * page_frag_refill - check that a page_frag contains enough room
+ * @sz: minimum size of the fragment we want to get
+ * @pfrag: pointer to page_frag
+ * @prio: priority for memory allocation
+ *
+ * Note: While this allocator tries to use high order pages, there is
+ * no guarantee that allocations succeed. Therefore, @sz MUST be
+ * less or equal than PAGE_SIZE.
+ */
+bool page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio)
 {
 	int order;
 
@@ -1856,16 +1866,16 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
 			pfrag->offset = 0;
 			return true;
 		}
-		if (pfrag->offset < pfrag->size)
+		if (pfrag->offset + sz < pfrag->size)
 			return true;
 		put_page(pfrag->page);
 	}
 
 	/* We restrict high order allocations to users that can afford to wait */
-	order = (sk->sk_allocation & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
+	order = (prio & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
 
 	do {
-		gfp_t gfp = sk->sk_allocation;
+		gfp_t gfp = prio;
 
 		if (order)
 			gfp |= __GFP_COMP | __GFP_NOWARN;
@@ -1877,6 +1887,15 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
 		}
 	} while (--order >= 0);
 
+	return false;
+}
+EXPORT_SYMBOL(page_frag_refill);
+
+bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+{
+	if (page_frag_refill(32U, pfrag, sk->sk_allocation))
+		return true;
+
 	sk_enter_memory_pressure(sk);
 	sk_stream_moderate_sndbuf(sk);
 	return false;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next] net: refactor sk_page_frag_refill()
  2013-10-13  0:25     ` [PATCH net-next] net: refactor sk_page_frag_refill() Eric Dumazet
@ 2013-10-13  0:55       ` Eric Dumazet
  2013-10-13  4:46       ` [PATCH v2 " Eric Dumazet
  1 sibling, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2013-10-13  0:55 UTC (permalink / raw)
  To: Michael Dalton
  Cc: Daniel Borkmann, davem, netdev, Francesco Fusco, ycheng,
	Neal Cardwell, Eric Northup

On Sat, 2013-10-12 at 17:25 -0700, Eric Dumazet wrote:

> -		if (pfrag->offset < pfrag->size)
> +		if (pfrag->offset + sz < pfrag->size)
>  			return true;
>  		put_page(pfrag->page);

This needs to be :

if (pfrag->offset + sz <= pfrag->size)

I'll send a v2

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 net-next] net: refactor sk_page_frag_refill()
  2013-10-13  0:25     ` [PATCH net-next] net: refactor sk_page_frag_refill() Eric Dumazet
  2013-10-13  0:55       ` Eric Dumazet
@ 2013-10-13  4:46       ` Eric Dumazet
  2013-10-17 19:46         ` David Miller
  2013-10-17 23:27         ` [PATCH v3 " Eric Dumazet
  1 sibling, 2 replies; 11+ messages in thread
From: Eric Dumazet @ 2013-10-13  4:46 UTC (permalink / raw)
  To: Michael Dalton
  Cc: Daniel Borkmann, davem, netdev, Francesco Fusco, ycheng,
	Neal Cardwell, Eric Northup

From: Eric Dumazet <edumazet@google.com>

While working on virtio_net new allocation strategy to increase
payload/truesize ratio, we found that refactoring sk_page_frag_refill()
was needed.

This patch splits sk_page_frag_refill() into two parts, adding
page_frag_refill() which can be used without a socket.

While we are at it, add a minimum frag size of 32 for
sk_page_frag_refill()

Michael will either use netdev_alloc_frag() from softirq context,
or page_frag_refill() from process context in refill_work() (GFP_KERNEL
allocations)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michael Dalton <mwdalton@google.com>
---
v2: fix a off-by one error

 include/linux/skbuff.h |    2 ++
 net/core/sock.c        |   27 +++++++++++++++++++++++----
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 1cd32f9..0c5d40f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2062,6 +2062,8 @@ static inline void skb_frag_set_page(struct sk_buff *skb, int f,
 	__skb_frag_set_page(&skb_shinfo(skb)->frags[f], page);
 }
 
+bool page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio);
+
 /**
  * skb_frag_dma_map - maps a paged fragment via the DMA API
  * @dev: the device to map the fragment to
diff --git a/net/core/sock.c b/net/core/sock.c
index fd6afa2..7824d60 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1847,7 +1847,17 @@ EXPORT_SYMBOL(sock_alloc_send_skb);
 /* On 32bit arches, an skb frag is limited to 2^15 */
 #define SKB_FRAG_PAGE_ORDER	get_order(32768)
 
-bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+/**
+ * page_frag_refill - check that a page_frag contains enough room
+ * @sz: minimum size of the fragment we want to get
+ * @pfrag: pointer to page_frag
+ * @prio: priority for memory allocation
+ *
+ * Note: While this allocator tries to use high order pages, there is
+ * no guarantee that allocations succeed. Therefore, @sz MUST be
+ * less or equal than PAGE_SIZE.
+ */
+bool page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio)
 {
 	int order;
 
@@ -1856,16 +1866,16 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
 			pfrag->offset = 0;
 			return true;
 		}
-		if (pfrag->offset < pfrag->size)
+		if (pfrag->offset + sz <= pfrag->size)
 			return true;
 		put_page(pfrag->page);
 	}
 
 	/* We restrict high order allocations to users that can afford to wait */
-	order = (sk->sk_allocation & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
+	order = (prio & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
 
 	do {
-		gfp_t gfp = sk->sk_allocation;
+		gfp_t gfp = prio;
 
 		if (order)
 			gfp |= __GFP_COMP | __GFP_NOWARN;
@@ -1877,6 +1887,15 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
 		}
 	} while (--order >= 0);
 
+	return false;
+}
+EXPORT_SYMBOL(page_frag_refill);
+
+bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+{
+	if (likely(page_frag_refill(32U, pfrag, sk->sk_allocation)))
+		return true;
+
 	sk_enter_memory_pressure(sk);
 	sk_stream_moderate_sndbuf(sk);
 	return false;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 net-next] net: refactor sk_page_frag_refill()
  2013-10-13  4:46       ` [PATCH v2 " Eric Dumazet
@ 2013-10-17 19:46         ` David Miller
  2013-10-17 23:13           ` Eric Dumazet
  2013-10-17 23:27         ` [PATCH v3 " Eric Dumazet
  1 sibling, 1 reply; 11+ messages in thread
From: David Miller @ 2013-10-17 19:46 UTC (permalink / raw)
  To: eric.dumazet
  Cc: mwdalton, dborkman, netdev, ffusco, ycheng, ncardwell, digitaleric

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 12 Oct 2013 21:46:31 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> While working on virtio_net new allocation strategy to increase
> payload/truesize ratio, we found that refactoring sk_page_frag_refill()
> was needed.
> 
> This patch splits sk_page_frag_refill() into two parts, adding
> page_frag_refill() which can be used without a socket.
> 
> While we are at it, add a minimum frag size of 32 for
> sk_page_frag_refill()
> 
> Michael will either use netdev_alloc_frag() from softirq context,
> or page_frag_refill() from process context in refill_work() (GFP_KERNEL
> allocations)
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Please rename this to something like "skb_page_frag_refill" so that it
is clear that this is a networking interface.

Thanks Eric.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 net-next] net: refactor sk_page_frag_refill()
  2013-10-17 19:46         ` David Miller
@ 2013-10-17 23:13           ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2013-10-17 23:13 UTC (permalink / raw)
  To: David Miller
  Cc: mwdalton, dborkman, netdev, ffusco, ycheng, ncardwell, digitaleric

On Thu, 2013-10-17 at 15:46 -0400, David Miller wrote:

> Please rename this to something like "skb_page_frag_refill" so that it
> is clear that this is a networking interface.

Sure will do, thanks !

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 net-next] net: refactor sk_page_frag_refill()
  2013-10-13  4:46       ` [PATCH v2 " Eric Dumazet
  2013-10-17 19:46         ` David Miller
@ 2013-10-17 23:27         ` Eric Dumazet
  2013-10-18  4:09           ` David Miller
  1 sibling, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2013-10-17 23:27 UTC (permalink / raw)
  To: Michael Dalton
  Cc: Daniel Borkmann, davem, netdev, Francesco Fusco, ycheng,
	Neal Cardwell, Eric Northup

From: Eric Dumazet <edumazet@google.com>

While working on virtio_net new allocation strategy to increase
payload/truesize ratio, we found that refactoring sk_page_frag_refill()
was needed.

This patch splits sk_page_frag_refill() into two parts, adding
skb_page_frag_refill() which can be used without a socket.

While we are at it, add a minimum frag size of 32 for
sk_page_frag_refill()

Michael will either use netdev_alloc_frag() from softirq context,
or skb_page_frag_refill() from process context in refill_work()
 (GFP_KERNEL allocations)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michael Dalton <mwdalton@google.com>
---
v3: page_frag_refill() -> skb_page_frag_refill()
v2: fix a off-by one error

 include/linux/skbuff.h |    2 ++
 net/core/sock.c        |   27 +++++++++++++++++++++++----
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 1cd32f9..ba74474 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2062,6 +2062,8 @@ static inline void skb_frag_set_page(struct sk_buff *skb, int f,
 	__skb_frag_set_page(&skb_shinfo(skb)->frags[f], page);
 }
 
+bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio);
+
 /**
  * skb_frag_dma_map - maps a paged fragment via the DMA API
  * @dev: the device to map the fragment to
diff --git a/net/core/sock.c b/net/core/sock.c
index fd6afa2..440afdc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1847,7 +1847,17 @@ EXPORT_SYMBOL(sock_alloc_send_skb);
 /* On 32bit arches, an skb frag is limited to 2^15 */
 #define SKB_FRAG_PAGE_ORDER	get_order(32768)
 
-bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+/**
+ * skb_page_frag_refill - check that a page_frag contains enough room
+ * @sz: minimum size of the fragment we want to get
+ * @pfrag: pointer to page_frag
+ * @prio: priority for memory allocation
+ *
+ * Note: While this allocator tries to use high order pages, there is
+ * no guarantee that allocations succeed. Therefore, @sz MUST be
+ * less or equal than PAGE_SIZE.
+ */
+bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio)
 {
 	int order;
 
@@ -1856,16 +1866,16 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
 			pfrag->offset = 0;
 			return true;
 		}
-		if (pfrag->offset < pfrag->size)
+		if (pfrag->offset + sz <= pfrag->size)
 			return true;
 		put_page(pfrag->page);
 	}
 
 	/* We restrict high order allocations to users that can afford to wait */
-	order = (sk->sk_allocation & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
+	order = (prio & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
 
 	do {
-		gfp_t gfp = sk->sk_allocation;
+		gfp_t gfp = prio;
 
 		if (order)
 			gfp |= __GFP_COMP | __GFP_NOWARN;
@@ -1877,6 +1887,15 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
 		}
 	} while (--order >= 0);
 
+	return false;
+}
+EXPORT_SYMBOL(skb_page_frag_refill);
+
+bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+{
+	if (likely(skb_page_frag_refill(32U, pfrag, sk->sk_allocation)))
+		return true;
+
 	sk_enter_memory_pressure(sk);
 	sk_stream_moderate_sndbuf(sk);
 	return false;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 net-next] net: refactor sk_page_frag_refill()
  2013-10-17 23:27         ` [PATCH v3 " Eric Dumazet
@ 2013-10-18  4:09           ` David Miller
  0 siblings, 0 replies; 11+ messages in thread
From: David Miller @ 2013-10-18  4:09 UTC (permalink / raw)
  To: eric.dumazet
  Cc: mwdalton, dborkman, netdev, ffusco, ycheng, ncardwell, digitaleric

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 17 Oct 2013 16:27:07 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> While working on virtio_net new allocation strategy to increase
> payload/truesize ratio, we found that refactoring sk_page_frag_refill()
> was needed.
> 
> This patch splits sk_page_frag_refill() into two parts, adding
> skb_page_frag_refill() which can be used without a socket.
> 
> While we are at it, add a minimum frag size of 32 for
> sk_page_frag_refill()
> 
> Michael will either use netdev_alloc_frag() from softirq context,
> or skb_page_frag_refill() from process context in refill_work()
>  (GFP_KERNEL allocations)
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-10-18  4:09 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-03  7:56 [PATCH net-next] tcp: rcvbuf autotuning improvements Daniel Borkmann
2013-10-03 13:03 ` Eric Dumazet
2013-10-04  6:56   ` Michael Dalton
2013-10-13  0:25     ` [PATCH net-next] net: refactor sk_page_frag_refill() Eric Dumazet
2013-10-13  0:55       ` Eric Dumazet
2013-10-13  4:46       ` [PATCH v2 " Eric Dumazet
2013-10-17 19:46         ` David Miller
2013-10-17 23:13           ` Eric Dumazet
2013-10-17 23:27         ` [PATCH v3 " Eric Dumazet
2013-10-18  4:09           ` David Miller
2013-10-03 13:13 ` [PATCH net-next] tcp: rcvbuf autotuning improvements Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.