* [PATCH net-next] tcp: rcvbuf autotuning improvements
@ 2013-10-03 7:56 Daniel Borkmann
2013-10-03 13:03 ` Eric Dumazet
2013-10-03 13:13 ` [PATCH net-next] tcp: rcvbuf autotuning improvements Eric Dumazet
0 siblings, 2 replies; 11+ messages in thread
From: Daniel Borkmann @ 2013-10-03 7:56 UTC (permalink / raw)
To: davem; +Cc: netdev, eric.dumazet, Francesco Fusco
This is a complementary patch for commit 6ae705323 ("tcp: sndbuf
autotuning improvements") that fixes a performance regression on
receiver side in setups with low to mid latency, high throughput,
and senders with TSO/GSO off (receivers w/ default settings).
The following measurements in Mbit/s were done for 60sec w/ netperf
on virtio w/ TSO/GSO off:
(ms) 1) 2) 3)
0 2762.11 1150.32 2906.17
10 1083.61 538.89 1091.03
25 471.81 313.18 474.60
50 242.33 187.84 242.36
75 162.14 134.45 161.95
100 121.55 101.96 121.49
150 80.64 57.75 80.48
200 58.97 54.11 59.90
250 47.10 46.92 47.31
Same setup w/ TSO/GSO on:
(ms) 1) 2) 3)
0 12225.91 12366.89 16514.37
10 1526.64 1525.79 2176.63
25 655.13 647.79 871.52
50 338.51 377.88 439.46
75 246.49 278.46 295.62
100 210.93 207.56 217.34
150 127.88 129.56 141.33
200 94.95 94.50 107.29
250 67.39 73.88 88.35
Similarly as in 6ae705323, we fixed up power-of-two rounding and
took cached mss into account, thus bringing per_mss calculations
closer to each other, the rest stays as is.
We also renamed tcp_fixup_rcvbuf() to tcp_rcvbuf_expand() to be
consistent with tcp_sndbuf_expand().
While we do think that 6ae705323b71 is the right way to go, also
this follow-up seems necessary to restore performance for
receivers.
For the evaluation, same kernels on each host were used:
1) net-next (4fbef95af), which is before 6ae705323
2) net-next (6ae705323), which is sndbuf improvements
3) net-next (6ae705323), plus this patch on top
This was done in joint work with Francesco Fusco.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Francesco Fusco <ffusco@redhat.com>
---
net/ipv4/tcp_input.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index cd65674..ed37b1d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -367,13 +367,19 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
}
/* 3. Tuning rcvbuf, when connection enters established state. */
-static void tcp_fixup_rcvbuf(struct sock *sk)
+static void tcp_rcvbuf_expand(struct sock *sk)
{
- u32 mss = tcp_sk(sk)->advmss;
- int rcvmem;
+ const struct tcp_sock *tp = tcp_sk(sk);
+ int rcvmem, per_mss;
+
+ per_mss = max_t(u32, tp->advmss, tp->mss_cache) +
+ MAX_TCP_HEADER +
+ SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+
+ per_mss = roundup_pow_of_two(per_mss) +
+ SKB_DATA_ALIGN(sizeof(struct sk_buff));
- rcvmem = 2 * SKB_TRUESIZE(mss + MAX_TCP_HEADER) *
- tcp_default_init_rwnd(mss);
+ rcvmem = 2 * tcp_default_init_rwnd(per_mss) * per_mss;
/* Dynamic Right Sizing (DRS) has 2 to 3 RTT latency
* Allow enough cushion so that sender is not limited by our window
@@ -394,7 +400,7 @@ void tcp_init_buffer_space(struct sock *sk)
int maxwin;
if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK))
- tcp_fixup_rcvbuf(sk);
+ tcp_rcvbuf_expand(sk);
if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
tcp_sndbuf_expand(sk);
--
1.7.11.7
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH net-next] tcp: rcvbuf autotuning improvements
2013-10-03 7:56 [PATCH net-next] tcp: rcvbuf autotuning improvements Daniel Borkmann
@ 2013-10-03 13:03 ` Eric Dumazet
2013-10-04 6:56 ` Michael Dalton
2013-10-03 13:13 ` [PATCH net-next] tcp: rcvbuf autotuning improvements Eric Dumazet
1 sibling, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2013-10-03 13:03 UTC (permalink / raw)
To: Daniel Borkmann
Cc: davem, netdev, Francesco Fusco, Michael Dalton, ycheng, ncardwell
On Thu, 2013-10-03 at 09:56 +0200, Daniel Borkmann wrote:
> This is a complementary patch for commit 6ae705323 ("tcp: sndbuf
> autotuning improvements") that fixes a performance regression on
> receiver side in setups with low to mid latency, high throughput,
> and senders with TSO/GSO off (receivers w/ default settings).
>
> The following measurements in Mbit/s were done for 60sec w/ netperf
> on virtio w/ TSO/GSO off:
>
> (ms) 1) 2) 3)
> 0 2762.11 1150.32 2906.17
> 10 1083.61 538.89 1091.03
> 25 471.81 313.18 474.60
> 50 242.33 187.84 242.36
> 75 162.14 134.45 161.95
> 100 121.55 101.96 121.49
> 150 80.64 57.75 80.48
> 200 58.97 54.11 59.90
> 250 47.10 46.92 47.31
>
> Same setup w/ TSO/GSO on:
>
> (ms) 1) 2) 3)
> 0 12225.91 12366.89 16514.37
> 10 1526.64 1525.79 2176.63
> 25 655.13 647.79 871.52
> 50 338.51 377.88 439.46
> 75 246.49 278.46 295.62
> 100 210.93 207.56 217.34
> 150 127.88 129.56 141.33
> 200 94.95 94.50 107.29
> 250 67.39 73.88 88.35
>
> Similarly as in 6ae705323, we fixed up power-of-two rounding and
> took cached mss into account, thus bringing per_mss calculations
> closer to each other, the rest stays as is.
>
> We also renamed tcp_fixup_rcvbuf() to tcp_rcvbuf_expand() to be
> consistent with tcp_sndbuf_expand().
>
> While we do think that 6ae705323b71 is the right way to go, also
> this follow-up seems necessary to restore performance for
> receivers.
Hmm, I think you based this patch on some virtio requirements.
I would rather fix virtio, because virtio has poor truesize/payload
ratio.
Michael Dalton is working on this right now.
Really I don't understand how 'fixing' initial rcvbuf could explain such
difference in a 60 second transfert.
Normally, if autotuning was working, the first sk_rcvbuf value would
only matter in the very beginning of a flow (maybe one, two or even
three RTT)
It looks like you only need to set sk_rcvbuf to tcp_rmem[2],
so you probably have to fix the autotuning, or virtio to give normal
skbs, not fat ones ;)
Thanks
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next] tcp: rcvbuf autotuning improvements
2013-10-03 7:56 [PATCH net-next] tcp: rcvbuf autotuning improvements Daniel Borkmann
2013-10-03 13:03 ` Eric Dumazet
@ 2013-10-03 13:13 ` Eric Dumazet
1 sibling, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2013-10-03 13:13 UTC (permalink / raw)
To: Daniel Borkmann; +Cc: davem, netdev, Francesco Fusco
On Thu, 2013-10-03 at 09:56 +0200, Daniel Borkmann wrote:
> We also renamed tcp_fixup_rcvbuf() to tcp_rcvbuf_expand() to be
> consistent with tcp_sndbuf_expand().
BTW we renamed the function only because it was used both for initial
sizing, and from tcp_new_space()
As is, tcp_fixup_rcvbuf() is only called at connection setup.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next] tcp: rcvbuf autotuning improvements
2013-10-03 13:03 ` Eric Dumazet
@ 2013-10-04 6:56 ` Michael Dalton
2013-10-13 0:25 ` [PATCH net-next] net: refactor sk_page_frag_refill() Eric Dumazet
0 siblings, 1 reply; 11+ messages in thread
From: Michael Dalton @ 2013-10-04 6:56 UTC (permalink / raw)
To: Eric Dumazet
Cc: Daniel Borkmann, davem, netdev, Francesco Fusco, ycheng,
Neal Cardwell, Eric Northup
Thanks Eric,
I believe this issue may be related to one that I encountered
recently - poor performance with MTU-sized packets in
virtio_net when mergeable receive buffers are enabled. Performance was
quite low relative to virtio_net where mergeable receive buffers are
disabled and MTU-sized packets are received. The issue can be reliably
reproduced via netperf TCP_STREAM when mergeable receive buffers is
enabled but GRO is disabled (to force MTU-sized packets on receive).
I found the root cause was the memory allocation strategy employed for
virtio_net -- when mergeable receive buffers are enabled, every
receive ring packet buffer is allocated using a full page via the page
allocator, so the SKB truesize is 4096 + skb header +
128 (GOOD_COPY_LEN). This means that there is >100% overhead
(true_size / number of bytes actually used to store packet data) for
MTU-sized packets, impacting TCP.
The issue can be resolved by switching mergeable receive packet's
packet allocation to use netdev_alloc_frag(), allocating MTU-sized (or
slightly larger) buffers, and handling the rare edge case where the
number of frags exceeds SKB_MAX_FRAGS (occurs for extremely large
GRO'd packets and is permitted by the virtio specification) by using
the SKB frag list. I will update this thread with a patch when one is
ready, hopefully in the next few days. Thanks!
Best,
Mike
On Thu, Oct 3, 2013 at 6:03 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2013-10-03 at 09:56 +0200, Daniel Borkmann wrote:
>> This is a complementary patch for commit 6ae705323 ("tcp: sndbuf
>> autotuning improvements") that fixes a performance regression on
>> receiver side in setups with low to mid latency, high throughput,
>> and senders with TSO/GSO off (receivers w/ default settings).
>>
>> The following measurements in Mbit/s were done for 60sec w/ netperf
>> on virtio w/ TSO/GSO off:
>>
>> (ms) 1) 2) 3)
>> 0 2762.11 1150.32 2906.17
>> 10 1083.61 538.89 1091.03
>> 25 471.81 313.18 474.60
>> 50 242.33 187.84 242.36
>> 75 162.14 134.45 161.95
>> 100 121.55 101.96 121.49
>> 150 80.64 57.75 80.48
>> 200 58.97 54.11 59.90
>> 250 47.10 46.92 47.31
>>
>> Same setup w/ TSO/GSO on:
>>
>> (ms) 1) 2) 3)
>> 0 12225.91 12366.89 16514.37
>> 10 1526.64 1525.79 2176.63
>> 25 655.13 647.79 871.52
>> 50 338.51 377.88 439.46
>> 75 246.49 278.46 295.62
>> 100 210.93 207.56 217.34
>> 150 127.88 129.56 141.33
>> 200 94.95 94.50 107.29
>> 250 67.39 73.88 88.35
>>
>> Similarly as in 6ae705323, we fixed up power-of-two rounding and
>> took cached mss into account, thus bringing per_mss calculations
>> closer to each other, the rest stays as is.
>>
>> We also renamed tcp_fixup_rcvbuf() to tcp_rcvbuf_expand() to be
>> consistent with tcp_sndbuf_expand().
>>
>> While we do think that 6ae705323b71 is the right way to go, also
>> this follow-up seems necessary to restore performance for
>> receivers.
>
> Hmm, I think you based this patch on some virtio requirements.
>
> I would rather fix virtio, because virtio has poor truesize/payload
> ratio.
>
> Michael Dalton is working on this right now.
>
> Really I don't understand how 'fixing' initial rcvbuf could explain such
> difference in a 60 second transfert.
>
> Normally, if autotuning was working, the first sk_rcvbuf value would
> only matter in the very beginning of a flow (maybe one, two or even
> three RTT)
>
> It looks like you only need to set sk_rcvbuf to tcp_rmem[2],
> so you probably have to fix the autotuning, or virtio to give normal
> skbs, not fat ones ;)
>
>
> Thanks
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH net-next] net: refactor sk_page_frag_refill()
2013-10-04 6:56 ` Michael Dalton
@ 2013-10-13 0:25 ` Eric Dumazet
2013-10-13 0:55 ` Eric Dumazet
2013-10-13 4:46 ` [PATCH v2 " Eric Dumazet
0 siblings, 2 replies; 11+ messages in thread
From: Eric Dumazet @ 2013-10-13 0:25 UTC (permalink / raw)
To: Michael Dalton
Cc: Daniel Borkmann, davem, netdev, Francesco Fusco, ycheng,
Neal Cardwell, Eric Northup
From: Eric Dumazet <edumazet@google.com>
While working on virtio_net new allocation strategy to increase
payload/truesize ratio, we found that refactoring sk_page_frag_refill()
was needed.
This patch splits sk_page_frag_refill() into two parts, adding
page_frag_refill() which can be used without a socket.
While we are at it, add a minimum frag size of 32 for
sk_page_frag_refill()
Michael will either use netdev_alloc_frag() from softirq context,
or page_frag_refill() from process context in refill_work() (GFP_KERNEL
allocations)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michael Dalton <mwdalton@google.com>
---
include/linux/skbuff.h | 2 ++
net/core/sock.c | 27 +++++++++++++++++++++++----
2 files changed, 25 insertions(+), 4 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 1cd32f9..0c5d40f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2062,6 +2062,8 @@ static inline void skb_frag_set_page(struct sk_buff *skb, int f,
__skb_frag_set_page(&skb_shinfo(skb)->frags[f], page);
}
+bool page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio);
+
/**
* skb_frag_dma_map - maps a paged fragment via the DMA API
* @dev: the device to map the fragment to
diff --git a/net/core/sock.c b/net/core/sock.c
index fd6afa2..e87d624 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1847,7 +1847,17 @@ EXPORT_SYMBOL(sock_alloc_send_skb);
/* On 32bit arches, an skb frag is limited to 2^15 */
#define SKB_FRAG_PAGE_ORDER get_order(32768)
-bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+/**
+ * page_frag_refill - check that a page_frag contains enough room
+ * @sz: minimum size of the fragment we want to get
+ * @pfrag: pointer to page_frag
+ * @prio: priority for memory allocation
+ *
+ * Note: While this allocator tries to use high order pages, there is
+ * no guarantee that allocations succeed. Therefore, @sz MUST be
+ * less or equal than PAGE_SIZE.
+ */
+bool page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio)
{
int order;
@@ -1856,16 +1866,16 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
pfrag->offset = 0;
return true;
}
- if (pfrag->offset < pfrag->size)
+ if (pfrag->offset + sz < pfrag->size)
return true;
put_page(pfrag->page);
}
/* We restrict high order allocations to users that can afford to wait */
- order = (sk->sk_allocation & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
+ order = (prio & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
do {
- gfp_t gfp = sk->sk_allocation;
+ gfp_t gfp = prio;
if (order)
gfp |= __GFP_COMP | __GFP_NOWARN;
@@ -1877,6 +1887,15 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
}
} while (--order >= 0);
+ return false;
+}
+EXPORT_SYMBOL(page_frag_refill);
+
+bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+{
+ if (page_frag_refill(32U, pfrag, sk->sk_allocation))
+ return true;
+
sk_enter_memory_pressure(sk);
sk_stream_moderate_sndbuf(sk);
return false;
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH net-next] net: refactor sk_page_frag_refill()
2013-10-13 0:25 ` [PATCH net-next] net: refactor sk_page_frag_refill() Eric Dumazet
@ 2013-10-13 0:55 ` Eric Dumazet
2013-10-13 4:46 ` [PATCH v2 " Eric Dumazet
1 sibling, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2013-10-13 0:55 UTC (permalink / raw)
To: Michael Dalton
Cc: Daniel Borkmann, davem, netdev, Francesco Fusco, ycheng,
Neal Cardwell, Eric Northup
On Sat, 2013-10-12 at 17:25 -0700, Eric Dumazet wrote:
> - if (pfrag->offset < pfrag->size)
> + if (pfrag->offset + sz < pfrag->size)
> return true;
> put_page(pfrag->page);
This needs to be :
if (pfrag->offset + sz <= pfrag->size)
I'll send a v2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v2 net-next] net: refactor sk_page_frag_refill()
2013-10-13 0:25 ` [PATCH net-next] net: refactor sk_page_frag_refill() Eric Dumazet
2013-10-13 0:55 ` Eric Dumazet
@ 2013-10-13 4:46 ` Eric Dumazet
2013-10-17 19:46 ` David Miller
2013-10-17 23:27 ` [PATCH v3 " Eric Dumazet
1 sibling, 2 replies; 11+ messages in thread
From: Eric Dumazet @ 2013-10-13 4:46 UTC (permalink / raw)
To: Michael Dalton
Cc: Daniel Borkmann, davem, netdev, Francesco Fusco, ycheng,
Neal Cardwell, Eric Northup
From: Eric Dumazet <edumazet@google.com>
While working on virtio_net new allocation strategy to increase
payload/truesize ratio, we found that refactoring sk_page_frag_refill()
was needed.
This patch splits sk_page_frag_refill() into two parts, adding
page_frag_refill() which can be used without a socket.
While we are at it, add a minimum frag size of 32 for
sk_page_frag_refill()
Michael will either use netdev_alloc_frag() from softirq context,
or page_frag_refill() from process context in refill_work() (GFP_KERNEL
allocations)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michael Dalton <mwdalton@google.com>
---
v2: fix a off-by one error
include/linux/skbuff.h | 2 ++
net/core/sock.c | 27 +++++++++++++++++++++++----
2 files changed, 25 insertions(+), 4 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 1cd32f9..0c5d40f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2062,6 +2062,8 @@ static inline void skb_frag_set_page(struct sk_buff *skb, int f,
__skb_frag_set_page(&skb_shinfo(skb)->frags[f], page);
}
+bool page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio);
+
/**
* skb_frag_dma_map - maps a paged fragment via the DMA API
* @dev: the device to map the fragment to
diff --git a/net/core/sock.c b/net/core/sock.c
index fd6afa2..7824d60 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1847,7 +1847,17 @@ EXPORT_SYMBOL(sock_alloc_send_skb);
/* On 32bit arches, an skb frag is limited to 2^15 */
#define SKB_FRAG_PAGE_ORDER get_order(32768)
-bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+/**
+ * page_frag_refill - check that a page_frag contains enough room
+ * @sz: minimum size of the fragment we want to get
+ * @pfrag: pointer to page_frag
+ * @prio: priority for memory allocation
+ *
+ * Note: While this allocator tries to use high order pages, there is
+ * no guarantee that allocations succeed. Therefore, @sz MUST be
+ * less or equal than PAGE_SIZE.
+ */
+bool page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio)
{
int order;
@@ -1856,16 +1866,16 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
pfrag->offset = 0;
return true;
}
- if (pfrag->offset < pfrag->size)
+ if (pfrag->offset + sz <= pfrag->size)
return true;
put_page(pfrag->page);
}
/* We restrict high order allocations to users that can afford to wait */
- order = (sk->sk_allocation & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
+ order = (prio & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
do {
- gfp_t gfp = sk->sk_allocation;
+ gfp_t gfp = prio;
if (order)
gfp |= __GFP_COMP | __GFP_NOWARN;
@@ -1877,6 +1887,15 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
}
} while (--order >= 0);
+ return false;
+}
+EXPORT_SYMBOL(page_frag_refill);
+
+bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+{
+ if (likely(page_frag_refill(32U, pfrag, sk->sk_allocation)))
+ return true;
+
sk_enter_memory_pressure(sk);
sk_stream_moderate_sndbuf(sk);
return false;
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v2 net-next] net: refactor sk_page_frag_refill()
2013-10-13 4:46 ` [PATCH v2 " Eric Dumazet
@ 2013-10-17 19:46 ` David Miller
2013-10-17 23:13 ` Eric Dumazet
2013-10-17 23:27 ` [PATCH v3 " Eric Dumazet
1 sibling, 1 reply; 11+ messages in thread
From: David Miller @ 2013-10-17 19:46 UTC (permalink / raw)
To: eric.dumazet
Cc: mwdalton, dborkman, netdev, ffusco, ycheng, ncardwell, digitaleric
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 12 Oct 2013 21:46:31 -0700
> From: Eric Dumazet <edumazet@google.com>
>
> While working on virtio_net new allocation strategy to increase
> payload/truesize ratio, we found that refactoring sk_page_frag_refill()
> was needed.
>
> This patch splits sk_page_frag_refill() into two parts, adding
> page_frag_refill() which can be used without a socket.
>
> While we are at it, add a minimum frag size of 32 for
> sk_page_frag_refill()
>
> Michael will either use netdev_alloc_frag() from softirq context,
> or page_frag_refill() from process context in refill_work() (GFP_KERNEL
> allocations)
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Please rename this to something like "skb_page_frag_refill" so that it
is clear that this is a networking interface.
Thanks Eric.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2 net-next] net: refactor sk_page_frag_refill()
2013-10-17 19:46 ` David Miller
@ 2013-10-17 23:13 ` Eric Dumazet
0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2013-10-17 23:13 UTC (permalink / raw)
To: David Miller
Cc: mwdalton, dborkman, netdev, ffusco, ycheng, ncardwell, digitaleric
On Thu, 2013-10-17 at 15:46 -0400, David Miller wrote:
> Please rename this to something like "skb_page_frag_refill" so that it
> is clear that this is a networking interface.
Sure will do, thanks !
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v3 net-next] net: refactor sk_page_frag_refill()
2013-10-13 4:46 ` [PATCH v2 " Eric Dumazet
2013-10-17 19:46 ` David Miller
@ 2013-10-17 23:27 ` Eric Dumazet
2013-10-18 4:09 ` David Miller
1 sibling, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2013-10-17 23:27 UTC (permalink / raw)
To: Michael Dalton
Cc: Daniel Borkmann, davem, netdev, Francesco Fusco, ycheng,
Neal Cardwell, Eric Northup
From: Eric Dumazet <edumazet@google.com>
While working on virtio_net new allocation strategy to increase
payload/truesize ratio, we found that refactoring sk_page_frag_refill()
was needed.
This patch splits sk_page_frag_refill() into two parts, adding
skb_page_frag_refill() which can be used without a socket.
While we are at it, add a minimum frag size of 32 for
sk_page_frag_refill()
Michael will either use netdev_alloc_frag() from softirq context,
or skb_page_frag_refill() from process context in refill_work()
(GFP_KERNEL allocations)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michael Dalton <mwdalton@google.com>
---
v3: page_frag_refill() -> skb_page_frag_refill()
v2: fix a off-by one error
include/linux/skbuff.h | 2 ++
net/core/sock.c | 27 +++++++++++++++++++++++----
2 files changed, 25 insertions(+), 4 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 1cd32f9..ba74474 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2062,6 +2062,8 @@ static inline void skb_frag_set_page(struct sk_buff *skb, int f,
__skb_frag_set_page(&skb_shinfo(skb)->frags[f], page);
}
+bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio);
+
/**
* skb_frag_dma_map - maps a paged fragment via the DMA API
* @dev: the device to map the fragment to
diff --git a/net/core/sock.c b/net/core/sock.c
index fd6afa2..440afdc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1847,7 +1847,17 @@ EXPORT_SYMBOL(sock_alloc_send_skb);
/* On 32bit arches, an skb frag is limited to 2^15 */
#define SKB_FRAG_PAGE_ORDER get_order(32768)
-bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+/**
+ * skb_page_frag_refill - check that a page_frag contains enough room
+ * @sz: minimum size of the fragment we want to get
+ * @pfrag: pointer to page_frag
+ * @prio: priority for memory allocation
+ *
+ * Note: While this allocator tries to use high order pages, there is
+ * no guarantee that allocations succeed. Therefore, @sz MUST be
+ * less or equal than PAGE_SIZE.
+ */
+bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio)
{
int order;
@@ -1856,16 +1866,16 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
pfrag->offset = 0;
return true;
}
- if (pfrag->offset < pfrag->size)
+ if (pfrag->offset + sz <= pfrag->size)
return true;
put_page(pfrag->page);
}
/* We restrict high order allocations to users that can afford to wait */
- order = (sk->sk_allocation & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
+ order = (prio & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
do {
- gfp_t gfp = sk->sk_allocation;
+ gfp_t gfp = prio;
if (order)
gfp |= __GFP_COMP | __GFP_NOWARN;
@@ -1877,6 +1887,15 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
}
} while (--order >= 0);
+ return false;
+}
+EXPORT_SYMBOL(skb_page_frag_refill);
+
+bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+{
+ if (likely(skb_page_frag_refill(32U, pfrag, sk->sk_allocation)))
+ return true;
+
sk_enter_memory_pressure(sk);
sk_stream_moderate_sndbuf(sk);
return false;
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v3 net-next] net: refactor sk_page_frag_refill()
2013-10-17 23:27 ` [PATCH v3 " Eric Dumazet
@ 2013-10-18 4:09 ` David Miller
0 siblings, 0 replies; 11+ messages in thread
From: David Miller @ 2013-10-18 4:09 UTC (permalink / raw)
To: eric.dumazet
Cc: mwdalton, dborkman, netdev, ffusco, ycheng, ncardwell, digitaleric
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 17 Oct 2013 16:27:07 -0700
> From: Eric Dumazet <edumazet@google.com>
>
> While working on virtio_net new allocation strategy to increase
> payload/truesize ratio, we found that refactoring sk_page_frag_refill()
> was needed.
>
> This patch splits sk_page_frag_refill() into two parts, adding
> skb_page_frag_refill() which can be used without a socket.
>
> While we are at it, add a minimum frag size of 32 for
> sk_page_frag_refill()
>
> Michael will either use netdev_alloc_frag() from softirq context,
> or skb_page_frag_refill() from process context in refill_work()
> (GFP_KERNEL allocations)
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Applied, thanks Eric.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2013-10-18 4:09 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-03 7:56 [PATCH net-next] tcp: rcvbuf autotuning improvements Daniel Borkmann
2013-10-03 13:03 ` Eric Dumazet
2013-10-04 6:56 ` Michael Dalton
2013-10-13 0:25 ` [PATCH net-next] net: refactor sk_page_frag_refill() Eric Dumazet
2013-10-13 0:55 ` Eric Dumazet
2013-10-13 4:46 ` [PATCH v2 " Eric Dumazet
2013-10-17 19:46 ` David Miller
2013-10-17 23:13 ` Eric Dumazet
2013-10-17 23:27 ` [PATCH v3 " Eric Dumazet
2013-10-18 4:09 ` David Miller
2013-10-03 13:13 ` [PATCH net-next] tcp: rcvbuf autotuning improvements Eric Dumazet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).