* [Question] net/mlx4_en: Memory consumption issue with mlx4_en driver
@ 2015-03-11 18:51 Martin Lau
2015-03-11 20:21 ` Eric Dumazet
0 siblings, 1 reply; 5+ messages in thread
From: Martin Lau @ 2015-03-11 18:51 UTC (permalink / raw)
To: Amir Vadai, Or Gerlitz; +Cc: netdev, kernel-team
Hi,
We have seen a memory consumption issue related to the mlx4 driver.
We suspect it is related to the page order used to do the alloc_pages().
The order starts by 3 and then try the next lower value in case of failure.
I have copy and paste the alloc_pages() call site at the end of the email.
Is it a must to get order 3 pages? Based on the code and its comment,
it seems it is a little bit of functional and/or performance reason.
Can you share some perf test numbers on different page order allocation,
like 3 vs 2 vs 1?
It can be reproduced by:
1. At netserver (receiver), sysctl net.ipv4.tcp_rmem ='4096 125000 67108864'
and net.core.rmem_max=67108864.
2. Start two netservers listening on 2 different ports:
- One for taking 1000 background netperf flows
- Another netserver for taking 200 netperf flows. It will be
suspended (ctrl-z) in the middle of the test.
2. Start 1000 background netperf TCP_STREAM flows
3. Start another 200 netperf TCP_STREAM flows
4. Suspend the netserver taking the 200 flows.
5. Observe the socket memory usage of the suspended netserver by 'ss -t -m'.
200 of them will eventually reach 64MB rmem.
We observed the total socket rmem usage reported by 'ss -t -m'
has a huge difference from /proc/meminfo. We have seen ~6x-10x difference.
Any of the fragment queued in the suspended socket will
hold a refcount to page->_count and stop 8 pages from freeing.
The net.ipv4.tcp_mem seems not saving us here since it only
counts the skb->truesize which is 1536 in our setup.
Thanks,
--Martin
static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
struct mlx4_en_rx_alloc *page_alloc,
const struct mlx4_en_frag_info *frag_info,
gfp_t _gfp)
{
int order;
struct page *page;
dma_addr_t dma;
for (order = MLX4_EN_ALLOC_PREFER_ORDER; ;) {
gfp_t gfp = _gfp;
if (order)
gfp |= __GFP_COMP | __GFP_NOWARN;
page = alloc_pages(gfp, order);
if (likely(page))
break;
if (--order < 0 ||
((PAGE_SIZE << order) < frag_info->frag_size))
return -ENOMEM;
}
dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE << order,
PCI_DMA_FROMDEVICE);
if (dma_mapping_error(priv->ddev, dma)) {
put_page(page);
return -ENOMEM;
}
page_alloc->page_size = PAGE_SIZE << order;
page_alloc->page = page;
page_alloc->dma = dma;
page_alloc->page_offset = 0;
/* Not doing get_page() for each frag is a big win
* on asymetric workloads. Note we can not use atomic_set().
*/
atomic_add(page_alloc->page_size / frag_info->frag_stride - 1,
&page->_count);
return 0;
}
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Question] net/mlx4_en: Memory consumption issue with mlx4_en driver
2015-03-11 18:51 [Question] net/mlx4_en: Memory consumption issue with mlx4_en driver Martin Lau
@ 2015-03-11 20:21 ` Eric Dumazet
2015-03-11 20:23 ` Eric Dumazet
2015-03-12 16:56 ` Martin Lau
0 siblings, 2 replies; 5+ messages in thread
From: Eric Dumazet @ 2015-03-11 20:21 UTC (permalink / raw)
To: Martin Lau; +Cc: Amir Vadai, Or Gerlitz, netdev, kernel-team
On Wed, 2015-03-11 at 11:51 -0700, Martin Lau wrote:
> Hi,
>
> We have seen a memory consumption issue related to the mlx4 driver.
> We suspect it is related to the page order used to do the alloc_pages().
> The order starts by 3 and then try the next lower value in case of failure.
> I have copy and paste the alloc_pages() call site at the end of the email.
>
> Is it a must to get order 3 pages? Based on the code and its comment,
> it seems it is a little bit of functional and/or performance reason.
> Can you share some perf test numbers on different page order allocation,
> like 3 vs 2 vs 1?
>
> It can be reproduced by:
> 1. At netserver (receiver), sysctl net.ipv4.tcp_rmem ='4096 125000 67108864'
> and net.core.rmem_max=67108864.
> 2. Start two netservers listening on 2 different ports:
> - One for taking 1000 background netperf flows
> - Another netserver for taking 200 netperf flows. It will be
> suspended (ctrl-z) in the middle of the test.
> 2. Start 1000 background netperf TCP_STREAM flows
> 3. Start another 200 netperf TCP_STREAM flows
> 4. Suspend the netserver taking the 200 flows.
> 5. Observe the socket memory usage of the suspended netserver by 'ss -t -m'.
> 200 of them will eventually reach 64MB rmem.
>
> We observed the total socket rmem usage reported by 'ss -t -m'
> has a huge difference from /proc/meminfo. We have seen ~6x-10x difference.
>
> Any of the fragment queued in the suspended socket will
> hold a refcount to page->_count and stop 8 pages from freeing.
> The net.ipv4.tcp_mem seems not saving us here since it only
> counts the skb->truesize which is 1536 in our setup.
>
> Thanks,
> --Martin
You know, even the order-3 allocations done for regular skb allocations
will hurt you : a single copybreaked skb stored a long time in a tcp
receive queue will hold 32KB of memory.
Even 4KB can lead to disasters.
You could lower tcp_rmem so that collapsing happens sooner.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Question] net/mlx4_en: Memory consumption issue with mlx4_en driver
2015-03-11 20:21 ` Eric Dumazet
@ 2015-03-11 20:23 ` Eric Dumazet
2015-03-12 16:56 ` Martin Lau
1 sibling, 0 replies; 5+ messages in thread
From: Eric Dumazet @ 2015-03-11 20:23 UTC (permalink / raw)
To: Martin Lau; +Cc: Amir Vadai, Or Gerlitz, netdev, kernel-team
On Wed, 2015-03-11 at 13:21 -0700, Eric Dumazet wrote:
> You know, even the order-3 allocations done for regular skb allocations
> will hurt you : a single copybreaked skb stored a long time in a tcp
> receive queue will hold 32KB of memory.
>
> Even 4KB can lead to disasters.
>
> You could lower tcp_rmem so that collapsing happens sooner.
I also played with following :
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6af09a597d4f..118568267a2a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -458,10 +458,17 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
unsigned int length, gfp_t gfp_mask)
{
struct sk_buff *skb = NULL;
- unsigned int fragsz = SKB_DATA_ALIGN(length + NET_SKB_PAD) +
- SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+ unsigned int fragsz;
- if (fragsz <= PAGE_SIZE && !(gfp_mask & (__GFP_WAIT | GFP_DMA))) {
+ length = SKB_DATA_ALIGN(length + NET_SKB_PAD);
+ fragsz = length + SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+ /* if fragment is smaller than struct skb_shared_info overhead,
+ * do not bother use a page fragment, because malicious traffic
+ * could hold a full page (order-0 or order-3)
+ */
+ if (fragsz <= PAGE_SIZE &&
+ length > SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) &&
+ !(gfp_mask & (__GFP_WAIT | GFP_DMA))) {
void *data;
if (sk_memalloc_socks())
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [Question] net/mlx4_en: Memory consumption issue with mlx4_en driver
2015-03-11 20:21 ` Eric Dumazet
2015-03-11 20:23 ` Eric Dumazet
@ 2015-03-12 16:56 ` Martin Lau
2015-03-12 17:24 ` Eric Dumazet
1 sibling, 1 reply; 5+ messages in thread
From: Martin Lau @ 2015-03-12 16:56 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Amir Vadai, Or Gerlitz, netdev, kernel-team
On Wed, Mar 11, 2015 at 01:21:02PM -0700, Eric Dumazet wrote:
> You know, even the order-3 allocations done for regular skb allocations
> will hurt you : a single copybreaked skb stored a long time in a tcp
> receive queue will hold 32KB of memory.
>
> Even 4KB can lead to disasters.
Thanks for the pointer. I look a little deeper at the allocation in skbuff.c.
I can see your point.
> You could lower tcp_rmem so that collapsing happens sooner.
It is what we did. However, a buggy process accommodated enough stalled
sockets (stop reading from it but not closing it) will re-surface the problem.
Thanks,
--Martin
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Question] net/mlx4_en: Memory consumption issue with mlx4_en driver
2015-03-12 16:56 ` Martin Lau
@ 2015-03-12 17:24 ` Eric Dumazet
0 siblings, 0 replies; 5+ messages in thread
From: Eric Dumazet @ 2015-03-12 17:24 UTC (permalink / raw)
To: Martin Lau; +Cc: Amir Vadai, Or Gerlitz, netdev, kernel-team
On Thu, 2015-03-12 at 09:56 -0700, Martin Lau wrote:
> It is what we did. However, a buggy process accommodated enough stalled
> sockets (stop reading from it but not closing it) will re-surface the problem.
That's where collapsing helps : TCP stack reallocates linear skbs using
order-0 pages only, and fill them. Overhead is reduced to strict
minimum.
Well, this collapsing code could be extended to add order-0 pages frags,
so that overhead would be really minimal.
I'll send patches that we have been using here for a while.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2015-03-12 17:24 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-11 18:51 [Question] net/mlx4_en: Memory consumption issue with mlx4_en driver Martin Lau
2015-03-11 20:21 ` Eric Dumazet
2015-03-11 20:23 ` Eric Dumazet
2015-03-12 16:56 ` Martin Lau
2015-03-12 17:24 ` Eric Dumazet
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.