linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
@ 2012-08-29  0:25 David Madore
  2012-08-29 19:32 ` Francois Romieu
  2012-09-01  2:21 ` Hugh Dickins
  0 siblings, 2 replies; 26+ messages in thread
From: David Madore @ 2012-08-29  0:25 UTC (permalink / raw)
  To: linux-kernel

Dear all,

I hope this is the right place to send this sort of backtrace dump.

I'm getting the following sort of dumps (below) on a 3.2.27 kernel on
an arm/kirkwood (actually DreamPlug) machine that's used as a router.

I imagine it being somehow related to the fact that it operates a
network bridge (I imagine this because I have another identical
machine with exactly the same kernel and a very similar config but not
running a bridge, and the warning never pops up).

Is this worth investigating?  (I will, of course, provide the config
file and any other relevant data if the answer is "yes".)  Is this
potentially serious?  (I'm getting hard lockups on this machine which
I suspect are due to hardware and unrelated to this, but if someone
tells me it could be the cause, I'd be more than happy to believe it.)

[24711.204492] ------------[ cut here ]------------
[24711.209151] WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
[24711.216667] Modules linked in: 8021q ath9k_htc mac80211 ath9k_common ath9k_hw ath cfg80211 bnep rfcomm sit tunnel4 sch_ingress cls_fw cls_u32 sch_sfq sch_htb pppoe pppox ppp_generic slhc bridge stp llc ip6t_REJECT ip6table_filter ip6table_mangle xt_NOTRACK ip6table_raw ip6_tables nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ftp nf_conntrack_ftp ipt_REJECT xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_nat xt_TCPMSS xt_tcpudp xt_mark iptable_mangle ip_tables x_tables nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 orion_wdt ipv6 snd_usb_audio snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib snd_seq_midi snd_seq_midi_event snd_rawmidi btmrvl_sdio btmrvl snd_seq snd_timer snd_seq_device snd bluetooth soundcore
[24711.280663] [<c000d728>] (unwind_backtrace+0x0/0xf0) from [<c0022f74>] (warn_slowpath_common+0x50/0x68)
[24711.290124] [<c0022f74>] (warn_slowpath_common+0x50/0x68) from [<c0022fa8>] (warn_slowpath_null+0x1c/0x24)
[24711.299845] [<c0022fa8>] (warn_slowpath_null+0x1c/0x24) from [<c009caec>] (__alloc_pages_nodemask+0x1d4/0x68c)
[24711.309914] [<c009caec>] (__alloc_pages_nodemask+0x1d4/0x68c) from [<c009cfb4>] (__get_free_pages+0x10/0x3c)
[24711.319805] [<c009cfb4>] (__get_free_pages+0x10/0x3c) from [<c00c9fd0>] (kmalloc_order_trace+0x24/0xdc)
[24711.329269] [<c00c9fd0>] (kmalloc_order_trace+0x24/0xdc) from [<c038d638>] (pskb_expand_head+0x68/0x298)
[24711.338901] [<c038d638>] (pskb_expand_head+0x68/0x298) from [<bf0dd3ec>] (ip6_forward+0x4d4/0x7bc [ipv6])
[24711.348638] [<bf0dd3ec>] (ip6_forward+0x4d4/0x7bc [ipv6]) from [<bf0dfebc>] (ipv6_rcv+0x2bc/0x3dc [ipv6])
[24711.358333] [<bf0dfebc>] (ipv6_rcv+0x2bc/0x3dc [ipv6]) from [<c0394870>] (__netif_receive_skb+0x544/0x66c)
[24711.368106] [<c0394870>] (__netif_receive_skb+0x544/0x66c) from [<bf1cd054>] (br_nf_pre_routing_finish_ipv6+0x10c/0x160 [bridge])
[24711.379899] [<bf1cd054>] (br_nf_pre_routing_finish_ipv6+0x10c/0x160 [bridge]) from [<bf1cdae8>] (br_nf_pre_routing+0x59c/0x67c [bridge])
[24711.392271] [<bf1cdae8>] (br_nf_pre_routing+0x59c/0x67c [bridge]) from [<c03bd2a4>] (nf_iterate+0x8c/0xb4)
[24711.401988] [<c03bd2a4>] (nf_iterate+0x8c/0xb4) from [<c03bd328>] (nf_hook_slow+0x5c/0x118)
[24711.410540] [<c03bd328>] (nf_hook_slow+0x5c/0x118) from [<bf1c7fa4>] (br_handle_frame+0x1b8/0x290 [bridge])
[24711.420367] [<bf1c7fa4>] (br_handle_frame+0x1b8/0x290 [bridge]) from [<c03946f8>] (__netif_receive_skb+0x3cc/0x66c)
[24711.430872] [<c03946f8>] (__netif_receive_skb+0x3cc/0x66c) from [<c031e254>] (mv643xx_eth_poll+0x540/0x734)
[24711.440680] [<c031e254>] (mv643xx_eth_poll+0x540/0x734) from [<c0397390>] (net_rx_action+0x118/0x314)
[24711.449970] [<c0397390>] (net_rx_action+0x118/0x314) from [<c0029924>] (__do_softirq+0xac/0x234)
[24711.458817] [<c0029924>] (__do_softirq+0xac/0x234) from [<c0029f00>] (irq_exit+0x94/0x9c)
[24711.467046] [<c0029f00>] (irq_exit+0x94/0x9c) from [<c00094b0>] (handle_IRQ+0x34/0x84)
[24711.475007] [<c00094b0>] (handle_IRQ+0x34/0x84) from [<c04398d4>] (__irq_svc+0x34/0x98)
[24711.483068] [<c04398d4>] (__irq_svc+0x34/0x98) from [<c0011d6c>] (kirkwood_enter_idle+0x4c/0x94)
[24711.491908] [<c0011d6c>] (kirkwood_enter_idle+0x4c/0x94) from [<c0357a00>] (cpuidle_idle_call+0xc8/0x35c)
[24711.501532] [<c0357a00>] (cpuidle_idle_call+0xc8/0x35c) from [<c0009764>] (cpu_idle+0x88/0xdc)
[24711.510201] [<c0009764>] (cpu_idle+0x88/0xdc) from [<c05d8720>] (start_kernel+0x2a0/0x2f0)
[24711.518512] ---[ end trace e1776fbe32468909 ]---

-- 
     David A. Madore
   ( http://www.madore.org/~david/ )

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-08-29  0:25 kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c() David Madore
@ 2012-08-29 19:32 ` Francois Romieu
  2012-08-31 10:59   ` David Madore
  2012-09-01  2:21 ` Hugh Dickins
  1 sibling, 1 reply; 26+ messages in thread
From: Francois Romieu @ 2012-08-29 19:32 UTC (permalink / raw)
  To: David Madore; +Cc: linux-kernel

David Madore <david+ml@madore.org> :
[...]
> I imagine it being somehow related to the fact that it operates a
> network bridge (I imagine this because I have another identical
> machine with exactly the same kernel and a very similar config but not
> running a bridge, and the warning never pops up).

Could it not be a genuine allocation failure ?

-- 
Ueimor

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-08-29 19:32 ` Francois Romieu
@ 2012-08-31 10:59   ` David Madore
  2012-08-31 13:57     ` David Madore
  0 siblings, 1 reply; 26+ messages in thread
From: David Madore @ 2012-08-31 10:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: Francois Romieu

On Wed, Aug 29, 2012 at 09:32:20PM +0200, Francois Romieu wrote:
> David Madore <david+ml@madore.org> :
> [...]
> > I imagine it being somehow related to the fact that it operates a
> > network bridge (I imagine this because I have another identical
> > machine with exactly the same kernel and a very similar config but not
> > running a bridge, and the warning never pops up).
> 
> Could it not be a genuine allocation failure ?

I have no idea.  How can I tell?  In any case, if having 512MB RAM
isn't enough for the kernel in the router of a small home's network,
that's a bug somewhere, isn't it?

Also:

On Wed, Aug 29, 2012 at 02:25:48AM +0200, David Madore wrote:
> Is this worth investigating?  (I will, of course, provide the config
> file and any other relevant data if the answer is "yes".)  Is this
> potentially serious?  (I'm getting hard lockups on this machine which
> I suspect are due to hardware and unrelated to this, but if someone
> tells me it could be the cause, I'd be more than happy to believe it.)

I'm now inclined to believe the hard lockups are indeed related to
this (I can semi-reproducibly make them happen with only network
traffic - actually, with the messages of a compilation taking place on
another machine being routed through this box (over IPv6)).

So how can I help debug this?  (One difficulty is that I have only
remote access to this box, and it's not meant for experimenting with.)

-- 
     David A. Madore
   ( http://www.madore.org/~david/ )

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-08-31 10:59   ` David Madore
@ 2012-08-31 13:57     ` David Madore
  2012-08-31 18:12       ` Francois Romieu
  0 siblings, 1 reply; 26+ messages in thread
From: David Madore @ 2012-08-31 13:57 UTC (permalink / raw)
  To: linux-kernel

On Fri, Aug 31, 2012 at 12:59:36PM +0200, David Madore wrote:
> On Wed, Aug 29, 2012 at 09:32:20PM +0200, Francois Romieu wrote:
> > David Madore <david+ml@madore.org> :
> > [...]
> > > I imagine it being somehow related to the fact that it operates a
> > > network bridge (I imagine this because I have another identical
> > > machine with exactly the same kernel and a very similar config but not
> > > running a bridge, and the warning never pops up).
> > 
> > Could it not be a genuine allocation failure ?
> 
> I have no idea.  How can I tell?  In any case, if having 512MB RAM
> isn't enough for the kernel in the router of a small home's network,
> that's a bug somewhere, isn't it?

PS: I'm also getting the following kind of messages from a wlan
interface that's on the bridge:

[  268.976317] ieee80211 phy0: failed to reallocate TX buffer
[  716.880515] ieee80211 phy0: failed to reallocate TX buffer
[ 1160.877677] ieee80211 phy0: failed to reallocate TX buffer

Could they be related?

-- 
     David A. Madore
   ( http://www.madore.org/~david/ )

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-08-31 13:57     ` David Madore
@ 2012-08-31 18:12       ` Francois Romieu
  0 siblings, 0 replies; 26+ messages in thread
From: Francois Romieu @ 2012-08-31 18:12 UTC (permalink / raw)
  To: David Madore; +Cc: linux-kernel

David Madore <david+ml@madore.org> :
[...]
> [  268.976317] ieee80211 phy0: failed to reallocate TX buffer
> [  716.880515] ieee80211 phy0: failed to reallocate TX buffer
> [ 1160.877677] ieee80211 phy0: failed to reallocate TX buffer
> 
> Could they be related?

Yes. This is a pskb_expand_head failure from net/mac80211/tx.c.
The router is experiencing GFP_ATOMIC allocation failure

You should monitor /proc/slabinfo content. A pig may appear over time.

-- 
Ueimor

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-08-29  0:25 kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c() David Madore
  2012-08-29 19:32 ` Francois Romieu
@ 2012-09-01  2:21 ` Hugh Dickins
  2012-09-01  8:20   ` Francois Romieu
  2012-10-04 16:02   ` Maxime Bizon
  1 sibling, 2 replies; 26+ messages in thread
From: Hugh Dickins @ 2012-09-01  2:21 UTC (permalink / raw)
  To: David Madore; +Cc: Francois Romieu, netdev, linux-kernel

[ Cc'ing original mail to netdev as the problem may be recognized there ]

On Wed, 29 Aug 2012, David Madore wrote:
> Dear all,
> 
> I hope this is the right place to send this sort of backtrace dump.
> 
> I'm getting the following sort of dumps (below) on a 3.2.27 kernel on
> an arm/kirkwood (actually DreamPlug) machine that's used as a router.
> 
> I imagine it being somehow related to the fact that it operates a
> network bridge (I imagine this because I have another identical
> machine with exactly the same kernel and a very similar config but not
> running a bridge, and the warning never pops up).
> 
> Is this worth investigating?  (I will, of course, provide the config
> file and any other relevant data if the answer is "yes".)  Is this
> potentially serious?  (I'm getting hard lockups on this machine which
> I suspect are due to hardware and unrelated to this, but if someone
> tells me it could be the cause, I'd be more than happy to believe it.)
> 
> [24711.204492] ------------[ cut here ]------------
> [24711.209151] WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
> [24711.216667] Modules linked in: 8021q ath9k_htc mac80211 ath9k_common ath9k_hw ath cfg80211 bnep rfcomm sit tunnel4 sch_ingress cls_fw cls_u32 sch_sfq sch_htb pppoe pppox ppp_generic slhc bridge stp llc ip6t_REJECT ip6table_filter ip6table_mangle xt_NOTRACK ip6table_raw ip6_tables nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ftp nf_conntrack_ftp ipt_REJECT xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_nat xt_TCPMSS xt_tcpudp xt_mark iptable_mangle ip_tables x_tables nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 orion_wdt ipv6 snd_usb_audio snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib snd_seq_midi snd_seq_midi_event snd_rawmidi btmrvl_sdio btmrvl snd_seq snd_timer snd_seq_device snd bluetooth soundcore
> [24711.280663] [<c000d728>] (unwind_backtrace+0x0/0xf0) from [<c0022f74>] (warn_slowpath_common+0x50/0x68)
> [24711.290124] [<c0022f74>] (warn_slowpath_common+0x50/0x68) from [<c0022fa8>] (warn_slowpath_null+0x1c/0x24)
> [24711.299845] [<c0022fa8>] (warn_slowpath_null+0x1c/0x24) from [<c009caec>] (__alloc_pages_nodemask+0x1d4/0x68c)
> [24711.309914] [<c009caec>] (__alloc_pages_nodemask+0x1d4/0x68c) from [<c009cfb4>] (__get_free_pages+0x10/0x3c)
> [24711.319805] [<c009cfb4>] (__get_free_pages+0x10/0x3c) from [<c00c9fd0>] (kmalloc_order_trace+0x24/0xdc)
> [24711.329269] [<c00c9fd0>] (kmalloc_order_trace+0x24/0xdc) from [<c038d638>] (pskb_expand_head+0x68/0x298)
> [24711.338901] [<c038d638>] (pskb_expand_head+0x68/0x298) from [<bf0dd3ec>] (ip6_forward+0x4d4/0x7bc [ipv6])
> [24711.348638] [<bf0dd3ec>] (ip6_forward+0x4d4/0x7bc [ipv6]) from [<bf0dfebc>] (ipv6_rcv+0x2bc/0x3dc [ipv6])
> [24711.358333] [<bf0dfebc>] (ipv6_rcv+0x2bc/0x3dc [ipv6]) from [<c0394870>] (__netif_receive_skb+0x544/0x66c)
> [24711.368106] [<c0394870>] (__netif_receive_skb+0x544/0x66c) from [<bf1cd054>] (br_nf_pre_routing_finish_ipv6+0x10c/0x160 [bridge])
> [24711.379899] [<bf1cd054>] (br_nf_pre_routing_finish_ipv6+0x10c/0x160 [bridge]) from [<bf1cdae8>] (br_nf_pre_routing+0x59c/0x67c [bridge])
> [24711.392271] [<bf1cdae8>] (br_nf_pre_routing+0x59c/0x67c [bridge]) from [<c03bd2a4>] (nf_iterate+0x8c/0xb4)
> [24711.401988] [<c03bd2a4>] (nf_iterate+0x8c/0xb4) from [<c03bd328>] (nf_hook_slow+0x5c/0x118)
> [24711.410540] [<c03bd328>] (nf_hook_slow+0x5c/0x118) from [<bf1c7fa4>] (br_handle_frame+0x1b8/0x290 [bridge])
> [24711.420367] [<bf1c7fa4>] (br_handle_frame+0x1b8/0x290 [bridge]) from [<c03946f8>] (__netif_receive_skb+0x3cc/0x66c)
> [24711.430872] [<c03946f8>] (__netif_receive_skb+0x3cc/0x66c) from [<c031e254>] (mv643xx_eth_poll+0x540/0x734)
> [24711.440680] [<c031e254>] (mv643xx_eth_poll+0x540/0x734) from [<c0397390>] (net_rx_action+0x118/0x314)
> [24711.449970] [<c0397390>] (net_rx_action+0x118/0x314) from [<c0029924>] (__do_softirq+0xac/0x234)
> [24711.458817] [<c0029924>] (__do_softirq+0xac/0x234) from [<c0029f00>] (irq_exit+0x94/0x9c)
> [24711.467046] [<c0029f00>] (irq_exit+0x94/0x9c) from [<c00094b0>] (handle_IRQ+0x34/0x84)
> [24711.475007] [<c00094b0>] (handle_IRQ+0x34/0x84) from [<c04398d4>] (__irq_svc+0x34/0x98)
> [24711.483068] [<c04398d4>] (__irq_svc+0x34/0x98) from [<c0011d6c>] (kirkwood_enter_idle+0x4c/0x94)
> [24711.491908] [<c0011d6c>] (kirkwood_enter_idle+0x4c/0x94) from [<c0357a00>] (cpuidle_idle_call+0xc8/0x35c)
> [24711.501532] [<c0357a00>] (cpuidle_idle_call+0xc8/0x35c) from [<c0009764>] (cpu_idle+0x88/0xdc)
> [24711.510201] [<c0009764>] (cpu_idle+0x88/0xdc) from [<c05d8720>] (start_kernel+0x2a0/0x2f0)
> [24711.518512] ---[ end trace e1776fbe32468909 ]---

Francois is right that a GFP_ATOMIC allocation from pskb_expand_head()
is failing, which can easily happen, and cause your "failed to reallocate
TX buffer" errors; but it's well worth looking up what's actually on
lines 2108 and 2109 of mm/page_alloc.c in 3.2.27:

	if (order >= MAX_ORDER) {
		WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN));

That was probably not a sane allocation request, it has gone out of range:
maybe the skb header is even corrupted.  If you're lucky, it might be
something that netdev will recognize as already fixed.

Hugh

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-09-01  2:21 ` Hugh Dickins
@ 2012-09-01  8:20   ` Francois Romieu
  2012-09-02 22:51     ` David Madore
  2012-10-04 16:02   ` Maxime Bizon
  1 sibling, 1 reply; 26+ messages in thread
From: Francois Romieu @ 2012-09-01  8:20 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: David Madore, netdev, linux-kernel, Eric Dumazet

Hugh Dickins <hughd@google.com> :
> [ Cc'ing original mail to netdev as the problem may be recognized there ]
[...]
> Francois is right that a GFP_ATOMIC allocation from pskb_expand_head()
> is failing, which can easily happen, and cause your "failed to reallocate
> TX buffer" errors; but it's well worth looking up what's actually on
> lines 2108 and 2109 of mm/page_alloc.c in 3.2.27:
> 
> 	if (order >= MAX_ORDER) {
> 		WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN));

You are right. The wifi Tx path error is not related.

I overlooked it was using SLUB btw.

> That was probably not a sane allocation request, it has gone out of range:
> maybe the skb header is even corrupted.  If you're lucky, it might be
> something that netdev will recognize as already fixed.

Afaics nothing related to mv643xx_eth.c, nor pskb_expand_head nor
ip6_forward, at least nothing trivially close. Eric may provide a
better answer.

-- 
Ueimor

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-09-01  8:20   ` Francois Romieu
@ 2012-09-02 22:51     ` David Madore
  0 siblings, 0 replies; 26+ messages in thread
From: David Madore @ 2012-09-02 22:51 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Hugh Dickins, netdev, linux-kernel, Eric Dumazet

Since I had a rare occasion to physically access the machine, I did
the following experiment: connect another machine to the serial
console, run

while true ; do date ; cat /proc/slabinfo ; echo '***' ; sleep 3 ; done

and generate lots of IPv6 traffic through the box (as I mentioned, for
some reason, a Firefox compilation through ssh seems particularly
effective).  So I now have lots of slabinfo data and, beyond the
initial WARNING, I also got messages along the lines of "swapper: page
allocation failure: order:10, mode:0x4020".

I put the full log in <URL:
http://www.madore.org/~david/.tmp/pollux-dump.0
 > (unfortunately a bit garbled, because sometimes the cat slabinfo
was interspaced with printk output, but there are still plenty of
usable lines of each sort).

For completeness, here's a sample message from a page allocation
failure, and a copy of /proc/slabinfo from just about that time (I
have no idea how to read this, but one thing I can say is that there
is no extraordinarily large number in this):

[  567.757489] swapper: page allocation failure: order:10, mode:0x4020
[  567.763815] [<c000d728>] (unwind_backtrace+0x0/0xf0) from [<c009a87c>] (warn_alloc_failed+0xcc/0x10c)
[  567.773119] [<c009a87c>] (warn_alloc_failed+0xcc/0x10c) from [<c009ce48>] (__alloc_pages_nodemask+0x530/0x68c)
[  567.783184] [<c009ce48>] (__alloc_pages_nodemask+0x530/0x68c) from [<c009cfb4>] (__get_free_pages+0x10/0x3c)
[  567.793084] [<c009cfb4>] (__get_free_pages+0x10/0x3c) from [<c00c9fd0>] (kmalloc_order_trace+0x24/0xdc)
[  567.802547] [<c00c9fd0>] (kmalloc_order_trace+0x24/0xdc) from [<c038d638>] (pskb_expand_head+0x68/0x298)
[  567.812317] [<c038d638>] (pskb_expand_head+0x68/0x298) from [<bf0e93ec>] (ip6_forward+0x4d4/0x7bc [ipv6])
[  567.822056] [<bf0e93ec>] (ip6_forward+0x4d4/0x7bc [ipv6]) from [<bf0ebebc>] (ipv6_rcv+0x2bc/0x3dc [ipv6])
[  567.831751] [<bf0ebebc>] (ipv6_rcv+0x2bc/0x3dc [ipv6]) from [<c0394870>] (__netif_receive_skb+0x544/0x66c)
[  567.841521] [<c0394870>] (__netif_receive_skb+0x544/0x66c) from [<bf1d9054>] (br_nf_pre_routing_finish_ipv6+0x10c/0x160 [bridge])
[  567.853306] [<bf1d9054>] (br_nf_pre_routing_finish_ipv6+0x10c/0x160 [bridge]) from [<bf1d9ae8>] (br_nf_pre_routing+0x59c/0x67c [bridge])
[  567.865673] [<bf1d9ae8>] (br_nf_pre_routing+0x59c/0x67c [bridge]) from [<c03bd2a4>] (nf_iterate+0x8c/0xb4)
[  567.875387] [<c03bd2a4>] (nf_iterate+0x8c/0xb4) from [<c03bd328>] (nf_hook_slow+0x5c/0x118)
[  567.883800] [<c03bd328>] (nf_hook_slow+0x5c/0x118) from [<bf1d3fa4>] (br_handle_frame+0x1b8/0x290 [bridge])
[  567.893624] [<bf1d3fa4>] (br_handle_frame+0x1b8/0x290 [bridge]) from [<c03946f8>] (__netif_receive_skb+0x3cc/0x66c)
[  567.904137] [<c03946f8>] (__netif_receive_skb+0x3cc/0x66c) from [<c031e254>] (mv643xx_eth_poll+0x540/0x734)
[  567.913928] [<c031e254>] (mv643xx_eth_poll+0x540/0x734) from [<c0397390>] (net_rx_action+0x118/0x314)
[  567.923215] [<c0397390>] (net_rx_action+0x118/0x314) from [<c0029924>] (__do_softirq+0xac/0x234)
[  567.932058] [<c0029924>] (__do_softirq+0xac/0x234) from [<c0029f00>] (irq_exit+0x94/0x9c)
[  567.940421] [<c0029f00>] (irq_exit+0x94/0x9c) from [<c00094b0>] (handle_IRQ+0x34/0x84)
[  567.948392] [<c00094b0>] (handle_IRQ+0x34/0x84) from [<c04398d4>] (__irq_svc+0x34/0x98)
[  567.956454] [<c04398d4>] (__irq_svc+0x34/0x98) from [<c0011d6c>] (kirkwood_enter_idle+0x4c/0x94)
[  567.965299] [<c0011d6c>] (kirkwood_enter_idle+0x4c/0x94) from [<c0357a00>] (cpuidle_idle_call+0xc8/0x35c)
[  567.974925] [<c0357a00>] (cpuidle_idle_call+0xc8/0x35c) from [<c0009764>] (cpu_idle+0x88/0xdc)
[  567.983581] [<c0009764>] (cpu_idle+0x88/0xdc) from [<c05d8720>] (start_kernel+0x2a0/0x2f0)
[  567.991893] Mem-info:
[  567.994185] Normal per-cpu:
[  567.996995] CPU    0: hi:  186, btch:  31 usd:  84
[  568.001815] active_anon:5592 inactive_anon:34 isolated_anon:0
[  568.001820]  active_file:2845 inactive_file:6118 isolated_file:0
[  568.001825]  unevictable:418 dirty:13 writeback:0 unstable:0
[  568.001829]  free:12507 slab_reclaimable:632 slab_unreclaimable:1124
[  568.001835]  mapped:2546 shmem:47 pagetables:152 bounce:0
[  568.031126] Normal free:50028kB min:2884kB low:3604kB high:4324kB active_anon:22368kB inactive_anon:136kB active_file:11380kB inactive_file:24472kB unevictable:1672kB isolated(anon):0kB isolated(file):0kB present:520192kB mlocked:1672kB dirty:52kB writeback:0kB mapped:10184kB shmem:188kB slab_reclaimable:2528kB slab_unreclaimable:4496kB kernel_stack:584kB pagetables:608kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  568.071329] lowmem_reserve[]: 0 0
[  568.074696] Normal: 1*4kB 1*8kB 8*16kB 7*32kB 0*64kB 2*128kB 1*256kB 4*512kB 20*1024kB 11*2048kB 1*4096kB = 50028kB
[  568.085282] 9350 total pagecache pages
[  568.089039] 0 pages in swap cache
[  568.092363] Swap cache stats: add 0, delete 0, find 0/0
[  568.097621] Free swap  = 0kB
[  568.100506] Total swap = 0kB
[  568.117927] 131072 pages of RAM
[  568.121087] 12771 free pages
[  568.123972] 2839 reserved pages
[  568.127140] 1361 slab pages
[  568.129945] 8003 pages shared
[  568.132919] 0 pages swap cached

slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
nf_conntrack_expect      0      0    176   23    1 : tunables    0    0    0 : slabdata      0      0      0
nf_conntrack_c06d1258    128    128    248   16    1 : tunables    0    0    0 : slabdata      8      8      0
ip6_dst_cache         72     72    224   18    1 : tunables    0    0    0 : slabdata      4      4      0
RAWv6                 11     11    704   11    2 : tunables    0    0    0 : slabdata      1      1      0
UDPLITEv6              0      0    704   11    2 : tunables    0    0    0 : slabdata      0      0      0
UDPv6                 33     33    704   11    2 : tunables    0    0    0 : slabdata      3      3      0
tw_sock_TCPv6          0      0    160   25    1 : tunables    0    0    0 : slabdata      0      0      0
TCPv6                 12     12   1344   12    4 : tunables    0    0    0 : slabdata      1      1      0
ubifs_inode_slab       0      0    408   10    1 : tunables    0    0    0 : slabdata      0      0      0
mqueue_inode_cache      8      8    512    8    1 : tunables    0    0    0 : slabdata      1      1      0
jffs2_xattr_datum      0      0     56   73    1 : tunables    0    0    0 : slabdata      0      0      0
jffs2_refblock         0      0    248   16    1 : tunables    0    0    0 : slabdata      0      0      0
jffs2_i                0      0    360   11    1 : tunables    0    0    0 : slabdata      0      0      0
fat_inode_cache        0      0    392   10    1 : tunables    0    0    0 : slabdata      0      0      0
fat_cache              0      0     24  170    1 : tunables    0    0    0 : slabdata      0      0      0
jbd2_revoke_record      0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
journal_handle       170    170     24  170    1 : tunables    0    0    0 : slabdata      1      1      0
revoke_record        256    256     16  256    1 : tunables    0    0    0 : slabdata      1      1      0
ext4_inode_cache       0      0    568   14    2 : tunables    0    0    0 : slabdata      0      0      0
ext4_free_data         0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_allocation_context      0      0    112   36    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_io_end            0      0    576   14    2 : tunables    0    0    0 : slabdata      0      0      0
ext4_io_page           0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
ext2_inode_cache       0      0    448    9    1 : tunables    0    0    0 : slabdata      0      0      0
ext3_inode_cache     952    952    464   17    2 : tunables    0    0    0 : slabdata     56     56      0
ext3_xattr             0      0     56   73    1 : tunables    0    0    0 : slabdata      0      0      0
dquot                  0      0    160   25    1 : tunables    0    0    0 : slabdata      0      0      0
dnotify_mark          56     56     72   56    1 : tunables    0    0    0 : slabdata      1      1      0
dio                    0      0    320   12    1 : tunables    0    0    0 : slabdata      0      0      0
UNIX                  27     27    448    9    1 : tunables    0    0    0 : slabdata      3      3      0
UDP-Lite               0      0    544   15    2 : tunables    0    0    0 : slabdata      0      0      0
arp_cache             28     28    288   14    1 : tunables    0    0    0 : slabdata      2      2      0
RAW                   15     15    544   15    2 : tunables    0    0    0 : slabdata      1      1      0
UDP                   45     45    544   15    2 : tunables    0    0    0 : slabdata      3      3      0
tw_sock_TCP           32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0
TCP                   13     13   1216   13    4 : tunables    0    0    0 : slabdata      1      1      0
eventpoll_pwq        102    102     40  102    1 : tunables    0    0    0 : slabdata      1      1      0
blkdev_requests       54     54    216   18    1 : tunables    0    0    0 : slabdata      3      3      0
blkdev_ioc             0      0     48   85    1 : tunables    0    0    0 : slabdata      0      0      0
fsnotify_event_holder    256    256     16  256    1 : tunables    0    0    0 : slabdata      1      1      0
biovec-256            10     10   3072   10    8 : tunables    0    0    0 : slabdata      1      1      0
biovec-128             0      0   1536   10    4 : tunables    0    0    0 : slabdata      0      0      0
biovec-64             10     10    768   10    2 : tunables    0    0    0 : slabdata      1      1      0
sock_inode_cache     143    143    352   11    1 : tunables    0    0    0 : slabdata     13     13      0
file_lock_cache       36     36    112   36    1 : tunables    0    0    0 : slabdata      1      1      0
net_namespace          0      0   1248   13    4 : tunables    0    0    0 : slabdata      0      0      0
shmem_inode_cache    715    715    352   11    1 : tunables    0    0    0 : slabdata     65     65      0
taskstats             12     12    328   12    1 : tunables    0    0    0 : slabdata      1      1      0
proc_inode_cache     350    396    352   11    1 : tunables    0    0    0 : slabdata     36     36      0
sigqueue              56     56    144   28    1 : tunables    0    0    0 : slabdata      2      2      0
bdev_cache             9      9    448    9    1 : tunables    0    0    0 : slabdata      1      1      0
sysfs_dir_cache    11118  11118     80   51    1 : tunables    0    0    0 : slabdata    218    218      0
filp                 925    925    160   25    1 : tunables    0    0    0 : slabdata     37     37      0
inode_cache         2508   2508    320   12    1 : tunables    0    0    0 : slabdata    209    209      0
dentry              4928   4928    128   32    1 : tunables    0    0    0 : slabdata    154    154      0
buffer_head         1984   1984     64   64    1 : tunables    0    0    0 : slabdata     31     31      0
vm_area_struct      1932   1932     88   46    1 : tunables    0    0    0 : slabdata     42     42      0
mm_struct             50     50    384   10    1 : tunables    0    0    0 : slabdata      5      5      0
signal_cache          84     84    640   12    2 : tunables    0    0    0 : slabdata      7      7      0
sighand_cache         84     84   1312   12    4 : tunables    0    0    0 : slabdata      7      7      0
task_struct           90     90   1088   15    4 : tunables    0    0    0 : slabdata      6      6      0
anon_vma_chain      2210   2210     24  170    1 : tunables    0    0    0 : slabdata     13     13      0
anon_vma            1280   1280     32  128    1 : tunables    0    0    0 : slabdata     10     10      0
radix_tree_node      962    962    304   13    1 : tunables    0    0    0 : slabdata     74     74      0
idr_layer_cache      260    260    152   26    1 : tunables    0    0    0 : slabdata     10     10      0
kmalloc-8192          44     44   8192    4    8 : tunables    0    0    0 : slabdata     11     11      0
kmalloc-4096         114    128   4096    8    8 : tunables    0    0    0 : slabdata     16     16      0
kmalloc-2048          88    160   2048    8    4 : tunables    0    0    0 : slabdata     20     20      0
kmalloc-1024         176    176   1024    8    2 : tunables    0    0    0 : slabdata     22     22      0
kmalloc-512          504    504    512    8    1 : tunables    0    0    0 : slabdata     63     63      0
kmalloc-256          240    240    256   16    1 : tunables    0    0    0 : slabdata     15     15      0
kmalloc-128          512    512    128   32    1 : tunables    0    0    0 : slabdata     16     16      0
kmalloc-64          1728   1728     64   64    1 : tunables    0    0    0 : slabdata     27     27      0
kmalloc-32         10240  10240     32  128    1 : tunables    0    0    0 : slabdata     80     80      0
kmalloc-192          546    546    192   21    1 : tunables    0    0    0 : slabdata     26     26      0
kmalloc-96          1092   1092     96   42    1 : tunables    0    0    0 : slabdata     26     26      0
kmem_cache            32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0
kmem_cache_node      128    128     32  128    1 : tunables    0    0    0 : slabdata      1      1      0

-- 
     David A. Madore
   ( http://www.madore.org/~david/ )

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-09-01  2:21 ` Hugh Dickins
  2012-09-01  8:20   ` Francois Romieu
@ 2012-10-04 16:02   ` Maxime Bizon
  2012-10-04 16:29     ` Eric Dumazet
  2012-10-04 16:50     ` Eric Dumazet
  1 sibling, 2 replies; 26+ messages in thread
From: Maxime Bizon @ 2012-10-04 16:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins


On Fri, 2012-08-31 at 19:21 -0700, Hugh Dickins wrote:

Hi,

> Francois is right that a GFP_ATOMIC allocation from pskb_expand_head()
> is failing, which can easily happen, and cause your "failed to reallocate
> TX buffer" errors; but it's well worth looking up what's actually on
> lines 2108 and 2109 of mm/page_alloc.c in 3.2.27:
> 
> 	if (order >= MAX_ORDER) {
> 		WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN));
> 
> That was probably not a sane allocation request, it has gone out of range:
> maybe the skb header is even corrupted.  If you're lucky, it might be
> something that netdev will recognize as already fixed.

I have the same problem on the exact same hardware and found the cause:

Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Tue Apr 10 20:08:39 2012 +0000

    net: allow pskb_expand_head() to get maximum tailroom
    
    [ Upstream commit 87151b8689d890dfb495081f7be9b9e257f7a2df ]
    

It turns out this change has a bad side effect on drivers that uses
skb_recycle(), in that case mv643xx_eth.c

Since skb_recycle() resets skb->data using (skb->head + NET_SKB_PAD), a
recycled skb going multiple times through a path that needs to expand
skb head will get bigger and bigger each time, and you eventually end up
with an allocation failure.

An idea to fix this would be to pass needed skb size to skb_resize() and
set skb->data to MIN(NET_SKB_PAD, (skb->end - skb->head - skb_size) / 2)

skb recycling gives a small speed boost, but does not get a lot of test
coverage since only 3 drivers uses it

-- 
Maxime



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-10-04 16:02   ` Maxime Bizon
@ 2012-10-04 16:29     ` Eric Dumazet
  2012-10-05  7:41       ` Eric Dumazet
  2012-10-04 16:50     ` Eric Dumazet
  1 sibling, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2012-10-04 16:29 UTC (permalink / raw)
  To: mbizon; +Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins

On Thu, 2012-10-04 at 18:02 +0200, Maxime Bizon wrote:
> On Fri, 2012-08-31 at 19:21 -0700, Hugh Dickins wrote:
> 
> Hi,
> 
> > Francois is right that a GFP_ATOMIC allocation from pskb_expand_head()
> > is failing, which can easily happen, and cause your "failed to reallocate
> > TX buffer" errors; but it's well worth looking up what's actually on
> > lines 2108 and 2109 of mm/page_alloc.c in 3.2.27:
> > 
> > 	if (order >= MAX_ORDER) {
> > 		WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN));
> > 
> > That was probably not a sane allocation request, it has gone out of range:
> > maybe the skb header is even corrupted.  If you're lucky, it might be
> > something that netdev will recognize as already fixed.
> 
> I have the same problem on the exact same hardware and found the cause:
> 
> Author: Eric Dumazet <eric.dumazet@gmail.com>
> Date:   Tue Apr 10 20:08:39 2012 +0000
> 
>     net: allow pskb_expand_head() to get maximum tailroom
>     
>     [ Upstream commit 87151b8689d890dfb495081f7be9b9e257f7a2df ]
>     
> 
> It turns out this change has a bad side effect on drivers that uses
> skb_recycle(), in that case mv643xx_eth.c
> 
> Since skb_recycle() resets skb->data using (skb->head + NET_SKB_PAD), a
> recycled skb going multiple times through a path that needs to expand
> skb head will get bigger and bigger each time, and you eventually end up
> with an allocation failure.
> 
> An idea to fix this would be to pass needed skb size to skb_resize() and
> set skb->data to MIN(NET_SKB_PAD, (skb->end - skb->head - skb_size) / 2)
> 
> skb recycling gives a small speed boost, but does not get a lot of test
> coverage since only 3 drivers uses it
> 

Thanks Maxime

Sure we can probably fix this issue, but its really not worth the pain.

I would get rid of it, its superseded by build_skb() to get cache hot
skbs anyway, and more over, rx path now uses skb->head allocated from a
page fragment for optimal GRO/TCP coalescing behavior.

skb_recycle() assumes skb allocation is slow, but its not per se.

Cache line misses are expensive, thats the real issue.




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-10-04 16:02   ` Maxime Bizon
  2012-10-04 16:29     ` Eric Dumazet
@ 2012-10-04 16:50     ` Eric Dumazet
  2012-10-04 17:09       ` Maxime Bizon
  1 sibling, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2012-10-04 16:50 UTC (permalink / raw)
  To: mbizon; +Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins

On Thu, 2012-10-04 at 18:02 +0200, Maxime Bizon wrote:
> O
> 
> Since skb_recycle() resets skb->data using (skb->head + NET_SKB_PAD), a
> recycled skb going multiple times through a path that needs to expand
> skb head will get bigger and bigger each time, and you eventually end up
> with an allocation failure.
> 

Because there is not enough headroom ?

> An idea to fix this would be to pass needed skb size to skb_resize() and
> set skb->data to MIN(NET_SKB_PAD, (skb->end - skb->head - skb_size) / 2)

I am trying to decode this but I cant ;)

What is skb_resize() ?
and what do you mean setting skb->data to MIN(NET_SKB_PAD, (skb->end -
skb->head - skb_size) / 2)

Care to explain again your idea ?

Thanks !




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-10-04 16:50     ` Eric Dumazet
@ 2012-10-04 17:09       ` Maxime Bizon
  2012-10-04 17:17         ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Maxime Bizon @ 2012-10-04 17:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins


On Thu, 2012-10-04 at 18:50 +0200, Eric Dumazet wrote:

> > Since skb_recycle() resets skb->data using (skb->head + NET_SKB_PAD), a
> > recycled skb going multiple times through a path that needs to expand
> > skb head will get bigger and bigger each time, and you eventually end up
> > with an allocation failure.
> > 
> 
> Because there is not enough headroom ?

yes, on ipv6 forward path the default NET_SKB_PAD is too small, so each
packet forwarded has its headroom expanded, it is then recycled and gets
its original default headroom back, then it gets forwarded,
expanded, ...

> > An idea to fix this would be to pass needed skb size to skb_resize() and
> > set skb->data to MIN(NET_SKB_PAD, (skb->end - skb->head - skb_size) / 2)
> 
> I am trying to decode this but I cant ;)
> 
> What is skb_resize() ?
> and what do you mean setting skb->data to MIN(NET_SKB_PAD, (skb->end -
> skb->head - skb_size) / 2)
> 
> Care to explain again your idea ?

damn typo, I meant skb_recycle(), and my formula is probably wrong.

skb_size is passed to skb_recycle_check() to ensure the skb is at least
that big.

The idea is it to pass the same value to skb_recycle(), allowing it to
set skb->data somewhat at the middle of skb head space (after honoring
NET_SKB_PAD), that way the recycled skb won't have its head expanded
again if it takes the same path.

-- 
Maxime



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-10-04 17:09       ` Maxime Bizon
@ 2012-10-04 17:17         ` Eric Dumazet
  2012-10-04 17:34           ` Maxime Bizon
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2012-10-04 17:17 UTC (permalink / raw)
  To: mbizon; +Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins

On Thu, 2012-10-04 at 19:09 +0200, Maxime Bizon wrote:

> yes, on ipv6 forward path the default NET_SKB_PAD is too small, so each
> packet forwarded has its headroom expanded, it is then recycled and gets
> its original default headroom back, then it gets forwarded,
> expanded, ...

Hmm, this sounds bad (especially without recycle)

Might I assume NET_SKB_PAD is 32 on this arch ?

We might adjust NET_SKB_PAD to avoid all these memory copies...



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-10-04 17:17         ` Eric Dumazet
@ 2012-10-04 17:34           ` Maxime Bizon
  2012-10-04 21:27             ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Maxime Bizon @ 2012-10-04 17:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins


On Thu, 2012-10-04 at 19:17 +0200, Eric Dumazet wrote:

> > yes, on ipv6 forward path the default NET_SKB_PAD is too small, so each
> > packet forwarded has its headroom expanded, it is then recycled and gets
> > its original default headroom back, then it gets forwarded,
> > expanded, ...
> 
> Hmm, this sounds bad (especially without recycle)
> 
> Might I assume NET_SKB_PAD is 32 on this arch ?

It is, I have a setup with 6to4 tunneling, so needed headroom on tx is
quite big.

I used to be careful about raising this value to avoid drivers using
slab-4096 instead of slab-2048, but since our boards no longer have 16MB
of RAM and with the recent changes in mainline it doesn't seem to be an
issue anymore.

It's not a that big issue in the non recycle case, just lower
performance if the tunable is not set correctly. Though it would be nice
to have a stat/counter so you know when you hit this kind of slow path.

But on the recycle case, skb->head is reallocated to twice the size each
time the packet is recycled and takes the same path again. This stresses
the VM and you eventually get packet loss (and scary printk)

-- 
Maxime



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-10-04 17:34           ` Maxime Bizon
@ 2012-10-04 21:27             ` Eric Dumazet
  0 siblings, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2012-10-04 21:27 UTC (permalink / raw)
  To: mbizon; +Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins

On Thu, 2012-10-04 at 19:34 +0200, Maxime Bizon wrote:
> On Thu, 2012-10-04 at 19:17 +0200, Eric Dumazet wrote:
> 
> > > yes, on ipv6 forward path the default NET_SKB_PAD is too small, so each
> > > packet forwarded has its headroom expanded, it is then recycled and gets
> > > its original default headroom back, then it gets forwarded,
> > > expanded, ...
> > 
> > Hmm, this sounds bad (especially without recycle)
> > 
> > Might I assume NET_SKB_PAD is 32 on this arch ?
> 
> It is, I have a setup with 6to4 tunneling, so needed headroom on tx is
> quite big.
> 

If we change NET_SKB_PAD minimum to be 64 (as it is on x86), it should
be enough for the added tunnel encapsulation or not ?


> I used to be careful about raising this value to avoid drivers using
> slab-4096 instead of slab-2048, but since our boards no longer have 16MB
> of RAM and with the recent changes in mainline it doesn't seem to be an
> issue anymore.

Yes, granted we can allocate order-3 pages for delivering skb->head
fragments, adding 32 bytes doesnt switch to slab-4096 since we dont use
it anymore.

> 
> It's not a that big issue in the non recycle case, just lower
> performance if the tunable is not set correctly. Though it would be nice
> to have a stat/counter so you know when you hit this kind of slow path.
> 

Yeah, we already mentioned this idea in the past. We are lazy now we
have good performance tools (perf)

> But on the recycle case, skb->head is reallocated to twice the size each
> time the packet is recycled and takes the same path again. This stresses
> the VM and you eventually get packet loss (and scary printk)
> 

OK, so to fix this on stable trees, skb_recycle() should not recycle skb
if skb->head is too big.

By the way, another problem with skb_recycle() is that skb->truesize can
be wrong as well. (One skb might had one frag, with a truesize of
2048/4096 bytes, and this frag was pulled in skb->head, so skb->truesize
is slightly wrong.

So we also must check if skb->truesize is equal to
SKB_TRUESIZE(skb_end_offset(skb)), or reset it in skb_recycle(),
I have no strong opinion.

Something like this (untested) patch :

But I really think we should remove skb_recycle() when net-next is
opened again.


diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b33a3a1..13ca215 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2659,7 +2659,10 @@ static inline bool skb_is_recycleable(const struct sk_buff *skb, int skb_size)
 	skb_size = SKB_DATA_ALIGN(skb_size + NET_SKB_PAD);
 	if (skb_end_offset(skb) < skb_size)
 		return false;
-
+	if (skb_end_offset(skb) > 2 * skb_size)
+		return false;
+	if (skb->truesize != SKB_TRUESIZE(skb_end_offset(skb)))
+		return false;
 	if (skb_shared(skb) || skb_cloned(skb))
 		return false;
 



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-10-04 16:29     ` Eric Dumazet
@ 2012-10-05  7:41       ` Eric Dumazet
  2012-10-05 10:49         ` Maxime Bizon
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2012-10-05  7:41 UTC (permalink / raw)
  To: mbizon; +Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins

On Thu, 2012-10-04 at 18:29 +0200, Eric Dumazet wrote:
> On Thu, 2012-10-04 at 18:02 +0200, Maxime Bizon wrote:
> > On Fri, 2012-08-31 at 19:21 -0700, Hugh Dickins wrote:
> > 
> > Hi,
> > 
> > > Francois is right that a GFP_ATOMIC allocation from pskb_expand_head()
> > > is failing, which can easily happen, and cause your "failed to reallocate
> > > TX buffer" errors; but it's well worth looking up what's actually on
> > > lines 2108 and 2109 of mm/page_alloc.c in 3.2.27:
> > > 
> > > 	if (order >= MAX_ORDER) {
> > > 		WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN));
> > > 
> > > That was probably not a sane allocation request, it has gone out of range:
> > > maybe the skb header is even corrupted.  If you're lucky, it might be
> > > something that netdev will recognize as already fixed.
> > 
> > I have the same problem on the exact same hardware and found the cause:
> > 
> > Author: Eric Dumazet <eric.dumazet@gmail.com>
> > Date:   Tue Apr 10 20:08:39 2012 +0000
> > 
> >     net: allow pskb_expand_head() to get maximum tailroom
> >     
> >     [ Upstream commit 87151b8689d890dfb495081f7be9b9e257f7a2df ]
> >     
> > 
> > It turns out this change has a bad side effect on drivers that uses
> > skb_recycle(), in that case mv643xx_eth.c
> > 
> > Since skb_recycle() resets skb->data using (skb->head + NET_SKB_PAD), a
> > recycled skb going multiple times through a path that needs to expand
> > skb head will get bigger and bigger each time, and you eventually end up
> > with an allocation failure.
> > 
> > An idea to fix this would be to pass needed skb size to skb_resize() and
> > set skb->data to MIN(NET_SKB_PAD, (skb->end - skb->head - skb_size) / 2)
> > 
> > skb recycling gives a small speed boost, but does not get a lot of test
> > coverage since only 3 drivers uses it
> > 
> 
> Thanks Maxime

By the way, the commit you pointed has no effect on the reallocation
performed by pskb_expand_head() :

int size = nhead + skb_end_offset(skb) + ntail;

So pskb_expand_head() always assumed the current head is fully used, and
because we have some kmalloc-power-of-two contraints, each time
pskb_expand_head() is called with a non zero (nhead + ntail) we double
the skb->head ksize.

So why are we using skb_end_offset(skb) here is the question.

I guess it could be (skb_tail_pointer(skb) - skb->head) on some uses.

(ie not counting the unused bytes from tail_pointer to end_pointer)

Its probably long to audit all pskb_expand_head() users (about 77 in
tree), but most of them use (nhead = 0, ntail = 0) 

It looks like the following patch should be fine.


diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index cdc2859..dd42c6a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1053,11 +1053,22 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 {
 	int i;
 	u8 *data;
-	int size = nhead + skb_end_offset(skb) + ntail;
+	unsigned int tail_offset = skb_tail_pointer(skb) - skb->head;
+	int size = nhead + ntail;
 	long off;
 
 	BUG_ON(nhead < 0);
 
+	/* callers using nhead == 0 and ntail == 0 want to get a fresh copy,
+	 * so allocate same amount of memory (skb_end_offset)
+	 * For others, they want extra headroom or tailroom against the
+	 * currently used portion of header (skb->head -> skb_tail_pointer)
+	 */
+	if (!size)
+		size = skb_end_offset(skb);
+	else
+		size += tail_offset;
+
 	if (skb_shared(skb))
 		BUG();
 
@@ -1074,7 +1085,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	/* Copy only real data... and, alas, header. This should be
 	 * optimized for the cases when header is void.
 	 */
-	memcpy(data + nhead, skb->head, skb_tail_pointer(skb) - skb->head);
+	memcpy(data + nhead, skb->head, tail_offset);
 
 	memcpy((struct skb_shared_info *)(data + size),
 	       skb_shinfo(skb),



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-10-05  7:41       ` Eric Dumazet
@ 2012-10-05 10:49         ` Maxime Bizon
  2012-10-05 12:22           ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Maxime Bizon @ 2012-10-05 10:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins


On Fri, 2012-10-05 at 09:41 +0200, Eric Dumazet wrote:

> By the way, the commit you pointed has no effect on the reallocation
> performed by pskb_expand_head() :

The commit has a side effect, because the problem appeared after it was
merged (and goes away if I revert it)

> int size = nhead + skb_end_offset(skb) + ntail;
> 
> So pskb_expand_head() always assumed the current head is fully used, and
> because we have some kmalloc-power-of-two contraints, each time
> pskb_expand_head() is called with a non zero (nhead + ntail) we double
> the skb->head ksize.

That is true, but only after the commit I mentioned.

Before that commit, we indeed reallocate skb->head to twice the size,
but skb->end is *not* positioned at the end of newly allocated data. So
on the next pskb_expand_head(), if head and tail are not big values, the
kmalloc() will be of the same size.


The commit adds this after allocation:

size = SKB_WITH_OVERHEAD(ksize(data))
[...]
skb->end      = skb->head + size;

so on the next pskb_expand_head, we are going to allocate twice the size
for sure.

> So why are we using skb_end_offset(skb) here is the question.
> 
> I guess it could be (skb_tail_pointer(skb) - skb->head) on some uses.

I think your patch is wrong, ntail is not the new tailroom size, it's
what missing to the current tailroom size, by adding ntail + nhead +
tail_offset we are removing previous tailroom.

We cannot shrink the skb that way here I guess, a caller may check
needed headroom & tailroom, calls with nhead=1/ntail=0 because only
headroom is missing, but after the call tailroom would be less than
before the call.

Why don't we juste reallocate to this size:

MAX(current_alloc_size, nhead + ntail + current_end - current_head)

-- 
Maxime



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-10-05 10:49         ` Maxime Bizon
@ 2012-10-05 12:22           ` Eric Dumazet
  2012-10-05 12:37             ` Eric Dumazet
  2012-10-05 12:51             ` Maxime Bizon
  0 siblings, 2 replies; 26+ messages in thread
From: Eric Dumazet @ 2012-10-05 12:22 UTC (permalink / raw)
  To: mbizon; +Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins

On Fri, 2012-10-05 at 12:49 +0200, Maxime Bizon wrote:
> On Fri, 2012-10-05 at 09:41 +0200, Eric Dumazet wrote:
> 
> > By the way, the commit you pointed has no effect on the reallocation
> > performed by pskb_expand_head() :
> 
> The commit has a side effect, because the problem appeared after it was
> merged (and goes away if I revert it)
> 
> > int size = nhead + skb_end_offset(skb) + ntail;
> > 
> > So pskb_expand_head() always assumed the current head is fully used, and
> > because we have some kmalloc-power-of-two contraints, each time
> > pskb_expand_head() is called with a non zero (nhead + ntail) we double
> > the skb->head ksize.
> 
> That is true, but only after the commit I mentioned.
> 
> Before that commit, we indeed reallocate skb->head to twice the size,
> but skb->end is *not* positioned at the end of newly allocated data. So
> on the next pskb_expand_head(), if head and tail are not big values, the
> kmalloc() will be of the same size.
> 
> 
> The commit adds this after allocation:
> 
> size = SKB_WITH_OVERHEAD(ksize(data))
> [...]
> skb->end      = skb->head + size;
> 
> so on the next pskb_expand_head, we are going to allocate twice the size
> for sure.

Yes, but the idea of the patch was to _avoid_ next pskb_expand_head()
calls...

Its defeated because you have a too small NET_SKB_PAD, and skb_recycle()
inability to properly detect ans skb is oversized.

> 
> > So why are we using skb_end_offset(skb) here is the question.
> > 
> > I guess it could be (skb_tail_pointer(skb) - skb->head) on some uses.
> 
> I think your patch is wrong, ntail is not the new tailroom size, it's
> what missing to the current tailroom size, by adding ntail + nhead +
> tail_offset we are removing previous tailroom.
> 



> We cannot shrink the skb that way here I guess, a caller may check
> needed headroom & tailroom, calls with nhead=1/ntail=0 because only
> headroom is missing, but after the call tailroom would be less than
> before the call.
> 
> Why don't we juste reallocate to this size:
> 
> MAX(current_alloc_size, nhead + ntail + current_end - current_head)

Hmm, 

this changes nothing assuming current_end == skb_end_offset(skb)
and current_head = skb->head

Not sure what you mean.

I guess the right answer is to change API, because we have no clue if
the callers want some tailroom or not.

If they have 'enough' they currently pass 0, so we dont really now how
many bytes they really need..

In fact very few call sites need to increase the tailroom, so it's
probably very easy to do this change.

New convention would be : pass number of needed bytes after current
tail, not after current end.




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-10-05 12:22           ` Eric Dumazet
@ 2012-10-05 12:37             ` Eric Dumazet
  2012-10-05 12:39               ` Eric Dumazet
  2012-10-05 12:51             ` Maxime Bizon
  1 sibling, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2012-10-05 12:37 UTC (permalink / raw)
  To: mbizon; +Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins

On Fri, 2012-10-05 at 14:22 +0200, Eric Dumazet wrote:
> On Fri, 2012-10-05 at 12:49 +0200, Maxime Bizon wrote:
> > On Fri, 2012-10-05 at 09:41 +0200, Eric Dumazet wrote:
> > 
> > > By the way, the commit you pointed has no effect on the reallocation
> > > performed by pskb_expand_head() :
> > 
> > The commit has a side effect, because the problem appeared after it was
> > merged (and goes away if I revert it)
> > 
> > > int size = nhead + skb_end_offset(skb) + ntail;
> > > 
> > > So pskb_expand_head() always assumed the current head is fully used, and
> > > because we have some kmalloc-power-of-two contraints, each time
> > > pskb_expand_head() is called with a non zero (nhead + ntail) we double
> > > the skb->head ksize.
> > 
> > That is true, but only after the commit I mentioned.
> > 
> > Before that commit, we indeed reallocate skb->head to twice the size,
> > but skb->end is *not* positioned at the end of newly allocated data. So
> > on the next pskb_expand_head(), if head and tail are not big values, the
> > kmalloc() will be of the same size.
> > 
> > 
> > The commit adds this after allocation:
> > 
> > size = SKB_WITH_OVERHEAD(ksize(data))
> > [...]
> > skb->end      = skb->head + size;
> > 
> > so on the next pskb_expand_head, we are going to allocate twice the size
> > for sure.
> 
> Yes, but the idea of the patch was to _avoid_ next pskb_expand_head()
> calls...
> 
> Its defeated because you have a too small NET_SKB_PAD, and skb_recycle()
> inability to properly detect ans skb is oversized.
> 
> > 
> > > So why are we using skb_end_offset(skb) here is the question.
> > > 
> > > I guess it could be (skb_tail_pointer(skb) - skb->head) on some uses.
> > 
> > I think your patch is wrong, ntail is not the new tailroom size, it's
> > what missing to the current tailroom size, by adding ntail + nhead +
> > tail_offset we are removing previous tailroom.
> > 
> 
> 
> 
> > We cannot shrink the skb that way here I guess, a caller may check
> > needed headroom & tailroom, calls with nhead=1/ntail=0 because only
> > headroom is missing, but after the call tailroom would be less than
> > before the call.
> > 
> > Why don't we juste reallocate to this size:
> > 
> > MAX(current_alloc_size, nhead + ntail + current_end - current_head)
> 
> Hmm, 
> 
> this changes nothing assuming current_end == skb_end_offset(skb)
> and current_head = skb->head
> 
> Not sure what you mean.

Following patch maybe ...

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index cdc2859..f6c1f52 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1053,11 +1053,22 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 {
 	int i;
 	u8 *data;
-	int size = nhead + skb_end_offset(skb) + ntail;
+	unsigned int tail_offset = skb_tail_pointer(skb) - skb->head;
+	int size = nhead + ntail;
 	long off;
 
 	BUG_ON(nhead < 0);
 
+	/* callers using nhead == 0 and ntail == 0 wants to get a fresh copy,
+	 * so allocate same amount of memory (skb_end_offset)
+	 * For others, they want extra head or tail against the currently
+	 * used portion of header (skb->head -> skb_tail_pointer).
+	 * But we dont shrink the head.
+	 */
+	if (size)
+		size += tail_offset;
+	size = max_t(int, size, skb_end_offset(skb));
+
 	if (skb_shared(skb))
 		BUG();
 
@@ -1074,7 +1085,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	/* Copy only real data... and, alas, header. This should be
 	 * optimized for the cases when header is void.
 	 */
-	memcpy(data + nhead, skb->head, skb_tail_pointer(skb) - skb->head);
+	memcpy(data + nhead, skb->head, tail_offset);
 
 	memcpy((struct skb_shared_info *)(data + size),
 	       skb_shinfo(skb),



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-10-05 12:37             ` Eric Dumazet
@ 2012-10-05 12:39               ` Eric Dumazet
  0 siblings, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2012-10-05 12:39 UTC (permalink / raw)
  To: mbizon; +Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins

On Fri, 2012-10-05 at 14:37 +0200, Eric Dumazet wrote:

> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index cdc2859..f6c1f52 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1053,11 +1053,22 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  {
>  	int i;
>  	u8 *data;
> -	int size = nhead + skb_end_offset(skb) + ntail;
> +	unsigned int tail_offset = skb_tail_pointer(skb) - skb->head;
> +	int size = nhead + ntail;
>  	long off;
>  
>  	BUG_ON(nhead < 0);
>  
> +	/* callers using nhead == 0 and ntail == 0 wants to get a fresh copy,
> +	 * so allocate same amount of memory (skb_end_offset)
> +	 * For others, they want extra head or tail against the currently
> +	 * used portion of header (skb->head -> skb_tail_pointer).
> +	 * But we dont shrink the head.
> +	 */
> +	if (size)
> +		size += tail_offset;
> +	size = max_t(int, size, skb_end_offset(skb));
> +
>  

This can be factorized to :

	size = tail_offset + nhead + ntail;
	size = max_t(int, size, skb_end_offset(skb));




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-10-05 12:22           ` Eric Dumazet
  2012-10-05 12:37             ` Eric Dumazet
@ 2012-10-05 12:51             ` Maxime Bizon
  2012-10-05 13:02               ` Eric Dumazet
  1 sibling, 1 reply; 26+ messages in thread
From: Maxime Bizon @ 2012-10-05 12:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins


On Fri, 2012-10-05 at 14:22 +0200, Eric Dumazet wrote:

> Yes, but the idea of the patch was to _avoid_ next pskb_expand_head()
> calls...

yes but we cannot be sure of that, the caller may not have a good idea
of the headroom needed for the whole lifetime of the skb

it's better to think we will reduce number of calls, not avoid them

that's why I think doubling the size each time is dangerous, since we
silently request bigger and bigger allocations if an skb takes an
unoptimized path

> Hmm, 
> 
> this changes nothing assuming current_end == skb_end_offset(skb)
> and current_head = skb->head

My idea was to leave skb->end at its last position even if we grow
skb->head.

Since we have a way to know the current allocation size of skb->head,
further pskb_expand_head() calls to request tailroom would just push
skb->tail & skb->end together if that fits in current ksize().

I've not looked at recent changes in mainline, since you changed how
skb->head is managed, that may be totally impossible.

Your proposed changed API change to expand_head will fill this anyway.

> New convention would be : pass number of needed bytes after current
> tail, not after current end.

Fully agree on this

-- 
Maxime



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-10-05 12:51             ` Maxime Bizon
@ 2012-10-05 13:02               ` Eric Dumazet
  2012-10-05 14:50                 ` Maxime Bizon
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2012-10-05 13:02 UTC (permalink / raw)
  To: mbizon; +Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins

On Fri, 2012-10-05 at 14:51 +0200, Maxime Bizon wrote:

> > New convention would be : pass number of needed bytes after current
> > tail, not after current end.
> 
> Fully agree on this
> 

Here is the proposal :

Change all occurrences of :

pskb_expand_head(skb, 0, 0, gfp)

to

pskb_realloc_head(skb, gfp)   
[ an inline calling a common function ]



And few other pskb_expand_head() calls a new function

pskb_may_expand_head(skb, nhead, ntail, gfp) 
[ an inline calling same common function ]


nhead : number of needed bytes before [data,tail] portion
         (might be already < skb_headroom(skb))

ntail : number of needed bytes after [data,tail] portion
        (might be already > skb_tailroom(skb))

For example, net/xfrm/xfrm_output.c would be like the following :

(Note how all the various ugly tests will be centralized in the common
function)

And the common function would now know the caller intent.


diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 95a338c..15760df 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -24,18 +24,10 @@ static int xfrm_output2(struct sk_buff *skb);
 static int xfrm_skb_check_space(struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
-	int nhead = dst->header_len + LL_RESERVED_SPACE(dst->dev)
-		- skb_headroom(skb);
-	int ntail = dst->dev->needed_tailroom - skb_tailroom(skb);
-
-	if (nhead <= 0) {
-		if (ntail <= 0)
-			return 0;
-		nhead = 0;
-	} else if (ntail < 0)
-		ntail = 0;
-
-	return pskb_expand_head(skb, nhead, ntail, GFP_ATOMIC);
+	int nhead = dst->header_len + LL_RESERVED_SPACE(dst->dev);
+	int ntail = dst->dev->needed_tailroom;
+
+	return pskb_may_expand_head(skb, nhead, ntail, GFP_ATOMIC);
 }
 
 static int xfrm_output_one(struct sk_buff *skb, int err)



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-10-05 13:02               ` Eric Dumazet
@ 2012-10-05 14:50                 ` Maxime Bizon
  2012-10-05 15:04                   ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Maxime Bizon @ 2012-10-05 14:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins


On Fri, 2012-10-05 at 15:02 +0200, Eric Dumazet wrote:

> On Fri, 2012-10-05 at 14:51 +0200, Maxime Bizon wrote:
> 
> > > New convention would be : pass number of needed bytes after current
> > > tail, not after current end.
> > 
> > Fully agree on this
> > 
> 
> Here is the proposal :

Looks fine

What is your plan for the actual pskb_expand_head() code now that you
will have absolute values for headroom & tailroom ?

Because there will still be callers that have no clue of required
tailroom (nor further headroom requirement), like skb_cow() in
vlan_reorder_header().

-- 
Maxime



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-10-05 14:50                 ` Maxime Bizon
@ 2012-10-05 15:04                   ` Eric Dumazet
  2012-10-05 15:15                     ` Maxime Bizon
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2012-10-05 15:04 UTC (permalink / raw)
  To: mbizon; +Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins

On Fri, 2012-10-05 at 16:50 +0200, Maxime Bizon wrote:
> On Fri, 2012-10-05 at 15:02 +0200, Eric Dumazet wrote:
> 
> > On Fri, 2012-10-05 at 14:51 +0200, Maxime Bizon wrote:
> > 
> > > > New convention would be : pass number of needed bytes after current
> > > > tail, not after current end.
> > > 
> > > Fully agree on this
> > > 
> > 
> > Here is the proposal :
> 
> Looks fine
> 
> What is your plan for the actual pskb_expand_head() code now that you
> will have absolute values for headroom & tailroom ?
> 

They stay relative values.

For example, netlink_trim() really wants to shrink the skb head.


> Because there will still be callers that have no clue of required
> tailroom (nor further headroom requirement), like skb_cow() in
> vlan_reorder_header().
> 

What I plan is to not shrink size, unless specifically asked.

Its 3.8 material anyway, so a stable fix is needed on skb_recycle() and
NET_SKB_PAD minimal value.




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-10-05 15:04                   ` Eric Dumazet
@ 2012-10-05 15:15                     ` Maxime Bizon
  2012-10-05 15:37                       ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Maxime Bizon @ 2012-10-05 15:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins


On Fri, 2012-10-05 at 17:04 +0200, Eric Dumazet wrote:
> 
> 
> What I plan is to not shrink size, unless specifically asked.
> 
> Its 3.8 material anyway, so a stable fix is needed on skb_recycle()
> and NET_SKB_PAD minimal value.

You think removing skb_recycle() is too big a change for stable ?

Driver change is simple, as recycling is not guaranteed today you have
this:

 if (!try_recycle(skb))
    skb = alloc_skb()
> 
we just remove the try_recycle part, we are not adding any new code
path.


I'm not the right person to give you a correct NET_SKB_PAD value, I have
a lot of out of tree patches in my kernels, some custom drivers as well
that needs quite a lot of headroom, and uncommon network setup.

-- 
Maxime



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
  2012-10-05 15:15                     ` Maxime Bizon
@ 2012-10-05 15:37                       ` Eric Dumazet
  0 siblings, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2012-10-05 15:37 UTC (permalink / raw)
  To: mbizon; +Cc: David Madore, Francois Romieu, netdev, linux-kernel, Hugh Dickins

On Fri, 2012-10-05 at 17:15 +0200, Maxime Bizon wrote:

> You think removing skb_recycle() is too big a change for stable ?
> 
> Driver change is simple, as recycling is not guaranteed today you have
> this:
> 
>  if (!try_recycle(skb))
>     skb = alloc_skb()
> > 
> we just remove the try_recycle part, we are not adding any new code
> path.
> 
> 
> I'm not the right person to give you a correct NET_SKB_PAD value, I have
> a lot of out of tree patches in my kernels, some custom drivers as well
> that needs quite a lot of headroom, and uncommon network setup.
> 

OK, I'll send the patch to remove skb_recycle()



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2012-10-05 15:37 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-29  0:25 kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c() David Madore
2012-08-29 19:32 ` Francois Romieu
2012-08-31 10:59   ` David Madore
2012-08-31 13:57     ` David Madore
2012-08-31 18:12       ` Francois Romieu
2012-09-01  2:21 ` Hugh Dickins
2012-09-01  8:20   ` Francois Romieu
2012-09-02 22:51     ` David Madore
2012-10-04 16:02   ` Maxime Bizon
2012-10-04 16:29     ` Eric Dumazet
2012-10-05  7:41       ` Eric Dumazet
2012-10-05 10:49         ` Maxime Bizon
2012-10-05 12:22           ` Eric Dumazet
2012-10-05 12:37             ` Eric Dumazet
2012-10-05 12:39               ` Eric Dumazet
2012-10-05 12:51             ` Maxime Bizon
2012-10-05 13:02               ` Eric Dumazet
2012-10-05 14:50                 ` Maxime Bizon
2012-10-05 15:04                   ` Eric Dumazet
2012-10-05 15:15                     ` Maxime Bizon
2012-10-05 15:37                       ` Eric Dumazet
2012-10-04 16:50     ` Eric Dumazet
2012-10-04 17:09       ` Maxime Bizon
2012-10-04 17:17         ` Eric Dumazet
2012-10-04 17:34           ` Maxime Bizon
2012-10-04 21:27             ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).