All of lore.kernel.org
 help / color / mirror / Atom feed
* 4.6.3, pppoe + shaper workload,  skb_panic / skb_push / ppp_start_xmit
@ 2016-07-11 19:45 nuclearcat
  2016-07-12 17:31 ` Cong Wang
  0 siblings, 1 reply; 13+ messages in thread
From: nuclearcat @ 2016-07-11 19:45 UTC (permalink / raw)
  To: netdev

Hi

On latest kernel i noticed kernel panic happening 1-2 times per day. It 
is also happening on older kernel (at least 4.5.3).

Panic message received over netconsole:

[42916.416307] skbuff: skb_under_panic: text:ffffffffa00e8ce5 len:581 
put:2 head:ffff8800b0bf2800 data:ffa00500b0bf284c tail:0x291 end:0x6c0 
dev:ppp2828
[42916.416677] ------------[ cut here ]------------
[42916.416876] kernel BUG at net/core/skbuff.c:104!
[42916.417075] invalid opcode: 0000 [#1]
SMP

[42916.417388] Modules linked in:
cls_fw
act_police
cls_u32
sch_ingress
sch_sfq
sch_htb
netconsole
configfs
coretemp
nf_nat_pptp
nf_nat_proto_gre
nf_conntrack_pptp
nf_conntrack_proto_gre
pppoe
pppox
ppp_generic
slhc
tun
xt_REDIRECT
nf_nat_redirect
xt_TCPMSS
ipt_REJECT
nf_reject_ipv4
xt_set
ts_bm
xt_string
xt_connmark
xt_DSCP
xt_mark
xt_tcpudp
ip_set_hash_net
ip_set_hash_ip
ip_set
nfnetlink
iptable_mangle
iptable_filter
iptable_nat
nf_conntrack_ipv4
nf_defrag_ipv4
nf_nat_ipv4
nf_nat
nf_conntrack
ip_tables
x_tables
8021q
garp
mrp
stp
llc


  [42916.421443] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
4.6.3-build-0105 #4
  [42916.421643] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 
04/02/2015
  [42916.421842] task: ffffffff8200b500 ti: ffffffff82000000 task.ti: 
ffffffff82000000
  [42916.422178] RIP: 0010:[<ffffffff8184374e>]
  [<ffffffff8184374e>] skb_panic+0x49/0x4b
  [42916.422574] RSP: 0018:ffff880447403da8  EFLAGS: 00010296
  [42916.422773] RAX: 0000000000000089 RBX: ffff880422c13900 RCX: 
0000000000000000
  [42916.422974] RDX: ffff88044740df50 RSI: ffff88044740c908 RDI: 
ffff88044740c908
  [42916.423175] RBP: ffff880447403dc8 R08: 0000000000000001 R09: 
0000000000000000
  [42916.423439] R10: ffffffff820050c0 R11: ffff88041c7ee900 R12: 
ffff880423037000
  [42916.423640] R13: 0000000000000000 R14: ffff880423037000 R15: 
0000000000000000
  [42916.423841] FS:  0000000000000000(0000) GS:ffff880447400000(0000) 
knlGS:0000000000000000
  [42916.424179] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [42916.424379] CR2: 00007effd0814b00 CR3: 0000000430ab2000 CR4: 
00000000001406f0
  [42916.424577] Stack:
  [42916.424772]  ffa00500b0bf284c
  0000000000000291
  00000000000006c0
  ffff880423037000

  [42916.425333]  ffff880447403dd8
  ffffffff81843786
  ffff880447403e00
  ffffffffa00e8ce5

  [42916.425898]  ffff880422c13900
  ffff8800ae7c6c00
  ffffffff820b3210
  ffff880447403e68

  [42916.426463] Call Trace:
  [42916.426658]  <IRQ>

  [42916.426719]  [<ffffffff81843786>] skb_push+0x36/0x37
  [42916.427111]  [<ffffffffa00e8ce5>] ppp_start_xmit+0x10f/0x150 
[ppp_generic]
  [42916.427314]  [<ffffffff81853467>] dev_hard_start_xmit+0x25a/0x2d3
  [42916.427516]  [<ffffffff818530f2>] ? 
validate_xmit_skb.isra.107.part.108+0x11d/0x238
  [42916.427858]  [<ffffffff8186dee3>] sch_direct_xmit+0x89/0x1b5
  [42916.428060]  [<ffffffff8186e142>] __qdisc_run+0x133/0x170
  [42916.428261]  [<ffffffff81850034>] net_tx_action+0xe3/0x148
  [42916.428462]  [<ffffffff810c401a>] __do_softirq+0xb9/0x1a9
  [42916.428663]  [<ffffffff810c4251>] irq_exit+0x37/0x7c
  [42916.428862]  [<ffffffff8102b8f7>] smp_apic_timer_interrupt+0x3d/0x48
  [42916.429063]  [<ffffffff818cb15c>] apic_timer_interrupt+0x7c/0x90
  [42916.429263]  <EOI>

  [42916.429324]  [<ffffffff8101be12>] ? mwait_idle+0x68/0x7e
  [42916.429719]  [<ffffffff810d731c>] ? 
atomic_notifier_call_chain+0x13/0x15
  [42916.429921]  [<ffffffff8101c212>] arch_cpu_idle+0xa/0xc
  [42916.430121]  [<ffffffff810ea333>] default_idle_call+0x27/0x29
  [42916.430323]  [<ffffffff810ea44a>] cpu_startup_entry+0x115/0x1bf
  [42916.430526]  [<ffffffff818c5d7b>] rest_init+0x72/0x74
  [42916.430727]  [<ffffffff820cdd8c>] start_kernel+0x3b7/0x3c4
  [42916.430929]  [<ffffffff820cd422>] 
x86_64_start_reservations+0x2a/0x2c
  [42916.431130]  [<ffffffff820cd4df>] x86_64_start_kernel+0xbb/0xbe
  [42916.431332] Code:
  78
  50
  8b
  87
  c0
  00
  00
  00
  50
  8b
  87
  bc
  00
  00
  00
  50
  ff
  b7
  d0
  00
  00
  00
  31
  c0
  4c
  8b
  8f
  c8
  00
  00
  00
  48
  c7
  c7
  49
  10
  e1
  81
  e8
  0e
  60
  8e
  ff
  0b
  48
  8b
  97
  d0
  00
  00
  00
  89
  f0
  01
  77
  78
  48
  29
  c2
  48
  3b
  97
  c8

  [42916.435514] RIP
  [<ffffffff8184374e>] skb_panic+0x49/0x4b
  [42916.439115]  RSP <ffff880447403da8>
  [42916.439336] ---[ end trace d7bfed0177be96d1 ]---
  [42916.445801] Kernel panic - not syncing: Fatal exception in interrupt
  [42916.446005] Kernel Offset: disabled
  [42916.477266] Rebooting in 5 seconds..

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit
  2016-07-11 19:45 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit nuclearcat
@ 2016-07-12 17:31 ` Cong Wang
  2016-07-12 18:03   ` nuclearcat
  2016-07-28 11:09   ` Guillaume Nault
  0 siblings, 2 replies; 13+ messages in thread
From: Cong Wang @ 2016-07-12 17:31 UTC (permalink / raw)
  To: nuclearcat; +Cc: Linux Kernel Network Developers

On Mon, Jul 11, 2016 at 12:45 PM,  <nuclearcat@nuclearcat.com> wrote:
> Hi
>
> On latest kernel i noticed kernel panic happening 1-2 times per day. It is
> also happening on older kernel (at least 4.5.3).
>
...
>  [42916.426463] Call Trace:
>  [42916.426658]  <IRQ>
>
>  [42916.426719]  [<ffffffff81843786>] skb_push+0x36/0x37
>  [42916.427111]  [<ffffffffa00e8ce5>] ppp_start_xmit+0x10f/0x150
> [ppp_generic]
>  [42916.427314]  [<ffffffff81853467>] dev_hard_start_xmit+0x25a/0x2d3
>  [42916.427516]  [<ffffffff818530f2>] ?
> validate_xmit_skb.isra.107.part.108+0x11d/0x238
>  [42916.427858]  [<ffffffff8186dee3>] sch_direct_xmit+0x89/0x1b5
>  [42916.428060]  [<ffffffff8186e142>] __qdisc_run+0x133/0x170
>  [42916.428261]  [<ffffffff81850034>] net_tx_action+0xe3/0x148
>  [42916.428462]  [<ffffffff810c401a>] __do_softirq+0xb9/0x1a9
>  [42916.428663]  [<ffffffff810c4251>] irq_exit+0x37/0x7c
>  [42916.428862]  [<ffffffff8102b8f7>] smp_apic_timer_interrupt+0x3d/0x48
>  [42916.429063]  [<ffffffff818cb15c>] apic_timer_interrupt+0x7c/0x90

Interesting, we call a skb_cow_head() before skb_push() in ppp_start_xmit(),
I have no idea why this could happen.

Do you have any tc qdisc, filter or actions on this ppp device?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit
  2016-07-12 17:31 ` Cong Wang
@ 2016-07-12 18:03   ` nuclearcat
  2016-07-12 18:05     ` Cong Wang
  2016-07-28 11:09   ` Guillaume Nault
  1 sibling, 1 reply; 13+ messages in thread
From: nuclearcat @ 2016-07-12 18:03 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers

On 2016-07-12 20:31, Cong Wang wrote:
> On Mon, Jul 11, 2016 at 12:45 PM,  <nuclearcat@nuclearcat.com> wrote:
>> Hi
>> 
>> On latest kernel i noticed kernel panic happening 1-2 times per day. 
>> It is
>> also happening on older kernel (at least 4.5.3).
>> 
> ...
>>  [42916.426463] Call Trace:
>>  [42916.426658]  <IRQ>
>> 
>>  [42916.426719]  [<ffffffff81843786>] skb_push+0x36/0x37
>>  [42916.427111]  [<ffffffffa00e8ce5>] ppp_start_xmit+0x10f/0x150
>> [ppp_generic]
>>  [42916.427314]  [<ffffffff81853467>] dev_hard_start_xmit+0x25a/0x2d3
>>  [42916.427516]  [<ffffffff818530f2>] ?
>> validate_xmit_skb.isra.107.part.108+0x11d/0x238
>>  [42916.427858]  [<ffffffff8186dee3>] sch_direct_xmit+0x89/0x1b5
>>  [42916.428060]  [<ffffffff8186e142>] __qdisc_run+0x133/0x170
>>  [42916.428261]  [<ffffffff81850034>] net_tx_action+0xe3/0x148
>>  [42916.428462]  [<ffffffff810c401a>] __do_softirq+0xb9/0x1a9
>>  [42916.428663]  [<ffffffff810c4251>] irq_exit+0x37/0x7c
>>  [42916.428862]  [<ffffffff8102b8f7>] 
>> smp_apic_timer_interrupt+0x3d/0x48
>>  [42916.429063]  [<ffffffff818cb15c>] apic_timer_interrupt+0x7c/0x90
> 
> Interesting, we call a skb_cow_head() before skb_push() in 
> ppp_start_xmit(),
> I have no idea why this could happen.
> 
> Do you have any tc qdisc, filter or actions on this ppp device?
Yes, i have policing filters for incoming traffic (ingress), and also on 
egress htb + pfifo + filters.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit
  2016-07-12 18:03   ` nuclearcat
@ 2016-07-12 18:05     ` Cong Wang
  2016-07-12 18:13       ` nuclearcat
  0 siblings, 1 reply; 13+ messages in thread
From: Cong Wang @ 2016-07-12 18:05 UTC (permalink / raw)
  To: nuclearcat; +Cc: Linux Kernel Network Developers

On Tue, Jul 12, 2016 at 11:03 AM,  <nuclearcat@nuclearcat.com> wrote:
> On 2016-07-12 20:31, Cong Wang wrote:
>>
>> On Mon, Jul 11, 2016 at 12:45 PM,  <nuclearcat@nuclearcat.com> wrote:
>>>
>>> Hi
>>>
>>> On latest kernel i noticed kernel panic happening 1-2 times per day. It
>>> is
>>> also happening on older kernel (at least 4.5.3).
>>>
>> ...
>>>
>>>  [42916.426463] Call Trace:
>>>  [42916.426658]  <IRQ>
>>>
>>>  [42916.426719]  [<ffffffff81843786>] skb_push+0x36/0x37
>>>  [42916.427111]  [<ffffffffa00e8ce5>] ppp_start_xmit+0x10f/0x150
>>> [ppp_generic]
>>>  [42916.427314]  [<ffffffff81853467>] dev_hard_start_xmit+0x25a/0x2d3
>>>  [42916.427516]  [<ffffffff818530f2>] ?
>>> validate_xmit_skb.isra.107.part.108+0x11d/0x238
>>>  [42916.427858]  [<ffffffff8186dee3>] sch_direct_xmit+0x89/0x1b5
>>>  [42916.428060]  [<ffffffff8186e142>] __qdisc_run+0x133/0x170
>>>  [42916.428261]  [<ffffffff81850034>] net_tx_action+0xe3/0x148
>>>  [42916.428462]  [<ffffffff810c401a>] __do_softirq+0xb9/0x1a9
>>>  [42916.428663]  [<ffffffff810c4251>] irq_exit+0x37/0x7c
>>>  [42916.428862]  [<ffffffff8102b8f7>] smp_apic_timer_interrupt+0x3d/0x48
>>>  [42916.429063]  [<ffffffff818cb15c>] apic_timer_interrupt+0x7c/0x90
>>
>>
>> Interesting, we call a skb_cow_head() before skb_push() in
>> ppp_start_xmit(),
>> I have no idea why this could happen.
>>
>> Do you have any tc qdisc, filter or actions on this ppp device?
>
> Yes, i have policing filters for incoming traffic (ingress), and also on
> egress htb + pfifo + filters.

Does it make any difference if you remove the egress qdisc and/or
filters? If yes, please share the `tc qd show...` and `tc filter show ...`?

Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit
  2016-07-12 18:05     ` Cong Wang
@ 2016-07-12 18:13       ` nuclearcat
  0 siblings, 0 replies; 13+ messages in thread
From: nuclearcat @ 2016-07-12 18:13 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers

On 2016-07-12 21:05, Cong Wang wrote:
> On Tue, Jul 12, 2016 at 11:03 AM,  <nuclearcat@nuclearcat.com> wrote:
>> On 2016-07-12 20:31, Cong Wang wrote:
>>> 
>>> On Mon, Jul 11, 2016 at 12:45 PM,  <nuclearcat@nuclearcat.com> wrote:
>>>> 
>>>> Hi
>>>> 
>>>> On latest kernel i noticed kernel panic happening 1-2 times per day. 
>>>> It
>>>> is
>>>> also happening on older kernel (at least 4.5.3).
>>>> 
>>> ...
>>>> 
>>>>  [42916.426463] Call Trace:
>>>>  [42916.426658]  <IRQ>
>>>> 
>>>>  [42916.426719]  [<ffffffff81843786>] skb_push+0x36/0x37
>>>>  [42916.427111]  [<ffffffffa00e8ce5>] ppp_start_xmit+0x10f/0x150
>>>> [ppp_generic]
>>>>  [42916.427314]  [<ffffffff81853467>] 
>>>> dev_hard_start_xmit+0x25a/0x2d3
>>>>  [42916.427516]  [<ffffffff818530f2>] ?
>>>> validate_xmit_skb.isra.107.part.108+0x11d/0x238
>>>>  [42916.427858]  [<ffffffff8186dee3>] sch_direct_xmit+0x89/0x1b5
>>>>  [42916.428060]  [<ffffffff8186e142>] __qdisc_run+0x133/0x170
>>>>  [42916.428261]  [<ffffffff81850034>] net_tx_action+0xe3/0x148
>>>>  [42916.428462]  [<ffffffff810c401a>] __do_softirq+0xb9/0x1a9
>>>>  [42916.428663]  [<ffffffff810c4251>] irq_exit+0x37/0x7c
>>>>  [42916.428862]  [<ffffffff8102b8f7>] 
>>>> smp_apic_timer_interrupt+0x3d/0x48
>>>>  [42916.429063]  [<ffffffff818cb15c>] apic_timer_interrupt+0x7c/0x90
>>> 
>>> 
>>> Interesting, we call a skb_cow_head() before skb_push() in
>>> ppp_start_xmit(),
>>> I have no idea why this could happen.
>>> 
>>> Do you have any tc qdisc, filter or actions on this ppp device?
>> 
>> Yes, i have policing filters for incoming traffic (ingress), and also 
>> on
>> egress htb + pfifo + filters.
> 
> Does it make any difference if you remove the egress qdisc and/or
> filters? If yes, please share the `tc qd show...` and `tc filter show 
> ...`?
> 
> Thanks!

It is not easy, because it is NAS with approx 5000 users connected (and 
they are constantly connecting/disconnecting), and crash can't be 
reproduced easily. If i will remove qdisc/filters users will get 
unlimited speed and this will cause serious service degradation.
But maybe i can add some debug lines and run some test kernel if 
necessary (if it will not cause serious performance overhead).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit
  2016-07-12 17:31 ` Cong Wang
  2016-07-12 18:03   ` nuclearcat
@ 2016-07-28 11:09   ` Guillaume Nault
  2016-07-28 11:28     ` Denys Fedoryshchenko
  1 sibling, 1 reply; 13+ messages in thread
From: Guillaume Nault @ 2016-07-28 11:09 UTC (permalink / raw)
  To: Cong Wang; +Cc: nuclearcat, Linux Kernel Network Developers

On Tue, Jul 12, 2016 at 10:31:18AM -0700, Cong Wang wrote:
> On Mon, Jul 11, 2016 at 12:45 PM,  <nuclearcat@nuclearcat.com> wrote:
> > Hi
> >
> > On latest kernel i noticed kernel panic happening 1-2 times per day. It is
> > also happening on older kernel (at least 4.5.3).
> >
> ...
> >  [42916.426463] Call Trace:
> >  [42916.426658]  <IRQ>
> >
> >  [42916.426719]  [<ffffffff81843786>] skb_push+0x36/0x37
> >  [42916.427111]  [<ffffffffa00e8ce5>] ppp_start_xmit+0x10f/0x150
> > [ppp_generic]
> >  [42916.427314]  [<ffffffff81853467>] dev_hard_start_xmit+0x25a/0x2d3
> >  [42916.427516]  [<ffffffff818530f2>] ?
> > validate_xmit_skb.isra.107.part.108+0x11d/0x238
> >  [42916.427858]  [<ffffffff8186dee3>] sch_direct_xmit+0x89/0x1b5
> >  [42916.428060]  [<ffffffff8186e142>] __qdisc_run+0x133/0x170
> >  [42916.428261]  [<ffffffff81850034>] net_tx_action+0xe3/0x148
> >  [42916.428462]  [<ffffffff810c401a>] __do_softirq+0xb9/0x1a9
> >  [42916.428663]  [<ffffffff810c4251>] irq_exit+0x37/0x7c
> >  [42916.428862]  [<ffffffff8102b8f7>] smp_apic_timer_interrupt+0x3d/0x48
> >  [42916.429063]  [<ffffffff818cb15c>] apic_timer_interrupt+0x7c/0x90
> 
> Interesting, we call a skb_cow_head() before skb_push() in ppp_start_xmit(),
> I have no idea why this could happen.
>
The skb is corrupted: head is at ffff8800b0bf2800 while data is at
ffa00500b0bf284c.

Figuring out how this corruption happened is going to be hard without a
way to reproduce the problem.

Denys, can you confirm you're using a vanilla kernel?
Also I guess the ppp devices and tc settings are handled by accel-ppp.
If so, can you share more info about your setup (accel-ppp.conf, radius
attributes, iptables...) so that I can try to reproduce it on my
machines?

Regards

Guillaume

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit
  2016-07-28 11:09   ` Guillaume Nault
@ 2016-07-28 11:28     ` Denys Fedoryshchenko
  2016-08-01 20:54       ` Guillaume Nault
  2016-08-01 20:59       ` Guillaume Nault
  0 siblings, 2 replies; 13+ messages in thread
From: Denys Fedoryshchenko @ 2016-07-28 11:28 UTC (permalink / raw)
  To: Guillaume Nault; +Cc: Cong Wang, Linux Kernel Network Developers

On 2016-07-28 14:09, Guillaume Nault wrote:
> On Tue, Jul 12, 2016 at 10:31:18AM -0700, Cong Wang wrote:
>> On Mon, Jul 11, 2016 at 12:45 PM,  <nuclearcat@nuclearcat.com> wrote:
>> > Hi
>> >
>> > On latest kernel i noticed kernel panic happening 1-2 times per day. It is
>> > also happening on older kernel (at least 4.5.3).
>> >
>> ...
>> >  [42916.426463] Call Trace:
>> >  [42916.426658]  <IRQ>
>> >
>> >  [42916.426719]  [<ffffffff81843786>] skb_push+0x36/0x37
>> >  [42916.427111]  [<ffffffffa00e8ce5>] ppp_start_xmit+0x10f/0x150
>> > [ppp_generic]
>> >  [42916.427314]  [<ffffffff81853467>] dev_hard_start_xmit+0x25a/0x2d3
>> >  [42916.427516]  [<ffffffff818530f2>] ?
>> > validate_xmit_skb.isra.107.part.108+0x11d/0x238
>> >  [42916.427858]  [<ffffffff8186dee3>] sch_direct_xmit+0x89/0x1b5
>> >  [42916.428060]  [<ffffffff8186e142>] __qdisc_run+0x133/0x170
>> >  [42916.428261]  [<ffffffff81850034>] net_tx_action+0xe3/0x148
>> >  [42916.428462]  [<ffffffff810c401a>] __do_softirq+0xb9/0x1a9
>> >  [42916.428663]  [<ffffffff810c4251>] irq_exit+0x37/0x7c
>> >  [42916.428862]  [<ffffffff8102b8f7>] smp_apic_timer_interrupt+0x3d/0x48
>> >  [42916.429063]  [<ffffffff818cb15c>] apic_timer_interrupt+0x7c/0x90
>> 
>> Interesting, we call a skb_cow_head() before skb_push() in 
>> ppp_start_xmit(),
>> I have no idea why this could happen.
>> 
> The skb is corrupted: head is at ffff8800b0bf2800 while data is at
> ffa00500b0bf284c.
> 
> Figuring out how this corruption happened is going to be hard without a
> way to reproduce the problem.
> 
> Denys, can you confirm you're using a vanilla kernel?
> Also I guess the ppp devices and tc settings are handled by accel-ppp.
> If so, can you share more info about your setup (accel-ppp.conf, radius
> attributes, iptables...) so that I can try to reproduce it on my
> machines?

I have slight modification from vanilla:

--- linux/net/sched/sch_htb.c	2016-06-08 01:23:53.000000000 +0000
+++ linux-new/net/sched/sch_htb.c	2016-06-21 14:03:08.398486593 +0000
@@ -1495,10 +1495,10 @@
  				cl->common.classid);
  			cl->quantum = 1000;
  		}
-		if (!hopt->quantum && cl->quantum > 200000) {
+		if (!hopt->quantum && cl->quantum > 2000000) {
  			pr_warn("HTB: quantum of class %X is big. Consider r2q change.\n",
  				cl->common.classid);
-			cl->quantum = 200000;
+			cl->quantum = 2000000;
  		}
  		if (hopt->quantum)
  			cl->quantum = hopt->quantum;

But i guess it should not be reason of crash (it is related to another 
system,  without it i was unable to shape over 7Gbps, maybe with latest 
kernel i will not need this patch).

I'm trying to make reproducible conditions of crash, because right now 
it happens only on some servers in large networks (completely different 
ISPs, so i excluded possible hardware fault of specific server). It is 
complex config, i have accel-ppp, plus my own "shaping daemon" that 
apply several shapers on ppp interfaces. Wost thing it happens only on 
live customers, i am unable to reproduce same on stress tests. Also 
until recent kernel i was getting different panic messages (but all 
related to ppp).

I think also at least one reason of crash also was fixed by "ppp: defer 
netns reference release for ppp channel" in 4.7.0 (maybe thats why i am 
getting less crashes recently).
I tried also various kernel debug options that doesn't cause major 
performance degradation (locks checking, freed memory poisoning and 
etc), without any luck yet. Is it useful if i will post panics that at 
least occurs twice? (I will post below example, got recently)
Sure if i will be able to reproducible conditions i will send them 
immediately.


<server19> [ 5449.900988] general protection fault: 0000 [#1] SMP
<server19> [ 5449.901263] Modules linked in:
<server19> cls_fw
<server19> act_police
<server19> cls_u32
<server19> sch_ingress
<server19> sch_sfq
<server19> sch_htb
<server19> pppoe
<server19> pppox
<server19> ppp_generic
<server19> slhc
<server19> netconsole
<server19> configfs
<server19> xt_nat
<server19> ts_bm
<server19> xt_string
<server19> xt_connmark
<server19> xt_TCPMSS
<server19> xt_tcpudp
<server19> xt_mark
<server19> iptable_filter
<server19> iptable_nat
<server19> nf_conntrack_ipv4
<server19> nf_defrag_ipv4
<server19> nf_nat_ipv4
<server19> nf_nat
<server19> nf_conntrack
<server19> iptable_mangle
<server19> ip_tables
<server19> x_tables
<server19> 8021q
<server19> garp
<server19> mrp
<server19> stp
<server19> llc
<server19> ixgbe
<server19> dca
<server19>
<server19> [ 5449.904989] CPU: 1 PID: 6359 Comm: ip Not tainted 
4.7.0-build-0109 #2
<server19> [ 5449.905255] Hardware name: Supermicro 
X10SLM+-LN4F/X10SLM+-LN4F, BIOS 3.0 04/24/2015
<server19> [ 5449.905712] task: ffff8803eef40000 ti: ffff8803fd754000 
task.ti: ffff8803fd754000
<server19> [ 5449.906168] RIP: 0010:[<ffffffff818a994d>]
<server19> [<ffffffff818a994d>] inet_fill_ifaddr+0x5a/0x264
<server19> [ 5449.906710] RSP: 0018:ffff8803fd757b98  EFLAGS: 00010286
<server19> [ 5449.906976] RAX: ffff8803ef65cb90 RBX: ffff8803f7d2cd00 
RCX: 0000000000000000
<server19> [ 5449.907248] RDX: 0000000800000002 RSI: ffff8803ef65cb90 
RDI: ffff8803ef65cba8
<server19> [ 5449.907519] RBP: ffff8803fd757be0 R08: 0000000000000008 
R09: 0000000000000002
<server19> [ 5449.907792] R10: ffa005040269f480 R11: ffffffff820a1c00 
R12: ffa005040269f480
<server19> [ 5449.908067] R13: ffff8803ef65cb90 R14: 0000000000000000 
R15: ffff8803f7d2cd00
<server19> [ 5449.908339] FS:  00007f660674d700(0000) 
GS:ffff88041fc40000(0000) knlGS:0000000000000000
<server19> [ 5449.908796] CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
<server19> [ 5449.909067] CR2: 00000000008b9018 CR3: 00000003f2a11000 
CR4: 00000000001406e0
<server19> [ 5449.909339] Stack:
<server19> [ 5449.909598]  0163a8c0869711ac
<server19> 0000008000000000
<server19> ffffffffffffffff
<server19> 0003e1d50003e1d5
<server19>
<server19> [ 5449.910329]  ffff8800d54c0ac8
<server19> ffff8803f0d90000
<server19> 0000000000000005
<server19> 0000000000000000
<server19>
<server19> [ 5449.911066]  ffff8803f7d2cd00
<server19> ffff8803fd757c40
<server19> ffffffff818a9f73
<server19> ffffffff820a1c00
<server19>
<server19> [ 5449.911803] Call Trace:
<server19> [ 5449.912061]  [<ffffffff818a9f73>] 
inet_dump_ifaddr+0xfb/0x185
<server19> [ 5449.912332]  [<ffffffff8185de4b>] rtnl_dump_all+0xa9/0xc2
<server19> [ 5449.912601]  [<ffffffff818756d8>] netlink_dump+0xf0/0x25c
<server19> [ 5449.912873]  [<ffffffff818759ed>] 
netlink_recvmsg+0x1a9/0x2d3
<server19> [ 5449.913142]  [<ffffffff81838412>] sock_recvmsg+0x14/0x16
<server19> [ 5449.913407]  [<ffffffff8183a743>] 
___sys_recvmsg+0xea/0x1a1
<server19> [ 5449.913675]  [<ffffffff811658e6>] ? 
alloc_pages_vma+0x167/0x1a0
<server19> [ 5449.913945]  [<ffffffff81159a8b>] ? 
page_add_new_anon_rmap+0xb4/0xbd
<server19> [ 5449.914212]  [<ffffffff8113b0d0>] ? 
lru_cache_add_active_or_unevictable+0x31/0x9d
<server19> [ 5449.914664]  [<ffffffff81151762>] ? 
handle_mm_fault+0x632/0x112d
<server19> [ 5449.914940]  [<ffffffff811550fe>] ? vma_merge+0x27e/0x2b1
<server19> [ 5449.915208]  [<ffffffff8183b4db>] __sys_recvmsg+0x3d/0x5e
<server19> [ 5449.915478]  [<ffffffff8183b4db>] ? 
__sys_recvmsg+0x3d/0x5e
<server19> [ 5449.915747]  [<ffffffff8183b509>] SyS_recvmsg+0xd/0x17
<server19> [ 5449.916017]  [<ffffffff818cb85f>] 
entry_SYSCALL_64_fastpath+0x17/0x93
<server19> [ 5449.916287] Code:
<server19> e5
<server19> 41
<server19> 57
<server19> 41
<server19> 56
<server19> 41
<server19> 55
<server19> 41
<server19> 54
<server19> 49
<server19> 89
<server19> f4
<server19> 53
<server19> 89
<server19> c6
<server19> 48
<server19> 89
<server19> fb
<server19> 48
<server19> 83
<server19> ec
<server19> 20
<server19> e8
<server19> be
<server19> b0
<server19> fc
<server19> ff
<server19> 48
<server19> 85
<server19> c0
<server19> 49
<server19> 89
<server19> c5
<server19> 0f
<server19> 84
<server19> f4
<server19> 01
<server19> 00
<server19> 00
<server19> c6
<server19> 40
<server19> 10
<server19> 02
<server19>
<server19> 8a
<server19> 44
<server19> 24
<server19> 41
<server19> 41
<server19> 83
<server19> ce
<server19> ff
<server19> 45
<server19> 89
<server19> f7
<server19> 41
<server19> 88
<server19> 45
<server19> 11
<server19> 41
<server19> 8b
<server19> 44
<server19> 24
<server19> 44
<server19>
<server19> [ 5449.921684] RIP
<server19> [<ffffffff818a994d>] inet_fill_ifaddr+0x5a/0x264
<server19> [ 5449.922028]  RSP <ffff8803fd757b98>
<server19> [ 5449.922547] ---[ end trace 18580d58f51e3038 ]---
<server19> [ 5449.923705] Kernel panic - not syncing: Fatal exception
<server19> [ 5449.923979] Kernel Offset: disabled
<server19> [ 5449.925873] Rebooting in 5 seconds..



<server19> [43221.432450] general protection fault: 0000 [#1] SMP
<server19> [43221.432656] Modules linked in:
<server19> intel_ips
<server19> intel_smartconnect
<server19> intel_rst
<server19> cls_fw
<server19> act_police
<server19> cls_u32
<server19> sch_ingress
<server19> sch_sfq
<server19> sch_htb
<server19> pppoe
<server19> pppox
<server19> ppp_generic
<server19> slhc
<server19> netconsole
<server19> configfs
<server19> xt_nat
<server19> ts_bm
<server19> xt_string
<server19> xt_connmark
<server19> xt_TCPMSS
<server19> xt_tcpudp
<server19> xt_mark
<server19> iptable_filter
<server19> iptable_nat
<server19> nf_conntrack_ipv4
<server19> nf_defrag_ipv4
<server19> nf_nat_ipv4
<server19> nf_nat
<server19> nf_conntrack
<server19> iptable_mangle
<server19> ip_tables
<server19> x_tables
<server19> 8021q
<server19> garp
<server19> mrp
<server19> stp
<server19> llc
<server19> ixgbe
<server19> dca
<server19>
<server19> [43221.433815] CPU: 3 PID: 29196 Comm: accel-cmd Not tainted 
4.7.0-build-0110 #2
<server19> [43221.434024] Hardware name: Supermicro 
X10SLM+-LN4F/X10SLM+-LN4F, BIOS 3.0 04/24/2015
<server19> [43221.434414] task: ffff8803dcc39780 ti: ffff8800cdb18000 
task.ti: ffff8800cdb18000
<server19> [43221.434805] RIP: 0010:[<ffffffff818a7fd0>]
<server19> [<ffffffff818a7fd0>] inet_fill_ifaddr+0x5a/0x264
<server19> [43221.435202] RSP: 0018:ffff8800cdb1bb98  EFLAGS: 00010282
<server19> [43221.435406] RAX: ffff8803fe89efb0 RBX: ffff8803de661500 
RCX: 0000000000000000
<server19> [43221.435616] RDX: 0000000800000002 RSI: ffff8803fe89efb0 
RDI: ffff8803fe89efc8
<server19> [43221.435823] RBP: ffff8800cdb1bbe0 R08: 0000000000000008 
R09: 0000000000000002
<server19> [43221.436030] R10: ffa0050402880f80 R11: ffffffff820a1680 
R12: ffa0050402880f80
<server19> [43221.436234] R13: ffff8803fe89efb0 R14: 0000000000000000 
R15: ffff8803de661500
<server19> [43221.436436] FS:  00007f25a2539700(0000) 
GS:ffff88041fcc0000(0000) knlGS:0000000000000000
<server19> [43221.436821] CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
<server19> [43221.437023] CR2: 000000000060f000 CR3: 00000000cd2e8000 
CR4: 00000000001406e0
<server19> [43221.437227] Stack:
<server19> [43221.437419]  0163a8c0818411ac
<server19> 0000008000000000
<server19> ffffffffffffffff
<server19> 003a44db003a44db
<server19>
<server19> [43221.437827]  ffff8803fe5992c8
<server19> ffff8803f5b04000
<server19> 0000000000000003
<server19> 0000000000000000
<server19>
<server19> [43221.438230]  ffff8803de661500
<server19> ffff8800cdb1bc40
<server19> ffffffff818a85f6
<server19> ffffffff820a1680
<server19>
<server19> [43221.438636] Call Trace:
<server19> [43221.438834]  [<ffffffff818a85f6>] 
inet_dump_ifaddr+0xfb/0x185
<server19> [43221.439035]  [<ffffffff8185c4ce>] rtnl_dump_all+0xa9/0xc2
<server19> [43221.439241]  [<ffffffff81873d5b>] netlink_dump+0xf0/0x25c
<server19> [43221.439441]  [<ffffffff81874070>] 
netlink_recvmsg+0x1a9/0x2d3
<server19> [43221.439641]  [<ffffffff81836a95>] sock_recvmsg+0x14/0x16
<server19> [43221.439841]  [<ffffffff81838dc6>] 
___sys_recvmsg+0xea/0x1a1
<server19> [43221.440043]  [<ffffffff8116765f>] ? 
alloc_pages_vma+0x167/0x1a0
<server19> [43221.440247]  [<ffffffff8115b804>] ? 
page_add_new_anon_rmap+0xb4/0xbd
<server19> [43221.440449]  [<ffffffff8113ce49>] ? 
lru_cache_add_active_or_unevictable+0x31/0x9d
<server19> [43221.440837]  [<ffffffff811534db>] ? 
handle_mm_fault+0x632/0x112d
<server19> [43221.441038]  [<ffffffff81839636>] ? SyS_sendto+0xef/0x120
<server19> [43221.441241]  [<ffffffff81839b5e>] __sys_recvmsg+0x3d/0x5e
<server19> [43221.441443]  [<ffffffff81839b5e>] ? 
__sys_recvmsg+0x3d/0x5e
<server19> [43221.441644]  [<ffffffff81839b8c>] SyS_recvmsg+0xd/0x17
<server19> [43221.441849]  [<ffffffff818c9edf>] 
entry_SYSCALL_64_fastpath+0x17/0x93
<server19> [43221.442055] Code:
<server19> e5
<server19> 41
<server19> 57
<server19> 41
<server19> 56
<server19> 41
<server19> 55
<server19> 41
<server19> 54
<server19> 49
<server19> 89
<server19> f4
<server19> 53
<server19> 89
<server19> c6
<server19> 48
<server19> 89
<server19> fb
<server19> 48
<server19> 83
<server19> ec
<server19> 20
<server19> e8
<server19> be
<server19> b0
<server19> fc
<server19> ff
<server19> 48
<server19> 85
<server19> c0
<server19> 49
<server19> 89
<server19> c5
<server19> 0f
<server19> 84
<server19> f4
<server19> 01
<server19> 00
<server19> 00
<server19> c6
<server19> 40
<server19> 10
<server19> 02
<server19>
<server19> 8a
<server19> 44
<server19> 24
<server19> 41
<server19> 41
<server19> 83
<server19> ce
<server19> ff
<server19> 45
<server19> 89
<server19> f7
<server19> 41
<server19> 88
<server19> 45
<server19> 11
<server19> 41
<server19> 8b
<server19> 44
<server19> 24
<server19> 44
<server19>
<server19> [43221.442945] RIP
<server19> [<ffffffff818a7fd0>] inet_fill_ifaddr+0x5a/0x264
<server19> [43221.443151]  RSP <ffff8800cdb1bb98>
<server19> [43221.445125] ---[ end trace 99273d413e56a193 ]---
<server19> [43221.446262] Kernel panic - not syncing: Fatal exception
<server19> [43221.446536] Kernel Offset: disabled
<server19> [43221.448446] Rebooting in 5 seconds..
Jul 27 23:41:44 10.0.253.19
Jul 27 23:41:44 10.0.253.19 [43226.451328] ACPI MEMORY or I/O RESET_REG.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit
  2016-07-28 11:28     ` Denys Fedoryshchenko
@ 2016-08-01 20:54       ` Guillaume Nault
  2016-08-01 20:59       ` Guillaume Nault
  1 sibling, 0 replies; 13+ messages in thread
From: Guillaume Nault @ 2016-08-01 20:54 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: Cong Wang, Linux Kernel Network Developers

On Thu, Jul 28, 2016 at 02:28:23PM +0300, Denys Fedoryshchenko wrote:
> On 2016-07-28 14:09, Guillaume Nault wrote:
> > On Tue, Jul 12, 2016 at 10:31:18AM -0700, Cong Wang wrote:
> > > On Mon, Jul 11, 2016 at 12:45 PM,  <nuclearcat@nuclearcat.com> wrote:
> > > > Hi
> > > >
> > > > On latest kernel i noticed kernel panic happening 1-2 times per day. It is
> > > > also happening on older kernel (at least 4.5.3).
> > > >
> > > ...
> > > >  [42916.426463] Call Trace:
> > > >  [42916.426658]  <IRQ>
> > > >
> > > >  [42916.426719]  [<ffffffff81843786>] skb_push+0x36/0x37
> > > >  [42916.427111]  [<ffffffffa00e8ce5>] ppp_start_xmit+0x10f/0x150
> > > > [ppp_generic]
> > > >  [42916.427314]  [<ffffffff81853467>] dev_hard_start_xmit+0x25a/0x2d3
> > > >  [42916.427516]  [<ffffffff818530f2>] ?
> > > > validate_xmit_skb.isra.107.part.108+0x11d/0x238
> > > >  [42916.427858]  [<ffffffff8186dee3>] sch_direct_xmit+0x89/0x1b5
> > > >  [42916.428060]  [<ffffffff8186e142>] __qdisc_run+0x133/0x170
> > > >  [42916.428261]  [<ffffffff81850034>] net_tx_action+0xe3/0x148
> > > >  [42916.428462]  [<ffffffff810c401a>] __do_softirq+0xb9/0x1a9
> > > >  [42916.428663]  [<ffffffff810c4251>] irq_exit+0x37/0x7c
> > > >  [42916.428862]  [<ffffffff8102b8f7>] smp_apic_timer_interrupt+0x3d/0x48
> > > >  [42916.429063]  [<ffffffff818cb15c>] apic_timer_interrupt+0x7c/0x90
> > > 
> > > Interesting, we call a skb_cow_head() before skb_push() in
> > > ppp_start_xmit(),
> > > I have no idea why this could happen.
> > > 
> > The skb is corrupted: head is at ffff8800b0bf2800 while data is at
> > ffa00500b0bf284c.
> > 
> > Figuring out how this corruption happened is going to be hard without a
> > way to reproduce the problem.
> > 
> > Denys, can you confirm you're using a vanilla kernel?
> > Also I guess the ppp devices and tc settings are handled by accel-ppp.
> > If so, can you share more info about your setup (accel-ppp.conf, radius
> > attributes, iptables...) so that I can try to reproduce it on my
> > machines?
> 
> I have slight modification from vanilla:
> 
> --- linux/net/sched/sch_htb.c	2016-06-08 01:23:53.000000000 +0000
> +++ linux-new/net/sched/sch_htb.c	2016-06-21 14:03:08.398486593 +0000
> @@ -1495,10 +1495,10 @@
>  				cl->common.classid);
>  			cl->quantum = 1000;
>  		}
> -		if (!hopt->quantum && cl->quantum > 200000) {
> +		if (!hopt->quantum && cl->quantum > 2000000) {
>  			pr_warn("HTB: quantum of class %X is big. Consider r2q change.\n",
>  				cl->common.classid);
> -			cl->quantum = 200000;
> +			cl->quantum = 2000000;
>  		}
>  		if (hopt->quantum)
>  			cl->quantum = hopt->quantum;
> 
> But i guess it should not be reason of crash (it is related to another
> system,  without it i was unable to shape over 7Gbps, maybe with latest
> kernel i will not need this patch).
>
I guess such a big quantum is probably going to add some stress on HTB
because of longer dequeues. But that shouldn't make the kernel panic.
Anyway, I'm certainly not an HTB expert, so I can't comment further.
BTW, what about setting ->quantum directly and drop this patch if you
really need values this big?

> I'm trying to make reproducible conditions of crash, because right now it
> happens only on some servers in large networks (completely different ISPs,
> so i excluded possible hardware fault of specific server). It is complex
> config, i have accel-ppp, plus my own "shaping daemon" that apply several
> shapers on ppp interfaces. Wost thing it happens only on live customers, i
> am unable to reproduce same on stress tests. Also until recent kernel i
> was getting different panic messages (but all related to ppp).
> 
In the logs I commented earlier, the skb is probably corrupted before
the ppp_start_xmit() call. The PPP module hasn't done anything at this
stage, unless the packet was forwarded from another PPP interface.
In short, corruption could have happened anywhere. So we really need to
narrow down the scope or get a way to reproduce the problem.

> I think also at least one reason of crash also was fixed by "ppp: defer
> netns reference release for ppp channel" in 4.7.0 (maybe thats why i am
> getting less crashes recently).
> I tried also various kernel debug options that doesn't cause major
> performance degradation (locks checking, freed memory poisoning and etc),
> without any luck yet.
> Is it useful if i will post panics that at least
> occurs twice? (I will post below example, got recently)
Do you mean that you have many more different panics traces?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit
  2016-07-28 11:28     ` Denys Fedoryshchenko
  2016-08-01 20:54       ` Guillaume Nault
@ 2016-08-01 20:59       ` Guillaume Nault
  2016-08-01 22:52         ` Denys Fedoryshchenko
  2016-08-08 11:25         ` Denys Fedoryshchenko
  1 sibling, 2 replies; 13+ messages in thread
From: Guillaume Nault @ 2016-08-01 20:59 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: Cong Wang, Linux Kernel Network Developers

On Thu, Jul 28, 2016 at 02:28:23PM +0300, Denys Fedoryshchenko wrote:
> <server19> [ 5449.904989] CPU: 1 PID: 6359 Comm: ip Not tainted
> 4.7.0-build-0109 #2
> <server19> [ 5449.905255] Hardware name: Supermicro
> X10SLM+-LN4F/X10SLM+-LN4F, BIOS 3.0 04/24/2015
> <server19> [ 5449.905712] task: ffff8803eef40000 ti: ffff8803fd754000
> task.ti: ffff8803fd754000
> <server19> [ 5449.906168] RIP: 0010:[<ffffffff818a994d>]
> <server19> [<ffffffff818a994d>] inet_fill_ifaddr+0x5a/0x264
> <server19> [ 5449.906710] RSP: 0018:ffff8803fd757b98  EFLAGS: 00010286
> <server19> [ 5449.906976] RAX: ffff8803ef65cb90 RBX: ffff8803f7d2cd00 RCX:
> 0000000000000000
> <server19> [ 5449.907248] RDX: 0000000800000002 RSI: ffff8803ef65cb90 RDI:
> ffff8803ef65cba8
> <server19> [ 5449.907519] RBP: ffff8803fd757be0 R08: 0000000000000008 R09:
> 0000000000000002
> <server19> [ 5449.907792] R10: ffa005040269f480 R11: ffffffff820a1c00 R12:
> ffa005040269f480
> <server19> [ 5449.908067] R13: ffff8803ef65cb90 R14: 0000000000000000 R15:
> ffff8803f7d2cd00
> <server19> [ 5449.908339] FS:  00007f660674d700(0000)
> GS:ffff88041fc40000(0000) knlGS:0000000000000000
> <server19> [ 5449.908796] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <server19> [ 5449.909067] CR2: 00000000008b9018 CR3: 00000003f2a11000 CR4:
> 00000000001406e0
> <server19> [ 5449.909339] Stack:
> <server19> [ 5449.909598]  0163a8c0869711ac
> <server19> 0000008000000000
> <server19> ffffffffffffffff
> <server19> 0003e1d50003e1d5
> <server19>
> <server19> [ 5449.910329]  ffff8800d54c0ac8
> <server19> ffff8803f0d90000
> <server19> 0000000000000005
> <server19> 0000000000000000
> <server19>
> <server19> [ 5449.911066]  ffff8803f7d2cd00
> <server19> ffff8803fd757c40
> <server19> ffffffff818a9f73
> <server19> ffffffff820a1c00
> <server19>
> <server19> [ 5449.911803] Call Trace:
> <server19> [ 5449.912061]  [<ffffffff818a9f73>] inet_dump_ifaddr+0xfb/0x185
> <server19> [ 5449.912332]  [<ffffffff8185de4b>] rtnl_dump_all+0xa9/0xc2
> <server19> [ 5449.912601]  [<ffffffff818756d8>] netlink_dump+0xf0/0x25c
> <server19> [ 5449.912873]  [<ffffffff818759ed>] netlink_recvmsg+0x1a9/0x2d3
> <server19> [ 5449.913142]  [<ffffffff81838412>] sock_recvmsg+0x14/0x16
> <server19> [ 5449.913407]  [<ffffffff8183a743>] ___sys_recvmsg+0xea/0x1a1
> <server19> [ 5449.913675]  [<ffffffff811658e6>] ?
> alloc_pages_vma+0x167/0x1a0
> <server19> [ 5449.913945]  [<ffffffff81159a8b>] ?
> page_add_new_anon_rmap+0xb4/0xbd
> <server19> [ 5449.914212]  [<ffffffff8113b0d0>] ?
> lru_cache_add_active_or_unevictable+0x31/0x9d
> <server19> [ 5449.914664]  [<ffffffff81151762>] ?
> handle_mm_fault+0x632/0x112d
> <server19> [ 5449.914940]  [<ffffffff811550fe>] ? vma_merge+0x27e/0x2b1
> <server19> [ 5449.915208]  [<ffffffff8183b4db>] __sys_recvmsg+0x3d/0x5e
> <server19> [ 5449.915478]  [<ffffffff8183b4db>] ? __sys_recvmsg+0x3d/0x5e
> <server19> [ 5449.915747]  [<ffffffff8183b509>] SyS_recvmsg+0xd/0x17
> <server19> [ 5449.916017]  [<ffffffff818cb85f>]
> entry_SYSCALL_64_fastpath+0x17/0x93
> 
Do you still have the vmlinux file with debug symbols that generated
this panic?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit
  2016-08-01 20:59       ` Guillaume Nault
@ 2016-08-01 22:52         ` Denys Fedoryshchenko
  2016-08-08 11:25         ` Denys Fedoryshchenko
  1 sibling, 0 replies; 13+ messages in thread
From: Denys Fedoryshchenko @ 2016-08-01 22:52 UTC (permalink / raw)
  To: Guillaume Nault; +Cc: Cong Wang, Linux Kernel Network Developers

On 2016-08-01 23:59, Guillaume Nault wrote:
> On Thu, Jul 28, 2016 at 02:28:23PM +0300, Denys Fedoryshchenko wrote:
>> <server19> [ 5449.904989] CPU: 1 PID: 6359 Comm: ip Not tainted
>> 4.7.0-build-0109 #2
>> <server19> [ 5449.905255] Hardware name: Supermicro
>> X10SLM+-LN4F/X10SLM+-LN4F, BIOS 3.0 04/24/2015
>> <server19> [ 5449.905712] task: ffff8803eef40000 ti: ffff8803fd754000
>> task.ti: ffff8803fd754000
>> <server19> [ 5449.906168] RIP: 0010:[<ffffffff818a994d>]
>> <server19> [<ffffffff818a994d>] inet_fill_ifaddr+0x5a/0x264
>> <server19> [ 5449.906710] RSP: 0018:ffff8803fd757b98  EFLAGS: 00010286
>> <server19> [ 5449.906976] RAX: ffff8803ef65cb90 RBX: ffff8803f7d2cd00 
>> RCX:
>> 0000000000000000
>> <server19> [ 5449.907248] RDX: 0000000800000002 RSI: ffff8803ef65cb90 
>> RDI:
>> ffff8803ef65cba8
>> <server19> [ 5449.907519] RBP: ffff8803fd757be0 R08: 0000000000000008 
>> R09:
>> 0000000000000002
>> <server19> [ 5449.907792] R10: ffa005040269f480 R11: ffffffff820a1c00 
>> R12:
>> ffa005040269f480
>> <server19> [ 5449.908067] R13: ffff8803ef65cb90 R14: 0000000000000000 
>> R15:
>> ffff8803f7d2cd00
>> <server19> [ 5449.908339] FS:  00007f660674d700(0000)
>> GS:ffff88041fc40000(0000) knlGS:0000000000000000
>> <server19> [ 5449.908796] CS:  0010 DS: 0000 ES: 0000 CR0: 
>> 0000000080050033
>> <server19> [ 5449.909067] CR2: 00000000008b9018 CR3: 00000003f2a11000 
>> CR4:
>> 00000000001406e0
>> <server19> [ 5449.909339] Stack:
>> <server19> [ 5449.909598]  0163a8c0869711ac
>> <server19> 0000008000000000
>> <server19> ffffffffffffffff
>> <server19> 0003e1d50003e1d5
>> <server19>
>> <server19> [ 5449.910329]  ffff8800d54c0ac8
>> <server19> ffff8803f0d90000
>> <server19> 0000000000000005
>> <server19> 0000000000000000
>> <server19>
>> <server19> [ 5449.911066]  ffff8803f7d2cd00
>> <server19> ffff8803fd757c40
>> <server19> ffffffff818a9f73
>> <server19> ffffffff820a1c00
>> <server19>
>> <server19> [ 5449.911803] Call Trace:
>> <server19> [ 5449.912061]  [<ffffffff818a9f73>] 
>> inet_dump_ifaddr+0xfb/0x185
>> <server19> [ 5449.912332]  [<ffffffff8185de4b>] 
>> rtnl_dump_all+0xa9/0xc2
>> <server19> [ 5449.912601]  [<ffffffff818756d8>] 
>> netlink_dump+0xf0/0x25c
>> <server19> [ 5449.912873]  [<ffffffff818759ed>] 
>> netlink_recvmsg+0x1a9/0x2d3
>> <server19> [ 5449.913142]  [<ffffffff81838412>] sock_recvmsg+0x14/0x16
>> <server19> [ 5449.913407]  [<ffffffff8183a743>] 
>> ___sys_recvmsg+0xea/0x1a1
>> <server19> [ 5449.913675]  [<ffffffff811658e6>] ?
>> alloc_pages_vma+0x167/0x1a0
>> <server19> [ 5449.913945]  [<ffffffff81159a8b>] ?
>> page_add_new_anon_rmap+0xb4/0xbd
>> <server19> [ 5449.914212]  [<ffffffff8113b0d0>] ?
>> lru_cache_add_active_or_unevictable+0x31/0x9d
>> <server19> [ 5449.914664]  [<ffffffff81151762>] ?
>> handle_mm_fault+0x632/0x112d
>> <server19> [ 5449.914940]  [<ffffffff811550fe>] ? 
>> vma_merge+0x27e/0x2b1
>> <server19> [ 5449.915208]  [<ffffffff8183b4db>] 
>> __sys_recvmsg+0x3d/0x5e
>> <server19> [ 5449.915478]  [<ffffffff8183b4db>] ? 
>> __sys_recvmsg+0x3d/0x5e
>> <server19> [ 5449.915747]  [<ffffffff8183b509>] SyS_recvmsg+0xd/0x17
>> <server19> [ 5449.916017]  [<ffffffff818cb85f>]
>> entry_SYSCALL_64_fastpath+0x17/0x93
>> 
> Do you still have the vmlinux file with debug symbols that generated
> this panic?

I have slightly different build now (tried to enable slightly different 
kernel options), but i had also new panic in inet_fill_ifaddr in new 
build. I will prepare tomorrow(everything at office) all files and 
provide link with sources and vmlinux, and sure new panic message on 
this build.
New panic message happened on completely different location and ISP.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit
  2016-08-01 20:59       ` Guillaume Nault
  2016-08-01 22:52         ` Denys Fedoryshchenko
@ 2016-08-08 11:25         ` Denys Fedoryshchenko
  2016-08-08 21:05           ` Guillaume Nault
  1 sibling, 1 reply; 13+ messages in thread
From: Denys Fedoryshchenko @ 2016-08-08 11:25 UTC (permalink / raw)
  To: Guillaume Nault; +Cc: Cong Wang, Linux Kernel Network Developers

On 2016-08-01 23:59, Guillaume Nault wrote:
> Do you still have the vmlinux file with debug symbols that generated
> this panic?
Sorry for delay, i didn't had same image on all servers and probably i 
found cause of panic, but still testing on several servers.
If i remove SFQ qdisc from ppp shapers, servers not rebooting anymore.
But still i need around 2 days to make sure that's the reason.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit
  2016-08-08 11:25         ` Denys Fedoryshchenko
@ 2016-08-08 21:05           ` Guillaume Nault
  2016-08-17 11:54             ` Denys Fedoryshchenko
  0 siblings, 1 reply; 13+ messages in thread
From: Guillaume Nault @ 2016-08-08 21:05 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: Cong Wang, Linux Kernel Network Developers

On Mon, Aug 08, 2016 at 02:25:00PM +0300, Denys Fedoryshchenko wrote:
> On 2016-08-01 23:59, Guillaume Nault wrote:
> > Do you still have the vmlinux file with debug symbols that generated
> > this panic?
> Sorry for delay, i didn't had same image on all servers and probably i found
> cause of panic, but still testing on several servers.
> If i remove SFQ qdisc from ppp shapers, servers not rebooting anymore.
> 
Thanks for the feedback. I wonder which interactions between SFQ and
PPP can lead to this problem. I'll take a look.

> But still i need around 2 days to make sure that's the reason.
> 
Okay, just let me know if you can confirm that removing SFQ really
solves the problem.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit
  2016-08-08 21:05           ` Guillaume Nault
@ 2016-08-17 11:54             ` Denys Fedoryshchenko
  0 siblings, 0 replies; 13+ messages in thread
From: Denys Fedoryshchenko @ 2016-08-17 11:54 UTC (permalink / raw)
  To: Guillaume Nault; +Cc: Cong Wang, Linux Kernel Network Developers, netdev-owner

On 2016-08-09 00:05, Guillaume Nault wrote:
> On Mon, Aug 08, 2016 at 02:25:00PM +0300, Denys Fedoryshchenko wrote:
>> On 2016-08-01 23:59, Guillaume Nault wrote:
>> > Do you still have the vmlinux file with debug symbols that generated
>> > this panic?
>> Sorry for delay, i didn't had same image on all servers and probably i 
>> found
>> cause of panic, but still testing on several servers.
>> If i remove SFQ qdisc from ppp shapers, servers not rebooting anymore.
>> 
> Thanks for the feedback. I wonder which interactions between SFQ and
> PPP can lead to this problem. I'll take a look.
> 
>> But still i need around 2 days to make sure that's the reason.
>> 
> Okay, just let me know if you can confirm that removing SFQ really
> solves the problem.
After long testing, i can confirm removing sfq from rules decreased 
panic reboot greatly, tested on many different servers.
I will try today to do some stress tests, to apply on live system at 
night sfq qdiscs, then remove them.
Then i will try also to disconnect all users with sfq qdiscs attached.
Not sure it will help to reproduce the bug, but worth to try.

Still i am hitting once per week some different conntrack bug, sand 
thats why i was confused, i was getting clearly panics in conntrack and 
then something else, i was not sure if it is different bugs, hardware 
glitch or something else.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-08-17 11:54 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-11 19:45 4.6.3, pppoe + shaper workload, skb_panic / skb_push / ppp_start_xmit nuclearcat
2016-07-12 17:31 ` Cong Wang
2016-07-12 18:03   ` nuclearcat
2016-07-12 18:05     ` Cong Wang
2016-07-12 18:13       ` nuclearcat
2016-07-28 11:09   ` Guillaume Nault
2016-07-28 11:28     ` Denys Fedoryshchenko
2016-08-01 20:54       ` Guillaume Nault
2016-08-01 20:59       ` Guillaume Nault
2016-08-01 22:52         ` Denys Fedoryshchenko
2016-08-08 11:25         ` Denys Fedoryshchenko
2016-08-08 21:05           ` Guillaume Nault
2016-08-17 11:54             ` Denys Fedoryshchenko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.