netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* ipv4_dst_destroy panic regression after 3.10.15
@ 2014-01-18  1:25 dormando
  2014-01-18  6:49 ` Eric Dumazet
  0 siblings, 1 reply; 12+ messages in thread
From: dormando @ 2014-01-18  1:25 UTC (permalink / raw)
  To: netdev, linux-kernel

Hi,

Upgraded a few kernels to the latest 3.10 stable tree while tracking down
a rare kernel panic, seems to have introduced a much more frequent kernel
panic. Takes anywhere from 4 hours to 2 days to trigger:

<4>[196727.311203] general protection fault: 0000 [#1] SMP
<4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
<4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
<4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
<4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
<4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
<4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
<4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
<4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
<4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
<4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
<4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
<4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
<4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
<4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[196727.311713] Stack:
<4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
<4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
<4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
<4>[196727.311885] Call Trace:
<4>[196727.311907]  <IRQ>
<4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
<4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
<4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
<4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
<4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
<4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
<4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
<4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
<4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
<4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
<4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
<4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
<4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
<4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
<4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
<4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
<4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
<4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
<4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
<4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
<4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
<4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
<4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
<4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
<4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
<4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
<4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
<4>[196727.312722]  <EOI>
<4>[196727.312727]  [<ffffffff8100a150>] ? default_idle+0x20/0xe0
<4>[196727.312775]  [<ffffffff8100a8ff>] arch_cpu_idle+0xf/0x20
<4>[196727.312803]  [<ffffffff8108d330>] cpu_startup_entry+0xc0/0x270
<4>[196727.312833]  [<ffffffff816b276e>] start_secondary+0x1f9/0x200
<4>[196727.312860] Code: 4a 9f e9 81 e8 13 cb 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 9f e9 81
<1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
<4>[196727.313100]  RSP <ffff885effd23a70>
<4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
<0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt


... bisecting it's going to be a pain... I tried eyeballing the diffs and
am trying a revert or two.

We've hit it in .25, .26 so far. I have .27 running but not sure if it
crashed, so the change exists between .15 and .25.

Thanks,
-Dormando

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ipv4_dst_destroy panic regression after 3.10.15
  2014-01-18  1:25 ipv4_dst_destroy panic regression after 3.10.15 dormando
@ 2014-01-18  6:49 ` Eric Dumazet
  2014-01-18  7:09   ` Eric Dumazet
  2014-01-20  2:56   ` dormando
  0 siblings, 2 replies; 12+ messages in thread
From: Eric Dumazet @ 2014-01-18  6:49 UTC (permalink / raw)
  To: dormando; +Cc: netdev, linux-kernel

On Fri, 2014-01-17 at 17:25 -0800, dormando wrote:
> Hi,
> 
> Upgraded a few kernels to the latest 3.10 stable tree while tracking down
> a rare kernel panic, seems to have introduced a much more frequent kernel
> panic. Takes anywhere from 4 hours to 2 days to trigger:
> 
> <4>[196727.311203] general protection fault: 0000 [#1] SMP
> <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
> <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
> <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
> <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
> <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
> <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
> <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
> <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
> <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
> <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
> <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
> <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> <4>[196727.311713] Stack:
> <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
> <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
> <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
> <4>[196727.311885] Call Trace:
> <4>[196727.311907]  <IRQ>
> <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
> <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
> <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
> <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
> <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
> <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
> <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
> <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
> <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
> <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
> <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
> <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
> <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
> <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
> <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
> <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
> <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
> <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
> <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
> <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
> <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
> <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
> <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
> <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
> <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
> <4>[196727.312722]  <EOI>
> <4>[196727.312727]  [<ffffffff8100a150>] ? default_idle+0x20/0xe0
> <4>[196727.312775]  [<ffffffff8100a8ff>] arch_cpu_idle+0xf/0x20
> <4>[196727.312803]  [<ffffffff8108d330>] cpu_startup_entry+0xc0/0x270
> <4>[196727.312833]  [<ffffffff816b276e>] start_secondary+0x1f9/0x200
> <4>[196727.312860] Code: 4a 9f e9 81 e8 13 cb 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 9f e9 81
> <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> <4>[196727.313100]  RSP <ffff885effd23a70>
> <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
> <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
> 
> 
> ... bisecting it's going to be a pain... I tried eyeballing the diffs and
> am trying a revert or two.
> 
> We've hit it in .25, .26 so far. I have .27 running but not sure if it
> crashed, so the change exists between .15 and .25.

Please try following fix, thanks for the report !

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 25071b48921c..78a50a22298a 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1333,7 +1333,7 @@ static void ipv4_dst_destroy(struct dst_entry
*dst)
 
 	if (!list_empty(&rt->rt_uncached)) {
 		spin_lock_bh(&rt_uncached_lock);
-		list_del(&rt->rt_uncached);
+		list_del_init(&rt->rt_uncached);
 		spin_unlock_bh(&rt_uncached_lock);
 	}
 }

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: ipv4_dst_destroy panic regression after 3.10.15
  2014-01-18  6:49 ` Eric Dumazet
@ 2014-01-18  7:09   ` Eric Dumazet
  2014-01-18  7:16     ` dormando
  2014-01-20  2:56   ` dormando
  1 sibling, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2014-01-18  7:09 UTC (permalink / raw)
  To: dormando; +Cc: netdev, linux-kernel, Alexei Starovoitov

On Fri, 2014-01-17 at 22:49 -0800, Eric Dumazet wrote:
> On Fri, 2014-01-17 at 17:25 -0800, dormando wrote:
> > Hi,
> > 
> > Upgraded a few kernels to the latest 3.10 stable tree while tracking down
> > a rare kernel panic, seems to have introduced a much more frequent kernel
> > panic. Takes anywhere from 4 hours to 2 days to trigger:
> > 
> > <4>[196727.311203] general protection fault: 0000 [#1] SMP
> > <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
> > <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
> > <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
> > <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
> > <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> > <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
> > <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
> > <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
> > <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
> > <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
> > <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
> > <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
> > <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > <4>[196727.311713] Stack:
> > <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
> > <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
> > <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
> > <4>[196727.311885] Call Trace:
> > <4>[196727.311907]  <IRQ>
> > <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
> > <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
> > <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
> > <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
> > <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> > <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
> > <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> > <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
> > <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
> > <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
> > <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
> > <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
> > <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
> > <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
> > <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
> > <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
> > <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
> > <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
> > <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
> > <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
> > <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
> > <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
> > <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
> > <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
> > <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
> > <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
> > <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
> > <4>[196727.312722]  <EOI>
> > <4>[196727.312727]  [<ffffffff8100a150>] ? default_idle+0x20/0xe0
> > <4>[196727.312775]  [<ffffffff8100a8ff>] arch_cpu_idle+0xf/0x20
> > <4>[196727.312803]  [<ffffffff8108d330>] cpu_startup_entry+0xc0/0x270
> > <4>[196727.312833]  [<ffffffff816b276e>] start_secondary+0x1f9/0x200
> > <4>[196727.312860] Code: 4a 9f e9 81 e8 13 cb 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 9f e9 81
> > <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> > <4>[196727.313100]  RSP <ffff885effd23a70>
> > <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
> > <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
> > 
> > 
> > ... bisecting it's going to be a pain... I tried eyeballing the diffs and
> > am trying a revert or two.
> > 
> > We've hit it in .25, .26 so far. I have .27 running but not sure if it
> > crashed, so the change exists between .15 and .25.
> 
> Please try following fix, thanks for the report !
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 25071b48921c..78a50a22298a 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1333,7 +1333,7 @@ static void ipv4_dst_destroy(struct dst_entry
> *dst)
>  
>  	if (!list_empty(&rt->rt_uncached)) {
>  		spin_lock_bh(&rt_uncached_lock);
> -		list_del(&rt->rt_uncached);
> +		list_del_init(&rt->rt_uncached);
>  		spin_unlock_bh(&rt_uncached_lock);
>  	}
>  }
> 

Problem could come from this commit, in linux 3.10.23,
you also could try to revert it
 
commit 62713c4b6bc10c2d082ee1540e11b01a2b2162ab
Author: Alexei Starovoitov <ast@plumgrid.com>
Date:   Tue Nov 19 19:12:34 2013 -0800

    ipv4: fix race in concurrent ip_route_input_slow()
    
    [ Upstream commit dcdfdf56b4a6c9437fc37dbc9cee94a788f9b0c4 ]
    
    CPUs can ask for local route via ip_route_input_noref() concurrently.
    if nh_rth_input is not cached yet, CPUs will proceed to allocate
    equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input
    via rt_cache_route()
    Most of the time they succeed, but on occasion the following two lines:
        orig = *p;
        prev = cmpxchg(p, orig, rt);
    in rt_cache_route() do race and one of the cpus fails to complete cmpxchg.
    But ip_route_input_slow() doesn't check the return code of rt_cache_route(),
    so dst is leaking. dst_destroy() is never called and 'lo' device
    refcnt doesn't go to zero, which can be seen in the logs as:
        unregister_netdevice: waiting for lo to become free. Usage count = 1
    Adding mdelay() between above two lines makes it easily reproducible.
    Fix it similar to nh_pcpu_rth_output case.
    
    Fixes: d2d68ba9fe8b ("ipv4: Cache input routes in fib_info nexthops.")
    Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ipv4_dst_destroy panic regression after 3.10.15
  2014-01-18  7:09   ` Eric Dumazet
@ 2014-01-18  7:16     ` dormando
  2014-01-18 17:18       ` Alexei Starovoitov
  0 siblings, 1 reply; 12+ messages in thread
From: dormando @ 2014-01-18  7:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel, Alexei Starovoitov

> On Fri, 2014-01-17 at 22:49 -0800, Eric Dumazet wrote:
> > On Fri, 2014-01-17 at 17:25 -0800, dormando wrote:
> > > Hi,
> > >
> > > Upgraded a few kernels to the latest 3.10 stable tree while tracking down
> > > a rare kernel panic, seems to have introduced a much more frequent kernel
> > > panic. Takes anywhere from 4 hours to 2 days to trigger:
> > >
> > > <4>[196727.311203] general protection fault: 0000 [#1] SMP
> > > <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
> > > <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
> > > <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
> > > <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
> > > <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> > > <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
> > > <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
> > > <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
> > > <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
> > > <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > > <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
> > > <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
> > > <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
> > > <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > <4>[196727.311713] Stack:
> > > <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
> > > <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
> > > <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
> > > <4>[196727.311885] Call Trace:
> > > <4>[196727.311907]  <IRQ>
> > > <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
> > > <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
> > > <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
> > > <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
> > > <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> > > <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
> > > <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> > > <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
> > > <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
> > > <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
> > > <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
> > > <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
> > > <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
> > > <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
> > > <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
> > > <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
> > > <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
> > > <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
> > > <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
> > > <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
> > > <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
> > > <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
> > > <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
> > > <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
> > > <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
> > > <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
> > > <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
> > > <4>[196727.312722]  <EOI>
> > > <4>[196727.312727]  [<ffffffff8100a150>] ? default_idle+0x20/0xe0
> > > <4>[196727.312775]  [<ffffffff8100a8ff>] arch_cpu_idle+0xf/0x20
> > > <4>[196727.312803]  [<ffffffff8108d330>] cpu_startup_entry+0xc0/0x270
> > > <4>[196727.312833]  [<ffffffff816b276e>] start_secondary+0x1f9/0x200
> > > <4>[196727.312860] Code: 4a 9f e9 81 e8 13 cb 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 9f e9 81
> > > <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> > > <4>[196727.313100]  RSP <ffff885effd23a70>
> > > <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
> > > <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
> > >
> > >
> > > ... bisecting it's going to be a pain... I tried eyeballing the diffs and
> > > am trying a revert or two.
> > >
> > > We've hit it in .25, .26 so far. I have .27 running but not sure if it
> > > crashed, so the change exists between .15 and .25.
> >
> > Please try following fix, thanks for the report !
> >
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index 25071b48921c..78a50a22298a 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -1333,7 +1333,7 @@ static void ipv4_dst_destroy(struct dst_entry
> > *dst)
> >
> >  	if (!list_empty(&rt->rt_uncached)) {
> >  		spin_lock_bh(&rt_uncached_lock);
> > -		list_del(&rt->rt_uncached);
> > +		list_del_init(&rt->rt_uncached);
> >  		spin_unlock_bh(&rt_uncached_lock);
> >  	}
> >  }
> >
>
> Problem could come from this commit, in linux 3.10.23,
> you also could try to revert it
>
> commit 62713c4b6bc10c2d082ee1540e11b01a2b2162ab
> Author: Alexei Starovoitov <ast@plumgrid.com>
> Date:   Tue Nov 19 19:12:34 2013 -0800
>
>     ipv4: fix race in concurrent ip_route_input_slow()
>
>     [ Upstream commit dcdfdf56b4a6c9437fc37dbc9cee94a788f9b0c4 ]
>
>     CPUs can ask for local route via ip_route_input_noref() concurrently.
>     if nh_rth_input is not cached yet, CPUs will proceed to allocate
>     equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input
>     via rt_cache_route()
>     Most of the time they succeed, but on occasion the following two lines:
>         orig = *p;
>         prev = cmpxchg(p, orig, rt);
>     in rt_cache_route() do race and one of the cpus fails to complete cmpxchg.
>     But ip_route_input_slow() doesn't check the return code of rt_cache_route(),
>     so dst is leaking. dst_destroy() is never called and 'lo' device
>     refcnt doesn't go to zero, which can be seen in the logs as:
>         unregister_netdevice: waiting for lo to become free. Usage count = 1
>     Adding mdelay() between above two lines makes it easily reproducible.
>     Fix it similar to nh_pcpu_rth_output case.
>
>     Fixes: d2d68ba9fe8b ("ipv4: Cache input routes in fib_info nexthops.")
>     Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>

Heh. I spent an hour squinting at the difflog from .15 to .25 and this was
my best guess. I have a kernel running in production with only this
reverted as of ~5 hours ago. Won't know if it helps for a day or two.

I'm building a kernel now with your route patch, but 62713c4 *not*
reverted, which I can throw on a different machine. Does this sound like a
good idea?

Thanks for your quick help as always!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ipv4_dst_destroy panic regression after 3.10.15
  2014-01-18  7:16     ` dormando
@ 2014-01-18 17:18       ` Alexei Starovoitov
  2014-01-18 19:13         ` dormando
  2014-01-22  1:39         ` dormando
  0 siblings, 2 replies; 12+ messages in thread
From: Alexei Starovoitov @ 2014-01-18 17:18 UTC (permalink / raw)
  To: dormando; +Cc: Eric Dumazet, netdev, linux-kernel, Alexei Starovoitov

On Fri, Jan 17, 2014 at 11:16 PM, dormando <dormando@rydia.net> wrote:
>> On Fri, 2014-01-17 at 22:49 -0800, Eric Dumazet wrote:
>> > On Fri, 2014-01-17 at 17:25 -0800, dormando wrote:
>> > > Hi,
>> > >
>> > > Upgraded a few kernels to the latest 3.10 stable tree while tracking down
>> > > a rare kernel panic, seems to have introduced a much more frequent kernel
>> > > panic. Takes anywhere from 4 hours to 2 days to trigger:
>> > >
>> > > <4>[196727.311203] general protection fault: 0000 [#1] SMP
>> > > <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
>> > > <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
>> > > <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
>> > > <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
>> > > <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
>> > > <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
>> > > <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
>> > > <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
>> > > <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
>> > > <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>> > > <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
>> > > <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
>> > > <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > > <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
>> > > <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> > > <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> > > <4>[196727.311713] Stack:
>> > > <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
>> > > <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
>> > > <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
>> > > <4>[196727.311885] Call Trace:
>> > > <4>[196727.311907]  <IRQ>
>> > > <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
>> > > <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
>> > > <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
>> > > <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
>> > > <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
>> > > <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
>> > > <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
>> > > <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
>> > > <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
>> > > <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
>> > > <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
>> > > <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
>> > > <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
>> > > <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
>> > > <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
>> > > <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
>> > > <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
>> > > <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
>> > > <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
>> > > <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
>> > > <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
>> > > <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
>> > > <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
>> > > <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
>> > > <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
>> > > <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
>> > > <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
>> > > <4>[196727.312722]  <EOI>
>> > > <4>[196727.312727]  [<ffffffff8100a150>] ? default_idle+0x20/0xe0
>> > > <4>[196727.312775]  [<ffffffff8100a8ff>] arch_cpu_idle+0xf/0x20
>> > > <4>[196727.312803]  [<ffffffff8108d330>] cpu_startup_entry+0xc0/0x270
>> > > <4>[196727.312833]  [<ffffffff816b276e>] start_secondary+0x1f9/0x200
>> > > <4>[196727.312860] Code: 4a 9f e9 81 e8 13 cb 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 9f e9 81
>> > > <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
>> > > <4>[196727.313100]  RSP <ffff885effd23a70>
>> > > <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
>> > > <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
>> > >
>> > >
>> > > ... bisecting it's going to be a pain... I tried eyeballing the diffs and
>> > > am trying a revert or two.
>> > >
>> > > We've hit it in .25, .26 so far. I have .27 running but not sure if it
>> > > crashed, so the change exists between .15 and .25.
>> >
>> > Please try following fix, thanks for the report !
>> >
>> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
>> > index 25071b48921c..78a50a22298a 100644
>> > --- a/net/ipv4/route.c
>> > +++ b/net/ipv4/route.c
>> > @@ -1333,7 +1333,7 @@ static void ipv4_dst_destroy(struct dst_entry
>> > *dst)
>> >
>> >     if (!list_empty(&rt->rt_uncached)) {
>> >             spin_lock_bh(&rt_uncached_lock);
>> > -           list_del(&rt->rt_uncached);
>> > +           list_del_init(&rt->rt_uncached);
>> >             spin_unlock_bh(&rt_uncached_lock);
>> >     }
>> >  }
>> >
>>
>> Problem could come from this commit, in linux 3.10.23,
>> you also could try to revert it
>>
>> commit 62713c4b6bc10c2d082ee1540e11b01a2b2162ab
>> Author: Alexei Starovoitov <ast@plumgrid.com>
>> Date:   Tue Nov 19 19:12:34 2013 -0800
>>
>>     ipv4: fix race in concurrent ip_route_input_slow()
>>
>>     [ Upstream commit dcdfdf56b4a6c9437fc37dbc9cee94a788f9b0c4 ]
>>
>>     CPUs can ask for local route via ip_route_input_noref() concurrently.
>>     if nh_rth_input is not cached yet, CPUs will proceed to allocate
>>     equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input
>>     via rt_cache_route()
>>     Most of the time they succeed, but on occasion the following two lines:
>>         orig = *p;
>>         prev = cmpxchg(p, orig, rt);
>>     in rt_cache_route() do race and one of the cpus fails to complete cmpxchg.
>>     But ip_route_input_slow() doesn't check the return code of rt_cache_route(),
>>     so dst is leaking. dst_destroy() is never called and 'lo' device
>>     refcnt doesn't go to zero, which can be seen in the logs as:
>>         unregister_netdevice: waiting for lo to become free. Usage count = 1
>>     Adding mdelay() between above two lines makes it easily reproducible.
>>     Fix it similar to nh_pcpu_rth_output case.
>>
>>     Fixes: d2d68ba9fe8b ("ipv4: Cache input routes in fib_info nexthops.")
>>     Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>>     Signed-off-by: David S. Miller <davem@davemloft.net>
>>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>
>
> Heh. I spent an hour squinting at the difflog from .15 to .25 and this was
> my best guess. I have a kernel running in production with only this
> reverted as of ~5 hours ago. Won't know if it helps for a day or two.
>
> I'm building a kernel now with your route patch, but 62713c4 *not*
> reverted, which I can throw on a different machine. Does this sound like a
> good idea?

the traces of your crash don't look similar to dst leak that was fixed by
commit 62713c4...

To trigger the 'fix' logic of the 62713c4 add the following diff:
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f6c6ab1..8972676 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1259,7 +1259,7 @@ static bool rt_cache_route(struct fib_nh *nh,
struct rtable *rt)
                p = (struct rtable **)__this_cpu_ptr(nh->nh_pcpu_rth_output);
        }
        orig = *p;
-
+       mdelay(100);
        prev = cmpxchg(p, orig, rt);
        if (prev == orig) {
                if (orig)

I've been running with it for a day without issues.
Note that it will stress both 'input' and 'output' ways of adding dsts to
rt_uncached list...
and 'output' was there for ages.

If mdelay() helps to reproduce it in minutes instead of days
then we're on the right path.
Could you share details of your workload?
In our case it was a lot of namespaces with a ton of processes
talking to each other, forcefully killed and restarted.
Do you see the same crash in the latest tree?

PS sorry for double posts. netdev email bounced few times for me...

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: ipv4_dst_destroy panic regression after 3.10.15
  2014-01-18 17:18       ` Alexei Starovoitov
@ 2014-01-18 19:13         ` dormando
  2014-01-22  1:39         ` dormando
  1 sibling, 0 replies; 12+ messages in thread
From: dormando @ 2014-01-18 19:13 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Eric Dumazet, netdev, linux-kernel, Alexei Starovoitov

> On Fri, Jan 17, 2014 at 11:16 PM, dormando <dormando@rydia.net> wrote:
> >> On Fri, 2014-01-17 at 22:49 -0800, Eric Dumazet wrote:
> >> > On Fri, 2014-01-17 at 17:25 -0800, dormando wrote:
> >> > > Hi,
> >> > >
> >> > > Upgraded a few kernels to the latest 3.10 stable tree while tracking down
> >> > > a rare kernel panic, seems to have introduced a much more frequent kernel
> >> > > panic. Takes anywhere from 4 hours to 2 days to trigger:
> >> > >
> >> > > <4>[196727.311203] general protection fault: 0000 [#1] SMP
> >> > > <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
> >> > > <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
> >> > > <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
> >> > > <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
> >> > > <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> >> > > <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
> >> > > <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
> >> > > <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
> >> > > <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
> >> > > <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> >> > > <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
> >> > > <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
> >> > > <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> > > <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
> >> > > <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> > > <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> >> > > <4>[196727.311713] Stack:
> >> > > <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
> >> > > <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
> >> > > <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
> >> > > <4>[196727.311885] Call Trace:
> >> > > <4>[196727.311907]  <IRQ>
> >> > > <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
> >> > > <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
> >> > > <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
> >> > > <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
> >> > > <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> >> > > <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
> >> > > <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> >> > > <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
> >> > > <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
> >> > > <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
> >> > > <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
> >> > > <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
> >> > > <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
> >> > > <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
> >> > > <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
> >> > > <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
> >> > > <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
> >> > > <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
> >> > > <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
> >> > > <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
> >> > > <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
> >> > > <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
> >> > > <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
> >> > > <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
> >> > > <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
> >> > > <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
> >> > > <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
> >> > > <4>[196727.312722]  <EOI>
> >> > > <4>[196727.312727]  [<ffffffff8100a150>] ? default_idle+0x20/0xe0
> >> > > <4>[196727.312775]  [<ffffffff8100a8ff>] arch_cpu_idle+0xf/0x20
> >> > > <4>[196727.312803]  [<ffffffff8108d330>] cpu_startup_entry+0xc0/0x270
> >> > > <4>[196727.312833]  [<ffffffff816b276e>] start_secondary+0x1f9/0x200
> >> > > <4>[196727.312860] Code: 4a 9f e9 81 e8 13 cb 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 9f e9 81
> >> > > <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> >> > > <4>[196727.313100]  RSP <ffff885effd23a70>
> >> > > <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
> >> > > <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
> >> > >
> >> > >
> >> > > ... bisecting it's going to be a pain... I tried eyeballing the diffs and
> >> > > am trying a revert or two.
> >> > >
> >> > > We've hit it in .25, .26 so far. I have .27 running but not sure if it
> >> > > crashed, so the change exists between .15 and .25.
> >> >
> >> > Please try following fix, thanks for the report !
> >> >
> >> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> >> > index 25071b48921c..78a50a22298a 100644
> >> > --- a/net/ipv4/route.c
> >> > +++ b/net/ipv4/route.c
> >> > @@ -1333,7 +1333,7 @@ static void ipv4_dst_destroy(struct dst_entry
> >> > *dst)
> >> >
> >> >     if (!list_empty(&rt->rt_uncached)) {
> >> >             spin_lock_bh(&rt_uncached_lock);
> >> > -           list_del(&rt->rt_uncached);
> >> > +           list_del_init(&rt->rt_uncached);
> >> >             spin_unlock_bh(&rt_uncached_lock);
> >> >     }
> >> >  }
> >> >
> >>
> >> Problem could come from this commit, in linux 3.10.23,
> >> you also could try to revert it
> >>
> >> commit 62713c4b6bc10c2d082ee1540e11b01a2b2162ab
> >> Author: Alexei Starovoitov <ast@plumgrid.com>
> >> Date:   Tue Nov 19 19:12:34 2013 -0800
> >>
> >>     ipv4: fix race in concurrent ip_route_input_slow()
> >>
> >>     [ Upstream commit dcdfdf56b4a6c9437fc37dbc9cee94a788f9b0c4 ]
> >>
> >>     CPUs can ask for local route via ip_route_input_noref() concurrently.
> >>     if nh_rth_input is not cached yet, CPUs will proceed to allocate
> >>     equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input
> >>     via rt_cache_route()
> >>     Most of the time they succeed, but on occasion the following two lines:
> >>         orig = *p;
> >>         prev = cmpxchg(p, orig, rt);
> >>     in rt_cache_route() do race and one of the cpus fails to complete cmpxchg.
> >>     But ip_route_input_slow() doesn't check the return code of rt_cache_route(),
> >>     so dst is leaking. dst_destroy() is never called and 'lo' device
> >>     refcnt doesn't go to zero, which can be seen in the logs as:
> >>         unregister_netdevice: waiting for lo to become free. Usage count = 1
> >>     Adding mdelay() between above two lines makes it easily reproducible.
> >>     Fix it similar to nh_pcpu_rth_output case.
> >>
> >>     Fixes: d2d68ba9fe8b ("ipv4: Cache input routes in fib_info nexthops.")
> >>     Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> >>     Signed-off-by: David S. Miller <davem@davemloft.net>
> >>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> >>
> >
> > Heh. I spent an hour squinting at the difflog from .15 to .25 and this was
> > my best guess. I have a kernel running in production with only this
> > reverted as of ~5 hours ago. Won't know if it helps for a day or two.
> >
> > I'm building a kernel now with your route patch, but 62713c4 *not*
> > reverted, which I can throw on a different machine. Does this sound like a
> > good idea?
>
> the traces of your crash don't look similar to dst leak that was fixed by
> commit 62713c4...
>
> To trigger the 'fix' logic of the 62713c4 add the following diff:
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index f6c6ab1..8972676 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1259,7 +1259,7 @@ static bool rt_cache_route(struct fib_nh *nh,
> struct rtable *rt)
>                 p = (struct rtable **)__this_cpu_ptr(nh->nh_pcpu_rth_output);
>         }
>         orig = *p;
> -
> +       mdelay(100);
>         prev = cmpxchg(p, orig, rt);
>         if (prev == orig) {
>                 if (orig)
>
> I've been running with it for a day without issues.
> Note that it will stress both 'input' and 'output' ways of adding dsts to
> rt_uncached list...
> and 'output' was there for ages.
>
> If mdelay() helps to reproduce it in minutes instead of days
> then we're on the right path.
> Could you share details of your workload?
> In our case it was a lot of namespaces with a ton of processes
> talking to each other, forcefully killed and restarted.
> Do you see the same crash in the latest tree?
>
> PS sorry for double posts. netdev email bounced few times for me...
>

I have two machines running the experiments I noted last night. If
either blow up again I'll try this. If reverting your patch fixes it, it's
likely just a side effect of your fix (somehow...). Nothing else changed
the dst references between .15 and .25 so far as I could see.

We have a fairly bizarre/hard to reproduce workload. Full routing tables
loaded, bgp daemons running, a few hundred thousand threads, many gigabits
of direct-to-internet traffic. All on each machine.

By latest tree, you mean the very latest linux git or latest
released/stable version? Diffs are large enough that I'd likely just hit a
new crash. I might have to do that for the other panic I posted, but this
one is definitely a regression in the 3.10 stable series.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ipv4_dst_destroy panic regression after 3.10.15
  2014-01-18  6:49 ` Eric Dumazet
  2014-01-18  7:09   ` Eric Dumazet
@ 2014-01-20  2:56   ` dormando
  1 sibling, 0 replies; 12+ messages in thread
From: dormando @ 2014-01-20  2:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel

> On Fri, 2014-01-17 at 17:25 -0800, dormando wrote:
> > Hi,
> >
> > Upgraded a few kernels to the latest 3.10 stable tree while tracking down
> > a rare kernel panic, seems to have introduced a much more frequent kernel
> > panic. Takes anywhere from 4 hours to 2 days to trigger:
> >
> > <4>[196727.311203] general protection fault: 0000 [#1] SMP
> > <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
> > <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
> > <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
> > <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
> > <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> > <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
> > <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
> > <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
> > <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
> > <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
> > <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
> > <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
> > <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > <4>[196727.311713] Stack:
> > <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
> > <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
> > <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
> > <4>[196727.311885] Call Trace:
> > <4>[196727.311907]  <IRQ>
> > <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
> > <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
> > <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
> > <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
> > <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> > <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
> > <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> > <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
> > <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
> > <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
> > <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
> > <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
> > <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
> > <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
> > <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
> > <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
> > <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
> > <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
> > <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
> > <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
> > <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
> > <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
> > <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
> > <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
> > <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
> > <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
> > <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
> > <4>[196727.312722]  <EOI>
> > <4>[196727.312727]  [<ffffffff8100a150>] ? default_idle+0x20/0xe0
> > <4>[196727.312775]  [<ffffffff8100a8ff>] arch_cpu_idle+0xf/0x20
> > <4>[196727.312803]  [<ffffffff8108d330>] cpu_startup_entry+0xc0/0x270
> > <4>[196727.312833]  [<ffffffff816b276e>] start_secondary+0x1f9/0x200
> > <4>[196727.312860] Code: 4a 9f e9 81 e8 13 cb 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 9f e9 81
> > <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> > <4>[196727.313100]  RSP <ffff885effd23a70>
> > <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
> > <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
> >
> >
> > ... bisecting it's going to be a pain... I tried eyeballing the diffs and
> > am trying a revert or two.
> >
> > We've hit it in .25, .26 so far. I have .27 running but not sure if it
> > crashed, so the change exists between .15 and .25.
>
> Please try following fix, thanks for the report !
>
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 25071b48921c..78a50a22298a 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1333,7 +1333,7 @@ static void ipv4_dst_destroy(struct dst_entry
> *dst)
>
>  	if (!list_empty(&rt->rt_uncached)) {
>  		spin_lock_bh(&rt_uncached_lock);
> -		list_del(&rt->rt_uncached);
> +		list_del_init(&rt->rt_uncached);
>  		spin_unlock_bh(&rt_uncached_lock);
>  	}
>  }

This one crashed after a day and a half with a new-to-me trace. The other
machine running 3.10.27 + the reversion of the other patch is still going.
I now have two machines on that kernel to see if they're stable for more
than two days.

Panic#2 Part1
<4>[116124.368440] ------------[ cut here ]------------
<4>[116124.368450] WARNING: at net/core/dst.c:276 dst_release+0x7e/0x80()
<4>[116124.368451] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich isci ipmi_watchdog ipmi_devintf ixgbe microcode igb sb_edac edac_core libsas lpc_ich i2c_algo_bit mfd_core ptp pps_core tpm_tis mdio tpm tpm_bios ipmi_si ipmi_msghandler
<4>[116124.368470] CPU: 1 PID: 166084 Comm: cache-worker Tainted: G        W    3.10.27 #1
<4>[116124.368471] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
<4>[116124.368473]  ffffffff81a86f96 ffff885effc23b30 ffffffff816c045e ffff885effc23b70
<4>[116124.368475]  ffffffff81049930 ffff88be64ed4000 ffff88531e227980 00000000fffffff9
<4>[116124.368477]  ffffffff81c876c0 ffff8852ef21ec00 ffff882dffaaa5e2 ffff885effc23b80
<4>[116124.368479] Call Trace:
<4>[116124.368480]  <IRQ>  [<ffffffff816c045e>] dump_stack+0x19/0x1b
<4>[116124.368490]  [<ffffffff81049930>] warn_slowpath_common+0x70/0xa0
<4>[116124.368492]  [<ffffffff8104997a>] warn_slowpath_null+0x1a/0x20
<4>[116124.368494]  [<ffffffff815b870e>] dst_release+0x7e/0x80
<4>[116124.368497]  [<ffffffff816351f9>] inet_sock_destruct+0x129/0x1d0
<4>[116124.368500]  [<ffffffff8159d4bd>] __sk_free+0x1d/0x140
<4>[116124.368502]  [<ffffffff8159d675>] sk_free+0x25/0x30
<4>[116124.368506]  [<ffffffff8161e3fd>] sock_put+0x1d/0x30
<4>[116124.368508]  [<ffffffff81622afb>] tcp_v4_rcv+0x77b/0x820
<4>[116124.368511]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
<4>[116124.368515]  [<ffffffff815de03d>] ? nf_hook_slow+0x7d/0x150
<4>[116124.368517]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
<4>[116124.368519]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
<4>[116124.368521]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
<4>[116124.368523]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
<4>[116124.368526]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
<4>[116124.368529]  [<ffffffff815b0c82>] __netif_receive_skb_core+0x512/0x640
<4>[116124.368531]  [<ffffffff815b0dd1>] __netif_receive_skb+0x21/0x70
<4>[116124.368533]  [<ffffffff815b0ebb>] process_backlog+0x9b/0x170
<4>[116124.368535]  [<ffffffff815b16b1>] net_rx_action+0x111/0x210
<4>[116124.368538]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
<4>[116124.368542]  [<ffffffff816cf03c>] call_softirq+0x1c/0x30
<4>[116124.368546]  [<ffffffff81004215>] do_softirq+0x55/0x90
<4>[116124.368548]  [<ffffffff81051c85>] irq_exit+0x55/0x60
<4>[116124.368550]  [<ffffffff816cf6c3>] do_IRQ+0x63/0xe0
<4>[116124.368553]  [<ffffffff816c5baa>] common_interrupt+0x6a/0x6a
<4>[116124.368554]  <EOI>  [<ffffffff816cde02>] ? system_call_fastpath+0x16/0x1b
<4>[116124.368558] ---[ end trace 3ed2370f32f77a53 ]---
<7>[116125.177470] Peer 88.153.144.82:50962/80 unexpectedly shrunk window 2770172661:2770430920 (repaired)
<7>[116127.452869] Peer 88.153.144.82:50962/80 unexpectedly shrunk window 2770172661:2770430920 (repaired)
<7>[116132.003660] Peer 88.153.144.82:50962/80 unexpectedly shrunk window 2770172661:2770430920 (repaired)
<4>[116133.987013] general protection fault: 0000 [#1] SMP
<4>[116133.987033] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich isci ipmi_watchdog ipmi_devintf ixgbe microcode igb sb_edac edac_core libsas lpc_ich i2c_algo_bit mfd_core ptp pps_core tpm_tis mdio tpm tpm_bios ipmi_si ipmi_msghandler
<4>[116133.987137] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G        W    3.10.27 #1
<4>[116133.987150] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
<4>[116133.987168] task: ffff885e6f7edc00 ti: ffff885e6f7f4000 task.ti: ffff885e6f7f4000
<4>[116133.987182] RIP: 0010:[<ffffffff8164f120>]  [<ffffffff8164f120>] iptable_raw_hook+0x20/0x70
<4>[116133.987202] RSP: 0018:ffff885effca38f0  EFLAGS: 00010246
<4>[116133.987212] RAX: ffca3aa0ffff885e RBX: ffca3aa0ffff885e RCX: ffca3aa0ffff885e
<4>[116133.987226] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff885e3b447100
<4>[116133.987240] RBP: ffff885effca38f0 R08: ffff884b3183fc00 R09: ffff885e3b447100
<4>[116133.987254] R10: ffff88be6df7be80 R11: 0000000000000013 R12: ffff885effca3990
<4>[116133.987285] R13: 0000000000000000 R14: 0000000000000003 R15: ffff885e3b447100
<4>[116133.987328] FS:  0000000000000000(0000) GS:ffff885effca0000(0000) knlGS:0000000000000000
<4>[116133.987372] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[116133.987399] CR2: 00007a306991d000 CR3: 0000005e0e44f000 CR4: 00000000000407e0
<4>[116133.987441] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[116133.987484] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[116133.987526] Stack:
<4>[116133.987546]  ffff885effca3950 ffffffff815ddf7f ffff885effca3910 ffffffff81cb4c90
<4>[116133.987596]  ffff885effca3930 ffff885e66024f30 0000000000000020 ffff885e3b447100
<4>[116133.987645]  ffffffff81600e90 0000000000000003 ffff885e3b447100 0000000000000000
<4>[116133.987695] Call Trace:
<4>[116133.987716]  <IRQ>
<4>[116133.987721]  [<ffffffff815ddf7f>] nf_iterate+0x8f/0xd0
<4>[116133.987770]  [<ffffffff81600e90>] ? ip_forward_options+0x200/0x200
<4>[116133.987798]  [<ffffffff815de03d>] nf_hook_slow+0x7d/0x150
<4>[116133.987825]  [<ffffffff81600e90>] ? ip_forward_options+0x200/0x200
<4>[116133.987853]  [<ffffffff81602c22>] __ip_local_out+0x72/0x80
<4>[116133.987880]  [<ffffffff81602c46>] ip_local_out+0x16/0x30
<4>[116133.987906]  [<ffffffff81603f39>] ip_send_skb+0x19/0x50
<4>[116133.987932]  [<ffffffff81603fa7>] ip_push_pending_frames+0x37/0x40
<4>[116133.987960]  [<ffffffff8160438f>] ip_send_unicast_reply+0x29f/0x2f0
<4>[116133.987989]  [<ffffffff8161a5b6>] ? tcp_write_xmit+0x2b6/0xa90
<4>[116133.988019]  [<ffffffff8161e2ba>] tcp_v4_send_ack+0x1ba/0x210
<4>[116133.988048]  [<ffffffff816226f5>] tcp_v4_rcv+0x375/0x820
<4>[116133.988075]  [<ffffffff815de03d>] ? nf_hook_slow+0x7d/0x150
<4>[116133.988103]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
<4>[116133.988131]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
<4>[116133.988159]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
<4>[116133.988187]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
<4>[116133.988214]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
<4>[116133.988242]  [<ffffffff815b0c82>] __netif_receive_skb_core+0x512/0x640
<4>[116133.988272]  [<ffffffff81079b00>] ? resched_task+0x60/0x70
<4>[116133.988299]  [<ffffffff815b0dd1>] __netif_receive_skb+0x21/0x70
<4>[116133.988326]  [<ffffffff815b0ebb>] process_backlog+0x9b/0x170
<4>[116133.988354]  [<ffffffff8107a1a6>] ? ttwu_do_activate.constprop.91+0x66/0x70
<4>[116133.988384]  [<ffffffff815b16b1>] net_rx_action+0x111/0x210
<4>[116133.988412]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
<4>[116133.988441]  [<ffffffff816cf03c>] call_softirq+0x1c/0x30
<4>[116133.988470]  [<ffffffff81004215>] do_softirq+0x55/0x90
<4>[116133.988495]  [<ffffffff81051c85>] irq_exit+0x55/0x60
<4>[116133.988521]  [<ffffffff816cf6c3>] do_IRQ+0x63/0xe0
<4>[116133.988548]  [<ffffffff816c5baa>] common_interrupt+0x6a/0x6a
<4>[116133.988574]  <EOI>
<4>[116133.988578]  [<ffffffff8100a160>] ? default_idle+0x20/0xe0
<4>[116133.988624]  [<ffffffff8100a90f>] arch_cpu_idle+0xf/0x20
<4>[116133.988652]  [<ffffffff8108d330>] cpu_startup_entry+0xc0/0x270
<4>[116133.988681]  [<ffffffff816b288e>] start_secondary+0x1f9/0x200
<4>[116133.988708] Code: eb f2 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 83 ff 03 49 89 f1 48 89 e5 74 27 48 85 d2 48 89 d0 89 fe 48 0f 44 c1 4c 89 cf <48> 8b 80 e8 03 00 00 4c 8b 80 50 03 00 00 e8 6d d8 ff ff 5d c3
<1>[116133.988905] RIP  [<ffffffff8164f120>] iptable_raw_hook+0x20/0x70
<4>[116133.988934]  RSP <ffff885effca38f0>
<4>[116133.989217] ---[ end trace 3ed2370f32f77a54 ]---
<0>[116134.851050] Kernel panic - not syncing: Fatal exception in interrupt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ipv4_dst_destroy panic regression after 3.10.15
  2014-01-18 17:18       ` Alexei Starovoitov
  2014-01-18 19:13         ` dormando
@ 2014-01-22  1:39         ` dormando
       [not found]           ` <CAMEtUuzq+KhzZfxzAo=fgYr9umtQCPx-4Jp_twtrM-eh1Bv14w@mail.gmail.com>
  1 sibling, 1 reply; 12+ messages in thread
From: dormando @ 2014-01-22  1:39 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Eric Dumazet, netdev, linux-kernel, Alexei Starovoitov

> On Fri, Jan 17, 2014 at 11:16 PM, dormando <dormando@rydia.net> wrote:
> >> On Fri, 2014-01-17 at 22:49 -0800, Eric Dumazet wrote:
> >> > On Fri, 2014-01-17 at 17:25 -0800, dormando wrote:
> >> > > Hi,
> >> > >
> >> > > Upgraded a few kernels to the latest 3.10 stable tree while tracking down
> >> > > a rare kernel panic, seems to have introduced a much more frequent kernel
> >> > > panic. Takes anywhere from 4 hours to 2 days to trigger:
> >> > >
> >> > > <4>[196727.311203] general protection fault: 0000 [#1] SMP
> >> > > <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
> >> > > <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
> >> > > <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
> >> > > <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
> >> > > <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> >> > > <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
> >> > > <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
> >> > > <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
> >> > > <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
> >> > > <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> >> > > <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
> >> > > <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
> >> > > <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> > > <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
> >> > > <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> > > <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> >> > > <4>[196727.311713] Stack:
> >> > > <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
> >> > > <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
> >> > > <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
> >> > > <4>[196727.311885] Call Trace:
> >> > > <4>[196727.311907]  <IRQ>
> >> > > <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
> >> > > <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
> >> > > <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
> >> > > <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
> >> > > <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> >> > > <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
> >> > > <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> >> > > <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
> >> > > <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
> >> > > <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
> >> > > <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
> >> > > <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
> >> > > <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
> >> > > <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
> >> > > <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
> >> > > <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
> >> > > <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
> >> > > <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
> >> > > <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
> >> > > <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
> >> > > <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
> >> > > <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
> >> > > <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
> >> > > <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
> >> > > <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
> >> > > <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
> >> > > <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
> >> > > <4>[196727.312722]  <EOI>
> >> > > <4>[196727.312727]  [<ffffffff8100a150>] ? default_idle+0x20/0xe0
> >> > > <4>[196727.312775]  [<ffffffff8100a8ff>] arch_cpu_idle+0xf/0x20
> >> > > <4>[196727.312803]  [<ffffffff8108d330>] cpu_startup_entry+0xc0/0x270
> >> > > <4>[196727.312833]  [<ffffffff816b276e>] start_secondary+0x1f9/0x200
> >> > > <4>[196727.312860] Code: 4a 9f e9 81 e8 13 cb 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 9f e9 81
> >> > > <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> >> > > <4>[196727.313100]  RSP <ffff885effd23a70>
> >> > > <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
> >> > > <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
> >> > >
> >> > >
> >> > > ... bisecting it's going to be a pain... I tried eyeballing the diffs and
> >> > > am trying a revert or two.
> >> > >
> >> > > We've hit it in .25, .26 so far. I have .27 running but not sure if it
> >> > > crashed, so the change exists between .15 and .25.
> >> >
> >> > Please try following fix, thanks for the report !
> >> >
> >> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> >> > index 25071b48921c..78a50a22298a 100644
> >> > --- a/net/ipv4/route.c
> >> > +++ b/net/ipv4/route.c
> >> > @@ -1333,7 +1333,7 @@ static void ipv4_dst_destroy(struct dst_entry
> >> > *dst)
> >> >
> >> >     if (!list_empty(&rt->rt_uncached)) {
> >> >             spin_lock_bh(&rt_uncached_lock);
> >> > -           list_del(&rt->rt_uncached);
> >> > +           list_del_init(&rt->rt_uncached);
> >> >             spin_unlock_bh(&rt_uncached_lock);
> >> >     }
> >> >  }
> >> >
> >>
> >> Problem could come from this commit, in linux 3.10.23,
> >> you also could try to revert it
> >>
> >> commit 62713c4b6bc10c2d082ee1540e11b01a2b2162ab
> >> Author: Alexei Starovoitov <ast@plumgrid.com>
> >> Date:   Tue Nov 19 19:12:34 2013 -0800
> >>
> >>     ipv4: fix race in concurrent ip_route_input_slow()
> >>
> >>     [ Upstream commit dcdfdf56b4a6c9437fc37dbc9cee94a788f9b0c4 ]
> >>
> >>     CPUs can ask for local route via ip_route_input_noref() concurrently.
> >>     if nh_rth_input is not cached yet, CPUs will proceed to allocate
> >>     equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input
> >>     via rt_cache_route()
> >>     Most of the time they succeed, but on occasion the following two lines:
> >>         orig = *p;
> >>         prev = cmpxchg(p, orig, rt);
> >>     in rt_cache_route() do race and one of the cpus fails to complete cmpxchg.
> >>     But ip_route_input_slow() doesn't check the return code of rt_cache_route(),
> >>     so dst is leaking. dst_destroy() is never called and 'lo' device
> >>     refcnt doesn't go to zero, which can be seen in the logs as:
> >>         unregister_netdevice: waiting for lo to become free. Usage count = 1
> >>     Adding mdelay() between above two lines makes it easily reproducible.
> >>     Fix it similar to nh_pcpu_rth_output case.
> >>
> >>     Fixes: d2d68ba9fe8b ("ipv4: Cache input routes in fib_info nexthops.")
> >>     Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> >>     Signed-off-by: David S. Miller <davem@davemloft.net>
> >>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> >>
> >
> > Heh. I spent an hour squinting at the difflog from .15 to .25 and this was
> > my best guess. I have a kernel running in production with only this
> > reverted as of ~5 hours ago. Won't know if it helps for a day or two.
> >
> > I'm building a kernel now with your route patch, but 62713c4 *not*
> > reverted, which I can throw on a different machine. Does this sound like a
> > good idea?
>
> the traces of your crash don't look similar to dst leak that was fixed by
> commit 62713c4...
>
> To trigger the 'fix' logic of the 62713c4 add the following diff:
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index f6c6ab1..8972676 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1259,7 +1259,7 @@ static bool rt_cache_route(struct fib_nh *nh,
> struct rtable *rt)
>                 p = (struct rtable **)__this_cpu_ptr(nh->nh_pcpu_rth_output);
>         }
>         orig = *p;
> -
> +       mdelay(100);
>         prev = cmpxchg(p, orig, rt);
>         if (prev == orig) {
>                 if (orig)
>
> I've been running with it for a day without issues.
> Note that it will stress both 'input' and 'output' ways of adding dsts to
> rt_uncached list...
> and 'output' was there for ages.
>
> If mdelay() helps to reproduce it in minutes instead of days
> then we're on the right path.
> Could you share details of your workload?
> In our case it was a lot of namespaces with a ton of processes
> talking to each other, forcefully killed and restarted.
> Do you see the same crash in the latest tree?
>
> PS sorry for double posts. netdev email bounced few times for me...
>

So with 62713c4 reverted on top of 3.10.27 (no other new patches), both of
my machines have survived. One four days, another two days. They'd barely
make it a day before.

So this patch is introducing a new problem. Weakening the dst reference
too much in a false positive case?

I'm not entirely sure if the mdelay(100) thing will help much. I can try
it but it's kind of obnoxious to reboot these machines (far away, ipmi
only) too often. Unless you folks have any new ideas before I go off and
do that?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ipv4_dst_destroy panic regression after 3.10.15
       [not found]           ` <CAMEtUuzq+KhzZfxzAo=fgYr9umtQCPx-4Jp_twtrM-eh1Bv14w@mail.gmail.com>
@ 2014-01-22  4:10             ` dormando
  2014-01-22  6:02               ` Alexei Starovoitov
  0 siblings, 1 reply; 12+ messages in thread
From: dormando @ 2014-01-22  4:10 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Alexei Starovoitov, Eric Dumazet, netdev, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 11946 bytes --]



On Tue, 21 Jan 2014, Alexei Starovoitov wrote:

> On Tue, Jan 21, 2014 at 5:39 PM, dormando <dormando@rydia.net> wrote:
> >
> > > On Fri, Jan 17, 2014 at 11:16 PM, dormando <dormando@rydia.net> wrote:
> > > >> On Fri, 2014-01-17 at 22:49 -0800, Eric Dumazet wrote:
> > > >> > On Fri, 2014-01-17 at 17:25 -0800, dormando wrote:
> > > >> > > Hi,
> > > >> > >
> > > >> > > Upgraded a few kernels to the latest 3.10 stable tree while tracking down
> > > >> > > a rare kernel panic, seems to have introduced a much more frequent kernel
> > > >> > > panic. Takes anywhere from 4 hours to 2 days to trigger:
> > > >> > >
> > > >> > > <4>[196727.311203] general protection fault: 0000 [#1] SMP
> > > >> > > <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode
> ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp
> pps_core mdio
> > > >> > > <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
> > > >> > > <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
> > > >> > > <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
> > > >> > > <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> > > >> > > <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
> > > >> > > <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
> > > >> > > <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
> > > >> > > <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
> > > >> > > <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > > >> > > <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
> > > >> > > <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
> > > >> > > <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > >> > > <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
> > > >> > > <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > >> > > <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > >> > > <4>[196727.311713] Stack:
> > > >> > > <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
> > > >> > > <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
> > > >> > > <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
> > > >> > > <4>[196727.311885] Call Trace:
> > > >> > > <4>[196727.311907]  <IRQ>
> > > >> > > <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
> > > >> > > <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
> > > >> > > <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
> > > >> > > <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
> > > >> > > <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> > > >> > > <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
> > > >> > > <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> > > >> > > <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
> > > >> > > <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
> > > >> > > <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
> > > >> > > <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
> > > >> > > <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
> > > >> > > <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
> > > >> > > <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
> > > >> > > <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
> > > >> > > <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
> > > >> > > <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
> > > >> > > <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
> > > >> > > <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
> > > >> > > <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
> > > >> > > <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
> > > >> > > <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
> > > >> > > <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
> > > >> > > <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
> > > >> > > <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
> > > >> > > <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
> > > >> > > <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
> > > >> > > <4>[196727.312722]  <EOI>
> > > >> > > <4>[196727.312727]  [<ffffffff8100a150>] ? default_idle+0x20/0xe0
> > > >> > > <4>[196727.312775]  [<ffffffff8100a8ff>] arch_cpu_idle+0xf/0x20
> > > >> > > <4>[196727.312803]  [<ffffffff8108d330>] cpu_startup_entry+0xc0/0x270
> > > >> > > <4>[196727.312833]  [<ffffffff816b276e>] start_secondary+0x1f9/0x200
> > > >> > > <4>[196727.312860] Code: 4a 9f e9 81 e8 13 cb 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01
> 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 9f e9 81
> > > >> > > <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> > > >> > > <4>[196727.313100]  RSP <ffff885effd23a70>
> > > >> > > <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
> > > >> > > <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
> > > >> > >
> > > >> > >
> > > >> > > ... bisecting it's going to be a pain... I tried eyeballing the diffs and
> > > >> > > am trying a revert or two.
> > > >> > >
> > > >> > > We've hit it in .25, .26 so far. I have .27 running but not sure if it
> > > >> > > crashed, so the change exists between .15 and .25.
> > > >> >
> > > >> > Please try following fix, thanks for the report !
> > > >> >
> > > >> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > >> > index 25071b48921c..78a50a22298a 100644
> > > >> > --- a/net/ipv4/route.c
> > > >> > +++ b/net/ipv4/route.c
> > > >> > @@ -1333,7 +1333,7 @@ static void ipv4_dst_destroy(struct dst_entry
> > > >> > *dst)
> > > >> >
> > > >> >     if (!list_empty(&rt->rt_uncached)) {
> > > >> >             spin_lock_bh(&rt_uncached_lock);
> > > >> > -           list_del(&rt->rt_uncached);
> > > >> > +           list_del_init(&rt->rt_uncached);
> > > >> >             spin_unlock_bh(&rt_uncached_lock);
> > > >> >     }
> > > >> >  }
> > > >> >
> > > >>
> > > >> Problem could come from this commit, in linux 3.10.23,
> > > >> you also could try to revert it
> > > >>
> > > >> commit 62713c4b6bc10c2d082ee1540e11b01a2b2162ab
> > > >> Author: Alexei Starovoitov <ast@plumgrid.com>
> > > >> Date:   Tue Nov 19 19:12:34 2013 -0800
> > > >>
> > > >>     ipv4: fix race in concurrent ip_route_input_slow()
> > > >>
> > > >>     [ Upstream commit dcdfdf56b4a6c9437fc37dbc9cee94a788f9b0c4 ]
> > > >>
> > > >>     CPUs can ask for local route via ip_route_input_noref() concurrently.
> > > >>     if nh_rth_input is not cached yet, CPUs will proceed to allocate
> > > >>     equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input
> > > >>     via rt_cache_route()
> > > >>     Most of the time they succeed, but on occasion the following two lines:
> > > >>         orig = *p;
> > > >>         prev = cmpxchg(p, orig, rt);
> > > >>     in rt_cache_route() do race and one of the cpus fails to complete cmpxchg.
> > > >>     But ip_route_input_slow() doesn't check the return code of rt_cache_route(),
> > > >>     so dst is leaking. dst_destroy() is never called and 'lo' device
> > > >>     refcnt doesn't go to zero, which can be seen in the logs as:
> > > >>         unregister_netdevice: waiting for lo to become free. Usage count = 1
> > > >>     Adding mdelay() between above two lines makes it easily reproducible.
> > > >>     Fix it similar to nh_pcpu_rth_output case.
> > > >>
> > > >>     Fixes: d2d68ba9fe8b ("ipv4: Cache input routes in fib_info nexthops.")
> > > >>     Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> > > >>     Signed-off-by: David S. Miller <davem@davemloft.net>
> > > >>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > > >>
> > > >
> > > > Heh. I spent an hour squinting at the difflog from .15 to .25 and this was
> > > > my best guess. I have a kernel running in production with only this
> > > > reverted as of ~5 hours ago. Won't know if it helps for a day or two.
> > > >
> > > > I'm building a kernel now with your route patch, but 62713c4 *not*
> > > > reverted, which I can throw on a different machine. Does this sound like a
> > > > good idea?
> > >
> > > the traces of your crash don't look similar to dst leak that was fixed by
> > > commit 62713c4...
> > >
> > > To trigger the 'fix' logic of the 62713c4 add the following diff:
> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > index f6c6ab1..8972676 100644
> > > --- a/net/ipv4/route.c
> > > +++ b/net/ipv4/route.c
> > > @@ -1259,7 +1259,7 @@ static bool rt_cache_route(struct fib_nh *nh,
> > > struct rtable *rt)
> > >                 p = (struct rtable **)__this_cpu_ptr(nh->nh_pcpu_rth_output);
> > >         }
> > >         orig = *p;
> > > -
> > > +       mdelay(100);
> > >         prev = cmpxchg(p, orig, rt);
> > >         if (prev == orig) {
> > >                 if (orig)
> > >
> > > I've been running with it for a day without issues.
> > > Note that it will stress both 'input' and 'output' ways of adding dsts to
> > > rt_uncached list...
> > > and 'output' was there for ages.
> > >
> > > If mdelay() helps to reproduce it in minutes instead of days
> > > then we're on the right path.
> > > Could you share details of your workload?
> > > In our case it was a lot of namespaces with a ton of processes
> > > talking to each other, forcefully killed and restarted.
> > > Do you see the same crash in the latest tree?
> > >
> > > PS sorry for double posts. netdev email bounced few times for me...
> > >
> >
> > So with 62713c4 reverted on top of 3.10.27 (no other new patches), both of
> > my machines have survived. One four days, another two days. They'd barely
> > make it a day before.
> >
> > So this patch is introducing a new problem. Weakening the dst reference
> > too much in a false positive case?
> >
> > I'm not entirely sure if the mdelay(100) thing will help much. I can try
> > it but it's kind of obnoxious to reboot these machines (far away, ipmi
> > only) too often. Unless you folks have any new ideas before I go off and
> > do that?
>
> mdelay() won't fix things, but it will help debugging. Can you try it on just one machine? It should fail within minutes.
>
> Only afterwards we can try alternative fix like:
> @@ -1722,8 +1722,8 @@ local_input:
>         }
>         if (do_cache) {
>                 if (unlikely(!rt_cache_route(&FIB_RES_NH(res), rth))) {
> -                       rth->dst.flags |= DST_NOCACHE;
> -                       rt_add_uncached_list(rth);
> +                       rt_free(rth);
> +                       goto local_input;
>                 }
>         }
>         skb_dst_set(skb, &rth->dst);
>
>
>

It might take a day or three to find a good time to do it. I don't have
access to reboot the machines myself. Oh how I wish things broke in the
lab more often :/

To be clear, I add your old patch back in + the mdelay. If that fails
within a few minutes, I add this new patch instead of your old one?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ipv4_dst_destroy panic regression after 3.10.15
  2014-01-22  4:10             ` dormando
@ 2014-01-22  6:02               ` Alexei Starovoitov
  2014-01-22 22:13                 ` Alexei Starovoitov
  0 siblings, 1 reply; 12+ messages in thread
From: Alexei Starovoitov @ 2014-01-22  6:02 UTC (permalink / raw)
  To: dormando; +Cc: Alexei Starovoitov, Eric Dumazet, netdev, linux-kernel

On Tue, Jan 21, 2014 at 8:10 PM, dormando <dormando@rydia.net> wrote:
>
>
> On Tue, 21 Jan 2014, Alexei Starovoitov wrote:
>
>> On Tue, Jan 21, 2014 at 5:39 PM, dormando <dormando@rydia.net> wrote:
>> >
>> > > On Fri, Jan 17, 2014 at 11:16 PM, dormando <dormando@rydia.net> wrote:
>> > > >> On Fri, 2014-01-17 at 22:49 -0800, Eric Dumazet wrote:
>> > > >> > On Fri, 2014-01-17 at 17:25 -0800, dormando wrote:
>> > > >> > > Hi,
>> > > >> > >
>> > > >> > > Upgraded a few kernels to the latest 3.10 stable tree while tracking down
>> > > >> > > a rare kernel panic, seems to have introduced a much more frequent kernel
>> > > >> > > panic. Takes anywhere from 4 hours to 2 days to trigger:
>> > > >> > >
>> > > >> > > <4>[196727.311203] general protection fault: 0000 [#1] SMP
>> > > >> > > <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode
>> ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp
>> pps_core mdio
>> > > >> > > <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
>> > > >> > > <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
>> > > >> > > <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
>> > > >> > > <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
>> > > >> > > <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
>> > > >> > > <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
>> > > >> > > <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
>> > > >> > > <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
>> > > >> > > <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>> > > >> > > <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
>> > > >> > > <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
>> > > >> > > <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > > >> > > <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
>> > > >> > > <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> > > >> > > <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> > > >> > > <4>[196727.311713] Stack:
>> > > >> > > <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
>> > > >> > > <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
>> > > >> > > <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
>> > > >> > > <4>[196727.311885] Call Trace:
>> > > >> > > <4>[196727.311907]  <IRQ>
>> > > >> > > <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
>> > > >> > > <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
>> > > >> > > <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
>> > > >> > > <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
>> > > >> > > <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
>> > > >> > > <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
>> > > >> > > <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
>> > > >> > > <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
>> > > >> > > <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
>> > > >> > > <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
>> > > >> > > <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
>> > > >> > > <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
>> > > >> > > <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
>> > > >> > > <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
>> > > >> > > <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
>> > > >> > > <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
>> > > >> > > <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
>> > > >> > > <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
>> > > >> > > <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
>> > > >> > > <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
>> > > >> > > <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
>> > > >> > > <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
>> > > >> > > <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
>> > > >> > > <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
>> > > >> > > <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
>> > > >> > > <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
>> > > >> > > <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
>> > > >> > > <4>[196727.312722]  <EOI>
>> > > >> > > <4>[196727.312727]  [<ffffffff8100a150>] ? default_idle+0x20/0xe0
>> > > >> > > <4>[196727.312775]  [<ffffffff8100a8ff>] arch_cpu_idle+0xf/0x20
>> > > >> > > <4>[196727.312803]  [<ffffffff8108d330>] cpu_startup_entry+0xc0/0x270
>> > > >> > > <4>[196727.312833]  [<ffffffff816b276e>] start_secondary+0x1f9/0x200
>> > > >> > > <4>[196727.312860] Code: 4a 9f e9 81 e8 13 cb 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01
>> 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 9f e9 81
>> > > >> > > <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
>> > > >> > > <4>[196727.313100]  RSP <ffff885effd23a70>
>> > > >> > > <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
>> > > >> > > <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
>> > > >> > >
>> > > >> > >
>> > > >> > > ... bisecting it's going to be a pain... I tried eyeballing the diffs and
>> > > >> > > am trying a revert or two.
>> > > >> > >
>> > > >> > > We've hit it in .25, .26 so far. I have .27 running but not sure if it
>> > > >> > > crashed, so the change exists between .15 and .25.
>> > > >> >
>> > > >> > Please try following fix, thanks for the report !
>> > > >> >
>> > > >> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
>> > > >> > index 25071b48921c..78a50a22298a 100644
>> > > >> > --- a/net/ipv4/route.c
>> > > >> > +++ b/net/ipv4/route.c
>> > > >> > @@ -1333,7 +1333,7 @@ static void ipv4_dst_destroy(struct dst_entry
>> > > >> > *dst)
>> > > >> >
>> > > >> >     if (!list_empty(&rt->rt_uncached)) {
>> > > >> >             spin_lock_bh(&rt_uncached_lock);
>> > > >> > -           list_del(&rt->rt_uncached);
>> > > >> > +           list_del_init(&rt->rt_uncached);
>> > > >> >             spin_unlock_bh(&rt_uncached_lock);
>> > > >> >     }
>> > > >> >  }
>> > > >> >
>> > > >>
>> > > >> Problem could come from this commit, in linux 3.10.23,
>> > > >> you also could try to revert it
>> > > >>
>> > > >> commit 62713c4b6bc10c2d082ee1540e11b01a2b2162ab
>> > > >> Author: Alexei Starovoitov <ast@plumgrid.com>
>> > > >> Date:   Tue Nov 19 19:12:34 2013 -0800
>> > > >>
>> > > >>     ipv4: fix race in concurrent ip_route_input_slow()
>> > > >>
>> > > >>     [ Upstream commit dcdfdf56b4a6c9437fc37dbc9cee94a788f9b0c4 ]
>> > > >>
>> > > >>     CPUs can ask for local route via ip_route_input_noref() concurrently.
>> > > >>     if nh_rth_input is not cached yet, CPUs will proceed to allocate
>> > > >>     equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input
>> > > >>     via rt_cache_route()
>> > > >>     Most of the time they succeed, but on occasion the following two lines:
>> > > >>         orig = *p;
>> > > >>         prev = cmpxchg(p, orig, rt);
>> > > >>     in rt_cache_route() do race and one of the cpus fails to complete cmpxchg.
>> > > >>     But ip_route_input_slow() doesn't check the return code of rt_cache_route(),
>> > > >>     so dst is leaking. dst_destroy() is never called and 'lo' device
>> > > >>     refcnt doesn't go to zero, which can be seen in the logs as:
>> > > >>         unregister_netdevice: waiting for lo to become free. Usage count = 1
>> > > >>     Adding mdelay() between above two lines makes it easily reproducible.
>> > > >>     Fix it similar to nh_pcpu_rth_output case.
>> > > >>
>> > > >>     Fixes: d2d68ba9fe8b ("ipv4: Cache input routes in fib_info nexthops.")
>> > > >>     Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>> > > >>     Signed-off-by: David S. Miller <davem@davemloft.net>
>> > > >>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>> > > >>
>> > > >
>> > > > Heh. I spent an hour squinting at the difflog from .15 to .25 and this was
>> > > > my best guess. I have a kernel running in production with only this
>> > > > reverted as of ~5 hours ago. Won't know if it helps for a day or two.
>> > > >
>> > > > I'm building a kernel now with your route patch, but 62713c4 *not*
>> > > > reverted, which I can throw on a different machine. Does this sound like a
>> > > > good idea?
>> > >
>> > > the traces of your crash don't look similar to dst leak that was fixed by
>> > > commit 62713c4...
>> > >
>> > > To trigger the 'fix' logic of the 62713c4 add the following diff:
>> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
>> > > index f6c6ab1..8972676 100644
>> > > --- a/net/ipv4/route.c
>> > > +++ b/net/ipv4/route.c
>> > > @@ -1259,7 +1259,7 @@ static bool rt_cache_route(struct fib_nh *nh,
>> > > struct rtable *rt)
>> > >                 p = (struct rtable **)__this_cpu_ptr(nh->nh_pcpu_rth_output);
>> > >         }
>> > >         orig = *p;
>> > > -
>> > > +       mdelay(100);
>> > >         prev = cmpxchg(p, orig, rt);
>> > >         if (prev == orig) {
>> > >                 if (orig)
>> > >
>> > > I've been running with it for a day without issues.
>> > > Note that it will stress both 'input' and 'output' ways of adding dsts to
>> > > rt_uncached list...
>> > > and 'output' was there for ages.
>> > >
>> > > If mdelay() helps to reproduce it in minutes instead of days
>> > > then we're on the right path.
>> > > Could you share details of your workload?
>> > > In our case it was a lot of namespaces with a ton of processes
>> > > talking to each other, forcefully killed and restarted.
>> > > Do you see the same crash in the latest tree?
>> > >
>> > > PS sorry for double posts. netdev email bounced few times for me...
>> > >
>> >
>> > So with 62713c4 reverted on top of 3.10.27 (no other new patches), both of
>> > my machines have survived. One four days, another two days. They'd barely
>> > make it a day before.
>> >
>> > So this patch is introducing a new problem. Weakening the dst reference
>> > too much in a false positive case?
>> >
>> > I'm not entirely sure if the mdelay(100) thing will help much. I can try
>> > it but it's kind of obnoxious to reboot these machines (far away, ipmi
>> > only) too often. Unless you folks have any new ideas before I go off and
>> > do that?
>>
>> mdelay() won't fix things, but it will help debugging. Can you try it on just one machine? It should fail within minutes.
>>
>> Only afterwards we can try alternative fix like:
>> @@ -1722,8 +1722,8 @@ local_input:
>>         }
>>         if (do_cache) {
>>                 if (unlikely(!rt_cache_route(&FIB_RES_NH(res), rth))) {
>> -                       rth->dst.flags |= DST_NOCACHE;
>> -                       rt_add_uncached_list(rth);
>> +                       rt_free(rth);
>> +                       goto local_input;
>>                 }
>>         }
>>         skb_dst_set(skb, &rth->dst);
>>
>>
>>
>
> It might take a day or three to find a good time to do it. I don't have
> access to reboot the machines myself. Oh how I wish things broke in the
> lab more often :/
>
> To be clear, I add your old patch back in + the mdelay. If that fails
> within a few minutes, I add this new patch instead of your old one?

yes. vanilla 3.10.23+ and mdelay() will be triggering rt_add_uncached path.
If that fails quickly, I think we better debug it further to really understand
what's going on. Alternative fix may workaround the problem,
since it will be freeing dst immediately if it fails to be cached and retrying
allocation and caching again, but that would mean that double free problem
is simply hidden instead of fixed.
I would add tracing of rth free/alloc.
If mdelay() makes it fail quickly, the trace won't be too long.
config_debug_kmemleak may be helpful (though unlikely in this situation).
since your crash traces look like double free, we can enable config_slub_debug
and call verify_mem_not_deleted() from dst_destroy()

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ipv4_dst_destroy panic regression after 3.10.15
  2014-01-22  6:02               ` Alexei Starovoitov
@ 2014-01-22 22:13                 ` Alexei Starovoitov
  2014-05-13  9:27                   ` dormando
  0 siblings, 1 reply; 12+ messages in thread
From: Alexei Starovoitov @ 2014-01-22 22:13 UTC (permalink / raw)
  To: dormando; +Cc: Alexei Starovoitov, Eric Dumazet, netdev, linux-kernel

On Tue, Jan 21, 2014 at 10:02 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Jan 21, 2014 at 8:10 PM, dormando <dormando@rydia.net> wrote:
>>
>>
>> On Tue, 21 Jan 2014, Alexei Starovoitov wrote:
>>
>>> On Tue, Jan 21, 2014 at 5:39 PM, dormando <dormando@rydia.net> wrote:
>>> >
>>> > > On Fri, Jan 17, 2014 at 11:16 PM, dormando <dormando@rydia.net> wrote:
>>> > > >> On Fri, 2014-01-17 at 22:49 -0800, Eric Dumazet wrote:
>>> > > >> > On Fri, 2014-01-17 at 17:25 -0800, dormando wrote:
>>> > > >> > > Hi,
>>> > > >> > >
>>> > > >> > > Upgraded a few kernels to the latest 3.10 stable tree while tracking down
>>> > > >> > > a rare kernel panic, seems to have introduced a much more frequent kernel
>>> > > >> > > panic. Takes anywhere from 4 hours to 2 days to trigger:
>>> > > >> > >
>>> > > >> > > <4>[196727.311203] general protection fault: 0000 [#1] SMP
>>> > > >> > > <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode
>>> ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp
>>> pps_core mdio
>>> > > >> > > <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
>>> > > >> > > <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
>>> > > >> > > <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
>>> > > >> > > <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
>>> > > >> > > <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
>>> > > >> > > <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
>>> > > >> > > <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
>>> > > >> > > <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
>>> > > >> > > <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>>> > > >> > > <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
>>> > > >> > > <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
>>> > > >> > > <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> > > >> > > <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
>>> > > >> > > <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> > > >> > > <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>> > > >> > > <4>[196727.311713] Stack:
>>> > > >> > > <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
>>> > > >> > > <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
>>> > > >> > > <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
>>> > > >> > > <4>[196727.311885] Call Trace:
>>> > > >> > > <4>[196727.311907]  <IRQ>
>>> > > >> > > <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
>>> > > >> > > <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
>>> > > >> > > <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
>>> > > >> > > <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
>>> > > >> > > <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
>>> > > >> > > <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
>>> > > >> > > <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
>>> > > >> > > <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
>>> > > >> > > <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
>>> > > >> > > <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
>>> > > >> > > <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
>>> > > >> > > <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
>>> > > >> > > <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
>>> > > >> > > <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
>>> > > >> > > <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
>>> > > >> > > <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
>>> > > >> > > <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
>>> > > >> > > <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
>>> > > >> > > <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
>>> > > >> > > <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
>>> > > >> > > <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
>>> > > >> > > <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
>>> > > >> > > <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
>>> > > >> > > <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
>>> > > >> > > <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
>>> > > >> > > <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
>>> > > >> > > <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
>>> > > >> > > <4>[196727.312722]  <EOI>
>>> > > >> > > <4>[196727.312727]  [<ffffffff8100a150>] ? default_idle+0x20/0xe0
>>> > > >> > > <4>[196727.312775]  [<ffffffff8100a8ff>] arch_cpu_idle+0xf/0x20
>>> > > >> > > <4>[196727.312803]  [<ffffffff8108d330>] cpu_startup_entry+0xc0/0x270
>>> > > >> > > <4>[196727.312833]  [<ffffffff816b276e>] start_secondary+0x1f9/0x200
>>> > > >> > > <4>[196727.312860] Code: 4a 9f e9 81 e8 13 cb 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01
>>> 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 9f e9 81
>>> > > >> > > <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
>>> > > >> > > <4>[196727.313100]  RSP <ffff885effd23a70>
>>> > > >> > > <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
>>> > > >> > > <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
>>> > > >> > >
>>> > > >> > >
>>> > > >> > > ... bisecting it's going to be a pain... I tried eyeballing the diffs and
>>> > > >> > > am trying a revert or two.
>>> > > >> > >
>>> > > >> > > We've hit it in .25, .26 so far. I have .27 running but not sure if it
>>> > > >> > > crashed, so the change exists between .15 and .25.
>>> > > >> >
>>> > > >> > Please try following fix, thanks for the report !
>>> > > >> >
>>> > > >> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
>>> > > >> > index 25071b48921c..78a50a22298a 100644
>>> > > >> > --- a/net/ipv4/route.c
>>> > > >> > +++ b/net/ipv4/route.c
>>> > > >> > @@ -1333,7 +1333,7 @@ static void ipv4_dst_destroy(struct dst_entry
>>> > > >> > *dst)
>>> > > >> >
>>> > > >> >     if (!list_empty(&rt->rt_uncached)) {
>>> > > >> >             spin_lock_bh(&rt_uncached_lock);
>>> > > >> > -           list_del(&rt->rt_uncached);
>>> > > >> > +           list_del_init(&rt->rt_uncached);
>>> > > >> >             spin_unlock_bh(&rt_uncached_lock);
>>> > > >> >     }
>>> > > >> >  }
>>> > > >> >
>>> > > >>
>>> > > >> Problem could come from this commit, in linux 3.10.23,
>>> > > >> you also could try to revert it
>>> > > >>
>>> > > >> commit 62713c4b6bc10c2d082ee1540e11b01a2b2162ab
>>> > > >> Author: Alexei Starovoitov <ast@plumgrid.com>
>>> > > >> Date:   Tue Nov 19 19:12:34 2013 -0800
>>> > > >>
>>> > > >>     ipv4: fix race in concurrent ip_route_input_slow()
>>> > > >>
>>> > > >>     [ Upstream commit dcdfdf56b4a6c9437fc37dbc9cee94a788f9b0c4 ]
>>> > > >>
>>> > > >>     CPUs can ask for local route via ip_route_input_noref() concurrently.
>>> > > >>     if nh_rth_input is not cached yet, CPUs will proceed to allocate
>>> > > >>     equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input
>>> > > >>     via rt_cache_route()
>>> > > >>     Most of the time they succeed, but on occasion the following two lines:
>>> > > >>         orig = *p;
>>> > > >>         prev = cmpxchg(p, orig, rt);
>>> > > >>     in rt_cache_route() do race and one of the cpus fails to complete cmpxchg.
>>> > > >>     But ip_route_input_slow() doesn't check the return code of rt_cache_route(),
>>> > > >>     so dst is leaking. dst_destroy() is never called and 'lo' device
>>> > > >>     refcnt doesn't go to zero, which can be seen in the logs as:
>>> > > >>         unregister_netdevice: waiting for lo to become free. Usage count = 1
>>> > > >>     Adding mdelay() between above two lines makes it easily reproducible.
>>> > > >>     Fix it similar to nh_pcpu_rth_output case.
>>> > > >>
>>> > > >>     Fixes: d2d68ba9fe8b ("ipv4: Cache input routes in fib_info nexthops.")
>>> > > >>     Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>>> > > >>     Signed-off-by: David S. Miller <davem@davemloft.net>
>>> > > >>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>> > > >>
>>> > > >
>>> > > > Heh. I spent an hour squinting at the difflog from .15 to .25 and this was
>>> > > > my best guess. I have a kernel running in production with only this
>>> > > > reverted as of ~5 hours ago. Won't know if it helps for a day or two.
>>> > > >
>>> > > > I'm building a kernel now with your route patch, but 62713c4 *not*
>>> > > > reverted, which I can throw on a different machine. Does this sound like a
>>> > > > good idea?
>>> > >
>>> > > the traces of your crash don't look similar to dst leak that was fixed by
>>> > > commit 62713c4...
>>> > >
>>> > > To trigger the 'fix' logic of the 62713c4 add the following diff:
>>> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
>>> > > index f6c6ab1..8972676 100644
>>> > > --- a/net/ipv4/route.c
>>> > > +++ b/net/ipv4/route.c
>>> > > @@ -1259,7 +1259,7 @@ static bool rt_cache_route(struct fib_nh *nh,
>>> > > struct rtable *rt)
>>> > >                 p = (struct rtable **)__this_cpu_ptr(nh->nh_pcpu_rth_output);
>>> > >         }
>>> > >         orig = *p;
>>> > > -
>>> > > +       mdelay(100);
>>> > >         prev = cmpxchg(p, orig, rt);
>>> > >         if (prev == orig) {
>>> > >                 if (orig)
>>> > >
>>> > > I've been running with it for a day without issues.
>>> > > Note that it will stress both 'input' and 'output' ways of adding dsts to
>>> > > rt_uncached list...
>>> > > and 'output' was there for ages.
>>> > >
>>> > > If mdelay() helps to reproduce it in minutes instead of days
>>> > > then we're on the right path.
>>> > > Could you share details of your workload?
>>> > > In our case it was a lot of namespaces with a ton of processes
>>> > > talking to each other, forcefully killed and restarted.
>>> > > Do you see the same crash in the latest tree?
>>> > >
>>> > > PS sorry for double posts. netdev email bounced few times for me...
>>> > >
>>> >
>>> > So with 62713c4 reverted on top of 3.10.27 (no other new patches), both of
>>> > my machines have survived. One four days, another two days. They'd barely
>>> > make it a day before.
>>> >
>>> > So this patch is introducing a new problem. Weakening the dst reference
>>> > too much in a false positive case?
>>> >
>>> > I'm not entirely sure if the mdelay(100) thing will help much. I can try
>>> > it but it's kind of obnoxious to reboot these machines (far away, ipmi
>>> > only) too often. Unless you folks have any new ideas before I go off and
>>> > do that?
>>>
>>> mdelay() won't fix things, but it will help debugging. Can you try it on just one machine? It should fail within minutes.
>>>
>>> Only afterwards we can try alternative fix like:
>>> @@ -1722,8 +1722,8 @@ local_input:
>>>         }
>>>         if (do_cache) {
>>>                 if (unlikely(!rt_cache_route(&FIB_RES_NH(res), rth))) {
>>> -                       rth->dst.flags |= DST_NOCACHE;
>>> -                       rt_add_uncached_list(rth);
>>> +                       rt_free(rth);
>>> +                       goto local_input;
>>>                 }
>>>         }
>>>         skb_dst_set(skb, &rth->dst);

the above 'alternative fix' is the wrong fix. I managed to hit good
old dst leak with it:
unregister_netdevice: waiting for lo to become free. Usage count = 2

will continue running with mdelay() and original fix in place for more
days to see
whether I can spot anything.

>>
>> It might take a day or three to find a good time to do it. I don't have
>> access to reboot the machines myself. Oh how I wish things broke in the
>> lab more often :/
>>
>> To be clear, I add your old patch back in + the mdelay. If that fails
>> within a few minutes, I add this new patch instead of your old one?
>
> yes. vanilla 3.10.23+ and mdelay() will be triggering rt_add_uncached path.
> If that fails quickly, I think we better debug it further to really understand
> what's going on. Alternative fix may workaround the problem,
> since it will be freeing dst immediately if it fails to be cached and retrying
> allocation and caching again, but that would mean that double free problem
> is simply hidden instead of fixed.
> I would add tracing of rth free/alloc.
> If mdelay() makes it fail quickly, the trace won't be too long.
> config_debug_kmemleak may be helpful (though unlikely in this situation).
> since your crash traces look like double free, we can enable config_slub_debug
> and call verify_mem_not_deleted() from dst_destroy()

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ipv4_dst_destroy panic regression after 3.10.15
  2014-01-22 22:13                 ` Alexei Starovoitov
@ 2014-05-13  9:27                   ` dormando
  0 siblings, 0 replies; 12+ messages in thread
From: dormando @ 2014-05-13  9:27 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Alexei Starovoitov, Eric Dumazet, netdev, linux-kernel

On Wed, 22 Jan 2014, Alexei Starovoitov wrote:

> On Tue, Jan 21, 2014 at 10:02 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Tue, Jan 21, 2014 at 8:10 PM, dormando <dormando@rydia.net> wrote:
> >>
> >>
> >> On Tue, 21 Jan 2014, Alexei Starovoitov wrote:
> >>
> >>> On Tue, Jan 21, 2014 at 5:39 PM, dormando <dormando@rydia.net> wrote:
> >>> >
> >>> > > On Fri, Jan 17, 2014 at 11:16 PM, dormando <dormando@rydia.net> wrote:
> >>> > > >> On Fri, 2014-01-17 at 22:49 -0800, Eric Dumazet wrote:
> >>> > > >> > On Fri, 2014-01-17 at 17:25 -0800, dormando wrote:
> >>> > > >> > > Hi,
> >>> > > >> > >
> >>> > > >> > > Upgraded a few kernels to the latest 3.10 stable tree while tracking down
> >>> > > >> > > a rare kernel panic, seems to have introduced a much more frequent kernel
> >>> > > >> > > panic. Takes anywhere from 4 hours to 2 days to trigger:
> >>> > > >> > >
> >>> > > >> > > <4>[196727.311203] general protection fault: 0000 [#1] SMP
> >>> > > >> > > <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode
> >>> ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp
> >>> pps_core mdio
> >>> > > >> > > <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
> >>> > > >> > > <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
> >>> > > >> > > <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
> >>> > > >> > > <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> >>> > > >> > > <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
> >>> > > >> > > <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
> >>> > > >> > > <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
> >>> > > >> > > <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
> >>> > > >> > > <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> >>> > > >> > > <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
> >>> > > >> > > <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
> >>> > > >> > > <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> > > >> > > <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
> >>> > > >> > > <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>> > > >> > > <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> >>> > > >> > > <4>[196727.311713] Stack:
> >>> > > >> > > <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
> >>> > > >> > > <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
> >>> > > >> > > <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
> >>> > > >> > > <4>[196727.311885] Call Trace:
> >>> > > >> > > <4>[196727.311907]  <IRQ>
> >>> > > >> > > <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
> >>> > > >> > > <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
> >>> > > >> > > <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
> >>> > > >> > > <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
> >>> > > >> > > <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> >>> > > >> > > <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
> >>> > > >> > > <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> >>> > > >> > > <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
> >>> > > >> > > <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
> >>> > > >> > > <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
> >>> > > >> > > <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
> >>> > > >> > > <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
> >>> > > >> > > <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
> >>> > > >> > > <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
> >>> > > >> > > <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
> >>> > > >> > > <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
> >>> > > >> > > <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
> >>> > > >> > > <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
> >>> > > >> > > <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
> >>> > > >> > > <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
> >>> > > >> > > <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
> >>> > > >> > > <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
> >>> > > >> > > <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
> >>> > > >> > > <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
> >>> > > >> > > <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
> >>> > > >> > > <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
> >>> > > >> > > <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
> >>> > > >> > > <4>[196727.312722]  <EOI>
> >>> > > >> > > <4>[196727.312727]  [<ffffffff8100a150>] ? default_idle+0x20/0xe0
> >>> > > >> > > <4>[196727.312775]  [<ffffffff8100a8ff>] arch_cpu_idle+0xf/0x20
> >>> > > >> > > <4>[196727.312803]  [<ffffffff8108d330>] cpu_startup_entry+0xc0/0x270
> >>> > > >> > > <4>[196727.312833]  [<ffffffff816b276e>] start_secondary+0x1f9/0x200
> >>> > > >> > > <4>[196727.312860] Code: 4a 9f e9 81 e8 13 cb 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01
> >>> 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 9f e9 81
> >>> > > >> > > <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> >>> > > >> > > <4>[196727.313100]  RSP <ffff885effd23a70>
> >>> > > >> > > <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
> >>> > > >> > > <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
> >>> > > >> > >
> >>> > > >> > >
> >>> > > >> > > ... bisecting it's going to be a pain... I tried eyeballing the diffs and
> >>> > > >> > > am trying a revert or two.
> >>> > > >> > >
> >>> > > >> > > We've hit it in .25, .26 so far. I have .27 running but not sure if it
> >>> > > >> > > crashed, so the change exists between .15 and .25.
> >>> > > >> >
> >>> > > >> > Please try following fix, thanks for the report !
> >>> > > >> >
> >>> > > >> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> >>> > > >> > index 25071b48921c..78a50a22298a 100644
> >>> > > >> > --- a/net/ipv4/route.c
> >>> > > >> > +++ b/net/ipv4/route.c
> >>> > > >> > @@ -1333,7 +1333,7 @@ static void ipv4_dst_destroy(struct dst_entry
> >>> > > >> > *dst)
> >>> > > >> >
> >>> > > >> >     if (!list_empty(&rt->rt_uncached)) {
> >>> > > >> >             spin_lock_bh(&rt_uncached_lock);
> >>> > > >> > -           list_del(&rt->rt_uncached);
> >>> > > >> > +           list_del_init(&rt->rt_uncached);
> >>> > > >> >             spin_unlock_bh(&rt_uncached_lock);
> >>> > > >> >     }
> >>> > > >> >  }
> >>> > > >> >
> >>> > > >>
> >>> > > >> Problem could come from this commit, in linux 3.10.23,
> >>> > > >> you also could try to revert it
> >>> > > >>
> >>> > > >> commit 62713c4b6bc10c2d082ee1540e11b01a2b2162ab
> >>> > > >> Author: Alexei Starovoitov <ast@plumgrid.com>
> >>> > > >> Date:   Tue Nov 19 19:12:34 2013 -0800
> >>> > > >>
> >>> > > >>     ipv4: fix race in concurrent ip_route_input_slow()
> >>> > > >>
> >>> > > >>     [ Upstream commit dcdfdf56b4a6c9437fc37dbc9cee94a788f9b0c4 ]
> >>> > > >>
> >>> > > >>     CPUs can ask for local route via ip_route_input_noref() concurrently.
> >>> > > >>     if nh_rth_input is not cached yet, CPUs will proceed to allocate
> >>> > > >>     equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input
> >>> > > >>     via rt_cache_route()
> >>> > > >>     Most of the time they succeed, but on occasion the following two lines:
> >>> > > >>         orig = *p;
> >>> > > >>         prev = cmpxchg(p, orig, rt);
> >>> > > >>     in rt_cache_route() do race and one of the cpus fails to complete cmpxchg.
> >>> > > >>     But ip_route_input_slow() doesn't check the return code of rt_cache_route(),
> >>> > > >>     so dst is leaking. dst_destroy() is never called and 'lo' device
> >>> > > >>     refcnt doesn't go to zero, which can be seen in the logs as:
> >>> > > >>         unregister_netdevice: waiting for lo to become free. Usage count = 1
> >>> > > >>     Adding mdelay() between above two lines makes it easily reproducible.
> >>> > > >>     Fix it similar to nh_pcpu_rth_output case.
> >>> > > >>
> >>> > > >>     Fixes: d2d68ba9fe8b ("ipv4: Cache input routes in fib_info nexthops.")
> >>> > > >>     Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> >>> > > >>     Signed-off-by: David S. Miller <davem@davemloft.net>
> >>> > > >>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> >>> > > >>
> >>> > > >
> >>> > > > Heh. I spent an hour squinting at the difflog from .15 to .25 and this was
> >>> > > > my best guess. I have a kernel running in production with only this
> >>> > > > reverted as of ~5 hours ago. Won't know if it helps for a day or two.
> >>> > > >
> >>> > > > I'm building a kernel now with your route patch, but 62713c4 *not*
> >>> > > > reverted, which I can throw on a different machine. Does this sound like a
> >>> > > > good idea?
> >>> > >
> >>> > > the traces of your crash don't look similar to dst leak that was fixed by
> >>> > > commit 62713c4...
> >>> > >
> >>> > > To trigger the 'fix' logic of the 62713c4 add the following diff:
> >>> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> >>> > > index f6c6ab1..8972676 100644
> >>> > > --- a/net/ipv4/route.c
> >>> > > +++ b/net/ipv4/route.c
> >>> > > @@ -1259,7 +1259,7 @@ static bool rt_cache_route(struct fib_nh *nh,
> >>> > > struct rtable *rt)
> >>> > >                 p = (struct rtable **)__this_cpu_ptr(nh->nh_pcpu_rth_output);
> >>> > >         }
> >>> > >         orig = *p;
> >>> > > -
> >>> > > +       mdelay(100);
> >>> > >         prev = cmpxchg(p, orig, rt);
> >>> > >         if (prev == orig) {
> >>> > >                 if (orig)
> >>> > >
> >>> > > I've been running with it for a day without issues.
> >>> > > Note that it will stress both 'input' and 'output' ways of adding dsts to
> >>> > > rt_uncached list...
> >>> > > and 'output' was there for ages.
> >>> > >
> >>> > > If mdelay() helps to reproduce it in minutes instead of days
> >>> > > then we're on the right path.
> >>> > > Could you share details of your workload?
> >>> > > In our case it was a lot of namespaces with a ton of processes
> >>> > > talking to each other, forcefully killed and restarted.
> >>> > > Do you see the same crash in the latest tree?
> >>> > >
> >>> > > PS sorry for double posts. netdev email bounced few times for me...
> >>> > >
> >>> >
> >>> > So with 62713c4 reverted on top of 3.10.27 (no other new patches), both of
> >>> > my machines have survived. One four days, another two days. They'd barely
> >>> > make it a day before.
> >>> >
> >>> > So this patch is introducing a new problem. Weakening the dst reference
> >>> > too much in a false positive case?
> >>> >
> >>> > I'm not entirely sure if the mdelay(100) thing will help much. I can try
> >>> > it but it's kind of obnoxious to reboot these machines (far away, ipmi
> >>> > only) too often. Unless you folks have any new ideas before I go off and
> >>> > do that?
> >>>
> >>> mdelay() won't fix things, but it will help debugging. Can you try it on just one machine? It should fail within minutes.
> >>>
> >>> Only afterwards we can try alternative fix like:
> >>> @@ -1722,8 +1722,8 @@ local_input:
> >>>         }
> >>>         if (do_cache) {
> >>>                 if (unlikely(!rt_cache_route(&FIB_RES_NH(res), rth))) {
> >>> -                       rth->dst.flags |= DST_NOCACHE;
> >>> -                       rt_add_uncached_list(rth);
> >>> +                       rt_free(rth);
> >>> +                       goto local_input;
> >>>                 }
> >>>         }
> >>>         skb_dst_set(skb, &rth->dst);
>
> the above 'alternative fix' is the wrong fix. I managed to hit good
> old dst leak with it:
> unregister_netdevice: waiting for lo to become free. Usage count = 2
>
> will continue running with mdelay() and original fix in place for more
> days to see
> whether I can spot anything.
>
> >>
> >> It might take a day or three to find a good time to do it. I don't have
> >> access to reboot the machines myself. Oh how I wish things broke in the
> >> lab more often :/
> >>
> >> To be clear, I add your old patch back in + the mdelay. If that fails
> >> within a few minutes, I add this new patch instead of your old one?
> >
> > yes. vanilla 3.10.23+ and mdelay() will be triggering rt_add_uncached path.
> > If that fails quickly, I think we better debug it further to really understand
> > what's going on. Alternative fix may workaround the problem,
> > since it will be freeing dst immediately if it fails to be cached and retrying
> > allocation and caching again, but that would mean that double free problem
> > is simply hidden instead of fixed.
> > I would add tracing of rth free/alloc.
> > If mdelay() makes it fail quickly, the trace won't be too long.
> > config_debug_kmemleak may be helpful (though unlikely in this situation).
> > since your crash traces look like double free, we can enable config_slub_debug
> > and call verify_mem_not_deleted() from dst_destroy()
>

I really wanted to fix this, but I'm too dumb. Just tried on 3.10.39 and
I've tried it on 3.12 kernels and crashed as well. Not tried .13 or .14.

We now have a small utility which crashes in very similar code but a
slightly different path. I can hang a machine within a few seconds (or a
minute sometimes).

This is only affecting machines with multiple active network interfaces.
On a single interface (with bgp routes et all) the util doesn't crash it.
I'm also not confident that this isn't three different bugs in the dst
handling subsystem.

Anyway, given:

1) Machine with active full BGP routing table (frequent route obsoletion?)
   - not sure if this is necessary for the crash, as the IP's used with
the test utility are 172.16.* addresses.
2) >1 interface
3) A "connected" UDP socket (only one socket is necessary)
4) Multiple threads using the same socket(s) to send broadcast packets
   - one thread will never crash.
5) death:

ipv4_dst_destroy+0x4f/0x80

<4>[  550.583536] Call Trace:
<4>[  550.583593]  [<ffffffff815b7f62>] dst_destroy+0x32/0xe0
<4>[  550.583652]  [<ffffffff815b86e6>] dst_release+0x56/0x80
<4>[  550.583713]  [<ffffffff8159d789>] sk_dst_check+0x99/0xa0
<4>[  550.583776]  [<ffffffff81629906>] udp_sendmsg+0x236/0x920
...etc

The registers are trying to load LIST_POISON1 and then deref it, so that
certainly seems like a double-free.

This code is pretty beyond me, but I see that:
Under sk_dst_check() sk_dst_get(sk) does a careful RCU deref + atomic
refcount increment.

Then immediately does:
    if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
        sk_dst_reset(sk);
        dst_release(dst);

... could get double checked/freed? sk_dst_reset() internally can call
dst_release(), but I have no idea if the dst inside the sk is ever the
same as the dst in sk_dst_check()... either that or the bug is more vile
and the sk is being reused when it shouldn't.

but nothing makes sense, since in dst_release():
        newrefcnt = atomic_dec_return(&dst->__refcnt);
        WARN_ON(newrefcnt < 0);
        if (unlikely(dst->flags & DST_NOCACHE) && !newrefcnt) {

... but there's never a WARN_ON() print near the oops (using pstore).
Either that doesn't get printed because of the immediate death or it's
somehow passing straight through this function and into dst_destroy
despite the dst->next/prev pointers being poisoned already

static void ipv4_dst_destroy(struct dst_entry *dst)
{
    struct rtable *rt = (struct rtable *) dst;

    if (!list_empty(&rt->rt_uncached)) {
        spin_lock_bh(&rt_uncached_lock);
        list_del(&rt->rt_uncached);
        spin_unlock_bh(&rt_uncached_lock);
    }
}

Dying within the list_del() part. somewhere. trying to deref 8 bytes into
the dst struct, using a poison address.

Given my other reports with kmem_cache_alloc() being corrupt, and from
dst_destroy():
    if (dst->ops->destroy)
        dst->ops->destroy(dst);
    if (dst->dev)
        dev_put(dst->dev);
    kmem_cache_free(dst->ops->kmem_cachep, dst);

... it's possible that this corruption isn't consistent, with some paths
going all the way into a double kmem_cache_free.

I've tried switching to SLAB, no difference.

also with SLUB_DEBUG_ON, it died with no extra information printed.

Apologies for the slew of ignorance as to guessing how the code works
above. I wanted to point out some weirdness. The missing WARN_ON and the
lack of locks given the concurrency requirement to cause the failure.

I can provide the utility privately if requested, but I've probably
described how to write it already...

I guess it's also relatively easy to add code inspection

Thanks,
-Dormando

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-05-13  9:27 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-18  1:25 ipv4_dst_destroy panic regression after 3.10.15 dormando
2014-01-18  6:49 ` Eric Dumazet
2014-01-18  7:09   ` Eric Dumazet
2014-01-18  7:16     ` dormando
2014-01-18 17:18       ` Alexei Starovoitov
2014-01-18 19:13         ` dormando
2014-01-22  1:39         ` dormando
     [not found]           ` <CAMEtUuzq+KhzZfxzAo=fgYr9umtQCPx-4Jp_twtrM-eh1Bv14w@mail.gmail.com>
2014-01-22  4:10             ` dormando
2014-01-22  6:02               ` Alexei Starovoitov
2014-01-22 22:13                 ` Alexei Starovoitov
2014-05-13  9:27                   ` dormando
2014-01-20  2:56   ` dormando

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).