linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] vrf: Fix possible NULL pointer oops when delete nic
@ 2019-11-15  6:22 wangxiaogang (F)
  2019-11-15 13:14 ` David Ahern
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: wangxiaogang (F) @ 2019-11-15  6:22 UTC (permalink / raw)
  To: dsahern, shrijeet, davem; +Cc: netdev, linux-kernel, hujunwei4, xuhanbing

From: XiaoGang Wang <wangxiaogang3@huawei.com>

Recently we get a crash when access illegal address (0xc0),
which will occasionally appear when deleting a physical NIC with vrf.

[166603.826737]hinic 0000:43:00.4 eth-s3: Failed to cycle device eth-s3;
route tables might be wrong!
.....
[166603.828018]WARNING: CPU: 135 PID: 15382at net/core/dev.c:6875
__netdev_adjacent_dev_remove.constprop.40+0x1e0/0x1e8
......
[166603.828169]pc : __netdev_adjacent_dev_remove.constprop.40+0x1e0/0x1e8
[166603.828171]lr : __netdev_adjacent_dev_remove.constprop.40+0x1e0/0x1e8
[166603.828172]sp : ffff000031efb810
[166603.828173]x29: ffff000031efb810 x28: 0000000000002710
[166603.828175]x27: 0000000000000001 x26: ffffa021f4095d88
[166603.828177]x25: ffff000008d30de8 x24: ffff0000092d75d6
[166603.828179]x23: 0000000000000006 x22: ffffa021e1edc480
[166603.828181]x21: 0000000000000000 x20: ffffa021e1edc530
[166603.828183]x19: ffffa021e1edc518 x18: ffffffffffffffff
[166603.828185]x17: 0000000000000000 x16: 0000000000000000
[166603.828186]x15: ffff0000091d9708 x14: 776620746567206f
[166603.828188]x13: 742064656c696146 x12: ffff800040801004
[166603.828190]x11: ffff80004080100c x10: ffff0000091dbae0
[166603.828192]x9 : 0000000000000001 x8 : 0000000006a60f9c
[166603.828194]x7 : ffff0000093b6fc0 x6 : 0000000000000001
[166603.828196]x5 : 0000000000000001 x4 : ffffa020560383c0
[166603.828198]x3 : ffffa020560383c0 x2 : 371cb5224b539100
[166603.828200]x1 : 0000000000000000 x0 : 0000000000000036
[166603.828202]Call trace:
[166603.828204] __netdev_adjacent_dev_remove.constprop.40+0x1e0/0x1e8
[166603.828205] __netdev_adjacent_dev_unlink_neighbour+0x2c/0x48
[166603.828207] netdev_upper_dev_unlink+0x7c/0xe8
[166603.828215] vrf_device_event+0x58/0x80 [vrf]
[166603.828221] notifier_call_chain+0x5c/0xa0
[166603.828222] raw_notifier_call_chain+0x3c/0x50
[166603.828224] call_netdevice_notifiers_info+0x3c/0x80
[166603.828229] rollback_registered_many+0x35c/0x568
[166603.828233] rollback_registered+0x68/0xb0
[166603.828234] unregister_netdevice_queue+0xc0/0x110
[166603.828239] unregister_netdev+0x28/0x38
[166603.828425] nic_remove+0x58/0xc0 [hinic]
[166603.828442] detach_uld+0xd8/0x1a8 [hinic]
[166603.828458] hinic_ulds_deinit+0x54/0x68 [hinic]
[166603.828473] hinic_remove+0x218/0x240 [hinic]
[166603.828481] pci_device_remove+0x48/0xd8
[166603.828490] device_release_driver_internal+0x1b4/0x250
[166603.828492] device_release_driver+0x28/0x38
[166603.828499] pci_stop_bus_device+0x84/0xb8
[166603.828500] pci_stop_bus_device+0x40/0xb8
[166603.828502] pci_stop_bus_device+0x40/0xb8
[166603.828503] pci_stop_and_remove_bus_device+0x20/0x38
[166603.828557] PCIEMGT_KNL_DelPciDev+0xc0/0x198 [pciemgtagent]
[166603.828564] PCIEMGT_KNL_DelDev+0xac/0x1d8 [pciemgtagent]
[166603.828573] PCIEMGT_DelKnlDev+0x50/0x180 [pciemgtagent]
[166603.828579] PCIEMGT_KAGENT_DevEventHandle+0x94/0x168 [pciemgtagent]
[166603.828585] PCIEMGT_KAGENT_EventHandleThread+0xb8/0x1a0 [pciemgtagent]
[166603.828594] kthread+0x134/0x138
[166603.828599] ret_from_fork+0x10/0x18
[166603.828601]---[ end trace 5052903cb62d99f0 ]---
[166603.828612]Unable to handle kernel NULL pointer dereference at virtual address 00000000000000c0
[166603.828613]Mem abort info:
[166603.828614]  ESR = 0x96000006
[166603.828616]  Exception class = DABT (current EL), IL = 32 bits
[166603.828617]  SET = 0, FnV = 0
[166603.828618]  EA = 0, S1PTW = 0
[166603.828618]Data abort info:
[166603.828619]  ISV = 0, ISS = 0x00000006
[166603.828620]  CM = 0, WnR = 0
[166603.828622]user pgtable: 4k pages, 48-bit VAs, pgdp = 000000003c6ab870
[166603.828623][00000000000000c0] pgd=00002022651d1003, pud=000020226bd6a003, pmd=0000000000000000
[166603.828628]Internal error: Oops: 96000006 [#1] SMP
[166603.828630]Process PCIE40:c.0 (pid: 15382, stack limit = 0x00000000d24f8167)
[166603.828632]CPU: 135 PID: 15382 Comm: PCIE40:c.0 Kdump: loaded Tainted: PF       WC OE     4.19.36-vhulk1907.1.0.h453.eulerosv2r8.aarch64 #1
[166603.828633]Hardware name: Huawei Technologies Co., Ltd. PANGEA/STL6SPCB, BIOS TA BIOS Pangea3P CS - 11.01.60T31 05/26/2019
[166603.828634]pstate: 40c00009 (nZcv daif +PAN +UAO)
[166603.828636]pc : __netdev_adjacent_dev_remove.constprop.40+0x28/0x1e8
[166603.828638]lr : __netdev_adjacent_dev_unlink_neighbour+0x3c/0x48
[166603.828639]sp : ffff000031efb810
[166603.828639]x29: ffff000031efb810 x28: 0000000000002710
[166603.828641]x27: 0000000000000001 x26: ffffa021f4095d88
[166603.828643]x25: ffff000008d30de8 x24: ffff0000092d75d6
[166603.828645]x23: 0000000000000006 x22: 0000000000000000
[166603.828647]x21: ffffa021e1edc480 x20: 00000000000000c0
[166603.828649]x19: 0000000000000000 x18: ffffffffffffffff
[166603.828651]x17: 0000000000000000 x16: 0000000000000000
[166603.828653]x15: ffff0000091d9708 x14: 776620746567206f
[166603.828654]x13: 742064656c696146 x12: ffff800040801004
[166603.828656]x11: ffff80004080100c x10: ffff0000091dbae0
[166603.828658]x9 : 0000000000000001 x8 : 0000000006a60f9c
[166603.828660]x7 : ffff0000093b6fc0 x6 : 0000000000000001
[166603.828662]x5 : 0000000000000001 x4 : ffffa020560383c0
[166603.828664]x3 : ffffa020560383c0 x2 : 00000000000000c0
[166603.828666]x1 : ffffa021e1edc480 x0 : ffff000008879f7c
[166603.828668]Call trace:
[166603.828669] __netdev_adjacent_dev_remove.constprop.40+0x28/0x1e8
[166603.828670] __netdev_adjacent_dev_unlink_neighbour+0x3c/0x48
[166603.828672] netdev_upper_dev_unlink+0x7c/0xe8
[166603.828674] vrf_device_event+0x58/0x80 [vrf]
[166603.828675] notifier_call_chain+0x5c/0xa0
[166603.828676] raw_notifier_call_chain+0x3c/0x50
[166603.828678] call_netdevice_notifiers_info+0x3c/0x80
[166603.828679] rollback_registered_many+0x35c/0x568
[166603.828681] rollback_registered+0x68/0xb0
[166603.828682] unregister_netdevice_queue+0xc0/0x110
[166603.828684] unregister_netdev+0x28/0x38
[166603.828699] nic_remove+0x58/0xc0 [hinic]
[166603.828714] detach_uld+0xd8/0x1a8 [hinic]
[166603.828729] hinic_ulds_deinit+0x54/0x68 [hinic]
[166603.828743] hinic_remove+0x218/0x240 [hinic]
[166603.828745] pci_device_remove+0x48/0xd8
[166603.828747] device_release_driver_internal+0x1b4/0x250
[166603.828748] device_release_driver+0x28/0x38
[166603.828750] pci_stop_bus_device+0x84/0xb8
[166603.828751] pci_stop_bus_device+0x40/0xb8
[166603.828752] pci_stop_bus_device+0x40/0xb8
[166603.828753] pci_stop_and_remove_bus_device+0x20/0x38
[166603.828760] PCIEMGT_KNL_DelPciDev+0xc0/0x198 [pciemgtagent]
[166603.828765] PCIEMGT_KNL_DelDev+0xac/0x1d8 [pciemgtagent]
[166603.828771] PCIEMGT_DelKnlDev+0x50/0x180 [pciemgtagent]
[166603.828776] PCIEMGT_KAGENT_DevEventHandle+0x94/0x168 [pciemgtagent]
[166603.828782] PCIEMGT_KAGENT_EventHandleThread+0xb8/0x1a0 [pciemgtagent]
[166603.828784] kthread+0x134/0x138
[166603.828785] ret_from_fork+0x10/0x18
[166603.828788]Code: aa0203f4 aa1e03e0 d503201f d503201f (f9400280)
[166603.828789]kernel fault(0x1) notification starting on CPU 135

set vrf nomaster function vrf_del_slave() and del nic function
vrf_device_event() concurrent execution will occasionally oops.

thread1                     thread2

do_vrf_del_slave
netdev_upper_dev_unlink()   vrf_device_event
 	
                            vrf_device_event
                            netif_is_l3_slave(dev)
                            //IFF_L3MDEV_SLAVE is not cleaned
                            //so function return 1
                            netdev_master_upper_dev_get()
                            //return vrf_dev is NULL
                            ....	
                            __netdev_adjacent_dev_remove()
                            //adj pointer is NULL cause WARN_ON
                            __netdev_adjacent_dev_remove()
                            //down_list is NULL cause OOPS

port_dev->priv_flags &= ~IFF_L3MDEV_SLAVE;

why oops did not happen in __netdev_adjacent_dev_unlink_lists()'s
parameter “&upper_dev->adj_list.lower”.
we Disassemble __netdev_adjacent_dev_unlink_neighbour:
.....
 <__netdev_adjacent_dev_unlink_neighbour+44>: add     x2, x19, #0xc0
 <__netdev_adjacent_dev_unlink_neighbour+48>: mov     x1, x20
 <__netdev_adjacent_dev_unlink_neighbour+52>: mov     x0, x19
....
upper_dev->adj_list.lower is compiled to be optimized to
upper_dev pointer offset 0xc0.

this patch adds vrf_dev NULL pointer judgment to resolve the above problem.

Signed-off-by: XiaoGang Wang <wangxiaogang3@huawei.com>
Reviewed-by: JunWei Hu <hujunwei4@huawei.com>
---
 drivers/net/vrf.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index b8228f5..86c4b8c 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -1427,6 +1427,9 @@ static int vrf_device_event(struct notifier_block *unused,
 			goto out;

 		vrf_dev = netdev_master_upper_dev_get(dev);
+		if (!vrf_dev)
+			goto out;
+
 		vrf_del_slave(vrf_dev, dev);
 	}
 out:
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] vrf: Fix possible NULL pointer oops when delete nic
  2019-11-15  6:22 [PATCH] vrf: Fix possible NULL pointer oops when delete nic wangxiaogang (F)
@ 2019-11-15 13:14 ` David Ahern
  2019-11-18  3:15   ` wangxiaogang (F)
  2019-11-15 16:59 ` David Ahern
  2019-11-16 20:53 ` David Miller
  2 siblings, 1 reply; 9+ messages in thread
From: David Ahern @ 2019-11-15 13:14 UTC (permalink / raw)
  To: wangxiaogang (F), dsahern, shrijeet, davem
  Cc: netdev, linux-kernel, hujunwei4, xuhanbing

On 11/14/19 11:22 PM, wangxiaogang (F) wrote:
> From: XiaoGang Wang <wangxiaogang3@huawei.com>
> 
> Recently we get a crash when access illegal address (0xc0),
> which will occasionally appear when deleting a physical NIC with vrf.
> 

How long have you been running this test?

I am wondering if this is fallout from the recent adjacency changes in
commits 5343da4c1742 through f3b0a18bb6cb.




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vrf: Fix possible NULL pointer oops when delete nic
  2019-11-15  6:22 [PATCH] vrf: Fix possible NULL pointer oops when delete nic wangxiaogang (F)
  2019-11-15 13:14 ` David Ahern
@ 2019-11-15 16:59 ` David Ahern
  2019-11-18  3:16   ` wangxiaogang (F)
  2019-11-16 20:53 ` David Miller
  2 siblings, 1 reply; 9+ messages in thread
From: David Ahern @ 2019-11-15 16:59 UTC (permalink / raw)
  To: wangxiaogang (F), dsahern, shrijeet, davem
  Cc: netdev, linux-kernel, hujunwei4, xuhanbing

On 11/14/19 11:22 PM, wangxiaogang (F) wrote:
> diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
> index b8228f5..86c4b8c 100644
> --- a/drivers/net/vrf.c
> +++ b/drivers/net/vrf.c
> @@ -1427,6 +1427,9 @@ static int vrf_device_event(struct notifier_block *unused,
>  			goto out;
> 
>  		vrf_dev = netdev_master_upper_dev_get(dev);
> +		if (!vrf_dev)
> +			goto out;
> +
>  		vrf_del_slave(vrf_dev, dev);
>  	}
>  out:

BTW, I believe this is the wrong fix. A device can not be a VRF slave
AND not have an upper device. Something is fundamentally wrong.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vrf: Fix possible NULL pointer oops when delete nic
  2019-11-15  6:22 [PATCH] vrf: Fix possible NULL pointer oops when delete nic wangxiaogang (F)
  2019-11-15 13:14 ` David Ahern
  2019-11-15 16:59 ` David Ahern
@ 2019-11-16 20:53 ` David Miller
  2019-11-17  6:17   ` Taehee Yoo
  2 siblings, 1 reply; 9+ messages in thread
From: David Miller @ 2019-11-16 20:53 UTC (permalink / raw)
  To: wangxiaogang3
  Cc: dsahern, shrijeet, netdev, linux-kernel, hujunwei4, xuhanbing, ap420073

From: "wangxiaogang (F)" <wangxiaogang3@huawei.com>
Date: Fri, 15 Nov 2019 14:22:56 +0800

> From: XiaoGang Wang <wangxiaogang3@huawei.com>
> 
> Recently we get a crash when access illegal address (0xc0),
> which will occasionally appear when deleting a physical NIC with vrf.
> 
> [166603.826737]hinic 0000:43:00.4 eth-s3: Failed to cycle device eth-s3;
> route tables might be wrong!
> .....
> [166603.828018]WARNING: CPU: 135 PID: 15382at net/core/dev.c:6875
> __netdev_adjacent_dev_remove.constprop.40+0x1e0/0x1e8
> ......

Taehee-ssi, please take a look at this.

It is believed that this may be caused by the adjacency fixes you made
recently.

Thank you.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vrf: Fix possible NULL pointer oops when delete nic
  2019-11-16 20:53 ` David Miller
@ 2019-11-17  6:17   ` Taehee Yoo
  0 siblings, 0 replies; 9+ messages in thread
From: Taehee Yoo @ 2019-11-17  6:17 UTC (permalink / raw)
  To: David Miller
  Cc: wangxiaogang3, dsahern, shrijeet, Netdev, linux-kernel,
	hujunwei4, xuhanbing

On Sun, 17 Nov 2019 at 05:53, David Miller <davem@davemloft.net> wrote:
>

Hi David,
Thank you for Ccing!

> From: "wangxiaogang (F)" <wangxiaogang3@huawei.com>
> Date: Fri, 15 Nov 2019 14:22:56 +0800
>
> > From: XiaoGang Wang <wangxiaogang3@huawei.com>
> >
> > Recently we get a crash when access illegal address (0xc0),
> > which will occasionally appear when deleting a physical NIC with vrf.
> >
> > [166603.826737]hinic 0000:43:00.4 eth-s3: Failed to cycle device eth-s3;
> > route tables might be wrong!
> > .....
> > [166603.828018]WARNING: CPU: 135 PID: 15382at net/core/dev.c:6875
> > __netdev_adjacent_dev_remove.constprop.40+0x1e0/0x1e8
> > ......
>
> Taehee-ssi, please take a look at this.
>
> It is believed that this may be caused by the adjacency fixes you made
> recently.
>

I will take a look at this
Thank you!

> Thank you.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vrf: Fix possible NULL pointer oops when delete nic
  2019-11-15 13:14 ` David Ahern
@ 2019-11-18  3:15   ` wangxiaogang (F)
  2019-11-18  3:22     ` David Ahern
  0 siblings, 1 reply; 9+ messages in thread
From: wangxiaogang (F) @ 2019-11-18  3:15 UTC (permalink / raw)
  To: David Ahern, dsahern, shrijeet, davem
  Cc: netdev, linux-kernel, hujunwei4, xuhanbing



On 2019/11/15 21:14, David Ahern wrote:
> On 11/14/19 11:22 PM, wangxiaogang (F) wrote:
>> From: XiaoGang Wang <wangxiaogang3@huawei.com>
>>
>> Recently we get a crash when access illegal address (0xc0),
>> which will occasionally appear when deleting a physical NIC with vrf.
>>
> 
> How long have you been running this test?
> 
> I am wondering if this is fallout from the recent adjacency changes in
> commits 5343da4c1742 through f3b0a18bb6cb.
> 
> 
> 
> 
> 
Thank you so much for the reply, our kernel version is linux 4.19.
this problem happened once in our production environment.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vrf: Fix possible NULL pointer oops when delete nic
  2019-11-15 16:59 ` David Ahern
@ 2019-11-18  3:16   ` wangxiaogang (F)
  2019-11-18  3:21     ` David Ahern
  0 siblings, 1 reply; 9+ messages in thread
From: wangxiaogang (F) @ 2019-11-18  3:16 UTC (permalink / raw)
  To: David Ahern, dsahern, shrijeet, davem
  Cc: netdev, linux-kernel, hujunwei4, xuhanbing



On 2019/11/16 0:59, David Ahern wrote:
> On 11/14/19 11:22 PM, wangxiaogang (F) wrote:
>> diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
>> index b8228f5..86c4b8c 100644
>> --- a/drivers/net/vrf.c
>> +++ b/drivers/net/vrf.c
>> @@ -1427,6 +1427,9 @@ static int vrf_device_event(struct notifier_block *unused,
>>  			goto out;
>>
>>  		vrf_dev = netdev_master_upper_dev_get(dev);
>> +		if (!vrf_dev)
>> +			goto out;
>> +
>>  		vrf_del_slave(vrf_dev, dev);
>>  	}
>>  out:
> 
> BTW, I believe this is the wrong fix. A device can not be a VRF slave
> AND not have an upper device. Something is fundamentally wrong.
> 
> 

this problem occurs when our testers deleted the NIC and vrf in parallel.
I will try to recurring this problem later.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vrf: Fix possible NULL pointer oops when delete nic
  2019-11-18  3:16   ` wangxiaogang (F)
@ 2019-11-18  3:21     ` David Ahern
  0 siblings, 0 replies; 9+ messages in thread
From: David Ahern @ 2019-11-18  3:21 UTC (permalink / raw)
  To: wangxiaogang (F), dsahern, shrijeet, davem
  Cc: netdev, linux-kernel, hujunwei4, xuhanbing

On 11/17/19 8:16 PM, wangxiaogang (F) wrote:
> 
> 
> On 2019/11/16 0:59, David Ahern wrote:
>> On 11/14/19 11:22 PM, wangxiaogang (F) wrote:
>>> diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
>>> index b8228f5..86c4b8c 100644
>>> --- a/drivers/net/vrf.c
>>> +++ b/drivers/net/vrf.c
>>> @@ -1427,6 +1427,9 @@ static int vrf_device_event(struct notifier_block *unused,
>>>  			goto out;
>>>
>>>  		vrf_dev = netdev_master_upper_dev_get(dev);
>>> +		if (!vrf_dev)
>>> +			goto out;
>>> +
>>>  		vrf_del_slave(vrf_dev, dev);
>>>  	}
>>>  out:
>>
>> BTW, I believe this is the wrong fix. A device can not be a VRF slave
>> AND not have an upper device. Something is fundamentally wrong.
>>
>>
> 
> this problem occurs when our testers deleted the NIC and vrf in parallel.
> I will try to recurring this problem later.
> 

The deletes are serial in the kernel due to the rtnl, but dev changes
are under rcu...

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vrf: Fix possible NULL pointer oops when delete nic
  2019-11-18  3:15   ` wangxiaogang (F)
@ 2019-11-18  3:22     ` David Ahern
  0 siblings, 0 replies; 9+ messages in thread
From: David Ahern @ 2019-11-18  3:22 UTC (permalink / raw)
  To: wangxiaogang (F), dsahern, shrijeet, davem
  Cc: netdev, linux-kernel, hujunwei4, xuhanbing, Taehee Yoo

On 11/17/19 8:15 PM, wangxiaogang (F) wrote:
> 
> 
> On 2019/11/15 21:14, David Ahern wrote:
>> On 11/14/19 11:22 PM, wangxiaogang (F) wrote:
>>> From: XiaoGang Wang <wangxiaogang3@huawei.com>
>>>
>>> Recently we get a crash when access illegal address (0xc0),
>>> which will occasionally appear when deleting a physical NIC with vrf.
>>>
>>
>> How long have you been running this test?
>>
>> I am wondering if this is fallout from the recent adjacency changes in
>> commits 5343da4c1742 through f3b0a18bb6cb.
>>
>>
>>
>>
>>
> Thank you so much for the reply, our kernel version is linux 4.19.
> this problem happened once in our production environment.
> 

ok, so the recent adjacency changes would not be at fault here.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-11-18  3:22 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-15  6:22 [PATCH] vrf: Fix possible NULL pointer oops when delete nic wangxiaogang (F)
2019-11-15 13:14 ` David Ahern
2019-11-18  3:15   ` wangxiaogang (F)
2019-11-18  3:22     ` David Ahern
2019-11-15 16:59 ` David Ahern
2019-11-18  3:16   ` wangxiaogang (F)
2019-11-18  3:21     ` David Ahern
2019-11-16 20:53 ` David Miller
2019-11-17  6:17   ` Taehee Yoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).