All of lore.kernel.org
 help / color / mirror / Atom feed
* i40e: crash on NMI by continuous module reload
@ 2015-02-27 13:50 Stefan Assmann
  2015-02-27 14:02 ` nick
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Assmann @ 2015-02-27 13:50 UTC (permalink / raw)
  To: netdev
  Cc: e1000-devel, Brandeburg, Jesse, Kirsher, Jeffrey T, Williams,
	Mitch A, anjali.singhai

When unloading/loading the driver in a loop with
modprobe -r i40e ; modprobe i40e
after a few cycles the driver no longer successfully probes and outputs
the following.
[  160.171944] i40e 0000:07:00.1 eth7: adding 68:05:ca:2a:3a:41 vid=0
[  161.271487] i40e 0000:07:00.1: set phy mask fail, aq_err -54
[  161.685505] i40e 0000:07:00.0 eth6: NIC Link is Down
[  161.873172] i40e 0000:07:00.1: link restart failed, aq_err=0
[  162.401255] i40e 0000:07:00.1: PCI-Express: Speed 8.0GT/s Width x8
[  162.710082] i40e 0000:07:00.0: add filter failed, err -54, aq_err 0
[  162.930801] i40e 0000:07:00.1: get phy abilities failed, aq_err -54, advertised speed settings may not be correct
[  162.977599] i40e 0000:07:00.1: Features: PF-id[1] VFs: 32 VSIs: 34 QP: 32 RX: PS RSS FD_ATR FD_SB NTUPLE PTP
[  163.238624] i40e 0000:07:00.0 eth6: NIC Link is Down
[  163.244566] i40e 0000:07:00.2: Initial pf_reset failed: -15
[  163.244607] i40e: probe of 0000:07:00.2 failed with error -15
[  163.464911] i40e 0000:07:00.3: Initial pf_reset failed: -15
[  163.490747] i40e: probe of 0000:07:00.3 failed with error -15
[  163.518932] i40e 0000:07:00.1: i40e_ptp_stop: removed PHC on eth7
[  163.746713] i40e 0000:07:00.1 eth7: NIC Link is Down
[  164.270164] i40e 0000:07:00.1: add filter failed, err -54, aq_err 0
[...]
[  184.462907] i40e: Copyright (c) 2013 - 2014 Intel Corporation.
[  184.711290] i40e 0000:07:00.0: Initial pf_reset failed: -15
[  184.736457] i40e: probe of 0000:07:00.0 failed with error -15
[  184.983109] i40e 0000:07:00.1: Initial pf_reset failed: -15
[  185.009354] i40e: probe of 0000:07:00.1 failed with error -15
[  185.256612] i40e 0000:07:00.2: Initial pf_reset failed: -15
[  185.281990] i40e: probe of 0000:07:00.2 failed with error -15
[  185.529085] i40e 0000:07:00.3: Initial pf_reset failed: -15
[  185.555094] i40e: probe of 0000:07:00.3 failed with error -15

Followed by

[  188.178408] NMI: IOCK error (debug interrupt?) for reason 71 on CPU 0.
[  188.214709] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.19.0+ #81
[  188.245187] Hardware name: HP ProLiant DL360p Gen8, BIOS P71 08/02/2014
[  188.276847] task: ffffffff81e13480 ti: ffffffff81e00000 task.ti: ffffffff81e00000
[  188.313671] RIP: 0010:[<ffffffff8100d45b>]  [<ffffffff8100d45b>] default_idle+0x1b/0xb0
[  188.351779] RSP: 0018:ffffffff81e03ea8  EFLAGS: 00000246
[  188.377118] RAX: 0000000000000000 RBX: ffffffff81e00010 RCX: 0000000000000000
[  188.412311] RDX: ffffffff81e00000 RSI: 0000000000000000 RDI: 0000000000000000
[  188.448563] RBP: ffffffff81e03eb8 R08: 0000000000000000 R09: 00000000fffe4047
[  188.482137] R10: ffffffff81a0e045 R11: 0000000000000000 R12: 0000000000000000
[  188.518089] R13: ffffffff81efd970 R14: ffffffff81e00010 R15: 0000000000000000
[  188.553382] FS:  0000000000000000(0000) GS:ffff880237a00000(0000) knlGS:0000000000000000
[  188.594583] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  188.621056] CR2: 00007fbcb561bc88 CR3: 0000000235966000 CR4: 00000000001406f0
[  188.656549] Stack:
[  188.665693]  ffffffff81e00010 ffffffff81e00010 ffffffff81e03ec8 ffffffff8100cc3a
[  188.700062]  ffffffff81e03f48 ffffffff810884b7 ffffffff81e13480 ffff880236538910
[  188.734638]  ffffffff81e00000 ffffffff81e00010 ffffffff81e00010 ffffffff81e00000
[  188.773067] Call Trace:
[  188.784412]  [<ffffffff8100cc3a>] arch_cpu_idle+0xa/0x10
[  188.808717]  [<ffffffff810884b7>] cpu_startup_entry+0x227/0x3b0
[  188.837221]  [<ffffffff819d0a52>] rest_init+0x72/0x80
[  188.860698]  [<ffffffff81f201bd>] start_kernel+0x41b/0x428
[  188.887669]  [<ffffffff81f1fbc0>] ? set_init_arg+0x5d/0x5d
[  188.914359]  [<ffffffff81f1f5ad>] x86_64_start_reservations+0x2a/0x2c
[  188.945125]  [<ffffffff81f1f700>] x86_64_start_kernel+0x151/0x158
[  188.972480] Code: c0 48 83 c8 08 0f 22 c0 eb ce 66 0f 1f 44 00 00 55 8b 05 a1 a8 ec 00 48 89 e5 41 54 65 44 8b 25 cc cc ff 7e 85 c0 5
3 7f 19 fb f4 <8b> 05 87 a8 ec 00 65 44 8b 25 b7 cc ff 7e 85 c0 7f 44 5b 41 5c


I've tracked this down to the following hunk from this commit.
commit cafa2ee6fbb1bbc2fecdeef990858d56646fc1bd
Author: Anjali Singhai Jain <anjali.singhai@intel.com>
Date:   Sat Sep 13 07:40:45 2014 +0000

    i40e: Fix a bug where Rx would stop after some time
[...]
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index f7464e8..ff6d94d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
[...]
@@ -9169,6 +9178,13 @@ static int i40e_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (err)
 		dev_info(&pf->pdev->dev, "set phy mask fail, aq_err %d\n", err);

+	msleep(75);
+	err = i40e_aq_set_link_restart_an(&pf->hw, true, NULL);
+	if (err) {
+		dev_info(&pf->pdev->dev, "link restart failed, aq_err=%d\n",
+			 pf->hw.aq.asq_last_status);
+	}
+
 	/* The main driver is (mostly) up and happy. We need to set this state
 	 * before setting up the misc vector or we get a race and the vector
 	 * ends up disabled forever.

With this hunk removed the driver successfully unloaded/reloaded a
couple of hundred times. Would it be safe to just remove this hunk?
I haven't seen any negative effects by removing this yet.

  Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: i40e: crash on NMI by continuous module reload
  2015-02-27 13:50 i40e: crash on NMI by continuous module reload Stefan Assmann
@ 2015-02-27 14:02 ` nick
  2015-02-27 14:16   ` [E1000-devel] " Stefan Assmann
  0 siblings, 1 reply; 9+ messages in thread
From: nick @ 2015-02-27 14:02 UTC (permalink / raw)
  To: Stefan Assmann, netdev; +Cc: e1000-devel, Brandeburg, Jesse



On 2015-02-27 08:50 AM, Stefan Assmann wrote:
> When unloading/loading the driver in a loop with
> modprobe -r i40e ; modprobe i40e
> after a few cycles the driver no longer successfully probes and outputs
> the following.
> [  160.171944] i40e 0000:07:00.1 eth7: adding 68:05:ca:2a:3a:41 vid=0
> [  161.271487] i40e 0000:07:00.1: set phy mask fail, aq_err -54
> [  161.685505] i40e 0000:07:00.0 eth6: NIC Link is Down
> [  161.873172] i40e 0000:07:00.1: link restart failed, aq_err=0
> [  162.401255] i40e 0000:07:00.1: PCI-Express: Speed 8.0GT/s Width x8
> [  162.710082] i40e 0000:07:00.0: add filter failed, err -54, aq_err 0
> [  162.930801] i40e 0000:07:00.1: get phy abilities failed, aq_err -54, advertised speed settings may not be correct
> [  162.977599] i40e 0000:07:00.1: Features: PF-id[1] VFs: 32 VSIs: 34 QP: 32 RX: PS RSS FD_ATR FD_SB NTUPLE PTP
> [  163.238624] i40e 0000:07:00.0 eth6: NIC Link is Down
> [  163.244566] i40e 0000:07:00.2: Initial pf_reset failed: -15
> [  163.244607] i40e: probe of 0000:07:00.2 failed with error -15
> [  163.464911] i40e 0000:07:00.3: Initial pf_reset failed: -15
> [  163.490747] i40e: probe of 0000:07:00.3 failed with error -15
> [  163.518932] i40e 0000:07:00.1: i40e_ptp_stop: removed PHC on eth7
> [  163.746713] i40e 0000:07:00.1 eth7: NIC Link is Down
> [  164.270164] i40e 0000:07:00.1: add filter failed, err -54, aq_err 0
> [...]
> [  184.462907] i40e: Copyright (c) 2013 - 2014 Intel Corporation.
> [  184.711290] i40e 0000:07:00.0: Initial pf_reset failed: -15
> [  184.736457] i40e: probe of 0000:07:00.0 failed with error -15
> [  184.983109] i40e 0000:07:00.1: Initial pf_reset failed: -15
> [  185.009354] i40e: probe of 0000:07:00.1 failed with error -15
> [  185.256612] i40e 0000:07:00.2: Initial pf_reset failed: -15
> [  185.281990] i40e: probe of 0000:07:00.2 failed with error -15
> [  185.529085] i40e 0000:07:00.3: Initial pf_reset failed: -15
> [  185.555094] i40e: probe of 0000:07:00.3 failed with error -15
> 
> Followed by
> 
> [  188.178408] NMI: IOCK error (debug interrupt?) for reason 71 on CPU 0.
> [  188.214709] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.19.0+ #81
> [  188.245187] Hardware name: HP ProLiant DL360p Gen8, BIOS P71 08/02/2014
> [  188.276847] task: ffffffff81e13480 ti: ffffffff81e00000 task.ti: ffffffff81e00000
> [  188.313671] RIP: 0010:[<ffffffff8100d45b>]  [<ffffffff8100d45b>] default_idle+0x1b/0xb0
> [  188.351779] RSP: 0018:ffffffff81e03ea8  EFLAGS: 00000246
> [  188.377118] RAX: 0000000000000000 RBX: ffffffff81e00010 RCX: 0000000000000000
> [  188.412311] RDX: ffffffff81e00000 RSI: 0000000000000000 RDI: 0000000000000000
> [  188.448563] RBP: ffffffff81e03eb8 R08: 0000000000000000 R09: 00000000fffe4047
> [  188.482137] R10: ffffffff81a0e045 R11: 0000000000000000 R12: 0000000000000000
> [  188.518089] R13: ffffffff81efd970 R14: ffffffff81e00010 R15: 0000000000000000
> [  188.553382] FS:  0000000000000000(0000) GS:ffff880237a00000(0000) knlGS:0000000000000000
> [  188.594583] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  188.621056] CR2: 00007fbcb561bc88 CR3: 0000000235966000 CR4: 00000000001406f0
> [  188.656549] Stack:
> [  188.665693]  ffffffff81e00010 ffffffff81e00010 ffffffff81e03ec8 ffffffff8100cc3a
> [  188.700062]  ffffffff81e03f48 ffffffff810884b7 ffffffff81e13480 ffff880236538910
> [  188.734638]  ffffffff81e00000 ffffffff81e00010 ffffffff81e00010 ffffffff81e00000
> [  188.773067] Call Trace:
> [  188.784412]  [<ffffffff8100cc3a>] arch_cpu_idle+0xa/0x10
> [  188.808717]  [<ffffffff810884b7>] cpu_startup_entry+0x227/0x3b0
> [  188.837221]  [<ffffffff819d0a52>] rest_init+0x72/0x80
> [  188.860698]  [<ffffffff81f201bd>] start_kernel+0x41b/0x428
> [  188.887669]  [<ffffffff81f1fbc0>] ? set_init_arg+0x5d/0x5d
> [  188.914359]  [<ffffffff81f1f5ad>] x86_64_start_reservations+0x2a/0x2c
> [  188.945125]  [<ffffffff81f1f700>] x86_64_start_kernel+0x151/0x158
> [  188.972480] Code: c0 48 83 c8 08 0f 22 c0 eb ce 66 0f 1f 44 00 00 55 8b 05 a1 a8 ec 00 48 89 e5 41 54 65 44 8b 25 cc cc ff 7e 85 c0 5
> 3 7f 19 fb f4 <8b> 05 87 a8 ec 00 65 44 8b 25 b7 cc ff 7e 85 c0 7f 44 5b 41 5c
> 
> 
> I've tracked this down to the following hunk from this commit.
> commit cafa2ee6fbb1bbc2fecdeef990858d56646fc1bd
> Author: Anjali Singhai Jain <anjali.singhai@intel.com>
> Date:   Sat Sep 13 07:40:45 2014 +0000
> 
>     i40e: Fix a bug where Rx would stop after some time
> [...]
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index f7464e8..ff6d94d 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> [...]
> @@ -9169,6 +9178,13 @@ static int i40e_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>  	if (err)
>  		dev_info(&pf->pdev->dev, "set phy mask fail, aq_err %d\n", err);
> 
> +	msleep(75);
> +	err = i40e_aq_set_link_restart_an(&pf->hw, true, NULL);
> +	if (err) {
> +		dev_info(&pf->pdev->dev, "link restart failed, aq_err=%d\n",
> +			 pf->hw.aq.asq_last_status);
> +	}
> +
>  	/* The main driver is (mostly) up and happy. We need to set this state
>  	 * before setting up the misc vector or we get a race and the vector
>  	 * ends up disabled forever.
> 
> With this hunk removed the driver successfully unloaded/reloaded a
> couple of hundred times. Would it be safe to just remove this hunk?
> I haven't seen any negative effects by removing this yet.
> 
>   Stefan
> 
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the 
> conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> E1000-devel mailing list
> E1000-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/e1000-devel
> To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired
> 
Stefan,
I wouldn't remove them yet as this does look like a valid idea to check to see if the link is 
restarting successfully. On the other hand can you try removing the msleep line as this one is
most likely causing the issue due to sleeping for some long in a probe function is generally a
bad idea.
Thanks,
Nick

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [E1000-devel] i40e: crash on NMI by continuous module reload
  2015-02-27 14:02 ` nick
@ 2015-02-27 14:16   ` Stefan Assmann
  2015-02-27 14:44     ` nick
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Assmann @ 2015-02-27 14:16 UTC (permalink / raw)
  To: nick, netdev; +Cc: e1000-devel, Brandeburg, Jesse

On 27.02.2015 15:02, nick wrote:

[...]

>>     i40e: Fix a bug where Rx would stop after some time
>> [...]
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> index f7464e8..ff6d94d 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> [...]
>> @@ -9169,6 +9178,13 @@ static int i40e_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>>  	if (err)
>>  		dev_info(&pf->pdev->dev, "set phy mask fail, aq_err %d\n", err);
>>
>> +	msleep(75);
>> +	err = i40e_aq_set_link_restart_an(&pf->hw, true, NULL);
>> +	if (err) {
>> +		dev_info(&pf->pdev->dev, "link restart failed, aq_err=%d\n",
>> +			 pf->hw.aq.asq_last_status);
>> +	}
>> +
>>  	/* The main driver is (mostly) up and happy. We need to set this state
>>  	 * before setting up the misc vector or we get a race and the vector
>>  	 * ends up disabled forever.
>>
>> With this hunk removed the driver successfully unloaded/reloaded a
>> couple of hundred times. Would it be safe to just remove this hunk?
>> I haven't seen any negative effects by removing this yet.
>>
>>   Stefan
>>
> Stefan,
> I wouldn't remove them yet as this does look like a valid idea to check to see if the link is 
> restarting successfully. On the other hand can you try removing the msleep line as this one is
> most likely causing the issue due to sleeping for some long in a probe function is generally a
> bad idea.
> Thanks,
> Nick

Thanks Nick for the quick reply. I tested removing the msleep but that
didn't make a difference. You actually need to remove the complete hunk
to get a stable driver reload.

  Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: i40e: crash on NMI by continuous module reload
  2015-02-27 14:16   ` [E1000-devel] " Stefan Assmann
@ 2015-02-27 14:44     ` nick
  2015-02-27 19:42       ` Nelson, Shannon
  0 siblings, 1 reply; 9+ messages in thread
From: nick @ 2015-02-27 14:44 UTC (permalink / raw)
  To: Stefan Assmann, netdev; +Cc: e1000-devel, Brandeburg, Jesse



On 2015-02-27 09:16 AM, Stefan Assmann wrote:
> On 27.02.2015 15:02, nick wrote:
> 
> [...]
> 
>>>     i40e: Fix a bug where Rx would stop after some time
>>> [...]
>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
>>> index f7464e8..ff6d94d 100644
>>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>>> [...]
>>> @@ -9169,6 +9178,13 @@ static int i40e_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>>>  	if (err)
>>>  		dev_info(&pf->pdev->dev, "set phy mask fail, aq_err %d\n", err);
>>>
>>> +	msleep(75);
>>> +	err = i40e_aq_set_link_restart_an(&pf->hw, true, NULL);
>>> +	if (err) {
>>> +		dev_info(&pf->pdev->dev, "link restart failed, aq_err=%d\n",
>>> +			 pf->hw.aq.asq_last_status);
>>> +	}
>>> +
>>>  	/* The main driver is (mostly) up and happy. We need to set this state
>>>  	 * before setting up the misc vector or we get a race and the vector
>>>  	 * ends up disabled forever.
>>>
>>> With this hunk removed the driver successfully unloaded/reloaded a
>>> couple of hundred times. Would it be safe to just remove this hunk?
>>> I haven't seen any negative effects by removing this yet.
>>>
>>>   Stefan
>>>
>> Stefan,
>> I wouldn't remove them yet as this does look like a valid idea to check to see if the link is 
>> restarting successfully. On the other hand can you try removing the msleep line as this one is
>> most likely causing the issue due to sleeping for some long in a probe function is generally a
>> bad idea.
>> Thanks,
>> Nick
> 
> Thanks Nick for the quick reply. I tested removing the msleep but that
> didn't make a difference. You actually need to remove the complete hunk
> to get a stable driver reload.
> 
>   Stefan
> 
Stefan,
Basically there are a few things that could be going wrong
1. You are getting a error return for the function,i40e_aq_set_link_restart_an
2. You are trying to re able the device again when not needed
3. You are sending a NULL value to a field for command arguments that takes a 0 and not NULL
to take no arguments
Nick 

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: i40e: crash on NMI by continuous module reload
  2015-02-27 14:44     ` nick
@ 2015-02-27 19:42       ` Nelson, Shannon
  2015-02-27 21:25         ` Nicholas Krause
  2015-03-02  8:08         ` [E1000-devel] " Stefan Assmann
  0 siblings, 2 replies; 9+ messages in thread
From: Nelson, Shannon @ 2015-02-27 19:42 UTC (permalink / raw)
  To: nick, Stefan Assmann, netdev; +Cc: e1000-devel, Brandeburg, Jesse

> From: nick [mailto:xerofoify@gmail.com]
> On 2015-02-27 09:16 AM, Stefan Assmann wrote:
> > On 27.02.2015 15:02, nick wrote:
> >
> > [...]
> >
> >>>     i40e: Fix a bug where Rx would stop after some time
> >>> [...]
> >>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c
> b/drivers/net/ethernet/intel/i40e/i40e_main.c
> >>> index f7464e8..ff6d94d 100644
> >>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> >>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> >>> [...]
> >>> @@ -9169,6 +9178,13 @@ static int i40e_probe(struct pci_dev *pdev,
> const struct pci_device_id *ent)
> >>>  	if (err)
> >>>  		dev_info(&pf->pdev->dev, "set phy mask fail, aq_err %d\n",
> err);
> >>>
> >>> +	msleep(75);
> >>> +	err = i40e_aq_set_link_restart_an(&pf->hw, true, NULL);
> >>> +	if (err) {
> >>> +		dev_info(&pf->pdev->dev, "link restart failed, aq_err=%d\n",
> >>> +			 pf->hw.aq.asq_last_status);
> >>> +	}
> >>> +
> >>>  	/* The main driver is (mostly) up and happy. We need to set this
> state
> >>>  	 * before setting up the misc vector or we get a race and the
> vector
> >>>  	 * ends up disabled forever.
> >>>
> >>> With this hunk removed the driver successfully unloaded/reloaded a
> >>> couple of hundred times. Would it be safe to just remove this hunk?
> >>> I haven't seen any negative effects by removing this yet.
> >>>
> >>>   Stefan
> >>>
> >> Stefan,
> >> I wouldn't remove them yet as this does look like a valid idea to
> check to see if the link is
> >> restarting successfully. On the other hand can you try removing the
> msleep line as this one is
> >> most likely causing the issue due to sleeping for some long in a
> probe function is generally a
> >> bad idea.
> >> Thanks,
> >> Nick
> >
> > Thanks Nick for the quick reply. I tested removing the msleep but that
> > didn't make a difference. You actually need to remove the complete
> hunk
> > to get a stable driver reload.
> >
> >   Stefan
> >
> Stefan,
> Basically there are a few things that could be going wrong
> 1. You are getting a error return for the
> function,i40e_aq_set_link_restart_an
> 2. You are trying to re able the device again when not needed
> 3. You are sending a NULL value to a field for command arguments that
> takes a 0 and not NULL
> to take no arguments
> Nick

First of all, I would make sure you've got a short sleep in between each load and unload in this stress test.  There's a lot going on under the covers in the Firmware that really should be allowed to settle out before jostling it again with another load/unload command.  

It would help to know what Firmware you have on your NIC - can you give us the output from "ethtool -i <ethX>"?

The out-of-tree driver has just (finally!) been updated on SourceForge, so you might give this version 1.2.37 driver a try to see if it changes your result.  That code still has the hunk in question, but protected by a FW version check.  The related patch will be headed upstream to net-next very soon.

Firmware updates have also just been released, but I'm not sure they've made it to the Intel Downloads site yet.  Updating your FW will make a difference.

sln




------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: i40e: crash on NMI by continuous module reload
  2015-02-27 19:42       ` Nelson, Shannon
@ 2015-02-27 21:25         ` Nicholas Krause
  2015-02-28  0:45           ` [E1000-devel] " Jeff Kirsher
  2015-03-02  8:08         ` [E1000-devel] " Stefan Assmann
  1 sibling, 1 reply; 9+ messages in thread
From: Nicholas Krause @ 2015-02-27 21:25 UTC (permalink / raw)
  To: Nelson, Shannon, Stefan Assmann, netdev; +Cc: e1000-devel, Brandeburg, Jesse

 

On February 27, 2015 2:42:29 PM EST, "Nelson, Shannon" <shannon.nelson@intel.com> wrote:
>> From: nick [mailto:xerofoify@gmail.com]
>> On 2015-02-27 09:16 AM, Stefan Assmann wrote:
>> > On 27.02.2015 15:02, nick wrote:
>> >
>> > [...]
>> >
>> >>>     i40e: Fix a bug where Rx would stop after some time
>> >>> [...]
>> >>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c
>> b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> >>> index f7464e8..ff6d94d 100644
>> >>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>> >>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> >>> [...]
>> >>> @@ -9169,6 +9178,13 @@ static int i40e_probe(struct pci_dev
>*pdev,
>> const struct pci_device_id *ent)
>> >>>  	if (err)
>> >>>  		dev_info(&pf->pdev->dev, "set phy mask fail, aq_err %d\n",
>> err);
>> >>>
>> >>> +	msleep(75);
>> >>> +	err = i40e_aq_set_link_restart_an(&pf->hw, true, NULL);
>> >>> +	if (err) {
>> >>> +		dev_info(&pf->pdev->dev, "link restart failed, aq_err=%d\n",
>> >>> +			 pf->hw.aq.asq_last_status);
>> >>> +	}
>> >>> +
>> >>>  	/* The main driver is (mostly) up and happy. We need to set
>this
>> state
>> >>>  	 * before setting up the misc vector or we get a race and the
>> vector
>> >>>  	 * ends up disabled forever.
>> >>>
>> >>> With this hunk removed the driver successfully unloaded/reloaded
>a
>> >>> couple of hundred times. Would it be safe to just remove this
>hunk?
>> >>> I haven't seen any negative effects by removing this yet.
>> >>>
>> >>>   Stefan
>> >>>
>> >> Stefan,
>> >> I wouldn't remove them yet as this does look like a valid idea to
>> check to see if the link is
>> >> restarting successfully. On the other hand can you try removing
>the
>> msleep line as this one is
>> >> most likely causing the issue due to sleeping for some long in a
>> probe function is generally a
>> >> bad idea.
>> >> Thanks,
>> >> Nick
>> >
>> > Thanks Nick for the quick reply. I tested removing the msleep but
>that
>> > didn't make a difference. You actually need to remove the complete
>> hunk
>> > to get a stable driver reload.
>> >
>> >   Stefan
>> >
>> Stefan,
>> Basically there are a few things that could be going wrong
>> 1. You are getting a error return for the
>> function,i40e_aq_set_link_restart_an
>> 2. You are trying to re able the device again when not needed
>> 3. You are sending a NULL value to a field for command arguments that
>> takes a 0 and not NULL
>> to take no arguments
>> Nick
>
>First of all, I would make sure you've got a short sleep in between
>each load and unload in this stress test.  There's a lot going on under
>the covers in the Firmware that really should be allowed to settle out
>before jostling it again with another load/unload command.  
>
>It would help to know what Firmware you have on your NIC - can you give
>us the output from "ethtool -i <ethX>"?
>
>The out-of-tree driver has just (finally!) been updated on SourceForge,
>so you might give this version 1.2.37 driver a try to see if it changes
>your result.  That code still has the hunk in question, but protected
>by a FW version check.  The related patch will be headed upstream to
>net-next very soon.
>
>Firmware updates have also just been released, but I'm not sure they've
>made it to the Intel Downloads site yet.  Updating your FW will make a
>difference.
>
>sln
Thanks Shannon, 
For the advice on there being a newer driver and firmware to test that on before reporting the bug against this driver. I am curious as to why this newer code is not up streamed to Linux next yet through. 
Nick
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [E1000-devel] i40e: crash on NMI by continuous module reload
  2015-02-27 21:25         ` Nicholas Krause
@ 2015-02-28  0:45           ` Jeff Kirsher
  2015-02-28  2:11             ` Nicholas Krause
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff Kirsher @ 2015-02-28  0:45 UTC (permalink / raw)
  To: Nicholas Krause
  Cc: Nelson, Shannon, Stefan Assmann, netdev, e1000-devel, Brandeburg, Jesse

[-- Attachment #1: Type: text/plain, Size: 797 bytes --]

On Fri, 2015-02-27 at 16:25 -0500, Nicholas Krause wrote:
> Thanks Shannon, 
> For the advice on there being a newer driver and firmware to test that
> on before reporting the bug against this driver. I am curious as to
> why this newer code is not up streamed to Linux next yet through.

Because I still have ~40 patches in my queue for i40e that need to go
upstream to get the upstream driver in sync with our out-of-tree driver.

Also Nick, please do not take advantage of me being nice to you and
letting you remain on e1000-devel mailing list, even though you have
been banned on @vger.kernel.org mailing lists.  It is clear from your
attempts to help Stefan, you do not know the driver and your "blind
stabs in the dark" to assist, only hinder our abilities to assist
others.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: i40e: crash on NMI by continuous module reload
  2015-02-28  0:45           ` [E1000-devel] " Jeff Kirsher
@ 2015-02-28  2:11             ` Nicholas Krause
  0 siblings, 0 replies; 9+ messages in thread
From: Nicholas Krause @ 2015-02-28  2:11 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: e1000-devel, Stefan Assmann, Brandeburg, Jesse, netdev



On February 27, 2015 7:45:58 PM EST, Jeff Kirsher <jeffrey.t.kirsher@intel.com> wrote:
>On Fri, 2015-02-27 at 16:25 -0500, Nicholas Krause wrote:
>> Thanks Shannon, 
>> For the advice on there being a newer driver and firmware to test
>that
>> on before reporting the bug against this driver. I am curious as to
>> why this newer code is not up streamed to Linux next yet through.
>
>Because I still have ~40 patches in my queue for i40e that need to go
>upstream to get the upstream driver in sync with our out-of-tree
>driver.
>
>Also Nick, please do not take advantage of me being nice to you and
>letting you remain on e1000-devel mailing list, even though you have
>been banned on @vger.kernel.org mailing lists.  It is clear from your
>attempts to help Stefan, you do not know the driver and your "blind
>stabs in the dark" to assist, only hinder our abilities to assist
>others.
Jeff, 
These seem to be the most likely culprits after reading the code. However there seems to be something else going wrong with the function call in the trace code. Jeff I do understand your busy with the upstream merging of the Intel drivers and was just wondering when the merge would be finished. Furthermore I can
try tracing this later during my March 
Break. I do appreciate you being nice to me and am sorry about making you lose
your patience. I am interested in learning about the Intel network drivers and working with them, if would like to give me something to start with. 
Sorry, 
Nick :( 


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [E1000-devel] i40e: crash on NMI by continuous module reload
  2015-02-27 19:42       ` Nelson, Shannon
  2015-02-27 21:25         ` Nicholas Krause
@ 2015-03-02  8:08         ` Stefan Assmann
  1 sibling, 0 replies; 9+ messages in thread
From: Stefan Assmann @ 2015-03-02  8:08 UTC (permalink / raw)
  To: Nelson, Shannon, nick, netdev; +Cc: e1000-devel, Brandeburg, Jesse

On 27.02.2015 20:42, Nelson, Shannon wrote:
>> From: nick [mailto:xerofoify@gmail.com]
>> On 2015-02-27 09:16 AM, Stefan Assmann wrote:
>>> On 27.02.2015 15:02, nick wrote:
>>>
>>> [...]
>>>
>>>>>     i40e: Fix a bug where Rx would stop after some time
>>>>> [...]
>>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c
>> b/drivers/net/ethernet/intel/i40e/i40e_main.c
>>>>> index f7464e8..ff6d94d 100644
>>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>>>>> [...]
>>>>> @@ -9169,6 +9178,13 @@ static int i40e_probe(struct pci_dev *pdev,
>> const struct pci_device_id *ent)
>>>>>  	if (err)
>>>>>  		dev_info(&pf->pdev->dev, "set phy mask fail, aq_err %d\n",
>> err);
>>>>>
>>>>> +	msleep(75);
>>>>> +	err = i40e_aq_set_link_restart_an(&pf->hw, true, NULL);
>>>>> +	if (err) {
>>>>> +		dev_info(&pf->pdev->dev, "link restart failed, aq_err=%d\n",
>>>>> +			 pf->hw.aq.asq_last_status);
>>>>> +	}
>>>>> +
>>>>>  	/* The main driver is (mostly) up and happy. We need to set this
>> state
>>>>>  	 * before setting up the misc vector or we get a race and the
>> vector
>>>>>  	 * ends up disabled forever.
>>>>>
>>>>> With this hunk removed the driver successfully unloaded/reloaded a
>>>>> couple of hundred times. Would it be safe to just remove this hunk?
>>>>> I haven't seen any negative effects by removing this yet.
>>>>>
>>>>>   Stefan
>>>>>
>>>> Stefan,
>>>> I wouldn't remove them yet as this does look like a valid idea to
>> check to see if the link is
>>>> restarting successfully. On the other hand can you try removing the
>> msleep line as this one is
>>>> most likely causing the issue due to sleeping for some long in a
>> probe function is generally a
>>>> bad idea.
>>>> Thanks,
>>>> Nick
>>>
>>> Thanks Nick for the quick reply. I tested removing the msleep but that
>>> didn't make a difference. You actually need to remove the complete
>> hunk
>>> to get a stable driver reload.
>>>
>>>   Stefan
>>>
>> Stefan,
>> Basically there are a few things that could be going wrong
>> 1. You are getting a error return for the
>> function,i40e_aq_set_link_restart_an
>> 2. You are trying to re able the device again when not needed
>> 3. You are sending a NULL value to a field for command arguments that
>> takes a 0 and not NULL
>> to take no arguments
>> Nick
> 
> First of all, I would make sure you've got a short sleep in between each load and unload in this stress test.  There's a lot going on under the covers in the Firmware that really should be allowed to settle out before jostling it again with another load/unload command.  

If a short delay is needed I think this should be implemented by the
driver. Triggering this kind of bug from userspace shouldn't be
possible. I'm using this reload loop regularly on driver backports to
test for regressions.
Btw, I noticed this problem during a normal reboot and used the
reloading while looking for a reproducer.

> It would help to know what Firmware you have on your NIC - can you give us the output from "ethtool -i <ethX>"?

# ethtool -i eth6
driver: i40e
version: 1.2.9-k
firmware-version: f4.22 a1.1 n04.26 e800014b1
bus-info: 0000:07:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

> The out-of-tree driver has just (finally!) been updated on SourceForge, so you might give this version 1.2.37 driver a try to see if it changes your result.  That code still has the hunk in question, but protected by a FW version check.  The related patch will be headed upstream to net-next very soon.

1.2.37 fails the same way.

> Firmware updates have also just been released, but I'm not sure they've made it to the Intel Downloads site yet.  Updating your FW will make a difference.

If you could point me to the firmware updates and instructions I can
perform the update.

Thanks!

  Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-03-02  8:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-27 13:50 i40e: crash on NMI by continuous module reload Stefan Assmann
2015-02-27 14:02 ` nick
2015-02-27 14:16   ` [E1000-devel] " Stefan Assmann
2015-02-27 14:44     ` nick
2015-02-27 19:42       ` Nelson, Shannon
2015-02-27 21:25         ` Nicholas Krause
2015-02-28  0:45           ` [E1000-devel] " Jeff Kirsher
2015-02-28  2:11             ` Nicholas Krause
2015-03-02  8:08         ` [E1000-devel] " Stefan Assmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.