All of lore.kernel.org
 help / color / mirror / Atom feed
* i40e: driver can't probe device (capabilities discovery error)
@ 2017-02-08 16:31 ` Guilherme G. Piccoli
  0 siblings, 0 replies; 4+ messages in thread
From: Guilherme G. Piccoli @ 2017-02-08 16:31 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: netdev, Brian King, alexander.h.duyck, Kirsher, Jeffrey T,
	Keller, Jacob E, Murilo pIO, maurosr, gpiccoli

[-- Attachment #1: Type: text/plain, Size: 3588 bytes --]

Recently we had a sudden fail on Intel XL710 adapter, in which the i40e
driver is not able to probe the device anymore - it fails right on the
beginning of the probe process, on discovery capabilities procedure. We
observed the following messages on kernel (v4.10-rc7) log:


i40e: Intel(R) Ethernet Connection XL710 Network Driver - version 1.6.25-k
i40e: Copyright (c) 2013 - 2014 Intel Corporation.
i40e 0002:01:00.0: Using 64-bit DMA iommu bypass
i40e 0002:01:00.0: fw 5.1.40981 api 1.5 nvm 5.03 0x80002469 1.1313.0
i40e 0002:01:00.0: capability discovery failed, err OK aq_err
I40E_AQ_RC_EMODE
i40e 0002:01:00.1: Using 64-bit DMA iommu bypass
i40e 0002:01:00.1: fw 5.1.40981 api 1.5 nvm 5.03 0x80002469 1.1313.0
i40e 0002:01:00.1: capability discovery failed, err OK aq_err
I40E_AQ_RC_EMODE

<and the same messages on functions .2 and .3 too>


We were able to "revive" the adapter using one of the following 2
procedures:

i) PowerPC systems have a feature called EEH, that is a PCI slot reset
in essence. It's something in HW/PHB level, so the mechanism does a slot
reset, that can be a PCI Hot Reset or Fundamental Reset (PERST).

The 1st way to recover the adapter was to inject an error on this slot
and forcing a called "hotplug recovery". Basically, we removed the
adapter from the PCI core (echo 1 >
/sys/bus/pci/devices/0002:01:00.*/remove), then we froze the PHB
transactions (using a debug facility on powerpc kernel) and then we do a
rescan on PCI bus (echo 1 > /sys/bus/pci/rescan).

This led to Hot Reset on slot, and adapter recovered fine, i40e driver
was able to complete the probe procedure. I can provide full logs if
desired.
Although I think this is too hacky way...

ii) With the attached patch, we were able to "partially" circumvent the
issue. Basically, the probe procedure worked fine to all device
functions, but on function 3 we failed in eeprom check - the following
messages were observed in the kernel log:

[29.1126] i40e 0002:01:00.3: Using 64-bit DMA iommu bypass
[32.3530] i40e 0002:01:00.3: fw 5.1.40981 api 1.5 nvm 5.03 0x24695003
192.0.63
[32.8441] i40e 0002:01:00.3: eeprom check failed (-2), Tx/Rx traffic
disabled
[32.8583] i40e 0002:01:00.3: MAC address: 0c:c4:7a:89:f1:c3
[32.8712] i40e 0002:01:00.3: MSI-X vector limit reached, attempting to
redistribute vectors
[32.9765] i40e 0002:01:00.3: Added LAN device PF3 bus=0x00 func=0x03
[32.9766] i40e 0002:01:00.3: PCI-Express: Speed 8.0GT/s Width x8
[32.9867] i40e 0002:01:00.3: Features: PF-id[3] VFs: 32 VSIs: 34 QP: 119
RSS FD_ATR FD_SB NTUPLE DCB VxLAN Geneve PTP VEPA


All the other 3 functions presented the same messages except the eeprom
check failed.
I'm aware the patch needs some rework (in my understanding, the logic
works only to a single adapter, because we need a global reset only in
one function of the adapter. But the patch logic fails if we have more
than 1 physical adapter on machine. It's just a draft/RFC version for now).
--

So, I'd like to request help/feedback from you regarding what's going
on. I'm not sure the root cause of the sudden adapter failure. In one
day it was fine, and in the other, after a machine reboot, it entered in
this odd state. We have 2 machines presenting this behavior and 5 others
that are fine.

Is there a way to clear this bad state on the adapter, like a special
reset (or even a jumper that we should play physically)? I tried EMP
reset too, but seems it's not allowed for some reason (perhaps only in
NVM update mode? Not sure). Also, any pointers on how to understand the
root cause are welcome.
Thanks in advance,


Guilherme

[-- Attachment #2: 0001-i40-force-global-reset-on-adapter-probe.patch --]
[-- Type: text/x-patch, Size: 1695 bytes --]

>From 1a49e453816dbab747788b87f9d03edc978cb50b Mon Sep 17 00:00:00 2001
From: "Guilherme G. Piccoli" <gpiccoli@linux.vnet.ibm.com>
Date: Tue, 7 Feb 2017 17:38:04 -0200
Subject: [PATCH] i40: force global reset on adapter probe

Device might experience a bad state on probe time, making impossible
to the function i40e_probe() to successfully complete.

In these cases, for example we observed the following message in
kernel log:

  [22.6397] i40e 0002:01:00.0: capability discovery failed, err OK aq_err I40E_AQ_RC_EMODE

This patch forces a global reset to happen on driver probe to avoid
the issue.

Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index ad4cf63..f686c4a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -10928,7 +10928,7 @@ static int i40e_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	static u16 pfs_found;
 	u16 wol_nvm_bits;
 	u16 link_status;
-	int err;
+	int err, globr_probe = 1;
 	u32 val;
 	u32 i;
 	u8 set_fc_aq_fail;
@@ -11009,6 +11009,11 @@ static int i40e_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (debug < -1)
 		pf->hw.debug_mask = debug;
 
+	if (globr_probe) {
+		i40e_do_reset_safe(pf, BIT(__I40E_GLOBAL_RESET_REQUESTED));
+		globr_probe = 0;
+	}
+
 	/* do a special CORER for clearing PXE mode once at init */
 	if (hw->revision_id == 0 &&
 	    (rd32(hw, I40E_GLLAN_RCTL_0) & I40E_GLLAN_RCTL_0_PXE_MODE_MASK)) {
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-02-08 23:53 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-08 16:31 i40e: driver can't probe device (capabilities discovery error) Guilherme G. Piccoli
2017-02-08 16:31 ` [Intel-wired-lan] " Guilherme G. Piccoli
2017-02-08 17:32 ` Guilherme G. Piccoli
2017-02-08 17:32   ` [Intel-wired-lan] " Guilherme G. Piccoli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.