All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mahesh J Salgaonkar <mahesh@linux.ibm.com>
To: "Oliver O'Halloran" <oohall@gmail.com>
Cc: linuxppc-dev <linuxppc-dev@ozlabs.org>
Subject: Re: [PATCH] powerpc/eeh: Delay slot presence check once driver is notified about the pci error.
Date: Mon, 29 Nov 2021 13:44:43 +0530	[thread overview]
Message-ID: <20211129081443.fbj6vb77bmmfnpfi@in.ibm.com> (raw)
In-Reply-To: <CAOSf1CFRY=VsGenpwqVu_VOUYuBheVUEoX2M_h-XSHk6NdUtwg@mail.gmail.com>

On 2021-11-24 23:01:45 Wed, Oliver O'Halloran wrote:
> On Wed, Nov 24, 2021 at 12:05 AM Mahesh Salgaonkar <mahesh@linux.ibm.com> wrote:
> >
> > *snip*
> >
> > This causes the EEH handler to get stuck for ~6
> > seconds before it could notify that the pci error has been detected and
> > stop any active operations. Hence with running I/O traffic, during this 6
> > seconds, the network driver continues its operation and hits a timeout
> > (netdev watchdog).On timeouts, network driver go into ffdc capture mode
> > and reset path assuming the PCI device is in fatal condition. This causes
> > EEH recovery to fail and sometimes it leads to system hang or crash.
> 
> Whatever is causing that crash is the real issue IMO. PCI error

I have seen crash only once but that was triggered by HTX tool and may
not be related. However, the major concern here is EEH failure. I will
correct the above statement in my next patch.

> reporting is fundamentally asynchronous and the driver always has to
> tolerate some amount of latency between the error occuring and being
> reported. Six seconds is admittedly an eternity, but it should not
> cause a system crash under any circumstances. Printing a warning due
> to a timeout is annoying, but it's not the end of the world.

Yeah, but due to timeout sometimes the driver gets into a situation
where when EEH recovery kicks-in, the driver is unable to recover the
device. Thus EEH recovery fails and disconnects the pci device even when
it could have recovered. To recover, we need to either reboot the lpar
or re-assign the I/O adapter from HMC to get it back in working
condition.

[16532.212197] EEH: PCI-E AER 30: 00000000 00000000
[16532.213207] EEH: Reset without hotplug activity
[16534.229469] bnx2x: [bnx2x_clean_tx_queue:1203(enP22p1s0f1)]timeout waiting for queue[2]: txdata->tx_pkt_prod(37003) != txdata->tx_pkt_cons(36996)
[16534.385484] EEH: Beginning: 'slot_reset'
[16534.385489] PCI 0016:01:00.0#10000: EEH: Invoking bnx2x->slot_reset()
[16536.229469] bnx2x: [bnx2x_clean_tx_queue:1203(enP22p1s0f1)]timeout waiting for queue[4]: txdata->tx_pkt_prod(64894) != txdata->tx_pkt_cons(64891)
o[...]
[16623.571502] bnx2x: [bnx2x_nic_load_request:2342(enP22p1s0f1)]MCP response failure, aborting
[16623.571507] bnx2x: [bnx2x_acquire_hw_lock:2019(enP22p1s0f1)]lock_status 0xffffffff  resource_bit 0x800
[16623.571881] bnx2x: [bnx2x_io_slot_reset:14359(enP22p1s0f0)]IO slot reset initializing...
[16623.571976] bnx2x 0016:01:00.0: enabling device (0140 -> 0142)
[16623.576169] bnx2x: [bnx2x_io_slot_reset:14375(enP22p1s0f0)]IO slot reset --> driver unload
[16623.576174] PCI 0016:01:00.0#10000: EEH: bnx2x driver reports: 'disconnect'
[16623.576177] PCI 0016:01:00.1#10000: EEH: Invoking bnx2x->slot_reset()
[16623.576179] bnx2x: [bnx2x_io_slot_reset:14359(enP22p1s0f1)]IO slot reset initializing...
[16623.576239] bnx2x 0016:01:00.1: enabling device (0140 -> 0142)
[16623.580241] bnx2x: [bnx2x_io_slot_reset:14375(enP22p1s0f1)]IO slot reset --> driver unload
[16623.580245] PCI 0016:01:00.1#10000: EEH: bnx2x driver reports: 'disconnect'
[16623.580246] EEH: Finished:'slot_reset' with aggregate recovery state:'disconnect'
[16623.580250] EEH: Unable to recover from failure from PHB#16-PE#10000.

Thanks,
-Mahesh.

-- 
Mahesh J Salgaonkar

      reply	other threads:[~2021-11-29  8:15 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-23 13:05 [PATCH] powerpc/eeh: Delay slot presence check once driver is notified about the pci error Mahesh Salgaonkar
2021-11-23 23:14 ` Michael Ellerman
2021-11-24  8:45   ` Mahesh J Salgaonkar
2021-11-24 11:57     ` Oliver O'Halloran
2021-11-25  5:34       ` Mahesh J Salgaonkar
2021-11-24 12:01 ` Oliver O'Halloran
2021-11-29  8:14   ` Mahesh J Salgaonkar [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211129081443.fbj6vb77bmmfnpfi@in.ibm.com \
    --to=mahesh@linux.ibm.com \
    --cc=linuxppc-dev@ozlabs.org \
    --cc=oohall@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.