From: poza@codeaurora.org
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Sinan Kaya <okaya@kernel.org>, Bjorn Helgaas <helgaas@kernel.org>,
Thomas Tai <thomas.tai@oracle.com>,
bhelgaas@google.com, keith.busch@intel.com,
linux-pci@vger.kernel.org, linux-pci-owner@vger.kernel.org,
Sam Bobroff <sam.bobroff@au1.ibm.com>
Subject: Re: [PATCH 1/1] PCI/AER: prevent pcie_do_fatal_recovery from using device after it is removed
Date: Tue, 21 Aug 2018 10:44:24 +0530 [thread overview]
Message-ID: <ad23b16cf7ff1f3b51961fda59533892@codeaurora.org> (raw)
In-Reply-To: <512e0e11c3ba462c1d033f8b0e768fa27489731c.camel@kernel.crashing.org>
On 2018-08-21 02:32, Benjamin Herrenschmidt wrote:
> On Mon, 2018-08-20 at 18:56 +0530, poza@codeaurora.org wrote:
>> Hi Ben,
>>
>> I get the idea and get your points. and when I started implementation
>> of
>> this, I did ask these questions to myself.
>> but then we all fell aligned to what DPC was doing, because thats how
>> the driver was dealing with ERR_FATAL.
>>
>> I can put together a patch adhering to the idea, and modify err.c.
>> let me know if you are okay with that.
>> I have both DPC and AER capable root port, so I can easily validate
>> the
>> changes as we improve upon this.
>>
>> besides We have to come to some common ground on policy.
>> 1) if device driver dictates the reset we should agree to do SBR.
>> (irrespective of the type of error)
>
> It's not just "the driver dictates the reset". It's a combination of
> all agents. Either the driver or the framework, *both* can request a
> reset. Also if you have multiple devices involved (for example multiple
> functions), then any of the drivers can request the reset.
>
>> 2) under what circumstances framework will impose the reset, even if
>> device driver did not !
>
> Well ERR_FATAL typically would be one. Make it an argument to the
> framework, it's possible that EEH might always do, I have to look into
> it.
>
>> probably changes might be simpler; although it will require to
>> exercise
>> some tests cases for both DPC and AER.
>> let me know if anything I missed here, but let me attempt to make
>> patch
>> and we can go form there.
>>
>> Let me know you opinion.
>
> I agree. Hopefully Sam will have a chance to chime in (he's caught up
> with something else at the moment) as well.
>
> Cheers,
> Ben.
>
>> Regards,
>> Oza.
>>
>> >
>> > > Also I am very much interested in knowing original intention of DPC
>> > > driver to unplug/plug devices,
>> > > all I remember in some conversation was:
>> > > hotplug capable bridge might have see devices changed, so it is safer
>> > > to
>> > > remove/unplug the devices and during which .shutdown methods of driver
>> > > is called, in case of ERR_FATAL.
>> >
>> > The device being "swapped" during an error recovery operation is ...
>> > unlikely. Do the bridges have a way to latch that the presence detect
>> > changed ?
>> >
>> > > although DPC is HW recovery while AER is sw recovery both should
>> > > fundamentally act in the same way as far as device drivers callbacks
>> > > are
>> > > concerned.
>> > > (again I really dont know real motivation behind this)
>> >
>> > The main problem with unplug/replug (as I mentioned earlier) is that it
>> > just does NOT work for storage controllers (or similar type of
>> > devices). The links between the storage controller and the mounted
>> > filesystems is lost permanently, you'll most likely have to reboot the
>> > machine.
>> >
>> > With our current EEH implementation we can successfully recover from
>> > fatal errors with the storage controller by resetting it.
>> >
>> > Finally as for PERST vs. Hot Reset, this is an implementation detail.
>> > Not all our platforms can control PERST on all devices either, we
>> > generally prefer PERST as it provides a more reliable reset mechanism
>> > in practice (talking from experience) but will fallback to hot reset.
>> >
>> > Cheers,
>> > Ben.
Ok Let me summarize the so far discussed things.
It would be nice if we all (Bjorn, Keith, Ben, Sinan) can hold consensus
on this.
1) Right now AER and DPC both calls pcie_do_fatal_recovery(), I majorly
see DPC as error handling and recovery agent rather than being used for
hotplug.
so in my opinion, both AER and DPC should have same error handling
and recovery mechanism
so if there is a way to figure out that in absence of pcihp, if DPC
is being used to support hotplug then we fall back to original DPC
mechanism (which is remove devices)
otherwise, we fall back to drivers callbacks.
Spec 6.7.5 Async Removal
"
The Surprise Down error resulting from async removal may trigger
Downstream Port Containment (See Section 6.2.10).
Downstream Port Containment following an async removal may be
utilized to hold the Link of a
Downstream Port in the Disabled LTSSM state while host software
recovers from the side effects of an async removal.
"
I think above is implementation specific. but there has to be some
way to kow if we are using DPC for hotplug or not !
otherwise it is assumed to be used for error handling and recovery
pcie_do_fatal_recovery should take care of above. so that we support
both error handling and async removal from DPC point of view.
2) It is left to driver first to determine whether it wants to requests
the reset of device or not based on sanity of link (e.g. if it is
reading 0xffffffff back !)
pci_channel_io_frozen is the one which should be used from framework
to communicate driver that this is ERR_FATAL.
3) @Ben, although I am not very clear that if there is an ERR_FATAL, in
what circumstances driver will deny the reset but framework will impose
reset ?
4) Sinan's patch is going to ignore link states if it finds ERR_FATAL,so
that pcihp will not trigger and unplug the devices.
5) pcie_do_nonfatal_recovery; we sould actually reset the link e.g. SBR
if driver requests the reset of link. (there is already TDO note
anyway).
if (status == PCI_ERS_RESULT_NEED_RESET) {
/*
* TODO: Should call platform-specific
* functions to reset slot before calling
* drivers' slot_reset callbacks?
*/
status = broadcast_error_message(dev,
state,
"slot_reset",
report_slot_reset);
}
Let me know how this sounds, if we all agree on behavior I can go ahead
with the changes.
ps: although I need input on point-1.
Regards,
Oza.
>> >
>> > >
>> > > Regards,
>> > > Oza.
>> > >
>> > > > - Figure out what hooks might be needed to be able to plumb EEH into
>> > > > it, possibly removing a bunch of crap in arch/powerpc (yay !)
>> > > >
>> > > > I don't think having a webex will be that practical with the timezones
>> > > > involved. I'm trying to get approval to go to Plumbers in which case we
>> > > > could setup a BOF but I have no guarantee at this point that I can make
>> > > > it happen.
>> > > >
>> > > > So let's try using email as much possible for now.
>> > > >
>> > > > Cheers,
>> > > > Ben.
next prev parent reply other threads:[~2018-08-21 5:14 UTC|newest]
Thread overview: 91+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-08-13 16:51 [PATCH 0/1] PCI/AER: prevent pcie_do_fatal_recovery from using device after it is removed Thomas Tai
2018-08-13 16:51 ` [PATCH 1/1] " Thomas Tai
2018-08-14 9:16 ` poza
2018-08-14 9:22 ` poza
2018-08-14 13:51 ` Thomas Tai
2018-08-15 14:57 ` poza
2018-08-15 15:02 ` Thomas Tai
2018-08-15 15:26 ` poza
2018-08-15 15:43 ` Thomas Tai
2018-08-15 15:59 ` poza
2018-08-15 16:04 ` Thomas Tai
2018-08-15 21:55 ` Benjamin Herrenschmidt
2018-08-15 21:56 ` Benjamin Herrenschmidt
2018-08-16 6:36 ` poza
2018-08-16 6:51 ` Benjamin Herrenschmidt
2018-08-16 6:59 ` Benjamin Herrenschmidt
2018-08-16 8:07 ` poza
2018-08-16 8:12 ` Benjamin Herrenschmidt
2018-08-16 9:03 ` poza
2018-08-16 10:07 ` Benjamin Herrenschmidt
2018-08-16 14:11 ` poza
2018-08-16 23:30 ` Benjamin Herrenschmidt
2018-08-17 10:29 ` poza
2018-08-17 10:44 ` poza
2018-08-18 7:38 ` Benjamin Herrenschmidt
2018-08-16 7:05 ` Benjamin Herrenschmidt
2018-08-16 7:15 ` Benjamin Herrenschmidt
2018-08-16 7:56 ` poza
2018-08-16 8:10 ` Benjamin Herrenschmidt
2018-08-16 8:05 ` poza
2018-08-16 8:15 ` Benjamin Herrenschmidt
2018-08-16 8:22 ` poza
2018-08-16 8:28 ` Benjamin Herrenschmidt
2018-08-16 13:30 ` Thomas Tai
2018-08-16 13:46 ` Sinan Kaya
2018-08-16 23:27 ` Benjamin Herrenschmidt
2018-08-17 6:35 ` poza
2018-08-19 2:24 ` Bjorn Helgaas
2018-08-20 5:09 ` poza
2018-08-20 5:15 ` Benjamin Herrenschmidt
2018-08-20 13:02 ` Thomas Tai
2018-08-20 13:27 ` Benjamin Herrenschmidt
2018-08-19 2:19 ` Bjorn Helgaas
2018-08-19 21:41 ` Sinan Kaya
2018-08-20 2:03 ` Benjamin Herrenschmidt
2018-08-20 5:19 ` poza
2018-08-20 5:33 ` Benjamin Herrenschmidt
2018-08-20 7:56 ` poza
2018-08-20 11:22 ` Benjamin Herrenschmidt
2018-08-20 13:26 ` poza
2018-08-20 21:02 ` Benjamin Herrenschmidt
2018-08-21 5:14 ` poza [this message]
2018-08-21 6:06 ` Benjamin Herrenschmidt
2018-08-21 14:37 ` Keith Busch
2018-08-21 15:07 ` Sinan Kaya
2018-08-21 15:29 ` Keith Busch
2018-08-21 15:50 ` Sinan Kaya
2018-08-21 15:55 ` Sinan Kaya
2018-08-21 16:44 ` Keith Busch
2018-08-21 15:30 ` poza
2018-08-21 21:14 ` Benjamin Herrenschmidt
2018-08-21 22:04 ` Keith Busch
2018-08-21 22:33 ` Keith Busch
2018-08-21 23:06 ` Benjamin Herrenschmidt
2018-08-21 23:13 ` Keith Busch
2018-08-22 0:36 ` Benjamin Herrenschmidt
2018-08-30 0:01 ` Keith Busch
2018-08-30 0:10 ` Sinan Kaya
2018-08-30 0:46 ` Keith Busch
2018-08-30 4:26 ` Benjamin Herrenschmidt
2018-08-20 15:53 ` Keith Busch
2018-08-20 16:13 ` poza
2018-08-20 16:32 ` Keith Busch
2018-08-20 21:05 ` Benjamin Herrenschmidt
2018-08-20 21:21 ` Sinan Kaya
2018-08-20 21:35 ` Keith Busch
2018-08-20 21:53 ` Benjamin Herrenschmidt
2018-08-20 22:02 ` Sinan Kaya
2018-08-20 22:04 ` Benjamin Herrenschmidt
2018-08-20 22:13 ` Sinan Kaya
2018-08-20 22:19 ` Benjamin Herrenschmidt
2018-08-22 9:13 ` Lukas Wunner
2018-08-22 14:38 ` Keith Busch
2018-08-22 14:51 ` Sinan Kaya
2018-08-20 22:13 ` Keith Busch
2018-08-20 22:19 ` Benjamin Herrenschmidt
2018-08-21 1:30 ` Keith Busch
2018-08-20 4:37 ` Benjamin Herrenschmidt
2018-08-20 4:39 ` PATCH] Partial revert of "PCI/AER: Handle ERR_FATAL with removal and re-enumeration of devices" Benjamin Herrenschmidt
2018-08-21 19:50 ` Bjorn Helgaas
2018-08-22 4:35 ` poza
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ad23b16cf7ff1f3b51961fda59533892@codeaurora.org \
--to=poza@codeaurora.org \
--cc=benh@kernel.crashing.org \
--cc=bhelgaas@google.com \
--cc=helgaas@kernel.org \
--cc=keith.busch@intel.com \
--cc=linux-pci-owner@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=okaya@kernel.org \
--cc=sam.bobroff@au1.ibm.com \
--cc=thomas.tai@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).