linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Cc: Sean V Kelley <sean.v.kelley@intel.com>,
	"Kuppuswamy,
	Sathyanarayanan"  <sathyanarayanan.kuppuswamy@linux.intel.com>,
	Ethan Zhao <xerces.zhao@gmail.com>,
	Sean V Kelley <seanvk.dev@oregontracks.org>,
	Bjorn Helgaas <bhelgaas@google.com>,
	rafael.j.wysocki@intel.com, Ashok Raj <ashok.raj@intel.com>,
	tony.luck@intel.com, qiuxu.zhuo@intel.com,
	linux-pci <linux-pci@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Christoph Hellwig <hch@lst.de>, Sinan Kaya <okaya@kernel.org>,
	Keith Busch <kbusch@kernel.org>
Subject: Re: [PATCH v9 12/15] PCI/RCEC: Add RCiEP's linked RCEC to AER/ERR
Date: Tue, 20 Oct 2020 11:28:20 -0500	[thread overview]
Message-ID: <20201020162820.GA370938@bjorn-Precision-5520> (raw)
In-Reply-To: <20201020125920.0000399b@Huawei.com>

On Tue, Oct 20, 2020 at 01:59:20PM +0100, Jonathan Cameron wrote:
> On Mon, 19 Oct 2020 13:50:17 -0700
> Sean V Kelley <sean.v.kelley@intel.com> wrote:
> > On 19 Oct 2020, at 11:59, Kuppuswamy, Sathyanarayanan wrote:
> > > On 10/19/20 11:31 AM, Sean V Kelley wrote:  
> > >> On 19 Oct 2020, at 3:49, Ethan Zhao wrote:
> > >>> On Sat, Oct 17, 2020 at 6:29 AM Bjorn Helgaas <helgaas@kernel.org> 
> > >>> wrote:  
> > >>>>
> > >>>> [+cc Christoph, Ethan, Sinan, Keith; sorry should have cc'd you to
> > >>>> begin with since you're looking at this code too. Particularly
> > >>>> interested in your thoughts about whether we should be touching
> > >>>> PCI_ERR_ROOT_COMMAND and PCI_ERR_ROOT_STATUS when we don't own 
> > >>>> AER.]  
> > >>>
> > >>> aer_root_reset() function has a prefix  'aer_', looks like it's a
> > >>> function of aer driver, will
> > >>> only be called by aer driver at runtime. if so it's up to the
> > >>> owner/aer to know if OSPM is
> > >>> granted to init. while actually some of the functions and runtime 
> > >>> service of
> > >>> aer driver is also shared by GHES driver (running time) and DPC 
> > >>> driver
> > >>> (compiling time ?)
> > >>> etc. then it is confused now.
> > >>>
> > >>> Shall we move some of the shared functions and running time service 
> > >>> to
> > >>> pci/err.c ?
> > >>> if so , just like pcie_do_recovery(), it's share by firmware_first  
> > >>> mode GHES
> > >>> ghes_probe()  
> > >>> ->ghes_irq_func  
> > >>>   ->ghes_proc
> > >>>     ->ghes_do_proc()
> > >>>       ->ghes_handle_aer()
> > >>>         ->aer_recover_work_func()
> > >>>           ->pcie_do_recovery()
> > >>>             ->aer_root_reset()
> > >>>
> > >>> and aer driver etc.  if aer wants to do some access might conflict
> > >>> with firmware(or
> > >>> firmware in embedded controller) should check _OSC_ etc first. 
> > >>> blindly issue
> > >>> PCI_ERR_ROOT_COMMAND  or clear PCI_ERR_ROOT_STATUS *likely*
> > >>> cause errors by error handling itself.  
> > >>
> > >> If _OSC negotiation ends up with FW being in control of AER, that 
> > >> means OS is not in charge and should not be messing with AER I guess. 
> > >> That seems appropriate to me then.  
> > >
> > > But APEI based notification is more like a hybrid approach (frimware 
> > > first detects the
> > > error and notifies OS). Since spec does not clarify what OS is allowed 
> > > to do, its bit of a
> > > gray area now. My point is, since firmware allows OS to process the 
> > > error by sending
> > > the notification, I think its OK to clear the status once the error is 
> > > handled.  
> > 
> > I don’t disagree as long as AER is granted to the OS via _OSC. But if 
> > it’s not granted explicitly via _OSC even in the APEI case where 
> > it’s either an SCI or NMI and not an MSI, I’m unsure whether the OS 
> > should be touching those registers.
> 
> My assumption was indeed this.  If AER hasn't been granted to the OS,
> it shouldn't be doing anything involving AER itself.  It should constrain
> itself to dealing with the End Points etc due to the need there for
> driver interaction.
> 
> I fully agree with the comment that the specifications aren't entirely
> clear on these cases.
> 
> It is possible that no one is currently generating the particular
> combination of severity bits in the APEI path to actually hit this.
> It requires the outer record to be marked recoverable, but the inner
> part to be marked fatal.  Kind of an odd mix.
> 
> In the GHES case, you get to this path by having a
> Generic Error Status Block - recoverable (must not be fatal to avoid panic()
> in APEI layer) containing one more more Generic Error Blocks, one of
> which is fatal.
> 
> Response of our firmware team is that this particularly combination is
> probably crazy.

Thanks a lot for researching this and outlining these details.  I
hadn't worked all that out.  It makes me a lot less worried about
breaking something if we tweak this.

> So good to clean up this corner, but it is probably not a problem
> anyone has actually hit so far.

IMO if the OS has not been granted AER control via _OSC, it shouldn't
be touching these registers.  I don't want to speculate based on what
the intent might have been with APEI.  If the intent was that the OS
should write those registers even if it doesn't own the AER
capability, that could easily have been put in the spec.

IIUC APEI enables cases where the device with Root Error Command and
Root Error Status registers (or device-specific equivalents) is not
even visible to the OS.  In those cases the OS *cannot* fiddle with
them.

I assume APEI tells us about an error with an Endpoint.  I do not
think we should be groping around for an an upstream device and poking
things in it.  I'm not even 100% comfortable with the fact that we
find an upstream device and reset all devices below it.  Maybe that's
OK as a "bigger hammer," but I don't know about it being the default
approach.

I was hoping to get this series merged for v5.10, but I don't think
it's really practical to merge it at this stage with four days left in
the merge window.  When we do merge this, I propose tightening up
aer_root_reset() along these lines as preliminary patches, since this
has nothing to do with RCEC/RCiEP, and if they *do* cause any issues,
I don't want these patches to be implicated.

Bjorn

  reply	other threads:[~2020-10-20 16:28 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-16  0:10 [PATCH v9 00/15] Add RCEC handling to PCI/AER Sean V Kelley
2020-10-16  0:10 ` [PATCH v9 01/15] PCI/RCEC: Add RCEC class code and extended capability Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 02/15] PCI/RCEC: Bind RCEC devices to the Root Port driver Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 03/15] PCI/RCEC: Cache RCEC capabilities in pci_init_capabilities() Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 04/15] PCI/ERR: Rename reset_link() to reset_subordinates() Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 05/15] PCI/ERR: Simplify by using pci_upstream_bridge() Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 06/15] PCI/ERR: Simplify by computing pci_pcie_type() once Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 07/15] PCI/ERR: Use "bridge" for clarity in pcie_do_recovery() Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 08/15] PCI/ERR: Avoid negated conditional for clarity Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 09/15] PCI/ERR: Add pci_walk_bridge() to pcie_do_recovery() Sean V Kelley
2020-10-16 17:19   ` Bjorn Helgaas
2020-10-16  0:11 ` [PATCH v9 10/15] PCI/ERR: Limit AER resets in pcie_do_recovery() Sean V Kelley
2020-10-16 17:22   ` Bjorn Helgaas
2020-10-16 17:36     ` Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 11/15] PCI/RCEC: Add pcie_link_rcec() to associate RCiEPs Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 12/15] PCI/RCEC: Add RCiEP's linked RCEC to AER/ERR Sean V Kelley
2020-10-16 20:30   ` Bjorn Helgaas
2020-10-16 22:29     ` Bjorn Helgaas
2020-10-17  0:28       ` Kuppuswamy, Sathyanarayanan
2020-10-17  1:45       ` Kuppuswamy, Sathyanarayanan
2020-10-19 10:49       ` Ethan Zhao
2020-10-19 18:31         ` Sean V Kelley
2020-10-19 18:59           ` Kuppuswamy, Sathyanarayanan
2020-10-19 20:50             ` Sean V Kelley
2020-10-20 12:59               ` Jonathan Cameron
2020-10-20 16:28                 ` Bjorn Helgaas [this message]
2020-10-17 16:14     ` Sean V Kelley
2020-10-19 19:07       ` Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 13/15] PCI/AER: Add pcie_walk_rcec() to RCEC AER handling Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 14/15] PCI/PME: Add pcie_walk_rcec() to RCEC PME handling Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 15/15] PCI/AER: Add RCEC AER error injection support Sean V Kelley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201020162820.GA370938@bjorn-Precision-5520 \
    --to=helgaas@kernel.org \
    --cc=Jonathan.Cameron@Huawei.com \
    --cc=ashok.raj@intel.com \
    --cc=bhelgaas@google.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=okaya@kernel.org \
    --cc=qiuxu.zhuo@intel.com \
    --cc=rafael.j.wysocki@intel.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=sean.v.kelley@intel.com \
    --cc=seanvk.dev@oregontracks.org \
    --cc=tony.luck@intel.com \
    --cc=xerces.zhao@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).