linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] pci-error-recover: doc cleanup
@ 2016-12-08  8:16 Cao jin
  2016-12-08 14:05 ` Jonathan Corbet
  0 siblings, 1 reply; 12+ messages in thread
From: Cao jin @ 2016-12-08  8:16 UTC (permalink / raw)
  To: linux-pci, linux-doc, linux-kernel; +Cc: linasvepstas, bhelgaas, corbet

Include typo fix; white space shooting; mistake correction.

Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
---
 Documentation/PCI/pci-error-recovery.txt | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.txt
index ac26869c7db4..fcb29cdbeb1b 100644
--- a/Documentation/PCI/pci-error-recovery.txt
+++ b/Documentation/PCI/pci-error-recovery.txt
@@ -11,7 +11,7 @@
 
 Many PCI bus controllers are able to detect a variety of hardware
 PCI errors on the bus, such as parity errors on the data and address
-busses, as well as SERR and PERR errors.  Some of the more advanced
+buses, as well as SERR and PERR errors.  Some of the more advanced
 chipsets are able to deal with these errors; these include PCI-E chipsets,
 and the PCI-host bridges found on IBM Power4, Power5 and Power6-based
 pSeries boxes. A typical action taken is to disconnect the affected device,
@@ -175,7 +175,7 @@ is STEP 6 (Permanent Failure).
 >>> a value of 0xff on read, and writes will be dropped. If more than
 >>> EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH
 >>> assumes that the device driver has gone into an infinite loop
->>> and prints an error to syslog.  A reboot is then required to 
+>>> and prints an error to syslog.  A reboot is then required to
 >>> get the device working again.
 
 STEP 2: MMIO Enabled
@@ -234,7 +234,7 @@ STEP 3: Link Reset
 ------------------
 The platform resets the link, and then calls the link_reset() callback
 on all affected device drivers.  This is a PCI-Express specific state
-and is done whenever a non-fatal error has been detected that can be
+and is done whenever a fatal error has been detected that can be
 "solved" by resetting the link. This call informs the driver of the
 reset and the driver should check to see if the device appears to be
 in working condition.
@@ -256,7 +256,7 @@ STEP 4: Slot Reset
 ------------------
 
 In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
-the platform will perform a slot reset on the requesting PCI device(s). 
+the platform will perform a slot reset on the requesting PCI device(s).
 The actual steps taken by a platform to perform a slot reset
 will be platform-dependent. Upon completion of slot reset, the
 platform will call the device slot_reset() callback.
@@ -276,7 +276,7 @@ configuration registers to initialize to their default conditions.
 
 For most PCI devices, a soft reset will be sufficient for recovery.
 Optional fundamental reset is provided to support a limited number
-of PCI Express PCI devices  for which a soft reset is not sufficient
+of PCI Express PCI devices for which a soft reset is not sufficient
 for recovery.
 
 If the platform supports PCI hotplug, then the reset might be
@@ -321,7 +321,7 @@ driver performs device init only from PCI function 0:
 		Same as above.
 
 Drivers for PCI Express cards that require a fundamental reset must
-set the needs_freset bit in the pci_dev structure in their probe function.  
+set the needs_freset bit in the pci_dev structure in their probe function.
 For example, the QLogic qla2xxx driver sets the needs_freset bit for certain
 PCI card types:
 
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] pci-error-recover: doc cleanup
  2016-12-08  8:16 [PATCH] pci-error-recover: doc cleanup Cao jin
@ 2016-12-08 14:05 ` Jonathan Corbet
  2016-12-08 14:13   ` Cao jin
  0 siblings, 1 reply; 12+ messages in thread
From: Jonathan Corbet @ 2016-12-08 14:05 UTC (permalink / raw)
  To: Cao jin; +Cc: linux-pci, linux-doc, linux-kernel, linasvepstas, bhelgaas

On Thu, 8 Dec 2016 16:16:14 +0800
Cao jin <caoj.fnst@cn.fujitsu.com> wrote:

>  The platform resets the link, and then calls the link_reset() callback
>  on all affected device drivers.  This is a PCI-Express specific state
> -and is done whenever a non-fatal error has been detected that can be
> +and is done whenever a fatal error has been detected that can be
>  "solved" by resetting the link. This call informs the driver of the

As far as I can tell, the original text was correct here; why do you
think this change needs to be made?

Thanks,

jon

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] pci-error-recover: doc cleanup
  2016-12-08 14:05 ` Jonathan Corbet
@ 2016-12-08 14:13   ` Cao jin
  2016-12-09  6:24     ` Linas Vepstas
  0 siblings, 1 reply; 12+ messages in thread
From: Cao jin @ 2016-12-08 14:13 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: linux-pci, linux-doc, linux-kernel, linasvepstas, bhelgaas



On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
> On Thu, 8 Dec 2016 16:16:14 +0800
> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
> 
>>  The platform resets the link, and then calls the link_reset() callback
>>  on all affected device drivers.  This is a PCI-Express specific state
>> -and is done whenever a non-fatal error has been detected that can be
>> +and is done whenever a fatal error has been detected that can be
>>  "solved" by resetting the link. This call informs the driver of the
> 
> As far as I can tell, the original text was correct here; why do you
> think this change needs to be made?
> 

See do_recovery() in aer core, reset_link() is called only seeing fatal
error.

-- 
Sincerely,
Cao jin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] pci-error-recover: doc cleanup
  2016-12-08 14:13   ` Cao jin
@ 2016-12-09  6:24     ` Linas Vepstas
  2016-12-09  6:37       ` Cao jin
  2016-12-09  6:50       ` Andrew Donnellan
  0 siblings, 2 replies; 12+ messages in thread
From: Linas Vepstas @ 2016-12-09  6:24 UTC (permalink / raw)
  To: Cao jin
  Cc: Jonathan Corbet, linux-pci, linux-doc, linux-kernel, Bjorn Helgaas

I suppose I'm confused, but I recall that link resets are non-fatal.
Fatal errors typically require that the the pci adapter be completely
reset, any adapter firmware to be reloaded from scratch, the device
driver has to kill all device state and start from scratch. Its huge.
If the fatal error is on pci device that is under a block device
holding a file system, then (usually) there is no way to recover,
because the block layer (and file system) cannot deal with a block
device that disappeared and then reappeared some few seconds later.
(maybe some future zfs or lvm or btrfs might be able to deal with
this, but not today)

By contrast, link resets are far more gentle: the device driver might
have to discard some half-full FIFO's, or cancel some in-flight
commands, but can otherwise gracefully recover without telling the
higher layers that there were any problems.

--linas

On Thu, Dec 8, 2016 at 10:13 PM, Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>
>
> On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
>> On Thu, 8 Dec 2016 16:16:14 +0800
>> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>>
>>>  The platform resets the link, and then calls the link_reset() callback
>>>  on all affected device drivers.  This is a PCI-Express specific state
>>> -and is done whenever a non-fatal error has been detected that can be
>>> +and is done whenever a fatal error has been detected that can be
>>>  "solved" by resetting the link. This call informs the driver of the
>>
>> As far as I can tell, the original text was correct here; why do you
>> think this change needs to be made?
>>
>
> See do_recovery() in aer core, reset_link() is called only seeing fatal
> error.
>
> --
> Sincerely,
> Cao jin
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] pci-error-recover: doc cleanup
  2016-12-09  6:24     ` Linas Vepstas
@ 2016-12-09  6:37       ` Cao jin
  2016-12-09  6:44         ` Linas Vepstas
  2016-12-09 14:37         ` Jonathan Corbet
  2016-12-09  6:50       ` Andrew Donnellan
  1 sibling, 2 replies; 12+ messages in thread
From: Cao jin @ 2016-12-09  6:37 UTC (permalink / raw)
  To: linasvepstas
  Cc: Jonathan Corbet, linux-pci, linux-doc, linux-kernel, Bjorn Helgaas



On 12/09/2016 02:24 PM, Linas Vepstas wrote:
> I suppose I'm confused, but I recall that link resets are non-fatal.
> Fatal errors typically require that the the pci adapter be completely
> reset, any adapter firmware to be reloaded from scratch, the device
> driver has to kill all device state and start from scratch. Its huge.
> If the fatal error is on pci device that is under a block device
> holding a file system, then (usually) there is no way to recover,
> because the block layer (and file system) cannot deal with a block
> device that disappeared and then reappeared some few seconds later.
> (maybe some future zfs or lvm or btrfs might be able to deal with
> this, but not today)
> 
> By contrast, link resets are far more gentle: the device driver might
> have to discard some half-full FIFO's, or cancel some in-flight
> commands, but can otherwise gracefully recover without telling the
> higher layers that there were any problems.
> 
> --linas
> 

I am little confused too, even not sure if we are talking the same
*fatal error*, I am talking the fatal error defined in PCI Express spec,
chapter 6.2.2.2.1:

Fatal errors are uncorrectable error conditions which render the
particular Link and related hardware unreliable. For Fatal errors, a
reset of the components on the Link may be required to return to
reliable operation. Platform handling of Fatal errors, and any efforts
to limit the effects of these errors, is platform implementation specific.

Link reset means set *secondary bus reset* bit in pci bridge config
space, can reset the link and device simultaneously, is the strongest
kind of reset as I know.

> On Thu, Dec 8, 2016 at 10:13 PM, Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>>
>>
>> On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
>>> On Thu, 8 Dec 2016 16:16:14 +0800
>>> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>>>
>>>>  The platform resets the link, and then calls the link_reset() callback
>>>>  on all affected device drivers.  This is a PCI-Express specific state
>>>> -and is done whenever a non-fatal error has been detected that can be
>>>> +and is done whenever a fatal error has been detected that can be
>>>>  "solved" by resetting the link. This call informs the driver of the
>>>
>>> As far as I can tell, the original text was correct here; why do you
>>> think this change needs to be made?
>>>
>>
>> See do_recovery() in aer core, reset_link() is called only seeing fatal
>> error.
>>
>> --
>> Sincerely,
>> Cao jin
>>
>>
> 
> 
> 

-- 
Sincerely,
Cao jin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] pci-error-recover: doc cleanup
  2016-12-09  6:37       ` Cao jin
@ 2016-12-09  6:44         ` Linas Vepstas
  2016-12-09  7:59           ` Cao jin
  2016-12-09 16:11           ` Alex Williamson
  2016-12-09 14:37         ` Jonathan Corbet
  1 sibling, 2 replies; 12+ messages in thread
From: Linas Vepstas @ 2016-12-09  6:44 UTC (permalink / raw)
  To: Cao jin
  Cc: Jonathan Corbet, linux-pci, linux-doc, linux-kernel, Bjorn Helgaas

On Fri, Dec 9, 2016 at 2:37 PM, Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>
>
> On 12/09/2016 02:24 PM, Linas Vepstas wrote:
>> I suppose I'm confused, but I recall that link resets are non-fatal.
>> Fatal errors typically require that the the pci adapter be completely
>> reset, any adapter firmware to be reloaded from scratch, the device
>> driver has to kill all device state and start from scratch. Its huge.
>> If the fatal error is on pci device that is under a block device
>> holding a file system, then (usually) there is no way to recover,
>> because the block layer (and file system) cannot deal with a block
>> device that disappeared and then reappeared some few seconds later.
>> (maybe some future zfs or lvm or btrfs might be able to deal with
>> this, but not today)
>>
>> By contrast, link resets are far more gentle: the device driver might
>> have to discard some half-full FIFO's, or cancel some in-flight
>> commands, but can otherwise gracefully recover without telling the
>> higher layers that there were any problems.
>>
>> --linas
>>
>
> I am little confused too, even not sure if we are talking the same
> *fatal error*, I am talking the fatal error defined in PCI Express spec,
> chapter 6.2.2.2.1:
>
> Fatal errors are uncorrectable error conditions which render the
> particular Link and related hardware unreliable. For Fatal errors, a
> reset of the components on the Link may be required to return to
> reliable operation. Platform handling of Fatal errors, and any efforts
> to limit the effects of these errors, is platform implementation specific.
>
> Link reset means set *secondary bus reset* bit in pci bridge config
> space, can reset the link and device simultaneously, is the strongest
> kind of reset as I know.

OK, well, its been far too many years, and I don't have the PCI spec
at my fingertips.
Isn't there a link reset that can be performed, without forcing a device reset?

The intent was that some PCI link errors are due to vibration,
ground-bounce, humidity, etc. and that these errors can be detected
and do not corrupt the device state or the device driver state.  Since
they are not associated with data corruption (or rather, the
corruption is local to the link), these can be recovered by reseting
just the link, without resetting the whole adapter. They may require
reseting some device-driver state, but not all of it.

However, this was all decided before the PCI-E spec was written, so
maybe the newer PCI-E specs now say something different.

--linas

>
>> On Thu, Dec 8, 2016 at 10:13 PM, Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>>>
>>>
>>> On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
>>>> On Thu, 8 Dec 2016 16:16:14 +0800
>>>> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>>>>
>>>>>  The platform resets the link, and then calls the link_reset() callback
>>>>>  on all affected device drivers.  This is a PCI-Express specific state
>>>>> -and is done whenever a non-fatal error has been detected that can be
>>>>> +and is done whenever a fatal error has been detected that can be
>>>>>  "solved" by resetting the link. This call informs the driver of the
>>>>
>>>> As far as I can tell, the original text was correct here; why do you
>>>> think this change needs to be made?
>>>>
>>>
>>> See do_recovery() in aer core, reset_link() is called only seeing fatal
>>> error.
>>>
>>> --
>>> Sincerely,
>>> Cao jin
>>>
>>>
>>
>>
>>
>
> --
> Sincerely,
> Cao jin
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] pci-error-recover: doc cleanup
  2016-12-09  6:24     ` Linas Vepstas
  2016-12-09  6:37       ` Cao jin
@ 2016-12-09  6:50       ` Andrew Donnellan
  2016-12-14  2:39         ` Gavin Shan
  1 sibling, 1 reply; 12+ messages in thread
From: Andrew Donnellan @ 2016-12-09  6:50 UTC (permalink / raw)
  To: linasvepstas, Cao jin
  Cc: Jonathan Corbet, linux-pci, linux-doc, linux-kernel, Bjorn Helgaas

On 09/12/16 17:24, Linas Vepstas wrote:
> I suppose I'm confused, but I recall that link resets are non-fatal.
> Fatal errors typically require that the the pci adapter be completely
> reset, any adapter firmware to be reloaded from scratch, the device
> driver has to kill all device state and start from scratch. Its huge.

Is there a difference in terminology between an AER fatal error and what 
EEH/IBM people think of as a fatal error?

> If the fatal error is on pci device that is under a block device
> holding a file system, then (usually) there is no way to recover,
> because the block layer (and file system) cannot deal with a block
> device that disappeared and then reappeared some few seconds later.
> (maybe some future zfs or lvm or btrfs might be able to deal with
> this, but not today)

Is this still true? I'm not at all familiar with the block device side 
of it, but the cxlflash driver has reasonably full EEH support, 
including surviving a full PHB fence and complete reset.

-- 
Andrew Donnellan              OzLabs, ADL Canberra
andrew.donnellan@au1.ibm.com  IBM Australia Limited

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] pci-error-recover: doc cleanup
  2016-12-09  6:44         ` Linas Vepstas
@ 2016-12-09  7:59           ` Cao jin
  2016-12-09 16:11           ` Alex Williamson
  1 sibling, 0 replies; 12+ messages in thread
From: Cao jin @ 2016-12-09  7:59 UTC (permalink / raw)
  To: linasvepstas
  Cc: Jonathan Corbet, linux-pci, linux-doc, linux-kernel, Bjorn Helgaas



On 12/09/2016 02:44 PM, Linas Vepstas wrote:
> On Fri, Dec 9, 2016 at 2:37 PM, Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>>
>>
>> On 12/09/2016 02:24 PM, Linas Vepstas wrote:
>>> I suppose I'm confused, but I recall that link resets are non-fatal.
>>> Fatal errors typically require that the the pci adapter be completely
>>> reset, any adapter firmware to be reloaded from scratch, the device
>>> driver has to kill all device state and start from scratch. Its huge.
>>> If the fatal error is on pci device that is under a block device
>>> holding a file system, then (usually) there is no way to recover,
>>> because the block layer (and file system) cannot deal with a block
>>> device that disappeared and then reappeared some few seconds later.
>>> (maybe some future zfs or lvm or btrfs might be able to deal with
>>> this, but not today)
>>>
>>> By contrast, link resets are far more gentle: the device driver might
>>> have to discard some half-full FIFO's, or cancel some in-flight
>>> commands, but can otherwise gracefully recover without telling the
>>> higher layers that there were any problems.
>>>
>>> --linas
>>>
>>
>> I am little confused too, even not sure if we are talking the same
>> *fatal error*, I am talking the fatal error defined in PCI Express spec,
>> chapter 6.2.2.2.1:
>>
>> Fatal errors are uncorrectable error conditions which render the
>> particular Link and related hardware unreliable. For Fatal errors, a
>> reset of the components on the Link may be required to return to
>> reliable operation. Platform handling of Fatal errors, and any efforts
>> to limit the effects of these errors, is platform implementation specific.
>>
>> Link reset means set *secondary bus reset* bit in pci bridge config
>> space, can reset the link and device simultaneously, is the strongest
>> kind of reset as I know.
> 
> OK, well, its been far too many years, and I don't have the PCI spec
> at my fingertips.
> Isn't there a link reset that can be performed, without forcing a device reset?
> 

At least I don't find the exact words saying that.

-- 
Sincerely,
Cao jin

> The intent was that some PCI link errors are due to vibration,
> ground-bounce, humidity, etc. and that these errors can be detected
> and do not corrupt the device state or the device driver state.  Since
> they are not associated with data corruption (or rather, the
> corruption is local to the link), these can be recovered by reseting
> just the link, without resetting the whole adapter. They may require
> reseting some device-driver state, but not all of it.
> 
> However, this was all decided before the PCI-E spec was written, so
> maybe the newer PCI-E specs now say something different.
> 
> --linas
> 
>>
>>> On Thu, Dec 8, 2016 at 10:13 PM, Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>>>>
>>>>
>>>> On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
>>>>> On Thu, 8 Dec 2016 16:16:14 +0800
>>>>> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>>>>>
>>>>>>  The platform resets the link, and then calls the link_reset() callback
>>>>>>  on all affected device drivers.  This is a PCI-Express specific state
>>>>>> -and is done whenever a non-fatal error has been detected that can be
>>>>>> +and is done whenever a fatal error has been detected that can be
>>>>>>  "solved" by resetting the link. This call informs the driver of the
>>>>>
>>>>> As far as I can tell, the original text was correct here; why do you
>>>>> think this change needs to be made?
>>>>>
>>>>
>>>> See do_recovery() in aer core, reset_link() is called only seeing fatal
>>>> error.
>>>>
>>>> --
>>>> Sincerely,
>>>> Cao jin
>>>>
>>>>
>>>
>>>
>>>
>>
>> --
>> Sincerely,
>> Cao jin
>>
>>
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] pci-error-recover: doc cleanup
  2016-12-09  6:37       ` Cao jin
  2016-12-09  6:44         ` Linas Vepstas
@ 2016-12-09 14:37         ` Jonathan Corbet
  2016-12-19  3:25           ` Cao jin
  1 sibling, 1 reply; 12+ messages in thread
From: Jonathan Corbet @ 2016-12-09 14:37 UTC (permalink / raw)
  To: Cao jin; +Cc: linasvepstas, linux-pci, linux-doc, linux-kernel, Bjorn Helgaas

On Fri, 9 Dec 2016 14:37:47 +0800
Cao jin <caoj.fnst@cn.fujitsu.com> wrote:

> I am little confused too, even not sure if we are talking the same
> *fatal error*, I am talking the fatal error defined in PCI Express spec,
> chapter 6.2.2.2.1:

Therein lies my original discomfort with the change; it didn't seem to
make sense to talk about recovering from a fatal error.  Perhaps making
it "is done whenever a fatal error (as defined in section 6.2.2.2.1) has
been detected that can be "solved" by resetting the link" or something
like that to make it clear how the term is being used?

Thanks,

jon

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] pci-error-recover: doc cleanup
  2016-12-09  6:44         ` Linas Vepstas
  2016-12-09  7:59           ` Cao jin
@ 2016-12-09 16:11           ` Alex Williamson
  1 sibling, 0 replies; 12+ messages in thread
From: Alex Williamson @ 2016-12-09 16:11 UTC (permalink / raw)
  To: Linas Vepstas
  Cc: Cao jin, Jonathan Corbet, linux-pci, linux-doc, linux-kernel,
	Bjorn Helgaas

On Fri, 9 Dec 2016 14:44:25 +0800
Linas Vepstas <linasvepstas@gmail.com> wrote:

> On Fri, Dec 9, 2016 at 2:37 PM, Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
> >
> >
> > On 12/09/2016 02:24 PM, Linas Vepstas wrote:  
> >> I suppose I'm confused, but I recall that link resets are non-fatal.
> >> Fatal errors typically require that the the pci adapter be completely
> >> reset, any adapter firmware to be reloaded from scratch, the device
> >> driver has to kill all device state and start from scratch. Its huge.
> >> If the fatal error is on pci device that is under a block device
> >> holding a file system, then (usually) there is no way to recover,
> >> because the block layer (and file system) cannot deal with a block
> >> device that disappeared and then reappeared some few seconds later.
> >> (maybe some future zfs or lvm or btrfs might be able to deal with
> >> this, but not today)
> >>
> >> By contrast, link resets are far more gentle: the device driver might
> >> have to discard some half-full FIFO's, or cancel some in-flight
> >> commands, but can otherwise gracefully recover without telling the
> >> higher layers that there were any problems.
> >>
> >> --linas
> >>  
> >
> > I am little confused too, even not sure if we are talking the same
> > *fatal error*, I am talking the fatal error defined in PCI Express spec,
> > chapter 6.2.2.2.1:
> >
> > Fatal errors are uncorrectable error conditions which render the
> > particular Link and related hardware unreliable. For Fatal errors, a
> > reset of the components on the Link may be required to return to
> > reliable operation. Platform handling of Fatal errors, and any efforts
> > to limit the effects of these errors, is platform implementation specific.
> >
> > Link reset means set *secondary bus reset* bit in pci bridge config
> > space, can reset the link and device simultaneously, is the strongest
> > kind of reset as I know.  
> 
> OK, well, its been far too many years, and I don't have the PCI spec
> at my fingertips.
> Isn't there a link reset that can be performed, without forcing a device reset?
> 
> The intent was that some PCI link errors are due to vibration,
> ground-bounce, humidity, etc. and that these errors can be detected
> and do not corrupt the device state or the device driver state.  Since
> they are not associated with data corruption (or rather, the
> corruption is local to the link), these can be recovered by reseting
> just the link, without resetting the whole adapter. They may require
> reseting some device-driver state, but not all of it.
> 
> However, this was all decided before the PCI-E spec was written, so
> maybe the newer PCI-E specs now say something different.

Perhaps you're thinking of link retraining?  That sort of error would
be considered correctable, not fatal.  Fatal errors are uncorrected
errors and a bigger hammer is needed to deal with them, such as a link
reset.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] pci-error-recover: doc cleanup
  2016-12-09  6:50       ` Andrew Donnellan
@ 2016-12-14  2:39         ` Gavin Shan
  0 siblings, 0 replies; 12+ messages in thread
From: Gavin Shan @ 2016-12-14  2:39 UTC (permalink / raw)
  To: Andrew Donnellan
  Cc: linasvepstas, Cao jin, Jonathan Corbet, linux-pci, linux-doc,
	linux-kernel, Bjorn Helgaas

On Fri, Dec 09, 2016 at 05:50:17PM +1100, Andrew Donnellan wrote:
>On 09/12/16 17:24, Linas Vepstas wrote:
>>I suppose I'm confused, but I recall that link resets are non-fatal.
>>Fatal errors typically require that the the pci adapter be completely
>>reset, any adapter firmware to be reloaded from scratch, the device
>>driver has to kill all device state and start from scratch. Its huge.
>
>Is there a difference in terminology between an AER fatal error and what
>EEH/IBM people think of as a fatal error?
>

They are different things. AER fatal error can lead to frozen PE error,
not fenced PHB error basing on the configuration on PHB.

>>If the fatal error is on pci device that is under a block device
>>holding a file system, then (usually) there is no way to recover,
>>because the block layer (and file system) cannot deal with a block
>>device that disappeared and then reappeared some few seconds later.
>>(maybe some future zfs or lvm or btrfs might be able to deal with
>>this, but not today)
>
>Is this still true? I'm not at all familiar with the block device side of it,
>but the cxlflash driver has reasonably full EEH support, including surviving
>a full PHB fence and complete reset.
>

It's still true, especially when the recovery is going to affect the
rootfs. On completion of error recovery, the driver (if necessary)
and filesystem needs to be reloaded which depends on script or daemon
and they are unavailable in this scenario.

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] pci-error-recover: doc cleanup
  2016-12-09 14:37         ` Jonathan Corbet
@ 2016-12-19  3:25           ` Cao jin
  0 siblings, 0 replies; 12+ messages in thread
From: Cao jin @ 2016-12-19  3:25 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: linasvepstas, linux-pci, linux-doc, linux-kernel, Bjorn Helgaas

Sorry for late.

On 12/09/2016 10:37 PM, Jonathan Corbet wrote:
> On Fri, 9 Dec 2016 14:37:47 +0800
> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
> 
>> I am little confused too, even not sure if we are talking the same
>> *fatal error*, I am talking the fatal error defined in PCI Express spec,
>> chapter 6.2.2.2.1:
> 
> Therein lies my original discomfort with the change; it didn't seem to
> make sense to talk about recovering from a fatal error.  Perhaps making
> it "is done whenever a fatal error (as defined in section 6.2.2.2.1) has
> been detected that can be "solved" by resetting the link" or something
> like that to make it clear how the term is being used?
> 

I find that the .link_reset callback of struct pci_error_handlers isn't
called by anyone(if I didn't miss anything), and just a few drivers
implement this callback, and their implementation seems meaningless.

And the reset_link() provided by aer driver seems is a different thing
with .link_reset callback. So I am guessing this patch probably is not
quite suitable, and the doc maybe need update totally.

-- 
Sincerely,
Cao jin

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-12-19  3:21 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-08  8:16 [PATCH] pci-error-recover: doc cleanup Cao jin
2016-12-08 14:05 ` Jonathan Corbet
2016-12-08 14:13   ` Cao jin
2016-12-09  6:24     ` Linas Vepstas
2016-12-09  6:37       ` Cao jin
2016-12-09  6:44         ` Linas Vepstas
2016-12-09  7:59           ` Cao jin
2016-12-09 16:11           ` Alex Williamson
2016-12-09 14:37         ` Jonathan Corbet
2016-12-19  3:25           ` Cao jin
2016-12-09  6:50       ` Andrew Donnellan
2016-12-14  2:39         ` Gavin Shan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).