linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] How drivers notice a HW error?
@ 2003-11-27  8:28 Hidetoshi Seto
  0 siblings, 0 replies; 2+ messages in thread
From: Hidetoshi Seto @ 2003-11-27  8:28 UTC (permalink / raw)
  To: Linux Kernel mailing list

Hi all,

This is a request for comments, especially comments from driver developers.

On some platform, for example IA64, the chipset detects an error caused by
driver's operation such as I/O read, and reports it to kernel. Linux kernel
analyzes the error and decides to kill the driver or reboot at worst.
I want to convey the error information to the offending driver, and want to
enable the driver to recover the failed operation.

So, just a plan, I think about a readb_check function that has checking ability
enable it to return error value if error is occurred on read. Drivers could use
readb_check instead of usual readb, and could diagnosis whether a retry be
required or not, by the return value of readb_check.

To realize this, I consider following two images:

+ readb_check on driver (with Notifier)
  [Outline]:
    - Hardware error handler (for example in IA64, MCA handler) has a Notifier
      as hook point.
    - Driver may register a hook function to the Notifier.
    - Notifier calls over registered functions when error is occurred.
    - Called hook function checks address of error, and if the error seems
      to be concerned with the parent driver, ups internal error flag and
      stops Notifier by returning OK.
    - Hardware error handler regards state of Notifier, and decides the system
      to resume or not.
    - Restarted driver may refer the error flag after read, and may retry the
      read if flag is up.
  [Issue]:
    - Some interfaces such as register hooks would be required.
    - Coding a hook function would be a bother of developers.

+ readb_check on kernel
  [Outline]:
    - Kernel has readb_check function.
    - Drivers may use readb_check instead of usual readb.
    - Hardware error handler checks address of error, and if it occurs in
      readb_check, changes return value of readb_check and resumes
      interrupted context.
    - Driver may refer the return value to notice an error in last read
      procedure.
  [Issue]:
    - Overhead would be involved. (Possibly, it could say negligible since
      I/O reads are already horribly slow.)

IMO, this is a general-purpose function that should be available on many
platforms. I also hear that Solaris has some similar implementations like this.

If you have any comment about this feature or any idea different from this,
please tell me.


Best regards,

------

H.Seto <seto.hidetoshi@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [RFC] How drivers notice a HW error?
       [not found] <WpR1.1LG.3@gated-at.bofh.it>
@ 2003-11-27 11:37 ` Andi Kleen
  0 siblings, 0 replies; 2+ messages in thread
From: Andi Kleen @ 2003-11-27 11:37 UTC (permalink / raw)
  To: Hidetoshi Seto; +Cc: linux-kernel

Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> writes:

> On some platform, for example IA64, the chipset detects an error caused by
> driver's operation such as I/O read, and reports it to kernel. Linux kernel
> analyzes the error and decides to kill the driver or reboot at worst.
> I want to convey the error information to the offending driver, and want to
> enable the driver to recover the failed operation.
>A
> So, just a plan, I think about a readb_check function that has checking ability
> enable it to return error value if error is occurred on read. Drivers could use
> readb_check instead of usual readb, and could diagnosis whether a retry be
> required or not, by the return value of readb_check.

I don't think that's an good portable API. On many architectures it is hard to 
associate an MCE with an specific instruction because the MCE 
happnes asynchronously. All the MCE handler gets is an address. Also
adding error checks to every read* would make the driver source quite 
unreadable.

Also I think most drivers would not attempt to specially handle every
access but just implement a generic handler that shutdowns the device
(otherwise it would be a testing nightmare). 

So better would be:

Add a callback to the pci_dev/device. When an error occurs in a mmio
area associated with a driver call that callback.

Add another function to register other memory areas (in case a driver
does mmio not visible in PCI config)  for error handling.

-Andi


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2003-11-27 11:38 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-11-27  8:28 [RFC] How drivers notice a HW error? Hidetoshi Seto
     [not found] <WpR1.1LG.3@gated-at.bofh.it>
2003-11-27 11:37 ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).