[RFC] How drivers notice a HW error?

* [RFC] How drivers notice a HW error?
@ 2003-11-27  8:28 Hidetoshi Seto
  0 siblings, 0 replies; 2+ messages in thread
From: Hidetoshi Seto @ 2003-11-27  8:28 UTC (permalink / raw)
  To: Linux Kernel mailing list

Hi all,

This is a request for comments, especially comments from driver developers.

On some platform, for example IA64, the chipset detects an error caused by
driver's operation such as I/O read, and reports it to kernel. Linux kernel
analyzes the error and decides to kill the driver or reboot at worst.
I want to convey the error information to the offending driver, and want to
enable the driver to recover the failed operation.

So, just a plan, I think about a readb_check function that has checking ability
enable it to return error value if error is occurred on read. Drivers could use
readb_check instead of usual readb, and could diagnosis whether a retry be
required or not, by the return value of readb_check.

To realize this, I consider following two images:

+ readb_check on driver (with Notifier)
  [Outline]:
    - Hardware error handler (for example in IA64, MCA handler) has a Notifier
      as hook point.
    - Driver may register a hook function to the Notifier.
    - Notifier calls over registered functions when error is occurred.
    - Called hook function checks address of error, and if the error seems
      to be concerned with the parent driver, ups internal error flag and
      stops Notifier by returning OK.
    - Hardware error handler regards state of Notifier, and decides the system
      to resume or not.
    - Restarted driver may refer the error flag after read, and may retry the
      read if flag is up.
  [Issue]:
    - Some interfaces such as register hooks would be required.
    - Coding a hook function would be a bother of developers.

+ readb_check on kernel
  [Outline]:
    - Kernel has readb_check function.
    - Drivers may use readb_check instead of usual readb.
    - Hardware error handler checks address of error, and if it occurs in
      readb_check, changes return value of readb_check and resumes
      interrupted context.
    - Driver may refer the return value to notice an error in last read
      procedure.
  [Issue]:
    - Overhead would be involved. (Possibly, it could say negligible since
      I/O reads are already horribly slow.)

IMO, this is a general-purpose function that should be available on many
platforms. I also hear that Solaris has some similar implementations like this.

If you have any comment about this feature or any idea different from this,
please tell me.

Best regards,

------

H.Seto <seto.hidetoshi@jp.fujitsu.com>

^ permalink raw reply	[flat|nested] 2+ messages in thread