All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V4] xfs: Document error handlers behavior
@ 2016-09-13  9:03 Carlos Maiolino
  2016-09-14  1:23 ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Carlos Maiolino @ 2016-09-13  9:03 UTC (permalink / raw)
  To: linux-xfs, xfs

Document the implementation of error handlers into sysfs.


Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---
Changelog:

V2:
	- Add a description of the precedence order of each option, focusing on
	  the behavior of "fail_at_unmount" which was not well explained in V1

V3:
	- Fix English spelling mistakes suggested by Eric

V4:
	- Typo mistakes, document ENODEV default value for max_retries, fix
	  directories's hierarchy description

 Documentation/filesystems/xfs.txt | 75 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt
index 8146e9f..374af3b 100644
--- a/Documentation/filesystems/xfs.txt
+++ b/Documentation/filesystems/xfs.txt
@@ -348,3 +348,78 @@ Removed Sysctls
   ----				-------
   fs.xfs.xfsbufd_centisec	v4.0
   fs.xfs.age_buffer_centisecs	v4.0
+
+Error handling
+==============
+
+XFS can act differently according to the type of error found
+during its operation. The implementation introduces the following
+concepts to the error handler:
+
+ -failure speed:
+	Defines how fast XFS should shut down when a specific error is found
+	during the filesystem operation. It can shut down immediately, after a
+	defined number of retries, after a set time period, or simply retry
+	forever. The old "retry forever" behavior is still the default, except
+	during unmount, where any IOs retrying due to errors will be cancelled
+	and unmount will be allowed to proceed.
+
+ -error classes:
+	Specifies the subsystem/location of the error handlers, such as
+	metadata or memory allocation. Only metadata IO errors are handled
+	at this time.
+
+ -error handlers:
+	Defines the behavior for a specific error.
+
+The filesystem behavior during an error can be set via sysfs files, Each
+configuration option works independently, the first condition met for a
+specific configuration will cause the filesystem to shut down.
+
+The configuration files are organized into the following hierarchy:
+
+  /sys/fs/xfs/<dev>/error/<class>/<error>/
+
+Each directory contains:
+
+ /sys/fs/xfs/<dev>/error/
+
+	fail_at_unmount		(Min:  0  Default:  1  Max: 1)
+		Defines the global error behavior at unmount time. If set to the
+		default value of 1, XFS will cancel any pending IO retries, shut
+		down, and unmount. If set to 0, pending IO retries may prevent
+		the filesystem from unmounting.
+
+	<class> subdirectories
+		Contains specific error handlers configuration
+		(Ex: /sys/fs/xfs/<dev>/error/metadata, see below).
+
+ /sys/fs/xfs/<dev>/error/<class>/
+
+	Directory containing configuration for a specific error <class>;
+	currently only the "metadata" <class> is implemented.
+	The contents of this directory are <class> specific, since each <class>
+	might need to handle different types of errors.
+
+ /sys/fs/xfs/<dev>/error/<class>/<error>/
+
+	Contains the failure speed configuration files for specific errors in
+	this <class>, as well as a "default" behavior. Each <error> directory
+	contains the following configuration files:
+
+	max_retries			(Min: -1  Default: -1  Max: INTMAX)
+		Defines the allowed number of retries of a specific error before
+		the filesystem will shut down.  The default value of "-1" will
+		cause XFS to retry forever for this specific error.  Setting it
+		to "0" will cause XFS to fail immediately when the specific
+		error is found, and setting it to "N," where N is greater than 0,
+		will make XFS retry "N" times before shutting down.
+		Default value for ENODEV error is set to '0', once there is no
+		reason to keep retrying if the device is gone.
+
+	retry_timeout_seconds		(Min:  0  Default:  0  Max: 1 day)
+		Define the amount of time (in seconds) that the filesystem is
+		allowed to retry its operations when the specific error is
+		found. The default value of "0" will cause XFS to retry forever.
+
+	
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH V4] xfs: Document error handlers behavior
  2016-09-13  9:03 [PATCH V4] xfs: Document error handlers behavior Carlos Maiolino
@ 2016-09-14  1:23 ` Dave Chinner
  2016-09-14 10:02   ` Carlos Maiolino
  2016-09-14 15:10   ` Eric Sandeen
  0 siblings, 2 replies; 8+ messages in thread
From: Dave Chinner @ 2016-09-14  1:23 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: linux-xfs, xfs

On Tue, Sep 13, 2016 at 05:03:05AM -0400, Carlos Maiolino wrote:
> Document the implementation of error handlers into sysfs.
> 
> 
> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> ---
> Changelog:
> 
> V2:
> 	- Add a description of the precedence order of each option, focusing on
> 	  the behavior of "fail_at_unmount" which was not well explained in V1
> 
> V3:
> 	- Fix English spelling mistakes suggested by Eric
> 
> V4:
> 	- Typo mistakes, document ENODEV default value for max_retries, fix
> 	  directories's hierarchy description

Ok, I had to update this for the change in retry timeout values from
Eric, so I went and fixed all the other things I thought needed
fixing, too. New patch below....

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: Document error handlers behavior

From: Carlos Maiolino <cmaiolino@redhat.com>

Document the implementation of error handlers into sysfs.

[dchinner: significant update:
	- removed examples from concept descriptions, placed them in
	  appropriate detailed descriptions instead
	- added explanations for <dev>, <class> and <error> strings
	  in sysfs layout description
	- added specific definition of "global" per-filesystem error
	  configuration parameters.
	- reformatted to remove multiple indents
	- added more information about fail_at_unmount behaviour and
	  constraints
	- added comment that there is a "default" handler to
	  configure behaviour for all errors that don't have
	  specific handlers defined.
	- added specific handler value explanations
	- added note about handlers having context specific
	  defaults with example. ]

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>

---
 Documentation/filesystems/xfs.txt | 125 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 125 insertions(+)

diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt
index 8146e9f..705d064 100644
--- a/Documentation/filesystems/xfs.txt
+++ b/Documentation/filesystems/xfs.txt
@@ -348,3 +348,128 @@ Removed Sysctls
   ----				-------
   fs.xfs.xfsbufd_centisec	v4.0
   fs.xfs.age_buffer_centisecs	v4.0
+
+
+Error handling
+==============
+
+XFS can act differently according to the type of error found during its
+operation. The implementation introduces the following concepts to the error
+handler:
+
+ -failure speed:
+	Defines how fast XFS should propagate an error upwards when a specific
+	error is found during the filesystem operation. It can propagate
+	immediately, after a defined number of retries, after a set time period,
+	or simply retry forever.
+
+ -error classes:
+	Specifies the subsystem the error configuration will apply to, such as
+	metadata IO or memory allocation. Different subsystems will have
+	different error handlers for which behaviour can be configured.
+
+ -error handlers:
+	Defines the behavior for a specific error.
+
+The filesystem behavior during an error can be set via sysfs files, Each
+error handler works independently, the first condition met by and error handler
+for a specific class will cause the error to be propagated rather than reset and
+retried.
+
+The action taken by the filesystem when the error is propagated is context
+dependent - it may cause a shut down in the case of an unrecoverable error,
+it may be reported back to userspace, or it may even be ignored because
+there's nothing useful we can with the error or anyone we can report it to (e.g.
+during unmount).
+
+The configuration files are organized into the following per-mounted filesystem
+hierarchy:
+
+  /sys/fs/xfs/<dev>/error/<class>/<error>/
+
+Where:
+  <dev>
+	The short device name of the mounted filesystem. This is the same device
+	name that shows up in XFS kernel error messages as "XFS(<dev>): ..."
+
+  <class>
+	The subsystem the error configuration belongs to. As of 4.9, the defined
+	classes are:
+
+		- "metadata": applies metadata buffer write IO
+
+  <error>
+	The individual error handler configurations.
+
+
+Each filesystem has "global" error configuration options defined in their top
+level directory:
+
+  /sys/fs/xfs/<dev>/error/
+
+  fail_at_unmount		(Min:  0  Default:  1  Max: 1)
+	Defines the filesystem error behavior at unmount time.
+
+	If set to a value of 1, XFS will override all other error configurations
+	during unmount and replace them with "immediate fail" characteristics.
+	i.e. no retries, no retry timeout. This will always allow unmount to
+	succeed when there are persistent errors present.
+
+	If set to 0, the configured retry behaviour will continue until all
+	retries and/or timeouts have been exhausted. This will delay unmount
+	completion when there are persistent errors, and it may prevent the
+	filesystem from ever unmounting fully in the case of "retry forever"
+	handler configurations.
+
+	Note: there is no guarantee that fail_at_unmount can be set whilst an
+	unmount is in progress. It is possible that the sysfs entries are
+	removed by the unmounting filesystem before a "retry forever" error
+	handler configuration causes unmount to hang, and hence the filesystem
+	must be configured appropriately before unmount begins to prevent
+	unmount hangs.
+
+Each filesystem has specific error class handlers that define the error
+propagation behaviour for specific errors. There is also a "default" error
+handler defined, which defines the behaviour for all errors that don't have
+specific handlers defined. The handler configurations are found in the
+directory:
+
+  /sys/fs/xfs/<dev>/error/<class>/<error>/
+
+  max_retries			(Min: -1  Default: Varies  Max: INTMAX)
+	Defines the allowed number of retries of a specific error before
+	the filesystem will propagate the error. The retry count for a given
+	error context (e.g. a specific metadata buffer) is reset ever time there
+	is a successful completion of the operation.
+
+	Setting the value to "-1" will cause XFS to retry forever for this
+	specific error.
+
+	Setting the value to "0" will cause XFS to fail immediately when the
+	specific error is reported.
+
+	Setting the value to "N" (where 0 < N < Max) will make XFS retry the
+	operation "N" times before propagating the error.
+
+  retry_timeout_seconds		(Min:  -1  Default:  Varies  Max: 1 day)
+	Define the amount of time (in seconds) that the filesystem is
+	allowed to retry its operations when the specific error is
+	found.
+
+	Setting the value to "-1" will set an infinite timeout, causing
+	error propagation behaviour to be determined solely by the "max_retries"
+	parameter.
+
+	Setting the value to "0" will cause XFS to fail immediately when the
+	specific error is reported.
+
+	Setting the value to  "N" (where 0 < N < Max) will propagate the error
+	on the first retry that fails at least "N" seconds after the first error
+	was detected, unless the number of retries defined by max_retries
+	expires first.
+
+Note: The default behaviour for a specific error handler is dependent on both
+the class and error context. For example, the default values for
+"metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults
+to "fail immediately" behaviour. This is done because ENODEV is a fatal,
+unrecoverable error no matter how many times the metadata IO is retried.

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH V4] xfs: Document error handlers behavior
  2016-09-14  1:23 ` Dave Chinner
@ 2016-09-14 10:02   ` Carlos Maiolino
  2016-09-14 22:09     ` Dave Chinner
  2016-09-14 15:10   ` Eric Sandeen
  1 sibling, 1 reply; 8+ messages in thread
From: Carlos Maiolino @ 2016-09-14 10:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, xfs

On Wed, Sep 14, 2016 at 11:23:34AM +1000, Dave Chinner wrote:
> Ok, I had to update this for the change in retry timeout values from
> Eric, so I went and fixed all the other things I thought needed
> fixing, too. New patch below....
> 

Hi, thanks, this looks good to me, with one exception described below.

> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> xfs: Document error handlers behavior
> 
> From: Carlos Maiolino <cmaiolino@redhat.com>
> 
> + -error handlers:
> +	Defines the behavior for a specific error.
> +
> +The filesystem behavior during an error can be set via sysfs files, Each
> +error handler works independently, the first condition met by and error handler
> +for a specific class will cause the error to be propagated rather than reset and
> +retried.
> +
> +The action taken by the filesystem when the error is propagated is context
> +dependent - it may cause a shut down in the case of an unrecoverable error,
> +it may be reported back to userspace, or it may even be ignored because
> +there's nothing useful we can with the error or anyone we can report it to (e.g.

"there's nothing useful we can do with the error"

> +during unmount).

Also, I apologize if I misunderstand it, but being ignored doesn't look a proper
description here, it sounds to me something like 'we ignore the error and tell
nobody about it", in unmount example, we shut down the filesystem if any error
happens, for me it doesn't sound like ignoring an error, but I might be
interpreting it in the wrong way.


> +
> +The configuration files are organized into the following per-mounted filesystem
> +hierarchy:
> +
> +  /sys/fs/xfs/<dev>/error/<class>/<error>/
> +
> +Where:
> +  <dev>
> +	The short device name of the mounted filesystem. This is the same device
> +	name that shows up in XFS kernel error messages as "XFS(<dev>): ..."
> +
> +  <class>
> +	The subsystem the error configuration belongs to. As of 4.9, the defined
> +	classes are:
> +
> +		- "metadata": applies metadata buffer write IO
> +
> +  <error>
> +	The individual error handler configurations.
> +
> +
> +Each filesystem has "global" error configuration options defined in their top
> +level directory:
> +
> +  /sys/fs/xfs/<dev>/error/
> +
> +  fail_at_unmount		(Min:  0  Default:  1  Max: 1)
> +	Defines the filesystem error behavior at unmount time.
> +
> +	If set to a value of 1, XFS will override all other error configurations
> +	during unmount and replace them with "immediate fail" characteristics.
> +	i.e. no retries, no retry timeout. This will always allow unmount to
> +	succeed when there are persistent errors present.
> +
> +	If set to 0, the configured retry behaviour will continue until all
> +	retries and/or timeouts have been exhausted. This will delay unmount
> +	completion when there are persistent errors, and it may prevent the
> +	filesystem from ever unmounting fully in the case of "retry forever"
> +	handler configurations.
> +
> +	Note: there is no guarantee that fail_at_unmount can be set whilst an
> +	unmount is in progress. It is possible that the sysfs entries are
> +	removed by the unmounting filesystem before a "retry forever" error
> +	handler configuration causes unmount to hang, and hence the filesystem
> +	must be configured appropriately before unmount begins to prevent
> +	unmount hangs.
> +
> +Each filesystem has specific error class handlers that define the error
> +propagation behaviour for specific errors. There is also a "default" error
> +handler defined, which defines the behaviour for all errors that don't have
> +specific handlers defined. The handler configurations are found in the
> +directory:
> +
> +  /sys/fs/xfs/<dev>/error/<class>/<error>/
> +
> +  max_retries			(Min: -1  Default: Varies  Max: INTMAX)
> +	Defines the allowed number of retries of a specific error before
> +	the filesystem will propagate the error. The retry count for a given
> +	error context (e.g. a specific metadata buffer) is reset ever time there
> +	is a successful completion of the operation.
> +
> +	Setting the value to "-1" will cause XFS to retry forever for this
> +	specific error.
> +
> +	Setting the value to "0" will cause XFS to fail immediately when the
> +	specific error is reported.
> +
> +	Setting the value to "N" (where 0 < N < Max) will make XFS retry the
> +	operation "N" times before propagating the error.
> +
> +  retry_timeout_seconds		(Min:  -1  Default:  Varies  Max: 1 day)
> +	Define the amount of time (in seconds) that the filesystem is
> +	allowed to retry its operations when the specific error is
> +	found.
> +
> +	Setting the value to "-1" will set an infinite timeout, causing
> +	error propagation behaviour to be determined solely by the "max_retries"
> +	parameter.
> +
> +	Setting the value to "0" will cause XFS to fail immediately when the
> +	specific error is reported.
> +
> +	Setting the value to  "N" (where 0 < N < Max) will propagate the error
> +	on the first retry that fails at least "N" seconds after the first error
> +	was detected, unless the number of retries defined by max_retries
> +	expires first.
> +
> +Note: The default behaviour for a specific error handler is dependent on both
> +the class and error context. For example, the default values for
> +"metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults
> +to "fail immediately" behaviour. This is done because ENODEV is a fatal,
> +unrecoverable error no matter how many times the metadata IO is retried.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

-- 
Carlos

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V4] xfs: Document error handlers behavior
  2016-09-14  1:23 ` Dave Chinner
  2016-09-14 10:02   ` Carlos Maiolino
@ 2016-09-14 15:10   ` Eric Sandeen
  2016-09-14 22:22     ` Dave Chinner
  1 sibling, 1 reply; 8+ messages in thread
From: Eric Sandeen @ 2016-09-14 15:10 UTC (permalink / raw)
  To: Dave Chinner, Carlos Maiolino; +Cc: linux-xfs, xfs

On 9/13/16 8:23 PM, Dave Chinner wrote:

> Ok, I had to update this for the change in retry timeout values from
> Eric, so I went and fixed all the other things I thought needed
> fixing, too. New patch below....
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> xfs: Document error handlers behavior
> 
> From: Carlos Maiolino <cmaiolino@redhat.com>
> 
> Document the implementation of error handlers into sysfs.
> 
> [dchinner: significant update:
> 	- removed examples from concept descriptions, placed them in
> 	  appropriate detailed descriptions instead
> 	- added explanations for <dev>, <class> and <error> strings
> 	  in sysfs layout description
> 	- added specific definition of "global" per-filesystem error
> 	  configuration parameters.
> 	- reformatted to remove multiple indents
> 	- added more information about fail_at_unmount behaviour and
> 	  constraints
> 	- added comment that there is a "default" handler to
> 	  configure behaviour for all errors that don't have
> 	  specific handlers defined.
> 	- added specific handler value explanations
> 	- added note about handlers having context specific
> 	  defaults with example. ]
> 
> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> ---
>  Documentation/filesystems/xfs.txt | 125 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 125 insertions(+)
> 
> diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt
> index 8146e9f..705d064 100644
> --- a/Documentation/filesystems/xfs.txt
> +++ b/Documentation/filesystems/xfs.txt
> @@ -348,3 +348,128 @@ Removed Sysctls
>    ----				-------
>    fs.xfs.xfsbufd_centisec	v4.0
>    fs.xfs.age_buffer_centisecs	v4.0
> +
> +
> +Error handling
> +==============
> +
> +XFS can act differently according to the type of error found during its
> +operation. The implementation introduces the following concepts to the error
> +handler:
> +
> + -failure speed:
> +	Defines how fast XFS should propagate an error upwards when a specific
> +	error is found during the filesystem operation. It can propagate
> +	immediately, after a defined number of retries, after a set time period,
> +	or simply retry forever.
> +
> + -error classes:
> +	Specifies the subsystem the error configuration will apply to, such as
> +	metadata IO or memory allocation. Different subsystems will have
> +	different error handlers for which behaviour can be configured.
> +
> + -error handlers:
> +	Defines the behavior for a specific error.
> +
> +The filesystem behavior during an error can be set via sysfs files, Each

files.  Each error ...

> +error handler works independently, the first condition met by and error handler

works independently - the firest condition met by /an/ error handler ...

> +for a specific class will cause the error to be propagated rather than reset and
> +retried.
> +
> +The action taken by the filesystem when the error is propagated is context
> +dependent - it may cause a shut down in the case of an unrecoverable error,
> +it may be reported back to userspace, or it may even be ignored because
> +there's nothing useful we can with the error or anyone we can report it to (e.g.
> +during unmount).
> +
> +The configuration files are organized into the following per-mounted filesystem
> +hierarchy:

... into the following hierarchy for each mounted filesystem:

> +
> +  /sys/fs/xfs/<dev>/error/<class>/<error>/
> +
> +Where:
> +  <dev>
> +	The short device name of the mounted filesystem. This is the same device
> +	name that shows up in XFS kernel error messages as "XFS(<dev>): ..."
> +
> +  <class>
> +	The subsystem the error configuration belongs to. As of 4.9, the defined
> +	classes are:
> +
> +		- "metadata": applies metadata buffer write IO
> +
> +  <error>
> +	The individual error handler configurations.
> +
> +
> +Each filesystem has "global" error configuration options defined in their top
> +level directory:
> +
> +  /sys/fs/xfs/<dev>/error/
> +
> +  fail_at_unmount		(Min:  0  Default:  1  Max: 1)
> +	Defines the filesystem error behavior at unmount time.
> +
> +	If set to a value of 1, XFS will override all other error configurations
> +	during unmount and replace them with "immediate fail" characteristics.
> +	i.e. no retries, no retry timeout. This will always allow unmount to
> +	succeed when there are persistent errors present.
> +
> +	If set to 0, the configured retry behaviour will continue until all
> +	retries and/or timeouts have been exhausted. This will delay unmount
> +	completion when there are persistent errors, and it may prevent the
> +	filesystem from ever unmounting fully in the case of "retry forever"
> +	handler configurations.
> +
> +	Note: there is no guarantee that fail_at_unmount can be set whilst an
> +	unmount is in progress. It is possible that the sysfs entries are
> +	removed by the unmounting filesystem before a "retry forever" error
> +	handler configuration causes unmount to hang, and hence the filesystem
> +	must be configured appropriately before unmount begins to prevent
> +	unmount hangs.
> +
> +Each filesystem has specific error class handlers that define the error
> +propagation behaviour for specific errors. There is also a "default" error
> +handler defined, which defines the behaviour for all errors that don't have
> +specific handlers defined. The handler configurations are found in the
> +directory:
> +
> +  /sys/fs/xfs/<dev>/error/<class>/<error>/
> +
> +  max_retries			(Min: -1  Default: Varies  Max: INTMAX)
> +	Defines the allowed number of retries of a specific error before
> +	the filesystem will propagate the error. The retry count for a given
> +	error context (e.g. a specific metadata buffer) is reset ever time there

every time there ...

> +	is a successful completion of the operation.
> +
> +	Setting the value to "-1" will cause XFS to retry forever for this
> +	specific error.
> +
> +	Setting the value to "0" will cause XFS to fail immediately when the
> +	specific error is reported.
> +
> +	Setting the value to "N" (where 0 < N < Max) will make XFS retry the
> +	operation "N" times before propagating the error.
> +
> +  retry_timeout_seconds		(Min:  -1  Default:  Varies  Max: 1 day)
> +	Define the amount of time (in seconds) that the filesystem is
> +	allowed to retry its operations when the specific error is
> +	found.
> +
> +	Setting the value to "-1" will set an infinite timeout, causing
> +	error propagation behaviour to be determined solely by the "max_retries"
> +	parameter.

This is asymmetric; if you want this, then max_retries should probably say that
-1 will cause the behavior to be determined solely by retry_timeout_seconds...

Could also say "removing any time limits on retries."  (and above, "removing any
count limits on retries.)

But that's already covered by "the first condition met by ..., " really.

> +
> +	Setting the value to "0" will cause XFS to fail immediately when the
> +	specific error is reported.
> +
> +	Setting the value to  "N" (where 0 < N < Max) will propagate the error
> +	on the first retry that fails at least "N" seconds after the first error
> +	was detected, unless the number of retries defined by max_retries
> +	expires first.

Same issue here, really; they are symmetric, right?  First condition met for
propagation propagates the error, period.  This sounds overly complex, unless
I'm missing something. Seems like:

+	Setting the value to "N" (where 0 < N < Max) will make XFS retry the
+	operation for "N" seconds before propagating the error.

would suffice, no?

> +
> +Note: The default behaviour for a specific error handler is dependent on both
> +the class and error context. For example, the default values for
> +"metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults
> +to "fail immediately" behaviour. This is done because ENODEV is a fatal,
> +unrecoverable error no matter how many times the metadata IO is retried.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V4] xfs: Document error handlers behavior
  2016-09-14 10:02   ` Carlos Maiolino
@ 2016-09-14 22:09     ` Dave Chinner
  2016-09-15  9:18       ` Carlos Maiolino
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2016-09-14 22:09 UTC (permalink / raw)
  To: linux-xfs, xfs

On Wed, Sep 14, 2016 at 12:02:19PM +0200, Carlos Maiolino wrote:
> On Wed, Sep 14, 2016 at 11:23:34AM +1000, Dave Chinner wrote:
> > Ok, I had to update this for the change in retry timeout values from
> > Eric, so I went and fixed all the other things I thought needed
> > fixing, too. New patch below....
> > 
> 
> Hi, thanks, this looks good to me, with one exception described below.
> 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> > xfs: Document error handlers behavior
> > 
> > From: Carlos Maiolino <cmaiolino@redhat.com>
> > 
> > + -error handlers:
> > +	Defines the behavior for a specific error.
> > +
> > +The filesystem behavior during an error can be set via sysfs files, Each
> > +error handler works independently, the first condition met by and error handler
> > +for a specific class will cause the error to be propagated rather than reset and
> > +retried.
> > +
> > +The action taken by the filesystem when the error is propagated is context
> > +dependent - it may cause a shut down in the case of an unrecoverable error,
> > +it may be reported back to userspace, or it may even be ignored because
> > +there's nothing useful we can with the error or anyone we can report it to (e.g.
> 
> "there's nothing useful we can do with the error"
> 
> > +during unmount).
> 
> Also, I apologize if I misunderstand it, but being ignored doesn't look a proper
> description here, it sounds to me something like 'we ignore the error and tell
> nobody about it", in unmount example, we shut down the filesystem if any error
> happens, for me it doesn't sound like ignoring an error, but I might be
> interpreting it in the wrong way.

I think you're making the assumption that the only way we handle
errors once retries are exhausted is to trigger a filesystem shutdown.
That assumption was repeated throughout the documentation.

While that may be true for /metadata write IO errors/, it is not
true for the generic error handling case. e.g. if we extend it to
memory allocation contexts, we may end up returning ENOMEM to
userspace. Or, in certain contexts, we might be able to fall back to
doing a single operation at a time using the stack for storage, in
which case there is no reason at all to report the allocation
failure to anyone.

The infrastructure is generic, as is the documentation, and so it
shouldn't assume anything about what is going to happen once the
retries are exhausted and the error is propagated upwards. What
happens with that error after it is returned is a subsystem and
context dependent behaviour, not something that is defined by the
error retry configuration infrastructure....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V4] xfs: Document error handlers behavior
  2016-09-14 15:10   ` Eric Sandeen
@ 2016-09-14 22:22     ` Dave Chinner
  2016-09-14 22:31       ` Eric Sandeen
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2016-09-14 22:22 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Carlos Maiolino, linux-xfs, xfs

On Wed, Sep 14, 2016 at 10:10:05AM -0500, Eric Sandeen wrote:
> On 9/13/16 8:23 PM, Dave Chinner wrote:
....
> > +  max_retries			(Min: -1  Default: Varies  Max: INTMAX)
> > +	Defines the allowed number of retries of a specific error before
> > +	the filesystem will propagate the error. The retry count for a given
> > +	error context (e.g. a specific metadata buffer) is reset ever time there
> 
> every time there ...
> 
> > +	is a successful completion of the operation.
> > +
> > +	Setting the value to "-1" will cause XFS to retry forever for this
> > +	specific error.
> > +
> > +	Setting the value to "0" will cause XFS to fail immediately when the
> > +	specific error is reported.
> > +
> > +	Setting the value to "N" (where 0 < N < Max) will make XFS retry the
> > +	operation "N" times before propagating the error.
> > +
> > +  retry_timeout_seconds		(Min:  -1  Default:  Varies  Max: 1 day)
> > +	Define the amount of time (in seconds) that the filesystem is
> > +	allowed to retry its operations when the specific error is
> > +	found.
> > +
> > +	Setting the value to "-1" will set an infinite timeout, causing
> > +	error propagation behaviour to be determined solely by the "max_retries"
> > +	parameter.
> 
> This is asymmetric; if you want this, then max_retries should probably say that
> -1 will cause the behavior to be determined solely by retry_timeout_seconds...

I could, but...

> Could also say "removing any time limits on retries."  (and above, "removing any
> count limits on retries.)
> 
> But that's already covered by "the first condition met by ..., " really.

this covers it so I can probably just remove it.
> 
> > +
> > +	Setting the value to "0" will cause XFS to fail immediately when the
> > +	specific error is reported.
> > +
> > +	Setting the value to  "N" (where 0 < N < Max) will propagate the error
> > +	on the first retry that fails at least "N" seconds after the first error
> > +	was detected, unless the number of retries defined by max_retries
> > +	expires first.
> 
> Same issue here, really; they are symmetric, right?  First condition met for
> propagation propagates the error, period.  This sounds overly complex, unless
> I'm missing something. Seems like:
> 
> +	Setting the value to "N" (where 0 < N < Max) will make XFS retry the
> +	operation for "N" seconds before propagating the error.
> 
> would suffice, no?

No, because that's not what the implementation does:

	if (retries expired)
		fail
	if (retry timer expired)
		fail

IOWs, the retry count has precedence over the retry timer. if you
set both retry_timeout and max_retries, the timeout only takes
effect if max retries is set high enough that they aren't exhausted
before the timeout fires.

This is for the case where an failure might take a variable time to
report.  (Think interactions with errors that TLER would address).
Normally you might say 10 retries, but if it is taking 5 minutes to
then fail when this specific error condition is hit, you might set a retry
timeout of 1 minute. In that case, we might get an immediate IO
error and retry several times before failing. However, if we hit the
"slow to report" error, we still get failure in the same time frame
as the immediate failures that have been retried many times before
giving up.

It's hard to explain complex stuff like with a simple, concise
description. I'll try again....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V4] xfs: Document error handlers behavior
  2016-09-14 22:22     ` Dave Chinner
@ 2016-09-14 22:31       ` Eric Sandeen
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Sandeen @ 2016-09-14 22:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Carlos Maiolino, xfs



On 9/14/16 5:22 PM, Dave Chinner wrote:
>> Same issue here, really; they are symmetric, right?  First condition met for
>> > propagation propagates the error, period.  This sounds overly complex, unless
>> > I'm missing something. Seems like:
>> > 
>> > +	Setting the value to "N" (where 0 < N < Max) will make XFS retry the
>> > +	operation for "N" seconds before propagating the error.
>> > 
>> > would suffice, no?
> No, because that's not what the implementation does:
> 
> 	if (retries expired)
> 		fail
> 	if (retry timer expired)
> 		fail
> 
> IOWs, the retry count has precedence over the retry timer. if you
> set both retry_timeout and max_retries, the timeout only takes
> effect if max retries is set high enough that they aren't exhausted
> before the timeout fires.

Then:

+	Setting the value to "N" (where 0 < N < Max) will /allow/ XFS to retry
+	the operation for /up to/ "N" seconds before propagating the error.

?  i.e. it could, but only if the retries don't expire first :)

> This is for the case where an failure might take a variable time to
> report.  (Think interactions with errors that TLER would address).
> Normally you might say 10 retries, but if it is taking 5 minutes to
> then fail when this specific error condition is hit, you might set a retry
> timeout of 1 minute. In that case, we might get an immediate IO
> error and retry several times before failing. However, if we hit the
> "slow to report" error, we still get failure in the same time frame
> as the immediate failures that have been retried many times before
> giving up.

Either way, it's still the /first/ condition satisfied which will
bubble it up.

So "first condition satisfied will propagate up," plus
one condition is "retry up to N times," plus
one condition is "retry up for up to N seconds"

seems to cover it all, no?

> It's hard to explain complex stuff like with a simple, concise
> description. I'll try again....
> 
> Cheers,
> 
> Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V4] xfs: Document error handlers behavior
  2016-09-14 22:09     ` Dave Chinner
@ 2016-09-15  9:18       ` Carlos Maiolino
  0 siblings, 0 replies; 8+ messages in thread
From: Carlos Maiolino @ 2016-09-15  9:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, xfs

> > Also, I apologize if I misunderstand it, but being ignored doesn't look a proper
> > description here, it sounds to me something like 'we ignore the error and tell
> > nobody about it", in unmount example, we shut down the filesystem if any error
> > happens, for me it doesn't sound like ignoring an error, but I might be
> > interpreting it in the wrong way.
> 
> I think you're making the assumption that the only way we handle
> errors once retries are exhausted is to trigger a filesystem shutdown.
> That assumption was repeated throughout the documentation.
> 
> While that may be true for /metadata write IO errors/, it is not
> true for the generic error handling case. e.g. if we extend it to
> memory allocation contexts, we may end up returning ENOMEM to
> userspace. Or, in certain contexts, we might be able to fall back to
> doing a single operation at a time using the stack for storage, in
> which case there is no reason at all to report the allocation
> failure to anyone.
> 
> The infrastructure is generic, as is the documentation, and so it
> shouldn't assume anything about what is going to happen once the
> retries are exhausted and the error is propagated upwards. What
> happens with that error after it is returned is a subsystem and
> context dependent behaviour, not something that is defined by the
> error retry configuration infrastructure....
> 
> Cheers,

Thanks for the very detailed explanation Dave
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

-- 
Carlos

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-09-15  9:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-13  9:03 [PATCH V4] xfs: Document error handlers behavior Carlos Maiolino
2016-09-14  1:23 ` Dave Chinner
2016-09-14 10:02   ` Carlos Maiolino
2016-09-14 22:09     ` Dave Chinner
2016-09-15  9:18       ` Carlos Maiolino
2016-09-14 15:10   ` Eric Sandeen
2016-09-14 22:22     ` Dave Chinner
2016-09-14 22:31       ` Eric Sandeen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.