All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET v2 0/3] fstests: fix some hangs in crash recovery
@ 2022-08-03  4:22 Darrick J. Wong
  2022-08-03  4:22 ` [PATCH 1/3] common/xfs: fix _reset_xfs_sysfs_error_handling reset to actual defaults Darrick J. Wong
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Darrick J. Wong @ 2022-08-03  4:22 UTC (permalink / raw)
  To: djwong, guaneryu, zlang; +Cc: Christoph Hellwig, linux-xfs, fstests, guan

Hi all,

There are several tests in fstests (generic/019, generic/388,
generic/475, xfs/057, etc.) that test filesystem crash recovery by
starting a loop that kicks off a filesystem exerciser, waits a few
seconds, and offlines the filesystem somehow.  Some of them use the
block layer's error injector, some use dm-error, and some use the
shutdown ioctl.

The crash tests that employ error injection have the unfortunate trait
of causing occasional livelocks when tested against XFS because XFS
allows administrators to configure the filesystem to retry some failed
writes indefinitely.  If the offlining races with a full log trying to
update the filesystem, the fs will hang forever.  Fix this by allowing
XFS to go offline immediately.

While we're at it, fix the dmesg scrapers so they don't trip over XFS
reporting these IO errors as internal errors.

v2: add hch reviews

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fix-shutdown-test-hangs
---
 check                    |    1 +
 common/dmerror           |    4 ++++
 common/fail_make_request |    1 +
 common/rc                |   50 +++++++++++++++++++++++++++++++++++++++++-----
 common/xfs               |   38 ++++++++++++++++++++++++++++++++++-
 tests/xfs/006.out        |    6 +++---
 tests/xfs/264.out        |   12 ++++++-----
 7 files changed, 97 insertions(+), 15 deletions(-)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/3] common/xfs: fix _reset_xfs_sysfs_error_handling reset to actual defaults
  2022-08-03  4:22 [PATCHSET v2 0/3] fstests: fix some hangs in crash recovery Darrick J. Wong
@ 2022-08-03  4:22 ` Darrick J. Wong
  2022-08-03  4:22 ` [PATCH 2/3] common: disable infinite IO error retry for EIO shutdown tests Darrick J. Wong
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Darrick J. Wong @ 2022-08-03  4:22 UTC (permalink / raw)
  To: djwong, guaneryu, zlang; +Cc: Christoph Hellwig, linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

There's a slight mistake in _reset_xfs_sysfs_error_handling: it sets
retry_timeout_seconds to 0, which is not the current default (-1) in
upstream Linux.  Fix this.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 common/xfs        |    2 +-
 tests/xfs/006.out |    6 +++---
 tests/xfs/264.out |   12 ++++++------
 3 files changed, 10 insertions(+), 10 deletions(-)


diff --git a/common/xfs b/common/xfs
index f6f4cdd2..92c281c6 100644
--- a/common/xfs
+++ b/common/xfs
@@ -804,7 +804,7 @@ _reset_xfs_sysfs_error_handling()
 		_get_fs_sysfs_attr $dev error/metadata/${e}/max_retries
 
 		_set_fs_sysfs_attr $dev \
-				   error/metadata/${e}/retry_timeout_seconds 0
+				   error/metadata/${e}/retry_timeout_seconds -1
 		echo -n "error/metadata/${e}/retry_timeout_seconds="
 		_get_fs_sysfs_attr $dev \
 				   error/metadata/${e}/retry_timeout_seconds
diff --git a/tests/xfs/006.out b/tests/xfs/006.out
index 3260b3a2..433b0bc3 100644
--- a/tests/xfs/006.out
+++ b/tests/xfs/006.out
@@ -1,8 +1,8 @@
 QA output created by 006
 error/fail_at_unmount=1
 error/metadata/default/max_retries=-1
-error/metadata/default/retry_timeout_seconds=0
+error/metadata/default/retry_timeout_seconds=-1
 error/metadata/EIO/max_retries=-1
-error/metadata/EIO/retry_timeout_seconds=0
+error/metadata/EIO/retry_timeout_seconds=-1
 error/metadata/ENOSPC/max_retries=-1
-error/metadata/ENOSPC/retry_timeout_seconds=0
+error/metadata/ENOSPC/retry_timeout_seconds=-1
diff --git a/tests/xfs/264.out b/tests/xfs/264.out
index 502e72d3..e45ac5a5 100644
--- a/tests/xfs/264.out
+++ b/tests/xfs/264.out
@@ -2,20 +2,20 @@ QA output created by 264
 === Test EIO/max_retries ===
 error/fail_at_unmount=1
 error/metadata/default/max_retries=-1
-error/metadata/default/retry_timeout_seconds=0
+error/metadata/default/retry_timeout_seconds=-1
 error/metadata/EIO/max_retries=-1
-error/metadata/EIO/retry_timeout_seconds=0
+error/metadata/EIO/retry_timeout_seconds=-1
 error/metadata/ENOSPC/max_retries=-1
-error/metadata/ENOSPC/retry_timeout_seconds=0
+error/metadata/ENOSPC/retry_timeout_seconds=-1
 error/fail_at_unmount=0
 error/metadata/EIO/max_retries=1
 === Test EIO/retry_timeout_seconds ===
 error/fail_at_unmount=1
 error/metadata/default/max_retries=-1
-error/metadata/default/retry_timeout_seconds=0
+error/metadata/default/retry_timeout_seconds=-1
 error/metadata/EIO/max_retries=-1
-error/metadata/EIO/retry_timeout_seconds=0
+error/metadata/EIO/retry_timeout_seconds=-1
 error/metadata/ENOSPC/max_retries=-1
-error/metadata/ENOSPC/retry_timeout_seconds=0
+error/metadata/ENOSPC/retry_timeout_seconds=-1
 error/fail_at_unmount=0
 error/metadata/EIO/retry_timeout_seconds=1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/3] common: disable infinite IO error retry for EIO shutdown tests
  2022-08-03  4:22 [PATCHSET v2 0/3] fstests: fix some hangs in crash recovery Darrick J. Wong
  2022-08-03  4:22 ` [PATCH 1/3] common/xfs: fix _reset_xfs_sysfs_error_handling reset to actual defaults Darrick J. Wong
@ 2022-08-03  4:22 ` Darrick J. Wong
  2022-08-15  9:42   ` Zorro Lang
  2022-08-03  4:22 ` [PATCH 3/3] common: filter internal errors during io error testing Darrick J. Wong
  2022-09-04 13:18 ` [PATCHSET v2 0/3] fstests: fix some hangs in crash recovery Zorro Lang
  3 siblings, 1 reply; 7+ messages in thread
From: Darrick J. Wong @ 2022-08-03  4:22 UTC (permalink / raw)
  To: djwong, guaneryu, zlang; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

This patch fixes a rather hard to hit livelock in the tests that test
how xfs handles shutdown behavior when the device suddenly dies and
starts returing EIO all the time.  The livelock happens if the AIL is
stuck retrying failed metadata updates forever, the log itself is not
being written, and there is no more log grant space, which prevents the
frontend from shutting down the log due to EIO errors during
transactions.

While most users probably want the default retry-forever behavior
because EIO can be transient, the circumstances are different here.  The
tests are designed to flip the device back to working status only after
the unmount succeeds, so we know there's no point in the filesystem
retrying writes until after the unmount.

This fixes some of the periodic hangs in generic/019 and generic/475.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/dmerror           |    4 ++++
 common/fail_make_request |    1 +
 common/rc                |   31 +++++++++++++++++++++++++++----
 common/xfs               |   29 +++++++++++++++++++++++++++++
 4 files changed, 61 insertions(+), 4 deletions(-)


diff --git a/common/dmerror b/common/dmerror
index 0934d220..54122b12 100644
--- a/common/dmerror
+++ b/common/dmerror
@@ -138,6 +138,10 @@ _dmerror_load_error_table()
 		suspend_opt="$*"
 	fi
 
+	# If the full environment is set up, configure ourselves for shutdown
+	type _prepare_for_eio_shutdown &>/dev/null && \
+		_prepare_for_eio_shutdown $DMERROR_DEV
+
 	# Suspend the scratch device before the log and realtime devices so
 	# that the kernel can freeze and flush the filesystem if the caller
 	# wanted a freeze.
diff --git a/common/fail_make_request b/common/fail_make_request
index 9f8ea500..b5370ba6 100644
--- a/common/fail_make_request
+++ b/common/fail_make_request
@@ -44,6 +44,7 @@ _start_fail_scratch_dev()
 {
     echo "Force SCRATCH_DEV device failure"
 
+    _prepare_for_eio_shutdown $SCRATCH_DEV
     _bdev_fail_make_request $SCRATCH_DEV 1
     [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
         _bdev_fail_make_request $SCRATCH_LOGDEV 1
diff --git a/common/rc b/common/rc
index 63bafb4b..119cc477 100644
--- a/common/rc
+++ b/common/rc
@@ -4205,6 +4205,20 @@ _check_dmesg()
 	fi
 }
 
+# Make whatever configuration changes we need ahead of testing fs shutdowns due
+# to unexpected IO errors while updating metadata.  The sole parameter should
+# be the fs device, e.g.  $SCRATCH_DEV.
+_prepare_for_eio_shutdown()
+{
+	local dev="$1"
+
+	case "$FSTYP" in
+	"xfs")
+		_xfs_prepare_for_eio_shutdown "$dev"
+		;;
+	esac
+}
+
 # capture the kmemleak report
 _capture_kmemleak()
 {
@@ -4467,7 +4481,7 @@ run_fsx()
 #
 # Usage example:
 #   _require_fs_sysfs error/fail_at_unmount
-_require_fs_sysfs()
+_has_fs_sysfs()
 {
 	local attr=$1
 	local dname
@@ -4483,9 +4497,18 @@ _require_fs_sysfs()
 		_fail "Usage: _require_fs_sysfs <sysfs_attr_path>"
 	fi
 
-	if [ ! -e /sys/fs/${FSTYP}/${dname}/${attr} ];then
-		_notrun "This test requires /sys/fs/${FSTYP}/${dname}/${attr}"
-	fi
+	test -e /sys/fs/${FSTYP}/${dname}/${attr}
+}
+
+# Require the existence of a sysfs entry at /sys/fs/$FSTYP/DEV/$ATTR
+_require_fs_sysfs()
+{
+	_has_fs_sysfs "$@" && return
+
+	local attr=$1
+	local dname=$(_short_dev $TEST_DEV)
+
+	_notrun "This test requires /sys/fs/${FSTYP}/${dname}/${attr}"
 }
 
 _require_statx()
diff --git a/common/xfs b/common/xfs
index 92c281c6..65234c8b 100644
--- a/common/xfs
+++ b/common/xfs
@@ -823,6 +823,35 @@ _scratch_xfs_unmount_dirty()
 	_scratch_unmount
 }
 
+# Prepare a mounted filesystem for an IO error shutdown test by disabling retry
+# for metadata writes.  This prevents a (rare) log livelock when:
+#
+# - The log has given out all available grant space, preventing any new
+#   writers from tripping over IO errors (and shutting down the fs/log),
+# - All log buffers were written to disk, and
+# - The log tail is pinned because the AIL keeps hitting EIO trying to write
+#   committed changes back into the filesystem.
+#
+# Real users might want the default behavior of the AIL retrying writes forever
+# but for testing purposes we don't want to wait.
+#
+# The sole parameter should be the filesystem data device, e.g. $SCRATCH_DEV.
+_xfs_prepare_for_eio_shutdown()
+{
+	local dev="$1"
+	local ctlfile="error/fail_at_unmount"
+
+	# Don't retry any writes during the (presumably) post-shutdown unmount
+	_has_fs_sysfs "$ctlfile" && _set_fs_sysfs_attr $dev "$ctlfile" 1
+
+	# Disable retry of metadata writes that fail with EIO
+	for ctl in max_retries retry_timeout_seconds; do
+		ctlfile="error/metadata/EIO/$ctl"
+
+		_has_fs_sysfs "$ctlfile" && _set_fs_sysfs_attr $dev "$ctlfile" 0
+	done
+}
+
 # Skip if we are running an older binary without the stricter input checks.
 # Make multiple checks to be sure that there is no regression on the one
 # selected feature check, which would skew the result.


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 3/3] common: filter internal errors during io error testing
  2022-08-03  4:22 [PATCHSET v2 0/3] fstests: fix some hangs in crash recovery Darrick J. Wong
  2022-08-03  4:22 ` [PATCH 1/3] common/xfs: fix _reset_xfs_sysfs_error_handling reset to actual defaults Darrick J. Wong
  2022-08-03  4:22 ` [PATCH 2/3] common: disable infinite IO error retry for EIO shutdown tests Darrick J. Wong
@ 2022-08-03  4:22 ` Darrick J. Wong
  2022-09-04 13:18 ` [PATCHSET v2 0/3] fstests: fix some hangs in crash recovery Zorro Lang
  3 siblings, 0 replies; 7+ messages in thread
From: Darrick J. Wong @ 2022-08-03  4:22 UTC (permalink / raw)
  To: djwong, guaneryu, zlang; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

The goal of an EIO shutdown test is to examine the shutdown and recovery
behavior if we make the underlying storage device return EIO.  On XFS,
it's possible that the shutdown will come from a thread that cancels a
dirty transaction due to the EIO.  This is expected behavior, but
_check_dmesg will flag it as a test failure.

Make it so that we can add simple regexps to the default check_dmesg
filter function, then add the "Internal error" string to filter function
when we invoke an EIO test.  This fixes periodic regressions in
generic/019 and generic/475.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 check      |    1 +
 common/rc  |   19 ++++++++++++++++++-
 common/xfs |    7 +++++++
 3 files changed, 26 insertions(+), 1 deletion(-)


diff --git a/check b/check
index 0b2f10ed..000e31cb 100755
--- a/check
+++ b/check
@@ -896,6 +896,7 @@ function run_section()
 			echo "run fstests $seqnum at $date_time" > /dev/kmsg
 			# _check_dmesg depends on this log in dmesg
 			touch ${RESULT_DIR}/check_dmesg
+			rm -f ${RESULT_DIR}/dmesg_filter
 		fi
 		_try_wipe_scratch_devs > /dev/null 2>&1
 
diff --git a/common/rc b/common/rc
index 119cc477..0f233892 100644
--- a/common/rc
+++ b/common/rc
@@ -4164,8 +4164,25 @@ _check_dmesg_for()
 # lockdep.
 _check_dmesg_filter()
 {
+	local extra_filter=
+	local filter_file="${RESULT_DIR}/dmesg_filter"
+
+	test -e "$filter_file" && extra_filter="-f $filter_file"
+
 	egrep -v -e "BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low" \
-		-e "BUG: MAX_STACK_TRACE_ENTRIES too low"
+		-e "BUG: MAX_STACK_TRACE_ENTRIES too low" \
+		$extra_filter
+}
+
+# Add a simple expression to the default dmesg filter
+_add_dmesg_filter()
+{
+	local regexp="$1"
+	local filter_file="${RESULT_DIR}/dmesg_filter"
+
+	if [ ! -e "$filter_file" ] || ! grep -q "$regexp" "$filter_file"; then
+		echo "$regexp" >> "${RESULT_DIR}/dmesg_filter"
+	fi
 }
 
 # check dmesg log for WARNING/Oops/etc.
diff --git a/common/xfs b/common/xfs
index 65234c8b..ae81b3fe 100644
--- a/common/xfs
+++ b/common/xfs
@@ -841,6 +841,13 @@ _xfs_prepare_for_eio_shutdown()
 	local dev="$1"
 	local ctlfile="error/fail_at_unmount"
 
+	# Once we enable IO errors, it's possible that a writer thread will
+	# trip over EIO, cancel the transaction, and shut down the system.
+	# This is expected behavior, so we need to remove the "Internal error"
+	# message from the list of things that can cause the test to be marked
+	# as failed.
+	_add_dmesg_filter "Internal error"
+
 	# Don't retry any writes during the (presumably) post-shutdown unmount
 	_has_fs_sysfs "$ctlfile" && _set_fs_sysfs_attr $dev "$ctlfile" 1
 


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/3] common: disable infinite IO error retry for EIO shutdown tests
  2022-08-03  4:22 ` [PATCH 2/3] common: disable infinite IO error retry for EIO shutdown tests Darrick J. Wong
@ 2022-08-15  9:42   ` Zorro Lang
  2022-08-16  1:47     ` Darrick J. Wong
  0 siblings, 1 reply; 7+ messages in thread
From: Zorro Lang @ 2022-08-15  9:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Tue, Aug 02, 2022 at 09:22:35PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> This patch fixes a rather hard to hit livelock in the tests that test
> how xfs handles shutdown behavior when the device suddenly dies and
> starts returing EIO all the time.  The livelock happens if the AIL is
> stuck retrying failed metadata updates forever, the log itself is not
> being written, and there is no more log grant space, which prevents the
> frontend from shutting down the log due to EIO errors during
> transactions.
> 
> While most users probably want the default retry-forever behavior
> because EIO can be transient, the circumstances are different here.  The
> tests are designed to flip the device back to working status only after
> the unmount succeeds, so we know there's no point in the filesystem
> retrying writes until after the unmount.
> 
> This fixes some of the periodic hangs in generic/019 and generic/475.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  common/dmerror           |    4 ++++
>  common/fail_make_request |    1 +
>  common/rc                |   31 +++++++++++++++++++++++++++----
>  common/xfs               |   29 +++++++++++++++++++++++++++++
>  4 files changed, 61 insertions(+), 4 deletions(-)
> 
> 
> diff --git a/common/dmerror b/common/dmerror
> index 0934d220..54122b12 100644
> --- a/common/dmerror
> +++ b/common/dmerror
> @@ -138,6 +138,10 @@ _dmerror_load_error_table()
>  		suspend_opt="$*"
>  	fi
>  
> +	# If the full environment is set up, configure ourselves for shutdown
> +	type _prepare_for_eio_shutdown &>/dev/null && \

I'm wondering why we need to check if _prepare_for_eio_shutdown() is defined
at here? This patch define this function, so if we merge this patch, this
function is exist, right?

> +		_prepare_for_eio_shutdown $DMERROR_DEV

Hmm... what about someone load error table, but not for testing fs shutdown?

> +
>  	# Suspend the scratch device before the log and realtime devices so
>  	# that the kernel can freeze and flush the filesystem if the caller
>  	# wanted a freeze.
> diff --git a/common/fail_make_request b/common/fail_make_request
> index 9f8ea500..b5370ba6 100644
> --- a/common/fail_make_request
> +++ b/common/fail_make_request
> @@ -44,6 +44,7 @@ _start_fail_scratch_dev()
>  {
>      echo "Force SCRATCH_DEV device failure"
>  
> +    _prepare_for_eio_shutdown $SCRATCH_DEV
>      _bdev_fail_make_request $SCRATCH_DEV 1
>      [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
>          _bdev_fail_make_request $SCRATCH_LOGDEV 1
> diff --git a/common/rc b/common/rc
> index 63bafb4b..119cc477 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -4205,6 +4205,20 @@ _check_dmesg()
>  	fi
>  }
>  
> +# Make whatever configuration changes we need ahead of testing fs shutdowns due
> +# to unexpected IO errors while updating metadata.  The sole parameter should
> +# be the fs device, e.g.  $SCRATCH_DEV.
> +_prepare_for_eio_shutdown()
> +{
> +	local dev="$1"
> +
> +	case "$FSTYP" in
> +	"xfs")
> +		_xfs_prepare_for_eio_shutdown "$dev"
> +		;;
> +	esac
> +}
> +
>  # capture the kmemleak report
>  _capture_kmemleak()
>  {
> @@ -4467,7 +4481,7 @@ run_fsx()
>  #
>  # Usage example:
>  #   _require_fs_sysfs error/fail_at_unmount
> -_require_fs_sysfs()
> +_has_fs_sysfs()
>  {
>  	local attr=$1
>  	local dname
> @@ -4483,9 +4497,18 @@ _require_fs_sysfs()
>  		_fail "Usage: _require_fs_sysfs <sysfs_attr_path>"
>  	fi
>  
> -	if [ ! -e /sys/fs/${FSTYP}/${dname}/${attr} ];then
> -		_notrun "This test requires /sys/fs/${FSTYP}/${dname}/${attr}"
> -	fi
> +	test -e /sys/fs/${FSTYP}/${dname}/${attr}
> +}
> +
> +# Require the existence of a sysfs entry at /sys/fs/$FSTYP/DEV/$ATTR
> +_require_fs_sysfs()
> +{
> +	_has_fs_sysfs "$@" && return
> +
> +	local attr=$1
> +	local dname=$(_short_dev $TEST_DEV)
> +
> +	_notrun "This test requires /sys/fs/${FSTYP}/${dname}/${attr}"
>  }
>  
>  _require_statx()
> diff --git a/common/xfs b/common/xfs
> index 92c281c6..65234c8b 100644
> --- a/common/xfs
> +++ b/common/xfs
> @@ -823,6 +823,35 @@ _scratch_xfs_unmount_dirty()
>  	_scratch_unmount
>  }
>  
> +# Prepare a mounted filesystem for an IO error shutdown test by disabling retry
> +# for metadata writes.  This prevents a (rare) log livelock when:
> +#
> +# - The log has given out all available grant space, preventing any new
> +#   writers from tripping over IO errors (and shutting down the fs/log),
> +# - All log buffers were written to disk, and
> +# - The log tail is pinned because the AIL keeps hitting EIO trying to write
> +#   committed changes back into the filesystem.
> +#
> +# Real users might want the default behavior of the AIL retrying writes forever
> +# but for testing purposes we don't want to wait.
> +#
> +# The sole parameter should be the filesystem data device, e.g. $SCRATCH_DEV.
> +_xfs_prepare_for_eio_shutdown()
> +{
> +	local dev="$1"
> +	local ctlfile="error/fail_at_unmount"
> +
> +	# Don't retry any writes during the (presumably) post-shutdown unmount
> +	_has_fs_sysfs "$ctlfile" && _set_fs_sysfs_attr $dev "$ctlfile" 1
> +
> +	# Disable retry of metadata writes that fail with EIO
> +	for ctl in max_retries retry_timeout_seconds; do
> +		ctlfile="error/metadata/EIO/$ctl"
> +
> +		_has_fs_sysfs "$ctlfile" && _set_fs_sysfs_attr $dev "$ctlfile" 0
> +	done
> +}
> +
>  # Skip if we are running an older binary without the stricter input checks.
>  # Make multiple checks to be sure that there is no regression on the one
>  # selected feature check, which would skew the result.
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/3] common: disable infinite IO error retry for EIO shutdown tests
  2022-08-15  9:42   ` Zorro Lang
@ 2022-08-16  1:47     ` Darrick J. Wong
  0 siblings, 0 replies; 7+ messages in thread
From: Darrick J. Wong @ 2022-08-16  1:47 UTC (permalink / raw)
  To: Zorro Lang; +Cc: linux-xfs, fstests

On Mon, Aug 15, 2022 at 05:42:46PM +0800, Zorro Lang wrote:
> On Tue, Aug 02, 2022 at 09:22:35PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > This patch fixes a rather hard to hit livelock in the tests that test
> > how xfs handles shutdown behavior when the device suddenly dies and
> > starts returing EIO all the time.  The livelock happens if the AIL is
> > stuck retrying failed metadata updates forever, the log itself is not
> > being written, and there is no more log grant space, which prevents the
> > frontend from shutting down the log due to EIO errors during
> > transactions.
> > 
> > While most users probably want the default retry-forever behavior
> > because EIO can be transient, the circumstances are different here.  The
> > tests are designed to flip the device back to working status only after
> > the unmount succeeds, so we know there's no point in the filesystem
> > retrying writes until after the unmount.
> > 
> > This fixes some of the periodic hangs in generic/019 and generic/475.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  common/dmerror           |    4 ++++
> >  common/fail_make_request |    1 +
> >  common/rc                |   31 +++++++++++++++++++++++++++----
> >  common/xfs               |   29 +++++++++++++++++++++++++++++
> >  4 files changed, 61 insertions(+), 4 deletions(-)
> > 
> > 
> > diff --git a/common/dmerror b/common/dmerror
> > index 0934d220..54122b12 100644
> > --- a/common/dmerror
> > +++ b/common/dmerror
> > @@ -138,6 +138,10 @@ _dmerror_load_error_table()
> >  		suspend_opt="$*"
> >  	fi
> >  
> > +	# If the full environment is set up, configure ourselves for shutdown
> > +	type _prepare_for_eio_shutdown &>/dev/null && \
> 
> I'm wondering why we need to check if _prepare_for_eio_shutdown() is defined
> at here? This patch define this function, so if we merge this patch, this
> function is exist, right?

This mess exists because of src/dmerror, which sources common/dmerror
but does *not* source common/rc.  src/dmerror, in turn, is a helper
script that is called as a subprocess of src/fsync-err.c, which (as a C
program) doesn't inherit any of the shell data contained within its
caller (e.g. generic/441).

I probably should have documented that better, but TBH I'm not fully
convinced that I understand all this nested re-entry stuff that some of
the dmerror tests invoke.

> > +		_prepare_for_eio_shutdown $DMERROR_DEV
> 
> Hmm... what about someone load error table, but not for testing fs shutdown?

Even the tests that use dmerror but aren't looking for an fs shutdown
could trigger one anyway, because (a) timestamp updates, or (b) inodegc
running in the background.  Either way, an EIO shutdown is possible, so
we should turn off infinite retries to avoid hanging fstests.

--D

> > +
> >  	# Suspend the scratch device before the log and realtime devices so
> >  	# that the kernel can freeze and flush the filesystem if the caller
> >  	# wanted a freeze.
> > diff --git a/common/fail_make_request b/common/fail_make_request
> > index 9f8ea500..b5370ba6 100644
> > --- a/common/fail_make_request
> > +++ b/common/fail_make_request
> > @@ -44,6 +44,7 @@ _start_fail_scratch_dev()
> >  {
> >      echo "Force SCRATCH_DEV device failure"
> >  
> > +    _prepare_for_eio_shutdown $SCRATCH_DEV
> >      _bdev_fail_make_request $SCRATCH_DEV 1
> >      [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
> >          _bdev_fail_make_request $SCRATCH_LOGDEV 1
> > diff --git a/common/rc b/common/rc
> > index 63bafb4b..119cc477 100644
> > --- a/common/rc
> > +++ b/common/rc
> > @@ -4205,6 +4205,20 @@ _check_dmesg()
> >  	fi
> >  }
> >  
> > +# Make whatever configuration changes we need ahead of testing fs shutdowns due
> > +# to unexpected IO errors while updating metadata.  The sole parameter should
> > +# be the fs device, e.g.  $SCRATCH_DEV.
> > +_prepare_for_eio_shutdown()
> > +{
> > +	local dev="$1"
> > +
> > +	case "$FSTYP" in
> > +	"xfs")
> > +		_xfs_prepare_for_eio_shutdown "$dev"
> > +		;;
> > +	esac
> > +}
> > +
> >  # capture the kmemleak report
> >  _capture_kmemleak()
> >  {
> > @@ -4467,7 +4481,7 @@ run_fsx()
> >  #
> >  # Usage example:
> >  #   _require_fs_sysfs error/fail_at_unmount
> > -_require_fs_sysfs()
> > +_has_fs_sysfs()
> >  {
> >  	local attr=$1
> >  	local dname
> > @@ -4483,9 +4497,18 @@ _require_fs_sysfs()
> >  		_fail "Usage: _require_fs_sysfs <sysfs_attr_path>"
> >  	fi
> >  
> > -	if [ ! -e /sys/fs/${FSTYP}/${dname}/${attr} ];then
> > -		_notrun "This test requires /sys/fs/${FSTYP}/${dname}/${attr}"
> > -	fi
> > +	test -e /sys/fs/${FSTYP}/${dname}/${attr}
> > +}
> > +
> > +# Require the existence of a sysfs entry at /sys/fs/$FSTYP/DEV/$ATTR
> > +_require_fs_sysfs()
> > +{
> > +	_has_fs_sysfs "$@" && return
> > +
> > +	local attr=$1
> > +	local dname=$(_short_dev $TEST_DEV)
> > +
> > +	_notrun "This test requires /sys/fs/${FSTYP}/${dname}/${attr}"
> >  }
> >  
> >  _require_statx()
> > diff --git a/common/xfs b/common/xfs
> > index 92c281c6..65234c8b 100644
> > --- a/common/xfs
> > +++ b/common/xfs
> > @@ -823,6 +823,35 @@ _scratch_xfs_unmount_dirty()
> >  	_scratch_unmount
> >  }
> >  
> > +# Prepare a mounted filesystem for an IO error shutdown test by disabling retry
> > +# for metadata writes.  This prevents a (rare) log livelock when:
> > +#
> > +# - The log has given out all available grant space, preventing any new
> > +#   writers from tripping over IO errors (and shutting down the fs/log),
> > +# - All log buffers were written to disk, and
> > +# - The log tail is pinned because the AIL keeps hitting EIO trying to write
> > +#   committed changes back into the filesystem.
> > +#
> > +# Real users might want the default behavior of the AIL retrying writes forever
> > +# but for testing purposes we don't want to wait.
> > +#
> > +# The sole parameter should be the filesystem data device, e.g. $SCRATCH_DEV.
> > +_xfs_prepare_for_eio_shutdown()
> > +{
> > +	local dev="$1"
> > +	local ctlfile="error/fail_at_unmount"
> > +
> > +	# Don't retry any writes during the (presumably) post-shutdown unmount
> > +	_has_fs_sysfs "$ctlfile" && _set_fs_sysfs_attr $dev "$ctlfile" 1
> > +
> > +	# Disable retry of metadata writes that fail with EIO
> > +	for ctl in max_retries retry_timeout_seconds; do
> > +		ctlfile="error/metadata/EIO/$ctl"
> > +
> > +		_has_fs_sysfs "$ctlfile" && _set_fs_sysfs_attr $dev "$ctlfile" 0
> > +	done
> > +}
> > +
> >  # Skip if we are running an older binary without the stricter input checks.
> >  # Make multiple checks to be sure that there is no regression on the one
> >  # selected feature check, which would skew the result.
> > 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCHSET v2 0/3] fstests: fix some hangs in crash recovery
  2022-08-03  4:22 [PATCHSET v2 0/3] fstests: fix some hangs in crash recovery Darrick J. Wong
                   ` (2 preceding siblings ...)
  2022-08-03  4:22 ` [PATCH 3/3] common: filter internal errors during io error testing Darrick J. Wong
@ 2022-09-04 13:18 ` Zorro Lang
  3 siblings, 0 replies; 7+ messages in thread
From: Zorro Lang @ 2022-09-04 13:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Tue, Aug 02, 2022 at 09:22:24PM -0700, Darrick J. Wong wrote:
> Hi all,
> 
> There are several tests in fstests (generic/019, generic/388,
> generic/475, xfs/057, etc.) that test filesystem crash recovery by
> starting a loop that kicks off a filesystem exerciser, waits a few
> seconds, and offlines the filesystem somehow.  Some of them use the
> block layer's error injector, some use dm-error, and some use the
> shutdown ioctl.
> 
> The crash tests that employ error injection have the unfortunate trait
> of causing occasional livelocks when tested against XFS because XFS
> allows administrators to configure the filesystem to retry some failed
> writes indefinitely.  If the offlining races with a full log trying to
> update the filesystem, the fs will hang forever.  Fix this by allowing
> XFS to go offline immediately.
> 
> While we're at it, fix the dmesg scrapers so they don't trip over XFS
> reporting these IO errors as internal errors.
> 
> v2: add hch reviews
> 
> If you're going to start using this mess, you probably ought to just
> pull from my git trees, which are linked below.
> 
> This is an extraordinary way to destroy everything.  Enjoy!
> Comments and questions are, as always, welcome.
> 
> --D

I'd like to merge this patchset, after long time testing, it looks like not
bring in regression. It's stuck long time, let's keep moving :)

Reviewed-by: Zorro Lang <zlang@redhat.com>

> 
> fstests git tree:
> https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fix-shutdown-test-hangs
> ---
>  check                    |    1 +
>  common/dmerror           |    4 ++++
>  common/fail_make_request |    1 +
>  common/rc                |   50 +++++++++++++++++++++++++++++++++++++++++-----
>  common/xfs               |   38 ++++++++++++++++++++++++++++++++++-
>  tests/xfs/006.out        |    6 +++---
>  tests/xfs/264.out        |   12 ++++++-----
>  7 files changed, 97 insertions(+), 15 deletions(-)
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-09-04 13:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-03  4:22 [PATCHSET v2 0/3] fstests: fix some hangs in crash recovery Darrick J. Wong
2022-08-03  4:22 ` [PATCH 1/3] common/xfs: fix _reset_xfs_sysfs_error_handling reset to actual defaults Darrick J. Wong
2022-08-03  4:22 ` [PATCH 2/3] common: disable infinite IO error retry for EIO shutdown tests Darrick J. Wong
2022-08-15  9:42   ` Zorro Lang
2022-08-16  1:47     ` Darrick J. Wong
2022-08-03  4:22 ` [PATCH 3/3] common: filter internal errors during io error testing Darrick J. Wong
2022-09-04 13:18 ` [PATCHSET v2 0/3] fstests: fix some hangs in crash recovery Zorro Lang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.