All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Zorro Lang <zlang@redhat.com>
Cc: linux-xfs@vger.kernel.org, fstests@vger.kernel.org
Subject: Re: [PATCH 2/3] common: disable infinite IO error retry for EIO shutdown tests
Date: Mon, 15 Aug 2022 18:47:31 -0700	[thread overview]
Message-ID: <Yvr3M8V9Tg9LQE6E@magnolia> (raw)
In-Reply-To: <20220815094246.iqrbfdkwzvrjbjb3@zlang-mailbox>

On Mon, Aug 15, 2022 at 05:42:46PM +0800, Zorro Lang wrote:
> On Tue, Aug 02, 2022 at 09:22:35PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > This patch fixes a rather hard to hit livelock in the tests that test
> > how xfs handles shutdown behavior when the device suddenly dies and
> > starts returing EIO all the time.  The livelock happens if the AIL is
> > stuck retrying failed metadata updates forever, the log itself is not
> > being written, and there is no more log grant space, which prevents the
> > frontend from shutting down the log due to EIO errors during
> > transactions.
> > 
> > While most users probably want the default retry-forever behavior
> > because EIO can be transient, the circumstances are different here.  The
> > tests are designed to flip the device back to working status only after
> > the unmount succeeds, so we know there's no point in the filesystem
> > retrying writes until after the unmount.
> > 
> > This fixes some of the periodic hangs in generic/019 and generic/475.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  common/dmerror           |    4 ++++
> >  common/fail_make_request |    1 +
> >  common/rc                |   31 +++++++++++++++++++++++++++----
> >  common/xfs               |   29 +++++++++++++++++++++++++++++
> >  4 files changed, 61 insertions(+), 4 deletions(-)
> > 
> > 
> > diff --git a/common/dmerror b/common/dmerror
> > index 0934d220..54122b12 100644
> > --- a/common/dmerror
> > +++ b/common/dmerror
> > @@ -138,6 +138,10 @@ _dmerror_load_error_table()
> >  		suspend_opt="$*"
> >  	fi
> >  
> > +	# If the full environment is set up, configure ourselves for shutdown
> > +	type _prepare_for_eio_shutdown &>/dev/null && \
> 
> I'm wondering why we need to check if _prepare_for_eio_shutdown() is defined
> at here? This patch define this function, so if we merge this patch, this
> function is exist, right?

This mess exists because of src/dmerror, which sources common/dmerror
but does *not* source common/rc.  src/dmerror, in turn, is a helper
script that is called as a subprocess of src/fsync-err.c, which (as a C
program) doesn't inherit any of the shell data contained within its
caller (e.g. generic/441).

I probably should have documented that better, but TBH I'm not fully
convinced that I understand all this nested re-entry stuff that some of
the dmerror tests invoke.

> > +		_prepare_for_eio_shutdown $DMERROR_DEV
> 
> Hmm... what about someone load error table, but not for testing fs shutdown?

Even the tests that use dmerror but aren't looking for an fs shutdown
could trigger one anyway, because (a) timestamp updates, or (b) inodegc
running in the background.  Either way, an EIO shutdown is possible, so
we should turn off infinite retries to avoid hanging fstests.

--D

> > +
> >  	# Suspend the scratch device before the log and realtime devices so
> >  	# that the kernel can freeze and flush the filesystem if the caller
> >  	# wanted a freeze.
> > diff --git a/common/fail_make_request b/common/fail_make_request
> > index 9f8ea500..b5370ba6 100644
> > --- a/common/fail_make_request
> > +++ b/common/fail_make_request
> > @@ -44,6 +44,7 @@ _start_fail_scratch_dev()
> >  {
> >      echo "Force SCRATCH_DEV device failure"
> >  
> > +    _prepare_for_eio_shutdown $SCRATCH_DEV
> >      _bdev_fail_make_request $SCRATCH_DEV 1
> >      [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
> >          _bdev_fail_make_request $SCRATCH_LOGDEV 1
> > diff --git a/common/rc b/common/rc
> > index 63bafb4b..119cc477 100644
> > --- a/common/rc
> > +++ b/common/rc
> > @@ -4205,6 +4205,20 @@ _check_dmesg()
> >  	fi
> >  }
> >  
> > +# Make whatever configuration changes we need ahead of testing fs shutdowns due
> > +# to unexpected IO errors while updating metadata.  The sole parameter should
> > +# be the fs device, e.g.  $SCRATCH_DEV.
> > +_prepare_for_eio_shutdown()
> > +{
> > +	local dev="$1"
> > +
> > +	case "$FSTYP" in
> > +	"xfs")
> > +		_xfs_prepare_for_eio_shutdown "$dev"
> > +		;;
> > +	esac
> > +}
> > +
> >  # capture the kmemleak report
> >  _capture_kmemleak()
> >  {
> > @@ -4467,7 +4481,7 @@ run_fsx()
> >  #
> >  # Usage example:
> >  #   _require_fs_sysfs error/fail_at_unmount
> > -_require_fs_sysfs()
> > +_has_fs_sysfs()
> >  {
> >  	local attr=$1
> >  	local dname
> > @@ -4483,9 +4497,18 @@ _require_fs_sysfs()
> >  		_fail "Usage: _require_fs_sysfs <sysfs_attr_path>"
> >  	fi
> >  
> > -	if [ ! -e /sys/fs/${FSTYP}/${dname}/${attr} ];then
> > -		_notrun "This test requires /sys/fs/${FSTYP}/${dname}/${attr}"
> > -	fi
> > +	test -e /sys/fs/${FSTYP}/${dname}/${attr}
> > +}
> > +
> > +# Require the existence of a sysfs entry at /sys/fs/$FSTYP/DEV/$ATTR
> > +_require_fs_sysfs()
> > +{
> > +	_has_fs_sysfs "$@" && return
> > +
> > +	local attr=$1
> > +	local dname=$(_short_dev $TEST_DEV)
> > +
> > +	_notrun "This test requires /sys/fs/${FSTYP}/${dname}/${attr}"
> >  }
> >  
> >  _require_statx()
> > diff --git a/common/xfs b/common/xfs
> > index 92c281c6..65234c8b 100644
> > --- a/common/xfs
> > +++ b/common/xfs
> > @@ -823,6 +823,35 @@ _scratch_xfs_unmount_dirty()
> >  	_scratch_unmount
> >  }
> >  
> > +# Prepare a mounted filesystem for an IO error shutdown test by disabling retry
> > +# for metadata writes.  This prevents a (rare) log livelock when:
> > +#
> > +# - The log has given out all available grant space, preventing any new
> > +#   writers from tripping over IO errors (and shutting down the fs/log),
> > +# - All log buffers were written to disk, and
> > +# - The log tail is pinned because the AIL keeps hitting EIO trying to write
> > +#   committed changes back into the filesystem.
> > +#
> > +# Real users might want the default behavior of the AIL retrying writes forever
> > +# but for testing purposes we don't want to wait.
> > +#
> > +# The sole parameter should be the filesystem data device, e.g. $SCRATCH_DEV.
> > +_xfs_prepare_for_eio_shutdown()
> > +{
> > +	local dev="$1"
> > +	local ctlfile="error/fail_at_unmount"
> > +
> > +	# Don't retry any writes during the (presumably) post-shutdown unmount
> > +	_has_fs_sysfs "$ctlfile" && _set_fs_sysfs_attr $dev "$ctlfile" 1
> > +
> > +	# Disable retry of metadata writes that fail with EIO
> > +	for ctl in max_retries retry_timeout_seconds; do
> > +		ctlfile="error/metadata/EIO/$ctl"
> > +
> > +		_has_fs_sysfs "$ctlfile" && _set_fs_sysfs_attr $dev "$ctlfile" 0
> > +	done
> > +}
> > +
> >  # Skip if we are running an older binary without the stricter input checks.
> >  # Make multiple checks to be sure that there is no regression on the one
> >  # selected feature check, which would skew the result.
> > 
> 

  reply	other threads:[~2022-08-16  6:46 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-03  4:22 [PATCHSET v2 0/3] fstests: fix some hangs in crash recovery Darrick J. Wong
2022-08-03  4:22 ` [PATCH 1/3] common/xfs: fix _reset_xfs_sysfs_error_handling reset to actual defaults Darrick J. Wong
2022-08-03  4:22 ` [PATCH 2/3] common: disable infinite IO error retry for EIO shutdown tests Darrick J. Wong
2022-08-15  9:42   ` Zorro Lang
2022-08-16  1:47     ` Darrick J. Wong [this message]
2022-08-03  4:22 ` [PATCH 3/3] common: filter internal errors during io error testing Darrick J. Wong
2022-09-04 13:18 ` [PATCHSET v2 0/3] fstests: fix some hangs in crash recovery Zorro Lang
  -- strict thread matches above, loose matches on Subject: below --
2022-07-26 19:49 [PATCHSET " Darrick J. Wong
2022-07-26 19:49 ` [PATCH 2/3] common: disable infinite IO error retry for EIO shutdown tests Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yvr3M8V9Tg9LQE6E@magnolia \
    --to=djwong@kernel.org \
    --cc=fstests@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=zlang@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.