All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] xfs: test agfl reset on bad list wrapping
@ 2018-03-21 16:57 Darrick J. Wong
  2018-03-22  2:46 ` [PATCH] common/xfs: don't call xfs_scrub on a block device Darrick J. Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Darrick J. Wong @ 2018-03-21 16:57 UTC (permalink / raw)
  To: Eryu Guan; +Cc: Brian Foster, linux-xfs, david, fstests

From: Darrick J. Wong <darrick.wong@oracle.com>

>From the kernel patch that this test examines ("xfs: detect agfl count
corruption and reset agfl"):

"The struct xfs_agfl v5 header was originally introduced with
unexpected padding that caused the AGFL to operate with one less
slot than intended. The header has since been packed, but the fix
left an incompatibility for users who upgrade from an old kernel
with the unpacked header to a newer kernel with the packed header
while the AGFL happens to wrap around the end. The newer kernel
recognizes one extra slot at the physical end of the AGFL that the
previous kernel did not. The new kernel will eventually attempt to
allocate a block from that slot, which contains invalid data, and
cause a crash.

"This condition can be detected by comparing the active range of the
AGFL to the count. While this detects a padding mismatch, it can
also trigger false positives for unrelated flcount corruption. Since
we cannot distinguish a size mismatch due to padding from unrelated
corruption, we can't trust the AGFL enough to simply repopulate the
empty slot.

"Instead, avoid unnecessarily complex detection logic and and use a
solution that can handle any form of flcount corruption that slips
through read verifiers: distrust the entire AGFL and reset it to an
empty state. Any valid blocks within the AGFL are intentionally
leaked. This requires xfs_repair to rectify (which was already
necessary based on the state the AGFL was found in). The reset
mitigates the side effect of the padding mismatch problem from a
filesystem crash to a free space accounting inconsistency."

This test exercises the reset code by mutating a fresh filesystem to
contain an agfl with various list configurations of correctly wrapped,
incorrectly wrapped, not wrapped, and actually corrupt free lists; then
checks the success of the reset operation by fragmenting the free space
btrees to exercise the agfl.  Kernels without this reset fix will shut
down the filesystem with corruption errors.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v3: use fallocate instead of dd write, more factoring of common code
v2: remove unncessary umounts, refactor long lines into helpers
---
 common/rc         |   23 ++++-
 tests/xfs/709     |  258 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/709.out |   13 +++
 tests/xfs/group   |    1 
 4 files changed, 293 insertions(+), 2 deletions(-)
 create mode 100755 tests/xfs/709
 create mode 100644 tests/xfs/709.out

diff --git a/common/rc b/common/rc
index 2c29d55..f7eb72d 100644
--- a/common/rc
+++ b/common/rc
@@ -3440,6 +3440,26 @@ _get_device_size()
 	grep `_short_dev $1` /proc/partitions | awk '{print $3}'
 }
 
+# Make sure we actually have dmesg checking set up.
+_require_check_dmesg() {
+	test -w /dev/kmsg || \
+		_notrun "Test requires writable /dev/kmsg."
+}
+
+# Return the dmesg log since the start of this test.  Caller must ensure that
+# /dev/kmsg was writable when the test was started so that we can find the
+# beginning of this test's log messages; _require_check_dmesg does this.
+_dmesg_since_test_start() {
+	dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | \
+		tac
+}
+
+# check dmesg log for a specific string, subject to the same requirements as
+# _dmesg_since_test_start.
+_check_dmesg_for() {
+	_dmesg_since_test_start | egrep -q "$1"
+}
+
 # check dmesg log for WARNING/Oops/etc.
 _check_dmesg()
 {
@@ -3455,8 +3475,7 @@ _check_dmesg()
 
 	# search the dmesg log of last run of $seqnum for possible failures
 	# use sed \cregexpc address type, since $seqnum contains "/"
-	dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | \
-		tac | $filter >$seqres.dmesg
+	_dmesg_since_test_start | $filter >$seqres.dmesg
 	egrep -q -e "kernel BUG at" \
 	     -e "WARNING:" \
 	     -e "BUG:" \
diff --git a/tests/xfs/709 b/tests/xfs/709
new file mode 100755
index 0000000..78cefe5
--- /dev/null
+++ b/tests/xfs/709
@@ -0,0 +1,258 @@
+#! /bin/bash
+# FS QA Test No. 709
+#
+# Make sure XFS can fix a v5 AGFL that wraps over the last block.
+# Refer to commit 96f859d52bcb ("libxfs: pack the agfl header structure so
+# XFS_AGFL_SIZE is correct") for details on the original on-disk format error
+# and the patch "xfs: detect agfl count corruption and reset agfl") for details
+# about the fix.
+#
+#-----------------------------------------------------------------------
+# Copyright (c) 2018 Oracle, Inc.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#
+#-----------------------------------------------------------------------
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1
+trap "_cleanup; rm -f $tmp.*; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	cd /
+	rm -f $tmp.*
+}
+
+rm -f $seqres.full
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs xfs
+_supported_os Linux
+
+_require_check_dmesg
+_require_scratch
+_require_test_program "punch-alternating"
+
+# This is only a v5 filesystem problem
+_require_scratch_xfs_crc
+
+mount_loop() {
+	if ! _try_scratch_mount >> $seqres.full 2>&1; then
+		echo "scratch mount failed" >> $seqres.full
+		return
+	fi
+
+	# Trigger agfl fixing by fragmenting free space enough to cause
+	# a bnobt split
+	blksz=$(_get_file_block_size ${SCRATCH_MNT})
+	bno_maxrecs=$(( blksz / 8 ))
+	filesz=$((bno_maxrecs * 3 * blksz))
+	rm -rf $SCRATCH_MNT/a
+	$XFS_IO_PROG -f -c "falloc 0 $filesz" $SCRATCH_MNT/a
+	test -e $SCRATCH_MNT/a && ./src/punch-alternating $SCRATCH_MNT/a
+	rm -rf $SCRATCH_MNT/a
+
+	_scratch_unmount 2>&1 | _filter_scratch
+}
+
+dump_ag0() {
+	_scratch_xfs_db -c 'sb 0' -c 'p' -c 'agf 0' -c 'p' -c 'agfl 0' -c 'p'
+}
+
+runtest() {
+	cmd="$1"
+
+	# Format filesystem
+	echo "TEST $cmd" | tee /dev/ttyprintk
+	echo "TEST $cmd" >> $seqres.full
+	_scratch_mkfs >> $seqres.full
+
+	# Record what was here before
+	echo "FS BEFORE" >> $seqres.full
+	dump_ag0 > $tmp.before
+	cat $tmp.before >> $seqres.full
+
+	sectsize=$(_scratch_xfs_get_metadata_field "sectsize" "sb 0")
+	flfirst=$(_scratch_xfs_get_metadata_field "flfirst" "agf 0")
+	fllast=$(_scratch_xfs_get_metadata_field "fllast" "agf 0")
+	flcount=$(_scratch_xfs_get_metadata_field "flcount" "agf 0")
+
+	# Due to a padding bug in the original v5 struct xfs_agfl,
+	# XFS_AGFL_SIZE could be 36 on 32-bit or 40 on 64-bit.  On a system
+	# with 512b sectors, this means that the AGFL length could be
+	# ((512 - 36) / 4) = 119 entries on 32-bit or ((512 - 40) / 4) = 118
+	# entries on 64-bit.
+	#
+	# We now have code to figure out if the AGFL list wraps incorrectly
+	# according to the kernel's agfl size and fix it by resetting the agfl
+	# to zero length.  Mutate ag 0's agfl to be in various configurations
+	# and see if we can trigger the reset.
+	#
+	# Don't hardcode the numbers, calculate them.
+
+	# Have to have at least three agfl items to test full wrap
+	test "$flcount" -ge 3 || _notrun "insufficient agfl flcount"
+
+	# mkfs should be able to make us a nice neat flfirst < fllast setup
+	test "$flfirst" -lt "$fllast" || _notrun "fresh agfl already wrapped?"
+
+	bad_agfl_size=$(( (sectsize - 40) / 4 ))
+	good_agfl_size=$(( (sectsize - 36) / 4 ))
+	agfl_size=
+	case "$1" in
+	"fix_end")	# fllast points to the end w/ 40-byte padding
+		new_flfirst=$(( bad_agfl_size - flcount ))
+		agfl_size=$bad_agfl_size;;
+	"fix_start")	# flfirst points to the end w/ 40-byte padding
+		new_flfirst=$(( bad_agfl_size - 1))
+		agfl_size=$bad_agfl_size;;
+	"fix_wrap")	# list wraps around end w/ 40-byte padding
+		new_flfirst=$(( bad_agfl_size - (flcount / 2) ))
+		agfl_size=$bad_agfl_size;;
+	"start_zero")	# flfirst points to the start
+		new_flfirst=0
+		agfl_size=$good_agfl_size;;
+	"good_end")	# fllast points to the end w/ 36-byte padding
+		new_flfirst=$(( good_agfl_size - flcount ))
+		agfl_size=$good_agfl_size;;
+	"good_start")	# flfirst points to the end w/ 36-byte padding
+		new_flfirst=$(( good_agfl_size - 1 ))
+		agfl_size=$good_agfl_size;;
+	"good_wrap")	# list wraps around end w/ 36-byte padding
+		new_flfirst=$(( good_agfl_size - (flcount / 2) ))
+		agfl_size=$good_agfl_size;;
+	"bad_start")	# flfirst points off the end
+		new_flfirst=$good_agfl_size
+		agfl_size=$good_agfl_size;;
+	"no_move")	# whatever mkfs formats (flfirst points to start)
+		new_flfirst=$flfirst
+		agfl_size=$good_agfl_size;;
+	"simple_move")	# move list arbitrarily
+		new_flfirst=$((fllast + 1))
+		agfl_size=$good_agfl_size;;
+	*)
+		_fail "Internal test error";;
+	esac
+	new_fllast=$(( (new_flfirst + flcount - 1) % agfl_size ))
+
+	# Log what we're doing...
+	cat >> $seqres.full << ENDL
+sector size: $sectsize
+bad_agfl_size: $bad_agfl_size [0 - $((bad_agfl_size - 1))]
+good_agfl_size: $good_agfl_size [0 - $((good_agfl_size - 1))]
+agfl_size: $agfl_size
+flfirst: $flfirst
+fllast: $fllast
+flcount: $flcount
+new_flfirst: $new_flfirst
+new_fllast: $new_fllast
+ENDL
+
+	# Remap the agfl blocks
+	echo "$((good_agfl_size - 1)) 0xffffffff" > $tmp.remap
+	seq "$flfirst" "$fllast" | while read f; do
+		list_pos=$((f - flfirst))
+		dest_pos=$(( (new_flfirst + list_pos) % agfl_size ))
+		bno=$(_scratch_xfs_get_metadata_field "bno[$f]" "agfl 0")
+		echo "$dest_pos $bno" >> $tmp.remap
+	done
+
+	cat $tmp.remap | while read dest_pos bno junk; do
+		_scratch_xfs_set_metadata_field "bno[$dest_pos]" "$bno" \
+				"agfl 0" >> $seqres.full
+	done
+
+	# Set new flfirst/fllast
+	_scratch_xfs_set_metadata_field "fllast" "$new_fllast" \
+			"agf 0" >> $seqres.full
+	_scratch_xfs_set_metadata_field "flfirst" "$new_flfirst" \
+			"agf 0" >> $seqres.full
+
+	echo "FS AFTER" >> $seqres.full
+	dump_ag0 > $tmp.corrupt 2> /dev/null
+	diff -u $tmp.before $tmp.corrupt >> $seqres.full
+
+	# Mount and see what happens
+	mount_loop
+
+	# Did we end up with a non-wrapped list?
+	flfirst=$(_scratch_xfs_get_metadata_field "flfirst" "agf 0" 2>/dev/null)
+	fllast=$(_scratch_xfs_get_metadata_field "fllast" "agf 0" 2>/dev/null)
+	echo "flfirst=${flfirst} fllast=${fllast}" >> $seqres.full
+	if [ "${flfirst}" -ge "$((good_agfl_size - 1))" ]; then
+		echo "ASSERT flfirst < good_agfl_size - 1" | tee -a $seqres.full
+	fi
+	if [ "${fllast}" -ge "$((good_agfl_size - 1))" ]; then
+		echo "ASSERT fllast < good_agfl_size - 1" | tee -a $seqres.full
+	fi
+	if [ "${flfirst}" -ge "${fllast}" ]; then
+		echo "ASSERT flfirst < fllast" | tee -a $seqres.full
+	fi
+
+	echo "FS MOUNTLOOP" >> $seqres.full
+	dump_ag0 > $tmp.mountloop 2> /dev/null
+	diff -u $tmp.corrupt $tmp.mountloop >> $seqres.full
+
+	# Let's see what repair thinks
+	echo "REPAIR" >> $seqres.full
+	_scratch_xfs_repair >> $seqres.full 2>&1
+
+	echo "FS REPAIR" >> $seqres.full
+	dump_ag0 > $tmp.repair 2> /dev/null
+	diff -u $tmp.mountloop $tmp.repair >> $seqres.full
+
+	# Exercise the filesystem again to make sure there aren't any lasting
+	# ill effects from either the agfl reset or the recommended subsequent
+	# repair run.
+	mount_loop
+
+	echo "FS REMOUNT" >> $seqres.full
+	dump_ag0 > $tmp.remount 2> /dev/null
+	diff -u $tmp.repair $tmp.remount >> $seqres.full
+}
+
+runtest fix_end
+runtest fix_start
+runtest fix_wrap
+runtest start_zero
+runtest good_end
+runtest good_start
+runtest good_wrap
+runtest bad_start
+runtest no_move
+runtest simple_move
+
+# Did we get the kernel warning too?
+warn_str='WARNING: Reset corrupted AGFL'
+_check_dmesg_for "${warn_str}" || echo "Missing dmesg string \"${warn_str}\"."
+
+# Now run the regular dmesg check, filtering out the agfl warning
+filter_agfl_reset_printk() {
+	grep -v "${warn_str}"
+}
+_check_dmesg filter_agfl_reset_printk
+
+status=0
+exit 0
diff --git a/tests/xfs/709.out b/tests/xfs/709.out
new file mode 100644
index 0000000..f1fa9a3
--- /dev/null
+++ b/tests/xfs/709.out
@@ -0,0 +1,13 @@
+QA output created by 709
+TEST fix_end
+TEST fix_start
+TEST fix_wrap
+TEST start_zero
+TEST good_end
+TEST good_start
+TEST good_wrap
+TEST bad_start
+ASSERT flfirst < good_agfl_size - 1
+ASSERT flfirst < fllast
+TEST no_move
+TEST simple_move
diff --git a/tests/xfs/group b/tests/xfs/group
index e2397fe..472120e 100644
--- a/tests/xfs/group
+++ b/tests/xfs/group
@@ -441,3 +441,4 @@
 441 auto quick clone quota
 442 auto stress clone quota
 443 auto quick ioctl fsr
+709 auto quick

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH] common/xfs: don't call xfs_scrub on a block device
  2018-03-21 16:57 [PATCH v3] xfs: test agfl reset on bad list wrapping Darrick J. Wong
@ 2018-03-22  2:46 ` Darrick J. Wong
  2018-03-29 10:25   ` Xiao Yang
  2018-03-22  2:48 ` [PATCH] common/xfs: fix various problems with _supports_xfs_scrub Darrick J. Wong
  2018-03-23  5:26 ` [PATCH v3] xfs: test agfl reset on bad list wrapping Eryu Guan
  2 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2018-03-22  2:46 UTC (permalink / raw)
  To: Eryu Guan; +Cc: david, fstests

From: Darrick J. Wong <darrick.wong@oracle.com>

xfs_scrub takes an xfs mountpoint as its argument, not a block device.
Therefore, fix _check_xfs_filesystem to call it correctly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 common/xfs |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/common/xfs b/common/xfs
index 5dbd81e..1d98ba1 100644
--- a/common/xfs
+++ b/common/xfs
@@ -358,7 +358,7 @@ _check_xfs_filesystem()
 	# Run online scrub if we can.
 	mntpt="$(_is_dev_mounted $device)"
 	if [ -n "$mntpt" ] && _supports_xfs_scrub "$mntpt" "$device"; then
-		"$XFS_SCRUB_PROG" $scrubflag -v -d -n $device > $tmp.scrub 2>&1
+		"$XFS_SCRUB_PROG" $scrubflag -v -d -n $mntpt > $tmp.scrub 2>&1
 		if [ $? -ne 0 ]; then
 			_log_err "_check_xfs_filesystem: filesystem on $device failed scrub"
 			echo "*** xfs_scrub $scrubflag -v -d -n output ***" >> $seqres.full

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH] common/xfs: fix various problems with _supports_xfs_scrub
  2018-03-21 16:57 [PATCH v3] xfs: test agfl reset on bad list wrapping Darrick J. Wong
  2018-03-22  2:46 ` [PATCH] common/xfs: don't call xfs_scrub on a block device Darrick J. Wong
@ 2018-03-22  2:48 ` Darrick J. Wong
  2018-03-23  5:26 ` [PATCH v3] xfs: test agfl reset on bad list wrapping Eryu Guan
  2 siblings, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2018-03-22  2:48 UTC (permalink / raw)
  To: Eryu Guan; +Cc: david, fstests

From: Darrick J. Wong <darrick.wong@oracle.com>

The _supports_xfs_scrub helper is called with a mountpoint (a working
mountpoint is required for scrub) and a block device (used to detect
norecovery mounts).  If either of these conditions aren't satisfied we
should return failure status to the caller, not unilaterally decide to
bail out of the test.  In particular, the -b test doesn't work if the
fs has already shutdown on us.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 common/xfs |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/common/xfs b/common/xfs
index 1d98ba1..3b71e02 100644
--- a/common/xfs
+++ b/common/xfs
@@ -305,9 +305,13 @@ _supports_xfs_scrub()
 	local mountpoint="$1"
 	local device="$2"
 
-	if [ ! -b "$device" ] || [ ! -e "$mountpoint" ]; then
+	if [ -z "$device" ] || [ -z "$mountpoint" ]; then
 		echo "Usage: _supports_xfs_scrub mountpoint device"
-		exit 1
+		return 1
+	fi
+
+	if [ ! -b "$device" ] || [ ! -e "$mountpoint" ]; then
+		return 1
 	fi
 
 	test "$FSTYP" = "xfs" || return 1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] xfs: test agfl reset on bad list wrapping
  2018-03-21 16:57 [PATCH v3] xfs: test agfl reset on bad list wrapping Darrick J. Wong
  2018-03-22  2:46 ` [PATCH] common/xfs: don't call xfs_scrub on a block device Darrick J. Wong
  2018-03-22  2:48 ` [PATCH] common/xfs: fix various problems with _supports_xfs_scrub Darrick J. Wong
@ 2018-03-23  5:26 ` Eryu Guan
  2018-03-23 16:08   ` Darrick J. Wong
  2 siblings, 1 reply; 9+ messages in thread
From: Eryu Guan @ 2018-03-23  5:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs, david, fstests

On Wed, Mar 21, 2018 at 09:57:16AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> From the kernel patch that this test examines ("xfs: detect agfl count
> corruption and reset agfl"):
> 
> "The struct xfs_agfl v5 header was originally introduced with
> unexpected padding that caused the AGFL to operate with one less
> slot than intended. The header has since been packed, but the fix
> left an incompatibility for users who upgrade from an old kernel
> with the unpacked header to a newer kernel with the packed header
> while the AGFL happens to wrap around the end. The newer kernel
> recognizes one extra slot at the physical end of the AGFL that the
> previous kernel did not. The new kernel will eventually attempt to
> allocate a block from that slot, which contains invalid data, and
> cause a crash.
> 
> "This condition can be detected by comparing the active range of the
> AGFL to the count. While this detects a padding mismatch, it can
> also trigger false positives for unrelated flcount corruption. Since
> we cannot distinguish a size mismatch due to padding from unrelated
> corruption, we can't trust the AGFL enough to simply repopulate the
> empty slot.
> 
> "Instead, avoid unnecessarily complex detection logic and and use a
> solution that can handle any form of flcount corruption that slips
> through read verifiers: distrust the entire AGFL and reset it to an
> empty state. Any valid blocks within the AGFL are intentionally
> leaked. This requires xfs_repair to rectify (which was already
> necessary based on the state the AGFL was found in). The reset
> mitigates the side effect of the padding mismatch problem from a
> filesystem crash to a free space accounting inconsistency."
> 
> This test exercises the reset code by mutating a fresh filesystem to
> contain an agfl with various list configurations of correctly wrapped,
> incorrectly wrapped, not wrapped, and actually corrupt free lists; then
> checks the success of the reset operation by fragmenting the free space
> btrees to exercise the agfl.  Kernels without this reset fix will shut
> down the filesystem with corruption errors.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> v3: use fallocate instead of dd write, more factoring of common code
> v2: remove unncessary umounts, refactor long lines into helpers
> ---
>  common/rc         |   23 ++++-
>  tests/xfs/709     |  258 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/709.out |   13 +++
>  tests/xfs/group   |    1 
>  4 files changed, 293 insertions(+), 2 deletions(-)
>  create mode 100755 tests/xfs/709
>  create mode 100644 tests/xfs/709.out
> 
> diff --git a/common/rc b/common/rc
> index 2c29d55..f7eb72d 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -3440,6 +3440,26 @@ _get_device_size()
>  	grep `_short_dev $1` /proc/partitions | awk '{print $3}'
>  }
>  
> +# Make sure we actually have dmesg checking set up.
> +_require_check_dmesg() {
> +	test -w /dev/kmsg || \
> +		_notrun "Test requires writable /dev/kmsg."
> +}
> +
> +# Return the dmesg log since the start of this test.  Caller must ensure that
> +# /dev/kmsg was writable when the test was started so that we can find the
> +# beginning of this test's log messages; _require_check_dmesg does this.
> +_dmesg_since_test_start() {
> +	dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | \
> +		tac
> +}
> +
> +# check dmesg log for a specific string, subject to the same requirements as
> +# _dmesg_since_test_start.
> +_check_dmesg_for() {
> +	_dmesg_since_test_start | egrep -q "$1"
> +}
> +
>  # check dmesg log for WARNING/Oops/etc.
>  _check_dmesg()
>  {
> @@ -3455,8 +3475,7 @@ _check_dmesg()
>  
>  	# search the dmesg log of last run of $seqnum for possible failures
>  	# use sed \cregexpc address type, since $seqnum contains "/"

The comments about sed usage probably should go to
_dmesg_since_test_start() too.

> -	dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | \
> -		tac | $filter >$seqres.dmesg
> +	_dmesg_since_test_start | $filter >$seqres.dmesg
>  	egrep -q -e "kernel BUG at" \
>  	     -e "WARNING:" \
>  	     -e "BUG:" \
> diff --git a/tests/xfs/709 b/tests/xfs/709
> new file mode 100755
> index 0000000..78cefe5
> --- /dev/null
> +++ b/tests/xfs/709
> @@ -0,0 +1,258 @@
> +#! /bin/bash
> +# FS QA Test No. 709
> +#
> +# Make sure XFS can fix a v5 AGFL that wraps over the last block.
> +# Refer to commit 96f859d52bcb ("libxfs: pack the agfl header structure so
> +# XFS_AGFL_SIZE is correct") for details on the original on-disk format error
> +# and the patch "xfs: detect agfl count corruption and reset agfl") for details
> +# about the fix.
> +#
> +#-----------------------------------------------------------------------
> +# Copyright (c) 2018 Oracle, Inc.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#
> +#-----------------------------------------------------------------------
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1
> +trap "_cleanup; rm -f $tmp.*; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +	cd /
> +	rm -f $tmp.*
> +}
> +
> +rm -f $seqres.full
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_supported_os Linux
> +
> +_require_check_dmesg
> +_require_scratch
> +_require_test_program "punch-alternating"
> +
> +# This is only a v5 filesystem problem
> +_require_scratch_xfs_crc
> +
> +mount_loop() {
> +	if ! _try_scratch_mount >> $seqres.full 2>&1; then
> +		echo "scratch mount failed" >> $seqres.full
> +		return
> +	fi
> +
> +	# Trigger agfl fixing by fragmenting free space enough to cause
> +	# a bnobt split
> +	blksz=$(_get_file_block_size ${SCRATCH_MNT})
> +	bno_maxrecs=$(( blksz / 8 ))
> +	filesz=$((bno_maxrecs * 3 * blksz))
> +	rm -rf $SCRATCH_MNT/a
> +	$XFS_IO_PROG -f -c "falloc 0 $filesz" $SCRATCH_MNT/a

And I noticed test failure with patch "xfs: detect agfl count corruption
and reset agfl" applied on top of 4.16-rc5 kernel. Looks like we should
dump xfs_io output to $seqres.full, as in v2 patch

dd if=/dev/zero of=$SCRATCH_MNT/a bs=8192k >> $seqres.full 2>&1

--- tests/xfs/709.out   2018-03-23 12:45:16.831011711 +0800
+++ /root/workspace/xfstests/results//xfs_4k_crc/xfs/709.out.bad        2018-03-23 13:12:10.083980820 +0800
@@ -7,6 +7,7 @@
 TEST good_start
 TEST good_wrap
 TEST bad_start
+fallocate: Structure needs cleaning
 ASSERT flfirst < good_agfl_size - 1
 ASSERT flfirst < fllast
 TEST no_move

> +	test -e $SCRATCH_MNT/a && ./src/punch-alternating $SCRATCH_MNT/a
> +	rm -rf $SCRATCH_MNT/a
> +
> +	_scratch_unmount 2>&1 | _filter_scratch
> +}
> +
> +dump_ag0() {
> +	_scratch_xfs_db -c 'sb 0' -c 'p' -c 'agf 0' -c 'p' -c 'agfl 0' -c 'p'
> +}
> +
> +runtest() {
> +	cmd="$1"
> +
> +	# Format filesystem
> +	echo "TEST $cmd" | tee /dev/ttyprintk

What's the purpose of writing to /dev/ttyprintk? I don't see how it's
used in the test.

Thanks,
Eryu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] xfs: test agfl reset on bad list wrapping
  2018-03-23  5:26 ` [PATCH v3] xfs: test agfl reset on bad list wrapping Eryu Guan
@ 2018-03-23 16:08   ` Darrick J. Wong
  2018-03-26  1:22     ` Eryu Guan
  0 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2018-03-23 16:08 UTC (permalink / raw)
  To: Eryu Guan; +Cc: Brian Foster, linux-xfs, david, fstests

On Fri, Mar 23, 2018 at 01:26:06PM +0800, Eryu Guan wrote:
> On Wed, Mar 21, 2018 at 09:57:16AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > From the kernel patch that this test examines ("xfs: detect agfl count
> > corruption and reset agfl"):
> > 
> > "The struct xfs_agfl v5 header was originally introduced with
> > unexpected padding that caused the AGFL to operate with one less
> > slot than intended. The header has since been packed, but the fix
> > left an incompatibility for users who upgrade from an old kernel
> > with the unpacked header to a newer kernel with the packed header
> > while the AGFL happens to wrap around the end. The newer kernel
> > recognizes one extra slot at the physical end of the AGFL that the
> > previous kernel did not. The new kernel will eventually attempt to
> > allocate a block from that slot, which contains invalid data, and
> > cause a crash.
> > 
> > "This condition can be detected by comparing the active range of the
> > AGFL to the count. While this detects a padding mismatch, it can
> > also trigger false positives for unrelated flcount corruption. Since
> > we cannot distinguish a size mismatch due to padding from unrelated
> > corruption, we can't trust the AGFL enough to simply repopulate the
> > empty slot.
> > 
> > "Instead, avoid unnecessarily complex detection logic and and use a
> > solution that can handle any form of flcount corruption that slips
> > through read verifiers: distrust the entire AGFL and reset it to an
> > empty state. Any valid blocks within the AGFL are intentionally
> > leaked. This requires xfs_repair to rectify (which was already
> > necessary based on the state the AGFL was found in). The reset
> > mitigates the side effect of the padding mismatch problem from a
> > filesystem crash to a free space accounting inconsistency."
> > 
> > This test exercises the reset code by mutating a fresh filesystem to
> > contain an agfl with various list configurations of correctly wrapped,
> > incorrectly wrapped, not wrapped, and actually corrupt free lists; then
> > checks the success of the reset operation by fragmenting the free space
> > btrees to exercise the agfl.  Kernels without this reset fix will shut
> > down the filesystem with corruption errors.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> > v3: use fallocate instead of dd write, more factoring of common code
> > v2: remove unncessary umounts, refactor long lines into helpers
> > ---
> >  common/rc         |   23 ++++-
> >  tests/xfs/709     |  258 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  tests/xfs/709.out |   13 +++
> >  tests/xfs/group   |    1 
> >  4 files changed, 293 insertions(+), 2 deletions(-)
> >  create mode 100755 tests/xfs/709
> >  create mode 100644 tests/xfs/709.out
> > 
> > diff --git a/common/rc b/common/rc
> > index 2c29d55..f7eb72d 100644
> > --- a/common/rc
> > +++ b/common/rc
> > @@ -3440,6 +3440,26 @@ _get_device_size()
> >  	grep `_short_dev $1` /proc/partitions | awk '{print $3}'
> >  }
> >  
> > +# Make sure we actually have dmesg checking set up.
> > +_require_check_dmesg() {
> > +	test -w /dev/kmsg || \
> > +		_notrun "Test requires writable /dev/kmsg."
> > +}
> > +
> > +# Return the dmesg log since the start of this test.  Caller must ensure that
> > +# /dev/kmsg was writable when the test was started so that we can find the
> > +# beginning of this test's log messages; _require_check_dmesg does this.
> > +_dmesg_since_test_start() {
> > +	dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | \
> > +		tac
> > +}
> > +
> > +# check dmesg log for a specific string, subject to the same requirements as
> > +# _dmesg_since_test_start.
> > +_check_dmesg_for() {
> > +	_dmesg_since_test_start | egrep -q "$1"
> > +}
> > +
> >  # check dmesg log for WARNING/Oops/etc.
> >  _check_dmesg()
> >  {
> > @@ -3455,8 +3475,7 @@ _check_dmesg()
> >  
> >  	# search the dmesg log of last run of $seqnum for possible failures
> >  	# use sed \cregexpc address type, since $seqnum contains "/"
> 
> The comments about sed usage probably should go to
> _dmesg_since_test_start() too.

Ok

> > -	dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | \
> > -		tac | $filter >$seqres.dmesg
> > +	_dmesg_since_test_start | $filter >$seqres.dmesg
> >  	egrep -q -e "kernel BUG at" \
> >  	     -e "WARNING:" \
> >  	     -e "BUG:" \
> > diff --git a/tests/xfs/709 b/tests/xfs/709
> > new file mode 100755
> > index 0000000..78cefe5
> > --- /dev/null
> > +++ b/tests/xfs/709
> > @@ -0,0 +1,258 @@
> > +#! /bin/bash
> > +# FS QA Test No. 709
> > +#
> > +# Make sure XFS can fix a v5 AGFL that wraps over the last block.
> > +# Refer to commit 96f859d52bcb ("libxfs: pack the agfl header structure so
> > +# XFS_AGFL_SIZE is correct") for details on the original on-disk format error
> > +# and the patch "xfs: detect agfl count corruption and reset agfl") for details
> > +# about the fix.
> > +#
> > +#-----------------------------------------------------------------------
> > +# Copyright (c) 2018 Oracle, Inc.
> > +#
> > +# This program is free software; you can redistribute it and/or
> > +# modify it under the terms of the GNU General Public License as
> > +# published by the Free Software Foundation.
> > +#
> > +# This program is distributed in the hope that it would be useful,
> > +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > +# GNU General Public License for more details.
> > +#
> > +# You should have received a copy of the GNU General Public License
> > +# along with this program; if not, write the Free Software Foundation,
> > +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> > +#
> > +#-----------------------------------------------------------------------
> > +#
> > +
> > +seq=`basename $0`
> > +seqres=$RESULT_DIR/$seq
> > +echo "QA output created by $seq"
> > +
> > +here=`pwd`
> > +tmp=/tmp/$$
> > +status=1
> > +trap "_cleanup; rm -f $tmp.*; exit \$status" 0 1 2 3 15
> > +
> > +_cleanup()
> > +{
> > +	cd /
> > +	rm -f $tmp.*
> > +}
> > +
> > +rm -f $seqres.full
> > +
> > +# get standard environment, filters and checks
> > +. ./common/rc
> > +. ./common/filter
> > +
> > +# real QA test starts here
> > +_supported_fs xfs
> > +_supported_os Linux
> > +
> > +_require_check_dmesg
> > +_require_scratch
> > +_require_test_program "punch-alternating"
> > +
> > +# This is only a v5 filesystem problem
> > +_require_scratch_xfs_crc
> > +
> > +mount_loop() {
> > +	if ! _try_scratch_mount >> $seqres.full 2>&1; then
> > +		echo "scratch mount failed" >> $seqres.full
> > +		return
> > +	fi
> > +
> > +	# Trigger agfl fixing by fragmenting free space enough to cause
> > +	# a bnobt split
> > +	blksz=$(_get_file_block_size ${SCRATCH_MNT})
> > +	bno_maxrecs=$(( blksz / 8 ))
> > +	filesz=$((bno_maxrecs * 3 * blksz))
> > +	rm -rf $SCRATCH_MNT/a
> > +	$XFS_IO_PROG -f -c "falloc 0 $filesz" $SCRATCH_MNT/a
> 
> And I noticed test failure with patch "xfs: detect agfl count corruption
> and reset agfl" applied on top of 4.16-rc5 kernel. Looks like we should
> dump xfs_io output to $seqres.full, as in v2 patch
> 
> dd if=/dev/zero of=$SCRATCH_MNT/a bs=8192k >> $seqres.full 2>&1

Yep.

In the bad_start case the fs checks every agfl and refuses to mount...
if your kernel has CONFIG_XFS_DEBUG=y.

> --- tests/xfs/709.out   2018-03-23 12:45:16.831011711 +0800
> +++ /root/workspace/xfstests/results//xfs_4k_crc/xfs/709.out.bad        2018-03-23 13:12:10.083980820 +0800
> @@ -7,6 +7,7 @@
>  TEST good_start
>  TEST good_wrap
>  TEST bad_start
> +fallocate: Structure needs cleaning
>  ASSERT flfirst < good_agfl_size - 1
>  ASSERT flfirst < fllast
>  TEST no_move
> 
> > +	test -e $SCRATCH_MNT/a && ./src/punch-alternating $SCRATCH_MNT/a
> > +	rm -rf $SCRATCH_MNT/a
> > +
> > +	_scratch_unmount 2>&1 | _filter_scratch
> > +}
> > +
> > +dump_ag0() {
> > +	_scratch_xfs_db -c 'sb 0' -c 'p' -c 'agf 0' -c 'p' -c 'agfl 0' -c 'p'
> > +}
> > +
> > +runtest() {
> > +	cmd="$1"
> > +
> > +	# Format filesystem
> > +	echo "TEST $cmd" | tee /dev/ttyprintk
> 
> What's the purpose of writing to /dev/ttyprintk? I don't see how it's
> used in the test.

It makes it easy to tell which kernel messages came from which runtest()
invocation so that we can tell if a particular agfl mutation test
actually triggered the fixup.

--D

> Thanks,
> Eryu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] xfs: test agfl reset on bad list wrapping
  2018-03-23 16:08   ` Darrick J. Wong
@ 2018-03-26  1:22     ` Eryu Guan
  2018-03-28  1:20       ` Eryu Guan
  0 siblings, 1 reply; 9+ messages in thread
From: Eryu Guan @ 2018-03-26  1:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs, david, fstests

On Fri, Mar 23, 2018 at 09:08:17AM -0700, Darrick J. Wong wrote:
> > > +
> > > +	# Format filesystem
> > > +	echo "TEST $cmd" | tee /dev/ttyprintk
> > 
> > What's the purpose of writing to /dev/ttyprintk? I don't see how it's
> > used in the test.
> 
> It makes it easy to tell which kernel messages came from which runtest()
> invocation so that we can tell if a particular agfl mutation test
> actually triggered the fixup.

This could fail the test if /dev/ttyprintk doesn't exist. It seems
writing to /dev/kmsg works could tell us the same information, and we've
already made sure /dev/kmsg is writable by _require_check_dmesg. IMHO
/dev/kmsg might be a better choice here.

Thanks,
Eryu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] xfs: test agfl reset on bad list wrapping
  2018-03-26  1:22     ` Eryu Guan
@ 2018-03-28  1:20       ` Eryu Guan
  0 siblings, 0 replies; 9+ messages in thread
From: Eryu Guan @ 2018-03-28  1:20 UTC (permalink / raw)
  To: Darrick J. Wong, Brian Foster; +Cc: linux-xfs, david, fstests

On Mon, Mar 26, 2018 at 09:22:53AM +0800, Eryu Guan wrote:
> On Fri, Mar 23, 2018 at 09:08:17AM -0700, Darrick J. Wong wrote:
> > > > +
> > > > +	# Format filesystem
> > > > +	echo "TEST $cmd" | tee /dev/ttyprintk
> > > 
> > > What's the purpose of writing to /dev/ttyprintk? I don't see how it's
> > > used in the test.
> > 
> > It makes it easy to tell which kernel messages came from which runtest()
> > invocation so that we can tell if a particular agfl mutation test
> > actually triggered the fixup.
> 
> This could fail the test if /dev/ttyprintk doesn't exist. It seems

Correction, it doesn't fail the test, but creates a new /dev/ttyprintk
file.. but still, I think this should be addressed.

Rather than that, the test runs good for me, it fails with 4.16-rc7
kernel and passes with the mentioned patch applied.

Brian, would you please help review the new version of this patch in
patchset "[PATCH 0/4] misc. fstests changes" (patch 3/4) as well? I
really like a Reviewed-by tag from someone who knows all the details of
the test and the fix :) Thanks a lot!

Eryu

> writing to /dev/kmsg works could tell us the same information, and we've
> already made sure /dev/kmsg is writable by _require_check_dmesg. IMHO
> /dev/kmsg might be a better choice here.
> 
> Thanks,
> Eryu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] common/xfs: don't call xfs_scrub on a block device
  2018-03-22  2:46 ` [PATCH] common/xfs: don't call xfs_scrub on a block device Darrick J. Wong
@ 2018-03-29 10:25   ` Xiao Yang
  2018-03-29 15:46     ` Darrick J. Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Xiao Yang @ 2018-03-29 10:25 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Eryu Guan, fstests

On 2018/03/22 10:46, Darrick J. Wong wrote:
> From: Darrick J. Wong<darrick.wong@oracle.com>
>
> xfs_scrub takes an xfs mountpoint as its argument, not a block device.
> Therefore, fix _check_xfs_filesystem to call it correctly.
Hi Darrick,

According to xfs_scrub manpage, it seems that xfs_scrub can take a mounted block device
as its argument.  Is the xfs_scrub manpage incorrect?

Thanks,
Xiao Yang

> Signed-off-by: Darrick J. Wong<darrick.wong@oracle.com>
> ---
>   common/xfs |    2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/common/xfs b/common/xfs
> index 5dbd81e..1d98ba1 100644
> --- a/common/xfs
> +++ b/common/xfs
> @@ -358,7 +358,7 @@ _check_xfs_filesystem()
>   	# Run online scrub if we can.
>   	mntpt="$(_is_dev_mounted $device)"
>   	if [ -n "$mntpt" ]&&  _supports_xfs_scrub "$mntpt" "$device"; then
> -		"$XFS_SCRUB_PROG" $scrubflag -v -d -n $device>  $tmp.scrub 2>&1
> +		"$XFS_SCRUB_PROG" $scrubflag -v -d -n $mntpt>  $tmp.scrub 2>&1
>   		if [ $? -ne 0 ]; then
>   			_log_err "_check_xfs_filesystem: filesystem on $device failed scrub"
>   			echo "*** xfs_scrub $scrubflag -v -d -n output ***">>  $seqres.full
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] common/xfs: don't call xfs_scrub on a block device
  2018-03-29 10:25   ` Xiao Yang
@ 2018-03-29 15:46     ` Darrick J. Wong
  0 siblings, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2018-03-29 15:46 UTC (permalink / raw)
  To: Xiao Yang; +Cc: Eryu Guan, fstests

On Thu, Mar 29, 2018 at 06:25:52PM +0800, Xiao Yang wrote:
> On 2018/03/22 10:46, Darrick J. Wong wrote:
> >From: Darrick J. Wong<darrick.wong@oracle.com>
> >
> >xfs_scrub takes an xfs mountpoint as its argument, not a block device.
> >Therefore, fix _check_xfs_filesystem to call it correctly.
> Hi Darrick,
> 
> According to xfs_scrub manpage, it seems that xfs_scrub can take a mounted block device
> as its argument.  Is the xfs_scrub manpage incorrect?

Yes, the manpage is wrong and will be fixed in xfsprogs 4.16.

--D

> Thanks,
> Xiao Yang
> 
> >Signed-off-by: Darrick J. Wong<darrick.wong@oracle.com>
> >---
> >  common/xfs |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> >diff --git a/common/xfs b/common/xfs
> >index 5dbd81e..1d98ba1 100644
> >--- a/common/xfs
> >+++ b/common/xfs
> >@@ -358,7 +358,7 @@ _check_xfs_filesystem()
> >  	# Run online scrub if we can.
> >  	mntpt="$(_is_dev_mounted $device)"
> >  	if [ -n "$mntpt" ]&&  _supports_xfs_scrub "$mntpt" "$device"; then
> >-		"$XFS_SCRUB_PROG" $scrubflag -v -d -n $device>  $tmp.scrub 2>&1
> >+		"$XFS_SCRUB_PROG" $scrubflag -v -d -n $mntpt>  $tmp.scrub 2>&1
> >  		if [ $? -ne 0 ]; then
> >  			_log_err "_check_xfs_filesystem: filesystem on $device failed scrub"
> >  			echo "*** xfs_scrub $scrubflag -v -d -n output ***">>  $seqres.full
> >--
> >To unsubscribe from this list: send the line "unsubscribe fstests" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> 
> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-03-29 15:46 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-21 16:57 [PATCH v3] xfs: test agfl reset on bad list wrapping Darrick J. Wong
2018-03-22  2:46 ` [PATCH] common/xfs: don't call xfs_scrub on a block device Darrick J. Wong
2018-03-29 10:25   ` Xiao Yang
2018-03-29 15:46     ` Darrick J. Wong
2018-03-22  2:48 ` [PATCH] common/xfs: fix various problems with _supports_xfs_scrub Darrick J. Wong
2018-03-23  5:26 ` [PATCH v3] xfs: test agfl reset on bad list wrapping Eryu Guan
2018-03-23 16:08   ` Darrick J. Wong
2018-03-26  1:22     ` Eryu Guan
2018-03-28  1:20       ` Eryu Guan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.