All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] fstest: regression test for ext4 crash consistency bug
@ 2017-08-27 10:44 Amir Goldstein
  2017-09-25  9:49   ` Xiao Yang
  0 siblings, 1 reply; 31+ messages in thread
From: Amir Goldstein @ 2017-08-27 10:44 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Eryu Guan, Josef Bacik, fstests, linux-ext4

This test is motivated by a bug found in ext4 during random crash
consistency tests.

This test uses device mapper flakey target to demonstrate the bug
found using device mapper log-writes target.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---

Ted,

While working on crash consistency xfstests [1], I stubmled on what
appeared to be an ext4 crash consistency bug.

The tests I used rely on the log-writes dm target code written
by Josef Bacik, which had little exposure to the wide community
as far as I know.  I wanted to prove to myself that the found
inconsistency was not due to a test bug, so I bisected the failed
test to the minimal operations that trigger the failure and wrote
a small independent test to reproduce the issue using dm flakey target.

The following fsck error is reliably reproduced by replaying some fsx ops
on overlapping file regions, then emulating a crash, followed by mount,
umount and fsck -nf:

  ./ltp/fsx -d --replay-ops /tmp/8995.fsxops /mnt/scratch/testfile
  1 write 0x137dd thru    0x21445 (0xdc69 bytes)
  2 falloc        from 0xb531 to 0x16ade (0xb5ad bytes)
  3 collapse      from 0x1c000 to 0x20000, (0x4000 bytes)
  4 write 0x3e5ec thru    0x3ffff (0x1a14 bytes)
  5 zero  from 0x20fac to 0x27d48, (0x6d9c bytes)
  6 mapwrite      0x216ad thru    0x23dfb (0x274f bytes)
  All 7 operations completed A-OK!
  _check_generic_filesystem: filesystem on /dev/mapper/ssd-scratch is inconsistent
  *** fsck.ext4 output ***
  fsck from util-linux 2.27.1
  e2fsck 1.42.13 (17-May-2015)
  Pass 1: Checking inodes, blocks, and sizes
  Inode 12, end of extent exceeds allowed value
          (logical block 33, physical block 33441, len 7)
  Clear? no
  Inode 12, i_blocks is 184, should be 128.  Fix? no

Note that the inconsistency is "applied" by journal replay during mount.
fsck -nf before mount does not report any errors.

I did not intend for this test to be merged as is, but rather to be used
by ext4 developers to analyze the problem and then re-write the test with
more comments and less arbitrary offset/length values.

P.S.: crash consistency tests also reliably reproduce a btrfs fsck error.
      a detailed report with I/O recording was sent to Josef.
P.S.2: crash consistency tests report file data checksum errors on xfs
       after fsync+crash, but I still need to prove the reliability of
       these reports.
 
[1] https://github.com/amir73il/xfstests/commits/dm-log-writes

 tests/generic/501     | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/501.out |  2 ++
 tests/generic/group   |  1 +
 3 files changed, 83 insertions(+)
 create mode 100755 tests/generic/501
 create mode 100644 tests/generic/501.out

diff --git a/tests/generic/501 b/tests/generic/501
new file mode 100755
index 0000000..ccb513d
--- /dev/null
+++ b/tests/generic/501
@@ -0,0 +1,80 @@
+#! /bin/bash
+# FS QA Test No. 501
+#
+# This test is motivated by a bug found in ext4 during random crash
+# consistency tests.
+#
+#-----------------------------------------------------------------------
+# Copyright (C) 2017 CTERA Networks. All Rights Reserved.
+# Author: Amir Goldstein <amir73il@gmail.com>
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#-----------------------------------------------------------------------
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1	# failure is the default!
+
+_cleanup()
+{
+	_cleanup_flakey
+	cd /
+	rm -f $tmp.*
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+_require_dm_target flakey
+_require_metadata_journaling $SCRATCH_DEV
+
+rm -f $seqres.full
+
+_scratch_mkfs >> $seqres.full 2>&1
+
+_init_flakey
+_mount_flakey
+
+fsxops=$tmp.fsxops
+cat <<EOF > $fsxops
+write 0x137dd 0xdc69 0x0
+fallocate 0xb531 0xb5ad 0x21446
+collapse_range 0x1c000 0x4000 0x21446
+write 0x3e5ec 0x1a14 0x21446
+zero_range 0x20fac 0x6d9c 0x40000 keep_size
+mapwrite 0x216ad 0x274f 0x40000
+EOF
+run_check $here/ltp/fsx -d --replay-ops $fsxops $SCRATCH_MNT/testfile
+
+_flakey_drop_and_remount
+_unmount_flakey
+_cleanup_flakey
+_check_scratch_fs
+
+echo "Silence is golden"
+
+status=0
+exit
diff --git a/tests/generic/501.out b/tests/generic/501.out
new file mode 100644
index 0000000..00133b6
--- /dev/null
+++ b/tests/generic/501.out
@@ -0,0 +1,2 @@
+QA output created by 501
+Silence is golden
diff --git a/tests/generic/group b/tests/generic/group
index 2396b72..bb870f2 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -454,3 +454,4 @@
 449 auto quick acl enospc
 450 auto quick rw
 500 auto log replay
+501 auto quick metadata
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread
* [RFC][PATCH] fstest: regression test for ext4 crash consistency bug
@ 2017-08-31  1:28 Ashlie Martinez
  2017-08-31  4:05 ` Amir Goldstein
  0 siblings, 1 reply; 31+ messages in thread
From: Ashlie Martinez @ 2017-08-31  1:28 UTC (permalink / raw)
  To: amir73il; +Cc: tytso, eguan, jbacik, vvijay03, fstests, linux-ext4

Amir,

I have been working on CrashMonkey more and I have jerry-rigged together 
a test in CrashMonkey that calls into `fsx` with the minimal test case 
you made. I am able to reproduce the ext4 error that you found along 
with a few other potential errors.

A quick point, I run fsck with `-yf` instead of `-nf` that xfstests runs 
with. The reason for this is that CrashMonkey would like to report on 
fixable and unfixable errors in the future.

Running the ported test case, I find that CrashMonkey encounters the 
following errors:
1. Incorrect inode size and incorrect free data block and inode counts 
(fixable)
2. incorrect free data block and inode counts (fixable)
3. `Superblock needs_recovery flag is clear, but journal has data` 
notice along with errors present in case 1
4. `Superblock needs_recovery flag is clear, but journal has data` 
notice with no other errors

For the incorrect i_size errors, I get the output `Inode 12, i_size is 
147456, should be 163840.` which I can also reproduce with your 501 
xfstests test case.

When free data blocks and inode errors occur, the message is `Free 
blocks count wrong (8795, counted=8714).` and `Free inodes count wrong 
(2549, counted=2546).`

I have not had a chance to look into the above errors to find their root 
causes.

In total, CrashMonkey ran 1000 different tests. Of those, 344 passed 
without fsck complaining. The remaining 656 tests saw fsck complain 
about something. All of these tests consisted of unique sequences of 
bios, but may contain equivalent crash states.

The larger range of test results is due to the fact that CrashMonkey 
runs many tests from just the single workload you made. These tests 
consist of replaying some number of bio write operations, so it tests 
states different than you 500 xfstest which I believe only replays to 
sync operations (i.e. it never stops replay before a recorded fsync).

If you're interested, you can find the CrashMonkey code (and branch) at 
https://github.com/utsaslab/crashmonkey/tree/ext4_regression_bug. If you 
would like to run it, you should clone and build you xfstest in your 
home directory so that the jerry-rigged CrashMonkey test case can find 
it. Directions for running this test case in CrashMonkey should be at 
the top of the README.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-10-17 23:17 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-27 10:44 [RFC][PATCH] fstest: regression test for ext4 crash consistency bug Amir Goldstein
2017-09-25  9:49 ` Xiao Yang
2017-09-25  9:49   ` Xiao Yang
2017-09-25 10:53   ` Amir Goldstein
2017-09-26 10:45     ` Xiao Yang
2017-09-26 11:48       ` Amir Goldstein
2017-09-30 14:15     ` Ashlie Martinez
2017-10-05  7:27       ` Xiao Yang
2017-10-05 15:04         ` Ashlie Martinez
2017-10-05 19:10           ` Amir Goldstein
2017-10-06  0:34             ` Ashlie Martinez
2017-10-07  3:29               ` [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash Theodore Ts'o
2017-10-07  5:54                 ` Amir Goldstein
2017-10-07 18:32                   ` Theodore Ts'o
2017-10-09  0:37                 ` Ashlie Martinez
2017-10-11 11:11                 ` Xiao Yang
2017-10-11 13:17                   ` Ashlie Martinez
2017-10-11 13:34                     ` Amir Goldstein
2017-10-16 19:32                       ` Ashlie Martinez
2017-10-16 21:11                         ` Amir Goldstein
2017-10-17  0:09                           ` Theodore Ts'o
2017-10-17  1:02                             ` Vijay Chidambaram
     [not found]                             ` <CAPaz=E+jFuOmRk8+EmVhNawwogNzW3VkciFrCc0Fk23OfGbwuA@mail.gmail.com>
2017-10-17  7:15                               ` Amir Goldstein
2017-10-17 14:41                               ` Theodore Ts'o
2017-10-17 23:16                                 ` Vijay Chidambaram
2017-10-12 14:38                 ` Jan Kara
2017-08-31  1:28 [RFC][PATCH] fstest: regression test for ext4 crash consistency bug Ashlie Martinez
2017-08-31  4:05 ` Amir Goldstein
2017-08-31  4:06   ` Amir Goldstein
2017-09-01 12:21     ` Ashlie Martinez
2017-09-01 14:59       ` Amir Goldstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.