linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] fstests: generic, fsync fuzz tester with fsstress
@ 2019-05-15 15:02 fdmanana
  2019-05-15 15:07 ` Vijay Chidambaram
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: fdmanana @ 2019-05-15 15:02 UTC (permalink / raw)
  To: fstests; +Cc: linux-btrfs, linux-ext4, jack, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

Run fsstress, fsync every file and directory, simulate a power failure and
then verify the all files and directories exist, with the same data and
metadata they had before the power failure.

This tes has found already 2 bugs in btrfs, that caused mtime and ctime of
directories not being preserved after replaying the log/journal and loss
of a directory's attributes (such a UID and GID) after replaying the log.
The patches that fix the btrfs issues are titled:

  "Btrfs: fix wrong ctime and mtime of a directory after log replay"
  "Btrfs: fix fsync not persisting changed attributes of a directory"

Running this test 1000 times:

- on xfs, no issues were found

- on ext4 it has resulted in about a dozen journal checksum errors (on a
  5.0 kernel) that resulted in failure to mount the filesystem after the
  simulated power failure with dmflakey, which produces the following
  error in dmesg/syslog:

    [Mon May 13 12:51:37 2019] JBD2: journal checksum error
    [Mon May 13 12:51:37 2019] EXT4-fs (dm-0): error loading journal

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 tests/generic/547     | 72 +++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/547.out |  2 ++
 tests/generic/group   |  1 +
 3 files changed, 75 insertions(+)
 create mode 100755 tests/generic/547
 create mode 100644 tests/generic/547.out

diff --git a/tests/generic/547 b/tests/generic/547
new file mode 100755
index 00000000..577b0e9b
--- /dev/null
+++ b/tests/generic/547
@@ -0,0 +1,72 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2019 SUSE Linux Products GmbH. All Rights Reserved.
+#
+# FS QA Test No. 547
+#
+# Run fsstress, fsync every file and directory, simulate a power failure and
+# then verify the all files and directories exist, with the same data and
+# metadata they had before the power failure.
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+tmp=/tmp/$$
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	_cleanup_flakey
+	cd /
+	rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+_require_fssum
+_require_dm_target flakey
+
+rm -f $seqres.full
+
+fssum_files_dir=$TEST_DIR/generic-test-$seq
+rm -fr $fssum_files_dir
+mkdir $fssum_files_dir
+
+_scratch_mkfs >>$seqres.full 2>&1
+_require_metadata_journaling $SCRATCH_DEV
+_init_flakey
+_mount_flakey
+
+mkdir $SCRATCH_MNT/test
+args=`_scale_fsstress_args -p 4 -n 100 $FSSTRESS_AVOID -d $SCRATCH_MNT/test`
+args="$args -f mknod=0 -f symlink=0"
+echo "Running fsstress with arguments: $args" >>$seqres.full
+$FSSTRESS_PROG $args >>$seqres.full
+
+# Fsync every file and directory.
+find $SCRATCH_MNT/test -type f,d -exec $XFS_IO_PROG -c "fsync" {} \;
+
+# Compute a digest of the filesystem (using the test directory only, to skip
+# fs specific directories such as "lost+found" on ext4 for example).
+$FSSUM_PROG -A -f -w $fssum_files_dir/fs_digest $SCRATCH_MNT/test
+
+# Simulate a power failure and mount the filesystem to check that all files and
+# directories exist and have all data and metadata preserved.
+_flakey_drop_and_remount
+
+# Compute a new digest and compare it to the one we created previously, they
+# must match.
+$FSSUM_PROG -r $fssum_files_dir/fs_digest $SCRATCH_MNT/test
+
+_unmount_flakey
+
+status=0
+exit
diff --git a/tests/generic/547.out b/tests/generic/547.out
new file mode 100644
index 00000000..0f6f1131
--- /dev/null
+++ b/tests/generic/547.out
@@ -0,0 +1,2 @@
+QA output created by 547
+OK
diff --git a/tests/generic/group b/tests/generic/group
index 47e81d96..49639fc9 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -549,3 +549,4 @@
 544 auto quick clone
 545 auto quick cap
 546 auto quick clone enospc log
+547 auto quick log
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] fstests: generic, fsync fuzz tester with fsstress
  2019-05-15 15:02 [PATCH] fstests: generic, fsync fuzz tester with fsstress fdmanana
@ 2019-05-15 15:07 ` Vijay Chidambaram
  2019-05-16  8:09 ` Johannes Thumshirn
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Vijay Chidambaram @ 2019-05-15 15:07 UTC (permalink / raw)
  To: Filipe Manana; +Cc: fstests, linux-btrfs, linux-ext4, Jan Kara, Filipe Manana

On Wed, May 15, 2019 at 10:02 AM <fdmanana@kernel.org> wrote:
>
> From: Filipe Manana <fdmanana@suse.com>
>
> Run fsstress, fsync every file and directory, simulate a power failure and
> then verify the all files and directories exist, with the same data and
> metadata they had before the power failure.

I'm happy to see this sort of crash testing be merged into the Linux
kernel! I think something like this being run after every
merge/nightly build will make file systems significantly more robust
to crash-recovery bugs.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fstests: generic, fsync fuzz tester with fsstress
  2019-05-15 15:02 [PATCH] fstests: generic, fsync fuzz tester with fsstress fdmanana
  2019-05-15 15:07 ` Vijay Chidambaram
@ 2019-05-16  8:09 ` Johannes Thumshirn
  2019-05-16  9:28 ` Theodore Ts'o
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Johannes Thumshirn @ 2019-05-16  8:09 UTC (permalink / raw)
  To: fdmanana; +Cc: fstests, linux-btrfs, linux-ext4, jack, Filipe Manana

On Wed, May 15, 2019 at 04:02:21PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
Nit:
> This tes has found already 2 bugs in btrfs, that caused mtime and ctime of
  test? ^

-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fstests: generic, fsync fuzz tester with fsstress
  2019-05-15 15:02 [PATCH] fstests: generic, fsync fuzz tester with fsstress fdmanana
  2019-05-15 15:07 ` Vijay Chidambaram
  2019-05-16  8:09 ` Johannes Thumshirn
@ 2019-05-16  9:28 ` Theodore Ts'o
  2019-05-16  9:54   ` Filipe Manana
  2019-05-17  3:42 ` Eryu Guan
  2019-05-17 15:34 ` [PATCH v2] " fdmanana
  4 siblings, 1 reply; 10+ messages in thread
From: Theodore Ts'o @ 2019-05-16  9:28 UTC (permalink / raw)
  To: fdmanana; +Cc: fstests, linux-btrfs, linux-ext4, jack, Filipe Manana

On Wed, May 15, 2019 at 04:02:21PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Run fsstress, fsync every file and directory, simulate a power failure and
> then verify the all files and directories exist, with the same data and
> metadata they had before the power failure.
> 
> This tes has found already 2 bugs in btrfs, that caused mtime and ctime of
> directories not being preserved after replaying the log/journal and loss
> of a directory's attributes (such a UID and GID) after replaying the log.
> The patches that fix the btrfs issues are titled:
> 
>   "Btrfs: fix wrong ctime and mtime of a directory after log replay"
>   "Btrfs: fix fsync not persisting changed attributes of a directory"
> 
> Running this test 1000 times:
> 
> - on ext4 it has resulted in about a dozen journal checksum errors (on a
>   5.0 kernel) that resulted in failure to mount the filesystem after the
>   simulated power failure with dmflakey, which produces the following
>   error in dmesg/syslog:
> 
>     [Mon May 13 12:51:37 2019] JBD2: journal checksum error
>     [Mon May 13 12:51:37 2019] EXT4-fs (dm-0): error loading journal

I'm curious what configuration you used when you ran the test.  I
tried to reproduce it, and had no luck:

TESTRUNID: tytso-20190516042341
KERNEL:    kernel 5.1.0-rc3-xfstests-00034-g0c72924ef346 #999 SMP Wed May 15 00:56:08 EDT 2019 x86_64
CMDLINE:   -c 4k -C 1000 generic/547
CPUS:      2
MEM:       7680

ext4/4k: 1000 tests, 1855 seconds
Totals: 1000 tests, 0 skipped, 0 failures, 0 errors, 1855s

FSTESTPRJ: gce-xfstests
FSTESTVER: blktests baccddc (Wed, 13 Mar 2019 00:06:50 -0700)
FSTESTVER: fio  fio-3.2 (Fri, 3 Nov 2017 15:23:49 -0600)
FSTESTVER: fsverity bdebc45 (Wed, 5 Sep 2018 21:32:22 -0700)
FSTESTVER: ima-evm-utils 0267fa1 (Mon, 3 Dec 2018 06:11:35 -0500)
FSTESTVER: nvme-cli v1.7-35-g669d759 (Tue, 12 Mar 2019 11:22:16 -0600)
FSTESTVER: quota  62661bd (Tue, 2 Apr 2019 17:04:37 +0200)
FSTESTVER: stress-ng 7d0353cf (Sun, 20 Jan 2019 03:30:03 +0000)
FSTESTVER: syzkaller bab43553 (Fri, 15 Mar 2019 09:08:49 +0100)
FSTESTVER: xfsprogs v5.0.0 (Fri, 3 May 2019 12:14:36 -0500)
FSTESTVER: xfstests-bld 9582562 (Sun, 12 May 2019 00:38:51 -0400)
FSTESTVER: xfstests linux-v3.8-2390-g64233614 (Thu, 16 May 2019 00:12:52 -0400)
FSTESTCFG: 4k
FSTESTSET: generic/547
FSTESTOPT: count 1000 aex
GCE ID:    8592267165157073108

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fstests: generic, fsync fuzz tester with fsstress
  2019-05-16  9:28 ` Theodore Ts'o
@ 2019-05-16  9:54   ` Filipe Manana
  2019-05-16 16:59     ` Theodore Ts'o
  0 siblings, 1 reply; 10+ messages in thread
From: Filipe Manana @ 2019-05-16  9:54 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: fstests, linux-btrfs, linux-ext4, Jan Kara, Filipe Manana

On Thu, May 16, 2019 at 10:30 AM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Wed, May 15, 2019 at 04:02:21PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Run fsstress, fsync every file and directory, simulate a power failure and
> > then verify the all files and directories exist, with the same data and
> > metadata they had before the power failure.
> >
> > This tes has found already 2 bugs in btrfs, that caused mtime and ctime of
> > directories not being preserved after replaying the log/journal and loss
> > of a directory's attributes (such a UID and GID) after replaying the log.
> > The patches that fix the btrfs issues are titled:
> >
> >   "Btrfs: fix wrong ctime and mtime of a directory after log replay"
> >   "Btrfs: fix fsync not persisting changed attributes of a directory"
> >
> > Running this test 1000 times:
> >
> > - on ext4 it has resulted in about a dozen journal checksum errors (on a
> >   5.0 kernel) that resulted in failure to mount the filesystem after the
> >   simulated power failure with dmflakey, which produces the following
> >   error in dmesg/syslog:
> >
> >     [Mon May 13 12:51:37 2019] JBD2: journal checksum error
> >     [Mon May 13 12:51:37 2019] EXT4-fs (dm-0): error loading journal
>
> I'm curious what configuration you used when you ran the test.  I

Default configuration, MKFS_OPTIONS="" and MOUNT_OPTIONS="", 5.0 kernel.

I have logs with all the fsstress seed values kept around.

From one of the failures, the .full file:

Discarding device blocks: done
Creating filesystem with 5242880 4k blocks and 1310720 inodes
Filesystem UUID: 4bb2559c-12ea-45fa-810e-00c513b00dee
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

Running fsstress with arguments: -p 4 -n 100 -d
/home/fdmanana/btrfs-tests/scratch_1/test -f mknod=0 -f symlink=0
seed = 1558078129
_check_generic_filesystem: filesystem on /dev/sdc is inconsistent
*** fsck.ext4 output ***
fsck from util-linux 2.29.2
e2fsck 1.43.4 (31-Jan-2017)
Journal superblock is corrupt.
Fix? no

fsck.ext4: The journal superblock is corrupt while checking journal for /dev/sdc
e2fsck: Cannot proceed with file system check

/dev/sdc: ********** WARNING: Filesystem still has errors **********

*** end fsck.ext4 output
*** mount output ***
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
udev on /dev type devtmpfs (rw,nosuid,relatime,mode=755)
devpts on /dev/pts type devpts
(rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,noexec,relatime,size=788996k,mode=755)
/dev/sda1 on / type ext4 (rw,relatime,discard,errors=remount-ro)
securityfs on /sys/kernel/security type securityfs
(rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup
(rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/devices type cgroup
(rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup
(rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup
(rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/perf_event type cgroup
(rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/memory type cgroup
(rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup
(rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/cpuset type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/blkio type cgroup
(rw,nosuid,nodev,noexec,relatime,blkio)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs
(rw,relatime,fd=40,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=1624)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
mqueue on /dev/mqueue type mqueue (rw,relatime)
tmpfs on /run/user/1000 type tmpfs
(rw,nosuid,nodev,relatime,size=788992k,mode=700,uid=1000,gid=1000)
tracefs on /sys/kernel/debug/tracing type tracefs (rw,relatime)
*** end mount output


Haven't tried ext4 with 1 process only (instead of 4), but I can try
to see if it happens without concurrency as well.

> tried to reproduce it, and had no luck:
>
> TESTRUNID: tytso-20190516042341
> KERNEL:    kernel 5.1.0-rc3-xfstests-00034-g0c72924ef346 #999 SMP Wed May 15 00:56:08 EDT 2019 x86_64
> CMDLINE:   -c 4k -C 1000 generic/547
> CPUS:      2
> MEM:       7680
>
> ext4/4k: 1000 tests, 1855 seconds
> Totals: 1000 tests, 0 skipped, 0 failures, 0 errors, 1855s
>
> FSTESTPRJ: gce-xfstests
> FSTESTVER: blktests baccddc (Wed, 13 Mar 2019 00:06:50 -0700)
> FSTESTVER: fio  fio-3.2 (Fri, 3 Nov 2017 15:23:49 -0600)
> FSTESTVER: fsverity bdebc45 (Wed, 5 Sep 2018 21:32:22 -0700)
> FSTESTVER: ima-evm-utils 0267fa1 (Mon, 3 Dec 2018 06:11:35 -0500)
> FSTESTVER: nvme-cli v1.7-35-g669d759 (Tue, 12 Mar 2019 11:22:16 -0600)
> FSTESTVER: quota  62661bd (Tue, 2 Apr 2019 17:04:37 +0200)
> FSTESTVER: stress-ng 7d0353cf (Sun, 20 Jan 2019 03:30:03 +0000)
> FSTESTVER: syzkaller bab43553 (Fri, 15 Mar 2019 09:08:49 +0100)
> FSTESTVER: xfsprogs v5.0.0 (Fri, 3 May 2019 12:14:36 -0500)
> FSTESTVER: xfstests-bld 9582562 (Sun, 12 May 2019 00:38:51 -0400)
> FSTESTVER: xfstests linux-v3.8-2390-g64233614 (Thu, 16 May 2019 00:12:52 -0400)
> FSTESTCFG: 4k
> FSTESTSET: generic/547
> FSTESTOPT: count 1000 aex
> GCE ID:    8592267165157073108

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fstests: generic, fsync fuzz tester with fsstress
  2019-05-16  9:54   ` Filipe Manana
@ 2019-05-16 16:59     ` Theodore Ts'o
  2019-05-16 17:18       ` Filipe Manana
  0 siblings, 1 reply; 10+ messages in thread
From: Theodore Ts'o @ 2019-05-16 16:59 UTC (permalink / raw)
  To: Filipe Manana; +Cc: fstests, linux-btrfs, linux-ext4, Jan Kara, Filipe Manana

On Thu, May 16, 2019 at 10:54:57AM +0100, Filipe Manana wrote:
> 
> Haven't tried ext4 with 1 process only (instead of 4), but I can try
> to see if it happens without concurrency as well.

How many CPU's and how much memory were you using?  And I assume this
was using KVM/QEMU?  How was it configured?

Thanks,

					- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fstests: generic, fsync fuzz tester with fsstress
  2019-05-16 16:59     ` Theodore Ts'o
@ 2019-05-16 17:18       ` Filipe Manana
  2019-05-17 15:33         ` Filipe Manana
  0 siblings, 1 reply; 10+ messages in thread
From: Filipe Manana @ 2019-05-16 17:18 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: fstests, linux-btrfs, linux-ext4, Jan Kara, Filipe Manana

On Thu, May 16, 2019 at 5:59 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Thu, May 16, 2019 at 10:54:57AM +0100, Filipe Manana wrote:
> >
> > Haven't tried ext4 with 1 process only (instead of 4), but I can try
> > to see if it happens without concurrency as well.
>
> How many CPU's and how much memory were you using?  And I assume this
> was using KVM/QEMU?  How was it configured?

Yep, kvm and qemu (3.0.0). The qemu config:

https://pastebin.com/KNigeXXq

TEST_DEV is the drive with ID "drive1" and SCRATCH_DEV is the drive
with ID "drive2".

The host has:

Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
64Gb of ram
crappy seagate hdd:

Device Model:     ST3000DM008-2DM166
Serial Number:    Z5053T2R
LU WWN Device Id: 5 000c50 0a46f7ecb
Firmware Version: CC26
User Capacity:    3,000,592,982,016 bytes [3,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)

It hosts 3 qemu instances, all with the same configuration.

I left the test running earlier today for about 1 hour on ext4 with
only 1 fsstress process. Didn't manage to reproduce.
With 4 or more processes, those journal checksum failures happen sporadically.
I can leave it running with 1 process during this evening and see what
we get here, if it happens with 1 process, it should be trivial to
reproduce anywhere.

>
> Thanks,
>
>                                         - Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fstests: generic, fsync fuzz tester with fsstress
  2019-05-15 15:02 [PATCH] fstests: generic, fsync fuzz tester with fsstress fdmanana
                   ` (2 preceding siblings ...)
  2019-05-16  9:28 ` Theodore Ts'o
@ 2019-05-17  3:42 ` Eryu Guan
  2019-05-17 15:34 ` [PATCH v2] " fdmanana
  4 siblings, 0 replies; 10+ messages in thread
From: Eryu Guan @ 2019-05-17  3:42 UTC (permalink / raw)
  To: fdmanana; +Cc: fstests, linux-btrfs, linux-ext4, jack, Filipe Manana

On Wed, May 15, 2019 at 04:02:21PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Run fsstress, fsync every file and directory, simulate a power failure and
> then verify the all files and directories exist, with the same data and
> metadata they had before the power failure.
> 
> This tes has found already 2 bugs in btrfs, that caused mtime and ctime of
> directories not being preserved after replaying the log/journal and loss
> of a directory's attributes (such a UID and GID) after replaying the log.
> The patches that fix the btrfs issues are titled:
> 
>   "Btrfs: fix wrong ctime and mtime of a directory after log replay"
>   "Btrfs: fix fsync not persisting changed attributes of a directory"
> 
> Running this test 1000 times:
> 
> - on xfs, no issues were found
> 
> - on ext4 it has resulted in about a dozen journal checksum errors (on a
>   5.0 kernel) that resulted in failure to mount the filesystem after the
>   simulated power failure with dmflakey, which produces the following
>   error in dmesg/syslog:
> 
>     [Mon May 13 12:51:37 2019] JBD2: journal checksum error
>     [Mon May 13 12:51:37 2019] EXT4-fs (dm-0): error loading journal
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  tests/generic/547     | 72 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/generic/547.out |  2 ++
>  tests/generic/group   |  1 +
>  3 files changed, 75 insertions(+)
>  create mode 100755 tests/generic/547
>  create mode 100644 tests/generic/547.out
> 
> diff --git a/tests/generic/547 b/tests/generic/547
> new file mode 100755
> index 00000000..577b0e9b
> --- /dev/null
> +++ b/tests/generic/547
> @@ -0,0 +1,72 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (C) 2019 SUSE Linux Products GmbH. All Rights Reserved.
> +#
> +# FS QA Test No. 547
> +#
> +# Run fsstress, fsync every file and directory, simulate a power failure and
> +# then verify the all files and directories exist, with the same data and
> +# metadata they had before the power failure.
> +#
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +tmp=/tmp/$$
> +status=1	# failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +	_cleanup_flakey
> +	cd /
> +	rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/dmflakey
> +
> +# real QA test starts here
> +_supported_fs generic
> +_supported_os Linux
> +_require_scratch

As we save fssum file to $TEST_DIR, it'd be better to _require_test too.

> +_require_fssum
> +_require_dm_target flakey
> +
> +rm -f $seqres.full
> +
> +fssum_files_dir=$TEST_DIR/generic-test-$seq
> +rm -fr $fssum_files_dir
> +mkdir $fssum_files_dir
> +
> +_scratch_mkfs >>$seqres.full 2>&1
> +_require_metadata_journaling $SCRATCH_DEV
> +_init_flakey
> +_mount_flakey
> +
> +mkdir $SCRATCH_MNT/test
> +args=`_scale_fsstress_args -p 4 -n 100 $FSSTRESS_AVOID -d $SCRATCH_MNT/test`
> +args="$args -f mknod=0 -f symlink=0"
> +echo "Running fsstress with arguments: $args" >>$seqres.full
> +$FSSTRESS_PROG $args >>$seqres.full
> +
> +# Fsync every file and directory.
> +find $SCRATCH_MNT/test -type f,d -exec $XFS_IO_PROG -c "fsync" {} \;

My 'find' on Fedora 29 vm (find (GNU findutils) 4.6.0) doesn't support
"-type f,d" syntax

find: Arguments to -type should contain only one letter

I have to change this to

find $SCRATCH_MNT/test \( -type f -o -type d \) -exec $XFS_IO_PROG -c "fsync" {} \;

Otherwise looks good to me, thanks!

Eryu

> +# Compute a digest of the filesystem (using the test directory only, to skip
> +# fs specific directories such as "lost+found" on ext4 for example).
> +$FSSUM_PROG -A -f -w $fssum_files_dir/fs_digest $SCRATCH_MNT/test
> +
> +# Simulate a power failure and mount the filesystem to check that all files and
> +# directories exist and have all data and metadata preserved.
> +_flakey_drop_and_remount
> +
> +# Compute a new digest and compare it to the one we created previously, they
> +# must match.
> +$FSSUM_PROG -r $fssum_files_dir/fs_digest $SCRATCH_MNT/test
> +
> +_unmount_flakey
> +
> +status=0
> +exit
> diff --git a/tests/generic/547.out b/tests/generic/547.out
> new file mode 100644
> index 00000000..0f6f1131
> --- /dev/null
> +++ b/tests/generic/547.out
> @@ -0,0 +1,2 @@
> +QA output created by 547
> +OK
> diff --git a/tests/generic/group b/tests/generic/group
> index 47e81d96..49639fc9 100644
> --- a/tests/generic/group
> +++ b/tests/generic/group
> @@ -549,3 +549,4 @@
>  544 auto quick clone
>  545 auto quick cap
>  546 auto quick clone enospc log
> +547 auto quick log
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fstests: generic, fsync fuzz tester with fsstress
  2019-05-16 17:18       ` Filipe Manana
@ 2019-05-17 15:33         ` Filipe Manana
  0 siblings, 0 replies; 10+ messages in thread
From: Filipe Manana @ 2019-05-17 15:33 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: fstests, linux-btrfs, linux-ext4, Jan Kara, Filipe Manana

[-- Attachment #1: Type: text/plain, Size: 3544 bytes --]

On Thu, May 16, 2019 at 6:18 PM Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Thu, May 16, 2019 at 5:59 PM Theodore Ts'o <tytso@mit.edu> wrote:
> >
> > On Thu, May 16, 2019 at 10:54:57AM +0100, Filipe Manana wrote:
> > >
> > > Haven't tried ext4 with 1 process only (instead of 4), but I can try
> > > to see if it happens without concurrency as well.
> >
> > How many CPU's and how much memory were you using?  And I assume this
> > was using KVM/QEMU?  How was it configured?
>
> Yep, kvm and qemu (3.0.0). The qemu config:
>
> https://pastebin.com/KNigeXXq
>
> TEST_DEV is the drive with ID "drive1" and SCRATCH_DEV is the drive
> with ID "drive2".
>
> The host has:
>
> Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
> 64Gb of ram
> crappy seagate hdd:
>
> Device Model:     ST3000DM008-2DM166
> Serial Number:    Z5053T2R
> LU WWN Device Id: 5 000c50 0a46f7ecb
> Firmware Version: CC26
> User Capacity:    3,000,592,982,016 bytes [3,00 TB]
> Sector Sizes:     512 bytes logical, 4096 bytes physical
> Rotation Rate:    7200 rpm
> Form Factor:      3.5 inches
> Device is:        Not in smartctl database [for details use: -P showall]
> ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
> SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
>
> It hosts 3 qemu instances, all with the same configuration.
>
> I left the test running earlier today for about 1 hour on ext4 with
> only 1 fsstress process. Didn't manage to reproduce.
> With 4 or more processes, those journal checksum failures happen sporadically.
> I can leave it running with 1 process during this evening and see what
> we get here, if it happens with 1 process, it should be trivial to
> reproduce anywhere.

Ok, so I left it running overnight, for 17 000+ iterations. It failed
102 times with that journal corruption.
I changed the test to randomize the number of fsstress processes
between 1 to 8. I'm attaching here the logs (.full, .out.bad and dmesg
files) in case you are interested in the seed values for fsstress.

So the test does now:

(...)
procs=$(( (RANDOM % 8) + 1 ))
args=`_scale_fsstress_args -p $procs -n 100 $FSSTRESS_AVOID -d
$SCRATCH_MNT/test`
args="$args -f mknod=0 -f symlink=0"
echo "Running fsstress with arguments: $args" >>$seqres.full
(...)

I verified no failures happened with only 1 process, and the more
processes are used, the more likely it is to hit the issue:

$ egrep -r 'Running fsstress with arguments: -p' . | cut -d ' ' -f 6 |
perl -n -e 'use Statistics::Histogram; @data = <>; chomp @data; print
get_histogram(\@data);'
Count: 102
Range:  2.000 -  8.000; Mean:  5.598; Median:  6.000; Stddev:  1.831
Percentiles:  90th:  8.000; 95th:  8.000; 99th:  8.000
   2.000 -    2.348:     5 ##############
   2.348 -    3.171:    13 ####################################
   3.171 -    4.196:    12 #################################
   4.196 -    5.473:    15 ##########################################
   5.473 -    6.225:    19 #####################################################
   6.225 -    7.064:    19 #####################################################
   7.064 -    8.000:    19 #####################################################

And verified picking one of the failing seeds, such as 1557322233 for
2 processes, and running the test with that seed for 10 times didn't
reproduce, so it indeed seems to be some race causing the journal
corruption.

Forgot previously, but my kernel config in case it helps:
https://pastebin.com/LKvRcAW1

Thanks.

>
> >
> > Thanks,
> >
> >                                         - Ted

[-- Attachment #2: ext4_generic_547_logs.tar.xz --]
[-- Type: application/x-xz, Size: 14024 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2] fstests: generic, fsync fuzz tester with fsstress
  2019-05-15 15:02 [PATCH] fstests: generic, fsync fuzz tester with fsstress fdmanana
                   ` (3 preceding siblings ...)
  2019-05-17  3:42 ` Eryu Guan
@ 2019-05-17 15:34 ` fdmanana
  4 siblings, 0 replies; 10+ messages in thread
From: fdmanana @ 2019-05-17 15:34 UTC (permalink / raw)
  To: fstests; +Cc: linux-btrfs, linux-ext4, jack, guaneryu, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

Run fsstress, fsync every file and directory, simulate a power failure and
then verify that all files and directories exist, with the same data and
metadata they had before the power failure.

This test has found already 2 bugs in btrfs, that caused mtime and ctime of
directories not being preserved after replaying the log/journal and loss
of a directory's attributes (such a UID and GID) after replaying the log.
The patches that fix the btrfs issues are titled:

  "Btrfs: fix wrong ctime and mtime of a directory after log replay"
  "Btrfs: fix fsync not persisting changed attributes of a directory"

Running this test 1000 times:

- on xfs, no issues were found

- on ext4 it has resulted in about a dozen journal checksum errors (on a
  5.0 kernel) that resulted in failure to mount the filesystem after the
  simulated power failure with dmflakey, which produces the following
  error in dmesg/syslog:

    [Mon May 13 12:51:37 2019] JBD2: journal checksum error
    [Mon May 13 12:51:37 2019] EXT4-fs (dm-0): error loading journal

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Fixed a few typos in the changelog, add missing _require_test and changed
    the find command to replace '-type f,d' with '\( -type f -o -type d \)'
    since not all versions of the find utility accept the former syntax.

 tests/generic/547     | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/547.out |  2 ++
 tests/generic/group   |  1 +
 3 files changed, 76 insertions(+)
 create mode 100755 tests/generic/547
 create mode 100644 tests/generic/547.out

diff --git a/tests/generic/547 b/tests/generic/547
new file mode 100755
index 00000000..bcce8fe8
--- /dev/null
+++ b/tests/generic/547
@@ -0,0 +1,73 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2019 SUSE Linux Products GmbH. All Rights Reserved.
+#
+# FS QA Test No. 547
+#
+# Run fsstress, fsync every file and directory, simulate a power failure and
+# then verify that all files and directories exist, with the same data and
+# metadata they had before the power failure.
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+tmp=/tmp/$$
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	_cleanup_flakey
+	cd /
+	rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_test
+_require_scratch
+_require_fssum
+_require_dm_target flakey
+
+rm -f $seqres.full
+
+fssum_files_dir=$TEST_DIR/generic-test-$seq
+rm -fr $fssum_files_dir
+mkdir $fssum_files_dir
+
+_scratch_mkfs >>$seqres.full 2>&1
+_require_metadata_journaling $SCRATCH_DEV
+_init_flakey
+_mount_flakey
+
+mkdir $SCRATCH_MNT/test
+args=`_scale_fsstress_args -p 4 -n 100 $FSSTRESS_AVOID -d $SCRATCH_MNT/test`
+args="$args -f mknod=0 -f symlink=0"
+echo "Running fsstress with arguments: $args" >>$seqres.full
+$FSSTRESS_PROG $args >>$seqres.full
+
+# Fsync every file and directory.
+find $SCRATCH_MNT/test \( -type f -o -type d \) -exec $XFS_IO_PROG -c fsync {} \;
+
+# Compute a digest of the filesystem (using the test directory only, to skip
+# fs specific directories such as "lost+found" on ext4 for example).
+$FSSUM_PROG -A -f -w $fssum_files_dir/fs_digest $SCRATCH_MNT/test
+
+# Simulate a power failure and mount the filesystem to check that all files and
+# directories exist and have all data and metadata preserved.
+_flakey_drop_and_remount
+
+# Compute a new digest and compare it to the one we created previously, they
+# must match.
+$FSSUM_PROG -r $fssum_files_dir/fs_digest $SCRATCH_MNT/test
+
+_unmount_flakey
+
+status=0
+exit
diff --git a/tests/generic/547.out b/tests/generic/547.out
new file mode 100644
index 00000000..0f6f1131
--- /dev/null
+++ b/tests/generic/547.out
@@ -0,0 +1,2 @@
+QA output created by 547
+OK
diff --git a/tests/generic/group b/tests/generic/group
index 47e81d96..49639fc9 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -549,3 +549,4 @@
 544 auto quick clone
 545 auto quick cap
 546 auto quick clone enospc log
+547 auto quick log
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-05-17 15:34 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-15 15:02 [PATCH] fstests: generic, fsync fuzz tester with fsstress fdmanana
2019-05-15 15:07 ` Vijay Chidambaram
2019-05-16  8:09 ` Johannes Thumshirn
2019-05-16  9:28 ` Theodore Ts'o
2019-05-16  9:54   ` Filipe Manana
2019-05-16 16:59     ` Theodore Ts'o
2019-05-16 17:18       ` Filipe Manana
2019-05-17 15:33         ` Filipe Manana
2019-05-17  3:42 ` Eryu Guan
2019-05-17 15:34 ` [PATCH v2] " fdmanana

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).