All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] fstest: regression test for ext4 crash consistency bug
@ 2017-08-27 10:44 Amir Goldstein
  2017-09-25  9:49   ` Xiao Yang
  0 siblings, 1 reply; 31+ messages in thread
From: Amir Goldstein @ 2017-08-27 10:44 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Eryu Guan, Josef Bacik, fstests, linux-ext4

This test is motivated by a bug found in ext4 during random crash
consistency tests.

This test uses device mapper flakey target to demonstrate the bug
found using device mapper log-writes target.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---

Ted,

While working on crash consistency xfstests [1], I stubmled on what
appeared to be an ext4 crash consistency bug.

The tests I used rely on the log-writes dm target code written
by Josef Bacik, which had little exposure to the wide community
as far as I know.  I wanted to prove to myself that the found
inconsistency was not due to a test bug, so I bisected the failed
test to the minimal operations that trigger the failure and wrote
a small independent test to reproduce the issue using dm flakey target.

The following fsck error is reliably reproduced by replaying some fsx ops
on overlapping file regions, then emulating a crash, followed by mount,
umount and fsck -nf:

  ./ltp/fsx -d --replay-ops /tmp/8995.fsxops /mnt/scratch/testfile
  1 write 0x137dd thru    0x21445 (0xdc69 bytes)
  2 falloc        from 0xb531 to 0x16ade (0xb5ad bytes)
  3 collapse      from 0x1c000 to 0x20000, (0x4000 bytes)
  4 write 0x3e5ec thru    0x3ffff (0x1a14 bytes)
  5 zero  from 0x20fac to 0x27d48, (0x6d9c bytes)
  6 mapwrite      0x216ad thru    0x23dfb (0x274f bytes)
  All 7 operations completed A-OK!
  _check_generic_filesystem: filesystem on /dev/mapper/ssd-scratch is inconsistent
  *** fsck.ext4 output ***
  fsck from util-linux 2.27.1
  e2fsck 1.42.13 (17-May-2015)
  Pass 1: Checking inodes, blocks, and sizes
  Inode 12, end of extent exceeds allowed value
          (logical block 33, physical block 33441, len 7)
  Clear? no
  Inode 12, i_blocks is 184, should be 128.  Fix? no

Note that the inconsistency is "applied" by journal replay during mount.
fsck -nf before mount does not report any errors.

I did not intend for this test to be merged as is, but rather to be used
by ext4 developers to analyze the problem and then re-write the test with
more comments and less arbitrary offset/length values.

P.S.: crash consistency tests also reliably reproduce a btrfs fsck error.
      a detailed report with I/O recording was sent to Josef.
P.S.2: crash consistency tests report file data checksum errors on xfs
       after fsync+crash, but I still need to prove the reliability of
       these reports.
 
[1] https://github.com/amir73il/xfstests/commits/dm-log-writes

 tests/generic/501     | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/501.out |  2 ++
 tests/generic/group   |  1 +
 3 files changed, 83 insertions(+)
 create mode 100755 tests/generic/501
 create mode 100644 tests/generic/501.out

diff --git a/tests/generic/501 b/tests/generic/501
new file mode 100755
index 0000000..ccb513d
--- /dev/null
+++ b/tests/generic/501
@@ -0,0 +1,80 @@
+#! /bin/bash
+# FS QA Test No. 501
+#
+# This test is motivated by a bug found in ext4 during random crash
+# consistency tests.
+#
+#-----------------------------------------------------------------------
+# Copyright (C) 2017 CTERA Networks. All Rights Reserved.
+# Author: Amir Goldstein <amir73il@gmail.com>
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#-----------------------------------------------------------------------
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1	# failure is the default!
+
+_cleanup()
+{
+	_cleanup_flakey
+	cd /
+	rm -f $tmp.*
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+_require_dm_target flakey
+_require_metadata_journaling $SCRATCH_DEV
+
+rm -f $seqres.full
+
+_scratch_mkfs >> $seqres.full 2>&1
+
+_init_flakey
+_mount_flakey
+
+fsxops=$tmp.fsxops
+cat <<EOF > $fsxops
+write 0x137dd 0xdc69 0x0
+fallocate 0xb531 0xb5ad 0x21446
+collapse_range 0x1c000 0x4000 0x21446
+write 0x3e5ec 0x1a14 0x21446
+zero_range 0x20fac 0x6d9c 0x40000 keep_size
+mapwrite 0x216ad 0x274f 0x40000
+EOF
+run_check $here/ltp/fsx -d --replay-ops $fsxops $SCRATCH_MNT/testfile
+
+_flakey_drop_and_remount
+_unmount_flakey
+_cleanup_flakey
+_check_scratch_fs
+
+echo "Silence is golden"
+
+status=0
+exit
diff --git a/tests/generic/501.out b/tests/generic/501.out
new file mode 100644
index 0000000..00133b6
--- /dev/null
+++ b/tests/generic/501.out
@@ -0,0 +1,2 @@
+QA output created by 501
+Silence is golden
diff --git a/tests/generic/group b/tests/generic/group
index 2396b72..bb870f2 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -454,3 +454,4 @@
 449 auto quick acl enospc
 450 auto quick rw
 500 auto log replay
+501 auto quick metadata
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH] fstest: regression test for ext4 crash consistency bug
  2017-08-27 10:44 [RFC][PATCH] fstest: regression test for ext4 crash consistency bug Amir Goldstein
@ 2017-09-25  9:49   ` Xiao Yang
  0 siblings, 0 replies; 31+ messages in thread
From: Xiao Yang @ 2017-09-25  9:49 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Theodore Ts'o, Eryu Guan, Josef Bacik, fstests, linux-ext4

On 2017/08/27 18:44, Amir Goldstein wrote:
> This test is motivated by a bug found in ext4 during random crash
> consistency tests.
>
> This test uses device mapper flakey target to demonstrate the bug
> found using device mapper log-writes target.
>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>
> Ted,
>
> While working on crash consistency xfstests [1], I stubmled on what
> appeared to be an ext4 crash consistency bug.
>
> The tests I used rely on the log-writes dm target code written
> by Josef Bacik, which had little exposure to the wide community
> as far as I know.  I wanted to prove to myself that the found
> inconsistency was not due to a test bug, so I bisected the failed
> test to the minimal operations that trigger the failure and wrote
> a small independent test to reproduce the issue using dm flakey target.
>
> The following fsck error is reliably reproduced by replaying some fsx ops
> on overlapping file regions, then emulating a crash, followed by mount,
> umount and fsck -nf:
>
>   ./ltp/fsx -d --replay-ops /tmp/8995.fsxops /mnt/scratch/testfile
>   1 write 0x137dd thru    0x21445 (0xdc69 bytes)
>   2 falloc        from 0xb531 to 0x16ade (0xb5ad bytes)
>   3 collapse      from 0x1c000 to 0x20000, (0x4000 bytes)
>   4 write 0x3e5ec thru    0x3ffff (0x1a14 bytes)
>   5 zero  from 0x20fac to 0x27d48, (0x6d9c bytes)
>   6 mapwrite      0x216ad thru    0x23dfb (0x274f bytes)
>   All 7 operations completed A-OK!
>   _check_generic_filesystem: filesystem on /dev/mapper/ssd-scratch is inconsistent
>   *** fsck.ext4 output ***
>   fsck from util-linux 2.27.1
>   e2fsck 1.42.13 (17-May-2015)
>   Pass 1: Checking inodes, blocks, and sizes
>   Inode 12, end of extent exceeds allowed value
>           (logical block 33, physical block 33441, len 7)
>   Clear? no
>   Inode 12, i_blocks is 184, should be 128.  Fix? no
Hi Amir,

I always get the following output when running your xfstests test case 501.
---------------------------------------------------------------------------
e2fsck 1.42.9 (28-Dec-2013)
Pass 1: Checking inodes, blocks, and sizes
Inode 12, i_size is 147456, should be 163840. Fix? no
---------------------------------------------------------------------------

Could you tell me how to get the expected output as you reported?

Thanks,
Xiao Yang
> Note that the inconsistency is "applied" by journal replay during mount.
> fsck -nf before mount does not report any errors.
>
> I did not intend for this test to be merged as is, but rather to be used
> by ext4 developers to analyze the problem and then re-write the test with
> more comments and less arbitrary offset/length values.
>
> P.S.: crash consistency tests also reliably reproduce a btrfs fsck error.
>       a detailed report with I/O recording was sent to Josef.
> P.S.2: crash consistency tests report file data checksum errors on xfs
>        after fsync+crash, but I still need to prove the reliability of
>        these reports.
>  
> [1] https://github.com/amir73il/xfstests/commits/dm-log-writes
>
>  tests/generic/501     | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/generic/501.out |  2 ++
>  tests/generic/group   |  1 +
>  3 files changed, 83 insertions(+)
>  create mode 100755 tests/generic/501
>  create mode 100644 tests/generic/501.out
>
> diff --git a/tests/generic/501 b/tests/generic/501
> new file mode 100755
> index 0000000..ccb513d
> --- /dev/null
> +++ b/tests/generic/501
> @@ -0,0 +1,80 @@
> +#! /bin/bash
> +# FS QA Test No. 501
> +#
> +# This test is motivated by a bug found in ext4 during random crash
> +# consistency tests.
> +#
> +#-----------------------------------------------------------------------
> +# Copyright (C) 2017 CTERA Networks. All Rights Reserved.
> +# Author: Amir Goldstein <amir73il@gmail.com>
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#-----------------------------------------------------------------------
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1	# failure is the default!
> +
> +_cleanup()
> +{
> +	_cleanup_flakey
> +	cd /
> +	rm -f $tmp.*
> +}
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/dmflakey
> +
> +# real QA test starts here
> +_supported_fs generic
> +_supported_os Linux
> +_require_scratch
> +_require_dm_target flakey
> +_require_metadata_journaling $SCRATCH_DEV
> +
> +rm -f $seqres.full
> +
> +_scratch_mkfs >> $seqres.full 2>&1
> +
> +_init_flakey
> +_mount_flakey
> +
> +fsxops=$tmp.fsxops
> +cat <<EOF > $fsxops
> +write 0x137dd 0xdc69 0x0
> +fallocate 0xb531 0xb5ad 0x21446
> +collapse_range 0x1c000 0x4000 0x21446
> +write 0x3e5ec 0x1a14 0x21446
> +zero_range 0x20fac 0x6d9c 0x40000 keep_size
> +mapwrite 0x216ad 0x274f 0x40000
> +EOF
> +run_check $here/ltp/fsx -d --replay-ops $fsxops $SCRATCH_MNT/testfile
> +
> +_flakey_drop_and_remount
> +_unmount_flakey
> +_cleanup_flakey
> +_check_scratch_fs
> +
> +echo "Silence is golden"
> +
> +status=0
> +exit
> diff --git a/tests/generic/501.out b/tests/generic/501.out
> new file mode 100644
> index 0000000..00133b6
> --- /dev/null
> +++ b/tests/generic/501.out
> @@ -0,0 +1,2 @@
> +QA output created by 501
> +Silence is golden
> diff --git a/tests/generic/group b/tests/generic/group
> index 2396b72..bb870f2 100644
> --- a/tests/generic/group
> +++ b/tests/generic/group
> @@ -454,3 +454,4 @@
>  449 auto quick acl enospc
>  450 auto quick rw
>  500 auto log replay
> +501 auto quick metadata




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH] fstest: regression test for ext4 crash consistency bug
@ 2017-09-25  9:49   ` Xiao Yang
  0 siblings, 0 replies; 31+ messages in thread
From: Xiao Yang @ 2017-09-25  9:49 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Theodore Ts'o, Eryu Guan, Josef Bacik, fstests, linux-ext4

On 2017/08/27 18:44, Amir Goldstein wrote:
> This test is motivated by a bug found in ext4 during random crash
> consistency tests.
>
> This test uses device mapper flakey target to demonstrate the bug
> found using device mapper log-writes target.
>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>
> Ted,
>
> While working on crash consistency xfstests [1], I stubmled on what
> appeared to be an ext4 crash consistency bug.
>
> The tests I used rely on the log-writes dm target code written
> by Josef Bacik, which had little exposure to the wide community
> as far as I know.  I wanted to prove to myself that the found
> inconsistency was not due to a test bug, so I bisected the failed
> test to the minimal operations that trigger the failure and wrote
> a small independent test to reproduce the issue using dm flakey target.
>
> The following fsck error is reliably reproduced by replaying some fsx ops
> on overlapping file regions, then emulating a crash, followed by mount,
> umount and fsck -nf:
>
>   ./ltp/fsx -d --replay-ops /tmp/8995.fsxops /mnt/scratch/testfile
>   1 write 0x137dd thru    0x21445 (0xdc69 bytes)
>   2 falloc        from 0xb531 to 0x16ade (0xb5ad bytes)
>   3 collapse      from 0x1c000 to 0x20000, (0x4000 bytes)
>   4 write 0x3e5ec thru    0x3ffff (0x1a14 bytes)
>   5 zero  from 0x20fac to 0x27d48, (0x6d9c bytes)
>   6 mapwrite      0x216ad thru    0x23dfb (0x274f bytes)
>   All 7 operations completed A-OK!
>   _check_generic_filesystem: filesystem on /dev/mapper/ssd-scratch is inconsistent
>   *** fsck.ext4 output ***
>   fsck from util-linux 2.27.1
>   e2fsck 1.42.13 (17-May-2015)
>   Pass 1: Checking inodes, blocks, and sizes
>   Inode 12, end of extent exceeds allowed value
>           (logical block 33, physical block 33441, len 7)
>   Clear? no
>   Inode 12, i_blocks is 184, should be 128.  Fix? no
Hi Amir,

I always get the following output when running your xfstests test case 501.
---------------------------------------------------------------------------
e2fsck 1.42.9 (28-Dec-2013)
Pass 1: Checking inodes, blocks, and sizes
Inode 12, i_size is 147456, should be 163840. Fix? no
---------------------------------------------------------------------------

Could you tell me how to get the expected output as you reported?

Thanks,
Xiao Yang
> Note that the inconsistency is "applied" by journal replay during mount.
> fsck -nf before mount does not report any errors.
>
> I did not intend for this test to be merged as is, but rather to be used
> by ext4 developers to analyze the problem and then re-write the test with
> more comments and less arbitrary offset/length values.
>
> P.S.: crash consistency tests also reliably reproduce a btrfs fsck error.
>       a detailed report with I/O recording was sent to Josef.
> P.S.2: crash consistency tests report file data checksum errors on xfs
>        after fsync+crash, but I still need to prove the reliability of
>        these reports.
>  
> [1] https://github.com/amir73il/xfstests/commits/dm-log-writes
>
>  tests/generic/501     | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/generic/501.out |  2 ++
>  tests/generic/group   |  1 +
>  3 files changed, 83 insertions(+)
>  create mode 100755 tests/generic/501
>  create mode 100644 tests/generic/501.out
>
> diff --git a/tests/generic/501 b/tests/generic/501
> new file mode 100755
> index 0000000..ccb513d
> --- /dev/null
> +++ b/tests/generic/501
> @@ -0,0 +1,80 @@
> +#! /bin/bash
> +# FS QA Test No. 501
> +#
> +# This test is motivated by a bug found in ext4 during random crash
> +# consistency tests.
> +#
> +#-----------------------------------------------------------------------
> +# Copyright (C) 2017 CTERA Networks. All Rights Reserved.
> +# Author: Amir Goldstein <amir73il@gmail.com>
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#-----------------------------------------------------------------------
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1	# failure is the default!
> +
> +_cleanup()
> +{
> +	_cleanup_flakey
> +	cd /
> +	rm -f $tmp.*
> +}
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/dmflakey
> +
> +# real QA test starts here
> +_supported_fs generic
> +_supported_os Linux
> +_require_scratch
> +_require_dm_target flakey
> +_require_metadata_journaling $SCRATCH_DEV
> +
> +rm -f $seqres.full
> +
> +_scratch_mkfs >> $seqres.full 2>&1
> +
> +_init_flakey
> +_mount_flakey
> +
> +fsxops=$tmp.fsxops
> +cat <<EOF > $fsxops
> +write 0x137dd 0xdc69 0x0
> +fallocate 0xb531 0xb5ad 0x21446
> +collapse_range 0x1c000 0x4000 0x21446
> +write 0x3e5ec 0x1a14 0x21446
> +zero_range 0x20fac 0x6d9c 0x40000 keep_size
> +mapwrite 0x216ad 0x274f 0x40000
> +EOF
> +run_check $here/ltp/fsx -d --replay-ops $fsxops $SCRATCH_MNT/testfile
> +
> +_flakey_drop_and_remount
> +_unmount_flakey
> +_cleanup_flakey
> +_check_scratch_fs
> +
> +echo "Silence is golden"
> +
> +status=0
> +exit
> diff --git a/tests/generic/501.out b/tests/generic/501.out
> new file mode 100644
> index 0000000..00133b6
> --- /dev/null
> +++ b/tests/generic/501.out
> @@ -0,0 +1,2 @@
> +QA output created by 501
> +Silence is golden
> diff --git a/tests/generic/group b/tests/generic/group
> index 2396b72..bb870f2 100644
> --- a/tests/generic/group
> +++ b/tests/generic/group
> @@ -454,3 +454,4 @@
>  449 auto quick acl enospc
>  450 auto quick rw
>  500 auto log replay
> +501 auto quick metadata

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH] fstest: regression test for ext4 crash consistency bug
  2017-09-25  9:49   ` Xiao Yang
  (?)
@ 2017-09-25 10:53   ` Amir Goldstein
  2017-09-26 10:45     ` Xiao Yang
  2017-09-30 14:15     ` Ashlie Martinez
  -1 siblings, 2 replies; 31+ messages in thread
From: Amir Goldstein @ 2017-09-25 10:53 UTC (permalink / raw)
  To: Xiao Yang; +Cc: Theodore Ts'o, Eryu Guan, Josef Bacik, fstests, Ext4

On Mon, Sep 25, 2017 at 12:49 PM, Xiao Yang <yangx.jy@cn.fujitsu.com> wrote:
> On 2017/08/27 18:44, Amir Goldstein wrote:
>> This test is motivated by a bug found in ext4 during random crash
>> consistency tests.
>>
>> This test uses device mapper flakey target to demonstrate the bug
>> found using device mapper log-writes target.
>>
>> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>> ---
>>
>> Ted,
>>
>> While working on crash consistency xfstests [1], I stubmled on what
>> appeared to be an ext4 crash consistency bug.
>>
>> The tests I used rely on the log-writes dm target code written
>> by Josef Bacik, which had little exposure to the wide community
>> as far as I know.  I wanted to prove to myself that the found
>> inconsistency was not due to a test bug, so I bisected the failed
>> test to the minimal operations that trigger the failure and wrote
>> a small independent test to reproduce the issue using dm flakey target.
>>
>> The following fsck error is reliably reproduced by replaying some fsx ops
>> on overlapping file regions, then emulating a crash, followed by mount,
>> umount and fsck -nf:
>>
>>   ./ltp/fsx -d --replay-ops /tmp/8995.fsxops /mnt/scratch/testfile
>>   1 write 0x137dd thru    0x21445 (0xdc69 bytes)
>>   2 falloc        from 0xb531 to 0x16ade (0xb5ad bytes)
>>   3 collapse      from 0x1c000 to 0x20000, (0x4000 bytes)
>>   4 write 0x3e5ec thru    0x3ffff (0x1a14 bytes)
>>   5 zero  from 0x20fac to 0x27d48, (0x6d9c bytes)
>>   6 mapwrite      0x216ad thru    0x23dfb (0x274f bytes)
>>   All 7 operations completed A-OK!
>>   _check_generic_filesystem: filesystem on /dev/mapper/ssd-scratch is inconsistent
>>   *** fsck.ext4 output ***
>>   fsck from util-linux 2.27.1
>>   e2fsck 1.42.13 (17-May-2015)
>>   Pass 1: Checking inodes, blocks, and sizes
>>   Inode 12, end of extent exceeds allowed value
>>           (logical block 33, physical block 33441, len 7)
>>   Clear? no
>>   Inode 12, i_blocks is 184, should be 128.  Fix? no
> Hi Amir,
>
> I always get the following output when running your xfstests test case 501.

Now merged as test generic/456

> ---------------------------------------------------------------------------
> e2fsck 1.42.9 (28-Dec-2013)
> Pass 1: Checking inodes, blocks, and sizes
> Inode 12, i_size is 147456, should be 163840. Fix? no
> ---------------------------------------------------------------------------
>
> Could you tell me how to get the expected output as you reported?

I can't say I am doing anything special, but I can say that I get the
same output as you did when running the test inside kvm-xfstests.
Actually, I could not reproduce ANY of the the crash consistency bugs
inside kvm-xfstests. Must be something to do with different timing of
IO with KVM+virtio disks??

When running on my laptop (Ubuntu 16.04 with latest kernel)
on a 10G SSD volume, I always get the error reported above.
I just re-verified with latest stable e2fsprogs (1.43.6).

Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH] fstest: regression test for ext4 crash consistency bug
  2017-09-25 10:53   ` Amir Goldstein
@ 2017-09-26 10:45     ` Xiao Yang
  2017-09-26 11:48       ` Amir Goldstein
  2017-09-30 14:15     ` Ashlie Martinez
  1 sibling, 1 reply; 31+ messages in thread
From: Xiao Yang @ 2017-09-26 10:45 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Theodore Ts'o, Eryu Guan, Josef Bacik, fstests, Ext4

On 2017/09/25 18:53, Amir Goldstein wrote:
> On Mon, Sep 25, 2017 at 12:49 PM, Xiao Yang<yangx.jy@cn.fujitsu.com>  wrote:
>> On 2017/08/27 18:44, Amir Goldstein wrote:
>>> This test is motivated by a bug found in ext4 during random crash
>>> consistency tests.
>>>
>>> This test uses device mapper flakey target to demonstrate the bug
>>> found using device mapper log-writes target.
>>>
>>> Signed-off-by: Amir Goldstein<amir73il@gmail.com>
>>> ---
>>>
>>> Ted,
>>>
>>> While working on crash consistency xfstests [1], I stubmled on what
>>> appeared to be an ext4 crash consistency bug.
>>>
>>> The tests I used rely on the log-writes dm target code written
>>> by Josef Bacik, which had little exposure to the wide community
>>> as far as I know.  I wanted to prove to myself that the found
>>> inconsistency was not due to a test bug, so I bisected the failed
>>> test to the minimal operations that trigger the failure and wrote
>>> a small independent test to reproduce the issue using dm flakey target.
>>>
>>> The following fsck error is reliably reproduced by replaying some fsx ops
>>> on overlapping file regions, then emulating a crash, followed by mount,
>>> umount and fsck -nf:
>>>
>>>    ./ltp/fsx -d --replay-ops /tmp/8995.fsxops /mnt/scratch/testfile
>>>    1 write 0x137dd thru    0x21445 (0xdc69 bytes)
>>>    2 falloc        from 0xb531 to 0x16ade (0xb5ad bytes)
>>>    3 collapse      from 0x1c000 to 0x20000, (0x4000 bytes)
>>>    4 write 0x3e5ec thru    0x3ffff (0x1a14 bytes)
>>>    5 zero  from 0x20fac to 0x27d48, (0x6d9c bytes)
>>>    6 mapwrite      0x216ad thru    0x23dfb (0x274f bytes)
>>>    All 7 operations completed A-OK!
>>>    _check_generic_filesystem: filesystem on /dev/mapper/ssd-scratch is inconsistent
>>>    *** fsck.ext4 output ***
>>>    fsck from util-linux 2.27.1
>>>    e2fsck 1.42.13 (17-May-2015)
>>>    Pass 1: Checking inodes, blocks, and sizes
>>>    Inode 12, end of extent exceeds allowed value
>>>            (logical block 33, physical block 33441, len 7)
>>>    Clear? no
>>>    Inode 12, i_blocks is 184, should be 128.  Fix? no
>> Hi Amir,
>>
>> I always get the following output when running your xfstests test case 501.
> Now merged as test generic/456
>
>> ---------------------------------------------------------------------------
>> e2fsck 1.42.9 (28-Dec-2013)
>> Pass 1: Checking inodes, blocks, and sizes
>> Inode 12, i_size is 147456, should be 163840. Fix? no
>> ---------------------------------------------------------------------------
>>
>> Could you tell me how to get the expected output as you reported?
> I can't say I am doing anything special, but I can say that I get the
> same output as you did when running the test inside kvm-xfstests.
> Actually, I could not reproduce ANY of the the crash consistency bugs
> inside kvm-xfstests. Must be something to do with different timing of
> IO with KVM+virtio disks??
>
> When running on my laptop (Ubuntu 16.04 with latest kernel)
> on a 10G SSD volume, I always get the error reported above.
> I just re-verified with latest stable e2fsprogs (1.43.6).
Hi Amir,

I tested generic/456 with KVM+virtio disks and SATA volumes on some 
kernels (including
v3.10.0, the latest kernel), but i still got the same output as i reported.

Could you determine whether the two different outputs are caused by the 
same bug
or not ?

Thanks,
Xiao Yang.
> Amir.
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> .
>




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH] fstest: regression test for ext4 crash consistency bug
  2017-09-26 10:45     ` Xiao Yang
@ 2017-09-26 11:48       ` Amir Goldstein
  0 siblings, 0 replies; 31+ messages in thread
From: Amir Goldstein @ 2017-09-26 11:48 UTC (permalink / raw)
  To: Xiao Yang; +Cc: Theodore Ts'o, Eryu Guan, Josef Bacik, fstests, Ext4

On Tue, Sep 26, 2017 at 1:45 PM, Xiao Yang <yangx.jy@cn.fujitsu.com> wrote:
> On 2017/09/25 18:53, Amir Goldstein wrote:
>>
>> On Mon, Sep 25, 2017 at 12:49 PM, Xiao Yang<yangx.jy@cn.fujitsu.com>
>> wrote:
>>>
>>> On 2017/08/27 18:44, Amir Goldstein wrote:
>>>>
>>>> This test is motivated by a bug found in ext4 during random crash
>>>> consistency tests.
>>>>
>>>> This test uses device mapper flakey target to demonstrate the bug
>>>> found using device mapper log-writes target.
>>>>
>>>> Signed-off-by: Amir Goldstein<amir73il@gmail.com>
>>>> ---
>>>>
>>>> Ted,
>>>>
>>>> While working on crash consistency xfstests [1], I stubmled on what
>>>> appeared to be an ext4 crash consistency bug.
>>>>
>>>> The tests I used rely on the log-writes dm target code written
>>>> by Josef Bacik, which had little exposure to the wide community
>>>> as far as I know.  I wanted to prove to myself that the found
>>>> inconsistency was not due to a test bug, so I bisected the failed
>>>> test to the minimal operations that trigger the failure and wrote
>>>> a small independent test to reproduce the issue using dm flakey target.
>>>>
>>>> The following fsck error is reliably reproduced by replaying some fsx
>>>> ops
>>>> on overlapping file regions, then emulating a crash, followed by mount,
>>>> umount and fsck -nf:
>>>>
>>>>    ./ltp/fsx -d --replay-ops /tmp/8995.fsxops /mnt/scratch/testfile
>>>>    1 write 0x137dd thru    0x21445 (0xdc69 bytes)
>>>>    2 falloc        from 0xb531 to 0x16ade (0xb5ad bytes)
>>>>    3 collapse      from 0x1c000 to 0x20000, (0x4000 bytes)
>>>>    4 write 0x3e5ec thru    0x3ffff (0x1a14 bytes)
>>>>    5 zero  from 0x20fac to 0x27d48, (0x6d9c bytes)
>>>>    6 mapwrite      0x216ad thru    0x23dfb (0x274f bytes)
>>>>    All 7 operations completed A-OK!
>>>>    _check_generic_filesystem: filesystem on /dev/mapper/ssd-scratch is
>>>> inconsistent
>>>>    *** fsck.ext4 output ***
>>>>    fsck from util-linux 2.27.1
>>>>    e2fsck 1.42.13 (17-May-2015)
>>>>    Pass 1: Checking inodes, blocks, and sizes
>>>>    Inode 12, end of extent exceeds allowed value
>>>>            (logical block 33, physical block 33441, len 7)
>>>>    Clear? no
>>>>    Inode 12, i_blocks is 184, should be 128.  Fix? no
>>>
>>> Hi Amir,
>>>
>>> I always get the following output when running your xfstests test case
>>> 501.
>>
>> Now merged as test generic/456
>>
>>>
>>> ---------------------------------------------------------------------------
>>> e2fsck 1.42.9 (28-Dec-2013)
>>> Pass 1: Checking inodes, blocks, and sizes
>>> Inode 12, i_size is 147456, should be 163840. Fix? no
>>>
>>> ---------------------------------------------------------------------------
>>>
>>> Could you tell me how to get the expected output as you reported?
>>
>> I can't say I am doing anything special, but I can say that I get the
>> same output as you did when running the test inside kvm-xfstests.
>> Actually, I could not reproduce ANY of the the crash consistency bugs
>> inside kvm-xfstests. Must be something to do with different timing of
>> IO with KVM+virtio disks??
>>
>> When running on my laptop (Ubuntu 16.04 with latest kernel)
>> on a 10G SSD volume, I always get the error reported above.
>> I just re-verified with latest stable e2fsprogs (1.43.6).
>
> Hi Amir,
>
> I tested generic/456 with KVM+virtio disks and SATA volumes on some kernels

I don't understand. Did you also test without KVM?
Otherwise I suggest that you test without KVM/virtio.

> (including
> v3.10.0, the latest kernel), but i still got the same output as i reported.
>
> Could you determine whether the two different outputs are caused by the same
> bug
> or not ?

No idea if those are 2 symptoms of the same bug or 2 different bugs
I did not investigate the root cause.

Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH] fstest: regression test for ext4 crash consistency bug
  2017-09-25 10:53   ` Amir Goldstein
  2017-09-26 10:45     ` Xiao Yang
@ 2017-09-30 14:15     ` Ashlie Martinez
  2017-10-05  7:27       ` Xiao Yang
  1 sibling, 1 reply; 31+ messages in thread
From: Ashlie Martinez @ 2017-09-30 14:15 UTC (permalink / raw)
  To: Xiao Yang
  Cc: Amir Goldstein, Theodore Ts'o, Eryu Guan, Josef Bacik,
	fstests, Ext4, Vijay Chidambaram

Hi Xiao,

I am a student at the University of Texas at Austin. Some researchers
in the computer science department at UT, myself included, have
recently been working to develop a file system crash consistency test
harness called CrashMonkey [1][2]. I have been working on the
CrashMonkey project since it was started late last year. With
CrashMonkey we have also been able to reproduce the incorrect i_size
error you noted but we have not been able to reproduce the other
output that Amir found. CrashMonkey works by logging and replaying
operations for a workload, so it should not be sensitive to
differences in timing that could be caused by things like KVM+virtio.
I also did a few experiments with Amir's new xfstests test 456 (both
with and without KVM and virtio) and I was unable to reproduce the
output noted in the xfstest. I have not spent a lot of time looking
into the cause of the bug that Amir found and it is rather unfortunate
that I was unable to reproduce it with either xfstests or CrashMonkey.

At any rate, CrashMonkey is still under development, so it does have
some caveats. First, we are running with a fixed random seed in our
default RandomPermuter (used to generate crash states) to aid
development. Second, the branch with the reproduction of this ext4
regression bug in CrashMonkey [3] will yield a few false positives due
to the way CrashMonkey works and how fsx runs. These false positives
are due to CrashMonkey generating crash states where the directories
for files used for the test have not be fsync-ed in the file system.
The top of the README in the CrashMonkey branch with this bug
reproduction outlines how we determined these were false positives

[1] https://github.com/utsaslab/crashmonkey
[2] https://www.usenix.org/conference/hotstorage17/program/presentation/martinez
[3] https://github.com/utsaslab/crashmonkey/tree/ext4_regression_bug


On Mon, Sep 25, 2017 at 5:53 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Mon, Sep 25, 2017 at 12:49 PM, Xiao Yang <yangx.jy@cn.fujitsu.com> wrote:
>> On 2017/08/27 18:44, Amir Goldstein wrote:
>>> This test is motivated by a bug found in ext4 during random crash
>>> consistency tests.
>>>
>>> This test uses device mapper flakey target to demonstrate the bug
>>> found using device mapper log-writes target.
>>>
>>> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>>> ---
>>>
>>> Ted,
>>>
>>> While working on crash consistency xfstests [1], I stubmled on what
>>> appeared to be an ext4 crash consistency bug.
>>>
>>> The tests I used rely on the log-writes dm target code written
>>> by Josef Bacik, which had little exposure to the wide community
>>> as far as I know.  I wanted to prove to myself that the found
>>> inconsistency was not due to a test bug, so I bisected the failed
>>> test to the minimal operations that trigger the failure and wrote
>>> a small independent test to reproduce the issue using dm flakey target.
>>>
>>> The following fsck error is reliably reproduced by replaying some fsx ops
>>> on overlapping file regions, then emulating a crash, followed by mount,
>>> umount and fsck -nf:
>>>
>>>   ./ltp/fsx -d --replay-ops /tmp/8995.fsxops /mnt/scratch/testfile
>>>   1 write 0x137dd thru    0x21445 (0xdc69 bytes)
>>>   2 falloc        from 0xb531 to 0x16ade (0xb5ad bytes)
>>>   3 collapse      from 0x1c000 to 0x20000, (0x4000 bytes)
>>>   4 write 0x3e5ec thru    0x3ffff (0x1a14 bytes)
>>>   5 zero  from 0x20fac to 0x27d48, (0x6d9c bytes)
>>>   6 mapwrite      0x216ad thru    0x23dfb (0x274f bytes)
>>>   All 7 operations completed A-OK!
>>>   _check_generic_filesystem: filesystem on /dev/mapper/ssd-scratch is inconsistent
>>>   *** fsck.ext4 output ***
>>>   fsck from util-linux 2.27.1
>>>   e2fsck 1.42.13 (17-May-2015)
>>>   Pass 1: Checking inodes, blocks, and sizes
>>>   Inode 12, end of extent exceeds allowed value
>>>           (logical block 33, physical block 33441, len 7)
>>>   Clear? no
>>>   Inode 12, i_blocks is 184, should be 128.  Fix? no
>> Hi Amir,
>>
>> I always get the following output when running your xfstests test case 501.
>
> Now merged as test generic/456
>
>> ---------------------------------------------------------------------------
>> e2fsck 1.42.9 (28-Dec-2013)
>> Pass 1: Checking inodes, blocks, and sizes
>> Inode 12, i_size is 147456, should be 163840. Fix? no
>> ---------------------------------------------------------------------------
>>
>> Could you tell me how to get the expected output as you reported?
>
> I can't say I am doing anything special, but I can say that I get the
> same output as you did when running the test inside kvm-xfstests.
> Actually, I could not reproduce ANY of the the crash consistency bugs
> inside kvm-xfstests. Must be something to do with different timing of
> IO with KVM+virtio disks??
>
> When running on my laptop (Ubuntu 16.04 with latest kernel)
> on a 10G SSD volume, I always get the error reported above.
> I just re-verified with latest stable e2fsprogs (1.43.6).
>
> Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH] fstest: regression test for ext4 crash consistency bug
  2017-09-30 14:15     ` Ashlie Martinez
@ 2017-10-05  7:27       ` Xiao Yang
  2017-10-05 15:04         ` Ashlie Martinez
  0 siblings, 1 reply; 31+ messages in thread
From: Xiao Yang @ 2017-10-05  7:27 UTC (permalink / raw)
  To: Ashlie Martinez
  Cc: Amir Goldstein, Theodore Ts'o, Eryu Guan, Josef Bacik,
	fstests, Ext4, Vijay Chidambaram

On 2017/09/30 22:15, Ashlie Martinez wrote:
> Hi Xiao,
>
> I am a student at the University of Texas at Austin. Some researchers
> in the computer science department at UT, myself included, have
> recently been working to develop a file system crash consistency test
> harness called CrashMonkey [1][2]. I have been working on the
> CrashMonkey project since it was started late last year. With
> CrashMonkey we have also been able to reproduce the incorrect i_size
> error you noted but we have not been able to reproduce the other
> output that Amir found. CrashMonkey works by logging and replaying
> operations for a workload, so it should not be sensitive to
> differences in timing that could be caused by things like KVM+virtio.
> I also did a few experiments with Amir's new xfstests test 456 (both
> with and without KVM and virtio) and I was unable to reproduce the
> output noted in the xfstest. I have not spent a lot of time looking
> into the cause of the bug that Amir found and it is rather unfortunate
> that I was unable to reproduce it with either xfstests or CrashMonkey.
Hi Ashlie,

Thanks for your detailed comments.

1) Do you think the output that Amir noted in xfstests is a false positive?

2) About the output that both i and you reproduced,  did you look into 
it and find its root cause?

Thanks,
Xiao Yang
> At any rate, CrashMonkey is still under development, so it does have
> some caveats. First, we are running with a fixed random seed in our
> default RandomPermuter (used to generate crash states) to aid
> development. Second, the branch with the reproduction of this ext4
> regression bug in CrashMonkey [3] will yield a few false positives due
> to the way CrashMonkey works and how fsx runs. These false positives
> are due to CrashMonkey generating crash states where the directories
> for files used for the test have not be fsync-ed in the file system.
> The top of the README in the CrashMonkey branch with this bug
> reproduction outlines how we determined these were false positives
>
> [1] https://github.com/utsaslab/crashmonkey
> [2] https://www.usenix.org/conference/hotstorage17/program/presentation/martinez
> [3] https://github.com/utsaslab/crashmonkey/tree/ext4_regression_bug
>
>
> On Mon, Sep 25, 2017 at 5:53 AM, Amir Goldstein<amir73il@gmail.com>  wrote:
>> On Mon, Sep 25, 2017 at 12:49 PM, Xiao Yang<yangx.jy@cn.fujitsu.com>  wrote:
>>> On 2017/08/27 18:44, Amir Goldstein wrote:
>>>> This test is motivated by a bug found in ext4 during random crash
>>>> consistency tests.
>>>>
>>>> This test uses device mapper flakey target to demonstrate the bug
>>>> found using device mapper log-writes target.
>>>>
>>>> Signed-off-by: Amir Goldstein<amir73il@gmail.com>
>>>> ---
>>>>
>>>> Ted,
>>>>
>>>> While working on crash consistency xfstests [1], I stubmled on what
>>>> appeared to be an ext4 crash consistency bug.
>>>>
>>>> The tests I used rely on the log-writes dm target code written
>>>> by Josef Bacik, which had little exposure to the wide community
>>>> as far as I know.  I wanted to prove to myself that the found
>>>> inconsistency was not due to a test bug, so I bisected the failed
>>>> test to the minimal operations that trigger the failure and wrote
>>>> a small independent test to reproduce the issue using dm flakey target.
>>>>
>>>> The following fsck error is reliably reproduced by replaying some fsx ops
>>>> on overlapping file regions, then emulating a crash, followed by mount,
>>>> umount and fsck -nf:
>>>>
>>>>    ./ltp/fsx -d --replay-ops /tmp/8995.fsxops /mnt/scratch/testfile
>>>>    1 write 0x137dd thru    0x21445 (0xdc69 bytes)
>>>>    2 falloc        from 0xb531 to 0x16ade (0xb5ad bytes)
>>>>    3 collapse      from 0x1c000 to 0x20000, (0x4000 bytes)
>>>>    4 write 0x3e5ec thru    0x3ffff (0x1a14 bytes)
>>>>    5 zero  from 0x20fac to 0x27d48, (0x6d9c bytes)
>>>>    6 mapwrite      0x216ad thru    0x23dfb (0x274f bytes)
>>>>    All 7 operations completed A-OK!
>>>>    _check_generic_filesystem: filesystem on /dev/mapper/ssd-scratch is inconsistent
>>>>    *** fsck.ext4 output ***
>>>>    fsck from util-linux 2.27.1
>>>>    e2fsck 1.42.13 (17-May-2015)
>>>>    Pass 1: Checking inodes, blocks, and sizes
>>>>    Inode 12, end of extent exceeds allowed value
>>>>            (logical block 33, physical block 33441, len 7)
>>>>    Clear? no
>>>>    Inode 12, i_blocks is 184, should be 128.  Fix? no
>>> Hi Amir,
>>>
>>> I always get the following output when running your xfstests test case 501.
>> Now merged as test generic/456
>>
>>> ---------------------------------------------------------------------------
>>> e2fsck 1.42.9 (28-Dec-2013)
>>> Pass 1: Checking inodes, blocks, and sizes
>>> Inode 12, i_size is 147456, should be 163840. Fix? no
>>> ---------------------------------------------------------------------------
>>>
>>> Could you tell me how to get the expected output as you reported?
>> I can't say I am doing anything special, but I can say that I get the
>> same output as you did when running the test inside kvm-xfstests.
>> Actually, I could not reproduce ANY of the the crash consistency bugs
>> inside kvm-xfstests. Must be something to do with different timing of
>> IO with KVM+virtio disks??
>>
>> When running on my laptop (Ubuntu 16.04 with latest kernel)
>> on a 10G SSD volume, I always get the error reported above.
>> I just re-verified with latest stable e2fsprogs (1.43.6).
>>
>> Amir.
>
> .
>




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH] fstest: regression test for ext4 crash consistency bug
  2017-10-05  7:27       ` Xiao Yang
@ 2017-10-05 15:04         ` Ashlie Martinez
  2017-10-05 19:10           ` Amir Goldstein
  0 siblings, 1 reply; 31+ messages in thread
From: Ashlie Martinez @ 2017-10-05 15:04 UTC (permalink / raw)
  To: Xiao Yang
  Cc: Amir Goldstein, Theodore Ts'o, Eryu Guan, Josef Bacik,
	fstests, Ext4, Vijay Chidambaram

On Thu, Oct 5, 2017 at 2:27 AM, Xiao Yang <yangx.jy@cn.fujitsu.com> wrote:
> On 2017/09/30 22:15, Ashlie Martinez wrote:
>>
>> Hi Xiao,
>>
>> I am a student at the University of Texas at Austin. Some researchers
>> in the computer science department at UT, myself included, have
>> recently been working to develop a file system crash consistency test
>> harness called CrashMonkey [1][2]. I have been working on the
>> CrashMonkey project since it was started late last year. With
>> CrashMonkey we have also been able to reproduce the incorrect i_size
>> error you noted but we have not been able to reproduce the other
>> output that Amir found. CrashMonkey works by logging and replaying
>> operations for a workload, so it should not be sensitive to
>> differences in timing that could be caused by things like KVM+virtio.
>> I also did a few experiments with Amir's new xfstests test 456 (both
>> with and without KVM and virtio) and I was unable to reproduce the
>> output noted in the xfstest. I have not spent a lot of time looking
>> into the cause of the bug that Amir found and it is rather unfortunate
>> that I was unable to reproduce it with either xfstests or CrashMonkey.
>
> Hi Ashlie,
>
> Thanks for your detailed comments.
>
> 1) Do you think the output that Amir noted in xfstests is a false positive?

It almost seems that way, though to be honest, I don't think I know
enough about 1. the setup used by Amir and 2. all the internal working
of KVM+virtio to say for sure. One thing I have identified as being a
potential source of false positives is code in the kernel to remap the
block number sent to device drives to something relative to the start
of a block device, not the start of a partition. I'm thinking this
could cause trouble if 1. a partition higher than the first partition
is monitored by dm-log-writes, 2. the block numbers are recorded
verbatim in dm-log-writes, and 3. the dm-log-writes log is replayed on
a different device with different partitions (or possibly the same
device, but a different partition). I know some other undergrad
students at UT on the CrashMonkey team are looking into this right
now, but I have not had time to test this myself. The offending code
in the kernel is in the completely misnamed
`generic_make_request_checks` function which has given the CrashMonkey
team trouble in the past. Links to that function and the remapping
function it calls are below.

http://elixir.free-electrons.com/linux/latest/source/block/blk-core.c#L2041
http://elixir.free-electrons.com/linux/latest/source/block/blk-core.c#L1902

>
> 2) About the output that both i and you reproduced,  did you look into it
> and find its root cause?

I have not looked to find the root cause of the issue that we both
reliably reproduce.

>
> Thanks,
> Xiao Yang
>>
>> At any rate, CrashMonkey is still under development, so it does have
>> some caveats. First, we are running with a fixed random seed in our
>> default RandomPermuter (used to generate crash states) to aid
>> development. Second, the branch with the reproduction of this ext4
>> regression bug in CrashMonkey [3] will yield a few false positives due
>> to the way CrashMonkey works and how fsx runs. These false positives
>> are due to CrashMonkey generating crash states where the directories
>> for files used for the test have not be fsync-ed in the file system.
>> The top of the README in the CrashMonkey branch with this bug
>> reproduction outlines how we determined these were false positives
>>
>> [1] https://github.com/utsaslab/crashmonkey
>> [2]
>> https://www.usenix.org/conference/hotstorage17/program/presentation/martinez
>> [3] https://github.com/utsaslab/crashmonkey/tree/ext4_regression_bug
>>
>>
>> On Mon, Sep 25, 2017 at 5:53 AM, Amir Goldstein<amir73il@gmail.com>
>> wrote:
>>>
>>> On Mon, Sep 25, 2017 at 12:49 PM, Xiao Yang<yangx.jy@cn.fujitsu.com>
>>> wrote:
>>>>
>>>> On 2017/08/27 18:44, Amir Goldstein wrote:
>>>>>
>>>>> This test is motivated by a bug found in ext4 during random crash
>>>>> consistency tests.
>>>>>
>>>>> This test uses device mapper flakey target to demonstrate the bug
>>>>> found using device mapper log-writes target.
>>>>>
>>>>> Signed-off-by: Amir Goldstein<amir73il@gmail.com>
>>>>> ---
>>>>>
>>>>> Ted,
>>>>>
>>>>> While working on crash consistency xfstests [1], I stubmled on what
>>>>> appeared to be an ext4 crash consistency bug.
>>>>>
>>>>> The tests I used rely on the log-writes dm target code written
>>>>> by Josef Bacik, which had little exposure to the wide community
>>>>> as far as I know.  I wanted to prove to myself that the found
>>>>> inconsistency was not due to a test bug, so I bisected the failed
>>>>> test to the minimal operations that trigger the failure and wrote
>>>>> a small independent test to reproduce the issue using dm flakey target.
>>>>>
>>>>> The following fsck error is reliably reproduced by replaying some fsx
>>>>> ops
>>>>> on overlapping file regions, then emulating a crash, followed by mount,
>>>>> umount and fsck -nf:
>>>>>
>>>>>    ./ltp/fsx -d --replay-ops /tmp/8995.fsxops /mnt/scratch/testfile
>>>>>    1 write 0x137dd thru    0x21445 (0xdc69 bytes)
>>>>>    2 falloc        from 0xb531 to 0x16ade (0xb5ad bytes)
>>>>>    3 collapse      from 0x1c000 to 0x20000, (0x4000 bytes)
>>>>>    4 write 0x3e5ec thru    0x3ffff (0x1a14 bytes)
>>>>>    5 zero  from 0x20fac to 0x27d48, (0x6d9c bytes)
>>>>>    6 mapwrite      0x216ad thru    0x23dfb (0x274f bytes)
>>>>>    All 7 operations completed A-OK!
>>>>>    _check_generic_filesystem: filesystem on /dev/mapper/ssd-scratch is
>>>>> inconsistent
>>>>>    *** fsck.ext4 output ***
>>>>>    fsck from util-linux 2.27.1
>>>>>    e2fsck 1.42.13 (17-May-2015)
>>>>>    Pass 1: Checking inodes, blocks, and sizes
>>>>>    Inode 12, end of extent exceeds allowed value
>>>>>            (logical block 33, physical block 33441, len 7)
>>>>>    Clear? no
>>>>>    Inode 12, i_blocks is 184, should be 128.  Fix? no
>>>>
>>>> Hi Amir,
>>>>
>>>> I always get the following output when running your xfstests test case
>>>> 501.
>>>
>>> Now merged as test generic/456
>>>
>>>>
>>>> ---------------------------------------------------------------------------
>>>> e2fsck 1.42.9 (28-Dec-2013)
>>>> Pass 1: Checking inodes, blocks, and sizes
>>>> Inode 12, i_size is 147456, should be 163840. Fix? no
>>>>
>>>> ---------------------------------------------------------------------------
>>>>
>>>> Could you tell me how to get the expected output as you reported?
>>>
>>> I can't say I am doing anything special, but I can say that I get the
>>> same output as you did when running the test inside kvm-xfstests.
>>> Actually, I could not reproduce ANY of the the crash consistency bugs
>>> inside kvm-xfstests. Must be something to do with different timing of
>>> IO with KVM+virtio disks??
>>>
>>> When running on my laptop (Ubuntu 16.04 with latest kernel)
>>> on a 10G SSD volume, I always get the error reported above.
>>> I just re-verified with latest stable e2fsprogs (1.43.6).
>>>
>>> Amir.
>>
>>
>> .
>>
>
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH] fstest: regression test for ext4 crash consistency bug
  2017-10-05 15:04         ` Ashlie Martinez
@ 2017-10-05 19:10           ` Amir Goldstein
  2017-10-06  0:34             ` Ashlie Martinez
  0 siblings, 1 reply; 31+ messages in thread
From: Amir Goldstein @ 2017-10-05 19:10 UTC (permalink / raw)
  To: Ashlie Martinez
  Cc: Xiao Yang, Theodore Ts'o, Eryu Guan, Josef Bacik, fstests,
	Ext4, Vijay Chidambaram

[-- Attachment #1: Type: text/plain, Size: 4228 bytes --]

On Thu, Oct 5, 2017 at 6:04 PM, Ashlie Martinez <ashmrtn@utexas.edu> wrote:
> On Thu, Oct 5, 2017 at 2:27 AM, Xiao Yang <yangx.jy@cn.fujitsu.com> wrote:
>> On 2017/09/30 22:15, Ashlie Martinez wrote:
>>>
>>> Hi Xiao,
>>>
>>> I am a student at the University of Texas at Austin. Some researchers
>>> in the computer science department at UT, myself included, have
>>> recently been working to develop a file system crash consistency test
>>> harness called CrashMonkey [1][2]. I have been working on the
>>> CrashMonkey project since it was started late last year. With
>>> CrashMonkey we have also been able to reproduce the incorrect i_size
>>> error you noted but we have not been able to reproduce the other
>>> output that Amir found. CrashMonkey works by logging and replaying
>>> operations for a workload, so it should not be sensitive to
>>> differences in timing that could be caused by things like KVM+virtio.
>>> I also did a few experiments with Amir's new xfstests test 456 (both
>>> with and without KVM and virtio) and I was unable to reproduce the
>>> output noted in the xfstest. I have not spent a lot of time looking
>>> into the cause of the bug that Amir found and it is rather unfortunate
>>> that I was unable to reproduce it with either xfstests or CrashMonkey.
>>
>> Hi Ashlie,
>>
>> Thanks for your detailed comments.
>>
>> 1) Do you think the output that Amir noted in xfstests is a false positive?
>
> It almost seems that way, though to be honest, I don't think I know
> enough about 1. the setup used by Amir and 2. all the internal working
> of KVM+virtio to say for sure.

I believe you misread my email.
The problem was NOT reproduced on KVM+virtio.
The problem *is* reproduced on a 10G LVM volume over SSD on
Ubuntu 16.04 with latest kernel and latest e2fsprogs.

There is no use in speculating why the problem doesn't happen on your
systems. If the issue reproduces on A system (2 systems actually that I tested)
that it is a problem.

Attached is an e2image dump of the inconsistent file system following test
generic/456 on my system.  I would have attached the image sooner,
but since on my system problem reproduces 100% on the time, I did not think
that was a need. Without an ext4 developer looking into this image, I don't
really see the point in further discussion.

I would be interesting to get more test samples from people running the test
on other systems. If only some people see the error
"end of extent exceeds allowed value"
it would be interesting to find the commonality between those setups.

> One thing I have identified as being a
> potential source of false positives is code in the kernel to remap the
> block number sent to device drives to something relative to the start
> of a block device, not the start of a partition. I'm thinking this
> could cause trouble if 1. a partition higher than the first partition
> is monitored by dm-log-writes, 2. the block numbers are recorded
> verbatim in dm-log-writes, and 3. the dm-log-writes log is replayed on
> a different device with different partitions (or possibly the same
> device, but a different partition).

You do realize that the test generic/456 is not using dm-log-writes at all.
I intentionally took it out of the equation and used the even dumber dm-flakey
target to demonstrate the crash inconsistency.

> I know some other undergrad
> students at UT on the CrashMonkey team are looking into this right
> now, but I have not had time to test this myself. The offending code
> in the kernel is in the completely misnamed
> `generic_make_request_checks` function which has given the CrashMonkey
> team trouble in the past. Links to that function and the remapping
> function it calls are below.
>
> http://elixir.free-electrons.com/linux/latest/source/block/blk-core.c#L2041
> http://elixir.free-electrons.com/linux/latest/source/block/blk-core.c#L1902
>
>>
>> 2) About the output that both i and you reproduced,  did you look into it
>> and find its root cause?
>
> I have not looked to find the root cause of the issue that we both
> reliably reproduce.
>

Now you have a broken file system image and the exact set of operations
that led to it. That's should be a pretty big lead for investigation.

Amir.

[-- Attachment #2: ext4-crash.qcow2.bz2 --]
[-- Type: application/x-bzip2, Size: 12094 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH] fstest: regression test for ext4 crash consistency bug
  2017-10-05 19:10           ` Amir Goldstein
@ 2017-10-06  0:34             ` Ashlie Martinez
  2017-10-07  3:29               ` [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash Theodore Ts'o
  0 siblings, 1 reply; 31+ messages in thread
From: Ashlie Martinez @ 2017-10-06  0:34 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Xiao Yang, Theodore Ts'o, Eryu Guan, Josef Bacik, fstests,
	Ext4, Vijay Chidambaram

On Thu, Oct 5, 2017 at 2:10 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Thu, Oct 5, 2017 at 6:04 PM, Ashlie Martinez <ashmrtn@utexas.edu> wrote:
>> On Thu, Oct 5, 2017 at 2:27 AM, Xiao Yang <yangx.jy@cn.fujitsu.com> wrote:
>>> On 2017/09/30 22:15, Ashlie Martinez wrote:
>>>>
>>>> Hi Xiao,
>>>>
>>>> I am a student at the University of Texas at Austin. Some researchers
>>>> in the computer science department at UT, myself included, have
>>>> recently been working to develop a file system crash consistency test
>>>> harness called CrashMonkey [1][2]. I have been working on the
>>>> CrashMonkey project since it was started late last year. With
>>>> CrashMonkey we have also been able to reproduce the incorrect i_size
>>>> error you noted but we have not been able to reproduce the other
>>>> output that Amir found. CrashMonkey works by logging and replaying
>>>> operations for a workload, so it should not be sensitive to
>>>> differences in timing that could be caused by things like KVM+virtio.
>>>> I also did a few experiments with Amir's new xfstests test 456 (both
>>>> with and without KVM and virtio) and I was unable to reproduce the
>>>> output noted in the xfstest. I have not spent a lot of time looking
>>>> into the cause of the bug that Amir found and it is rather unfortunate
>>>> that I was unable to reproduce it with either xfstests or CrashMonkey.
>>>
>>> Hi Ashlie,
>>>
>>> Thanks for your detailed comments.
>>>
>>> 1) Do you think the output that Amir noted in xfstests is a false positive?
>>
>> It almost seems that way, though to be honest, I don't think I know
>> enough about 1. the setup used by Amir and 2. all the internal working
>> of KVM+virtio to say for sure.
>
> I believe you misread my email.
> The problem was NOT reproduced on KVM+virtio.
> The problem *is* reproduced on a 10G LVM volume over SSD on
> Ubuntu 16.04 with latest kernel and latest e2fsprogs.

I have also tried running generic/456 on non-KVM, non-virtio machines
and I was unable to reproduce it. Without information on test setups
individuals are using, it is very hard to tell where I, or other
people, are going wrong in testing other than maybe using a virtual
machine.

As an aside, it does not appear to be simply a timing issue due to
KVM+virtio. I would hope that CrashMonkey would still be able to find
the extent error you saw, even on a KVM VM since it is not dependent
on timing.

>
> There is no use in speculating why the problem doesn't happen on your
> systems. If the issue reproduces on A system (2 systems actually that I tested)
> that it is a problem.
>
> Attached is an e2image dump of the inconsistent file system following test
> generic/456 on my system.  I would have attached the image sooner,
> but since on my system problem reproduces 100% on the time, I did not think
> that was a need. Without an ext4 developer looking into this image, I don't
> really see the point in further discussion.
>
> I would be interesting to get more test samples from people running the test
> on other systems. If only some people see the error
> "end of extent exceeds allowed value"
> it would be interesting to find the commonality between those setups.
>

I agree it would be interesting to see why this error appears only for
some people.


>> One thing I have identified as being a
>> potential source of false positives is code in the kernel to remap the
>> block number sent to device drives to something relative to the start
>> of a block device, not the start of a partition. I'm thinking this
>> could cause trouble if 1. a partition higher than the first partition
>> is monitored by dm-log-writes, 2. the block numbers are recorded
>> verbatim in dm-log-writes, and 3. the dm-log-writes log is replayed on
>> a different device with different partitions (or possibly the same
>> device, but a different partition).
>
> You do realize that the test generic/456 is not using dm-log-writes at all.
> I intentionally took it out of the equation and used the even dumber dm-flakey
> target to demonstrate the crash inconsistency.

Oh, yes, I recall that now. The past few weeks have been very busy for
me with all of my other school work, so getting all of my i's dotted
and t's crossed has been a challenge. At any rate, I don't think it is
a real minus to mention the potential remapping of blocks in the cited
function. If people are going to use any sort of block device module
(be dm-log-writes or otherwise) then I think it would be good for
developers to be aware of.

>
>> I know some other undergrad
>> students at UT on the CrashMonkey team are looking into this right
>> now, but I have not had time to test this myself. The offending code
>> in the kernel is in the completely misnamed
>> `generic_make_request_checks` function which has given the CrashMonkey
>> team trouble in the past. Links to that function and the remapping
>> function it calls are below.
>>
>> http://elixir.free-electrons.com/linux/latest/source/block/blk-core.c#L2041
>> http://elixir.free-electrons.com/linux/latest/source/block/blk-core.c#L1902
>>
>>>
>>> 2) About the output that both i and you reproduced,  did you look into it
>>> and find its root cause?
>>
>> I have not looked to find the root cause of the issue that we both
>> reliably reproduce.
>>
>
> Now you have a broken file system image and the exact set of operations
> that led to it. That's should be a pretty big lead for investigation.
>
> Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash
  2017-10-06  0:34             ` Ashlie Martinez
@ 2017-10-07  3:29               ` Theodore Ts'o
  2017-10-07  5:54                 ` Amir Goldstein
                                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Theodore Ts'o @ 2017-10-07  3:29 UTC (permalink / raw)
  To: Ashlie Martinez
  Cc: Amir Goldstein, Xiao Yang, Eryu Guan, Josef Bacik, fstests, Ext4,
	Vijay Chidambaram

On Thu, Oct 05, 2017 at 07:34:10PM -0500, Ashlie Martinez wrote:
> >> It almost seems that way, though to be honest, I don't think I know
> >> enough about 1. the setup used by Amir and 2. all the internal working
> >> of KVM+virtio to say for sure.
> >
> > I believe you misread my email.
> > The problem was NOT reproduced on KVM+virtio.
> > The problem *is* reproduced on a 10G LVM volume over SSD on
> > Ubuntu 16.04 with latest kernel and latest e2fsprogs.

I was able to reproduce it using both kvm-xfstests[1] and gce-xfstests[2].

[1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md
[2] https://thunk.org/gce-xfstests

It did turn out to be timing related, and it requires that a journal
commit take place after fsstress runs, but it can *not* be triggered
by a sync/fsync (as this would cause the delayed allocation writes to
be forced out to disk, and that makes the problem go away).

I initially tried using xfs_io as a replacement for fsstress (since it
is more flexible and would allow me to more easily run experiments),
but it turns out xfs_io was too fast/efficient, and so using xfs_io to
execute the same system calls (verified by strace) would not replicate
the problem.

> > Now you have a broken file system image and the exact set of operations
> > that led to it. That's should be a pretty big lead for investigation.

It was indeed a big lead for investigation (thanks, Amir!), but it
still took me several hours before I was finally able to figure out
the problem.  The patch and the commit description should explain what
was going on.

I'll leave it to Ashlie and Vijay to investigate how to improve Crash
Monkey so it can better find problems like this automatically.  Since
you now have a clear reproducer (you can use generic/456 and run it on
gce-xfstests, using is a standard cloud VM configuration) and an
explanation of the bug and the four-line fix, I suspect this might be
good grist for follow-on research after your Hot Storage '17 workshop
paper.  :-)

Best regards,

					- Ted


commit 3912e7b44cf77e9452d4d0cb6c1da9c7043bb7f1
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Fri Oct 6 23:09:55 2017 -0400

    ext4: fix interaction between i_size, fallocate, and delalloc after a crash
    
    If there are pending writes subject to delayed allocation, then i_size
    will show size after the writes have completed, while i_disksize
    contains the value of i_size on the disk (since the writes have not
    been persisted to disk).
    
    If fallocate(2) is called with the FALLOC_FL_KEEP_SIZE flag, either
    with or without the FALLOC_FL_ZERO_RANGE flag set, and the new size
    after the fallocate(2) is between i_size and i_disksize, then after a
    crash, if a journal commit has resulted in the changes made by the
    fallocate() call to be persisted after a crash, but the delayed
    allocation write has not resolved itself, i_size would not be updated,
    and this would cause the following e2fsck complaint:
    
    Inode 12, end of extent exceeds allowed value
            (logical block 33, physical block 33441, len 7)
    
    This can only take place on a sparse file, where the fallocate(2) call
    is allocating blocks in a range which is before a pending delayed
    allocation write which is extending i_size.  Since this situation is
    quite rare, and the window in which the crash must take place is
    typically < 30 seconds, in practice this condition will rarely happen.
    
    Nevertheless, it can be triggered in testing, and in particular by
    xfstests generic/456.
    
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Reported-by: Amir Goldstein <amir73il@gmail.com>
    Cc: stable@vger.kernel.org

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 97f0fd06728d..07bca11749d4 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4794,7 +4794,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	}
 
 	if (!(mode & FALLOC_FL_KEEP_SIZE) &&
-	     offset + len > i_size_read(inode)) {
+	    (offset + len > i_size_read(inode) ||
+	     offset + len > EXT4_I(inode)->i_disksize)) {
 		new_size = offset + len;
 		ret = inode_newsize_ok(inode, new_size);
 		if (ret)
@@ -4965,7 +4966,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	}
 
 	if (!(mode & FALLOC_FL_KEEP_SIZE) &&
-	     offset + len > i_size_read(inode)) {
+	    (offset + len > i_size_read(inode) ||
+	     offset + len > EXT4_I(inode)->i_disksize)) {
 		new_size = offset + len;
 		ret = inode_newsize_ok(inode, new_size);
 		if (ret)

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash
  2017-10-07  3:29               ` [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash Theodore Ts'o
@ 2017-10-07  5:54                 ` Amir Goldstein
  2017-10-07 18:32                   ` Theodore Ts'o
  2017-10-09  0:37                 ` Ashlie Martinez
                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: Amir Goldstein @ 2017-10-07  5:54 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Ashlie Martinez, Xiao Yang, Eryu Guan, Josef Bacik, fstests,
	Ext4, Vijay Chidambaram

On Sat, Oct 7, 2017 at 6:29 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Thu, Oct 05, 2017 at 07:34:10PM -0500, Ashlie Martinez wrote:
>> >> It almost seems that way, though to be honest, I don't think I know
>> >> enough about 1. the setup used by Amir and 2. all the internal working
>> >> of KVM+virtio to say for sure.
>> >
>> > I believe you misread my email.
>> > The problem was NOT reproduced on KVM+virtio.
>> > The problem *is* reproduced on a 10G LVM volume over SSD on
>> > Ubuntu 16.04 with latest kernel and latest e2fsprogs.
>
> I was able to reproduce it using both kvm-xfstests[1] and gce-xfstests[2].
>
> [1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md
> [2] https://thunk.org/gce-xfstests
>
> It did turn out to be timing related, and it requires that a journal
> commit take place after fsstress runs, but it can *not* be triggered
> by a sync/fsync (as this would cause the delayed allocation writes to
> be forced out to disk, and that makes the problem go away).
>
> I initially tried using xfs_io as a replacement for fsstress (since it
> is more flexible and would allow me to more easily run experiments),
> but it turns out xfs_io was too fast/efficient, and so using xfs_io to
> execute the same system calls (verified by strace) would not replicate
> the problem.
>
>> > Now you have a broken file system image and the exact set of operations
>> > that led to it. That's should be a pretty big lead for investigation.
>
> It was indeed a big lead for investigation (thanks, Amir!), but it
> still took me several hours before I was finally able to figure out
> the problem.  The patch and the commit description should explain what
> was going on.
>
> I'll leave it to Ashlie and Vijay to investigate how to improve Crash
> Monkey so it can better find problems like this automatically.  Since
> you now have a clear reproducer (you can use generic/456 and run it on
> gce-xfstests, using is a standard cloud VM configuration) and an
> explanation of the bug and the four-line fix, I suspect this might be
> good grist for follow-on research after your Hot Storage '17 workshop
> paper.  :-)
>
> Best regards,
>
>                                         - Ted
>
>
> commit 3912e7b44cf77e9452d4d0cb6c1da9c7043bb7f1
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Fri Oct 6 23:09:55 2017 -0400
>
>     ext4: fix interaction between i_size, fallocate, and delalloc after a crash
>
>     If there are pending writes subject to delayed allocation, then i_size
>     will show size after the writes have completed, while i_disksize
>     contains the value of i_size on the disk (since the writes have not
>     been persisted to disk).
>
>     If fallocate(2) is called with the FALLOC_FL_KEEP_SIZE flag, either

According to patch without keep size flag?

BTW looking at the patch, does truncate racing with fallocate that
would increase size beyond truncated size works correctly?

Didn't check the code just wondering..

Thanks!

>     with or without the FALLOC_FL_ZERO_RANGE flag set, and the new size
>     after the fallocate(2) is between i_size and i_disksize, then after a
>     crash, if a journal commit has resulted in the changes made by the
>     fallocate() call to be persisted after a crash, but the delayed
>     allocation write has not resolved itself, i_size would not be updated,
>     and this would cause the following e2fsck complaint:
>
>     Inode 12, end of extent exceeds allowed value
>             (logical block 33, physical block 33441, len 7)
>
>     This can only take place on a sparse file, where the fallocate(2) call
>     is allocating blocks in a range which is before a pending delayed
>     allocation write which is extending i_size.  Since this situation is
>     quite rare, and the window in which the crash must take place is
>     typically < 30 seconds, in practice this condition will rarely happen.
>
>     Nevertheless, it can be triggered in testing, and in particular by
>     xfstests generic/456.
>
>     Signed-off-by: Theodore Ts'o <tytso@mit.edu>
>     Reported-by: Amir Goldstein <amir73il@gmail.com>
>     Cc: stable@vger.kernel.org
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 97f0fd06728d..07bca11749d4 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4794,7 +4794,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>         }
>
>         if (!(mode & FALLOC_FL_KEEP_SIZE) &&
> -            offset + len > i_size_read(inode)) {
> +           (offset + len > i_size_read(inode) ||
> +            offset + len > EXT4_I(inode)->i_disksize)) {
>                 new_size = offset + len;
>                 ret = inode_newsize_ok(inode, new_size);
>                 if (ret)
> @@ -4965,7 +4966,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>         }
>
>         if (!(mode & FALLOC_FL_KEEP_SIZE) &&
> -            offset + len > i_size_read(inode)) {
> +           (offset + len > i_size_read(inode) ||
> +            offset + len > EXT4_I(inode)->i_disksize)) {
>                 new_size = offset + len;
>                 ret = inode_newsize_ok(inode, new_size);
>                 if (ret)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash
  2017-10-07  5:54                 ` Amir Goldstein
@ 2017-10-07 18:32                   ` Theodore Ts'o
  0 siblings, 0 replies; 31+ messages in thread
From: Theodore Ts'o @ 2017-10-07 18:32 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Ashlie Martinez, Xiao Yang, Eryu Guan, Josef Bacik, fstests,
	Ext4, Vijay Chidambaram

On Sat, Oct 07, 2017 at 08:54:06AM +0300, Amir Goldstein wrote:
> >     If fallocate(2) is called with the FALLOC_FL_KEEP_SIZE flag, either
> 
> According to patch without keep size flag?

Oops, thanks for catching that.  I'll fix the commit description.

> BTW looking at the patch, does truncate racing with fallocate that
> would increase size beyond truncated size works correctly?

Yes, because we the inode is locked via i_mutex (well, now an write
lock on i_rwsem) during both truncate and fallocate operations.

     		 	     	      	  - Ted

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash
  2017-10-07  3:29               ` [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash Theodore Ts'o
  2017-10-07  5:54                 ` Amir Goldstein
@ 2017-10-09  0:37                 ` Ashlie Martinez
  2017-10-11 11:11                 ` Xiao Yang
  2017-10-12 14:38                 ` Jan Kara
  3 siblings, 0 replies; 31+ messages in thread
From: Ashlie Martinez @ 2017-10-09  0:37 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Amir Goldstein, Xiao Yang, Eryu Guan, Josef Bacik, fstests, Ext4,
	Vijay Chidambaram

Ted,

Thanks for looking into this more. The information about what was
causing the output Amir saw will certainly help us improve CrashMonkey
so we can detect bugs better in the future!

Best,
Ashlie

On Fri, Oct 6, 2017 at 10:29 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Thu, Oct 05, 2017 at 07:34:10PM -0500, Ashlie Martinez wrote:
>> >> It almost seems that way, though to be honest, I don't think I know
>> >> enough about 1. the setup used by Amir and 2. all the internal working
>> >> of KVM+virtio to say for sure.
>> >
>> > I believe you misread my email.
>> > The problem was NOT reproduced on KVM+virtio.
>> > The problem *is* reproduced on a 10G LVM volume over SSD on
>> > Ubuntu 16.04 with latest kernel and latest e2fsprogs.
>
> I was able to reproduce it using both kvm-xfstests[1] and gce-xfstests[2].
>
> [1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md
> [2] https://thunk.org/gce-xfstests
>
> It did turn out to be timing related, and it requires that a journal
> commit take place after fsstress runs, but it can *not* be triggered
> by a sync/fsync (as this would cause the delayed allocation writes to
> be forced out to disk, and that makes the problem go away).
>
> I initially tried using xfs_io as a replacement for fsstress (since it
> is more flexible and would allow me to more easily run experiments),
> but it turns out xfs_io was too fast/efficient, and so using xfs_io to
> execute the same system calls (verified by strace) would not replicate
> the problem.
>
>> > Now you have a broken file system image and the exact set of operations
>> > that led to it. That's should be a pretty big lead for investigation.
>
> It was indeed a big lead for investigation (thanks, Amir!), but it
> still took me several hours before I was finally able to figure out
> the problem.  The patch and the commit description should explain what
> was going on.
>
> I'll leave it to Ashlie and Vijay to investigate how to improve Crash
> Monkey so it can better find problems like this automatically.  Since
> you now have a clear reproducer (you can use generic/456 and run it on
> gce-xfstests, using is a standard cloud VM configuration) and an
> explanation of the bug and the four-line fix, I suspect this might be
> good grist for follow-on research after your Hot Storage '17 workshop
> paper.  :-)
>
> Best regards,
>
>                                         - Ted
>
>
> commit 3912e7b44cf77e9452d4d0cb6c1da9c7043bb7f1
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Fri Oct 6 23:09:55 2017 -0400
>
>     ext4: fix interaction between i_size, fallocate, and delalloc after a crash
>
>     If there are pending writes subject to delayed allocation, then i_size
>     will show size after the writes have completed, while i_disksize
>     contains the value of i_size on the disk (since the writes have not
>     been persisted to disk).
>
>     If fallocate(2) is called with the FALLOC_FL_KEEP_SIZE flag, either
>     with or without the FALLOC_FL_ZERO_RANGE flag set, and the new size
>     after the fallocate(2) is between i_size and i_disksize, then after a
>     crash, if a journal commit has resulted in the changes made by the
>     fallocate() call to be persisted after a crash, but the delayed
>     allocation write has not resolved itself, i_size would not be updated,
>     and this would cause the following e2fsck complaint:
>
>     Inode 12, end of extent exceeds allowed value
>             (logical block 33, physical block 33441, len 7)
>
>     This can only take place on a sparse file, where the fallocate(2) call
>     is allocating blocks in a range which is before a pending delayed
>     allocation write which is extending i_size.  Since this situation is
>     quite rare, and the window in which the crash must take place is
>     typically < 30 seconds, in practice this condition will rarely happen.
>
>     Nevertheless, it can be triggered in testing, and in particular by
>     xfstests generic/456.
>
>     Signed-off-by: Theodore Ts'o <tytso@mit.edu>
>     Reported-by: Amir Goldstein <amir73il@gmail.com>
>     Cc: stable@vger.kernel.org
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 97f0fd06728d..07bca11749d4 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4794,7 +4794,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>         }
>
>         if (!(mode & FALLOC_FL_KEEP_SIZE) &&
> -            offset + len > i_size_read(inode)) {
> +           (offset + len > i_size_read(inode) ||
> +            offset + len > EXT4_I(inode)->i_disksize)) {
>                 new_size = offset + len;
>                 ret = inode_newsize_ok(inode, new_size);
>                 if (ret)
> @@ -4965,7 +4966,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>         }
>
>         if (!(mode & FALLOC_FL_KEEP_SIZE) &&
> -            offset + len > i_size_read(inode)) {
> +           (offset + len > i_size_read(inode) ||
> +            offset + len > EXT4_I(inode)->i_disksize)) {
>                 new_size = offset + len;
>                 ret = inode_newsize_ok(inode, new_size);
>                 if (ret)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash
  2017-10-07  3:29               ` [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash Theodore Ts'o
  2017-10-07  5:54                 ` Amir Goldstein
  2017-10-09  0:37                 ` Ashlie Martinez
@ 2017-10-11 11:11                 ` Xiao Yang
  2017-10-11 13:17                   ` Ashlie Martinez
  2017-10-12 14:38                 ` Jan Kara
  3 siblings, 1 reply; 31+ messages in thread
From: Xiao Yang @ 2017-10-11 11:11 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Ashlie Martinez, Amir Goldstein, Eryu Guan, Josef Bacik, fstests,
	Ext4, Vijay Chidambaram

On 2017/10/07 11:29, Theodore Ts'o wrote:
> On Thu, Oct 05, 2017 at 07:34:10PM -0500, Ashlie Martinez wrote:
>>>> It almost seems that way, though to be honest, I don't think I know
>>>> enough about 1. the setup used by Amir and 2. all the internal working
>>>> of KVM+virtio to say for sure.
>>> I believe you misread my email.
>>> The problem was NOT reproduced on KVM+virtio.
>>> The problem *is* reproduced on a 10G LVM volume over SSD on
>>> Ubuntu 16.04 with latest kernel and latest e2fsprogs.
> I was able to reproduce it using both kvm-xfstests[1] and gce-xfstests[2].
>
> [1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md
> [2] https://thunk.org/gce-xfstests
>
> It did turn out to be timing related, and it requires that a journal
> commit take place after fsstress runs, but it can *not* be triggered
> by a sync/fsync (as this would cause the delayed allocation writes to
> be forced out to disk, and that makes the problem go away).
>
> I initially tried using xfs_io as a replacement for fsstress (since it
> is more flexible and would allow me to more easily run experiments),
> but it turns out xfs_io was too fast/efficient, and so using xfs_io to
> execute the same system calls (verified by strace) would not replicate
> the problem.
>
>>> Now you have a broken file system image and the exact set of operations
>>> that led to it. That's should be a pretty big lead for investigation.
> It was indeed a big lead for investigation (thanks, Amir!), but it
> still took me several hours before I was finally able to figure out
> the problem.  The patch and the commit description should explain what
> was going on.
>
> I'll leave it to Ashlie and Vijay to investigate how to improve Crash
> Monkey so it can better find problems like this automatically.  Since
> you now have a clear reproducer (you can use generic/456 and run it on
> gce-xfstests, using is a standard cloud VM configuration) and an
> explanation of the bug and the four-line fix, I suspect this might be
> good grist for follow-on research after your Hot Storage '17 workshop
> paper.  :-)
>
> Best regards,
>
> 					- Ted
>
>
> commit 3912e7b44cf77e9452d4d0cb6c1da9c7043bb7f1
> Author: Theodore Ts'o<tytso@mit.edu>
> Date:   Fri Oct 6 23:09:55 2017 -0400
>
>      ext4: fix interaction between i_size, fallocate, and delalloc after a crash
Hi Theodore,

After applying your patch, generic/456 always passes on my system which 
just triggers the output[2].
So i could believe this two different outputs[1][2] are triggered by 
different environments, but they
are caused by the same bug which your patch fixes.  Is this right?

[1] Inode 12, end of extent exceeds allowed value(logical block 33, 
physical block 33441, len 7)Clear? no
       Inode 12, i_blocks is 184, should be 128. Fix? no
[2] Inode 12, i_size is 147456, should be 163840. Fix? no

Sorry, i am not familiar with ext4.

Thanks,
Xiao Yang
>
>      If there are pending writes subject to delayed allocation, then i_size
>      will show size after the writes have completed, while i_disksize
>      contains the value of i_size on the disk (since the writes have not
>      been persisted to disk).
>
>      If fallocate(2) is called with the FALLOC_FL_KEEP_SIZE flag, either
>      with or without the FALLOC_FL_ZERO_RANGE flag set, and the new size
>      after the fallocate(2) is between i_size and i_disksize, then after a
>      crash, if a journal commit has resulted in the changes made by the
>      fallocate() call to be persisted after a crash, but the delayed
>      allocation write has not resolved itself, i_size would not be updated,
>      and this would cause the following e2fsck complaint:
>
>      Inode 12, end of extent exceeds allowed value
>              (logical block 33, physical block 33441, len 7)
>
>      This can only take place on a sparse file, where the fallocate(2) call
>      is allocating blocks in a range which is before a pending delayed
>      allocation write which is extending i_size.  Since this situation is
>      quite rare, and the window in which the crash must take place is
>      typically<  30 seconds, in practice this condition will rarely happen.
>
>      Nevertheless, it can be triggered in testing, and in particular by
>      xfstests generic/456.
>
>      Signed-off-by: Theodore Ts'o<tytso@mit.edu>
>      Reported-by: Amir Goldstein<amir73il@gmail.com>
>      Cc: stable@vger.kernel.org
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 97f0fd06728d..07bca11749d4 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4794,7 +4794,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>   	}
>
>   	if (!(mode&  FALLOC_FL_KEEP_SIZE)&&
> -	     offset + len>  i_size_read(inode)) {
> +	    (offset + len>  i_size_read(inode) ||
> +	     offset + len>  EXT4_I(inode)->i_disksize)) {
>   		new_size = offset + len;
>   		ret = inode_newsize_ok(inode, new_size);
>   		if (ret)
> @@ -4965,7 +4966,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>   	}
>
>   	if (!(mode&  FALLOC_FL_KEEP_SIZE)&&
> -	     offset + len>  i_size_read(inode)) {
> +	    (offset + len>  i_size_read(inode) ||
> +	     offset + len>  EXT4_I(inode)->i_disksize)) {
>   		new_size = offset + len;
>   		ret = inode_newsize_ok(inode, new_size);
>   		if (ret)
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> .
>




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash
  2017-10-11 11:11                 ` Xiao Yang
@ 2017-10-11 13:17                   ` Ashlie Martinez
  2017-10-11 13:34                     ` Amir Goldstein
  0 siblings, 1 reply; 31+ messages in thread
From: Ashlie Martinez @ 2017-10-11 13:17 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Xiao Yang, Theodore Ts'o, Eryu Guan, Josef Bacik, fstests,
	Ext4, Vijay Chidambaram

On Wed, Oct 11, 2017 at 6:11 AM, Xiao Yang <yangx.jy@cn.fujitsu.com> wrote:
> On 2017/10/07 11:29, Theodore Ts'o wrote:
>>
>> On Thu, Oct 05, 2017 at 07:34:10PM -0500, Ashlie Martinez wrote:
>>>>>
>>>>> It almost seems that way, though to be honest, I don't think I know
>>>>> enough about 1. the setup used by Amir and 2. all the internal working
>>>>> of KVM+virtio to say for sure.
>>>>
>>>> I believe you misread my email.
>>>> The problem was NOT reproduced on KVM+virtio.
>>>> The problem *is* reproduced on a 10G LVM volume over SSD on
>>>> Ubuntu 16.04 with latest kernel and latest e2fsprogs.
>>
>> I was able to reproduce it using both kvm-xfstests[1] and gce-xfstests[2].
>>
>> [1]
>> https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md
>> [2] https://thunk.org/gce-xfstests
>>
>> It did turn out to be timing related, and it requires that a journal
>> commit take place after fsstress runs, but it can *not* be triggered
>> by a sync/fsync (as this would cause the delayed allocation writes to
>> be forced out to disk, and that makes the problem go away).

Amir,

I was actually looking into why only some machines can reproduce the
output from generic/456 and I found that I was a little unsure about
exactly how the test itself was operating. I'm not an expert on how
device mapper works, so I was wondering if you could help shed some
light on it.

>From what I've surmised from the test case, it replays the workload
you specified with fsx and then immediately inserts the dm_flakey
table and device in place of the hard disk *while requests are still
in flight* (see nolockfs flag in dmsetup which is used by dm_flakey
calls). The flakey device drops all requests sent to it. The flakey
device is then unmounted (to force all pending writes out), the dm
table changed once again back to the original backing block device,
and then mounted again.

Is this summary of the test case correct? If it is, then perhaps the
reason why it is so hard for some people to reproduce the exact bug
output you saw is because there does not appear to be anything in the
test case that specifies the timing for when dm_flakey should begin
dropping requests. Since the flakey device could be inserted any time
after fsx completes, but writes are still pending, it seems like the
test creates a race condition between inserting dm_flakey and a kernel
thread waking up to persist writes. This race condition would likely
be affected by things like having > 1 core on the VM or hardware (so
that dm_flakey could be inserted at the same time a kernel thread on
another core is trying to persist writes) the test is running on,
system load, dirty writeback times, etc.

Do you have any thoughts on the above?

>>
>> I initially tried using xfs_io as a replacement for fsstress (since it
>> is more flexible and would allow me to more easily run experiments),
>> but it turns out xfs_io was too fast/efficient, and so using xfs_io to
>> execute the same system calls (verified by strace) would not replicate
>> the problem.
>>
>>>> Now you have a broken file system image and the exact set of operations
>>>> that led to it. That's should be a pretty big lead for investigation.
>>
>> It was indeed a big lead for investigation (thanks, Amir!), but it
>> still took me several hours before I was finally able to figure out
>> the problem.  The patch and the commit description should explain what
>> was going on.
>>
>> I'll leave it to Ashlie and Vijay to investigate how to improve Crash
>> Monkey so it can better find problems like this automatically.  Since
>> you now have a clear reproducer (you can use generic/456 and run it on
>> gce-xfstests, using is a standard cloud VM configuration) and an
>> explanation of the bug and the four-line fix, I suspect this might be
>> good grist for follow-on research after your Hot Storage '17 workshop
>> paper.  :-)
>>
>> Best regards,
>>
>>                                         - Ted
>>
>>
>> commit 3912e7b44cf77e9452d4d0cb6c1da9c7043bb7f1
>> Author: Theodore Ts'o<tytso@mit.edu>
>> Date:   Fri Oct 6 23:09:55 2017 -0400
>>
>>      ext4: fix interaction between i_size, fallocate, and delalloc after a
>> crash
>
> Hi Theodore,
>
> After applying your patch, generic/456 always passes on my system which just
> triggers the output[2].
> So i could believe this two different outputs[1][2] are triggered by
> different environments, but they
> are caused by the same bug which your patch fixes.  Is this right?
>
> [1] Inode 12, end of extent exceeds allowed value(logical block 33, physical
> block 33441, len 7)Clear? no
>       Inode 12, i_blocks is 184, should be 128. Fix? no
> [2] Inode 12, i_size is 147456, should be 163840. Fix? no
>
> Sorry, i am not familiar with ext4.
>
> Thanks,
> Xiao Yang
>>
>>
>>      If there are pending writes subject to delayed allocation, then
>> i_size
>>      will show size after the writes have completed, while i_disksize
>>      contains the value of i_size on the disk (since the writes have not
>>      been persisted to disk).
>>
>>      If fallocate(2) is called with the FALLOC_FL_KEEP_SIZE flag, either
>>      with or without the FALLOC_FL_ZERO_RANGE flag set, and the new size
>>      after the fallocate(2) is between i_size and i_disksize, then after a
>>      crash, if a journal commit has resulted in the changes made by the
>>      fallocate() call to be persisted after a crash, but the delayed
>>      allocation write has not resolved itself, i_size would not be
>> updated,
>>      and this would cause the following e2fsck complaint:
>>
>>      Inode 12, end of extent exceeds allowed value
>>              (logical block 33, physical block 33441, len 7)
>>
>>      This can only take place on a sparse file, where the fallocate(2)
>> call
>>      is allocating blocks in a range which is before a pending delayed
>>      allocation write which is extending i_size.  Since this situation is
>>      quite rare, and the window in which the crash must take place is
>>      typically<  30 seconds, in practice this condition will rarely
>> happen.
>>
>>      Nevertheless, it can be triggered in testing, and in particular by
>>      xfstests generic/456.
>>
>>      Signed-off-by: Theodore Ts'o<tytso@mit.edu>
>>      Reported-by: Amir Goldstein<amir73il@gmail.com>
>>      Cc: stable@vger.kernel.org
>>
>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>> index 97f0fd06728d..07bca11749d4 100644
>> --- a/fs/ext4/extents.c
>> +++ b/fs/ext4/extents.c
>> @@ -4794,7 +4794,8 @@ static long ext4_zero_range(struct file *file,
>> loff_t offset,
>>         }
>>
>>         if (!(mode&  FALLOC_FL_KEEP_SIZE)&&
>> -            offset + len>  i_size_read(inode)) {
>> +           (offset + len>  i_size_read(inode) ||
>> +            offset + len>  EXT4_I(inode)->i_disksize)) {
>>                 new_size = offset + len;
>>                 ret = inode_newsize_ok(inode, new_size);
>>                 if (ret)
>> @@ -4965,7 +4966,8 @@ long ext4_fallocate(struct file *file, int mode,
>> loff_t offset, loff_t len)
>>         }
>>
>>         if (!(mode&  FALLOC_FL_KEEP_SIZE)&&
>> -            offset + len>  i_size_read(inode)) {
>> +           (offset + len>  i_size_read(inode) ||
>> +            offset + len>  EXT4_I(inode)->i_disksize)) {
>>                 new_size = offset + len;
>>                 ret = inode_newsize_ok(inode, new_size);
>>                 if (ret)
>> --
>> To unsubscribe from this list: send the line "unsubscribe fstests" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>> .
>>
>
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash
  2017-10-11 13:17                   ` Ashlie Martinez
@ 2017-10-11 13:34                     ` Amir Goldstein
  2017-10-16 19:32                       ` Ashlie Martinez
  0 siblings, 1 reply; 31+ messages in thread
From: Amir Goldstein @ 2017-10-11 13:34 UTC (permalink / raw)
  To: Ashlie Martinez
  Cc: Xiao Yang, Theodore Ts'o, Eryu Guan, Josef Bacik, fstests,
	Ext4, Vijay Chidambaram

On Wed, Oct 11, 2017 at 4:17 PM, Ashlie Martinez <ashmrtn@utexas.edu> wrote:
> On Wed, Oct 11, 2017 at 6:11 AM, Xiao Yang <yangx.jy@cn.fujitsu.com> wrote:
>> On 2017/10/07 11:29, Theodore Ts'o wrote:
>>>
>>> On Thu, Oct 05, 2017 at 07:34:10PM -0500, Ashlie Martinez wrote:
>>>>>>
>>>>>> It almost seems that way, though to be honest, I don't think I know
>>>>>> enough about 1. the setup used by Amir and 2. all the internal working
>>>>>> of KVM+virtio to say for sure.
>>>>>
>>>>> I believe you misread my email.
>>>>> The problem was NOT reproduced on KVM+virtio.
>>>>> The problem *is* reproduced on a 10G LVM volume over SSD on
>>>>> Ubuntu 16.04 with latest kernel and latest e2fsprogs.
>>>
>>> I was able to reproduce it using both kvm-xfstests[1] and gce-xfstests[2].
>>>
>>> [1]
>>> https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md
>>> [2] https://thunk.org/gce-xfstests
>>>
>>> It did turn out to be timing related, and it requires that a journal
>>> commit take place after fsstress runs, but it can *not* be triggered
>>> by a sync/fsync (as this would cause the delayed allocation writes to
>>> be forced out to disk, and that makes the problem go away).
>
> Amir,
>
> I was actually looking into why only some machines can reproduce the
> output from generic/456 and I found that I was a little unsure about
> exactly how the test itself was operating. I'm not an expert on how
> device mapper works, so I was wondering if you could help shed some
> light on it.
>
> From what I've surmised from the test case, it replays the workload
> you specified with fsx and then immediately inserts the dm_flakey
> table and device in place of the hard disk *while requests are still
> in flight* (see nolockfs flag in dmsetup which is used by dm_flakey
> calls). The flakey device drops all requests sent to it. The flakey
> device is then unmounted (to force all pending writes out), the dm
> table changed once again back to the original backing block device,
> and then mounted again.
>
> Is this summary of the test case correct? If it is, then perhaps the
> reason why it is so hard for some people to reproduce the exact bug
> output you saw is because there does not appear to be anything in the
> test case that specifies the timing for when dm_flakey should begin
> dropping requests. Since the flakey device could be inserted any time
> after fsx completes, but writes are still pending, it seems like the
> test creates a race condition between inserting dm_flakey and a kernel
> thread waking up to persist writes. This race condition would likely
> be affected by things like having > 1 core on the VM or hardware (so
> that dm_flakey could be inserted at the same time a kernel thread on
> another core is trying to persist writes) the test is running on,
> system load, dirty writeback times, etc.
>
> Do you have any thoughts on the above?

Sounds about right.
Having more cores to the VM was my first guess as the reason.
Note that all the bugs I found so far depended on some timing between
processes or kernel threads, so crash monkey trying all io completion
combinations would not have helped. Maybe it would help to find other
bugs though.

>
>>>
>>> I initially tried using xfs_io as a replacement for fsstress (since it
>>> is more flexible and would allow me to more easily run experiments),
>>> but it turns out xfs_io was too fast/efficient, and so using xfs_io to
>>> execute the same system calls (verified by strace) would not replicate
>>> the problem.
>>>
>>>>> Now you have a broken file system image and the exact set of operations
>>>>> that led to it. That's should be a pretty big lead for investigation.
>>>
>>> It was indeed a big lead for investigation (thanks, Amir!), but it
>>> still took me several hours before I was finally able to figure out
>>> the problem.  The patch and the commit description should explain what
>>> was going on.
>>>
>>> I'll leave it to Ashlie and Vijay to investigate how to improve Crash
>>> Monkey so it can better find problems like this automatically.  Since
>>> you now have a clear reproducer (you can use generic/456 and run it on
>>> gce-xfstests, using is a standard cloud VM configuration) and an
>>> explanation of the bug and the four-line fix, I suspect this might be
>>> good grist for follow-on research after your Hot Storage '17 workshop
>>> paper.  :-)
>>>
>>> Best regards,
>>>
>>>                                         - Ted
>>>
>>>
>>> commit 3912e7b44cf77e9452d4d0cb6c1da9c7043bb7f1
>>> Author: Theodore Ts'o<tytso@mit.edu>
>>> Date:   Fri Oct 6 23:09:55 2017 -0400
>>>
>>>      ext4: fix interaction between i_size, fallocate, and delalloc after a
>>> crash
>>
>> Hi Theodore,
>>
>> After applying your patch, generic/456 always passes on my system which just
>> triggers the output[2].
>> So i could believe this two different outputs[1][2] are triggered by
>> different environments, but they
>> are caused by the same bug which your patch fixes.  Is this right?
>>
>> [1] Inode 12, end of extent exceeds allowed value(logical block 33, physical
>> block 33441, len 7)Clear? no
>>       Inode 12, i_blocks is 184, should be 128. Fix? no
>> [2] Inode 12, i_size is 147456, should be 163840. Fix? no
>>
>> Sorry, i am not familiar with ext4.
>>
>> Thanks,
>> Xiao Yang
>>>
>>>
>>>      If there are pending writes subject to delayed allocation, then
>>> i_size
>>>      will show size after the writes have completed, while i_disksize
>>>      contains the value of i_size on the disk (since the writes have not
>>>      been persisted to disk).
>>>
>>>      If fallocate(2) is called with the FALLOC_FL_KEEP_SIZE flag, either
>>>      with or without the FALLOC_FL_ZERO_RANGE flag set, and the new size
>>>      after the fallocate(2) is between i_size and i_disksize, then after a
>>>      crash, if a journal commit has resulted in the changes made by the
>>>      fallocate() call to be persisted after a crash, but the delayed
>>>      allocation write has not resolved itself, i_size would not be
>>> updated,
>>>      and this would cause the following e2fsck complaint:
>>>
>>>      Inode 12, end of extent exceeds allowed value
>>>              (logical block 33, physical block 33441, len 7)
>>>
>>>      This can only take place on a sparse file, where the fallocate(2)
>>> call
>>>      is allocating blocks in a range which is before a pending delayed
>>>      allocation write which is extending i_size.  Since this situation is
>>>      quite rare, and the window in which the crash must take place is
>>>      typically<  30 seconds, in practice this condition will rarely
>>> happen.
>>>
>>>      Nevertheless, it can be triggered in testing, and in particular by
>>>      xfstests generic/456.
>>>
>>>      Signed-off-by: Theodore Ts'o<tytso@mit.edu>
>>>      Reported-by: Amir Goldstein<amir73il@gmail.com>
>>>      Cc: stable@vger.kernel.org
>>>
>>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>>> index 97f0fd06728d..07bca11749d4 100644
>>> --- a/fs/ext4/extents.c
>>> +++ b/fs/ext4/extents.c
>>> @@ -4794,7 +4794,8 @@ static long ext4_zero_range(struct file *file,
>>> loff_t offset,
>>>         }
>>>
>>>         if (!(mode&  FALLOC_FL_KEEP_SIZE)&&
>>> -            offset + len>  i_size_read(inode)) {
>>> +           (offset + len>  i_size_read(inode) ||
>>> +            offset + len>  EXT4_I(inode)->i_disksize)) {
>>>                 new_size = offset + len;
>>>                 ret = inode_newsize_ok(inode, new_size);
>>>                 if (ret)
>>> @@ -4965,7 +4966,8 @@ long ext4_fallocate(struct file *file, int mode,
>>> loff_t offset, loff_t len)
>>>         }
>>>
>>>         if (!(mode&  FALLOC_FL_KEEP_SIZE)&&
>>> -            offset + len>  i_size_read(inode)) {
>>> +           (offset + len>  i_size_read(inode) ||
>>> +            offset + len>  EXT4_I(inode)->i_disksize)) {
>>>                 new_size = offset + len;
>>>                 ret = inode_newsize_ok(inode, new_size);
>>>                 if (ret)
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe fstests" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>> .
>>>
>>
>>
>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash
  2017-10-07  3:29               ` [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash Theodore Ts'o
                                   ` (2 preceding siblings ...)
  2017-10-11 11:11                 ` Xiao Yang
@ 2017-10-12 14:38                 ` Jan Kara
  3 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2017-10-12 14:38 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Ashlie Martinez, Amir Goldstein, Xiao Yang, Eryu Guan,
	Josef Bacik, fstests, Ext4, Vijay Chidambaram

> commit 3912e7b44cf77e9452d4d0cb6c1da9c7043bb7f1
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Fri Oct 6 23:09:55 2017 -0400
> 
>     ext4: fix interaction between i_size, fallocate, and delalloc after a crash
>     
>     If there are pending writes subject to delayed allocation, then i_size
>     will show size after the writes have completed, while i_disksize
>     contains the value of i_size on the disk (since the writes have not
>     been persisted to disk).
>     
>     If fallocate(2) is called with the FALLOC_FL_KEEP_SIZE flag, either
>     with or without the FALLOC_FL_ZERO_RANGE flag set, and the new size
>     after the fallocate(2) is between i_size and i_disksize, then after a
>     crash, if a journal commit has resulted in the changes made by the
>     fallocate() call to be persisted after a crash, but the delayed
>     allocation write has not resolved itself, i_size would not be updated,
>     and this would cause the following e2fsck complaint:
>     
>     Inode 12, end of extent exceeds allowed value
>             (logical block 33, physical block 33441, len 7)
>     
>     This can only take place on a sparse file, where the fallocate(2) call
>     is allocating blocks in a range which is before a pending delayed
>     allocation write which is extending i_size.  Since this situation is
>     quite rare, and the window in which the crash must take place is
>     typically < 30 seconds, in practice this condition will rarely happen.
>     
>     Nevertheless, it can be triggered in testing, and in particular by
>     xfstests generic/456.
>     
>     Signed-off-by: Theodore Ts'o <tytso@mit.edu>
>     Reported-by: Amir Goldstein <amir73il@gmail.com>
>     Cc: stable@vger.kernel.org

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza


> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 97f0fd06728d..07bca11749d4 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4794,7 +4794,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  	}
>  
>  	if (!(mode & FALLOC_FL_KEEP_SIZE) &&
> -	     offset + len > i_size_read(inode)) {
> +	    (offset + len > i_size_read(inode) ||
> +	     offset + len > EXT4_I(inode)->i_disksize)) {
>  		new_size = offset + len;
>  		ret = inode_newsize_ok(inode, new_size);
>  		if (ret)
> @@ -4965,7 +4966,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  	}
>  
>  	if (!(mode & FALLOC_FL_KEEP_SIZE) &&
> -	     offset + len > i_size_read(inode)) {
> +	    (offset + len > i_size_read(inode) ||
> +	     offset + len > EXT4_I(inode)->i_disksize)) {
>  		new_size = offset + len;
>  		ret = inode_newsize_ok(inode, new_size);
>  		if (ret)
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash
  2017-10-11 13:34                     ` Amir Goldstein
@ 2017-10-16 19:32                       ` Ashlie Martinez
  2017-10-16 21:11                         ` Amir Goldstein
  0 siblings, 1 reply; 31+ messages in thread
From: Ashlie Martinez @ 2017-10-16 19:32 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Xiao Yang, Theodore Tso, Eryu Guan, Josef Bacik, fstests, Ext4,
	Vijay Chidambaram

Amir,

I know this is a bit late, but I've spent some time working through
the disk image that you provided (so that I could determine how/if I
could modify CrashMonkey to catch errors like this) and I don't think
I understand what state the disk image reflects. After digging around
the journal of the disk image you provided, I found that the first 10
journal blocks are used, with the journal superblock being placed in
the very first block of the journal. The journal superblock says that
the first journal transaction ID that should be in the journal is
transaction ID 4. However, dumping the other journal blocks, I found
that the next block is a descriptor block for transaction ID 2. The
rest of the journal blocks are data blocks for that transaction plus a
transaction commit block. This seems a little odd considering that the
journal refers to a 4th transaction, which I have not been able to
find (I quickly dumped the first 50 blocks in debugfs and found the
rest to contain only zeros).

With this in mind, I looked back at the xfstests code for controlling
the dm_flakey device. What I realized is the `nolockfs` flag is
provided both when it switches from the real device to the flakey
device that drops writes and when it switches from the flakey device
back to the real device. I know there is a call to umount once the
flakey device that drops writes is inserted, but do you think it is
possible that the flakey device is swapped back to the real device
before all the writes forced out by umount have made it to the flakey
device? Unfortunately I still don't have a local machine that is
capable of reproducing your test results and I have not made any gce
test appliance images to test this yet, so I'm not sure if this is a
valid theory.

On Wed, Oct 11, 2017 at 8:34 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, Oct 11, 2017 at 4:17 PM, Ashlie Martinez <ashmrtn@utexas.edu> wrote:
> > On Wed, Oct 11, 2017 at 6:11 AM, Xiao Yang <yangx.jy@cn.fujitsu.com> wrote:
> >> On 2017/10/07 11:29, Theodore Ts'o wrote:
> >>>
> >>> On Thu, Oct 05, 2017 at 07:34:10PM -0500, Ashlie Martinez wrote:
> >>>>>>
> >>>>>> It almost seems that way, though to be honest, I don't think I know
> >>>>>> enough about 1. the setup used by Amir and 2. all the internal working
> >>>>>> of KVM+virtio to say for sure.
> >>>>>
> >>>>> I believe you misread my email.
> >>>>> The problem was NOT reproduced on KVM+virtio.
> >>>>> The problem *is* reproduced on a 10G LVM volume over SSD on
> >>>>> Ubuntu 16.04 with latest kernel and latest e2fsprogs.
> >>>
> >>> I was able to reproduce it using both kvm-xfstests[1] and gce-xfstests[2].
> >>>
> >>> [1]
> >>> https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md
> >>> [2] https://thunk.org/gce-xfstests
> >>>
> >>> It did turn out to be timing related, and it requires that a journal
> >>> commit take place after fsstress runs, but it can *not* be triggered
> >>> by a sync/fsync (as this would cause the delayed allocation writes to
> >>> be forced out to disk, and that makes the problem go away).
> >
> > Amir,
> >
> > I was actually looking into why only some machines can reproduce the
> > output from generic/456 and I found that I was a little unsure about
> > exactly how the test itself was operating. I'm not an expert on how
> > device mapper works, so I was wondering if you could help shed some
> > light on it.
> >
> > From what I've surmised from the test case, it replays the workload
> > you specified with fsx and then immediately inserts the dm_flakey
> > table and device in place of the hard disk *while requests are still
> > in flight* (see nolockfs flag in dmsetup which is used by dm_flakey
> > calls). The flakey device drops all requests sent to it. The flakey
> > device is then unmounted (to force all pending writes out), the dm
> > table changed once again back to the original backing block device,
> > and then mounted again.
> >
> > Is this summary of the test case correct? If it is, then perhaps the
> > reason why it is so hard for some people to reproduce the exact bug
> > output you saw is because there does not appear to be anything in the
> > test case that specifies the timing for when dm_flakey should begin
> > dropping requests. Since the flakey device could be inserted any time
> > after fsx completes, but writes are still pending, it seems like the
> > test creates a race condition between inserting dm_flakey and a kernel
> > thread waking up to persist writes. This race condition would likely
> > be affected by things like having > 1 core on the VM or hardware (so
> > that dm_flakey could be inserted at the same time a kernel thread on
> > another core is trying to persist writes) the test is running on,
> > system load, dirty writeback times, etc.
> >
> > Do you have any thoughts on the above?
>
> Sounds about right.
> Having more cores to the VM was my first guess as the reason.
> Note that all the bugs I found so far depended on some timing between
> processes or kernel threads, so crash monkey trying all io completion
> combinations would not have helped. Maybe it would help to find other
> bugs though.
>
> >
> >>>
> >>> I initially tried using xfs_io as a replacement for fsstress (since it
> >>> is more flexible and would allow me to more easily run experiments),
> >>> but it turns out xfs_io was too fast/efficient, and so using xfs_io to
> >>> execute the same system calls (verified by strace) would not replicate
> >>> the problem.
> >>>
> >>>>> Now you have a broken file system image and the exact set of operations
> >>>>> that led to it. That's should be a pretty big lead for investigation.
> >>>
> >>> It was indeed a big lead for investigation (thanks, Amir!), but it
> >>> still took me several hours before I was finally able to figure out
> >>> the problem.  The patch and the commit description should explain what
> >>> was going on.
> >>>
> >>> I'll leave it to Ashlie and Vijay to investigate how to improve Crash
> >>> Monkey so it can better find problems like this automatically.  Since
> >>> you now have a clear reproducer (you can use generic/456 and run it on
> >>> gce-xfstests, using is a standard cloud VM configuration) and an
> >>> explanation of the bug and the four-line fix, I suspect this might be
> >>> good grist for follow-on research after your Hot Storage '17 workshop
> >>> paper.  :-)
> >>>
> >>> Best regards,
> >>>
> >>>                                         - Ted
> >>>
> >>>
> >>> commit 3912e7b44cf77e9452d4d0cb6c1da9c7043bb7f1
> >>> Author: Theodore Ts'o<tytso@mit.edu>
> >>> Date:   Fri Oct 6 23:09:55 2017 -0400
> >>>
> >>>      ext4: fix interaction between i_size, fallocate, and delalloc after a
> >>> crash
> >>
> >> Hi Theodore,
> >>
> >> After applying your patch, generic/456 always passes on my system which just
> >> triggers the output[2].
> >> So i could believe this two different outputs[1][2] are triggered by
> >> different environments, but they
> >> are caused by the same bug which your patch fixes.  Is this right?
> >>
> >> [1] Inode 12, end of extent exceeds allowed value(logical block 33, physical
> >> block 33441, len 7)Clear? no
> >>       Inode 12, i_blocks is 184, should be 128. Fix? no
> >> [2] Inode 12, i_size is 147456, should be 163840. Fix? no
> >>
> >> Sorry, i am not familiar with ext4.
> >>
> >> Thanks,
> >> Xiao Yang
> >>>
> >>>
> >>>      If there are pending writes subject to delayed allocation, then
> >>> i_size
> >>>      will show size after the writes have completed, while i_disksize
> >>>      contains the value of i_size on the disk (since the writes have not
> >>>      been persisted to disk).
> >>>
> >>>      If fallocate(2) is called with the FALLOC_FL_KEEP_SIZE flag, either
> >>>      with or without the FALLOC_FL_ZERO_RANGE flag set, and the new size
> >>>      after the fallocate(2) is between i_size and i_disksize, then after a
> >>>      crash, if a journal commit has resulted in the changes made by the
> >>>      fallocate() call to be persisted after a crash, but the delayed
> >>>      allocation write has not resolved itself, i_size would not be
> >>> updated,
> >>>      and this would cause the following e2fsck complaint:
> >>>
> >>>      Inode 12, end of extent exceeds allowed value
> >>>              (logical block 33, physical block 33441, len 7)
> >>>
> >>>      This can only take place on a sparse file, where the fallocate(2)
> >>> call
> >>>      is allocating blocks in a range which is before a pending delayed
> >>>      allocation write which is extending i_size.  Since this situation is
> >>>      quite rare, and the window in which the crash must take place is
> >>>      typically<  30 seconds, in practice this condition will rarely
> >>> happen.
> >>>
> >>>      Nevertheless, it can be triggered in testing, and in particular by
> >>>      xfstests generic/456.
> >>>
> >>>      Signed-off-by: Theodore Ts'o<tytso@mit.edu>
> >>>      Reported-by: Amir Goldstein<amir73il@gmail.com>
> >>>      Cc: stable@vger.kernel.org
> >>>
> >>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> >>> index 97f0fd06728d..07bca11749d4 100644
> >>> --- a/fs/ext4/extents.c
> >>> +++ b/fs/ext4/extents.c
> >>> @@ -4794,7 +4794,8 @@ static long ext4_zero_range(struct file *file,
> >>> loff_t offset,
> >>>         }
> >>>
> >>>         if (!(mode&  FALLOC_FL_KEEP_SIZE)&&
> >>> -            offset + len>  i_size_read(inode)) {
> >>> +           (offset + len>  i_size_read(inode) ||
> >>> +            offset + len>  EXT4_I(inode)->i_disksize)) {
> >>>                 new_size = offset + len;
> >>>                 ret = inode_newsize_ok(inode, new_size);
> >>>                 if (ret)
> >>> @@ -4965,7 +4966,8 @@ long ext4_fallocate(struct file *file, int mode,
> >>> loff_t offset, loff_t len)
> >>>         }
> >>>
> >>>         if (!(mode&  FALLOC_FL_KEEP_SIZE)&&
> >>> -            offset + len>  i_size_read(inode)) {
> >>> +           (offset + len>  i_size_read(inode) ||
> >>> +            offset + len>  EXT4_I(inode)->i_disksize)) {
> >>>                 new_size = offset + len;
> >>>                 ret = inode_newsize_ok(inode, new_size);
> >>>                 if (ret)
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe fstests" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>>
> >>> .
> >>>
> >>
> >>
> >>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash
  2017-10-16 19:32                       ` Ashlie Martinez
@ 2017-10-16 21:11                         ` Amir Goldstein
  2017-10-17  0:09                           ` Theodore Ts'o
  0 siblings, 1 reply; 31+ messages in thread
From: Amir Goldstein @ 2017-10-16 21:11 UTC (permalink / raw)
  To: Ashlie Martinez
  Cc: Xiao Yang, Theodore Tso, Eryu Guan, Josef Bacik, fstests, Ext4,
	Vijay Chidambaram

On Mon, Oct 16, 2017 at 10:32 PM, Ashlie Martinez <ashmrtn@utexas.edu> wrote:
> Amir,
>
> I know this is a bit late, but I've spent some time working through
> the disk image that you provided (so that I could determine how/if I
> could modify CrashMonkey to catch errors like this) and I don't think
> I understand what state the disk image reflects.

The disk image SHOULD reflect a state on a disk after the power was
cut in the middle of mounted fs. Then power came back on, filesystem
was mounted, journal recovered, then filesystem was cleanly unmounted.
At this stage, I don't expect there should be anything interesting in the
journal.

> After digging around
> the journal of the disk image you provided, I found that the first 10
> journal blocks are used, with the journal superblock being placed in
> the very first block of the journal. The journal superblock says that
> the first journal transaction ID that should be in the journal is
> transaction ID 4. However, dumping the other journal blocks, I found
> that the next block is a descriptor block for transaction ID 2. The
> rest of the journal blocks are data blocks for that transaction plus a
> transaction commit block. This seems a little odd considering that the
> journal refers to a 4th transaction, which I have not been able to
> find (I quickly dumped the first 50 blocks in debugfs and found the
> rest to contain only zeros).
>

I did not spend time analyzing the image, so I'll take your word for it,
but I can't help you understand your findings.

> With this in mind, I looked back at the xfstests code for controlling
> the dm_flakey device. What I realized is the `nolockfs` flag is
> provided both when it switches from the real device to the flakey
> device that drops writes and when it switches from the flakey device
> back to the real device. I know there is a call to umount once the
> flakey device that drops writes is inserted, but do you think it is
> possible that the flakey device is swapped back to the real device
> before all the writes forced out by umount have made it to the flakey
> device?

I believe umount call should be blocked until all writes have been flushed
out to flakey device.

> Unfortunately I still don't have a local machine that is
> capable of reproducing your test results and I have not made any gce
> test appliance images to test this yet, so I'm not sure if this is a
> valid theory.
>

Ted explained that the bug related to very specific timing of flusher
thread vs. fallocate thread.
I was under the impression that CrashMonkey can only reorder writes
between recorded FLUSH requests, so I am not really sure how you intent to
modify CrashMonkey to catch this bug.

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash
  2017-10-16 21:11                         ` Amir Goldstein
@ 2017-10-17  0:09                           ` Theodore Ts'o
  2017-10-17  1:02                             ` Vijay Chidambaram
       [not found]                             ` <CAPaz=E+jFuOmRk8+EmVhNawwogNzW3VkciFrCc0Fk23OfGbwuA@mail.gmail.com>
  0 siblings, 2 replies; 31+ messages in thread
From: Theodore Ts'o @ 2017-10-17  0:09 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Ashlie Martinez, Xiao Yang, Eryu Guan, Josef Bacik, fstests,
	Ext4, Vijay Chidambaram

On Tue, Oct 17, 2017 at 12:11:40AM +0300, Amir Goldstein wrote:
> 
> The disk image SHOULD reflect a state on a disk after the power was
> cut in the middle of mounted fs. Then power came back on, filesystem
> was mounted, journal recovered, then filesystem was cleanly unmounted.
> At this stage, I don't expect there should be anything interesting in the
> journal.

I suspect what Ashlie was hoping for was a file system image *before*
the file system was remounted and the journal replayed (and then
truncated).  That would allow for an analysis of image right after the
simulated power cut, so it could be seen what was in the journal.

The only way to get that is to modify the test so that it aborts
before the file system is remounted.  I did some investigations where
I ran commands (such as "debugfs -R "logdump -ac /dev/vdc") before the
file system was remounted to gather debugging information.  That's how
I tracked down the problem.  Unfortunately I never bothered to grab
full file system snapshot, so I can't give Ashlie what she's hoping
for.

> I believe umount call should be blocked until all writes have been flushed
> out to flakey device.

That is correct.

> Ted explained that the bug related to very specific timing of flusher
> thread vs. fallocate thread.
> I was under the impression that CrashMonkey can only reorder writes
> between recorded FLUSH requests, so I am not really sure how you intent to
> modify CrashMonkey to catch this bug.

The real issue is that what CrashMonkey is testing is given a test
trace with N CACHE FLUSH operations, given a random X such that:

       1 <= X < N

If of the writes before the Xth CACHE FLUSH are completed, and a
random set of writes between the Xth and (X+1)th CACHE FLUSH are
completed, is the file system still consistent after a journal replay.

That's a fine thing to test, although you can probably do that more
efficiently by simply looking at all of the metadata writes between
the Xth and X+1th CACHE FLUSH.  Those writes must be effective no-ops
after the journal is replayed up to the Xth cache flush.  Which is to
say, the writes must either (a) be to a data block, or (b) the
contents of the writes must match either (a) the most recent journal
entry for that block (up to the Xth cache flush), or (b) the current
state of the disk.

So if you are willing to assume knowledge of what is stored in the
journal and how ext4 works, it should be possible to implement
CrashMonkey much more effectively.


The problem that this bug exposed is different sort of problem.  To
find this bug, given the I/O stream, you can simply examine the file
system state after each journal commit.  (e.g., after each CACHE
FLUSH).  And just make sure the file system state is consistent.
There is no need to include some random set of writes from the last
commit epoch.

The sort of searching of the test space new CrashMonkey' would have to
test can't be done just by looking at the io traces.  Instead, a
workload consists of a series of micro-transactions (jbd2 handles)
which are assigned to a set of journal transactions.  Normally, which
handles get assigned to a given transaction is based on timing (we
close a transaction every 5 seconds), or based on the size of the
transaction (we limit the number of blocks in a transaction), or based
file system operations --- e.g., a fsync() will cause a transaction to
close.

This CrashMoney' would have to explore a different set of transaction
boundaries (e.g., which handles are assigned to a transaction before a
transaction closes), and whether the file system is consistent at each
transaction boundary given a each possible assignment of handles to
transactions.

It's doable, but it would have to be done by logging the data passed
to the jbd2 logging layer, and checking file system consistency at
each handle boundary.

							- Ted

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash
  2017-10-17  0:09                           ` Theodore Ts'o
@ 2017-10-17  1:02                             ` Vijay Chidambaram
       [not found]                             ` <CAPaz=E+jFuOmRk8+EmVhNawwogNzW3VkciFrCc0Fk23OfGbwuA@mail.gmail.com>
  1 sibling, 0 replies; 31+ messages in thread
From: Vijay Chidambaram @ 2017-10-17  1:02 UTC (permalink / raw)
  To: fstests, Ext4

Hi Ted,

This was what we were suspecting as well. If the bug being exposed or
not only depends on the number of transactions in the journal before
the test workload starts, that’s something we could easily capture in
CrashMonkey. It would be one more parameter to tweak, and we would
potentially have to get multiple traces of the test workload with the
journal being X percent full.

It does expand our already-large search space, but our first order of
business is making sure CrashMonkey can reproduce every
crash-consistency bug reported in recent times (mostly by Amir :) ).

So for now we were just analyzing the bug and trying to understand it,
but it looks like the post-recovery image is not very useful for this.
We do want CrashMonkey to remain file-system agnostic.

(resending since earlier email failed)

Thanks,
Vijay

On Mon, Oct 16, 2017 at 7:09 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Tue, Oct 17, 2017 at 12:11:40AM +0300, Amir Goldstein wrote:
>>
>> The disk image SHOULD reflect a state on a disk after the power was
>> cut in the middle of mounted fs. Then power came back on, filesystem
>> was mounted, journal recovered, then filesystem was cleanly unmounted.
>> At this stage, I don't expect there should be anything interesting in the
>> journal.
>
> I suspect what Ashlie was hoping for was a file system image *before*
> the file system was remounted and the journal replayed (and then
> truncated).  That would allow for an analysis of image right after the
> simulated power cut, so it could be seen what was in the journal.
>
> The only way to get that is to modify the test so that it aborts
> before the file system is remounted.  I did some investigations where
> I ran commands (such as "debugfs -R "logdump -ac /dev/vdc") before the
> file system was remounted to gather debugging information.  That's how
> I tracked down the problem.  Unfortunately I never bothered to grab
> full file system snapshot, so I can't give Ashlie what she's hoping
> for.
>
>> I believe umount call should be blocked until all writes have been flushed
>> out to flakey device.
>
> That is correct.
>
>> Ted explained that the bug related to very specific timing of flusher
>> thread vs. fallocate thread.
>> I was under the impression that CrashMonkey can only reorder writes
>> between recorded FLUSH requests, so I am not really sure how you intent to
>> modify CrashMonkey to catch this bug.
>
> The real issue is that what CrashMonkey is testing is given a test
> trace with N CACHE FLUSH operations, given a random X such that:
>
>        1 <= X < N
>
> If of the writes before the Xth CACHE FLUSH are completed, and a
> random set of writes between the Xth and (X+1)th CACHE FLUSH are
> completed, is the file system still consistent after a journal replay.
>
> That's a fine thing to test, although you can probably do that more
> efficiently by simply looking at all of the metadata writes between
> the Xth and X+1th CACHE FLUSH.  Those writes must be effective no-ops
> after the journal is replayed up to the Xth cache flush.  Which is to
> say, the writes must either (a) be to a data block, or (b) the
> contents of the writes must match either (a) the most recent journal
> entry for that block (up to the Xth cache flush), or (b) the current
> state of the disk.
>
> So if you are willing to assume knowledge of what is stored in the
> journal and how ext4 works, it should be possible to implement
> CrashMonkey much more effectively.
>
>
> The problem that this bug exposed is different sort of problem.  To
> find this bug, given the I/O stream, you can simply examine the file
> system state after each journal commit.  (e.g., after each CACHE
> FLUSH).  And just make sure the file system state is consistent.
> There is no need to include some random set of writes from the last
> commit epoch.
>
> The sort of searching of the test space new CrashMonkey' would have to
> test can't be done just by looking at the io traces.  Instead, a
> workload consists of a series of micro-transactions (jbd2 handles)
> which are assigned to a set of journal transactions.  Normally, which
> handles get assigned to a given transaction is based on timing (we
> close a transaction every 5 seconds), or based on the size of the
> transaction (we limit the number of blocks in a transaction), or based
> file system operations --- e.g., a fsync() will cause a transaction to
> close.
>
> This CrashMoney' would have to explore a different set of transaction
> boundaries (e.g., which handles are assigned to a transaction before a
> transaction closes), and whether the file system is consistent at each
> transaction boundary given a each possible assignment of handles to
> transactions.
>
> It's doable, but it would have to be done by logging the data passed
> to the jbd2 logging layer, and checking file system consistency at
> each handle boundary.
>
>                                                         - Ted

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash
       [not found]                             ` <CAPaz=E+jFuOmRk8+EmVhNawwogNzW3VkciFrCc0Fk23OfGbwuA@mail.gmail.com>
@ 2017-10-17  7:15                               ` Amir Goldstein
  2017-10-17 14:41                               ` Theodore Ts'o
  1 sibling, 0 replies; 31+ messages in thread
From: Amir Goldstein @ 2017-10-17  7:15 UTC (permalink / raw)
  To: Vijay Chidambaram
  Cc: Theodore Ts'o, Ashlie Martinez, Eryu Guan, Ext4, Josef Bacik,
	Xiao Yang, fstests

[-- Attachment #1: Type: text/plain, Size: 4701 bytes --]

On Tue, Oct 17, 2017 at 3:43 AM, Vijay Chidambaram <vvijay03@gmail.com> wrote:
> Hi Ted,
>
> This was what we were suspecting as well. If the bug being exposed or not
> only depends on the number of transactions in the journal before the test
> workload starts, that’s something we could easily capture in CrashMonkey. It
> would be one more parameter to tweak, and we would potentially have to get
> multiple traces of the test workload with the journal being X percent full.
>
> It does expand our already-large search space, but our first order of
> business is making sure CrashMonkey can reproduce every crash-consistency
> bug reported in recent times (mostly by Amir :) ). So for now we were just
> analyzing the bug and trying to understand it, but it looks like the
> post-recovery image is not very useful for this.
>
> On Mon, Oct 16, 2017 at 7:09 PM Theodore Ts'o <tytso@mit.edu> wrote:
>>
>> On Tue, Oct 17, 2017 at 12:11:40AM +0300, Amir Goldstein wrote:
>> >
>> > The disk image SHOULD reflect a state on a disk after the power was
>> > cut in the middle of mounted fs. Then power came back on, filesystem
>> > was mounted, journal recovered, then filesystem was cleanly unmounted.
>> > At this stage, I don't expect there should be anything interesting in
>> > the
>> > journal.
>>
>> I suspect what Ashlie was hoping for was a file system image *before*
>> the file system was remounted and the journal replayed (and then
>> truncated).  That would allow for an analysis of image right after the
>> simulated power cut, so it could be seen what was in the journal.
>>
>> The only way to get that is to modify the test so that it aborts
>> before the file system is remounted.  I did some investigations where
>> I ran commands (such as "debugfs -R "logdump -ac /dev/vdc") before the
>> file system was remounted to gather debugging information.  That's how
>> I tracked down the problem.  Unfortunately I never bothered to grab
>> full file system snapshot, so I can't give Ashlie what she's hoping
>> for.
>>

I can provide the desired file system snapshot after crash, but I will do more
than that. I will provide an io recording of the test and will explain how to
how to replay the recording to any given point before the crash.

To produce the recording, I copied this hunk from test 456:
fsxops=$tmp.fsxops
cat <<EOF > $fsxops
write 0x137dd 0xdc69 0x0
fallocate 0xb531 0xb5ad 0x21446
collapse_range 0x1c000 0x4000 0x21446
write 0x3e5ec 0x1a14 0x21446
zero_range 0x20fac 0x6d9c 0x40000 keep_size
mapwrite 0x216ad 0x274f 0x40000
EOF
run_check $here/ltp/fsx -d --replay-ops $fsxops $SCRATCH_MNT/testfile

to test 455 before line NUM_FILES=4 and changed the line to
NUM_FILES=0 to suppress the randomized fsx runs.

Now test 455 fails and leaves us a recording of all io on the
$LOGWRITES_DEV device.

As we can see in 455.full, the consistency check fails on "pre umount"
checkpoint, which test 455 has marked in the log device as mark "last":

  checking pre umount ("_log_writes_replay_log last" in the test)
  _check_generic_filesystem: filesystem on /dev/mapper/ssd-scratch is
inconsistent

To export the recording from my log device I ran:
# ./src/log-writes/replay-log --log $LOGWRITES_DEV --find --end-mark last
17098@153852

Meaning that the "last" mark is record #17098 at sector offset 153852
on the log device, so I can export content of log up to and including
this offset:
# dd if=$LOGWRITES_DEV of=ext4-crash.log bs=512 count=153853

The exported log is 76M compressed to 34K and attached to this message.

To replay this log on your device and reproduce the fsck error, you need
a $SCRATCH_DEV that is at least 10G and run:

# bunzip2 ext4-crash.log.bz2
# ./src/log-writes/replay-log --log ext4-crash.log --replay
$SCRATCH_DEV --end-mark last

Now you can investigate state of file system right after the crash.
More importantly, you can use the --check [<number>|flush|fua] command line
options to replay-log to run an arbitrary check utility on replay device at any
numbers of checkpoints before the crash.
You can use --limit and --start-entry to fast forward to a given point
in recording
and replay with consistency checks from that point on, and so forth.

Finally, see that replaying the recording to "last" reproduces the problem:
# mount $SCRATCH_DEV $SCRATCH_MNT
# umount $SCRATCH_MNT
# e2fsck -nf $SCRATCH_DEV

Hope this helps.
I guess I could have saved Ted some work had I provided this guide
and log sooner, but I thought the bug may be trivial enough to understand
with just the dm-flakey reproducer -  I was wrong about that.

Cheers,
Amir.

[-- Attachment #2: ext4-crash.log.bz2 --]
[-- Type: application/x-bzip2, Size: 34309 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash
       [not found]                             ` <CAPaz=E+jFuOmRk8+EmVhNawwogNzW3VkciFrCc0Fk23OfGbwuA@mail.gmail.com>
  2017-10-17  7:15                               ` Amir Goldstein
@ 2017-10-17 14:41                               ` Theodore Ts'o
  2017-10-17 23:16                                 ` Vijay Chidambaram
  1 sibling, 1 reply; 31+ messages in thread
From: Theodore Ts'o @ 2017-10-17 14:41 UTC (permalink / raw)
  To: Vijay Chidambaram
  Cc: Amir Goldstein, Ashlie Martinez, Eryu Guan, Ext4, Josef Bacik,
	Xiao Yang, fstests

On Tue, Oct 17, 2017 at 12:43:20AM +0000, Vijay Chidambaram wrote:
> It does expand our already-large search space, but our first order of
> business is making sure CrashMonkey can reproduce every crash-consistency
> bug reported in recent times (mostly by Amir :) ). So for now we were just
> analyzing the bug and trying to understand it, but it looks like the
> post-recovery image is not very useful for this.

Right, the post-recovery (after the journal replayed) is not very
useful.  Unfortunately, the pre-recovery (after the power cut, but
before the journal replay) I suspect won't be terribly interesting
either.  It will show that the corruption is baked into the journal
--- which is to say, the problem wasn't in whether the calls to the
jbd2 layer were correct --- but rather, that one of the file system
mutations in a specific jbd2 handle's "micro-transaction" left the
file system is an inconsistent state.

Not a terrible inconsistency, and it would be quickly papered over in
a follow-up handle --- but one where if the handle which left the file
system in an inconsistent state, and the handle which cleaned it up
were in different transactions, and the power cut happened after the
first transaction, the file system be left in a state where e2fsck
would complain.

So if you have the I/O trace where the handles in question were
assigned to the right (wrong) set of transactions, then yes, you'll
see the problem, just as the xfstest will see the problem.

But if you want to improve the CrashMonkey's search of the problem
space, it will require higher-level logging, because this is really a
different sort of bug.  CrashMonkey will find (a) bugs in jbd2, and
(b) bugs in how the jbd2 layer is called.  This bug is really a bug in
ext4 implementation, because it is in *how* the file system was
mutated that temporarily left it in an inconsistent state, and that's
a different thing from (a) or (b).  Which is great --- it's arguably
additional research work that can be segregated into a different
"Minimum Publishable Unit".  :-)

					- Ted

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash
  2017-10-17 14:41                               ` Theodore Ts'o
@ 2017-10-17 23:16                                 ` Vijay Chidambaram
  0 siblings, 0 replies; 31+ messages in thread
From: Vijay Chidambaram @ 2017-10-17 23:16 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Amir Goldstein, Ashlie Martinez, Eryu Guan, Ext4, Josef Bacik,
	Xiao Yang, fstests

Amir, thank you for providing io recording, this is really useful!
Most of our work in recent weeks has driven by your input and your
posts to the mailing list!

Ted, I agree with your characterization of the bug. Its true that
CrashMonkey will be missing context for certain bugs, and for those
kinds of bugs, we will probably have a different project :)

But for now, we are trying to widen the scope of CrashMonkey as much
as we can. I think for the current bug, we can handle it in the
following way:
1. run CrashMonkey with the smallest journal possible
2. in the setup phase of CrashMonkey, do N random metadata operations
(that will fill up some percentage of the journal space)
3. run the test workload and collect the IO trace

We will do steps 2 and 3 with different N. For this bug, there must be
a particular N that will force the micro-transaction to be at the end
of the journal tx, revealing the bug.

In general, it might be good for CrashMonkey to force the metadata
from the test workload to be distributed across different journal
transactions. The trick is to do it without modifying the file system
itself.

Thanks,
Vijay

On Tue, Oct 17, 2017 at 9:41 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Tue, Oct 17, 2017 at 12:43:20AM +0000, Vijay Chidambaram wrote:
>> It does expand our already-large search space, but our first order of
>> business is making sure CrashMonkey can reproduce every crash-consistency
>> bug reported in recent times (mostly by Amir :) ). So for now we were just
>> analyzing the bug and trying to understand it, but it looks like the
>> post-recovery image is not very useful for this.
>
> Right, the post-recovery (after the journal replayed) is not very
> useful.  Unfortunately, the pre-recovery (after the power cut, but
> before the journal replay) I suspect won't be terribly interesting
> either.  It will show that the corruption is baked into the journal
> --- which is to say, the problem wasn't in whether the calls to the
> jbd2 layer were correct --- but rather, that one of the file system
> mutations in a specific jbd2 handle's "micro-transaction" left the
> file system is an inconsistent state.
>
> Not a terrible inconsistency, and it would be quickly papered over in
> a follow-up handle --- but one where if the handle which left the file
> system in an inconsistent state, and the handle which cleaned it up
> were in different transactions, and the power cut happened after the
> first transaction, the file system be left in a state where e2fsck
> would complain.
>
> So if you have the I/O trace where the handles in question were
> assigned to the right (wrong) set of transactions, then yes, you'll
> see the problem, just as the xfstest will see the problem.
>
> But if you want to improve the CrashMonkey's search of the problem
> space, it will require higher-level logging, because this is really a
> different sort of bug.  CrashMonkey will find (a) bugs in jbd2, and
> (b) bugs in how the jbd2 layer is called.  This bug is really a bug in
> ext4 implementation, because it is in *how* the file system was
> mutated that temporarily left it in an inconsistent state, and that's
> a different thing from (a) or (b).  Which is great --- it's arguably
> additional research work that can be segregated into a different
> "Minimum Publishable Unit".  :-)
>
>                                         - Ted

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH] fstest: regression test for ext4 crash consistency bug
  2017-09-01 12:21     ` Ashlie Martinez
@ 2017-09-01 14:59       ` Amir Goldstein
  0 siblings, 0 replies; 31+ messages in thread
From: Amir Goldstein @ 2017-09-01 14:59 UTC (permalink / raw)
  To: Ashlie Martinez
  Cc: Eryu Guan, Josef Bacik, Vijay Chidambaram, fstests, Ext4, Theodore Tso

On Fri, Sep 1, 2017 at 3:21 PM, Ashlie Martinez <ashmrtn@utexas.edu> wrote:
> Apologies for spam, resending plain text so it appears in the archives.
>
>
...
>>>> When free data blocks and inode errors occur, the message is `Free blocks
>>>> count wrong (8795, counted=8714).` and `Free inodes count wrong (2549,
>>>> counted=2546).`
>>>>
>>>> I have not had a chance to look into the above errors to find their root
>>>> causes.
>>>>
>>>
>>> I believe this is what you get when you fsck -yf before trying to mount when
>>> the orphan list is not empty. You should avoid doing that.
>
> Do you know what the kernel does when a file system like that is mounted? The
> link you posted below has a comment about the kernel checking the file system
> faster than fsck can. Do you know how it accomplishes this? Should the
> mount -o errors=remount-ro; unmount sequence be done for all file systems or
> just a select few for which the kernel knows how to handle? Does following the
> mount/unmount, check sequence just mean that the kernel will silently (from the
> perspective of normal users) fix some issues with the file system much like
> fsck does when it is run (though perhaps less thoroughly than fsck)?
>

mount/umount does several things.
among other things, its starts with replaying the journal.
with journalling file systems this is essential to the point that it is really
not safe to fsck before replaying the journal. So e2fsck -y will in fact
replay the journal, but it will do a much worse job (much slower) the
the kernel does when replaying the journal (too hard to explain why).
BTW e2fsck -n will not replay the journal so it may report phantom
errors or refuse to run?

HOWEVER, the comment I referred you to in Android code refers
to the efficiency of the kernel with cleaning up the "orphan inode list".
If you open N files, unlink them sync and crash the kernel will cleanup
the disk space used by those N files on next mount.
e2fsck will "zap" those files and then fix all the inode and block counter
and report they were wrong, like you reported above.

Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH] fstest: regression test for ext4 crash consistency bug
  2017-08-31  4:06   ` Amir Goldstein
@ 2017-09-01 12:21     ` Ashlie Martinez
  2017-09-01 14:59       ` Amir Goldstein
  0 siblings, 1 reply; 31+ messages in thread
From: Ashlie Martinez @ 2017-09-01 12:21 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Eryu Guan, Josef Bacik, Vijay Chidambaram, fstests, Ext4, Theodore Tso

Apologies for spam, resending plain text so it appears in the archives.


On Wed, Aug 30, 2017 at 11:06 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> [now really CC Ted]
>
> On Thu, Aug 31, 2017 at 7:05 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Thu, Aug 31, 2017 at 4:28 AM, Ashlie Martinez <ashmrtn@utexas.edu> wrote:
>>> Amir,
>>>
>>> I have been working on CrashMonkey more and I have jerry-rigged together a
>>> test in CrashMonkey that calls into `fsx` with the minimal test case you
>>> made. I am able to reproduce the ext4 error that you found along with a few
>>> other potential errors.
>>>
>>> A quick point, I run fsck with `-yf` instead of `-nf` that xfstests runs
>>> with. The reason for this is that CrashMonkey would like to report on
>>> fixable and unfixable errors in the future.
>>>
>>
>> That makes sense, but keep in mind that 'fixable' error may still loose data
>> when fixing them with -y. Perhaps you should consider running fsck is auto
>> fixing mode (i.e. e2fsck -p) when available, to classify errors as
>> 'safely fixable'
>> I believe the error these test encountered are 'safely fixable', but
>> didn't check.
>>
>>> Running the ported test case, I find that CrashMonkey encounters the
>>> following errors:
>>> 1. Incorrect inode size and incorrect free data block and inode counts
>>> (fixable)
>>> 2. incorrect free data block and inode counts (fixable)
>>> 3. `Superblock needs_recovery flag is clear, but journal has data` notice
>>> along with errors present in case 1
>>> 4. `Superblock needs_recovery flag is clear, but journal has data` notice
>>> with no other errors
>>>
>>> For the incorrect i_size errors, I get the output `Inode 12, i_size is
>>> 147456, should be 163840.` which I can also reproduce with your 501 xfstests
>>> test case.
>>>
>>> When free data blocks and inode errors occur, the message is `Free blocks
>>> count wrong (8795, counted=8714).` and `Free inodes count wrong (2549,
>>> counted=2546).`
>>>
>>> I have not had a chance to look into the above errors to find their root
>>> causes.
>>>
>>
>> I believe this is what you get when you fsck -yf before trying to mount when
>> the orphan list is not empty. You should avoid doing that.

Do you know what the kernel does when a file system like that is mounted? The
link you posted below has a comment about the kernel checking the file system
faster than fsck can. Do you know how it accomplishes this? Should the
mount -o errors=remount-ro; unmount sequence be done for all file systems or
just a select few for which the kernel knows how to handle? Does following the
mount/unmount, check sequence just mean that the kernel will silently (from the
perspective of normal users) fix some issues with the file system much like
fsck does when it is run (though perhaps less thoroughly than fsck)?

>>
>> See what the greatest ext4 crash test experiment of them all is doing
>> and read the comment to understand why:
>> https://android.googlesource.com/platform/system/core/+/marshmallow-mr1-dev/fs_mgr/fs_mgr.c#96
>> 1. mount -o  errors=remount-ro; umount
>> 2. e2fsck -y
>>
>> So upstream Android never runs e2fsck -f. It will only check fs if kernel marked
>> that fs has errors.
>> Although Cyanogenmod did add -f and I imagine that many vendors do as well.
>>
>> As one who hacked and crashed a lot of Android devices, I can attest that I have
>> observed both data loss and corrupted (non booting) fs, but the rest
>> of the 2 billion
>> crash test monkeys don't seem to be bothered ;-)
>>
>>> In total, CrashMonkey ran 1000 different tests. Of those, 344 passed without
>>> fsck complaining. The remaining 656 tests saw fsck complain about something.
>>> All of these tests consisted of unique sequences of bios, but may contain
>>> equivalent crash states.
>>>
>>> The larger range of test results is due to the fact that CrashMonkey runs
>>> many tests from just the single workload you made. These tests consist of
>>> replaying some number of bio write operations, so it tests states different
>>> than you 500 xfstest which I believe only replays to sync operations (i.e.
>>> it never stops replay before a recorded fsync).
>>
>> That is correct. test 500 (temporary name) is mostly focused on checking
>> data consistency of files after fsync. detecting metadata consistency errors
>> is a by product. I do intend to add more tests focused on metadata consistency.
>> Josef already wrote an fsstress script that should be converted to an xfstest
>> which replays the log to every FUA and fsck.
>>
>>>
>>> If you're interested, you can find the CrashMonkey code (and branch) at
>>> https://github.com/utsaslab/crashmonkey/tree/ext4_regression_bug. If you
>>> would like to run it, you should clone and build you xfstest in your home
>>> directory so that the jerry-rigged CrashMonkey test case can find it.
>>> Directions for running this test case in CrashMonkey should be at the top of
>>> the README.
>>
>> You seem to have misspelled 'fsx' in README and in the code as 'xfs'.
>> Funny, I always mistype it as 'sfx' :)

:) thanks for pointing that out, it's hard enough to spell it properly
half the time since
the test suite it's in is xfstests. Thank you for all the other
feedback and links too.

>>
>> Cheers,
>> Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH] fstest: regression test for ext4 crash consistency bug
  2017-08-31  4:05 ` Amir Goldstein
@ 2017-08-31  4:06   ` Amir Goldstein
  2017-09-01 12:21     ` Ashlie Martinez
  0 siblings, 1 reply; 31+ messages in thread
From: Amir Goldstein @ 2017-08-31  4:06 UTC (permalink / raw)
  To: Ashlie Martinez
  Cc: Eryu Guan, Josef Bacik, Vijay Chidambaram, fstests, Ext4, Theodore Tso

[now really CC Ted]

On Thu, Aug 31, 2017 at 7:05 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Thu, Aug 31, 2017 at 4:28 AM, Ashlie Martinez <ashmrtn@utexas.edu> wrote:
>> Amir,
>>
>> I have been working on CrashMonkey more and I have jerry-rigged together a
>> test in CrashMonkey that calls into `fsx` with the minimal test case you
>> made. I am able to reproduce the ext4 error that you found along with a few
>> other potential errors.
>>
>> A quick point, I run fsck with `-yf` instead of `-nf` that xfstests runs
>> with. The reason for this is that CrashMonkey would like to report on
>> fixable and unfixable errors in the future.
>>
>
> That makes sense, but keep in mind that 'fixable' error may still loose data
> when fixing them with -y. Perhaps you should consider running fsck is auto
> fixing mode (i.e. e2fsck -p) when available, to classify errors as
> 'safely fixable'
> I believe the error these test encountered are 'safely fixable', but
> didn't check.
>
>> Running the ported test case, I find that CrashMonkey encounters the
>> following errors:
>> 1. Incorrect inode size and incorrect free data block and inode counts
>> (fixable)
>> 2. incorrect free data block and inode counts (fixable)
>> 3. `Superblock needs_recovery flag is clear, but journal has data` notice
>> along with errors present in case 1
>> 4. `Superblock needs_recovery flag is clear, but journal has data` notice
>> with no other errors
>>
>> For the incorrect i_size errors, I get the output `Inode 12, i_size is
>> 147456, should be 163840.` which I can also reproduce with your 501 xfstests
>> test case.
>>
>> When free data blocks and inode errors occur, the message is `Free blocks
>> count wrong (8795, counted=8714).` and `Free inodes count wrong (2549,
>> counted=2546).`
>>
>> I have not had a chance to look into the above errors to find their root
>> causes.
>>
>
> I believe this is what you get when you fsck -yf before trying to mount when
> the orphan list is not empty. You should avoid doing that.
>
> See what the greatest ext4 crash test experiment of them all is doing
> and read the comment to understand why:
> https://android.googlesource.com/platform/system/core/+/marshmallow-mr1-dev/fs_mgr/fs_mgr.c#96
> 1. mount -o  errors=remount-ro; umount
> 2. e2fsck -y
>
> So upstream Android never runs e2fsck -f. It will only check fs if kernel marked
> that fs has errors.
> Although Cyanogenmod did add -f and I imagine that many vendors do as well.
>
> As one who hacked and crashed a lot of Android devices, I can attest that I have
> observed both data loss and corrupted (non booting) fs, but the rest
> of the 2 billion
> crash test monkeys don't seem to be bothered ;-)
>
>> In total, CrashMonkey ran 1000 different tests. Of those, 344 passed without
>> fsck complaining. The remaining 656 tests saw fsck complain about something.
>> All of these tests consisted of unique sequences of bios, but may contain
>> equivalent crash states.
>>
>> The larger range of test results is due to the fact that CrashMonkey runs
>> many tests from just the single workload you made. These tests consist of
>> replaying some number of bio write operations, so it tests states different
>> than you 500 xfstest which I believe only replays to sync operations (i.e.
>> it never stops replay before a recorded fsync).
>
> That is correct. test 500 (temporary name) is mostly focused on checking
> data consistency of files after fsync. detecting metadata consistency errors
> is a by product. I do intend to add more tests focused on metadata consistency.
> Josef already wrote an fsstress script that should be converted to an xfstest
> which replays the log to every FUA and fsck.
>
>>
>> If you're interested, you can find the CrashMonkey code (and branch) at
>> https://github.com/utsaslab/crashmonkey/tree/ext4_regression_bug. If you
>> would like to run it, you should clone and build you xfstest in your home
>> directory so that the jerry-rigged CrashMonkey test case can find it.
>> Directions for running this test case in CrashMonkey should be at the top of
>> the README.
>
> You seem to have misspelled 'fsx' in README and in the code as 'xfs'.
> Funny, I always mistype it as 'sfx' :)
>
> Cheers,
> Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC][PATCH] fstest: regression test for ext4 crash consistency bug
  2017-08-31  1:28 [RFC][PATCH] fstest: regression test for ext4 crash consistency bug Ashlie Martinez
@ 2017-08-31  4:05 ` Amir Goldstein
  2017-08-31  4:06   ` Amir Goldstein
  0 siblings, 1 reply; 31+ messages in thread
From: Amir Goldstein @ 2017-08-31  4:05 UTC (permalink / raw)
  To: Ashlie Martinez
  Cc: tytso, Eryu Guan, Josef Bacik, Vijay Chidambaram, fstests, Ext4

On Thu, Aug 31, 2017 at 4:28 AM, Ashlie Martinez <ashmrtn@utexas.edu> wrote:
> Amir,
>
> I have been working on CrashMonkey more and I have jerry-rigged together a
> test in CrashMonkey that calls into `fsx` with the minimal test case you
> made. I am able to reproduce the ext4 error that you found along with a few
> other potential errors.
>
> A quick point, I run fsck with `-yf` instead of `-nf` that xfstests runs
> with. The reason for this is that CrashMonkey would like to report on
> fixable and unfixable errors in the future.
>

That makes sense, but keep in mind that 'fixable' error may still loose data
when fixing them with -y. Perhaps you should consider running fsck is auto
fixing mode (i.e. e2fsck -p) when available, to classify errors as
'safely fixable'
I believe the error these test encountered are 'safely fixable', but
didn't check.

> Running the ported test case, I find that CrashMonkey encounters the
> following errors:
> 1. Incorrect inode size and incorrect free data block and inode counts
> (fixable)
> 2. incorrect free data block and inode counts (fixable)
> 3. `Superblock needs_recovery flag is clear, but journal has data` notice
> along with errors present in case 1
> 4. `Superblock needs_recovery flag is clear, but journal has data` notice
> with no other errors
>
> For the incorrect i_size errors, I get the output `Inode 12, i_size is
> 147456, should be 163840.` which I can also reproduce with your 501 xfstests
> test case.
>
> When free data blocks and inode errors occur, the message is `Free blocks
> count wrong (8795, counted=8714).` and `Free inodes count wrong (2549,
> counted=2546).`
>
> I have not had a chance to look into the above errors to find their root
> causes.
>

I believe this is what you get when you fsck -yf before trying to mount when
the orphan list is not empty. You should avoid doing that.

See what the greatest ext4 crash test experiment of them all is doing
and read the comment to understand why:
https://android.googlesource.com/platform/system/core/+/marshmallow-mr1-dev/fs_mgr/fs_mgr.c#96
1. mount -o  errors=remount-ro; umount
2. e2fsck -y

So upstream Android never runs e2fsck -f. It will only check fs if kernel marked
that fs has errors.
Although Cyanogenmod did add -f and I imagine that many vendors do as well.

As one who hacked and crashed a lot of Android devices, I can attest that I have
observed both data loss and corrupted (non booting) fs, but the rest
of the 2 billion
crash test monkeys don't seem to be bothered ;-)

> In total, CrashMonkey ran 1000 different tests. Of those, 344 passed without
> fsck complaining. The remaining 656 tests saw fsck complain about something.
> All of these tests consisted of unique sequences of bios, but may contain
> equivalent crash states.
>
> The larger range of test results is due to the fact that CrashMonkey runs
> many tests from just the single workload you made. These tests consist of
> replaying some number of bio write operations, so it tests states different
> than you 500 xfstest which I believe only replays to sync operations (i.e.
> it never stops replay before a recorded fsync).

That is correct. test 500 (temporary name) is mostly focused on checking
data consistency of files after fsync. detecting metadata consistency errors
is a by product. I do intend to add more tests focused on metadata consistency.
Josef already wrote an fsstress script that should be converted to an xfstest
which replays the log to every FUA and fsck.

>
> If you're interested, you can find the CrashMonkey code (and branch) at
> https://github.com/utsaslab/crashmonkey/tree/ext4_regression_bug. If you
> would like to run it, you should clone and build you xfstest in your home
> directory so that the jerry-rigged CrashMonkey test case can find it.
> Directions for running this test case in CrashMonkey should be at the top of
> the README.

You seem to have misspelled 'fsx' in README and in the code as 'xfs'.
Funny, I always mistype it as 'sfx' :)

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC][PATCH] fstest: regression test for ext4 crash consistency bug
@ 2017-08-31  1:28 Ashlie Martinez
  2017-08-31  4:05 ` Amir Goldstein
  0 siblings, 1 reply; 31+ messages in thread
From: Ashlie Martinez @ 2017-08-31  1:28 UTC (permalink / raw)
  To: amir73il; +Cc: tytso, eguan, jbacik, vvijay03, fstests, linux-ext4

Amir,

I have been working on CrashMonkey more and I have jerry-rigged together 
a test in CrashMonkey that calls into `fsx` with the minimal test case 
you made. I am able to reproduce the ext4 error that you found along 
with a few other potential errors.

A quick point, I run fsck with `-yf` instead of `-nf` that xfstests runs 
with. The reason for this is that CrashMonkey would like to report on 
fixable and unfixable errors in the future.

Running the ported test case, I find that CrashMonkey encounters the 
following errors:
1. Incorrect inode size and incorrect free data block and inode counts 
(fixable)
2. incorrect free data block and inode counts (fixable)
3. `Superblock needs_recovery flag is clear, but journal has data` 
notice along with errors present in case 1
4. `Superblock needs_recovery flag is clear, but journal has data` 
notice with no other errors

For the incorrect i_size errors, I get the output `Inode 12, i_size is 
147456, should be 163840.` which I can also reproduce with your 501 
xfstests test case.

When free data blocks and inode errors occur, the message is `Free 
blocks count wrong (8795, counted=8714).` and `Free inodes count wrong 
(2549, counted=2546).`

I have not had a chance to look into the above errors to find their root 
causes.

In total, CrashMonkey ran 1000 different tests. Of those, 344 passed 
without fsck complaining. The remaining 656 tests saw fsck complain 
about something. All of these tests consisted of unique sequences of 
bios, but may contain equivalent crash states.

The larger range of test results is due to the fact that CrashMonkey 
runs many tests from just the single workload you made. These tests 
consist of replaying some number of bio write operations, so it tests 
states different than you 500 xfstest which I believe only replays to 
sync operations (i.e. it never stops replay before a recorded fsync).

If you're interested, you can find the CrashMonkey code (and branch) at 
https://github.com/utsaslab/crashmonkey/tree/ext4_regression_bug. If you 
would like to run it, you should clone and build you xfstest in your 
home directory so that the jerry-rigged CrashMonkey test case can find 
it. Directions for running this test case in CrashMonkey should be at 
the top of the README.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-10-17 23:17 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-27 10:44 [RFC][PATCH] fstest: regression test for ext4 crash consistency bug Amir Goldstein
2017-09-25  9:49 ` Xiao Yang
2017-09-25  9:49   ` Xiao Yang
2017-09-25 10:53   ` Amir Goldstein
2017-09-26 10:45     ` Xiao Yang
2017-09-26 11:48       ` Amir Goldstein
2017-09-30 14:15     ` Ashlie Martinez
2017-10-05  7:27       ` Xiao Yang
2017-10-05 15:04         ` Ashlie Martinez
2017-10-05 19:10           ` Amir Goldstein
2017-10-06  0:34             ` Ashlie Martinez
2017-10-07  3:29               ` [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash Theodore Ts'o
2017-10-07  5:54                 ` Amir Goldstein
2017-10-07 18:32                   ` Theodore Ts'o
2017-10-09  0:37                 ` Ashlie Martinez
2017-10-11 11:11                 ` Xiao Yang
2017-10-11 13:17                   ` Ashlie Martinez
2017-10-11 13:34                     ` Amir Goldstein
2017-10-16 19:32                       ` Ashlie Martinez
2017-10-16 21:11                         ` Amir Goldstein
2017-10-17  0:09                           ` Theodore Ts'o
2017-10-17  1:02                             ` Vijay Chidambaram
     [not found]                             ` <CAPaz=E+jFuOmRk8+EmVhNawwogNzW3VkciFrCc0Fk23OfGbwuA@mail.gmail.com>
2017-10-17  7:15                               ` Amir Goldstein
2017-10-17 14:41                               ` Theodore Ts'o
2017-10-17 23:16                                 ` Vijay Chidambaram
2017-10-12 14:38                 ` Jan Kara
2017-08-31  1:28 [RFC][PATCH] fstest: regression test for ext4 crash consistency bug Ashlie Martinez
2017-08-31  4:05 ` Amir Goldstein
2017-08-31  4:06   ` Amir Goldstein
2017-09-01 12:21     ` Ashlie Martinez
2017-09-01 14:59       ` Amir Goldstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.