All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ashlie Martinez <ashmrtn@utexas.edu>
To: Xiao Yang <yangx.jy@cn.fujitsu.com>
Cc: Amir Goldstein <amir73il@gmail.com>,
	Theodore Ts'o <tytso@mit.edu>, Eryu Guan <eguan@redhat.com>,
	Josef Bacik <jbacik@fb.com>, fstests <fstests@vger.kernel.org>,
	Ext4 <linux-ext4@vger.kernel.org>,
	Vijay Chidambaram <vvijay03@gmail.com>
Subject: Re: [RFC][PATCH] fstest: regression test for ext4 crash consistency bug
Date: Thu, 5 Oct 2017 10:04:23 -0500	[thread overview]
Message-ID: <CAFk8rvbi3MHbRBtaZVTJKDcORaDTtMpwy2X2ZdXTZt=LE5smZw@mail.gmail.com> (raw)
In-Reply-To: <59D5DEE0.6080506@cn.fujitsu.com>

On Thu, Oct 5, 2017 at 2:27 AM, Xiao Yang <yangx.jy@cn.fujitsu.com> wrote:
> On 2017/09/30 22:15, Ashlie Martinez wrote:
>>
>> Hi Xiao,
>>
>> I am a student at the University of Texas at Austin. Some researchers
>> in the computer science department at UT, myself included, have
>> recently been working to develop a file system crash consistency test
>> harness called CrashMonkey [1][2]. I have been working on the
>> CrashMonkey project since it was started late last year. With
>> CrashMonkey we have also been able to reproduce the incorrect i_size
>> error you noted but we have not been able to reproduce the other
>> output that Amir found. CrashMonkey works by logging and replaying
>> operations for a workload, so it should not be sensitive to
>> differences in timing that could be caused by things like KVM+virtio.
>> I also did a few experiments with Amir's new xfstests test 456 (both
>> with and without KVM and virtio) and I was unable to reproduce the
>> output noted in the xfstest. I have not spent a lot of time looking
>> into the cause of the bug that Amir found and it is rather unfortunate
>> that I was unable to reproduce it with either xfstests or CrashMonkey.
>
> Hi Ashlie,
>
> Thanks for your detailed comments.
>
> 1) Do you think the output that Amir noted in xfstests is a false positive?

It almost seems that way, though to be honest, I don't think I know
enough about 1. the setup used by Amir and 2. all the internal working
of KVM+virtio to say for sure. One thing I have identified as being a
potential source of false positives is code in the kernel to remap the
block number sent to device drives to something relative to the start
of a block device, not the start of a partition. I'm thinking this
could cause trouble if 1. a partition higher than the first partition
is monitored by dm-log-writes, 2. the block numbers are recorded
verbatim in dm-log-writes, and 3. the dm-log-writes log is replayed on
a different device with different partitions (or possibly the same
device, but a different partition). I know some other undergrad
students at UT on the CrashMonkey team are looking into this right
now, but I have not had time to test this myself. The offending code
in the kernel is in the completely misnamed
`generic_make_request_checks` function which has given the CrashMonkey
team trouble in the past. Links to that function and the remapping
function it calls are below.

http://elixir.free-electrons.com/linux/latest/source/block/blk-core.c#L2041
http://elixir.free-electrons.com/linux/latest/source/block/blk-core.c#L1902

>
> 2) About the output that both i and you reproduced,  did you look into it
> and find its root cause?

I have not looked to find the root cause of the issue that we both
reliably reproduce.

>
> Thanks,
> Xiao Yang
>>
>> At any rate, CrashMonkey is still under development, so it does have
>> some caveats. First, we are running with a fixed random seed in our
>> default RandomPermuter (used to generate crash states) to aid
>> development. Second, the branch with the reproduction of this ext4
>> regression bug in CrashMonkey [3] will yield a few false positives due
>> to the way CrashMonkey works and how fsx runs. These false positives
>> are due to CrashMonkey generating crash states where the directories
>> for files used for the test have not be fsync-ed in the file system.
>> The top of the README in the CrashMonkey branch with this bug
>> reproduction outlines how we determined these were false positives
>>
>> [1] https://github.com/utsaslab/crashmonkey
>> [2]
>> https://www.usenix.org/conference/hotstorage17/program/presentation/martinez
>> [3] https://github.com/utsaslab/crashmonkey/tree/ext4_regression_bug
>>
>>
>> On Mon, Sep 25, 2017 at 5:53 AM, Amir Goldstein<amir73il@gmail.com>
>> wrote:
>>>
>>> On Mon, Sep 25, 2017 at 12:49 PM, Xiao Yang<yangx.jy@cn.fujitsu.com>
>>> wrote:
>>>>
>>>> On 2017/08/27 18:44, Amir Goldstein wrote:
>>>>>
>>>>> This test is motivated by a bug found in ext4 during random crash
>>>>> consistency tests.
>>>>>
>>>>> This test uses device mapper flakey target to demonstrate the bug
>>>>> found using device mapper log-writes target.
>>>>>
>>>>> Signed-off-by: Amir Goldstein<amir73il@gmail.com>
>>>>> ---
>>>>>
>>>>> Ted,
>>>>>
>>>>> While working on crash consistency xfstests [1], I stubmled on what
>>>>> appeared to be an ext4 crash consistency bug.
>>>>>
>>>>> The tests I used rely on the log-writes dm target code written
>>>>> by Josef Bacik, which had little exposure to the wide community
>>>>> as far as I know.  I wanted to prove to myself that the found
>>>>> inconsistency was not due to a test bug, so I bisected the failed
>>>>> test to the minimal operations that trigger the failure and wrote
>>>>> a small independent test to reproduce the issue using dm flakey target.
>>>>>
>>>>> The following fsck error is reliably reproduced by replaying some fsx
>>>>> ops
>>>>> on overlapping file regions, then emulating a crash, followed by mount,
>>>>> umount and fsck -nf:
>>>>>
>>>>>    ./ltp/fsx -d --replay-ops /tmp/8995.fsxops /mnt/scratch/testfile
>>>>>    1 write 0x137dd thru    0x21445 (0xdc69 bytes)
>>>>>    2 falloc        from 0xb531 to 0x16ade (0xb5ad bytes)
>>>>>    3 collapse      from 0x1c000 to 0x20000, (0x4000 bytes)
>>>>>    4 write 0x3e5ec thru    0x3ffff (0x1a14 bytes)
>>>>>    5 zero  from 0x20fac to 0x27d48, (0x6d9c bytes)
>>>>>    6 mapwrite      0x216ad thru    0x23dfb (0x274f bytes)
>>>>>    All 7 operations completed A-OK!
>>>>>    _check_generic_filesystem: filesystem on /dev/mapper/ssd-scratch is
>>>>> inconsistent
>>>>>    *** fsck.ext4 output ***
>>>>>    fsck from util-linux 2.27.1
>>>>>    e2fsck 1.42.13 (17-May-2015)
>>>>>    Pass 1: Checking inodes, blocks, and sizes
>>>>>    Inode 12, end of extent exceeds allowed value
>>>>>            (logical block 33, physical block 33441, len 7)
>>>>>    Clear? no
>>>>>    Inode 12, i_blocks is 184, should be 128.  Fix? no
>>>>
>>>> Hi Amir,
>>>>
>>>> I always get the following output when running your xfstests test case
>>>> 501.
>>>
>>> Now merged as test generic/456
>>>
>>>>
>>>> ---------------------------------------------------------------------------
>>>> e2fsck 1.42.9 (28-Dec-2013)
>>>> Pass 1: Checking inodes, blocks, and sizes
>>>> Inode 12, i_size is 147456, should be 163840. Fix? no
>>>>
>>>> ---------------------------------------------------------------------------
>>>>
>>>> Could you tell me how to get the expected output as you reported?
>>>
>>> I can't say I am doing anything special, but I can say that I get the
>>> same output as you did when running the test inside kvm-xfstests.
>>> Actually, I could not reproduce ANY of the the crash consistency bugs
>>> inside kvm-xfstests. Must be something to do with different timing of
>>> IO with KVM+virtio disks??
>>>
>>> When running on my laptop (Ubuntu 16.04 with latest kernel)
>>> on a 10G SSD volume, I always get the error reported above.
>>> I just re-verified with latest stable e2fsprogs (1.43.6).
>>>
>>> Amir.
>>
>>
>> .
>>
>
>
>

  reply	other threads:[~2017-10-05 15:04 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-27 10:44 [RFC][PATCH] fstest: regression test for ext4 crash consistency bug Amir Goldstein
2017-09-25  9:49 ` Xiao Yang
2017-09-25  9:49   ` Xiao Yang
2017-09-25 10:53   ` Amir Goldstein
2017-09-26 10:45     ` Xiao Yang
2017-09-26 11:48       ` Amir Goldstein
2017-09-30 14:15     ` Ashlie Martinez
2017-10-05  7:27       ` Xiao Yang
2017-10-05 15:04         ` Ashlie Martinez [this message]
2017-10-05 19:10           ` Amir Goldstein
2017-10-06  0:34             ` Ashlie Martinez
2017-10-07  3:29               ` [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash Theodore Ts'o
2017-10-07  5:54                 ` Amir Goldstein
2017-10-07 18:32                   ` Theodore Ts'o
2017-10-09  0:37                 ` Ashlie Martinez
2017-10-11 11:11                 ` Xiao Yang
2017-10-11 13:17                   ` Ashlie Martinez
2017-10-11 13:34                     ` Amir Goldstein
2017-10-16 19:32                       ` Ashlie Martinez
2017-10-16 21:11                         ` Amir Goldstein
2017-10-17  0:09                           ` Theodore Ts'o
2017-10-17  1:02                             ` Vijay Chidambaram
     [not found]                             ` <CAPaz=E+jFuOmRk8+EmVhNawwogNzW3VkciFrCc0Fk23OfGbwuA@mail.gmail.com>
2017-10-17  7:15                               ` Amir Goldstein
2017-10-17 14:41                               ` Theodore Ts'o
2017-10-17 23:16                                 ` Vijay Chidambaram
2017-10-12 14:38                 ` Jan Kara
2017-08-31  1:28 [RFC][PATCH] fstest: regression test for ext4 crash consistency bug Ashlie Martinez
2017-08-31  4:05 ` Amir Goldstein
2017-08-31  4:06   ` Amir Goldstein
2017-09-01 12:21     ` Ashlie Martinez
2017-09-01 14:59       ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAFk8rvbi3MHbRBtaZVTJKDcORaDTtMpwy2X2ZdXTZt=LE5smZw@mail.gmail.com' \
    --to=ashmrtn@utexas.edu \
    --cc=amir73il@gmail.com \
    --cc=eguan@redhat.com \
    --cc=fstests@vger.kernel.org \
    --cc=jbacik@fb.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=vvijay03@gmail.com \
    --cc=yangx.jy@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.