All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Crash Consistency xfstests
@ 2017-08-21 15:35 Amir Goldstein
  2017-08-21 16:48 ` Josef Bacik
  0 siblings, 1 reply; 5+ messages in thread
From: Amir Goldstein @ 2017-08-21 15:35 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-fsdevel, fstests, Eryu Guan, Christoph Hellwig

On Wed, Aug 16, 2017 at 3:06 PM, Josef Bacik <josef@toxicpanda.com> wrote:
...
>
> Sorry I was travelling yesterday so I couldn't give this my full attention.
> Everything you guys do is already accomplished with dm-log-writes.  If you look
> at the example scripts I've provided
>
> https://github.com/josefbacik/log-writes/blob/master/replay-individual-faster.sh
> https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh
>
> The first initiates the replay, and points at the second script to run after
> each entry is replayed.  The whole point of this stuff was to make it as
> flexible as possible.  The way we use it is to replay, create a snapshot of the
> replay, mount, unmount, fsck, delete the snapshot and carry on to the next
> position in the log.
>
> There is nothing keeping us from generating random crash points, this has been
> something on my list of things to do forever.  All that would be required would
> be to hold the entries between flush/fua events in memory, and then replay them
> in whatever order you deemed fit.  That's the only functionality missing from my
> replay-log stuff that CrashMonkey has.
>
> The other part of this is getting user space applications to do more thorough
> checking of consistency that it expects, which I implemented here
>
> https://github.com/josefbacik/fstests/commit/70d41e17164b2afc9a3f2ae532f084bf64cb4a07
>
> fsx will randomly do operations to a file, and every time it fsync()'s it saves
> it's state and marks the log.  Then we can go back and replay the log to the
> mark and md5sum the file to make sure it matches the saved state.  This
> infrastructure was meant to be as simple as possible so the possiblities for
> crash consistency testing were endless.  One of the next areas we plan to use
> this in Facebook is just for application consistency, so we can replay the fs
> and verify the application works in whatever state the fs is at any given point.
>

Joseph,

FYI, while testing your patches I found that on my system (Ubuntu 16.04)
fsx was always generating the same pseudo random sequence, even
though the printed seed was different.

Replacing initstate()/setstate() with srandom() in fsx fixed the problem for me.
When I further mixed pid into the randomized seed, thus, generating
different sequence of events in the 4 parallel fsx invocations, I
started getting
checksum failures on replay. I will continue to investigate this phenomena.

BTW, I am not sure if it is best to use a randomized or constant random seed
for an xfstest. What is the common practice if any?

> 3) My patches need to actually be pushed into upstream fstests.  This would be
> the largest win because then all the fs developers would be running the tests
> by default.
>

FYI, I rebased your patch, added some minor cleanups and tested over xfs:
https://github.com/amir73il/xfstests/commits/dm-log-writes

replay-log is still an external dependency, but I intend to import
it as xfstests src test program.

I also intend to split your patch into several smaller patches
- infrastructure
- fsx fixes
- generic test

When done with this, I will try to import the fsstress/replay test to
xfstests.

For now, I will leave the btrfs specific tests out from my work.
It should be trivial to add them once the basic infra has been merged.

I noticed that if SCRATCH_DEV is a dm target itself (linear), then
log-writes target creation fails. Is that by design? Can be fixed?
If not, the test would have to require_scratch_not_dm_target or so.

Please let me know if have any other tip or pointers for me.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Crash Consistency xfstests
  2017-08-21 15:35 Crash Consistency xfstests Amir Goldstein
@ 2017-08-21 16:48 ` Josef Bacik
  2017-08-21 18:49   ` Amir Goldstein
  0 siblings, 1 reply; 5+ messages in thread
From: Josef Bacik @ 2017-08-21 16:48 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Josef Bacik, linux-fsdevel, fstests, Eryu Guan, Christoph Hellwig

On Mon, Aug 21, 2017 at 05:35:02PM +0200, Amir Goldstein wrote:
> On Wed, Aug 16, 2017 at 3:06 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> ...
> >
> > Sorry I was travelling yesterday so I couldn't give this my full attention.
> > Everything you guys do is already accomplished with dm-log-writes.  If you look
> > at the example scripts I've provided
> >
> > https://github.com/josefbacik/log-writes/blob/master/replay-individual-faster.sh
> > https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh
> >
> > The first initiates the replay, and points at the second script to run after
> > each entry is replayed.  The whole point of this stuff was to make it as
> > flexible as possible.  The way we use it is to replay, create a snapshot of the
> > replay, mount, unmount, fsck, delete the snapshot and carry on to the next
> > position in the log.
> >
> > There is nothing keeping us from generating random crash points, this has been
> > something on my list of things to do forever.  All that would be required would
> > be to hold the entries between flush/fua events in memory, and then replay them
> > in whatever order you deemed fit.  That's the only functionality missing from my
> > replay-log stuff that CrashMonkey has.
> >
> > The other part of this is getting user space applications to do more thorough
> > checking of consistency that it expects, which I implemented here
> >
> > https://github.com/josefbacik/fstests/commit/70d41e17164b2afc9a3f2ae532f084bf64cb4a07
> >
> > fsx will randomly do operations to a file, and every time it fsync()'s it saves
> > it's state and marks the log.  Then we can go back and replay the log to the
> > mark and md5sum the file to make sure it matches the saved state.  This
> > infrastructure was meant to be as simple as possible so the possiblities for
> > crash consistency testing were endless.  One of the next areas we plan to use
> > this in Facebook is just for application consistency, so we can replay the fs
> > and verify the application works in whatever state the fs is at any given point.
> >
> 
> Joseph,
> 
> FYI, while testing your patches I found that on my system (Ubuntu 16.04)
> fsx was always generating the same pseudo random sequence, even
> though the printed seed was different.
> 
> Replacing initstate()/setstate() with srandom() in fsx fixed the problem for me.
> When I further mixed pid into the randomized seed, thus, generating
> different sequence of events in the 4 parallel fsx invocations, I
> started getting
> checksum failures on replay. I will continue to investigate this phenomena.
> 
> BTW, I am not sure if it is best to use a randomized or constant random seed
> for an xfstest. What is the common practice if any?
> 

Oops I thought fsx was generating different sequence each time.  My preference
is that we be as random as possible and we just print out the seed at the start
so that if we hit a problem we can go back and reproduce with the same seed for
debugging.  Fsstress prints out the seed it's using, we should do the same for
fsx.

> > 3) My patches need to actually be pushed into upstream fstests.  This would be
> > the largest win because then all the fs developers would be running the tests
> > by default.
> >
> 
> FYI, I rebased your patch, added some minor cleanups and tested over xfs:
> https://github.com/amir73il/xfstests/commits/dm-log-writes
> 
> replay-log is still an external dependency, but I intend to import
> it as xfstests src test program.
> 

Yeah I think this is a good idea and what I had planned to do the next time I
submitted stuff.

> I also intend to split your patch into several smaller patches
> - infrastructure
> - fsx fixes
> - generic test
> 
> When done with this, I will try to import the fsstress/replay test to
> xfstests.
> 
> For now, I will leave the btrfs specific tests out from my work.
> It should be trivial to add them once the basic infra has been merged.
> 

Agreed.

> I noticed that if SCRATCH_DEV is a dm target itself (linear), then
> log-writes target creation fails. Is that by design? Can be fixed?
> If not, the test would have to require_scratch_not_dm_target or so.
> 
> Please let me know if have any other tip or pointers for me.

Huh that's weird, I was using it with dm-snapshot and it worked fine.  Maybe I
was doing something else and it's never worked, but it's definitely not by
design.  I'll look into this when I get some time.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Crash Consistency xfstests
  2017-08-21 16:48 ` Josef Bacik
@ 2017-08-21 18:49   ` Amir Goldstein
  2017-08-22 13:47     ` Amir Goldstein
  0 siblings, 1 reply; 5+ messages in thread
From: Amir Goldstein @ 2017-08-21 18:49 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-fsdevel, fstests, Eryu Guan, Christoph Hellwig

On Mon, Aug 21, 2017 at 6:48 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> On Mon, Aug 21, 2017 at 05:35:02PM +0200, Amir Goldstein wrote:
...
>> Joseph,
>>
>> FYI, while testing your patches I found that on my system (Ubuntu 16.04)
>> fsx was always generating the same pseudo random sequence, even
>> though the printed seed was different.
>>
>> Replacing initstate()/setstate() with srandom() in fsx fixed the problem for me.
>> When I further mixed pid into the randomized seed, thus, generating
>> different sequence of events in the 4 parallel fsx invocations, I
>> started getting
>> checksum failures on replay. I will continue to investigate this phenomena.
>>
>> BTW, I am not sure if it is best to use a randomized or constant random seed
>> for an xfstest. What is the common practice if any?
>>
>
> Oops I thought fsx was generating different sequence each time.  My preference
> is that we be as random as possible and we just print out the seed at the start
> so that if we hit a problem we can go back and reproduce with the same seed for
> debugging.  Fsstress prints out the seed it's using, we should do the same for
> fsx.
>

fsx does print the seed it is using, but there was a bug (at least on my system)
where that seed had no effect on resulting random sequence.
Also, compared to fsstress, fsx was initializing seed with current seconds
and fsstress is initializing seed with seconds ^ nanoseconds + procid.

...
>
>> I noticed that if SCRATCH_DEV is a dm target itself (linear), then
>> log-writes target creation fails. Is that by design? Can be fixed?
>> If not, the test would have to require_scratch_not_dm_target or so.
>>
>> Please let me know if have any other tip or pointers for me.
>
> Huh that's weird, I was using it with dm-snapshot and it worked fine.  Maybe I
> was doing something else and it's never worked, but it's definitely not by
> design.  I'll look into this when I get some time.  Thanks,
>

Well, if it's not by design, I'll try to figure out the root cause of the error
with my logical volume SCRATCH_DEV setup.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Crash Consistency xfstests
  2017-08-21 18:49   ` Amir Goldstein
@ 2017-08-22 13:47     ` Amir Goldstein
  2017-08-22 14:52       ` Josef Bacik
  0 siblings, 1 reply; 5+ messages in thread
From: Amir Goldstein @ 2017-08-22 13:47 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-fsdevel, fstests, Eryu Guan, Christoph Hellwig

On Mon, Aug 21, 2017 at 9:49 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Mon, Aug 21, 2017 at 6:48 PM, Josef Bacik <josef@toxicpanda.com> wrote:
...
>>
>>> I noticed that if SCRATCH_DEV is a dm target itself (linear), then
>>> log-writes target creation fails. Is that by design? Can be fixed?
>>> If not, the test would have to require_scratch_not_dm_target or so.
>>>
>>> Please let me know if have any other tip or pointers for me.
>>
>> Huh that's weird, I was using it with dm-snapshot and it worked fine.  Maybe I
>> was doing something else and it's never worked, but it's definitely not by
>> design.  I'll look into this when I get some time.  Thanks,
>>
>
> Well, if it's not by design, I'll try to figure out the root cause of the error
> with my logical volume SCRATCH_DEV setup.
>

I could not reproduce the logical volume SCRATCH_DEV issue with upstream
kernel, so chalking it up to setup error or whatever.

BTW, is there a reason I am missing why you needed to compose FSX_OPTS
on the fly for the test using the new helper _test_falloc_support?
I see that fsx already detects if any of the falloc ops are not supported
via test_fallocate() and auto-disabled the unsupported ops.
I started fixing _test_falloc_support() which was a bit broken before I realized
it's not needed. Am I missing something?

I am still seeing test failures from time to time (checksum errors),
but they do not always reproduce on same system with same random seed
and when they reproduce its not always the same checksum that fails,
so I'm still gathering test results.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Crash Consistency xfstests
  2017-08-22 13:47     ` Amir Goldstein
@ 2017-08-22 14:52       ` Josef Bacik
  0 siblings, 0 replies; 5+ messages in thread
From: Josef Bacik @ 2017-08-22 14:52 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Josef Bacik, linux-fsdevel, fstests, Eryu Guan, Christoph Hellwig

On Tue, Aug 22, 2017 at 04:47:32PM +0300, Amir Goldstein wrote:
> On Mon, Aug 21, 2017 at 9:49 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> > On Mon, Aug 21, 2017 at 6:48 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> ...
> >>
> >>> I noticed that if SCRATCH_DEV is a dm target itself (linear), then
> >>> log-writes target creation fails. Is that by design? Can be fixed?
> >>> If not, the test would have to require_scratch_not_dm_target or so.
> >>>
> >>> Please let me know if have any other tip or pointers for me.
> >>
> >> Huh that's weird, I was using it with dm-snapshot and it worked fine.  Maybe I
> >> was doing something else and it's never worked, but it's definitely not by
> >> design.  I'll look into this when I get some time.  Thanks,
> >>
> >
> > Well, if it's not by design, I'll try to figure out the root cause of the error
> > with my logical volume SCRATCH_DEV setup.
> >
> 
> I could not reproduce the logical volume SCRATCH_DEV issue with upstream
> kernel, so chalking it up to setup error or whatever.
> 
> BTW, is there a reason I am missing why you needed to compose FSX_OPTS
> on the fly for the test using the new helper _test_falloc_support?
> I see that fsx already detects if any of the falloc ops are not supported
> via test_fallocate() and auto-disabled the unsupported ops.
> I started fixing _test_falloc_support() which was a bit broken before I realized
> it's not needed. Am I missing something?

Nope I missed that fsx did that, so that part can be dropped.

> 
> I am still seeing test failures from time to time (checksum errors),
> but they do not always reproduce on same system with same random seed
> and when they reproduce its not always the same checksum that fails,
> so I'm still gathering test results.
> 

Yeah timing is everything, when I saw issues things weren't failing
consistently, so I spent a lot of time trying to figure out if it was log-writes
or the fs.  I have definitely found issues with replay log and dm-snapshot, even
recently I added an fsync() before running the fsck from replay-log because
those changes weren't getting picked up by dm-snapshot which resulted in false
positives.  I'm pretty sure I've gotten all the kinks worked out, but I wouldn't
be surprised if there was some other dark corner I missed.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-08-22 14:52 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-21 15:35 Crash Consistency xfstests Amir Goldstein
2017-08-21 16:48 ` Josef Bacik
2017-08-21 18:49   ` Amir Goldstein
2017-08-22 13:47     ` Amir Goldstein
2017-08-22 14:52       ` Josef Bacik

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.