XFS crash consistency bug : Loss of fsynced metadata operation

All of lore.kernel.org
 help / color / mirror / Atom feed

* XFS crash consistency bug : Loss of fsynced metadata operation
@ 2018-03-13  2:15 Jayashree Mohan
  2018-03-13  4:21 ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Jayashree Mohan @ 2018-03-13  2:15 UTC (permalink / raw)
  To: linux-xfs; +Cc: Vijaychidambaram Velayudhan Pillai, Ashlie Martinez

Hi,

We've encountered what seems to be a crash consistency bug in
xfs(kernel 4.15) due to the interaction between delayed allocated
write and an unaligned fallocate(zero range) : Fallocate_zero_range is
not persisted even when followed by a fsync - thereby leading to a
loss of fsynced metadata operation on a file.

Say we create a disk image with known data and quick format it.
1. Now write 65K of data to a new file
2. Zero out a part of the above file using falloc_zero_range (60K+128)
- (60K+128+4096) - an unaligned block
3. fsync the above file
4. Crash

If we crash after the fsync, and allow reordering of the block IOs
between two flush/fua commands using Crashmonkey[1], then we can end
up not persisting the zero_range command in step 2 above, even if we
crashed after a fsync.

This workload was inspired from xfstest/generic_042, which tests for
stale data exposure using aligned fallocate commands. It's worth
noting that f2fs and btrfs passes our test clean - irrespective of the
order of bios, user data is intact and fzero operation is correctly persisted.

To reproduce this bug using CrashMonkey, simply run :
./c_harness -f /dev/sda -d /dev/cow_ram0 -t xfs -e 102400 -s 1000 -v
tests/generic_042/generic_042_fzero_unaligned.so

and take a look at the <timestamp>-generic_042_fzero_unaligned.log
created in the build directory. This file has the list of block IOs
issued during the workload and the permutation of bios that lead to
this bug. You can also verify using blktrace that CrashMonkey only
reorders bios between two barrier operations(thereby such a crash
state could be encountered due to reordering blocks at the storage
stack). Note that tools like dm-log-writes cannot capture this bug
because this arises due to reordering blocks between barrier
operations.

Possible reason for this bug :
On closely inspecting the reason for this bug, we discovered that the
problem lies in updating a data block twice, without a barrier
operation between the two updates. The blktrace for the above
mentioned workload shows that sector 280 is updated twice - probably
first due to the write and later due to the fzero operation. However
notice that there is no barrier between the two updates, and using
Crashmonkey we see the above mentioned bug when these two updates to
the same block are reordered. Essentially, a reordering here means
fzero goes through first, and is later overwritten by the delayed
allocated write.

 0.000179069  7104  Q  WS 280 + 16 [c_harness]
 0.000594994  7104  Q   R 280 + 8 [c_harness]
 0.000598216  7104  Q   R 288 + 8 [c_harness]
 0.000620552  7104  Q  WS 160 + 120 [c_harness]
 0.000653742  7104  Q  WS 280 + 16 [c_harness]
 0.000733154  7104  Q FWFSM 102466 + 3 [c_harness]

This seems to be a bug, as it is not persisting the metadata operation
even in the presence of fsync.

Let me know if I am missing some detail here.

[1] https://github.com/utsaslab/crashmonkey.git

Thanks,
Jayashree Mohan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XFS crash consistency bug : Loss of fsynced metadata operation
  2018-03-13  2:15 XFS crash consistency bug : Loss of fsynced metadata operation Jayashree Mohan
@ 2018-03-13  4:21 ` Dave Chinner
  2018-03-13  6:36   ` Amir Goldstein
  2018-03-13 16:57   ` Jayashree Mohan
  0 siblings, 2 replies; 16+ messages in thread
From: Dave Chinner @ 2018-03-13  4:21 UTC (permalink / raw)
  To: Jayashree Mohan
  Cc: linux-xfs, Vijaychidambaram Velayudhan Pillai, Ashlie Martinez

On Mon, Mar 12, 2018 at 09:15:25PM -0500, Jayashree Mohan wrote:
> Hi,
> 
> We've encountered what seems to be a crash consistency bug in
> xfs(kernel 4.15) due to the interaction between delayed allocated
> write and an unaligned fallocate(zero range) : Fallocate_zero_range is
> not persisted even when followed by a fsync - thereby leading to a
> loss of fsynced metadata operation on a file.
> 
> Say we create a disk image with known data and quick format it.
> 1. Now write 65K of data to a new file
> 2. Zero out a part of the above file using falloc_zero_range (60K+128)
> - (60K+128+4096) - an unaligned block
> 3. fsync the above file
> 4. Crash

So, effectively:

$ xfs_io -xft -c "pwrite 0 65k" \
	-c "fzero $((60 * 1024 + 128)) 4k" \
	-c fsync \
	-c "shutdown"
	/mnt/scratch/testfile
$

> 
> If we crash after the fsync, and allow reordering of the block IOs
> between two flush/fua commands using Crashmonkey[1], then we can end
> up not persisting the zero_range command in step 2 above, even if we
> crashed after a fsync.

Sorry, but what do you mean by "reordering of the block IOs"?

> This workload was inspired from xfstest/generic_042, which tests for
> stale data exposure using aligned fallocate commands. It's worth
> noting that f2fs and btrfs passes our test clean - irrespective of the
> order of bios, user data is intact and fzero operation is correctly persisted.
> 
> To reproduce this bug using CrashMonkey, simply run :
> ./c_harness -f /dev/sda -d /dev/cow_ram0 -t xfs -e 102400 -s 1000 -v
> tests/generic_042/generic_042_fzero_unaligned.so
> 
> and take a look at the <timestamp>-generic_042_fzero_unaligned.log
> created in the build directory. This file has the list of block IOs
> issued during the workload and the permutation of bios that lead to
> this bug. You can also verify using blktrace that CrashMonkey only
> reorders bios between two barrier operations(thereby such a crash
> state could be encountered due to reordering blocks at the storage
> stack). Note that tools like dm-log-writes cannot capture this bug
> because this arises due to reordering blocks between barrier
> operations.
> 
> Possible reason for this bug :
> On closely inspecting the reason for this bug, we discovered that the
> problem lies in updating a data block twice, without a barrier
> operation between the two updates.

There's nothing wrong with doing that if the filesystem has waited
for completion of the first IOs before doing subsequent operations
on the block.

> The blktrace for the above
> mentioned workload shows that sector 280 is updated twice - probably
> first due to the write and later due to the fzero operation.

WHy is that happening? The trace I have from the above xfs_io
command is this:

  8,96   2      613     0.081499295  3635  Q  WS 280 + 16 [xfs_io]
  8,96   2      617     0.081982168     0  C  WS 280 + 16 [0]
  8,96   2      618     0.082033206  3635  Q   R 280 + 8 [xfs_io]
  8,96   2      622     0.082420539     0  C   R 280 + 8 [0]
  8,96   2      623     0.082475920  3635  Q   R 288 + 8 [xfs_io]
  8,96   2      627     0.082988642     0  C   R 288 + 8 [0]
  8,96   2      628     0.083050032  3635  Q  WS 160 + 136 [xfs_io]
  8,96   2      632     0.084414651     0  C  WS 160 + 136 [0]
  8,96   2      633     0.084469733  3635  Q FWFSM 20971632 + 8 [xfs_io]
  8,96   2      636     0.107483279     0  C WFSM 20971632 + 8 [0]
  8,96   2      637     0.107485345     0  C WFSM 20971632 [0]
 
Which shows serialised IO dispatch and completion of the IO, and no
double write. You've only shown queueing of writes, which doesn't
tell me about completion order which is what really matters.

In more detail.

1. zero-range forces a data flush to occur and then waits for it
to complete.

....
  xfs_file_fallocate
    xfs_zero_file_space(60-68k)
      xfs_free_file_space
        xfs_flush_unmap_range()
	  filemap_write_and_wait_range()
xfs_writepage:        pgoff 0xf000 size 0x10400
xfs_writepage:        pgoff 0x10000 size 0x10400
  8,96   2      613     0.081499295  3635  Q  WS 280 + 16 [xfs_io]
  8,96   2      617     0.081982168     0  C  WS 280 + 16 [0]
	  truncate_pagecache_range()
xfs_invalidatepage:   pgoff 0xf000 size 0x10400
xfs_releasepage:      pgoff 0xf000 size 0x10400
xfs_invalidatepage:   pgoff 0x10000 size 0x10400
xfs_releasepage:      pgoff 0x10000 size 0x10400
	xfs_zero_range()
  8,96   2      618     0.082033206  3635  Q   R 280 + 8 [xfs_io]
  8,96   2      622     0.082420539     0  C   R 280 + 8 [0]
  8,96   2      623     0.082475920  3635  Q   R 288 + 8 [xfs_io]
  8,96   2      627     0.082988642     0  C   R 288 + 8 [0]
    xfs_alloc_file_space()
xfs_map_blocks_found:	startoff 0x0 startblock 68 blockcount 0x30
	<nothing to allocate!>
.....

After the fzero command, we then run fsync:

xfs_file_fsync()
  filemap_write_and_wait_range()
xfs_writepage:        pgoff 0x0 size 0x10400
....
xfs_writepage:        pgoff 0x10000 size 0x10400
  8,96   2      628     0.083050032  3635  Q  WS 160 + 136 [xfs_io]
  8,96   2      632     0.084414651     0  C  WS 160 + 136 [0]
<data writeback complete>
  xfs_log_force_lsn()
xfs_buf_submit:       flags WRITE|NO_IOACCT|ASYNC|SYNCIO|FUA|FLUSH|PAGES
  8,96   2      633     0.084469733  3635  Q FWFSM 20971632 + 8 [xfs_io]
  8,96   2      636     0.107483279     0  C WFSM 20971632 + 8 [0]
  8,96   2      637     0.107485345     0  C WFSM 20971632 [0]

And that's the log force. That's all correct and expected behaviour
for pwrite, fzero and fsync. And when I dump the data in the file
after unmount/mount (because the shutdown stops all IO from the
filesytem), I get:

000000 146715 146715 146715 146715 146715 146715 146715 146715

00f080 000000 000000 000000 000000 000000 000000 000000 000000
*
010080 146715 146715 146715 146715 146715 146715 146715 146715
*
010400

exactly the right data in the file. 

> However
> notice that there is no barrier between the two updates, and using
> Crashmonkey we see the above mentioned bug when these two updates to
> the same block are reordered. Essentially, a reordering here means
> fzero goes through first, and is later overwritten by the delayed
> allocated write.

>  0.000179069  7104  Q  WS 280 + 16 [c_harness]
>  0.000594994  7104  Q   R 280 + 8 [c_harness]
>  0.000598216  7104  Q   R 288 + 8 [c_harness]
>  0.000620552  7104  Q  WS 160 + 120 [c_harness]
>  0.000653742  7104  Q  WS 280 + 16 [c_harness]

So this is where there's a difference. Note the write in my case:

0.083050032  3635  Q  WS 160 + 136 [xfs_io]

It's a single IO - for some reason the kernel you are running has
split it into two IOs, but that doesn't make it wrong or invalid -
it's not indicative of a bug in any way.

What's wrong here is your assumption that you can reorder it around
the previous IO to those blocks.  That's *simply not possible*
because the filesystem has waited for completion of each IO before
issuing the next. The IO serialisation is being done at the
filesystem level, not the IO level, and so CrashMonkey is
effectively breaking the completion ordering guarantees storage
hardware provides us with.

>  0.000733154  7104  Q FWFSM 102466 + 3 [c_harness]

IOWs, once this pre-flush is completed, the data on disk is supposed
to be that of the latter write because that one was received by the
storage well after the first one was completed. Reordering a later
IO in front of an already completed dependent IO (read or write!) is
simply not allowed. 

> This seems to be a bug, as it is not persisting the metadata operation
> even in the presence of fsync.
> 
> Let me know if I am missing some detail here.

Filesystem IO ordering in the linux stack is not based on
block/request level flush barriers. i.e. there's no such thing as
"IO barriers" anymore. Filesystem IO ordering is based on a
completion-to-dispatch serialisation model at the filesystem level,
and that cannot be observed by looking at block layer traces....

This really looks like a crashmonkey bug, not a filesystem problem...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XFS crash consistency bug : Loss of fsynced metadata operation
  2018-03-13  4:21 ` Dave Chinner
@ 2018-03-13  6:36   ` Amir Goldstein
  2018-03-13 18:05     ` Jayashree Mohan
  2018-03-13 16:57   ` Jayashree Mohan
  1 sibling, 1 reply; 16+ messages in thread
From: Amir Goldstein @ 2018-03-13  6:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai,
	Ashlie Martinez, Theodore Tso, fstests

Since the exact same report was sent to the ext4 list, CC'ing Dave's response to
Ted and fstests, which seems an appropriate crowd for this discussion.

I think that future CrashMonkey findings should be CC'ed to either fsdevel
or fstests, especially if more than one filesystem is affected.

Thanks,
Amir.

On Tue, Mar 13, 2018 at 6:21 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Mar 12, 2018 at 09:15:25PM -0500, Jayashree Mohan wrote:
>> Hi,
>>
>> We've encountered what seems to be a crash consistency bug in
>> xfs(kernel 4.15) due to the interaction between delayed allocated
>> write and an unaligned fallocate(zero range) : Fallocate_zero_range is
>> not persisted even when followed by a fsync - thereby leading to a
>> loss of fsynced metadata operation on a file.
>>
>> Say we create a disk image with known data and quick format it.
>> 1. Now write 65K of data to a new file
>> 2. Zero out a part of the above file using falloc_zero_range (60K+128)
>> - (60K+128+4096) - an unaligned block
>> 3. fsync the above file
>> 4. Crash
>
> So, effectively:
>
> $ xfs_io -xft -c "pwrite 0 65k" \
>         -c "fzero $((60 * 1024 + 128)) 4k" \
>         -c fsync \
>         -c "shutdown"
>         /mnt/scratch/testfile
> $
>
>>
>> If we crash after the fsync, and allow reordering of the block IOs
>> between two flush/fua commands using Crashmonkey[1], then we can end
>> up not persisting the zero_range command in step 2 above, even if we
>> crashed after a fsync.
>
> Sorry, but what do you mean by "reordering of the block IOs"?
>
>> This workload was inspired from xfstest/generic_042, which tests for
>> stale data exposure using aligned fallocate commands. It's worth
>> noting that f2fs and btrfs passes our test clean - irrespective of the
>> order of bios, user data is intact and fzero operation is correctly persisted.
>>
>> To reproduce this bug using CrashMonkey, simply run :
>> ./c_harness -f /dev/sda -d /dev/cow_ram0 -t xfs -e 102400 -s 1000 -v
>> tests/generic_042/generic_042_fzero_unaligned.so
>>
>> and take a look at the <timestamp>-generic_042_fzero_unaligned.log
>> created in the build directory. This file has the list of block IOs
>> issued during the workload and the permutation of bios that lead to
>> this bug. You can also verify using blktrace that CrashMonkey only
>> reorders bios between two barrier operations(thereby such a crash
>> state could be encountered due to reordering blocks at the storage
>> stack). Note that tools like dm-log-writes cannot capture this bug
>> because this arises due to reordering blocks between barrier
>> operations.
>>
>> Possible reason for this bug :
>> On closely inspecting the reason for this bug, we discovered that the
>> problem lies in updating a data block twice, without a barrier
>> operation between the two updates.
>
> There's nothing wrong with doing that if the filesystem has waited
> for completion of the first IOs before doing subsequent operations
> on the block.
>
>> The blktrace for the above
>> mentioned workload shows that sector 280 is updated twice - probably
>> first due to the write and later due to the fzero operation.
>
> WHy is that happening? The trace I have from the above xfs_io
> command is this:
>
>   8,96   2      613     0.081499295  3635  Q  WS 280 + 16 [xfs_io]
>   8,96   2      617     0.081982168     0  C  WS 280 + 16 [0]
>   8,96   2      618     0.082033206  3635  Q   R 280 + 8 [xfs_io]
>   8,96   2      622     0.082420539     0  C   R 280 + 8 [0]
>   8,96   2      623     0.082475920  3635  Q   R 288 + 8 [xfs_io]
>   8,96   2      627     0.082988642     0  C   R 288 + 8 [0]
>   8,96   2      628     0.083050032  3635  Q  WS 160 + 136 [xfs_io]
>   8,96   2      632     0.084414651     0  C  WS 160 + 136 [0]
>   8,96   2      633     0.084469733  3635  Q FWFSM 20971632 + 8 [xfs_io]
>   8,96   2      636     0.107483279     0  C WFSM 20971632 + 8 [0]
>   8,96   2      637     0.107485345     0  C WFSM 20971632 [0]
>
> Which shows serialised IO dispatch and completion of the IO, and no
> double write. You've only shown queueing of writes, which doesn't
> tell me about completion order which is what really matters.
>
> In more detail.
>
> 1. zero-range forces a data flush to occur and then waits for it
> to complete.
>
> ....
>   xfs_file_fallocate
>     xfs_zero_file_space(60-68k)
>       xfs_free_file_space
>         xfs_flush_unmap_range()
>           filemap_write_and_wait_range()
> xfs_writepage:        pgoff 0xf000 size 0x10400
> xfs_writepage:        pgoff 0x10000 size 0x10400
>   8,96   2      613     0.081499295  3635  Q  WS 280 + 16 [xfs_io]
>   8,96   2      617     0.081982168     0  C  WS 280 + 16 [0]
>           truncate_pagecache_range()
> xfs_invalidatepage:   pgoff 0xf000 size 0x10400
> xfs_releasepage:      pgoff 0xf000 size 0x10400
> xfs_invalidatepage:   pgoff 0x10000 size 0x10400
> xfs_releasepage:      pgoff 0x10000 size 0x10400
>         xfs_zero_range()
>   8,96   2      618     0.082033206  3635  Q   R 280 + 8 [xfs_io]
>   8,96   2      622     0.082420539     0  C   R 280 + 8 [0]
>   8,96   2      623     0.082475920  3635  Q   R 288 + 8 [xfs_io]
>   8,96   2      627     0.082988642     0  C   R 288 + 8 [0]
>     xfs_alloc_file_space()
> xfs_map_blocks_found:   startoff 0x0 startblock 68 blockcount 0x30
>         <nothing to allocate!>
> .....
>
> After the fzero command, we then run fsync:
>
> xfs_file_fsync()
>   filemap_write_and_wait_range()
> xfs_writepage:        pgoff 0x0 size 0x10400
> ....
> xfs_writepage:        pgoff 0x10000 size 0x10400
>   8,96   2      628     0.083050032  3635  Q  WS 160 + 136 [xfs_io]
>   8,96   2      632     0.084414651     0  C  WS 160 + 136 [0]
> <data writeback complete>
>   xfs_log_force_lsn()
> xfs_buf_submit:       flags WRITE|NO_IOACCT|ASYNC|SYNCIO|FUA|FLUSH|PAGES
>   8,96   2      633     0.084469733  3635  Q FWFSM 20971632 + 8 [xfs_io]
>   8,96   2      636     0.107483279     0  C WFSM 20971632 + 8 [0]
>   8,96   2      637     0.107485345     0  C WFSM 20971632 [0]
>
> And that's the log force. That's all correct and expected behaviour
> for pwrite, fzero and fsync. And when I dump the data in the file
> after unmount/mount (because the shutdown stops all IO from the
> filesytem), I get:
>
> 000000 146715 146715 146715 146715 146715 146715 146715 146715
>
> 00f080 000000 000000 000000 000000 000000 000000 000000 000000
> *
> 010080 146715 146715 146715 146715 146715 146715 146715 146715
> *
> 010400
>
> exactly the right data in the file.
>
>> However
>> notice that there is no barrier between the two updates, and using
>> Crashmonkey we see the above mentioned bug when these two updates to
>> the same block are reordered. Essentially, a reordering here means
>> fzero goes through first, and is later overwritten by the delayed
>> allocated write.
>
>>  0.000179069  7104  Q  WS 280 + 16 [c_harness]
>>  0.000594994  7104  Q   R 280 + 8 [c_harness]
>>  0.000598216  7104  Q   R 288 + 8 [c_harness]
>>  0.000620552  7104  Q  WS 160 + 120 [c_harness]
>>  0.000653742  7104  Q  WS 280 + 16 [c_harness]
>
> So this is where there's a difference. Note the write in my case:
>
> 0.083050032  3635  Q  WS 160 + 136 [xfs_io]
>
> It's a single IO - for some reason the kernel you are running has
> split it into two IOs, but that doesn't make it wrong or invalid -
> it's not indicative of a bug in any way.
>
> What's wrong here is your assumption that you can reorder it around
> the previous IO to those blocks.  That's *simply not possible*
> because the filesystem has waited for completion of each IO before
> issuing the next. The IO serialisation is being done at the
> filesystem level, not the IO level, and so CrashMonkey is
> effectively breaking the completion ordering guarantees storage
> hardware provides us with.
>
>>  0.000733154  7104  Q FWFSM 102466 + 3 [c_harness]
>
> IOWs, once this pre-flush is completed, the data on disk is supposed
> to be that of the latter write because that one was received by the
> storage well after the first one was completed. Reordering a later
> IO in front of an already completed dependent IO (read or write!) is
> simply not allowed.
>
>> This seems to be a bug, as it is not persisting the metadata operation
>> even in the presence of fsync.
>>
>> Let me know if I am missing some detail here.
>
> Filesystem IO ordering in the linux stack is not based on
> block/request level flush barriers. i.e. there's no such thing as
> "IO barriers" anymore. Filesystem IO ordering is based on a
> completion-to-dispatch serialisation model at the filesystem level,
> and that cannot be observed by looking at block layer traces....
>
> This really looks like a crashmonkey bug, not a filesystem problem...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XFS crash consistency bug : Loss of fsynced metadata operation
  2018-03-13  6:36   ` Amir Goldstein
@ 2018-03-13 18:05     ` Jayashree Mohan
  0 siblings, 0 replies; 16+ messages in thread
From: Jayashree Mohan @ 2018-03-13 18:05 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, linux-xfs, Vijaychidambaram Velayudhan Pillai,
	Ashlie Martinez, Theodore Tso, fstests

Hi Amir,

We reported these bugs separately on the two mailing lists because,
though the workload was the same, the behavior/bug was different on
the two filesystems. On ext4 there was data loss, while on xfs the
fzero operation was not persisted even after a fsync.

On Tue, Mar 13, 2018 at 1:36 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> Since the exact same report was sent to the ext4 list, CC'ing Dave's response to
> Ted and fstests, which seems an appropriate crowd for this discussion.
> I think that future CrashMonkey findings should be CC'ed to either fsdevel
> or fstests, especially if more than one filesystem is affected.

Thanks for the suggestion. We will CC our future reports to these lists.

Thanks,
Jayashree

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XFS crash consistency bug : Loss of fsynced metadata operation
  2018-03-13  4:21 ` Dave Chinner
  2018-03-13  6:36   ` Amir Goldstein
@ 2018-03-13 16:57   ` Jayashree Mohan
  2018-03-13 22:45     ` Dave Chinner
  1 sibling, 1 reply; 16+ messages in thread
From: Jayashree Mohan @ 2018-03-13 16:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, Vijaychidambaram Velayudhan Pillai, Ashlie Martinez,
	fstests, Theodore Ts'o

Hi Dave,

Thanks for the response. CrashMonkey assumes the following behavior of
disk cache. Let me know if any of this sounds unreasonable.

Whenever the underlying storage device has an associated cache, the IO
is marked completed the moment it reaches the disk cache. This does
not guarantee that the disk cache would persist them in the same
order, unless there is a Flush/FUA. The order of completed writes as
seen by the user could be A, B, C, *Flush* D, E. However the disk
cache could write these back to the persistent storage in the order
say B, A, C, E, D. The only invariant it ensures is that writing in an
order like  A, C, E, B, D is
not possible because, writes A,B,C have to strictly happen before D
and E. However you cannot ensure that (A, B, C) is written to the
persistent storage in the same order.

CrashMonkey reorders bios in conformance to the guarantees provided by
disk cache; we do not make any extra assumptions and we respect the
barrier operations.

On Mon, Mar 12, 2018 at 11:21 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Mar 12, 2018 at 09:15:25PM -0500, Jayashree Mohan wrote:
>> Hi,
>>
>> We've encountered what seems to be a crash consistency bug in
>> xfs(kernel 4.15) due to the interaction between delayed allocated
>> write and an unaligned fallocate(zero range) : Fallocate_zero_range is
>> not persisted even when followed by a fsync - thereby leading to a
>> loss of fsynced metadata operation on a file.
>>
>> Say we create a disk image with known data and quick format it.
>> 1. Now write 65K of data to a new file
>> 2. Zero out a part of the above file using falloc_zero_range (60K+128)
>> - (60K+128+4096) - an unaligned block
>> 3. fsync the above file
>> 4. Crash
>
> So, effectively:
>
> $ xfs_io -xft -c "pwrite 0 65k" \
>         -c "fzero $((60 * 1024 + 128)) 4k" \
>         -c fsync \
>         -c "shutdown"
>         /mnt/scratch/testfile
> $

Yes, this is the right translation of our workload.

>>
>> If we crash after the fsync, and allow reordering of the block IOs
>> between two flush/fua commands using Crashmonkey[1], then we can end
>> up not persisting the zero_range command in step 2 above, even if we
>> crashed after a fsync.
>
> Sorry, but what do you mean by "reordering of the block IOs"?

CrashMonkey reorders IOs between two barrier operations just like the
disk cache would do, while respecting the guarantees provided by disk
cache; CrashMonkey will never re order bios across barriers.

>> This workload was inspired from xfstest/generic_042, which tests for
>> stale data exposure using aligned fallocate commands. It's worth
>> noting that f2fs and btrfs passes our test clean - irrespective of the
>> order of bios, user data is intact and fzero operation is correctly persisted.
>>
>> To reproduce this bug using CrashMonkey, simply run :
>> ./c_harness -f /dev/sda -d /dev/cow_ram0 -t xfs -e 102400 -s 1000 -v
>> tests/generic_042/generic_042_fzero_unaligned.so
>>
>> and take a look at the <timestamp>-generic_042_fzero_unaligned.log
>> created in the build directory. This file has the list of block IOs
>> issued during the workload and the permutation of bios that lead to
>> this bug. You can also verify using blktrace that CrashMonkey only
>> reorders bios between two barrier operations(thereby such a crash
>> state could be encountered due to reordering blocks at the storage
>> stack). Note that tools like dm-log-writes cannot capture this bug
>> because this arises due to reordering blocks between barrier
>> operations.
>>
>> Possible reason for this bug :
>> On closely inspecting the reason for this bug, we discovered that the
>> problem lies in updating a data block twice, without a barrier
>> operation between the two updates.
>
> There's nothing wrong with doing that if the filesystem has waited
> for completion of the first IOs before doing subsequent operations
> on the block.
>
>> The blktrace for the above
>> mentioned workload shows that sector 280 is updated twice - probably
>> first due to the write and later due to the fzero operation.
>
> WHy is that happening? The trace I have from the above xfs_io
> command is this:
>
>   8,96   2      613     0.081499295  3635  Q  WS 280 + 16 [xfs_io]
>   8,96   2      617     0.081982168     0  C  WS 280 + 16 [0]
>   8,96   2      618     0.082033206  3635  Q   R 280 + 8 [xfs_io]
>   8,96   2      622     0.082420539     0  C   R 280 + 8 [0]
>   8,96   2      623     0.082475920  3635  Q   R 288 + 8 [xfs_io]
>   8,96   2      627     0.082988642     0  C   R 288 + 8 [0]
>   8,96   2      628     0.083050032  3635  Q  WS 160 + 136 [xfs_io]
>   8,96   2      632     0.084414651     0  C  WS 160 + 136 [0]
>   8,96   2      633     0.084469733  3635  Q FWFSM 20971632 + 8 [xfs_io]
>   8,96   2      636     0.107483279     0  C WFSM 20971632 + 8 [0]
>   8,96   2      637     0.107485345     0  C WFSM 20971632 [0]


1. 8,96   2      617     0.081982168     0  C  WS 280 + 16 [0]
2. 8,96   2      632     0.084414651     0  C  WS 160 + 136 [0]
3. 8,96   2      636     0.107483279     0  C WFSM 20971632 + 8 [0]

Again, though (2) is a single bio in your case, we notice that this
sequential IO updates sector 280 too. By the previous discussion about
disk cache, it is very well possible that (2) is ordered before (1),
which gives rise to the bug we mentioned.


> Which shows serialised IO dispatch and completion of the IO, and no
> double write. You've only shown queueing of writes, which doesn't
> tell me about completion order which is what really matters.
>
> In more detail.
>
> 1. zero-range forces a data flush to occur and then waits for it
> to complete.
>
> ....
>   xfs_file_fallocate
>     xfs_zero_file_space(60-68k)
>       xfs_free_file_space
>         xfs_flush_unmap_range()
>           filemap_write_and_wait_range()
> xfs_writepage:        pgoff 0xf000 size 0x10400
> xfs_writepage:        pgoff 0x10000 size 0x10400
>   8,96   2      613     0.081499295  3635  Q  WS 280 + 16 [xfs_io]
>   8,96   2      617     0.081982168     0  C  WS 280 + 16 [0]
>           truncate_pagecache_range()
> xfs_invalidatepage:   pgoff 0xf000 size 0x10400
> xfs_releasepage:      pgoff 0xf000 size 0x10400
> xfs_invalidatepage:   pgoff 0x10000 size 0x10400
> xfs_releasepage:      pgoff 0x10000 size 0x10400
>         xfs_zero_range()
>   8,96   2      618     0.082033206  3635  Q   R 280 + 8 [xfs_io]
>   8,96   2      622     0.082420539     0  C   R 280 + 8 [0]
>   8,96   2      623     0.082475920  3635  Q   R 288 + 8 [xfs_io]
>   8,96   2      627     0.082988642     0  C   R 288 + 8 [0]
>     xfs_alloc_file_space()
> xfs_map_blocks_found:   startoff 0x0 startblock 68 blockcount 0x30
>         <nothing to allocate!>
> .....
>
> After the fzero command, we then run fsync:
>
> xfs_file_fsync()
>   filemap_write_and_wait_range()
> xfs_writepage:        pgoff 0x0 size 0x10400
> ....
> xfs_writepage:        pgoff 0x10000 size 0x10400
>   8,96   2      628     0.083050032  3635  Q  WS 160 + 136 [xfs_io]
>   8,96   2      632     0.084414651     0  C  WS 160 + 136 [0]
> <data writeback complete>
>   xfs_log_force_lsn()
> xfs_buf_submit:       flags WRITE|NO_IOACCT|ASYNC|SYNCIO|FUA|FLUSH|PAGES
>   8,96   2      633     0.084469733  3635  Q FWFSM 20971632 + 8 [xfs_io]
>   8,96   2      636     0.107483279     0  C WFSM 20971632 + 8 [0]
>   8,96   2      637     0.107485345     0  C WFSM 20971632 [0]
>
> And that's the log force. That's all correct and expected behaviour
> for pwrite, fzero and fsync. And when I dump the data in the file
> after unmount/mount (because the shutdown stops all IO from the
> filesytem), I get:
>
> 000000 146715 146715 146715 146715 146715 146715 146715 146715
>
> 00f080 000000 000000 000000 000000 000000 000000 000000 000000
> *
> 010080 146715 146715 146715 146715 146715 146715 146715 146715
> *
> 010400
>
> exactly the right data in the file.
>
>> However
>> notice that there is no barrier between the two updates, and using
>> Crashmonkey we see the above mentioned bug when these two updates to
>> the same block are reordered. Essentially, a reordering here means
>> fzero goes through first, and is later overwritten by the delayed
>> allocated write.
>
>>  0.000179069  7104  Q  WS 280 + 16 [c_harness]
>>  0.000594994  7104  Q   R 280 + 8 [c_harness]
>>  0.000598216  7104  Q   R 288 + 8 [c_harness]
>>  0.000620552  7104  Q  WS 160 + 120 [c_harness]
>>  0.000653742  7104  Q  WS 280 + 16 [c_harness]
>
> So this is where there's a difference. Note the write in my case:
>
> 0.083050032  3635  Q  WS 160 + 136 [xfs_io]
>
> It's a single IO - for some reason the kernel you are running has
> split it into two IOs, but that doesn't make it wrong or invalid -
> it's not indicative of a bug in any way.
>
> What's wrong here is your assumption that you can reorder it around
> the previous IO to those blocks.  That's *simply not possible*
> because the filesystem has waited for completion of each IO before
> issuing the next. The IO serialisation is being done at the
> filesystem level, not the IO level, and so CrashMonkey is
> effectively breaking the completion ordering guarantees storage
> hardware provides us with.
>
>>  0.000733154  7104  Q FWFSM 102466 + 3 [c_harness]
>
> IOWs, once this pre-flush is completed, the data on disk is supposed
> to be that of the latter write because that one was received by the
> storage well after the first one was completed. Reordering a later
> IO in front of an already completed dependent IO (read or write!) is
> simply not allowed.

*Received by the storage* means, received at the disk cache. I agree
the fzero write went to disk cache after pwrite was marked completed.
But this does not ensure pwrite data goes to the dpersistent storage
from the disk cache before fzero.

>> This seems to be a bug, as it is not persisting the metadata operation
>> even in the presence of fsync.
>>
>> Let me know if I am missing some detail here.
>
> Filesystem IO ordering in the linux stack is not based on
> block/request level flush barriers. i.e. there's no such thing as
> "IO barriers" anymore. Filesystem IO ordering is based on a
> completion-to-dispatch serialisation model at the filesystem level,
> and that cannot be observed by looking at block layer traces....
>
> This really looks like a crashmonkey bug, not a filesystem problem...


CrashMonkey therefore respects the guarantees provided by the disk
cache, and assumes nothing more than that. I hope this provides more
clarity on what
CrashMonkey is trying to do, and why we think it is reasonable to do so.


Thanks,
Jayashree

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XFS crash consistency bug : Loss of fsynced metadata operation
  2018-03-13 16:57   ` Jayashree Mohan
@ 2018-03-13 22:45     ` Dave Chinner
  2018-03-14 13:16       ` Lukas Czerner
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2018-03-13 22:45 UTC (permalink / raw)
  To: Jayashree Mohan
  Cc: linux-xfs, Vijaychidambaram Velayudhan Pillai, Ashlie Martinez,
	fstests, Theodore Ts'o

On Tue, Mar 13, 2018 at 11:57:21AM -0500, Jayashree Mohan wrote:
> Hi Dave,
> 
> Thanks for the response. CrashMonkey assumes the following behavior of
> disk cache. Let me know if any of this sounds unreasonable.
> 
> Whenever the underlying storage device has an associated cache, the IO
> is marked completed the moment it reaches the disk cache. This does
> not guarantee that the disk cache would persist them in the same
> order, unless there is a Flush/FUA. The order of completed writes as
> seen by the user could be A, B, C, *Flush* D, E. However the disk
> cache could write these back to the persistent storage in the order
> say B, A, C, E, D. The only invariant it ensures is that writing in an
> order like  A, C, E, B, D is
> not possible because, writes A,B,C have to strictly happen before D
> and E. However you cannot ensure that (A, B, C) is written to the
> persistent storage in the same order.
> 
> CrashMonkey reorders bios in conformance to the guarantees provided by
> disk cache; we do not make any extra assumptions and we respect the
> barrier operations.

I think your model is wrong. caches do not randomly re-order
completed IO operations to the *same LBA*. When a block is overwritten
the cache contains the overwrite data and the previous data is
discarded. THe previous data may be on disk, but it's no longer in
the cache.

e.g. take a dependent filesystem read-modify-write cycle (I'm
choosing this because that's the problem this fzero/fsync
"bug" is apparently demonstrating) where we write data to disk,
invalidate the kernel cache, read the data back off disk, zero it
in memory, then write it back to disk, all in the one LBA:

	<flush>
	write A to disk, invalidate kernel cache
	......
	read A from disk into kernel cache
	A' = <modify A>
	write A' to disk
	......
	<flush>

The disk cache model you are describing allows writes
to be reordered anywhere in the flush window regardless of their
inter-IO completion dependencies. Hence you're allowing temporally
ordered filesystem IO to the same LBA be reorded like so:

	<flush>
	......
	write A'
	......
	read A
	A' = <modify A>
	......
	write A
	......
	<flush>

This violates causality. it's simply *not possible for the disk
cache to contain A' before either "write A", "read A" or the
in-memory modification of A has been completed by the OS. Hence
there is no way for a crash situation to have the disk cache or the
physical storage medium to contain corruption that indicates it
stored A' on disk before stored A.

> CrashMonkey therefore respects the guarantees provided by the disk
> cache, and assumes nothing more than that. I hope this provides more
> clarity on what
> CrashMonkey is trying to do, and why we think it is reasonable to do so.

It clearly demonstrates to me where CrashMonkey is broken and needs
fixing - it needs to respect the ordering of temporally separate IO
to the same LBA and not violate causality. Simulators that assume
time travel is possible are not useful to us.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XFS crash consistency bug : Loss of fsynced metadata operation
  2018-03-13 22:45     ` Dave Chinner
@ 2018-03-14 13:16       ` Lukas Czerner
  2018-03-14 13:32         ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Lukas Czerner @ 2018-03-14 13:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai,
	Ashlie Martinez, fstests, Theodore Ts'o

On Wed, Mar 14, 2018 at 09:45:22AM +1100, Dave Chinner wrote:
> On Tue, Mar 13, 2018 at 11:57:21AM -0500, Jayashree Mohan wrote:
> > Hi Dave,
> > 
> > Thanks for the response. CrashMonkey assumes the following behavior of
> > disk cache. Let me know if any of this sounds unreasonable.
> > 
> > Whenever the underlying storage device has an associated cache, the IO
> > is marked completed the moment it reaches the disk cache. This does
> > not guarantee that the disk cache would persist them in the same
> > order, unless there is a Flush/FUA. The order of completed writes as
> > seen by the user could be A, B, C, *Flush* D, E. However the disk
> > cache could write these back to the persistent storage in the order
> > say B, A, C, E, D. The only invariant it ensures is that writing in an
> > order like  A, C, E, B, D is
> > not possible because, writes A,B,C have to strictly happen before D
> > and E. However you cannot ensure that (A, B, C) is written to the
> > persistent storage in the same order.
> > 
> > CrashMonkey reorders bios in conformance to the guarantees provided by
> > disk cache; we do not make any extra assumptions and we respect the
> > barrier operations.
> 
> I think your model is wrong. caches do not randomly re-order
> completed IO operations to the *same LBA*. When a block is overwritten
> the cache contains the overwrite data and the previous data is
> discarded. THe previous data may be on disk, but it's no longer in
> the cache.
> 
> e.g. take a dependent filesystem read-modify-write cycle (I'm
> choosing this because that's the problem this fzero/fsync
> "bug" is apparently demonstrating) where we write data to disk,
> invalidate the kernel cache, read the data back off disk, zero it
> in memory, then write it back to disk, all in the one LBA:
> 
> 	<flush>
> 	write A to disk, invalidate kernel cache
> 	......
> 	read A from disk into kernel cache
> 	A' = <modify A>
> 	write A' to disk
> 	......
> 	<flush>
> 
> The disk cache model you are describing allows writes
> to be reordered anywhere in the flush window regardless of their
> inter-IO completion dependencies. Hence you're allowing temporally
> ordered filesystem IO to the same LBA be reorded like so:
> 
> 
> 	<flush>
> 	......
> 	write A'
> 	......
> 	read A
> 	A' = <modify A>
> 	......
> 	write A
> 	......
> 	<flush>
> 
> This violates causality. it's simply *not possible for the disk
> cache to contain A' before either "write A", "read A" or the
> in-memory modification of A has been completed by the OS. Hence
> there is no way for a crash situation to have the disk cache or the
> physical storage medium to contain corruption that indicates it
> stored A' on disk before stored A.
> 
> > CrashMonkey therefore respects the guarantees provided by the disk
> > cache, and assumes nothing more than that. I hope this provides more
> > clarity on what
> > CrashMonkey is trying to do, and why we think it is reasonable to do so.
> 
> It clearly demonstrates to me where CrashMonkey is broken and needs
> fixing - it needs to respect the ordering of temporally separate IO
> to the same LBA and not violate causality. Simulators that assume
> time travel is possible are not useful to us.
> 
> Cheers,

Hi Dave,

just FYI the 042 xfstest does fail on xfs with what I think is stale
data exposure. It might not be related at all to what crashmonkey is
reporting but there is something wrong nevertheless.

SECTION       -- ext4
RECREATING    -- xfs on /dev/vdc1
FSTYP         -- xfs (non-debug)
PLATFORM      -- Linux/x86_64 rhel7 4.16.0-rc5+
MKFS_OPTIONS  -- -f -f /dev/vdb1
MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/vdb1 /mnt/test1

generic/042	 - output mismatch (see /root/Projects/xfstests-dev/results//ext4/generic/042.out.bad)
    --- tests/generic/042.out	2018-03-14 05:56:38.619124060 -0400
    +++ /root/Projects/xfstests-dev/results//ext4/generic/042.out.bad	2018-03-14 09:14:54.825969765 -0400
    @@ -5,6 +5,16 @@
     fpunch
     wrote 65536/65536 bytes at offset 0
     XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
    +0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
    +*
    +000f000 0000 0000 0000 0000 0000 0000 0000 0000
    +*
    ...
    (Run 'diff -u tests/generic/042.out /root/Projects/xfstests-dev/results//ext4/generic/042.out.bad'  to see the entire diff)
Ran: generic/042
Failures: generic/042
Failed 1 of 1 tests


ext4 fails as well, however it looks to be a problem with the test
itself where it is expecting the file to remain of a zero length which
it might not be the case.

-Lukas

> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XFS crash consistency bug : Loss of fsynced metadata operation
  2018-03-14 13:16       ` Lukas Czerner
@ 2018-03-14 13:32         ` Dave Chinner
  2018-03-14 13:57           ` Lukas Czerner
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2018-03-14 13:32 UTC (permalink / raw)
  To: Lukas Czerner
  Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai,
	Ashlie Martinez, fstests, Theodore Ts'o

On Wed, Mar 14, 2018 at 02:16:59PM +0100, Lukas Czerner wrote:
> On Wed, Mar 14, 2018 at 09:45:22AM +1100, Dave Chinner wrote:
> > On Tue, Mar 13, 2018 at 11:57:21AM -0500, Jayashree Mohan wrote:
> > > Hi Dave,
> > > 
> > > Thanks for the response. CrashMonkey assumes the following behavior of
> > > disk cache. Let me know if any of this sounds unreasonable.
> > > 
> > > Whenever the underlying storage device has an associated cache, the IO
> > > is marked completed the moment it reaches the disk cache. This does
> > > not guarantee that the disk cache would persist them in the same
> > > order, unless there is a Flush/FUA. The order of completed writes as
> > > seen by the user could be A, B, C, *Flush* D, E. However the disk
> > > cache could write these back to the persistent storage in the order
> > > say B, A, C, E, D. The only invariant it ensures is that writing in an
> > > order like  A, C, E, B, D is
> > > not possible because, writes A,B,C have to strictly happen before D
> > > and E. However you cannot ensure that (A, B, C) is written to the
> > > persistent storage in the same order.
> > > 
> > > CrashMonkey reorders bios in conformance to the guarantees provided by
> > > disk cache; we do not make any extra assumptions and we respect the
> > > barrier operations.
> > 
> > I think your model is wrong. caches do not randomly re-order
> > completed IO operations to the *same LBA*. When a block is overwritten
> > the cache contains the overwrite data and the previous data is
> > discarded. THe previous data may be on disk, but it's no longer in
> > the cache.
> > 
> > e.g. take a dependent filesystem read-modify-write cycle (I'm
> > choosing this because that's the problem this fzero/fsync
> > "bug" is apparently demonstrating) where we write data to disk,
> > invalidate the kernel cache, read the data back off disk, zero it
> > in memory, then write it back to disk, all in the one LBA:
> > 
> > 	<flush>
> > 	write A to disk, invalidate kernel cache
> > 	......
> > 	read A from disk into kernel cache
> > 	A' = <modify A>
> > 	write A' to disk
> > 	......
> > 	<flush>
> > 
> > The disk cache model you are describing allows writes
> > to be reordered anywhere in the flush window regardless of their
> > inter-IO completion dependencies. Hence you're allowing temporally
> > ordered filesystem IO to the same LBA be reorded like so:
> > 
> > 
> > 	<flush>
> > 	......
> > 	write A'
> > 	......
> > 	read A
> > 	A' = <modify A>
> > 	......
> > 	write A
> > 	......
> > 	<flush>
> > 
> > This violates causality. it's simply *not possible for the disk
> > cache to contain A' before either "write A", "read A" or the
> > in-memory modification of A has been completed by the OS. Hence
> > there is no way for a crash situation to have the disk cache or the
> > physical storage medium to contain corruption that indicates it
> > stored A' on disk before stored A.
> > 
> > > CrashMonkey therefore respects the guarantees provided by the disk
> > > cache, and assumes nothing more than that. I hope this provides more
> > > clarity on what
> > > CrashMonkey is trying to do, and why we think it is reasonable to do so.
> > 
> > It clearly demonstrates to me where CrashMonkey is broken and needs
> > fixing - it needs to respect the ordering of temporally separate IO
> > to the same LBA and not violate causality. Simulators that assume
> > time travel is possible are not useful to us.
> > 
> > Cheers,
> 
> Hi Dave,
> 
> just FYI the 042 xfstest does fail on xfs with what I think is stale
> data exposure. It might not be related at all to what crashmonkey is
> reporting but there is something wrong nevertheless.

generic/042 is unreliable and certain operations result in a
non-zero length file because of metadata commits/writeback that
occur as a result of the fallocate operations. It got removed from
the auto group because it isn't a reliable test about 3 years ago:

commit 7721b85016081cf6651d139e8af26267eceb23d3
Author: Jeff Moyer <jmoyer@redhat.com>
Date:   Mon Dec 21 18:01:47 2015 +1100

    generic/042: remove from the 'auto' group
    
    This test fails 100% of the time for me on xfs and current git head, and
    is not run for ext4 since ext4 does not support shutdown.  After talking
    with bfoster, it isn't expected to succeed right now.  Since the auto
    group is for tests that *are* expected to succeed, let's move this one
    out.
    
    Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
    Reviewed-by: Brian Foster <bfoster@redhat.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

This is what it used to be testing - it was an XFS only test:

commit cf1438248c0f62f3b64013d23e9e6d6bc23ca24b
Author: Brian Foster <bfoster@redhat.com>
Date:   Tue Oct 14 22:59:39 2014 +1100

    xfs/053: test for stale data exposure via falloc/writeback interaction
    
    XFS buffered I/O writeback has a subtle race condition that leads to
    stale data exposure if the filesystem happens to crash after delayed
    allocation blocks are converted on disk and before data is written back
    to said blocks.
    
    Use file allocation commands to attempt to reproduce a related, but
    slightly different variant of this problem. The associated falloc
    commands can lead to partial writeback that converts an extent larger
    than the range affected by falloc. If the filesystem crashes after the
    extent conversion but before all other cached data is written to the
    extent, stale data can be exposed.
    
    Signed-off-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

So, generic/042 was written to expose unwritten extent conversion
issues in the XFS writeback path and has nothing to do with mutliple
writes being reordered. It's failure detection is unreliable and not
indicative of an actual failure or data corruption, and that's why
it was removed from the auto group....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XFS crash consistency bug : Loss of fsynced metadata operation
  2018-03-14 13:32         ` Dave Chinner
@ 2018-03-14 13:57           ` Lukas Czerner
  2018-03-14 21:24             ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Lukas Czerner @ 2018-03-14 13:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai,
	Ashlie Martinez, fstests, Theodore Ts'o

On Thu, Mar 15, 2018 at 12:32:58AM +1100, Dave Chinner wrote:
> On Wed, Mar 14, 2018 at 02:16:59PM +0100, Lukas Czerner wrote:
> > On Wed, Mar 14, 2018 at 09:45:22AM +1100, Dave Chinner wrote:
> > > On Tue, Mar 13, 2018 at 11:57:21AM -0500, Jayashree Mohan wrote:
> > > > Hi Dave,
> > > > 
> > > > Thanks for the response. CrashMonkey assumes the following behavior of
> > > > disk cache. Let me know if any of this sounds unreasonable.
> > > > 
> > > > Whenever the underlying storage device has an associated cache, the IO
> > > > is marked completed the moment it reaches the disk cache. This does
> > > > not guarantee that the disk cache would persist them in the same
> > > > order, unless there is a Flush/FUA. The order of completed writes as
> > > > seen by the user could be A, B, C, *Flush* D, E. However the disk
> > > > cache could write these back to the persistent storage in the order
> > > > say B, A, C, E, D. The only invariant it ensures is that writing in an
> > > > order like  A, C, E, B, D is
> > > > not possible because, writes A,B,C have to strictly happen before D
> > > > and E. However you cannot ensure that (A, B, C) is written to the
> > > > persistent storage in the same order.
> > > > 
> > > > CrashMonkey reorders bios in conformance to the guarantees provided by
> > > > disk cache; we do not make any extra assumptions and we respect the
> > > > barrier operations.
> > > 
> > > I think your model is wrong. caches do not randomly re-order
> > > completed IO operations to the *same LBA*. When a block is overwritten
> > > the cache contains the overwrite data and the previous data is
> > > discarded. THe previous data may be on disk, but it's no longer in
> > > the cache.
> > > 
> > > e.g. take a dependent filesystem read-modify-write cycle (I'm
> > > choosing this because that's the problem this fzero/fsync
> > > "bug" is apparently demonstrating) where we write data to disk,
> > > invalidate the kernel cache, read the data back off disk, zero it
> > > in memory, then write it back to disk, all in the one LBA:
> > > 
> > > 	<flush>
> > > 	write A to disk, invalidate kernel cache
> > > 	......
> > > 	read A from disk into kernel cache
> > > 	A' = <modify A>
> > > 	write A' to disk
> > > 	......
> > > 	<flush>
> > > 
> > > The disk cache model you are describing allows writes
> > > to be reordered anywhere in the flush window regardless of their
> > > inter-IO completion dependencies. Hence you're allowing temporally
> > > ordered filesystem IO to the same LBA be reorded like so:
> > > 
> > > 
> > > 	<flush>
> > > 	......
> > > 	write A'
> > > 	......
> > > 	read A
> > > 	A' = <modify A>
> > > 	......
> > > 	write A
> > > 	......
> > > 	<flush>
> > > 
> > > This violates causality. it's simply *not possible for the disk
> > > cache to contain A' before either "write A", "read A" or the
> > > in-memory modification of A has been completed by the OS. Hence
> > > there is no way for a crash situation to have the disk cache or the
> > > physical storage medium to contain corruption that indicates it
> > > stored A' on disk before stored A.
> > > 
> > > > CrashMonkey therefore respects the guarantees provided by the disk
> > > > cache, and assumes nothing more than that. I hope this provides more
> > > > clarity on what
> > > > CrashMonkey is trying to do, and why we think it is reasonable to do so.
> > > 
> > > It clearly demonstrates to me where CrashMonkey is broken and needs
> > > fixing - it needs to respect the ordering of temporally separate IO
> > > to the same LBA and not violate causality. Simulators that assume
> > > time travel is possible are not useful to us.
> > > 
> > > Cheers,
> > 
> > Hi Dave,
> > 
> > just FYI the 042 xfstest does fail on xfs with what I think is stale
> > data exposure. It might not be related at all to what crashmonkey is
> > reporting but there is something wrong nevertheless.
> 
> generic/042 is unreliable and certain operations result in a
> non-zero length file because of metadata commits/writeback that
> occur as a result of the fallocate operations. It got removed from
> the auto group because it isn't a reliable test about 3 years ago:

Sure, I just that it clearly exposes stale data on xfs. That is, the
resulting file contains data that was previously written to the
underlying image file to catch the exposure. I am aware of the non-zero
length file problem, that's not what I am pointing out though.

-Lukas

> 
> commit 7721b85016081cf6651d139e8af26267eceb23d3
> Author: Jeff Moyer <jmoyer@redhat.com>
> Date:   Mon Dec 21 18:01:47 2015 +1100
> 
>     generic/042: remove from the 'auto' group
>     
>     This test fails 100% of the time for me on xfs and current git head, and
>     is not run for ext4 since ext4 does not support shutdown.  After talking
>     with bfoster, it isn't expected to succeed right now.  Since the auto
>     group is for tests that *are* expected to succeed, let's move this one
>     out.
>     
>     Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
>     Reviewed-by: Brian Foster <bfoster@redhat.com>
>     Signed-off-by: Dave Chinner <david@fromorbit.com>
> 
> This is what it used to be testing - it was an XFS only test:
> 
> commit cf1438248c0f62f3b64013d23e9e6d6bc23ca24b
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Tue Oct 14 22:59:39 2014 +1100
> 
>     xfs/053: test for stale data exposure via falloc/writeback interaction
>     
>     XFS buffered I/O writeback has a subtle race condition that leads to
>     stale data exposure if the filesystem happens to crash after delayed
>     allocation blocks are converted on disk and before data is written back
>     to said blocks.
>     
>     Use file allocation commands to attempt to reproduce a related, but
>     slightly different variant of this problem. The associated falloc
>     commands can lead to partial writeback that converts an extent larger
>     than the range affected by falloc. If the filesystem crashes after the
>     extent conversion but before all other cached data is written to the
>     extent, stale data can be exposed.
>     
>     Signed-off-by: Brian Foster <bfoster@redhat.com>
>     Reviewed-by: Dave Chinner <dchinner@redhat.com>
>     Signed-off-by: Dave Chinner <david@fromorbit.com>
> 
> So, generic/042 was written to expose unwritten extent conversion
> issues in the XFS writeback path and has nothing to do with mutliple
> writes being reordered. It's failure detection is unreliable and not
> indicative of an actual failure or data corruption, and that's why
> it was removed from the auto group....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XFS crash consistency bug : Loss of fsynced metadata operation
  2018-03-14 13:57           ` Lukas Czerner
@ 2018-03-14 21:24             ` Dave Chinner
  2018-03-15  6:15               ` Lukas Czerner
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2018-03-14 21:24 UTC (permalink / raw)
  To: Lukas Czerner
  Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai,
	Ashlie Martinez, fstests, Theodore Ts'o

On Wed, Mar 14, 2018 at 02:57:52PM +0100, Lukas Czerner wrote:
> On Thu, Mar 15, 2018 at 12:32:58AM +1100, Dave Chinner wrote:
> > On Wed, Mar 14, 2018 at 02:16:59PM +0100, Lukas Czerner wrote:
> > > just FYI the 042 xfstest does fail on xfs with what I think is stale
> > > data exposure. It might not be related at all to what crashmonkey is
> > > reporting but there is something wrong nevertheless.
> > 
> > generic/042 is unreliable and certain operations result in a
> > non-zero length file because of metadata commits/writeback that
> > occur as a result of the fallocate operations. It got removed from
> > the auto group because it isn't a reliable test about 3 years ago:
> 
> Sure, I just that it clearly exposes stale data on xfs. That is, the
> resulting file contains data that was previously written to the
> underlying image file to catch the exposure. I am aware of the non-zero
> length file problem, that's not what I am pointing out though.

What setup are you testing on? I haven't seen it fail in some time.
Here, on emulated pmem:

SECTION       -- xfs
FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 test4 4.16.0-rc5-dgc
MKFS_OPTIONS  -- -f -m rmapbt=1,reflink=1 -i sparse=1 /dev/pmem1
MOUNT_OPTIONS -- /dev/pmem1 /mnt/scratch

xfs/042 10s ... 14s
Passed all 1 tests

SECTION       -- xfs
=========================
Passed all 1 tests

And on a different machine, with iSCSI devices and a slightly older
kernel:

SECTION       -- xfs
FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 test2 4.16.0-rc2-dgc
MKFS_OPTIONS  -- -f -m rmapbt=1,reflink=1 -i sparse=1 /dev/sdg
MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/sdg /mnt/scratch

xfs/042 14s ... 14s
Passed all 1 tests

SECTION       -- xfs
=========================
Passed all 1 tests

It passes just fine.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XFS crash consistency bug : Loss of fsynced metadata operation
  2018-03-14 21:24             ` Dave Chinner
@ 2018-03-15  6:15               ` Lukas Czerner
  2018-03-15 10:06                 ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Lukas Czerner @ 2018-03-15  6:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai,
	Ashlie Martinez, fstests, Theodore Ts'o

On Thu, Mar 15, 2018 at 08:24:41AM +1100, Dave Chinner wrote:
> On Wed, Mar 14, 2018 at 02:57:52PM +0100, Lukas Czerner wrote:
> > On Thu, Mar 15, 2018 at 12:32:58AM +1100, Dave Chinner wrote:
> > > On Wed, Mar 14, 2018 at 02:16:59PM +0100, Lukas Czerner wrote:
> > > > just FYI the 042 xfstest does fail on xfs with what I think is stale
> > > > data exposure. It might not be related at all to what crashmonkey is
> > > > reporting but there is something wrong nevertheless.
> > > 
> > > generic/042 is unreliable and certain operations result in a
> > > non-zero length file because of metadata commits/writeback that
> > > occur as a result of the fallocate operations. It got removed from
> > > the auto group because it isn't a reliable test about 3 years ago:
> > 
> > Sure, I just that it clearly exposes stale data on xfs. That is, the
> > resulting file contains data that was previously written to the
> > underlying image file to catch the exposure. I am aware of the non-zero
> > length file problem, that's not what I am pointing out though.
> 
> What setup are you testing on? I haven't seen it fail in some time.
> Here, on emulated pmem:

Virtual machine with Virtio devices backed by a linear lvs consisting of
SCSI drives, all local.

> 
> SECTION       -- xfs
> FSTYP         -- xfs (debug)
> PLATFORM      -- Linux/x86_64 test4 4.16.0-rc5-dgc
> MKFS_OPTIONS  -- -f -m rmapbt=1,reflink=1 -i sparse=1 /dev/pmem1
> MOUNT_OPTIONS -- /dev/pmem1 /mnt/scratch
> 
> xfs/042 10s ... 14s

We are talking about generic/042. xfs/042 is very much a different
test.

SECTION       -- xfs
RECREATING    -- xfs on /dev/vdc1
FSTYP         -- xfs (non-debug)
PLATFORM      -- Linux/x86_64 rhel7 4.16.0-rc5+
MKFS_OPTIONS  -- -f -f /dev/vdb1
MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/vdb1 /mnt/test1

generic/042	 - output mismatch (see /root/Projects/xfstests-dev/results//ext4/generic/042.out.bad)
    --- tests/generic/042.out	2018-03-14 05:56:38.619124060 -0400
    +++ /root/Projects/xfstests-dev/results//ext4/generic/042.out.bad	2018-03-15 02:15:02.872113819 -0400
    @@ -5,6 +5,16 @@
     fpunch
     wrote 65536/65536 bytes at offset 0
     XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
    +0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
    +*
    +000f000 0000 0000 0000 0000 0000 0000 0000 0000
    +*
    ...
    (Run 'diff -u tests/generic/042.out /root/Projects/xfstests-dev/results//ext4/generic/042.out.bad'  to see the entire diff)
xfs/042 15s ... 15s
Ran: generic/042 xfs/042
Failures: generic/042
Failed 1 of 2 tests


-Lukas

> Passed all 1 tests
> 
> SECTION       -- xfs
> =========================
> Passed all 1 tests
> 
> And on a different machine, with iSCSI devices and a slightly older
> kernel:
> 
> SECTION       -- xfs
> FSTYP         -- xfs (debug)
> PLATFORM      -- Linux/x86_64 test2 4.16.0-rc2-dgc
> MKFS_OPTIONS  -- -f -m rmapbt=1,reflink=1 -i sparse=1 /dev/sdg
> MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/sdg /mnt/scratch
> 
> xfs/042 14s ... 14s
> Passed all 1 tests
> 
> SECTION       -- xfs
> =========================
> Passed all 1 tests
> 
> It passes just fine.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XFS crash consistency bug : Loss of fsynced metadata operation
  2018-03-15  6:15               ` Lukas Czerner
@ 2018-03-15 10:06                 ` Dave Chinner
  2018-03-15 10:32                   ` Lukas Czerner
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2018-03-15 10:06 UTC (permalink / raw)
  To: Lukas Czerner
  Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai,
	Ashlie Martinez, fstests, Theodore Ts'o

On Thu, Mar 15, 2018 at 07:15:57AM +0100, Lukas Czerner wrote:
> On Thu, Mar 15, 2018 at 08:24:41AM +1100, Dave Chinner wrote:
> > On Wed, Mar 14, 2018 at 02:57:52PM +0100, Lukas Czerner wrote:
> > > On Thu, Mar 15, 2018 at 12:32:58AM +1100, Dave Chinner wrote:
> > > > On Wed, Mar 14, 2018 at 02:16:59PM +0100, Lukas Czerner wrote:
> > > > > just FYI the 042 xfstest does fail on xfs with what I think is stale
> > > > > data exposure. It might not be related at all to what crashmonkey is
> > > > > reporting but there is something wrong nevertheless.
> > > > 
> > > > generic/042 is unreliable and certain operations result in a
> > > > non-zero length file because of metadata commits/writeback that
> > > > occur as a result of the fallocate operations. It got removed from
> > > > the auto group because it isn't a reliable test about 3 years ago:
> > > 
> > > Sure, I just that it clearly exposes stale data on xfs. That is, the
> > > resulting file contains data that was previously written to the
> > > underlying image file to catch the exposure. I am aware of the non-zero
> > > length file problem, that's not what I am pointing out though.
> > 
> > What setup are you testing on? I haven't seen it fail in some time.
> > Here, on emulated pmem:
> 
> Virtual machine with Virtio devices backed by a linear lvs consisting of
> SCSI drives, all local.
> 
> > 
> > SECTION       -- xfs
> > FSTYP         -- xfs (debug)
> > PLATFORM      -- Linux/x86_64 test4 4.16.0-rc5-dgc
> > MKFS_OPTIONS  -- -f -m rmapbt=1,reflink=1 -i sparse=1 /dev/pmem1
> > MOUNT_OPTIONS -- /dev/pmem1 /mnt/scratch
> > 
> > xfs/042 10s ... 14s
> 
> We are talking about generic/042. xfs/042 is very much a different
> test.

Ugh copy-n-paste fail. Sorry.

I was looking at the right test, just running the wrong one.

Anyway, what makes you think this:

> 
> SECTION       -- xfs
> RECREATING    -- xfs on /dev/vdc1
> FSTYP         -- xfs (non-debug)
> PLATFORM      -- Linux/x86_64 rhel7 4.16.0-rc5+
> MKFS_OPTIONS  -- -f -f /dev/vdb1
> MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/vdb1 /mnt/test1
> 
> generic/042	 - output mismatch (see /root/Projects/xfstests-dev/results//ext4/generic/042.out.bad)
>     --- tests/generic/042.out	2018-03-14 05:56:38.619124060 -0400
>     +++ /root/Projects/xfstests-dev/results//ext4/generic/042.out.bad	2018-03-15 02:15:02.872113819 -0400
>     @@ -5,6 +5,16 @@
>      fpunch
>      wrote 65536/65536 bytes at offset 0
>      XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>     +0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
>     +*
>     +000f000 0000 0000 0000 0000 0000 0000 0000 0000
>     +*

exposes stale data? The command is:

$XFS_IO_PROG -f -c "pwrite -S 1 0 64k" -c "$cmd 60k 4k" $file

i.e. We wrote bytes from 0 to 64k, then punched from 60k to 64k. if
the file is 64k in length, then it should contain either all "cdcd"
pattern, or there should be "cdcd" data except for the range from
60k to 64k where there should be zeros.

The later is exactly what the diff output is say - "cdcd" data from
0-60k, zeros from 60 to 64k. So there's no stale data exposure
occurring here (those bugs got fixed!), it's just the test output is
unreliable and does not match the golden output.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XFS crash consistency bug : Loss of fsynced metadata operation
  2018-03-15 10:06                 ` Dave Chinner
@ 2018-03-15 10:32                   ` Lukas Czerner
  2018-03-16  0:19                     ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Lukas Czerner @ 2018-03-15 10:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai,
	Ashlie Martinez, fstests, Theodore Ts'o

On Thu, Mar 15, 2018 at 09:06:46PM +1100, Dave Chinner wrote:
> On Thu, Mar 15, 2018 at 07:15:57AM +0100, Lukas Czerner wrote:
> > On Thu, Mar 15, 2018 at 08:24:41AM +1100, Dave Chinner wrote:
> > > On Wed, Mar 14, 2018 at 02:57:52PM +0100, Lukas Czerner wrote:
> > > > On Thu, Mar 15, 2018 at 12:32:58AM +1100, Dave Chinner wrote:
> > > > > On Wed, Mar 14, 2018 at 02:16:59PM +0100, Lukas Czerner wrote:
> > > > > > just FYI the 042 xfstest does fail on xfs with what I think is stale
> > > > > > data exposure. It might not be related at all to what crashmonkey is
> > > > > > reporting but there is something wrong nevertheless.
> > > > > 
> > > > > generic/042 is unreliable and certain operations result in a
> > > > > non-zero length file because of metadata commits/writeback that
> > > > > occur as a result of the fallocate operations. It got removed from
> > > > > the auto group because it isn't a reliable test about 3 years ago:
> > > > 
> > > > Sure, I just that it clearly exposes stale data on xfs. That is, the
> > > > resulting file contains data that was previously written to the
> > > > underlying image file to catch the exposure. I am aware of the non-zero
> > > > length file problem, that's not what I am pointing out though.
> > > 
> > > What setup are you testing on? I haven't seen it fail in some time.
> > > Here, on emulated pmem:
> > 
> > Virtual machine with Virtio devices backed by a linear lvs consisting of
> > SCSI drives, all local.
> > 
> > > 
> > > SECTION       -- xfs
> > > FSTYP         -- xfs (debug)
> > > PLATFORM      -- Linux/x86_64 test4 4.16.0-rc5-dgc
> > > MKFS_OPTIONS  -- -f -m rmapbt=1,reflink=1 -i sparse=1 /dev/pmem1
> > > MOUNT_OPTIONS -- /dev/pmem1 /mnt/scratch
> > > 
> > > xfs/042 10s ... 14s
> > 
> > We are talking about generic/042. xfs/042 is very much a different
> > test.
> 
> Ugh copy-n-paste fail. Sorry.
> 
> I was looking at the right test, just running the wrong one.
> 
> Anyway, what makes you think this:
> 
> > 
> > SECTION       -- xfs
> > RECREATING    -- xfs on /dev/vdc1
> > FSTYP         -- xfs (non-debug)
> > PLATFORM      -- Linux/x86_64 rhel7 4.16.0-rc5+
> > MKFS_OPTIONS  -- -f -f /dev/vdb1
> > MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/vdb1 /mnt/test1
> > 
> > generic/042	 - output mismatch (see /root/Projects/xfstests-dev/results//ext4/generic/042.out.bad)
> >     --- tests/generic/042.out	2018-03-14 05:56:38.619124060 -0400
> >     +++ /root/Projects/xfstests-dev/results//ext4/generic/042.out.bad	2018-03-15 02:15:02.872113819 -0400
> >     @@ -5,6 +5,16 @@
> >      fpunch
> >      wrote 65536/65536 bytes at offset 0
> >      XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> >     +0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> >     +*
> >     +000f000 0000 0000 0000 0000 0000 0000 0000 0000
> >     +*
> 
> exposes stale data? The command is:
> 
> $XFS_IO_PROG -f -c "pwrite -S 1 0 64k" -c "$cmd 60k 4k" $file
> 
> i.e. We wrote bytes from 0 to 64k, then punched from 60k to 64k. if
> the file is 64k in length, then it should contain either all "cdcd"
> pattern, or there should be "cdcd" data except for the range from
> 60k to 64k where there should be zeros.

And where the "cdcd" data is comming from ? Not from the

"pwrite -S 1 0 64k"

that's producing "0101".

(from man xfs_io)
 -S  used  to set the (repeated) fill pattern which is used when
     the data to write is not coming from a file.  The default buffer
     fill pattern value is 0xcdcdcdcd.


It's comming from

$XFS_IO_PROG -f -c "truncate 0" -c "pwrite 0 25M" $img
_mkfs_dev $img

so it does expose data written to the device prior the mkfs.

-Lukas

> 
> The later is exactly what the diff output is say - "cdcd" data from
> 0-60k, zeros from 60 to 64k. So there's no stale data exposure
> occurring here (those bugs got fixed!), it's just the test output is
> unreliable and does not match the golden output.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XFS crash consistency bug : Loss of fsynced metadata operation
  2018-03-15 10:32                   ` Lukas Czerner
@ 2018-03-16  0:19                     ` Dave Chinner
  2018-03-16  5:45                       ` Lukas Czerner
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2018-03-16  0:19 UTC (permalink / raw)
  To: Lukas Czerner
  Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai,
	Ashlie Martinez, fstests, Theodore Ts'o

On Thu, Mar 15, 2018 at 11:32:07AM +0100, Lukas Czerner wrote:
> 
> And where the "cdcd" data is comming from ? Not from the
> 
> "pwrite -S 1 0 64k"
> 
> that's producing "0101".

Ah, too much context switching. Makes me look like the village
idiot...

Ok, I just spent some time looking at this in detail (event traces,
etc) and it is indeed as the test describes - an whole delalloc
extent allocation on partial overwrite. THe partial overwrite is
triggered by the fpunch/fzero code, and it appears the delalloc
converison is not paying attention to the start offset of the
writeback:

xfs_writepage:        dev 7:0 ino 0x7603 pgoff 0xf000 size 0x10000 delalloc 1
.....
xfs_alloc_near_greater: dev 7:0 agno 0 agbno 3784 minlen 16 maxlen 16
.....
xfs_write_extent:     dev 7:0 ino 0x7603 offset 0 block 3784 count 16 flag 0

IOWs, as I've said from the start (based on the test description)
this has nothing to do with the corruption issue CrashMonkey is
creating.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XFS crash consistency bug : Loss of fsynced metadata operation
  2018-03-16  0:19                     ` Dave Chinner
@ 2018-03-16  5:45                       ` Lukas Czerner
  2018-03-17  3:16                         ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Lukas Czerner @ 2018-03-16  5:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai,
	Ashlie Martinez, fstests, Theodore Ts'o

On Fri, Mar 16, 2018 at 11:19:55AM +1100, Dave Chinner wrote:
> On Thu, Mar 15, 2018 at 11:32:07AM +0100, Lukas Czerner wrote:
> > 
> > And where the "cdcd" data is comming from ? Not from the
> > 
> > "pwrite -S 1 0 64k"
> > 
> > that's producing "0101".
> 
> Ah, too much context switching. Makes me look like the village
> idiot...

that's how I was looking the past 4 emails ;)

Anyway good that we finally understand each other. I think it'd be worth
fixing the generic/042 to check for the actuall stale data not just
expecting the file to be empty as it still can do some good.

> 
> Ok, I just spent some time looking at this in detail (event traces,
> etc) and it is indeed as the test describes - an whole delalloc
> extent allocation on partial overwrite. THe partial overwrite is
> triggered by the fpunch/fzero code, and it appears the delalloc
> converison is not paying attention to the start offset of the
> writeback:
> 
> xfs_writepage:        dev 7:0 ino 0x7603 pgoff 0xf000 size 0x10000 delalloc 1
> .....
> xfs_alloc_near_greater: dev 7:0 agno 0 agbno 3784 minlen 16 maxlen 16
> .....
> xfs_write_extent:     dev 7:0 ino 0x7603 offset 0 block 3784 count 16 flag 0
> 
> IOWs, as I've said from the start (based on the test description)
> this has nothing to do with the corruption issue CrashMonkey is
> creating.

Yeah they're probably not seeing this because of added fsync.

-Lukas

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: XFS crash consistency bug : Loss of fsynced metadata operation
  2018-03-16  5:45                       ` Lukas Czerner
@ 2018-03-17  3:16                         ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2018-03-17  3:16 UTC (permalink / raw)
  To: Lukas Czerner
  Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai,
	Ashlie Martinez, fstests, Theodore Ts'o

On Fri, Mar 16, 2018 at 06:45:16AM +0100, Lukas Czerner wrote:
> On Fri, Mar 16, 2018 at 11:19:55AM +1100, Dave Chinner wrote:
> > On Thu, Mar 15, 2018 at 11:32:07AM +0100, Lukas Czerner wrote:
> > > 
> > > And where the "cdcd" data is comming from ? Not from the
> > > 
> > > "pwrite -S 1 0 64k"
> > > 
> > > that's producing "0101".
> > 
> > Ah, too much context switching. Makes me look like the village
> > idiot...
> 
> that's how I was looking the past 4 emails ;)
> 
> Anyway good that we finally understand each other. I think it'd be worth
> fixing the generic/042 to check for the actuall stale data not just
> expecting the file to be empty as it still can do some good.

I've got a fixed version of it. It starts by setting the file size
to 128k and then fsyncing ths file, so after the shutdown we should
always see a file full of zeros. Of course, the extent layout can be
anything, as long as the data returned after remount is all zeros.

> > Ok, I just spent some time looking at this in detail (event traces,
> > etc) and it is indeed as the test describes - an whole delalloc
> > extent allocation on partial overwrite. THe partial overwrite is
> > triggered by the fpunch/fzero code, and it appears the delalloc
> > converison is not paying attention to the start offset of the
> > writeback:
> > 
> > xfs_writepage:        dev 7:0 ino 0x7603 pgoff 0xf000 size 0x10000 delalloc 1
> > .....
> > xfs_alloc_near_greater: dev 7:0 agno 0 agbno 3784 minlen 16 maxlen 16
> > .....
> > xfs_write_extent:     dev 7:0 ino 0x7603 offset 0 block 3784 count 16 flag 0
> > 
> > IOWs, as I've said from the start (based on the test description)
> > this has nothing to do with the corruption issue CrashMonkey is
> > creating.
> 
> Yeah they're probably not seeing this because of added fsync.

Yes - the fsync ensures data vs metadata ordering by writing and
waiting for data before flushing the log. In this case, there is
not data flush and so when the log is flushed in the shutdown,
the metadata hits the disk before the data, and that's where the
stale data exposure comes from.

THis is a well known issue - if allocation transactions are
committed before the data writes have been submitted, then a crash
can expose stale data - but it's very rarely seen with delalloc.
Historically it's always been seen with direct IO - that's why we
use unwritten extent allocation for direct IO at all times.

However, in XFS we can't do that with delayed allocation because
of the fact the information used by writeback is held on the
bufferhead and is not coherent with extent state. Hence we can't
change a delalloc extent to unwritten deep in the allocator, because
then the page writeback does the wrong thing as it expects to see a
delalloc extent not an unwritten extent. Fixing this problem is one
of the reasons we want to get rid of bufferheads from XFS.

I've got patches that have gone one round of review that remove
bufferhead state from the XFS writepage engine, but that will need
to fixed up and finished off before this problem can be fixed. IIRC,
when I wrote this test I'd just uncovered the problem, realised it
can't be fixed without bufferhead removal, and so the test was added
to remind us what needed fixing....

I actually have fixes that make XFS pass this test, but there's a
fair chunk of work still to be done before they are ready for
review. I'm thinking of closing this whole entirely by always using
unwritten extent allocate for delalloc writes inside EOF, which will
mean we could probably go back to using delalloc with extent size
hints, too (yay!)...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-03-17  3:16 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-13  2:15 XFS crash consistency bug : Loss of fsynced metadata operation Jayashree Mohan
2018-03-13  4:21 ` Dave Chinner
2018-03-13  6:36   ` Amir Goldstein
2018-03-13 18:05     ` Jayashree Mohan
2018-03-13 16:57   ` Jayashree Mohan
2018-03-13 22:45     ` Dave Chinner
2018-03-14 13:16       ` Lukas Czerner
2018-03-14 13:32         ` Dave Chinner
2018-03-14 13:57           ` Lukas Czerner
2018-03-14 21:24             ` Dave Chinner
2018-03-15  6:15               ` Lukas Czerner
2018-03-15 10:06                 ` Dave Chinner
2018-03-15 10:32                   ` Lukas Czerner
2018-03-16  0:19                     ` Dave Chinner
2018-03-16  5:45                       ` Lukas Czerner
2018-03-17  3:16                         ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.