* XFS crash consistency bug : Loss of fsynced metadata operation @ 2018-03-13 2:15 Jayashree Mohan 2018-03-13 4:21 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Jayashree Mohan @ 2018-03-13 2:15 UTC (permalink / raw) To: linux-xfs; +Cc: Vijaychidambaram Velayudhan Pillai, Ashlie Martinez Hi, We've encountered what seems to be a crash consistency bug in xfs(kernel 4.15) due to the interaction between delayed allocated write and an unaligned fallocate(zero range) : Fallocate_zero_range is not persisted even when followed by a fsync - thereby leading to a loss of fsynced metadata operation on a file. Say we create a disk image with known data and quick format it. 1. Now write 65K of data to a new file 2. Zero out a part of the above file using falloc_zero_range (60K+128) - (60K+128+4096) - an unaligned block 3. fsync the above file 4. Crash If we crash after the fsync, and allow reordering of the block IOs between two flush/fua commands using Crashmonkey[1], then we can end up not persisting the zero_range command in step 2 above, even if we crashed after a fsync. This workload was inspired from xfstest/generic_042, which tests for stale data exposure using aligned fallocate commands. It's worth noting that f2fs and btrfs passes our test clean - irrespective of the order of bios, user data is intact and fzero operation is correctly persisted. To reproduce this bug using CrashMonkey, simply run : ./c_harness -f /dev/sda -d /dev/cow_ram0 -t xfs -e 102400 -s 1000 -v tests/generic_042/generic_042_fzero_unaligned.so and take a look at the <timestamp>-generic_042_fzero_unaligned.log created in the build directory. This file has the list of block IOs issued during the workload and the permutation of bios that lead to this bug. You can also verify using blktrace that CrashMonkey only reorders bios between two barrier operations(thereby such a crash state could be encountered due to reordering blocks at the storage stack). Note that tools like dm-log-writes cannot capture this bug because this arises due to reordering blocks between barrier operations. Possible reason for this bug : On closely inspecting the reason for this bug, we discovered that the problem lies in updating a data block twice, without a barrier operation between the two updates. The blktrace for the above mentioned workload shows that sector 280 is updated twice - probably first due to the write and later due to the fzero operation. However notice that there is no barrier between the two updates, and using Crashmonkey we see the above mentioned bug when these two updates to the same block are reordered. Essentially, a reordering here means fzero goes through first, and is later overwritten by the delayed allocated write. 0.000179069 7104 Q WS 280 + 16 [c_harness] 0.000594994 7104 Q R 280 + 8 [c_harness] 0.000598216 7104 Q R 288 + 8 [c_harness] 0.000620552 7104 Q WS 160 + 120 [c_harness] 0.000653742 7104 Q WS 280 + 16 [c_harness] 0.000733154 7104 Q FWFSM 102466 + 3 [c_harness] This seems to be a bug, as it is not persisting the metadata operation even in the presence of fsync. Let me know if I am missing some detail here. [1] https://github.com/utsaslab/crashmonkey.git Thanks, Jayashree Mohan ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: XFS crash consistency bug : Loss of fsynced metadata operation 2018-03-13 2:15 XFS crash consistency bug : Loss of fsynced metadata operation Jayashree Mohan @ 2018-03-13 4:21 ` Dave Chinner 2018-03-13 6:36 ` Amir Goldstein 2018-03-13 16:57 ` Jayashree Mohan 0 siblings, 2 replies; 16+ messages in thread From: Dave Chinner @ 2018-03-13 4:21 UTC (permalink / raw) To: Jayashree Mohan Cc: linux-xfs, Vijaychidambaram Velayudhan Pillai, Ashlie Martinez On Mon, Mar 12, 2018 at 09:15:25PM -0500, Jayashree Mohan wrote: > Hi, > > We've encountered what seems to be a crash consistency bug in > xfs(kernel 4.15) due to the interaction between delayed allocated > write and an unaligned fallocate(zero range) : Fallocate_zero_range is > not persisted even when followed by a fsync - thereby leading to a > loss of fsynced metadata operation on a file. > > Say we create a disk image with known data and quick format it. > 1. Now write 65K of data to a new file > 2. Zero out a part of the above file using falloc_zero_range (60K+128) > - (60K+128+4096) - an unaligned block > 3. fsync the above file > 4. Crash So, effectively: $ xfs_io -xft -c "pwrite 0 65k" \ -c "fzero $((60 * 1024 + 128)) 4k" \ -c fsync \ -c "shutdown" /mnt/scratch/testfile $ > > If we crash after the fsync, and allow reordering of the block IOs > between two flush/fua commands using Crashmonkey[1], then we can end > up not persisting the zero_range command in step 2 above, even if we > crashed after a fsync. Sorry, but what do you mean by "reordering of the block IOs"? > This workload was inspired from xfstest/generic_042, which tests for > stale data exposure using aligned fallocate commands. It's worth > noting that f2fs and btrfs passes our test clean - irrespective of the > order of bios, user data is intact and fzero operation is correctly persisted. > > To reproduce this bug using CrashMonkey, simply run : > ./c_harness -f /dev/sda -d /dev/cow_ram0 -t xfs -e 102400 -s 1000 -v > tests/generic_042/generic_042_fzero_unaligned.so > > and take a look at the <timestamp>-generic_042_fzero_unaligned.log > created in the build directory. This file has the list of block IOs > issued during the workload and the permutation of bios that lead to > this bug. You can also verify using blktrace that CrashMonkey only > reorders bios between two barrier operations(thereby such a crash > state could be encountered due to reordering blocks at the storage > stack). Note that tools like dm-log-writes cannot capture this bug > because this arises due to reordering blocks between barrier > operations. > > Possible reason for this bug : > On closely inspecting the reason for this bug, we discovered that the > problem lies in updating a data block twice, without a barrier > operation between the two updates. There's nothing wrong with doing that if the filesystem has waited for completion of the first IOs before doing subsequent operations on the block. > The blktrace for the above > mentioned workload shows that sector 280 is updated twice - probably > first due to the write and later due to the fzero operation. WHy is that happening? The trace I have from the above xfs_io command is this: 8,96 2 613 0.081499295 3635 Q WS 280 + 16 [xfs_io] 8,96 2 617 0.081982168 0 C WS 280 + 16 [0] 8,96 2 618 0.082033206 3635 Q R 280 + 8 [xfs_io] 8,96 2 622 0.082420539 0 C R 280 + 8 [0] 8,96 2 623 0.082475920 3635 Q R 288 + 8 [xfs_io] 8,96 2 627 0.082988642 0 C R 288 + 8 [0] 8,96 2 628 0.083050032 3635 Q WS 160 + 136 [xfs_io] 8,96 2 632 0.084414651 0 C WS 160 + 136 [0] 8,96 2 633 0.084469733 3635 Q FWFSM 20971632 + 8 [xfs_io] 8,96 2 636 0.107483279 0 C WFSM 20971632 + 8 [0] 8,96 2 637 0.107485345 0 C WFSM 20971632 [0] Which shows serialised IO dispatch and completion of the IO, and no double write. You've only shown queueing of writes, which doesn't tell me about completion order which is what really matters. In more detail. 1. zero-range forces a data flush to occur and then waits for it to complete. .... xfs_file_fallocate xfs_zero_file_space(60-68k) xfs_free_file_space xfs_flush_unmap_range() filemap_write_and_wait_range() xfs_writepage: pgoff 0xf000 size 0x10400 xfs_writepage: pgoff 0x10000 size 0x10400 8,96 2 613 0.081499295 3635 Q WS 280 + 16 [xfs_io] 8,96 2 617 0.081982168 0 C WS 280 + 16 [0] truncate_pagecache_range() xfs_invalidatepage: pgoff 0xf000 size 0x10400 xfs_releasepage: pgoff 0xf000 size 0x10400 xfs_invalidatepage: pgoff 0x10000 size 0x10400 xfs_releasepage: pgoff 0x10000 size 0x10400 xfs_zero_range() 8,96 2 618 0.082033206 3635 Q R 280 + 8 [xfs_io] 8,96 2 622 0.082420539 0 C R 280 + 8 [0] 8,96 2 623 0.082475920 3635 Q R 288 + 8 [xfs_io] 8,96 2 627 0.082988642 0 C R 288 + 8 [0] xfs_alloc_file_space() xfs_map_blocks_found: startoff 0x0 startblock 68 blockcount 0x30 <nothing to allocate!> ..... After the fzero command, we then run fsync: xfs_file_fsync() filemap_write_and_wait_range() xfs_writepage: pgoff 0x0 size 0x10400 .... xfs_writepage: pgoff 0x10000 size 0x10400 8,96 2 628 0.083050032 3635 Q WS 160 + 136 [xfs_io] 8,96 2 632 0.084414651 0 C WS 160 + 136 [0] <data writeback complete> xfs_log_force_lsn() xfs_buf_submit: flags WRITE|NO_IOACCT|ASYNC|SYNCIO|FUA|FLUSH|PAGES 8,96 2 633 0.084469733 3635 Q FWFSM 20971632 + 8 [xfs_io] 8,96 2 636 0.107483279 0 C WFSM 20971632 + 8 [0] 8,96 2 637 0.107485345 0 C WFSM 20971632 [0] And that's the log force. That's all correct and expected behaviour for pwrite, fzero and fsync. And when I dump the data in the file after unmount/mount (because the shutdown stops all IO from the filesytem), I get: 000000 146715 146715 146715 146715 146715 146715 146715 146715 00f080 000000 000000 000000 000000 000000 000000 000000 000000 * 010080 146715 146715 146715 146715 146715 146715 146715 146715 * 010400 exactly the right data in the file. > However > notice that there is no barrier between the two updates, and using > Crashmonkey we see the above mentioned bug when these two updates to > the same block are reordered. Essentially, a reordering here means > fzero goes through first, and is later overwritten by the delayed > allocated write. > 0.000179069 7104 Q WS 280 + 16 [c_harness] > 0.000594994 7104 Q R 280 + 8 [c_harness] > 0.000598216 7104 Q R 288 + 8 [c_harness] > 0.000620552 7104 Q WS 160 + 120 [c_harness] > 0.000653742 7104 Q WS 280 + 16 [c_harness] So this is where there's a difference. Note the write in my case: 0.083050032 3635 Q WS 160 + 136 [xfs_io] It's a single IO - for some reason the kernel you are running has split it into two IOs, but that doesn't make it wrong or invalid - it's not indicative of a bug in any way. What's wrong here is your assumption that you can reorder it around the previous IO to those blocks. That's *simply not possible* because the filesystem has waited for completion of each IO before issuing the next. The IO serialisation is being done at the filesystem level, not the IO level, and so CrashMonkey is effectively breaking the completion ordering guarantees storage hardware provides us with. > 0.000733154 7104 Q FWFSM 102466 + 3 [c_harness] IOWs, once this pre-flush is completed, the data on disk is supposed to be that of the latter write because that one was received by the storage well after the first one was completed. Reordering a later IO in front of an already completed dependent IO (read or write!) is simply not allowed. > This seems to be a bug, as it is not persisting the metadata operation > even in the presence of fsync. > > Let me know if I am missing some detail here. Filesystem IO ordering in the linux stack is not based on block/request level flush barriers. i.e. there's no such thing as "IO barriers" anymore. Filesystem IO ordering is based on a completion-to-dispatch serialisation model at the filesystem level, and that cannot be observed by looking at block layer traces.... This really looks like a crashmonkey bug, not a filesystem problem... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: XFS crash consistency bug : Loss of fsynced metadata operation 2018-03-13 4:21 ` Dave Chinner @ 2018-03-13 6:36 ` Amir Goldstein 2018-03-13 18:05 ` Jayashree Mohan 2018-03-13 16:57 ` Jayashree Mohan 1 sibling, 1 reply; 16+ messages in thread From: Amir Goldstein @ 2018-03-13 6:36 UTC (permalink / raw) To: Dave Chinner Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai, Ashlie Martinez, Theodore Tso, fstests Since the exact same report was sent to the ext4 list, CC'ing Dave's response to Ted and fstests, which seems an appropriate crowd for this discussion. I think that future CrashMonkey findings should be CC'ed to either fsdevel or fstests, especially if more than one filesystem is affected. Thanks, Amir. On Tue, Mar 13, 2018 at 6:21 AM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Mar 12, 2018 at 09:15:25PM -0500, Jayashree Mohan wrote: >> Hi, >> >> We've encountered what seems to be a crash consistency bug in >> xfs(kernel 4.15) due to the interaction between delayed allocated >> write and an unaligned fallocate(zero range) : Fallocate_zero_range is >> not persisted even when followed by a fsync - thereby leading to a >> loss of fsynced metadata operation on a file. >> >> Say we create a disk image with known data and quick format it. >> 1. Now write 65K of data to a new file >> 2. Zero out a part of the above file using falloc_zero_range (60K+128) >> - (60K+128+4096) - an unaligned block >> 3. fsync the above file >> 4. Crash > > So, effectively: > > $ xfs_io -xft -c "pwrite 0 65k" \ > -c "fzero $((60 * 1024 + 128)) 4k" \ > -c fsync \ > -c "shutdown" > /mnt/scratch/testfile > $ > >> >> If we crash after the fsync, and allow reordering of the block IOs >> between two flush/fua commands using Crashmonkey[1], then we can end >> up not persisting the zero_range command in step 2 above, even if we >> crashed after a fsync. > > Sorry, but what do you mean by "reordering of the block IOs"? > >> This workload was inspired from xfstest/generic_042, which tests for >> stale data exposure using aligned fallocate commands. It's worth >> noting that f2fs and btrfs passes our test clean - irrespective of the >> order of bios, user data is intact and fzero operation is correctly persisted. >> >> To reproduce this bug using CrashMonkey, simply run : >> ./c_harness -f /dev/sda -d /dev/cow_ram0 -t xfs -e 102400 -s 1000 -v >> tests/generic_042/generic_042_fzero_unaligned.so >> >> and take a look at the <timestamp>-generic_042_fzero_unaligned.log >> created in the build directory. This file has the list of block IOs >> issued during the workload and the permutation of bios that lead to >> this bug. You can also verify using blktrace that CrashMonkey only >> reorders bios between two barrier operations(thereby such a crash >> state could be encountered due to reordering blocks at the storage >> stack). Note that tools like dm-log-writes cannot capture this bug >> because this arises due to reordering blocks between barrier >> operations. >> >> Possible reason for this bug : >> On closely inspecting the reason for this bug, we discovered that the >> problem lies in updating a data block twice, without a barrier >> operation between the two updates. > > There's nothing wrong with doing that if the filesystem has waited > for completion of the first IOs before doing subsequent operations > on the block. > >> The blktrace for the above >> mentioned workload shows that sector 280 is updated twice - probably >> first due to the write and later due to the fzero operation. > > WHy is that happening? The trace I have from the above xfs_io > command is this: > > 8,96 2 613 0.081499295 3635 Q WS 280 + 16 [xfs_io] > 8,96 2 617 0.081982168 0 C WS 280 + 16 [0] > 8,96 2 618 0.082033206 3635 Q R 280 + 8 [xfs_io] > 8,96 2 622 0.082420539 0 C R 280 + 8 [0] > 8,96 2 623 0.082475920 3635 Q R 288 + 8 [xfs_io] > 8,96 2 627 0.082988642 0 C R 288 + 8 [0] > 8,96 2 628 0.083050032 3635 Q WS 160 + 136 [xfs_io] > 8,96 2 632 0.084414651 0 C WS 160 + 136 [0] > 8,96 2 633 0.084469733 3635 Q FWFSM 20971632 + 8 [xfs_io] > 8,96 2 636 0.107483279 0 C WFSM 20971632 + 8 [0] > 8,96 2 637 0.107485345 0 C WFSM 20971632 [0] > > Which shows serialised IO dispatch and completion of the IO, and no > double write. You've only shown queueing of writes, which doesn't > tell me about completion order which is what really matters. > > In more detail. > > 1. zero-range forces a data flush to occur and then waits for it > to complete. > > .... > xfs_file_fallocate > xfs_zero_file_space(60-68k) > xfs_free_file_space > xfs_flush_unmap_range() > filemap_write_and_wait_range() > xfs_writepage: pgoff 0xf000 size 0x10400 > xfs_writepage: pgoff 0x10000 size 0x10400 > 8,96 2 613 0.081499295 3635 Q WS 280 + 16 [xfs_io] > 8,96 2 617 0.081982168 0 C WS 280 + 16 [0] > truncate_pagecache_range() > xfs_invalidatepage: pgoff 0xf000 size 0x10400 > xfs_releasepage: pgoff 0xf000 size 0x10400 > xfs_invalidatepage: pgoff 0x10000 size 0x10400 > xfs_releasepage: pgoff 0x10000 size 0x10400 > xfs_zero_range() > 8,96 2 618 0.082033206 3635 Q R 280 + 8 [xfs_io] > 8,96 2 622 0.082420539 0 C R 280 + 8 [0] > 8,96 2 623 0.082475920 3635 Q R 288 + 8 [xfs_io] > 8,96 2 627 0.082988642 0 C R 288 + 8 [0] > xfs_alloc_file_space() > xfs_map_blocks_found: startoff 0x0 startblock 68 blockcount 0x30 > <nothing to allocate!> > ..... > > After the fzero command, we then run fsync: > > xfs_file_fsync() > filemap_write_and_wait_range() > xfs_writepage: pgoff 0x0 size 0x10400 > .... > xfs_writepage: pgoff 0x10000 size 0x10400 > 8,96 2 628 0.083050032 3635 Q WS 160 + 136 [xfs_io] > 8,96 2 632 0.084414651 0 C WS 160 + 136 [0] > <data writeback complete> > xfs_log_force_lsn() > xfs_buf_submit: flags WRITE|NO_IOACCT|ASYNC|SYNCIO|FUA|FLUSH|PAGES > 8,96 2 633 0.084469733 3635 Q FWFSM 20971632 + 8 [xfs_io] > 8,96 2 636 0.107483279 0 C WFSM 20971632 + 8 [0] > 8,96 2 637 0.107485345 0 C WFSM 20971632 [0] > > And that's the log force. That's all correct and expected behaviour > for pwrite, fzero and fsync. And when I dump the data in the file > after unmount/mount (because the shutdown stops all IO from the > filesytem), I get: > > 000000 146715 146715 146715 146715 146715 146715 146715 146715 > > 00f080 000000 000000 000000 000000 000000 000000 000000 000000 > * > 010080 146715 146715 146715 146715 146715 146715 146715 146715 > * > 010400 > > exactly the right data in the file. > >> However >> notice that there is no barrier between the two updates, and using >> Crashmonkey we see the above mentioned bug when these two updates to >> the same block are reordered. Essentially, a reordering here means >> fzero goes through first, and is later overwritten by the delayed >> allocated write. > >> 0.000179069 7104 Q WS 280 + 16 [c_harness] >> 0.000594994 7104 Q R 280 + 8 [c_harness] >> 0.000598216 7104 Q R 288 + 8 [c_harness] >> 0.000620552 7104 Q WS 160 + 120 [c_harness] >> 0.000653742 7104 Q WS 280 + 16 [c_harness] > > So this is where there's a difference. Note the write in my case: > > 0.083050032 3635 Q WS 160 + 136 [xfs_io] > > It's a single IO - for some reason the kernel you are running has > split it into two IOs, but that doesn't make it wrong or invalid - > it's not indicative of a bug in any way. > > What's wrong here is your assumption that you can reorder it around > the previous IO to those blocks. That's *simply not possible* > because the filesystem has waited for completion of each IO before > issuing the next. The IO serialisation is being done at the > filesystem level, not the IO level, and so CrashMonkey is > effectively breaking the completion ordering guarantees storage > hardware provides us with. > >> 0.000733154 7104 Q FWFSM 102466 + 3 [c_harness] > > IOWs, once this pre-flush is completed, the data on disk is supposed > to be that of the latter write because that one was received by the > storage well after the first one was completed. Reordering a later > IO in front of an already completed dependent IO (read or write!) is > simply not allowed. > >> This seems to be a bug, as it is not persisting the metadata operation >> even in the presence of fsync. >> >> Let me know if I am missing some detail here. > > Filesystem IO ordering in the linux stack is not based on > block/request level flush barriers. i.e. there's no such thing as > "IO barriers" anymore. Filesystem IO ordering is based on a > completion-to-dispatch serialisation model at the filesystem level, > and that cannot be observed by looking at block layer traces.... > > This really looks like a crashmonkey bug, not a filesystem problem... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: XFS crash consistency bug : Loss of fsynced metadata operation 2018-03-13 6:36 ` Amir Goldstein @ 2018-03-13 18:05 ` Jayashree Mohan 0 siblings, 0 replies; 16+ messages in thread From: Jayashree Mohan @ 2018-03-13 18:05 UTC (permalink / raw) To: Amir Goldstein Cc: Dave Chinner, linux-xfs, Vijaychidambaram Velayudhan Pillai, Ashlie Martinez, Theodore Tso, fstests Hi Amir, We reported these bugs separately on the two mailing lists because, though the workload was the same, the behavior/bug was different on the two filesystems. On ext4 there was data loss, while on xfs the fzero operation was not persisted even after a fsync. On Tue, Mar 13, 2018 at 1:36 AM, Amir Goldstein <amir73il@gmail.com> wrote: > Since the exact same report was sent to the ext4 list, CC'ing Dave's response to > Ted and fstests, which seems an appropriate crowd for this discussion. > I think that future CrashMonkey findings should be CC'ed to either fsdevel > or fstests, especially if more than one filesystem is affected. Thanks for the suggestion. We will CC our future reports to these lists. Thanks, Jayashree ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: XFS crash consistency bug : Loss of fsynced metadata operation 2018-03-13 4:21 ` Dave Chinner 2018-03-13 6:36 ` Amir Goldstein @ 2018-03-13 16:57 ` Jayashree Mohan 2018-03-13 22:45 ` Dave Chinner 1 sibling, 1 reply; 16+ messages in thread From: Jayashree Mohan @ 2018-03-13 16:57 UTC (permalink / raw) To: Dave Chinner Cc: linux-xfs, Vijaychidambaram Velayudhan Pillai, Ashlie Martinez, fstests, Theodore Ts'o Hi Dave, Thanks for the response. CrashMonkey assumes the following behavior of disk cache. Let me know if any of this sounds unreasonable. Whenever the underlying storage device has an associated cache, the IO is marked completed the moment it reaches the disk cache. This does not guarantee that the disk cache would persist them in the same order, unless there is a Flush/FUA. The order of completed writes as seen by the user could be A, B, C, *Flush* D, E. However the disk cache could write these back to the persistent storage in the order say B, A, C, E, D. The only invariant it ensures is that writing in an order like A, C, E, B, D is not possible because, writes A,B,C have to strictly happen before D and E. However you cannot ensure that (A, B, C) is written to the persistent storage in the same order. CrashMonkey reorders bios in conformance to the guarantees provided by disk cache; we do not make any extra assumptions and we respect the barrier operations. On Mon, Mar 12, 2018 at 11:21 PM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Mar 12, 2018 at 09:15:25PM -0500, Jayashree Mohan wrote: >> Hi, >> >> We've encountered what seems to be a crash consistency bug in >> xfs(kernel 4.15) due to the interaction between delayed allocated >> write and an unaligned fallocate(zero range) : Fallocate_zero_range is >> not persisted even when followed by a fsync - thereby leading to a >> loss of fsynced metadata operation on a file. >> >> Say we create a disk image with known data and quick format it. >> 1. Now write 65K of data to a new file >> 2. Zero out a part of the above file using falloc_zero_range (60K+128) >> - (60K+128+4096) - an unaligned block >> 3. fsync the above file >> 4. Crash > > So, effectively: > > $ xfs_io -xft -c "pwrite 0 65k" \ > -c "fzero $((60 * 1024 + 128)) 4k" \ > -c fsync \ > -c "shutdown" > /mnt/scratch/testfile > $ Yes, this is the right translation of our workload. >> >> If we crash after the fsync, and allow reordering of the block IOs >> between two flush/fua commands using Crashmonkey[1], then we can end >> up not persisting the zero_range command in step 2 above, even if we >> crashed after a fsync. > > Sorry, but what do you mean by "reordering of the block IOs"? CrashMonkey reorders IOs between two barrier operations just like the disk cache would do, while respecting the guarantees provided by disk cache; CrashMonkey will never re order bios across barriers. >> This workload was inspired from xfstest/generic_042, which tests for >> stale data exposure using aligned fallocate commands. It's worth >> noting that f2fs and btrfs passes our test clean - irrespective of the >> order of bios, user data is intact and fzero operation is correctly persisted. >> >> To reproduce this bug using CrashMonkey, simply run : >> ./c_harness -f /dev/sda -d /dev/cow_ram0 -t xfs -e 102400 -s 1000 -v >> tests/generic_042/generic_042_fzero_unaligned.so >> >> and take a look at the <timestamp>-generic_042_fzero_unaligned.log >> created in the build directory. This file has the list of block IOs >> issued during the workload and the permutation of bios that lead to >> this bug. You can also verify using blktrace that CrashMonkey only >> reorders bios between two barrier operations(thereby such a crash >> state could be encountered due to reordering blocks at the storage >> stack). Note that tools like dm-log-writes cannot capture this bug >> because this arises due to reordering blocks between barrier >> operations. >> >> Possible reason for this bug : >> On closely inspecting the reason for this bug, we discovered that the >> problem lies in updating a data block twice, without a barrier >> operation between the two updates. > > There's nothing wrong with doing that if the filesystem has waited > for completion of the first IOs before doing subsequent operations > on the block. > >> The blktrace for the above >> mentioned workload shows that sector 280 is updated twice - probably >> first due to the write and later due to the fzero operation. > > WHy is that happening? The trace I have from the above xfs_io > command is this: > > 8,96 2 613 0.081499295 3635 Q WS 280 + 16 [xfs_io] > 8,96 2 617 0.081982168 0 C WS 280 + 16 [0] > 8,96 2 618 0.082033206 3635 Q R 280 + 8 [xfs_io] > 8,96 2 622 0.082420539 0 C R 280 + 8 [0] > 8,96 2 623 0.082475920 3635 Q R 288 + 8 [xfs_io] > 8,96 2 627 0.082988642 0 C R 288 + 8 [0] > 8,96 2 628 0.083050032 3635 Q WS 160 + 136 [xfs_io] > 8,96 2 632 0.084414651 0 C WS 160 + 136 [0] > 8,96 2 633 0.084469733 3635 Q FWFSM 20971632 + 8 [xfs_io] > 8,96 2 636 0.107483279 0 C WFSM 20971632 + 8 [0] > 8,96 2 637 0.107485345 0 C WFSM 20971632 [0] 1. 8,96 2 617 0.081982168 0 C WS 280 + 16 [0] 2. 8,96 2 632 0.084414651 0 C WS 160 + 136 [0] 3. 8,96 2 636 0.107483279 0 C WFSM 20971632 + 8 [0] Again, though (2) is a single bio in your case, we notice that this sequential IO updates sector 280 too. By the previous discussion about disk cache, it is very well possible that (2) is ordered before (1), which gives rise to the bug we mentioned. > Which shows serialised IO dispatch and completion of the IO, and no > double write. You've only shown queueing of writes, which doesn't > tell me about completion order which is what really matters. > > In more detail. > > 1. zero-range forces a data flush to occur and then waits for it > to complete. > > .... > xfs_file_fallocate > xfs_zero_file_space(60-68k) > xfs_free_file_space > xfs_flush_unmap_range() > filemap_write_and_wait_range() > xfs_writepage: pgoff 0xf000 size 0x10400 > xfs_writepage: pgoff 0x10000 size 0x10400 > 8,96 2 613 0.081499295 3635 Q WS 280 + 16 [xfs_io] > 8,96 2 617 0.081982168 0 C WS 280 + 16 [0] > truncate_pagecache_range() > xfs_invalidatepage: pgoff 0xf000 size 0x10400 > xfs_releasepage: pgoff 0xf000 size 0x10400 > xfs_invalidatepage: pgoff 0x10000 size 0x10400 > xfs_releasepage: pgoff 0x10000 size 0x10400 > xfs_zero_range() > 8,96 2 618 0.082033206 3635 Q R 280 + 8 [xfs_io] > 8,96 2 622 0.082420539 0 C R 280 + 8 [0] > 8,96 2 623 0.082475920 3635 Q R 288 + 8 [xfs_io] > 8,96 2 627 0.082988642 0 C R 288 + 8 [0] > xfs_alloc_file_space() > xfs_map_blocks_found: startoff 0x0 startblock 68 blockcount 0x30 > <nothing to allocate!> > ..... > > After the fzero command, we then run fsync: > > xfs_file_fsync() > filemap_write_and_wait_range() > xfs_writepage: pgoff 0x0 size 0x10400 > .... > xfs_writepage: pgoff 0x10000 size 0x10400 > 8,96 2 628 0.083050032 3635 Q WS 160 + 136 [xfs_io] > 8,96 2 632 0.084414651 0 C WS 160 + 136 [0] > <data writeback complete> > xfs_log_force_lsn() > xfs_buf_submit: flags WRITE|NO_IOACCT|ASYNC|SYNCIO|FUA|FLUSH|PAGES > 8,96 2 633 0.084469733 3635 Q FWFSM 20971632 + 8 [xfs_io] > 8,96 2 636 0.107483279 0 C WFSM 20971632 + 8 [0] > 8,96 2 637 0.107485345 0 C WFSM 20971632 [0] > > And that's the log force. That's all correct and expected behaviour > for pwrite, fzero and fsync. And when I dump the data in the file > after unmount/mount (because the shutdown stops all IO from the > filesytem), I get: > > 000000 146715 146715 146715 146715 146715 146715 146715 146715 > > 00f080 000000 000000 000000 000000 000000 000000 000000 000000 > * > 010080 146715 146715 146715 146715 146715 146715 146715 146715 > * > 010400 > > exactly the right data in the file. > >> However >> notice that there is no barrier between the two updates, and using >> Crashmonkey we see the above mentioned bug when these two updates to >> the same block are reordered. Essentially, a reordering here means >> fzero goes through first, and is later overwritten by the delayed >> allocated write. > >> 0.000179069 7104 Q WS 280 + 16 [c_harness] >> 0.000594994 7104 Q R 280 + 8 [c_harness] >> 0.000598216 7104 Q R 288 + 8 [c_harness] >> 0.000620552 7104 Q WS 160 + 120 [c_harness] >> 0.000653742 7104 Q WS 280 + 16 [c_harness] > > So this is where there's a difference. Note the write in my case: > > 0.083050032 3635 Q WS 160 + 136 [xfs_io] > > It's a single IO - for some reason the kernel you are running has > split it into two IOs, but that doesn't make it wrong or invalid - > it's not indicative of a bug in any way. > > What's wrong here is your assumption that you can reorder it around > the previous IO to those blocks. That's *simply not possible* > because the filesystem has waited for completion of each IO before > issuing the next. The IO serialisation is being done at the > filesystem level, not the IO level, and so CrashMonkey is > effectively breaking the completion ordering guarantees storage > hardware provides us with. > >> 0.000733154 7104 Q FWFSM 102466 + 3 [c_harness] > > IOWs, once this pre-flush is completed, the data on disk is supposed > to be that of the latter write because that one was received by the > storage well after the first one was completed. Reordering a later > IO in front of an already completed dependent IO (read or write!) is > simply not allowed. *Received by the storage* means, received at the disk cache. I agree the fzero write went to disk cache after pwrite was marked completed. But this does not ensure pwrite data goes to the dpersistent storage from the disk cache before fzero. >> This seems to be a bug, as it is not persisting the metadata operation >> even in the presence of fsync. >> >> Let me know if I am missing some detail here. > > Filesystem IO ordering in the linux stack is not based on > block/request level flush barriers. i.e. there's no such thing as > "IO barriers" anymore. Filesystem IO ordering is based on a > completion-to-dispatch serialisation model at the filesystem level, > and that cannot be observed by looking at block layer traces.... > > This really looks like a crashmonkey bug, not a filesystem problem... CrashMonkey therefore respects the guarantees provided by the disk cache, and assumes nothing more than that. I hope this provides more clarity on what CrashMonkey is trying to do, and why we think it is reasonable to do so. Thanks, Jayashree ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: XFS crash consistency bug : Loss of fsynced metadata operation 2018-03-13 16:57 ` Jayashree Mohan @ 2018-03-13 22:45 ` Dave Chinner 2018-03-14 13:16 ` Lukas Czerner 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2018-03-13 22:45 UTC (permalink / raw) To: Jayashree Mohan Cc: linux-xfs, Vijaychidambaram Velayudhan Pillai, Ashlie Martinez, fstests, Theodore Ts'o On Tue, Mar 13, 2018 at 11:57:21AM -0500, Jayashree Mohan wrote: > Hi Dave, > > Thanks for the response. CrashMonkey assumes the following behavior of > disk cache. Let me know if any of this sounds unreasonable. > > Whenever the underlying storage device has an associated cache, the IO > is marked completed the moment it reaches the disk cache. This does > not guarantee that the disk cache would persist them in the same > order, unless there is a Flush/FUA. The order of completed writes as > seen by the user could be A, B, C, *Flush* D, E. However the disk > cache could write these back to the persistent storage in the order > say B, A, C, E, D. The only invariant it ensures is that writing in an > order like A, C, E, B, D is > not possible because, writes A,B,C have to strictly happen before D > and E. However you cannot ensure that (A, B, C) is written to the > persistent storage in the same order. > > CrashMonkey reorders bios in conformance to the guarantees provided by > disk cache; we do not make any extra assumptions and we respect the > barrier operations. I think your model is wrong. caches do not randomly re-order completed IO operations to the *same LBA*. When a block is overwritten the cache contains the overwrite data and the previous data is discarded. THe previous data may be on disk, but it's no longer in the cache. e.g. take a dependent filesystem read-modify-write cycle (I'm choosing this because that's the problem this fzero/fsync "bug" is apparently demonstrating) where we write data to disk, invalidate the kernel cache, read the data back off disk, zero it in memory, then write it back to disk, all in the one LBA: <flush> write A to disk, invalidate kernel cache ...... read A from disk into kernel cache A' = <modify A> write A' to disk ...... <flush> The disk cache model you are describing allows writes to be reordered anywhere in the flush window regardless of their inter-IO completion dependencies. Hence you're allowing temporally ordered filesystem IO to the same LBA be reorded like so: <flush> ...... write A' ...... read A A' = <modify A> ...... write A ...... <flush> This violates causality. it's simply *not possible for the disk cache to contain A' before either "write A", "read A" or the in-memory modification of A has been completed by the OS. Hence there is no way for a crash situation to have the disk cache or the physical storage medium to contain corruption that indicates it stored A' on disk before stored A. > CrashMonkey therefore respects the guarantees provided by the disk > cache, and assumes nothing more than that. I hope this provides more > clarity on what > CrashMonkey is trying to do, and why we think it is reasonable to do so. It clearly demonstrates to me where CrashMonkey is broken and needs fixing - it needs to respect the ordering of temporally separate IO to the same LBA and not violate causality. Simulators that assume time travel is possible are not useful to us. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: XFS crash consistency bug : Loss of fsynced metadata operation 2018-03-13 22:45 ` Dave Chinner @ 2018-03-14 13:16 ` Lukas Czerner 2018-03-14 13:32 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Lukas Czerner @ 2018-03-14 13:16 UTC (permalink / raw) To: Dave Chinner Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai, Ashlie Martinez, fstests, Theodore Ts'o On Wed, Mar 14, 2018 at 09:45:22AM +1100, Dave Chinner wrote: > On Tue, Mar 13, 2018 at 11:57:21AM -0500, Jayashree Mohan wrote: > > Hi Dave, > > > > Thanks for the response. CrashMonkey assumes the following behavior of > > disk cache. Let me know if any of this sounds unreasonable. > > > > Whenever the underlying storage device has an associated cache, the IO > > is marked completed the moment it reaches the disk cache. This does > > not guarantee that the disk cache would persist them in the same > > order, unless there is a Flush/FUA. The order of completed writes as > > seen by the user could be A, B, C, *Flush* D, E. However the disk > > cache could write these back to the persistent storage in the order > > say B, A, C, E, D. The only invariant it ensures is that writing in an > > order like A, C, E, B, D is > > not possible because, writes A,B,C have to strictly happen before D > > and E. However you cannot ensure that (A, B, C) is written to the > > persistent storage in the same order. > > > > CrashMonkey reorders bios in conformance to the guarantees provided by > > disk cache; we do not make any extra assumptions and we respect the > > barrier operations. > > I think your model is wrong. caches do not randomly re-order > completed IO operations to the *same LBA*. When a block is overwritten > the cache contains the overwrite data and the previous data is > discarded. THe previous data may be on disk, but it's no longer in > the cache. > > e.g. take a dependent filesystem read-modify-write cycle (I'm > choosing this because that's the problem this fzero/fsync > "bug" is apparently demonstrating) where we write data to disk, > invalidate the kernel cache, read the data back off disk, zero it > in memory, then write it back to disk, all in the one LBA: > > <flush> > write A to disk, invalidate kernel cache > ...... > read A from disk into kernel cache > A' = <modify A> > write A' to disk > ...... > <flush> > > The disk cache model you are describing allows writes > to be reordered anywhere in the flush window regardless of their > inter-IO completion dependencies. Hence you're allowing temporally > ordered filesystem IO to the same LBA be reorded like so: > > > <flush> > ...... > write A' > ...... > read A > A' = <modify A> > ...... > write A > ...... > <flush> > > This violates causality. it's simply *not possible for the disk > cache to contain A' before either "write A", "read A" or the > in-memory modification of A has been completed by the OS. Hence > there is no way for a crash situation to have the disk cache or the > physical storage medium to contain corruption that indicates it > stored A' on disk before stored A. > > > CrashMonkey therefore respects the guarantees provided by the disk > > cache, and assumes nothing more than that. I hope this provides more > > clarity on what > > CrashMonkey is trying to do, and why we think it is reasonable to do so. > > It clearly demonstrates to me where CrashMonkey is broken and needs > fixing - it needs to respect the ordering of temporally separate IO > to the same LBA and not violate causality. Simulators that assume > time travel is possible are not useful to us. > > Cheers, Hi Dave, just FYI the 042 xfstest does fail on xfs with what I think is stale data exposure. It might not be related at all to what crashmonkey is reporting but there is something wrong nevertheless. SECTION -- ext4 RECREATING -- xfs on /dev/vdc1 FSTYP -- xfs (non-debug) PLATFORM -- Linux/x86_64 rhel7 4.16.0-rc5+ MKFS_OPTIONS -- -f -f /dev/vdb1 MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/vdb1 /mnt/test1 generic/042 - output mismatch (see /root/Projects/xfstests-dev/results//ext4/generic/042.out.bad) --- tests/generic/042.out 2018-03-14 05:56:38.619124060 -0400 +++ /root/Projects/xfstests-dev/results//ext4/generic/042.out.bad 2018-03-14 09:14:54.825969765 -0400 @@ -5,6 +5,16 @@ fpunch wrote 65536/65536 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd +* +000f000 0000 0000 0000 0000 0000 0000 0000 0000 +* ... (Run 'diff -u tests/generic/042.out /root/Projects/xfstests-dev/results//ext4/generic/042.out.bad' to see the entire diff) Ran: generic/042 Failures: generic/042 Failed 1 of 1 tests ext4 fails as well, however it looks to be a problem with the test itself where it is expecting the file to remain of a zero length which it might not be the case. -Lukas > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe fstests" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: XFS crash consistency bug : Loss of fsynced metadata operation 2018-03-14 13:16 ` Lukas Czerner @ 2018-03-14 13:32 ` Dave Chinner 2018-03-14 13:57 ` Lukas Czerner 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2018-03-14 13:32 UTC (permalink / raw) To: Lukas Czerner Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai, Ashlie Martinez, fstests, Theodore Ts'o On Wed, Mar 14, 2018 at 02:16:59PM +0100, Lukas Czerner wrote: > On Wed, Mar 14, 2018 at 09:45:22AM +1100, Dave Chinner wrote: > > On Tue, Mar 13, 2018 at 11:57:21AM -0500, Jayashree Mohan wrote: > > > Hi Dave, > > > > > > Thanks for the response. CrashMonkey assumes the following behavior of > > > disk cache. Let me know if any of this sounds unreasonable. > > > > > > Whenever the underlying storage device has an associated cache, the IO > > > is marked completed the moment it reaches the disk cache. This does > > > not guarantee that the disk cache would persist them in the same > > > order, unless there is a Flush/FUA. The order of completed writes as > > > seen by the user could be A, B, C, *Flush* D, E. However the disk > > > cache could write these back to the persistent storage in the order > > > say B, A, C, E, D. The only invariant it ensures is that writing in an > > > order like A, C, E, B, D is > > > not possible because, writes A,B,C have to strictly happen before D > > > and E. However you cannot ensure that (A, B, C) is written to the > > > persistent storage in the same order. > > > > > > CrashMonkey reorders bios in conformance to the guarantees provided by > > > disk cache; we do not make any extra assumptions and we respect the > > > barrier operations. > > > > I think your model is wrong. caches do not randomly re-order > > completed IO operations to the *same LBA*. When a block is overwritten > > the cache contains the overwrite data and the previous data is > > discarded. THe previous data may be on disk, but it's no longer in > > the cache. > > > > e.g. take a dependent filesystem read-modify-write cycle (I'm > > choosing this because that's the problem this fzero/fsync > > "bug" is apparently demonstrating) where we write data to disk, > > invalidate the kernel cache, read the data back off disk, zero it > > in memory, then write it back to disk, all in the one LBA: > > > > <flush> > > write A to disk, invalidate kernel cache > > ...... > > read A from disk into kernel cache > > A' = <modify A> > > write A' to disk > > ...... > > <flush> > > > > The disk cache model you are describing allows writes > > to be reordered anywhere in the flush window regardless of their > > inter-IO completion dependencies. Hence you're allowing temporally > > ordered filesystem IO to the same LBA be reorded like so: > > > > > > <flush> > > ...... > > write A' > > ...... > > read A > > A' = <modify A> > > ...... > > write A > > ...... > > <flush> > > > > This violates causality. it's simply *not possible for the disk > > cache to contain A' before either "write A", "read A" or the > > in-memory modification of A has been completed by the OS. Hence > > there is no way for a crash situation to have the disk cache or the > > physical storage medium to contain corruption that indicates it > > stored A' on disk before stored A. > > > > > CrashMonkey therefore respects the guarantees provided by the disk > > > cache, and assumes nothing more than that. I hope this provides more > > > clarity on what > > > CrashMonkey is trying to do, and why we think it is reasonable to do so. > > > > It clearly demonstrates to me where CrashMonkey is broken and needs > > fixing - it needs to respect the ordering of temporally separate IO > > to the same LBA and not violate causality. Simulators that assume > > time travel is possible are not useful to us. > > > > Cheers, > > Hi Dave, > > just FYI the 042 xfstest does fail on xfs with what I think is stale > data exposure. It might not be related at all to what crashmonkey is > reporting but there is something wrong nevertheless. generic/042 is unreliable and certain operations result in a non-zero length file because of metadata commits/writeback that occur as a result of the fallocate operations. It got removed from the auto group because it isn't a reliable test about 3 years ago: commit 7721b85016081cf6651d139e8af26267eceb23d3 Author: Jeff Moyer <jmoyer@redhat.com> Date: Mon Dec 21 18:01:47 2015 +1100 generic/042: remove from the 'auto' group This test fails 100% of the time for me on xfs and current git head, and is not run for ext4 since ext4 does not support shutdown. After talking with bfoster, it isn't expected to succeed right now. Since the auto group is for tests that *are* expected to succeed, let's move this one out. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> This is what it used to be testing - it was an XFS only test: commit cf1438248c0f62f3b64013d23e9e6d6bc23ca24b Author: Brian Foster <bfoster@redhat.com> Date: Tue Oct 14 22:59:39 2014 +1100 xfs/053: test for stale data exposure via falloc/writeback interaction XFS buffered I/O writeback has a subtle race condition that leads to stale data exposure if the filesystem happens to crash after delayed allocation blocks are converted on disk and before data is written back to said blocks. Use file allocation commands to attempt to reproduce a related, but slightly different variant of this problem. The associated falloc commands can lead to partial writeback that converts an extent larger than the range affected by falloc. If the filesystem crashes after the extent conversion but before all other cached data is written to the extent, stale data can be exposed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> So, generic/042 was written to expose unwritten extent conversion issues in the XFS writeback path and has nothing to do with mutliple writes being reordered. It's failure detection is unreliable and not indicative of an actual failure or data corruption, and that's why it was removed from the auto group.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: XFS crash consistency bug : Loss of fsynced metadata operation 2018-03-14 13:32 ` Dave Chinner @ 2018-03-14 13:57 ` Lukas Czerner 2018-03-14 21:24 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Lukas Czerner @ 2018-03-14 13:57 UTC (permalink / raw) To: Dave Chinner Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai, Ashlie Martinez, fstests, Theodore Ts'o On Thu, Mar 15, 2018 at 12:32:58AM +1100, Dave Chinner wrote: > On Wed, Mar 14, 2018 at 02:16:59PM +0100, Lukas Czerner wrote: > > On Wed, Mar 14, 2018 at 09:45:22AM +1100, Dave Chinner wrote: > > > On Tue, Mar 13, 2018 at 11:57:21AM -0500, Jayashree Mohan wrote: > > > > Hi Dave, > > > > > > > > Thanks for the response. CrashMonkey assumes the following behavior of > > > > disk cache. Let me know if any of this sounds unreasonable. > > > > > > > > Whenever the underlying storage device has an associated cache, the IO > > > > is marked completed the moment it reaches the disk cache. This does > > > > not guarantee that the disk cache would persist them in the same > > > > order, unless there is a Flush/FUA. The order of completed writes as > > > > seen by the user could be A, B, C, *Flush* D, E. However the disk > > > > cache could write these back to the persistent storage in the order > > > > say B, A, C, E, D. The only invariant it ensures is that writing in an > > > > order like A, C, E, B, D is > > > > not possible because, writes A,B,C have to strictly happen before D > > > > and E. However you cannot ensure that (A, B, C) is written to the > > > > persistent storage in the same order. > > > > > > > > CrashMonkey reorders bios in conformance to the guarantees provided by > > > > disk cache; we do not make any extra assumptions and we respect the > > > > barrier operations. > > > > > > I think your model is wrong. caches do not randomly re-order > > > completed IO operations to the *same LBA*. When a block is overwritten > > > the cache contains the overwrite data and the previous data is > > > discarded. THe previous data may be on disk, but it's no longer in > > > the cache. > > > > > > e.g. take a dependent filesystem read-modify-write cycle (I'm > > > choosing this because that's the problem this fzero/fsync > > > "bug" is apparently demonstrating) where we write data to disk, > > > invalidate the kernel cache, read the data back off disk, zero it > > > in memory, then write it back to disk, all in the one LBA: > > > > > > <flush> > > > write A to disk, invalidate kernel cache > > > ...... > > > read A from disk into kernel cache > > > A' = <modify A> > > > write A' to disk > > > ...... > > > <flush> > > > > > > The disk cache model you are describing allows writes > > > to be reordered anywhere in the flush window regardless of their > > > inter-IO completion dependencies. Hence you're allowing temporally > > > ordered filesystem IO to the same LBA be reorded like so: > > > > > > > > > <flush> > > > ...... > > > write A' > > > ...... > > > read A > > > A' = <modify A> > > > ...... > > > write A > > > ...... > > > <flush> > > > > > > This violates causality. it's simply *not possible for the disk > > > cache to contain A' before either "write A", "read A" or the > > > in-memory modification of A has been completed by the OS. Hence > > > there is no way for a crash situation to have the disk cache or the > > > physical storage medium to contain corruption that indicates it > > > stored A' on disk before stored A. > > > > > > > CrashMonkey therefore respects the guarantees provided by the disk > > > > cache, and assumes nothing more than that. I hope this provides more > > > > clarity on what > > > > CrashMonkey is trying to do, and why we think it is reasonable to do so. > > > > > > It clearly demonstrates to me where CrashMonkey is broken and needs > > > fixing - it needs to respect the ordering of temporally separate IO > > > to the same LBA and not violate causality. Simulators that assume > > > time travel is possible are not useful to us. > > > > > > Cheers, > > > > Hi Dave, > > > > just FYI the 042 xfstest does fail on xfs with what I think is stale > > data exposure. It might not be related at all to what crashmonkey is > > reporting but there is something wrong nevertheless. > > generic/042 is unreliable and certain operations result in a > non-zero length file because of metadata commits/writeback that > occur as a result of the fallocate operations. It got removed from > the auto group because it isn't a reliable test about 3 years ago: Sure, I just that it clearly exposes stale data on xfs. That is, the resulting file contains data that was previously written to the underlying image file to catch the exposure. I am aware of the non-zero length file problem, that's not what I am pointing out though. -Lukas > > commit 7721b85016081cf6651d139e8af26267eceb23d3 > Author: Jeff Moyer <jmoyer@redhat.com> > Date: Mon Dec 21 18:01:47 2015 +1100 > > generic/042: remove from the 'auto' group > > This test fails 100% of the time for me on xfs and current git head, and > is not run for ext4 since ext4 does not support shutdown. After talking > with bfoster, it isn't expected to succeed right now. Since the auto > group is for tests that *are* expected to succeed, let's move this one > out. > > Signed-off-by: Jeff Moyer <jmoyer@redhat.com> > Reviewed-by: Brian Foster <bfoster@redhat.com> > Signed-off-by: Dave Chinner <david@fromorbit.com> > > This is what it used to be testing - it was an XFS only test: > > commit cf1438248c0f62f3b64013d23e9e6d6bc23ca24b > Author: Brian Foster <bfoster@redhat.com> > Date: Tue Oct 14 22:59:39 2014 +1100 > > xfs/053: test for stale data exposure via falloc/writeback interaction > > XFS buffered I/O writeback has a subtle race condition that leads to > stale data exposure if the filesystem happens to crash after delayed > allocation blocks are converted on disk and before data is written back > to said blocks. > > Use file allocation commands to attempt to reproduce a related, but > slightly different variant of this problem. The associated falloc > commands can lead to partial writeback that converts an extent larger > than the range affected by falloc. If the filesystem crashes after the > extent conversion but before all other cached data is written to the > extent, stale data can be exposed. > > Signed-off-by: Brian Foster <bfoster@redhat.com> > Reviewed-by: Dave Chinner <dchinner@redhat.com> > Signed-off-by: Dave Chinner <david@fromorbit.com> > > So, generic/042 was written to expose unwritten extent conversion > issues in the XFS writeback path and has nothing to do with mutliple > writes being reordered. It's failure detection is unreliable and not > indicative of an actual failure or data corruption, and that's why > it was removed from the auto group.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: XFS crash consistency bug : Loss of fsynced metadata operation 2018-03-14 13:57 ` Lukas Czerner @ 2018-03-14 21:24 ` Dave Chinner 2018-03-15 6:15 ` Lukas Czerner 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2018-03-14 21:24 UTC (permalink / raw) To: Lukas Czerner Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai, Ashlie Martinez, fstests, Theodore Ts'o On Wed, Mar 14, 2018 at 02:57:52PM +0100, Lukas Czerner wrote: > On Thu, Mar 15, 2018 at 12:32:58AM +1100, Dave Chinner wrote: > > On Wed, Mar 14, 2018 at 02:16:59PM +0100, Lukas Czerner wrote: > > > just FYI the 042 xfstest does fail on xfs with what I think is stale > > > data exposure. It might not be related at all to what crashmonkey is > > > reporting but there is something wrong nevertheless. > > > > generic/042 is unreliable and certain operations result in a > > non-zero length file because of metadata commits/writeback that > > occur as a result of the fallocate operations. It got removed from > > the auto group because it isn't a reliable test about 3 years ago: > > Sure, I just that it clearly exposes stale data on xfs. That is, the > resulting file contains data that was previously written to the > underlying image file to catch the exposure. I am aware of the non-zero > length file problem, that's not what I am pointing out though. What setup are you testing on? I haven't seen it fail in some time. Here, on emulated pmem: SECTION -- xfs FSTYP -- xfs (debug) PLATFORM -- Linux/x86_64 test4 4.16.0-rc5-dgc MKFS_OPTIONS -- -f -m rmapbt=1,reflink=1 -i sparse=1 /dev/pmem1 MOUNT_OPTIONS -- /dev/pmem1 /mnt/scratch xfs/042 10s ... 14s Passed all 1 tests SECTION -- xfs ========================= Passed all 1 tests And on a different machine, with iSCSI devices and a slightly older kernel: SECTION -- xfs FSTYP -- xfs (debug) PLATFORM -- Linux/x86_64 test2 4.16.0-rc2-dgc MKFS_OPTIONS -- -f -m rmapbt=1,reflink=1 -i sparse=1 /dev/sdg MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/sdg /mnt/scratch xfs/042 14s ... 14s Passed all 1 tests SECTION -- xfs ========================= Passed all 1 tests It passes just fine. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: XFS crash consistency bug : Loss of fsynced metadata operation 2018-03-14 21:24 ` Dave Chinner @ 2018-03-15 6:15 ` Lukas Czerner 2018-03-15 10:06 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Lukas Czerner @ 2018-03-15 6:15 UTC (permalink / raw) To: Dave Chinner Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai, Ashlie Martinez, fstests, Theodore Ts'o On Thu, Mar 15, 2018 at 08:24:41AM +1100, Dave Chinner wrote: > On Wed, Mar 14, 2018 at 02:57:52PM +0100, Lukas Czerner wrote: > > On Thu, Mar 15, 2018 at 12:32:58AM +1100, Dave Chinner wrote: > > > On Wed, Mar 14, 2018 at 02:16:59PM +0100, Lukas Czerner wrote: > > > > just FYI the 042 xfstest does fail on xfs with what I think is stale > > > > data exposure. It might not be related at all to what crashmonkey is > > > > reporting but there is something wrong nevertheless. > > > > > > generic/042 is unreliable and certain operations result in a > > > non-zero length file because of metadata commits/writeback that > > > occur as a result of the fallocate operations. It got removed from > > > the auto group because it isn't a reliable test about 3 years ago: > > > > Sure, I just that it clearly exposes stale data on xfs. That is, the > > resulting file contains data that was previously written to the > > underlying image file to catch the exposure. I am aware of the non-zero > > length file problem, that's not what I am pointing out though. > > What setup are you testing on? I haven't seen it fail in some time. > Here, on emulated pmem: Virtual machine with Virtio devices backed by a linear lvs consisting of SCSI drives, all local. > > SECTION -- xfs > FSTYP -- xfs (debug) > PLATFORM -- Linux/x86_64 test4 4.16.0-rc5-dgc > MKFS_OPTIONS -- -f -m rmapbt=1,reflink=1 -i sparse=1 /dev/pmem1 > MOUNT_OPTIONS -- /dev/pmem1 /mnt/scratch > > xfs/042 10s ... 14s We are talking about generic/042. xfs/042 is very much a different test. SECTION -- xfs RECREATING -- xfs on /dev/vdc1 FSTYP -- xfs (non-debug) PLATFORM -- Linux/x86_64 rhel7 4.16.0-rc5+ MKFS_OPTIONS -- -f -f /dev/vdb1 MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/vdb1 /mnt/test1 generic/042 - output mismatch (see /root/Projects/xfstests-dev/results//ext4/generic/042.out.bad) --- tests/generic/042.out 2018-03-14 05:56:38.619124060 -0400 +++ /root/Projects/xfstests-dev/results//ext4/generic/042.out.bad 2018-03-15 02:15:02.872113819 -0400 @@ -5,6 +5,16 @@ fpunch wrote 65536/65536 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd +* +000f000 0000 0000 0000 0000 0000 0000 0000 0000 +* ... (Run 'diff -u tests/generic/042.out /root/Projects/xfstests-dev/results//ext4/generic/042.out.bad' to see the entire diff) xfs/042 15s ... 15s Ran: generic/042 xfs/042 Failures: generic/042 Failed 1 of 2 tests -Lukas > Passed all 1 tests > > SECTION -- xfs > ========================= > Passed all 1 tests > > And on a different machine, with iSCSI devices and a slightly older > kernel: > > SECTION -- xfs > FSTYP -- xfs (debug) > PLATFORM -- Linux/x86_64 test2 4.16.0-rc2-dgc > MKFS_OPTIONS -- -f -m rmapbt=1,reflink=1 -i sparse=1 /dev/sdg > MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/sdg /mnt/scratch > > xfs/042 14s ... 14s > Passed all 1 tests > > SECTION -- xfs > ========================= > Passed all 1 tests > > It passes just fine. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: XFS crash consistency bug : Loss of fsynced metadata operation 2018-03-15 6:15 ` Lukas Czerner @ 2018-03-15 10:06 ` Dave Chinner 2018-03-15 10:32 ` Lukas Czerner 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2018-03-15 10:06 UTC (permalink / raw) To: Lukas Czerner Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai, Ashlie Martinez, fstests, Theodore Ts'o On Thu, Mar 15, 2018 at 07:15:57AM +0100, Lukas Czerner wrote: > On Thu, Mar 15, 2018 at 08:24:41AM +1100, Dave Chinner wrote: > > On Wed, Mar 14, 2018 at 02:57:52PM +0100, Lukas Czerner wrote: > > > On Thu, Mar 15, 2018 at 12:32:58AM +1100, Dave Chinner wrote: > > > > On Wed, Mar 14, 2018 at 02:16:59PM +0100, Lukas Czerner wrote: > > > > > just FYI the 042 xfstest does fail on xfs with what I think is stale > > > > > data exposure. It might not be related at all to what crashmonkey is > > > > > reporting but there is something wrong nevertheless. > > > > > > > > generic/042 is unreliable and certain operations result in a > > > > non-zero length file because of metadata commits/writeback that > > > > occur as a result of the fallocate operations. It got removed from > > > > the auto group because it isn't a reliable test about 3 years ago: > > > > > > Sure, I just that it clearly exposes stale data on xfs. That is, the > > > resulting file contains data that was previously written to the > > > underlying image file to catch the exposure. I am aware of the non-zero > > > length file problem, that's not what I am pointing out though. > > > > What setup are you testing on? I haven't seen it fail in some time. > > Here, on emulated pmem: > > Virtual machine with Virtio devices backed by a linear lvs consisting of > SCSI drives, all local. > > > > > SECTION -- xfs > > FSTYP -- xfs (debug) > > PLATFORM -- Linux/x86_64 test4 4.16.0-rc5-dgc > > MKFS_OPTIONS -- -f -m rmapbt=1,reflink=1 -i sparse=1 /dev/pmem1 > > MOUNT_OPTIONS -- /dev/pmem1 /mnt/scratch > > > > xfs/042 10s ... 14s > > We are talking about generic/042. xfs/042 is very much a different > test. Ugh copy-n-paste fail. Sorry. I was looking at the right test, just running the wrong one. Anyway, what makes you think this: > > SECTION -- xfs > RECREATING -- xfs on /dev/vdc1 > FSTYP -- xfs (non-debug) > PLATFORM -- Linux/x86_64 rhel7 4.16.0-rc5+ > MKFS_OPTIONS -- -f -f /dev/vdb1 > MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/vdb1 /mnt/test1 > > generic/042 - output mismatch (see /root/Projects/xfstests-dev/results//ext4/generic/042.out.bad) > --- tests/generic/042.out 2018-03-14 05:56:38.619124060 -0400 > +++ /root/Projects/xfstests-dev/results//ext4/generic/042.out.bad 2018-03-15 02:15:02.872113819 -0400 > @@ -5,6 +5,16 @@ > fpunch > wrote 65536/65536 bytes at offset 0 > XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > +0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd > +* > +000f000 0000 0000 0000 0000 0000 0000 0000 0000 > +* exposes stale data? The command is: $XFS_IO_PROG -f -c "pwrite -S 1 0 64k" -c "$cmd 60k 4k" $file i.e. We wrote bytes from 0 to 64k, then punched from 60k to 64k. if the file is 64k in length, then it should contain either all "cdcd" pattern, or there should be "cdcd" data except for the range from 60k to 64k where there should be zeros. The later is exactly what the diff output is say - "cdcd" data from 0-60k, zeros from 60 to 64k. So there's no stale data exposure occurring here (those bugs got fixed!), it's just the test output is unreliable and does not match the golden output. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: XFS crash consistency bug : Loss of fsynced metadata operation 2018-03-15 10:06 ` Dave Chinner @ 2018-03-15 10:32 ` Lukas Czerner 2018-03-16 0:19 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Lukas Czerner @ 2018-03-15 10:32 UTC (permalink / raw) To: Dave Chinner Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai, Ashlie Martinez, fstests, Theodore Ts'o On Thu, Mar 15, 2018 at 09:06:46PM +1100, Dave Chinner wrote: > On Thu, Mar 15, 2018 at 07:15:57AM +0100, Lukas Czerner wrote: > > On Thu, Mar 15, 2018 at 08:24:41AM +1100, Dave Chinner wrote: > > > On Wed, Mar 14, 2018 at 02:57:52PM +0100, Lukas Czerner wrote: > > > > On Thu, Mar 15, 2018 at 12:32:58AM +1100, Dave Chinner wrote: > > > > > On Wed, Mar 14, 2018 at 02:16:59PM +0100, Lukas Czerner wrote: > > > > > > just FYI the 042 xfstest does fail on xfs with what I think is stale > > > > > > data exposure. It might not be related at all to what crashmonkey is > > > > > > reporting but there is something wrong nevertheless. > > > > > > > > > > generic/042 is unreliable and certain operations result in a > > > > > non-zero length file because of metadata commits/writeback that > > > > > occur as a result of the fallocate operations. It got removed from > > > > > the auto group because it isn't a reliable test about 3 years ago: > > > > > > > > Sure, I just that it clearly exposes stale data on xfs. That is, the > > > > resulting file contains data that was previously written to the > > > > underlying image file to catch the exposure. I am aware of the non-zero > > > > length file problem, that's not what I am pointing out though. > > > > > > What setup are you testing on? I haven't seen it fail in some time. > > > Here, on emulated pmem: > > > > Virtual machine with Virtio devices backed by a linear lvs consisting of > > SCSI drives, all local. > > > > > > > > SECTION -- xfs > > > FSTYP -- xfs (debug) > > > PLATFORM -- Linux/x86_64 test4 4.16.0-rc5-dgc > > > MKFS_OPTIONS -- -f -m rmapbt=1,reflink=1 -i sparse=1 /dev/pmem1 > > > MOUNT_OPTIONS -- /dev/pmem1 /mnt/scratch > > > > > > xfs/042 10s ... 14s > > > > We are talking about generic/042. xfs/042 is very much a different > > test. > > Ugh copy-n-paste fail. Sorry. > > I was looking at the right test, just running the wrong one. > > Anyway, what makes you think this: > > > > > SECTION -- xfs > > RECREATING -- xfs on /dev/vdc1 > > FSTYP -- xfs (non-debug) > > PLATFORM -- Linux/x86_64 rhel7 4.16.0-rc5+ > > MKFS_OPTIONS -- -f -f /dev/vdb1 > > MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/vdb1 /mnt/test1 > > > > generic/042 - output mismatch (see /root/Projects/xfstests-dev/results//ext4/generic/042.out.bad) > > --- tests/generic/042.out 2018-03-14 05:56:38.619124060 -0400 > > +++ /root/Projects/xfstests-dev/results//ext4/generic/042.out.bad 2018-03-15 02:15:02.872113819 -0400 > > @@ -5,6 +5,16 @@ > > fpunch > > wrote 65536/65536 bytes at offset 0 > > XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > > +0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd > > +* > > +000f000 0000 0000 0000 0000 0000 0000 0000 0000 > > +* > > exposes stale data? The command is: > > $XFS_IO_PROG -f -c "pwrite -S 1 0 64k" -c "$cmd 60k 4k" $file > > i.e. We wrote bytes from 0 to 64k, then punched from 60k to 64k. if > the file is 64k in length, then it should contain either all "cdcd" > pattern, or there should be "cdcd" data except for the range from > 60k to 64k where there should be zeros. And where the "cdcd" data is comming from ? Not from the "pwrite -S 1 0 64k" that's producing "0101". (from man xfs_io) -S used to set the (repeated) fill pattern which is used when the data to write is not coming from a file. The default buffer fill pattern value is 0xcdcdcdcd. It's comming from $XFS_IO_PROG -f -c "truncate 0" -c "pwrite 0 25M" $img _mkfs_dev $img so it does expose data written to the device prior the mkfs. -Lukas > > The later is exactly what the diff output is say - "cdcd" data from > 0-60k, zeros from 60 to 64k. So there's no stale data exposure > occurring here (those bugs got fixed!), it's just the test output is > unreliable and does not match the golden output. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: XFS crash consistency bug : Loss of fsynced metadata operation 2018-03-15 10:32 ` Lukas Czerner @ 2018-03-16 0:19 ` Dave Chinner 2018-03-16 5:45 ` Lukas Czerner 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2018-03-16 0:19 UTC (permalink / raw) To: Lukas Czerner Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai, Ashlie Martinez, fstests, Theodore Ts'o On Thu, Mar 15, 2018 at 11:32:07AM +0100, Lukas Czerner wrote: > > And where the "cdcd" data is comming from ? Not from the > > "pwrite -S 1 0 64k" > > that's producing "0101". Ah, too much context switching. Makes me look like the village idiot... Ok, I just spent some time looking at this in detail (event traces, etc) and it is indeed as the test describes - an whole delalloc extent allocation on partial overwrite. THe partial overwrite is triggered by the fpunch/fzero code, and it appears the delalloc converison is not paying attention to the start offset of the writeback: xfs_writepage: dev 7:0 ino 0x7603 pgoff 0xf000 size 0x10000 delalloc 1 ..... xfs_alloc_near_greater: dev 7:0 agno 0 agbno 3784 minlen 16 maxlen 16 ..... xfs_write_extent: dev 7:0 ino 0x7603 offset 0 block 3784 count 16 flag 0 IOWs, as I've said from the start (based on the test description) this has nothing to do with the corruption issue CrashMonkey is creating. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: XFS crash consistency bug : Loss of fsynced metadata operation 2018-03-16 0:19 ` Dave Chinner @ 2018-03-16 5:45 ` Lukas Czerner 2018-03-17 3:16 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Lukas Czerner @ 2018-03-16 5:45 UTC (permalink / raw) To: Dave Chinner Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai, Ashlie Martinez, fstests, Theodore Ts'o On Fri, Mar 16, 2018 at 11:19:55AM +1100, Dave Chinner wrote: > On Thu, Mar 15, 2018 at 11:32:07AM +0100, Lukas Czerner wrote: > > > > And where the "cdcd" data is comming from ? Not from the > > > > "pwrite -S 1 0 64k" > > > > that's producing "0101". > > Ah, too much context switching. Makes me look like the village > idiot... that's how I was looking the past 4 emails ;) Anyway good that we finally understand each other. I think it'd be worth fixing the generic/042 to check for the actuall stale data not just expecting the file to be empty as it still can do some good. > > Ok, I just spent some time looking at this in detail (event traces, > etc) and it is indeed as the test describes - an whole delalloc > extent allocation on partial overwrite. THe partial overwrite is > triggered by the fpunch/fzero code, and it appears the delalloc > converison is not paying attention to the start offset of the > writeback: > > xfs_writepage: dev 7:0 ino 0x7603 pgoff 0xf000 size 0x10000 delalloc 1 > ..... > xfs_alloc_near_greater: dev 7:0 agno 0 agbno 3784 minlen 16 maxlen 16 > ..... > xfs_write_extent: dev 7:0 ino 0x7603 offset 0 block 3784 count 16 flag 0 > > IOWs, as I've said from the start (based on the test description) > this has nothing to do with the corruption issue CrashMonkey is > creating. Yeah they're probably not seeing this because of added fsync. -Lukas > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: XFS crash consistency bug : Loss of fsynced metadata operation 2018-03-16 5:45 ` Lukas Czerner @ 2018-03-17 3:16 ` Dave Chinner 0 siblings, 0 replies; 16+ messages in thread From: Dave Chinner @ 2018-03-17 3:16 UTC (permalink / raw) To: Lukas Czerner Cc: Jayashree Mohan, linux-xfs, Vijaychidambaram Velayudhan Pillai, Ashlie Martinez, fstests, Theodore Ts'o On Fri, Mar 16, 2018 at 06:45:16AM +0100, Lukas Czerner wrote: > On Fri, Mar 16, 2018 at 11:19:55AM +1100, Dave Chinner wrote: > > On Thu, Mar 15, 2018 at 11:32:07AM +0100, Lukas Czerner wrote: > > > > > > And where the "cdcd" data is comming from ? Not from the > > > > > > "pwrite -S 1 0 64k" > > > > > > that's producing "0101". > > > > Ah, too much context switching. Makes me look like the village > > idiot... > > that's how I was looking the past 4 emails ;) > > Anyway good that we finally understand each other. I think it'd be worth > fixing the generic/042 to check for the actuall stale data not just > expecting the file to be empty as it still can do some good. I've got a fixed version of it. It starts by setting the file size to 128k and then fsyncing ths file, so after the shutdown we should always see a file full of zeros. Of course, the extent layout can be anything, as long as the data returned after remount is all zeros. > > Ok, I just spent some time looking at this in detail (event traces, > > etc) and it is indeed as the test describes - an whole delalloc > > extent allocation on partial overwrite. THe partial overwrite is > > triggered by the fpunch/fzero code, and it appears the delalloc > > converison is not paying attention to the start offset of the > > writeback: > > > > xfs_writepage: dev 7:0 ino 0x7603 pgoff 0xf000 size 0x10000 delalloc 1 > > ..... > > xfs_alloc_near_greater: dev 7:0 agno 0 agbno 3784 minlen 16 maxlen 16 > > ..... > > xfs_write_extent: dev 7:0 ino 0x7603 offset 0 block 3784 count 16 flag 0 > > > > IOWs, as I've said from the start (based on the test description) > > this has nothing to do with the corruption issue CrashMonkey is > > creating. > > Yeah they're probably not seeing this because of added fsync. Yes - the fsync ensures data vs metadata ordering by writing and waiting for data before flushing the log. In this case, there is not data flush and so when the log is flushed in the shutdown, the metadata hits the disk before the data, and that's where the stale data exposure comes from. THis is a well known issue - if allocation transactions are committed before the data writes have been submitted, then a crash can expose stale data - but it's very rarely seen with delalloc. Historically it's always been seen with direct IO - that's why we use unwritten extent allocation for direct IO at all times. However, in XFS we can't do that with delayed allocation because of the fact the information used by writeback is held on the bufferhead and is not coherent with extent state. Hence we can't change a delalloc extent to unwritten deep in the allocator, because then the page writeback does the wrong thing as it expects to see a delalloc extent not an unwritten extent. Fixing this problem is one of the reasons we want to get rid of bufferheads from XFS. I've got patches that have gone one round of review that remove bufferhead state from the XFS writepage engine, but that will need to fixed up and finished off before this problem can be fixed. IIRC, when I wrote this test I'd just uncovered the problem, realised it can't be fixed without bufferhead removal, and so the test was added to remind us what needed fixing.... I actually have fixes that make XFS pass this test, but there's a fair chunk of work still to be done before they are ready for review. I'm thinking of closing this whole entirely by always using unwritten extent allocate for delalloc writes inside EOF, which will mean we could probably go back to using delalloc with extent size hints, too (yay!)... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2018-03-17 3:16 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-03-13 2:15 XFS crash consistency bug : Loss of fsynced metadata operation Jayashree Mohan 2018-03-13 4:21 ` Dave Chinner 2018-03-13 6:36 ` Amir Goldstein 2018-03-13 18:05 ` Jayashree Mohan 2018-03-13 16:57 ` Jayashree Mohan 2018-03-13 22:45 ` Dave Chinner 2018-03-14 13:16 ` Lukas Czerner 2018-03-14 13:32 ` Dave Chinner 2018-03-14 13:57 ` Lukas Czerner 2018-03-14 21:24 ` Dave Chinner 2018-03-15 6:15 ` Lukas Czerner 2018-03-15 10:06 ` Dave Chinner 2018-03-15 10:32 ` Lukas Czerner 2018-03-16 0:19 ` Dave Chinner 2018-03-16 5:45 ` Lukas Czerner 2018-03-17 3:16 ` Dave Chinner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.