From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ipmail06.adl6.internode.on.net ([150.101.137.145]:14092 "EHLO ipmail06.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750724AbeCMEV3 (ORCPT ); Tue, 13 Mar 2018 00:21:29 -0400 Date: Tue, 13 Mar 2018 15:21:27 +1100 From: Dave Chinner Subject: Re: XFS crash consistency bug : Loss of fsynced metadata operation Message-ID: <20180313042126.GB18129@dastard> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-xfs-owner@vger.kernel.org List-ID: List-Id: xfs To: Jayashree Mohan Cc: linux-xfs@vger.kernel.org, Vijaychidambaram Velayudhan Pillai , Ashlie Martinez On Mon, Mar 12, 2018 at 09:15:25PM -0500, Jayashree Mohan wrote: > Hi, > > We've encountered what seems to be a crash consistency bug in > xfs(kernel 4.15) due to the interaction between delayed allocated > write and an unaligned fallocate(zero range) : Fallocate_zero_range is > not persisted even when followed by a fsync - thereby leading to a > loss of fsynced metadata operation on a file. > > Say we create a disk image with known data and quick format it. > 1. Now write 65K of data to a new file > 2. Zero out a part of the above file using falloc_zero_range (60K+128) > - (60K+128+4096) - an unaligned block > 3. fsync the above file > 4. Crash So, effectively: $ xfs_io -xft -c "pwrite 0 65k" \ -c "fzero $((60 * 1024 + 128)) 4k" \ -c fsync \ -c "shutdown" /mnt/scratch/testfile $ > > If we crash after the fsync, and allow reordering of the block IOs > between two flush/fua commands using Crashmonkey[1], then we can end > up not persisting the zero_range command in step 2 above, even if we > crashed after a fsync. Sorry, but what do you mean by "reordering of the block IOs"? > This workload was inspired from xfstest/generic_042, which tests for > stale data exposure using aligned fallocate commands. It's worth > noting that f2fs and btrfs passes our test clean - irrespective of the > order of bios, user data is intact and fzero operation is correctly persisted. > > To reproduce this bug using CrashMonkey, simply run : > ./c_harness -f /dev/sda -d /dev/cow_ram0 -t xfs -e 102400 -s 1000 -v > tests/generic_042/generic_042_fzero_unaligned.so > > and take a look at the -generic_042_fzero_unaligned.log > created in the build directory. This file has the list of block IOs > issued during the workload and the permutation of bios that lead to > this bug. You can also verify using blktrace that CrashMonkey only > reorders bios between two barrier operations(thereby such a crash > state could be encountered due to reordering blocks at the storage > stack). Note that tools like dm-log-writes cannot capture this bug > because this arises due to reordering blocks between barrier > operations. > > Possible reason for this bug : > On closely inspecting the reason for this bug, we discovered that the > problem lies in updating a data block twice, without a barrier > operation between the two updates. There's nothing wrong with doing that if the filesystem has waited for completion of the first IOs before doing subsequent operations on the block. > The blktrace for the above > mentioned workload shows that sector 280 is updated twice - probably > first due to the write and later due to the fzero operation. WHy is that happening? The trace I have from the above xfs_io command is this: 8,96 2 613 0.081499295 3635 Q WS 280 + 16 [xfs_io] 8,96 2 617 0.081982168 0 C WS 280 + 16 [0] 8,96 2 618 0.082033206 3635 Q R 280 + 8 [xfs_io] 8,96 2 622 0.082420539 0 C R 280 + 8 [0] 8,96 2 623 0.082475920 3635 Q R 288 + 8 [xfs_io] 8,96 2 627 0.082988642 0 C R 288 + 8 [0] 8,96 2 628 0.083050032 3635 Q WS 160 + 136 [xfs_io] 8,96 2 632 0.084414651 0 C WS 160 + 136 [0] 8,96 2 633 0.084469733 3635 Q FWFSM 20971632 + 8 [xfs_io] 8,96 2 636 0.107483279 0 C WFSM 20971632 + 8 [0] 8,96 2 637 0.107485345 0 C WFSM 20971632 [0] Which shows serialised IO dispatch and completion of the IO, and no double write. You've only shown queueing of writes, which doesn't tell me about completion order which is what really matters. In more detail. 1. zero-range forces a data flush to occur and then waits for it to complete. .... xfs_file_fallocate xfs_zero_file_space(60-68k) xfs_free_file_space xfs_flush_unmap_range() filemap_write_and_wait_range() xfs_writepage: pgoff 0xf000 size 0x10400 xfs_writepage: pgoff 0x10000 size 0x10400 8,96 2 613 0.081499295 3635 Q WS 280 + 16 [xfs_io] 8,96 2 617 0.081982168 0 C WS 280 + 16 [0] truncate_pagecache_range() xfs_invalidatepage: pgoff 0xf000 size 0x10400 xfs_releasepage: pgoff 0xf000 size 0x10400 xfs_invalidatepage: pgoff 0x10000 size 0x10400 xfs_releasepage: pgoff 0x10000 size 0x10400 xfs_zero_range() 8,96 2 618 0.082033206 3635 Q R 280 + 8 [xfs_io] 8,96 2 622 0.082420539 0 C R 280 + 8 [0] 8,96 2 623 0.082475920 3635 Q R 288 + 8 [xfs_io] 8,96 2 627 0.082988642 0 C R 288 + 8 [0] xfs_alloc_file_space() xfs_map_blocks_found: startoff 0x0 startblock 68 blockcount 0x30 ..... After the fzero command, we then run fsync: xfs_file_fsync() filemap_write_and_wait_range() xfs_writepage: pgoff 0x0 size 0x10400 .... xfs_writepage: pgoff 0x10000 size 0x10400 8,96 2 628 0.083050032 3635 Q WS 160 + 136 [xfs_io] 8,96 2 632 0.084414651 0 C WS 160 + 136 [0] xfs_log_force_lsn() xfs_buf_submit: flags WRITE|NO_IOACCT|ASYNC|SYNCIO|FUA|FLUSH|PAGES 8,96 2 633 0.084469733 3635 Q FWFSM 20971632 + 8 [xfs_io] 8,96 2 636 0.107483279 0 C WFSM 20971632 + 8 [0] 8,96 2 637 0.107485345 0 C WFSM 20971632 [0] And that's the log force. That's all correct and expected behaviour for pwrite, fzero and fsync. And when I dump the data in the file after unmount/mount (because the shutdown stops all IO from the filesytem), I get: 000000 146715 146715 146715 146715 146715 146715 146715 146715 00f080 000000 000000 000000 000000 000000 000000 000000 000000 * 010080 146715 146715 146715 146715 146715 146715 146715 146715 * 010400 exactly the right data in the file. > However > notice that there is no barrier between the two updates, and using > Crashmonkey we see the above mentioned bug when these two updates to > the same block are reordered. Essentially, a reordering here means > fzero goes through first, and is later overwritten by the delayed > allocated write. > 0.000179069 7104 Q WS 280 + 16 [c_harness] > 0.000594994 7104 Q R 280 + 8 [c_harness] > 0.000598216 7104 Q R 288 + 8 [c_harness] > 0.000620552 7104 Q WS 160 + 120 [c_harness] > 0.000653742 7104 Q WS 280 + 16 [c_harness] So this is where there's a difference. Note the write in my case: 0.083050032 3635 Q WS 160 + 136 [xfs_io] It's a single IO - for some reason the kernel you are running has split it into two IOs, but that doesn't make it wrong or invalid - it's not indicative of a bug in any way. What's wrong here is your assumption that you can reorder it around the previous IO to those blocks. That's *simply not possible* because the filesystem has waited for completion of each IO before issuing the next. The IO serialisation is being done at the filesystem level, not the IO level, and so CrashMonkey is effectively breaking the completion ordering guarantees storage hardware provides us with. > 0.000733154 7104 Q FWFSM 102466 + 3 [c_harness] IOWs, once this pre-flush is completed, the data on disk is supposed to be that of the latter write because that one was received by the storage well after the first one was completed. Reordering a later IO in front of an already completed dependent IO (read or write!) is simply not allowed. > This seems to be a bug, as it is not persisting the metadata operation > even in the presence of fsync. > > Let me know if I am missing some detail here. Filesystem IO ordering in the linux stack is not based on block/request level flush barriers. i.e. there's no such thing as "IO barriers" anymore. Filesystem IO ordering is based on a completion-to-dispatch serialisation model at the filesystem level, and that cannot be observed by looking at block layer traces.... This really looks like a crashmonkey bug, not a filesystem problem... Cheers, Dave. -- Dave Chinner david@fromorbit.com