All of lore.kernel.org
 help / color / mirror / Atom feed
* ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
@ 2017-11-17 15:43 Ashlie Martinez
  2017-11-21 16:17 ` Ashlie Martinez
  2017-11-22 18:03 ` Theodore Ts'o
  0 siblings, 2 replies; 23+ messages in thread
From: Ashlie Martinez @ 2017-11-17 15:43 UTC (permalink / raw)
  To: Theodore Tso, Vijay Chidambaram, Ext4

Hi Ted,

I've been looking into the bug Amir Goldstein found (way back in
August) as well as the patch you created for it. After examining both
the patch and some the code for delayed allocation and fallocate, I
find that I don't understand what is actually going on to cause the
bug. I'll outline what I've found so far below. Maybe you can help me
fill in some of the gaps or point me in the right direction if I've
made a mistake.

Delayed allocation:
Delayed allocation is enabled if the `da` set of aops is used in the
ext4 code (set sometime close to file system mount time). When a
delayed allocation write occurs, a function higher up the storage
stack calls into ext4_da_write_begin. This function will go through
the rb tree representing the extents for a file and either 1. find
blocks already allocated that can be used for the write or 2. reserve
space in the extent tree. The bh returned from this function informs
later code about whether preexisting blocks were found or if space was
reserved. ext4_da_write_begin also appears to create a journal handle
(with a handle on the stack), but never closes the handle unless an
error occurs.

After ext4_da_write_begin, control returns to some functions higher in
the storage stack before going into ext4_da_write_end. This function
conditionally updates the i_disksize value of the file if non-delay
allocated extents were used and the file size was extended. It also
closes the journal handle opened in ext4_da_begin write. Furthermore,
it calls generic_write_end.

generic_write_end updates the i_size value if needed, and also marks
the inode as dirty. By marking the inode as dirty, a bdi_writeback is
scheduled later that will call ext4_writepages.

ext4_writepages resolves the delayed allocation block mappings and
updates i_disksize as needed. Each update to the extent tree on disk
and i_disksize is done using a single journal handle.

fallocate (just fallocate to change the file size, not to collapse
ranges or anything else):
fallocate will check if offset + len > the current i_size value and,
if it is, set new_size. Later, in a call to ext4_alloc_file_blocks
(which modifies the extent tree, using a journal handle it makes, for
the fallocate call), new_size is checked to determine if i_disksize
needs to be updated, and, if it does, it writes that update to the
current journal handle. Otherwise, no update to i_disksize is made.

zero_range:
The main portion of zero_range appears to be copy/paste from fallocate
code. However, one notable difference is that is has the possibility
of opening three different journal handles (2 from calls to
ext4_alloc_file_blocks, and one later in zero_range for writing out
blocks and the updated inode size. In this function, new_size is used
to update i_disksize in both the calls to ext4_alloc_file_blocks and
at the bottom of the function (so they will all either update
i_disksize or not update it).


What I don't really understand from the above is how the patch for
Amir's bug fixes the bug. It appears that what is happening in the bug
is an old value of i_disksize is being persisted while a new extent
tree is being persisted. I am confused how adding an extra clause to
the if statement in fallocate and zero_range (which causes new_size to
be set on more conditions) later translates into correct behavior. It
seems like in order to get incorrect behavior in the first place, you
would have to have the file size updated in a different journal
transaction than the extent tree so that, on replay, one is updated
but the other is not. Another piece of the puzzle I am missing is when
the extent tree is journaled in fallocate and zero_range. I can
clearly see the i_disksize updates being journaled in the call to
ext4_mark_inode_dirty, but it doesn't appear that that function also
persists the extent tree.

Could you give me any clarifications on when the extent tree is
journaled and some more of the logic behind the patch for Amir's bug?

Thanks,
Ashlie

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-17 15:43 ext4 fix for interaction between i_size, fallocate, and delalloc after a crash Ashlie Martinez
@ 2017-11-21 16:17 ` Ashlie Martinez
  2017-11-22 18:03 ` Theodore Ts'o
  1 sibling, 0 replies; 23+ messages in thread
From: Ashlie Martinez @ 2017-11-21 16:17 UTC (permalink / raw)
  To: Theodore Tso, Vijay Chidambaram, Ext4

Hi Ted,

I wasn't sure if this got buried under other things in your inbox, so
I just wanted to ping you about it again (also apologies that my
original email was so long) :)

On Fri, Nov 17, 2017 at 9:43 AM, Ashlie Martinez <ashmrtn@utexas.edu> wrote:
> Hi Ted,
>
> I've been looking into the bug Amir Goldstein found (way back in
> August) as well as the patch you created for it. After examining both
> the patch and some the code for delayed allocation and fallocate, I
> find that I don't understand what is actually going on to cause the
> bug. I'll outline what I've found so far below. Maybe you can help me
> fill in some of the gaps or point me in the right direction if I've
> made a mistake.
>
> Delayed allocation:
> Delayed allocation is enabled if the `da` set of aops is used in the
> ext4 code (set sometime close to file system mount time). When a
> delayed allocation write occurs, a function higher up the storage
> stack calls into ext4_da_write_begin. This function will go through
> the rb tree representing the extents for a file and either 1. find
> blocks already allocated that can be used for the write or 2. reserve
> space in the extent tree. The bh returned from this function informs
> later code about whether preexisting blocks were found or if space was
> reserved. ext4_da_write_begin also appears to create a journal handle
> (with a handle on the stack), but never closes the handle unless an
> error occurs.
>
> After ext4_da_write_begin, control returns to some functions higher in
> the storage stack before going into ext4_da_write_end. This function
> conditionally updates the i_disksize value of the file if non-delay
> allocated extents were used and the file size was extended. It also
> closes the journal handle opened in ext4_da_begin write. Furthermore,
> it calls generic_write_end.
>
> generic_write_end updates the i_size value if needed, and also marks
> the inode as dirty. By marking the inode as dirty, a bdi_writeback is
> scheduled later that will call ext4_writepages.
>
> ext4_writepages resolves the delayed allocation block mappings and
> updates i_disksize as needed. Each update to the extent tree on disk
> and i_disksize is done using a single journal handle.
>
> fallocate (just fallocate to change the file size, not to collapse
> ranges or anything else):
> fallocate will check if offset + len > the current i_size value and,
> if it is, set new_size. Later, in a call to ext4_alloc_file_blocks
> (which modifies the extent tree, using a journal handle it makes, for
> the fallocate call), new_size is checked to determine if i_disksize
> needs to be updated, and, if it does, it writes that update to the
> current journal handle. Otherwise, no update to i_disksize is made.
>
> zero_range:
> The main portion of zero_range appears to be copy/paste from fallocate
> code. However, one notable difference is that is has the possibility
> of opening three different journal handles (2 from calls to
> ext4_alloc_file_blocks, and one later in zero_range for writing out
> blocks and the updated inode size. In this function, new_size is used
> to update i_disksize in both the calls to ext4_alloc_file_blocks and
> at the bottom of the function (so they will all either update
> i_disksize or not update it).
>
>
> What I don't really understand from the above is how the patch for
> Amir's bug fixes the bug. It appears that what is happening in the bug
> is an old value of i_disksize is being persisted while a new extent
> tree is being persisted. I am confused how adding an extra clause to
> the if statement in fallocate and zero_range (which causes new_size to
> be set on more conditions) later translates into correct behavior. It
> seems like in order to get incorrect behavior in the first place, you
> would have to have the file size updated in a different journal
> transaction than the extent tree so that, on replay, one is updated
> but the other is not. Another piece of the puzzle I am missing is when
> the extent tree is journaled in fallocate and zero_range. I can
> clearly see the i_disksize updates being journaled in the call to
> ext4_mark_inode_dirty, but it doesn't appear that that function also
> persists the extent tree.
>
> Could you give me any clarifications on when the extent tree is
> journaled and some more of the logic behind the patch for Amir's bug?
>
> Thanks,
> Ashlie

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-17 15:43 ext4 fix for interaction between i_size, fallocate, and delalloc after a crash Ashlie Martinez
  2017-11-21 16:17 ` Ashlie Martinez
@ 2017-11-22 18:03 ` Theodore Ts'o
  2017-11-23  5:39   ` Amir Goldstein
  2017-11-27 14:31   ` Ashlie Martinez
  1 sibling, 2 replies; 23+ messages in thread
From: Theodore Ts'o @ 2017-11-22 18:03 UTC (permalink / raw)
  To: Ashlie Martinez; +Cc: Vijay Chidambaram, Ext4

[ Sorry for not getting back to you earlier.  I've been dealing with a
  family medical emergency and hanging out at an intensive care unit
  in Chicago for the past four days.  ]

OK, let's start with the high-level explanation of how Delayed
Allocation works.  When a userspace application does a buffered write,
and delayed allocation is not enabled, then ext4 acts like ext3
historically did.  That is, when a buffered write first touches a
logical block which is not yet mapped to a physical block, the file
system performs block allocation at the time of the write(2) system
call.  At the next journal commit, the inode's extent tree (or inode
structure or indirect blocks) are updated, along with the inode's
i_size field (if necessary).  In order to avoid stale data writes,
ext3 and ext4 in data=ordered mode, will write the data from the page
cache to disk before the journal commit is allowed to be finalized.

What are the implications of this?  If the application does a series
of 4k writes, ext4 will allocate each 4k block one at a time.  And it
will do that with no knowledge of how big the file will end up being.

When delayed allocation is enabled[1], we do *not* make block
allocation decisions at the time of the buffered write(2) call.
Instead, we wait until the writeback timer expires, which by default
is 30 seconds after the inode is first dirtied.  In many cases, the
application is finished writing the file within 30 seconds, and now
instead of doing N individual 4k block allocations, with delayed
allocation we make a single block decision based on how many pages
have been newly modified.  This makes it much easier to make much more
intelligent allocation decisions.

[1] It can get disabled via a mount option, or when the file system is
almost full; in the latter case, it is temporary and delalloc will be
automatically re-enabled when files are deleted and the file system
has more free space.


But what happens if we crash before the delayed allocation has been
resolved?  As James Earl Jones might say, "I was never here[2]".  That
is, it's as if never happened.

[2] https://www.youtube.com/watch?v=-fPtmRZQVHo&feature=youtu.be&t=1m5s

But immediately after the buffered write returns, if the program calls
stat(2), the i_size returned by the stat(2) system call must reflect
the results of the buffered write.  This is why we have i_size versus
i_disksize.  The i_size value reflects the size of the file as it
exists in the page cache.  The i_disksize value reflects the size of
the file as it exists on disk.

So i_disksize will lag i_size by up to 30 seconds.  But, by default,
the journal commits happen every five seconds.  The other key thing to
understand is that fallocate(2)'s ZERO_RANGE operation modifies both
metadata *and* data blocks.

Let's review xfstests generic/256.  The calls fsx --replay-ops with
the following:

1.	write 0x137dd 0xdc69 0x0
2.	fallocate 0xb531 0xb5ad 0x21446
3.	collapse_range 0x1c000 0x4000 0x21446
4.	write 0x3e5ec 0x1a14 0x21446
5.	zero_range 0x20fac 0x6d9c 0x40000 keep_size
6.	mapwrite 0x216ad 0x274f 0x40000

(I added the line numbers to make it clearer what I'm referring to.

In the FSX replay log, the first argument is the starting offset, the
second argument is the length, and the third is the i_size before the
operation in question.  (The 3rd argument does not actually make any
difference to the system call operation executed by fsx; it does make
a difference to some of the explanatory messages printed by fsx, though.)

Lines 1-3, just set up the inode's extent tree, and aren't really
important.  Linux 6 isn't terribly important, either.  The important
operations for the repro is lines 4 and 5.

Line 4 does a buffered write, which done subject to the delayed
allocation rules.  So it modifies the values in the page cache, but it
not do any block allocation.  After the buffered write in line 4,
i_size is 0x40000 but and i_disksize is 0x21446.

Line 5 performs a FALLOC_FL_ZERO_RANGE from offset 0x20FAC to 0x27D48.
What the zero_range operation will do is to change the file so that
specified byte range will return zeroes if read.  If it is optimal to
do so, the system MAY do this by not actually writing to the data
blocks, but instead, by manipulating the extent tree.  So these two
low-level actions are both valid ways of implementing the zero_range:

1.  (a) Read blocks 0x20 and 0x27 into page cache if necessary, (b)
    zero offsets 0xFAC to 0xFFF of block 0x20, (c) zero offsets 0x000
    to 0xD48 of block 0x27, (d) allocate blocks 0x21 to 0x26 if
    necessary, marking blocks 0x21 to 0x26 as unitialized.

2.  (a) Read blocks 0x20 and 0x27 into page cache if necessary, (b)
    zero offsets 0xFAC to 0xFFF of block 0x20, (c) zero offsets 0x000
    to 0xD48 of block 0x27, (d) allocate blocks 0x21 to 0x26 if
    necessary, marking blocks 0x21 to 0x26 as initialized, and write
    zeros to blocks 0x21 to 0x26.

Both 1 and 2 are perfectly valid, and ext4 will do both.  In #1, the
system needs to do two 4k writes.  In #2, the system needs to do a 32k
sequential write (blocks 20 through 27).  By default, if the size of
the write needed to initialize the blocks is 32k or less, ext4 will
choose #2, since eliminates the need to manipulate the extent tree to
mark blocks as initialized later, and it tends to leave the extent
tree in a less fragmented state.  On the other hand, if the userspace
program requests a zero_range operation on several megabytes, and
better yet, the starting and ending offsets are 4k aligned, then there
is no need to do the two 4k random writes at the beginning and the end
of the range, then just manipulating the extent tree (as in #1) is the
better way to go.

In the case of xfstests generic/456, we are doing #2, and within five
seconds of zero_range operation being issued, the system will journal
commit, so all of the actions in #2 needs to be committed after a
crash.  The last offset of the zero_range, 0x27D48 is beyond the
current on-disk i_size of 0x21446.  So if we crash at this point, it's
important that after journal replay, i_size is updated to be 0x27D48.

However, since the delayed allocation write has not been resolved.
i_size is 0x40000, which is beyond 0x27D46.  Due to the missing check,
the buggy code would see that i_size is beyond the last byte of the
last offset of the zero_range, and skip updating the on_disk size.
And this is what causes the bug.

The fix is to change the code to update the on-disk size if the end
offset of zero_range is beyond i_size *or* beyond i_disksize.  And
this is what commit 51e3ae81ec58: "ext4: fix interaction between
i_size, fallocate, and delalloc after a crash" does.

Hope this helps,

						- Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-22 18:03 ` Theodore Ts'o
@ 2017-11-23  5:39   ` Amir Goldstein
  2017-11-27 14:31   ` Ashlie Martinez
  1 sibling, 0 replies; 23+ messages in thread
From: Amir Goldstein @ 2017-11-23  5:39 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ashlie Martinez, Vijay Chidambaram, Ext4, Eryu Guan

On Wed, Nov 22, 2017 at 8:03 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> [ Sorry for not getting back to you earlier.  I've been dealing with a
>   family medical emergency and hanging out at an intensive care unit
>   in Chicago for the past four days.  ]
>
> OK, let's start with the high-level explanation of how Delayed
> Allocation works.  When a userspace application does a buffered write,
> and delayed allocation is not enabled, then ext4 acts like ext3
> historically did.  That is, when a buffered write first touches a
> logical block which is not yet mapped to a physical block, the file
> system performs block allocation at the time of the write(2) system
> call.  At the next journal commit, the inode's extent tree (or inode
> structure or indirect blocks) are updated, along with the inode's
> i_size field (if necessary).  In order to avoid stale data writes,
> ext3 and ext4 in data=ordered mode, will write the data from the page
> cache to disk before the journal commit is allowed to be finalized.
>
> What are the implications of this?  If the application does a series
> of 4k writes, ext4 will allocate each 4k block one at a time.  And it
> will do that with no knowledge of how big the file will end up being.
>
> When delayed allocation is enabled[1], we do *not* make block
> allocation decisions at the time of the buffered write(2) call.
> Instead, we wait until the writeback timer expires, which by default
> is 30 seconds after the inode is first dirtied.  In many cases, the
> application is finished writing the file within 30 seconds, and now
> instead of doing N individual 4k block allocations, with delayed
> allocation we make a single block decision based on how many pages
> have been newly modified.  This makes it much easier to make much more
> intelligent allocation decisions.
>
> [1] It can get disabled via a mount option, or when the file system is
> almost full; in the latter case, it is temporary and delalloc will be
> automatically re-enabled when files are deleted and the file system
> has more free space.
>
>
> But what happens if we crash before the delayed allocation has been
> resolved?  As James Earl Jones might say, "I was never here[2]".  That
> is, it's as if never happened.
>
> [2] https://www.youtube.com/watch?v=-fPtmRZQVHo&feature=youtu.be&t=1m5s
>
> But immediately after the buffered write returns, if the program calls
> stat(2), the i_size returned by the stat(2) system call must reflect
> the results of the buffered write.  This is why we have i_size versus
> i_disksize.  The i_size value reflects the size of the file as it
> exists in the page cache.  The i_disksize value reflects the size of
> the file as it exists on disk.
>
> So i_disksize will lag i_size by up to 30 seconds.  But, by default,
> the journal commits happen every five seconds.  The other key thing to
> understand is that fallocate(2)'s ZERO_RANGE operation modifies both
> metadata *and* data blocks.
>
> Let's review xfstests generic/256.  The calls fsx --replay-ops with
> the following:
>
> 1.      write 0x137dd 0xdc69 0x0
> 2.      fallocate 0xb531 0xb5ad 0x21446
> 3.      collapse_range 0x1c000 0x4000 0x21446
> 4.      write 0x3e5ec 0x1a14 0x21446
> 5.      zero_range 0x20fac 0x6d9c 0x40000 keep_size
> 6.      mapwrite 0x216ad 0x274f 0x40000
>
> (I added the line numbers to make it clearer what I'm referring to.
>
> In the FSX replay log, the first argument is the starting offset, the
> second argument is the length, and the third is the i_size before the
> operation in question.  (The 3rd argument does not actually make any
> difference to the system call operation executed by fsx; it does make
> a difference to some of the explanatory messages printed by fsx, though.)
>
> Lines 1-3, just set up the inode's extent tree, and aren't really
> important.  Linux 6 isn't terribly important, either.  The important
> operations for the repro is lines 4 and 5.
>
> Line 4 does a buffered write, which done subject to the delayed
> allocation rules.  So it modifies the values in the page cache, but it
> not do any block allocation.  After the buffered write in line 4,
> i_size is 0x40000 but and i_disksize is 0x21446.
>
> Line 5 performs a FALLOC_FL_ZERO_RANGE from offset 0x20FAC to 0x27D48.
> What the zero_range operation will do is to change the file so that
> specified byte range will return zeroes if read.  If it is optimal to
> do so, the system MAY do this by not actually writing to the data
> blocks, but instead, by manipulating the extent tree.  So these two
> low-level actions are both valid ways of implementing the zero_range:
>
> 1.  (a) Read blocks 0x20 and 0x27 into page cache if necessary, (b)
>     zero offsets 0xFAC to 0xFFF of block 0x20, (c) zero offsets 0x000
>     to 0xD48 of block 0x27, (d) allocate blocks 0x21 to 0x26 if
>     necessary, marking blocks 0x21 to 0x26 as unitialized.
>
> 2.  (a) Read blocks 0x20 and 0x27 into page cache if necessary, (b)
>     zero offsets 0xFAC to 0xFFF of block 0x20, (c) zero offsets 0x000
>     to 0xD48 of block 0x27, (d) allocate blocks 0x21 to 0x26 if
>     necessary, marking blocks 0x21 to 0x26 as initialized, and write
>     zeros to blocks 0x21 to 0x26.
>
> Both 1 and 2 are perfectly valid, and ext4 will do both.  In #1, the
> system needs to do two 4k writes.  In #2, the system needs to do a 32k
> sequential write (blocks 20 through 27).  By default, if the size of
> the write needed to initialize the blocks is 32k or less, ext4 will
> choose #2, since eliminates the need to manipulate the extent tree to
> mark blocks as initialized later, and it tends to leave the extent
> tree in a less fragmented state.  On the other hand, if the userspace
> program requests a zero_range operation on several megabytes, and
> better yet, the starting and ending offsets are 4k aligned, then there
> is no need to do the two 4k random writes at the beginning and the end
> of the range, then just manipulating the extent tree (as in #1) is the
> better way to go.
>
> In the case of xfstests generic/456, we are doing #2, and within five
> seconds of zero_range operation being issued, the system will journal
> commit, so all of the actions in #2 needs to be committed after a
> crash.  The last offset of the zero_range, 0x27D48 is beyond the
> current on-disk i_size of 0x21446.  So if we crash at this point, it's
> important that after journal replay, i_size is updated to be 0x27D48.
>
> However, since the delayed allocation write has not been resolved.
> i_size is 0x40000, which is beyond 0x27D46.  Due to the missing check,
> the buggy code would see that i_size is beyond the last byte of the
> last offset of the zero_range, and skip updating the on_disk size.
> And this is what causes the bug.
>
> The fix is to change the code to update the on-disk size if the end
> offset of zero_range is beyond i_size *or* beyond i_disksize.  And
> this is what commit 51e3ae81ec58: "ext4: fix interaction between
> i_size, fallocate, and delalloc after a crash" does.
>

Ted,

Thanks for the detailed explanation.
I am wondering out loud:
The regression test generic/456 was merged without any insight about
the root cause of the problem or any insight about the operations,
offsets and sizes used by the test.

Should we add to the test description a reference to the fix commit?
Should we add parts of this explanation in a comment near the fsxops
definition? All of it????

Perhaps you could edit it down to the context of a comment for the
test?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-22 18:03 ` Theodore Ts'o
  2017-11-23  5:39   ` Amir Goldstein
@ 2017-11-27 14:31   ` Ashlie Martinez
  2017-11-27 16:11     ` Theodore Ts'o
  1 sibling, 1 reply; 23+ messages in thread
From: Ashlie Martinez @ 2017-11-27 14:31 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Vijay Chidambaram, Ext4

Ted,

Thank you very much for taking the time to lay all of this out for me
(and throwing some humor and youtube links to boot), despite how busy
you were (I hope everything is alright!). I see now why the fix works
and what was going wrong. It appears I was confused about the order of
operations being performed in the test based on what I read in another
email. I believe in another email somewhere I read that the fallocate
was before a delayed write so I was thinking something like fallocate
then write. I see now that it is write with delayed allocation
(resolved after fallocate) and then fallocate. With that piece of
information everything else about the test, delayed allocation, and
the fix make sense.

On Wed, Nov 22, 2017 at 12:03 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> [ Sorry for not getting back to you earlier.  I've been dealing with a
>   family medical emergency and hanging out at an intensive care unit
>   in Chicago for the past four days.  ]
>
> OK, let's start with the high-level explanation of how Delayed
> Allocation works.  When a userspace application does a buffered write,
> and delayed allocation is not enabled, then ext4 acts like ext3
> historically did.  That is, when a buffered write first touches a
> logical block which is not yet mapped to a physical block, the file
> system performs block allocation at the time of the write(2) system
> call.  At the next journal commit, the inode's extent tree (or inode
> structure or indirect blocks) are updated, along with the inode's
> i_size field (if necessary).  In order to avoid stale data writes,
> ext3 and ext4 in data=ordered mode, will write the data from the page
> cache to disk before the journal commit is allowed to be finalized.
>
> What are the implications of this?  If the application does a series
> of 4k writes, ext4 will allocate each 4k block one at a time.  And it
> will do that with no knowledge of how big the file will end up being.
>
> When delayed allocation is enabled[1], we do *not* make block
> allocation decisions at the time of the buffered write(2) call.
> Instead, we wait until the writeback timer expires, which by default
> is 30 seconds after the inode is first dirtied.  In many cases, the
> application is finished writing the file within 30 seconds, and now
> instead of doing N individual 4k block allocations, with delayed
> allocation we make a single block decision based on how many pages
> have been newly modified.  This makes it much easier to make much more
> intelligent allocation decisions.
>
> [1] It can get disabled via a mount option, or when the file system is
> almost full; in the latter case, it is temporary and delalloc will be
> automatically re-enabled when files are deleted and the file system
> has more free space.
>
>
> But what happens if we crash before the delayed allocation has been
> resolved?  As James Earl Jones might say, "I was never here[2]".  That
> is, it's as if never happened.
>
> [2] https://www.youtube.com/watch?v=-fPtmRZQVHo&feature=youtu.be&t=1m5s
>
> But immediately after the buffered write returns, if the program calls
> stat(2), the i_size returned by the stat(2) system call must reflect
> the results of the buffered write.  This is why we have i_size versus
> i_disksize.  The i_size value reflects the size of the file as it
> exists in the page cache.  The i_disksize value reflects the size of
> the file as it exists on disk.
>
> So i_disksize will lag i_size by up to 30 seconds.  But, by default,
> the journal commits happen every five seconds.  The other key thing to
> understand is that fallocate(2)'s ZERO_RANGE operation modifies both
> metadata *and* data blocks.
>
> Let's review xfstests generic/256.  The calls fsx --replay-ops with
> the following:
>
> 1.      write 0x137dd 0xdc69 0x0
> 2.      fallocate 0xb531 0xb5ad 0x21446
> 3.      collapse_range 0x1c000 0x4000 0x21446
> 4.      write 0x3e5ec 0x1a14 0x21446
> 5.      zero_range 0x20fac 0x6d9c 0x40000 keep_size
> 6.      mapwrite 0x216ad 0x274f 0x40000
>
> (I added the line numbers to make it clearer what I'm referring to.
>
> In the FSX replay log, the first argument is the starting offset, the
> second argument is the length, and the third is the i_size before the
> operation in question.  (The 3rd argument does not actually make any
> difference to the system call operation executed by fsx; it does make
> a difference to some of the explanatory messages printed by fsx, though.)
>
> Lines 1-3, just set up the inode's extent tree, and aren't really
> important.  Linux 6 isn't terribly important, either.  The important
> operations for the repro is lines 4 and 5.
>
> Line 4 does a buffered write, which done subject to the delayed
> allocation rules.  So it modifies the values in the page cache, but it
> not do any block allocation.  After the buffered write in line 4,
> i_size is 0x40000 but and i_disksize is 0x21446.
>
> Line 5 performs a FALLOC_FL_ZERO_RANGE from offset 0x20FAC to 0x27D48.
> What the zero_range operation will do is to change the file so that
> specified byte range will return zeroes if read.  If it is optimal to
> do so, the system MAY do this by not actually writing to the data
> blocks, but instead, by manipulating the extent tree.  So these two
> low-level actions are both valid ways of implementing the zero_range:
>
> 1.  (a) Read blocks 0x20 and 0x27 into page cache if necessary, (b)
>     zero offsets 0xFAC to 0xFFF of block 0x20, (c) zero offsets 0x000
>     to 0xD48 of block 0x27, (d) allocate blocks 0x21 to 0x26 if
>     necessary, marking blocks 0x21 to 0x26 as unitialized.
>
> 2.  (a) Read blocks 0x20 and 0x27 into page cache if necessary, (b)
>     zero offsets 0xFAC to 0xFFF of block 0x20, (c) zero offsets 0x000
>     to 0xD48 of block 0x27, (d) allocate blocks 0x21 to 0x26 if
>     necessary, marking blocks 0x21 to 0x26 as initialized, and write
>     zeros to blocks 0x21 to 0x26.
>
> Both 1 and 2 are perfectly valid, and ext4 will do both.  In #1, the
> system needs to do two 4k writes.  In #2, the system needs to do a 32k
> sequential write (blocks 20 through 27).  By default, if the size of
> the write needed to initialize the blocks is 32k or less, ext4 will
> choose #2, since eliminates the need to manipulate the extent tree to
> mark blocks as initialized later, and it tends to leave the extent
> tree in a less fragmented state.  On the other hand, if the userspace
> program requests a zero_range operation on several megabytes, and
> better yet, the starting and ending offsets are 4k aligned, then there
> is no need to do the two 4k random writes at the beginning and the end
> of the range, then just manipulating the extent tree (as in #1) is the
> better way to go.
>
> In the case of xfstests generic/456, we are doing #2, and within five
> seconds of zero_range operation being issued, the system will journal
> commit, so all of the actions in #2 needs to be committed after a
> crash.  The last offset of the zero_range, 0x27D48 is beyond the
> current on-disk i_size of 0x21446.  So if we crash at this point, it's
> important that after journal replay, i_size is updated to be 0x27D48.
>
> However, since the delayed allocation write has not been resolved.
> i_size is 0x40000, which is beyond 0x27D46.  Due to the missing check,
> the buggy code would see that i_size is beyond the last byte of the
> last offset of the zero_range, and skip updating the on_disk size.
> And this is what causes the bug.
>
> The fix is to change the code to update the on-disk size if the end
> offset of zero_range is beyond i_size *or* beyond i_disksize.  And
> this is what commit 51e3ae81ec58: "ext4: fix interaction between
> i_size, fallocate, and delalloc after a crash" does.
>
> Hope this helps,
>
>                                                 - Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-27 14:31   ` Ashlie Martinez
@ 2017-11-27 16:11     ` Theodore Ts'o
  2017-11-28 13:04       ` Ashlie Martinez
  0 siblings, 1 reply; 23+ messages in thread
From: Theodore Ts'o @ 2017-11-27 16:11 UTC (permalink / raw)
  To: Ashlie Martinez; +Cc: Vijay Chidambaram, Ext4

On Mon, Nov 27, 2017 at 08:31:07AM -0600, Ashlie Martinez wrote:
> Ted,
> 
> Thank you very much for taking the time to lay all of this out for me
> (and throwing some humor and youtube links to boot), despite how busy
> you were (I hope everything is alright!). I see now why the fix works
> and what was going wrong. It appears I was confused about the order of
> operations being performed in the test based on what I read in another
> email. I believe in another email somewhere I read that the fallocate
> was before a delayed write so I was thinking something like fallocate
> then write. I see now that it is write with delayed allocation
> (resolved after fallocate) and then fallocate. With that piece of
> information everything else about the test, delayed allocation, and
> the fix make sense.

Sorry, "before" was misleading.  When I used the word "before", I was
speaking of the order that the operations hit the disk.  The confusion
comes from the fact that the delayed allocation write was *issued*
before the fallocate, but in terms of when they are committed to disk,
the fallocate commits *first*, and then 25-30 seconds later, the
delayed allocation write is resolved and then committed to disk.

It's the difference between the order that the operations are issued
and when they are committed to disk which is what caused the bug; and
the problem reproduction relies on crashing/aborting the file system
between the time that the two operations would have been committed.

Hopefully this will be helpful in terms of finding a way to create
automated file system testing systems that can detect bugs similar to
this one.  I can imagine that if you ever want to extend this to
database testing, a similar technique might be used to detect
transactions which close in a different order than how they were
issued, or dealing transactions which end up getting rolled back.

						- Ted

P.S.  I see you have some Google internships under your belt, so I'm
sure you know the drill, but I hope you'll consider us for another
future internship experience.   :-)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-27 16:11     ` Theodore Ts'o
@ 2017-11-28 13:04       ` Ashlie Martinez
  2017-11-28 20:45         ` Theodore Ts'o
  0 siblings, 1 reply; 23+ messages in thread
From: Ashlie Martinez @ 2017-11-28 13:04 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Vijay Chidambaram, Ext4

On Mon, Nov 27, 2017 at 10:11 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Mon, Nov 27, 2017 at 08:31:07AM -0600, Ashlie Martinez wrote:
>> Ted,
>>
>> Thank you very much for taking the time to lay all of this out for me
>> (and throwing some humor and youtube links to boot), despite how busy
>> you were (I hope everything is alright!). I see now why the fix works
>> and what was going wrong. It appears I was confused about the order of
>> operations being performed in the test based on what I read in another
>> email. I believe in another email somewhere I read that the fallocate
>> was before a delayed write so I was thinking something like fallocate
>> then write. I see now that it is write with delayed allocation
>> (resolved after fallocate) and then fallocate. With that piece of
>> information everything else about the test, delayed allocation, and
>> the fix make sense.
>
> Sorry, "before" was misleading.  When I used the word "before", I was
> speaking of the order that the operations hit the disk.  The confusion
> comes from the fact that the delayed allocation write was *issued*
> before the fallocate, but in terms of when they are committed to disk,
> the fallocate commits *first*, and then 25-30 seconds later, the
> delayed allocation write is resolved and then committed to disk.

No biggie, part of the reason this was so hard for me to wrap my head
around is I don't have a physical machine that I can reproduce this on
(and I never got around to getting a GCE instance to test on). Not
being able to poke around a reproducing system makes it a little bit
harder for me to reason about :)

>
> It's the difference between the order that the operations are issued
> and when they are committed to disk which is what caused the bug; and
> the problem reproduction relies on crashing/aborting the file system
> between the time that the two operations would have been committed.
>
> Hopefully this will be helpful in terms of finding a way to create
> automated file system testing systems that can detect bugs similar to
> this one.  I can imagine that if you ever want to extend this to
> database testing, a similar technique might be used to detect
> transactions which close in a different order than how they were
> issued, or dealing transactions which end up getting rolled back.
>

Vijay and I are hopeful that we can find some reliable way to
reproduce this in CrashMonkey. It has also showed us a class of timing
bugs that we can't find with the current iteration of CrashMonkey, but
we hope we can expand what we have to find them in the future.

>                                                 - Ted
>
> P.S.  I see you have some Google internships under your belt, so I'm
> sure you know the drill, but I hope you'll consider us for another
> future internship experience.   :-)

Haha it's always been nice to be a little bit spoiled while interning
there for a summer. I hope I can make way back there for another
internship etc. eventually :)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-28 13:04       ` Ashlie Martinez
@ 2017-11-28 20:45         ` Theodore Ts'o
  2017-11-28 21:27           ` Ashlie Martinez
  0 siblings, 1 reply; 23+ messages in thread
From: Theodore Ts'o @ 2017-11-28 20:45 UTC (permalink / raw)
  To: Ashlie Martinez; +Cc: Vijay Chidambaram, Ext4

On Tue, Nov 28, 2017 at 07:04:54AM -0600, Ashlie Martinez wrote:
> No biggie, part of the reason this was so hard for me to wrap my head
> around is I don't have a physical machine that I can reproduce this on
> (and I never got around to getting a GCE instance to test on). Not
> being able to poke around a reproducing system makes it a little bit
> harder for me to reason about :)

This does reproduce easily using kvm-xfstests[1]; using gce-xfstests
was not necessary.  That's actually how I debugged it, since kvm
starts up in under 5 seconds, while starting up a cloud VM takes a bit
longer.  So if you want a quick edit/compile/debug cycle, or if you
attach a debugger to the running kernel, using kvm-xfstests is the
right tool to use.  99% of the command syntax and test appliance
implementation is the same between kvm-xfstests and gce-xfstests.

[1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md

I've been trying to promote the use of kvm-xfstests for researchers
who are interested in doing file system work.  So if you can help
promote {kvm,gce}-xfstests amongst your fellow students and
professors, that would be great!


You can run the reproducer automatically via "kvm-xfstests -c 4k
generic/456".  But you can also run "kvm-xfstests shell", and then run
the following commands;

kvm-xfstests# export FSTESTSET=generic/456
kvm-xfstests# ./runtests.sh

You can then edit the test script to add debugging commands; it can be
found in /root/xfstests/tests/generic/456 and then rerun the tests
using the "./runtests.sh" script.

Sorry, the only editor available is /bin/ed.  If you want to use some
other editor, and are willing to build your own test-appliance VM
image instead of just downloading the rebuilt test applinace image,
you can add it to the xfstests-packages file in the
kvm-xfstests/test-appliance directory, and generate your own test
appliance.  See [2] for more details.

[2] https://github.com/tytso/xfstests-bld/blob/master/Documentation/building-rootfs.md

This is actually how I figured out what was happening; I added
commands such as "debugfs -R 'stat <11>'" so I could see was going on
with the file system before the _flakey_drop_and_remount statement,
and then varied the number of operations in the fsx operations to
replay list.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-28 20:45         ` Theodore Ts'o
@ 2017-11-28 21:27           ` Ashlie Martinez
  2017-11-29  3:37             ` Amir Goldstein
  2017-11-29  6:13             ` Theodore Ts'o
  0 siblings, 2 replies; 23+ messages in thread
From: Ashlie Martinez @ 2017-11-28 21:27 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Vijay Chidambaram, Ext4

On Tue, Nov 28, 2017 at 2:45 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Tue, Nov 28, 2017 at 07:04:54AM -0600, Ashlie Martinez wrote:
>> No biggie, part of the reason this was so hard for me to wrap my head
>> around is I don't have a physical machine that I can reproduce this on
>> (and I never got around to getting a GCE instance to test on). Not
>> being able to poke around a reproducing system makes it a little bit
>> harder for me to reason about :)
>
> This does reproduce easily using kvm-xfstests[1]; using gce-xfstests
> was not necessary.  That's actually how I debugged it, since kvm
> starts up in under 5 seconds, while starting up a cloud VM takes a bit
> longer.  So if you want a quick edit/compile/debug cycle, or if you
> attach a debugger to the running kernel, using kvm-xfstests is the
> right tool to use.  99% of the command syntax and test appliance
> implementation is the same between kvm-xfstests and gce-xfstests.

Unfortunately this timing bug only reproduces on some machines. Xiao
and I have been unable to reproduce this bug (I've tried kvm-xfstests,
my own kvm VMs, VMs without kvm, VMs with/without virtio drivers, and
another bare metal system). generic/456 basically sets up a race
condition between a kernel flusher thread and triggering dm-flakey, so
I think things like system load, core count, etc. might cause
different test results.

>
> [1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md
>
> I've been trying to promote the use of kvm-xfstests for researchers
> who are interested in doing file system work.  So if you can help
> promote {kvm,gce}-xfstests amongst your fellow students and
> professors, that would be great!
>
>
> You can run the reproducer automatically via "kvm-xfstests -c 4k
> generic/456".  But you can also run "kvm-xfstests shell", and then run
> the following commands;
>
> kvm-xfstests# export FSTESTSET=generic/456
> kvm-xfstests# ./runtests.sh
>
> You can then edit the test script to add debugging commands; it can be
> found in /root/xfstests/tests/generic/456 and then rerun the tests
> using the "./runtests.sh" script.
>
> Sorry, the only editor available is /bin/ed.  If you want to use some
> other editor, and are willing to build your own test-appliance VM
> image instead of just downloading the rebuilt test applinace image,
> you can add it to the xfstests-packages file in the
> kvm-xfstests/test-appliance directory, and generate your own test
> appliance.  See [2] for more details.
>
> [2] https://github.com/tytso/xfstests-bld/blob/master/Documentation/building-rootfs.md
>
> This is actually how I figured out what was happening; I added
> commands such as "debugfs -R 'stat <11>'" so I could see was going on
> with the file system before the _flakey_drop_and_remount statement,
> and then varied the number of operations in the fsx operations to
> replay list.
>
> Regards,
>
>                                                 - Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-28 21:27           ` Ashlie Martinez
@ 2017-11-29  3:37             ` Amir Goldstein
  2017-11-29  6:13             ` Theodore Ts'o
  1 sibling, 0 replies; 23+ messages in thread
From: Amir Goldstein @ 2017-11-29  3:37 UTC (permalink / raw)
  To: Ashlie Martinez; +Cc: Theodore Ts'o, Vijay Chidambaram, Ext4

On Tue, Nov 28, 2017 at 11:27 PM, Ashlie Martinez <ashmrtn@utexas.edu> wrote:
> On Tue, Nov 28, 2017 at 2:45 PM, Theodore Ts'o <tytso@mit.edu> wrote:
>> On Tue, Nov 28, 2017 at 07:04:54AM -0600, Ashlie Martinez wrote:
>>> No biggie, part of the reason this was so hard for me to wrap my head
>>> around is I don't have a physical machine that I can reproduce this on
>>> (and I never got around to getting a GCE instance to test on). Not
>>> being able to poke around a reproducing system makes it a little bit
>>> harder for me to reason about :)
>>
>> This does reproduce easily using kvm-xfstests[1]; using gce-xfstests
>> was not necessary.  That's actually how I debugged it, since kvm
>> starts up in under 5 seconds, while starting up a cloud VM takes a bit
>> longer.  So if you want a quick edit/compile/debug cycle, or if you
>> attach a debugger to the running kernel, using kvm-xfstests is the
>> right tool to use.  99% of the command syntax and test appliance
>> implementation is the same between kvm-xfstests and gce-xfstests.
>
> Unfortunately this timing bug only reproduces on some machines. Xiao
> and I have been unable to reproduce this bug (I've tried kvm-xfstests,
> my own kvm VMs, VMs without kvm, VMs with/without virtio drivers, and
> another bare metal system). generic/456 basically sets up a race
> condition between a kernel flusher thread and triggering dm-flakey, so
> I think things like system load, core count, etc. might cause
> different test results.
>

For what it's worth, I wasn't able to reproduce on my kvm-xfstests machine
either. With 2 cores, virtio over LVM/SSD. Didn't try to play with parameters.

Amir.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-28 21:27           ` Ashlie Martinez
  2017-11-29  3:37             ` Amir Goldstein
@ 2017-11-29  6:13             ` Theodore Ts'o
  2017-11-29  8:07               ` Amir Goldstein
  1 sibling, 1 reply; 23+ messages in thread
From: Theodore Ts'o @ 2017-11-29  6:13 UTC (permalink / raw)
  To: Ashlie Martinez; +Cc: Vijay Chidambaram, Ext4

On Tue, Nov 28, 2017 at 03:27:47PM -0600, Ashlie Martinez wrote:
> 
> Unfortunately this timing bug only reproduces on some machines. Xiao
> and I have been unable to reproduce this bug (I've tried kvm-xfstests,
> my own kvm VMs, VMs without kvm, VMs with/without virtio drivers, and
> another bare metal system). generic/456 basically sets up a race
> condition between a kernel flusher thread and triggering dm-flakey, so
> I think things like system load, core count, etc. might cause
> different test results.

Hmm, now I remember the details.  It reproduced reliably on
gce-xfstests, but I was able to use kvm-xfstests to debug the problem
(by invocations of debugfs to dump the file system state as I had
described).  That's because debugfs operates on the buffer cache, and
before the jbd2 commit, the changes to the inode structure are in the
buffer cache, but they aren't allowed to be persisted on disk until
after the journal commit.  And I was using debugfs to dump the inode's
extent tree (as it exists in the buffer cache) before triggering
dm-flakey.

Now that we understand what is happening, it should be simple to
adjust the test so it reliably reproduces, by adding a "sleep 6"
before _flakey_drop_and_remote.  Since the delayed allocation write
won't get resolved until 30 seconds after the inode was first dirtied,
and the default jbd2 timer value is 5 seconds, this should guarantee
that the jbd2 commit has taken place so that the inode changes made by
fallocate are persisted onto the journal, while still allowing the
delayed allocation write to be remain unresolved.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-29  6:13             ` Theodore Ts'o
@ 2017-11-29  8:07               ` Amir Goldstein
  2017-11-29 19:58                 ` Ashlie Martinez
  2017-11-30  0:24                 ` Theodore Ts'o
  0 siblings, 2 replies; 23+ messages in thread
From: Amir Goldstein @ 2017-11-29  8:07 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ashlie Martinez, Vijay Chidambaram, Ext4

On Wed, Nov 29, 2017 at 8:13 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Tue, Nov 28, 2017 at 03:27:47PM -0600, Ashlie Martinez wrote:
>>
>> Unfortunately this timing bug only reproduces on some machines. Xiao
>> and I have been unable to reproduce this bug (I've tried kvm-xfstests,
>> my own kvm VMs, VMs without kvm, VMs with/without virtio drivers, and
>> another bare metal system). generic/456 basically sets up a race
>> condition between a kernel flusher thread and triggering dm-flakey, so
>> I think things like system load, core count, etc. might cause
>> different test results.
>
> Hmm, now I remember the details.  It reproduced reliably on
> gce-xfstests, but I was able to use kvm-xfstests to debug the problem
> (by invocations of debugfs to dump the file system state as I had
> described).  That's because debugfs operates on the buffer cache, and
> before the jbd2 commit, the changes to the inode structure are in the
> buffer cache, but they aren't allowed to be persisted on disk until
> after the journal commit.  And I was using debugfs to dump the inode's
> extent tree (as it exists in the buffer cache) before triggering
> dm-flakey.
>
> Now that we understand what is happening, it should be simple to
> adjust the test so it reliably reproduces, by adding a "sleep 6"
> before _flakey_drop_and_remote.  Since the delayed allocation write
> won't get resolved until 30 seconds after the inode was first dirtied,
> and the default jbd2 timer value is 5 seconds, this should guarantee
> that the jbd2 commit has taken place so that the inode changes made by
> fallocate are persisted onto the journal, while still allowing the
> delayed allocation write to be remain unresolved.
>

Sorry, sleep 6 didn't work for me.
Must be some other subtle detail.
If you could work out how to fix the test to catch the bug in kvm-xfstests
that would be nice.

Better yet, if you can figure out how to configure kvm-xfstests differently
so it catches the bug without modifying g/456 that would be much better,
because I currently cannot use kvm-xfstest to debug ANY of the
dm-log-writes crash test dummies.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-29  8:07               ` Amir Goldstein
@ 2017-11-29 19:58                 ` Ashlie Martinez
  2017-11-30  0:48                   ` Theodore Ts'o
  2017-11-30  0:24                 ` Theodore Ts'o
  1 sibling, 1 reply; 23+ messages in thread
From: Ashlie Martinez @ 2017-11-29 19:58 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Theodore Ts'o, Vijay Chidambaram, Ext4

Ted,

1.      write 0x137dd 0xdc69 0x0
2.      fallocate 0xb531 0xb5ad 0x21446
3.      collapse_range 0x1c000 0x4000 0x21446
4.      write 0x3e5ec 0x1a14 0x21446
5.      zero_range 0x20fac 0x6d9c 0x40000 keep_size

I have made a CrashMonkey test that runs the same operations run by
xfstests generic/456 as I wanted a bit more control over the test. My
test runs operations 1-3 from the list above, and then runs sleep(30).
After that, it runs operations 4 and 5 (I skipped operation 6 as it
doesn't seem to be related to the underlying cause of the bug).
CrashMonkey then waits a further 120 seconds for IO to trickle down to
the block device. After that, CrashMonkey replays the operations and I
have obtained the following output from fsck (run with -yf flags):

e2fsck 1.42.13 (17-May-2015)
Pass 1: Checking inodes, blocks, and sizes
Inode 12 has an invalid extent node (blk 8476, lblk 45)
Clear? yes

Inode 12, i_blocks is 168, should be 0.  Fix? yes

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -(8451--8476) -(8481--8512) -(8550--8575)
Fix? yes

Free blocks count wrong for group #1 (7850, counted=7934).
Fix? yes

Free blocks count wrong (93420, counted=93504).
Fix? yes


/dev/cow_ram_snapshot1_0: ***** FILE SYSTEM WAS MODIFIED *****
/dev/cow_ram_snapshot1_0: 12/25688 files (0.0% non-contiguous),
8896/102400 blocks

I get that the errors fixed in later passes are likely due to the
i_blocks and invalid extent issue found earlier (this is not shown in
xfstests as it usually runs with -nf). I think this output may be
related to what Amir found in generic/456, but I can't seem to
reproduce his output exactly. It almost seems like the crash I'm
getting with CrashMonkey is happening before any of the operations
have a chance to update the inode on disk. This seems to make sense if
you assume the write in operation 1 is delay allocated and the
subsequent fallocate calls don't adjust i_disksize in any way. If the
first write is delay allocated, then I am confused how Amir got the
output he did (with an updated i_disksize) as it seems like he should
get something similar to what I got.

On Wed, Nov 29, 2017 at 2:07 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Wed, Nov 29, 2017 at 8:13 AM, Theodore Ts'o <tytso@mit.edu> wrote:
>> On Tue, Nov 28, 2017 at 03:27:47PM -0600, Ashlie Martinez wrote:
>>>
>>> Unfortunately this timing bug only reproduces on some machines. Xiao
>>> and I have been unable to reproduce this bug (I've tried kvm-xfstests,
>>> my own kvm VMs, VMs without kvm, VMs with/without virtio drivers, and
>>> another bare metal system). generic/456 basically sets up a race
>>> condition between a kernel flusher thread and triggering dm-flakey, so
>>> I think things like system load, core count, etc. might cause
>>> different test results.
>>
>> Hmm, now I remember the details.  It reproduced reliably on
>> gce-xfstests, but I was able to use kvm-xfstests to debug the problem
>> (by invocations of debugfs to dump the file system state as I had
>> described).  That's because debugfs operates on the buffer cache, and
>> before the jbd2 commit, the changes to the inode structure are in the
>> buffer cache, but they aren't allowed to be persisted on disk until
>> after the journal commit.  And I was using debugfs to dump the inode's
>> extent tree (as it exists in the buffer cache) before triggering
>> dm-flakey.
>>
>> Now that we understand what is happening, it should be simple to
>> adjust the test so it reliably reproduces, by adding a "sleep 6"
>> before _flakey_drop_and_remote.  Since the delayed allocation write
>> won't get resolved until 30 seconds after the inode was first dirtied,
>> and the default jbd2 timer value is 5 seconds, this should guarantee
>> that the jbd2 commit has taken place so that the inode changes made by
>> fallocate are persisted onto the journal, while still allowing the
>> delayed allocation write to be remain unresolved.
>>
>
> Sorry, sleep 6 didn't work for me.
> Must be some other subtle detail.
> If you could work out how to fix the test to catch the bug in kvm-xfstests
> that would be nice.
>
> Better yet, if you can figure out how to configure kvm-xfstests differently
> so it catches the bug without modifying g/456 that would be much better,
> because I currently cannot use kvm-xfstest to debug ANY of the
> dm-log-writes crash test dummies.
>
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-29  8:07               ` Amir Goldstein
  2017-11-29 19:58                 ` Ashlie Martinez
@ 2017-11-30  0:24                 ` Theodore Ts'o
  2017-11-30  6:46                   ` Amir Goldstein
  1 sibling, 1 reply; 23+ messages in thread
From: Theodore Ts'o @ 2017-11-30  0:24 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Ashlie Martinez, Vijay Chidambaram, Ext4

On Wed, Nov 29, 2017 at 10:07:39AM +0200, Amir Goldstein wrote:
> If you could work out how to fix the test to catch the bug in kvm-xfstests
> that would be nice.

One question, how did you actually test it using kvm-xfstests?  The
kvm-xfstests image that I had up on www.kernel.org did not have
generic/456.  It was dated from September 2017, so it didn't have that
test.  Since the publically available image didn't have generic/456,
did you create your own image some how?

I just updated the test appliance yesterday, so it does now.  The
version stamp on it is:

e2fsprogs	v1.43.6-85-g7595699d0 (Wed, 6 Sep 2017 22:04:14 -0400)
fio		fio-3.2 (Fri, 3 Nov 2017 15:23:49 -0600)
quota		4d81e8b (Mon, 16 Oct 2017 09:42:44 +0200)
stress-ng	977ae35 (Wed, 6 Sep 2017 23:45:03 -0400)
xfsprogs	v4.14.0-rc2-1-g19ca9b0b (Mon, 27 Nov 2017 10:56:21 -0600)
xfstests-bld	0b27af9 (Tue, 28 Nov 2017 16:28:51 -0500)
xfstests	linux-v3.8-1797-g4f1eaa02 (Tue, 28 Nov 2017 15:49:06 -0500)

Can you try the latest version of the kvm-xfststs test appliance image
that is on www.kernel.org, with the latest xfstests-bld git repo to
drive it?

I use the default configuration (so 2 cpu's, 2 gigs of memory).  The
only difference I have from the default config is that I have the
following in my ~/.config/kvm-xfstests:

VG=callcc

VDB=/dev/$VG/test-4k
VDC=/dev/$VG/scratch
VDD=/dev/$VG/test-1k
VDE=/dev/$VG/scratch2
VDF=/dev/$VG/scratch3
VDG=/dev/$VG/results

Where callcc is an LVM setup using an SSD.  (A Samsung 850 PRO, to be
precise).  VDB, VDC, VDD, and VDG are 5 gig logical volumes, VDE and
VDF are 20 gig logical volumes (although you won't need those for this
repro.)

With this setup, generic/456 reproduces *reliably* for me.  I tested
using a v4.15-rc1 kernel with commit 51e3ae81ec58e95 reverted. and the
command: "kvm-xfstests -c 4k generic/456"

      		  	      	     		     - Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-29 19:58                 ` Ashlie Martinez
@ 2017-11-30  0:48                   ` Theodore Ts'o
  2017-11-30  1:46                     ` Ashlie Martinez
  0 siblings, 1 reply; 23+ messages in thread
From: Theodore Ts'o @ 2017-11-30  0:48 UTC (permalink / raw)
  To: Ashlie Martinez; +Cc: Amir Goldstein, Vijay Chidambaram, Ext4

On Wed, Nov 29, 2017 at 01:58:53PM -0600, Ashlie Martinez wrote:
> Ted,
> 
> 1.      write 0x137dd 0xdc69 0x0
> 2.      fallocate 0xb531 0xb5ad 0x21446
> 3.      collapse_range 0x1c000 0x4000 0x21446
> 4.      write 0x3e5ec 0x1a14 0x21446
> 5.      zero_range 0x20fac 0x6d9c 0x40000 keep_size
> 
> I have made a CrashMonkey test that runs the same operations run by
> xfstests generic/456 as I wanted a bit more control over the test. My
> test runs operations 1-3 from the list above, and then runs sleep(30).
> After that, it runs operations 4 and 5 (I skipped operation 6 as it
> doesn't seem to be related to the underlying cause of the bug).
> CrashMonkey then waits a further 120 seconds for IO to trickle down to
> the block device.

So I'm not sure exactly what Crashmonkey is doing here.  Are you
forcing a crash, or not?

So here's my test which I think should replicate what you are doing.

1.  Start "kvm-xfstests shell"
2.  Create the fsxops file:

	cat > /tmp/fsxops
	write 0x137dd 0xdc69 0x0
	fallocate 0xb531 0xb5ad 0x21446
	collapse_range 0x1c000 0x4000 0x21446
	write 0x3e5ec 0x1a14 0x21446
	zero_range 0x20fac 0x6d9c 0x40000 keep_size
	<type control-d>

3.  Create a scratch file system and mount it:

	mke2fs -Fq -t ext4 /dev/vdc
	mount /vdc

4.  Run fsx:

	./xfstests/ltp/fsx -d --replay-ops /tmp/fsxops /vdc/testfile

5.  Since I'm too lazy to wait 120 seconds, just force everything to disk:

	sync

6a.  Unmount the file system and check it:

	umount /vdc
	e2fsck -fy /dev/vdc

6b.   Force a crash, and then restart kvm-xfstests shell, and then check the file system:

	<type control-A followed by 'x'>
	kvm-xfstests shell
	e2fsck -fy /dev/vdc

In both cases, e2fsck does not complain.  In the 6b variant, e2fsck
will replay the journal first, but other than that, no real differences.

So, tell me --- how is what I am doing any different from your Crashmonkey test?

    	    	       	      - Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-30  0:48                   ` Theodore Ts'o
@ 2017-11-30  1:46                     ` Ashlie Martinez
  2017-11-30  4:46                       ` Theodore Ts'o
  0 siblings, 1 reply; 23+ messages in thread
From: Ashlie Martinez @ 2017-11-30  1:46 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Amir Goldstein, Vijay Chidambaram, Ext4

On Wed, Nov 29, 2017 at 6:48 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Wed, Nov 29, 2017 at 01:58:53PM -0600, Ashlie Martinez wrote:
>> Ted,
>>
>> 1.      write 0x137dd 0xdc69 0x0
>> 2.      fallocate 0xb531 0xb5ad 0x21446
>> 3.      collapse_range 0x1c000 0x4000 0x21446
>> 4.      write 0x3e5ec 0x1a14 0x21446
>> 5.      zero_range 0x20fac 0x6d9c 0x40000 keep_size
>>
>> I have made a CrashMonkey test that runs the same operations run by
>> xfstests generic/456 as I wanted a bit more control over the test. My
>> test runs operations 1-3 from the list above, and then runs sleep(30).
>> After that, it runs operations 4 and 5 (I skipped operation 6 as it
>> doesn't seem to be related to the underlying cause of the bug).
>> CrashMonkey then waits a further 120 seconds for IO to trickle down to
>> the block device.
>
> So I'm not sure exactly what Crashmonkey is doing here.  Are you
> forcing a crash, or not?

My apologies, I should have given proper context on CrashMonkey in my
previous email. CrashMonkey is a record-and-replay framework much like
dm-log-writes, but with a few differences. First, it records bio
information when the bios are given to the block device driver, not
when they are completed. Second, CrashMonkey has an option to
re-arrange the recorded bios within a set of ordering rules. These
rules adhere to FUA and flush commands (i.e. they will not move bios
across those barriers) and they may cause some bios between the
previous barrier operation and the next barrier operation to be
dropped if the next barrier operation is not reached. These rules are
meant to help simulate a crash involving a disk with an onboard cache.
In that case, the programmer cannot know what writes were cached on
the device and what writes were persisted when the crash occurs
(unless the crash occurs right after a barrier operation, in which
case everything should be persisted).

>
> So here's my test which I think should replicate what you are doing.
>
> 1.  Start "kvm-xfstests shell"
> 2.  Create the fsxops file:
>
>         cat > /tmp/fsxops
>         write 0x137dd 0xdc69 0x0
>         fallocate 0xb531 0xb5ad 0x21446
>         collapse_range 0x1c000 0x4000 0x21446
>         write 0x3e5ec 0x1a14 0x21446
>         zero_range 0x20fac 0x6d9c 0x40000 keep_size
>         <type control-d>
>
> 3.  Create a scratch file system and mount it:
>
>         mke2fs -Fq -t ext4 /dev/vdc
>         mount /vdc
>
> 4.  Run fsx:
>
>         ./xfstests/ltp/fsx -d --replay-ops /tmp/fsxops /vdc/testfile
>
> 5.  Since I'm too lazy to wait 120 seconds, just force everything to disk:
>
>         sync

I believe you said in an earlier email that sync would erase any trace
of the bug Amir found as it resolves the delayed allocation.

>
> 6a.  Unmount the file system and check it:
>
>         umount /vdc
>         e2fsck -fy /dev/vdc
>
> 6b.   Force a crash, and then restart kvm-xfstests shell, and then check the file system:
>
>         <type control-A followed by 'x'>
>         kvm-xfstests shell
>         e2fsck -fy /dev/vdc
>
> In both cases, e2fsck does not complain.  In the 6b variant, e2fsck
> will replay the journal first, but other than that, no real differences.
>
> So, tell me --- how is what I am doing any different from your Crashmonkey test?

CrashMonkey first records the entire stream of bios, including the
resolution of the delayed allocation. Once the workload has finished,
it writes out subsets of the recorded bios to the disk (restoring the
disk to it's state prior to the workload each time), checking each
resulting "crash state" with fsck to ensure it's consistent. One of
the goals of CrashMonkey is to test many crash states from a single
workload by rearranging the bios logged during the workload.
CrashMonkey also hopes to catch bugs that are caused by reordering
bios between two barrier operations, assuming the crash occurred
before the second barrier operation (this would depend on timing in
systems like xfstests and dm-log-writes, but since CrashMonkey records
and replays, it does not face these timing issues). An overview of
CrashMonkey and slides about can be found at [1].

When I was testing a workload based off Amir's generic/456 test in
CrashMonkey, I noticed the output that I sent to you earlier.

[1] https://www.usenix.org/conference/hotstorage17/program/presentation/martinez

>
>                               - Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-30  1:46                     ` Ashlie Martinez
@ 2017-11-30  4:46                       ` Theodore Ts'o
  2017-11-30 14:22                         ` Theodore Ts'o
  2017-11-30 14:51                         ` Ashlie Martinez
  0 siblings, 2 replies; 23+ messages in thread
From: Theodore Ts'o @ 2017-11-30  4:46 UTC (permalink / raw)
  To: Ashlie Martinez; +Cc: Amir Goldstein, Vijay Chidambaram, Ext4

On Wed, Nov 29, 2017 at 07:46:08PM -0600, Ashlie Martinez wrote:
> > 5.  Since I'm too lazy to wait 120 seconds, just force everything to disk:
> >
> >         sync
> 
> I believe you said in an earlier email that sync would erase any trace
> of the bug Amir found as it resolves the delayed allocation.

Right, but you're waiting 120 seconds, which is enough time that it
would resolve the delayed allocation.  So that's why I was trying to
replicate your Crashmonkey experiment.

And since you stated that you waited 120 seconds as the last step,
there should have been a barrier, and no I/O operations for
Crashmonkey to rearrange.  This is why I believe what I listed should
be exactly the same as your Crashmonkey test, if I understood it
correctly.

What you said was that you ran the following operations on the test
file:

1.      write 0x137dd 0xdc69 0x0
2.      fallocate 0xb531 0xb5ad 0x21446
3.      collapse_range 0x1c000 0x4000 0x21446
	<sleep 30>
4.      write 0x3e5ec 0x1a14 0x21446
5.      zero_range 0x20fac 0x6d9c 0x40000 keep_size
	<sleep 120>

So what was the block I/O trace?  What operations was crashmonkey
actually reordering?  There **really** shouldn't have been any....

Can you send the block I/O trace that was observed when you did the
following, a complete output of dumpe2fs on the file system, and a
debugfs stat output on the test file?

						- Ted

P.S.  I did read the Crashmonkey paper, so I'm aware of what
Crashmonkey does.  I'm just confused about your workload, since the
120 second sleep should have meant there was a barrier followed by
nothing else, making for a *very* boring crashmonkey replay.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-30  0:24                 ` Theodore Ts'o
@ 2017-11-30  6:46                   ` Amir Goldstein
  0 siblings, 0 replies; 23+ messages in thread
From: Amir Goldstein @ 2017-11-30  6:46 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ashlie Martinez, Vijay Chidambaram, Ext4

On Thu, Nov 30, 2017 at 2:24 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Wed, Nov 29, 2017 at 10:07:39AM +0200, Amir Goldstein wrote:
>> If you could work out how to fix the test to catch the bug in kvm-xfstests
>> that would be nice.
>
> One question, how did you actually test it using kvm-xfstests?  The
> kvm-xfstests image that I had up on www.kernel.org did not have
> generic/456.  It was dated from September 2017, so it didn't have that
> test.  Since the publically available image didn't have generic/456,
> did you create your own image some how?

Of course. I am using my own image and update xfstests frequently.

>
> I just updated the test appliance yesterday, so it does now.  The
> version stamp on it is:
>
> e2fsprogs       v1.43.6-85-g7595699d0 (Wed, 6 Sep 2017 22:04:14 -0400)
> fio             fio-3.2 (Fri, 3 Nov 2017 15:23:49 -0600)
> quota           4d81e8b (Mon, 16 Oct 2017 09:42:44 +0200)
> stress-ng       977ae35 (Wed, 6 Sep 2017 23:45:03 -0400)
> xfsprogs        v4.14.0-rc2-1-g19ca9b0b (Mon, 27 Nov 2017 10:56:21 -0600)
> xfstests-bld    0b27af9 (Tue, 28 Nov 2017 16:28:51 -0500)
> xfstests        linux-v3.8-1797-g4f1eaa02 (Tue, 28 Nov 2017 15:49:06 -0500)
>
> Can you try the latest version of the kvm-xfststs test appliance image
> that is on www.kernel.org, with the latest xfstests-bld git repo to
> drive it?

It works with latest image.
456 fails on old kernel and pass on v4.15-rc1.

When I get to it, I will also check dm-log-write tests with latest image
and rebuild my image from scratch to see if that changes anything.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-30  4:46                       ` Theodore Ts'o
@ 2017-11-30 14:22                         ` Theodore Ts'o
  2017-11-30 14:51                         ` Ashlie Martinez
  1 sibling, 0 replies; 23+ messages in thread
From: Theodore Ts'o @ 2017-11-30 14:22 UTC (permalink / raw)
  To: Ashlie Martinez; +Cc: Amir Goldstein, Vijay Chidambaram, Ext4

On Wed, Nov 29, 2017 at 11:46:49PM -0500, Theodore Ts'o wrote:
> 
> Can you send the block I/O trace that was observed when you did the
> following, a complete output of dumpe2fs on the file system, and a
> debugfs stat output on the test file?

Oh, and the block I/O trace after CrashMonkey reordered the I/O
operations after the last CACHE FLUSH / barrier operation?  I'm
***really*** curious what I/O operations happened *after* the last
barrier operation if you had waited 120 secounds before triggering the
crash.

						- Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-30  4:46                       ` Theodore Ts'o
  2017-11-30 14:22                         ` Theodore Ts'o
@ 2017-11-30 14:51                         ` Ashlie Martinez
  2017-11-30 15:27                           ` Theodore Ts'o
  1 sibling, 1 reply; 23+ messages in thread
From: Ashlie Martinez @ 2017-11-30 14:51 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Amir Goldstein, Vijay Chidambaram, Ext4

On Wed, Nov 29, 2017 at 10:46 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Wed, Nov 29, 2017 at 07:46:08PM -0600, Ashlie Martinez wrote:
>> > 5.  Since I'm too lazy to wait 120 seconds, just force everything to disk:
>> >
>> >         sync
>>
>> I believe you said in an earlier email that sync would erase any trace
>> of the bug Amir found as it resolves the delayed allocation.
>
> Right, but you're waiting 120 seconds, which is enough time that it
> would resolve the delayed allocation.  So that's why I was trying to
> replicate your Crashmonkey experiment.
>
> And since you stated that you waited 120 seconds as the last step,
> there should have been a barrier, and no I/O operations for
> Crashmonkey to rearrange.  This is why I believe what I listed should
> be exactly the same as your Crashmonkey test, if I understood it
> correctly.
>
> What you said was that you ran the following operations on the test
> file:
>
> 1.      write 0x137dd 0xdc69 0x0
> 2.      fallocate 0xb531 0xb5ad 0x21446
> 3.      collapse_range 0x1c000 0x4000 0x21446
>         <sleep 30>
> 4.      write 0x3e5ec 0x1a14 0x21446
> 5.      zero_range 0x20fac 0x6d9c 0x40000 keep_size
>         <sleep 120>
>
> So what was the block I/O trace?  What operations was crashmonkey
> actually reordering?  There **really** shouldn't have been any....
>
> Can you send the block I/O trace that was observed when you did the
> following, a complete output of dumpe2fs on the file system, and a
> debugfs stat output on the test file?

I'll work on getting you traces, or other information with which you
can recreate the crash with later today or tomorrow. Unfortunately I'm
a bit slammed with work right now (undergrad honors thesis, end of
semester projects, etc.), but I'll try to get it out soon :)

>
>                                                 - Ted
>
> P.S.  I did read the Crashmonkey paper, so I'm aware of what
> Crashmonkey does.  I'm just confused about your workload, since the
> 120 second sleep should have meant there was a barrier followed by
> nothing else, making for a *very* boring crashmonkey replay.
>

Even though CrashMonkey *records* all the disk operations, it doesn't
have to replay all of them when generating crash states. For example,
it could choose to fully replay (and preserve the ordering of) all
operations before the 3rd barrier operation in a trace with 5
different barrier operations in it (we dub each set of operations from
just after the previous barrier operation up to and including the next
barrier operation a "disk write epoch" or "disk epoch"). If
CrashMonkey decides to write out a partial disk epoch at the end, it
can rearrange and/or drop operations from it according to the rules
laid out in my earlier message. This is more or less equivalent to
running dm-log-writes, placing a mark at each barrier operation it
sees, and then replaying up to each mark (and potentially a little
past the mark) to generate a disk image to check.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-30 14:51                         ` Ashlie Martinez
@ 2017-11-30 15:27                           ` Theodore Ts'o
  2017-11-30 15:40                             ` Ashlie Martinez
  0 siblings, 1 reply; 23+ messages in thread
From: Theodore Ts'o @ 2017-11-30 15:27 UTC (permalink / raw)
  To: Ashlie Martinez; +Cc: Amir Goldstein, Vijay Chidambaram, Ext4

On Thu, Nov 30, 2017 at 08:51:45AM -0600, Ashlie Martinez wrote:
> 
> Even though CrashMonkey *records* all the disk operations, it doesn't
> have to replay all of them when generating crash states. For example,
> it could choose to fully replay (and preserve the ordering of) all
> operations before the 3rd barrier operation in a trace with 5
> different barrier operations in it (we dub each set of operations from
> just after the previous barrier operation up to and including the next
> barrier operation a "disk write epoch" or "disk epoch").

OK, but then it won't replay any operations after the 4th barrier
opration, correct?  And in the case where you are stopping somewhere
between the 3rd and the 4th replay, you will drop and reorder random
operations after the 3rd barrier op, but before the 4th, correct?

So what would be good is to understand where it stopped replaying
operations ---- and if you can get a strace -ttt of the workload, and
the fine-grained timestamps from block I/O trace, so we can understand
how far we had gotten in the workload.  And then, to also to include
the output of debugfs's "logdump -ac" command before the journal is
replayed.

>From a file system developer's perspective, what is most useful is
when we can see the minimal reproduction test case where Crashmonkey
is only rearranging block I/O's of the last full "disk write epock".
And getting the timestampped strace logs and block I/O logs will help
us do that.

Otherwise, we have no idea where to look for a potential problem, only
that it's one of the several I/O commands.


Anyway, looking back at your original question, is your question why
the first write isn't delay allocated?  That's because the
collapse_range operation needs to resolve any delayed allocation
writes on the portion of the extent tree which will be affected by the
collapse_range operation.  See the calls to
filemap_write_and_wait_range() in ext4_collapse_range().

Note that if you want to try to understand what is going on, there are
a large number of ext4 and jbd2 tracepoints.  Enabling these
tracepoints (you may need to omit some of the much more chatty jbd2
trace points from the ones that you enable) and dumping those
timestamps alongside the strace -ttt and block I/O timestamps should
be especially illuminating.

Cheers,

						- Ted

P.S.  Another set of tracepoints that might be useful for
understanding when delayed write allocations are getting resolved are
the writeback tracepoints --- although you can probably infer those
from the ext4_writepages traces, since when the writeback is triggered
this will trigger calls to ext4_writepages.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-30 15:27                           ` Theodore Ts'o
@ 2017-11-30 15:40                             ` Ashlie Martinez
  2017-12-02 20:00                               ` Ashlie Martinez
  0 siblings, 1 reply; 23+ messages in thread
From: Ashlie Martinez @ 2017-11-30 15:40 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Amir Goldstein, Vijay Chidambaram, Ext4

On Thu, Nov 30, 2017 at 9:27 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Thu, Nov 30, 2017 at 08:51:45AM -0600, Ashlie Martinez wrote:
>>
>> Even though CrashMonkey *records* all the disk operations, it doesn't
>> have to replay all of them when generating crash states. For example,
>> it could choose to fully replay (and preserve the ordering of) all
>> operations before the 3rd barrier operation in a trace with 5
>> different barrier operations in it (we dub each set of operations from
>> just after the previous barrier operation up to and including the next
>> barrier operation a "disk write epoch" or "disk epoch").
>
> OK, but then it won't replay any operations after the 4th barrier
> opration, correct?  And in the case where you are stopping somewhere
> between the 3rd and the 4th replay, you will drop and reorder random
> operations after the 3rd barrier op, but before the 4th, correct?
>
> So what would be good is to understand where it stopped replaying
> operations ---- and if you can get a strace -ttt of the workload, and
> the fine-grained timestamps from block I/O trace, so we can understand
> how far we had gotten in the workload.  And then, to also to include
> the output of debugfs's "logdump -ac" command before the journal is
> replayed.
>
> From a file system developer's perspective, what is most useful is
> when we can see the minimal reproduction test case where Crashmonkey
> is only rearranging block I/O's of the last full "disk write epock".
> And getting the timestampped strace logs and block I/O logs will help
> us do that.
>
> Otherwise, we have no idea where to look for a potential problem, only
> that it's one of the several I/O commands.

These are good suggestions that I will work on adding into my current
testing flow. Right now, I have CrashMonkey dump the recorded bios to
a file in binary (so it can be hexdumped) and also record the bios
that were used to generate each crash state it tests.

>
>
> Anyway, looking back at your original question, is your question why
> the first write isn't delay allocated?  That's because the
> collapse_range operation needs to resolve any delayed allocation
> writes on the portion of the extent tree which will be affected by the
> collapse_range operation.  See the calls to
> filemap_write_and_wait_range() in ext4_collapse_range().

I also ran a few more tests on this, it appears that when fsx does
mapwrite operations it calls msync after mmap (for obvious reasons).
Adding the mmap/msync operations to the end of my CrashMonkey workload
produced output similar to Amir's reported output, but with slightly
different values for i_blocks. There may be another interaction there
that we hadn't thought of based on this information. The top fsck
output below is from CrashMonkey with mmap/msync at the end of the
workload. The fsck output below that is from Amir's original post
about the bug. Note that my test is running on a much smaller file
system than Amir's, so the logical/physical blocks will differ.

e2fsck 1.42.13 (17-May-2015)
Pass 1: Checking inodes, blocks, and sizes
Inode 12, end of extent exceeds allowed value
  (logical block 131, physical block 8465, len 11)
Clear? yes

Inode 12, i_blocks is 168, should be 146.  Fix? yes


e2fsck 1.42.13 (17-May-2015)
Pass 1: Checking inodes, blocks, and sizes
Inode 12, end of extent exceeds allowed value
       (logical block 33, physical block 33441, len 7)
Clear? no
Inode 12, i_blocks is 184, should be 128.  Fix? no


>
> Note that if you want to try to understand what is going on, there are
> a large number of ext4 and jbd2 tracepoints.  Enabling these
> tracepoints (you may need to omit some of the much more chatty jbd2
> trace points from the ones that you enable) and dumping those
> timestamps alongside the strace -ttt and block I/O timestamps should
> be especially illuminating.
>
> Cheers,
>
>                                                 - Ted
>
> P.S.  Another set of tracepoints that might be useful for
> understanding when delayed write allocations are getting resolved are
> the writeback tracepoints --- although you can probably infer those
> from the ext4_writepages traces, since when the writeback is triggered
> this will trigger calls to ext4_writepages.
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash
  2017-11-30 15:40                             ` Ashlie Martinez
@ 2017-12-02 20:00                               ` Ashlie Martinez
  0 siblings, 0 replies; 23+ messages in thread
From: Ashlie Martinez @ 2017-12-02 20:00 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Amir Goldstein, Vijay Chidambaram, Ext4

On Thu, Nov 30, 2017 at 9:40 AM, Ashlie Martinez <ashmrtn@utexas.edu> wrote:
> On Thu, Nov 30, 2017 at 9:27 AM, Theodore Ts'o <tytso@mit.edu> wrote:
>> On Thu, Nov 30, 2017 at 08:51:45AM -0600, Ashlie Martinez wrote:
>>>
>>> Even though CrashMonkey *records* all the disk operations, it doesn't
>>> have to replay all of them when generating crash states. For example,
>>> it could choose to fully replay (and preserve the ordering of) all
>>> operations before the 3rd barrier operation in a trace with 5
>>> different barrier operations in it (we dub each set of operations from
>>> just after the previous barrier operation up to and including the next
>>> barrier operation a "disk write epoch" or "disk epoch").
>>
>> OK, but then it won't replay any operations after the 4th barrier
>> opration, correct?  And in the case where you are stopping somewhere
>> between the 3rd and the 4th replay, you will drop and reorder random
>> operations after the 3rd barrier op, but before the 4th, correct?
>>
>> So what would be good is to understand where it stopped replaying
>> operations ---- and if you can get a strace -ttt of the workload, and
>> the fine-grained timestamps from block I/O trace, so we can understand
>> how far we had gotten in the workload.  And then, to also to include
>> the output of debugfs's "logdump -ac" command before the journal is
>> replayed.

It appears that the statements I added to provide extra debugging have
changed the timing on things and now I can't seem to get the same
output. I will keep poking at this later when I have more free time
and send you anything interesting that I find. Now that I know what
format to send material in, it should make things a little easier.

>>
>> From a file system developer's perspective, what is most useful is
>> when we can see the minimal reproduction test case where Crashmonkey
>> is only rearranging block I/O's of the last full "disk write epock".
>> And getting the timestampped strace logs and block I/O logs will help
>> us do that.
>>
>> Otherwise, we have no idea where to look for a potential problem, only
>> that it's one of the several I/O commands.
>
> These are good suggestions that I will work on adding into my current
> testing flow. Right now, I have CrashMonkey dump the recorded bios to
> a file in binary (so it can be hexdumped) and also record the bios
> that were used to generate each crash state it tests.
>
>>
>>
>> Anyway, looking back at your original question, is your question why
>> the first write isn't delay allocated?  That's because the
>> collapse_range operation needs to resolve any delayed allocation
>> writes on the portion of the extent tree which will be affected by the
>> collapse_range operation.  See the calls to
>> filemap_write_and_wait_range() in ext4_collapse_range().
>
> I also ran a few more tests on this, it appears that when fsx does
> mapwrite operations it calls msync after mmap (for obvious reasons).
> Adding the mmap/msync operations to the end of my CrashMonkey workload
> produced output similar to Amir's reported output, but with slightly
> different values for i_blocks. There may be another interaction there
> that we hadn't thought of based on this information. The top fsck
> output below is from CrashMonkey with mmap/msync at the end of the
> workload. The fsck output below that is from Amir's original post
> about the bug. Note that my test is running on a much smaller file
> system than Amir's, so the logical/physical blocks will differ.
>
> e2fsck 1.42.13 (17-May-2015)
> Pass 1: Checking inodes, blocks, and sizes
> Inode 12, end of extent exceeds allowed value
>   (logical block 131, physical block 8465, len 11)
> Clear? yes
>
> Inode 12, i_blocks is 168, should be 146.  Fix? yes
>
>
> e2fsck 1.42.13 (17-May-2015)
> Pass 1: Checking inodes, blocks, and sizes
> Inode 12, end of extent exceeds allowed value
>        (logical block 33, physical block 33441, len 7)
> Clear? no
> Inode 12, i_blocks is 184, should be 128.  Fix? no
>
>
>>
>> Note that if you want to try to understand what is going on, there are
>> a large number of ext4 and jbd2 tracepoints.  Enabling these
>> tracepoints (you may need to omit some of the much more chatty jbd2
>> trace points from the ones that you enable) and dumping those
>> timestamps alongside the strace -ttt and block I/O timestamps should
>> be especially illuminating.
>>
>> Cheers,
>>
>>                                                 - Ted
>>
>> P.S.  Another set of tracepoints that might be useful for
>> understanding when delayed write allocations are getting resolved are
>> the writeback tracepoints --- although you can probably infer those
>> from the ext4_writepages traces, since when the writeback is triggered
>> this will trigger calls to ext4_writepages.
>>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-12-02 20:00 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-17 15:43 ext4 fix for interaction between i_size, fallocate, and delalloc after a crash Ashlie Martinez
2017-11-21 16:17 ` Ashlie Martinez
2017-11-22 18:03 ` Theodore Ts'o
2017-11-23  5:39   ` Amir Goldstein
2017-11-27 14:31   ` Ashlie Martinez
2017-11-27 16:11     ` Theodore Ts'o
2017-11-28 13:04       ` Ashlie Martinez
2017-11-28 20:45         ` Theodore Ts'o
2017-11-28 21:27           ` Ashlie Martinez
2017-11-29  3:37             ` Amir Goldstein
2017-11-29  6:13             ` Theodore Ts'o
2017-11-29  8:07               ` Amir Goldstein
2017-11-29 19:58                 ` Ashlie Martinez
2017-11-30  0:48                   ` Theodore Ts'o
2017-11-30  1:46                     ` Ashlie Martinez
2017-11-30  4:46                       ` Theodore Ts'o
2017-11-30 14:22                         ` Theodore Ts'o
2017-11-30 14:51                         ` Ashlie Martinez
2017-11-30 15:27                           ` Theodore Ts'o
2017-11-30 15:40                             ` Ashlie Martinez
2017-12-02 20:00                               ` Ashlie Martinez
2017-11-30  0:24                 ` Theodore Ts'o
2017-11-30  6:46                   ` Amir Goldstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.