All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nikolay Borisov <nborisov@suse.com>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: fdmanana@gmail.com, Qu Wenruo <wqu@suse.com>,
	linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH RFC] btrfs: reflink: Flush before reflink any extent to prevent NOCOW write falling back to CoW without data reservation
Date: Tue, 7 May 2019 10:49:27 +0300	[thread overview]
Message-ID: <a5b4c1a5-a76a-4412-4f42-677ef2676e62@suse.com> (raw)
In-Reply-To: <d3e1b3dd-60da-bd6f-24b7-7cda8fde6ac2@gmx.com>



On 6.05.19 г. 5:04 ч., Qu Wenruo wrote:
> [snip]
>>>
>>> For data writeback, it should only cause sync related failure.
>>
>> Ok, so please remove the transaction abort comments for next iteration.
>> By sync related failure, you mean running dealloc fails with ENOSPC,
>> since when it tries to reserve a data extent it fails as it can't find
>> any free extent.
> 
> Well, btrfs has some hidden way to fix such problem by itself already.
> 
> At buffered write time, we have the following call chain:
> btrfs_buffered_write()
> |- btrfs_check_data_free_space()
>    |- btrfs_alloc_data_chunk_ondemand()
>       |- need_commit = 2; /* We have 2 chance to retry, 1 to flush */
>       |- do_chunk_alloc() /* we have no data space available */
>       |- if (ret < 0 && ret == -ENOSPC)
>       |-     goto commit_trans;  /* try to free some space */
>       |- commit_trans:
>       |-     need_commit--;
>       |-     if (need_commit > 0) {
>       |-         btrfs_start_delalloc_roots();
>       |-         btrfs_wait_ordered_roots();
>       |-     }
> 
> This means, as long as we hit ENOSPC for data space, we will start write
> back, thus NODATACOW data will still hit disk as NODATACOW.

I'm lost for words at expressing how subtle and despicable that code is
... Is there a way to factor that out and make it more explicit ? I
don't think we should rely on such subtleties...

> 
> Such hidden behavior along with always-reserve-data-space solves the
> problem pretty well.
> We either:
> - reserve data space
>   Then no matter how it ends, we're OK, although it may end as CoW.
> 
> - Failed to reserve data space
>   Writeback will be triggered anyway, no way to screw things around.
> 
> Thus this workaround has nothing to fix, but only make certain NODATACOW
> reach disk as NODATACOW.
> 
> It makes some NODATACOW behaves more correctly but won't fix any obvious
> bug.
> 
> My personal take is to fix any strange behavior even it won't cause any
> problem, but the full inode writeback can be performance heavy.
> 
> So my question is, do we really need this anyway?
> 
> Thanks,
> Qu
> 
>>
>>>
>>>> I don't recall starting transactions when running dealloc, and failed
>>>> to see where after a quick glance to cow_file_range()
>>>> and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means
>>>> when starting writeback.
>>>>
>>>>>
>>>>> [CAUSE]
>>>>> This is due to the fact that btrfs can only do extent level share check.
>>>>>
>>>>> Btrfs can only tell if an extent is shared, no matter if only part of the
>>>>> extent is shared or not.
>>>>>
>>>>> So for above script we have:
>>>>> - fallocate
>>>>> - buffered write
>>>>>   If we don't have enough data space, we fall back to NOCOW check.
>>>>>   At this timming, the extent is not shared, we can skip data
>>>>>   reservation.
>>>>
>>>> But in the above example we don't fall to nocow mode when doing the
>>>> buffered write, as there's plenty of data space available (1Gb -
>>>> 24Kb).
>>>> You need to update the example.
>>> I have to admit that the core part is mostly based on the worst case
>>> *assumption*.
>>>
>>> I'll try to make the case convincing by making it fail directly.
>>
>> Great, thanks.
>>
>>>
>>>>
>>>>
>>>>> - reflink
>>>>>   Now part of the large preallocated extent is shared.
>>>>> - delalloc kicks in
>>>>
>>>> writeback kicks in
>>>>
>>>>>   For the NOCOW range, as the preallocated extent is shared, we need
>>>>>   to fall back to COW.
>>>>>
>>>>> [WORKAROUND]
>>>>> The workaround is to ensure any buffered write in the related extents
>>>>> (not the reflink source range) get flushed before reflink.
>>>>
>>>> not the reflink source range -> not just the reflink source range
>>>>
>>>>>
>>>>> However it's pretty expensive to do a comprehensive check.
>>>>> In the reproducer, the reflink source is just a part of a larger
>>>>
>>>> Again, the reproducer needs to be fixed (yes, I tested it even if it's
>>>> clear by looking at it that it doesn't trigger the nocow case).
>>>>
>>>>> preallocated extent, we need to flush all buffered write of that extent
>>>>> before reflink.
>>>>> Such backward search can be complex and we may not get much benefit from
>>>>> it.
>>>>>
>>>>> So this patch will just try to flush the whole inode before reflink.
>>>>
>>>>
>>>>>
>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>>> ---
>>>>> Reason for RFC:
>>>>> Flushing an inode just because it's a reflink source is definitely
>>>>> overkilling, but I don't have any better way to handle it.
>>>>>
>>>>> Any comment on this is welcomed.
>>>>> ---
>>>>>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
>>>>>  1 file changed, 22 insertions(+)
>>>>>
>>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>>>>> index 7755b503b348..8caa0edb6fbf 100644
>>>>> --- a/fs/btrfs/ioctl.c
>>>>> +++ b/fs/btrfs/ioctl.c
>>>>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
>>>>>                         return ret;
>>>>>         }
>>>>>
>>>>> +       /*
>>>>> +        * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
>>>>> +        *
>>>>> +        * Due to the limit of btrfs extent tree design, we can only have
>>>>> +        * extent level share view. Any part of an extent is shared then the
>>>>
>>>> Any -> If any
>>>>
>>>>> +        * whole extent is shared and any write into that extent needs to fall
>>>>
>>>> is -> is considered
>>>>
>>>>> +        * back to COW.
>>>>
>>>> I would add, something like:
>>>>
>>>> That is, btrfs' back references do not have a block level granularity,
>>>> they work at the whole extent level.
>>>>
>>>>> +        *
>>>>> +        * NOCOW buffered write without data space reserved could to lead to
>>>>> +        * either data space bytes_may_use underflow (kernel warning) or ENOSPC
>>>>> +        * at delalloc time (transaction abort).
>>>>
>>>> I would omit the warning and transaction abort parts, that can change
>>>> any time. And we have that information in the changelog, so it's not
>>>> lost.
>>>>
>>>>> +        *
>>>>> +        * Here we take a shortcut by flush the whole inode. We could do better
>>>>> +        * by finding all extents in that range and flush the space referring
>>>>> +        * all those extents.
>>>>> +        * But that's too complex for such corner case.
>>>>> +        */
>>>>> +       filemap_flush(src->i_mapping);
>>>>> +       if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
>>>>> +                    &BTRFS_I(src)->runtime_flags))
>>>>> +               filemap_flush(src->i_mapping);
>>>>
>>>> So a few comments here:
>>>>
>>>> - why just in the clone part? The dedupe side has the same problem, doesn't it?
>>>
>>> Right.
>>>
>>>>
>>>> - I would move such flushing to btrfs_remap_file_range_prep - this is
>>>> where we do the source and target range flush and wait.
>>>>
>>>> Can you turn the reproducer into an fstests case?
>>>
>>> Sure.
>>>
>>> Thanks for the info and all the comment,
>>> Qu
>>>
>>>>
>>>> Thanks.
>>>>
>>>>> +
>>>>>         /*
>>>>>          * Lock destination range to serialize with concurrent readpages() and
>>>>>          * source range to serialize with relocation.
>>>>> --
>>>>> 2.21.0
>>>>>
>>>>
>>>>
>>>
>>
>>
> 

  reply	other threads:[~2019-05-07  7:49 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-03  1:08 [PATCH RFC] btrfs: reflink: Flush before reflink any extent to prevent NOCOW write falling back to CoW without data reservation Qu Wenruo
2019-05-03  9:21 ` Filipe Manana
2019-05-03 10:18   ` Qu Wenruo
2019-05-03 10:45     ` Filipe Manana
2019-05-04  8:29       ` Nikolay Borisov
2019-05-06  2:04       ` Qu Wenruo
2019-05-07  7:49         ` Nikolay Borisov [this message]
2019-05-07  8:56         ` Filipe Manana
2019-05-07 11:13           ` Qu Wenruo
2019-05-07 11:36             ` Filipe Manana
2019-05-03 21:56 ` Zygo Blaxell
2019-05-04  0:32   ` Qu Wenruo
2019-05-05 15:07     ` Zygo Blaxell
2019-05-05 16:24       ` Filipe Manana
2019-05-06  0:06         ` Qu Wenruo
2019-05-07 17:36 ` Josef Bacik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a5b4c1a5-a76a-4412-4f42-677ef2676e62@suse.com \
    --to=nborisov@suse.com \
    --cc=fdmanana@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.