Re: [PATCH RFC] btrfs: reflink: Flush before reflink any extent to prevent NOCOW write falling back to CoW without data reservation

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: fdmanana@gmail.com
Cc: Qu Wenruo <wqu@suse.com>, linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH RFC] btrfs: reflink: Flush before reflink any extent to prevent NOCOW write falling back to CoW without data reservation
Date: Tue, 7 May 2019 19:13:38 +0800	[thread overview]
Message-ID: <028e58f3-99a6-f4bc-51f0-12eafee92c76@gmx.com> (raw)
In-Reply-To: <CAL3q7H7jmCnObo4Mtnnm9_07pePySbsRqdOFX+348s_q4VnA2Q@mail.gmail.com>

[-- Attachment #1.1: Type: text/plain, Size: 10929 bytes --]

On 2019/5/7 下午4:56, Filipe Manana wrote:
> On Mon, May 6, 2019 at 3:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>> [snip]
>>>>
>>>> For data writeback, it should only cause sync related failure.
>>>
>>> Ok, so please remove the transaction abort comments for next iteration.
>>> By sync related failure, you mean running dealloc fails with ENOSPC,
>>> since when it tries to reserve a data extent it fails as it can't find
>>> any free extent.
>>
>> Well, btrfs has some hidden way to fix such problem by itself already.
>>
>> At buffered write time, we have the following call chain:
>> btrfs_buffered_write()
>> |- btrfs_check_data_free_space()
>>    |- btrfs_alloc_data_chunk_ondemand()
>>       |- need_commit = 2; /* We have 2 chance to retry, 1 to flush */
>>       |- do_chunk_alloc() /* we have no data space available */
>>       |- if (ret < 0 && ret == -ENOSPC)
>>       |-     goto commit_trans;  /* try to free some space */
>>       |- commit_trans:
>>       |-     need_commit--;
>>       |-     if (need_commit > 0) {
>>       |-         btrfs_start_delalloc_roots();
>>       |-         btrfs_wait_ordered_roots();
>>       |-     }
>>
>> This means, as long as we hit ENOSPC for data space, we will start write
>> back, thus NODATACOW data will still hit disk as NODATACOW.
>>
>> Such hidden behavior along with always-reserve-data-space solves the
>> problem pretty well.
> 
> It doesn't solve the problem you reported in the rfc patch.

You're right, it doesn't solve the problem at all.
In fact, another bug caused my test script to pass even with some dirty
pages unable to be flushed back.

But it at least make sure all other pages reach disk as NODATACOW except
the last page.

> 
> What happens:
> 
> 1) We have a file with a prealloc extent, that isn't shared
> 
> 2) We have 0 bytes of available data space (or any amount less then
> that of the buffered write size)
> 
> 3) A buffered write happens that falls within a subrange of the prealloc extent.
>     We can't reserve space, we do all those things at
> btrfs_alloc_data_chunk_ondemand(), but we can't get any data space
> released, since it's all allocated.

At that time, we're already flushing all previously buffered write data.

E.g. if we're writing into one 1M preallocated extent.
The first 4K, we have no data space reserved, dirtied the page, prepare
all delalloc.

Then the 2nd 4K, we call btrfs_check_data_free_space(), as we're at low
data free space already, we flush all inodes, including the previous 4K
we just dirtied.
Then the first 4K get written to disk NODATACOW, as expected.

This loop happens until we reach the last page.

>     Therefore we fall back to nodatacow mode. We dirty the pages, mark
> the range as dealloc, etc.
> 
> 4) The reflink happens, for a subrange of the prealloc extent that
> does not overlap the range of the buffered write.

Just before the reflink, we only have 1 dirty page (the last page of
that buffered write) doesn't reach disk yet.

For the final page, we have no choice but do COW, and it fails with -ENOSPC.

However due to some other problem, the -ENOSPC doesn't reach user space
at all.

> 
> 5) Some time after the reflink, writeback starts for the inode.
>     During the writeback we fallback to COW mode, because the prealloc
> extent is shared, even if the subrange of the buffered write does not
> overlap the reflinked subrange.
>     Now the write silently fails with -ENOSPC, and a user doesn't know
> about it unless it does an fsync after that writeback, which will
> report the error via filemap_check_wb_err().
> 
>> We either:
>> - reserve data space
>>   Then no matter how it ends, we're OK, although it may end as CoW.
>>
>> - Failed to reserve data space
>>   Writeback will be triggered anyway, no way to screw things around.
>>
>> Thus this workaround has nothing to fix, but only make certain NODATACOW
>> reach disk as NODATACOW.
>>
>> It makes some NODATACOW behaves more correctly but won't fix any obvious
>> bug.
>>
>> My personal take is to fix any strange behavior even it won't cause any
>> problem, but the full inode writeback can be performance heavy.
>>
>> So my question is, do we really need this anyway?
> 
> Do we need what? Your patch, that logic at
> btrfs_alloc_data_chunk_ondemand(), something else?

I meant the patch, but the deeper I dig into the problem, more problem I
found.

The patch is still needed, but there is a more important bug, that
btrfs_run_delalloc_range() failure won't be reported in sync.

The script here I'm using is:
------
#!/bin/bash

dev=/dev/test/test
mnt=/mnt/btrfs

#mkfs.btrfs -f $dev -b 1G > /dev/null
#mount $dev $mnt -o nospace_cache

umount $mnt &> /dev/null
umount $dev &> /dev/null

dmesg -C
mkfs.btrfs -f $dev -b 512M > /dev/null

mount $dev $mnt -o nospace_cache

xfs_io -f -c "falloc 8k 64m" $mnt/file1
xfs_io -f -c "pwrite 0 -b 4k 370M" $mnt/padding

sync
btrfs fi df $mnt --raw

xfs_io -c "pwrite 1m 16m" $mnt/file1
echo "nodatacow write finished" >> /dev/kmsg
xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1
echo "reflink finished" >> /dev/kmsg
sync
echo "sync finished ret=$?" >> /dev/kmsg
umount $dev
------

As describe, the last write at 17821696 (17M - 4K) will fail due to ENOSPC.
But the sync succeeded without reporting any problem.

Thanks,
Qu

> 
> Thanks.
> 
>>
>> Thanks,
>> Qu
>>
>>>
>>>>
>>>>> I don't recall starting transactions when running dealloc, and failed
>>>>> to see where after a quick glance to cow_file_range()
>>>>> and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means
>>>>> when starting writeback.
>>>>>
>>>>>>
>>>>>> [CAUSE]
>>>>>> This is due to the fact that btrfs can only do extent level share check.
>>>>>>
>>>>>> Btrfs can only tell if an extent is shared, no matter if only part of the
>>>>>> extent is shared or not.
>>>>>>
>>>>>> So for above script we have:
>>>>>> - fallocate
>>>>>> - buffered write
>>>>>>   If we don't have enough data space, we fall back to NOCOW check.
>>>>>>   At this timming, the extent is not shared, we can skip data
>>>>>>   reservation.
>>>>>
>>>>> But in the above example we don't fall to nocow mode when doing the
>>>>> buffered write, as there's plenty of data space available (1Gb -
>>>>> 24Kb).
>>>>> You need to update the example.
>>>> I have to admit that the core part is mostly based on the worst case
>>>> *assumption*.
>>>>
>>>> I'll try to make the case convincing by making it fail directly.
>>>
>>> Great, thanks.
>>>
>>>>
>>>>>
>>>>>
>>>>>> - reflink
>>>>>>   Now part of the large preallocated extent is shared.
>>>>>> - delalloc kicks in
>>>>>
>>>>> writeback kicks in
>>>>>
>>>>>>   For the NOCOW range, as the preallocated extent is shared, we need
>>>>>>   to fall back to COW.
>>>>>>
>>>>>> [WORKAROUND]
>>>>>> The workaround is to ensure any buffered write in the related extents
>>>>>> (not the reflink source range) get flushed before reflink.
>>>>>
>>>>> not the reflink source range -> not just the reflink source range
>>>>>
>>>>>>
>>>>>> However it's pretty expensive to do a comprehensive check.
>>>>>> In the reproducer, the reflink source is just a part of a larger
>>>>>
>>>>> Again, the reproducer needs to be fixed (yes, I tested it even if it's
>>>>> clear by looking at it that it doesn't trigger the nocow case).
>>>>>
>>>>>> preallocated extent, we need to flush all buffered write of that extent
>>>>>> before reflink.
>>>>>> Such backward search can be complex and we may not get much benefit from
>>>>>> it.
>>>>>>
>>>>>> So this patch will just try to flush the whole inode before reflink.
>>>>>
>>>>>
>>>>>>
>>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>>>> ---
>>>>>> Reason for RFC:
>>>>>> Flushing an inode just because it's a reflink source is definitely
>>>>>> overkilling, but I don't have any better way to handle it.
>>>>>>
>>>>>> Any comment on this is welcomed.
>>>>>> ---
>>>>>>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
>>>>>>  1 file changed, 22 insertions(+)
>>>>>>
>>>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>>>>>> index 7755b503b348..8caa0edb6fbf 100644
>>>>>> --- a/fs/btrfs/ioctl.c
>>>>>> +++ b/fs/btrfs/ioctl.c
>>>>>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
>>>>>>                         return ret;
>>>>>>         }
>>>>>>
>>>>>> +       /*
>>>>>> +        * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
>>>>>> +        *
>>>>>> +        * Due to the limit of btrfs extent tree design, we can only have
>>>>>> +        * extent level share view. Any part of an extent is shared then the
>>>>>
>>>>> Any -> If any
>>>>>
>>>>>> +        * whole extent is shared and any write into that extent needs to fall
>>>>>
>>>>> is -> is considered
>>>>>
>>>>>> +        * back to COW.
>>>>>
>>>>> I would add, something like:
>>>>>
>>>>> That is, btrfs' back references do not have a block level granularity,
>>>>> they work at the whole extent level.
>>>>>
>>>>>> +        *
>>>>>> +        * NOCOW buffered write without data space reserved could to lead to
>>>>>> +        * either data space bytes_may_use underflow (kernel warning) or ENOSPC
>>>>>> +        * at delalloc time (transaction abort).
>>>>>
>>>>> I would omit the warning and transaction abort parts, that can change
>>>>> any time. And we have that information in the changelog, so it's not
>>>>> lost.
>>>>>
>>>>>> +        *
>>>>>> +        * Here we take a shortcut by flush the whole inode. We could do better
>>>>>> +        * by finding all extents in that range and flush the space referring
>>>>>> +        * all those extents.
>>>>>> +        * But that's too complex for such corner case.
>>>>>> +        */
>>>>>> +       filemap_flush(src->i_mapping);
>>>>>> +       if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
>>>>>> +                    &BTRFS_I(src)->runtime_flags))
>>>>>> +               filemap_flush(src->i_mapping);
>>>>>
>>>>> So a few comments here:
>>>>>
>>>>> - why just in the clone part? The dedupe side has the same problem, doesn't it?
>>>>
>>>> Right.
>>>>
>>>>>
>>>>> - I would move such flushing to btrfs_remap_file_range_prep - this is
>>>>> where we do the source and target range flush and wait.
>>>>>
>>>>> Can you turn the reproducer into an fstests case?
>>>>
>>>> Sure.
>>>>
>>>> Thanks for the info and all the comment,
>>>> Qu
>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>> +
>>>>>>         /*
>>>>>>          * Lock destination range to serialize with concurrent readpages() and
>>>>>>          * source range to serialize with relocation.
>>>>>> --
>>>>>> 2.21.0
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
> 
> 

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]