[snip]
>>
>> For data writeback, it should only cause sync related failure.
> 
> Ok, so please remove the transaction abort comments for next iteration.
> By sync related failure, you mean running dealloc fails with ENOSPC,
> since when it tries to reserve a data extent it fails as it can't find
> any free extent.

Well, btrfs has some hidden way to fix such problem by itself already.

At buffered write time, we have the following call chain:
btrfs_buffered_write()
|- btrfs_check_data_free_space()
   |- btrfs_alloc_data_chunk_ondemand()
      |- need_commit = 2; /* We have 2 chance to retry, 1 to flush */
      |- do_chunk_alloc() /* we have no data space available */
      |- if (ret < 0 && ret == -ENOSPC)
      |-     goto commit_trans;  /* try to free some space */
      |- commit_trans:
      |-     need_commit--;
      |-     if (need_commit > 0) {
      |-         btrfs_start_delalloc_roots();
      |-         btrfs_wait_ordered_roots();
      |-     }

This means, as long as we hit ENOSPC for data space, we will start write
back, thus NODATACOW data will still hit disk as NODATACOW.

Such hidden behavior along with always-reserve-data-space solves the
problem pretty well.
We either:
- reserve data space
  Then no matter how it ends, we're OK, although it may end as CoW.

- Failed to reserve data space
  Writeback will be triggered anyway, no way to screw things around.

Thus this workaround has nothing to fix, but only make certain NODATACOW
reach disk as NODATACOW.

It makes some NODATACOW behaves more correctly but won't fix any obvious
bug.

My personal take is to fix any strange behavior even it won't cause any
problem, but the full inode writeback can be performance heavy.

So my question is, do we really need this anyway?

Thanks,
Qu

> 
>>
>>> I don't recall starting transactions when running dealloc, and failed
>>> to see where after a quick glance to cow_file_range()
>>> and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means
>>> when starting writeback.
>>>
>>>>
>>>> [CAUSE]
>>>> This is due to the fact that btrfs can only do extent level share check.
>>>>
>>>> Btrfs can only tell if an extent is shared, no matter if only part of the
>>>> extent is shared or not.
>>>>
>>>> So for above script we have:
>>>> - fallocate
>>>> - buffered write
>>>>   If we don't have enough data space, we fall back to NOCOW check.
>>>>   At this timming, the extent is not shared, we can skip data
>>>>   reservation.
>>>
>>> But in the above example we don't fall to nocow mode when doing the
>>> buffered write, as there's plenty of data space available (1Gb -
>>> 24Kb).
>>> You need to update the example.
>> I have to admit that the core part is mostly based on the worst case
>> *assumption*.
>>
>> I'll try to make the case convincing by making it fail directly.
> 
> Great, thanks.
> 
>>
>>>
>>>
>>>> - reflink
>>>>   Now part of the large preallocated extent is shared.
>>>> - delalloc kicks in
>>>
>>> writeback kicks in
>>>
>>>>   For the NOCOW range, as the preallocated extent is shared, we need
>>>>   to fall back to COW.
>>>>
>>>> [WORKAROUND]
>>>> The workaround is to ensure any buffered write in the related extents
>>>> (not the reflink source range) get flushed before reflink.
>>>
>>> not the reflink source range -> not just the reflink source range
>>>
>>>>
>>>> However it's pretty expensive to do a comprehensive check.
>>>> In the reproducer, the reflink source is just a part of a larger
>>>
>>> Again, the reproducer needs to be fixed (yes, I tested it even if it's
>>> clear by looking at it that it doesn't trigger the nocow case).
>>>
>>>> preallocated extent, we need to flush all buffered write of that extent
>>>> before reflink.
>>>> Such backward search can be complex and we may not get much benefit from
>>>> it.
>>>>
>>>> So this patch will just try to flush the whole inode before reflink.
>>>
>>>
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>> Reason for RFC:
>>>> Flushing an inode just because it's a reflink source is definitely
>>>> overkilling, but I don't have any better way to handle it.
>>>>
>>>> Any comment on this is welcomed.
>>>> ---
>>>>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
>>>>  1 file changed, 22 insertions(+)
>>>>
>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>>>> index 7755b503b348..8caa0edb6fbf 100644
>>>> --- a/fs/btrfs/ioctl.c
>>>> +++ b/fs/btrfs/ioctl.c
>>>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
>>>>                         return ret;
>>>>         }
>>>>
>>>> +       /*
>>>> +        * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
>>>> +        *
>>>> +        * Due to the limit of btrfs extent tree design, we can only have
>>>> +        * extent level share view. Any part of an extent is shared then the
>>>
>>> Any -> If any
>>>
>>>> +        * whole extent is shared and any write into that extent needs to fall
>>>
>>> is -> is considered
>>>
>>>> +        * back to COW.
>>>
>>> I would add, something like:
>>>
>>> That is, btrfs' back references do not have a block level granularity,
>>> they work at the whole extent level.
>>>
>>>> +        *
>>>> +        * NOCOW buffered write without data space reserved could to lead to
>>>> +        * either data space bytes_may_use underflow (kernel warning) or ENOSPC
>>>> +        * at delalloc time (transaction abort).
>>>
>>> I would omit the warning and transaction abort parts, that can change
>>> any time. And we have that information in the changelog, so it's not
>>> lost.
>>>
>>>> +        *
>>>> +        * Here we take a shortcut by flush the whole inode. We could do better
>>>> +        * by finding all extents in that range and flush the space referring
>>>> +        * all those extents.
>>>> +        * But that's too complex for such corner case.
>>>> +        */
>>>> +       filemap_flush(src->i_mapping);
>>>> +       if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
>>>> +                    &BTRFS_I(src)->runtime_flags))
>>>> +               filemap_flush(src->i_mapping);
>>>
>>> So a few comments here:
>>>
>>> - why just in the clone part? The dedupe side has the same problem, doesn't it?
>>
>> Right.
>>
>>>
>>> - I would move such flushing to btrfs_remap_file_range_prep - this is
>>> where we do the source and target range flush and wait.
>>>
>>> Can you turn the reproducer into an fstests case?
>>
>> Sure.
>>
>> Thanks for the info and all the comment,
>> Qu
>>
>>>
>>> Thanks.
>>>
>>>> +
>>>>         /*
>>>>          * Lock destination range to serialize with concurrent readpages() and
>>>>          * source range to serialize with relocation.
>>>> --
>>>> 2.21.0
>>>>
>>>
>>>
>>
> 
>