All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>, Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH RFC] btrfs: reflink: Flush before reflink any extent to prevent NOCOW write falling back to CoW without data reservation
Date: Sat, 4 May 2019 08:32:11 +0800	[thread overview]
Message-ID: <448c9a50-98b0-15aa-cbde-81b294bd74e4@gmx.com> (raw)
In-Reply-To: <20190503215622.GC20359@hungrycats.org>


[-- Attachment #1.1: Type: text/plain, Size: 5206 bytes --]



On 2019/5/4 上午5:56, Zygo Blaxell wrote:
> On Fri, May 03, 2019 at 09:08:52AM +0800, Qu Wenruo wrote:
>> [BUG]
>> The following command can lead to unexpected data COW:
>>
>>   #!/bin/bash
>>
>>   dev=/dev/test/test
>>   mnt=/mnt/btrfs
>>
>>   mkfs.btrfs -f $dev -b 1G > /dev/null
>>   mount $dev $mnt -o nospace_cache
>>
>>   xfs_io -f -c "falloc 8k 24k" -c "pwrite 12k 8k" $mnt/file1
>>   xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1
>>   umount $dev
>>
>> The result extent will be
>>
>> 	item 7 key (257 EXTENT_DATA 4096) itemoff 15760 itemsize 53
>> 		generation 6 type 2 (prealloc)
>> 		prealloc data disk byte 13631488 nr 28672
>> 	item 8 key (257 EXTENT_DATA 12288) itemoff 15707 itemsize 53
>> 		generation 6 type 1 (regular)
>> 		extent data disk byte 13660160 nr 12288 <<< COW
>> 	item 9 key (257 EXTENT_DATA 24576) itemoff 15654 itemsize 53
>> 		generation 6 type 2 (prealloc)
>> 		prealloc data disk byte 13631488 nr 28672
>>
>> Currently we always reserve space even for NOCOW buffered write, thus
>> under most case it shouldn't cause anything wrong even we fall back to
>> COW.
>>
>> However when we're out of data space, we fall back to skip data space if
>> we can do NOCOW write.
>>
>> If such behavior happens under that case, we could hit the following
>> problems:
>> - data space bytes_may_use underflow
>>   This will cause kernel warning.
>>
>> - ENOSPC at delalloc time
>>   This will lead to transaction abort and fs forced to RO.
>>
>> [CAUSE]
>> This is due to the fact that btrfs can only do extent level share check.
>>
>> Btrfs can only tell if an extent is shared, no matter if only part of the
>> extent is shared or not.
>>
>> So for above script we have:
>> - fallocate
>> - buffered write
>>   If we don't have enough data space, we fall back to NOCOW check.
>>   At this timming, the extent is not shared, we can skip data
>>   reservation.
>> - reflink
>>   Now part of the large preallocated extent is shared.
>> - delalloc kicks in
>>   For the NOCOW range, as the preallocated extent is shared, we need
>>   to fall back to COW.
>>
>> [WORKAROUND]
>> The workaround is to ensure any buffered write in the related extents
>> (not the reflink source range) get flushed before reflink.
>>
>> However it's pretty expensive to do a comprehensive check.
>> In the reproducer, the reflink source is just a part of a larger
>> preallocated extent, we need to flush all buffered write of that extent
>> before reflink.
>> Such backward search can be complex and we may not get much benefit from
>> it.
>>
>> So this patch will just try to flush the whole inode before reflink.
> 
> Does that mean that if a large file is being written and deduped
> simultaneously, that the dedupes would now trigger flushes over the
> entire file?  That seems like it could be slow.

Yes, also my reason for RFC.

But it shouldn't be that heavy, as after the first dedupe/reflink, most
IO should be flushed back, later dedupe has much less work to do.


> 
> e.g. if the file is a big VM image file and it is used src and for dedupe
> (which is quite common in VM image files), we'd be hammering the disk
> with writes similar to hitting it with fsync() in a tight loop?

The original behavior also flush the target and source range, so we're
not completely creating some new overhead.

Thanks,
Qu

> 
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Reason for RFC:
>> Flushing an inode just because it's a reflink source is definitely
>> overkilling, but I don't have any better way to handle it.
>>
>> Any comment on this is welcomed.
>> ---
>>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
>>  1 file changed, 22 insertions(+)
>>
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index 7755b503b348..8caa0edb6fbf 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
>>  			return ret;
>>  	}
>>  
>> +	/*
>> +	 * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
>> +	 *
>> +	 * Due to the limit of btrfs extent tree design, we can only have
>> +	 * extent level share view. Any part of an extent is shared then the
>> +	 * whole extent is shared and any write into that extent needs to fall
>> +	 * back to COW.
>> +	 *
>> +	 * NOCOW buffered write without data space reserved could to lead to
>> +	 * either data space bytes_may_use underflow (kernel warning) or ENOSPC
>> +	 * at delalloc time (transaction abort).
>> +	 *
>> +	 * Here we take a shortcut by flush the whole inode. We could do better
>> +	 * by finding all extents in that range and flush the space referring
>> +	 * all those extents.
>> +	 * But that's too complex for such corner case.
>> +	 */
>> +	filemap_flush(src->i_mapping);
>> +	if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
>> +		     &BTRFS_I(src)->runtime_flags))
>> +		filemap_flush(src->i_mapping);
>> +
>>  	/*
>>  	 * Lock destination range to serialize with concurrent readpages() and
>>  	 * source range to serialize with relocation.
>> -- 
>> 2.21.0
>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2019-05-04  0:32 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-03  1:08 [PATCH RFC] btrfs: reflink: Flush before reflink any extent to prevent NOCOW write falling back to CoW without data reservation Qu Wenruo
2019-05-03  9:21 ` Filipe Manana
2019-05-03 10:18   ` Qu Wenruo
2019-05-03 10:45     ` Filipe Manana
2019-05-04  8:29       ` Nikolay Borisov
2019-05-06  2:04       ` Qu Wenruo
2019-05-07  7:49         ` Nikolay Borisov
2019-05-07  8:56         ` Filipe Manana
2019-05-07 11:13           ` Qu Wenruo
2019-05-07 11:36             ` Filipe Manana
2019-05-03 21:56 ` Zygo Blaxell
2019-05-04  0:32   ` Qu Wenruo [this message]
2019-05-05 15:07     ` Zygo Blaxell
2019-05-05 16:24       ` Filipe Manana
2019-05-06  0:06         ` Qu Wenruo
2019-05-07 17:36 ` Josef Bacik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=448c9a50-98b0-15aa-cbde-81b294bd74e4@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.