Re: [PATCH v2 2/2] btrfs: defrag: don't use merged extent map for their generation check

From: Qu Wenruo <wqu@suse.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v2 2/2] btrfs: defrag: don't use merged extent map for their generation check
Date: Thu, 10 Feb 2022 08:40:22 +0800	[thread overview]
Message-ID: <6b0f23dd-1808-609c-3f97-17f58a4f4688@suse.com> (raw)
In-Reply-To: <YgPhzFA3230j2mlf@debian9.Home>

On 2022/2/9 23:46, Filipe Manana wrote:
> On Tue, Feb 08, 2022 at 02:36:31PM +0800, Qu Wenruo wrote:
>> For extent maps, if they are not compressed extents and are adjacent by
>> logical addresses and file offsets, they can be merged into one larger
>> extent map.
>>
>> Such merged extent map will have the higher generation of all the
>> original ones.
>>
>> This behavior won't affect fsync code, as only extents read from disks
>> can be merged.
> 
> That is not true. An extent created by a write can be merged after a
> fsync runs and logs the extent. So yes, extent maps created by writes
> can be merged, but only after an fsync.

Or, if there is no fsync(), the em will never be mergeable, unless 
evicted and re-read.

In setup_extent_mapping() if @modified is true, we call 
"list_move(&em->list, &tree->modified_extents)".

So it means such em will not be merged, as try_merge_map() needs the 
em::list to be empty.

Then check when em::list is deleted, there are two paths to remove em::list:

- remove_extent_mapping()

- btrfs_log_inode()/btrfs_log_changed_extents()

This explains why under my test cases which only use sync(), those 
extent maps never get merged, unless removed and read from tree again.

I'm wondering now if this is expected behavior?

Thanks,
Qu

> 
>>
>> But this brings a problem for autodefrag, as it relies on accurate
>> extent_map::generation to determine if one extent should be defragged.
>>
>> For merged extent maps, their higher generation can mark some older
>> extents to be defragged while the original extent map doesn't meet the
>> minimal generation threshold.
>>
>> Thus this will cause extra IO.
>>
>> So solve the problem, here we introduce a new flag, EXTENT_FLAG_MERGED, to
>> indicate if the extent map is merged from one or more ems.
>>
>> And for autodefrag, if we find a merged extent map, and its generation
>> meets the generation requirement, we just don't use this one, and go
>> back to defrag_get_extent() to read em from disk.
> 
> Instead of saying from disk, I would say from the subvolume btree. That
> may or may not require reading a leaf, and any nodes in the path from the
> root to the leaf, from disk.
> 
>>
>> This could cause more read IO, but should result less defrag data write,
>> so in the long run it should be a win for autodefrag.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/extent_map.c |  2 ++
>>   fs/btrfs/extent_map.h |  8 ++++++++
>>   fs/btrfs/ioctl.c      | 14 ++++++++++++++
>>   3 files changed, 24 insertions(+)
>>
>> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
>> index 5a36add21305..c28ceddefae4 100644
>> --- a/fs/btrfs/extent_map.c
>> +++ b/fs/btrfs/extent_map.c
>> @@ -261,6 +261,7 @@ static void try_merge_map(struct extent_map_tree *tree, struct extent_map *em)
>>   			em->mod_len = (em->mod_len + em->mod_start) - merge->mod_start;
>>   			em->mod_start = merge->mod_start;
>>   			em->generation = max(em->generation, merge->generation);
>> +			set_bit(EXTENT_FLAG_MERGED, &em->flags);
>>   
>>   			rb_erase_cached(&merge->rb_node, &tree->map);
>>   			RB_CLEAR_NODE(&merge->rb_node);
>> @@ -278,6 +279,7 @@ static void try_merge_map(struct extent_map_tree *tree, struct extent_map *em)
>>   		RB_CLEAR_NODE(&merge->rb_node);
>>   		em->mod_len = (merge->mod_start + merge->mod_len) - em->mod_start;
>>   		em->generation = max(em->generation, merge->generation);
>> +		set_bit(EXTENT_FLAG_MERGED, &em->flags);
>>   		free_extent_map(merge);
>>   	}
>>   }
>> diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
>> index 8e217337dff9..d2fa32ffe304 100644
>> --- a/fs/btrfs/extent_map.h
>> +++ b/fs/btrfs/extent_map.h
>> @@ -25,6 +25,8 @@ enum {
>>   	EXTENT_FLAG_FILLING,
>>   	/* filesystem extent mapping type */
>>   	EXTENT_FLAG_FS_MAPPING,
>> +	/* This em is merged from two or more physically adjacent ems */
>> +	EXTENT_FLAG_MERGED,
>>   };
>>   
>>   struct extent_map {
>> @@ -40,6 +42,12 @@ struct extent_map {
>>   	u64 ram_bytes;
>>   	u64 block_start;
>>   	u64 block_len;
>> +
>> +	/*
>> +	 * Generation of the extent map, for merged em it's the highest
>> +	 * generation of all merged ems.
>> +	 * For non-merged extents, it's from btrfs_file_extent_item::generation.
> 
> Missing the (u64)-1 special case, a temporary value set when writeback starts
> and changed when the ordered extent completes.
> 
>> +	 */
>>   	u64 generation;
>>   	unsigned long flags;
>>   	/* Used for chunk mappings, flag EXTENT_FLAG_FS_MAPPING must be set */
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index fb4c25e5ff5f..3a5ada561298 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -1169,6 +1169,20 @@ static struct extent_map *defrag_lookup_extent(struct inode *inode, u64 start,
>>   	em = lookup_extent_mapping(em_tree, start, sectorsize);
>>   	read_unlock(&em_tree->lock);
>>   
>> +	/*
>> +	 * We can get a merged extent, in that case, we need to re-search
>> +	 * tree to get the original em for defrag.
>> +	 *
>> +	 * If @newer_than is 0 or em::geneartion < newer_than, we can trust
> 
> em::geneartion -> em::generation
> 
>> +	 * this em, as either we don't care about the geneartion, or the
> 
> geneartion, -> generation
> 
> The rest looks fine.
> Thanks.
> 
>> +	 * merged extent map will be rejected anyway.
>> +	 */
>> +	if (em && test_bit(EXTENT_FLAG_MERGED, &em->flags) &&
>> +	    newer_than && em->generation >= newer_than) {
>> +		free_extent_map(em);
>> +		em = NULL;
>> +	}
>> +
>>   	if (!em) {
>>   		struct extent_state *cached = NULL;
>>   		u64 end = start + sectorsize - 1;
>> -- 
>> 2.35.0
>>
>