All of lore.kernel.org
 help / color / mirror / Atom feed
From: ethanlien <ethanlien@synology.com>
To: Nikolay Borisov <nborisov@suse.com>
Cc: linux-btrfs@vger.kernel.org, linux-btrfs-owner@vger.kernel.org
Subject: Re: [PATCH] btrfs: use tagged writepage to mitigate livelock of snapshot
Date: Fri, 02 Nov 2018 15:00:39 +0800	[thread overview]
Message-ID: <db084399bf020eae5d334535e1425d6e@synology.com> (raw)
In-Reply-To: <857dc4ea-245b-623e-a734-b247d5b54149@suse.com>

Nikolay Borisov 於 2018-11-01 19:57 寫到:
> On 1.11.18 г. 8:49 ч., Ethan Lien wrote:
>> Snapshot is expected to be fast. But if there are writers steadily
>> create dirty pages in our subvolume, the snapshot may take a very long
>> time to complete. To fix the problem, we use tagged writepage for 
>> snapshot
>> flusher as we do in the generic write_cache_pages(), so we can ommit 
>> pages
>> dirtied after the snapshot command.
>> 
>> We do a simple snapshot speed test on a Intel D-1531 box:
>> 
>> fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G
>> --direct=0 --thread=1 --numjobs=1 --time_based --runtime=120
>> --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
>> time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
>> 
>> original: 1m58sec
>> patched:  6.54sec
>> 
>> This is the best case for this patch since for a sequential write 
>> case,
>> we omit nearly all pages dirtied after the snapshot command.
>> 
>> For a multi writers, random write test:
>> 
>> fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G
>> --direct=0 --thread=1 --numjobs=4 --time_based --runtime=120
>> --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
>> time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
>> 
>> original: 15.83sec
>> patched:  10.35sec
>> 
>> The improvement is less compared with the sequential write case, since
>> we omit only half of the pages dirtied after snapshot command.
>> 
>> Signed-off-by: Ethan Lien <ethanlien@synology.com>
>> ---
>>  fs/btrfs/btrfs_inode.h |  1 +
>>  fs/btrfs/ctree.h       |  2 +-
>>  fs/btrfs/extent_io.c   | 16 ++++++++++++++--
>>  fs/btrfs/inode.c       | 10 ++++++----
>>  fs/btrfs/ioctl.c       |  2 +-
>>  5 files changed, 23 insertions(+), 8 deletions(-)
>> 
>> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
>> index 1343ac57b438..4182bfbb56be 100644
>> --- a/fs/btrfs/btrfs_inode.h
>> +++ b/fs/btrfs/btrfs_inode.h
>> @@ -29,6 +29,7 @@ enum {
>>  	BTRFS_INODE_IN_DELALLOC_LIST,
>>  	BTRFS_INODE_READDIO_NEED_LOCK,
>>  	BTRFS_INODE_HAS_PROPS,
>> +	BTRFS_INODE_TAGGED_FLUSH,
>>  };
>> 
>>  /* in memory btrfs inode */
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 2cddfe7806a4..82682da5a40d 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -3155,7 +3155,7 @@ int btrfs_truncate_inode_items(struct 
>> btrfs_trans_handle *trans,
>>  			       struct inode *inode, u64 new_size,
>>  			       u32 min_type);
>> 
>> -int btrfs_start_delalloc_inodes(struct btrfs_root *root);
>> +int btrfs_start_delalloc_snapshot(struct btrfs_root *root);
>>  int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int 
>> nr);
>>  int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 
>> end,
>>  			      unsigned int extra_bits,
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 4dd6faab02bb..c21d8a0e010a 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -3928,12 +3928,24 @@ static int extent_write_cache_pages(struct 
>> address_space *mapping,
>>  			range_whole = 1;
>>  		scanned = 1;
>>  	}
>> -	if (wbc->sync_mode == WB_SYNC_ALL)
>> +
>> +	/*
>> +	 * We don't care if we are the one who set BTRFS_INODE_TAGGED_FLUSH 
>> in
>> +	 * start_delalloc_inodes(). We do the tagged writepage as long as we 
>> are
>> +	 * the first one who do the filemap_flush() on this inode.
>> +	 */
>> +	if (range_whole && wbc->nr_to_write == LONG_MAX &&
>> +			wbc->sync_mode == WB_SYNC_NONE &&
>> +			test_and_clear_bit(BTRFS_INODE_TAGGED_FLUSH,
>> +				&BTRFS_I(inode)->runtime_flags))
> Actually this check can be simplified to:
> 
> range_whole && test_and_clear_bit. filemap_flush triggers range_whole =
> 1 and then you care about TAGGED_FLUSH (or w/e it's going to be named)
> to be set. The nr_to_write && syncmode just make it a tad more 
> difficult
> to reason about the code.
> 
> 

Yes, the sync_mode check is not necessary. For nr_to_write, since 
pagevec
contain only limited amount of pages in one loop, nr_to_write 
essentially control
whether we scan all of the dirty pages or not. Since there is a window 
between
start_delalloc_inodes() and extent_write_cache_pages(), another flusher 
with
range_whole==1 but nr_to_write<LONG_MAX may come in and clear the flush
bit before the filemap_flush().

>> +		wbc->tagged_writepages = 1;
>> +
>> +	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
>>  		tag = PAGECACHE_TAG_TOWRITE;
>>  	else
>>  		tag = PAGECACHE_TAG_DIRTY;
>>  retry:
>> -	if (wbc->sync_mode == WB_SYNC_ALL)
>> +	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
>>  		tag_pages_for_writeback(mapping, index, end);
>>  	done_index = index;
>>  	while (!done && !nr_to_write_done && (index <= end) &&
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 3ea5339603cf..3df3cbbe91c5 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -9975,7 +9975,7 @@ static struct btrfs_delalloc_work 
>> *btrfs_alloc_delalloc_work(struct inode *inode
>>   * some fairly slow code that needs optimization. This walks the list
>>   * of all the inodes with pending delalloc and forces them to disk.
>>   */
>> -static int start_delalloc_inodes(struct btrfs_root *root, int nr)
>> +static int start_delalloc_inodes(struct btrfs_root *root, int nr, int 
>> snapshot)
>>  {
>>  	struct btrfs_inode *binode;
>>  	struct inode *inode;
>> @@ -10003,6 +10003,8 @@ static int start_delalloc_inodes(struct 
>> btrfs_root *root, int nr)
>>  		}
>>  		spin_unlock(&root->delalloc_lock);
>> 
>> +		if (snapshot)
>> +			set_bit(BTRFS_INODE_TAGGED_FLUSH, &binode->runtime_flags);
>>  		work = btrfs_alloc_delalloc_work(inode);
>>  		if (!work) {
>>  			iput(inode);
>> @@ -10036,7 +10038,7 @@ static int start_delalloc_inodes(struct 
>> btrfs_root *root, int nr)
>>  	return ret;
>>  }
>> 
>> -int btrfs_start_delalloc_inodes(struct btrfs_root *root)
>> +int btrfs_start_delalloc_snapshot(struct btrfs_root *root)
>>  {
>>  	struct btrfs_fs_info *fs_info = root->fs_info;
>>  	int ret;
>> @@ -10044,7 +10046,7 @@ int btrfs_start_delalloc_inodes(struct 
>> btrfs_root *root)
>>  	if (test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state))
>>  		return -EROFS;
>> 
>> -	ret = start_delalloc_inodes(root, -1);
>> +	ret = start_delalloc_inodes(root, -1, 1);
>>  	if (ret > 0)
>>  		ret = 0;
>>  	return ret;
>> @@ -10073,7 +10075,7 @@ int btrfs_start_delalloc_roots(struct 
>> btrfs_fs_info *fs_info, int nr)
>>  			       &fs_info->delalloc_roots);
>>  		spin_unlock(&fs_info->delalloc_root_lock);
>> 
>> -		ret = start_delalloc_inodes(root, nr);
>> +		ret = start_delalloc_inodes(root, nr, 0);
>>  		btrfs_put_fs_root(root);
>>  		if (ret < 0)
>>  			goto out;
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index d60b6caf09e8..d1293b6c31f6 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -775,7 +775,7 @@ static int create_snapshot(struct btrfs_root 
>> *root, struct inode *dir,
>>  	wait_event(root->subv_writers->wait,
>>  		   percpu_counter_sum(&root->subv_writers->counter) == 0);
>> 
>> -	ret = btrfs_start_delalloc_inodes(root);
>> +	ret = btrfs_start_delalloc_snapshot(root);
>>  	if (ret)
>>  		goto dec_and_free;
>> 
>> 


  reply	other threads:[~2018-11-02  7:00 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-01  6:49 [PATCH] btrfs: use tagged writepage to mitigate livelock of snapshot Ethan Lien
2018-11-01  8:59 ` Nikolay Borisov
2018-11-01  9:56   ` ethanlien
2018-11-01 10:01     ` Nikolay Borisov
2018-11-01 10:21       ` ethanlien
2018-11-01 13:24         ` Chris Mason
2018-11-01 11:57 ` Nikolay Borisov
2018-11-02  7:00   ` ethanlien [this message]
2018-11-01 18:02 ` David Sterba
2018-11-02  7:13   ` ethanlien
2018-11-02  8:43     ` Nikolay Borisov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=db084399bf020eae5d334535e1425d6e@synology.com \
    --to=ethanlien@synology.com \
    --cc=linux-btrfs-owner@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=nborisov@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.