Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io

From: ethanlien <ethanlien@synology.com>
To: Martin Raiber <martin@urbackup.org>
Cc: Chris Mason <clm@fb.com>,
	linux-btrfs@vger.kernel.org, David Sterba <dsterba@suse.cz>,
	linux-btrfs-owner@vger.kernel.org
Subject: Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
Date: Wed, 19 Dec 2018 18:33:11 +0800	[thread overview]
Message-ID: <362642e5e618dbf17effc17e0cb7457f@synology.com> (raw)
In-Reply-To: <01020167bc7811f0-cb1970f5-3d51-49f9-a5bb-63ba1ea35eea-000000@eu-west-1.amazonses.com>

Martin Raiber 於 2018-12-17 22:00 寫到:

>>>> 
>>> I had lockups with this patch as well. If you put e.g. a loop device 
>>> on
>>> top of a btrfs file, loop sets PF_LESS_THROTTLE to avoid a feed back
>>> loop causing delays. The task balancing dirty pages in
>>> btrfs_finish_ordered_io doesn't have the flag and causes slow-downs. 
>>> In
>>> my case it managed to cause a feedback loop where it queues other
>>> btrfs_finish_ordered_io and gets stuck completely.
>>> 
>> 
>> The data writepage endio will queue a work for
>> btrfs_finish_ordered_io() in a separate workqueue and clear page's
>> writeback, so throttling in btrfs_finish_ordered_io() should not slow
>> down flusher thread. One suspicious point is while the caller is
>> waiting a range of ordered_extents to complete, they will be
>> blocked until balance_dirty_pages_ratelimited() make some
>> progress, since we finish ordered_extents in
>> btrfs_finish_ordered_io().
>> Do you have call stack information for stuck processes or using
>> fsync/sync frequently? If this is the case, maybe we should pull
>> this thing out and try balance dirty metadata pages somewhere.
> 
> Yeah like,
> 
> [875317.071433] Call Trace:
> [875317.071438]  ? __schedule+0x306/0x7f0
> [875317.071442]  schedule+0x32/0x80
> [875317.071447]  btrfs_start_ordered_extent+0xed/0x120
> [875317.071450]  ? remove_wait_queue+0x60/0x60
> [875317.071454]  btrfs_wait_ordered_range+0xa0/0x100
> [875317.071457]  btrfs_sync_file+0x1d6/0x400
> [875317.071461]  ? do_fsync+0x38/0x60
> [875317.071463]  ? btrfs_fdatawrite_range+0x50/0x50
> [875317.071465]  do_fsync+0x38/0x60
> [875317.071468]  __x64_sys_fsync+0x10/0x20
> [875317.071470]  do_syscall_64+0x55/0x100
> [875317.071473]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> so I guess the problem is that the calling balance_dirty_pages causes
> fsyncs to the same btrfs (via my unusual setup of loop+fuse)? Those
> fsyncs are deadlocked because they are called indirectly from
> btrfs_finish_ordered_io... It is a unusal setup, which is why I did not
> post it to the mailing list initially.

To me this is not like a real deadlock. The fsync call invokes two 
steps:
(1) flushing dirty data pages, (2) update corresponding metadata to
point to those flushed data. Since step1 consume dirty pages and
step2 produce more dirty pages, in this patch we leave step1
unchanged and block step2 in btrfs_finish_ordered_io(), which
seems reasonable to a OOM fix. The problem is, if there are
other processes continually writing new data, the fsync call will
need to wait the metadata update for a long time, even its dirty
data has been flushed long time ago.

Back to the deadlock problem, what Chris found is really a deadlock,
and it can be fixed by adding a check of free space inode.
For the fsync delay problem, it depends on whether we want to
control dirty metadata page size to match dirty_bytes limit here.
If we don't want, we revert this patch and try flushing dirty metadata
pages somewhere. It should be enough to fix the OOM. But if we
want to control dirty metadata page size to match dirty_bytes limit, we
still need some throttling here. Setting __GFP_WRITE in extent_buffer
page is bad since in most of the time we use them as read cache to
speedup b-tree traversal. Adding PF_LESS_THROTTLE in
btrfs_finish_ordered_io() might help since it create a window where
new writes are blocked but all the flushing and metadata update
keep unblocked, and still remain some kind of throttling. But it
needs more test to know the delay of fsync in the worst case.