linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Martin Raiber <martin@urbackup.org>
To: ethanlien <ethanlien@synology.com>
Cc: Chris Mason <clm@fb.com>,
	linux-btrfs@vger.kernel.org, David Sterba <dsterba@suse.cz>,
	linux-btrfs-owner@vger.kernel.org
Subject: Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
Date: Mon, 17 Dec 2018 14:00:44 +0000	[thread overview]
Message-ID: <01020167bc7811f0-cb1970f5-3d51-49f9-a5bb-63ba1ea35eea-000000@eu-west-1.amazonses.com> (raw)
In-Reply-To: <d324045eb532b2fa525203c20c57bed1@synology.com>

On 14.12.2018 09:07 ethanlien wrote:
> Martin Raiber 於 2018-12-12 23:22 寫到:
>> On 12.12.2018 15:47 Chris Mason wrote:
>>> On 28 May 2018, at 1:48, Ethan Lien wrote:
>>>
>>> It took me a while to trigger, but this actually deadlocks ;)  More
>>> below.
>>>
>>>> [Problem description and how we fix it]
>>>> We should balance dirty metadata pages at the end of
>>>> btrfs_finish_ordered_io, since a small, unmergeable random write can
>>>> potentially produce dirty metadata which is multiple times larger than
>>>> the data itself. For example, a small, unmergeable 4KiB write may
>>>> produce:
>>>>
>>>>     16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
>>>>     16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
>>>>     16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
>>>>
>>>> Although we do call balance dirty pages in write side, but in the
>>>> buffered write path, most metadata are dirtied only after we reach the
>>>> dirty background limit (which by far only counts dirty data pages) and
>>>> wakeup the flusher thread. If there are many small, unmergeable random
>>>> writes spread in a large btree, we'll find a burst of dirty pages
>>>> exceeds the dirty_bytes limit after we wakeup the flusher thread -
>>>> which
>>>> is not what we expect. In our machine, it caused out-of-memory problem
>>>> since a page cannot be dropped if it is marked dirty.
>>>>
>>>> Someone may worry about we may sleep in
>>>> btrfs_btree_balance_dirty_nodelay,
>>>> but since we do btrfs_finish_ordered_io in a separate worker, it will
>>>> not
>>>> stop the flusher consuming dirty pages. Also, we use different worker
>>>> for
>>>> metadata writeback endio, sleep in btrfs_finish_ordered_io help us
>>>> throttle
>>>> the size of dirty metadata pages.
>>> In general, slowing down btrfs_finish_ordered_io isn't ideal because it
>>> adds latency to places we need to finish quickly.  Also,
>>> btrfs_finish_ordered_io is used by the free space cache.  Even though
>>> this happens from its own workqueue, it means completing free space
>>> cache writeback may end up waiting on balance_dirty_pages, something
>>> like this stack trace:
>>>
>>> [..]
>>>
>>> Eventually, we have every process in the system waiting on
>>> balance_dirty_pages(), and nobody is able to make progress on
>>> paclear page's writebackge
>>> writeback.
>>>
>> I had lockups with this patch as well. If you put e.g. a loop device on
>> top of a btrfs file, loop sets PF_LESS_THROTTLE to avoid a feed back
>> loop causing delays. The task balancing dirty pages in
>> btrfs_finish_ordered_io doesn't have the flag and causes slow-downs. In
>> my case it managed to cause a feedback loop where it queues other
>> btrfs_finish_ordered_io and gets stuck completely.
>>
>
> The data writepage endio will queue a work for
> btrfs_finish_ordered_io() in a separate workqueue and clear page's
> writeback, so throttling in btrfs_finish_ordered_io() should not slow
> down flusher thread. One suspicious point is while the caller is
> waiting a range of ordered_extents to complete, they will be
> blocked until balance_dirty_pages_ratelimited() make some
> progress, since we finish ordered_extents in
> btrfs_finish_ordered_io().
> Do you have call stack information for stuck processes or using
> fsync/sync frequently? If this is the case, maybe we should pull
> this thing out and try balance dirty metadata pages somewhere.

Yeah like,

[875317.071433] Call Trace:
[875317.071438]  ? __schedule+0x306/0x7f0
[875317.071442]  schedule+0x32/0x80
[875317.071447]  btrfs_start_ordered_extent+0xed/0x120
[875317.071450]  ? remove_wait_queue+0x60/0x60
[875317.071454]  btrfs_wait_ordered_range+0xa0/0x100
[875317.071457]  btrfs_sync_file+0x1d6/0x400
[875317.071461]  ? do_fsync+0x38/0x60
[875317.071463]  ? btrfs_fdatawrite_range+0x50/0x50
[875317.071465]  do_fsync+0x38/0x60
[875317.071468]  __x64_sys_fsync+0x10/0x20
[875317.071470]  do_syscall_64+0x55/0x100
[875317.071473]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

so I guess the problem is that the calling balance_dirty_pages causes
fsyncs to the same btrfs (via my unusual setup of loop+fuse)? Those
fsyncs are deadlocked because they are called indirectly from
btrfs_finish_ordered_io... It is a unusal setup, which is why I did not
post it to the mailing list initially.



  reply	other threads:[~2018-12-17 14:00 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-28  5:48 [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io Ethan Lien
2018-05-29 15:33 ` David Sterba
2018-12-12 14:47 ` Chris Mason
2018-12-12 15:22   ` Martin Raiber
2018-12-12 15:36     ` David Sterba
2018-12-12 17:55       ` Chris Mason
2018-12-14  8:07     ` ethanlien
2018-12-17 14:00       ` Martin Raiber [this message]
2018-12-19 10:33         ` ethanlien
2018-12-19 14:22           ` Chris Mason
2018-12-13  8:38   ` ethanlien
2019-01-04 15:59     ` David Sterba
2019-01-09 10:07       ` ethanlien

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=01020167bc7811f0-cb1970f5-3d51-49f9-a5bb-63ba1ea35eea-000000@eu-west-1.amazonses.com \
    --to=martin@urbackup.org \
    --cc=clm@fb.com \
    --cc=dsterba@suse.cz \
    --cc=ethanlien@synology.com \
    --cc=linux-btrfs-owner@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).