From: David Sterba <dsterba@suse.cz>
To: Ethan Lien <ethanlien@synology.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
Date: Tue, 29 May 2018 17:33:27 +0200 [thread overview]
Message-ID: <20180529153326.GB4325@twin.jikos.cz> (raw)
In-Reply-To: <20180528054821.9092-1-ethanlien@synology.com>
On Mon, May 28, 2018 at 01:48:20PM +0800, Ethan Lien wrote:
> [Problem description and how we fix it]
> We should balance dirty metadata pages at the end of
> btrfs_finish_ordered_io, since a small, unmergeable random write can
> potentially produce dirty metadata which is multiple times larger than
> the data itself. For example, a small, unmergeable 4KiB write may
> produce:
>
> 16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
> 16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
> 16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
>
> Although we do call balance dirty pages in write side, but in the
> buffered write path, most metadata are dirtied only after we reach the
> dirty background limit (which by far only counts dirty data pages) and
> wakeup the flusher thread. If there are many small, unmergeable random
> writes spread in a large btree, we'll find a burst of dirty pages
> exceeds the dirty_bytes limit after we wakeup the flusher thread - which
> is not what we expect. In our machine, it caused out-of-memory problem
> since a page cannot be dropped if it is marked dirty.
>
> Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay,
> but since we do btrfs_finish_ordered_io in a separate worker, it will not
> stop the flusher consuming dirty pages. Also, we use different worker for
> metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle
> the size of dirty metadata pages.
>
> [Reproduce steps]
> To reproduce the problem, we need to do 4KiB write randomly spread in a
> large btree. In our 2GiB RAM machine:
> 1) Create 4 subvolumes.
> 2) Run fio on each subvolume:
>
> [global]
> direct=0
> rw=randwrite
> ioengine=libaio
> bs=4k
> iodepth=16
> numjobs=1
> group_reporting
> size=128G
> runtime=1800
> norandommap
> time_based
> randrepeat=0
>
> 3) Take snapshot on each subvolume and repeat fio on existing files.
> 4) Repeat step (3) until we get large btrees.
> In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
> metadata in each subvolume tree and 12GiB of metadata in extent tree.
> 5) Stop all fio, take snapshot again, and wait until all delayed work is
> completed.
> 6) Start all fio. Few seconds later we hit OOM when the flusher starts
> to work.
>
> It can be reproduced even when using nocow write.
>
> Signed-off-by: Ethan Lien <ethanlien@synology.com>
Added to misc-next, thanks.
next prev parent reply other threads:[~2018-05-29 15:36 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-05-28 5:48 [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io Ethan Lien
2018-05-29 15:33 ` David Sterba [this message]
2018-12-12 14:47 ` Chris Mason
2018-12-12 15:22 ` Martin Raiber
2018-12-12 15:36 ` David Sterba
2018-12-12 17:55 ` Chris Mason
2018-12-14 8:07 ` ethanlien
2018-12-17 14:00 ` Martin Raiber
2018-12-19 10:33 ` ethanlien
2018-12-19 14:22 ` Chris Mason
2018-12-13 8:38 ` ethanlien
2019-01-04 15:59 ` David Sterba
2019-01-09 10:07 ` ethanlien
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180529153326.GB4325@twin.jikos.cz \
--to=dsterba@suse.cz \
--cc=ethanlien@synology.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).