All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
@ 2018-04-27  7:05 Ethan Lien
  2018-05-22 16:28 ` David Sterba
  0 siblings, 1 reply; 3+ messages in thread
From: Ethan Lien @ 2018-04-27  7:05 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Ethan Lien

We should balance dirty metadata pages at the end of
btrfs_finish_ordered_io, since a small, unmergeable random write can
potentially produce dirty metadata which is multiple times larger than
the data itself. For example, a small, unmergeable 4KiB write may
produce:

    16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
    16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
    16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree

Although we do call balance dirty pages in write side, but in the
buffered write path, most metadata are dirtied only after we reach the
dirty background limit (which by far onlys counts dirty data pages) and
wakeup the flusher thread. If there are many small, unmergeable random
writes spread in a large btree, we'll find a burst of dirty pages
exceeds the dirty_bytes limit after we wakeup the flusher thread - which
is not what we expect. In our machine, it caused out-of-memory problem
since a page cannot be dropped if it is marked dirty.

Someone may worry about we may sleep in btrfs_btree_balance_dirty(), but
since we do btrfs_finish_ordered_io in a separate worker, it will not
stop the flusher consuming dirty pages. Also, we use different worker for
metadata writeback endio, sleep in btrfs_finish_ordered_io help us
throttle the size of dirty metadata pages.

Signed-off-by: Ethan Lien <ethanlien@synology.com>
---
 fs/btrfs/inode.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d241285a0d2a..d09433ab126e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3151,6 +3151,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 	/* once for the tree */
 	btrfs_put_ordered_extent(ordered_extent);
 
+	btrfs_btree_balance_dirty(fs_info);
+
 	return ret;
 }
 
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
  2018-04-27  7:05 [PATCH] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io Ethan Lien
@ 2018-05-22 16:28 ` David Sterba
  2018-05-23  9:34   ` ethanlien
  0 siblings, 1 reply; 3+ messages in thread
From: David Sterba @ 2018-05-22 16:28 UTC (permalink / raw)
  To: Ethan Lien; +Cc: linux-btrfs

On Fri, Apr 27, 2018 at 03:05:24PM +0800, Ethan Lien wrote:
> We should balance dirty metadata pages at the end of
> btrfs_finish_ordered_io, since a small, unmergeable random write can
> potentially produce dirty metadata which is multiple times larger than
> the data itself. For example, a small, unmergeable 4KiB write may
> produce:
> 
>     16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
>     16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
>     16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
> 
> Although we do call balance dirty pages in write side, but in the
> buffered write path, most metadata are dirtied only after we reach the
> dirty background limit (which by far onlys counts dirty data pages) and
> wakeup the flusher thread. If there are many small, unmergeable random
> writes spread in a large btree, we'll find a burst of dirty pages
> exceeds the dirty_bytes limit after we wakeup the flusher thread - which
> is not what we expect. In our machine, it caused out-of-memory problem
> since a page cannot be dropped if it is marked dirty.
> 
> Someone may worry about we may sleep in btrfs_btree_balance_dirty(), but
> since we do btrfs_finish_ordered_io in a separate worker, it will not
> stop the flusher consuming dirty pages. Also, we use different worker for
> metadata writeback endio, sleep in btrfs_finish_ordered_io help us
> throttle the size of dirty metadata pages.

The described scenario sounds interesting. Do you have some scripted
steps to reproduce it?

btrfs_btree_balance_dirty detects congestion and can skip the balancing
eventually, so the case when it actually does something and waits is the
point where things can go bad. From the last paragraph, it is clear that
you have considered that, that's good.

Have you also considered calling __btrfs_btree_balance_dirty with
flush_delayed=0 ? This would avoid the waiting and I'm not sure if it's
really required here to get out of the situation.

Anyway, I'll add the patch to for-next for more testing.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
  2018-05-22 16:28 ` David Sterba
@ 2018-05-23  9:34   ` ethanlien
  0 siblings, 0 replies; 3+ messages in thread
From: ethanlien @ 2018-05-23  9:34 UTC (permalink / raw)
  To: dsterba, Ethan Lien, linux-btrfs; +Cc: linux-btrfs-owner

David Sterba 於 2018-05-23 00:28 寫到:
> On Fri, Apr 27, 2018 at 03:05:24PM +0800, Ethan Lien wrote:
>> We should balance dirty metadata pages at the end of
>> btrfs_finish_ordered_io, since a small, unmergeable random write can
>> potentially produce dirty metadata which is multiple times larger than
>> the data itself. For example, a small, unmergeable 4KiB write may
>> produce:
>> 
>>     16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
>>     16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
>>     16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
>> 
>> Although we do call balance dirty pages in write side, but in the
>> buffered write path, most metadata are dirtied only after we reach the
>> dirty background limit (which by far onlys counts dirty data pages) 
>> and
>> wakeup the flusher thread. If there are many small, unmergeable random
>> writes spread in a large btree, we'll find a burst of dirty pages
>> exceeds the dirty_bytes limit after we wakeup the flusher thread - 
>> which
>> is not what we expect. In our machine, it caused out-of-memory problem
>> since a page cannot be dropped if it is marked dirty.
>> 
>> Someone may worry about we may sleep in btrfs_btree_balance_dirty(), 
>> but
>> since we do btrfs_finish_ordered_io in a separate worker, it will not
>> stop the flusher consuming dirty pages. Also, we use different worker 
>> for
>> metadata writeback endio, sleep in btrfs_finish_ordered_io help us
>> throttle the size of dirty metadata pages.
> 
> The described scenario sounds interesting. Do you have some scripted
> steps to reproduce it?

It needs some time to reproduce the problem. In our case,
1. Create 4 subvolumes.
2. Run fio on each subvolume:

[global]
direct=0
rw=randwrite
ioengine=libaio
bs=4k
iodepth=16
numjobs=1
group_reporting
size=128G
runtime=1800
norandommap
time_based
randrepeat=0

3. Take snapshot on each subvolume and repeat fio on existing files.
4. Repeat step 2&3 until we get really big btrees.
In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
metadata in each subvolume tree and 12GiB of metadata in extent tree.
5. Stop all fio, take snapshot again, and wait until all delayed work is 
completed.
6. Start all fio. Few seconds later we hit OOM when the flusher starts 
to work.

It can be reproduced even when using nocow write.

> btrfs_btree_balance_dirty detects congestion and can skip the balancing
> eventually, so the case when it actually does something and waits is 
> the
> point where things can go bad. From the last paragraph, it is clear 
> that
> you have considered that, that's good.
> 
> Have you also considered calling __btrfs_btree_balance_dirty with
> flush_delayed=0 ? This would avoid the waiting and I'm not sure if it's
> really required here to get out of the situation.

Yes, btrfs_btree_balance_dirty_nodelay seems to be a better choice, I'll 
add it to the v2 patch, thanks.

> Anyway, I'll add the patch to for-next for more testing.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-05-23  9:34 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-27  7:05 [PATCH] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io Ethan Lien
2018-05-22 16:28 ` David Sterba
2018-05-23  9:34   ` ethanlien

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.