linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
@ 2018-05-28  5:48 Ethan Lien
  2018-05-29 15:33 ` David Sterba
  2018-12-12 14:47 ` Chris Mason
  0 siblings, 2 replies; 13+ messages in thread
From: Ethan Lien @ 2018-05-28  5:48 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Ethan Lien

[Problem description and how we fix it]
We should balance dirty metadata pages at the end of
btrfs_finish_ordered_io, since a small, unmergeable random write can
potentially produce dirty metadata which is multiple times larger than
the data itself. For example, a small, unmergeable 4KiB write may
produce:

    16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
    16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
    16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree

Although we do call balance dirty pages in write side, but in the
buffered write path, most metadata are dirtied only after we reach the
dirty background limit (which by far only counts dirty data pages) and
wakeup the flusher thread. If there are many small, unmergeable random
writes spread in a large btree, we'll find a burst of dirty pages
exceeds the dirty_bytes limit after we wakeup the flusher thread - which
is not what we expect. In our machine, it caused out-of-memory problem
since a page cannot be dropped if it is marked dirty.

Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay,
but since we do btrfs_finish_ordered_io in a separate worker, it will not
stop the flusher consuming dirty pages. Also, we use different worker for
metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle
the size of dirty metadata pages.

[Reproduce steps]
To reproduce the problem, we need to do 4KiB write randomly spread in a
large btree. In our 2GiB RAM machine:
1) Create 4 subvolumes.
2) Run fio on each subvolume:

   [global]
   direct=0
   rw=randwrite
   ioengine=libaio
   bs=4k
   iodepth=16
   numjobs=1
   group_reporting
   size=128G
   runtime=1800
   norandommap
   time_based
   randrepeat=0

3) Take snapshot on each subvolume and repeat fio on existing files.
4) Repeat step (3) until we get large btrees.
   In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
   metadata in each subvolume tree and 12GiB of metadata in extent tree.
5) Stop all fio, take snapshot again, and wait until all delayed work is
   completed.
6) Start all fio. Few seconds later we hit OOM when the flusher starts
   to work.

It can be reproduced even when using nocow write.

Signed-off-by: Ethan Lien <ethanlien@synology.com>
---

V2:
	Replace btrfs_btree_balance_dirty with btrfs_btree_balance_dirty_nodelay.
	Add reproduce steps.

 fs/btrfs/inode.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8e604e7071f1..e54547df24ee 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3158,6 +3158,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 	/* once for the tree */
 	btrfs_put_ordered_extent(ordered_extent);
 
+	btrfs_btree_balance_dirty_nodelay(fs_info);
+
 	return ret;
 }
 
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
  2018-05-28  5:48 [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io Ethan Lien
@ 2018-05-29 15:33 ` David Sterba
  2018-12-12 14:47 ` Chris Mason
  1 sibling, 0 replies; 13+ messages in thread
From: David Sterba @ 2018-05-29 15:33 UTC (permalink / raw)
  To: Ethan Lien; +Cc: linux-btrfs

On Mon, May 28, 2018 at 01:48:20PM +0800, Ethan Lien wrote:
> [Problem description and how we fix it]
> We should balance dirty metadata pages at the end of
> btrfs_finish_ordered_io, since a small, unmergeable random write can
> potentially produce dirty metadata which is multiple times larger than
> the data itself. For example, a small, unmergeable 4KiB write may
> produce:
> 
>     16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
>     16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
>     16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
> 
> Although we do call balance dirty pages in write side, but in the
> buffered write path, most metadata are dirtied only after we reach the
> dirty background limit (which by far only counts dirty data pages) and
> wakeup the flusher thread. If there are many small, unmergeable random
> writes spread in a large btree, we'll find a burst of dirty pages
> exceeds the dirty_bytes limit after we wakeup the flusher thread - which
> is not what we expect. In our machine, it caused out-of-memory problem
> since a page cannot be dropped if it is marked dirty.
> 
> Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay,
> but since we do btrfs_finish_ordered_io in a separate worker, it will not
> stop the flusher consuming dirty pages. Also, we use different worker for
> metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle
> the size of dirty metadata pages.
> 
> [Reproduce steps]
> To reproduce the problem, we need to do 4KiB write randomly spread in a
> large btree. In our 2GiB RAM machine:
> 1) Create 4 subvolumes.
> 2) Run fio on each subvolume:
> 
>    [global]
>    direct=0
>    rw=randwrite
>    ioengine=libaio
>    bs=4k
>    iodepth=16
>    numjobs=1
>    group_reporting
>    size=128G
>    runtime=1800
>    norandommap
>    time_based
>    randrepeat=0
> 
> 3) Take snapshot on each subvolume and repeat fio on existing files.
> 4) Repeat step (3) until we get large btrees.
>    In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
>    metadata in each subvolume tree and 12GiB of metadata in extent tree.
> 5) Stop all fio, take snapshot again, and wait until all delayed work is
>    completed.
> 6) Start all fio. Few seconds later we hit OOM when the flusher starts
>    to work.
> 
> It can be reproduced even when using nocow write.
> 
> Signed-off-by: Ethan Lien <ethanlien@synology.com>

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
  2018-05-28  5:48 [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io Ethan Lien
  2018-05-29 15:33 ` David Sterba
@ 2018-12-12 14:47 ` Chris Mason
  2018-12-12 15:22   ` Martin Raiber
  2018-12-13  8:38   ` ethanlien
  1 sibling, 2 replies; 13+ messages in thread
From: Chris Mason @ 2018-12-12 14:47 UTC (permalink / raw)
  To: Ethan Lien; +Cc: linux-btrfs, David Sterba

On 28 May 2018, at 1:48, Ethan Lien wrote:

It took me a while to trigger, but this actually deadlocks ;)  More 
below.

> [Problem description and how we fix it]
> We should balance dirty metadata pages at the end of
> btrfs_finish_ordered_io, since a small, unmergeable random write can
> potentially produce dirty metadata which is multiple times larger than
> the data itself. For example, a small, unmergeable 4KiB write may
> produce:
>
>     16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
>     16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
>     16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
>
> Although we do call balance dirty pages in write side, but in the
> buffered write path, most metadata are dirtied only after we reach the
> dirty background limit (which by far only counts dirty data pages) and
> wakeup the flusher thread. If there are many small, unmergeable random
> writes spread in a large btree, we'll find a burst of dirty pages
> exceeds the dirty_bytes limit after we wakeup the flusher thread - 
> which
> is not what we expect. In our machine, it caused out-of-memory problem
> since a page cannot be dropped if it is marked dirty.
>
> Someone may worry about we may sleep in 
> btrfs_btree_balance_dirty_nodelay,
> but since we do btrfs_finish_ordered_io in a separate worker, it will 
> not
> stop the flusher consuming dirty pages. Also, we use different worker 
> for
> metadata writeback endio, sleep in btrfs_finish_ordered_io help us 
> throttle
> the size of dirty metadata pages.

In general, slowing down btrfs_finish_ordered_io isn't ideal because it 
adds latency to places we need to finish quickly.  Also, 
btrfs_finish_ordered_io is used by the free space cache.  Even though 
this happens from its own workqueue, it means completing free space 
cache writeback may end up waiting on balance_dirty_pages, something 
like this stack trace:

12260 kworker/u96:16+btrfs-freespace-write D
[<0>] balance_dirty_pages+0x6e6/0x7ad
[<0>] balance_dirty_pages_ratelimited+0x6bb/0xa90
[<0>] btrfs_finish_ordered_io+0x3da/0x770
[<0>] normal_work_helper+0x1c5/0x5a0
[<0>] process_one_work+0x1ee/0x5a0
[<0>] worker_thread+0x46/0x3d0
[<0>] kthread+0xf5/0x130
[<0>] ret_from_fork+0x24/0x30
[<0>] 0xffffffffffffffff

Transaction commit will wait on the freespace cache:

838 btrfs-transacti D
[<0>] btrfs_start_ordered_extent+0x154/0x1e0
[<0>] btrfs_wait_ordered_range+0xbd/0x110
[<0>] __btrfs_wait_cache_io+0x49/0x1a0
[<0>] btrfs_write_dirty_block_groups+0x10b/0x3b0
[<0>] commit_cowonly_roots+0x215/0x2b0
[<0>] btrfs_commit_transaction+0x37e/0x910
[<0>] transaction_kthread+0x14d/0x180
[<0>] kthread+0xf5/0x130
[<0>] ret_from_fork+0x24/0x30
[<0>] 0xffffffffffffffff

And then writepages ends up waiting on transaction commit:

9520 kworker/u96:13+flush-btrfs-1 D
[<0>] wait_current_trans+0xac/0xe0
[<0>] start_transaction+0x21b/0x4b0
[<0>] cow_file_range_inline+0x10b/0x6b0
[<0>] cow_file_range.isra.69+0x329/0x4a0
[<0>] run_delalloc_range+0x105/0x3c0
[<0>] writepage_delalloc+0x119/0x180
[<0>] __extent_writepage+0x10c/0x390
[<0>] extent_write_cache_pages+0x26f/0x3d0
[<0>] extent_writepages+0x4f/0x80
[<0>] do_writepages+0x17/0x60
[<0>] __writeback_single_inode+0x59/0x690
[<0>] writeback_sb_inodes+0x291/0x4e0
[<0>] __writeback_inodes_wb+0x87/0xb0
[<0>] wb_writeback+0x3bb/0x500
[<0>] wb_workfn+0x40d/0x610
[<0>] process_one_work+0x1ee/0x5a0
[<0>] worker_thread+0x1e0/0x3d0
[<0>] kthread+0xf5/0x130
[<0>] ret_from_fork+0x24/0x30
[<0>] 0xffffffffffffffff

Eventually, we have every process in the system waiting on 
balance_dirty_pages(), and nobody is able to make progress on page 
writeback.

>
> [Reproduce steps]

[ ... ]

>
> V2:
> 	Replace btrfs_btree_balance_dirty with 
> btrfs_btree_balance_dirty_nodelay.
> 	Add reproduce steps.
>
>  fs/btrfs/inode.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8e604e7071f1..e54547df24ee 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3158,6 +3158,8 @@ static int btrfs_finish_ordered_io(struct 
> btrfs_ordered_extent *ordered_extent)
>  	/* once for the tree */
>  	btrfs_put_ordered_extent(ordered_extent);
>
> +	btrfs_btree_balance_dirty_nodelay(fs_info);
> +
>  	return ret;
>  }


The original OOM you describe feels like an MM bug to me, but I'm going 
to try the repro steps here.  Since the freespace cache has its own 
workqueue, we could fix the deadlock just by wrapping the 
balance_dirty_pages call in a check for the freespace inode.  But, I 
think we'll get better performance by nudging the fix outside of 
btrfs_finish_ordered_io.  I'll see if I can reproduce.

-chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
  2018-12-12 14:47 ` Chris Mason
@ 2018-12-12 15:22   ` Martin Raiber
  2018-12-12 15:36     ` David Sterba
  2018-12-14  8:07     ` ethanlien
  2018-12-13  8:38   ` ethanlien
  1 sibling, 2 replies; 13+ messages in thread
From: Martin Raiber @ 2018-12-12 15:22 UTC (permalink / raw)
  To: Chris Mason, Ethan Lien; +Cc: linux-btrfs, David Sterba

On 12.12.2018 15:47 Chris Mason wrote:
> On 28 May 2018, at 1:48, Ethan Lien wrote:
>
> It took me a while to trigger, but this actually deadlocks ;)  More 
> below.
>
>> [Problem description and how we fix it]
>> We should balance dirty metadata pages at the end of
>> btrfs_finish_ordered_io, since a small, unmergeable random write can
>> potentially produce dirty metadata which is multiple times larger than
>> the data itself. For example, a small, unmergeable 4KiB write may
>> produce:
>>
>>     16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
>>     16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
>>     16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
>>
>> Although we do call balance dirty pages in write side, but in the
>> buffered write path, most metadata are dirtied only after we reach the
>> dirty background limit (which by far only counts dirty data pages) and
>> wakeup the flusher thread. If there are many small, unmergeable random
>> writes spread in a large btree, we'll find a burst of dirty pages
>> exceeds the dirty_bytes limit after we wakeup the flusher thread - 
>> which
>> is not what we expect. In our machine, it caused out-of-memory problem
>> since a page cannot be dropped if it is marked dirty.
>>
>> Someone may worry about we may sleep in 
>> btrfs_btree_balance_dirty_nodelay,
>> but since we do btrfs_finish_ordered_io in a separate worker, it will 
>> not
>> stop the flusher consuming dirty pages. Also, we use different worker 
>> for
>> metadata writeback endio, sleep in btrfs_finish_ordered_io help us 
>> throttle
>> the size of dirty metadata pages.
> In general, slowing down btrfs_finish_ordered_io isn't ideal because it 
> adds latency to places we need to finish quickly.  Also, 
> btrfs_finish_ordered_io is used by the free space cache.  Even though 
> this happens from its own workqueue, it means completing free space 
> cache writeback may end up waiting on balance_dirty_pages, something 
> like this stack trace:
>
> [..]
>
> Eventually, we have every process in the system waiting on 
> balance_dirty_pages(), and nobody is able to make progress on page 
> writeback.
>
I had lockups with this patch as well. If you put e.g. a loop device on
top of a btrfs file, loop sets PF_LESS_THROTTLE to avoid a feed back
loop causing delays. The task balancing dirty pages in
btrfs_finish_ordered_io doesn't have the flag and causes slow-downs. In
my case it managed to cause a feedback loop where it queues other
btrfs_finish_ordered_io and gets stuck completely.

Regards,
Martin Raiber


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
  2018-12-12 15:22   ` Martin Raiber
@ 2018-12-12 15:36     ` David Sterba
  2018-12-12 17:55       ` Chris Mason
  2018-12-14  8:07     ` ethanlien
  1 sibling, 1 reply; 13+ messages in thread
From: David Sterba @ 2018-12-12 15:36 UTC (permalink / raw)
  To: Martin Raiber; +Cc: Chris Mason, Ethan Lien, linux-btrfs, David Sterba

On Wed, Dec 12, 2018 at 03:22:40PM +0000, Martin Raiber wrote:
> On 12.12.2018 15:47 Chris Mason wrote:
> > On 28 May 2018, at 1:48, Ethan Lien wrote:
> >
> > It took me a while to trigger, but this actually deadlocks ;)  More 
> > below.
> >
> >> [Problem description and how we fix it]
> >> We should balance dirty metadata pages at the end of
> >> btrfs_finish_ordered_io, since a small, unmergeable random write can
> >> potentially produce dirty metadata which is multiple times larger than
> >> the data itself. For example, a small, unmergeable 4KiB write may
> >> produce:
> >>
> >>     16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
> >>     16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
> >>     16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
> >>
> >> Although we do call balance dirty pages in write side, but in the
> >> buffered write path, most metadata are dirtied only after we reach the
> >> dirty background limit (which by far only counts dirty data pages) and
> >> wakeup the flusher thread. If there are many small, unmergeable random
> >> writes spread in a large btree, we'll find a burst of dirty pages
> >> exceeds the dirty_bytes limit after we wakeup the flusher thread - 
> >> which
> >> is not what we expect. In our machine, it caused out-of-memory problem
> >> since a page cannot be dropped if it is marked dirty.
> >>
> >> Someone may worry about we may sleep in 
> >> btrfs_btree_balance_dirty_nodelay,
> >> but since we do btrfs_finish_ordered_io in a separate worker, it will 
> >> not
> >> stop the flusher consuming dirty pages. Also, we use different worker 
> >> for
> >> metadata writeback endio, sleep in btrfs_finish_ordered_io help us 
> >> throttle
> >> the size of dirty metadata pages.
> > In general, slowing down btrfs_finish_ordered_io isn't ideal because it 
> > adds latency to places we need to finish quickly.  Also, 
> > btrfs_finish_ordered_io is used by the free space cache.  Even though 
> > this happens from its own workqueue, it means completing free space 
> > cache writeback may end up waiting on balance_dirty_pages, something 
> > like this stack trace:
> >
> > [..]
> >
> > Eventually, we have every process in the system waiting on 
> > balance_dirty_pages(), and nobody is able to make progress on page 
> > writeback.
> >
> I had lockups with this patch as well. If you put e.g. a loop device on
> top of a btrfs file, loop sets PF_LESS_THROTTLE to avoid a feed back
> loop causing delays. The task balancing dirty pages in
> btrfs_finish_ordered_io doesn't have the flag and causes slow-downs. In
> my case it managed to cause a feedback loop where it queues other
> btrfs_finish_ordered_io and gets stuck completely.

This does not look like the artificial and hard to hit case that's in
the original patch. I'm thinking about sending a revert to 4.20-rc6, the
deadlock is IMO worse than OOM.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
  2018-12-12 15:36     ` David Sterba
@ 2018-12-12 17:55       ` Chris Mason
  0 siblings, 0 replies; 13+ messages in thread
From: Chris Mason @ 2018-12-12 17:55 UTC (permalink / raw)
  To: David Sterba; +Cc: Martin Raiber, Ethan Lien, linux-btrfs

On 12 Dec 2018, at 10:36, David Sterba wrote:

> On Wed, Dec 12, 2018 at 03:22:40PM +0000, Martin Raiber wrote:
>> On 12.12.2018 15:47 Chris Mason wrote:
>>> On 28 May 2018, at 1:48, Ethan Lien wrote:
>>>
>>> Eventually, we have every process in the system waiting on
>>> balance_dirty_pages(), and nobody is able to make progress on page
>>> writeback.
>>>
>> I had lockups with this patch as well. If you put e.g. a loop device 
>> on
>> top of a btrfs file, loop sets PF_LESS_THROTTLE to avoid a feed back
>> loop causing delays. The task balancing dirty pages in
>> btrfs_finish_ordered_io doesn't have the flag and causes slow-downs. 
>> In
>> my case it managed to cause a feedback loop where it queues other
>> btrfs_finish_ordered_io and gets stuck completely.
>
> This does not look like the artificial and hard to hit case that's in
> the original patch. I'm thinking about sending a revert to 4.20-rc6, 
> the
> deadlock is IMO worse than OOM.

I haven't been able to trigger the OOM this morning.  Ethan, is this 
something you can still hit on upstream kernels with the 
balance_dirty_pages() removed?

-chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
  2018-12-12 14:47 ` Chris Mason
  2018-12-12 15:22   ` Martin Raiber
@ 2018-12-13  8:38   ` ethanlien
  2019-01-04 15:59     ` David Sterba
  1 sibling, 1 reply; 13+ messages in thread
From: ethanlien @ 2018-12-13  8:38 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs, David Sterba

Chris Mason 於 2018-12-12 22:47 寫到:
> On 28 May 2018, at 1:48, Ethan Lien wrote:
> 
> It took me a while to trigger, but this actually deadlocks ;)  More
> below.
> 
>> [Problem description and how we fix it]
>> We should balance dirty metadata pages at the end of
>> btrfs_finish_ordered_io, since a small, unmergeable random write can
>> potentially produce dirty metadata which is multiple times larger than
>> the data itself. For example, a small, unmergeable 4KiB write may
>> produce:
>> 
>>     16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
>>     16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
>>     16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
>> 
>> Although we do call balance dirty pages in write side, but in the
>> buffered write path, most metadata are dirtied only after we reach the
>> dirty background limit (which by far only counts dirty data pages) and
>> wakeup the flusher thread. If there are many small, unmergeable random
>> writes spread in a large btree, we'll find a burst of dirty pages
>> exceeds the dirty_bytes limit after we wakeup the flusher thread -
>> which
>> is not what we expect. In our machine, it caused out-of-memory problem
>> since a page cannot be dropped if it is marked dirty.
>> 
>> Someone may worry about we may sleep in
>> btrfs_btree_balance_dirty_nodelay,
>> but since we do btrfs_finish_ordered_io in a separate worker, it will
>> not
>> stop the flusher consuming dirty pages. Also, we use different worker
>> for
>> metadata writeback endio, sleep in btrfs_finish_ordered_io help us
>> throttle
>> the size of dirty metadata pages.
> 
> In general, slowing down btrfs_finish_ordered_io isn't ideal because it
> adds latency to places we need to finish quickly.  Also,
> btrfs_finish_ordered_io is used by the free space cache.  Even though
> this happens from its own workqueue, it means completing free space
> cache writeback may end up waiting on balance_dirty_pages, something
> like this stack trace:
> 
> 12260 kworker/u96:16+btrfs-freespace-write D
> [<0>] balance_dirty_pages+0x6e6/0x7ad
> [<0>] balance_dirty_pages_ratelimited+0x6bb/0xa90
> [<0>] btrfs_finish_ordered_io+0x3da/0x770
> [<0>] normal_work_helper+0x1c5/0x5a0
> [<0>] process_one_work+0x1ee/0x5a0
> [<0>] worker_thread+0x46/0x3d0
> [<0>] kthread+0xf5/0x130
> [<0>] ret_from_fork+0x24/0x30
> [<0>] 0xffffffffffffffff
> 
> Transaction commit will wait on the freespace cache:
> 
> 838 btrfs-transacti D
> [<0>] btrfs_start_ordered_extent+0x154/0x1e0
> [<0>] btrfs_wait_ordered_range+0xbd/0x110
> [<0>] __btrfs_wait_cache_io+0x49/0x1a0
> [<0>] btrfs_write_dirty_block_groups+0x10b/0x3b0
> [<0>] commit_cowonly_roots+0x215/0x2b0
> [<0>] btrfs_commit_transaction+0x37e/0x910
> [<0>] transaction_kthread+0x14d/0x180
> [<0>] kthread+0xf5/0x130
> [<0>] ret_from_fork+0x24/0x30
> [<0>] 0xffffffffffffffff
> 
> And then writepages ends up waiting on transaction commit:
> 
> 9520 kworker/u96:13+flush-btrfs-1 D
> [<0>] wait_current_trans+0xac/0xe0
> [<0>] start_transaction+0x21b/0x4b0
> [<0>] cow_file_range_inline+0x10b/0x6b0
> [<0>] cow_file_range.isra.69+0x329/0x4a0
> [<0>] run_delalloc_range+0x105/0x3c0
> [<0>] writepage_delalloc+0x119/0x180
> [<0>] __extent_writepage+0x10c/0x390
> [<0>] extent_write_cache_pages+0x26f/0x3d0
> [<0>] extent_writepages+0x4f/0x80
> [<0>] do_writepages+0x17/0x60
> [<0>] __writeback_single_inode+0x59/0x690
> [<0>] writeback_sb_inodes+0x291/0x4e0
> [<0>] __writeback_inodes_wb+0x87/0xb0
> [<0>] wb_writeback+0x3bb/0x500
> [<0>] wb_workfn+0x40d/0x610
> [<0>] process_one_work+0x1ee/0x5a0
> [<0>] worker_thread+0x1e0/0x3d0
> [<0>] kthread+0xf5/0x130
> [<0>] ret_from_fork+0x24/0x30
> [<0>] 0xffffffffffffffff
> 
> Eventually, we have every process in the system waiting on
> balance_dirty_pages(), and nobody is able to make progress on page
> writeback.
> 
>> 
>> [Reproduce steps]
> 
> [ ... ]
> 
>> 
>> V2:
>> 	Replace btrfs_btree_balance_dirty with
>> btrfs_btree_balance_dirty_nodelay.
>> 	Add reproduce steps.
>> 
>>  fs/btrfs/inode.c | 2 ++
>>  1 file changed, 2 insertions(+)
>> 
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 8e604e7071f1..e54547df24ee 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -3158,6 +3158,8 @@ static int btrfs_finish_ordered_io(struct
>> btrfs_ordered_extent *ordered_extent)
>>  	/* once for the tree */
>>  	btrfs_put_ordered_extent(ordered_extent);
>> 
>> +	btrfs_btree_balance_dirty_nodelay(fs_info);
>> +
>>  	return ret;
>>  }
> 
> 
> The original OOM you describe feels like an MM bug to me, but I'm going
> to try the repro steps here.  Since the freespace cache has its own
> workqueue, we could fix the deadlock just by wrapping the
> balance_dirty_pages call in a check for the freespace inode.  But, I
> think we'll get better performance by nudging the fix outside of
> btrfs_finish_ordered_io.  I'll see if I can reproduce.

Before this patch, I tried adding a new workqueue for metadata 
writeback,
and kick off async flush work on fs_info->btree_inode in
btrfs_finish_ordered_io(). It works, but it can't guarantee we control 
dirty
pages under MM's dirty_bytes limit if btrfs_finish_ordered_io() still 
running
at full speed.

> I haven't been able to trigger the OOM this morning.  Ethan, is this
> something you can still hit on upstream kernels with the
> balance_dirty_pages() removed?

I hit the OOM problem in 4.4 kernel. I'll try if I can trigger it in 
uptodate kernel.

> -chris


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
  2018-12-12 15:22   ` Martin Raiber
  2018-12-12 15:36     ` David Sterba
@ 2018-12-14  8:07     ` ethanlien
  2018-12-17 14:00       ` Martin Raiber
  1 sibling, 1 reply; 13+ messages in thread
From: ethanlien @ 2018-12-14  8:07 UTC (permalink / raw)
  To: Martin Raiber; +Cc: Chris Mason, linux-btrfs, David Sterba, linux-btrfs-owner

Martin Raiber 於 2018-12-12 23:22 寫到:
> On 12.12.2018 15:47 Chris Mason wrote:
>> On 28 May 2018, at 1:48, Ethan Lien wrote:
>> 
>> It took me a while to trigger, but this actually deadlocks ;)  More
>> below.
>> 
>>> [Problem description and how we fix it]
>>> We should balance dirty metadata pages at the end of
>>> btrfs_finish_ordered_io, since a small, unmergeable random write can
>>> potentially produce dirty metadata which is multiple times larger 
>>> than
>>> the data itself. For example, a small, unmergeable 4KiB write may
>>> produce:
>>> 
>>>     16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume 
>>> tree
>>>     16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
>>>     16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
>>> 
>>> Although we do call balance dirty pages in write side, but in the
>>> buffered write path, most metadata are dirtied only after we reach 
>>> the
>>> dirty background limit (which by far only counts dirty data pages) 
>>> and
>>> wakeup the flusher thread. If there are many small, unmergeable 
>>> random
>>> writes spread in a large btree, we'll find a burst of dirty pages
>>> exceeds the dirty_bytes limit after we wakeup the flusher thread -
>>> which
>>> is not what we expect. In our machine, it caused out-of-memory 
>>> problem
>>> since a page cannot be dropped if it is marked dirty.
>>> 
>>> Someone may worry about we may sleep in
>>> btrfs_btree_balance_dirty_nodelay,
>>> but since we do btrfs_finish_ordered_io in a separate worker, it will
>>> not
>>> stop the flusher consuming dirty pages. Also, we use different worker
>>> for
>>> metadata writeback endio, sleep in btrfs_finish_ordered_io help us
>>> throttle
>>> the size of dirty metadata pages.
>> In general, slowing down btrfs_finish_ordered_io isn't ideal because 
>> it
>> adds latency to places we need to finish quickly.  Also,
>> btrfs_finish_ordered_io is used by the free space cache.  Even though
>> this happens from its own workqueue, it means completing free space
>> cache writeback may end up waiting on balance_dirty_pages, something
>> like this stack trace:
>> 
>> [..]
>> 
>> Eventually, we have every process in the system waiting on
>> balance_dirty_pages(), and nobody is able to make progress on paclear 
>> page's writebackge
>> writeback.
>> 
> I had lockups with this patch as well. If you put e.g. a loop device on
> top of a btrfs file, loop sets PF_LESS_THROTTLE to avoid a feed back
> loop causing delays. The task balancing dirty pages in
> btrfs_finish_ordered_io doesn't have the flag and causes slow-downs. In
> my case it managed to cause a feedback loop where it queues other
> btrfs_finish_ordered_io and gets stuck completely.
> 

The data writepage endio will queue a work for
btrfs_finish_ordered_io() in a separate workqueue and clear page's
writeback, so throttling in btrfs_finish_ordered_io() should not slow
down flusher thread. One suspicious point is while the caller is
waiting a range of ordered_extents to complete, they will be
blocked until balance_dirty_pages_ratelimited() make some
progress, since we finish ordered_extents in
btrfs_finish_ordered_io().
Do you have call stack information for stuck processes or using
fsync/sync frequently? If this is the case, maybe we should pull
this thing out and try balance dirty metadata pages somewhere.

> Regards,
> Martin Raiber


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
  2018-12-14  8:07     ` ethanlien
@ 2018-12-17 14:00       ` Martin Raiber
  2018-12-19 10:33         ` ethanlien
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Raiber @ 2018-12-17 14:00 UTC (permalink / raw)
  To: ethanlien; +Cc: Chris Mason, linux-btrfs, David Sterba, linux-btrfs-owner

On 14.12.2018 09:07 ethanlien wrote:
> Martin Raiber 於 2018-12-12 23:22 寫到:
>> On 12.12.2018 15:47 Chris Mason wrote:
>>> On 28 May 2018, at 1:48, Ethan Lien wrote:
>>>
>>> It took me a while to trigger, but this actually deadlocks ;)  More
>>> below.
>>>
>>>> [Problem description and how we fix it]
>>>> We should balance dirty metadata pages at the end of
>>>> btrfs_finish_ordered_io, since a small, unmergeable random write can
>>>> potentially produce dirty metadata which is multiple times larger than
>>>> the data itself. For example, a small, unmergeable 4KiB write may
>>>> produce:
>>>>
>>>>     16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
>>>>     16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
>>>>     16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
>>>>
>>>> Although we do call balance dirty pages in write side, but in the
>>>> buffered write path, most metadata are dirtied only after we reach the
>>>> dirty background limit (which by far only counts dirty data pages) and
>>>> wakeup the flusher thread. If there are many small, unmergeable random
>>>> writes spread in a large btree, we'll find a burst of dirty pages
>>>> exceeds the dirty_bytes limit after we wakeup the flusher thread -
>>>> which
>>>> is not what we expect. In our machine, it caused out-of-memory problem
>>>> since a page cannot be dropped if it is marked dirty.
>>>>
>>>> Someone may worry about we may sleep in
>>>> btrfs_btree_balance_dirty_nodelay,
>>>> but since we do btrfs_finish_ordered_io in a separate worker, it will
>>>> not
>>>> stop the flusher consuming dirty pages. Also, we use different worker
>>>> for
>>>> metadata writeback endio, sleep in btrfs_finish_ordered_io help us
>>>> throttle
>>>> the size of dirty metadata pages.
>>> In general, slowing down btrfs_finish_ordered_io isn't ideal because it
>>> adds latency to places we need to finish quickly.  Also,
>>> btrfs_finish_ordered_io is used by the free space cache.  Even though
>>> this happens from its own workqueue, it means completing free space
>>> cache writeback may end up waiting on balance_dirty_pages, something
>>> like this stack trace:
>>>
>>> [..]
>>>
>>> Eventually, we have every process in the system waiting on
>>> balance_dirty_pages(), and nobody is able to make progress on
>>> paclear page's writebackge
>>> writeback.
>>>
>> I had lockups with this patch as well. If you put e.g. a loop device on
>> top of a btrfs file, loop sets PF_LESS_THROTTLE to avoid a feed back
>> loop causing delays. The task balancing dirty pages in
>> btrfs_finish_ordered_io doesn't have the flag and causes slow-downs. In
>> my case it managed to cause a feedback loop where it queues other
>> btrfs_finish_ordered_io and gets stuck completely.
>>
>
> The data writepage endio will queue a work for
> btrfs_finish_ordered_io() in a separate workqueue and clear page's
> writeback, so throttling in btrfs_finish_ordered_io() should not slow
> down flusher thread. One suspicious point is while the caller is
> waiting a range of ordered_extents to complete, they will be
> blocked until balance_dirty_pages_ratelimited() make some
> progress, since we finish ordered_extents in
> btrfs_finish_ordered_io().
> Do you have call stack information for stuck processes or using
> fsync/sync frequently? If this is the case, maybe we should pull
> this thing out and try balance dirty metadata pages somewhere.

Yeah like,

[875317.071433] Call Trace:
[875317.071438]  ? __schedule+0x306/0x7f0
[875317.071442]  schedule+0x32/0x80
[875317.071447]  btrfs_start_ordered_extent+0xed/0x120
[875317.071450]  ? remove_wait_queue+0x60/0x60
[875317.071454]  btrfs_wait_ordered_range+0xa0/0x100
[875317.071457]  btrfs_sync_file+0x1d6/0x400
[875317.071461]  ? do_fsync+0x38/0x60
[875317.071463]  ? btrfs_fdatawrite_range+0x50/0x50
[875317.071465]  do_fsync+0x38/0x60
[875317.071468]  __x64_sys_fsync+0x10/0x20
[875317.071470]  do_syscall_64+0x55/0x100
[875317.071473]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

so I guess the problem is that the calling balance_dirty_pages causes
fsyncs to the same btrfs (via my unusual setup of loop+fuse)? Those
fsyncs are deadlocked because they are called indirectly from
btrfs_finish_ordered_io... It is a unusal setup, which is why I did not
post it to the mailing list initially.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
  2018-12-17 14:00       ` Martin Raiber
@ 2018-12-19 10:33         ` ethanlien
  2018-12-19 14:22           ` Chris Mason
  0 siblings, 1 reply; 13+ messages in thread
From: ethanlien @ 2018-12-19 10:33 UTC (permalink / raw)
  To: Martin Raiber; +Cc: Chris Mason, linux-btrfs, David Sterba, linux-btrfs-owner

Martin Raiber 於 2018-12-17 22:00 寫到:

>>>> 
>>> I had lockups with this patch as well. If you put e.g. a loop device 
>>> on
>>> top of a btrfs file, loop sets PF_LESS_THROTTLE to avoid a feed back
>>> loop causing delays. The task balancing dirty pages in
>>> btrfs_finish_ordered_io doesn't have the flag and causes slow-downs. 
>>> In
>>> my case it managed to cause a feedback loop where it queues other
>>> btrfs_finish_ordered_io and gets stuck completely.
>>> 
>> 
>> The data writepage endio will queue a work for
>> btrfs_finish_ordered_io() in a separate workqueue and clear page's
>> writeback, so throttling in btrfs_finish_ordered_io() should not slow
>> down flusher thread. One suspicious point is while the caller is
>> waiting a range of ordered_extents to complete, they will be
>> blocked until balance_dirty_pages_ratelimited() make some
>> progress, since we finish ordered_extents in
>> btrfs_finish_ordered_io().
>> Do you have call stack information for stuck processes or using
>> fsync/sync frequently? If this is the case, maybe we should pull
>> this thing out and try balance dirty metadata pages somewhere.
> 
> Yeah like,
> 
> [875317.071433] Call Trace:
> [875317.071438]  ? __schedule+0x306/0x7f0
> [875317.071442]  schedule+0x32/0x80
> [875317.071447]  btrfs_start_ordered_extent+0xed/0x120
> [875317.071450]  ? remove_wait_queue+0x60/0x60
> [875317.071454]  btrfs_wait_ordered_range+0xa0/0x100
> [875317.071457]  btrfs_sync_file+0x1d6/0x400
> [875317.071461]  ? do_fsync+0x38/0x60
> [875317.071463]  ? btrfs_fdatawrite_range+0x50/0x50
> [875317.071465]  do_fsync+0x38/0x60
> [875317.071468]  __x64_sys_fsync+0x10/0x20
> [875317.071470]  do_syscall_64+0x55/0x100
> [875317.071473]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> so I guess the problem is that the calling balance_dirty_pages causes
> fsyncs to the same btrfs (via my unusual setup of loop+fuse)? Those
> fsyncs are deadlocked because they are called indirectly from
> btrfs_finish_ordered_io... It is a unusal setup, which is why I did not
> post it to the mailing list initially.

To me this is not like a real deadlock. The fsync call invokes two 
steps:
(1) flushing dirty data pages, (2) update corresponding metadata to
point to those flushed data. Since step1 consume dirty pages and
step2 produce more dirty pages, in this patch we leave step1
unchanged and block step2 in btrfs_finish_ordered_io(), which
seems reasonable to a OOM fix. The problem is, if there are
other processes continually writing new data, the fsync call will
need to wait the metadata update for a long time, even its dirty
data has been flushed long time ago.

Back to the deadlock problem, what Chris found is really a deadlock,
and it can be fixed by adding a check of free space inode.
For the fsync delay problem, it depends on whether we want to
control dirty metadata page size to match dirty_bytes limit here.
If we don't want, we revert this patch and try flushing dirty metadata
pages somewhere. It should be enough to fix the OOM. But if we
want to control dirty metadata page size to match dirty_bytes limit, we
still need some throttling here. Setting __GFP_WRITE in extent_buffer
page is bad since in most of the time we use them as read cache to
speedup b-tree traversal. Adding PF_LESS_THROTTLE in
btrfs_finish_ordered_io() might help since it create a window where
new writes are blocked but all the flushing and metadata update
keep unblocked, and still remain some kind of throttling. But it
needs more test to know the delay of fsync in the worst case.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
  2018-12-19 10:33         ` ethanlien
@ 2018-12-19 14:22           ` Chris Mason
  0 siblings, 0 replies; 13+ messages in thread
From: Chris Mason @ 2018-12-19 14:22 UTC (permalink / raw)
  To: ethanlien; +Cc: Martin Raiber, linux-btrfs, David Sterba, linux-btrfs-owner

On 19 Dec 2018, at 5:33, ethanlien wrote:

> Martin Raiber 於 2018-12-17 22:00 寫到:
>
>>>>>
>>>> I had lockups with this patch as well. If you put e.g. a loop 
>>>> device on
>>>> top of a btrfs file, loop sets PF_LESS_THROTTLE to avoid a feed 
>>>> back
>>>> loop causing delays. The task balancing dirty pages in
>>>> btrfs_finish_ordered_io doesn't have the flag and causes 
>>>> slow-downs. In
>>>> my case it managed to cause a feedback loop where it queues other
>>>> btrfs_finish_ordered_io and gets stuck completely.
>>>>
>>>
>>> The data writepage endio will queue a work for
>>> btrfs_finish_ordered_io() in a separate workqueue and clear page's
>>> writeback, so throttling in btrfs_finish_ordered_io() should not 
>>> slow
>>> down flusher thread. One suspicious point is while the caller is
>>> waiting a range of ordered_extents to complete, they will be
>>> blocked until balance_dirty_pages_ratelimited() make some
>>> progress, since we finish ordered_extents in
>>> btrfs_finish_ordered_io().
>>> Do you have call stack information for stuck processes or using
>>> fsync/sync frequently? If this is the case, maybe we should pull
>>> this thing out and try balance dirty metadata pages somewhere.
>>
>> Yeah like,
>>
>> [875317.071433] Call Trace:
>> [875317.071438]  ? __schedule+0x306/0x7f0
>> [875317.071442]  schedule+0x32/0x80
>> [875317.071447]  btrfs_start_ordered_extent+0xed/0x120
>> [875317.071450]  ? remove_wait_queue+0x60/0x60
>> [875317.071454]  btrfs_wait_ordered_range+0xa0/0x100
>> [875317.071457]  btrfs_sync_file+0x1d6/0x400
>> [875317.071461]  ? do_fsync+0x38/0x60
>> [875317.071463]  ? btrfs_fdatawrite_range+0x50/0x50
>> [875317.071465]  do_fsync+0x38/0x60
>> [875317.071468]  __x64_sys_fsync+0x10/0x20
>> [875317.071470]  do_syscall_64+0x55/0x100
>> [875317.071473]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> so I guess the problem is that the calling balance_dirty_pages causes
>> fsyncs to the same btrfs (via my unusual setup of loop+fuse)? Those
>> fsyncs are deadlocked because they are called indirectly from
>> btrfs_finish_ordered_io... It is a unusal setup, which is why I did 
>> not
>> post it to the mailing list initially.
>
> To me this is not like a real deadlock. The fsync call invokes two 
> steps:
> (1) flushing dirty data pages, (2) update corresponding metadata to
> point to those flushed data. Since step1 consume dirty pages and
> step2 produce more dirty pages, in this patch we leave step1
> unchanged and block step2 in btrfs_finish_ordered_io(), which
> seems reasonable to a OOM fix. The problem is, if there are
> other processes continually writing new data, the fsync call will
> need to wait the metadata update for a long time, even its dirty
> data has been flushed long time ago.
>
> Back to the deadlock problem, what Chris found is really a deadlock,
> and it can be fixed by adding a check of free space inode.

I think we should have a  better understanding of your original OOM 
problem before we keep the balance_dirty_pages().  This isn't a great 
place to throttle, and while it's also not a great place to make a huge 
burst of dirty pages, I'd like to make sure we're really fixing the 
right problem against today's kernel.

-chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
  2018-12-13  8:38   ` ethanlien
@ 2019-01-04 15:59     ` David Sterba
  2019-01-09 10:07       ` ethanlien
  0 siblings, 1 reply; 13+ messages in thread
From: David Sterba @ 2019-01-04 15:59 UTC (permalink / raw)
  To: ethanlien; +Cc: Chris Mason, linux-btrfs, David Sterba

On Thu, Dec 13, 2018 at 04:38:48PM +0800, ethanlien wrote:
> Chris Mason 於 2018-12-12 22:47 寫到:
> > On 28 May 2018, at 1:48, Ethan Lien wrote:
> > 
> > It took me a while to trigger, but this actually deadlocks ;)  More
> > below.
> > 
> >> [Problem description and how we fix it]
> >> We should balance dirty metadata pages at the end of
> >> btrfs_finish_ordered_io, since a small, unmergeable random write can
> >> potentially produce dirty metadata which is multiple times larger than
> >> the data itself. For example, a small, unmergeable 4KiB write may
> >> produce:
> >> 
> >>     16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
> >>     16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
> >>     16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
> >> 
> >> Although we do call balance dirty pages in write side, but in the
> >> buffered write path, most metadata are dirtied only after we reach the
> >> dirty background limit (which by far only counts dirty data pages) and
> >> wakeup the flusher thread. If there are many small, unmergeable random
> >> writes spread in a large btree, we'll find a burst of dirty pages
> >> exceeds the dirty_bytes limit after we wakeup the flusher thread -
> >> which
> >> is not what we expect. In our machine, it caused out-of-memory problem
> >> since a page cannot be dropped if it is marked dirty.
> >> 
> >> Someone may worry about we may sleep in
> >> btrfs_btree_balance_dirty_nodelay,
> >> but since we do btrfs_finish_ordered_io in a separate worker, it will
> >> not
> >> stop the flusher consuming dirty pages. Also, we use different worker
> >> for
> >> metadata writeback endio, sleep in btrfs_finish_ordered_io help us
> >> throttle
> >> the size of dirty metadata pages.
> > 
> > In general, slowing down btrfs_finish_ordered_io isn't ideal because it
> > adds latency to places we need to finish quickly.  Also,
> > btrfs_finish_ordered_io is used by the free space cache.  Even though
> > this happens from its own workqueue, it means completing free space
> > cache writeback may end up waiting on balance_dirty_pages, something
> > like this stack trace:
> > 
> > 12260 kworker/u96:16+btrfs-freespace-write D
> > [<0>] balance_dirty_pages+0x6e6/0x7ad
> > [<0>] balance_dirty_pages_ratelimited+0x6bb/0xa90
> > [<0>] btrfs_finish_ordered_io+0x3da/0x770
> > [<0>] normal_work_helper+0x1c5/0x5a0
> > [<0>] process_one_work+0x1ee/0x5a0
> > [<0>] worker_thread+0x46/0x3d0
> > [<0>] kthread+0xf5/0x130
> > [<0>] ret_from_fork+0x24/0x30
> > [<0>] 0xffffffffffffffff
> > 
> > Transaction commit will wait on the freespace cache:
> > 
> > 838 btrfs-transacti D
> > [<0>] btrfs_start_ordered_extent+0x154/0x1e0
> > [<0>] btrfs_wait_ordered_range+0xbd/0x110
> > [<0>] __btrfs_wait_cache_io+0x49/0x1a0
> > [<0>] btrfs_write_dirty_block_groups+0x10b/0x3b0
> > [<0>] commit_cowonly_roots+0x215/0x2b0
> > [<0>] btrfs_commit_transaction+0x37e/0x910
> > [<0>] transaction_kthread+0x14d/0x180
> > [<0>] kthread+0xf5/0x130
> > [<0>] ret_from_fork+0x24/0x30
> > [<0>] 0xffffffffffffffff
> > 
> > And then writepages ends up waiting on transaction commit:
> > 
> > 9520 kworker/u96:13+flush-btrfs-1 D
> > [<0>] wait_current_trans+0xac/0xe0
> > [<0>] start_transaction+0x21b/0x4b0
> > [<0>] cow_file_range_inline+0x10b/0x6b0
> > [<0>] cow_file_range.isra.69+0x329/0x4a0
> > [<0>] run_delalloc_range+0x105/0x3c0
> > [<0>] writepage_delalloc+0x119/0x180
> > [<0>] __extent_writepage+0x10c/0x390
> > [<0>] extent_write_cache_pages+0x26f/0x3d0
> > [<0>] extent_writepages+0x4f/0x80
> > [<0>] do_writepages+0x17/0x60
> > [<0>] __writeback_single_inode+0x59/0x690
> > [<0>] writeback_sb_inodes+0x291/0x4e0
> > [<0>] __writeback_inodes_wb+0x87/0xb0
> > [<0>] wb_writeback+0x3bb/0x500
> > [<0>] wb_workfn+0x40d/0x610
> > [<0>] process_one_work+0x1ee/0x5a0
> > [<0>] worker_thread+0x1e0/0x3d0
> > [<0>] kthread+0xf5/0x130
> > [<0>] ret_from_fork+0x24/0x30
> > [<0>] 0xffffffffffffffff
> > 
> > Eventually, we have every process in the system waiting on
> > balance_dirty_pages(), and nobody is able to make progress on page
> > writeback.
> > 
> >> 
> >> [Reproduce steps]
> > 
> > [ ... ]
> > 
> >> 
> >> V2:
> >> 	Replace btrfs_btree_balance_dirty with
> >> btrfs_btree_balance_dirty_nodelay.
> >> 	Add reproduce steps.
> >> 
> >>  fs/btrfs/inode.c | 2 ++
> >>  1 file changed, 2 insertions(+)
> >> 
> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> >> index 8e604e7071f1..e54547df24ee 100644
> >> --- a/fs/btrfs/inode.c
> >> +++ b/fs/btrfs/inode.c
> >> @@ -3158,6 +3158,8 @@ static int btrfs_finish_ordered_io(struct
> >> btrfs_ordered_extent *ordered_extent)
> >>  	/* once for the tree */
> >>  	btrfs_put_ordered_extent(ordered_extent);
> >> 
> >> +	btrfs_btree_balance_dirty_nodelay(fs_info);
> >> +
> >>  	return ret;
> >>  }
> > 
> > 
> > The original OOM you describe feels like an MM bug to me, but I'm going
> > to try the repro steps here.  Since the freespace cache has its own
> > workqueue, we could fix the deadlock just by wrapping the
> > balance_dirty_pages call in a check for the freespace inode.  But, I
> > think we'll get better performance by nudging the fix outside of
> > btrfs_finish_ordered_io.  I'll see if I can reproduce.
> 
> Before this patch, I tried adding a new workqueue for metadata 
> writeback,
> and kick off async flush work on fs_info->btree_inode in
> btrfs_finish_ordered_io(). It works, but it can't guarantee we control 
> dirty
> pages under MM's dirty_bytes limit if btrfs_finish_ordered_io() still 
> running
> at full speed.
> 
> > I haven't been able to trigger the OOM this morning.  Ethan, is this
> > something you can still hit on upstream kernels with the
> > balance_dirty_pages() removed?
> 
> I hit the OOM problem in 4.4 kernel. I'll try if I can trigger it in 
> uptodate kernel.

Any followup to your testing? Otherwise I'm going to add revert of the
patch.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
  2019-01-04 15:59     ` David Sterba
@ 2019-01-09 10:07       ` ethanlien
  0 siblings, 0 replies; 13+ messages in thread
From: ethanlien @ 2019-01-09 10:07 UTC (permalink / raw)
  To: dsterba, ethanlien, Chris Mason, linux-btrfs; +Cc: linux-btrfs-owner

David Sterba 於 2019-01-04 23:59 寫到:
> On Thu, Dec 13, 2018 at 04:38:48PM +0800, ethanlien wrote:
>> Chris Mason 於 2018-12-12 22:47 寫到:
>> > On 28 May 2018, at 1:48, Ethan Lien wrote:
>> >
>> > It took me a while to trigger, but this actually deadlocks ;)  More
>> > below.
>> >
>> >> [Problem description and how we fix it]
>> >> We should balance dirty metadata pages at the end of
>> >> btrfs_finish_ordered_io, since a small, unmergeable random write can
>> >> potentially produce dirty metadata which is multiple times larger than
>> >> the data itself. For example, a small, unmergeable 4KiB write may
>> >> produce:
>> >>
>> >>     16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
>> >>     16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
>> >>     16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
>> >>
>> >> Although we do call balance dirty pages in write side, but in the
>> >> buffered write path, most metadata are dirtied only after we reach the
>> >> dirty background limit (which by far only counts dirty data pages) and
>> >> wakeup the flusher thread. If there are many small, unmergeable random
>> >> writes spread in a large btree, we'll find a burst of dirty pages
>> >> exceeds the dirty_bytes limit after we wakeup the flusher thread -
>> >> which
>> >> is not what we expect. In our machine, it caused out-of-memory problem
>> >> since a page cannot be dropped if it is marked dirty.
>> >>
>> >> Someone may worry about we may sleep in
>> >> btrfs_btree_balance_dirty_nodelay,
>> >> but since we do btrfs_finish_ordered_io in a separate worker, it will
>> >> not
>> >> stop the flusher consuming dirty pages. Also, we use different worker
>> >> for
>> >> metadata writeback endio, sleep in btrfs_finish_ordered_io help us
>> >> throttle
>> >> the size of dirty metadata pages.
>> >
>> > In general, slowing down btrfs_finish_ordered_io isn't ideal because it
>> > adds latency to places we need to finish quickly.  Also,
>> > btrfs_finish_ordered_io is used by the free space cache.  Even though
>> > this happens from its own workqueue, it means completing free space
>> > cache writeback may end up waiting on balance_dirty_pages, something
>> > like this stack trace:
>> >
>> > 12260 kworker/u96:16+btrfs-freespace-write D
>> > [<0>] balance_dirty_pages+0x6e6/0x7ad
>> > [<0>] balance_dirty_pages_ratelimited+0x6bb/0xa90
>> > [<0>] btrfs_finish_ordered_io+0x3da/0x770
>> > [<0>] normal_work_helper+0x1c5/0x5a0
>> > [<0>] process_one_work+0x1ee/0x5a0
>> > [<0>] worker_thread+0x46/0x3d0
>> > [<0>] kthread+0xf5/0x130
>> > [<0>] ret_from_fork+0x24/0x30
>> > [<0>] 0xffffffffffffffff
>> >
>> > Transaction commit will wait on the freespace cache:
>> >
>> > 838 btrfs-transacti D
>> > [<0>] btrfs_start_ordered_extent+0x154/0x1e0
>> > [<0>] btrfs_wait_ordered_range+0xbd/0x110
>> > [<0>] __btrfs_wait_cache_io+0x49/0x1a0
>> > [<0>] btrfs_write_dirty_block_groups+0x10b/0x3b0
>> > [<0>] commit_cowonly_roots+0x215/0x2b0
>> > [<0>] btrfs_commit_transaction+0x37e/0x910
>> > [<0>] transaction_kthread+0x14d/0x180
>> > [<0>] kthread+0xf5/0x130
>> > [<0>] ret_from_fork+0x24/0x30
>> > [<0>] 0xffffffffffffffff
>> >
>> > And then writepages ends up waiting on transaction commit:
>> >
>> > 9520 kworker/u96:13+flush-btrfs-1 D
>> > [<0>] wait_current_trans+0xac/0xe0
>> > [<0>] start_transaction+0x21b/0x4b0
>> > [<0>] cow_file_range_inline+0x10b/0x6b0
>> > [<0>] cow_file_range.isra.69+0x329/0x4a0
>> > [<0>] run_delalloc_range+0x105/0x3c0
>> > [<0>] writepage_delalloc+0x119/0x180
>> > [<0>] __extent_writepage+0x10c/0x390
>> > [<0>] extent_write_cache_pages+0x26f/0x3d0
>> > [<0>] extent_writepages+0x4f/0x80
>> > [<0>] do_writepages+0x17/0x60
>> > [<0>] __writeback_single_inode+0x59/0x690
>> > [<0>] writeback_sb_inodes+0x291/0x4e0
>> > [<0>] __writeback_inodes_wb+0x87/0xb0
>> > [<0>] wb_writeback+0x3bb/0x500
>> > [<0>] wb_workfn+0x40d/0x610
>> > [<0>] process_one_work+0x1ee/0x5a0
>> > [<0>] worker_thread+0x1e0/0x3d0
>> > [<0>] kthread+0xf5/0x130
>> > [<0>] ret_from_fork+0x24/0x30
>> > [<0>] 0xffffffffffffffff
>> >
>> > Eventually, we have every process in the system waiting on
>> > balance_dirty_pages(), and nobody is able to make progress on page
>> > writeback.
>> >
>> >>
>> >> [Reproduce steps]
>> >
>> > [ ... ]
>> >
>> >>
>> >> V2:
>> >> 	Replace btrfs_btree_balance_dirty with
>> >> btrfs_btree_balance_dirty_nodelay.
>> >> 	Add reproduce steps.
>> >>
>> >>  fs/btrfs/inode.c | 2 ++
>> >>  1 file changed, 2 insertions(+)
>> >>
>> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> >> index 8e604e7071f1..e54547df24ee 100644
>> >> --- a/fs/btrfs/inode.c
>> >> +++ b/fs/btrfs/inode.c
>> >> @@ -3158,6 +3158,8 @@ static int btrfs_finish_ordered_io(struct
>> >> btrfs_ordered_extent *ordered_extent)
>> >>  	/* once for the tree */
>> >>  	btrfs_put_ordered_extent(ordered_extent);
>> >>
>> >> +	btrfs_btree_balance_dirty_nodelay(fs_info);
>> >> +
>> >>  	return ret;
>> >>  }
>> >
>> >
>> > The original OOM you describe feels like an MM bug to me, but I'm going
>> > to try the repro steps here.  Since the freespace cache has its own
>> > workqueue, we could fix the deadlock just by wrapping the
>> > balance_dirty_pages call in a check for the freespace inode.  But, I
>> > think we'll get better performance by nudging the fix outside of
>> > btrfs_finish_ordered_io.  I'll see if I can reproduce.
>> 
>> Before this patch, I tried adding a new workqueue for metadata
>> writeback,
>> and kick off async flush work on fs_info->btree_inode in
>> btrfs_finish_ordered_io(). It works, but it can't guarantee we control
>> dirty
>> pages under MM's dirty_bytes limit if btrfs_finish_ordered_io() still
>> running
>> at full speed.
>> 
>> > I haven't been able to trigger the OOM this morning.  Ethan, is this
>> > something you can still hit on upstream kernels with the
>> > balance_dirty_pages() removed?
>> 
>> I hit the OOM problem in 4.4 kernel. I'll try if I can trigger it in
>> uptodate kernel.
> 
> Any followup to your testing? Otherwise I'm going to add revert of the
> patch.

Sorry for the late update. I didn't hit OOM in the new 4.20 kernel, so 
please revert it, thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-01-09 10:07 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-28  5:48 [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io Ethan Lien
2018-05-29 15:33 ` David Sterba
2018-12-12 14:47 ` Chris Mason
2018-12-12 15:22   ` Martin Raiber
2018-12-12 15:36     ` David Sterba
2018-12-12 17:55       ` Chris Mason
2018-12-14  8:07     ` ethanlien
2018-12-17 14:00       ` Martin Raiber
2018-12-19 10:33         ` ethanlien
2018-12-19 14:22           ` Chris Mason
2018-12-13  8:38   ` ethanlien
2019-01-04 15:59     ` David Sterba
2019-01-09 10:07       ` ethanlien

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).