From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx2.suse.de ([195.135.220.15]:32811 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934940AbeE2PgL (ORCPT ); Tue, 29 May 2018 11:36:11 -0400 Date: Tue, 29 May 2018 17:33:27 +0200 From: David Sterba To: Ethan Lien Cc: linux-btrfs@vger.kernel.org Subject: Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io Message-ID: <20180529153326.GB4325@twin.jikos.cz> Reply-To: dsterba@suse.cz References: <20180528054821.9092-1-ethanlien@synology.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <20180528054821.9092-1-ethanlien@synology.com> Sender: linux-btrfs-owner@vger.kernel.org List-ID: On Mon, May 28, 2018 at 01:48:20PM +0800, Ethan Lien wrote: > [Problem description and how we fix it] > We should balance dirty metadata pages at the end of > btrfs_finish_ordered_io, since a small, unmergeable random write can > potentially produce dirty metadata which is multiple times larger than > the data itself. For example, a small, unmergeable 4KiB write may > produce: > > 16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree > 16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree > 16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree > > Although we do call balance dirty pages in write side, but in the > buffered write path, most metadata are dirtied only after we reach the > dirty background limit (which by far only counts dirty data pages) and > wakeup the flusher thread. If there are many small, unmergeable random > writes spread in a large btree, we'll find a burst of dirty pages > exceeds the dirty_bytes limit after we wakeup the flusher thread - which > is not what we expect. In our machine, it caused out-of-memory problem > since a page cannot be dropped if it is marked dirty. > > Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay, > but since we do btrfs_finish_ordered_io in a separate worker, it will not > stop the flusher consuming dirty pages. Also, we use different worker for > metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle > the size of dirty metadata pages. > > [Reproduce steps] > To reproduce the problem, we need to do 4KiB write randomly spread in a > large btree. In our 2GiB RAM machine: > 1) Create 4 subvolumes. > 2) Run fio on each subvolume: > > [global] > direct=0 > rw=randwrite > ioengine=libaio > bs=4k > iodepth=16 > numjobs=1 > group_reporting > size=128G > runtime=1800 > norandommap > time_based > randrepeat=0 > > 3) Take snapshot on each subvolume and repeat fio on existing files. > 4) Repeat step (3) until we get large btrees. > In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of > metadata in each subvolume tree and 12GiB of metadata in extent tree. > 5) Stop all fio, take snapshot again, and wait until all delayed work is > completed. > 6) Start all fio. Few seconds later we hit OOM when the flusher starts > to work. > > It can be reproduced even when using nocow write. > > Signed-off-by: Ethan Lien Added to misc-next, thanks.