Re: [PATCH 3/5] writeback: Fix bandwidth estimate for spiky workload

From: Hillf Danton <hdanton@sina.com>
To: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-fsdevel@vger.kernel.org,
	Michael Stapelberg <stapelberg+linux@google.com>,
	linux-mm@kvack.org
Subject: Re: [PATCH 3/5] writeback: Fix bandwidth estimate for spiky workload
Date: Thu,  8 Jul 2021 20:17:51 +0800	[thread overview]
Message-ID: <20210708121751.327-1-hdanton@sina.com> (raw)
In-Reply-To: <20210707095138.GC5335@quack2.suse.cz>

On Wed, 7 Jul 2021 11:51:38 +0200 Jan Kara wrote:
>On Wed 07-07-21 15:40:17, Hillf Danton wrote:
>> On Mon,  5 Jul 2021 18:23:17 +0200 Jan Kara wrote:
>> >
>> >Michael Stapelberg has reported that for workload with short big spikes
>> >of writes (GCC linker seem to trigger this frequently) the write
>> >throughput is heavily underestimated and tends to steadily sink until it
>> >reaches zero. This has rather bad impact on writeback throttling
>> >(causing stalls). The problem is that writeback throughput estimate gets
>> >updated at most once per 200 ms. One update happens early after we
>> >submit pages for writeback (at that point writeout of only small
>> >fraction of pages is completed and thus observed throughput is tiny).
>> >Next update happens only during the next write spike (updates happen
>> >only from inode writeback and dirty throttling code) and if that is
>> >more than 1s after previous spike, we decide system was idle and just
>> >ignore whatever was written until this moment.
>> >
>> >Fix the problem by making sure writeback throughput estimate is also
>> >updated shortly after writeback completes to get reasonable estimate of
>> >throughput for spiky workloads.
>> >
>> >Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+li>nux@google.com
>> >Reported-by: Michael Stapelberg <stapelberg+linux@google.com>
>> >Signed-off-by: Jan Kara <jack@suse.cz>
>...
>> >diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> >index 1fecf8ebadb0..6a99ddca95c0 100644
>> >--- a/mm/page-writeback.c
>> >+++ b/mm/page-writeback.c
>> >@@ -1346,14 +1346,7 @@ static void __wb_update_bandwidth(struct dirty_thr>ottle_control *gdtc,
>> > 	unsigned long dirtied;
>> > 	unsigned long written;
>> >
>> >-	lockdep_assert_held(&wb->list_lock);
>> >-
>> >-	/*
>> >-	 * rate-limit, only update once every 200ms.
>> >-	 */
>> >-	if (elapsed < BANDWIDTH_INTERVAL)
>> >-		return;
>> 
>> Please leave it as it is if you are not dumping the 200ms rule.
>
>Well, that could break the delayed updated scheduled after the end of
>writeback and for no good reason. The problematic ordering is like:

After another look at 2/5, you are cutting the rule, which is worth a
seperate patch.
>
>end writeback on inode1
>  queue_delayed_work() - queues delayed work after BANDWIDTH_INTERVAL
>
>__wb_update_bandwidth() called e.g. from balance_dirty_pages()
>  wb->bw_time_stamp = now;
>
>end writeback on inode2
>  queue_delayed_work() - does nothing since work is already queued
>
>delayed work calls __wb_update_bandwidth() - nothing is done since elapsed
>< BANDWIDTH_INTERVAL and we may thus miss reflecting writeback of inode2 in
>our estimates.

Your example says the estimate based on inode2 is torpedoed by a random
update, and you are looking to make that estimate meaningful at the cost
of breaking the rule - how differet is it to the current one if the
estimate is derived from 20ms-elapsed interval at inode2? Is it likely to
see another palpablely different result at inode3 from 50ms-elapsed interval?
>
>> >@@ -2742,6 +2737,11 @@ static void wb_inode_writeback_start(struct bdi_wr>iteback *wb)
>> > static void wb_inode_writeback_end(struct bdi_writeback *wb)
>> > {
>> > 	atomic_dec(&wb->writeback_inodes);
>> >+	/*
>> >+	 * Make sure estimate of writeback throughput gets
>> >+	 * updated after writeback completed.
>> >+	 */
>> >+	queue_delayed_work(bdi_wq, &wb->bw_dwork, BANDWIDTH_INTERVAL);
>> > }
>> 
>> This is a bogus estimate - it does not break the 200ms rule but walks
>> around it without specifying why 300ms is not good.
>
>Well, you're right that BANDWIDTH_INTERVAL is somewhat arbitrary here. We
>do want some batching of bandwidth updates after writeback completes for
>the cases where lots of inodes end their writeback in a quick succession.
>I've picked BANDWIDTH_INTERVAL here as that's the batching of other
>bandwidth updates as well so it kind of makes sense. I'll add a comment why
>BANDWIDTH_INTERVAL is picked here.
>
>								Honza
>-- 
>Jan Kara <jack@suse.com>
>SUSE Labs, CR
>
>