All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hillf Danton <hdanton@sina.com>
To: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-fsdevel@vger.kernel.org,
	Michael Stapelberg <stapelberg+linux@google.com>,
	linux-mm@kvack.org
Subject: Re: [PATCH 3/5] writeback: Fix bandwidth estimate for spiky workload
Date: Thu,  8 Jul 2021 20:17:51 +0800	[thread overview]
Message-ID: <20210708121751.327-1-hdanton@sina.com> (raw)
In-Reply-To: <20210707095138.GC5335@quack2.suse.cz>

On Wed, 7 Jul 2021 11:51:38 +0200 Jan Kara wrote:
>On Wed 07-07-21 15:40:17, Hillf Danton wrote:
>> On Mon,  5 Jul 2021 18:23:17 +0200 Jan Kara wrote:
>> >
>> >Michael Stapelberg has reported that for workload with short big spikes
>> >of writes (GCC linker seem to trigger this frequently) the write
>> >throughput is heavily underestimated and tends to steadily sink until it
>> >reaches zero. This has rather bad impact on writeback throttling
>> >(causing stalls). The problem is that writeback throughput estimate gets
>> >updated at most once per 200 ms. One update happens early after we
>> >submit pages for writeback (at that point writeout of only small
>> >fraction of pages is completed and thus observed throughput is tiny).
>> >Next update happens only during the next write spike (updates happen
>> >only from inode writeback and dirty throttling code) and if that is
>> >more than 1s after previous spike, we decide system was idle and just
>> >ignore whatever was written until this moment.
>> >
>> >Fix the problem by making sure writeback throughput estimate is also
>> >updated shortly after writeback completes to get reasonable estimate of
>> >throughput for spiky workloads.
>> >
>> >Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+li>nux@google.com
>> >Reported-by: Michael Stapelberg <stapelberg+linux@google.com>
>> >Signed-off-by: Jan Kara <jack@suse.cz>
>...
>> >diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> >index 1fecf8ebadb0..6a99ddca95c0 100644
>> >--- a/mm/page-writeback.c
>> >+++ b/mm/page-writeback.c
>> >@@ -1346,14 +1346,7 @@ static void __wb_update_bandwidth(struct dirty_thr>ottle_control *gdtc,
>> > 	unsigned long dirtied;
>> > 	unsigned long written;
>> >
>> >-	lockdep_assert_held(&wb->list_lock);
>> >-
>> >-	/*
>> >-	 * rate-limit, only update once every 200ms.
>> >-	 */
>> >-	if (elapsed < BANDWIDTH_INTERVAL)
>> >-		return;
>> 
>> Please leave it as it is if you are not dumping the 200ms rule.
>
>Well, that could break the delayed updated scheduled after the end of
>writeback and for no good reason. The problematic ordering is like:

After another look at 2/5, you are cutting the rule, which is worth a
seperate patch.
>
>end writeback on inode1
>  queue_delayed_work() - queues delayed work after BANDWIDTH_INTERVAL
>
>__wb_update_bandwidth() called e.g. from balance_dirty_pages()
>  wb->bw_time_stamp = now;
>
>end writeback on inode2
>  queue_delayed_work() - does nothing since work is already queued
>
>delayed work calls __wb_update_bandwidth() - nothing is done since elapsed
>< BANDWIDTH_INTERVAL and we may thus miss reflecting writeback of inode2 in
>our estimates.

Your example says the estimate based on inode2 is torpedoed by a random
update, and you are looking to make that estimate meaningful at the cost
of breaking the rule - how differet is it to the current one if the
estimate is derived from 20ms-elapsed interval at inode2? Is it likely to
see another palpablely different result at inode3 from 50ms-elapsed interval?
>
>> >@@ -2742,6 +2737,11 @@ static void wb_inode_writeback_start(struct bdi_wr>iteback *wb)
>> > static void wb_inode_writeback_end(struct bdi_writeback *wb)
>> > {
>> > 	atomic_dec(&wb->writeback_inodes);
>> >+	/*
>> >+	 * Make sure estimate of writeback throughput gets
>> >+	 * updated after writeback completed.
>> >+	 */
>> >+	queue_delayed_work(bdi_wq, &wb->bw_dwork, BANDWIDTH_INTERVAL);
>> > }
>> 
>> This is a bogus estimate - it does not break the 200ms rule but walks
>> around it without specifying why 300ms is not good.
>
>Well, you're right that BANDWIDTH_INTERVAL is somewhat arbitrary here. We
>do want some batching of bandwidth updates after writeback completes for
>the cases where lots of inodes end their writeback in a quick succession.
>I've picked BANDWIDTH_INTERVAL here as that's the batching of other
>bandwidth updates as well so it kind of makes sense. I'll add a comment why
>BANDWIDTH_INTERVAL is picked here.
>
>								Honza
>-- 
>Jan Kara <jack@suse.com>
>SUSE Labs, CR
>
>


  reply	other threads:[~2021-07-08 12:38 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-05 16:23 [PATCH 0/5] writeback: Fix bandwidth estimates Jan Kara
2021-07-05 16:23 ` [PATCH 1/5] writeback: Track number of inodes under writeback Jan Kara
2021-07-05 16:23 ` [PATCH 2/5] writeback: Reliably update bandwidth estimation Jan Kara
2021-07-05 16:23 ` [PATCH 3/5] writeback: Fix bandwidth estimate for spiky workload Jan Kara
2021-07-07  7:40   ` Hillf Danton
2021-07-07  9:51     ` Jan Kara
2021-07-08 12:17       ` Hillf Danton [this message]
2021-07-08 16:43         ` Jan Kara
2021-07-09  8:01           ` Hillf Danton
2021-07-05 16:23 ` [PATCH 4/5] writeback: Rename domain_update_bandwidth() Jan Kara
2021-07-05 16:23 ` [PATCH 5/5] writeback: Use READ_ONCE for unlocked reads of writeback stats Jan Kara
2021-07-09 13:19 ` [PATCH 0/5] writeback: Fix bandwidth estimates Michael Stapelberg
2021-07-09 13:19   ` Michael Stapelberg
2021-07-12 16:27   ` Jan Kara
2021-07-13  8:15     ` Michael Stapelberg
2021-07-13 10:24 [PATCH 0/5 v2] " Jan Kara
2021-07-13 10:24 ` [PATCH 3/5] writeback: Fix bandwidth estimate for spiky workload Jan Kara
2021-07-13 10:36 [PATCH 0/5 v3] writeback: Fix bandwidth estimates Jan Kara
2021-07-13 10:36 ` [PATCH 3/5] writeback: Fix bandwidth estimate for spiky workload Jan Kara
2021-07-13 10:47 [PATCH 0/5 v4] writeback: Fix bandwidth estimates Jan Kara
2021-07-13 10:47 ` [PATCH 3/5] writeback: Fix bandwidth estimate for spiky workload Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210708121751.327-1-hdanton@sina.com \
    --to=hdanton@sina.com \
    --cc=akpm@linux-foundation.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=stapelberg+linux@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.