From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.0 required=3.0 tests=BAYES_00, FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 650D1C07E99 for ; Fri, 9 Jul 2021 08:01:46 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 935CE613C3 for ; Fri, 9 Jul 2021 08:01:44 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 935CE613C3 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 113096B0071; Fri, 9 Jul 2021 04:01:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0C4016B0072; Fri, 9 Jul 2021 04:01:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EF4526B0073; Fri, 9 Jul 2021 04:01:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0082.hostedemail.com [216.40.44.82]) by kanga.kvack.org (Postfix) with ESMTP id C69686B0071 for ; Fri, 9 Jul 2021 04:01:43 -0400 (EDT) Received: from smtpin38.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 116FF8377A5A for ; Fri, 9 Jul 2021 08:01:43 +0000 (UTC) X-FDA: 78342305286.38.46D0D0A Received: from r3-11.sinamail.sina.com.cn (r3-11.sinamail.sina.com.cn [202.108.3.11]) by imf15.hostedemail.com (Postfix) with SMTP id 650A8D0000B0 for ; Fri, 9 Jul 2021 08:01:41 +0000 (UTC) Received: from unknown (HELO localhost.localdomain)([1.24.238.70]) by sina.com (172.16.97.27) with ESMTP id 60E802600001A379; Fri, 9 Jul 2021 16:01:38 +0800 (CST) X-Sender: hdanton@sina.com X-Auth-ID: hdanton@sina.com X-SMAIL-MID: 86918349283494 From: Hillf Danton To: Jan Kara Cc: Andrew Morton , linux-fsdevel@vger.kernel.org, Michael Stapelberg , linux-mm@kvack.org Subject: Re: [PATCH 3/5] writeback: Fix bandwidth estimate for spiky workload Date: Fri, 9 Jul 2021 16:01:34 +0800 Message-Id: <20210709080134.2366-1-hdanton@sina.com> In-Reply-To: <20210708164301.GA11179@quack2.suse.cz> References: <20210705161610.19406-1-jack@suse.cz> <20210707074017.2195-1-hdanton@sina.com> <20210708121751.327-1-hdanton@sina.com> MIME-Version: 1.0 Authentication-Results: imf15.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf15.hostedemail.com: domain of hdanton@sina.com designates 202.108.3.11 as permitted sender) smtp.mailfrom=hdanton@sina.com X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 650A8D0000B0 X-Stat-Signature: bscqwhtckjbii9ofa3j4ekurk4edcpc3 X-HE-Tag: 1625817701-387792 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, 8 Jul 2021 18:43:01 +0200 Jan Kara wrote: >On Thu 08-07-21 20:17:51, Hillf Danton wrote: >> On Wed, 7 Jul 2021 11:51:38 +0200 Jan Kara wrote: >> >On Wed 07-07-21 15:40:17, Hillf Danton wrote: >> >> On Mon, 5 Jul 2021 18:23:17 +0200 Jan Kara wrote: >> >> > >> >> >Michael Stapelberg has reported that for workload with short big s= pikes >> >> >of writes (GCC linker seem to trigger this frequently) the write >> >> >throughput is heavily underestimated and tends to steadily sink un= til it >> >> >reaches zero. This has rather bad impact on writeback throttling >> >> >(causing stalls). The problem is that writeback throughput estimat= e gets >> >> >updated at most once per 200 ms. One update happens early after we >> >> >submit pages for writeback (at that point writeout of only small >> >> >fraction of pages is completed and thus observed throughput is tin= y). >> >> >Next update happens only during the next write spike (updates happ= en >> >> >only from inode writeback and dirty throttling code) and if that i= s >> >> >more than 1s after previous spike, we decide system was idle and j= ust >> >> >ignore whatever was written until this moment. >> >> > >> >> >Fix the problem by making sure writeback throughput estimate is al= so >> >> >updated shortly after writeback completes to get reasonable estima= te of >> >> >throughput for spiky workloads. >> >> > >> >> >Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapel= berg+li>nux@google.com >> >> >Reported-by: Michael Stapelberg >> >> >Signed-off-by: Jan Kara >> >... >> >> >diff --git a/mm/page-writeback.c b/mm/page-writeback.c >> >> >index 1fecf8ebadb0..6a99ddca95c0 100644 >> >> >--- a/mm/page-writeback.c >> >> >+++ b/mm/page-writeback.c >> >> >@@ -1346,14 +1346,7 @@ static void __wb_update_bandwidth(struct di= rty_thr>ottle_control *gdtc, >> >> > unsigned long dirtied; >> >> > unsigned long written; >> >> > >> >> >- lockdep_assert_held(&wb->list_lock); >> >> >- >> >> >- /* >> >> >- * rate-limit, only update once every 200ms. >> >> >- */ >> >> >- if (elapsed < BANDWIDTH_INTERVAL) >> >> >- return; >> >>=20 >> >> Please leave it as it is if you are not dumping the 200ms rule. >> > >> >Well, that could break the delayed updated scheduled after the end of >> >writeback and for no good reason. The problematic ordering is like: >>=20 >> After another look at 2/5, you are cutting the rule, which is worth a >> seperate patch. > >The only update that can break the 200ms rule are the updates added in t= his >patch. I don't think separating the removal of 200ms check for that one >case really brings much clarity. It would rather bring "what if question= s" >to this patch... > >> >end writeback on inode1 >> > queue_delayed_work() - queues delayed work after BANDWIDTH_INTERVAL >> > >> >__wb_update_bandwidth() called e.g. from balance_dirty_pages() >> > wb->bw_time_stamp =3D now; >> > >> >end writeback on inode2 >> > queue_delayed_work() - does nothing since work is already queued >> > >> >delayed work calls __wb_update_bandwidth() - nothing is done since el= apsed >> >< BANDWIDTH_INTERVAL and we may thus miss reflecting writeback of ino= de2 in >> >our estimates. >>=20 >> Your example says the estimate based on inode2 is torpedoed by a rando= m >> update, and you are looking to make that estimate meaningful at the co= st >> of breaking the rule - how differet is it to the current one if the >> estimate is derived from 20ms-elapsed interval at inode2? Is it likely= to >> see another palpablely different result at inode3 from 50ms-elapsed in= terval? > >I'm not sure I understand your question correctly but updates after shor= ter >than 200ms interval should not disturb the estimates much. >wb_update_write_bandwidth() effectively uses formula: > > bandwidth =3D (written + bandwidth * (period - elapsed)) / period > >where 'period' is 3 seconds. So we compute average bandwidth over last 3 >seconds where amount written in 'elapsed' interval is 'written' pages. I= f >'elapsed' is small, the influence of current sample on reducing estimate= d >bandwidth is going to be small as well. Correct. Without the 3s period that in combination with filters there hel= ps prevent any sample either in 20ms or 200ms interval from overturning the estimated bandwidth, we will find what is estimated in an hour is a mile = from what disk vendors claim. And in a day interval I am inclined to ignore th= e difference between the estimated BW and the bare metal BW, in assumption = that no disk will be found idle while dirty pages go above the background thre= shold. Hillf