From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.3 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45C86C433E1 for ; Fri, 15 May 2020 11:10:47 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 00A67206B6 for ; Fri, 15 May 2020 11:10:46 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 00A67206B6 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 8614E8E0008; Fri, 15 May 2020 07:10:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 819858E0005; Fri, 15 May 2020 07:10:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 74D958E0008; Fri, 15 May 2020 07:10:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0132.hostedemail.com [216.40.44.132]) by kanga.kvack.org (Postfix) with ESMTP id 5E9968E0005 for ; Fri, 15 May 2020 07:10:46 -0400 (EDT) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 26366180AD811 for ; Fri, 15 May 2020 11:10:46 +0000 (UTC) X-FDA: 76818685692.14.shoe91_6a4147857338 X-HE-Tag: shoe91_6a4147857338 X-Filterd-Recvd-Size: 7774 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf17.hostedemail.com (Postfix) with ESMTP for ; Fri, 15 May 2020 11:10:45 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 3EC1AAE61; Fri, 15 May 2020 11:10:46 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id 211BC1E056A; Fri, 15 May 2020 13:10:43 +0200 (CEST) Date: Fri, 15 May 2020 13:10:43 +0200 From: Jan Kara To: NeilBrown Cc: Jan Kara , Andrew Morton , Michal Hocko , linux-mm@kvack.org, linux-nfs@vger.kernel.org, LKML , Christoph Hellwig Subject: Re: [PATCH 1/2 V4] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE Message-ID: <20200515111043.GK9569@quack2.suse.cz> References: <87tv2b7q72.fsf@notabene.neil.brown.name> <87v9miydai.fsf@notabene.neil.brown.name> <87ftdgw58w.fsf@notabene.neil.brown.name> <87wo6gs26e.fsf@notabene.neil.brown.name> <87tv1ks24t.fsf@notabene.neil.brown.name> <20200416151906.GQ23739@quack2.suse.cz> <87zhb5r30c.fsf@notabene.neil.brown.name> <20200422124600.GH8775@quack2.suse.cz> <871rnob8z3.fsf@notabene.neil.brown.name> <87y2pw9udb.fsf@notabene.neil.brown.name> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87y2pw9udb.fsf@notabene.neil.brown.name> User-Agent: Mutt/1.10.1 (2018-07-13) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed 13-05-20 17:17:20, NeilBrown wrote: > > PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the > loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a > daemon needs to write to one bdi (the final bdi) in order to free up > writes queued to another bdi (the client bdi). > > The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty > pages, so that it can still dirty pages after other processses have been > throttled. The purpose of this is to avoid deadlock that happen when > the PF_LESS_THROTTLE process must write for any dirty pages to be freed, > but it is being thottled and cannot write. > > This approach was designed when all threads were blocked equally, > independently on which device they were writing to, or how fast it was. > Since that time the writeback algorithm has changed substantially with > different threads getting different allowances based on non-trivial > heuristics. This means the simple "add 25%" heuristic is no longer > reliable. > > The important issue is not that the daemon needs a *larger* dirty page > allowance, but that it needs a *private* dirty page allowance, so that > dirty pages for the "client" bdi that it is helping to clear (the bdi for > an NFS filesystem or loop block device etc) do not affect the throttling > of the deamon writing to the "final" bdi. > > This patch changes the heuristic so that the task is not throttled when > the bdi it is writing to has a dirty page count below below (or equal > to) the free-run threshold for that bdi. This ensures it will always be > able to have some pages in flight, and so will not deadlock. > > In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might > still be throttled by global threshold, but that is acceptable as it is > only the deadlock state that is interesting for this flag. > > This approach of "only throttle when target bdi is busy" is consistent > with the other use of PF_LESS_THROTTLE in current_may_throttle(), were > it causes attention to be focussed only on the target bdi. > > So this patch > - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, > - removes the 25% bonus that that flag gives, and > - If PF_LOCAL_THROTTLE is set, don't delay at all unless the > global and the local free-run thresholds are exceeded. > > Note that previously realtime threads were treated the same as > PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for > real-time threads, so it is now different from the behaviour of nfsd and > loop tasks. I don't know what is wanted for realtime. > > Acked-by: Chuck Lever (for nfsd change) > Signed-off-by: NeilBrown The idea looks good to me. It will still have the "threads may hit hard wall" behavior when BDI freerun threshold is crossed at the moment we are very close to the global limit but that should be rare. So I think for now this good enough. The patch looks good to me so feel free to add: Reviewed-by: Jan Kara Throttling patches usually go through mm tree so ask Andrew to pick them up. Honza > @@ -1653,8 +1652,12 @@ static void balance_dirty_pages(struct bdi_writeback *wb, > if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) && > (!mdtc || > m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) { > - unsigned long intv = dirty_poll_interval(dirty, thresh); > - unsigned long m_intv = ULONG_MAX; > + unsigned long intv; > + unsigned long m_intv; > + > + free_running: > + intv = dirty_poll_interval(dirty, thresh); > + m_intv = ULONG_MAX; > > current->dirty_paused_when = now; > current->nr_dirtied = 0; > @@ -1673,9 +1676,20 @@ static void balance_dirty_pages(struct bdi_writeback *wb, > * Calculate global domain's pos_ratio and select the > * global dtc by default. > */ > - if (!strictlimit) > + if (!strictlimit) { > wb_dirty_limits(gdtc); > > + if ((current->flags & PF_LOCAL_THROTTLE) && > + gdtc->wb_dirty < > + dirty_freerun_ceiling(gdtc->wb_thresh, > + gdtc->wb_bg_thresh)) > + /* > + * LOCAL_THROTTLE tasks must not be throttled > + * when below the per-wb freerun ceiling. > + */ > + goto free_running; > + } > + > dirty_exceeded = (gdtc->wb_dirty > gdtc->wb_thresh) && > ((gdtc->dirty > gdtc->thresh) || strictlimit); > > @@ -1689,9 +1703,20 @@ static void balance_dirty_pages(struct bdi_writeback *wb, > * both global and memcg domains. Choose the one > * w/ lower pos_ratio. > */ > - if (!strictlimit) > + if (!strictlimit) { > wb_dirty_limits(mdtc); > > + if ((current->flags & PF_LOCAL_THROTTLE) && > + mdtc->wb_dirty < > + dirty_freerun_ceiling(mdtc->wb_thresh, > + mdtc->wb_bg_thresh)) > + /* > + * LOCAL_THROTTLE tasks must not be > + * throttled when below the per-wb > + * freerun ceiling. > + */ > + goto free_running; > + } > dirty_exceeded |= (mdtc->wb_dirty > mdtc->wb_thresh) && > ((mdtc->dirty > mdtc->thresh) || strictlimit); > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index a37c87b5aee2..b2f5deb3603c 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1878,13 +1878,13 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, > > /* > * If a kernel thread (such as nfsd for loop-back mounts) services > - * a backing device by writing to the page cache it sets PF_LESS_THROTTLE. > + * a backing device by writing to the page cache it sets PF_LOCAL_THROTTLE. > * In that case we should only throttle if the backing device it is > * writing to is congested. In other cases it is safe to throttle. > */ > static int current_may_throttle(void) > { > - return !(current->flags & PF_LESS_THROTTLE) || > + return !(current->flags & PF_LOCAL_THROTTLE) || > current->backing_dev_info == NULL || > bdi_write_congested(current->backing_dev_info); > } > -- > 2.26.2 > -- Jan Kara SUSE Labs, CR