From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CF657ECE588 for ; Tue, 15 Oct 2019 14:04:16 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 874AB218DE for ; Tue, 15 Oct 2019 14:04:16 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 874AB218DE Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 285818E0006; Tue, 15 Oct 2019 10:04:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 235E08E0001; Tue, 15 Oct 2019 10:04:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 14BE18E0006; Tue, 15 Oct 2019 10:04:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id E72168E0001 for ; Tue, 15 Oct 2019 10:04:15 -0400 (EDT) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id 731D04995ED for ; Tue, 15 Oct 2019 14:04:15 +0000 (UTC) X-FDA: 76046188470.03.kite40_24f21dbbe0f3a X-HE-Tag: kite40_24f21dbbe0f3a X-Filterd-Recvd-Size: 9347 Received: from r3-11.sinamail.sina.com.cn (r3-11.sinamail.sina.com.cn [202.108.3.11]) by imf41.hostedemail.com (Postfix) with SMTP for ; Tue, 15 Oct 2019 14:04:13 +0000 (UTC) Received: from unknown (HELO localhost.localdomain)([114.253.230.207]) by sina.com with ESMTP id 5DA5D1D600029722; Tue, 15 Oct 2019 22:04:09 +0800 (CST) X-Sender: hdanton@sina.com X-Auth-ID: hdanton@sina.com X-SMAIL-MID: 3470549284009 From: Hillf Danton To: Jan Kara Cc: mm , fsdev , Andrew Morton , linux , Roman Gushchin , Tejun Heo , Johannes Weiner , Shakeel Butt , Fengguang Wu , Hillf Danton , Minchan Kim , Mel Gorman Subject: Re: [RFC] writeback: add elastic bdi in cgwb bdp Date: Tue, 15 Oct 2019 22:03:56 +0800 Message-Id: <20191015140356.9256-1-hdanton@sina.com> In-Reply-To: <20191012132740.12968-1-hdanton@sina.com> References: <20191012132740.12968-1-hdanton@sina.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hey Jan On Tue, 15 Oct 2019 12:22:10 +0200 Jan Kara wrote: > Hello, >=20 > On Sat 12-10-19 21:27:40, Hillf Danton wrote: > >=20 > > The behaviors of the elastic bdi (ebdi) observed in the current cgwb > > bandwidth measurement include > >=20 > > 1, like spinning disks on market ebdi can do ~128MB/s IOs in consecti= ve > > minutes in few scenarios, or higher like SSD, or lower like USB key. > >=20 > > 2, with ebdi a bdi_writeback, wb-A, is able to do 80MB/s writeouts in= the > > current time window of 200ms, while it was 16M/s in the previous one. > >=20 > > 3, it will be either 100MB/s in the next time window if wb-B joins wb= -A > > writing pages out or 18MB/s if wb-C also decides to chime in. > >=20 > > With the help of bandwidth gauged above, what is left in balancing di= rty > > pages, bdp, is try to make wb-A's laundry speed catch up dirty speed = in > > every 200ms interval without knowing what wb-B is doing. > >=20 > > No heuristic is added in this work because ebdi does bdp without it. >=20 > Thanks for the patch but honestly, I have hard time understanding what = is > the purpose of this patch from the changelog. Fault on my side. I will try to make it as clear as I could. Under the cover of "behaviors of elastic bdi" I list the difficulties in the current writeback bandwidth measurings, particularly in the case with CONFIG_CGROUP_WRITEBACK enabled, with the phrase ebdi used to abstract the attribute of hardwares, like spinning disk, SSD and USB storage, that their physical bandwidth is a constant. The difference between that constant and the bandwidth currently measured comes from, I think, the IO pattern dispatched to hardware in the time interval of 200ms. How much sense does it make to guide wb-A's IO in the next 200ms without idea about what other wbs are doing? What should be modeled and built on top of the measured bw value? Hard to say. What will bdp OTOH look like on top of ebdi without the hard work of measuring bw? A name came up before I am tapping this message, though not available when I sent the RFC, and it essentially is that ebdi paves a brick for applying the walk-dog method to bdp: let wb-A's laundry speed walk its dirty speed the same way as pet owners in Paris, Prague and other cities go walking their dogs every day with a leash worth two dimes on average. Is a $200 electronic walkmeter needed to have a good time of walking dog in London? Nope, I think because it makes ant-eyelash-size sense to gauge the walker= 's speed first with that gadget prone to glitch and then teach the dog to walk that speed, and to do more based on it. The only reason I have to do walk-dog in bdp is that laundry speed remarkably falls behind dirty speed in every case of bdp with no exceptio= n. And a leash is supposed to do the job in a manner that it should naturall= y be, even though laundry speed changes in every 200ms interval and is usually hard to predict before hand under real workloads, with two things below met in every 200ms: 1, dirty pages in the system clamped near the threshold that is configura= ble in userspace, 2, dirty speed of every wb glued as close to the laundry speed as possibl= e, in long run. Should walk-dog be in place, then we can do cleanups of bw measurement an= d things dependent of it a step after another. Thanks Hillf > Some kind of writeback throttling? > And why is this needed? > Also some highlevel description of what > your solution is would be good... >=20 > Honza > =20 > > Cc: Roman Gushchin > > Cc: Tejun Heo > > Cc: Jan Kara > > Cc: Johannes Weiner > > Cc: Shakeel Butt > > Cc: Minchan Kim > > Cc: Mel Gorman > > Signed-off-by: Hillf Danton > > --- > >=20 > > --- a/include/linux/backing-dev-defs.h > > +++ b/include/linux/backing-dev-defs.h > > @@ -157,6 +157,9 @@ struct bdi_writeback { > > struct list_head memcg_node; /* anchored at memcg->cgwb_list */ > > struct list_head blkcg_node; /* anchored at blkcg->cgwb_list */ > > =20 > > +#ifdef CONFIG_CGWB_BDP_WITH_EBDI > > + struct wait_queue_head bdp_waitq; > > +#endif > > union { > > struct work_struct release_work; > > struct rcu_head rcu; > > --- a/mm/backing-dev.c > > +++ b/mm/backing-dev.c > > @@ -324,6 +324,10 @@ static int wb_init(struct bdi_writeback > > goto out_destroy_stat; > > } > > =20 > > + if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) && > > + IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI)) > > + init_waitqueue_head(&wb->bdp_waitq); > > + > > return 0; > > =20 > > out_destroy_stat: > > --- a/mm/page-writeback.c > > +++ b/mm/page-writeback.c > > @@ -1551,6 +1551,45 @@ static inline void wb_dirty_limits(struc > > } > > } > > =20 > > +#if defined(CONFIG_CGROUP_WRITEBACK) && defined(CONFIG_CGWB_BDP_WITH= _EBDI) > > +static bool cgwb_bdp_should_throttle(struct bdi_writeback *wb) > > +{ > > + struct dirty_throttle_control gdtc =3D { GDTC_INIT_NO_WB }; > > + > > + if (fatal_signal_pending(current)) > > + return false; > > + > > + gdtc.avail =3D global_dirtyable_memory(); > > + > > + domain_dirty_limits(&gdtc); > > + > > + gdtc.dirty =3D global_node_page_state(NR_FILE_DIRTY) + > > + global_node_page_state(NR_UNSTABLE_NFS) + > > + global_node_page_state(NR_WRITEBACK); > > + > > + if (gdtc.dirty < gdtc.bg_thresh) > > + return false; > > + > > + if (!writeback_in_progress(wb)) > > + wb_start_background_writeback(wb); > > + > > + /* > > + * throttle if laundry speed remarkably falls behind dirty speed > > + * in the current time window of 200ms > > + */ > > + return gdtc.dirty > gdtc.thresh && > > + wb_stat(wb, WB_DIRTIED) > > > + wb_stat(wb, WB_WRITTEN) + > > + wb_stat_error(); > > +} > > + > > +static inline void cgwb_bdp(struct bdi_writeback *wb) > > +{ > > + wait_event_interruptible_timeout(wb->bdp_waitq, > > + !cgwb_bdp_should_throttle(wb), HZ); > > +} > > +#endif > > + > > /* > > * balance_dirty_pages() must be called by processes which are gener= ating dirty > > * data. It looks at the number of dirty pages in the machine and w= ill force > > @@ -1910,7 +1949,11 @@ void balance_dirty_pages_ratelimited(str > > preempt_enable(); > > =20 > > if (unlikely(current->nr_dirtied >=3D ratelimit)) > > - balance_dirty_pages(wb, current->nr_dirtied); > > + if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) && > > + IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI)) > > + cgwb_bdp(wb); > > + else > > + balance_dirty_pages(wb, current->nr_dirtied); > > =20 > > wb_put(wb); > > } > > --- a/fs/fs-writeback.c > > +++ b/fs/fs-writeback.c > > @@ -632,6 +632,11 @@ void wbc_detach_inode(struct writeback_c > > if (!wb) > > return; > > =20 > > + if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) && > > + IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI)) > > + if (waitqueue_active(&wb->bdp_waitq)) > > + wake_up_all(&wb->bdp_waitq); > > + > > history =3D inode->i_wb_frn_history; > > avg_time =3D inode->i_wb_frn_avg_time; > > =20 > > @@ -811,6 +816,9 @@ static long wb_split_bdi_pages(struct bd > > if (nr_pages =3D=3D LONG_MAX) > > return LONG_MAX; > > =20 > > + if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) && > > + IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI)) > > + return nr_pages; > > /* > > * This may be called on clean wb's and proportional distribution > > * may not make sense, just use the original @nr_pages in those > > @@ -1599,6 +1607,10 @@ static long writeback_chunk_size(struct > > if (work->sync_mode =3D=3D WB_SYNC_ALL || work->tagged_writepages) > > pages =3D LONG_MAX; > > else { > > + if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) && > > + IS_ENABLED(CONFIG_CGWB_BDP_WITH_EBDI)) > > + return work->nr_pages; > > + > > pages =3D min(wb->avg_write_bandwidth / 2, > > global_wb_domain.dirty_limit / DIRTY_SCOPE); > > pages =3D min(pages, work->nr_pages); > > -- > >=20 > --=20 > Jan Kara > SUSE Labs, CR