From: Fengguang Wu <fengguang.wu@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>,
vgoyal@redhat.com, Jens Axboe <axboe@kernel.dk>,
linux-mm@kvack.org, sjayaraman@suse.com, andrea@betterlinux.com,
jmoyer@redhat.com, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com,
lizefan@huawei.com, containers@lists.linux-foundation.org,
cgroups@vger.kernel.org, ctalbott@google.com, rni@google.com,
lsf@lists.linux-foundation.org
Subject: Re: [RFC] writeback and cgroup
Date: Wed, 18 Apr 2012 15:58:14 +0800 [thread overview]
Message-ID: <20120418075814.GA3809@localhost> (raw)
In-Reply-To: <20120418065720.GA21485@quack.suse.cz>
On Wed, Apr 18, 2012 at 08:57:20AM +0200, Jan Kara wrote:
> On Fri 06-04-12 02:59:34, Wu Fengguang wrote:
> ...
> > > > > Let's please keep the layering clear. IO limitations will be applied
> > > > > at the block layer and pressure will be formed there and then
> > > > > propagated upwards eventually to the originator. Sure, exposing the
> > > > > whole information might result in better behavior for certain
> > > > > workloads, but down the road, say, in three or five years, devices
> > > > > which can be shared without worrying too much about seeks might be
> > > > > commonplace and we could be swearing at a disgusting structural mess,
> > > > > and sadly various cgroup support seems to be a prominent source of
> > > > > such design failures.
> > > >
> > > > Super fast storages are coming which will make us regret to make the
> > > > IO path over complex. Spinning disks are not going away anytime soon.
> > > > I doubt Google is willing to afford the disk seek costs on its
> > > > millions of disks and has the patience to wait until switching all of
> > > > the spin disks to SSD years later (if it will ever happen).
> > >
> > > This is new. Let's keep the damn employer out of the discussion.
> > > While the area I work on is affected by my employment (writeback isn't
> > > even my area BTW), I'm not gonna do something adverse to upstream even
> > > if it's beneficial to google and I'm much more likely to do something
> > > which may hurt google a bit if it's gonna benefit upstream.
> > >
> > > As for the faster / newer storage argument, that is *exactly* why we
> > > want to keep the layering proper. Writeback works from the pressure
> > > from the IO stack. If IO technology changes, we update the IO stack
> > > and writeback still works from the pressure. It may need to be
> > > adjusted but the principles don't change.
> >
> > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > It's always there doing 1:1 proportional throttling. Then you try to
> > kick in to add *double* throttling in block/cfq layer. Now the low
> > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > from its balanced state, leading to large fluctuations and program
> > stalls. This can be avoided by telling balance_dirty_pages(): "your
> > balance goal is no longer 1:1, but 10:1". With this information
> > balance_dirty_pages() will behave right. Then there is the question:
> > if balance_dirty_pages() will work just well provided the information,
> > why bother doing the throttling at low layer and "push back" the
> > pressure all the way up?
> Fengguang, maybe we should first agree on some basics:
> The two main goals of balance_dirty_pages() are (and always have been
> AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages
> in memory to allow for efficient writeback. Secondary goals are to also
> keep amount of dirty pages somewhat fair among bdis and processes. Agreed?
Agreed. In fact, before the IO-less change, balance_dirty_pages() had
no much explicit control over the dirty rate and fairness.
> Thus shift to trying to control *IO throughput* (or even just buffered
> write throughput) from balance_dirty_pages() is a fundamental shift in the
> goals of balance_dirty_pages(), not just some tweak (although technically,
> it might be relatively easy to do for buffered writes given the current
> implementation).
Yes, it has been a bit shift to the rate based dirty control.
> ...
> > > Well, I tried and I hope some of it got through. I also wrote a lot
> > > of questions, mainly regarding how what you have in mind is supposed
> > > to work through what path. Maybe I'm just not seeing what you're
> > > seeing but I just can't see where all the IOs would go through and
> > > come together. Can you please elaborate more on that?
> >
> > What I can see is, it looks pretty simple and nature to let
> > balance_dirty_pages() fill the gap towards a total solution :-)
> >
> > - add direct IO accounting in some convenient point of the IO path
> > IO submission or completion point, either is fine.
> >
> > - change several lines of the buffered write IO controller to
> > integrate the direct IO rate into the formula to fit the "total
> > IO" limit
> >
> > - in future, add more accounting as well as feedback control to make
> > balance_dirty_pages() work with IOPS and disk time
> Sorry Fengguang but I also think this is a wrong way to go.
> balance_dirty_pages() must primarily control the amount of dirty pages.
> Trying to bend it to control IO throughput by including direct IO and
> reads in the accounting will just make the logic even more complex than it
> already is.
Right, I have been adding too much complexity to balance_dirty_pages().
The control algorithms are pretty hard to understand and get right for
all cases.
OK, I'll post results of my experiments up to now, answer some
questions and take a comfortable break. Phooo..
Thanks,
Fengguang
prev parent reply other threads:[~2012-04-18 7:58 UTC|newest]
Thread overview: 81+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-04-03 18:36 [RFC] writeback and cgroup Tejun Heo
2012-04-04 14:51 ` Vivek Goyal
2012-04-04 15:36 ` [Lsf] " Steve French
2012-04-04 18:56 ` Tejun Heo
2012-04-04 19:19 ` Vivek Goyal
2012-04-25 8:47 ` Suresh Jayaraman
2012-04-04 18:49 ` Tejun Heo
2012-04-04 19:23 ` [Lsf] " Steve French
2012-04-14 12:15 ` Peter Zijlstra
2012-04-04 20:32 ` Vivek Goyal
2012-04-04 23:02 ` Tejun Heo
2012-04-05 16:38 ` Tejun Heo
2012-04-05 17:13 ` Vivek Goyal
2012-04-14 11:53 ` [Lsf] " Peter Zijlstra
2012-04-07 8:00 ` Jan Kara
2012-04-10 16:23 ` [Lsf] " Steve French
2012-04-10 18:16 ` Vivek Goyal
2012-04-10 18:06 ` Vivek Goyal
2012-04-10 21:05 ` Jan Kara
2012-04-10 21:20 ` Vivek Goyal
2012-04-10 22:24 ` Jan Kara
2012-04-11 15:40 ` Vivek Goyal
2012-04-11 15:45 ` Vivek Goyal
2012-04-11 17:05 ` Jan Kara
2012-04-11 17:23 ` Vivek Goyal
2012-04-11 19:44 ` Jan Kara
2012-04-17 21:48 ` Tejun Heo
2012-04-18 18:18 ` Vivek Goyal
2012-04-11 19:22 ` Jan Kara
2012-04-12 20:37 ` Vivek Goyal
2012-04-12 20:51 ` Tejun Heo
2012-04-14 14:36 ` Fengguang Wu
2012-04-16 14:57 ` Vivek Goyal
2012-04-24 11:33 ` Fengguang Wu
2012-04-24 14:56 ` Jan Kara
2012-04-24 15:58 ` Vivek Goyal
2012-04-25 2:42 ` Fengguang Wu
2012-04-25 3:16 ` Fengguang Wu
2012-04-25 9:01 ` Jan Kara
2012-04-25 12:05 ` Fengguang Wu
2012-04-15 11:37 ` [Lsf] " Peter Zijlstra
2012-04-17 22:01 ` Tejun Heo
2012-04-18 6:30 ` Jan Kara
2012-04-14 12:25 ` [Lsf] " Peter Zijlstra
2012-04-16 12:54 ` Vivek Goyal
2012-04-16 13:07 ` Fengguang Wu
2012-04-16 14:19 ` Fengguang Wu
2012-04-16 15:52 ` Vivek Goyal
2012-04-17 2:14 ` Fengguang Wu
2012-04-04 17:51 ` Fengguang Wu
2012-04-04 18:35 ` Vivek Goyal
2012-04-04 21:42 ` Fengguang Wu
2012-04-05 15:10 ` Vivek Goyal
2012-04-06 0:32 ` Fengguang Wu
2012-04-04 19:33 ` Tejun Heo
2012-04-04 20:18 ` Vivek Goyal
2012-04-05 16:31 ` Tejun Heo
2012-04-05 17:09 ` Vivek Goyal
2012-04-06 9:59 ` Fengguang Wu
2012-04-17 22:38 ` Tejun Heo
2012-04-19 14:23 ` Fengguang Wu
2012-04-19 18:31 ` Vivek Goyal
2012-04-20 12:45 ` Fengguang Wu
2012-04-20 19:29 ` Vivek Goyal
2012-04-20 21:33 ` Tejun Heo
2012-04-22 14:26 ` Fengguang Wu
2012-04-23 12:30 ` Vivek Goyal
2012-04-23 16:04 ` Tejun Heo
2012-04-19 20:26 ` Jan Kara
2012-04-20 13:34 ` Fengguang Wu
2012-04-20 19:08 ` Tejun Heo
2012-04-22 14:46 ` Fengguang Wu
2012-04-23 16:56 ` Tejun Heo
2012-04-24 7:58 ` Fengguang Wu
2012-04-25 15:47 ` Tejun Heo
2012-04-23 9:14 ` Jan Kara
2012-04-23 10:24 ` Fengguang Wu
2012-04-23 12:42 ` Jan Kara
2012-04-23 14:31 ` Fengguang Wu
2012-04-18 6:57 ` Jan Kara
2012-04-18 7:58 ` Fengguang Wu [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120418075814.GA3809@localhost \
--to=fengguang.wu@intel.com \
--cc=andrea@betterlinux.com \
--cc=axboe@kernel.dk \
--cc=cgroups@vger.kernel.org \
--cc=containers@lists.linux-foundation.org \
--cc=ctalbott@google.com \
--cc=jack@suse.cz \
--cc=jmoyer@redhat.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lizefan@huawei.com \
--cc=lsf@lists.linux-foundation.org \
--cc=rni@google.com \
--cc=sjayaraman@suse.com \
--cc=tj@kernel.org \
--cc=vgoyal@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).