From mboxrd@z Thu Jan  1 00:00:00 1970
From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Subject: Re: [RFC] writeback and cgroup
Date: Tue, 17 Apr 2012 15:38:54 -0700
Message-ID: <20120417223854.GG19975__19490.8147248954$1334702353$gmane$org@google.com>
References: <20120403183655.GA23106@dhcp-172-17-108-109.mtv.corp.google.com>
	<20120404175124.GA8931@localhost>
	<20120404193355.GD29686@dhcp-172-17-108-109.mtv.corp.google.com>
	<20120406095934.GA10465@localhost>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <containers-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>
Content-Disposition: inline
In-Reply-To: <20120406095934.GA10465@localhost>
List-Unsubscribe: <https://lists.linuxfoundation.org/mailman/options/containers>,
	<mailto:containers-request-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org?subject=unsubscribe>
List-Archive: <http://lists.linuxfoundation.org/pipermail/containers/>
List-Post: <mailto:containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>
List-Help: <mailto:containers-request-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org?subject=help>
List-Subscribe: <https://lists.linuxfoundation.org/mailman/listinfo/containers>,
	<mailto:containers-request-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org?subject=subscribe>
Sender: containers-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
Errors-To: containers-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
To: Fengguang Wu <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>, ctalbott-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>, rni-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, andrea-oIIqvOZpAevzfdHfmsDf5w@public.gmane.org, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, sjayaraman-IBi9RG/b67k@public.gmane.org, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
List-Id: containers.vger.kernel.org

Hello, Fengguang.

On Fri, Apr 06, 2012 at 02:59:34AM -0700, Fengguang Wu wrote:
> Fortunately, the above gap can be easily filled judging from the
> block/cfq IO controller code. By adding some direct IO accounting
> and changing several lines of my patches to make use of the collected
> stats, the semantics of the blkio.throttle.write_bps interfaces can be
> changed from "limit for direct IO" to "limit for direct+buffered IOs".
> Ditto for blkio.weight and blkio.write_iops, as long as some
> iops/device time stats are made available to balance_dirty_pages().
> 
> It would be a fairly *easy* change. :-) It's merely adding some
> accounting code and there is no need to change the block IO
> controlling algorithm at all. I'll do the work of accounting (which
> is basically independent of the IO controlling) and use the new stats
> in balance_dirty_pages().

I don't really understand how this can work.  For hard limits, maybe,
but for proportional IO, you have to know which cgroups have IOs
before assigning the proportions, so blkcg assigning IO bandwidth
without knowing async writes simply can't work.

For example, let's say cgroups A and B have 2:8 split.  If A has IOs
on queue and B doesn't, blkcg will assign all IO bandwidth to A.  I
can't wrap my head around how writeback is gonna make use of the
resulting stats but let's say it decides it needs to put out some IOs
out for both cgroups.  What happens then?  Do all the async writes go
through the root cgroup controlled by and affecting the ratio between
rootcg and cgroup A and B?  Or do they have to be accounted as part of
cgroups A and B?  If so, what if the added bandwidth goes over the
limit?  Let's say if we implement overcharge; then, I suppose we'll
have to communicate that upwards too, right?

This is still easy.  What about hierarchical propio?  What happens
then?  You can't do hierarchical proportional allocation without
knowing how much IOs are pending for which group.  How is that
information gonna be communicated between blkcg and writeback?  Are we
gonna have two separate hierarchical proportional IO allocators?  How
is that gonna work at all?  If we're gonna have single allocator in
block layer, writeback would have to feed the amount of IOs it may
generate into the allocator, get the resulting allocation and then
issue IO and then block layer again will have to account these to the
originating cgroups.  It's just crazy.

> The only problem I can see now, is that balance_dirty_pages() works
> per-bdi and blkcg works per-device. So the two ends may not match
> nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
> sdb is shared by lv0 and lv1. However it should be rare situations and
> be much more acceptable than the problems arise from the "push back"
> approach which impacts everyone.

I don't know.  What problems?  AFAICS, the biggest issue is writeback
of different inodes getting mixed resulting in poor performance, but
if you think about it, that's about the frequency of switching cgroups
and a problem which can and should be dealt with from block layer
(e.g. use larger time slice if all the pending IOs are async).

Writeback's duty is generating stream of async writes which can be
served efficiently for the *cgroup* and keeping the buffer filled as
necessary and chaining the backpressure from there to the actual
dirtier.  That's what writeback does without cgroup.  Nothing
fundamental changes with cgroup.  It's just finer grained.

> > No, no, it's not about standing in my way.  As Vivek said in the other
> > reply, it's that the "gap" that you filled was created *because*
> > writeback wasn't cgroup aware and now you're in turn filling that gap
> > by making writeback work around that "gap".  I mean, my mind boggles.
> > Doesn't yours?  I strongly believe everyone's should.
> 
> Heh. It's a hard problem indeed. I felt great pains in the IO-less
> dirty throttling work. I did a lot reasoning about it, and have in
> fact kept cgroup IO controller in mind since its early days. Now I'd
> say it's hands down for it to adapt to the gap between the total IO
> limit and what's carried out by the block IO controller.

You're not providing any valid counter arguments about the issues
being raised about the messed up design.  How is anything "hands down"
here?

> > There's where I'm confused.  How is the said split supposed to work?
> > They aren't independent.  I mean, who gets to decide what and where
> > are those decisions enforced?
> 
> Yeah it's not independent. It's about
> 
> - keep block IO cgroup untouched (in its current algorithm, for
>   throttling direct IO)
> 
> - let balance_dirty_pages() adapt to the throttling target
>   
>         buffered_write_limit = total_limit - direct_IOs

Think about proportional allocation.  You don't have a number until
you know who have pending IOs and how much.

> To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> It's always there doing 1:1 proportional throttling. Then you try to
> kick in to add *double* throttling in block/cfq layer. Now the low
> layer may enforce 10:1 throttling and push balance_dirty_pages() away
> from its balanced state, leading to large fluctuations and program
> stalls.

Just do the same 1:1 inside each cgroup.

>  This can be avoided by telling balance_dirty_pages(): "your
> balance goal is no longer 1:1, but 10:1". With this information
> balance_dirty_pages() will behave right. Then there is the question:
> if balance_dirty_pages() will work just well provided the information,
> why bother doing the throttling at low layer and "push back" the
> pressure all the way up?

Because splitting a resource into two pieces arbitrarily with
different amount of consumptions on each side and then applying the
same proportion on both doesn't mean anything?

> The balance_dirty_pages() is already deeply involved in dirty throttling.
> As you can see from this patchset, the same algorithms can be extended
> trivially to work with cgroup IO limits.
> 
> buffered write IO controller in balance_dirty_pages()
> https://lkml.org/lkml/2012/3/28/275

It is half broken thing with fundamental design flaws which can't be
corrected without complete reimplementation.  I don't know what to
say.

> In the "back pressure" scheme, memcg is a must because only it has all
> the infrastructure to track dirty pages upon which you can apply some
> dirty_limits. Don't tell me you want to account dirty pages in blkcg...

For now, per-inode tracking seems good enough.

> What I can see is, it looks pretty simple and nature to let
> balance_dirty_pages() fill the gap towards a total solution :-)
> 
> - add direct IO accounting in some convenient point of the IO path
>   IO submission or completion point, either is fine.
> 
> - change several lines of the buffered write IO controller to
>   integrate the direct IO rate into the formula to fit the "total
>   IO" limit
> 
> - in future, add more accounting as well as feedback control to make
>   balance_dirty_pages() work with IOPS and disk time

To me, you seem to be not addressing the issues I've been raising at
all and just repeating the same points again and again.  If I'm
misunderstanding something, please point out.

Thanks.

-- 
tejun