From mboxrd@z Thu Jan  1 00:00:00 1970
From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Subject: Re: [RFC] writeback and cgroup
Date: Thu, 19 Apr 2012 14:31:18 -0400
Message-ID: <20120419183118.GM10216__39018.393882043$1334860306$gmane$org@redhat.com>
References: <20120403183655.GA23106@dhcp-172-17-108-109.mtv.corp.google.com>
	<20120404175124.GA8931@localhost>
	<20120404193355.GD29686@dhcp-172-17-108-109.mtv.corp.google.com>
	<20120406095934.GA10465@localhost>
	<20120417223854.GG19975@google.com>
	<20120419142343.GA12684@localhost>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <containers-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>
Content-Disposition: inline
In-Reply-To: <20120419142343.GA12684@localhost>
List-Unsubscribe: <https://lists.linuxfoundation.org/mailman/options/containers>,
	<mailto:containers-request-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org?subject=unsubscribe>
List-Archive: <http://lists.linuxfoundation.org/pipermail/containers/>
List-Post: <mailto:containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>
List-Help: <mailto:containers-request-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org?subject=help>
List-Subscribe: <https://lists.linuxfoundation.org/mailman/listinfo/containers>,
	<mailto:containers-request-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org?subject=subscribe>
Sender: containers-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
Errors-To: containers-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
To: Fengguang Wu <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>, ctalbott-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>, rni-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, andrea-oIIqvOZpAevzfdHfmsDf5w@public.gmane.org, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, sjayaraman-IBi9RG/b67k@public.gmane.org, lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
List-Id: containers.vger.kernel.org

On Thu, Apr 19, 2012 at 10:23:43PM +0800, Fengguang Wu wrote:

Hi Fengguang,

[..]
> > I don't know.  What problems?  AFAICS, the biggest issue is writeback
> > of different inodes getting mixed resulting in poor performance, but
> > if you think about it, that's about the frequency of switching cgroups
> > and a problem which can and should be dealt with from block layer
> > (e.g. use larger time slice if all the pending IOs are async).
> 
> Yeah increasing time slice would help that case. In general it's not
> merely the frequency of switching cgroup if take hard disk' writeback
> cache into account.  Think about some inodes with async IO: A1, A2,
> A3, .., and inodes with sync IO: D1, D2, D3, ..., all from different
> cgroups. So when the root cgroup holds all async inodes, the cfq may
> schedule IO interleavely like this
> 
>         A1,    A1,    A1,    A2,    A1,    A2,    ...
>            D1,    D2,    D3,    D4,    D5,    D6, ...
> 
> Now it becomes
> 
>         A1,    A2,    A3,    A4,    A5,    A6,    ...
>            D1,    D2,    D3,    D4,    D5,    D6, ...
> 
> The difference is that it's now switching the async inodes each time.
> At cfq level, the seek costs look the same, however the disk's
> writeback cache may help merge the data chunks from the same inode A1.
> Well, it may cost some latency for spin disks. But how about SSD? It
> can run deeper queue and benefit from large writes.

Not sure what's the point here. Many things seem to be mixed up.

If we start putting async queues in separate groups (in an attempt to
provide fairness/service differentiation), then how much IO we dispatch
from one async inode will directly depend on slice time of that
cgroup/queue. So if you want longer dispatch from same async inode
increasing slice time will help.

Also elevator merge logic anyway increses the size of async IO requests
and big requests are submitted to device.

If you are looking that in every dispatch cycle we continue to dispatch
request from same inode, yes that's not possible. Too huge a slice length
in presence of sync IO is also not good. So if you are looking for
high throughput and sacrificing fairness then you can switch to mode
where all async queues are put in single root group. (Note: you will have
to do reasonably fast switch between cgroups so that all the cgroups are
able to do some writeout in a time window).

Writeback logic also submits a certain amount of writes from one inode
and then switches to next inode in an attempt to provide fairness. Same
thing should be directly controllable by CFQ's notion of time slice. That
is continue to dispatch async IO from a cgroup/inode for extended durtaion
before switching. So what's the difference. One can achieve equivalent
behavior at any layer (writeback/CFQ).

> 
> > Writeback's duty is generating stream of async writes which can be
> > served efficiently for the *cgroup* and keeping the buffer filled as
> > necessary and chaining the backpressure from there to the actual
> > dirtier.  That's what writeback does without cgroup.  Nothing
> > fundamental changes with cgroup.  It's just finer grained.
> 
> Believe me, physically partitioning the dirty pages and async IO
> streams comes at big costs. It won't scale well in many ways.
> 
> For one instance, splitting the request queues will give rise to
> PG_writeback pages.  Those pages have been the biggest source of
> latency issues in the various parts of the system.

So PG_writeback pages are one which have been submitted for IO? So even
now we generate PG_writeback pages across multiple inodes as we submit
those pages for IO. By keeping the number of request descriptor per
group low, we can build back pressure early and hence per inode/group
we will not have too many PG_Writeback pages. IOW, number of PG_Writeback
pages will be controllable by number of request descriptros. So how
does situation becomes worse in case of CFQ putting them in separate
cgroups?

> It's worth to note that running multiple flusher threads per bdi means
> not only disk seeks for spin disks, smaller IO size for SSD, but also
> lock contentions and cache bouncing for metadata heavy workloads and
> fast storage.

But we could still have single flusher per bdi and just check the
write congestion state of each group and back off if it is congested.

So single thread will still be doing IO submission. Just that it will
submit IO from multiple inodes/cgroup which can cause additional seeks.
And that's the tradeoff of fairness. What I am not able to understand
is that how are you avoiding this tradeoff by implementing things in
writeback layer. To achieve more fairness among groups, even a flusher
thread will have to switch faster among cgroups/inodes.

Thanks
Vivek