linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Konstantin Khebnikov <khlebnikov@yandex-team.ru>
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org,
	Roman Gushchin <klamm@yandex-team.ru>, Jan Kara <jack@suse.cz>,
	Dave Chinner <david@fromorbit.com>,
	linux-kernel@vger.kernel.org, Tejun Heo <tj@kernel.org>,
	linux-fsdevel@vger.kernel.org, koct9i@gmail.com
Subject: Re: [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller
Date: Fri, 16 Jan 2015 10:37:34 +0100	[thread overview]
Message-ID: <20150116093734.GD25884@quack.suse.cz> (raw)
In-Reply-To: <20150115180242.10450.92.stgit@buzz>

  Hello,

On Thu 15-01-15 21:49:10, Konstantin Khebnikov wrote:
> This is ressurection of my old RFC patch for dirty-set accounting cgroup [1]
> Now it's merged into memory cgroup and got bandwidth controller as a bonus.
> 
> That shows alternative solution: less accurate but much less monstrous than
> accurate page-based dirty-set controller from Tejun Heo.
> 
> Memory overhead: +1 pointer into struct address_space.
> Perfomance overhead is almost zero, no new locks added.
> 
> Idea is stright forward: link each inode to some cgroup when first dirty
> page appers and account all dirty pages to it. Writeback is implemented
> as single per-bdi writeback work which writes only inodes which belong
> to memory cgroups where amount of dirty memory is beyond thresholds.
> 
> Third patch adds trick for handling shared inodes which have dirty pages
> from several cgroups: it marks whole inode as shared and alters writeback
> filter for it.
> 
> The rest is an example of bandwith and iops controller build on top of that.
> Design is completely original, I bet nobody ever used task-works for that =)
  So I like the simplicity of your code but there are a few downsides too
(please correct me if I've got something wrong - btw the documentation of
high-level design would be welcome so that one doesn't have to understand
that from the patches):
1) The bandwidth controller simply accounts number of bytes submitted for
IO in submit_bio().  This doesn't reflect HW capabilities in any way. There
a huge difference between a process submitting single block random IO and a
process doing the same amount of sequential IO. This could be somewhat
dealt with by not accounting number of bytes but rather time it took to
complete a bio (but that somewhat complicates the code and block layer
already does similar counting so it would be good you used that).

2) The controller accounts bio to current task - that makes the limiting
useless for background writeback. You need to somehow propagate i_memcg
into submit_bio() so that IO is properly accounted.

3) The controller doesn't seem to guarantee any quality of service at the
IO level like blkcg does. The controller limits amount of IO userspace is
able to submit to kernel but only after we decide to submit the IO to disk.
So at that time cgroup may have generated lots of IO - e.g. by dirtying lots
of pages - and there's nothing to protect other cgroups from starvation
because of writeback of these pages.

Especially the last point seems to be essential to your approach (although
you could somewhat mitigate the issue by accounting the IO already when
it is entering the kernel) and I'm not sure whether that's really
acceptable for potential users of this feature.

								Honza
> [1] [PATCH RFC] fsio: filesystem io accounting cgroup
> http://marc.info/?l=linux-kernel&m=137331569501655&w=2
> 
> Patches also available here:
> https://github.com/koct9i/linux.git branch memcg_dirty_control
> 
> ---
> 
> Konstantin Khebnikov (6):
>       memcg: inode-based dirty and writeback pages accounting
>       memcg: dirty-set limiting and filtered writeback
>       memcg: track shared inodes with dirty pages
>       percpu_ratelimit: high-performance ratelimiting counter
>       delay-injection: resource management via procrastination
>       memcg: filesystem bandwidth controller
> 
> 
>  block/blk-core.c                 |    2 
>  fs/direct-io.c                   |    2 
>  fs/fs-writeback.c                |   22 ++
>  fs/inode.c                       |    1 
>  include/linux/backing-dev.h      |    1 
>  include/linux/fs.h               |   14 +
>  include/linux/memcontrol.h       |   27 +++
>  include/linux/percpu_ratelimit.h |   45 ++++
>  include/linux/sched.h            |    7 +
>  include/linux/writeback.h        |    1 
>  include/trace/events/sched.h     |    7 +
>  include/trace/events/writeback.h |    1 
>  kernel/sched/core.c              |   66 +++++++
>  kernel/sched/fair.c              |   12 +
>  lib/Makefile                     |    1 
>  lib/percpu_ratelimit.c           |  168 +++++++++++++++++
>  mm/memcontrol.c                  |  381 ++++++++++++++++++++++++++++++++++++++
>  mm/page-writeback.c              |   32 +++
>  mm/readahead.c                   |    2 
>  mm/truncate.c                    |    1 
>  mm/vmscan.c                      |    4 
>  21 files changed, 787 insertions(+), 10 deletions(-)
>  create mode 100644 include/linux/percpu_ratelimit.h
>  create mode 100644 lib/percpu_ratelimit.c
> 
> --
> Signature
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

  parent reply	other threads:[~2015-01-16  9:39 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-15 18:49 [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller Konstantin Khebnikov
2015-01-15 18:49 ` [PATCH 1/6] memcg: inode-based dirty and writeback pages accounting Konstantin Khebnikov
2015-01-15 18:49 ` [PATCH 2/6] memcg: dirty-set limiting and filtered writeback Konstantin Khebnikov
2015-01-15 18:49 ` [PATCH 3/6] memcg: track shared inodes with dirty pages Konstantin Khebnikov
2015-01-15 18:55   ` Tejun Heo
2015-01-15 19:04     ` Konstantin Khlebnikov
2015-01-15 19:08       ` Tejun Heo
2015-01-15 18:49 ` [PATCH 4/6] percpu_ratelimit: high-performance ratelimiting counter Konstantin Khebnikov
2015-01-15 18:49 ` [PATCH 5/6] delay-injection: resource management via procrastination Konstantin Khebnikov
2015-01-15 18:49 ` [PATCH 6/6] memcg: filesystem bandwidth controller Konstantin Khebnikov
2015-01-16  9:37 ` Jan Kara [this message]
2015-01-16 12:33   ` [PATCHSET RFC 0/6] memcg: inode-based dirty-set controller Konstantin Khlebnikov
2015-01-16 14:25     ` Jan Kara
2015-01-29  1:21 ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150116093734.GD25884@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=cgroups@vger.kernel.org \
    --cc=david@fromorbit.com \
    --cc=khlebnikov@yandex-team.ru \
    --cc=klamm@yandex-team.ru \
    --cc=koct9i@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).