[RFC] Making memcg track ownership per address_space or anon_vma

* [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-01-30  4:43 Tejun Heo
  2015-01-30  5:55 ` Greg Thelen
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2015-01-30  4:43 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko
  Cc: cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner,
	Jens Axboe, Christoph Hellwig, Li Zefan, gthelen, hughd,
	Konstantin Khebnikov

Hello,

Since the cgroup writeback patchset[1] have been posted, several
people brought up concerns about the complexity of allowing an inode
to be dirtied against multiple cgroups is necessary for the purpose of
writeback and it is true that a significant amount of complexity (note
that bdi still needs to be split, so it's still not trivial) can be
removed if we assume that an inode always belongs to one cgroup for
the purpose of writeback.

However, as mentioned before, this issue is directly linked to whether
memcg needs to track the memory ownership per-page.  If there are
valid use cases where the pages of an inode must be tracked to be
owned by different cgroups, cgroup writeback must be able to handle
that situation properly.  If there aren't no such cases, the cgroup
writeback support can be simplified but again we should put memcg on
the same cadence and enforce per-inode (or per-anon_vma) ownership
from the beginning.  The conclusion can be either way - per-page or
per-inode - but both memcg and blkcg must be looking at the same
picture.  Deviating them is highly likely to lead to long-term issues
forcing us to look at this again anyway, only with far more baggage.

One thing to note is that the per-page tracking which is currently
employed by memcg seems to have been born more out of conveninence
rather than requirements for any actual use cases.  Per-page ownership
makes sense iff pages of an inode have to be associated with different
cgroups - IOW, when an inode is accessed by multiple cgroups; however,
currently, memcg assigns a page to its instantiating memcg and leaves
it at that till the page is released.  This means that if a page is
instantiated by one cgroup and then subsequently accessed only by a
different cgroup, whether the page's charge gets moved to the cgroup
which is actively using it is purely incidental.  If the page gets
reclaimed and released at some point, it'll be moved.  If not, it
won't.

AFAICS, the only case where the current per-page accounting works
properly is when disjoint sections of an inode are used by different
cgroups and the whole thing hinges on whether this use case justifies
all the added overhead including page->mem_cgroup pointer and the
extra complexity in the writeback layer.  FWIW, I'm doubtful.
Johannes, Michal, Greg, what do you guys think?

If the above use case - a huge file being actively accssed disjointly
by multiple cgroups - isn't significant enough and there aren't other
use cases that I missed which can benefit from the per-page tracking
that's currently implemented, it'd be logical to switch to per-inode
(or per-anon_vma or per-slab) ownership tracking.  For the short term,
even just adding an extra ownership information to those containing
objects and inherting those to page->mem_cgroup could work although
it'd definitely be beneficial to eventually get rid of
page->mem_cgroup.

As with per-page, when the ownership terminates is debatable w/
per-inode tracking.  Also, supporting some form of shared accounting
across different cgroups may be useful (e.g. shared library's memory
being equally split among anyone who accesses it); however, these
aren't likely to be major and trying to do something smart may affect
other use cases adversely, so it'd probably be best to just keep it
dumb and clear the ownership when the inode loses all pages (a cgroup
can disown such inode through FADV_DONTNEED if necessary).

What do you guys think?  If making memcg track ownership at per-inode
level, even for just the unified hierarchy, is the direction we can
take, I'll go ahead and simplify the cgroup writeback patchset.

Thanks.

-- 
tejun

[1] http://lkml.kernel.org/g/1420579582-8516-1-git-send-email-tj@kernel.org

^ permalink raw reply	[flat|nested] 31+ messages in thread