linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Making memcg track ownership per address_space or anon_vma
@ 2015-01-30  4:43 Tejun Heo
  2015-01-30  5:55 ` Greg Thelen
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2015-01-30  4:43 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko
  Cc: cgroups, linux-mm, linux-kernel, Jan Kara, Dave Chinner,
	Jens Axboe, Christoph Hellwig, Li Zefan, gthelen, hughd,
	Konstantin Khebnikov

Hello,

Since the cgroup writeback patchset[1] have been posted, several
people brought up concerns about the complexity of allowing an inode
to be dirtied against multiple cgroups is necessary for the purpose of
writeback and it is true that a significant amount of complexity (note
that bdi still needs to be split, so it's still not trivial) can be
removed if we assume that an inode always belongs to one cgroup for
the purpose of writeback.

However, as mentioned before, this issue is directly linked to whether
memcg needs to track the memory ownership per-page.  If there are
valid use cases where the pages of an inode must be tracked to be
owned by different cgroups, cgroup writeback must be able to handle
that situation properly.  If there aren't no such cases, the cgroup
writeback support can be simplified but again we should put memcg on
the same cadence and enforce per-inode (or per-anon_vma) ownership
from the beginning.  The conclusion can be either way - per-page or
per-inode - but both memcg and blkcg must be looking at the same
picture.  Deviating them is highly likely to lead to long-term issues
forcing us to look at this again anyway, only with far more baggage.

One thing to note is that the per-page tracking which is currently
employed by memcg seems to have been born more out of conveninence
rather than requirements for any actual use cases.  Per-page ownership
makes sense iff pages of an inode have to be associated with different
cgroups - IOW, when an inode is accessed by multiple cgroups; however,
currently, memcg assigns a page to its instantiating memcg and leaves
it at that till the page is released.  This means that if a page is
instantiated by one cgroup and then subsequently accessed only by a
different cgroup, whether the page's charge gets moved to the cgroup
which is actively using it is purely incidental.  If the page gets
reclaimed and released at some point, it'll be moved.  If not, it
won't.

AFAICS, the only case where the current per-page accounting works
properly is when disjoint sections of an inode are used by different
cgroups and the whole thing hinges on whether this use case justifies
all the added overhead including page->mem_cgroup pointer and the
extra complexity in the writeback layer.  FWIW, I'm doubtful.
Johannes, Michal, Greg, what do you guys think?

If the above use case - a huge file being actively accssed disjointly
by multiple cgroups - isn't significant enough and there aren't other
use cases that I missed which can benefit from the per-page tracking
that's currently implemented, it'd be logical to switch to per-inode
(or per-anon_vma or per-slab) ownership tracking.  For the short term,
even just adding an extra ownership information to those containing
objects and inherting those to page->mem_cgroup could work although
it'd definitely be beneficial to eventually get rid of
page->mem_cgroup.

As with per-page, when the ownership terminates is debatable w/
per-inode tracking.  Also, supporting some form of shared accounting
across different cgroups may be useful (e.g. shared library's memory
being equally split among anyone who accesses it); however, these
aren't likely to be major and trying to do something smart may affect
other use cases adversely, so it'd probably be best to just keep it
dumb and clear the ownership when the inode loses all pages (a cgroup
can disown such inode through FADV_DONTNEED if necessary).

What do you guys think?  If making memcg track ownership at per-inode
level, even for just the unified hierarchy, is the direction we can
take, I'll go ahead and simplify the cgroup writeback patchset.

Thanks.

-- 
tejun

[1] http://lkml.kernel.org/g/1420579582-8516-1-git-send-email-tj@kernel.org

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-01-30  4:43 [RFC] Making memcg track ownership per address_space or anon_vma Tejun Heo
@ 2015-01-30  5:55 ` Greg Thelen
  2015-01-30  6:27   ` Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Greg Thelen @ 2015-01-30  5:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Michal Hocko, cgroups, linux-mm, linux-kernel,
	Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
	hughd, Konstantin Khebnikov


On Thu, Jan 29 2015, Tejun Heo wrote:

> Hello,
>
> Since the cgroup writeback patchset[1] have been posted, several
> people brought up concerns about the complexity of allowing an inode
> to be dirtied against multiple cgroups is necessary for the purpose of
> writeback and it is true that a significant amount of complexity (note
> that bdi still needs to be split, so it's still not trivial) can be
> removed if we assume that an inode always belongs to one cgroup for
> the purpose of writeback.
>
> However, as mentioned before, this issue is directly linked to whether
> memcg needs to track the memory ownership per-page.  If there are
> valid use cases where the pages of an inode must be tracked to be
> owned by different cgroups, cgroup writeback must be able to handle
> that situation properly.  If there aren't no such cases, the cgroup
> writeback support can be simplified but again we should put memcg on
> the same cadence and enforce per-inode (or per-anon_vma) ownership
> from the beginning.  The conclusion can be either way - per-page or
> per-inode - but both memcg and blkcg must be looking at the same
> picture.  Deviating them is highly likely to lead to long-term issues
> forcing us to look at this again anyway, only with far more baggage.
>
> One thing to note is that the per-page tracking which is currently
> employed by memcg seems to have been born more out of conveninence
> rather than requirements for any actual use cases.  Per-page ownership
> makes sense iff pages of an inode have to be associated with different
> cgroups - IOW, when an inode is accessed by multiple cgroups; however,
> currently, memcg assigns a page to its instantiating memcg and leaves
> it at that till the page is released.  This means that if a page is
> instantiated by one cgroup and then subsequently accessed only by a
> different cgroup, whether the page's charge gets moved to the cgroup
> which is actively using it is purely incidental.  If the page gets
> reclaimed and released at some point, it'll be moved.  If not, it
> won't.
>
> AFAICS, the only case where the current per-page accounting works
> properly is when disjoint sections of an inode are used by different
> cgroups and the whole thing hinges on whether this use case justifies
> all the added overhead including page->mem_cgroup pointer and the
> extra complexity in the writeback layer.  FWIW, I'm doubtful.
> Johannes, Michal, Greg, what do you guys think?
>
> If the above use case - a huge file being actively accssed disjointly
> by multiple cgroups - isn't significant enough and there aren't other
> use cases that I missed which can benefit from the per-page tracking
> that's currently implemented, it'd be logical to switch to per-inode
> (or per-anon_vma or per-slab) ownership tracking.  For the short term,
> even just adding an extra ownership information to those containing
> objects and inherting those to page->mem_cgroup could work although
> it'd definitely be beneficial to eventually get rid of
> page->mem_cgroup.
>
> As with per-page, when the ownership terminates is debatable w/
> per-inode tracking.  Also, supporting some form of shared accounting
> across different cgroups may be useful (e.g. shared library's memory
> being equally split among anyone who accesses it); however, these
> aren't likely to be major and trying to do something smart may affect
> other use cases adversely, so it'd probably be best to just keep it
> dumb and clear the ownership when the inode loses all pages (a cgroup
> can disown such inode through FADV_DONTNEED if necessary).
>
> What do you guys think?  If making memcg track ownership at per-inode
> level, even for just the unified hierarchy, is the direction we can
> take, I'll go ahead and simplify the cgroup writeback patchset.
>
> Thanks.

I find simplification appealing.  But I not sure it will fly, if for no
other reason than the shared accountings.  I'm ignoring intentional
sharing, used by carefully crafted apps, and just thinking about
incidental sharing (e.g. libc).

Example:

$ mkdir small
$ echo 1M > small/memory.limit_in_bytes
$ (echo $BASHPID > small/cgroup.procs && exec sleep 1h) &

$ mkdir big
$ echo 10G > big/memory.limit_in_bytes
$ (echo $BASHPID > big/cgroup.procs && exec mlockall_database 1h) &


Assuming big/mlockall_database mlocks all of libc, then it will oom kill
the small memcg because libc is owned by small due it having touched it
first.  It'd be hard to figure out what small did wrong to deserve the
oom kill.

FWIW we've been using memcg writeback where inodes have a memcg
writeback owner.  Once multiple memcg write to an inode then the inode
becomes writeback shared which makes it more likely to be written.  Once
cleaned the inode is then again able to be privately owned:
https://lkml.org/lkml/2011/8/17/200

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-01-30  5:55 ` Greg Thelen
@ 2015-01-30  6:27   ` Tejun Heo
  2015-01-30 16:07     ` Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2015-01-30  6:27 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Johannes Weiner, Michal Hocko, cgroups, linux-mm, linux-kernel,
	Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
	hughd, Konstantin Khebnikov

Hello, Greg.

On Thu, Jan 29, 2015 at 09:55:53PM -0800, Greg Thelen wrote:
> I find simplification appealing.  But I not sure it will fly, if for no
> other reason than the shared accountings.  I'm ignoring intentional
> sharing, used by carefully crafted apps, and just thinking about
> incidental sharing (e.g. libc).
> 
> Example:
> 
> $ mkdir small
> $ echo 1M > small/memory.limit_in_bytes
> $ (echo $BASHPID > small/cgroup.procs && exec sleep 1h) &
> 
> $ mkdir big
> $ echo 10G > big/memory.limit_in_bytes
> $ (echo $BASHPID > big/cgroup.procs && exec mlockall_database 1h) &
> 
> Assuming big/mlockall_database mlocks all of libc, then it will oom kill
> the small memcg because libc is owned by small due it having touched it
> first.  It'd be hard to figure out what small did wrong to deserve the
> oom kill.

The previous behavior was pretty unpredictable in terms of shared file
ownership too.  I wonder whether the better thing to do here is either
charging cases like this to the common ancestor or splitting the
charge equally among the accessors, which might be doable for ro
files.

> FWIW we've been using memcg writeback where inodes have a memcg
> writeback owner.  Once multiple memcg write to an inode then the inode
> becomes writeback shared which makes it more likely to be written.  Once
> cleaned the inode is then again able to be privately owned:
> https://lkml.org/lkml/2011/8/17/200

The problem is that it introduces deviations between memcg and
writeback / blkcg which will mess up pressure propagation.  Writeback
pressure can't be determined without its associated memcg and neither
can dirty balancing.  We sure can simplify things by trading off
accuracies at places but let's please try to do that throughout the
stack, not in the midpoint, so that we can say "if you do this, it'll
behave this way and you can see it showing up there".  The thing is if
we leave it half-way, in time, some will try to actively exploit
memcg's page granularity and we'll have to deal with writeback
behavior which is difficult to even characterize.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-01-30  6:27   ` Tejun Heo
@ 2015-01-30 16:07     ` Tejun Heo
  2015-02-02 19:26       ` Konstantin Khlebnikov
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2015-01-30 16:07 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Johannes Weiner, Michal Hocko, cgroups, linux-mm, linux-kernel,
	Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
	hughd, Konstantin Khebnikov

Hey, again.

On Fri, Jan 30, 2015 at 01:27:37AM -0500, Tejun Heo wrote:
> The previous behavior was pretty unpredictable in terms of shared file
> ownership too.  I wonder whether the better thing to do here is either
> charging cases like this to the common ancestor or splitting the
> charge equally among the accessors, which might be doable for ro
> files.

I've been thinking more about this.  It's true that doing per-page
association allows for avoiding confronting the worst side effects of
inode sharing head-on, but it is a tradeoff with fairly weak
justfications.  The only thing we're gaining is side-stepping the
blunt of the problem in an awkward manner and the loss of clarity in
taking this compromised position has nasty ramifications when we try
to connect it with the rest of the world.

I could be missing something major but the more I think about it, it
looks to me that the right thing to do here is accounting per-inode
and charging shared inodes to the nearest common ancestor.  The
resulting behavior would be way more logical and predicatable than the
current one, which would make it straight forward to integrate memcg
with blkcg and writeback.

One of the problems that I can think of off the top of my head is that
it'd involve more regular use of charge moving; however, this is an
operation which is per-inode rather than per-page and still gonna be
fairly infrequent.  Another one is that if we move memcg over to this
behavior, it's likely to affect the behavior on the traditional
hierarchies too as we sure as hell don't want to switch between the
two major behaviors dynamically but given that behaviors on inode
sharing aren't very well supported yet, this can be an acceptable
change.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-01-30 16:07     ` Tejun Heo
@ 2015-02-02 19:26       ` Konstantin Khlebnikov
  2015-02-02 19:46         ` Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Konstantin Khlebnikov @ 2015-02-02 19:26 UTC (permalink / raw)
  To: Tejun Heo, Greg Thelen
  Cc: Johannes Weiner, Michal Hocko, cgroups, linux-mm, linux-kernel,
	Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
	hughd

On 30.01.2015 19:07, Tejun Heo wrote:
> Hey, again.
>
> On Fri, Jan 30, 2015 at 01:27:37AM -0500, Tejun Heo wrote:
>> The previous behavior was pretty unpredictable in terms of shared file
>> ownership too.  I wonder whether the better thing to do here is either
>> charging cases like this to the common ancestor or splitting the
>> charge equally among the accessors, which might be doable for ro
>> files.
>
> I've been thinking more about this.  It's true that doing per-page
> association allows for avoiding confronting the worst side effects of
> inode sharing head-on, but it is a tradeoff with fairly weak
> justfications.  The only thing we're gaining is side-stepping the
> blunt of the problem in an awkward manner and the loss of clarity in
> taking this compromised position has nasty ramifications when we try
> to connect it with the rest of the world.
>
> I could be missing something major but the more I think about it, it
> looks to me that the right thing to do here is accounting per-inode
> and charging shared inodes to the nearest common ancestor.  The
> resulting behavior would be way more logical and predicatable than the
> current one, which would make it straight forward to integrate memcg
> with blkcg and writeback.
>
> One of the problems that I can think of off the top of my head is that
> it'd involve more regular use of charge moving; however, this is an
> operation which is per-inode rather than per-page and still gonna be
> fairly infrequent.  Another one is that if we move memcg over to this
> behavior, it's likely to affect the behavior on the traditional
> hierarchies too as we sure as hell don't want to switch between the
> two major behaviors dynamically but given that behaviors on inode
> sharing aren't very well supported yet, this can be an acceptable
> change.
>
> Thanks.
>

Well... that might work.

Per-inode/anonvma memcg will be much more predictable for sure.

In some cases memory cgroup for inode might be assigned statically.
For example database files migth be pinned to special cgroup and
protected with low limit (soft guarantee or whatever it's called
nowadays).

For overlay-fs-like containers might be reasonable to keep shared
template area in separate memory cgroup. (keep cgroup mark at bind-mount 
vfsmount?).

Removing memcg pointer from struct page might be tricky.
It's not clear what to do with truncated pages: either link them
with lru differently or remove from lru right at truncate.
Swap cache pages have the same problem.

Process of moving inodes from memcg to memcg is more or less doable.
Possible solution: keep at inode two pointers to memcg "old" and "new".
Each page will be accounted (and linked into corresponding lru) to one
of them. Separation to "old" and "new" pages could be done by flag on
struct page or by bordering page index stored in inode: pages where
index < border are accounted to the new memcg, the rest to the old.


Keeping shared inodes in common ancestor is reasonable.
We could schedule asynchronous moving when somebody opens or mmaps
inode from outside of its current cgroup. But it's not clear when
inode should be moved into opposite direction: when inode should
become private and how detect if it's no longer shared.

For example each inode could keep yet another pointer to memcg where
it will track subtree of cgroups where it was accessed in past 5
minutes or so. And sometimes that informations goes into moving thread.

Actually I don't see other options except that time-based estimation:
tracking all cgroups for each inode is too expensive, moving pages
from one lru to another is expensive too. So, moving inodes back and
forth at each access from the outside world is not an option.
That should be rare operation which runs in background or in reclaimer.

-- 
Konstantin

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-02 19:26       ` Konstantin Khlebnikov
@ 2015-02-02 19:46         ` Tejun Heo
  2015-02-03 23:30           ` Greg Thelen
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2015-02-02 19:46 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Greg Thelen, Johannes Weiner, Michal Hocko, cgroups, linux-mm,
	linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Li Zefan, hughd

Hey,

On Mon, Feb 02, 2015 at 10:26:44PM +0300, Konstantin Khlebnikov wrote:
> Removing memcg pointer from struct page might be tricky.
> It's not clear what to do with truncated pages: either link them
> with lru differently or remove from lru right at truncate.
> Swap cache pages have the same problem.

Hmmm... idk, maybe play another trick with low bits of page->mapping
and make it point to the cgroup after truncation?  Do we even care
tho?  Can't we just push them to the root and forget about them?  They
are pretty transient after all, no?

> Process of moving inodes from memcg to memcg is more or less doable.
> Possible solution: keep at inode two pointers to memcg "old" and "new".
> Each page will be accounted (and linked into corresponding lru) to one
> of them. Separation to "old" and "new" pages could be done by flag on
> struct page or by bordering page index stored in inode: pages where
> index < border are accounted to the new memcg, the rest to the old.

Yeah, pretty much the same scheme that the per-page cgroup writeback
is using with lower bits of page->mem_cgroup should work with the bits
moved to page->flags.

> Keeping shared inodes in common ancestor is reasonable.
> We could schedule asynchronous moving when somebody opens or mmaps
> inode from outside of its current cgroup. But it's not clear when
> inode should be moved into opposite direction: when inode should
> become private and how detect if it's no longer shared.
> 
> For example each inode could keep yet another pointer to memcg where
> it will track subtree of cgroups where it was accessed in past 5
> minutes or so. And sometimes that informations goes into moving thread.
> 
> Actually I don't see other options except that time-based estimation:
> tracking all cgroups for each inode is too expensive, moving pages
> from one lru to another is expensive too. So, moving inodes back and
> forth at each access from the outside world is not an option.
> That should be rare operation which runs in background or in reclaimer.

Right, what strategy to use for migration is up for debate, even for
moving to the common ancestor.  e.g. should we do that on the first
access?  In the other direction, it get more interesting.  Let's say
if we decide to move back an inode to a descendant, what if that
triggers OOM condition?  Do we still go through it and cause OOM in
the target?  Do we even want automatic moving in this direction?

For explicit cases, userland can do FADV_DONTNEED, I suppose.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-02 19:46         ` Tejun Heo
@ 2015-02-03 23:30           ` Greg Thelen
  2015-02-04 10:49             ` Konstantin Khlebnikov
  2015-02-04 17:06             ` Tejun Heo
  0 siblings, 2 replies; 31+ messages in thread
From: Greg Thelen @ 2015-02-03 23:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
	linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Li Zefan, Hugh Dickins

On Mon, Feb 2, 2015 at 11:46 AM, Tejun Heo <tj@kernel.org> wrote:
> Hey,
>
> On Mon, Feb 02, 2015 at 10:26:44PM +0300, Konstantin Khlebnikov wrote:
>
>> Keeping shared inodes in common ancestor is reasonable.
>> We could schedule asynchronous moving when somebody opens or mmaps
>> inode from outside of its current cgroup. But it's not clear when
>> inode should be moved into opposite direction: when inode should
>> become private and how detect if it's no longer shared.
>>
>> For example each inode could keep yet another pointer to memcg where
>> it will track subtree of cgroups where it was accessed in past 5
>> minutes or so. And sometimes that informations goes into moving thread.
>>
>> Actually I don't see other options except that time-based estimation:
>> tracking all cgroups for each inode is too expensive, moving pages
>> from one lru to another is expensive too. So, moving inodes back and
>> forth at each access from the outside world is not an option.
>> That should be rare operation which runs in background or in reclaimer.
>
> Right, what strategy to use for migration is up for debate, even for
> moving to the common ancestor.  e.g. should we do that on the first
> access?  In the other direction, it get more interesting.  Let's say
> if we decide to move back an inode to a descendant, what if that
> triggers OOM condition?  Do we still go through it and cause OOM in
> the target?  Do we even want automatic moving in this direction?
>
> For explicit cases, userland can do FADV_DONTNEED, I suppose.
>
> Thanks.
>
> --
> tejun

I don't have any killer objections, most of my worries are isolation concerns.

If a machine has several top level memcg trying to get some form of
isolation (using low, min, soft limit) then a shared libc will be
moved to the root memcg where it's not protected from global memory
pressure.  At least with the current per page accounting such shared
pages often land into some protected memcg.

If two cgroups collude they can use more memory than their limit and
oom the entire machine.  Admittedly the current per-page system isn't
perfect because deleting a memcg which contains mlocked memory
(referenced by a remote memcg) moves the mlocked memory to root
resulting in the same issue.  But I'd argue this is more likely with
the RFC because it doesn't involve the cgroup deletion/reparenting.  A
possible tweak to shore up the current system is to move such mlocked
pages to the memcg of the surviving locker.  When the machine is oom
it's often nice to examine memcg state to determine which container is
using the memory.  Tracking down who's contributing to a shared
container is non-trivial.

I actually have a set of patches which add a memcg=M mount option to
memory backed file systems.  I was planning on proposing them,
regardless of this RFC, and this discussion makes them even more
appealing.  If we go in this direction, then we'd need a similar
notion for disk based filesystems.  As Konstantin suggested, it'd be
really nice to specify charge policy on a per file, or directory, or
bind mount basis.  This allows shared files to be deterministically
charged to a known container.  We'd need to flesh out the policies:
e.g. if two bind mound each specify different charge targets for the
same inode, I guess we just pick one.  Though the nature of this
catch-all shared container is strange.  Presumably a machine manager
would need to create it as an unlimited container (or at least as big
as the sum of all shared files) so that any app which decided it wants
to mlock all shared files has a way to without ooming the shared
container.  In the current per-page approach it's possible to lock
shared libs.  But the machine manager would need to decide how much
system ram to set aside for this catch-all shared container.

When there's large incidental sharing, then things get sticky.  A
periodic filesystem scanner (e.g. virus scanner, or grep foo -r /) in
a small container would pull all pages to the root memcg where they
are exposed to root pressure which breaks isolation.  This is
concerning.  Perhaps the such accesses could be decorated with
(O_NO_MOVEMEM).

So this RFC change will introduce significant change to user space
machine managers and perturb isolation.  Is the resulting system
better?  It's not clear, it's the devil know vs devil unknown.  Maybe
it'd be easier if the memcg's I'm talking about were not allowed to
share page cache (aka copy-on-read) even for files which are jointly
visible.  That would provide today's interface while avoiding the
problematic sharing.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-03 23:30           ` Greg Thelen
@ 2015-02-04 10:49             ` Konstantin Khlebnikov
  2015-02-04 17:15               ` Tejun Heo
  2015-02-04 17:06             ` Tejun Heo
  1 sibling, 1 reply; 31+ messages in thread
From: Konstantin Khlebnikov @ 2015-02-04 10:49 UTC (permalink / raw)
  To: Greg Thelen, Tejun Heo
  Cc: Johannes Weiner, Michal Hocko, Cgroups, linux-mm, linux-kernel,
	Jan Kara, Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
	Hugh Dickins, Roman Gushchin

On 04.02.2015 02:30, Greg Thelen wrote:
> On Mon, Feb 2, 2015 at 11:46 AM, Tejun Heo <tj@kernel.org> wrote:
>> Hey,
>>
>> On Mon, Feb 02, 2015 at 10:26:44PM +0300, Konstantin Khlebnikov wrote:
>>
>>> Keeping shared inodes in common ancestor is reasonable.
>>> We could schedule asynchronous moving when somebody opens or mmaps
>>> inode from outside of its current cgroup. But it's not clear when
>>> inode should be moved into opposite direction: when inode should
>>> become private and how detect if it's no longer shared.
>>>
>>> For example each inode could keep yet another pointer to memcg where
>>> it will track subtree of cgroups where it was accessed in past 5
>>> minutes or so. And sometimes that informations goes into moving thread.
>>>
>>> Actually I don't see other options except that time-based estimation:
>>> tracking all cgroups for each inode is too expensive, moving pages
>>> from one lru to another is expensive too. So, moving inodes back and
>>> forth at each access from the outside world is not an option.
>>> That should be rare operation which runs in background or in reclaimer.
>>
>> Right, what strategy to use for migration is up for debate, even for
>> moving to the common ancestor.  e.g. should we do that on the first
>> access?  In the other direction, it get more interesting.  Let's say
>> if we decide to move back an inode to a descendant, what if that
>> triggers OOM condition?  Do we still go through it and cause OOM in
>> the target?  Do we even want automatic moving in this direction?
>>
>> For explicit cases, userland can do FADV_DONTNEED, I suppose.
>>
>> Thanks.
>>
>> --
>> tejun
>
> I don't have any killer objections, most of my worries are isolation concerns.
>
> If a machine has several top level memcg trying to get some form of
> isolation (using low, min, soft limit) then a shared libc will be
> moved to the root memcg where it's not protected from global memory
> pressure.  At least with the current per page accounting such shared
> pages often land into some protected memcg.
>
> If two cgroups collude they can use more memory than their limit and
> oom the entire machine.  Admittedly the current per-page system isn't
> perfect because deleting a memcg which contains mlocked memory
> (referenced by a remote memcg) moves the mlocked memory to root
> resulting in the same issue.  But I'd argue this is more likely with
> the RFC because it doesn't involve the cgroup deletion/reparenting.  A
> possible tweak to shore up the current system is to move such mlocked
> pages to the memcg of the surviving locker.  When the machine is oom
> it's often nice to examine memcg state to determine which container is
> using the memory.  Tracking down who's contributing to a shared
> container is non-trivial.
>
> I actually have a set of patches which add a memcg=M mount option to
> memory backed file systems.  I was planning on proposing them,
> regardless of this RFC, and this discussion makes them even more
> appealing.  If we go in this direction, then we'd need a similar
> notion for disk based filesystems.  As Konstantin suggested, it'd be
> really nice to specify charge policy on a per file, or directory, or
> bind mount basis.  This allows shared files to be deterministically
> charged to a known container.  We'd need to flesh out the policies:
> e.g. if two bind mound each specify different charge targets for the
> same inode, I guess we just pick one.  Though the nature of this
> catch-all shared container is strange.  Presumably a machine manager
> would need to create it as an unlimited container (or at least as big
> as the sum of all shared files) so that any app which decided it wants
> to mlock all shared files has a way to without ooming the shared
> container.  In the current per-page approach it's possible to lock
> shared libs.  But the machine manager would need to decide how much
> system ram to set aside for this catch-all shared container.
>
> When there's large incidental sharing, then things get sticky.  A
> periodic filesystem scanner (e.g. virus scanner, or grep foo -r /) in
> a small container would pull all pages to the root memcg where they
> are exposed to root pressure which breaks isolation.  This is
> concerning.  Perhaps the such accesses could be decorated with
> (O_NO_MOVEMEM).
>
> So this RFC change will introduce significant change to user space
> machine managers and perturb isolation.  Is the resulting system
> better?  It's not clear, it's the devil know vs devil unknown.  Maybe
> it'd be easier if the memcg's I'm talking about were not allowed to
> share page cache (aka copy-on-read) even for files which are jointly
> visible.  That would provide today's interface while avoiding the
> problematic sharing.
>

I think important shared data must be handled and protected explicitly.
That 'catch-all' shared container could be separated into several
memory cgroups depending on importance of files: glibc protected
with soft guarantee, less important stuff is placed into another
cgroup and cannot push top-priority libraries out of ram.

If shared files are free for use then that 'shared' container must be
ready to keep them in memory. Otherwise this need to be fixed at the
container side: we could ignore mlock for shared inodes or amount of
such vmas might be limited in per-container basis.

But sharing responsibility for shared file is vague concept: memory
usage and limit of container must depends only on its own behavior not
on neighbors at the same machine.


Generally incidental sharing could be handled as temporary sharing:
default policy (if inode isn't pinned to memory cgroup) after some
time should detect that inode is no longer shared and migrate it into
original cgroup. Of course task could provide hit: O_NO_MOVEMEM or
even while memory cgroup where it runs could be marked as "scanner"
which shouldn't disturb memory classification.

BTW, the same algorithm which determines who have used inode recently
could tell who have used shared inode even if it's pinned to shared
container.

Other cool option which could fix false-sharing after scanning is
FADV_NOREUSE which tells to keep page-cache pages which were used for
reading and writing via this file descriptor out of lru and remove them
from inode when this file descriptor closes. Something like private
per-struct-file page-cache. Probably somebody already tried that?


I've missed obvious solution for controlling memory cgroup for files:
project id. This persistent integer id stored in file system. For now
it's implemented only for xfs and used for quota which is orthogonal
to user/group quotas. We could map some of project id to memory cgroup.
That is more flexible than per-superblock mark, has no conflicts like
mark on bind-mount.

-- 
Konstantin

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-03 23:30           ` Greg Thelen
  2015-02-04 10:49             ` Konstantin Khlebnikov
@ 2015-02-04 17:06             ` Tejun Heo
  2015-02-04 23:51               ` Greg Thelen
  1 sibling, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2015-02-04 17:06 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
	linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Li Zefan, Hugh Dickins

Hello,

On Tue, Feb 03, 2015 at 03:30:31PM -0800, Greg Thelen wrote:
> If a machine has several top level memcg trying to get some form of
> isolation (using low, min, soft limit) then a shared libc will be
> moved to the root memcg where it's not protected from global memory
> pressure.  At least with the current per page accounting such shared
> pages often land into some protected memcg.

Yes, it becomes interesting with the low limit as the pressure
direction is reversed but at the same time overcommitting low limits
doesn't lead to a sane setup to begin with as it's asking for global
OOMs anyway, which means that things like libc would end up competing
at least fairly with other pages for global pressure and should stay
in memory under most circumstances, which may or may not be
sufficient.

Hmm.... need to think more about it but this only becomes a problem
with the root cgroup because it doesn't have min setting which is
expected to be inclusive of all descendants, right?  Maybe the right
thing to do here is treating the inodes which get pushed to the root
as a special case and we can implement a mechanism where the root is
effectively borrowing from the mins of its children which doesn't have
to be completely correct - e.g. just charge it against all children
repeatedly and if any has min protection, put it under min protection.
IOW, make it the baseload for all of them.

> If two cgroups collude they can use more memory than their limit and
> oom the entire machine.  Admittedly the current per-page system isn't
> perfect because deleting a memcg which contains mlocked memory
> (referenced by a remote memcg) moves the mlocked memory to root
> resulting in the same issue.  But I'd argue this is more likely with

Hmmm... why does it do that?  Can you point me to where it's
happening?

> the RFC because it doesn't involve the cgroup deletion/reparenting.  A

One approach could be expanding on the forementioned scheme and make
all sharing cgroups to get charged for the shared inodes they're
using, which should render such collusions entirely pointless.
e.g. let's say we start with the following.

	A   (usage=48M)
	+-B (usage=16M)
	\-C (usage=32M)

And let's say, C starts accessing an inode which is 8M and currently
associated with B.

	A   (usage=48M, hosted= 8M)
	+-B (usage= 8M, shared= 8M)
	\-C (usage=32M, shared= 8M)

The only extra charging that we'd be doing is charing C with extra
8M.  Let's say another cgroup D gets created and uses 4M.

	A   (usage=56M, hosted= 8M)
	+-B (usage= 8M, shared= 8M)
	+-C (usage=32M, shared= 8M)
	\-D (usage= 8M)

and it also accesses the inode.

	A   (usage=56M, hosted= 8M)
	+-B (usage= 8M, shared= 8M)
	+-C (usage=32M, shared= 8M)
	\-D (usage= 8M, shared= 8M)

We'd need to track the shared charges separately as they should count
only once in the parent but that shouldn't be too hard.  The problem
here is that we'd need to track which inodes are being accessed by
which children, which can get painful for things like libc.  Maybe we
can limit it to be level-by-level - track sharing only from the
immediate children and always move a shared inode at one level at a
time.  That would lose some ability to track the sharing beyond the
immediate children but it should be enough to solve the root case and
allow us to adapt to changing usage pattern over time.  Given that
sharing is mostly a corner case, this could be good enough.

Now, if D accesses 4M area of the inode which hasn't been accessed by
others yet.  We'd want it to look like the following.

	A   (usage=64M, hosted=16M)
	+-B (usage= 8M, shared=16M)
	+-C (usage=32M, shared=16M)
	\-D (usage= 8M, shared=16M)

But charging it to B, C at the same time prolly wouldn't be
particularly convenient.  We can prolly just do D -> A charging and
let B and C sort themselves out later.  Note that such charging would
still maintain the overall integrity of memory limits.  The only thing
which may overflow is the pseudo shared charges to keep sharing in
check and dealing with them later when B and C try to create further
charges should be completely fine.

Note that we can also try to split the shared charge across the users;
however, charging the full amount seems like the better approach to
me.  We don't have any way to tell how the usage is distributed
anyway.  For use cases where this sort of sharing is expected, I think
it's perfectly reasonable to provision the sharing children to have
enough to accomodate the possible full size of the shared resource.

> possible tweak to shore up the current system is to move such mlocked
> pages to the memcg of the surviving locker.  When the machine is oom
> it's often nice to examine memcg state to determine which container is
> using the memory.  Tracking down who's contributing to a shared
> container is non-trivial.
> 
> I actually have a set of patches which add a memcg=M mount option to
> memory backed file systems.  I was planning on proposing them,
> regardless of this RFC, and this discussion makes them even more
> appealing.  If we go in this direction, then we'd need a similar
> notion for disk based filesystems.  As Konstantin suggested, it'd be
> really nice to specify charge policy on a per file, or directory, or
> bind mount basis.  This allows shared files to be deterministically

I'm not too sure about that.  We might add that later if absolutely
justifiable but designing assuming that level of intervention from
userland may not be such a good idea.

> When there's large incidental sharing, then things get sticky.  A
> periodic filesystem scanner (e.g. virus scanner, or grep foo -r /) in
> a small container would pull all pages to the root memcg where they
> are exposed to root pressure which breaks isolation.  This is
> concerning.  Perhaps the such accesses could be decorated with
> (O_NO_MOVEMEM).

If such thing is really necessary, FADV_NOREUSE would be a better
indicator; however, yes, such incidental sharing is easier to handle
with per-page scheme as such scanner can be limited in the number of
pages it can carry throughout its operation regardless of which cgroup
it's looking at.  It still has the nasty corner case where random
target cgroups can latch onto pages faulted in by the scanner and
keeping accessing them tho, so, even now, FADV_NOREUSE would be a good
idea.  Note that such scanning, if repeated on cgroups under high
memory pressure, is *likely* to accumulate residue escaped pages and
if such a management cgroup is transient, those escaped pages will
accumulate over time outside any limit in a way which is unpredictable
and invisible.

> So this RFC change will introduce significant change to user space
> machine managers and perturb isolation.  Is the resulting system
> better?  It's not clear, it's the devil know vs devil unknown.  Maybe
> it'd be easier if the memcg's I'm talking about were not allowed to
> share page cache (aka copy-on-read) even for files which are jointly
> visible.  That would provide today's interface while avoiding the
> problematic sharing.

Yeah, compatibility would be the stickiest part.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-04 10:49             ` Konstantin Khlebnikov
@ 2015-02-04 17:15               ` Tejun Heo
  2015-02-04 17:58                 ` Konstantin Khlebnikov
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2015-02-04 17:15 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Greg Thelen, Johannes Weiner, Michal Hocko, Cgroups, linux-mm,
	linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Li Zefan, Hugh Dickins, Roman Gushchin

Hello,

On Wed, Feb 04, 2015 at 01:49:08PM +0300, Konstantin Khlebnikov wrote:
> I think important shared data must be handled and protected explicitly.
> That 'catch-all' shared container could be separated into several

I kinda disagree.  That'd be a major pain in the ass to use and you
wouldn't know when you got something wrong unless it actually goes
wrong and you know enough about the innerworkings to look for that.
Doesn't sound like a sound design to me.

> memory cgroups depending on importance of files: glibc protected
> with soft guarantee, less important stuff is placed into another
> cgroup and cannot push top-priority libraries out of ram.

That sounds extremely painful.

> If shared files are free for use then that 'shared' container must be
> ready to keep them in memory. Otherwise this need to be fixed at the
> container side: we could ignore mlock for shared inodes or amount of
> such vmas might be limited in per-container basis.
> 
> But sharing responsibility for shared file is vague concept: memory
> usage and limit of container must depends only on its own behavior not
> on neighbors at the same machine.
> 
> 
> Generally incidental sharing could be handled as temporary sharing:
> default policy (if inode isn't pinned to memory cgroup) after some
> time should detect that inode is no longer shared and migrate it into
> original cgroup. Of course task could provide hit: O_NO_MOVEMEM or
> even while memory cgroup where it runs could be marked as "scanner"
> which shouldn't disturb memory classification.

Ditto for annotating each file individually.  Let's please try to stay
away from things like that.  That's mostly a cop-out which is unlikely
to actually benefit the majority of users.

> I've missed obvious solution for controlling memory cgroup for files:
> project id. This persistent integer id stored in file system. For now
> it's implemented only for xfs and used for quota which is orthogonal
> to user/group quotas. We could map some of project id to memory cgroup.
> That is more flexible than per-superblock mark, has no conflicts like
> mark on bind-mount.

Again, hell, no.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-04 17:15               ` Tejun Heo
@ 2015-02-04 17:58                 ` Konstantin Khlebnikov
  2015-02-04 18:28                   ` Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Konstantin Khlebnikov @ 2015-02-04 17:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Greg Thelen, Johannes Weiner, Michal Hocko, Cgroups, linux-mm,
	linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Li Zefan, Hugh Dickins, Roman Gushchin

On 04.02.2015 20:15, Tejun Heo wrote:
> Hello,
>
> On Wed, Feb 04, 2015 at 01:49:08PM +0300, Konstantin Khlebnikov wrote:
>> I think important shared data must be handled and protected explicitly.
>> That 'catch-all' shared container could be separated into several
>
> I kinda disagree.  That'd be a major pain in the ass to use and you
> wouldn't know when you got something wrong unless it actually goes
> wrong and you know enough about the innerworkings to look for that.
> Doesn't sound like a sound design to me.
>
>> memory cgroups depending on importance of files: glibc protected
>> with soft guarantee, less important stuff is placed into another
>> cgroup and cannot push top-priority libraries out of ram.
>
> That sounds extremely painful.

I mean this thing _could_ be controlled more precisely. Even if default
policy works for 99% users manual override is still required for 1% or
if something goes wrong.

>
>> If shared files are free for use then that 'shared' container must be
>> ready to keep them in memory. Otherwise this need to be fixed at the
>> container side: we could ignore mlock for shared inodes or amount of
>> such vmas might be limited in per-container basis.
>>
>> But sharing responsibility for shared file is vague concept: memory
>> usage and limit of container must depends only on its own behavior not
>> on neighbors at the same machine.
>>
>>
>> Generally incidental sharing could be handled as temporary sharing:
>> default policy (if inode isn't pinned to memory cgroup) after some
>> time should detect that inode is no longer shared and migrate it into
>> original cgroup. Of course task could provide hit: O_NO_MOVEMEM or
>> even while memory cgroup where it runs could be marked as "scanner"
>> which shouldn't disturb memory classification.
>
> Ditto for annotating each file individually.  Let's please try to stay
> away from things like that.  That's mostly a cop-out which is unlikely
> to actually benefit the majority of users.

Process which scans all files once isn't so rare use case.
Linux still cannot handle this pattern sometimes.

>
>> I've missed obvious solution for controlling memory cgroup for files:
>> project id. This persistent integer id stored in file system. For now
>> it's implemented only for xfs and used for quota which is orthogonal
>> to user/group quotas. We could map some of project id to memory cgroup.
>> That is more flexible than per-superblock mark, has no conflicts like
>> mark on bind-mount.
>
> Again, hell, no.
>
> Thanks.
>

-- 
Konstantin

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-04 17:58                 ` Konstantin Khlebnikov
@ 2015-02-04 18:28                   ` Tejun Heo
  0 siblings, 0 replies; 31+ messages in thread
From: Tejun Heo @ 2015-02-04 18:28 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Greg Thelen, Johannes Weiner, Michal Hocko, Cgroups, linux-mm,
	linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Li Zefan, Hugh Dickins, Roman Gushchin

On Wed, Feb 04, 2015 at 08:58:21PM +0300, Konstantin Khlebnikov wrote:
> >>Generally incidental sharing could be handled as temporary sharing:
> >>default policy (if inode isn't pinned to memory cgroup) after some
> >>time should detect that inode is no longer shared and migrate it into
> >>original cgroup. Of course task could provide hit: O_NO_MOVEMEM or
> >>even while memory cgroup where it runs could be marked as "scanner"
> >>which shouldn't disturb memory classification.
> >
> >Ditto for annotating each file individually.  Let's please try to stay
> >away from things like that.  That's mostly a cop-out which is unlikely
> >to actually benefit the majority of users.
> 
> Process which scans all files once isn't so rare use case.
> Linux still cannot handle this pattern sometimes.

Yeah, sure, tagging usages with m/fadvise's is fine.  We can just look
at the policy and ignore them for the purpose of determining who's
using the inode, but let's stay away from tagging the files on
filesystem if at all possible.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-04 17:06             ` Tejun Heo
@ 2015-02-04 23:51               ` Greg Thelen
  2015-02-05 13:15                 ` Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Greg Thelen @ 2015-02-04 23:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
	linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Li Zefan, Hugh Dickins


On Wed, Feb 04 2015, Tejun Heo wrote:

> Hello,
>
> On Tue, Feb 03, 2015 at 03:30:31PM -0800, Greg Thelen wrote:
>> If a machine has several top level memcg trying to get some form of
>> isolation (using low, min, soft limit) then a shared libc will be
>> moved to the root memcg where it's not protected from global memory
>> pressure.  At least with the current per page accounting such shared
>> pages often land into some protected memcg.
>
> Yes, it becomes interesting with the low limit as the pressure
> direction is reversed but at the same time overcommitting low limits
> doesn't lead to a sane setup to begin with as it's asking for global
> OOMs anyway, which means that things like libc would end up competing
> at least fairly with other pages for global pressure and should stay
> in memory under most circumstances, which may or may not be
> sufficient.

I agree.  Clarification... I don't plan to overcommit low or min limits.
On machines without overcommited min limits the existing system offers
some protection for shared libs from global reclaim.  Pushing them to
root doesn't.

> Hmm.... need to think more about it but this only becomes a problem
> with the root cgroup because it doesn't have min setting which is
> expected to be inclusive of all descendants, right?  Maybe the right
> thing to do here is treating the inodes which get pushed to the root
> as a special case and we can implement a mechanism where the root is
> effectively borrowing from the mins of its children which doesn't have
> to be completely correct - e.g. just charge it against all children
> repeatedly and if any has min protection, put it under min protection.
> IOW, make it the baseload for all of them.

I think the linux-next low (and the TBD min) limits also have the
problem for more than just the root memcg.  I'm thinking of a 2M file
shared between C and D below.  The file will be charged to common parent
B.

	A
	+-B    (usage=2M lim=3M min=2M)
	  +-C  (usage=0  lim=2M min=1M shared_usage=2M)
	  +-D  (usage=0  lim=2M min=1M shared_usage=2M)
	  \-E  (usage=0  lim=2M min=0)

The problem arises if A/B/E allocates more than 1M of private
reclaimable file data.  This pushes A/B into reclaim which will reclaim
both the shared file from A/B and private file from A/B/E.  In contrast,
the current per-page memcg would've protected the shared file in either
C or D leaving A/B reclaim to only attack A/B/E.

Pinning the shared file to either C or D, using TBD policy such as mount
option, would solve this for tightly shared files.  But for wide fanout
file (libc) the admin would need to assign a global bucket and this
would be a pain to size due to various job requirements.

>> If two cgroups collude they can use more memory than their limit and
>> oom the entire machine.  Admittedly the current per-page system isn't
>> perfect because deleting a memcg which contains mlocked memory
>> (referenced by a remote memcg) moves the mlocked memory to root
>> resulting in the same issue.  But I'd argue this is more likely with
>
> Hmmm... why does it do that?  Can you point me to where it's
> happening?

My mistake, I was thinking of older kernels which reparent memory.
Though I can't say v3.19-rc7 handles this collusion any better.  Instead
of reparenting the mlocked memory, it's left in an invisible (offline)
memcg.  Unlike older kernels the memory doesn't appear in
root/memory.stat[unevictable], instead it buried in
root/memory.stat[total_unevictable] which includes mlocked memory in
visible (online) and invisible (offline) children.

>> the RFC because it doesn't involve the cgroup deletion/reparenting.  A
>
> One approach could be expanding on the forementioned scheme and make
> all sharing cgroups to get charged for the shared inodes they're
> using, which should render such collusions entirely pointless.
> e.g. let's say we start with the following.
>
> 	A   (usage=48M)
> 	+-B (usage=16M)
> 	\-C (usage=32M)
>
> And let's say, C starts accessing an inode which is 8M and currently
> associated with B.
>
> 	A   (usage=48M, hosted= 8M)
> 	+-B (usage= 8M, shared= 8M)
> 	\-C (usage=32M, shared= 8M)
>
> The only extra charging that we'd be doing is charing C with extra
> 8M.  Let's say another cgroup D gets created and uses 4M.
>
> 	A   (usage=56M, hosted= 8M)
> 	+-B (usage= 8M, shared= 8M)
> 	+-C (usage=32M, shared= 8M)
> 	\-D (usage= 8M)
>
> and it also accesses the inode.
>
> 	A   (usage=56M, hosted= 8M)
> 	+-B (usage= 8M, shared= 8M)
> 	+-C (usage=32M, shared= 8M)
> 	\-D (usage= 8M, shared= 8M)
>
> We'd need to track the shared charges separately as they should count
> only once in the parent but that shouldn't be too hard.  The problem
> here is that we'd need to track which inodes are being accessed by
> which children, which can get painful for things like libc.  Maybe we
> can limit it to be level-by-level - track sharing only from the
> immediate children and always move a shared inode at one level at a
> time.  That would lose some ability to track the sharing beyond the
> immediate children but it should be enough to solve the root case and
> allow us to adapt to changing usage pattern over time.  Given that
> sharing is mostly a corner case, this could be good enough.
>
> Now, if D accesses 4M area of the inode which hasn't been accessed by
> others yet.  We'd want it to look like the following.
>
> 	A   (usage=64M, hosted=16M)
> 	+-B (usage= 8M, shared=16M)
> 	+-C (usage=32M, shared=16M)
> 	\-D (usage= 8M, shared=16M)
>
> But charging it to B, C at the same time prolly wouldn't be
> particularly convenient.  We can prolly just do D -> A charging and
> let B and C sort themselves out later.  Note that such charging would
> still maintain the overall integrity of memory limits.  The only thing
> which may overflow is the pseudo shared charges to keep sharing in
> check and dealing with them later when B and C try to create further
> charges should be completely fine.
>
> Note that we can also try to split the shared charge across the users;
> however, charging the full amount seems like the better approach to
> me.  We don't have any way to tell how the usage is distributed
> anyway.  For use cases where this sort of sharing is expected, I think
> it's perfectly reasonable to provision the sharing children to have
> enough to accomodate the possible full size of the shared resource.
>
>> possible tweak to shore up the current system is to move such mlocked
>> pages to the memcg of the surviving locker.  When the machine is oom
>> it's often nice to examine memcg state to determine which container is
>> using the memory.  Tracking down who's contributing to a shared
>> container is non-trivial.
>> 
>> I actually have a set of patches which add a memcg=M mount option to
>> memory backed file systems.  I was planning on proposing them,
>> regardless of this RFC, and this discussion makes them even more
>> appealing.  If we go in this direction, then we'd need a similar
>> notion for disk based filesystems.  As Konstantin suggested, it'd be
>> really nice to specify charge policy on a per file, or directory, or
>> bind mount basis.  This allows shared files to be deterministically
>
> I'm not too sure about that.  We might add that later if absolutely
> justifiable but designing assuming that level of intervention from
> userland may not be such a good idea.
>
>> When there's large incidental sharing, then things get sticky.  A
>> periodic filesystem scanner (e.g. virus scanner, or grep foo -r /) in
>> a small container would pull all pages to the root memcg where they
>> are exposed to root pressure which breaks isolation.  This is
>> concerning.  Perhaps the such accesses could be decorated with
>> (O_NO_MOVEMEM).
>
> If such thing is really necessary, FADV_NOREUSE would be a better
> indicator; however, yes, such incidental sharing is easier to handle
> with per-page scheme as such scanner can be limited in the number of
> pages it can carry throughout its operation regardless of which cgroup
> it's looking at.  It still has the nasty corner case where random
> target cgroups can latch onto pages faulted in by the scanner and
> keeping accessing them tho, so, even now, FADV_NOREUSE would be a good
> idea.  Note that such scanning, if repeated on cgroups under high
> memory pressure, is *likely* to accumulate residue escaped pages and
> if such a management cgroup is transient, those escaped pages will
> accumulate over time outside any limit in a way which is unpredictable
> and invisible.
>
>> So this RFC change will introduce significant change to user space
>> machine managers and perturb isolation.  Is the resulting system
>> better?  It's not clear, it's the devil know vs devil unknown.  Maybe
>> it'd be easier if the memcg's I'm talking about were not allowed to
>> share page cache (aka copy-on-read) even for files which are jointly
>> visible.  That would provide today's interface while avoiding the
>> problematic sharing.
>
> Yeah, compatibility would be the stickiest part.
>
> Thanks.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-04 23:51               ` Greg Thelen
@ 2015-02-05 13:15                 ` Tejun Heo
  2015-02-05 22:05                   ` Greg Thelen
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2015-02-05 13:15 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
	linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Li Zefan, Hugh Dickins

Hello, Greg.

On Wed, Feb 04, 2015 at 03:51:01PM -0800, Greg Thelen wrote:
> I think the linux-next low (and the TBD min) limits also have the
> problem for more than just the root memcg.  I'm thinking of a 2M file
> shared between C and D below.  The file will be charged to common parent
> B.
> 
> 	A
> 	+-B    (usage=2M lim=3M min=2M)
> 	  +-C  (usage=0  lim=2M min=1M shared_usage=2M)
> 	  +-D  (usage=0  lim=2M min=1M shared_usage=2M)
> 	  \-E  (usage=0  lim=2M min=0)
> 
> The problem arises if A/B/E allocates more than 1M of private
> reclaimable file data.  This pushes A/B into reclaim which will reclaim
> both the shared file from A/B and private file from A/B/E.  In contrast,
> the current per-page memcg would've protected the shared file in either
> C or D leaving A/B reclaim to only attack A/B/E.
> 
> Pinning the shared file to either C or D, using TBD policy such as mount
> option, would solve this for tightly shared files.  But for wide fanout
> file (libc) the admin would need to assign a global bucket and this
> would be a pain to size due to various job requirements.

Shouldn't we be able to handle it the same way as I proposed for
handling sharing?  The above would look like

 	A
 	+-B    (usage=2M lim=3M min=2M hosted_usage=2M)
 	  +-C  (usage=0  lim=2M min=1M shared_usage=2M)
 	  +-D  (usage=0  lim=2M min=1M shared_usage=2M)
 	  \-E  (usage=0  lim=2M min=0)

Now, we don't wanna use B's min verbatim on the hosted inodes shared
by children but we're unconditionally charging the shared amount to
all sharing children, which means that we're eating into the min
settings of all participating children, so, we should be able to use
sum of all sharing children's min-covered amount as the inode's min,
which of course is to be contained inside the min of the parent.

Above, we're charging 2M to C and D, each of which has 1M min which is
being consumed by the shared charge (the shared part won't get
reclaimed from the internal pressure of children, so we're really
taking that part away from it).  Summing them up, the shared inode
would have 2M protection which is honored as long as B as a whole is
under its 3M limit.  This is similar to creating a dedicated child for
each shared resource for low limits.  The downside is that we end up
guarding the shared inodes more than non-shared ones, but, after all,
we're charging it to everybody who's using it.

Would something like this work?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-05 13:15                 ` Tejun Heo
@ 2015-02-05 22:05                   ` Greg Thelen
  2015-02-05 22:25                     ` Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Greg Thelen @ 2015-02-05 22:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
	linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Li Zefan, Hugh Dickins


On Thu, Feb 05 2015, Tejun Heo wrote:

> Hello, Greg.
>
> On Wed, Feb 04, 2015 at 03:51:01PM -0800, Greg Thelen wrote:
>> I think the linux-next low (and the TBD min) limits also have the
>> problem for more than just the root memcg.  I'm thinking of a 2M file
>> shared between C and D below.  The file will be charged to common parent
>> B.
>> 
>> 	A
>> 	+-B    (usage=2M lim=3M min=2M)
>> 	  +-C  (usage=0  lim=2M min=1M shared_usage=2M)
>> 	  +-D  (usage=0  lim=2M min=1M shared_usage=2M)
>> 	  \-E  (usage=0  lim=2M min=0)
>> 
>> The problem arises if A/B/E allocates more than 1M of private
>> reclaimable file data.  This pushes A/B into reclaim which will reclaim
>> both the shared file from A/B and private file from A/B/E.  In contrast,
>> the current per-page memcg would've protected the shared file in either
>> C or D leaving A/B reclaim to only attack A/B/E.
>> 
>> Pinning the shared file to either C or D, using TBD policy such as mount
>> option, would solve this for tightly shared files.  But for wide fanout
>> file (libc) the admin would need to assign a global bucket and this
>> would be a pain to size due to various job requirements.
>
> Shouldn't we be able to handle it the same way as I proposed for
> handling sharing?  The above would look like
>
>  	A
>  	+-B    (usage=2M lim=3M min=2M hosted_usage=2M)
>  	  +-C  (usage=0  lim=2M min=1M shared_usage=2M)
>  	  +-D  (usage=0  lim=2M min=1M shared_usage=2M)
>  	  \-E  (usage=0  lim=2M min=0)
>
> Now, we don't wanna use B's min verbatim on the hosted inodes shared
> by children but we're unconditionally charging the shared amount to
> all sharing children, which means that we're eating into the min
> settings of all participating children, so, we should be able to use
> sum of all sharing children's min-covered amount as the inode's min,
> which of course is to be contained inside the min of the parent.
>
> Above, we're charging 2M to C and D, each of which has 1M min which is
> being consumed by the shared charge (the shared part won't get
> reclaimed from the internal pressure of children, so we're really
> taking that part away from it).  Summing them up, the shared inode
> would have 2M protection which is honored as long as B as a whole is
> under its 3M limit.  This is similar to creating a dedicated child for
> each shared resource for low limits.  The downside is that we end up
> guarding the shared inodes more than non-shared ones, but, after all,
> we're charging it to everybody who's using it.
>
> Would something like this work?

Maybe, but I want to understand more about how pressure works in the
child.  As C (or D) allocates non shared memory does it perform reclaim
to ensure that its (C.usage + C.shared_usage < C.lim).  Given C's
shared_usage is linked into B.LRU it wouldn't be naturally reclaimable
by C.  Are you thinking that charge failures on cgroups with non zero
shared_usage would, as needed, induce reclaim of parent's hosted_usage?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-05 22:05                   ` Greg Thelen
@ 2015-02-05 22:25                     ` Tejun Heo
  2015-02-06  0:03                       ` Greg Thelen
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2015-02-05 22:25 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
	linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Li Zefan, Hugh Dickins

Hey,

On Thu, Feb 05, 2015 at 02:05:19PM -0800, Greg Thelen wrote:
> >  	A
> >  	+-B    (usage=2M lim=3M min=2M hosted_usage=2M)
> >  	  +-C  (usage=0  lim=2M min=1M shared_usage=2M)
> >  	  +-D  (usage=0  lim=2M min=1M shared_usage=2M)
> >  	  \-E  (usage=0  lim=2M min=0)
...
> Maybe, but I want to understand more about how pressure works in the
> child.  As C (or D) allocates non shared memory does it perform reclaim
> to ensure that its (C.usage + C.shared_usage < C.lim).  Given C's

Yes.

> shared_usage is linked into B.LRU it wouldn't be naturally reclaimable
> by C.  Are you thinking that charge failures on cgroups with non zero
> shared_usage would, as needed, induce reclaim of parent's hosted_usage?

Hmmm.... I'm not really sure but why not?  If we properly account for
the low protection when pushing inodes to the parent, I don't think
it'd break anything.  IOW, allow the amount beyond the sum of low
limits to be reclaimed when one of the sharers is under pressure.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-05 22:25                     ` Tejun Heo
@ 2015-02-06  0:03                       ` Greg Thelen
  2015-02-06 14:17                         ` Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Greg Thelen @ 2015-02-06  0:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
	linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Li Zefan, Hugh Dickins


On Thu, Feb 05 2015, Tejun Heo wrote:

> Hey,
>
> On Thu, Feb 05, 2015 at 02:05:19PM -0800, Greg Thelen wrote:
>> >  	A
>> >  	+-B    (usage=2M lim=3M min=2M hosted_usage=2M)
>> >  	  +-C  (usage=0  lim=2M min=1M shared_usage=2M)
>> >  	  +-D  (usage=0  lim=2M min=1M shared_usage=2M)
>> >  	  \-E  (usage=0  lim=2M min=0)
> ...
>> Maybe, but I want to understand more about how pressure works in the
>> child.  As C (or D) allocates non shared memory does it perform reclaim
>> to ensure that its (C.usage + C.shared_usage < C.lim).  Given C's
>
> Yes.
>
>> shared_usage is linked into B.LRU it wouldn't be naturally reclaimable
>> by C.  Are you thinking that charge failures on cgroups with non zero
>> shared_usage would, as needed, induce reclaim of parent's hosted_usage?
>
> Hmmm.... I'm not really sure but why not?  If we properly account for
> the low protection when pushing inodes to the parent, I don't think
> it'd break anything.  IOW, allow the amount beyond the sum of low
> limits to be reclaimed when one of the sharers is under pressure.
>
> Thanks.

I'm not saying that it'd break anything.  I think it's required that
children perform reclaim on shared data hosted in the parent.  The child
is limited by shared_usage, so it needs ability to reclaim it.  So I
think we're in agreement.  Child will reclaim parent's hosted_usage when
the child is charged for shared_usage.  Ideally the only parental memory
reclaimed in this situation would be shared.  But I think (though I
can't claim to have followed the new memcg philosophy discussions) that
internal nodes in the cgroup tree (i.e. parents) do not have any
resources charged directly to them.  All resources are charged to leaf
cgroups which linger until resources are uncharged.  Thus the LRUs of
parent will only contain hosted (shared) memory.  This thankfully focus
parental reclaim easy on shared pages.  Child pressure will,
unfortunately, reclaim shared pages used by any container.  But if
shared pages were charged all sharing containers, then it will help
relieve pressure in the caller.

So  this is  a system  which charges  all cgroups  using a  shared inode
(recharge on read) for all resident pages of that shared inode.  There's
only one copy of the page in memory on just one LRU, but the page may be
charged to multiple container's (shared_)usage.

Perhaps I missed it, but what happens when a child's limit is
insufficient to accept all pages shared by its siblings?  Example
starting with 2M cached of a shared file:

	A
	+-B    (usage=2M lim=3M hosted_usage=2M)
	  +-C  (usage=0  lim=2M shared_usage=2M)
	  +-D  (usage=0  lim=2M shared_usage=2M)
	  \-E  (usage=0  lim=1M shared_usage=0)

If E faults in a new 4K page within the shared file, then E is a sharing
participant so it'd be charged the 2M+4K, which pushes E over it's
limit.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-06  0:03                       ` Greg Thelen
@ 2015-02-06 14:17                         ` Tejun Heo
  2015-02-06 23:43                           ` Greg Thelen
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2015-02-06 14:17 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
	linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Li Zefan, Hugh Dickins

Hello, Greg.

On Thu, Feb 05, 2015 at 04:03:34PM -0800, Greg Thelen wrote:
> So  this is  a system  which charges  all cgroups  using a  shared inode
> (recharge on read) for all resident pages of that shared inode.  There's
> only one copy of the page in memory on just one LRU, but the page may be
> charged to multiple container's (shared_)usage.

Yeap.

> Perhaps I missed it, but what happens when a child's limit is
> insufficient to accept all pages shared by its siblings?  Example
> starting with 2M cached of a shared file:
> 
> 	A
> 	+-B    (usage=2M lim=3M hosted_usage=2M)
> 	  +-C  (usage=0  lim=2M shared_usage=2M)
> 	  +-D  (usage=0  lim=2M shared_usage=2M)
> 	  \-E  (usage=0  lim=1M shared_usage=0)
> 
> If E faults in a new 4K page within the shared file, then E is a sharing
> participant so it'd be charged the 2M+4K, which pushes E over it's
> limit.

OOM?  It shouldn't be participating in sharing of an inode if it can't
match others' protection on the inode, I think.  What we're doing now
w/ page based charging is kinda unfair because in the situations like
above the one under pressure can end up siphoning off of the larger
cgroups' protection if they actually use overlapping areas; however,
for disjoint areas, per-page charging would behave correctly.

So, this part comes down to the same question - whether multiple
cgroups accessing disjoint areas of a single inode is an important
enough use case.  If we say yes to that, we better make writeback
support that too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-06 14:17                         ` Tejun Heo
@ 2015-02-06 23:43                           ` Greg Thelen
  2015-02-07 14:38                             ` Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Greg Thelen @ 2015-02-06 23:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
	linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Li Zefan, Hugh Dickins

On Fri, Feb 6, 2015 at 6:17 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Greg.
>
> On Thu, Feb 05, 2015 at 04:03:34PM -0800, Greg Thelen wrote:
>> So  this is  a system  which charges  all cgroups  using a  shared inode
>> (recharge on read) for all resident pages of that shared inode.  There's
>> only one copy of the page in memory on just one LRU, but the page may be
>> charged to multiple container's (shared_)usage.
>
> Yeap.
>
>> Perhaps I missed it, but what happens when a child's limit is
>> insufficient to accept all pages shared by its siblings?  Example
>> starting with 2M cached of a shared file:
>>
>>       A
>>       +-B    (usage=2M lim=3M hosted_usage=2M)
>>         +-C  (usage=0  lim=2M shared_usage=2M)
>>         +-D  (usage=0  lim=2M shared_usage=2M)
>>         \-E  (usage=0  lim=1M shared_usage=0)
>>
>> If E faults in a new 4K page within the shared file, then E is a sharing
>> participant so it'd be charged the 2M+4K, which pushes E over it's
>> limit.
>
> OOM?  It shouldn't be participating in sharing of an inode if it can't
> match others' protection on the inode, I think.  What we're doing now
> w/ page based charging is kinda unfair because in the situations like
> above the one under pressure can end up siphoning off of the larger
> cgroups' protection if they actually use overlapping areas; however,
> for disjoint areas, per-page charging would behave correctly.
>
> So, this part comes down to the same question - whether multiple
> cgroups accessing disjoint areas of a single inode is an important
> enough use case.  If we say yes to that, we better make writeback
> support that too.

If cgroups are about isolation then writing to shared files should be
rare, so I'm willing to say that we don't need to handle shared
writers well.  Shared readers seem like a more valuable use cases
(thin provisioning).  I'm getting overwhelmed with the thought
exercise of automatically moving inodes to common ancestors and back
charging the sharers for shared_usage.  I haven't wrapped my head
around how these shared data pages will get protected.  It seems like
they'd no longer be protected by child min watermarks.

So I know this thread opened with the claim "both memcg and blkcg must
be looking at the same picture.  Deviating them is highly likely to
lead to long-term issues forcing us to look at this again anyway, only
with far more baggage."  But I'm still wondering if the following is
simpler:
(1) leave memcg as a per page controller.
(2) maintain a per inode i_memcg which is set to the common dirtying
ancestor.  If not shared then it'll point to the memcg that the page
was charged to.
(3) when memcg dirtying page pressure is seen, walk up the cgroup tree
writing dirty inodes, this will write shared inodes using blkcg
priority of the respective levels.
(4) background limit wb_check_background_flush() and time based
wb_check_old_data_flush() can feel free to attack shared inodes to
hopefully restore them to non-shared state.
For non-shared inodes, this should behave the same.  For shared inodes
it should only affect those in the hierarchy which is sharing.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-06 23:43                           ` Greg Thelen
@ 2015-02-07 14:38                             ` Tejun Heo
  2015-02-11  2:19                               ` Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2015-02-07 14:38 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
	linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Li Zefan, Hugh Dickins

Hello, Greg.

On Fri, Feb 06, 2015 at 03:43:11PM -0800, Greg Thelen wrote:
> If cgroups are about isolation then writing to shared files should be
> rare, so I'm willing to say that we don't need to handle shared
> writers well.  Shared readers seem like a more valuable use cases
> (thin provisioning).  I'm getting overwhelmed with the thought
> exercise of automatically moving inodes to common ancestors and back
> charging the sharers for shared_usage.  I haven't wrapped my head
> around how these shared data pages will get protected.  It seems like
> they'd no longer be protected by child min watermarks.

Yes, this is challenging and what my current thought is around taking
the maximum of the low settings of the sharing children but I need to
think more about it.  One problem is that the shared inodes will
preemptively take away the amount shared from the children's low
protection.  They won't compete fairly with other inodes or anons but
they can't really as they don't really belong to any single sharer.

> So I know this thread opened with the claim "both memcg and blkcg must
> be looking at the same picture.  Deviating them is highly likely to
> lead to long-term issues forcing us to look at this again anyway, only
> with far more baggage."  But I'm still wondering if the following is
> simpler:
> (1) leave memcg as a per page controller.
> (2) maintain a per inode i_memcg which is set to the common dirtying
> ancestor.  If not shared then it'll point to the memcg that the page
> was charged to.
> (3) when memcg dirtying page pressure is seen, walk up the cgroup tree
> writing dirty inodes, this will write shared inodes using blkcg
> priority of the respective levels.
> (4) background limit wb_check_background_flush() and time based
> wb_check_old_data_flush() can feel free to attack shared inodes to
> hopefully restore them to non-shared state.
> For non-shared inodes, this should behave the same.  For shared inodes
> it should only affect those in the hierarchy which is sharing.

The thing which breaks when you de-couple what memcg sees from the
rest of the stack is that the amount of memory which may be available
to a given cgroup and how much of that is dirty is the main linkage
propagating IO pressure to actual dirtying tasks.  If you decouple the
two worldviews, you lose the ability to propagate IO pressure to
dirtiers in a controlled manner and that's why anything inside a memcg
currently is always triggering direct reclaim path instead of being
properly dirty throttled.

You can argue that an inode being actively dirtied from multiple
cgroups is a rare case which we can sweep under the rug and that
*might* be the case but I have a nagging feeling that that would be a
decision which is made merely out of immediate convenience and would
much prefer having a well defined model of sharing inodes and anons
across cgroups so that the behaviors shown in thoses cases aren't mere
accidental consequences without any innate meaning.

If we can argue that memcg and blkcg having different views is
meaningful and characterize and justify the behaviors stemming from
the deviation, sure, that'd be fine, but I don't think we have that as
of now.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-07 14:38                             ` Tejun Heo
@ 2015-02-11  2:19                               ` Tejun Heo
  2015-02-11  7:32                                 ` Jan Kara
  2015-02-11 18:28                                 ` Greg Thelen
  0 siblings, 2 replies; 31+ messages in thread
From: Tejun Heo @ 2015-02-11  2:19 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
	linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Li Zefan, Hugh Dickins

Hello, again.

On Sat, Feb 07, 2015 at 09:38:39AM -0500, Tejun Heo wrote:
> If we can argue that memcg and blkcg having different views is
> meaningful and characterize and justify the behaviors stemming from
> the deviation, sure, that'd be fine, but I don't think we have that as
> of now.

If we assume that memcg and blkcg having different views is something
which represents an acceptable compromise considering the use cases
and implementation convenience - IOW, if we assume that read-sharing
is something which can happen regularly while write sharing is a
corner case and that while not completely correct the existing
self-corrective behavior from tracking ownership per-page at the point
of instantiation is good enough (as a memcg under pressure is likely
to give up shared pages to be re-instantiated by another sharer w/
more budget), we need to do the impedance matching between memcg and
blkcg at the writeback layer.

The main issue there is that the last chain of IO pressure propagation
is realized by making individual dirtying tasks to converge on a
common target dirty ratio point which naturally depending on those
tasks seeing the same picture in terms of the current write bandwidth
and available memory and how much of it is dirty.  Tasks dirtying
pages belonging to the same memcg while some of them are mostly being
written out by a different blkcg would wreck the mechanism.  It won't
be difficult for one subset to make the other to consider themselves
under severe IO pressure when there actually isn't one in that group
possibly stalling and starving those tasks unduly.  At more basic
level, it's just wrong for one group to be writing out significant
amount for another.

These issues can persist indefinitely if we follow the same
instantiator-owns rule for inode writebacks.  Even if we reset the
ownership when an inode becomes clea, it wouldn't work as it can be
dirtied over and over again while under writeback, and when things
like this happen, the behavior may become extremely difficult to
understand or characterize.  We don't have visibility into how
individual pages of an inode get distributed across multiple cgroups,
who's currently responsible for writing back a specific inode or how
dirty ratio mechanism is behaving in the face of the unexpected
combination of parameters.

Even if we assume that write sharing is a fringe case, we need
something better than first-whatever rule when choosing which blkcg is
responsible for writing a shared inode out.  There needs to be a
constant corrective pressure so that incidental and temporary sharings
don't end up screwing up the mechanism for an extended period of time.

Greg mentioned chossing the closest ancestor of the sharers, which
basically pushes inode sharing policy implmentation down to writeback
from memcg.  This could work but we end up with the same collusion
problem as when this is used for memcg and it's even more difficult to
solve this at writeback layer - we'd have to communicate the shared
state all the way down to block layer and then implement a mechanism
there to take corrective measures and even after that we're likely to
end up with prolonged state where dirty ratio propagation is
essentially broken as the dirtier and writer would be seeing different
pictures.

So, based on the assumption that write sharings are mostly incidental
and temporary (ie. we're basically declaring that we don't support
persistent write sharing), how about something like the following?

1. memcg contiues per-page tracking.

2. Each inode is associated with a single blkcg at a given time and
   written out by that blkcg.

3. While writing back, if the number of pages from foreign memcg's is
   higher than certain ratio of total written pages, the inode is
   marked as disowned and the writeback instance is optionally
   terminated early.  e.g. if the ratio of foreign pages is over 50%
   after writing out the number of pages matching 5s worth of write
   bandwidth for the bdi, mark the inode as disowned.

4. On the following dirtying of the inode, the inode is associated
   with the matching blkcg of the dirtied page.  Note that this could
   be the next cycle as the inode could already have been marked dirty
   by the time the above condition triggered.  In that case, the
   following writeback would be terminated early too.

This should provide sufficient corrective pressure so that incidental
and temporary sharing of an inode doesn't become a persistent issue
while keeping the complexity necessary for implementing such pressure
fairly minimal and self-contained.  Also, the changes necessary for
individual filesystems would be minimal.

I think this should work well enough as long as the forementioned
assumptions are true - IOW, if we maintain that write sharing is
unsupported.

What do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-11  2:19                               ` Tejun Heo
@ 2015-02-11  7:32                                 ` Jan Kara
  2015-02-11 18:28                                 ` Greg Thelen
  1 sibling, 0 replies; 31+ messages in thread
From: Jan Kara @ 2015-02-11  7:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
	Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
	Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
	Hugh Dickins

  Hello Tejun,

On Tue 10-02-15 21:19:06, Tejun Heo wrote:
> On Sat, Feb 07, 2015 at 09:38:39AM -0500, Tejun Heo wrote:
> > If we can argue that memcg and blkcg having different views is
> > meaningful and characterize and justify the behaviors stemming from
> > the deviation, sure, that'd be fine, but I don't think we have that as
> > of now.
...
> So, based on the assumption that write sharings are mostly incidental
> and temporary (ie. we're basically declaring that we don't support
> persistent write sharing), how about something like the following?
> 
> 1. memcg contiues per-page tracking.
> 
> 2. Each inode is associated with a single blkcg at a given time and
>    written out by that blkcg.
> 
> 3. While writing back, if the number of pages from foreign memcg's is
>    higher than certain ratio of total written pages, the inode is
>    marked as disowned and the writeback instance is optionally
>    terminated early.  e.g. if the ratio of foreign pages is over 50%
>    after writing out the number of pages matching 5s worth of write
>    bandwidth for the bdi, mark the inode as disowned.
> 
> 4. On the following dirtying of the inode, the inode is associated
>    with the matching blkcg of the dirtied page.  Note that this could
>    be the next cycle as the inode could already have been marked dirty
>    by the time the above condition triggered.  In that case, the
>    following writeback would be terminated early too.
> 
> This should provide sufficient corrective pressure so that incidental
> and temporary sharing of an inode doesn't become a persistent issue
> while keeping the complexity necessary for implementing such pressure
> fairly minimal and self-contained.  Also, the changes necessary for
> individual filesystems would be minimal.
  I like this proposal. It looks simple enough and when inodes aren't
pernamently write-shared it converges to the blkcg that is currently
writing to the inode. So ack from me.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-11  2:19                               ` Tejun Heo
  2015-02-11  7:32                                 ` Jan Kara
@ 2015-02-11 18:28                                 ` Greg Thelen
  2015-02-11 20:33                                   ` Tejun Heo
  1 sibling, 1 reply; 31+ messages in thread
From: Greg Thelen @ 2015-02-11 18:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
	linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Li Zefan, Hugh Dickins

On Tue, Feb 10, 2015 at 6:19 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, again.
>
> On Sat, Feb 07, 2015 at 09:38:39AM -0500, Tejun Heo wrote:
>> If we can argue that memcg and blkcg having different views is
>> meaningful and characterize and justify the behaviors stemming from
>> the deviation, sure, that'd be fine, but I don't think we have that as
>> of now.
>
> If we assume that memcg and blkcg having different views is something
> which represents an acceptable compromise considering the use cases
> and implementation convenience - IOW, if we assume that read-sharing
> is something which can happen regularly while write sharing is a
> corner case and that while not completely correct the existing
> self-corrective behavior from tracking ownership per-page at the point
> of instantiation is good enough (as a memcg under pressure is likely
> to give up shared pages to be re-instantiated by another sharer w/
> more budget), we need to do the impedance matching between memcg and
> blkcg at the writeback layer.
>
> The main issue there is that the last chain of IO pressure propagation
> is realized by making individual dirtying tasks to converge on a
> common target dirty ratio point which naturally depending on those
> tasks seeing the same picture in terms of the current write bandwidth
> and available memory and how much of it is dirty.  Tasks dirtying
> pages belonging to the same memcg while some of them are mostly being
> written out by a different blkcg would wreck the mechanism.  It won't
> be difficult for one subset to make the other to consider themselves
> under severe IO pressure when there actually isn't one in that group
> possibly stalling and starving those tasks unduly.  At more basic
> level, it's just wrong for one group to be writing out significant
> amount for another.
>
> These issues can persist indefinitely if we follow the same
> instantiator-owns rule for inode writebacks.  Even if we reset the
> ownership when an inode becomes clea, it wouldn't work as it can be
> dirtied over and over again while under writeback, and when things
> like this happen, the behavior may become extremely difficult to
> understand or characterize.  We don't have visibility into how
> individual pages of an inode get distributed across multiple cgroups,
> who's currently responsible for writing back a specific inode or how
> dirty ratio mechanism is behaving in the face of the unexpected
> combination of parameters.
>
> Even if we assume that write sharing is a fringe case, we need
> something better than first-whatever rule when choosing which blkcg is
> responsible for writing a shared inode out.  There needs to be a
> constant corrective pressure so that incidental and temporary sharings
> don't end up screwing up the mechanism for an extended period of time.
>
> Greg mentioned chossing the closest ancestor of the sharers, which
> basically pushes inode sharing policy implmentation down to writeback
> from memcg.  This could work but we end up with the same collusion
> problem as when this is used for memcg and it's even more difficult to
> solve this at writeback layer - we'd have to communicate the shared
> state all the way down to block layer and then implement a mechanism
> there to take corrective measures and even after that we're likely to
> end up with prolonged state where dirty ratio propagation is
> essentially broken as the dirtier and writer would be seeing different
> pictures.
>
> So, based on the assumption that write sharings are mostly incidental
> and temporary (ie. we're basically declaring that we don't support
> persistent write sharing), how about something like the following?
>
> 1. memcg contiues per-page tracking.
>
> 2. Each inode is associated with a single blkcg at a given time and
>    written out by that blkcg.
>
> 3. While writing back, if the number of pages from foreign memcg's is
>    higher than certain ratio of total written pages, the inode is
>    marked as disowned and the writeback instance is optionally
>    terminated early.  e.g. if the ratio of foreign pages is over 50%
>    after writing out the number of pages matching 5s worth of write
>    bandwidth for the bdi, mark the inode as disowned.
>
> 4. On the following dirtying of the inode, the inode is associated
>    with the matching blkcg of the dirtied page.  Note that this could
>    be the next cycle as the inode could already have been marked dirty
>    by the time the above condition triggered.  In that case, the
>    following writeback would be terminated early too.
>
> This should provide sufficient corrective pressure so that incidental
> and temporary sharing of an inode doesn't become a persistent issue
> while keeping the complexity necessary for implementing such pressure
> fairly minimal and self-contained.  Also, the changes necessary for
> individual filesystems would be minimal.
>
> I think this should work well enough as long as the forementioned
> assumptions are true - IOW, if we maintain that write sharing is
> unsupported.
>
> What do you think?
>
> Thanks.
>
> --
> tejun

This seems good.  I assume that blkcg writeback would query
corresponding memcg for dirty page count to determine if over
background limit.  And balance_dirty_pages() would query memcg's dirty
page count to throttle based on blkcg's bandwidth.  Note: memcg
doesn't yet have dirty page counts, but several of us have made
attempts at adding the counters.  And it shouldn't be hard to get them
merged.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-11 18:28                                 ` Greg Thelen
@ 2015-02-11 20:33                                   ` Tejun Heo
  2015-02-11 21:22                                     ` Konstantin Khlebnikov
  2015-02-12  2:10                                     ` Greg Thelen
  0 siblings, 2 replies; 31+ messages in thread
From: Tejun Heo @ 2015-02-11 20:33 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
	linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Li Zefan, Hugh Dickins

Hello, Greg.

On Wed, Feb 11, 2015 at 10:28:44AM -0800, Greg Thelen wrote:
> This seems good.  I assume that blkcg writeback would query
> corresponding memcg for dirty page count to determine if over
> background limit.  And balance_dirty_pages() would query memcg's dirty

Yeah, available memory to the matching memcg and the number of dirty
pages in it.  It's gonna work the same way as the global case just
scoped to the cgroup.

> page count to throttle based on blkcg's bandwidth.  Note: memcg
> doesn't yet have dirty page counts, but several of us have made
> attempts at adding the counters.  And it shouldn't be hard to get them
> merged.

Can you please post those?

So, cool, we're in agreement.  Working on it.  It shouldn't take too
long, hopefully.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-11 20:33                                   ` Tejun Heo
@ 2015-02-11 21:22                                     ` Konstantin Khlebnikov
  2015-02-11 21:46                                       ` Tejun Heo
  2015-02-12  2:10                                     ` Greg Thelen
  1 sibling, 1 reply; 31+ messages in thread
From: Konstantin Khlebnikov @ 2015-02-11 21:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
	Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
	Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
	Hugh Dickins

On Wed, Feb 11, 2015 at 11:33 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Greg.
>
> On Wed, Feb 11, 2015 at 10:28:44AM -0800, Greg Thelen wrote:
>> This seems good.  I assume that blkcg writeback would query
>> corresponding memcg for dirty page count to determine if over
>> background limit.  And balance_dirty_pages() would query memcg's dirty
>
> Yeah, available memory to the matching memcg and the number of dirty
> pages in it.  It's gonna work the same way as the global case just
> scoped to the cgroup.

That might be a problem: all dirty pages accounted to cgroup must be
reachable for its own personal writeback or balanace-drity-pages will be
unable to satisfy memcg dirty memory thresholds. I've done accounting
for per-inode owner, but there is another option: shared inodes might be
handled differently and will be available for all (or related) cgroup
writebacks.

Another side is that reclaimer now (mosly?) never trigger pageout.
Memcg reclaimer should do something if it finds shared dirty page:
either move it into right cgroup or make that inode reachable for
memcg writeback. I've send patch which marks shared dirty inodes
with flag I_DIRTY_SHARED or so.

>
>> page count to throttle based on blkcg's bandwidth.  Note: memcg
>> doesn't yet have dirty page counts, but several of us have made
>> attempts at adding the counters.  And it shouldn't be hard to get them
>> merged.
>
> Can you please post those?
>
> So, cool, we're in agreement.  Working on it.  It shouldn't take too
> long, hopefully.

Good. As I see this design is almost equal to my proposal,
maybe except that dumb first-owns-all-until-the-end rule.

>
> Thanks.
>
> --
> tejun
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-11 21:22                                     ` Konstantin Khlebnikov
@ 2015-02-11 21:46                                       ` Tejun Heo
  2015-02-11 21:57                                         ` Konstantin Khlebnikov
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2015-02-11 21:46 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
	Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
	Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
	Hugh Dickins

Hello,

On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote:
> > Yeah, available memory to the matching memcg and the number of dirty
> > pages in it.  It's gonna work the same way as the global case just
> > scoped to the cgroup.
> 
> That might be a problem: all dirty pages accounted to cgroup must be
> reachable for its own personal writeback or balanace-drity-pages will be
> unable to satisfy memcg dirty memory thresholds. I've done accounting

Yeah, it would.  Why wouldn't it?

> for per-inode owner, but there is another option: shared inodes might be
> handled differently and will be available for all (or related) cgroup
> writebacks.

I'm not following you at all.  The only reason this scheme can work is
because we exclude persistent shared write cases.  As the whole thing
is based on that assumption, special casing shared inodes doesn't make
any sense.  Doing things like allowing all cgroups to write shared
inodes without getting memcg on-board almost immediately breaks
pressure propagation while making shared writes a lot more attractive
and increasing implementation complexity substantially.  Am I missing
something?

> Another side is that reclaimer now (mosly?) never trigger pageout.
> Memcg reclaimer should do something if it finds shared dirty page:
> either move it into right cgroup or make that inode reachable for
> memcg writeback. I've send patch which marks shared dirty inodes
> with flag I_DIRTY_SHARED or so.

It *might* make sense for memcg to drop pages being dirtied which
don't match the currently associated blkcg of the inode; however,
again, as we're basically declaring that shared writes aren't
supported, I'm skeptical about the usefulness.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-11 21:46                                       ` Tejun Heo
@ 2015-02-11 21:57                                         ` Konstantin Khlebnikov
  2015-02-11 22:05                                           ` Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Konstantin Khlebnikov @ 2015-02-11 21:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
	Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
	Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
	Hugh Dickins

On Thu, Feb 12, 2015 at 12:46 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote:
>> > Yeah, available memory to the matching memcg and the number of dirty
>> > pages in it.  It's gonna work the same way as the global case just
>> > scoped to the cgroup.
>>
>> That might be a problem: all dirty pages accounted to cgroup must be
>> reachable for its own personal writeback or balanace-drity-pages will be
>> unable to satisfy memcg dirty memory thresholds. I've done accounting
>
> Yeah, it would.  Why wouldn't it?

How do you plan to do per-memcg/blkcg writeback for balance-dirty-pages?
Or you're thinking only about separating writeback flow into blkio cgroups
without actual inode filtering? I mean delaying inode writeback and keeping
dirty pages as long as possible if their cgroups are far from threshold.

>
>> for per-inode owner, but there is another option: shared inodes might be
>> handled differently and will be available for all (or related) cgroup
>> writebacks.
>
> I'm not following you at all.  The only reason this scheme can work is
> because we exclude persistent shared write cases.  As the whole thing
> is based on that assumption, special casing shared inodes doesn't make
> any sense.  Doing things like allowing all cgroups to write shared
> inodes without getting memcg on-board almost immediately breaks
> pressure propagation while making shared writes a lot more attractive
> and increasing implementation complexity substantially.  Am I missing
> something?
>
>> Another side is that reclaimer now (mosly?) never trigger pageout.
>> Memcg reclaimer should do something if it finds shared dirty page:
>> either move it into right cgroup or make that inode reachable for
>> memcg writeback. I've send patch which marks shared dirty inodes
>> with flag I_DIRTY_SHARED or so.
>
> It *might* make sense for memcg to drop pages being dirtied which
> don't match the currently associated blkcg of the inode; however,
> again, as we're basically declaring that shared writes aren't
> supported, I'm skeptical about the usefulness.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-11 21:57                                         ` Konstantin Khlebnikov
@ 2015-02-11 22:05                                           ` Tejun Heo
  2015-02-11 22:15                                             ` Konstantin Khlebnikov
  0 siblings, 1 reply; 31+ messages in thread
From: Tejun Heo @ 2015-02-11 22:05 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
	Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
	Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
	Hugh Dickins

On Thu, Feb 12, 2015 at 01:57:04AM +0400, Konstantin Khlebnikov wrote:
> On Thu, Feb 12, 2015 at 12:46 AM, Tejun Heo <tj@kernel.org> wrote:
> > Hello,
> >
> > On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote:
> >> > Yeah, available memory to the matching memcg and the number of dirty
> >> > pages in it.  It's gonna work the same way as the global case just
> >> > scoped to the cgroup.
> >>
> >> That might be a problem: all dirty pages accounted to cgroup must be
> >> reachable for its own personal writeback or balanace-drity-pages will be
> >> unable to satisfy memcg dirty memory thresholds. I've done accounting
> >
> > Yeah, it would.  Why wouldn't it?
> 
> How do you plan to do per-memcg/blkcg writeback for balance-dirty-pages?
> Or you're thinking only about separating writeback flow into blkio cgroups
> without actual inode filtering? I mean delaying inode writeback and keeping
> dirty pages as long as possible if their cgroups are far from threshold.

What?  The code was already in the previous patchset.  I'm just gonna
rip out the code to handle inode being dirtied on multiple wb's.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-11 22:05                                           ` Tejun Heo
@ 2015-02-11 22:15                                             ` Konstantin Khlebnikov
  2015-02-11 22:30                                               ` Tejun Heo
  0 siblings, 1 reply; 31+ messages in thread
From: Konstantin Khlebnikov @ 2015-02-11 22:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
	Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
	Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
	Hugh Dickins

On Thu, Feb 12, 2015 at 1:05 AM, Tejun Heo <tj@kernel.org> wrote:
> On Thu, Feb 12, 2015 at 01:57:04AM +0400, Konstantin Khlebnikov wrote:
>> On Thu, Feb 12, 2015 at 12:46 AM, Tejun Heo <tj@kernel.org> wrote:
>> > Hello,
>> >
>> > On Thu, Feb 12, 2015 at 12:22:34AM +0300, Konstantin Khlebnikov wrote:
>> >> > Yeah, available memory to the matching memcg and the number of dirty
>> >> > pages in it.  It's gonna work the same way as the global case just
>> >> > scoped to the cgroup.
>> >>
>> >> That might be a problem: all dirty pages accounted to cgroup must be
>> >> reachable for its own personal writeback or balanace-drity-pages will be
>> >> unable to satisfy memcg dirty memory thresholds. I've done accounting
>> >
>> > Yeah, it would.  Why wouldn't it?
>>
>> How do you plan to do per-memcg/blkcg writeback for balance-dirty-pages?
>> Or you're thinking only about separating writeback flow into blkio cgroups
>> without actual inode filtering? I mean delaying inode writeback and keeping
>> dirty pages as long as possible if their cgroups are far from threshold.
>
> What?  The code was already in the previous patchset.  I'm just gonna
> rip out the code to handle inode being dirtied on multiple wb's.

Well, ok. Even if shared writes are rare whey should be handled somehow
without relying on kupdate-like writeback. If memcg has a lot of dirty pages
but their inodes are accidentially belong to wrong wb queues when tasks in
that memcg shouldn't stuck in balance-dirty-pages until somebody outside
acidentially writes this data. That's all what I wanted to say.

>
> --
> tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-11 22:15                                             ` Konstantin Khlebnikov
@ 2015-02-11 22:30                                               ` Tejun Heo
  0 siblings, 0 replies; 31+ messages in thread
From: Tejun Heo @ 2015-02-11 22:30 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Greg Thelen, Konstantin Khlebnikov, Johannes Weiner,
	Michal Hocko, Cgroups, linux-mm, linux-kernel, Jan Kara,
	Dave Chinner, Jens Axboe, Christoph Hellwig, Li Zefan,
	Hugh Dickins

Hello,

On Thu, Feb 12, 2015 at 02:15:29AM +0400, Konstantin Khlebnikov wrote:
> Well, ok. Even if shared writes are rare whey should be handled somehow
> without relying on kupdate-like writeback. If memcg has a lot of dirty pages

This only works iff we consider those cases to be marginal enough to
be handle them in a pretty ghetto way.

> but their inodes are accidentially belong to wrong wb queues when tasks in
> that memcg shouldn't stuck in balance-dirty-pages until somebody outside
> acidentially writes this data. That's all what I wanted to say.

But, right, yeah, corner cases around this could be nasty if writeout
interval is set really high.  I don't think it matters for the default
5s interval at all.  Maybe what we need is queueing a delayed per-wb
work w/ the default writeout interval when dirtying a foreign inode.
I'll think more about it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Making memcg track ownership per address_space or anon_vma
  2015-02-11 20:33                                   ` Tejun Heo
  2015-02-11 21:22                                     ` Konstantin Khlebnikov
@ 2015-02-12  2:10                                     ` Greg Thelen
  1 sibling, 0 replies; 31+ messages in thread
From: Greg Thelen @ 2015-02-12  2:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, Cgroups,
	linux-mm, linux-kernel, Jan Kara, Dave Chinner, Jens Axboe,
	Christoph Hellwig, Li Zefan, Hugh Dickins

On Wed, Feb 11, 2015 at 12:33 PM, Tejun Heo <tj@kernel.org> wrote:
[...]
>> page count to throttle based on blkcg's bandwidth.  Note: memcg
>> doesn't yet have dirty page counts, but several of us have made
>> attempts at adding the counters.  And it shouldn't be hard to get them
>> merged.
>
> Can you please post those?

Will do.  Rebasing and testing needed, so it won't be today.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2015-02-12  2:10 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-30  4:43 [RFC] Making memcg track ownership per address_space or anon_vma Tejun Heo
2015-01-30  5:55 ` Greg Thelen
2015-01-30  6:27   ` Tejun Heo
2015-01-30 16:07     ` Tejun Heo
2015-02-02 19:26       ` Konstantin Khlebnikov
2015-02-02 19:46         ` Tejun Heo
2015-02-03 23:30           ` Greg Thelen
2015-02-04 10:49             ` Konstantin Khlebnikov
2015-02-04 17:15               ` Tejun Heo
2015-02-04 17:58                 ` Konstantin Khlebnikov
2015-02-04 18:28                   ` Tejun Heo
2015-02-04 17:06             ` Tejun Heo
2015-02-04 23:51               ` Greg Thelen
2015-02-05 13:15                 ` Tejun Heo
2015-02-05 22:05                   ` Greg Thelen
2015-02-05 22:25                     ` Tejun Heo
2015-02-06  0:03                       ` Greg Thelen
2015-02-06 14:17                         ` Tejun Heo
2015-02-06 23:43                           ` Greg Thelen
2015-02-07 14:38                             ` Tejun Heo
2015-02-11  2:19                               ` Tejun Heo
2015-02-11  7:32                                 ` Jan Kara
2015-02-11 18:28                                 ` Greg Thelen
2015-02-11 20:33                                   ` Tejun Heo
2015-02-11 21:22                                     ` Konstantin Khlebnikov
2015-02-11 21:46                                       ` Tejun Heo
2015-02-11 21:57                                         ` Konstantin Khlebnikov
2015-02-11 22:05                                           ` Tejun Heo
2015-02-11 22:15                                             ` Konstantin Khlebnikov
2015-02-11 22:30                                               ` Tejun Heo
2015-02-12  2:10                                     ` Greg Thelen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).