All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3 0/2] memcg softlimit reclaim rework
@ 2012-04-17 16:37 Ying Han
  2012-04-18 12:24 ` Johannes Weiner
  0 siblings, 1 reply; 25+ messages in thread
From: Ying Han @ 2012-04-17 16:37 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Rik van Riel, Hillf Danton, Hugh Dickins, Dan Magenheimer,
	Andrew Morton
  Cc: linux-mm

The "soft_limit" was introduced in memcg to support over-committing the
memory resource on the host. Each cgroup configures its "hard_limit" where
it will be throttled or OOM killed by going over the limit. However, the
cgroup can go above the "soft_limit" as long as there is no system-wide
memory contention. So, the "soft_limit" is the kernel mechanism for
re-distributing system spare memory among cgroups.

This patch reworks the softlimit reclaim by hooking it into the new global
reclaim scheme. So the global reclaim path including direct reclaim and
background reclaim will respect the memcg softlimit.

v3..v2:
1. rebase the patch on 3.4-rc3
2. squash the commits of replacing the old implementation with new
implementation into one commit. This is to make sure to leave the tree
in stable state between each commit.
3. removed the commit which changes the nr_to_reclaim for global reclaim
case. The need of that patch is not obvious now.

Note:
1. the new implementation of softlimit reclaim is rather simple and first
step for further optimizations. there is no memory pressure balancing between
memcgs for each zone, and that is something we would like to add as follow-ups.

2. this patch is slightly different from the last one posted from Johannes
http://comments.gmane.org/gmane.linux.kernel.mm/72382
where his patch is closer to the reverted implementation by doing hierarchical
reclaim for each selected memcg. However, that is not expected behavior from
user perspective. Considering the following example:

root (32G capacity)
--> A (hard limit 20G, soft limit 15G, usage 16G)
   --> A1 (soft limit 5G, usage 4G)
   --> A2 (soft limit 10G, usage 12G)
--> B (hard limit 20G, soft limit 10G, usage 16G)

Under global reclaim, we shouldn't add pressure on A1 although its parent(A)
exceeds softlimit. This is what admin expects by setting softlimit to the
actual working set size and only reclaim pages under softlimit if system has
trouble to reclaim.

Test on 32G host:
The stats are the memory.vmscan_stat which I didn't included in this patchset. It exports per-memcg based vmscan stats. The stat shows in the following exports the number of pages being reclaimed under global pressure from each memcg. As I can see, there is no pages reclaimed under memcg softlimit until some point (case 3). In that case, there are many reclaimers (20 container + kswapds ) with less reclaimable memcg (above softlimit) and the reclaim priority jumps. That's why we see memcg under softlimit being reclaimed as well.

1. 20 * cat 1G ramdisk containers (hardlimit = 512M, softlimit = 0 by default) + memory hog (for global pressure)
    $ for ((i=0; i<20; i++)); do cat /dev/cgroup/memory/$i/memory.vmscan_stat | grep total_freed_file_pages_by_system_under_hierarchy; done
    total_freed_file_pages_by_system_under_hierarchy 4431458
    total_freed_file_pages_by_system_under_hierarchy 4572150
    total_freed_file_pages_by_system_under_hierarchy 4260969
    total_freed_file_pages_by_system_under_hierarchy 4522491
    total_freed_file_pages_by_system_under_hierarchy 4467898
    total_freed_file_pages_by_system_under_hierarchy 4231144
    total_freed_file_pages_by_system_under_hierarchy 4467987
    total_freed_file_pages_by_system_under_hierarchy 4415137
    total_freed_file_pages_by_system_under_hierarchy 4537076
    total_freed_file_pages_by_system_under_hierarchy 4374586
    total_freed_file_pages_by_system_under_hierarchy 4238208
    total_freed_file_pages_by_system_under_hierarchy 4497263
    total_freed_file_pages_by_system_under_hierarchy 4401839
    total_freed_file_pages_by_system_under_hierarchy 4407700
    total_freed_file_pages_by_system_under_hierarchy 4291009
    total_freed_file_pages_by_system_under_hierarchy 4228416
    total_freed_file_pages_by_system_under_hierarchy 4126986
    total_freed_file_pages_by_system_under_hierarchy 4730479
    total_freed_file_pages_by_system_under_hierarchy 4316904
    total_freed_file_pages_by_system_under_hierarchy 4304469

2. 20 * cat 1G ramdisk containers (hardlimit = 512M, 1-5 container softlimit = 512M) + memory hog (for global pressure)
    total_freed_file_pages_by_system_under_hierarchy 0
    total_freed_file_pages_by_system_under_hierarchy 0
    total_freed_file_pages_by_system_under_hierarchy 0
    total_freed_file_pages_by_system_under_hierarchy 0
    total_freed_file_pages_by_system_under_hierarchy 0
    total_freed_file_pages_by_system_under_hierarchy 4562418
    total_freed_file_pages_by_system_under_hierarchy 4630498
    total_freed_file_pages_by_system_under_hierarchy 4809946
    total_freed_file_pages_by_system_under_hierarchy 4767868
    total_freed_file_pages_by_system_under_hierarchy 4716920
    total_freed_file_pages_by_system_under_hierarchy 4828952
    total_freed_file_pages_by_system_under_hierarchy 4672482
    total_freed_file_pages_by_system_under_hierarchy 4593165
    total_freed_file_pages_by_system_under_hierarchy 4862157
    total_freed_file_pages_by_system_under_hierarchy 4639331
    total_freed_file_pages_by_system_under_hierarchy 4620658
    total_freed_file_pages_by_system_under_hierarchy 4880210
    total_freed_file_pages_by_system_under_hierarchy 4652485
    total_freed_file_pages_by_system_under_hierarchy 4633724
    total_freed_file_pages_by_system_under_hierarchy 4673583

3. 20 * cat 1G ramdisk containers (hardlimit = 512M, 1-10 container softlimit = 512M) + memory hog (for global pressure)
   total_freed_file_pages_by_system_under_hierarchy 7318
   total_freed_file_pages_by_system_under_hierarchy 6612
   total_freed_file_pages_by_system_under_hierarchy 2900
   total_freed_file_pages_by_system_under_hierarchy 5740
   total_freed_file_pages_by_system_under_hierarchy 5353
   total_freed_file_pages_by_system_under_hierarchy 4707
   total_freed_file_pages_by_system_under_hierarchy 4252
   total_freed_file_pages_by_system_under_hierarchy 5518
   total_freed_file_pages_by_system_under_hierarchy 1431
   total_freed_file_pages_by_system_under_hierarchy 5722
   total_freed_file_pages_by_system_under_hierarchy 9538489
   total_freed_file_pages_by_system_under_hierarchy 9334518
   total_freed_file_pages_by_system_under_hierarchy 9727377
   total_freed_file_pages_by_system_under_hierarchy 9602573
   total_freed_file_pages_by_system_under_hierarchy 9771141
   total_freed_file_pages_by_system_under_hierarchy 9769589
   total_freed_file_pages_by_system_under_hierarchy 9610550
   total_freed_file_pages_by_system_under_hierarchy 9535241
   total_freed_file_pages_by_system_under_hierarchy 9912726
   total_freed_file_pages_by_system_under_hierarchy 9502706

Ying Han (2):
  memcg: softlimit reclaim rework
  memcg: set soft_limit_in_bytes to 0 by default

 include/linux/memcontrol.h |   18 +--
 include/linux/swap.h       |    4 -
 kernel/res_counter.c       |    1 -
 mm/memcontrol.c            |  397 +-------------------------------------------
 mm/vmscan.c                |  113 +++++--------
 5 files changed, 55 insertions(+), 478 deletions(-)

-- 
1.7.7.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-17 16:37 [PATCH V3 0/2] memcg softlimit reclaim rework Ying Han
@ 2012-04-18 12:24 ` Johannes Weiner
  2012-04-18 18:00   ` Ying Han
  0 siblings, 1 reply; 25+ messages in thread
From: Johannes Weiner @ 2012-04-18 12:24 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On Tue, Apr 17, 2012 at 09:37:46AM -0700, Ying Han wrote:
> The "soft_limit" was introduced in memcg to support over-committing the
> memory resource on the host. Each cgroup configures its "hard_limit" where
> it will be throttled or OOM killed by going over the limit. However, the
> cgroup can go above the "soft_limit" as long as there is no system-wide
> memory contention. So, the "soft_limit" is the kernel mechanism for
> re-distributing system spare memory among cgroups.
> 
> This patch reworks the softlimit reclaim by hooking it into the new global
> reclaim scheme. So the global reclaim path including direct reclaim and
> background reclaim will respect the memcg softlimit.
> 
> v3..v2:
> 1. rebase the patch on 3.4-rc3
> 2. squash the commits of replacing the old implementation with new
> implementation into one commit. This is to make sure to leave the tree
> in stable state between each commit.
> 3. removed the commit which changes the nr_to_reclaim for global reclaim
> case. The need of that patch is not obvious now.
> 
> Note:
> 1. the new implementation of softlimit reclaim is rather simple and first
> step for further optimizations. there is no memory pressure balancing between
> memcgs for each zone, and that is something we would like to add as follow-ups.
> 
> 2. this patch is slightly different from the last one posted from Johannes
> http://comments.gmane.org/gmane.linux.kernel.mm/72382
> where his patch is closer to the reverted implementation by doing hierarchical
> reclaim for each selected memcg. However, that is not expected behavior from
> user perspective. Considering the following example:
> 
> root (32G capacity)
> --> A (hard limit 20G, soft limit 15G, usage 16G)
>    --> A1 (soft limit 5G, usage 4G)
>    --> A2 (soft limit 10G, usage 12G)
> --> B (hard limit 20G, soft limit 10G, usage 16G)
> 
> Under global reclaim, we shouldn't add pressure on A1 although its parent(A)
> exceeds softlimit. This is what admin expects by setting softlimit to the
> actual working set size and only reclaim pages under softlimit if system has
> trouble to reclaim.

Actually, this is exactly what the admin expects when creating a
hierarchy, because she defines that A1 is a child of A and is
responsible for the memory situation in its parent.

That's the single point of having a hierarchy.  Why do you create them
if you don't want their behaviour?

And A does not have its own pages (usage is just the sum of its
children), what SHOULD its soft limit even mean in your example?

If you had

    A (hard 20G, usage 16G)
       A1 (soft  5G, usage  4G)
       A2 (soft 10G, usage 12G)
    B (hard 20G, soft 10G, usage 16G)

(i.e. no soft limit on A), you could reasonably make it so that on
global reclaim, only A2 and B would get reclaimed, like you want it
to, while still keeping the hierarchical properties of soft limits.
If you want soft limits applied to leaf nodes only, don't set them
anywhere else..?

Ultimately, we want to support nesting memcgs within containers.  For
this reason, they need to be applied hierarchically, or the admin of
the host does not have soft limit control over untrusted guest groups:

    container A (hard 20G, soft 16G)
      group A-1 (soft 100G)
    container B (hard 20G, soft 16G)
      group B-1

In this case under global memory pressure, contrary to your claims, we
actually do want to from reclaim A-1, not just from B-1.  Otherwise, a
container could gain priority over another one by setting ridiculous
soft limits.

We have been at this point a couple times.  Could you please explain
what you are trying to do in the first place, why you need
hierarchies, why you configure them like you do?

Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-18 12:24 ` Johannes Weiner
@ 2012-04-18 18:00   ` Ying Han
  2012-04-19 17:04     ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: Ying Han @ 2012-04-18 18:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On Wed, Apr 18, 2012 at 5:24 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Tue, Apr 17, 2012 at 09:37:46AM -0700, Ying Han wrote:
>> The "soft_limit" was introduced in memcg to support over-committing the
>> memory resource on the host. Each cgroup configures its "hard_limit" where
>> it will be throttled or OOM killed by going over the limit. However, the
>> cgroup can go above the "soft_limit" as long as there is no system-wide
>> memory contention. So, the "soft_limit" is the kernel mechanism for
>> re-distributing system spare memory among cgroups.
>>
>> This patch reworks the softlimit reclaim by hooking it into the new global
>> reclaim scheme. So the global reclaim path including direct reclaim and
>> background reclaim will respect the memcg softlimit.
>>
>> v3..v2:
>> 1. rebase the patch on 3.4-rc3
>> 2. squash the commits of replacing the old implementation with new
>> implementation into one commit. This is to make sure to leave the tree
>> in stable state between each commit.
>> 3. removed the commit which changes the nr_to_reclaim for global reclaim
>> case. The need of that patch is not obvious now.
>>
>> Note:
>> 1. the new implementation of softlimit reclaim is rather simple and first
>> step for further optimizations. there is no memory pressure balancing between
>> memcgs for each zone, and that is something we would like to add as follow-ups.
>>
>> 2. this patch is slightly different from the last one posted from Johannes
>> http://comments.gmane.org/gmane.linux.kernel.mm/72382
>> where his patch is closer to the reverted implementation by doing hierarchical
>> reclaim for each selected memcg. However, that is not expected behavior from
>> user perspective. Considering the following example:
>>
>> root (32G capacity)
>> --> A (hard limit 20G, soft limit 15G, usage 16G)
>>    --> A1 (soft limit 5G, usage 4G)
>>    --> A2 (soft limit 10G, usage 12G)
>> --> B (hard limit 20G, soft limit 10G, usage 16G)
>>
>> Under global reclaim, we shouldn't add pressure on A1 although its parent(A)
>> exceeds softlimit. This is what admin expects by setting softlimit to the
>> actual working set size and only reclaim pages under softlimit if system has
>> trouble to reclaim.
>
> Actually, this is exactly what the admin expects when creating a
> hierarchy, because she defines that A1 is a child of A and is
> responsible for the memory situation in its parent.

> That's the single point of having a hierarchy.  Why do you create them
> if you don't want their behaviour?

I agree with the hierarchical reclaim which pushing the pressure down
from A to A1 and A2. But that only apply naturally to hard_limit but
not soft_limit.

One of the use cases to create hierarchy is to get finer granularity
of accounting for subset of processes, and they share the same
hardlimit at the same time.

Imagine there were no A1 and A2 created and all the processes running
under A to start with. The problem with for that they all share a
single accounting and memcg naturally provide finer granularity
accounting by creating sub-cgroups under A. After setting
"use_hierarchy" to 1, the direct reclaim from A (A hits its
hard_limit) should also reclaim from A1 and A2 regardless of each
individual usage_in_bytes since both A1 and A2 contribute to A's
charge.

However, we need to be more selective for soft_limit since most users
setting it to protect the cgroup's working_set_size. We don't want to
reclaim from A1's anon pages while reclaiming from A2's cold page
cache pages could satisfy the page allocation.

Note, soft_limit setting is always optional not like hard_limit. Once
admin chooses to set it, he/she wants to protect the hot memory of
each cgroup.

>
> And A does not have its own pages (usage is just the sum of its
> children), what SHOULD its soft limit even mean in your example?

A does have pages on its LRU which are pages allocated for processes
running directly under A and also the re-parented pages after rmdir of
A1/A2. The softlimit of A will include both cases.

>
> If you had
>
>    A (hard 20G, usage 16G)
>       A1 (soft  5G, usage  4G)
>       A2 (soft 10G, usage 12G)
>    B (hard 20G, soft 10G, usage 16G)
>
> (i.e. no soft limit on A), you could reasonably make it so that on
> global reclaim, only A2 and B would get reclaimed, like you want it
> to, while still keeping the hierarchical properties of soft limits.

> If you want soft limits applied to leaf nodes only, don't set them
> anywhere else..?

No softlimit on A means leave it as default value:

unlimited (now) : then pages linked to A's lru will not get chance to
be reclaimed at all under softlimit reclaim.

0 (after this patch):  it will end up reclaiming from A's children always.

> Ultimately, we want to support nesting memcgs within containers.  For
> this reason, they need to be applied hierarchically, or the admin of
> the host does not have soft limit control over untrusted guest groups:
>
>    container A (hard 20G, soft 16G)
>      group A-1 (soft 100G)
>    container B (hard 20G, soft 16G)
>      group B-1
>
> In this case under global memory pressure, contrary to your claims, we
> actually do want to from reclaim A-1, not just from B-1.  Otherwise, a
> container could gain priority over another one by setting ridiculous
> soft limits.

This is a mis-configuration of softlimit assuming the machine capacity
< 100G. I am wondering if we should design the system to compromise
the mis-configuration with drawback of breaking the exception of
properly configured system.

> We have been at this point a couple times.  Could you please explain
> what you are trying to do in the first place, why you need
> hierarchies, why you configure them like you do?

The hierarchy is needed for sharing one hard_limit but also finer
granularity of accounting. The soft_limit is set to protect working
set for each cgroup of the system and it works purely like a filtering
and prioritize the reclaim order only after the whole system under
memory contention.

In my mind, soft_limit should be optional and admin only set them if
they know what they want to do with it. The main use case we use it
for now is to protect the working set and that is the exception when
they choose to set that.

--Ying

>
> Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-18 18:00   ` Ying Han
@ 2012-04-19 17:04     ` Michal Hocko
  2012-04-19 17:47       ` Ying Han
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2012-04-19 17:04 UTC (permalink / raw)
  To: Ying Han
  Cc: Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On Wed 18-04-12 11:00:40, Ying Han wrote:
> On Wed, Apr 18, 2012 at 5:24 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Tue, Apr 17, 2012 at 09:37:46AM -0700, Ying Han wrote:
> >> The "soft_limit" was introduced in memcg to support over-committing the
> >> memory resource on the host. Each cgroup configures its "hard_limit" where
> >> it will be throttled or OOM killed by going over the limit. However, the
> >> cgroup can go above the "soft_limit" as long as there is no system-wide
> >> memory contention. So, the "soft_limit" is the kernel mechanism for
> >> re-distributing system spare memory among cgroups.
> >>
> >> This patch reworks the softlimit reclaim by hooking it into the new global
> >> reclaim scheme. So the global reclaim path including direct reclaim and
> >> background reclaim will respect the memcg softlimit.
> >>
> >> v3..v2:
> >> 1. rebase the patch on 3.4-rc3
> >> 2. squash the commits of replacing the old implementation with new
> >> implementation into one commit. This is to make sure to leave the tree
> >> in stable state between each commit.
> >> 3. removed the commit which changes the nr_to_reclaim for global reclaim
> >> case. The need of that patch is not obvious now.
> >>
> >> Note:
> >> 1. the new implementation of softlimit reclaim is rather simple and first
> >> step for further optimizations. there is no memory pressure balancing between
> >> memcgs for each zone, and that is something we would like to add as follow-ups.
> >>
> >> 2. this patch is slightly different from the last one posted from Johannes
> >> http://comments.gmane.org/gmane.linux.kernel.mm/72382
> >> where his patch is closer to the reverted implementation by doing hierarchical
> >> reclaim for each selected memcg. However, that is not expected behavior from
> >> user perspective. Considering the following example:
> >>
> >> root (32G capacity)
> >> --> A (hard limit 20G, soft limit 15G, usage 16G)
> >>    --> A1 (soft limit 5G, usage 4G)
> >>    --> A2 (soft limit 10G, usage 12G)
> >> --> B (hard limit 20G, soft limit 10G, usage 16G)
> >>
> >> Under global reclaim, we shouldn't add pressure on A1 although its parent(A)
> >> exceeds softlimit. This is what admin expects by setting softlimit to the
> >> actual working set size and only reclaim pages under softlimit if system has
> >> trouble to reclaim.
> >
> > Actually, this is exactly what the admin expects when creating a
> > hierarchy, because she defines that A1 is a child of A and is
> > responsible for the memory situation in its parent.

Hmm, I guess that both approaches have cons and pros.
* Hierarchical soft limit reclaim - reclaim the whole subtree of the over
  soft limit memcg
  + it is consistent with the hard limit reclaim
  + easier for top to bottom configuration - especially when you allow
    subgroups to create deeper hierarchies. Does anybody do that?
  - harder to set up if soft limit should act as a guarantee - might lead
    to an unexpected reclaim.

* Targeted soft limit reclaim - only reclaim LRUs of over limit memcgs
  + easier to set up for the working set guarantee because admin can focus
    on the working set of a single group and not the whole hierarchy
  - easier to construct soft unreclaimable hierarchies - whole subtree
    contributes but nobody wants to take the responsibility when we reach
    the limit.

Both approaches don't play very well with the default 0 limit because we
either reclaim unless we set up the whole hierarchy properly or we just
burn cycles by trying to reclaim groups wit no or only few pages.
The second approach leads to more expected results though because we do
not touch "leaf" groups unless they are over limit.
I have to think about that some more but it seems that the second approach
is much easier to implement and matches the "guarantee" expectations
more.
I guess we could converge both approaches if we could reclaim from the
leaf groups upwards to the root but I didn't think about this very much.

[...]
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-19 17:04     ` Michal Hocko
@ 2012-04-19 17:47       ` Ying Han
  2012-04-19 22:33         ` Johannes Weiner
  2012-04-20  8:11         ` Michal Hocko
  0 siblings, 2 replies; 25+ messages in thread
From: Ying Han @ 2012-04-19 17:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On Thu, Apr 19, 2012 at 10:04 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Wed 18-04-12 11:00:40, Ying Han wrote:
>> On Wed, Apr 18, 2012 at 5:24 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> > On Tue, Apr 17, 2012 at 09:37:46AM -0700, Ying Han wrote:
>> >> The "soft_limit" was introduced in memcg to support over-committing the
>> >> memory resource on the host. Each cgroup configures its "hard_limit" where
>> >> it will be throttled or OOM killed by going over the limit. However, the
>> >> cgroup can go above the "soft_limit" as long as there is no system-wide
>> >> memory contention. So, the "soft_limit" is the kernel mechanism for
>> >> re-distributing system spare memory among cgroups.
>> >>
>> >> This patch reworks the softlimit reclaim by hooking it into the new global
>> >> reclaim scheme. So the global reclaim path including direct reclaim and
>> >> background reclaim will respect the memcg softlimit.
>> >>
>> >> v3..v2:
>> >> 1. rebase the patch on 3.4-rc3
>> >> 2. squash the commits of replacing the old implementation with new
>> >> implementation into one commit. This is to make sure to leave the tree
>> >> in stable state between each commit.
>> >> 3. removed the commit which changes the nr_to_reclaim for global reclaim
>> >> case. The need of that patch is not obvious now.
>> >>
>> >> Note:
>> >> 1. the new implementation of softlimit reclaim is rather simple and first
>> >> step for further optimizations. there is no memory pressure balancing between
>> >> memcgs for each zone, and that is something we would like to add as follow-ups.
>> >>
>> >> 2. this patch is slightly different from the last one posted from Johannes
>> >> http://comments.gmane.org/gmane.linux.kernel.mm/72382
>> >> where his patch is closer to the reverted implementation by doing hierarchical
>> >> reclaim for each selected memcg. However, that is not expected behavior from
>> >> user perspective. Considering the following example:
>> >>
>> >> root (32G capacity)
>> >> --> A (hard limit 20G, soft limit 15G, usage 16G)
>> >>    --> A1 (soft limit 5G, usage 4G)
>> >>    --> A2 (soft limit 10G, usage 12G)
>> >> --> B (hard limit 20G, soft limit 10G, usage 16G)
>> >>
>> >> Under global reclaim, we shouldn't add pressure on A1 although its parent(A)
>> >> exceeds softlimit. This is what admin expects by setting softlimit to the
>> >> actual working set size and only reclaim pages under softlimit if system has
>> >> trouble to reclaim.
>> >
>> > Actually, this is exactly what the admin expects when creating a
>> > hierarchy, because she defines that A1 is a child of A and is
>> > responsible for the memory situation in its parent.
>
> Hmm, I guess that both approaches have cons and pros.
> * Hierarchical soft limit reclaim - reclaim the whole subtree of the over
>  soft limit memcg
>  + it is consistent with the hard limit reclaim
Not sure why we want them to be consistent. Soft_limit is serving
different purpose and the one of the main purpose is to preserve the
working set of the cgroup.

>  + easier for top to bottom configuration - especially when you allow
>    subgroups to create deeper hierarchies. Does anybody do that?

As far as I heard, most (if not all) are using flat configuration
where everything is running under root.

>  - harder to set up if soft limit should act as a guarantee - might lead
>    to an unexpected reclaim.
>
> * Targeted soft limit reclaim - only reclaim LRUs of over limit memcgs
>  + easier to set up for the working set guarantee because admin can focus
>    on the working set of a single group and not the whole hierarchy
This is true.

>  - easier to construct soft unreclaimable hierarchies - whole subtree
>    contributes but nobody wants to take the responsibility when we reach
>    the limit.
>
> Both approaches don't play very well with the default 0 limit because we
> either reclaim unless we set up the whole hierarchy properly or we just
> burn cycles by trying to reclaim groups wit no or only few pages.

Setting the default to 0 is a good optimization which makes everybody
to be eligible for reclaim if admin doesn't do anything.

In reality, if admin want to preserve working set of cgroups and
he/she has to set the softlimit. By doing that, it is easier to only
focus on the cgroup itself without looking up its ancestors.

> The second approach leads to more expected results though because we do
> not touch "leaf" groups unless they are over limit.
> I have to think about that some more but it seems that the second approach
> is much easier to implement and matches the "guarantee" expectations
> more.

Agree.

> I guess we could converge both approaches if we could reclaim from the
> leaf groups upwards to the root but I didn't think about this very much.

That is what the current patch does, which only consider softlimit
under global pressure :)

--Ying
>
> [...]
> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-19 17:47       ` Ying Han
@ 2012-04-19 22:33         ` Johannes Weiner
  2012-04-19 22:51           ` Johannes Weiner
                             ` (2 more replies)
  2012-04-20  8:11         ` Michal Hocko
  1 sibling, 3 replies; 25+ messages in thread
From: Johannes Weiner @ 2012-04-19 22:33 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On Thu, Apr 19, 2012 at 10:47:27AM -0700, Ying Han wrote:
> On Thu, Apr 19, 2012 at 10:04 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Wed 18-04-12 11:00:40, Ying Han wrote:
> >> On Wed, Apr 18, 2012 at 5:24 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >> > On Tue, Apr 17, 2012 at 09:37:46AM -0700, Ying Han wrote:
> >> >> The "soft_limit" was introduced in memcg to support over-committing the
> >> >> memory resource on the host. Each cgroup configures its "hard_limit" where
> >> >> it will be throttled or OOM killed by going over the limit. However, the
> >> >> cgroup can go above the "soft_limit" as long as there is no system-wide
> >> >> memory contention. So, the "soft_limit" is the kernel mechanism for
> >> >> re-distributing system spare memory among cgroups.
> >> >>
> >> >> This patch reworks the softlimit reclaim by hooking it into the new global
> >> >> reclaim scheme. So the global reclaim path including direct reclaim and
> >> >> background reclaim will respect the memcg softlimit.
> >> >>
> >> >> v3..v2:
> >> >> 1. rebase the patch on 3.4-rc3
> >> >> 2. squash the commits of replacing the old implementation with new
> >> >> implementation into one commit. This is to make sure to leave the tree
> >> >> in stable state between each commit.
> >> >> 3. removed the commit which changes the nr_to_reclaim for global reclaim
> >> >> case. The need of that patch is not obvious now.
> >> >>
> >> >> Note:
> >> >> 1. the new implementation of softlimit reclaim is rather simple and first
> >> >> step for further optimizations. there is no memory pressure balancing between
> >> >> memcgs for each zone, and that is something we would like to add as follow-ups.
> >> >>
> >> >> 2. this patch is slightly different from the last one posted from Johannes
> >> >> http://comments.gmane.org/gmane.linux.kernel.mm/72382
> >> >> where his patch is closer to the reverted implementation by doing hierarchical
> >> >> reclaim for each selected memcg. However, that is not expected behavior from
> >> >> user perspective. Considering the following example:
> >> >>
> >> >> root (32G capacity)
> >> >> --> A (hard limit 20G, soft limit 15G, usage 16G)
> >> >>    --> A1 (soft limit 5G, usage 4G)
> >> >>    --> A2 (soft limit 10G, usage 12G)
> >> >> --> B (hard limit 20G, soft limit 10G, usage 16G)
> >> >>
> >> >> Under global reclaim, we shouldn't add pressure on A1 although its parent(A)
> >> >> exceeds softlimit. This is what admin expects by setting softlimit to the
> >> >> actual working set size and only reclaim pages under softlimit if system has
> >> >> trouble to reclaim.
> >> >
> >> > Actually, this is exactly what the admin expects when creating a
> >> > hierarchy, because she defines that A1 is a child of A and is
> >> > responsible for the memory situation in its parent.
> >
> > Hmm, I guess that both approaches have cons and pros.
> > * Hierarchical soft limit reclaim - reclaim the whole subtree of the over
> >  soft limit memcg
> >  + it is consistent with the hard limit reclaim
> Not sure why we want them to be consistent. Soft_limit is serving
> different purpose and the one of the main purpose is to preserve the
> working set of the cgroup.

I'd argue, given the history of cgroups, one of the main purposes is
having a machine of containers where you overcommit their hard limit
and set the soft limit accordingly to provide fairness.

Yes, we don't want to reclaim hierarchies that are below their soft
limit as long as there are some in excess, of course.  This is a flaw
and needs fixing.  But it's something completely different than
changing how the soft limit is defined and suddenly allow child
groups, which you may not trust, to override rules defined by parental
groups.

It bothers me that we should add something that will almost certainly
bite us in the future while we are discussing on the cgroups list what
would stand in the way of getting sane hierarchy semantics across
controllers to provide consistency, nesting, etc.

To support a single use case, which I feel we still have not discussed
nearly enough to justify this change.

For example, I get that you want 'meta-groups' that group together
subgroups for common accounting and hard limiting.  But I don't see
why such meta-groups have their own processes.  Conceptually, I mean,
how does a process fit into A?  Is it superior to the tasks in A1 and
A2?  Why can't it live in A3?

So here is a proposal:

Would it make sense to try to keep those meta groups always free of
their own memory so that they don't /need/ soft limits with weird
semantics?  E.g. immediately free the unused memory on rmdir, OR add
mechanisms to migrate the memory to a dedicated group:

     A
       A1 (soft-limited)
       A2 (soft-limited)
     B
     unused (soft-limited)

Move all leftover memory from finished jobs to this 'unused' group.
You could set its soft limit to 0 so that it sticks around only until
you actually need the memory for something else.

Then you would get the benefits of accounting and limiting A1 and A2
under a single umbrella without the need for a soft limit in A.  We
could keep the consistent semantics for soft limits, because you would
only have to set it on leaf nodes.

Wouldn't this work for you?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-19 22:33         ` Johannes Weiner
@ 2012-04-19 22:51           ` Johannes Weiner
  2012-04-20  7:37           ` Ying Han
  2012-04-20  8:28           ` Michal Hocko
  2 siblings, 0 replies; 25+ messages in thread
From: Johannes Weiner @ 2012-04-19 22:51 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On Fri, Apr 20, 2012 at 12:33:18AM +0200, Johannes Weiner wrote:
> On Thu, Apr 19, 2012 at 10:47:27AM -0700, Ying Han wrote:
> > On Thu, Apr 19, 2012 at 10:04 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > > On Wed 18-04-12 11:00:40, Ying Han wrote:
> > >> On Wed, Apr 18, 2012 at 5:24 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >> > On Tue, Apr 17, 2012 at 09:37:46AM -0700, Ying Han wrote:
> > >> >> The "soft_limit" was introduced in memcg to support over-committing the
> > >> >> memory resource on the host. Each cgroup configures its "hard_limit" where
> > >> >> it will be throttled or OOM killed by going over the limit. However, the
> > >> >> cgroup can go above the "soft_limit" as long as there is no system-wide
> > >> >> memory contention. So, the "soft_limit" is the kernel mechanism for
> > >> >> re-distributing system spare memory among cgroups.
> > >> >>
> > >> >> This patch reworks the softlimit reclaim by hooking it into the new global
> > >> >> reclaim scheme. So the global reclaim path including direct reclaim and
> > >> >> background reclaim will respect the memcg softlimit.
> > >> >>
> > >> >> v3..v2:
> > >> >> 1. rebase the patch on 3.4-rc3
> > >> >> 2. squash the commits of replacing the old implementation with new
> > >> >> implementation into one commit. This is to make sure to leave the tree
> > >> >> in stable state between each commit.
> > >> >> 3. removed the commit which changes the nr_to_reclaim for global reclaim
> > >> >> case. The need of that patch is not obvious now.
> > >> >>
> > >> >> Note:
> > >> >> 1. the new implementation of softlimit reclaim is rather simple and first
> > >> >> step for further optimizations. there is no memory pressure balancing between
> > >> >> memcgs for each zone, and that is something we would like to add as follow-ups.
> > >> >>
> > >> >> 2. this patch is slightly different from the last one posted from Johannes
> > >> >> http://comments.gmane.org/gmane.linux.kernel.mm/72382
> > >> >> where his patch is closer to the reverted implementation by doing hierarchical
> > >> >> reclaim for each selected memcg. However, that is not expected behavior from
> > >> >> user perspective. Considering the following example:
> > >> >>
> > >> >> root (32G capacity)
> > >> >> --> A (hard limit 20G, soft limit 15G, usage 16G)
> > >> >>    --> A1 (soft limit 5G, usage 4G)
> > >> >>    --> A2 (soft limit 10G, usage 12G)
> > >> >> --> B (hard limit 20G, soft limit 10G, usage 16G)
> > >> >>
> > >> >> Under global reclaim, we shouldn't add pressure on A1 although its parent(A)
> > >> >> exceeds softlimit. This is what admin expects by setting softlimit to the
> > >> >> actual working set size and only reclaim pages under softlimit if system has
> > >> >> trouble to reclaim.
> > >> >
> > >> > Actually, this is exactly what the admin expects when creating a
> > >> > hierarchy, because she defines that A1 is a child of A and is
> > >> > responsible for the memory situation in its parent.
> > >
> > > Hmm, I guess that both approaches have cons and pros.
> > > * Hierarchical soft limit reclaim - reclaim the whole subtree of the over
> > >  soft limit memcg
> > >  + it is consistent with the hard limit reclaim
> > Not sure why we want them to be consistent. Soft_limit is serving
> > different purpose and the one of the main purpose is to preserve the
> > working set of the cgroup.
> 
> I'd argue, given the history of cgroups, one of the main purposes is
> having a machine of containers where you overcommit their hard limit
> and set the soft limit accordingly to provide fairness.
> 
> Yes, we don't want to reclaim hierarchies that are below their soft
> limit as long as there are some in excess, of course.  This is a flaw
> and needs fixing.  But it's something completely different than
> changing how the soft limit is defined and suddenly allow child
> groups, which you may not trust, to override rules defined by parental
> groups.
> 
> It bothers me that we should add something that will almost certainly
> bite us in the future while we are discussing on the cgroups list what
> would stand in the way of getting sane hierarchy semantics across
> controllers to provide consistency, nesting, etc.
> 
> To support a single use case, which I feel we still have not discussed
> nearly enough to justify this change.
> 
> For example, I get that you want 'meta-groups' that group together
> subgroups for common accounting and hard limiting.  But I don't see
> why such meta-groups have their own processes.  Conceptually, I mean,
> how does a process fit into A?  Is it superior to the tasks in A1 and
> A2?  Why can't it live in A3?
> 
> So here is a proposal:
> 
> Would it make sense to try to keep those meta groups always free of
> their own memory so that they don't /need/ soft limits with weird
> semantics?  E.g. immediately free the unused memory on rmdir, OR add
> mechanisms to migrate the memory to a dedicated group:
> 
>      A
>        A1 (soft-limited)
>        A2 (soft-limited)
>      B
>      unused (soft-limited)
> 
> Move all leftover memory from finished jobs to this 'unused' group.
> You could set its soft limit to 0 so that it sticks around only until
> you actually need the memory for something else.
> 
> Then you would get the benefits of accounting and limiting A1 and A2
> under a single umbrella without the need for a soft limit in A.  We
> could keep the consistent semantics for soft limits, because you would
> only have to set it on leaf nodes.
> 
> Wouldn't this work for you?

Or, if the frequency of job creation and completion permits, just keep
the original groups around after completion, set their soft limit to
0, put a watch ("threshold notification") on its usage and reap it
when global pressure finally cleaned it out.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-19 22:33         ` Johannes Weiner
  2012-04-19 22:51           ` Johannes Weiner
@ 2012-04-20  7:37           ` Ying Han
  2012-04-20  8:21             ` KAMEZAWA Hiroyuki
  2012-04-20 13:17             ` Johannes Weiner
  2012-04-20  8:28           ` Michal Hocko
  2 siblings, 2 replies; 25+ messages in thread
From: Ying Han @ 2012-04-20  7:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On Thu, Apr 19, 2012 at 3:33 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Thu, Apr 19, 2012 at 10:47:27AM -0700, Ying Han wrote:
>> On Thu, Apr 19, 2012 at 10:04 AM, Michal Hocko <mhocko@suse.cz> wrote:
>> > On Wed 18-04-12 11:00:40, Ying Han wrote:
>> >> On Wed, Apr 18, 2012 at 5:24 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> >> > On Tue, Apr 17, 2012 at 09:37:46AM -0700, Ying Han wrote:
>> >> >> The "soft_limit" was introduced in memcg to support over-committing the
>> >> >> memory resource on the host. Each cgroup configures its "hard_limit" where
>> >> >> it will be throttled or OOM killed by going over the limit. However, the
>> >> >> cgroup can go above the "soft_limit" as long as there is no system-wide
>> >> >> memory contention. So, the "soft_limit" is the kernel mechanism for
>> >> >> re-distributing system spare memory among cgroups.
>> >> >>
>> >> >> This patch reworks the softlimit reclaim by hooking it into the new global
>> >> >> reclaim scheme. So the global reclaim path including direct reclaim and
>> >> >> background reclaim will respect the memcg softlimit.
>> >> >>
>> >> >> v3..v2:
>> >> >> 1. rebase the patch on 3.4-rc3
>> >> >> 2. squash the commits of replacing the old implementation with new
>> >> >> implementation into one commit. This is to make sure to leave the tree
>> >> >> in stable state between each commit.
>> >> >> 3. removed the commit which changes the nr_to_reclaim for global reclaim
>> >> >> case. The need of that patch is not obvious now.
>> >> >>
>> >> >> Note:
>> >> >> 1. the new implementation of softlimit reclaim is rather simple and first
>> >> >> step for further optimizations. there is no memory pressure balancing between
>> >> >> memcgs for each zone, and that is something we would like to add as follow-ups.
>> >> >>
>> >> >> 2. this patch is slightly different from the last one posted from Johannes
>> >> >> http://comments.gmane.org/gmane.linux.kernel.mm/72382
>> >> >> where his patch is closer to the reverted implementation by doing hierarchical
>> >> >> reclaim for each selected memcg. However, that is not expected behavior from
>> >> >> user perspective. Considering the following example:
>> >> >>
>> >> >> root (32G capacity)
>> >> >> --> A (hard limit 20G, soft limit 15G, usage 16G)
>> >> >>    --> A1 (soft limit 5G, usage 4G)
>> >> >>    --> A2 (soft limit 10G, usage 12G)
>> >> >> --> B (hard limit 20G, soft limit 10G, usage 16G)
>> >> >>
>> >> >> Under global reclaim, we shouldn't add pressure on A1 although its parent(A)
>> >> >> exceeds softlimit. This is what admin expects by setting softlimit to the
>> >> >> actual working set size and only reclaim pages under softlimit if system has
>> >> >> trouble to reclaim.
>> >> >
>> >> > Actually, this is exactly what the admin expects when creating a
>> >> > hierarchy, because she defines that A1 is a child of A and is
>> >> > responsible for the memory situation in its parent.
>> >
>> > Hmm, I guess that both approaches have cons and pros.
>> > * Hierarchical soft limit reclaim - reclaim the whole subtree of the over
>> >  soft limit memcg
>> >  + it is consistent with the hard limit reclaim
>> Not sure why we want them to be consistent. Soft_limit is serving
>> different purpose and the one of the main purpose is to preserve the
>> working set of the cgroup.
>
> I'd argue, given the history of cgroups, one of the main purposes is
> having a machine of containers where you overcommit their hard limit
> and set the soft limit accordingly to provide fairness.
>
> Yes, we don't want to reclaim hierarchies that are below their soft
> limit as long as there are some in excess, of course.  This is a flaw
> and needs fixing.  But it's something completely different than
> changing how the soft limit is defined and suddenly allow child
> groups, which you may not trust, to override rules defined by parental
> groups.
>
> It bothers me that we should add something that will almost certainly
> bite us in the future while we are discussing on the cgroups list what
> would stand in the way of getting sane hierarchy semantics across
> controllers to provide consistency, nesting, etc.

I understand the concern here and I don't want the soft_limit reclaim
to be far away from the other part of the cgroup design down to the
road. On the other hand, I don't think the current implementation is
against the hierarchy semantics totally. See the comment below :)

>
> To support a single use case, which I feel we still have not discussed
> nearly enough to justify this change.
>
> For example, I get that you want 'meta-groups' that group together
> subgroups for common accounting and hard limiting.  But I don't see
> why such meta-groups have their own processes.  Conceptually, I mean,
> how does a process fit into A?  Is it superior to the tasks in A1 and
> A2?  Why can't it live in A3?

For user processes, I can see that is totally feasible to live in A3.
The case I was thinking is kernel threads, which 1) we don't want to
limit their memory usage 2) they  serve for the whole group unlike
individual jobs. Of course, we could say that putting those kernel
thread in A3 and leave the cgroup to unlimited, but not sure if we
should constrain ourselves not having any processes running under A.

>
> So here is a proposal:
>
> Would it make sense to try to keep those meta groups always free of
> their own memory so that they don't /need/ soft limits with weird
> semantics?  E.g. immediately free the unused memory on rmdir, OR add
> mechanisms to migrate the memory to a dedicated group:
>
>     A
>       A1 (soft-limited)
>       A2 (soft-limited)
>     B
>     unused (soft-limited)
>
> Move all leftover memory from finished jobs to this 'unused' group.
> You could set its soft limit to 0 so that it sticks around only until
> you actually need the memory for something else.
>
> Then you would get the benefits of accounting and limiting A1 and A2
> under a single umbrella without the need for a soft limit in A.  We
> could keep the consistent semantics for soft limits, because you would
> only have to set it on leaf nodes.
>
> Wouldn't this work for you?

To be frankly, this sounds a lot of extra work for admin to manage the
system and we still can not prevent page being landed on A totally.

Back to the current proposal, there are two concerns that I can tell by far:

1. skipping "not trust" cgroup in case it sets its soft_limit very high:
Here, we don't skip the "not trust" cgroup always. We do reclaim from
them if not enough progress made from other cgroups above the
softlimit. So, I don't see a problem here.

2. not reclaiming based on hierarchy:
Here I am not checking the ancestor's soft_limit in
should_reclaim_mem_cgroup(). And it will only make difference if A is
under soft_limit and A1 is above soft_limit. Now you do agree that we
shouldn't reclaim from those under softlimit groups if there are
cgroup exeed their softlimit. Then it leads me to think something like
the following:

1. for priority > DEF_PRIORITY - 3, only reclaim memcg above their softlimit
2. for priority <= DEF_PRIORITY - 3, besides 1), also look at memcg's
ancestor. reclaim memcgs whose ancestor above soft_limit
3. for priority == 0, reclaim everything.

Then it has the guarantee of the softlimit at certain level while also
considers the hierarchy reclaim if the first few rounds doesn't
fulfill the request.

--Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-19 17:47       ` Ying Han
  2012-04-19 22:33         ` Johannes Weiner
@ 2012-04-20  8:11         ` Michal Hocko
  2012-04-20 17:22           ` Ying Han
  1 sibling, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2012-04-20  8:11 UTC (permalink / raw)
  To: Ying Han
  Cc: Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On Thu 19-04-12 10:47:27, Ying Han wrote:
> On Thu, Apr 19, 2012 at 10:04 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Wed 18-04-12 11:00:40, Ying Han wrote:
> >> On Wed, Apr 18, 2012 at 5:24 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >> > On Tue, Apr 17, 2012 at 09:37:46AM -0700, Ying Han wrote:
> >> >> The "soft_limit" was introduced in memcg to support over-committing the
> >> >> memory resource on the host. Each cgroup configures its "hard_limit" where
> >> >> it will be throttled or OOM killed by going over the limit. However, the
> >> >> cgroup can go above the "soft_limit" as long as there is no system-wide
> >> >> memory contention. So, the "soft_limit" is the kernel mechanism for
> >> >> re-distributing system spare memory among cgroups.
> >> >>
> >> >> This patch reworks the softlimit reclaim by hooking it into the new global
> >> >> reclaim scheme. So the global reclaim path including direct reclaim and
> >> >> background reclaim will respect the memcg softlimit.
> >> >>
> >> >> v3..v2:
> >> >> 1. rebase the patch on 3.4-rc3
> >> >> 2. squash the commits of replacing the old implementation with new
> >> >> implementation into one commit. This is to make sure to leave the tree
> >> >> in stable state between each commit.
> >> >> 3. removed the commit which changes the nr_to_reclaim for global reclaim
> >> >> case. The need of that patch is not obvious now.
> >> >>
> >> >> Note:
> >> >> 1. the new implementation of softlimit reclaim is rather simple and first
> >> >> step for further optimizations. there is no memory pressure balancing between
> >> >> memcgs for each zone, and that is something we would like to add as follow-ups.
> >> >>
> >> >> 2. this patch is slightly different from the last one posted from Johannes
> >> >> http://comments.gmane.org/gmane.linux.kernel.mm/72382
> >> >> where his patch is closer to the reverted implementation by doing hierarchical
> >> >> reclaim for each selected memcg. However, that is not expected behavior from
> >> >> user perspective. Considering the following example:
> >> >>
> >> >> root (32G capacity)
> >> >> --> A (hard limit 20G, soft limit 15G, usage 16G)
> >> >>    --> A1 (soft limit 5G, usage 4G)
> >> >>    --> A2 (soft limit 10G, usage 12G)
> >> >> --> B (hard limit 20G, soft limit 10G, usage 16G)
> >> >>
> >> >> Under global reclaim, we shouldn't add pressure on A1 although its parent(A)
> >> >> exceeds softlimit. This is what admin expects by setting softlimit to the
> >> >> actual working set size and only reclaim pages under softlimit if system has
> >> >> trouble to reclaim.
> >> >
> >> > Actually, this is exactly what the admin expects when creating a
> >> > hierarchy, because she defines that A1 is a child of A and is
> >> > responsible for the memory situation in its parent.
> >
> > Hmm, I guess that both approaches have cons and pros.
> > * Hierarchical soft limit reclaim - reclaim the whole subtree of the over
> >  soft limit memcg
> >  + it is consistent with the hard limit reclaim
> Not sure why we want them to be consistent. Soft_limit is serving
> different purpose and the one of the main purpose is to preserve the
> working set of the cgroup.

Well, cgroups subsystem is moving towards unification so all the
controllers should live in one hierarchy and it would be nice if we had
a common view on what hard and soft limits mean wrt. hierarchies. It is
true that memcg is the only user of the soft limit in the moment but it
would be better if we were prepared for future users are well and
wouldn't come up with one shot solutions.

> >  + easier for top to bottom configuration - especially when you allow
> >    subgroups to create deeper hierarchies. Does anybody do that?
> 
> As far as I heard, most (if not all) are using flat configuration
> where everything is running under root.

Might be true for memcg but what about other controllers?

[...]
> > Both approaches don't play very well with the default 0 limit because we
> > either reclaim unless we set up the whole hierarchy properly or we just
> > burn cycles by trying to reclaim groups wit no or only few pages.
> 
> Setting the default to 0 is a good optimization which makes everybody
> to be eligible for reclaim if admin doesn't do anything.
> 
> In reality, if admin want to preserve working set of cgroups and
> he/she has to set the softlimit. By doing that, it is easier to only
> focus on the cgroup itself without looking up its ancestors.

I guess it is not that clear who should be responsible for setting the
limit. Should it be admin or rather a workload owner? Because this
changes a lot. 

> 
> > The second approach leads to more expected results though because we do
> > not touch "leaf" groups unless they are over limit.
> > I have to think about that some more but it seems that the second approach
> > is much easier to implement and matches the "guarantee" expectations
> > more.
> 
> Agree.
> 
> > I guess we could converge both approaches if we could reclaim from the
> > leaf groups upwards to the root but I didn't think about this very much.
> 
> That is what the current patch does, which only consider softlimit
> under global pressure :)

Not really, because your patch iterates sequentially from top to bottom.
I was thinking about iteration from the leaves and do the hierarchical
reclaim from the first one which is over the limit. This would uncharge
from the parent as well so it could get down under its limit and if not
then we can hammer on siblings. But, as I said, I did give this more
thoughts, it sure comes with its own set of issues (including
inconsistency with the hard limit reclaim ;))

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-20  7:37           ` Ying Han
@ 2012-04-20  8:21             ` KAMEZAWA Hiroyuki
  2012-04-20 14:17               ` Rik van Riel
  2012-04-20 13:17             ` Johannes Weiner
  1 sibling, 1 reply; 25+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-04-20  8:21 UTC (permalink / raw)
  To: Ying Han
  Cc: Johannes Weiner, Michal Hocko, Mel Gorman, Rik van Riel,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

(2012/04/20 16:37), Ying Han wrote:

> On Thu, Apr 19, 2012 at 3:33 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> On Thu, Apr 19, 2012 at 10:47:27AM -0700, Ying Han wrote:
>>> On Thu, Apr 19, 2012 at 10:04 AM, Michal Hocko <mhocko@suse.cz> wrote:
>>>> On Wed 18-04-12 11:00:40, Ying Han wrote:
>>>>> On Wed, Apr 18, 2012 at 5:24 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>>>>>> On Tue, Apr 17, 2012 at 09:37:46AM -0700, Ying Han wrote:
>>>>>>> The "soft_limit" was introduced in memcg to support over-committing the
>>>>>>> memory resource on the host. Each cgroup configures its "hard_limit" where
>>>>>>> it will be throttled or OOM killed by going over the limit. However, the
>>>>>>> cgroup can go above the "soft_limit" as long as there is no system-wide
>>>>>>> memory contention. So, the "soft_limit" is the kernel mechanism for
>>>>>>> re-distributing system spare memory among cgroups.
>>>>>>>
>>>>>>> This patch reworks the softlimit reclaim by hooking it into the new global
>>>>>>> reclaim scheme. So the global reclaim path including direct reclaim and
>>>>>>> background reclaim will respect the memcg softlimit.
>>>>>>>
>>>>>>> v3..v2:
>>>>>>> 1. rebase the patch on 3.4-rc3
>>>>>>> 2. squash the commits of replacing the old implementation with new
>>>>>>> implementation into one commit. This is to make sure to leave the tree
>>>>>>> in stable state between each commit.
>>>>>>> 3. removed the commit which changes the nr_to_reclaim for global reclaim
>>>>>>> case. The need of that patch is not obvious now.
>>>>>>>
>>>>>>> Note:
>>>>>>> 1. the new implementation of softlimit reclaim is rather simple and first
>>>>>>> step for further optimizations. there is no memory pressure balancing between
>>>>>>> memcgs for each zone, and that is something we would like to add as follow-ups.
>>>>>>>
>>>>>>> 2. this patch is slightly different from the last one posted from Johannes
>>>>>>> http://comments.gmane.org/gmane.linux.kernel.mm/72382
>>>>>>> where his patch is closer to the reverted implementation by doing hierarchical
>>>>>>> reclaim for each selected memcg. However, that is not expected behavior from
>>>>>>> user perspective. Considering the following example:
>>>>>>>
>>>>>>> root (32G capacity)
>>>>>>> --> A (hard limit 20G, soft limit 15G, usage 16G)
>>>>>>>    --> A1 (soft limit 5G, usage 4G)
>>>>>>>    --> A2 (soft limit 10G, usage 12G)
>>>>>>> --> B (hard limit 20G, soft limit 10G, usage 16G)
>>>>>>>
>>>>>>> Under global reclaim, we shouldn't add pressure on A1 although its parent(A)
>>>>>>> exceeds softlimit. This is what admin expects by setting softlimit to the
>>>>>>> actual working set size and only reclaim pages under softlimit if system has
>>>>>>> trouble to reclaim.
>>>>>>
>>>>>> Actually, this is exactly what the admin expects when creating a
>>>>>> hierarchy, because she defines that A1 is a child of A and is
>>>>>> responsible for the memory situation in its parent.
>>>>
>>>> Hmm, I guess that both approaches have cons and pros.
>>>> * Hierarchical soft limit reclaim - reclaim the whole subtree of the over
>>>>  soft limit memcg
>>>>  + it is consistent with the hard limit reclaim
>>> Not sure why we want them to be consistent. Soft_limit is serving
>>> different purpose and the one of the main purpose is to preserve the
>>> working set of the cgroup.
>>
>> I'd argue, given the history of cgroups, one of the main purposes is
>> having a machine of containers where you overcommit their hard limit
>> and set the soft limit accordingly to provide fairness.
>>
>> Yes, we don't want to reclaim hierarchies that are below their soft
>> limit as long as there are some in excess, of course.  This is a flaw
>> and needs fixing.  But it's something completely different than
>> changing how the soft limit is defined and suddenly allow child
>> groups, which you may not trust, to override rules defined by parental
>> groups.
>>
>> It bothers me that we should add something that will almost certainly
>> bite us in the future while we are discussing on the cgroups list what
>> would stand in the way of getting sane hierarchy semantics across
>> controllers to provide consistency, nesting, etc.
> 
> I understand the concern here and I don't want the soft_limit reclaim
> to be far away from the other part of the cgroup design down to the
> road. On the other hand, I don't think the current implementation is
> against the hierarchy semantics totally. See the comment below :)
> 
>>
>> To support a single use case, which I feel we still have not discussed
>> nearly enough to justify this change.
>>
>> For example, I get that you want 'meta-groups' that group together
>> subgroups for common accounting and hard limiting.  But I don't see
>> why such meta-groups have their own processes.  Conceptually, I mean,
>> how does a process fit into A?  Is it superior to the tasks in A1 and
>> A2?  Why can't it live in A3?
> 
> For user processes, I can see that is totally feasible to live in A3.
> The case I was thinking is kernel threads, which 1) we don't want to
> limit their memory usage 2) they  serve for the whole group unlike
> individual jobs. Of course, we could say that putting those kernel
> thread in A3 and leave the cgroup to unlimited, but not sure if we
> should constrain ourselves not having any processes running under A.
> 
>>
>> So here is a proposal:
>>
>> Would it make sense to try to keep those meta groups always free of
>> their own memory so that they don't /need/ soft limits with weird
>> semantics?  E.g. immediately free the unused memory on rmdir, OR add
>> mechanisms to migrate the memory to a dedicated group:
>>
>>     A
>>       A1 (soft-limited)
>>       A2 (soft-limited)
>>     B
>>     unused (soft-limited)
>>
>> Move all leftover memory from finished jobs to this 'unused' group.
>> You could set its soft limit to 0 so that it sticks around only until
>> you actually need the memory for something else.
>>
>> Then you would get the benefits of accounting and limiting A1 and A2
>> under a single umbrella without the need for a soft limit in A.  We
>> could keep the consistent semantics for soft limits, because you would
>> only have to set it on leaf nodes.
>>
>> Wouldn't this work for you?
> 
> To be frankly, this sounds a lot of extra work for admin to manage the
> system and we still can not prevent page being landed on A totally.
> 
> Back to the current proposal, there are two concerns that I can tell by far:
> 
> 1. skipping "not trust" cgroup in case it sets its soft_limit very high:
> Here, we don't skip the "not trust" cgroup always. We do reclaim from
> them if not enough progress made from other cgroups above the
> softlimit. So, I don't see a problem here.
> 
> 2. not reclaiming based on hierarchy:
> Here I am not checking the ancestor's soft_limit in
> should_reclaim_mem_cgroup(). And it will only make difference if A is
> under soft_limit and A1 is above soft_limit. Now you do agree that we
> shouldn't reclaim from those under softlimit groups if there are
> cgroup exeed their softlimit. Then it leads me to think something like
> the following:
> 
> 1. for priority > DEF_PRIORITY - 3, only reclaim memcg above their softlimit
> 2. for priority <= DEF_PRIORITY - 3, besides 1), also look at memcg's
> ancestor. reclaim memcgs whose ancestor above soft_limit
> 3. for priority == 0, reclaim everything.
> 
> Then it has the guarantee of the softlimit at certain level while also
> considers the hierarchy reclaim if the first few rounds doesn't
> fulfill the request.
> 


seems complicated. I vote for " Hierarchical soft limit reclaim ".


If you need smart victim selection under hierarchy, please implement
victim scheduler which choose A2 rather than A and A1. I think you
can do it.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-19 22:33         ` Johannes Weiner
  2012-04-19 22:51           ` Johannes Weiner
  2012-04-20  7:37           ` Ying Han
@ 2012-04-20  8:28           ` Michal Hocko
  2 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2012-04-20  8:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ying Han, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On Fri 20-04-12 00:33:18, Johannes Weiner wrote:
> On Thu, Apr 19, 2012 at 10:47:27AM -0700, Ying Han wrote:
> > On Thu, Apr 19, 2012 at 10:04 AM, Michal Hocko <mhocko@suse.cz> wrote:
[...]
> > > Hmm, I guess that both approaches have cons and pros.
> > > * Hierarchical soft limit reclaim - reclaim the whole subtree of the over
> > >  soft limit memcg
> > >  + it is consistent with the hard limit reclaim
> > Not sure why we want them to be consistent. Soft_limit is serving
> > different purpose and the one of the main purpose is to preserve the
> > working set of the cgroup.
> 
> I'd argue, given the history of cgroups, one of the main purposes is
> having a machine of containers where you overcommit their hard limit
> and set the soft limit accordingly to provide fairness.
> 
> Yes, we don't want to reclaim hierarchies that are below their soft
> limit as long as there are some in excess, of course.  This is a flaw
> and needs fixing.  But it's something completely different than
> changing how the soft limit is defined and suddenly allow child
> groups, which you may not trust, to override rules defined by parental
> groups.

As I wrote in other email. Who is allowed to set the limit? Owner of the
container? If yes then how is admin supposed to set the top limit for
the container? Default (0) will not work, right?

> 
> It bothers me that we should add something that will almost certainly
> bite us in the future while we are discussing on the cgroups list what
> would stand in the way of getting sane hierarchy semantics across
> controllers to provide consistency, nesting, etc.
> 
> To support a single use case, which I feel we still have not discussed
> nearly enough to justify this change.
> 
> For example, I get that you want 'meta-groups' that group together
> subgroups for common accounting and hard limiting.  But I don't see
> why such meta-groups have their own processes.  Conceptually, I mean,
> how does a process fit into A?  Is it superior to the tasks in A1 and
> A2?  Why can't it live in A3?

That was my thinking as well but it will get harder if we really want to
have the unified hierarchy for all controllers.
Consider a school lab and per-user group which basically limits cpu
bandwidth and maximum amount of memory by hard limit (soft limit 0).
If a user would like to run a workload which would benefit from resident
memory he could create a subgroup and set a soft limit. All other tasks
would be executed in his native group by default because we probably do
not want him to think about cgroups for all tasks.

[...]
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-20  7:37           ` Ying Han
  2012-04-20  8:21             ` KAMEZAWA Hiroyuki
@ 2012-04-20 13:17             ` Johannes Weiner
  2012-04-20 17:44               ` Ying Han
  1 sibling, 1 reply; 25+ messages in thread
From: Johannes Weiner @ 2012-04-20 13:17 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On Fri, Apr 20, 2012 at 12:37:41AM -0700, Ying Han wrote:
> On Thu, Apr 19, 2012 at 3:33 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Thu, Apr 19, 2012 at 10:47:27AM -0700, Ying Han wrote:
> >> On Thu, Apr 19, 2012 at 10:04 AM, Michal Hocko <mhocko@suse.cz> wrote:
> >> > On Wed 18-04-12 11:00:40, Ying Han wrote:
> >> >> On Wed, Apr 18, 2012 at 5:24 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >> >> > On Tue, Apr 17, 2012 at 09:37:46AM -0700, Ying Han wrote:
> >> >> >> 2. this patch is slightly different from the last one posted from Johannes
> >> >> >> http://comments.gmane.org/gmane.linux.kernel.mm/72382
> >> >> >> where his patch is closer to the reverted implementation by doing hierarchical
> >> >> >> reclaim for each selected memcg. However, that is not expected behavior from
> >> >> >> user perspective. Considering the following example:
> >> >> >>
> >> >> >> root (32G capacity)
> >> >> >> --> A (hard limit 20G, soft limit 15G, usage 16G)
> >> >> >>    --> A1 (soft limit 5G, usage 4G)
> >> >> >>    --> A2 (soft limit 10G, usage 12G)
> >> >> >> --> B (hard limit 20G, soft limit 10G, usage 16G)
> >> >> >>
> >> >> >> Under global reclaim, we shouldn't add pressure on A1 although its parent(A)
> >> >> >> exceeds softlimit. This is what admin expects by setting softlimit to the
> >> >> >> actual working set size and only reclaim pages under softlimit if system has
> >> >> >> trouble to reclaim.
> >> >> >
> >> >> > Actually, this is exactly what the admin expects when creating a
> >> >> > hierarchy, because she defines that A1 is a child of A and is
> >> >> > responsible for the memory situation in its parent.
> >> >
> >> > Hmm, I guess that both approaches have cons and pros.
> >> > * Hierarchical soft limit reclaim - reclaim the whole subtree of the over
> >> >  soft limit memcg
> >> >  + it is consistent with the hard limit reclaim
> >> Not sure why we want them to be consistent. Soft_limit is serving
> >> different purpose and the one of the main purpose is to preserve the
> >> working set of the cgroup.
> >
> > I'd argue, given the history of cgroups, one of the main purposes is
> > having a machine of containers where you overcommit their hard limit
> > and set the soft limit accordingly to provide fairness.
> >
> > Yes, we don't want to reclaim hierarchies that are below their soft
> > limit as long as there are some in excess, of course.  This is a flaw
> > and needs fixing.  But it's something completely different than
> > changing how the soft limit is defined and suddenly allow child
> > groups, which you may not trust, to override rules defined by parental
> > groups.
> >
> > It bothers me that we should add something that will almost certainly
> > bite us in the future while we are discussing on the cgroups list what
> > would stand in the way of getting sane hierarchy semantics across
> > controllers to provide consistency, nesting, etc.
> 
> I understand the concern here and I don't want the soft_limit reclaim
> to be far away from the other part of the cgroup design down to the
> road. On the other hand, I don't think the current implementation is
> against the hierarchy semantics totally. See the comment below :)
> 
> > To support a single use case, which I feel we still have not discussed
> > nearly enough to justify this change.
> >
> > For example, I get that you want 'meta-groups' that group together
> > subgroups for common accounting and hard limiting.  But I don't see
> > why such meta-groups have their own processes.  Conceptually, I mean,
> > how does a process fit into A?  Is it superior to the tasks in A1 and
> > A2?  Why can't it live in A3?
> 
> For user processes, I can see that is totally feasible to live in A3.
> The case I was thinking is kernel threads, which 1) we don't want to
> limit their memory usage 2) they  serve for the whole group unlike
> individual jobs. Of course, we could say that putting those kernel
> thread in A3 and leave the cgroup to unlimited, but not sure if we
> should constrain ourselves not having any processes running under A.

That's just handwaving.

> > So here is a proposal:
> >
> > Would it make sense to try to keep those meta groups always free of
> > their own memory so that they don't /need/ soft limits with weird
> > semantics?  E.g. immediately free the unused memory on rmdir, OR add
> > mechanisms to migrate the memory to a dedicated group:
> >
> >     A
> >       A1 (soft-limited)
> >       A2 (soft-limited)
> >     B
> >     unused (soft-limited)
> >
> > Move all leftover memory from finished jobs to this 'unused' group.
> > You could set its soft limit to 0 so that it sticks around only until
> > you actually need the memory for something else.
> >
> > Then you would get the benefits of accounting and limiting A1 and A2
> > under a single umbrella without the need for a soft limit in A.  We
> > could keep the consistent semantics for soft limits, because you would
> > only have to set it on leaf nodes.
> >
> > Wouldn't this work for you?
> 
> To be frankly, this sounds a lot of extra work for admin to manage the
> system and we still can not prevent page being landed on A totally.

Why not?

And what extra work are we talking here?  As I wrote in the followup
mail: just keep the finished job groups around, set their soft limit
to 0.  Surely you have a userspace job scheduler that sets up these
groups in the first place and could be trivially extended to set soft
limits and watch for notifications.

Let me repeat the pros here: no breaking of existing semantics.  No
introduction of unprecedented semantics into the cgroup mess.  No
changing of kernel code necessary (except what we want to tune
anyway).  No computational overhead for you or anyone else.

If your only counter argument to this is that you can't be bothered to
slightly adjust your setup, I'm no longer interested in this
discussion.

> Back to the current proposal, there are two concerns that I can tell by far:
> 
> 1. skipping "not trust" cgroup in case it sets its soft_limit very high:
> Here, we don't skip the "not trust" cgroup always. We do reclaim from
> them if not enough progress made from other cgroups above the
> softlimit. So, I don't see a problem here.

When you decide to reclaim from groups below their soft limit.

Which means that an untrusted group can force global reclaim to go for
the workingset in other groups.

> 2. not reclaiming based on hierarchy:
> Here I am not checking the ancestor's soft_limit in
> should_reclaim_mem_cgroup(). And it will only make difference if A is
> under soft_limit and A1 is above soft_limit. Now you do agree that we
> shouldn't reclaim from those under softlimit groups if there are
> cgroup exeed their softlimit. Then it leads me to think something like
> the following:
> 
> 1. for priority > DEF_PRIORITY - 3, only reclaim memcg above their softlimit
> 2. for priority <= DEF_PRIORITY - 3, besides 1), also look at memcg's
> ancestor. reclaim memcgs whose ancestor above soft_limit
> 3. for priority == 0, reclaim everything.
>
> Then it has the guarantee of the softlimit at certain level while also
> considers the hierarchy reclaim if the first few rounds doesn't
> fulfill the request.

You expect sane setups to pay the cost of uselessly consulting the res
counters of every existing memcg, twice, on every single reclaim cycle.

Everyone has their agenda and their primary usecase, but this takes
the cake.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-20  8:21             ` KAMEZAWA Hiroyuki
@ 2012-04-20 14:17               ` Rik van Riel
  2012-04-20 16:56                 ` Ying Han
  0 siblings, 1 reply; 25+ messages in thread
From: Rik van Riel @ 2012-04-20 14:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, Johannes Weiner, Michal Hocko, Mel Gorman,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On 04/20/2012 04:21 AM, KAMEZAWA Hiroyuki wrote:

> If you need smart victim selection under hierarchy, please implement
> victim scheduler which choose A2 rather than A and A1. I think you
> can do it.

Ying and I spent a few hours working out exactly how to do
this, a few weeks ago in San Francisco.

She might still have the pictures of all the stuff we drew
on the whiteboard.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-20 14:17               ` Rik van Riel
@ 2012-04-20 16:56                 ` Ying Han
  0 siblings, 0 replies; 25+ messages in thread
From: Ying Han @ 2012-04-20 16:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KAMEZAWA Hiroyuki, Johannes Weiner, Michal Hocko, Mel Gorman,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 937 bytes --]

On Fri, Apr 20, 2012 at 7:17 AM, Rik van Riel <riel@redhat.com> wrote:
> On 04/20/2012 04:21 AM, KAMEZAWA Hiroyuki wrote:
>
>> If you need smart victim selection under hierarchy, please implement
>> victim scheduler which choose A2 rather than A and A1. I think you
>> can do it.
>
>
> Ying and I spent a few hours working out exactly how to do
> this, a few weeks ago in San Francisco.
>
> She might still have the pictures of all the stuff we drew
> on the whiteboard.

Unfortunately,  I do have that on my phone. See the attachment if
those who might be interested.

Rik, Johannes and myself were discussing how to make the soft_limit
reclaim being smart on picking memcg, the same logic we currently use
to do get_scan_count() between file/anon lru.

After the ground work in this patch is done, I do plan to make that
happen. But for now, I like to focus on the ground work as starting
point.

--Ying


>
> --
> All rights reversed

[-- Attachment #2: soft_limit.JPG --]
[-- Type: image/jpeg, Size: 104741 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-20  8:11         ` Michal Hocko
@ 2012-04-20 17:22           ` Ying Han
  0 siblings, 0 replies; 25+ messages in thread
From: Ying Han @ 2012-04-20 17:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On Fri, Apr 20, 2012 at 1:11 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Thu 19-04-12 10:47:27, Ying Han wrote:
>> On Thu, Apr 19, 2012 at 10:04 AM, Michal Hocko <mhocko@suse.cz> wrote:
>> > On Wed 18-04-12 11:00:40, Ying Han wrote:
>> >> On Wed, Apr 18, 2012 at 5:24 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> >> > On Tue, Apr 17, 2012 at 09:37:46AM -0700, Ying Han wrote:
>> >> >> The "soft_limit" was introduced in memcg to support over-committing the
>> >> >> memory resource on the host. Each cgroup configures its "hard_limit" where
>> >> >> it will be throttled or OOM killed by going over the limit. However, the
>> >> >> cgroup can go above the "soft_limit" as long as there is no system-wide
>> >> >> memory contention. So, the "soft_limit" is the kernel mechanism for
>> >> >> re-distributing system spare memory among cgroups.
>> >> >>
>> >> >> This patch reworks the softlimit reclaim by hooking it into the new global
>> >> >> reclaim scheme. So the global reclaim path including direct reclaim and
>> >> >> background reclaim will respect the memcg softlimit.
>> >> >>
>> >> >> v3..v2:
>> >> >> 1. rebase the patch on 3.4-rc3
>> >> >> 2. squash the commits of replacing the old implementation with new
>> >> >> implementation into one commit. This is to make sure to leave the tree
>> >> >> in stable state between each commit.
>> >> >> 3. removed the commit which changes the nr_to_reclaim for global reclaim
>> >> >> case. The need of that patch is not obvious now.
>> >> >>
>> >> >> Note:
>> >> >> 1. the new implementation of softlimit reclaim is rather simple and first
>> >> >> step for further optimizations. there is no memory pressure balancing between
>> >> >> memcgs for each zone, and that is something we would like to add as follow-ups.
>> >> >>
>> >> >> 2. this patch is slightly different from the last one posted from Johannes
>> >> >> http://comments.gmane.org/gmane.linux.kernel.mm/72382
>> >> >> where his patch is closer to the reverted implementation by doing hierarchical
>> >> >> reclaim for each selected memcg. However, that is not expected behavior from
>> >> >> user perspective. Considering the following example:
>> >> >>
>> >> >> root (32G capacity)
>> >> >> --> A (hard limit 20G, soft limit 15G, usage 16G)
>> >> >>    --> A1 (soft limit 5G, usage 4G)
>> >> >>    --> A2 (soft limit 10G, usage 12G)
>> >> >> --> B (hard limit 20G, soft limit 10G, usage 16G)
>> >> >>
>> >> >> Under global reclaim, we shouldn't add pressure on A1 although its parent(A)
>> >> >> exceeds softlimit. This is what admin expects by setting softlimit to the
>> >> >> actual working set size and only reclaim pages under softlimit if system has
>> >> >> trouble to reclaim.
>> >> >
>> >> > Actually, this is exactly what the admin expects when creating a
>> >> > hierarchy, because she defines that A1 is a child of A and is
>> >> > responsible for the memory situation in its parent.
>> >
>> > Hmm, I guess that both approaches have cons and pros.
>> > * Hierarchical soft limit reclaim - reclaim the whole subtree of the over
>> >  soft limit memcg
>> >  + it is consistent with the hard limit reclaim
>> Not sure why we want them to be consistent. Soft_limit is serving
>> different purpose and the one of the main purpose is to preserve the
>> working set of the cgroup.
>
> Well, cgroups subsystem is moving towards unification so all the
> controllers should live in one hierarchy and it would be nice if we had
> a common view on what hard and soft limits mean wrt. hierarchies. It is
> true that memcg is the only user of the soft limit in the moment but it
> would be better if we were prepared for future users are well and
> wouldn't come up with one shot solutions.
>
>> >  + easier for top to bottom configuration - especially when you allow
>> >    subgroups to create deeper hierarchies. Does anybody do that?
>>
>> As far as I heard, most (if not all) are using flat configuration
>> where everything is running under root.
>
> Might be true for memcg but what about other controllers?

>
> [...]
>> > Both approaches don't play very well with the default 0 limit because we
>> > either reclaim unless we set up the whole hierarchy properly or we just
>> > burn cycles by trying to reclaim groups wit no or only few pages.
>>
>> Setting the default to 0 is a good optimization which makes everybody
>> to be eligible for reclaim if admin doesn't do anything.
>>
>> In reality, if admin want to preserve working set of cgroups and
>> he/she has to set the softlimit. By doing that, it is easier to only
>> focus on the cgroup itself without looking up its ancestors.
>
> I guess it is not that clear who should be responsible for setting the
> limit. Should it be admin or rather a workload owner? Because this
> changes a lot.

Today the model we have is letting admin setting it by monitoring each
cgroup's working set size. But I think it would be also use case to
let the workload itself to set it. Something like self-ballooning.


>
>>
>> > The second approach leads to more expected results though because we do
>> > not touch "leaf" groups unless they are over limit.
>> > I have to think about that some more but it seems that the second approach
>> > is much easier to implement and matches the "guarantee" expectations
>> > more.
>>
>> Agree.
>>
>> > I guess we could converge both approaches if we could reclaim from the
>> > leaf groups upwards to the root but I didn't think about this very much.
>>
>> That is what the current patch does, which only consider softlimit
>> under global pressure :)
>
> Not really, because your patch iterates sequentially from top to bottom.
> I was thinking about iteration from the leaves and do the hierarchical
> reclaim from the first one which is over the limit. This would uncharge
> from the parent as well so it could get down under its limit and if not
> then we can hammer on siblings. But, as I said, I did give this more
> thoughts, it sure comes with its own set of issues (including
> inconsistency with the hard limit reclaim ;))

I feel like we mixed two things together here:

1. per-memcg reclaim: This is triggered when A reaches its hard_limit,
and then we do hierarchical reclaim including A1 and A2

2. global reclaim: This is triggered when root reach its limit ( root
doesn't has limit, but we can say something like that), and then we do
hierarchical reclaim including all the cgroups on the system.

The soft_reclaim I have here (so far) only triggers under global
reclaim, so we follow the same rule of existing global reclaim except
filtering out memcgs under their soft_limit under certain degree.

If we are talking about to add soft_limit reclaim in per-memcg
reclaim, that is when we cares about the limit of A when reclaiming
from A1. If we decide to do that, it should be added in per-memcg
reclaim logic (small change by removing the global reclaim check).

--Ying

>
> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-20 13:17             ` Johannes Weiner
@ 2012-04-20 17:44               ` Ying Han
  2012-04-20 18:58                 ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: Ying Han @ 2012-04-20 17:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On Fri, Apr 20, 2012 at 6:17 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Fri, Apr 20, 2012 at 12:37:41AM -0700, Ying Han wrote:
>> On Thu, Apr 19, 2012 at 3:33 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> > On Thu, Apr 19, 2012 at 10:47:27AM -0700, Ying Han wrote:
>> >> On Thu, Apr 19, 2012 at 10:04 AM, Michal Hocko <mhocko@suse.cz> wrote:
>> >> > On Wed 18-04-12 11:00:40, Ying Han wrote:
>> >> >> On Wed, Apr 18, 2012 at 5:24 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> >> >> > On Tue, Apr 17, 2012 at 09:37:46AM -0700, Ying Han wrote:
>> >> >> >> 2. this patch is slightly different from the last one posted from Johannes
>> >> >> >> http://comments.gmane.org/gmane.linux.kernel.mm/72382
>> >> >> >> where his patch is closer to the reverted implementation by doing hierarchical
>> >> >> >> reclaim for each selected memcg. However, that is not expected behavior from
>> >> >> >> user perspective. Considering the following example:
>> >> >> >>
>> >> >> >> root (32G capacity)
>> >> >> >> --> A (hard limit 20G, soft limit 15G, usage 16G)
>> >> >> >>    --> A1 (soft limit 5G, usage 4G)
>> >> >> >>    --> A2 (soft limit 10G, usage 12G)
>> >> >> >> --> B (hard limit 20G, soft limit 10G, usage 16G)
>> >> >> >>
>> >> >> >> Under global reclaim, we shouldn't add pressure on A1 although its parent(A)
>> >> >> >> exceeds softlimit. This is what admin expects by setting softlimit to the
>> >> >> >> actual working set size and only reclaim pages under softlimit if system has
>> >> >> >> trouble to reclaim.
>> >> >> >
>> >> >> > Actually, this is exactly what the admin expects when creating a
>> >> >> > hierarchy, because she defines that A1 is a child of A and is
>> >> >> > responsible for the memory situation in its parent.
>> >> >
>> >> > Hmm, I guess that both approaches have cons and pros.
>> >> > * Hierarchical soft limit reclaim - reclaim the whole subtree of the over
>> >> >  soft limit memcg
>> >> >  + it is consistent with the hard limit reclaim
>> >> Not sure why we want them to be consistent. Soft_limit is serving
>> >> different purpose and the one of the main purpose is to preserve the
>> >> working set of the cgroup.
>> >
>> > I'd argue, given the history of cgroups, one of the main purposes is
>> > having a machine of containers where you overcommit their hard limit
>> > and set the soft limit accordingly to provide fairness.
>> >
>> > Yes, we don't want to reclaim hierarchies that are below their soft
>> > limit as long as there are some in excess, of course.  This is a flaw
>> > and needs fixing.  But it's something completely different than
>> > changing how the soft limit is defined and suddenly allow child
>> > groups, which you may not trust, to override rules defined by parental
>> > groups.
>> >
>> > It bothers me that we should add something that will almost certainly
>> > bite us in the future while we are discussing on the cgroups list what
>> > would stand in the way of getting sane hierarchy semantics across
>> > controllers to provide consistency, nesting, etc.
>>
>> I understand the concern here and I don't want the soft_limit reclaim
>> to be far away from the other part of the cgroup design down to the
>> road. On the other hand, I don't think the current implementation is
>> against the hierarchy semantics totally. See the comment below :)
>>
>> > To support a single use case, which I feel we still have not discussed
>> > nearly enough to justify this change.
>> >
>> > For example, I get that you want 'meta-groups' that group together
>> > subgroups for common accounting and hard limiting.  But I don't see
>> > why such meta-groups have their own processes.  Conceptually, I mean,
>> > how does a process fit into A?  Is it superior to the tasks in A1 and
>> > A2?  Why can't it live in A3?
>>
>> For user processes, I can see that is totally feasible to live in A3.
>> The case I was thinking is kernel threads, which 1) we don't want to
>> limit their memory usage 2) they  serve for the whole group unlike
>> individual jobs. Of course, we could say that putting those kernel
>> thread in A3 and leave the cgroup to unlimited, but not sure if we
>> should constrain ourselves not having any processes running under A.
>
> That's just handwaving.
>
>> > So here is a proposal:
>> >
>> > Would it make sense to try to keep those meta groups always free of
>> > their own memory so that they don't /need/ soft limits with weird
>> > semantics?  E.g. immediately free the unused memory on rmdir, OR add
>> > mechanisms to migrate the memory to a dedicated group:
>> >
>> >     A
>> >       A1 (soft-limited)
>> >       A2 (soft-limited)
>> >     B
>> >     unused (soft-limited)
>> >
>> > Move all leftover memory from finished jobs to this 'unused' group.
>> > You could set its soft limit to 0 so that it sticks around only until
>> > you actually need the memory for something else.
>> >
>> > Then you would get the benefits of accounting and limiting A1 and A2
>> > under a single umbrella without the need for a soft limit in A.  We
>> > could keep the consistent semantics for soft limits, because you would
>> > only have to set it on leaf nodes.
>> >
>> > Wouldn't this work for you?
>>
>> To be frankly, this sounds a lot of extra work for admin to manage the
>> system and we still can not prevent page being landed on A totally.
>
> Why not?
>
> And what extra work are we talking here?  As I wrote in the followup
> mail: just keep the finished job groups around, set their soft limit
> to 0.  Surely you have a userspace job scheduler that sets up these
> groups in the first place and could be trivially extended to set soft
> limits and watch for notifications.
>
> Let me repeat the pros here: no breaking of existing semantics.  No
> introduction of unprecedented semantics into the cgroup mess.  No
> changing of kernel code necessary (except what we want to tune
> anyway).  No computational overhead for you or anyone else.

>
> If your only counter argument to this is that you can't be bothered to
> slightly adjust your setup, I'm no longer interested in this
> discussion.

Before going further, I wanna make sure there is no mis-communication
here. As I replied to Michal, I feel that we are mixing up global
reclaim and target reclaim policy here.

The way global reclaim works today is to scan all the mem cgroups to
fulfill the overall scan target per zone, and there is no bottom up
look up. My patch currently adds the softlimit reclaim under global
reclaim, and the difference is the filtering.

The soft_limit hierarchical reclaim we are discussing here is for
target reclaim?

--Ying

>
>> Back to the current proposal, there are two concerns that I can tell by far:
>>
>> 1. skipping "not trust" cgroup in case it sets its soft_limit very high:
>> Here, we don't skip the "not trust" cgroup always. We do reclaim from
>> them if not enough progress made from other cgroups above the
>> softlimit. So, I don't see a problem here.
>
> When you decide to reclaim from groups below their soft limit.
>
> Which means that an untrusted group can force global reclaim to go for
> the workingset in other groups.
>
>> 2. not reclaiming based on hierarchy:
>> Here I am not checking the ancestor's soft_limit in
>> should_reclaim_mem_cgroup(). And it will only make difference if A is
>> under soft_limit and A1 is above soft_limit. Now you do agree that we
>> shouldn't reclaim from those under softlimit groups if there are
>> cgroup exeed their softlimit. Then it leads me to think something like
>> the following:
>>
>> 1. for priority > DEF_PRIORITY - 3, only reclaim memcg above their softlimit
>> 2. for priority <= DEF_PRIORITY - 3, besides 1), also look at memcg's
>> ancestor. reclaim memcgs whose ancestor above soft_limit
>> 3. for priority == 0, reclaim everything.
>>
>> Then it has the guarantee of the softlimit at certain level while also
>> considers the hierarchy reclaim if the first few rounds doesn't
>> fulfill the request.
>
> You expect sane setups to pay the cost of uselessly consulting the res
> counters of every existing memcg, twice, on every single reclaim cycle.
>
> Everyone has their agenda and their primary usecase, but this takes
> the cake.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-20 17:44               ` Ying Han
@ 2012-04-20 18:58                 ` Michal Hocko
  2012-04-20 22:50                   ` Ying Han
  2012-04-20 23:29                   ` Johannes Weiner
  0 siblings, 2 replies; 25+ messages in thread
From: Michal Hocko @ 2012-04-20 18:58 UTC (permalink / raw)
  To: Ying Han
  Cc: Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On Fri 20-04-12 10:44:14, Ying Han wrote:
> On Fri, Apr 20, 2012 at 6:17 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Let me repeat the pros here: no breaking of existing semantics.  No
> > introduction of unprecedented semantics into the cgroup mess.  No
> > changing of kernel code necessary (except what we want to tune
> > anyway).  No computational overhead for you or anyone else.
> 
> >
> > If your only counter argument to this is that you can't be bothered to
> > slightly adjust your setup, I'm no longer interested in this
> > discussion.
> 
> Before going further, I wanna make sure there is no mis-communication
> here. As I replied to Michal, I feel that we are mixing up global
> reclaim and target reclaim policy here.

I was referring to the global reclaim and my understanding is that
Johannes did the same when talking about soft reclaim (even though it
makes some sense to apply the same rules to the hard limit reclaim as
well - but later to that one...)

The primary question is whether soft reclaim should be hierarchical or
not. That is what I've tried to express in other email earlier in this
thread where I've tried (very briefly) to compare those approaches.
It currently _is_ hierarchical and your patch changes that so we have to
be sure that this change in semantic is reasonable. The only workload
that you seem to consider is when you have a full control over the
machine while Johannes is considered about containers which might misuse
your approach to push out working sets of concurrency...
My concern with hierarchical approach is that it doesn't play well with
0 default (which is needed if we want to make soft limit a guarantee,
right?). I do agree with Johannes about the potential misuse though.  So
it seems that both approaches have serious issues with configurability.
Does this summary clarify the issue a bit? Or I am confused as well ;)

I am more inclined towards selective soft reclaim and make configuration
admin's responsibility (if you want some guarantee, admin has to approve
that and set it for you). This, however, doesn't enable self-ballooning
use case but I am not entirely sure this would work without a global
(admin) cooperation.

> The way global reclaim works today is to scan all the mem cgroups to
> fulfill the overall scan target per zone, and there is no bottom up
> look up. 

bottom up was just an idea without anything in hands so let's put it
aside for now.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-20 18:58                 ` Michal Hocko
@ 2012-04-20 22:50                   ` Ying Han
  2012-04-20 22:56                     ` Rik van Riel
  2012-04-21  0:19                     ` Johannes Weiner
  2012-04-20 23:29                   ` Johannes Weiner
  1 sibling, 2 replies; 25+ messages in thread
From: Ying Han @ 2012-04-20 22:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On Fri, Apr 20, 2012 at 11:58 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Fri 20-04-12 10:44:14, Ying Han wrote:
>> On Fri, Apr 20, 2012 at 6:17 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> > Let me repeat the pros here: no breaking of existing semantics.  No
>> > introduction of unprecedented semantics into the cgroup mess.  No
>> > changing of kernel code necessary (except what we want to tune
>> > anyway).  No computational overhead for you or anyone else.
>>
>> >
>> > If your only counter argument to this is that you can't be bothered to
>> > slightly adjust your setup, I'm no longer interested in this
>> > discussion.
>>
>> Before going further, I wanna make sure there is no mis-communication
>> here. As I replied to Michal, I feel that we are mixing up global
>> reclaim and target reclaim policy here.
>
> I was referring to the global reclaim and my understanding is that
> Johannes did the same when talking about soft reclaim (even though it
> makes some sense to apply the same rules to the hard limit reclaim as
> well - but later to that one...)
>
> The primary question is whether soft reclaim should be hierarchical or
> not. That is what I've tried to express in other email earlier in this
> thread where I've tried (very briefly) to compare those approaches.
> It currently _is_ hierarchical and your patch changes that so we have to
> be sure that this change in semantic is reasonable.

Yes, after reading the other thread and I suddenly realized what you
guys are talking about.

The only workload
> that you seem to consider is when you have a full control over the
> machine while Johannes is considered about containers which might misuse
> your approach to push out working sets of concurrency...
> My concern with hierarchical approach is that it doesn't play well with
> 0 default (which is needed if we want to make soft limit a guarantee,
> right?). I do agree with Johannes about the potential misuse though.  So
> it seems that both approaches have serious issues with configurability.
> Does this summary clarify the issue a bit? Or I am confused as well ;)

Thank you for the good summary and now we are on the same page :)

Regarding the misuse case, here I am gonna layout the ground rule for
setting up soft_limit:

"
Never over-commit the system by softlimit.
"

Considering the following:

root (32G, use_hierarchy = 1)
   -- A (soft: 16G, usage 22G)
       -- A1 (soft: 10G, usage 17G)
       -- A2 (soft: 6G, usage 5G)
   -- B (soft: 16G, usage 10G)

1) sum_of_softlimit(A + B) <= machine capacity
2) sum_of_softlimit(A1 + A2) <= softlimit(A)

So we have both A and A1 above softlimit. If we follow the ground rule
to set up the softlimit, we should be confidence to say that "If A is
above its softlimit, there must be cgroups under A who are also above
softlimit". We can still leave the priority check there in case all
the pages from A1 are hard to reclaim, and then we will look into A2
only by then.

I think it is reasonable to layout this upfront, otherwise we can not
make all the misuse cases right. And if we follow that route, lots of
things will become clear.

--Ying
>
> I am more inclined towards selective soft reclaim and make configuration
> admin's responsibility (if you want some guarantee, admin has to approve
> that and set it for you).


This, however, doesn't enable self-ballooning
> use case but I am not entirely sure this would work without a global
> (admin) cooperation.
>
>> The way global reclaim works today is to scan all the mem cgroups to
>> fulfill the overall scan target per zone, and there is no bottom up
>> look up.
>
> bottom up was just an idea without anything in hands so let's put it
> aside for now.
>
> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-20 22:50                   ` Ying Han
@ 2012-04-20 22:56                     ` Rik van Riel
  2012-04-20 23:14                       ` Ying Han
  2012-04-21  0:19                     ` Johannes Weiner
  1 sibling, 1 reply; 25+ messages in thread
From: Rik van Riel @ 2012-04-20 22:56 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On 04/20/2012 06:50 PM, Ying Han wrote:

> Regarding the misuse case, here I am gonna layout the ground rule for
> setting up soft_limit:
>
> "
> Never over-commit the system by softlimit.
> "

> I think it is reasonable to layout this upfront, otherwise we can not
> make all the misuse cases right. And if we follow that route, lots of
> things will become clear.

While that rule looks reasonable at first glance, I do not
believe it is possible to follow it in practice.

One reason is memory resizing through ballooning in virtual
machines. It is possible for the "physical" memory size to
shrink below the sum of the softlimits.

Another reason is memory zones and NUMA. It is possible for
one memory zone (or NUMA node) to only have cgroups that
are under their soft limit.

If this happens to be the one memory zone we can allocate
network buffers from, we could deadlock the system if we
refused to reclaim pages from a cgroup under its limit.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-20 22:56                     ` Rik van Riel
@ 2012-04-20 23:14                       ` Ying Han
  0 siblings, 0 replies; 25+ messages in thread
From: Ying Han @ 2012-04-20 23:14 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michal Hocko, Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On Fri, Apr 20, 2012 at 3:56 PM, Rik van Riel <riel@redhat.com> wrote:
> On 04/20/2012 06:50 PM, Ying Han wrote:
>
>> Regarding the misuse case, here I am gonna layout the ground rule for
>> setting up soft_limit:
>>
>> "
>> Never over-commit the system by softlimit.
>> "
>
>
>> I think it is reasonable to layout this upfront, otherwise we can not
>> make all the misuse cases right. And if we follow that route, lots of
>> things will become clear.
>
>
> While that rule looks reasonable at first glance, I do not
> believe it is possible to follow it in practice.
>
> One reason is memory resizing through ballooning in virtual
> machines. It is possible for the "physical" memory size to
> shrink below the sum of the softlimits.

Hmm, can you give more details on that? I assume the soft_limit should
be adjusted at run-time based on the memory usage, and in your case,
the "physcial" memory size.

This is different from hard_limit, which we can over-commit by set it
once and live with it.

>
> Another reason is memory zones and NUMA. It is possible for
> one memory zone (or NUMA node) to only have cgroups that
> are under their soft limit.

>
> If this happens to be the one memory zone we can allocate
> network buffers from, we could deadlock the system if we
> refused to reclaim pages from a cgroup under its limit.

Yes, that is the problem we talked about during LSF. Having
"per-memcg-per-zone softlimit" sounds too complicated and not
practical at all. To deal with that, my current patch is to identify
the situation by doing the first round of scanning, and then skip the
soft_limit if that is the case.

--Ying

>
> --
> All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-20 18:58                 ` Michal Hocko
  2012-04-20 22:50                   ` Ying Han
@ 2012-04-20 23:29                   ` Johannes Weiner
  2012-04-23 13:59                     ` Michal Hocko
  1 sibling, 1 reply; 25+ messages in thread
From: Johannes Weiner @ 2012-04-20 23:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ying Han, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On Fri, Apr 20, 2012 at 08:58:47PM +0200, Michal Hocko wrote:
> On Fri 20-04-12 10:44:14, Ying Han wrote:
> > On Fri, Apr 20, 2012 at 6:17 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > Let me repeat the pros here: no breaking of existing semantics.  No
> > > introduction of unprecedented semantics into the cgroup mess.  No
> > > changing of kernel code necessary (except what we want to tune
> > > anyway).  No computational overhead for you or anyone else.
> > 
> > >
> > > If your only counter argument to this is that you can't be bothered to
> > > slightly adjust your setup, I'm no longer interested in this
> > > discussion.
> > 
> > Before going further, I wanna make sure there is no mis-communication
> > here. As I replied to Michal, I feel that we are mixing up global
> > reclaim and target reclaim policy here.
> 
> I was referring to the global reclaim and my understanding is that
> Johannes did the same when talking about soft reclaim (even though it
> makes some sense to apply the same rules to the hard limit reclaim as
> well - but later to that one...)
> 
> The primary question is whether soft reclaim should be hierarchical or
> not. That is what I've tried to express in other email earlier in this
> thread where I've tried (very briefly) to compare those approaches.
> It currently _is_ hierarchical and your patch changes that so we have to
> be sure that this change in semantic is reasonable. The only workload
> that you seem to consider is when you have a full control over the
> machine while Johannes is considered about containers which might misuse
> your approach to push out working sets of concurrency...
> My concern with hierarchical approach is that it doesn't play well with
> 0 default (which is needed if we want to make soft limit a guarantee,
> right?). I do agree with Johannes about the potential misuse though.  So
> it seems that both approaches have serious issues with configurability.
> Does this summary clarify the issue a bit? Or I am confused as well ;)

Thanks for the nice summary!

A note on the default hierarchical soft limit:

Consider not making the default to be 0, but a special value.  We want
it to mean 'no guarantee' and 'every byte is in excess of the soft
limit', to keep the existing behaviour.  But at the same time, we
wouldn't have to make it inheritable:

    A (soft = default)
      A1 (soft = 10G)
      A2 (soft = 12G)

so in case of global reclaim, A itself would be eligible, but it would
not apply hierarchically to A1 and A2.  They would still only get
reclaimed if their usage would be above their respective soft limits.
Only if you set A's soft limit to 0 or higher it will apply
hierarchically, so that if a parent declares 'no guarantee', no child
is able to override it.

Maybe we can keep -1/~0UL and just treat it a bit differently.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-20 22:50                   ` Ying Han
  2012-04-20 22:56                     ` Rik van Riel
@ 2012-04-21  0:19                     ` Johannes Weiner
  2012-04-21  0:48                       ` Johannes Weiner
  1 sibling, 1 reply; 25+ messages in thread
From: Johannes Weiner @ 2012-04-21  0:19 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On Fri, Apr 20, 2012 at 03:50:28PM -0700, Ying Han wrote:
> On Fri, Apr 20, 2012 at 11:58 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Fri 20-04-12 10:44:14, Ying Han wrote:
> >> On Fri, Apr 20, 2012 at 6:17 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >> > Let me repeat the pros here: no breaking of existing semantics.  No
> >> > introduction of unprecedented semantics into the cgroup mess.  No
> >> > changing of kernel code necessary (except what we want to tune
> >> > anyway).  No computational overhead for you or anyone else.
> >>
> >> >
> >> > If your only counter argument to this is that you can't be bothered to
> >> > slightly adjust your setup, I'm no longer interested in this
> >> > discussion.
> >>
> >> Before going further, I wanna make sure there is no mis-communication
> >> here. As I replied to Michal, I feel that we are mixing up global
> >> reclaim and target reclaim policy here.
> >
> > I was referring to the global reclaim and my understanding is that
> > Johannes did the same when talking about soft reclaim (even though it
> > makes some sense to apply the same rules to the hard limit reclaim as
> > well - but later to that one...)
> >
> > The primary question is whether soft reclaim should be hierarchical or
> > not. That is what I've tried to express in other email earlier in this
> > thread where I've tried (very briefly) to compare those approaches.
> > It currently _is_ hierarchical and your patch changes that so we have to
> > be sure that this change in semantic is reasonable.
> 
> Yes, after reading the other thread and I suddenly realized what you
> guys are talking about.
> 
> The only workload
> > that you seem to consider is when you have a full control over the
> > machine while Johannes is considered about containers which might misuse
> > your approach to push out working sets of concurrency...
> > My concern with hierarchical approach is that it doesn't play well with
> > 0 default (which is needed if we want to make soft limit a guarantee,
> > right?). I do agree with Johannes about the potential misuse though.  So
> > it seems that both approaches have serious issues with configurability.
> > Does this summary clarify the issue a bit? Or I am confused as well ;)
> 
> Thank you for the good summary and now we are on the same page :)
> 
> Regarding the misuse case, here I am gonna layout the ground rule for
> setting up soft_limit:
> 
> "
> Never over-commit the system by softlimit.
> "

Which proves that we are not on the same page at all :-(

It's not about dealing with rare, non-sensical setups, it's about
suddenly trusting children to do the right thing.

And it's about suddenly REQUIRING all children to cooperate even for
the reasonable configuration case, instead of just having soft limits
apply hierarchically.

Meanwhile, you STILL haven't provided an argument why you couldn't
just fix your cgroup tree organization to make sense for the semantics
you require instead of pushing for such a bogus change.

It's like you're trying to redefine multiplication because you
accidentally used * instead of + in your equation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-21  0:19                     ` Johannes Weiner
@ 2012-04-21  0:48                       ` Johannes Weiner
  2012-04-23 22:19                         ` Ying Han
  0 siblings, 1 reply; 25+ messages in thread
From: Johannes Weiner @ 2012-04-21  0:48 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On Sat, Apr 21, 2012 at 02:19:14AM +0200, Johannes Weiner wrote:
> It's like you're trying to redefine multiplication because you
> accidentally used * instead of + in your equation.

You could for example do this:

-> A (hard limit = 16G)
   -> A1 (hard limit = 10G)
   -> A2 (hard limit =  6G)

and say the same: you want to account A, A1, and A2 under the same
umbrella, so you want the same hierarchy.  And you want to limit the
memory in A (from finished jobs and tasks running directly in A), but
this limit should NOT apply to A1 and A2 when they have not reached
THEIR respective limits.

You can apply all your current arguments to this same case.  And yet,
you say hierarchical hard limits make sense while hierarchical soft
limits don't.  I hope this example makes it clear why this is not true
at all.

We have cases where we want the hierarchical limits.  Both hard limits
and soft limits.  You can easily fix your setup without taking away
this power from everyone else or introducing inconsistency.  Your
whole problem stems from a simple misconfiguration.

The solution to both cases is this: don't stick memory in these meta
groups and complain that their hierarchical limits apply to their
children.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-20 23:29                   ` Johannes Weiner
@ 2012-04-23 13:59                     ` Michal Hocko
  0 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2012-04-23 13:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ying Han, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On Sat 21-04-12 01:29:09, Johannes Weiner wrote:
> On Fri, Apr 20, 2012 at 08:58:47PM +0200, Michal Hocko wrote:
> > On Fri 20-04-12 10:44:14, Ying Han wrote:
> > > On Fri, Apr 20, 2012 at 6:17 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > Let me repeat the pros here: no breaking of existing semantics.  No
> > > > introduction of unprecedented semantics into the cgroup mess.  No
> > > > changing of kernel code necessary (except what we want to tune
> > > > anyway).  No computational overhead for you or anyone else.
> > > 
> > > >
> > > > If your only counter argument to this is that you can't be bothered to
> > > > slightly adjust your setup, I'm no longer interested in this
> > > > discussion.
> > > 
> > > Before going further, I wanna make sure there is no mis-communication
> > > here. As I replied to Michal, I feel that we are mixing up global
> > > reclaim and target reclaim policy here.
> > 
> > I was referring to the global reclaim and my understanding is that
> > Johannes did the same when talking about soft reclaim (even though it
> > makes some sense to apply the same rules to the hard limit reclaim as
> > well - but later to that one...)
> > 
> > The primary question is whether soft reclaim should be hierarchical or
> > not. That is what I've tried to express in other email earlier in this
> > thread where I've tried (very briefly) to compare those approaches.
> > It currently _is_ hierarchical and your patch changes that so we have to
> > be sure that this change in semantic is reasonable. The only workload
> > that you seem to consider is when you have a full control over the
> > machine while Johannes is considered about containers which might misuse
> > your approach to push out working sets of concurrency...
> > My concern with hierarchical approach is that it doesn't play well with
> > 0 default (which is needed if we want to make soft limit a guarantee,
> > right?). I do agree with Johannes about the potential misuse though.  So
> > it seems that both approaches have serious issues with configurability.
> > Does this summary clarify the issue a bit? Or I am confused as well ;)
> 
> Thanks for the nice summary!
> 
> A note on the default hierarchical soft limit:
> 
> Consider not making the default to be 0, but a special value.  We want
> it to mean 'no guarantee' and 'every byte is in excess of the soft
> limit', to keep the existing behaviour.  But at the same time, we
> wouldn't have to make it inheritable:
> 
>     A (soft = default)
>       A1 (soft = 10G)
>       A2 (soft = 12G)
> 
> so in case of global reclaim, A itself would be eligible, but it would
> not apply hierarchically to A1 and A2.  They would still only get
> reclaimed if their usage would be above their respective soft limits.
> Only if you set A's soft limit to 0 or higher it will apply
> hierarchically, so that if a parent declares 'no guarantee', no child
> is able to override it.

I was thinking about a special value for the local reclaim as well but I
didn't like it much because then it wouldn't be only a value for limit
but also an API to switch between hierarchical vs. non-hierarchical
reclaim so it is an API of some sort. So I am really not so sure about
it and would rather go a different way - if there is any...

> Maybe we can keep -1/~0UL and just treat it a bit differently.

I would rather see 0 as a special value, if this is the only way to go,
it would make the life easier and also it makes more sense to me.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V3 0/2] memcg softlimit reclaim rework
  2012-04-21  0:48                       ` Johannes Weiner
@ 2012-04-23 22:19                         ` Ying Han
  0 siblings, 0 replies; 25+ messages in thread
From: Ying Han @ 2012-04-23 22:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Hillf Danton, Hugh Dickins, Dan Magenheimer, Andrew Morton,
	linux-mm

On Fri, Apr 20, 2012 at 5:48 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Sat, Apr 21, 2012 at 02:19:14AM +0200, Johannes Weiner wrote:
>> It's like you're trying to redefine multiplication because you
>> accidentally used * instead of + in your equation.
>
> You could for example do this:
>
> -> A (hard limit = 16G)
>   -> A1 (hard limit = 10G)
>   -> A2 (hard limit =  6G)
>
> and say the same: you want to account A, A1, and A2 under the same
> umbrella, so you want the same hierarchy.  And you want to limit the
> memory in A (from finished jobs and tasks running directly in A), but
> this limit should NOT apply to A1 and A2 when they have not reached
> THEIR respective limits.
>
> You can apply all your current arguments to this same case.  And yet,
> you say hierarchical hard limits make sense while hierarchical soft
> limits don't.  I hope this example makes it clear why this is not true
> at all.

I understand the example above which the pressure from A goes down to
A1 and A2, although neither of them reaches their hard_limit.

I am not against doing similar hierarchical reclaim on soft_limit, as
long as it is solving the problem which the soft_limit is targeted
for. The admin is setting up soft_limit to preserve working set for
each cgroup, which means that reclaim under the soft_limit could hurt
the application's performance. I assume that expectation is slightly
different from hard_limit and that's why we have two APIs instead of
one.

>
> We have cases where we want the hierarchical limits.  Both hard limits
> and soft limits.  You can easily fix your setup without taking away
> this power from everyone else or introducing inconsistency.  Your
> whole problem stems from a simple misconfiguration.

Let's see the following example:
A
 -- A1
 -- A2

There are three possibilities of how the soft_limit being set :

Here I use X to represent pages in A's lru only (re-parented or
process running under A) and admin wants to preserve.
1. soft_limit(A) == soft_limit(A1) + soft_limit(A2) + X

// only reclaiming from A2 will bring the usage_in_bytes of A under
its soft_limit.
A (soft_limit == 31G, X=1G, usage_in_bytes = 35G)
  -- A1 (soft_limit == 15G, usage_in_bytes = 14G)
  -- A2 (soft_limit == 15G, usage_in_bytes = 20G)

2. soft_limit(A) > soft_limit(A1) + soft_limit(A2) + X

//only reclaiming from A2 and it is ok.
A (soft_limit == 40G, X=1G, usage_in_bytes = 35G)
  -- A1 (soft_limit == 15G, usage_in_bytes = 14G)
  -- A2 (soft_limit == 15G, usage_in_bytes = 20G)

3. soft_limit (A) < soft_limit(A1) + soft_limit(A2) + X

//only reclaiming from A2 doesn't help and we have to reclaim both A1 and A2.
A (soft_limit == 31G, X=1G, usage_in_bytes = 35G)
  -- A1 (soft_limit == 100G, usage_in_bytes = 14G)
  -- A2 (soft_limit == 15G, usage_in_bytes = 20G)

If I understand correctly, the case3 is what my patch works
differently from yours. The difference is that my patch won't reclaim
from A1 but it is reclaimed from yours.

AFAIK, in most of the cases (if not all), the case1 would be adopted
by admin and that is what I've been trying to make to work. On the
other hand, i agree w/ you that we shouldn't constrain ourselves to
support only one configuration. But here is my question:

1. Do you agree that case1 would be the configuration makes most of
the senses for admin ?

2. If the answer of 1) is yes, do you agree that your proposal doesn't
work well w/ the admin's expectation ?

Meanwhile, i haven't figured out whether case 3 would be a well
adopted configuration. But let me guess why it is configured like
this?

a) admin wants to guarantee no reclaim on pages in A1 ?
if so, my patch works as expected

b) mis-configuration ?
if so, my patch doesn't work as expected. but since it is
mis-configuration and there is really no expectation. what we need
instead is not breaking the system

Overall, I would like to make sure the most-popular use case to work
and at the same time not breaking the system by having
mis-configuration. Hopefully this makes sense to you :)

--Ying

>
> The solution to both cases is this: don't stick memory in these meta
> groups and complain that their hierarchical limits apply to their
> children.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2012-04-23 22:19 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-17 16:37 [PATCH V3 0/2] memcg softlimit reclaim rework Ying Han
2012-04-18 12:24 ` Johannes Weiner
2012-04-18 18:00   ` Ying Han
2012-04-19 17:04     ` Michal Hocko
2012-04-19 17:47       ` Ying Han
2012-04-19 22:33         ` Johannes Weiner
2012-04-19 22:51           ` Johannes Weiner
2012-04-20  7:37           ` Ying Han
2012-04-20  8:21             ` KAMEZAWA Hiroyuki
2012-04-20 14:17               ` Rik van Riel
2012-04-20 16:56                 ` Ying Han
2012-04-20 13:17             ` Johannes Weiner
2012-04-20 17:44               ` Ying Han
2012-04-20 18:58                 ` Michal Hocko
2012-04-20 22:50                   ` Ying Han
2012-04-20 22:56                     ` Rik van Riel
2012-04-20 23:14                       ` Ying Han
2012-04-21  0:19                     ` Johannes Weiner
2012-04-21  0:48                       ` Johannes Weiner
2012-04-23 22:19                         ` Ying Han
2012-04-20 23:29                   ` Johannes Weiner
2012-04-23 13:59                     ` Michal Hocko
2012-04-20  8:28           ` Michal Hocko
2012-04-20  8:11         ` Michal Hocko
2012-04-20 17:22           ` Ying Han

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.