linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/2] mm: memcg reclaim integration followups
@ 2012-01-10 15:02 Johannes Weiner
  2012-01-10 15:02 ` [patch 1/2] mm: memcg: per-memcg reclaim statistics Johannes Weiner
  2012-01-10 15:02 ` [patch 2/2] mm: memcg: hierarchical soft limit reclaim Johannes Weiner
  0 siblings, 2 replies; 29+ messages in thread
From: Johannes Weiner @ 2012-01-10 15:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, KAMEZAWA Hiroyuki, Balbir Singh, Ying Han, cgroups,
	linux-mm, linux-kernel

Hi,

here are two patches based on memcg-aware global reclaim, which I
dropped from the initial series to focus on the exclusive-lru changes.

The first one is per-memcg reclaim statistics.  For now, they include
only pages scanned and pages reclaimed, separately for direct reclaim
and kswapd, as well as separately for internal pressure or reclaim due
to parental memcgs.

The second one is integrating soft limit reclaim into the now
memcg-aware global reclaim path.  It kills a lot of code and performs
better as far as I have tested it.  Furthermore, Ying is working on
turning soft limits into guarantees, as discussed in Prague, and this
patch is also in preparation for that.

Sorry for the odd point in time to submit this, I guess this will mean
3.4 at the earliest.  But the soft limit removal is a bit heavy weight
so it's probably easier conflict-wise to have it at the bottom of the
-mm stack.

 Documentation/cgroups/memory.txt |    4 +
 include/linux/memcontrol.h       |   28 ++-
 mm/memcontrol.c                  |  482 +++++++++-----------------------------
 mm/vmscan.c                      |   87 ++------
 4 files changed, 144 insertions(+), 457 deletions(-)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 1/2] mm: memcg: per-memcg reclaim statistics
  2012-01-10 15:02 [patch 0/2] mm: memcg reclaim integration followups Johannes Weiner
@ 2012-01-10 15:02 ` Johannes Weiner
  2012-01-10 23:54   ` Ying Han
  2012-01-10 15:02 ` [patch 2/2] mm: memcg: hierarchical soft limit reclaim Johannes Weiner
  1 sibling, 1 reply; 29+ messages in thread
From: Johannes Weiner @ 2012-01-10 15:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, KAMEZAWA Hiroyuki, Balbir Singh, Ying Han, cgroups,
	linux-mm, linux-kernel

With the single per-zone LRU gone and global reclaim scanning
individual memcgs, it's straight-forward to collect meaningful and
accurate per-memcg reclaim statistics.

This adds the following items to memory.stat:

pgreclaim
pgscan

  Number of pages reclaimed/scanned from that memcg due to its own
  hard limit (or physical limit in case of the root memcg) by the
  allocating task.

kswapd_pgreclaim
kswapd_pgscan

  Reclaim activity from kswapd due to the memcg's own limit.  Only
  applicable to the root memcg for now since kswapd is only triggered
  by physical limits, but kswapd-style reclaim based on memcg hard
  limits is being developped.

hierarchy_pgreclaim
hierarchy_pgscan
hierarchy_kswapd_pgreclaim
hierarchy_kswapd_pgscan

  Reclaim activity due to limitations in one of the memcg's parents.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroups/memory.txt |    4 ++
 include/linux/memcontrol.h       |   10 +++++
 mm/memcontrol.c                  |   84 +++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                      |    7 +++
 4 files changed, 103 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index cc0ebc5..eb9e982 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -389,6 +389,10 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
 swap		- # of bytes of swap usage
+pgreclaim	- # of pages reclaimed due to this memcg's limit
+pgscan		- # of pages scanned due to this memcg's limit
+kswapd_*	- # reclaim activity by background daemon due to this memcg's limit
+hierarchy_*	- # reclaim activity due to pressure from parental memcg
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bd3b102..6c1d69e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -121,6 +121,8 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
 						      struct zone *zone);
 struct zone_reclaim_stat*
 mem_cgroup_get_reclaim_stat_from_page(struct page *page);
+void mem_cgroup_account_reclaim(struct mem_cgroup *, struct mem_cgroup *,
+				unsigned long, unsigned long, bool);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 					struct task_struct *p);
 extern void mem_cgroup_replace_page_cache(struct page *oldpage,
@@ -347,6 +349,14 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 	return NULL;
 }
 
+static inline void mem_cgroup_account_reclaim(struct mem_cgroup *root,
+					      struct mem_cgroup *memcg,
+					      unsigned long nr_reclaimed,
+					      unsigned long nr_scanned,
+					      bool kswapd)
+{
+}
+
 static inline void
 mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8e2a80d..170dff4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -91,12 +91,23 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_NSTATS,
 };
 
+#define MEM_CGROUP_EVENTS_KSWAPD 2
+#define MEM_CGROUP_EVENTS_HIERARCHY 4
+
 enum mem_cgroup_events_index {
 	MEM_CGROUP_EVENTS_PGPGIN,	/* # of pages paged in */
 	MEM_CGROUP_EVENTS_PGPGOUT,	/* # of pages paged out */
 	MEM_CGROUP_EVENTS_COUNT,	/* # of pages paged in/out */
 	MEM_CGROUP_EVENTS_PGFAULT,	/* # of page-faults */
 	MEM_CGROUP_EVENTS_PGMAJFAULT,	/* # of major page-faults */
+	MEM_CGROUP_EVENTS_PGRECLAIM,
+	MEM_CGROUP_EVENTS_PGSCAN,
+	MEM_CGROUP_EVENTS_KSWAPD_PGRECLAIM,
+	MEM_CGROUP_EVENTS_KSWAPD_PGSCAN,
+	MEM_CGROUP_EVENTS_HIERARCHY_PGRECLAIM,
+	MEM_CGROUP_EVENTS_HIERARCHY_PGSCAN,
+	MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGRECLAIM,
+	MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGSCAN,
 	MEM_CGROUP_EVENTS_NSTATS,
 };
 /*
@@ -889,6 +900,38 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
 	return (memcg == root_mem_cgroup);
 }
 
+/**
+ * mem_cgroup_account_reclaim - update per-memcg reclaim statistics
+ * @root: memcg that triggered reclaim
+ * @memcg: memcg that is actually being scanned
+ * @nr_reclaimed: number of pages reclaimed from @memcg
+ * @nr_scanned: number of pages scanned from @memcg
+ * @kswapd: whether reclaiming task is kswapd or allocator itself
+ */
+void mem_cgroup_account_reclaim(struct mem_cgroup *root,
+				struct mem_cgroup *memcg,
+				unsigned long nr_reclaimed,
+				unsigned long nr_scanned,
+				bool kswapd)
+{
+	unsigned int offset = 0;
+
+	if (!root)
+		root = root_mem_cgroup;
+
+	if (kswapd)
+		offset += MEM_CGROUP_EVENTS_KSWAPD;
+	if (root != memcg)
+		offset += MEM_CGROUP_EVENTS_HIERARCHY;
+
+	preempt_disable();
+	__this_cpu_add(memcg->stat->events[MEM_CGROUP_EVENTS_PGRECLAIM + offset],
+		       nr_reclaimed);
+	__this_cpu_add(memcg->stat->events[MEM_CGROUP_EVENTS_PGSCAN + offset],
+		       nr_scanned);
+	preempt_enable();
+}
+
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 	struct mem_cgroup *memcg;
@@ -1662,6 +1705,8 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
 
 	while (1) {
+		unsigned long nr_reclaimed;
+
 		victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
 		if (!victim) {
 			loop++;
@@ -1687,8 +1732,11 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 		}
 		if (!mem_cgroup_reclaimable(victim, false))
 			continue;
-		total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
-						     zone, &nr_scanned);
+		nr_reclaimed = mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
+							   zone, &nr_scanned);
+		mem_cgroup_account_reclaim(root_mem_cgroup, victim, nr_reclaimed,
+					   nr_scanned, current_is_kswapd());
+		total += nr_reclaimed;
 		*total_scanned += nr_scanned;
 		if (!res_counter_soft_limit_excess(&root_memcg->res))
 			break;
@@ -4023,6 +4071,14 @@ enum {
 	MCS_SWAP,
 	MCS_PGFAULT,
 	MCS_PGMAJFAULT,
+	MCS_PGRECLAIM,
+	MCS_PGSCAN,
+	MCS_KSWAPD_PGRECLAIM,
+	MCS_KSWAPD_PGSCAN,
+	MCS_HIERARCHY_PGRECLAIM,
+	MCS_HIERARCHY_PGSCAN,
+	MCS_HIERARCHY_KSWAPD_PGRECLAIM,
+	MCS_HIERARCHY_KSWAPD_PGSCAN,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -4047,6 +4103,14 @@ struct {
 	{"swap", "total_swap"},
 	{"pgfault", "total_pgfault"},
 	{"pgmajfault", "total_pgmajfault"},
+	{"pgreclaim", "total_pgreclaim"},
+	{"pgscan", "total_pgscan"},
+	{"kswapd_pgreclaim", "total_kswapd_pgreclaim"},
+	{"kswapd_pgscan", "total_kswapd_pgscan"},
+	{"hierarchy_pgreclaim", "total_hierarchy_pgreclaim"},
+	{"hierarchy_pgscan", "total_hierarchy_pgscan"},
+	{"hierarchy_kswapd_pgreclaim", "total_hierarchy_kswapd_pgreclaim"},
+	{"hierarchy_kswapd_pgscan", "total_hierarchy_kswapd_pgscan"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -4079,6 +4143,22 @@ mem_cgroup_get_local_stat(struct mem_cgroup *memcg, struct mcs_total_stat *s)
 	s->stat[MCS_PGFAULT] += val;
 	val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_PGMAJFAULT);
 	s->stat[MCS_PGMAJFAULT] += val;
+	val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_PGRECLAIM);
+	s->stat[MCS_PGRECLAIM] += val;
+	val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_PGSCAN);
+	s->stat[MCS_PGSCAN] += val;
+	val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_KSWAPD_PGRECLAIM);
+	s->stat[MCS_KSWAPD_PGRECLAIM] += val;
+	val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_KSWAPD_PGSCAN);
+	s->stat[MCS_KSWAPD_PGSCAN] += val;
+	val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIERARCHY_PGRECLAIM);
+	s->stat[MCS_HIERARCHY_PGRECLAIM] += val;
+	val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIERARCHY_PGSCAN);
+	s->stat[MCS_HIERARCHY_PGSCAN] += val;
+	val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGRECLAIM);
+	s->stat[MCS_HIERARCHY_KSWAPD_PGRECLAIM] += val;
+	val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGSCAN);
+	s->stat[MCS_HIERARCHY_KSWAPD_PGSCAN] += val;
 
 	/* per zone stat */
 	val = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_INACTIVE_ANON));
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c631234..e3fd8a7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2115,12 +2115,19 @@ static void shrink_zone(int priority, struct zone *zone,
 
 	memcg = mem_cgroup_iter(root, NULL, &reclaim);
 	do {
+		unsigned long nr_reclaimed = sc->nr_reclaimed;
+		unsigned long nr_scanned = sc->nr_scanned;
 		struct mem_cgroup_zone mz = {
 			.mem_cgroup = memcg,
 			.zone = zone,
 		};
 
 		shrink_mem_cgroup_zone(priority, &mz, sc);
+
+		mem_cgroup_account_reclaim(root, memcg,
+					   sc->nr_reclaimed - nr_reclaimed,
+					   sc->nr_scanned - nr_scanned,
+					   current_is_kswapd());
 		/*
 		 * Limit reclaim has historically picked one memcg and
 		 * scanned it with decreasing priority levels until
-- 
1.7.7.5


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-10 15:02 [patch 0/2] mm: memcg reclaim integration followups Johannes Weiner
  2012-01-10 15:02 ` [patch 1/2] mm: memcg: per-memcg reclaim statistics Johannes Weiner
@ 2012-01-10 15:02 ` Johannes Weiner
  2012-01-11 21:42   ` Ying Han
                     ` (2 more replies)
  1 sibling, 3 replies; 29+ messages in thread
From: Johannes Weiner @ 2012-01-10 15:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, KAMEZAWA Hiroyuki, Balbir Singh, Ying Han, cgroups,
	linux-mm, linux-kernel

Right now, memcg soft limits are implemented by having a sorted tree
of memcgs that are in excess of their limits.  Under global memory
pressure, kswapd first reclaims from the biggest excessor and then
proceeds to do regular global reclaim.  The result of this is that
pages are reclaimed from all memcgs, but more scanning happens against
those above their soft limit.

With global reclaim doing memcg-aware hierarchical reclaim by default,
this is a lot easier to implement: everytime a memcg is reclaimed
from, scan more aggressively (per tradition with a priority of 0) if
it's above its soft limit.  With the same end result of scanning
everybody, but soft limit excessors a bit more.

Advantages:

  o smoother reclaim: soft limit reclaim is a separate stage before
    global reclaim, whose result is not communicated down the line and
    so overreclaim of the groups in excess is very likely.  After this
    patch, soft limit reclaim is fully integrated into regular reclaim
    and each memcg is considered exactly once per cycle.

  o true hierarchy support: soft limits are only considered when
    kswapd does global reclaim, but after this patch, targetted
    reclaim of a memcg will mind the soft limit settings of its child
    groups.

  o code size: soft limit reclaim requires a lot of code to maintain
    the per-node per-zone rb-trees to quickly find the biggest
    offender, dedicated paths for soft limit reclaim etc. while this
    new implementation gets away without all that.

Test:

The test consists of two concurrent kernel build jobs in separate
source trees, the master and the slave.  The two jobs get along nicely
on 600MB of available memory, so this is the zero overcommit control
case.  When available memory is decreased, the overcommit is
compensated by decreasing the soft limit of the slave by the same
amount, in the hope that the slave takes the hit and the master stays
unaffected.

                                    600M-0M-vanilla         600M-0M-patched
Master walltime (s)               552.65 (  +0.00%)       552.38 (  -0.05%)
Master walltime (stddev)            1.25 (  +0.00%)         0.92 ( -14.66%)
Master major faults               204.38 (  +0.00%)       205.38 (  +0.49%)
Master major faults (stddev)       27.16 (  +0.00%)        13.80 ( -47.43%)
Master reclaim                     31.88 (  +0.00%)        37.75 ( +17.87%)
Master reclaim (stddev)            34.01 (  +0.00%)        75.88 (+119.59%)
Master scan                        31.88 (  +0.00%)        37.75 ( +17.87%)
Master scan (stddev)               34.01 (  +0.00%)        75.88 (+119.59%)
Master kswapd reclaim           33922.12 (  +0.00%)     33887.12 (  -0.10%)
Master kswapd reclaim (stddev)    969.08 (  +0.00%)       492.22 ( -49.16%)
Master kswapd scan              34085.75 (  +0.00%)     33985.75 (  -0.29%)
Master kswapd scan (stddev)      1101.07 (  +0.00%)       563.33 ( -48.79%)
Slave walltime (s)                552.68 (  +0.00%)       552.12 (  -0.10%)
Slave walltime (stddev)             0.79 (  +0.00%)         1.05 ( +14.76%)
Slave major faults                212.50 (  +0.00%)       204.50 (  -3.75%)
Slave major faults (stddev)        26.90 (  +0.00%)        13.17 ( -49.20%)
Slave reclaim                      26.12 (  +0.00%)        35.00 ( +32.72%)
Slave reclaim (stddev)             29.42 (  +0.00%)        74.91 (+149.55%)
Slave scan                         31.38 (  +0.00%)        35.00 ( +11.20%)
Slave scan (stddev)                33.31 (  +0.00%)        74.91 (+121.24%)
Slave kswapd reclaim            34259.00 (  +0.00%)     33469.88 (  -2.30%)
Slave kswapd reclaim (stddev)     925.15 (  +0.00%)       565.07 ( -38.88%)
Slave kswapd scan               34354.62 (  +0.00%)     33555.75 (  -2.33%)
Slave kswapd scan (stddev)        969.62 (  +0.00%)       581.70 ( -39.97%)

In the control case, the differences in elapsed time, number of major
faults taken, and reclaim statistics are within the noise for both the
master and the slave job.

                                     600M-280M-vanilla      600M-280M-patched
Master walltime (s)                  595.13 (  +0.00%)      553.19 (  -7.04%)
Master walltime (stddev)               8.31 (  +0.00%)        2.57 ( -61.64%)
Master major faults                 3729.75 (  +0.00%)      783.25 ( -78.98%)
Master major faults (stddev)         258.79 (  +0.00%)      226.68 ( -12.36%)
Master reclaim                       705.00 (  +0.00%)       29.50 ( -95.68%)
Master reclaim (stddev)              232.87 (  +0.00%)       44.72 ( -80.45%)
Master scan                          714.88 (  +0.00%)       30.00 ( -95.67%)
Master scan (stddev)                 237.44 (  +0.00%)       45.39 ( -80.54%)
Master kswapd reclaim                114.75 (  +0.00%)       50.00 ( -55.94%)
Master kswapd reclaim (stddev)       128.51 (  +0.00%)        9.45 ( -91.93%)
Master kswapd scan                   115.75 (  +0.00%)       50.00 ( -56.32%)
Master kswapd scan (stddev)          130.31 (  +0.00%)        9.45 ( -92.04%)
Slave walltime (s)                   631.18 (  +0.00%)      577.68 (  -8.46%)
Slave walltime (stddev)                9.89 (  +0.00%)        3.63 ( -57.47%)
Slave major faults                 28401.75 (  +0.00%)    14656.75 ( -48.39%)
Slave major faults (stddev)         2629.97 (  +0.00%)     1911.81 ( -27.30%)
Slave reclaim                      65400.62 (  +0.00%)     1479.62 ( -97.74%)
Slave reclaim (stddev)             11623.02 (  +0.00%)     1482.13 ( -87.24%)
Slave scan                       9050047.88 (  +0.00%)    95968.25 ( -98.94%)
Slave scan (stddev)              1912786.94 (  +0.00%)    93390.71 ( -95.12%)
Slave kswapd reclaim              327894.50 (  +0.00%)   227099.88 ( -30.74%)
Slave kswapd reclaim (stddev)      22289.43 (  +0.00%)    16113.14 ( -27.71%)
Slave kswapd scan               34987335.75 (  +0.00%)  1362367.12 ( -96.11%)
Slave kswapd scan (stddev)       2523642.98 (  +0.00%)   156754.74 ( -93.79%)

Here, the available memory is limited to 320 MB, the machine is
overcommitted by 280 MB.  The soft limit of the master is 300 MB, that
of the slave merely 20 MB.

Looking at the slave job first, it is much better off with the patched
kernel: direct reclaim is almost gone, kswapd reclaim is decreased by
a third.  The result is much fewer major faults taken, which in turn
lets the job finish quicker.

It would be a zero-sum game if the improvement happened at the cost of
the master but looking at the numbers, even the master performs better
with the patched kernel.  In fact, the master job is almost unaffected
on the patched kernel compared to the control case.

This is an odd phenomenon, as the patch does not directly change how
the master is reclaimed.  An explanation for this is that the severe
overreclaim of the slave in the unpatched kernel results in the master
growing bigger than in the patched case.  Combining the fact that
memcgs are scanned according to their size with the increased refault
rate of the overreclaimed slave triggering global reclaim more often
means that overall pressure on the master job is higher in the
unpatched kernel.

At any rate, the patched kernel seems to do a much better job at both
overall resource allocation under soft limit overcommit as well as the
requested prioritization of the master job.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |   18 +--
 mm/memcontrol.c            |  412 ++++----------------------------------------
 mm/vmscan.c                |   80 +--------
 3 files changed, 48 insertions(+), 462 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6c1d69e..72368b7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -121,6 +121,7 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
 						      struct zone *zone);
 struct zone_reclaim_stat*
 mem_cgroup_get_reclaim_stat_from_page(struct page *page);
+bool mem_cgroup_over_softlimit(struct mem_cgroup *, struct mem_cgroup *);
 void mem_cgroup_account_reclaim(struct mem_cgroup *, struct mem_cgroup *,
 				unsigned long, unsigned long, bool);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
@@ -155,9 +156,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-						gfp_t gfp_mask,
-						unsigned long *total_scanned);
 u64 mem_cgroup_get_limit(struct mem_cgroup *memcg);
 
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
@@ -362,22 +360,20 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
-static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+static inline bool
+mem_cgroup_over_softlimit(struct mem_cgroup *root, struct mem_cgroup *memcg)
 {
+	return false;
 }
 
-static inline void mem_cgroup_dec_page_stat(struct page *page,
+static inline void mem_cgroup_inc_page_stat(struct page *page,
 					    enum mem_cgroup_page_stat_item idx)
 {
 }
 
-static inline
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-					    gfp_t gfp_mask,
-					    unsigned long *total_scanned)
+static inline void mem_cgroup_dec_page_stat(struct page *page,
+					    enum mem_cgroup_page_stat_item idx)
 {
-	return 0;
 }
 
 static inline
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 170dff4..d4f7ae5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -35,7 +35,6 @@
 #include <linux/limits.h>
 #include <linux/export.h>
 #include <linux/mutex.h>
-#include <linux/rbtree.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
@@ -118,12 +117,10 @@ enum mem_cgroup_events_index {
  */
 enum mem_cgroup_events_target {
 	MEM_CGROUP_TARGET_THRESH,
-	MEM_CGROUP_TARGET_SOFTLIMIT,
 	MEM_CGROUP_TARGET_NUMAINFO,
 	MEM_CGROUP_NTARGETS,
 };
 #define THRESHOLDS_EVENTS_TARGET (128)
-#define SOFTLIMIT_EVENTS_TARGET (1024)
 #define NUMAINFO_EVENTS_TARGET	(1024)
 
 struct mem_cgroup_stat_cpu {
@@ -149,12 +146,6 @@ struct mem_cgroup_per_zone {
 	struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1];
 
 	struct zone_reclaim_stat reclaim_stat;
-	struct rb_node		tree_node;	/* RB tree node */
-	unsigned long long	usage_in_excess;/* Set to the value by which */
-						/* the soft limit is exceeded*/
-	bool			on_tree;
-	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
-						/* use container_of	   */
 };
 /* Macro for accessing counter */
 #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
@@ -167,26 +158,6 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
 };
 
-/*
- * Cgroups above their limits are maintained in a RB-Tree, independent of
- * their hierarchy representation
- */
-
-struct mem_cgroup_tree_per_zone {
-	struct rb_root rb_root;
-	spinlock_t lock;
-};
-
-struct mem_cgroup_tree_per_node {
-	struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES];
-};
-
-struct mem_cgroup_tree {
-	struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
-};
-
-static struct mem_cgroup_tree soft_limit_tree __read_mostly;
-
 struct mem_cgroup_threshold {
 	struct eventfd_ctx *eventfd;
 	u64 threshold;
@@ -343,7 +314,6 @@ static bool move_file(void)
  * limit reclaim to prevent infinite loops, if they ever occur.
  */
 #define	MEM_CGROUP_MAX_RECLAIM_LOOPS		(100)
-#define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	(2)
 
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
@@ -398,164 +368,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *memcg, struct page *page)
 	return mem_cgroup_zoneinfo(memcg, nid, zid);
 }
 
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_node_zone(int nid, int zid)
-{
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_from_page(struct page *page)
-{
-	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
-
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static void
-__mem_cgroup_insert_exceeded(struct mem_cgroup *memcg,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz,
-				unsigned long long new_usage_in_excess)
-{
-	struct rb_node **p = &mctz->rb_root.rb_node;
-	struct rb_node *parent = NULL;
-	struct mem_cgroup_per_zone *mz_node;
-
-	if (mz->on_tree)
-		return;
-
-	mz->usage_in_excess = new_usage_in_excess;
-	if (!mz->usage_in_excess)
-		return;
-	while (*p) {
-		parent = *p;
-		mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
-					tree_node);
-		if (mz->usage_in_excess < mz_node->usage_in_excess)
-			p = &(*p)->rb_left;
-		/*
-		 * We can't avoid mem cgroups that are over their soft
-		 * limit by the same amount
-		 */
-		else if (mz->usage_in_excess >= mz_node->usage_in_excess)
-			p = &(*p)->rb_right;
-	}
-	rb_link_node(&mz->tree_node, parent, p);
-	rb_insert_color(&mz->tree_node, &mctz->rb_root);
-	mz->on_tree = true;
-}
-
-static void
-__mem_cgroup_remove_exceeded(struct mem_cgroup *memcg,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
-{
-	if (!mz->on_tree)
-		return;
-	rb_erase(&mz->tree_node, &mctz->rb_root);
-	mz->on_tree = false;
-}
-
-static void
-mem_cgroup_remove_exceeded(struct mem_cgroup *memcg,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
-{
-	spin_lock(&mctz->lock);
-	__mem_cgroup_remove_exceeded(memcg, mz, mctz);
-	spin_unlock(&mctz->lock);
-}
-
-
-static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
-{
-	unsigned long long excess;
-	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup_tree_per_zone *mctz;
-	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
-	mctz = soft_limit_tree_from_page(page);
-
-	/*
-	 * Necessary to update all ancestors when hierarchy is used.
-	 * because their event counter is not touched.
-	 */
-	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
-		mz = mem_cgroup_zoneinfo(memcg, nid, zid);
-		excess = res_counter_soft_limit_excess(&memcg->res);
-		/*
-		 * We have to update the tree if mz is on RB-tree or
-		 * mem is over its softlimit.
-		 */
-		if (excess || mz->on_tree) {
-			spin_lock(&mctz->lock);
-			/* if on-tree, remove it */
-			if (mz->on_tree)
-				__mem_cgroup_remove_exceeded(memcg, mz, mctz);
-			/*
-			 * Insert again. mz->usage_in_excess will be updated.
-			 * If excess is 0, no tree ops.
-			 */
-			__mem_cgroup_insert_exceeded(memcg, mz, mctz, excess);
-			spin_unlock(&mctz->lock);
-		}
-	}
-}
-
-static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
-{
-	int node, zone;
-	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup_tree_per_zone *mctz;
-
-	for_each_node_state(node, N_POSSIBLE) {
-		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-			mz = mem_cgroup_zoneinfo(memcg, node, zone);
-			mctz = soft_limit_tree_node_zone(node, zone);
-			mem_cgroup_remove_exceeded(memcg, mz, mctz);
-		}
-	}
-}
-
-static struct mem_cgroup_per_zone *
-__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
-	struct rb_node *rightmost = NULL;
-	struct mem_cgroup_per_zone *mz;
-
-retry:
-	mz = NULL;
-	rightmost = rb_last(&mctz->rb_root);
-	if (!rightmost)
-		goto done;		/* Nothing to reclaim from */
-
-	mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
-	/*
-	 * Remove the node now but someone else can add it back,
-	 * we will to add it back at the end of reclaim to its correct
-	 * position in the tree.
-	 */
-	__mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
-	if (!res_counter_soft_limit_excess(&mz->mem->res) ||
-		!css_tryget(&mz->mem->css))
-		goto retry;
-done:
-	return mz;
-}
-
-static struct mem_cgroup_per_zone *
-mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
-	struct mem_cgroup_per_zone *mz;
-
-	spin_lock(&mctz->lock);
-	mz = __mem_cgroup_largest_soft_limit_node(mctz);
-	spin_unlock(&mctz->lock);
-	return mz;
-}
-
 /*
  * Implementation Note: reading percpu statistics for memcg.
  *
@@ -696,9 +508,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 		case MEM_CGROUP_TARGET_THRESH:
 			next = val + THRESHOLDS_EVENTS_TARGET;
 			break;
-		case MEM_CGROUP_TARGET_SOFTLIMIT:
-			next = val + SOFTLIMIT_EVENTS_TARGET;
-			break;
 		case MEM_CGROUP_TARGET_NUMAINFO:
 			next = val + NUMAINFO_EVENTS_TARGET;
 			break;
@@ -718,13 +527,11 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 {
 	preempt_disable();
-	/* threshold event is triggered in finer grain than soft limit */
+	/* threshold event is triggered in finer grain than numa info */
 	if (unlikely(mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_THRESH))) {
-		bool do_softlimit, do_numainfo;
+		bool do_numainfo;
 
-		do_softlimit = mem_cgroup_event_ratelimit(memcg,
-						MEM_CGROUP_TARGET_SOFTLIMIT);
 #if MAX_NUMNODES > 1
 		do_numainfo = mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_NUMAINFO);
@@ -732,8 +539,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 		preempt_enable();
 
 		mem_cgroup_threshold(memcg);
-		if (unlikely(do_softlimit))
-			mem_cgroup_update_tree(memcg, page);
 #if MAX_NUMNODES > 1
 		if (unlikely(do_numainfo))
 			atomic_inc(&memcg->numainfo_events);
@@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
 	return margin >> PAGE_SHIFT;
 }
 
+/**
+ * mem_cgroup_over_softlimit
+ * @root: hierarchy root
+ * @memcg: child of @root to test
+ *
+ * Returns %true if @memcg exceeds its own soft limit or contributes
+ * to the soft limit excess of one of its parents up to and including
+ * @root.
+ */
+bool mem_cgroup_over_softlimit(struct mem_cgroup *root,
+			       struct mem_cgroup *memcg)
+{
+	if (mem_cgroup_disabled())
+		return false;
+
+	if (!root)
+		root = root_mem_cgroup;
+
+	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+		/* root_mem_cgroup does not have a soft limit */
+		if (memcg == root_mem_cgroup)
+			break;
+		if (res_counter_soft_limit_excess(&memcg->res))
+			return true;
+		if (memcg == root)
+			break;
+	}
+	return false;
+}
+
 int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
 	struct cgroup *cgrp = memcg->css.cgroup;
@@ -1687,64 +1522,6 @@ bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
 }
 #endif
 
-static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
-				   struct zone *zone,
-				   gfp_t gfp_mask,
-				   unsigned long *total_scanned)
-{
-	struct mem_cgroup *victim = NULL;
-	int total = 0;
-	int loop = 0;
-	unsigned long excess;
-	unsigned long nr_scanned;
-	struct mem_cgroup_reclaim_cookie reclaim = {
-		.zone = zone,
-		.priority = 0,
-	};
-
-	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
-
-	while (1) {
-		unsigned long nr_reclaimed;
-
-		victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
-		if (!victim) {
-			loop++;
-			if (loop >= 2) {
-				/*
-				 * If we have not been able to reclaim
-				 * anything, it might because there are
-				 * no reclaimable pages under this hierarchy
-				 */
-				if (!total)
-					break;
-				/*
-				 * We want to do more targeted reclaim.
-				 * excess >> 2 is not to excessive so as to
-				 * reclaim too much, nor too less that we keep
-				 * coming back to reclaim from this cgroup
-				 */
-				if (total >= (excess >> 2) ||
-					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
-					break;
-			}
-			continue;
-		}
-		if (!mem_cgroup_reclaimable(victim, false))
-			continue;
-		nr_reclaimed = mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
-							   zone, &nr_scanned);
-		mem_cgroup_account_reclaim(root_mem_cgroup, victim, nr_reclaimed,
-					   nr_scanned, current_is_kswapd());
-		total += nr_reclaimed;
-		*total_scanned += nr_scanned;
-		if (!res_counter_soft_limit_excess(&root_memcg->res))
-			break;
-	}
-	mem_cgroup_iter_break(root_memcg, victim);
-	return total;
-}
-
 /*
  * Check OOM-Killer is already running under our hierarchy.
  * If someone is running, return false.
@@ -2507,8 +2284,6 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 	unlock_page_cgroup(pc);
 	/*
 	 * "charge_statistics" updated event counter. Then, check it.
-	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
-	 * if they exceeds softlimit.
 	 */
 	memcg_check_events(memcg, page);
 }
@@ -3578,98 +3353,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 	return ret;
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-					    gfp_t gfp_mask,
-					    unsigned long *total_scanned)
-{
-	unsigned long nr_reclaimed = 0;
-	struct mem_cgroup_per_zone *mz, *next_mz = NULL;
-	unsigned long reclaimed;
-	int loop = 0;
-	struct mem_cgroup_tree_per_zone *mctz;
-	unsigned long long excess;
-	unsigned long nr_scanned;
-
-	if (order > 0)
-		return 0;
-
-	mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
-	/*
-	 * This loop can run a while, specially if mem_cgroup's continuously
-	 * keep exceeding their soft limit and putting the system under
-	 * pressure
-	 */
-	do {
-		if (next_mz)
-			mz = next_mz;
-		else
-			mz = mem_cgroup_largest_soft_limit_node(mctz);
-		if (!mz)
-			break;
-
-		nr_scanned = 0;
-		reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone,
-						    gfp_mask, &nr_scanned);
-		nr_reclaimed += reclaimed;
-		*total_scanned += nr_scanned;
-		spin_lock(&mctz->lock);
-
-		/*
-		 * If we failed to reclaim anything from this memory cgroup
-		 * it is time to move on to the next cgroup
-		 */
-		next_mz = NULL;
-		if (!reclaimed) {
-			do {
-				/*
-				 * Loop until we find yet another one.
-				 *
-				 * By the time we get the soft_limit lock
-				 * again, someone might have aded the
-				 * group back on the RB tree. Iterate to
-				 * make sure we get a different mem.
-				 * mem_cgroup_largest_soft_limit_node returns
-				 * NULL if no other cgroup is present on
-				 * the tree
-				 */
-				next_mz =
-				__mem_cgroup_largest_soft_limit_node(mctz);
-				if (next_mz == mz)
-					css_put(&next_mz->mem->css);
-				else /* next_mz == NULL or other memcg */
-					break;
-			} while (1);
-		}
-		__mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
-		excess = res_counter_soft_limit_excess(&mz->mem->res);
-		/*
-		 * One school of thought says that we should not add
-		 * back the node to the tree if reclaim returns 0.
-		 * But our reclaim could return 0, simply because due
-		 * to priority we are exposing a smaller subset of
-		 * memory to reclaim from. Consider this as a longer
-		 * term TODO.
-		 */
-		/* If excess == 0, no tree ops */
-		__mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess);
-		spin_unlock(&mctz->lock);
-		css_put(&mz->mem->css);
-		loop++;
-		/*
-		 * Could not reclaim anything and there are no more
-		 * mem cgroups to try or we seem to be looping without
-		 * reclaiming anything.
-		 */
-		if (!nr_reclaimed &&
-			(next_mz == NULL ||
-			loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
-			break;
-	} while (!nr_reclaimed);
-	if (next_mz)
-		css_put(&next_mz->mem->css);
-	return nr_reclaimed;
-}
-
 /*
  * This routine traverse page_cgroup in given list and drop them all.
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
@@ -4816,9 +4499,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 		mz = &pn->zoneinfo[zone];
 		for_each_lru(l)
 			INIT_LIST_HEAD(&mz->lruvec.lists[l]);
-		mz->usage_in_excess = 0;
-		mz->on_tree = false;
-		mz->mem = memcg;
 	}
 	memcg->info.nodeinfo[node] = pn;
 	return 0;
@@ -4872,7 +4552,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 {
 	int node;
 
-	mem_cgroup_remove_from_trees(memcg);
 	free_css_id(&mem_cgroup_subsys, &memcg->css);
 
 	for_each_node_state(node, N_POSSIBLE)
@@ -4927,31 +4606,6 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
-static int mem_cgroup_soft_limit_tree_init(void)
-{
-	struct mem_cgroup_tree_per_node *rtpn;
-	struct mem_cgroup_tree_per_zone *rtpz;
-	int tmp, node, zone;
-
-	for_each_node_state(node, N_POSSIBLE) {
-		tmp = node;
-		if (!node_state(node, N_NORMAL_MEMORY))
-			tmp = -1;
-		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
-		if (!rtpn)
-			return 1;
-
-		soft_limit_tree.rb_tree_per_node[node] = rtpn;
-
-		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-			rtpz = &rtpn->rb_tree_per_zone[zone];
-			rtpz->rb_root = RB_ROOT;
-			spin_lock_init(&rtpz->lock);
-		}
-	}
-	return 0;
-}
-
 static struct cgroup_subsys_state * __ref
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
@@ -4973,8 +4627,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		enable_swap_cgroup();
 		parent = NULL;
 		root_mem_cgroup = memcg;
-		if (mem_cgroup_soft_limit_tree_init())
-			goto free_out;
 		for_each_possible_cpu(cpu) {
 			struct memcg_stock_pcp *stock =
 						&per_cpu(memcg_stock, cpu);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e3fd8a7..4279549 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone,
 			.mem_cgroup = memcg,
 			.zone = zone,
 		};
+		int epriority = priority;
+		/*
+		 * Put more pressure on hierarchies that exceed their
+		 * soft limit, to push them back harder than their
+		 * well-behaving siblings.
+		 */
+		if (mem_cgroup_over_softlimit(root, memcg))
+			epriority = 0;
 
-		shrink_mem_cgroup_zone(priority, &mz, sc);
+		shrink_mem_cgroup_zone(epriority, &mz, sc);
 
 		mem_cgroup_account_reclaim(root, memcg,
 					   sc->nr_reclaimed - nr_reclaimed,
@@ -2171,8 +2179,6 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
 {
 	struct zoneref *z;
 	struct zone *zone;
-	unsigned long nr_soft_reclaimed;
-	unsigned long nr_soft_scanned;
 	bool should_abort_reclaim = false;
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
@@ -2205,19 +2211,6 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
 					continue;
 				}
 			}
-			/*
-			 * This steals pages from memory cgroups over softlimit
-			 * and returns the number of reclaimed pages and
-			 * scanned pages. This works for global memory pressure
-			 * and balancing, not for a memcg's limit.
-			 */
-			nr_soft_scanned = 0;
-			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
-						sc->order, sc->gfp_mask,
-						&nr_soft_scanned);
-			sc->nr_reclaimed += nr_soft_reclaimed;
-			sc->nr_scanned += nr_soft_scanned;
-			/* need some check for avoid more shrink_zone() */
 		}
 
 		shrink_zone(priority, zone, sc);
@@ -2393,48 +2386,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
-
-unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
-						gfp_t gfp_mask, bool noswap,
-						struct zone *zone,
-						unsigned long *nr_scanned)
-{
-	struct scan_control sc = {
-		.nr_scanned = 0,
-		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.may_writepage = !laptop_mode,
-		.may_unmap = 1,
-		.may_swap = !noswap,
-		.order = 0,
-		.target_mem_cgroup = memcg,
-	};
-	struct mem_cgroup_zone mz = {
-		.mem_cgroup = memcg,
-		.zone = zone,
-	};
-
-	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
-			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
-
-	trace_mm_vmscan_memcg_softlimit_reclaim_begin(0,
-						      sc.may_writepage,
-						      sc.gfp_mask);
-
-	/*
-	 * NOTE: Although we can get the priority field, using it
-	 * here is not a good idea, since it limits the pages we can scan.
-	 * if we don't reclaim here, the shrink_zone from balance_pgdat
-	 * will pick up pages from other mem cgroup's as well. We hack
-	 * the priority and make it zero.
-	 */
-	shrink_mem_cgroup_zone(0, &mz, &sc);
-
-	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
-
-	*nr_scanned = sc.nr_scanned;
-	return sc.nr_reclaimed;
-}
-
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					   gfp_t gfp_mask,
 					   bool noswap)
@@ -2609,8 +2560,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 	unsigned long total_scanned;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
-	unsigned long nr_soft_reclaimed;
-	unsigned long nr_soft_scanned;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.may_unmap = 1,
@@ -2701,17 +2650,6 @@ loop_again:
 				continue;
 
 			sc.nr_scanned = 0;
-
-			nr_soft_scanned = 0;
-			/*
-			 * Call soft limit reclaim before calling shrink_zone.
-			 */
-			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
-							order, sc.gfp_mask,
-							&nr_soft_scanned);
-			sc.nr_reclaimed += nr_soft_reclaimed;
-			total_scanned += nr_soft_scanned;
-
 			/*
 			 * We put equal pressure on every zone, unless
 			 * one zone has way too many pages free
-- 
1.7.7.5


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [patch 1/2] mm: memcg: per-memcg reclaim statistics
  2012-01-10 15:02 ` [patch 1/2] mm: memcg: per-memcg reclaim statistics Johannes Weiner
@ 2012-01-10 23:54   ` Ying Han
  2012-01-11  0:30     ` Johannes Weiner
  0 siblings, 1 reply; 29+ messages in thread
From: Ying Han @ 2012-01-10 23:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki, Balbir Singh,
	cgroups, linux-mm, linux-kernel

Thank you for the patch and the stats looks reasonable to me, few
questions as below:

On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> With the single per-zone LRU gone and global reclaim scanning
> individual memcgs, it's straight-forward to collect meaningful and
> accurate per-memcg reclaim statistics.
>
> This adds the following items to memory.stat:

Some of the previous discussions including patches have similar stats
in memory.vmscan_stat API, which collects all the per-memcg vmscan
stats. I would like to understand more why we add into memory.stat
instead, and do we have plan to keep extending memory.stat for those
vmstat like stats?

>
> pgreclaim

Not sure if we want to keep this more consistent to /proc/vmstat, then
it will be "pgsteal"?

> pgscan
>
>  Number of pages reclaimed/scanned from that memcg due to its own
>  hard limit (or physical limit in case of the root memcg) by the
>  allocating task.
>
> kswapd_pgreclaim
> kswapd_pgscan

we have "pgscan_kswapd_*" in vmstat, so maybe ?
"pgsteal_kswapd"
"pgscan_kswapd"

>
>  Reclaim activity from kswapd due to the memcg's own limit.  Only
>  applicable to the root memcg for now since kswapd is only triggered
>  by physical limits, but kswapd-style reclaim based on memcg hard
>  limits is being developped.
>
> hierarchy_pgreclaim
> hierarchy_pgscan
> hierarchy_kswapd_pgreclaim
> hierarchy_kswapd_pgscan

"pgsteal_hierarchy"
"pgsteal_kswapd_hierarchy"
..

No strong option on the naming, but try to make it more consistent to
existing API.


>
>  Reclaim activity due to limitations in one of the memcg's parents.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  Documentation/cgroups/memory.txt |    4 ++
>  include/linux/memcontrol.h       |   10 +++++
>  mm/memcontrol.c                  |   84 +++++++++++++++++++++++++++++++++++++-
>  mm/vmscan.c                      |    7 +++
>  4 files changed, 103 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index cc0ebc5..eb9e982 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -389,6 +389,10 @@ mapped_file        - # of bytes of mapped file (includes tmpfs/shmem)
>  pgpgin         - # of pages paged in (equivalent to # of charging events).
>  pgpgout                - # of pages paged out (equivalent to # of uncharging events).
>  swap           - # of bytes of swap usage
> +pgreclaim      - # of pages reclaimed due to this memcg's limit
> +pgscan         - # of pages scanned due to this memcg's limit
> +kswapd_*       - # reclaim activity by background daemon due to this memcg's limit
> +hierarchy_*    - # reclaim activity due to pressure from parental memcg
>  inactive_anon  - # of bytes of anonymous memory and swap cache memory on
>                LRU list.
>  active_anon    - # of bytes of anonymous and swap cache memory on active
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index bd3b102..6c1d69e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -121,6 +121,8 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
>                                                      struct zone *zone);
>  struct zone_reclaim_stat*
>  mem_cgroup_get_reclaim_stat_from_page(struct page *page);
> +void mem_cgroup_account_reclaim(struct mem_cgroup *, struct mem_cgroup *,
> +                               unsigned long, unsigned long, bool);
>  extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>                                        struct task_struct *p);
>  extern void mem_cgroup_replace_page_cache(struct page *oldpage,
> @@ -347,6 +349,14 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
>        return NULL;
>  }
>
> +static inline void mem_cgroup_account_reclaim(struct mem_cgroup *root,
> +                                             struct mem_cgroup *memcg,
> +                                             unsigned long nr_reclaimed,
> +                                             unsigned long nr_scanned,
> +                                             bool kswapd)
> +{
> +}
> +
>  static inline void
>  mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8e2a80d..170dff4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -91,12 +91,23 @@ enum mem_cgroup_stat_index {
>        MEM_CGROUP_STAT_NSTATS,
>  };
>
> +#define MEM_CGROUP_EVENTS_KSWAPD 2
> +#define MEM_CGROUP_EVENTS_HIERARCHY 4
> +
>  enum mem_cgroup_events_index {
>        MEM_CGROUP_EVENTS_PGPGIN,       /* # of pages paged in */
>        MEM_CGROUP_EVENTS_PGPGOUT,      /* # of pages paged out */
>        MEM_CGROUP_EVENTS_COUNT,        /* # of pages paged in/out */
>        MEM_CGROUP_EVENTS_PGFAULT,      /* # of page-faults */
>        MEM_CGROUP_EVENTS_PGMAJFAULT,   /* # of major page-faults */
> +       MEM_CGROUP_EVENTS_PGRECLAIM,
> +       MEM_CGROUP_EVENTS_PGSCAN,
> +       MEM_CGROUP_EVENTS_KSWAPD_PGRECLAIM,
> +       MEM_CGROUP_EVENTS_KSWAPD_PGSCAN,
> +       MEM_CGROUP_EVENTS_HIERARCHY_PGRECLAIM,
> +       MEM_CGROUP_EVENTS_HIERARCHY_PGSCAN,
> +       MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGRECLAIM,
> +       MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGSCAN,

missing comment here?
>        MEM_CGROUP_EVENTS_NSTATS,
>  };
>  /*
> @@ -889,6 +900,38 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>        return (memcg == root_mem_cgroup);
>  }
>
> +/**
> + * mem_cgroup_account_reclaim - update per-memcg reclaim statistics
> + * @root: memcg that triggered reclaim
> + * @memcg: memcg that is actually being scanned
> + * @nr_reclaimed: number of pages reclaimed from @memcg
> + * @nr_scanned: number of pages scanned from @memcg
> + * @kswapd: whether reclaiming task is kswapd or allocator itself
> + */
> +void mem_cgroup_account_reclaim(struct mem_cgroup *root,
> +                               struct mem_cgroup *memcg,
> +                               unsigned long nr_reclaimed,
> +                               unsigned long nr_scanned,
> +                               bool kswapd)
> +{
> +       unsigned int offset = 0;
> +
> +       if (!root)
> +               root = root_mem_cgroup;
> +
> +       if (kswapd)
> +               offset += MEM_CGROUP_EVENTS_KSWAPD;
> +       if (root != memcg)
> +               offset += MEM_CGROUP_EVENTS_HIERARCHY;

Just to be clear, here root cgroup has hierarchy_* stats always 0 ?
Also, we might want to consider renaming the root here, something like
target? The root is confusing with root_mem_cgroup.

--Ying

> +
> +       preempt_disable();
> +       __this_cpu_add(memcg->stat->events[MEM_CGROUP_EVENTS_PGRECLAIM + offset],
> +                      nr_reclaimed);
> +       __this_cpu_add(memcg->stat->events[MEM_CGROUP_EVENTS_PGSCAN + offset],
> +                      nr_scanned);
> +       preempt_enable();
> +}
> +
>  void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
>  {
>        struct mem_cgroup *memcg;
> @@ -1662,6 +1705,8 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
>        excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
>
>        while (1) {
> +               unsigned long nr_reclaimed;
> +
>                victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
>                if (!victim) {
>                        loop++;
> @@ -1687,8 +1732,11 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
>                }
>                if (!mem_cgroup_reclaimable(victim, false))
>                        continue;
> -               total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
> -                                                    zone, &nr_scanned);
> +               nr_reclaimed = mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
> +                                                          zone, &nr_scanned);
> +               mem_cgroup_account_reclaim(root_mem_cgroup, victim, nr_reclaimed,
> +                                          nr_scanned, current_is_kswapd());
> +               total += nr_reclaimed;
>                *total_scanned += nr_scanned;
>                if (!res_counter_soft_limit_excess(&root_memcg->res))
>                        break;
> @@ -4023,6 +4071,14 @@ enum {
>        MCS_SWAP,
>        MCS_PGFAULT,
>        MCS_PGMAJFAULT,
> +       MCS_PGRECLAIM,
> +       MCS_PGSCAN,
> +       MCS_KSWAPD_PGRECLAIM,
> +       MCS_KSWAPD_PGSCAN,
> +       MCS_HIERARCHY_PGRECLAIM,
> +       MCS_HIERARCHY_PGSCAN,
> +       MCS_HIERARCHY_KSWAPD_PGRECLAIM,
> +       MCS_HIERARCHY_KSWAPD_PGSCAN,
>        MCS_INACTIVE_ANON,
>        MCS_ACTIVE_ANON,
>        MCS_INACTIVE_FILE,
> @@ -4047,6 +4103,14 @@ struct {
>        {"swap", "total_swap"},
>        {"pgfault", "total_pgfault"},
>        {"pgmajfault", "total_pgmajfault"},
> +       {"pgreclaim", "total_pgreclaim"},
> +       {"pgscan", "total_pgscan"},
> +       {"kswapd_pgreclaim", "total_kswapd_pgreclaim"},
> +       {"kswapd_pgscan", "total_kswapd_pgscan"},
> +       {"hierarchy_pgreclaim", "total_hierarchy_pgreclaim"},
> +       {"hierarchy_pgscan", "total_hierarchy_pgscan"},
> +       {"hierarchy_kswapd_pgreclaim", "total_hierarchy_kswapd_pgreclaim"},
> +       {"hierarchy_kswapd_pgscan", "total_hierarchy_kswapd_pgscan"},
>        {"inactive_anon", "total_inactive_anon"},
>        {"active_anon", "total_active_anon"},
>        {"inactive_file", "total_inactive_file"},
> @@ -4079,6 +4143,22 @@ mem_cgroup_get_local_stat(struct mem_cgroup *memcg, struct mcs_total_stat *s)
>        s->stat[MCS_PGFAULT] += val;
>        val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_PGMAJFAULT);
>        s->stat[MCS_PGMAJFAULT] += val;
> +       val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_PGRECLAIM);
> +       s->stat[MCS_PGRECLAIM] += val;
> +       val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_PGSCAN);
> +       s->stat[MCS_PGSCAN] += val;
> +       val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_KSWAPD_PGRECLAIM);
> +       s->stat[MCS_KSWAPD_PGRECLAIM] += val;
> +       val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_KSWAPD_PGSCAN);
> +       s->stat[MCS_KSWAPD_PGSCAN] += val;
> +       val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIERARCHY_PGRECLAIM);
> +       s->stat[MCS_HIERARCHY_PGRECLAIM] += val;
> +       val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIERARCHY_PGSCAN);
> +       s->stat[MCS_HIERARCHY_PGSCAN] += val;
> +       val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGRECLAIM);
> +       s->stat[MCS_HIERARCHY_KSWAPD_PGRECLAIM] += val;
> +       val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGSCAN);
> +       s->stat[MCS_HIERARCHY_KSWAPD_PGSCAN] += val;
>
>        /* per zone stat */
>        val = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_INACTIVE_ANON));
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c631234..e3fd8a7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2115,12 +2115,19 @@ static void shrink_zone(int priority, struct zone *zone,
>
>        memcg = mem_cgroup_iter(root, NULL, &reclaim);
>        do {
> +               unsigned long nr_reclaimed = sc->nr_reclaimed;
> +               unsigned long nr_scanned = sc->nr_scanned;
>                struct mem_cgroup_zone mz = {
>                        .mem_cgroup = memcg,
>                        .zone = zone,
>                };
>
>                shrink_mem_cgroup_zone(priority, &mz, sc);
> +
> +               mem_cgroup_account_reclaim(root, memcg,
> +                                          sc->nr_reclaimed - nr_reclaimed,
> +                                          sc->nr_scanned - nr_scanned,
> +                                          current_is_kswapd());
>                /*
>                 * Limit reclaim has historically picked one memcg and
>                 * scanned it with decreasing priority levels until
> --
> 1.7.7.5
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/2] mm: memcg: per-memcg reclaim statistics
  2012-01-10 23:54   ` Ying Han
@ 2012-01-11  0:30     ` Johannes Weiner
  2012-01-11 22:33       ` Ying Han
  0 siblings, 1 reply; 29+ messages in thread
From: Johannes Weiner @ 2012-01-11  0:30 UTC (permalink / raw)
  To: Ying Han
  Cc: Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki, Balbir Singh,
	cgroups, linux-mm, linux-kernel

On Tue, Jan 10, 2012 at 03:54:05PM -0800, Ying Han wrote:
> Thank you for the patch and the stats looks reasonable to me, few
> questions as below:
> 
> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > With the single per-zone LRU gone and global reclaim scanning
> > individual memcgs, it's straight-forward to collect meaningful and
> > accurate per-memcg reclaim statistics.
> >
> > This adds the following items to memory.stat:
> 
> Some of the previous discussions including patches have similar stats
> in memory.vmscan_stat API, which collects all the per-memcg vmscan
> stats. I would like to understand more why we add into memory.stat
> instead, and do we have plan to keep extending memory.stat for those
> vmstat like stats?

I think they were put into an extra file in particular to be able to
write to this file to reset the statistics.  But in my opinion, it's
trivial to calculate a delta from before and after running a workload,
so I didn't really like adding kernel code for that.

Did you have another reason for a separate file in mind?

> > pgreclaim
> 
> Not sure if we want to keep this more consistent to /proc/vmstat, then
> it will be "pgsteal"?

The problem with that was that we didn't like to call pages stolen
when they were reclaimed from within the cgroup, so we had pgfree for
inner reclaim and pgsteal for outer reclaim, respectively.

I found it cleaner to just go with pgreclaim, it's unambiguous and
straight-forward.  Outer reclaim is designated by the hierarchy_
prefix.

> > pgscan
> >
> >  Number of pages reclaimed/scanned from that memcg due to its own
> >  hard limit (or physical limit in case of the root memcg) by the
> >  allocating task.
> >
> > kswapd_pgreclaim
> > kswapd_pgscan
> 
> we have "pgscan_kswapd_*" in vmstat, so maybe ?
> "pgsteal_kswapd"
> "pgscan_kswapd"
> 
> >  Reclaim activity from kswapd due to the memcg's own limit.  Only
> >  applicable to the root memcg for now since kswapd is only triggered
> >  by physical limits, but kswapd-style reclaim based on memcg hard
> >  limits is being developped.
> >
> > hierarchy_pgreclaim
> > hierarchy_pgscan
> > hierarchy_kswapd_pgreclaim
> > hierarchy_kswapd_pgscan
> 
> "pgsteal_hierarchy"
> "pgsteal_kswapd_hierarchy"
> ..
> 
> No strong option on the naming, but try to make it more consistent to
> existing API.

I swear I tried, but the existing naming is pretty screwed up :(

For example, pgscan_direct_* and pgscan_kswapd_* allow you to compare
scan rates of direct reclaim vs. kswapd reclaim.  To get the total
number of pages reclaimed, you sum them up.

On the other hand, pgsteal_* does not differentiate between direct
reclaim and kswapd, so to get direct reclaim numbers, you add up the
pgsteal_* counters and subtract kswapd_steal (notice the lack of pg?),
which is in turn not available at zone granularity.

> > +#define MEM_CGROUP_EVENTS_KSWAPD 2
> > +#define MEM_CGROUP_EVENTS_HIERARCHY 4

These two function as namespaces, that's why I put hierarchy_ and
kswapd_ at the beginning of the names.

Given that we have kswapd_steal, would you be okay with doing it like
this?  I mean, at least my naming conforms to ONE of the standards in
/proc/vmstat, right? ;-)

> > @@ -91,12 +91,23 @@ enum mem_cgroup_stat_index {
> >        MEM_CGROUP_STAT_NSTATS,
> >  };
> >
> > +#define MEM_CGROUP_EVENTS_KSWAPD 2
> > +#define MEM_CGROUP_EVENTS_HIERARCHY 4
> > +
> >  enum mem_cgroup_events_index {
> >        MEM_CGROUP_EVENTS_PGPGIN,       /* # of pages paged in */
> >        MEM_CGROUP_EVENTS_PGPGOUT,      /* # of pages paged out */
> >        MEM_CGROUP_EVENTS_COUNT,        /* # of pages paged in/out */
> >        MEM_CGROUP_EVENTS_PGFAULT,      /* # of page-faults */
> >        MEM_CGROUP_EVENTS_PGMAJFAULT,   /* # of major page-faults */
> > +       MEM_CGROUP_EVENTS_PGRECLAIM,
> > +       MEM_CGROUP_EVENTS_PGSCAN,
> > +       MEM_CGROUP_EVENTS_KSWAPD_PGRECLAIM,
> > +       MEM_CGROUP_EVENTS_KSWAPD_PGSCAN,
> > +       MEM_CGROUP_EVENTS_HIERARCHY_PGRECLAIM,
> > +       MEM_CGROUP_EVENTS_HIERARCHY_PGSCAN,
> > +       MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGRECLAIM,
> > +       MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGSCAN,
> 
> missing comment here?

As if the lines weren't long enough already ;-) I'll add some.

> >        MEM_CGROUP_EVENTS_NSTATS,
> >  };
> >  /*
> > @@ -889,6 +900,38 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
> >        return (memcg == root_mem_cgroup);
> >  }
> >
> > +/**
> > + * mem_cgroup_account_reclaim - update per-memcg reclaim statistics
> > + * @root: memcg that triggered reclaim
> > + * @memcg: memcg that is actually being scanned
> > + * @nr_reclaimed: number of pages reclaimed from @memcg
> > + * @nr_scanned: number of pages scanned from @memcg
> > + * @kswapd: whether reclaiming task is kswapd or allocator itself
> > + */
> > +void mem_cgroup_account_reclaim(struct mem_cgroup *root,
> > +                               struct mem_cgroup *memcg,
> > +                               unsigned long nr_reclaimed,
> > +                               unsigned long nr_scanned,
> > +                               bool kswapd)
> > +{
> > +       unsigned int offset = 0;
> > +
> > +       if (!root)
> > +               root = root_mem_cgroup;
> > +
> > +       if (kswapd)
> > +               offset += MEM_CGROUP_EVENTS_KSWAPD;
> > +       if (root != memcg)
> > +               offset += MEM_CGROUP_EVENTS_HIERARCHY;
> 
> Just to be clear, here root cgroup has hierarchy_* stats always 0 ?

That's correct, there can't be any hierarchical pressure on the
topmost parent.

> Also, we might want to consider renaming the root here, something like
> target? The root is confusing with root_mem_cgroup.

It's the same naming scheme I used for the iterator functions
(mem_cgroup_iter() and friends), so if we change it, I'd like to
change it consistently.

Having target and memcg as parameters is even more confusing and
non-descriptive, IMO.

Other places use mem_over_limit, which is a bit better, but quite
long.

Any other ideas for great names for parameters that designate a
hierarchy root and a memcg in that hierarchy?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-10 15:02 ` [patch 2/2] mm: memcg: hierarchical soft limit reclaim Johannes Weiner
@ 2012-01-11 21:42   ` Ying Han
  2012-01-12  8:59     ` Johannes Weiner
  2012-01-12  1:54   ` KAMEZAWA Hiroyuki
  2012-01-13 12:04   ` Michal Hocko
  2 siblings, 1 reply; 29+ messages in thread
From: Ying Han @ 2012-01-11 21:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki, Balbir Singh,
	cgroups, linux-mm, linux-kernel

On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Right now, memcg soft limits are implemented by having a sorted tree
> of memcgs that are in excess of their limits.  Under global memory
> pressure, kswapd first reclaims from the biggest excessor and then
> proceeds to do regular global reclaim.  The result of this is that
> pages are reclaimed from all memcgs, but more scanning happens against
> those above their soft limit.
>
> With global reclaim doing memcg-aware hierarchical reclaim by default,
> this is a lot easier to implement: everytime a memcg is reclaimed
> from, scan more aggressively (per tradition with a priority of 0) if
> it's above its soft limit.  With the same end result of scanning
> everybody, but soft limit excessors a bit more.
>
> Advantages:
>
>  o smoother reclaim: soft limit reclaim is a separate stage before
>    global reclaim, whose result is not communicated down the line and
>    so overreclaim of the groups in excess is very likely.  After this
>    patch, soft limit reclaim is fully integrated into regular reclaim
>    and each memcg is considered exactly once per cycle.
>
>  o true hierarchy support: soft limits are only considered when
>    kswapd does global reclaim, but after this patch, targetted
>    reclaim of a memcg will mind the soft limit settings of its child
>    groups.

Why we add soft limit reclaim into target reclaim?

Based on the discussions, my understanding is that the soft limit only
take effect while the whole machine is under memory contention. We
don't want to add extra pressure on a cgroup if there is free memory
on the system even the cgroup is above its limit.

>
>  o code size: soft limit reclaim requires a lot of code to maintain
>    the per-node per-zone rb-trees to quickly find the biggest
>    offender, dedicated paths for soft limit reclaim etc. while this
>    new implementation gets away without all that.
>
> Test:
>
> The test consists of two concurrent kernel build jobs in separate
> source trees, the master and the slave.  The two jobs get along nicely
> on 600MB of available memory, so this is the zero overcommit control
> case.  When available memory is decreased, the overcommit is
> compensated by decreasing the soft limit of the slave by the same
> amount, in the hope that the slave takes the hit and the master stays
> unaffected.
>
>                                    600M-0M-vanilla         600M-0M-patched
> Master walltime (s)               552.65 (  +0.00%)       552.38 (  -0.05%)
> Master walltime (stddev)            1.25 (  +0.00%)         0.92 ( -14.66%)
> Master major faults               204.38 (  +0.00%)       205.38 (  +0.49%)
> Master major faults (stddev)       27.16 (  +0.00%)        13.80 ( -47.43%)
> Master reclaim                     31.88 (  +0.00%)        37.75 ( +17.87%)
> Master reclaim (stddev)            34.01 (  +0.00%)        75.88 (+119.59%)
> Master scan                        31.88 (  +0.00%)        37.75 ( +17.87%)
> Master scan (stddev)               34.01 (  +0.00%)        75.88 (+119.59%)
> Master kswapd reclaim           33922.12 (  +0.00%)     33887.12 (  -0.10%)
> Master kswapd reclaim (stddev)    969.08 (  +0.00%)       492.22 ( -49.16%)
> Master kswapd scan              34085.75 (  +0.00%)     33985.75 (  -0.29%)
> Master kswapd scan (stddev)      1101.07 (  +0.00%)       563.33 ( -48.79%)
> Slave walltime (s)                552.68 (  +0.00%)       552.12 (  -0.10%)
> Slave walltime (stddev)             0.79 (  +0.00%)         1.05 ( +14.76%)
> Slave major faults                212.50 (  +0.00%)       204.50 (  -3.75%)
> Slave major faults (stddev)        26.90 (  +0.00%)        13.17 ( -49.20%)
> Slave reclaim                      26.12 (  +0.00%)        35.00 ( +32.72%)
> Slave reclaim (stddev)             29.42 (  +0.00%)        74.91 (+149.55%)
> Slave scan                         31.38 (  +0.00%)        35.00 ( +11.20%)
> Slave scan (stddev)                33.31 (  +0.00%)        74.91 (+121.24%)
> Slave kswapd reclaim            34259.00 (  +0.00%)     33469.88 (  -2.30%)
> Slave kswapd reclaim (stddev)     925.15 (  +0.00%)       565.07 ( -38.88%)
> Slave kswapd scan               34354.62 (  +0.00%)     33555.75 (  -2.33%)
> Slave kswapd scan (stddev)        969.62 (  +0.00%)       581.70 ( -39.97%)
>
> In the control case, the differences in elapsed time, number of major
> faults taken, and reclaim statistics are within the noise for both the
> master and the slave job.

What's the soft limit setting in the controlled case?

I assume it is the default RESOURCE_MAX. So both Master and Slave get
equal pressure before/after the patch, and no differences on the stats
should be observed.


>                                     600M-280M-vanilla      600M-280M-patched
> Master walltime (s)                  595.13 (  +0.00%)      553.19 (  -7.04%)
> Master walltime (stddev)               8.31 (  +0.00%)        2.57 ( -61.64%)
> Master major faults                 3729.75 (  +0.00%)      783.25 ( -78.98%)
> Master major faults (stddev)         258.79 (  +0.00%)      226.68 ( -12.36%)
> Master reclaim                       705.00 (  +0.00%)       29.50 ( -95.68%)
> Master reclaim (stddev)              232.87 (  +0.00%)       44.72 ( -80.45%)
> Master scan                          714.88 (  +0.00%)       30.00 ( -95.67%)
> Master scan (stddev)                 237.44 (  +0.00%)       45.39 ( -80.54%)
> Master kswapd reclaim                114.75 (  +0.00%)       50.00 ( -55.94%)
> Master kswapd reclaim (stddev)       128.51 (  +0.00%)        9.45 ( -91.93%)
> Master kswapd scan                   115.75 (  +0.00%)       50.00 ( -56.32%)
> Master kswapd scan (stddev)          130.31 (  +0.00%)        9.45 ( -92.04%)
> Slave walltime (s)                   631.18 (  +0.00%)      577.68 (  -8.46%)
> Slave walltime (stddev)                9.89 (  +0.00%)        3.63 ( -57.47%)
> Slave major faults                 28401.75 (  +0.00%)    14656.75 ( -48.39%)
> Slave major faults (stddev)         2629.97 (  +0.00%)     1911.81 ( -27.30%)
> Slave reclaim                      65400.62 (  +0.00%)     1479.62 ( -97.74%)
> Slave reclaim (stddev)             11623.02 (  +0.00%)     1482.13 ( -87.24%)
> Slave scan                       9050047.88 (  +0.00%)    95968.25 ( -98.94%)
> Slave scan (stddev)              1912786.94 (  +0.00%)    93390.71 ( -95.12%)
> Slave kswapd reclaim              327894.50 (  +0.00%)   227099.88 ( -30.74%)
> Slave kswapd reclaim (stddev)      22289.43 (  +0.00%)    16113.14 ( -27.71%)
> Slave kswapd scan               34987335.75 (  +0.00%)  1362367.12 ( -96.11%)
> Slave kswapd scan (stddev)       2523642.98 (  +0.00%)   156754.74 ( -93.79%)
>
> Here, the available memory is limited to 320 MB, the machine is
> overcommitted by 280 MB.  The soft limit of the master is 300 MB, that
> of the slave merely 20 MB.
>
> Looking at the slave job first, it is much better off with the patched
> kernel: direct reclaim is almost gone, kswapd reclaim is decreased by
> a third.  The result is much fewer major faults taken, which in turn
> lets the job finish quicker.

What's the setting of the hard limit here? Is the direct reclaim
referring to per-memcg directly reclaim or global one.

>
> It would be a zero-sum game if the improvement happened at the cost of
> the master but looking at the numbers, even the master performs better
> with the patched kernel.  In fact, the master job is almost unaffected
> on the patched kernel compared to the control case.

It makes sense since the master job get less affected by the patch
than the slave job under the example. Under the control case, if both
master and slave have RESOURCE_MAX soft limit setting, they are under
equal memory pressure(priority = DEF_PRIORITY) . On the second
example, only the slave pressure being increased by priority = 0, and
the Master got scanned with same priority = DEF_PRIORITY pretty much.

So I would expect to see more reclaim activities happens in slave on
the patched kernel compared to the control case. It seems match the
testing result.

>
> This is an odd phenomenon, as the patch does not directly change how
> the master is reclaimed.  An explanation for this is that the severe
> overreclaim of the slave in the unpatched kernel results in the master
> growing bigger than in the patched case.  Combining the fact that
> memcgs are scanned according to their size with the increased refault
> rate of the overreclaimed slave triggering global reclaim more often
> means that overall pressure on the master job is higher in the
> unpatched kernel.

We can check the Master memory.usage_in_bytes while the job is running.

On the other hand, I don't see why we expect the Master being less
reclaimed in the controlled case? On the unpatched kernel, the Master
is being reclaimed under global pressure each time anyway since we
ignore the return value of softlimit.

>
> At any rate, the patched kernel seems to do a much better job at both
> overall resource allocation under soft limit overcommit as well as the
> requested prioritization of the master job.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h |   18 +--
>  mm/memcontrol.c            |  412 ++++----------------------------------------
>  mm/vmscan.c                |   80 +--------
>  3 files changed, 48 insertions(+), 462 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6c1d69e..72368b7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -121,6 +121,7 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
>                                                      struct zone *zone);
>  struct zone_reclaim_stat*
>  mem_cgroup_get_reclaim_stat_from_page(struct page *page);
> +bool mem_cgroup_over_softlimit(struct mem_cgroup *, struct mem_cgroup *);

Maybe something like "mem_cgroup_over_soft_limit()" ?

>  void mem_cgroup_account_reclaim(struct mem_cgroup *, struct mem_cgroup *,
>                                unsigned long, unsigned long, bool);
>  extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> @@ -155,9 +156,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>        mem_cgroup_update_page_stat(page, idx, -1);
>  }
>
> -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> -                                               gfp_t gfp_mask,
> -                                               unsigned long *total_scanned);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *memcg);
>
>  void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
> @@ -362,22 +360,20 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>
> -static inline void mem_cgroup_inc_page_stat(struct page *page,
> -                                           enum mem_cgroup_page_stat_item idx)
> +static inline bool
> +mem_cgroup_over_softlimit(struct mem_cgroup *root, struct mem_cgroup *memcg)
>  {
> +       return false;
>  }
>
> -static inline void mem_cgroup_dec_page_stat(struct page *page,
> +static inline void mem_cgroup_inc_page_stat(struct page *page,
>                                            enum mem_cgroup_page_stat_item idx)
>  {
>  }
>
> -static inline
> -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> -                                           gfp_t gfp_mask,
> -                                           unsigned long *total_scanned)
> +static inline void mem_cgroup_dec_page_stat(struct page *page,
> +                                           enum mem_cgroup_page_stat_item idx)
>  {
> -       return 0;
>  }
>
>  static inline
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 170dff4..d4f7ae5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -35,7 +35,6 @@
>  #include <linux/limits.h>
>  #include <linux/export.h>
>  #include <linux/mutex.h>
> -#include <linux/rbtree.h>
>  #include <linux/slab.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
> @@ -118,12 +117,10 @@ enum mem_cgroup_events_index {
>  */
>  enum mem_cgroup_events_target {
>        MEM_CGROUP_TARGET_THRESH,
> -       MEM_CGROUP_TARGET_SOFTLIMIT,
>        MEM_CGROUP_TARGET_NUMAINFO,
>        MEM_CGROUP_NTARGETS,
>  };
>  #define THRESHOLDS_EVENTS_TARGET (128)
> -#define SOFTLIMIT_EVENTS_TARGET (1024)
>  #define NUMAINFO_EVENTS_TARGET (1024)
>
>  struct mem_cgroup_stat_cpu {
> @@ -149,12 +146,6 @@ struct mem_cgroup_per_zone {
>        struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1];
>
>        struct zone_reclaim_stat reclaim_stat;
> -       struct rb_node          tree_node;      /* RB tree node */
> -       unsigned long long      usage_in_excess;/* Set to the value by which */
> -                                               /* the soft limit is exceeded*/
> -       bool                    on_tree;
> -       struct mem_cgroup       *mem;           /* Back pointer, we cannot */
> -                                               /* use container_of        */
>  };
>  /* Macro for accessing counter */
>  #define MEM_CGROUP_ZSTAT(mz, idx)      ((mz)->count[(idx)])
> @@ -167,26 +158,6 @@ struct mem_cgroup_lru_info {
>        struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
>  };
>
> -/*
> - * Cgroups above their limits are maintained in a RB-Tree, independent of
> - * their hierarchy representation
> - */
> -
> -struct mem_cgroup_tree_per_zone {
> -       struct rb_root rb_root;
> -       spinlock_t lock;
> -};
> -
> -struct mem_cgroup_tree_per_node {
> -       struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES];
> -};
> -
> -struct mem_cgroup_tree {
> -       struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
> -};
> -
> -static struct mem_cgroup_tree soft_limit_tree __read_mostly;
> -
>  struct mem_cgroup_threshold {
>        struct eventfd_ctx *eventfd;
>        u64 threshold;
> @@ -343,7 +314,6 @@ static bool move_file(void)
>  * limit reclaim to prevent infinite loops, if they ever occur.
>  */
>  #define        MEM_CGROUP_MAX_RECLAIM_LOOPS            (100)
> -#define        MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2)

You might need to remove the comment above as well.
>
>  enum charge_type {
>        MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
> @@ -398,164 +368,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *memcg, struct page *page)
>        return mem_cgroup_zoneinfo(memcg, nid, zid);
>  }
>
> -static struct mem_cgroup_tree_per_zone *
> -soft_limit_tree_node_zone(int nid, int zid)
> -{
> -       return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
> -}
> -
> -static struct mem_cgroup_tree_per_zone *
> -soft_limit_tree_from_page(struct page *page)
> -{
> -       int nid = page_to_nid(page);
> -       int zid = page_zonenum(page);
> -
> -       return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
> -}
> -
> -static void
> -__mem_cgroup_insert_exceeded(struct mem_cgroup *memcg,
> -                               struct mem_cgroup_per_zone *mz,
> -                               struct mem_cgroup_tree_per_zone *mctz,
> -                               unsigned long long new_usage_in_excess)
> -{
> -       struct rb_node **p = &mctz->rb_root.rb_node;
> -       struct rb_node *parent = NULL;
> -       struct mem_cgroup_per_zone *mz_node;
> -
> -       if (mz->on_tree)
> -               return;
> -
> -       mz->usage_in_excess = new_usage_in_excess;
> -       if (!mz->usage_in_excess)
> -               return;
> -       while (*p) {
> -               parent = *p;
> -               mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
> -                                       tree_node);
> -               if (mz->usage_in_excess < mz_node->usage_in_excess)
> -                       p = &(*p)->rb_left;
> -               /*
> -                * We can't avoid mem cgroups that are over their soft
> -                * limit by the same amount
> -                */
> -               else if (mz->usage_in_excess >= mz_node->usage_in_excess)
> -                       p = &(*p)->rb_right;
> -       }
> -       rb_link_node(&mz->tree_node, parent, p);
> -       rb_insert_color(&mz->tree_node, &mctz->rb_root);
> -       mz->on_tree = true;
> -}
> -
> -static void
> -__mem_cgroup_remove_exceeded(struct mem_cgroup *memcg,
> -                               struct mem_cgroup_per_zone *mz,
> -                               struct mem_cgroup_tree_per_zone *mctz)
> -{
> -       if (!mz->on_tree)
> -               return;
> -       rb_erase(&mz->tree_node, &mctz->rb_root);
> -       mz->on_tree = false;
> -}
> -
> -static void
> -mem_cgroup_remove_exceeded(struct mem_cgroup *memcg,
> -                               struct mem_cgroup_per_zone *mz,
> -                               struct mem_cgroup_tree_per_zone *mctz)
> -{
> -       spin_lock(&mctz->lock);
> -       __mem_cgroup_remove_exceeded(memcg, mz, mctz);
> -       spin_unlock(&mctz->lock);
> -}
> -
> -
> -static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
> -{
> -       unsigned long long excess;
> -       struct mem_cgroup_per_zone *mz;
> -       struct mem_cgroup_tree_per_zone *mctz;
> -       int nid = page_to_nid(page);
> -       int zid = page_zonenum(page);
> -       mctz = soft_limit_tree_from_page(page);
> -
> -       /*
> -        * Necessary to update all ancestors when hierarchy is used.
> -        * because their event counter is not touched.
> -        */
> -       for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> -               mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> -               excess = res_counter_soft_limit_excess(&memcg->res);
> -               /*
> -                * We have to update the tree if mz is on RB-tree or
> -                * mem is over its softlimit.
> -                */
> -               if (excess || mz->on_tree) {
> -                       spin_lock(&mctz->lock);
> -                       /* if on-tree, remove it */
> -                       if (mz->on_tree)
> -                               __mem_cgroup_remove_exceeded(memcg, mz, mctz);
> -                       /*
> -                        * Insert again. mz->usage_in_excess will be updated.
> -                        * If excess is 0, no tree ops.
> -                        */
> -                       __mem_cgroup_insert_exceeded(memcg, mz, mctz, excess);
> -                       spin_unlock(&mctz->lock);
> -               }
> -       }
> -}
> -
> -static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
> -{
> -       int node, zone;
> -       struct mem_cgroup_per_zone *mz;
> -       struct mem_cgroup_tree_per_zone *mctz;
> -
> -       for_each_node_state(node, N_POSSIBLE) {
> -               for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> -                       mz = mem_cgroup_zoneinfo(memcg, node, zone);
> -                       mctz = soft_limit_tree_node_zone(node, zone);
> -                       mem_cgroup_remove_exceeded(memcg, mz, mctz);
> -               }
> -       }
> -}
> -
> -static struct mem_cgroup_per_zone *
> -__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
> -{
> -       struct rb_node *rightmost = NULL;
> -       struct mem_cgroup_per_zone *mz;
> -
> -retry:
> -       mz = NULL;
> -       rightmost = rb_last(&mctz->rb_root);
> -       if (!rightmost)
> -               goto done;              /* Nothing to reclaim from */
> -
> -       mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
> -       /*
> -        * Remove the node now but someone else can add it back,
> -        * we will to add it back at the end of reclaim to its correct
> -        * position in the tree.
> -        */
> -       __mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
> -       if (!res_counter_soft_limit_excess(&mz->mem->res) ||
> -               !css_tryget(&mz->mem->css))
> -               goto retry;
> -done:
> -       return mz;
> -}
> -
> -static struct mem_cgroup_per_zone *
> -mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
> -{
> -       struct mem_cgroup_per_zone *mz;
> -
> -       spin_lock(&mctz->lock);
> -       mz = __mem_cgroup_largest_soft_limit_node(mctz);
> -       spin_unlock(&mctz->lock);
> -       return mz;
> -}
> -
>  /*
>  * Implementation Note: reading percpu statistics for memcg.
>  *
> @@ -696,9 +508,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
>                case MEM_CGROUP_TARGET_THRESH:
>                        next = val + THRESHOLDS_EVENTS_TARGET;
>                        break;
> -               case MEM_CGROUP_TARGET_SOFTLIMIT:
> -                       next = val + SOFTLIMIT_EVENTS_TARGET;
> -                       break;
>                case MEM_CGROUP_TARGET_NUMAINFO:
>                        next = val + NUMAINFO_EVENTS_TARGET;
>                        break;
> @@ -718,13 +527,11 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
>  static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
>  {
>        preempt_disable();
> -       /* threshold event is triggered in finer grain than soft limit */
> +       /* threshold event is triggered in finer grain than numa info */
>        if (unlikely(mem_cgroup_event_ratelimit(memcg,
>                                                MEM_CGROUP_TARGET_THRESH))) {
> -               bool do_softlimit, do_numainfo;
> +               bool do_numainfo;
>
> -               do_softlimit = mem_cgroup_event_ratelimit(memcg,
> -                                               MEM_CGROUP_TARGET_SOFTLIMIT);
>  #if MAX_NUMNODES > 1
>                do_numainfo = mem_cgroup_event_ratelimit(memcg,
>                                                MEM_CGROUP_TARGET_NUMAINFO);
> @@ -732,8 +539,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
>                preempt_enable();
>
>                mem_cgroup_threshold(memcg);
> -               if (unlikely(do_softlimit))
> -                       mem_cgroup_update_tree(memcg, page);
>  #if MAX_NUMNODES > 1
>                if (unlikely(do_numainfo))
>                        atomic_inc(&memcg->numainfo_events);
> @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
>        return margin >> PAGE_SHIFT;
>  }
>
> +/**
> + * mem_cgroup_over_softlimit
> + * @root: hierarchy root
> + * @memcg: child of @root to test
> + *
> + * Returns %true if @memcg exceeds its own soft limit or contributes
> + * to the soft limit excess of one of its parents up to and including
> + * @root.
> + */
> +bool mem_cgroup_over_softlimit(struct mem_cgroup *root,
> +                              struct mem_cgroup *memcg)
> +{
> +       if (mem_cgroup_disabled())
> +               return false;
> +
> +       if (!root)
> +               root = root_mem_cgroup;
> +
> +       for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> +               /* root_mem_cgroup does not have a soft limit */
> +               if (memcg == root_mem_cgroup)
> +                       break;
> +               if (res_counter_soft_limit_excess(&memcg->res))
> +                       return true;
> +               if (memcg == root)
> +                       break;
> +       }

Here it adds pressure on a cgroup if one of its parents exceeds soft
limit, although the cgroup itself is under soft limit. It does change
my understanding of soft limit, and might introduce regression of our
existing use cases.

Here is an example:

Machine capacity 32G and we over-commit by 8G.

root
  -> A (hard limit 20G, soft limit 15G, usage 16G)
       -> A1 (soft limit 5G, usage 4G)
       -> A2 (soft limit 10G, usage 12G)
  -> B (hard limit 20G, soft limit 10G, usage 16G)

under global reclaim, we don't want to add pressure on A1 although its
parent A exceeds its soft limit. Assume that if we set the soft limit
corresponding to each cgroup's working set size (hot memory), and it
will introduce regression to A1 in that case.

In my existing implementation, i am checking the cgroup's soft limit
standalone w/o looking its ancestors.

> +       return false;
> +}
> +
>  int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>  {
>        struct cgroup *cgrp = memcg->css.cgroup;
> @@ -1687,64 +1522,6 @@ bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
>  }
>  #endif
>
> -static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
> -                                  struct zone *zone,
> -                                  gfp_t gfp_mask,
> -                                  unsigned long *total_scanned)
> -{
> -       struct mem_cgroup *victim = NULL;
> -       int total = 0;
> -       int loop = 0;
> -       unsigned long excess;
> -       unsigned long nr_scanned;
> -       struct mem_cgroup_reclaim_cookie reclaim = {
> -               .zone = zone,
> -               .priority = 0,
> -       };
> -
> -       excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
> -
> -       while (1) {
> -               unsigned long nr_reclaimed;
> -
> -               victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
> -               if (!victim) {
> -                       loop++;
> -                       if (loop >= 2) {
> -                               /*
> -                                * If we have not been able to reclaim
> -                                * anything, it might because there are
> -                                * no reclaimable pages under this hierarchy
> -                                */
> -                               if (!total)
> -                                       break;
> -                               /*
> -                                * We want to do more targeted reclaim.
> -                                * excess >> 2 is not to excessive so as to
> -                                * reclaim too much, nor too less that we keep
> -                                * coming back to reclaim from this cgroup
> -                                */
> -                               if (total >= (excess >> 2) ||
> -                                       (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
> -                                       break;
> -                       }
> -                       continue;
> -               }
> -               if (!mem_cgroup_reclaimable(victim, false))
> -                       continue;
> -               nr_reclaimed = mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
> -                                                          zone, &nr_scanned);
> -               mem_cgroup_account_reclaim(root_mem_cgroup, victim, nr_reclaimed,
> -                                          nr_scanned, current_is_kswapd());
> -               total += nr_reclaimed;
> -               *total_scanned += nr_scanned;
> -               if (!res_counter_soft_limit_excess(&root_memcg->res))
> -                       break;
> -       }
> -       mem_cgroup_iter_break(root_memcg, victim);
> -       return total;
> -}
> -
>  /*
>  * Check OOM-Killer is already running under our hierarchy.
>  * If someone is running, return false.
> @@ -2507,8 +2284,6 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
>        unlock_page_cgroup(pc);
>        /*
>         * "charge_statistics" updated event counter. Then, check it.
> -        * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> -        * if they exceeds softlimit.
>         */
>        memcg_check_events(memcg, page);
>  }
> @@ -3578,98 +3353,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>        return ret;
>  }
>
> -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> -                                           gfp_t gfp_mask,
> -                                           unsigned long *total_scanned)
> -{
> -       unsigned long nr_reclaimed = 0;
> -       struct mem_cgroup_per_zone *mz, *next_mz = NULL;
> -       unsigned long reclaimed;
> -       int loop = 0;
> -       struct mem_cgroup_tree_per_zone *mctz;
> -       unsigned long long excess;
> -       unsigned long nr_scanned;
> -
> -       if (order > 0)
> -               return 0;
> -
> -       mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
> -       /*
> -        * This loop can run a while, specially if mem_cgroup's continuously
> -        * keep exceeding their soft limit and putting the system under
> -        * pressure
> -        */
> -       do {
> -               if (next_mz)
> -                       mz = next_mz;
> -               else
> -                       mz = mem_cgroup_largest_soft_limit_node(mctz);
> -               if (!mz)
> -                       break;
> -
> -               nr_scanned = 0;
> -               reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone,
> -                                                   gfp_mask, &nr_scanned);
> -               nr_reclaimed += reclaimed;
> -               *total_scanned += nr_scanned;
> -               spin_lock(&mctz->lock);
> -
> -               /*
> -                * If we failed to reclaim anything from this memory cgroup
> -                * it is time to move on to the next cgroup
> -                */
> -               next_mz = NULL;
> -               if (!reclaimed) {
> -                       do {
> -                               /*
> -                                * Loop until we find yet another one.
> -                                *
> -                                * By the time we get the soft_limit lock
> -                                * again, someone might have aded the
> -                                * group back on the RB tree. Iterate to
> -                                * make sure we get a different mem.
> -                                * mem_cgroup_largest_soft_limit_node returns
> -                                * NULL if no other cgroup is present on
> -                                * the tree
> -                                */
> -                               next_mz =
> -                               __mem_cgroup_largest_soft_limit_node(mctz);
> -                               if (next_mz == mz)
> -                                       css_put(&next_mz->mem->css);
> -                               else /* next_mz == NULL or other memcg */
> -                                       break;
> -                       } while (1);
> -               }
> -               __mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
> -               excess = res_counter_soft_limit_excess(&mz->mem->res);
> -               /*
> -                * One school of thought says that we should not add
> -                * back the node to the tree if reclaim returns 0.
> -                * But our reclaim could return 0, simply because due
> -                * to priority we are exposing a smaller subset of
> -                * memory to reclaim from. Consider this as a longer
> -                * term TODO.
> -                */
> -               /* If excess == 0, no tree ops */
> -               __mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess);
> -               spin_unlock(&mctz->lock);
> -               css_put(&mz->mem->css);
> -               loop++;
> -               /*
> -                * Could not reclaim anything and there are no more
> -                * mem cgroups to try or we seem to be looping without
> -                * reclaiming anything.
> -                */
> -               if (!nr_reclaimed &&
> -                       (next_mz == NULL ||
> -                       loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
> -                       break;
> -       } while (!nr_reclaimed);
> -       if (next_mz)
> -               css_put(&next_mz->mem->css);
> -       return nr_reclaimed;
> -}
> -
>  /*
>  * This routine traverse page_cgroup in given list and drop them all.
>  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
> @@ -4816,9 +4499,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
>                mz = &pn->zoneinfo[zone];
>                for_each_lru(l)
>                        INIT_LIST_HEAD(&mz->lruvec.lists[l]);
> -               mz->usage_in_excess = 0;
> -               mz->on_tree = false;
> -               mz->mem = memcg;
>        }
>        memcg->info.nodeinfo[node] = pn;
>        return 0;
> @@ -4872,7 +4552,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
>  {
>        int node;
>
> -       mem_cgroup_remove_from_trees(memcg);
>        free_css_id(&mem_cgroup_subsys, &memcg->css);
>
>        for_each_node_state(node, N_POSSIBLE)
> @@ -4927,31 +4606,6 @@ static void __init enable_swap_cgroup(void)
>  }
>  #endif
>
> -static int mem_cgroup_soft_limit_tree_init(void)
> -{
> -       struct mem_cgroup_tree_per_node *rtpn;
> -       struct mem_cgroup_tree_per_zone *rtpz;
> -       int tmp, node, zone;
> -
> -       for_each_node_state(node, N_POSSIBLE) {
> -               tmp = node;
> -               if (!node_state(node, N_NORMAL_MEMORY))
> -                       tmp = -1;
> -               rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
> -               if (!rtpn)
> -                       return 1;
> -
> -               soft_limit_tree.rb_tree_per_node[node] = rtpn;
> -
> -               for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> -                       rtpz = &rtpn->rb_tree_per_zone[zone];
> -                       rtpz->rb_root = RB_ROOT;
> -                       spin_lock_init(&rtpz->lock);
> -               }
> -       }
> -       return 0;
> -}
> -
>  static struct cgroup_subsys_state * __ref
>  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  {
> @@ -4973,8 +4627,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>                enable_swap_cgroup();
>                parent = NULL;
>                root_mem_cgroup = memcg;
> -               if (mem_cgroup_soft_limit_tree_init())
> -                       goto free_out;
>                for_each_possible_cpu(cpu) {
>                        struct memcg_stock_pcp *stock =
>                                                &per_cpu(memcg_stock, cpu);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e3fd8a7..4279549 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone,
>                        .mem_cgroup = memcg,
>                        .zone = zone,
>                };
> +               int epriority = priority;
> +               /*
> +                * Put more pressure on hierarchies that exceed their
> +                * soft limit, to push them back harder than their
> +                * well-behaving siblings.
> +                */
> +               if (mem_cgroup_over_softlimit(root, memcg))
> +                       epriority = 0;
>
> -               shrink_mem_cgroup_zone(priority, &mz, sc);
> +               shrink_mem_cgroup_zone(epriority, &mz, sc);
>
>                mem_cgroup_account_reclaim(root, memcg,
>                                           sc->nr_reclaimed - nr_reclaimed,
> @@ -2171,8 +2179,6 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
>  {
>        struct zoneref *z;
>        struct zone *zone;
> -       unsigned long nr_soft_reclaimed;
> -       unsigned long nr_soft_scanned;
>        bool should_abort_reclaim = false;
>
>        for_each_zone_zonelist_nodemask(zone, z, zonelist,
> @@ -2205,19 +2211,6 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
>                                        continue;
>                                }
>                        }
> -                       /*
> -                        * This steals pages from memory cgroups over softlimit
> -                        * and returns the number of reclaimed pages and
> -                        * scanned pages. This works for global memory pressure
> -                        * and balancing, not for a memcg's limit.
> -                        */
> -                       nr_soft_scanned = 0;
> -                       nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
> -                                               sc->order, sc->gfp_mask,
> -                                               &nr_soft_scanned);
> -                       sc->nr_reclaimed += nr_soft_reclaimed;
> -                       sc->nr_scanned += nr_soft_scanned;
> -                       /* need some check for avoid more shrink_zone() */
>                }
>
>                shrink_zone(priority, zone, sc);
> @@ -2393,48 +2386,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  }
>
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> -
> -unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
> -                                               gfp_t gfp_mask, bool noswap,
> -                                               struct zone *zone,
> -                                               unsigned long *nr_scanned)
> -{
> -       struct scan_control sc = {
> -               .nr_scanned = 0,
> -               .nr_to_reclaim = SWAP_CLUSTER_MAX,
> -               .may_writepage = !laptop_mode,
> -               .may_unmap = 1,
> -               .may_swap = !noswap,
> -               .order = 0,
> -               .target_mem_cgroup = memcg,
> -       };
> -       struct mem_cgroup_zone mz = {
> -               .mem_cgroup = memcg,
> -               .zone = zone,
> -       };
> -
> -       sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> -                       (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
> -
> -       trace_mm_vmscan_memcg_softlimit_reclaim_begin(0,
> -                                                     sc.may_writepage,
> -                                                     sc.gfp_mask);
> -
> -       /*
> -        * NOTE: Although we can get the priority field, using it
> -        * here is not a good idea, since it limits the pages we can scan.
> -        * if we don't reclaim here, the shrink_zone from balance_pgdat
> -        * will pick up pages from other mem cgroup's as well. We hack
> -        * the priority and make it zero.
> -        */
> -       shrink_mem_cgroup_zone(0, &mz, &sc);
> -
> -       trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
> -
> -       *nr_scanned = sc.nr_scanned;
> -       return sc.nr_reclaimed;
> -}
> -
>  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>                                           gfp_t gfp_mask,
>                                           bool noswap)
> @@ -2609,8 +2560,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>        int end_zone = 0;       /* Inclusive.  0 = ZONE_DMA */
>        unsigned long total_scanned;
>        struct reclaim_state *reclaim_state = current->reclaim_state;
> -       unsigned long nr_soft_reclaimed;
> -       unsigned long nr_soft_scanned;
>        struct scan_control sc = {
>                .gfp_mask = GFP_KERNEL,
>                .may_unmap = 1,
> @@ -2701,17 +2650,6 @@ loop_again:
>                                continue;
>
>                        sc.nr_scanned = 0;
> -
> -                       nr_soft_scanned = 0;
> -                       /*
> -                        * Call soft limit reclaim before calling shrink_zone.
> -                        */
> -                       nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
> -                                                       order, sc.gfp_mask,
> -                                                       &nr_soft_scanned);
> -                       sc.nr_reclaimed += nr_soft_reclaimed;
> -                       total_scanned += nr_soft_scanned;
> -
>                        /*
>                         * We put equal pressure on every zone, unless
>                         * one zone has way too many pages free
> --
> 1.7.7.5
>

--Ying

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/2] mm: memcg: per-memcg reclaim statistics
  2012-01-11  0:30     ` Johannes Weiner
@ 2012-01-11 22:33       ` Ying Han
  2012-01-12  9:17         ` Johannes Weiner
  0 siblings, 1 reply; 29+ messages in thread
From: Ying Han @ 2012-01-11 22:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki, Balbir Singh,
	cgroups, linux-mm, linux-kernel

On Tue, Jan 10, 2012 at 4:30 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Tue, Jan 10, 2012 at 03:54:05PM -0800, Ying Han wrote:
>> Thank you for the patch and the stats looks reasonable to me, few
>> questions as below:
>>
>> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> > With the single per-zone LRU gone and global reclaim scanning
>> > individual memcgs, it's straight-forward to collect meaningful and
>> > accurate per-memcg reclaim statistics.
>> >
>> > This adds the following items to memory.stat:
>>
>> Some of the previous discussions including patches have similar stats
>> in memory.vmscan_stat API, which collects all the per-memcg vmscan
>> stats. I would like to understand more why we add into memory.stat
>> instead, and do we have plan to keep extending memory.stat for those
>> vmstat like stats?
>
> I think they were put into an extra file in particular to be able to
> write to this file to reset the statistics.  But in my opinion, it's
> trivial to calculate a delta from before and after running a workload,
> so I didn't really like adding kernel code for that.
>
> Did you have another reason for a separate file in mind?

Another reason I had them in separate file is easier to extend. I
don't know if we have plan to have something like memory.vmstat, or
just keep adding stuff into memory.stat. In general, I wanted to keep
the memory.stat being reasonable size including only the basic
statistics. In my existing vmscan_stat path, i have breakdowns of
reclaim stats into file/anon which will make the memory.stat even
larger.

>> > pgreclaim
>>
>> Not sure if we want to keep this more consistent to /proc/vmstat, then
>> it will be "pgsteal"?
>
> The problem with that was that we didn't like to call pages stolen
> when they were reclaimed from within the cgroup, so we had pgfree for
> inner reclaim and pgsteal for outer reclaim, respectively.
>
> I found it cleaner to just go with pgreclaim, it's unambiguous and
> straight-forward.  Outer reclaim is designated by the hierarchy_
> prefix.
>
>> > pgscan
>> >
>> > áNumber of pages reclaimed/scanned from that memcg due to its own
>> > áhard limit (or physical limit in case of the root memcg) by the
>> > áallocating task.
>> >
>> > kswapd_pgreclaim
>> > kswapd_pgscan
>>
>> we have "pgscan_kswapd_*" in vmstat, so maybe ?
>> "pgsteal_kswapd"
>> "pgscan_kswapd"
>>
>> > áReclaim activity from kswapd due to the memcg's own limit. áOnly
>> > áapplicable to the root memcg for now since kswapd is only triggered
>> > áby physical limits, but kswapd-style reclaim based on memcg hard
>> > álimits is being developped.
>> >
>> > hierarchy_pgreclaim
>> > hierarchy_pgscan
>> > hierarchy_kswapd_pgreclaim
>> > hierarchy_kswapd_pgscan
>>
>> "pgsteal_hierarchy"
>> "pgsteal_kswapd_hierarchy"
>> ..
>>
>> No strong option on the naming, but try to make it more consistent to
>> existing API.
>
> I swear I tried, but the existing naming is pretty screwed up :(
>
> For example, pgscan_direct_* and pgscan_kswapd_* allow you to compare
> scan rates of direct reclaim vs. kswapd reclaim.  To get the total
> number of pages reclaimed, you sum them up.
>
> On the other hand, pgsteal_* does not differentiate between direct
> reclaim and kswapd, so to get direct reclaim numbers, you add up the
> pgsteal_* counters and subtract kswapd_steal (notice the lack of pg?),
> which is in turn not available at zone granularity.

agree and that always confuses me.

>
>> > +#define MEM_CGROUP_EVENTS_KSWAPD 2
>> > +#define MEM_CGROUP_EVENTS_HIERARCHY 4
>
> These two function as namespaces, that's why I put hierarchy_ and
> kswapd_ at the beginning of the names.
>
> Given that we have kswapd_steal, would you be okay with doing it like
> this?  I mean, at least my naming conforms to ONE of the standards in
> /proc/vmstat, right? ;-)

I don't have much problem with the existing naming scheme, as long as
we well document it and make it less confusing.
>
>> > @@ -91,12 +91,23 @@ enum mem_cgroup_stat_index {
>> > á á á áMEM_CGROUP_STAT_NSTATS,
>> > á};
>> >
>> > +#define MEM_CGROUP_EVENTS_KSWAPD 2
>> > +#define MEM_CGROUP_EVENTS_HIERARCHY 4
>> > +
>> > áenum mem_cgroup_events_index {
>> > á á á áMEM_CGROUP_EVENTS_PGPGIN, á á á /* # of pages paged in */
>> > á á á áMEM_CGROUP_EVENTS_PGPGOUT, á á á/* # of pages paged out */
>> > á á á áMEM_CGROUP_EVENTS_COUNT, á á á á/* # of pages paged in/out */
>> > á á á áMEM_CGROUP_EVENTS_PGFAULT, á á á/* # of page-faults */
>> > á á á áMEM_CGROUP_EVENTS_PGMAJFAULT, á /* # of major page-faults */
>> > + á á á MEM_CGROUP_EVENTS_PGRECLAIM,
>> > + á á á MEM_CGROUP_EVENTS_PGSCAN,
>> > + á á á MEM_CGROUP_EVENTS_KSWAPD_PGRECLAIM,
>> > + á á á MEM_CGROUP_EVENTS_KSWAPD_PGSCAN,
>> > + á á á MEM_CGROUP_EVENTS_HIERARCHY_PGRECLAIM,
>> > + á á á MEM_CGROUP_EVENTS_HIERARCHY_PGSCAN,
>> > + á á á MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGRECLAIM,
>> > + á á á MEM_CGROUP_EVENTS_HIERARCHY_KSWAPD_PGSCAN,
>>
>> missing comment here?
>
> As if the lines weren't long enough already ;-) I'll add some.

Thanks.
>
>> > á á á áMEM_CGROUP_EVENTS_NSTATS,
>> > á};
>> > á/*
>> > @@ -889,6 +900,38 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>> > á á á áreturn (memcg == root_mem_cgroup);
>> > á}
>> >
>> > +/**
>> > + * mem_cgroup_account_reclaim - update per-memcg reclaim statistics
>> > + * @root: memcg that triggered reclaim
>> > + * @memcg: memcg that is actually being scanned
>> > + * @nr_reclaimed: number of pages reclaimed from @memcg
>> > + * @nr_scanned: number of pages scanned from @memcg
>> > + * @kswapd: whether reclaiming task is kswapd or allocator itself
>> > + */
>> > +void mem_cgroup_account_reclaim(struct mem_cgroup *root,
>> > + á á á á á á á á á á á á á á á struct mem_cgroup *memcg,
>> > + á á á á á á á á á á á á á á á unsigned long nr_reclaimed,
>> > + á á á á á á á á á á á á á á á unsigned long nr_scanned,
>> > + á á á á á á á á á á á á á á á bool kswapd)
>> > +{
>> > + á á á unsigned int offset = 0;
>> > +
>> > + á á á if (!root)
>> > + á á á á á á á root = root_mem_cgroup;
>> > +
>> > + á á á if (kswapd)
>> > + á á á á á á á offset += MEM_CGROUP_EVENTS_KSWAPD;
>> > + á á á if (root != memcg)
>> > + á á á á á á á offset += MEM_CGROUP_EVENTS_HIERARCHY;
>>
>> Just to be clear, here root cgroup has hierarchy_* stats always 0 ?
>
> That's correct, there can't be any hierarchical pressure on the
> topmost parent.

Thank you for clarifying.

>
>> Also, we might want to consider renaming the root here, something like
>> target? The root is confusing with root_mem_cgroup.
>
> It's the same naming scheme I used for the iterator functions
> (mem_cgroup_iter() and friends), so if we change it, I'd like to
> change it consistently.

That sounds good, and the change is separate from this effort.

>
> Having target and memcg as parameters is even more confusing and
> non-descriptive, IMO.
>
> Other places use mem_over_limit, which is a bit better, but quite
> long.
>
> Any other ideas for great names for parameters that designate a
> hierarchy root and a memcg in that hierarchy?

I don't have better name other than "target", which matches the naming
in scan_control as well. Or in this case, we can avoid passing both
target and memcg by doing something like:

+static inline void mem_cgroup_account_reclaim(
+                                             struct mem_cgroup *memcg,
+                                             unsigned long nr_reclaimed,
+                                             unsigned long nr_scanned,
+                                             bool kswapd,
+                                             bool hierarchy)
+{
+}
+

+               mem_cgroup_account_reclaim(victim, nr_reclaimed,
+                                          nr_scanned, current_is_kswapd(),
+                                          target != victim);

then we need to do something on the root_mem_cgroup before that. Just a thought.

--Ying

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-10 15:02 ` [patch 2/2] mm: memcg: hierarchical soft limit reclaim Johannes Weiner
  2012-01-11 21:42   ` Ying Han
@ 2012-01-12  1:54   ` KAMEZAWA Hiroyuki
  2012-01-13 12:16     ` Johannes Weiner
  2012-01-13 12:04   ` Michal Hocko
  2 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-12  1:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Balbir Singh, Ying Han, cgroups,
	linux-mm, linux-kernel

On Tue, 10 Jan 2012 16:02:52 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> Right now, memcg soft limits are implemented by having a sorted tree
> of memcgs that are in excess of their limits.  Under global memory
> pressure, kswapd first reclaims from the biggest excessor and then
> proceeds to do regular global reclaim.  The result of this is that
> pages are reclaimed from all memcgs, but more scanning happens against
> those above their soft limit.
> 
> With global reclaim doing memcg-aware hierarchical reclaim by default,
> this is a lot easier to implement: everytime a memcg is reclaimed
> from, scan more aggressively (per tradition with a priority of 0) if
> it's above its soft limit.  With the same end result of scanning
> everybody, but soft limit excessors a bit more.
> 
> Advantages:
> 
>   o smoother reclaim: soft limit reclaim is a separate stage before
>     global reclaim, whose result is not communicated down the line and
>     so overreclaim of the groups in excess is very likely.  After this
>     patch, soft limit reclaim is fully integrated into regular reclaim
>     and each memcg is considered exactly once per cycle.
> 
>   o true hierarchy support: soft limits are only considered when
>     kswapd does global reclaim, but after this patch, targetted
>     reclaim of a memcg will mind the soft limit settings of its child
>     groups.
> 
>   o code size: soft limit reclaim requires a lot of code to maintain
>     the per-node per-zone rb-trees to quickly find the biggest
>     offender, dedicated paths for soft limit reclaim etc. while this
>     new implementation gets away without all that.
> 
> Test:
> 
> The test consists of two concurrent kernel build jobs in separate
> source trees, the master and the slave.  The two jobs get along nicely
> on 600MB of available memory, so this is the zero overcommit control
> case.  When available memory is decreased, the overcommit is
> compensated by decreasing the soft limit of the slave by the same
> amount, in the hope that the slave takes the hit and the master stays
> unaffected.
> 
>                                     600M-0M-vanilla         600M-0M-patched
> Master walltime (s)               552.65 (  +0.00%)       552.38 (  -0.05%)
> Master walltime (stddev)            1.25 (  +0.00%)         0.92 ( -14.66%)
> Master major faults               204.38 (  +0.00%)       205.38 (  +0.49%)
> Master major faults (stddev)       27.16 (  +0.00%)        13.80 ( -47.43%)
> Master reclaim                     31.88 (  +0.00%)        37.75 ( +17.87%)
> Master reclaim (stddev)            34.01 (  +0.00%)        75.88 (+119.59%)
> Master scan                        31.88 (  +0.00%)        37.75 ( +17.87%)
> Master scan (stddev)               34.01 (  +0.00%)        75.88 (+119.59%)
> Master kswapd reclaim           33922.12 (  +0.00%)     33887.12 (  -0.10%)
> Master kswapd reclaim (stddev)    969.08 (  +0.00%)       492.22 ( -49.16%)
> Master kswapd scan              34085.75 (  +0.00%)     33985.75 (  -0.29%)
> Master kswapd scan (stddev)      1101.07 (  +0.00%)       563.33 ( -48.79%)
> Slave walltime (s)                552.68 (  +0.00%)       552.12 (  -0.10%)
> Slave walltime (stddev)             0.79 (  +0.00%)         1.05 ( +14.76%)
> Slave major faults                212.50 (  +0.00%)       204.50 (  -3.75%)
> Slave major faults (stddev)        26.90 (  +0.00%)        13.17 ( -49.20%)
> Slave reclaim                      26.12 (  +0.00%)        35.00 ( +32.72%)
> Slave reclaim (stddev)             29.42 (  +0.00%)        74.91 (+149.55%)
> Slave scan                         31.38 (  +0.00%)        35.00 ( +11.20%)
> Slave scan (stddev)                33.31 (  +0.00%)        74.91 (+121.24%)
> Slave kswapd reclaim            34259.00 (  +0.00%)     33469.88 (  -2.30%)
> Slave kswapd reclaim (stddev)     925.15 (  +0.00%)       565.07 ( -38.88%)
> Slave kswapd scan               34354.62 (  +0.00%)     33555.75 (  -2.33%)
> Slave kswapd scan (stddev)        969.62 (  +0.00%)       581.70 ( -39.97%)
> 
> In the control case, the differences in elapsed time, number of major
> faults taken, and reclaim statistics are within the noise for both the
> master and the slave job.
> 
>                                      600M-280M-vanilla      600M-280M-patched
> Master walltime (s)                  595.13 (  +0.00%)      553.19 (  -7.04%)
> Master walltime (stddev)               8.31 (  +0.00%)        2.57 ( -61.64%)
> Master major faults                 3729.75 (  +0.00%)      783.25 ( -78.98%)
> Master major faults (stddev)         258.79 (  +0.00%)      226.68 ( -12.36%)
> Master reclaim                       705.00 (  +0.00%)       29.50 ( -95.68%)
> Master reclaim (stddev)              232.87 (  +0.00%)       44.72 ( -80.45%)
> Master scan                          714.88 (  +0.00%)       30.00 ( -95.67%)
> Master scan (stddev)                 237.44 (  +0.00%)       45.39 ( -80.54%)
> Master kswapd reclaim                114.75 (  +0.00%)       50.00 ( -55.94%)
> Master kswapd reclaim (stddev)       128.51 (  +0.00%)        9.45 ( -91.93%)
> Master kswapd scan                   115.75 (  +0.00%)       50.00 ( -56.32%)
> Master kswapd scan (stddev)          130.31 (  +0.00%)        9.45 ( -92.04%)
> Slave walltime (s)                   631.18 (  +0.00%)      577.68 (  -8.46%)
> Slave walltime (stddev)                9.89 (  +0.00%)        3.63 ( -57.47%)
> Slave major faults                 28401.75 (  +0.00%)    14656.75 ( -48.39%)
> Slave major faults (stddev)         2629.97 (  +0.00%)     1911.81 ( -27.30%)
> Slave reclaim                      65400.62 (  +0.00%)     1479.62 ( -97.74%)
> Slave reclaim (stddev)             11623.02 (  +0.00%)     1482.13 ( -87.24%)
> Slave scan                       9050047.88 (  +0.00%)    95968.25 ( -98.94%)
> Slave scan (stddev)              1912786.94 (  +0.00%)    93390.71 ( -95.12%)
> Slave kswapd reclaim              327894.50 (  +0.00%)   227099.88 ( -30.74%)
> Slave kswapd reclaim (stddev)      22289.43 (  +0.00%)    16113.14 ( -27.71%)
> Slave kswapd scan               34987335.75 (  +0.00%)  1362367.12 ( -96.11%)
> Slave kswapd scan (stddev)       2523642.98 (  +0.00%)   156754.74 ( -93.79%)
> 
> Here, the available memory is limited to 320 MB, the machine is
> overcommitted by 280 MB.  The soft limit of the master is 300 MB, that
> of the slave merely 20 MB.
> 
> Looking at the slave job first, it is much better off with the patched
> kernel: direct reclaim is almost gone, kswapd reclaim is decreased by
> a third.  The result is much fewer major faults taken, which in turn
> lets the job finish quicker.
> 
> It would be a zero-sum game if the improvement happened at the cost of
> the master but looking at the numbers, even the master performs better
> with the patched kernel.  In fact, the master job is almost unaffected
> on the patched kernel compared to the control case.
> 
> This is an odd phenomenon, as the patch does not directly change how
> the master is reclaimed.  An explanation for this is that the severe
> overreclaim of the slave in the unpatched kernel results in the master
> growing bigger than in the patched case.  Combining the fact that
> memcgs are scanned according to their size with the increased refault
> rate of the overreclaimed slave triggering global reclaim more often
> means that overall pressure on the master job is higher in the
> unpatched kernel.
> 
> At any rate, the patched kernel seems to do a much better job at both
> overall resource allocation under soft limit overcommit as well as the
> requested prioritization of the master job.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Thank you for your work and the result seems atractive and code is much
simpler. My small concerns are..

1. This approach may increase latency of direct-reclaim because of priority=0.
2. In a case numa-spread/interleave application run in its own container, 
   pages on a node may paged-out again and again becasue of priority=0
   if some other application runs in the node.
   It seems difficult to use soft-limit with numa-aware applications.
   Do you have suggestions ?


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-11 21:42   ` Ying Han
@ 2012-01-12  8:59     ` Johannes Weiner
  2012-01-13 21:31       ` Ying Han
  0 siblings, 1 reply; 29+ messages in thread
From: Johannes Weiner @ 2012-01-12  8:59 UTC (permalink / raw)
  To: Ying Han
  Cc: Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki, Balbir Singh,
	cgroups, linux-mm, linux-kernel

On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote:
> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Right now, memcg soft limits are implemented by having a sorted tree
> > of memcgs that are in excess of their limits.  Under global memory
> > pressure, kswapd first reclaims from the biggest excessor and then
> > proceeds to do regular global reclaim.  The result of this is that
> > pages are reclaimed from all memcgs, but more scanning happens against
> > those above their soft limit.
> >
> > With global reclaim doing memcg-aware hierarchical reclaim by default,
> > this is a lot easier to implement: everytime a memcg is reclaimed
> > from, scan more aggressively (per tradition with a priority of 0) if
> > it's above its soft limit.  With the same end result of scanning
> > everybody, but soft limit excessors a bit more.
> >
> > Advantages:
> >
> >  o smoother reclaim: soft limit reclaim is a separate stage before
> >    global reclaim, whose result is not communicated down the line and
> >    so overreclaim of the groups in excess is very likely.  After this
> >    patch, soft limit reclaim is fully integrated into regular reclaim
> >    and each memcg is considered exactly once per cycle.
> >
> >  o true hierarchy support: soft limits are only considered when
> >    kswapd does global reclaim, but after this patch, targetted
> >    reclaim of a memcg will mind the soft limit settings of its child
> >    groups.
> 
> Why we add soft limit reclaim into target reclaim?

	-> A hard limit 10G, usage 10G
	   -> A1 soft limit 8G, usage 5G
	   -> A2 soft limit 2G, usage 5G

When A hits its hard limit, A2 will experience more pressure than A1.

Soft limits are already applied hierarchically: the memcg that is
picked from the tree is reclaimed hierarchically.  What I wanted to
add is the soft limit also being /triggerable/ from non-global
hierarchy levels.

> Based on the discussions, my understanding is that the soft limit only
> take effect while the whole machine is under memory contention. We
> don't want to add extra pressure on a cgroup if there is free memory
> on the system even the cgroup is above its limit.

If a hierarchy is under pressure, we will reclaim that hierarchy.  We
allow groups to be prioritized under global pressure, why not allow it
for local pressure as well?

I am not quite sure what you are objecting to.

> >  o code size: soft limit reclaim requires a lot of code to maintain
> >    the per-node per-zone rb-trees to quickly find the biggest
> >    offender, dedicated paths for soft limit reclaim etc. while this
> >    new implementation gets away without all that.
> >
> > Test:
> >
> > The test consists of two concurrent kernel build jobs in separate
> > source trees, the master and the slave.  The two jobs get along nicely
> > on 600MB of available memory, so this is the zero overcommit control
> > case.  When available memory is decreased, the overcommit is
> > compensated by decreasing the soft limit of the slave by the same
> > amount, in the hope that the slave takes the hit and the master stays
> > unaffected.
> >
> >                                    600M-0M-vanilla         600M-0M-patched
> > Master walltime (s)               552.65 (  +0.00%)       552.38 (  -0.05%)
> > Master walltime (stddev)            1.25 (  +0.00%)         0.92 ( -14.66%)
> > Master major faults               204.38 (  +0.00%)       205.38 (  +0.49%)
> > Master major faults (stddev)       27.16 (  +0.00%)        13.80 ( -47.43%)
> > Master reclaim                     31.88 (  +0.00%)        37.75 ( +17.87%)
> > Master reclaim (stddev)            34.01 (  +0.00%)        75.88 (+119.59%)
> > Master scan                        31.88 (  +0.00%)        37.75 ( +17.87%)
> > Master scan (stddev)               34.01 (  +0.00%)        75.88 (+119.59%)
> > Master kswapd reclaim           33922.12 (  +0.00%)     33887.12 (  -0.10%)
> > Master kswapd reclaim (stddev)    969.08 (  +0.00%)       492.22 ( -49.16%)
> > Master kswapd scan              34085.75 (  +0.00%)     33985.75 (  -0.29%)
> > Master kswapd scan (stddev)      1101.07 (  +0.00%)       563.33 ( -48.79%)
> > Slave walltime (s)                552.68 (  +0.00%)       552.12 (  -0.10%)
> > Slave walltime (stddev)             0.79 (  +0.00%)         1.05 ( +14.76%)
> > Slave major faults                212.50 (  +0.00%)       204.50 (  -3.75%)
> > Slave major faults (stddev)        26.90 (  +0.00%)        13.17 ( -49.20%)
> > Slave reclaim                      26.12 (  +0.00%)        35.00 ( +32.72%)
> > Slave reclaim (stddev)             29.42 (  +0.00%)        74.91 (+149.55%)
> > Slave scan                         31.38 (  +0.00%)        35.00 ( +11.20%)
> > Slave scan (stddev)                33.31 (  +0.00%)        74.91 (+121.24%)
> > Slave kswapd reclaim            34259.00 (  +0.00%)     33469.88 (  -2.30%)
> > Slave kswapd reclaim (stddev)     925.15 (  +0.00%)       565.07 ( -38.88%)
> > Slave kswapd scan               34354.62 (  +0.00%)     33555.75 (  -2.33%)
> > Slave kswapd scan (stddev)        969.62 (  +0.00%)       581.70 ( -39.97%)
> >
> > In the control case, the differences in elapsed time, number of major
> > faults taken, and reclaim statistics are within the noise for both the
> > master and the slave job.
> 
> What's the soft limit setting in the controlled case?

300MB for both jobs.

> I assume it is the default RESOURCE_MAX. So both Master and Slave get
> equal pressure before/after the patch, and no differences on the stats
> should be observed.

Yes.  The control case demonstrates that both jobs can fit
comfortably, don't compete for space and that in general the patch
does not have unexpected negative impact (after all, it modifies
codepaths that were invoked regularly outside of reclaim).

> >                                     600M-280M-vanilla      600M-280M-patched
> > Master walltime (s)                  595.13 (  +0.00%)      553.19 (  -7.04%)
> > Master walltime (stddev)               8.31 (  +0.00%)        2.57 ( -61.64%)
> > Master major faults                 3729.75 (  +0.00%)      783.25 ( -78.98%)
> > Master major faults (stddev)         258.79 (  +0.00%)      226.68 ( -12.36%)
> > Master reclaim                       705.00 (  +0.00%)       29.50 ( -95.68%)
> > Master reclaim (stddev)              232.87 (  +0.00%)       44.72 ( -80.45%)
> > Master scan                          714.88 (  +0.00%)       30.00 ( -95.67%)
> > Master scan (stddev)                 237.44 (  +0.00%)       45.39 ( -80.54%)
> > Master kswapd reclaim                114.75 (  +0.00%)       50.00 ( -55.94%)
> > Master kswapd reclaim (stddev)       128.51 (  +0.00%)        9.45 ( -91.93%)
> > Master kswapd scan                   115.75 (  +0.00%)       50.00 ( -56.32%)
> > Master kswapd scan (stddev)          130.31 (  +0.00%)        9.45 ( -92.04%)
> > Slave walltime (s)                   631.18 (  +0.00%)      577.68 (  -8.46%)
> > Slave walltime (stddev)                9.89 (  +0.00%)        3.63 ( -57.47%)
> > Slave major faults                 28401.75 (  +0.00%)    14656.75 ( -48.39%)
> > Slave major faults (stddev)         2629.97 (  +0.00%)     1911.81 ( -27.30%)
> > Slave reclaim                      65400.62 (  +0.00%)     1479.62 ( -97.74%)
> > Slave reclaim (stddev)             11623.02 (  +0.00%)     1482.13 ( -87.24%)
> > Slave scan                       9050047.88 (  +0.00%)    95968.25 ( -98.94%)
> > Slave scan (stddev)              1912786.94 (  +0.00%)    93390.71 ( -95.12%)
> > Slave kswapd reclaim              327894.50 (  +0.00%)   227099.88 ( -30.74%)
> > Slave kswapd reclaim (stddev)      22289.43 (  +0.00%)    16113.14 ( -27.71%)
> > Slave kswapd scan               34987335.75 (  +0.00%)  1362367.12 ( -96.11%)
> > Slave kswapd scan (stddev)       2523642.98 (  +0.00%)   156754.74 ( -93.79%)
> >
> > Here, the available memory is limited to 320 MB, the machine is
> > overcommitted by 280 MB.  The soft limit of the master is 300 MB, that
> > of the slave merely 20 MB.
> >
> > Looking at the slave job first, it is much better off with the patched
> > kernel: direct reclaim is almost gone, kswapd reclaim is decreased by
> > a third.  The result is much fewer major faults taken, which in turn
> > lets the job finish quicker.
> 
> What's the setting of the hard limit here? Is the direct reclaim
> referring to per-memcg directly reclaim or global one.

The machine's memory is limited to 600M, the hard limits are unset.
All reclaim is a result of global memory pressure.

With the patched kernel, I could have used a dedicated parent cgroup
and let master and slave run in children of this group, the soft
limits would be taken into account just the same.  But this does not
work on the unpatched kernel, as soft limits are only recognized on
the global level there.

> > It would be a zero-sum game if the improvement happened at the cost of
> > the master but looking at the numbers, even the master performs better
> > with the patched kernel.  In fact, the master job is almost unaffected
> > on the patched kernel compared to the control case.
> 
> It makes sense since the master job get less affected by the patch
> than the slave job under the example. Under the control case, if both
> master and slave have RESOURCE_MAX soft limit setting, they are under
> equal memory pressure(priority = DEF_PRIORITY) . On the second
> example, only the slave pressure being increased by priority = 0, and
> the Master got scanned with same priority = DEF_PRIORITY pretty much.
> 
> So I would expect to see more reclaim activities happens in slave on
> the patched kernel compared to the control case. It seems match the
> testing result.

Uhm,

> > Slave reclaim                      65400.62 (  +0.00%)     1479.62 ( -97.74%)
> > Slave reclaim (stddev)             11623.02 (  +0.00%)     1482.13 ( -87.24%)
> > Slave scan                       9050047.88 (  +0.00%)    95968.25 ( -98.94%)
> > Slave scan (stddev)              1912786.94 (  +0.00%)    93390.71 ( -95.12%)
> > Slave kswapd reclaim              327894.50 (  +0.00%)   227099.88 ( -30.74%)
> > Slave kswapd reclaim (stddev)      22289.43 (  +0.00%)    16113.14 ( -27.71%)
> > Slave kswapd scan               34987335.75 (  +0.00%)  1362367.12 ( -96.11%)
> > Slave kswapd scan (stddev)       2523642.98 (  +0.00%)   156754.74 ( -93.79%)

Direct reclaim _shrunk_ by 98%, kswapd reclaim by 31%.

> > This is an odd phenomenon, as the patch does not directly change how
> > the master is reclaimed.  An explanation for this is that the severe
> > overreclaim of the slave in the unpatched kernel results in the master
> > growing bigger than in the patched case.  Combining the fact that
> > memcgs are scanned according to their size with the increased refault
> > rate of the overreclaimed slave triggering global reclaim more often
> > means that overall pressure on the master job is higher in the
> > unpatched kernel.
> 
> We can check the Master memory.usage_in_bytes while the job is running.

Yep, the plots of cache/rss over time confirmed exactly this.  The
unpatched kernel shows higher spikes in the size of the master job
followed by deeper pits when reclaim kicked in.  The patched kernel is
much smoother in that regard.

> On the other hand, I don't see why we expect the Master being less
> reclaimed in the controlled case? On the unpatched kernel, the Master
> is being reclaimed under global pressure each time anyway since we
> ignore the return value of softlimit.

I didn't expect that, I expected both jobs to perform equally in the
control case.  And in the pressurized case, the master being
unaffected and the slave taking the hit.  The patched kernel does
this, the unpatched one does not.

> > @@ -121,6 +121,7 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
> >                                                      struct zone *zone);
> >  struct zone_reclaim_stat*
> >  mem_cgroup_get_reclaim_stat_from_page(struct page *page);
> > +bool mem_cgroup_over_softlimit(struct mem_cgroup *, struct mem_cgroup *);
> 
> Maybe something like "mem_cgroup_over_soft_limit()" ?

Probably more consistent, yeah.  Will do.

> > @@ -343,7 +314,6 @@ static bool move_file(void)
> >  * limit reclaim to prevent infinite loops, if they ever occur.
> >  */
> >  #define        MEM_CGROUP_MAX_RECLAIM_LOOPS            (100)
> > -#define        MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2)
> 
> You might need to remove the comment above as well.

Oops, will fix.

> > @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> >        return margin >> PAGE_SHIFT;
> >  }
> >
> > +/**
> > + * mem_cgroup_over_softlimit
> > + * @root: hierarchy root
> > + * @memcg: child of @root to test
> > + *
> > + * Returns %true if @memcg exceeds its own soft limit or contributes
> > + * to the soft limit excess of one of its parents up to and including
> > + * @root.
> > + */
> > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root,
> > +                              struct mem_cgroup *memcg)
> > +{
> > +       if (mem_cgroup_disabled())
> > +               return false;
> > +
> > +       if (!root)
> > +               root = root_mem_cgroup;
> > +
> > +       for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> > +               /* root_mem_cgroup does not have a soft limit */
> > +               if (memcg == root_mem_cgroup)
> > +                       break;
> > +               if (res_counter_soft_limit_excess(&memcg->res))
> > +                       return true;
> > +               if (memcg == root)
> > +                       break;
> > +       }
> 
> Here it adds pressure on a cgroup if one of its parents exceeds soft
> limit, although the cgroup itself is under soft limit. It does change
> my understanding of soft limit, and might introduce regression of our
> existing use cases.
> 
> Here is an example:
> 
> Machine capacity 32G and we over-commit by 8G.
> 
> root
>   -> A (hard limit 20G, soft limit 15G, usage 16G)
>        -> A1 (soft limit 5G, usage 4G)
>        -> A2 (soft limit 10G, usage 12G)
>   -> B (hard limit 20G, soft limit 10G, usage 16G)
> 
> under global reclaim, we don't want to add pressure on A1 although its
> parent A exceeds its soft limit. Assume that if we set the soft limit
> corresponding to each cgroup's working set size (hot memory), and it
> will introduce regression to A1 in that case.
> 
> In my existing implementation, i am checking the cgroup's soft limit
> standalone w/o looking its ancestors.

Why do you set the soft limit of A in the first place if you don't
want it to be enforced?

This is not really new behaviour, soft limit reclaim has always been
operating hierarchically on the biggest excessor.  In your case, the
excess of A is smaller than the excess of A2 and so that weird "only
pick the biggest excessor" behaviour hides it, but consider this:

	-> A soft 30G, usage 39G
	   -> A1 soft 5G, usage 4G
	   -> A2 soft 10G, usage 15G
	   -> A3 soft 15G, usage 20G

Upstream would pick A from the soft limit tree and reclaim its
children with priority 0, including A1.

On the other hand, if you don't consider ancestral soft limits, you
break perfectly reasonable setups like these

	-> A soft 10G, usage 20G
	   -> A1 usage 10G
	   -> A2 usage 10G
	-> B soft 10G, usage 11G

where upstream would pick A and reclaim it recursively, but your
version would only apply higher pressure to B.

If you would just not set the soft limit of A in your case:

	-> A (hard limit 20G, usage 16G)
           -> A1 (soft limit 5G, usage 4G)
           -> A2 (soft limit 10G, usage 12G)
	-> B (hard limit 20G, soft limit 10G, usage 16G)

only A2 and B would experience higher pressure upon global pressure.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/2] mm: memcg: per-memcg reclaim statistics
  2012-01-11 22:33       ` Ying Han
@ 2012-01-12  9:17         ` Johannes Weiner
  0 siblings, 0 replies; 29+ messages in thread
From: Johannes Weiner @ 2012-01-12  9:17 UTC (permalink / raw)
  To: Ying Han
  Cc: Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki, Balbir Singh,
	cgroups, linux-mm, linux-kernel

On Wed, Jan 11, 2012 at 02:33:59PM -0800, Ying Han wrote:
> On Tue, Jan 10, 2012 at 4:30 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Tue, Jan 10, 2012 at 03:54:05PM -0800, Ying Han wrote:
> >> Thank you for the patch and the stats looks reasonable to me, few
> >> questions as below:
> >>
> >> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >> > With the single per-zone LRU gone and global reclaim scanning
> >> > individual memcgs, it's straight-forward to collect meaningful and
> >> > accurate per-memcg reclaim statistics.
> >> >
> >> > This adds the following items to memory.stat:
> >>
> >> Some of the previous discussions including patches have similar stats
> >> in memory.vmscan_stat API, which collects all the per-memcg vmscan
> >> stats. I would like to understand more why we add into memory.stat
> >> instead, and do we have plan to keep extending memory.stat for those
> >> vmstat like stats?
> >
> > I think they were put into an extra file in particular to be able to
> > write to this file to reset the statistics.  But in my opinion, it's
> > trivial to calculate a delta from before and after running a workload,
> > so I didn't really like adding kernel code for that.
> >
> > Did you have another reason for a separate file in mind?
> 
> Another reason I had them in separate file is easier to extend. I
> don't know if we have plan to have something like memory.vmstat, or
> just keep adding stuff into memory.stat. In general, I wanted to keep
> the memory.stat being reasonable size including only the basic
> statistics. In my existing vmscan_stat path, i have breakdowns of
> reclaim stats into file/anon which will make the memory.stat even
> larger.

Do you think it's a problem of presentation, where we want to allow
admins to figure out the memcg parameters at a glance when looking at
memory.stat but be able to debug malfunction by looking at the more
extensive vmstat file?

> >> > áReclaim activity from kswapd due to the memcg's own limit. áOnly
> >> > áapplicable to the root memcg for now since kswapd is only triggered
> >> > áby physical limits, but kswapd-style reclaim based on memcg hard
> >> > álimits is being developped.
> >> >
> >> > hierarchy_pgreclaim
> >> > hierarchy_pgscan
> >> > hierarchy_kswapd_pgreclaim
> >> > hierarchy_kswapd_pgscan
> >>
> >> "pgsteal_hierarchy"
> >> "pgsteal_kswapd_hierarchy"
> >> ..
> >>
> >> No strong option on the naming, but try to make it more consistent to
> >> existing API.
> >
> > I swear I tried, but the existing naming is pretty screwed up :(
> >
> > For example, pgscan_direct_* and pgscan_kswapd_* allow you to compare
> > scan rates of direct reclaim vs. kswapd reclaim.  To get the total
> > number of pages reclaimed, you sum them up.
> >
> > On the other hand, pgsteal_* does not differentiate between direct
> > reclaim and kswapd, so to get direct reclaim numbers, you add up the
> > pgsteal_* counters and subtract kswapd_steal (notice the lack of pg?),
> > which is in turn not available at zone granularity.
> 
> agree and that always confuses me.

I just have scripts that present it as 'Direct page reclaimed' and
'Kswapd page reclaimed' when evaluating data so I don't have to
remember anymore :-)

But I think the wish for consistency is a bit misguided when we end up
with something like pgpgin that means something completely different
in memcg than it does on the global level.  Likewise, I don't want to
use pgsteal_* and pgsteal_kswapd_* because of their similarity to
/proc/vmstat while the numbers represent something different.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-10 15:02 ` [patch 2/2] mm: memcg: hierarchical soft limit reclaim Johannes Weiner
  2012-01-11 21:42   ` Ying Han
  2012-01-12  1:54   ` KAMEZAWA Hiroyuki
@ 2012-01-13 12:04   ` Michal Hocko
  2012-01-13 15:50     ` Johannes Weiner
  2 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2012-01-13 12:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Balbir Singh, Ying Han,
	cgroups, linux-mm, linux-kernel

On Tue 10-01-12 16:02:52, Johannes Weiner wrote:
> Right now, memcg soft limits are implemented by having a sorted tree
> of memcgs that are in excess of their limits.  Under global memory
> pressure, kswapd first reclaims from the biggest excessor and then
> proceeds to do regular global reclaim.  The result of this is that
> pages are reclaimed from all memcgs, but more scanning happens against
> those above their soft limit.
> 
> With global reclaim doing memcg-aware hierarchical reclaim by default,
> this is a lot easier to implement: everytime a memcg is reclaimed
> from, scan more aggressively (per tradition with a priority of 0) if
> it's above its soft limit.  With the same end result of scanning
> everybody, but soft limit excessors a bit more.
> 
> Advantages:
> 
>   o smoother reclaim: soft limit reclaim is a separate stage before
>     global reclaim, whose result is not communicated down the line and
>     so overreclaim of the groups in excess is very likely.  After this
>     patch, soft limit reclaim is fully integrated into regular reclaim
>     and each memcg is considered exactly once per cycle.
> 
>   o true hierarchy support: soft limits are only considered when
>     kswapd does global reclaim, but after this patch, targetted
>     reclaim of a memcg will mind the soft limit settings of its child
>     groups.

Yes it makes sense. At first I was thinking that soft limit should be
considered only under global mem. pressure (at least documentation says
so) but now it makes sense.
We can push on over-soft limit groups more because they told us they
could sacrifice something...  Anyway documentation needs an update as
well.

But we have to be little bit careful here. I am still quite confuses how
we should handle hierarchies vs. subtrees. See bellow.

> 
>   o code size: soft limit reclaim requires a lot of code to maintain
>     the per-node per-zone rb-trees to quickly find the biggest
>     offender, dedicated paths for soft limit reclaim etc. while this
>     new implementation gets away without all that.

on my i386 pae setup (including swap extension enabled):
Before
   text    data     bss     dec     hex filename
    310086   29970   35372  375428   5ba84 mm/built-in.o
After
size mm/built-in.o 
   text    data     bss     dec     hex filename
    309048   30030   35372  374450   5b6b2 mm/built-in.o

I would expect a bigger difference but still good.

> Test:

Will look into results later.
[...]

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h |   18 +--
>  mm/memcontrol.c            |  412 ++++----------------------------------------
>  mm/vmscan.c                |   80 +--------
>  3 files changed, 48 insertions(+), 462 deletions(-)

Really nice to see

[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 170dff4..d4f7ae5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
>  	return margin >> PAGE_SHIFT;
>  }
>  
> +/**
> + * mem_cgroup_over_softlimit
> + * @root: hierarchy root
> + * @memcg: child of @root to test
> + *
> + * Returns %true if @memcg exceeds its own soft limit or contributes
> + * to the soft limit excess of one of its parents up to and including
> + * @root.
> + */
> +bool mem_cgroup_over_softlimit(struct mem_cgroup *root,
> +			       struct mem_cgroup *memcg)
> +{
> +	if (mem_cgroup_disabled())
> +		return false;
> +
> +	if (!root)
> +		root = root_mem_cgroup;
> +
> +	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> +		/* root_mem_cgroup does not have a soft limit */
> +		if (memcg == root_mem_cgroup)
> +			break;
> +		if (res_counter_soft_limit_excess(&memcg->res))
> +			return true;
> +		if (memcg == root)
> +			break;
> +	}
> +	return false;
> +}

Well, this might be little bit tricky. We do not check whether memcg and
root are in a hierarchy (in terms of use_hierarchy) relation. 

If we are under global reclaim then we iterate over all memcgs and so
there is no guarantee that there is a hierarchical relation between the
given memcg and its parent. While, on the other hand, if we are doing
memcg reclaim then we have this guarantee.

Why should we punish a group (subtree) which is perfectly under its soft
limit just because some other subtree contributes to the common parent's
usage and makes it over its limit?
Should we check memcg->use_hierarchy here?

Does it even makes sense to setup soft limit on a parent group without
hierarchies?
Well I have to admit that hierarchies makes me headache.

> +
>  int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>  {
>  	struct cgroup *cgrp = memcg->css.cgroup;
[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e3fd8a7..4279549 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone,
>  			.mem_cgroup = memcg,
>  			.zone = zone,
>  		};
> +		int epriority = priority;
> +		/*
> +		 * Put more pressure on hierarchies that exceed their
> +		 * soft limit, to push them back harder than their
> +		 * well-behaving siblings.
> +		 */
> +		if (mem_cgroup_over_softlimit(root, memcg))
> +			epriority = 0;

This sounds too aggressive to me. Shouldn't we just double the pressure
or something like that?
Previously we always had nr_to_reclaim == SWAP_CLUSTER_MAX when we did
memcg reclaim but this is not the case now. For the kswapd we have
nr_to_reclaim == ULONG_MAX so we will not break out of the reclaim early
and we have to scan a lot.
Direct reclaim (shrink or hard limit) shouldn't be affected here.

>  
> -		shrink_mem_cgroup_zone(priority, &mz, sc);
> +		shrink_mem_cgroup_zone(epriority, &mz, sc);
>  
>  		mem_cgroup_account_reclaim(root, memcg,
>  					   sc->nr_reclaimed - nr_reclaimed,

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-12  1:54   ` KAMEZAWA Hiroyuki
@ 2012-01-13 12:16     ` Johannes Weiner
  2012-01-18  5:26       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: Johannes Weiner @ 2012-01-13 12:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Michal Hocko, Balbir Singh, Ying Han, cgroups,
	linux-mm, linux-kernel

On Thu, Jan 12, 2012 at 10:54:27AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 10 Jan 2012 16:02:52 +0100
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Right now, memcg soft limits are implemented by having a sorted tree
> > of memcgs that are in excess of their limits.  Under global memory
> > pressure, kswapd first reclaims from the biggest excessor and then
> > proceeds to do regular global reclaim.  The result of this is that
> > pages are reclaimed from all memcgs, but more scanning happens against
> > those above their soft limit.
> > 
> > With global reclaim doing memcg-aware hierarchical reclaim by default,
> > this is a lot easier to implement: everytime a memcg is reclaimed
> > from, scan more aggressively (per tradition with a priority of 0) if
> > it's above its soft limit.  With the same end result of scanning
> > everybody, but soft limit excessors a bit more.
> > 
> > Advantages:
> > 
> >   o smoother reclaim: soft limit reclaim is a separate stage before
> >     global reclaim, whose result is not communicated down the line and
> >     so overreclaim of the groups in excess is very likely.  After this
> >     patch, soft limit reclaim is fully integrated into regular reclaim
> >     and each memcg is considered exactly once per cycle.
> > 
> >   o true hierarchy support: soft limits are only considered when
> >     kswapd does global reclaim, but after this patch, targetted
> >     reclaim of a memcg will mind the soft limit settings of its child
> >     groups.
> > 
> >   o code size: soft limit reclaim requires a lot of code to maintain
> >     the per-node per-zone rb-trees to quickly find the biggest
> >     offender, dedicated paths for soft limit reclaim etc. while this
> >     new implementation gets away without all that.
> > 
> > Test:
> > 
> > The test consists of two concurrent kernel build jobs in separate
> > source trees, the master and the slave.  The two jobs get along nicely
> > on 600MB of available memory, so this is the zero overcommit control
> > case.  When available memory is decreased, the overcommit is
> > compensated by decreasing the soft limit of the slave by the same
> > amount, in the hope that the slave takes the hit and the master stays
> > unaffected.
> > 
> >                                     600M-0M-vanilla         600M-0M-patched
> > Master walltime (s)               552.65 (  +0.00%)       552.38 (  -0.05%)
> > Master walltime (stddev)            1.25 (  +0.00%)         0.92 ( -14.66%)
> > Master major faults               204.38 (  +0.00%)       205.38 (  +0.49%)
> > Master major faults (stddev)       27.16 (  +0.00%)        13.80 ( -47.43%)
> > Master reclaim                     31.88 (  +0.00%)        37.75 ( +17.87%)
> > Master reclaim (stddev)            34.01 (  +0.00%)        75.88 (+119.59%)
> > Master scan                        31.88 (  +0.00%)        37.75 ( +17.87%)
> > Master scan (stddev)               34.01 (  +0.00%)        75.88 (+119.59%)
> > Master kswapd reclaim           33922.12 (  +0.00%)     33887.12 (  -0.10%)
> > Master kswapd reclaim (stddev)    969.08 (  +0.00%)       492.22 ( -49.16%)
> > Master kswapd scan              34085.75 (  +0.00%)     33985.75 (  -0.29%)
> > Master kswapd scan (stddev)      1101.07 (  +0.00%)       563.33 ( -48.79%)
> > Slave walltime (s)                552.68 (  +0.00%)       552.12 (  -0.10%)
> > Slave walltime (stddev)             0.79 (  +0.00%)         1.05 ( +14.76%)
> > Slave major faults                212.50 (  +0.00%)       204.50 (  -3.75%)
> > Slave major faults (stddev)        26.90 (  +0.00%)        13.17 ( -49.20%)
> > Slave reclaim                      26.12 (  +0.00%)        35.00 ( +32.72%)
> > Slave reclaim (stddev)             29.42 (  +0.00%)        74.91 (+149.55%)
> > Slave scan                         31.38 (  +0.00%)        35.00 ( +11.20%)
> > Slave scan (stddev)                33.31 (  +0.00%)        74.91 (+121.24%)
> > Slave kswapd reclaim            34259.00 (  +0.00%)     33469.88 (  -2.30%)
> > Slave kswapd reclaim (stddev)     925.15 (  +0.00%)       565.07 ( -38.88%)
> > Slave kswapd scan               34354.62 (  +0.00%)     33555.75 (  -2.33%)
> > Slave kswapd scan (stddev)        969.62 (  +0.00%)       581.70 ( -39.97%)
> > 
> > In the control case, the differences in elapsed time, number of major
> > faults taken, and reclaim statistics are within the noise for both the
> > master and the slave job.
> > 
> >                                      600M-280M-vanilla      600M-280M-patched
> > Master walltime (s)                  595.13 (  +0.00%)      553.19 (  -7.04%)
> > Master walltime (stddev)               8.31 (  +0.00%)        2.57 ( -61.64%)
> > Master major faults                 3729.75 (  +0.00%)      783.25 ( -78.98%)
> > Master major faults (stddev)         258.79 (  +0.00%)      226.68 ( -12.36%)
> > Master reclaim                       705.00 (  +0.00%)       29.50 ( -95.68%)
> > Master reclaim (stddev)              232.87 (  +0.00%)       44.72 ( -80.45%)
> > Master scan                          714.88 (  +0.00%)       30.00 ( -95.67%)
> > Master scan (stddev)                 237.44 (  +0.00%)       45.39 ( -80.54%)
> > Master kswapd reclaim                114.75 (  +0.00%)       50.00 ( -55.94%)
> > Master kswapd reclaim (stddev)       128.51 (  +0.00%)        9.45 ( -91.93%)
> > Master kswapd scan                   115.75 (  +0.00%)       50.00 ( -56.32%)
> > Master kswapd scan (stddev)          130.31 (  +0.00%)        9.45 ( -92.04%)
> > Slave walltime (s)                   631.18 (  +0.00%)      577.68 (  -8.46%)
> > Slave walltime (stddev)                9.89 (  +0.00%)        3.63 ( -57.47%)
> > Slave major faults                 28401.75 (  +0.00%)    14656.75 ( -48.39%)
> > Slave major faults (stddev)         2629.97 (  +0.00%)     1911.81 ( -27.30%)
> > Slave reclaim                      65400.62 (  +0.00%)     1479.62 ( -97.74%)
> > Slave reclaim (stddev)             11623.02 (  +0.00%)     1482.13 ( -87.24%)
> > Slave scan                       9050047.88 (  +0.00%)    95968.25 ( -98.94%)
> > Slave scan (stddev)              1912786.94 (  +0.00%)    93390.71 ( -95.12%)
> > Slave kswapd reclaim              327894.50 (  +0.00%)   227099.88 ( -30.74%)
> > Slave kswapd reclaim (stddev)      22289.43 (  +0.00%)    16113.14 ( -27.71%)
> > Slave kswapd scan               34987335.75 (  +0.00%)  1362367.12 ( -96.11%)
> > Slave kswapd scan (stddev)       2523642.98 (  +0.00%)   156754.74 ( -93.79%)
> > 
> > Here, the available memory is limited to 320 MB, the machine is
> > overcommitted by 280 MB.  The soft limit of the master is 300 MB, that
> > of the slave merely 20 MB.
> > 
> > Looking at the slave job first, it is much better off with the patched
> > kernel: direct reclaim is almost gone, kswapd reclaim is decreased by
> > a third.  The result is much fewer major faults taken, which in turn
> > lets the job finish quicker.
> > 
> > It would be a zero-sum game if the improvement happened at the cost of
> > the master but looking at the numbers, even the master performs better
> > with the patched kernel.  In fact, the master job is almost unaffected
> > on the patched kernel compared to the control case.
> > 
> > This is an odd phenomenon, as the patch does not directly change how
> > the master is reclaimed.  An explanation for this is that the severe
> > overreclaim of the slave in the unpatched kernel results in the master
> > growing bigger than in the patched case.  Combining the fact that
> > memcgs are scanned according to their size with the increased refault
> > rate of the overreclaimed slave triggering global reclaim more often
> > means that overall pressure on the master job is higher in the
> > unpatched kernel.
> > 
> > At any rate, the patched kernel seems to do a much better job at both
> > overall resource allocation under soft limit overcommit as well as the
> > requested prioritization of the master job.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Thank you for your work and the result seems atractive and code is much
> simpler. My small concerns are..
> 
> 1. This approach may increase latency of direct-reclaim because of priority=0.

I think strictly speaking yes, but note that with kswapd being less
likely to get stuck in hammering on one group, the need for allocators
to enter direct reclaim itself is reduced.

However, if this really becomes a problem in real world loads, the fix
is pretty easy: just ignore the soft limit for direct reclaim.  We can
still consider it from hard limit reclaim and kswapd.

> 2. In a case numa-spread/interleave application run in its own container, 
>    pages on a node may paged-out again and again becasue of priority=0
>    if some other application runs in the node.
>    It seems difficult to use soft-limit with numa-aware applications.
>    Do you have suggestions ?

This is a question about soft limits in general rather than about this
particular patch, right?

And if I understand correctly, the problem you are referring to is
this: an application and parts of a soft-limited container share a
node, the soft limit setting means that the container's pages on that
node are reclaimed harder.  At that point, the container's share on
that node becomes tiny, but since the soft limit is oblivious to
nodes, the expansion of the other application pushes the soft-limited
container off that node completely as long as the container stays
above its soft limit with the usage on other nodes.

What would you think about having node-local soft limits that take the
node size into account?

	local_soft_limit = soft_limit * node_size / memcg_size

The soft limit can be exceeded globally, but the container is no
longer pushed off a node on which it's only occupying a small share of
memory.

Putting it into proportion of the memcg size, not overall memory size
has the following advantages:

  1. if the container is sitting on only one of several available
  nodes without exceeding the limit globally, the memcg will not be
  reclaimed harder just because it has a relatively large share of the
  node.

  2. if the soft limit excess is ridiculously high, the local soft
  limits will be pushed down, so the tolerance for smaller shares on
  nodes goes down in proportion to the global soft limit excess.

Example:

	4G soft limit * 2G node / 4G container = 2G node-local limit

The container is globally within its soft limit, so the local limit is
at least the size of the node.  It's never reclaimed harder compared
to other applications on the node.

	4G soft limit * 2G node / 5G container = ~1.6G node-local limit

Here, it will experience more pressure initially, but it will level
off when the shrinking usage and the thereby increasing node-local
soft limit meet.  From that point on, the container and the competing
application will be treated equally during reclaim.

Finally, if the container is 16G in size, i.e. 300% in excess, the
per-node tolerance is at 512M node-local soft limit, which IMO strikes
a good balance between zero tolerance and still applying some stress
to the hugely oversized container when other applications (with
virtually unlimited soft limits) want to run on the same node.

What do you think?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-13 12:04   ` Michal Hocko
@ 2012-01-13 15:50     ` Johannes Weiner
  2012-01-13 16:34       ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Johannes Weiner @ 2012-01-13 15:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Balbir Singh, Ying Han,
	cgroups, linux-mm, linux-kernel

On Fri, Jan 13, 2012 at 01:04:06PM +0100, Michal Hocko wrote:
> On Tue 10-01-12 16:02:52, Johannes Weiner wrote:
> > Right now, memcg soft limits are implemented by having a sorted tree
> > of memcgs that are in excess of their limits.  Under global memory
> > pressure, kswapd first reclaims from the biggest excessor and then
> > proceeds to do regular global reclaim.  The result of this is that
> > pages are reclaimed from all memcgs, but more scanning happens against
> > those above their soft limit.
> > 
> > With global reclaim doing memcg-aware hierarchical reclaim by default,
> > this is a lot easier to implement: everytime a memcg is reclaimed
> > from, scan more aggressively (per tradition with a priority of 0) if
> > it's above its soft limit.  With the same end result of scanning
> > everybody, but soft limit excessors a bit more.
> > 
> > Advantages:
> > 
> >   o smoother reclaim: soft limit reclaim is a separate stage before
> >     global reclaim, whose result is not communicated down the line and
> >     so overreclaim of the groups in excess is very likely.  After this
> >     patch, soft limit reclaim is fully integrated into regular reclaim
> >     and each memcg is considered exactly once per cycle.
> > 
> >   o true hierarchy support: soft limits are only considered when
> >     kswapd does global reclaim, but after this patch, targetted
> >     reclaim of a memcg will mind the soft limit settings of its child
> >     groups.
> 
> Yes it makes sense. At first I was thinking that soft limit should be
> considered only under global mem. pressure (at least documentation says
> so) but now it makes sense.
> We can push on over-soft limit groups more because they told us they
> could sacrifice something...  Anyway documentation needs an update as
> well.

You are right, I'll look into it.

> But we have to be little bit careful here. I am still quite confuses how
> we should handle hierarchies vs. subtrees. See bellow.

> > @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> >  	return margin >> PAGE_SHIFT;
> >  }
> >  
> > +/**
> > + * mem_cgroup_over_softlimit
> > + * @root: hierarchy root
> > + * @memcg: child of @root to test
> > + *
> > + * Returns %true if @memcg exceeds its own soft limit or contributes
> > + * to the soft limit excess of one of its parents up to and including
> > + * @root.
> > + */
> > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root,
> > +			       struct mem_cgroup *memcg)
> > +{
> > +	if (mem_cgroup_disabled())
> > +		return false;
> > +
> > +	if (!root)
> > +		root = root_mem_cgroup;
> > +
> > +	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> > +		/* root_mem_cgroup does not have a soft limit */
> > +		if (memcg == root_mem_cgroup)
> > +			break;
> > +		if (res_counter_soft_limit_excess(&memcg->res))
> > +			return true;
> > +		if (memcg == root)
> > +			break;
> > +	}
> > +	return false;
> > +}
> 
> Well, this might be little bit tricky. We do not check whether memcg and
> root are in a hierarchy (in terms of use_hierarchy) relation. 
> 
> If we are under global reclaim then we iterate over all memcgs and so
> there is no guarantee that there is a hierarchical relation between the
> given memcg and its parent. While, on the other hand, if we are doing
> memcg reclaim then we have this guarantee.
> 
> Why should we punish a group (subtree) which is perfectly under its soft
> limit just because some other subtree contributes to the common parent's
> usage and makes it over its limit?
> Should we check memcg->use_hierarchy here?

We do, actually.  parent_mem_cgroup() checks the res_counter parent,
which is only set when ->use_hierarchy is also set.  The loop should
never walk upwards outside of a hierarchy.

And yes, if you have this:

        A
       / \
      B   C

and configured a soft limit for A, you asked for both B and C to be
responsible when this limit is exceeded, that's not new behaviour.

> Does it even makes sense to setup soft limit on a parent group without
> hierarchies?
> Well I have to admit that hierarchies makes me headache.

There is no parent without a hierarchy.  It is insofar pretty
confusing that you can actually create a directory hierarchy that does
not reflect a memcg hierarchy:

	# pwd
	/sys/fs/cgroup/memory/foo/bar
	# cat memory.usage_in_bytes 
	450560
	# cat ../memory.usage_in_bytes 
	0

there is no accounting/limiting/whatever parent-child relationship
between foo and bar.

> > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone,
> >  			.mem_cgroup = memcg,
> >  			.zone = zone,
> >  		};
> > +		int epriority = priority;
> > +		/*
> > +		 * Put more pressure on hierarchies that exceed their
> > +		 * soft limit, to push them back harder than their
> > +		 * well-behaving siblings.
> > +		 */
> > +		if (mem_cgroup_over_softlimit(root, memcg))
> > +			epriority = 0;
> 
> This sounds too aggressive to me. Shouldn't we just double the pressure
> or something like that?

That's the historical value.  When I tried priority - 1, it was not
aggressive enough.

> Previously we always had nr_to_reclaim == SWAP_CLUSTER_MAX when we did
> memcg reclaim but this is not the case now. For the kswapd we have
> nr_to_reclaim == ULONG_MAX so we will not break out of the reclaim early
> and we have to scan a lot.
> Direct reclaim (shrink or hard limit) shouldn't be affected here.

It took me a while: we had SWAP_CLUSTER_MAX in _soft limit reclaim_,
which means that even with priority 0 we would bail after reclaiming
SWAP_CLUSTER_MAX from each lru of a zone.  But it's now happening with
kswapd's own scan_control, so the overreclaim protection is gone.

That is indeed a change in behaviour I haven't noticed, good catch!

I will look into it.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-13 15:50     ` Johannes Weiner
@ 2012-01-13 16:34       ` Michal Hocko
  2012-01-13 21:45         ` Ying Han
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2012-01-13 16:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Balbir Singh, Ying Han,
	cgroups, linux-mm, linux-kernel

On Fri 13-01-12 16:50:01, Johannes Weiner wrote:
> On Fri, Jan 13, 2012 at 01:04:06PM +0100, Michal Hocko wrote:
> > On Tue 10-01-12 16:02:52, Johannes Weiner wrote:
[...]
> > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root,
> > > +			       struct mem_cgroup *memcg)
> > > +{
> > > +	if (mem_cgroup_disabled())
> > > +		return false;
> > > +
> > > +	if (!root)
> > > +		root = root_mem_cgroup;
> > > +
> > > +	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> > > +		/* root_mem_cgroup does not have a soft limit */
> > > +		if (memcg == root_mem_cgroup)
> > > +			break;
> > > +		if (res_counter_soft_limit_excess(&memcg->res))
> > > +			return true;
> > > +		if (memcg == root)
> > > +			break;
> > > +	}
> > > +	return false;
> > > +}
> > 
> > Well, this might be little bit tricky. We do not check whether memcg and
> > root are in a hierarchy (in terms of use_hierarchy) relation. 
> > 
> > If we are under global reclaim then we iterate over all memcgs and so
> > there is no guarantee that there is a hierarchical relation between the
> > given memcg and its parent. While, on the other hand, if we are doing
> > memcg reclaim then we have this guarantee.
> > 
> > Why should we punish a group (subtree) which is perfectly under its soft
> > limit just because some other subtree contributes to the common parent's
> > usage and makes it over its limit?
> > Should we check memcg->use_hierarchy here?
> 
> We do, actually.  parent_mem_cgroup() checks the res_counter parent,
> which is only set when ->use_hierarchy is also set.  

Of course I am blind.. We do not setup res_counter parent for
!use_hierarchy case. Sorry for noise...
Now it makes much better sense. I was wondering how !use_hierarchy could
ever work, this should be a signal that I am overlooking something
terribly.

[...]
> > > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone,
> > >  			.mem_cgroup = memcg,
> > >  			.zone = zone,
> > >  		};
> > > +		int epriority = priority;
> > > +		/*
> > > +		 * Put more pressure on hierarchies that exceed their
> > > +		 * soft limit, to push them back harder than their
> > > +		 * well-behaving siblings.
> > > +		 */
> > > +		if (mem_cgroup_over_softlimit(root, memcg))
> > > +			epriority = 0;
> > 
> > This sounds too aggressive to me. Shouldn't we just double the pressure
> > or something like that?
> 
> That's the historical value.  When I tried priority - 1, it was not
> aggressive enough.

Probably because we want to reclaim too much. Maybe we should do
reduce nr_to_reclaim (ugly) or reclaim only overlimit groups until certain
priority level as Ying suggested in her patchset.
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-12  8:59     ` Johannes Weiner
@ 2012-01-13 21:31       ` Ying Han
  2012-01-13 22:44         ` Johannes Weiner
  0 siblings, 1 reply; 29+ messages in thread
From: Ying Han @ 2012-01-13 21:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki, Balbir Singh,
	cgroups, linux-mm, linux-kernel

On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote:
>> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> > Right now, memcg soft limits are implemented by having a sorted tree
>> > of memcgs that are in excess of their limits.  Under global memory
>> > pressure, kswapd first reclaims from the biggest excessor and then
>> > proceeds to do regular global reclaim.  The result of this is that
>> > pages are reclaimed from all memcgs, but more scanning happens against
>> > those above their soft limit.
>> >
>> > With global reclaim doing memcg-aware hierarchical reclaim by default,
>> > this is a lot easier to implement: everytime a memcg is reclaimed
>> > from, scan more aggressively (per tradition with a priority of 0) if
>> > it's above its soft limit.  With the same end result of scanning
>> > everybody, but soft limit excessors a bit more.
>> >
>> > Advantages:
>> >
>> >  o smoother reclaim: soft limit reclaim is a separate stage before
>> >    global reclaim, whose result is not communicated down the line and
>> >    so overreclaim of the groups in excess is very likely.  After this
>> >    patch, soft limit reclaim is fully integrated into regular reclaim
>> >    and each memcg is considered exactly once per cycle.
>> >
>> >  o true hierarchy support: soft limits are only considered when
>> >    kswapd does global reclaim, but after this patch, targetted
>> >    reclaim of a memcg will mind the soft limit settings of its child
>> >    groups.
>>
>> Why we add soft limit reclaim into target reclaim?
>
>        -> A hard limit 10G, usage 10G
>           -> A1 soft limit 8G, usage 5G
>           -> A2 soft limit 2G, usage 5G
>
> When A hits its hard limit, A2 will experience more pressure than A1.
>
> Soft limits are already applied hierarchically: the memcg that is
> picked from the tree is reclaimed hierarchically.  What I wanted to
> add is the soft limit also being /triggerable/ from non-global
> hierarchy levels.
>
>> Based on the discussions, my understanding is that the soft limit only
>> take effect while the whole machine is under memory contention. We
>> don't want to add extra pressure on a cgroup if there is free memory
>> on the system even the cgroup is above its limit.
>
> If a hierarchy is under pressure, we will reclaim that hierarchy.  We
> allow groups to be prioritized under global pressure, why not allow it
> for local pressure as well?
>
> I am not quite sure what you are objecting to.

>
>> >  o code size: soft limit reclaim requires a lot of code to maintain
>> >    the per-node per-zone rb-trees to quickly find the biggest
>> >    offender, dedicated paths for soft limit reclaim etc. while this
>> >    new implementation gets away without all that.
>> >
>> > Test:
>> >
>> > The test consists of two concurrent kernel build jobs in separate
>> > source trees, the master and the slave.  The two jobs get along nicely
>> > on 600MB of available memory, so this is the zero overcommit control
>> > case.  When available memory is decreased, the overcommit is
>> > compensated by decreasing the soft limit of the slave by the same
>> > amount, in the hope that the slave takes the hit and the master stays
>> > unaffected.
>> >
>> >                                    600M-0M-vanilla         600M-0M-patched
>> > Master walltime (s)               552.65 (  +0.00%)       552.38 (  -0.05%)
>> > Master walltime (stddev)            1.25 (  +0.00%)         0.92 ( -14.66%)
>> > Master major faults               204.38 (  +0.00%)       205.38 (  +0.49%)
>> > Master major faults (stddev)       27.16 (  +0.00%)        13.80 ( -47.43%)
>> > Master reclaim                     31.88 (  +0.00%)        37.75 ( +17.87%)
>> > Master reclaim (stddev)            34.01 (  +0.00%)        75.88 (+119.59%)
>> > Master scan                        31.88 (  +0.00%)        37.75 ( +17.87%)
>> > Master scan (stddev)               34.01 (  +0.00%)        75.88 (+119.59%)
>> > Master kswapd reclaim           33922.12 (  +0.00%)     33887.12 (  -0.10%)
>> > Master kswapd reclaim (stddev)    969.08 (  +0.00%)       492.22 ( -49.16%)
>> > Master kswapd scan              34085.75 (  +0.00%)     33985.75 (  -0.29%)
>> > Master kswapd scan (stddev)      1101.07 (  +0.00%)       563.33 ( -48.79%)
>> > Slave walltime (s)                552.68 (  +0.00%)       552.12 (  -0.10%)
>> > Slave walltime (stddev)             0.79 (  +0.00%)         1.05 ( +14.76%)
>> > Slave major faults                212.50 (  +0.00%)       204.50 (  -3.75%)
>> > Slave major faults (stddev)        26.90 (  +0.00%)        13.17 ( -49.20%)
>> > Slave reclaim                      26.12 (  +0.00%)        35.00 ( +32.72%)
>> > Slave reclaim (stddev)             29.42 (  +0.00%)        74.91 (+149.55%)
>> > Slave scan                         31.38 (  +0.00%)        35.00 ( +11.20%)
>> > Slave scan (stddev)                33.31 (  +0.00%)        74.91 (+121.24%)
>> > Slave kswapd reclaim            34259.00 (  +0.00%)     33469.88 (  -2.30%)
>> > Slave kswapd reclaim (stddev)     925.15 (  +0.00%)       565.07 ( -38.88%)
>> > Slave kswapd scan               34354.62 (  +0.00%)     33555.75 (  -2.33%)
>> > Slave kswapd scan (stddev)        969.62 (  +0.00%)       581.70 ( -39.97%)
>> >
>> > In the control case, the differences in elapsed time, number of major
>> > faults taken, and reclaim statistics are within the noise for both the
>> > master and the slave job.
>>
>> What's the soft limit setting in the controlled case?
>
> 300MB for both jobs.
>
>> I assume it is the default RESOURCE_MAX. So both Master and Slave get
>> equal pressure before/after the patch, and no differences on the stats
>> should be observed.
>
> Yes.  The control case demonstrates that both jobs can fit
> comfortably, don't compete for space and that in general the patch
> does not have unexpected negative impact (after all, it modifies
> codepaths that were invoked regularly outside of reclaim).
>
>> >                                     600M-280M-vanilla      600M-280M-patched
>> > Master walltime (s)                  595.13 (  +0.00%)      553.19 (  -7.04%)
>> > Master walltime (stddev)               8.31 (  +0.00%)        2.57 ( -61.64%)
>> > Master major faults                 3729.75 (  +0.00%)      783.25 ( -78.98%)
>> > Master major faults (stddev)         258.79 (  +0.00%)      226.68 ( -12.36%)
>> > Master reclaim                       705.00 (  +0.00%)       29.50 ( -95.68%)
>> > Master reclaim (stddev)              232.87 (  +0.00%)       44.72 ( -80.45%)
>> > Master scan                          714.88 (  +0.00%)       30.00 ( -95.67%)
>> > Master scan (stddev)                 237.44 (  +0.00%)       45.39 ( -80.54%)
>> > Master kswapd reclaim                114.75 (  +0.00%)       50.00 ( -55.94%)
>> > Master kswapd reclaim (stddev)       128.51 (  +0.00%)        9.45 ( -91.93%)
>> > Master kswapd scan                   115.75 (  +0.00%)       50.00 ( -56.32%)
>> > Master kswapd scan (stddev)          130.31 (  +0.00%)        9.45 ( -92.04%)
>> > Slave walltime (s)                   631.18 (  +0.00%)      577.68 (  -8.46%)
>> > Slave walltime (stddev)                9.89 (  +0.00%)        3.63 ( -57.47%)
>> > Slave major faults                 28401.75 (  +0.00%)    14656.75 ( -48.39%)
>> > Slave major faults (stddev)         2629.97 (  +0.00%)     1911.81 ( -27.30%)
>> > Slave reclaim                      65400.62 (  +0.00%)     1479.62 ( -97.74%)
>> > Slave reclaim (stddev)             11623.02 (  +0.00%)     1482.13 ( -87.24%)
>> > Slave scan                       9050047.88 (  +0.00%)    95968.25 ( -98.94%)
>> > Slave scan (stddev)              1912786.94 (  +0.00%)    93390.71 ( -95.12%)
>> > Slave kswapd reclaim              327894.50 (  +0.00%)   227099.88 ( -30.74%)
>> > Slave kswapd reclaim (stddev)      22289.43 (  +0.00%)    16113.14 ( -27.71%)
>> > Slave kswapd scan               34987335.75 (  +0.00%)  1362367.12 ( -96.11%)
>> > Slave kswapd scan (stddev)       2523642.98 (  +0.00%)   156754.74 ( -93.79%)
>> >
>> > Here, the available memory is limited to 320 MB, the machine is
>> > overcommitted by 280 MB.  The soft limit of the master is 300 MB, that
>> > of the slave merely 20 MB.
>> >
>> > Looking at the slave job first, it is much better off with the patched
>> > kernel: direct reclaim is almost gone, kswapd reclaim is decreased by
>> > a third.  The result is much fewer major faults taken, which in turn
>> > lets the job finish quicker.
>>
>> What's the setting of the hard limit here? Is the direct reclaim
>> referring to per-memcg directly reclaim or global one.
>
> The machine's memory is limited to 600M, the hard limits are unset.
> All reclaim is a result of global memory pressure.
>
> With the patched kernel, I could have used a dedicated parent cgroup
> and let master and slave run in children of this group, the soft
> limits would be taken into account just the same.  But this does not
> work on the unpatched kernel, as soft limits are only recognized on
> the global level there.
>
>> > It would be a zero-sum game if the improvement happened at the cost of
>> > the master but looking at the numbers, even the master performs better
>> > with the patched kernel.  In fact, the master job is almost unaffected
>> > on the patched kernel compared to the control case.
>>
>> It makes sense since the master job get less affected by the patch
>> than the slave job under the example. Under the control case, if both
>> master and slave have RESOURCE_MAX soft limit setting, they are under
>> equal memory pressure(priority = DEF_PRIORITY) . On the second
>> example, only the slave pressure being increased by priority = 0, and
>> the Master got scanned with same priority = DEF_PRIORITY pretty much.
>>
>> So I would expect to see more reclaim activities happens in slave on
>> the patched kernel compared to the control case. It seems match the
>> testing result.
>
> Uhm,
>
>> > Slave reclaim                      65400.62 (  +0.00%)     1479.62 ( -97.74%)
>> > Slave reclaim (stddev)             11623.02 (  +0.00%)     1482.13 ( -87.24%)
>> > Slave scan                       9050047.88 (  +0.00%)    95968.25 ( -98.94%)
>> > Slave scan (stddev)              1912786.94 (  +0.00%)    93390.71 ( -95.12%)
>> > Slave kswapd reclaim              327894.50 (  +0.00%)   227099.88 ( -30.74%)
>> > Slave kswapd reclaim (stddev)      22289.43 (  +0.00%)    16113.14 ( -27.71%)
>> > Slave kswapd scan               34987335.75 (  +0.00%)  1362367.12 ( -96.11%)
>> > Slave kswapd scan (stddev)       2523642.98 (  +0.00%)   156754.74 ( -93.79%)
>
> Direct reclaim _shrunk_ by 98%, kswapd reclaim by 31%.
>
>> > This is an odd phenomenon, as the patch does not directly change how
>> > the master is reclaimed.  An explanation for this is that the severe
>> > overreclaim of the slave in the unpatched kernel results in the master
>> > growing bigger than in the patched case.  Combining the fact that
>> > memcgs are scanned according to their size with the increased refault
>> > rate of the overreclaimed slave triggering global reclaim more often
>> > means that overall pressure on the master job is higher in the
>> > unpatched kernel.
>>
>> We can check the Master memory.usage_in_bytes while the job is running.
>
> Yep, the plots of cache/rss over time confirmed exactly this.  The
> unpatched kernel shows higher spikes in the size of the master job
> followed by deeper pits when reclaim kicked in.  The patched kernel is
> much smoother in that regard.
>
>> On the other hand, I don't see why we expect the Master being less
>> reclaimed in the controlled case? On the unpatched kernel, the Master
>> is being reclaimed under global pressure each time anyway since we
>> ignore the return value of softlimit.
>
> I didn't expect that, I expected both jobs to perform equally in the
> control case.  And in the pressurized case, the master being
> unaffected and the slave taking the hit.  The patched kernel does
> this, the unpatched one does not.
>
>> > @@ -121,6 +121,7 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
>> >                                                      struct zone *zone);
>> >  struct zone_reclaim_stat*
>> >  mem_cgroup_get_reclaim_stat_from_page(struct page *page);
>> > +bool mem_cgroup_over_softlimit(struct mem_cgroup *, struct mem_cgroup *);
>>
>> Maybe something like "mem_cgroup_over_soft_limit()" ?
>
> Probably more consistent, yeah.  Will do.
>
>> > @@ -343,7 +314,6 @@ static bool move_file(void)
>> >  * limit reclaim to prevent infinite loops, if they ever occur.
>> >  */
>> >  #define        MEM_CGROUP_MAX_RECLAIM_LOOPS            (100)
>> > -#define        MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2)
>>
>> You might need to remove the comment above as well.
>
> Oops, will fix.
>
>> > @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
>> >        return margin >> PAGE_SHIFT;
>> >  }
>> >
>> > +/**
>> > + * mem_cgroup_over_softlimit
>> > + * @root: hierarchy root
>> > + * @memcg: child of @root to test
>> > + *
>> > + * Returns %true if @memcg exceeds its own soft limit or contributes
>> > + * to the soft limit excess of one of its parents up to and including
>> > + * @root.
>> > + */
>> > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root,
>> > +                              struct mem_cgroup *memcg)
>> > +{
>> > +       if (mem_cgroup_disabled())
>> > +               return false;
>> > +
>> > +       if (!root)
>> > +               root = root_mem_cgroup;
>> > +
>> > +       for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>> > +               /* root_mem_cgroup does not have a soft limit */
>> > +               if (memcg == root_mem_cgroup)
>> > +                       break;
>> > +               if (res_counter_soft_limit_excess(&memcg->res))
>> > +                       return true;
>> > +               if (memcg == root)
>> > +                       break;
>> > +       }
>>
>> Here it adds pressure on a cgroup if one of its parents exceeds soft
>> limit, although the cgroup itself is under soft limit. It does change
>> my understanding of soft limit, and might introduce regression of our
>> existing use cases.
>>
>> Here is an example:
>>
>> Machine capacity 32G and we over-commit by 8G.
>>
>> root
>>   -> A (hard limit 20G, soft limit 15G, usage 16G)
>>        -> A1 (soft limit 5G, usage 4G)
>>        -> A2 (soft limit 10G, usage 12G)
>>   -> B (hard limit 20G, soft limit 10G, usage 16G)
>>
>> under global reclaim, we don't want to add pressure on A1 although its
>> parent A exceeds its soft limit. Assume that if we set the soft limit
>> corresponding to each cgroup's working set size (hot memory), and it
>> will introduce regression to A1 in that case.
>>
>> In my existing implementation, i am checking the cgroup's soft limit
>> standalone w/o looking its ancestors.
>
> Why do you set the soft limit of A in the first place if you don't
> want it to be enforced?

The soft limit should be enforced under certain condition, not always.
The soft limit of A is set to be enforced when the parent of A and B
is under memory pressure. For example:

Machine capacity 32G and we over-commit by 8G

root
-> A (hard limit 20G, soft limit 12G, usage 20G)
       -> A1 (soft limit 2G, usage 1G)
       -> A2 (soft limit 10G, usage 19G)
-> B (hard limit 20G, soft limit 10G, usage 0G)

Now, A is under memory pressure since the total usage is hitting its
hard limit. Then we start hierarchical reclaim under A, and each
cgroup under A also takes consideration of soft limit. In this case,
we should only set priority = 0 to A2 since it contributes to A's
charge as well as exceeding its own soft limit. Why punishing A1 (set
priority = 0) also which has usage under its soft limit ? I can
imagine it will introduce regression to existing environment which the
soft limit is set based on the working set size of the cgroup.

To answer the question why we set soft limit to A, it is used to
over-commit the host while sharing the resource with its sibling (B in
this case). If the machine is under memory contention, we would like
to push down memory to A or B depends on their usage and soft limit.

--Ying

>
> This is not really new behaviour, soft limit reclaim has always been
> operating hierarchically on the biggest excessor.  In your case, the
> excess of A is smaller than the excess of A2 and so that weird "only
> pick the biggest excessor" behaviour hides it, but consider this:
>
>        -> A soft 30G, usage 39G
>           -> A1 soft 5G, usage 4G
>           -> A2 soft 10G, usage 15G
>           -> A3 soft 15G, usage 20G
>
> Upstream would pick A from the soft limit tree and reclaim its
> children with priority 0, including A1.
>
> On the other hand, if you don't consider ancestral soft limits, you
> break perfectly reasonable setups like these
>
>        -> A soft 10G, usage 20G
>           -> A1 usage 10G
>           -> A2 usage 10G
>        -> B soft 10G, usage 11G
>
> where upstream would pick A and reclaim it recursively, but your
> version would only apply higher pressure to B.
>
> If you would just not set the soft limit of A in your case:
>
>        -> A (hard limit 20G, usage 16G)
>           -> A1 (soft limit 5G, usage 4G)
>           -> A2 (soft limit 10G, usage 12G)
>        -> B (hard limit 20G, soft limit 10G, usage 16G)
>
> only A2 and B would experience higher pressure upon global pressure.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-13 16:34       ` Michal Hocko
@ 2012-01-13 21:45         ` Ying Han
  2012-01-18  9:45           ` Johannes Weiner
  0 siblings, 1 reply; 29+ messages in thread
From: Ying Han @ 2012-01-13 21:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki, Balbir Singh,
	cgroups, linux-mm, linux-kernel

On Fri, Jan 13, 2012 at 8:34 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Fri 13-01-12 16:50:01, Johannes Weiner wrote:
>> On Fri, Jan 13, 2012 at 01:04:06PM +0100, Michal Hocko wrote:
>> > On Tue 10-01-12 16:02:52, Johannes Weiner wrote:
> [...]
>> > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root,
>> > > +                        struct mem_cgroup *memcg)
>> > > +{
>> > > + if (mem_cgroup_disabled())
>> > > +         return false;
>> > > +
>> > > + if (!root)
>> > > +         root = root_mem_cgroup;
>> > > +
>> > > + for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>> > > +         /* root_mem_cgroup does not have a soft limit */
>> > > +         if (memcg == root_mem_cgroup)
>> > > +                 break;
>> > > +         if (res_counter_soft_limit_excess(&memcg->res))
>> > > +                 return true;
>> > > +         if (memcg == root)
>> > > +                 break;
>> > > + }
>> > > + return false;
>> > > +}
>> >
>> > Well, this might be little bit tricky. We do not check whether memcg and
>> > root are in a hierarchy (in terms of use_hierarchy) relation.
>> >
>> > If we are under global reclaim then we iterate over all memcgs and so
>> > there is no guarantee that there is a hierarchical relation between the
>> > given memcg and its parent. While, on the other hand, if we are doing
>> > memcg reclaim then we have this guarantee.
>> >
>> > Why should we punish a group (subtree) which is perfectly under its soft
>> > limit just because some other subtree contributes to the common parent's
>> > usage and makes it over its limit?
>> > Should we check memcg->use_hierarchy here?
>>
>> We do, actually.  parent_mem_cgroup() checks the res_counter parent,
>> which is only set when ->use_hierarchy is also set.
>
> Of course I am blind.. We do not setup res_counter parent for
> !use_hierarchy case. Sorry for noise...
> Now it makes much better sense. I was wondering how !use_hierarchy could
> ever work, this should be a signal that I am overlooking something
> terribly.
>
> [...]
>> > > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone,
>> > >                   .mem_cgroup = memcg,
>> > >                   .zone = zone,
>> > >           };
>> > > +         int epriority = priority;
>> > > +         /*
>> > > +          * Put more pressure on hierarchies that exceed their
>> > > +          * soft limit, to push them back harder than their
>> > > +          * well-behaving siblings.
>> > > +          */
>> > > +         if (mem_cgroup_over_softlimit(root, memcg))
>> > > +                 epriority = 0;
>> >
>> > This sounds too aggressive to me. Shouldn't we just double the pressure
>> > or something like that?
>>
>> That's the historical value.  When I tried priority - 1, it was not
>> aggressive enough.
>
> Probably because we want to reclaim too much. Maybe we should do
> reduce nr_to_reclaim (ugly) or reclaim only overlimit groups until certain
> priority level as Ying suggested in her patchset.

I plan to post that change on top of this, and this patch set does the
basic stuff to allow us doing further improvement.

I still like the design to skip over_soft_limit cgroups until certain
priority. One way to set up the soft limit for each cgroup is to base
on its actual working set size, and we prefer to punish A first with
lots of page cache ( cold file pages above soft limit) than reclaiming
anon pages from B ( below soft limit ). Unless we can not get enough
pages reclaimed from A, we will start reclaiming from B.

This might not be the ideal solution, but should be a good start. Thoughts?

--Ying

> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-13 21:31       ` Ying Han
@ 2012-01-13 22:44         ` Johannes Weiner
  2012-01-17 14:22           ` Sha
  0 siblings, 1 reply; 29+ messages in thread
From: Johannes Weiner @ 2012-01-13 22:44 UTC (permalink / raw)
  To: Ying Han
  Cc: Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki, Balbir Singh,
	cgroups, linux-mm, linux-kernel

On Fri, Jan 13, 2012 at 01:31:16PM -0800, Ying Han wrote:
> On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote:
> >> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >> > @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> >> >        return margin >> PAGE_SHIFT;
> >> >  }
> >> >
> >> > +/**
> >> > + * mem_cgroup_over_softlimit
> >> > + * @root: hierarchy root
> >> > + * @memcg: child of @root to test
> >> > + *
> >> > + * Returns %true if @memcg exceeds its own soft limit or contributes
> >> > + * to the soft limit excess of one of its parents up to and including
> >> > + * @root.
> >> > + */
> >> > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root,
> >> > +                              struct mem_cgroup *memcg)
> >> > +{
> >> > +       if (mem_cgroup_disabled())
> >> > +               return false;
> >> > +
> >> > +       if (!root)
> >> > +               root = root_mem_cgroup;
> >> > +
> >> > +       for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> >> > +               /* root_mem_cgroup does not have a soft limit */
> >> > +               if (memcg == root_mem_cgroup)
> >> > +                       break;
> >> > +               if (res_counter_soft_limit_excess(&memcg->res))
> >> > +                       return true;
> >> > +               if (memcg == root)
> >> > +                       break;
> >> > +       }
> >>
> >> Here it adds pressure on a cgroup if one of its parents exceeds soft
> >> limit, although the cgroup itself is under soft limit. It does change
> >> my understanding of soft limit, and might introduce regression of our
> >> existing use cases.
> >>
> >> Here is an example:
> >>
> >> Machine capacity 32G and we over-commit by 8G.
> >>
> >> root
> >>   -> A (hard limit 20G, soft limit 15G, usage 16G)
> >>        -> A1 (soft limit 5G, usage 4G)
> >>        -> A2 (soft limit 10G, usage 12G)
> >>   -> B (hard limit 20G, soft limit 10G, usage 16G)
> >>
> >> under global reclaim, we don't want to add pressure on A1 although its
> >> parent A exceeds its soft limit. Assume that if we set the soft limit
> >> corresponding to each cgroup's working set size (hot memory), and it
> >> will introduce regression to A1 in that case.
> >>
> >> In my existing implementation, i am checking the cgroup's soft limit
> >> standalone w/o looking its ancestors.
> >
> > Why do you set the soft limit of A in the first place if you don't
> > want it to be enforced?
> 
> The soft limit should be enforced under certain condition, not always.
> The soft limit of A is set to be enforced when the parent of A and B
> is under memory pressure. For example:
> 
> Machine capacity 32G and we over-commit by 8G
> 
> root
> -> A (hard limit 20G, soft limit 12G, usage 20G)
>        -> A1 (soft limit 2G, usage 1G)
>        -> A2 (soft limit 10G, usage 19G)
> -> B (hard limit 20G, soft limit 10G, usage 0G)
> 
> Now, A is under memory pressure since the total usage is hitting its
> hard limit. Then we start hierarchical reclaim under A, and each
> cgroup under A also takes consideration of soft limit. In this case,
> we should only set priority = 0 to A2 since it contributes to A's
> charge as well as exceeding its own soft limit. Why punishing A1 (set
> priority = 0) also which has usage under its soft limit ? I can
> imagine it will introduce regression to existing environment which the
> soft limit is set based on the working set size of the cgroup.
>
> To answer the question why we set soft limit to A, it is used to
> over-commit the host while sharing the resource with its sibling (B in
> this case). If the machine is under memory contention, we would like
> to push down memory to A or B depends on their usage and soft limit.

D'oh, I think the problem is just that we walk up the hierarchy one
too many when checking whether a group exceeds a soft limit.  The soft
limit is a signal to distribute pressure that comes from above, it's
meaningless and should indeed be ignored on the level the pressure
originates from.

Say mem_cgroup_over_soft_limit(root, memcg) would check everyone up to
but not including root, wouldn't that do exactly what we both want?

Example:

1. If global memory is short, we reclaim with root=root_mem_cgroup.
   A1 and A2 get soft limit reclaimed because of A's soft limit
   excess, just like the current kernel would do.

2. If A hits its hard limit, we reclaim with root=A, so we only mind
   the soft limits of A1 and A2.  A1 is below its soft limit, all
   good.  A2 is above its soft limit, gets treated accordingly.  This
   is new behaviour, the current kernel would just reclaim them
   equally.

Code:

bool mem_cgroup_over_soft_limit(struct mem_cgroup *root,
			        struct mem_cgroup *memcg)
{
	if (mem_cgroup_disabled())
		return false;

	if (!root)
		root = root_mem_cgroup;

	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
		if (memcg == root)
			break;
		if (res_counter_soft_limit_excess(&memcg->res))
			return true;
	}
	return false;
}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-13 22:44         ` Johannes Weiner
@ 2012-01-17 14:22           ` Sha
  2012-01-17 14:53             ` Johannes Weiner
  0 siblings, 1 reply; 29+ messages in thread
From: Sha @ 2012-01-17 14:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ying Han, Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki,
	Balbir Singh, cgroups, linux-mm, linux-kernel

On 01/14/2012 06:44 AM, Johannes Weiner wrote:
> On Fri, Jan 13, 2012 at 01:31:16PM -0800, Ying Han wrote:
>> On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner<hannes@cmpxchg.org>  wrote:
>>> On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote:
>>>> On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner<hannes@cmpxchg.org>  wrote:
>>>>> @@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
>>>>>         return margin>>  PAGE_SHIFT;
>>>>>   }
>>>>>
>>>>> +/**
>>>>> + * mem_cgroup_over_softlimit
>>>>> + * @root: hierarchy root
>>>>> + * @memcg: child of @root to test
>>>>> + *
>>>>> + * Returns %true if @memcg exceeds its own soft limit or contributes
>>>>> + * to the soft limit excess of one of its parents up to and including
>>>>> + * @root.
>>>>> + */
>>>>> +bool mem_cgroup_over_softlimit(struct mem_cgroup *root,
>>>>> +                              struct mem_cgroup *memcg)
>>>>> +{
>>>>> +       if (mem_cgroup_disabled())
>>>>> +               return false;
>>>>> +
>>>>> +       if (!root)
>>>>> +               root = root_mem_cgroup;
>>>>> +
>>>>> +       for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>>>>> +               /* root_mem_cgroup does not have a soft limit */
>>>>> +               if (memcg == root_mem_cgroup)
>>>>> +                       break;
>>>>> +               if (res_counter_soft_limit_excess(&memcg->res))
>>>>> +                       return true;
>>>>> +               if (memcg == root)
>>>>> +                       break;
>>>>> +       }
>>>> Here it adds pressure on a cgroup if one of its parents exceeds soft
>>>> limit, although the cgroup itself is under soft limit. It does change
>>>> my understanding of soft limit, and might introduce regression of our
>>>> existing use cases.
>>>>
>>>> Here is an example:
>>>>
>>>> Machine capacity 32G and we over-commit by 8G.
>>>>
>>>> root
>>>>    ->  A (hard limit 20G, soft limit 15G, usage 16G)
>>>>         ->  A1 (soft limit 5G, usage 4G)
>>>>         ->  A2 (soft limit 10G, usage 12G)
>>>>    ->  B (hard limit 20G, soft limit 10G, usage 16G)
>>>>
>>>> under global reclaim, we don't want to add pressure on A1 although its
>>>> parent A exceeds its soft limit. Assume that if we set the soft limit
>>>> corresponding to each cgroup's working set size (hot memory), and it
>>>> will introduce regression to A1 in that case.
>>>>
>>>> In my existing implementation, i am checking the cgroup's soft limit
>>>> standalone w/o looking its ancestors.
>>> Why do you set the soft limit of A in the first place if you don't
>>> want it to be enforced?
>> The soft limit should be enforced under certain condition, not always.
>> The soft limit of A is set to be enforced when the parent of A and B
>> is under memory pressure. For example:
>>
>> Machine capacity 32G and we over-commit by 8G
>>
>> root
>> ->  A (hard limit 20G, soft limit 12G, usage 20G)
>>         ->  A1 (soft limit 2G, usage 1G)
>>         ->  A2 (soft limit 10G, usage 19G)
>> ->  B (hard limit 20G, soft limit 10G, usage 0G)
>>
>> Now, A is under memory pressure since the total usage is hitting its
>> hard limit. Then we start hierarchical reclaim under A, and each
>> cgroup under A also takes consideration of soft limit. In this case,
>> we should only set priority = 0 to A2 since it contributes to A's
>> charge as well as exceeding its own soft limit. Why punishing A1 (set
>> priority = 0) also which has usage under its soft limit ? I can
>> imagine it will introduce regression to existing environment which the
>> soft limit is set based on the working set size of the cgroup.
>>
>> To answer the question why we set soft limit to A, it is used to
>> over-commit the host while sharing the resource with its sibling (B in
>> this case). If the machine is under memory contention, we would like
>> to push down memory to A or B depends on their usage and soft limit.
> D'oh, I think the problem is just that we walk up the hierarchy one
> too many when checking whether a group exceeds a soft limit.  The soft
> limit is a signal to distribute pressure that comes from above, it's
> meaningless and should indeed be ignored on the level the pressure
> originates from.
>
> Say mem_cgroup_over_soft_limit(root, memcg) would check everyone up to
> but not including root, wouldn't that do exactly what we both want?
>
> Example:
>
> 1. If global memory is short, we reclaim with root=root_mem_cgroup.
>     A1 and A2 get soft limit reclaimed because of A's soft limit
>     excess, just like the current kernel would do.
>
> 2. If A hits its hard limit, we reclaim with root=A, so we only mind
>     the soft limits of A1 and A2.  A1 is below its soft limit, all
>     good.  A2 is above its soft limit, gets treated accordingly.  This
>     is new behaviour, the current kernel would just reclaim them
>     equally.
>
> Code:
>
> bool mem_cgroup_over_soft_limit(struct mem_cgroup *root,
> 			        struct mem_cgroup *memcg)
> {
> 	if (mem_cgroup_disabled())
> 		return false;
>
> 	if (!root)
> 		root = root_mem_cgroup;
>
> 	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> 		if (memcg == root)
> 			break;
> 		if (res_counter_soft_limit_excess(&memcg->res))
> 			return true;
> 	}
> 	return false;
> }
Hi Johannes,

I don't think it solve the root of the problem, example:
root
-> A (hard limit 20G, soft limit 12G, usage 20G)
     -> A1 ( soft limit 2G,   usage 1G)
     -> A2 ( soft limit 10G, usage 19G)
            ->B1 (soft limit 5G, usage 4G)
            ->B2 (soft limit 5G, usage 15G)

Now A is hitting its hard limit and start hierarchical reclaim under A.
If we choose B1 to go through mem_cgroup_over_soft_limit, it will
return true because its parent A2 has a large usage and will lead to
priority=0 reclaiming. But in fact it should be B2 to be punished.

IMHO, it may checking the cgroup's soft limit standalone without
looking up its ancestors just as Ying said.

Thanks,
Sha

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-17 14:22           ` Sha
@ 2012-01-17 14:53             ` Johannes Weiner
  2012-01-17 20:25               ` Ying Han
       [not found]               ` <CAFj3OHWY2Biw54gaGeH5fkxzgOhxn7NAibeYT_Jmga-_ypNSRg@mail.gmail.com>
  0 siblings, 2 replies; 29+ messages in thread
From: Johannes Weiner @ 2012-01-17 14:53 UTC (permalink / raw)
  To: Sha
  Cc: Ying Han, Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki,
	Balbir Singh, cgroups, linux-mm, linux-kernel

On Tue, Jan 17, 2012 at 10:22:16PM +0800, Sha wrote:
> On 01/14/2012 06:44 AM, Johannes Weiner wrote:
> >On Fri, Jan 13, 2012 at 01:31:16PM -0800, Ying Han wrote:
> >>On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner<hannes@cmpxchg.org>  wrote:
> >>>On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote:
> >>>>On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner<hannes@cmpxchg.org>  wrote:
> >>>>>@@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> >>>>>        return margin>>  PAGE_SHIFT;
> >>>>>  }
> >>>>>
> >>>>>+/**
> >>>>>+ * mem_cgroup_over_softlimit
> >>>>>+ * @root: hierarchy root
> >>>>>+ * @memcg: child of @root to test
> >>>>>+ *
> >>>>>+ * Returns %true if @memcg exceeds its own soft limit or contributes
> >>>>>+ * to the soft limit excess of one of its parents up to and including
> >>>>>+ * @root.
> >>>>>+ */
> >>>>>+bool mem_cgroup_over_softlimit(struct mem_cgroup *root,
> >>>>>+                              struct mem_cgroup *memcg)
> >>>>>+{
> >>>>>+       if (mem_cgroup_disabled())
> >>>>>+               return false;
> >>>>>+
> >>>>>+       if (!root)
> >>>>>+               root = root_mem_cgroup;
> >>>>>+
> >>>>>+       for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> >>>>>+               /* root_mem_cgroup does not have a soft limit */
> >>>>>+               if (memcg == root_mem_cgroup)
> >>>>>+                       break;
> >>>>>+               if (res_counter_soft_limit_excess(&memcg->res))
> >>>>>+                       return true;
> >>>>>+               if (memcg == root)
> >>>>>+                       break;
> >>>>>+       }
> >>>>Here it adds pressure on a cgroup if one of its parents exceeds soft
> >>>>limit, although the cgroup itself is under soft limit. It does change
> >>>>my understanding of soft limit, and might introduce regression of our
> >>>>existing use cases.
> >>>>
> >>>>Here is an example:
> >>>>
> >>>>Machine capacity 32G and we over-commit by 8G.
> >>>>
> >>>>root
> >>>>   ->  A (hard limit 20G, soft limit 15G, usage 16G)
> >>>>        ->  A1 (soft limit 5G, usage 4G)
> >>>>        ->  A2 (soft limit 10G, usage 12G)
> >>>>   ->  B (hard limit 20G, soft limit 10G, usage 16G)
> >>>>
> >>>>under global reclaim, we don't want to add pressure on A1 although its
> >>>>parent A exceeds its soft limit. Assume that if we set the soft limit
> >>>>corresponding to each cgroup's working set size (hot memory), and it
> >>>>will introduce regression to A1 in that case.
> >>>>
> >>>>In my existing implementation, i am checking the cgroup's soft limit
> >>>>standalone w/o looking its ancestors.
> >>>Why do you set the soft limit of A in the first place if you don't
> >>>want it to be enforced?
> >>The soft limit should be enforced under certain condition, not always.
> >>The soft limit of A is set to be enforced when the parent of A and B
> >>is under memory pressure. For example:
> >>
> >>Machine capacity 32G and we over-commit by 8G
> >>
> >>root
> >>->  A (hard limit 20G, soft limit 12G, usage 20G)
> >>        ->  A1 (soft limit 2G, usage 1G)
> >>        ->  A2 (soft limit 10G, usage 19G)
> >>->  B (hard limit 20G, soft limit 10G, usage 0G)
> >>
> >>Now, A is under memory pressure since the total usage is hitting its
> >>hard limit. Then we start hierarchical reclaim under A, and each
> >>cgroup under A also takes consideration of soft limit. In this case,
> >>we should only set priority = 0 to A2 since it contributes to A's
> >>charge as well as exceeding its own soft limit. Why punishing A1 (set
> >>priority = 0) also which has usage under its soft limit ? I can
> >>imagine it will introduce regression to existing environment which the
> >>soft limit is set based on the working set size of the cgroup.
> >>
> >>To answer the question why we set soft limit to A, it is used to
> >>over-commit the host while sharing the resource with its sibling (B in
> >>this case). If the machine is under memory contention, we would like
> >>to push down memory to A or B depends on their usage and soft limit.
> >D'oh, I think the problem is just that we walk up the hierarchy one
> >too many when checking whether a group exceeds a soft limit.  The soft
> >limit is a signal to distribute pressure that comes from above, it's
> >meaningless and should indeed be ignored on the level the pressure
> >originates from.
> >
> >Say mem_cgroup_over_soft_limit(root, memcg) would check everyone up to
> >but not including root, wouldn't that do exactly what we both want?
> >
> >Example:
> >
> >1. If global memory is short, we reclaim with root=root_mem_cgroup.
> >    A1 and A2 get soft limit reclaimed because of A's soft limit
> >    excess, just like the current kernel would do.
> >
> >2. If A hits its hard limit, we reclaim with root=A, so we only mind
> >    the soft limits of A1 and A2.  A1 is below its soft limit, all
> >    good.  A2 is above its soft limit, gets treated accordingly.  This
> >    is new behaviour, the current kernel would just reclaim them
> >    equally.
> >
> >Code:
> >
> >bool mem_cgroup_over_soft_limit(struct mem_cgroup *root,
> >			        struct mem_cgroup *memcg)
> >{
> >	if (mem_cgroup_disabled())
> >		return false;
> >
> >	if (!root)
> >		root = root_mem_cgroup;
> >
> >	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> >		if (memcg == root)
> >			break;
> >		if (res_counter_soft_limit_excess(&memcg->res))
> >			return true;
> >	}
> >	return false;
> >}
> Hi Johannes,
> 
> I don't think it solve the root of the problem, example:
> root
> -> A (hard limit 20G, soft limit 12G, usage 20G)
>     -> A1 ( soft limit 2G,   usage 1G)
>     -> A2 ( soft limit 10G, usage 19G)
>            ->B1 (soft limit 5G, usage 4G)
>            ->B2 (soft limit 5G, usage 15G)
> 
> Now A is hitting its hard limit and start hierarchical reclaim under A.
> If we choose B1 to go through mem_cgroup_over_soft_limit, it will
> return true because its parent A2 has a large usage and will lead to
> priority=0 reclaiming. But in fact it should be B2 to be punished.

Because A2 is over its soft limit, the whole hierarchy below it should
be preferred over A1, so both B1 and B2 should be soft limit reclaimed
to be consistent with behaviour at the root level.

> IMHO, it may checking the cgroup's soft limit standalone without
> looking up its ancestors just as Ying said.

Again, this would be a regression as soft limits have been applied
hierarchically forever.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-17 14:53             ` Johannes Weiner
@ 2012-01-17 20:25               ` Ying Han
  2012-01-17 21:56                 ` Johannes Weiner
       [not found]               ` <CAFj3OHWY2Biw54gaGeH5fkxzgOhxn7NAibeYT_Jmga-_ypNSRg@mail.gmail.com>
  1 sibling, 1 reply; 29+ messages in thread
From: Ying Han @ 2012-01-17 20:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Sha, Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki,
	Balbir Singh, cgroups, linux-mm, linux-kernel

On Tue, Jan 17, 2012 at 6:53 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Tue, Jan 17, 2012 at 10:22:16PM +0800, Sha wrote:
>> On 01/14/2012 06:44 AM, Johannes Weiner wrote:
>> >On Fri, Jan 13, 2012 at 01:31:16PM -0800, Ying Han wrote:
>> >>On Thu, Jan 12, 2012 at 12:59 AM, Johannes Weiner<hannes@cmpxchg.org>  wrote:
>> >>>On Wed, Jan 11, 2012 at 01:42:31PM -0800, Ying Han wrote:
>> >>>>On Tue, Jan 10, 2012 at 7:02 AM, Johannes Weiner<hannes@cmpxchg.org>  wrote:
>> >>>>>@@ -1318,6 +1123,36 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
>> >>>>>        return margin>>  PAGE_SHIFT;
>> >>>>>  }
>> >>>>>
>> >>>>>+/**
>> >>>>>+ * mem_cgroup_over_softlimit
>> >>>>>+ * @root: hierarchy root
>> >>>>>+ * @memcg: child of @root to test
>> >>>>>+ *
>> >>>>>+ * Returns %true if @memcg exceeds its own soft limit or contributes
>> >>>>>+ * to the soft limit excess of one of its parents up to and including
>> >>>>>+ * @root.
>> >>>>>+ */
>> >>>>>+bool mem_cgroup_over_softlimit(struct mem_cgroup *root,
>> >>>>>+                              struct mem_cgroup *memcg)
>> >>>>>+{
>> >>>>>+       if (mem_cgroup_disabled())
>> >>>>>+               return false;
>> >>>>>+
>> >>>>>+       if (!root)
>> >>>>>+               root = root_mem_cgroup;
>> >>>>>+
>> >>>>>+       for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>> >>>>>+               /* root_mem_cgroup does not have a soft limit */
>> >>>>>+               if (memcg == root_mem_cgroup)
>> >>>>>+                       break;
>> >>>>>+               if (res_counter_soft_limit_excess(&memcg->res))
>> >>>>>+                       return true;
>> >>>>>+               if (memcg == root)
>> >>>>>+                       break;
>> >>>>>+       }
>> >>>>Here it adds pressure on a cgroup if one of its parents exceeds soft
>> >>>>limit, although the cgroup itself is under soft limit. It does change
>> >>>>my understanding of soft limit, and might introduce regression of our
>> >>>>existing use cases.
>> >>>>
>> >>>>Here is an example:
>> >>>>
>> >>>>Machine capacity 32G and we over-commit by 8G.
>> >>>>
>> >>>>root
>> >>>>   ->  A (hard limit 20G, soft limit 15G, usage 16G)
>> >>>>        ->  A1 (soft limit 5G, usage 4G)
>> >>>>        ->  A2 (soft limit 10G, usage 12G)
>> >>>>   ->  B (hard limit 20G, soft limit 10G, usage 16G)
>> >>>>
>> >>>>under global reclaim, we don't want to add pressure on A1 although its
>> >>>>parent A exceeds its soft limit. Assume that if we set the soft limit
>> >>>>corresponding to each cgroup's working set size (hot memory), and it
>> >>>>will introduce regression to A1 in that case.
>> >>>>
>> >>>>In my existing implementation, i am checking the cgroup's soft limit
>> >>>>standalone w/o looking its ancestors.
>> >>>Why do you set the soft limit of A in the first place if you don't
>> >>>want it to be enforced?
>> >>The soft limit should be enforced under certain condition, not always.
>> >>The soft limit of A is set to be enforced when the parent of A and B
>> >>is under memory pressure. For example:
>> >>
>> >>Machine capacity 32G and we over-commit by 8G
>> >>
>> >>root
>> >>->  A (hard limit 20G, soft limit 12G, usage 20G)
>> >>        ->  A1 (soft limit 2G, usage 1G)
>> >>        ->  A2 (soft limit 10G, usage 19G)
>> >>->  B (hard limit 20G, soft limit 10G, usage 0G)
>> >>
>> >>Now, A is under memory pressure since the total usage is hitting its
>> >>hard limit. Then we start hierarchical reclaim under A, and each
>> >>cgroup under A also takes consideration of soft limit. In this case,
>> >>we should only set priority = 0 to A2 since it contributes to A's
>> >>charge as well as exceeding its own soft limit. Why punishing A1 (set
>> >>priority = 0) also which has usage under its soft limit ? I can
>> >>imagine it will introduce regression to existing environment which the
>> >>soft limit is set based on the working set size of the cgroup.
>> >>
>> >>To answer the question why we set soft limit to A, it is used to
>> >>over-commit the host while sharing the resource with its sibling (B in
>> >>this case). If the machine is under memory contention, we would like
>> >>to push down memory to A or B depends on their usage and soft limit.
>> >D'oh, I think the problem is just that we walk up the hierarchy one
>> >too many when checking whether a group exceeds a soft limit.  The soft
>> >limit is a signal to distribute pressure that comes from above, it's
>> >meaningless and should indeed be ignored on the level the pressure
>> >originates from.
>> >
>> >Say mem_cgroup_over_soft_limit(root, memcg) would check everyone up to
>> >but not including root, wouldn't that do exactly what we both want?
>> >
>> >Example:
>> >
>> >1. If global memory is short, we reclaim with root=root_mem_cgroup.
>> >    A1 and A2 get soft limit reclaimed because of A's soft limit
>> >    excess, just like the current kernel would do.
>> >
>> >2. If A hits its hard limit, we reclaim with root=A, so we only mind
>> >    the soft limits of A1 and A2.  A1 is below its soft limit, all
>> >    good.  A2 is above its soft limit, gets treated accordingly.  This
>> >    is new behaviour, the current kernel would just reclaim them
>> >    equally.
>> >
>> >Code:
>> >
>> >bool mem_cgroup_over_soft_limit(struct mem_cgroup *root,
>> >                             struct mem_cgroup *memcg)
>> >{
>> >     if (mem_cgroup_disabled())
>> >             return false;
>> >
>> >     if (!root)
>> >             root = root_mem_cgroup;
>> >
>> >     for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>> >             if (memcg == root)
>> >                     break;
>> >             if (res_counter_soft_limit_excess(&memcg->res))
>> >                     return true;
>> >     }
>> >     return false;
>> >}
>> Hi Johannes,
>>
>> I don't think it solve the root of the problem, example:
>> root
>> -> A (hard limit 20G, soft limit 12G, usage 20G)
>>     -> A1 ( soft limit 2G,   usage 1G)
>>     -> A2 ( soft limit 10G, usage 19G)
>>            ->B1 (soft limit 5G, usage 4G)
>>            ->B2 (soft limit 5G, usage 15G)
>>
>> Now A is hitting its hard limit and start hierarchical reclaim under A.
>> If we choose B1 to go through mem_cgroup_over_soft_limit, it will
>> return true because its parent A2 has a large usage and will lead to
>> priority=0 reclaiming. But in fact it should be B2 to be punished.
>
> Because A2 is over its soft limit, the whole hierarchy below it should
> be preferred over A1, so both B1 and B2 should be soft limit reclaimed
> to be consistent with behaviour at the root level.
>
>> IMHO, it may checking the cgroup's soft limit standalone without
>> looking up its ancestors just as Ying said.
>
> Again, this would be a regression as soft limits have been applied
> hierarchically forever.

If we are comparing it to the current implementation, agree that the
soft reclaim is applied hierarchically. In the example above, A2 will
be picked for soft reclaim while A is hitting its hard limit, which in
turns reclaim from B1 and B2 regardless of their soft limit setting.
However, I haven't convinced myself this is how we are gonna use the
soft limit.

The soft limit setting for each cgroup is a hit for applying pressure
under memory contention. One way of setting the soft limit is based on
the cgroup's working set size. Thus, we allow cgroup to grow above its
soft limit with cold page cache unless there is a memory pressure
comes from above. Under the hierarchical reclaim, we will exam the
soft limit and only apply extra pressure to the ones above their soft
limit. Here the same example:

root
-> A (hard limit 20G, soft limit 12G, usage 20G)
   -> A1 ( soft limit 2G,   usage 1G)
   -> A2 ( soft limit 10G, usage 19G)

          ->B1 (soft limit 5G, usage 4G)
          ->B2 (soft limit 5G, usage 15G)

If A is hitting its hard limit, we will reclaim all the children under
A hierarchically but only adding extra pressure to the ones above
their soft limits (A2, B2). Adding extra pressure to B1 will introduce
known regression based on customer expectation since the 4G usage are
hot memory.

I am not aware of how the existing soft reclaim being used, i bet
there are not a lot. If we are making changes on the current
implementation, we should also take the opportunity to think about the
initial design as well. Thoughts?

--Ying

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-17 20:25               ` Ying Han
@ 2012-01-17 21:56                 ` Johannes Weiner
  2012-01-17 23:39                   ` Ying Han
  0 siblings, 1 reply; 29+ messages in thread
From: Johannes Weiner @ 2012-01-17 21:56 UTC (permalink / raw)
  To: Ying Han
  Cc: Sha, Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki,
	Balbir Singh, cgroups, linux-mm, linux-kernel

On Tue, Jan 17, 2012 at 12:25:31PM -0800, Ying Han wrote:
> On Tue, Jan 17, 2012 at 6:53 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Tue, Jan 17, 2012 at 10:22:16PM +0800, Sha wrote:
> >> IMHO, it may checking the cgroup's soft limit standalone without
> >> looking up its ancestors just as Ying said.
> >
> > Again, this would be a regression as soft limits have been applied
> > hierarchically forever.
> 
> If we are comparing it to the current implementation, agree that the
> soft reclaim is applied hierarchically. In the example above, A2 will
> be picked for soft reclaim while A is hitting its hard limit, which in
> turns reclaim from B1 and B2 regardless of their soft limit setting.
> However, I haven't convinced myself this is how we are gonna use the
> soft limit.

Of course I'm comparing it to the current implementation, this is what
I'm changing!

> The soft limit setting for each cgroup is a hit for applying pressure
> under memory contention. One way of setting the soft limit is based on
> the cgroup's working set size. Thus, we allow cgroup to grow above its
> soft limit with cold page cache unless there is a memory pressure
> comes from above. Under the hierarchical reclaim, we will exam the
> soft limit and only apply extra pressure to the ones above their soft
> limit. Here the same example:
> 
> root
> -> A (hard limit 20G, soft limit 12G, usage 20G)
>    -> A1 ( soft limit 2G,   usage 1G)
>    -> A2 ( soft limit 10G, usage 19G)
> 
>           ->B1 (soft limit 5G, usage 4G)
>           ->B2 (soft limit 5G, usage 15G)
> 
> If A is hitting its hard limit, we will reclaim all the children under
> A hierarchically but only adding extra pressure to the ones above
> their soft limits (A2, B2). Adding extra pressure to B1 will introduce
> known regression based on customer expectation since the 4G usage are
> hot memory.

I can only repeat myself: A has a soft limit set, so the customer
expects global pressure to arise sooner or later.  If that happens, A
will be soft-limit reclaimed hierarchically in the _existing code_.
That's how the soft limit currently works and I don't mean to change
it _with this patch_.  The customer has to expect that B1 can be
reclaimed as a consequence of the soft limit in A or A2 today, so I
don't know where this expectation of different behaviour should even
come from.  How can this be a regression?!

> I am not aware of how the existing soft reclaim being used, i bet
> there are not a lot. If we are making changes on the current
> implementation, we should also take the opportunity to think about the
> initial design as well. Thoughts?

I agree that these semantics should be up for debate.  And I think
changing it to something like you have in mind is indeed a good idea;
to not have soft limits apply hierarchically but instead follow down
the whole chain and only soft limit reclaim those that are themselves
above their soft limit.  But it's an entirely different matter!

This patch is supposed to do only two things: 1. refactor the soft
limit implementation, staying as close as possible/practical to the
current semantics and 2. fix the inconsistency that soft limits are
ignored when pressure does not originate at the root_mem_cgroup.  If
that is too much change in semantics I can easily ditch 2., I just
didn't see the use of maintaining an inconsistency that resulted
purely from the limitations of the current implementation by re-adding
more code and because I think that this would not be surprising
behaviour.  It would be as simple as adding an extra check in reclaim
that only minds soft limits upon global pressure:

	if (global_reclaim(sc) && mem_cgroup_over_soft_limit(root, memcg))
		/* resulting action */

and it would have nothing to do how soft limits are actually applied
once triggered.  I can include this in the next version, but it won't
fix the problem you seem to be having with the _existing_ behaviour.

I also don't think that my patch will get in the way of what you are
planning to do: in fact, you already have code that easily turns
mem_cgroup_over_soft_limit() into a non-hierarchical predicate.

Even more will change when you invert the soft limits to become actual
guarantees and skip reclaiming memcgs that are below their soft limits
but I don't think this patch is in the way of doing that, either.

I feel that these are all orthogonal changes.  So if possible, could
we take just one step at a time and leave hypothetical behaviour out
of it unless the proposed changes clearly get in the way of where we
agreed we want to go?

If I misunderstood everything completely and you actually believe this
patch will get in the way, could you tell me where and how?

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-17 21:56                 ` Johannes Weiner
@ 2012-01-17 23:39                   ` Ying Han
  0 siblings, 0 replies; 29+ messages in thread
From: Ying Han @ 2012-01-17 23:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Sha, Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki,
	Balbir Singh, cgroups, linux-mm, linux-kernel

On Tue, Jan 17, 2012 at 1:56 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Tue, Jan 17, 2012 at 12:25:31PM -0800, Ying Han wrote:
>> On Tue, Jan 17, 2012 at 6:53 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> > On Tue, Jan 17, 2012 at 10:22:16PM +0800, Sha wrote:
>> >> IMHO, it may checking the cgroup's soft limit standalone without
>> >> looking up its ancestors just as Ying said.
>> >
>> > Again, this would be a regression as soft limits have been applied
>> > hierarchically forever.
>>
>> If we are comparing it to the current implementation, agree that the
>> soft reclaim is applied hierarchically. In the example above, A2 will
>> be picked for soft reclaim while A is hitting its hard limit, which in
>> turns reclaim from B1 and B2 regardless of their soft limit setting.
>> However, I haven't convinced myself this is how we are gonna use the
>> soft limit.
>
> Of course I'm comparing it to the current implementation, this is what
> I'm changing!

Thank you for clarifying it. Apparently i confused myself by comparing
this patch with the one I had last time.

>> The soft limit setting for each cgroup is a hit for applying pressure
>> under memory contention. One way of setting the soft limit is based on
>> the cgroup's working set size. Thus, we allow cgroup to grow above its
>> soft limit with cold page cache unless there is a memory pressure
>> comes from above. Under the hierarchical reclaim, we will exam the
>> soft limit and only apply extra pressure to the ones above their soft
>> limit. Here the same example:
>>
>> root
>> -> A (hard limit 20G, soft limit 12G, usage 20G)
>>    -> A1 ( soft limit 2G,   usage 1G)
>>    -> A2 ( soft limit 10G, usage 19G)
>>
>>           ->B1 (soft limit 5G, usage 4G)
>>           ->B2 (soft limit 5G, usage 15G)
>>
>> If A is hitting its hard limit, we will reclaim all the children under
>> A hierarchically but only adding extra pressure to the ones above
>> their soft limits (A2, B2). Adding extra pressure to B1 will introduce
>> known regression based on customer expectation since the 4G usage are
>> hot memory.
>
> I can only repeat myself: A has a soft limit set, so the customer
> expects global pressure to arise sooner or later.  If that happens, A
> will be soft-limit reclaimed hierarchically in the _existing code_.
> That's how the soft limit currently works and I don't mean to change
> it _with this patch_.  The customer has to expect that B1 can be
> reclaimed as a consequence of the soft limit in A or A2 today, so I
> don't know where this expectation of different behaviour should even
> come from.  How can this be a regression?!

sorry for the confusion :(

I wasn't comparing this patch to the current implementation, maybe I
should. If the goal of this patch set is to bring as close as possible
to the current implementation, I don't have objections.
>
>> I am not aware of how the existing soft reclaim being used, i bet
>> there are not a lot. If we are making changes on the current
>> implementation, we should also take the opportunity to think about the
>> initial design as well. Thoughts?
>
> I agree that these semantics should be up for debate.  And I think
> changing it to something like you have in mind is indeed a good idea;
> to not have soft limits apply hierarchically but instead follow down
> the whole chain and only soft limit reclaim those that are themselves
> above their soft limit.  But it's an entirely different matter!

thanks, agree that patch could come after this.

>
> This patch is supposed to do only two things: 1. refactor the soft
> limit implementation, staying as close as possible/practical to the
> current semantics and 2. fix the inconsistency that soft limits are
> ignored when pressure does not originate at the root_mem_cgroup.  If
> that is too much change in semantics I can easily ditch 2.,

It would be nice to split the two into separate patches. The second
which adds soft reclaim into per-memcg reclaim is a new functionality
from the current implementation.

I just
> didn't see the use of maintaining an inconsistency that resulted
> purely from the limitations of the current implementation by re-adding
> more code and because I think that this would not be surprising
> behaviour.

agree.

It would be as simple as adding an extra check in reclaim
> that only minds soft limits upon global pressure:
>
>        if (global_reclaim(sc) && mem_cgroup_over_soft_limit(root, memcg))
>                /* resulting action */
>
> and it would have nothing to do how soft limits are actually applied
> once triggered.  I can include this in the next version, but it won't
> fix the problem you seem to be having with the _existing_ behaviour.

No, it won't solve all the problems but close. It looks pretty much as
what I have, except the priority part. We can leave it for the
following patch to further improve soft limit reclaim.

I have no strong opinion to whether include the global_reclaim() or
not, however it might bring your patch closer to the _existing_
implementation. (considers soft reclaim only under global reclaim ).

> I also don't think that my patch will get in the way of what you are
> planning to do: in fact, you already have code that easily turns
> mem_cgroup_over_soft_limit() into a non-hierarchical predicate.

>
> Even more will change when you invert the soft limits to become actual
> guarantees and skip reclaiming memcgs that are below their soft limits
> but I don't think this patch is in the way of doing that, either.

> I feel that these are all orthogonal changes.  So if possible, could
> we take just one step at a time and leave hypothetical behaviour out
> of it unless the proposed changes clearly get in the way of where we
> agreed we want to go?
>
> If I misunderstood everything completely and you actually believe this
> patch will get in the way, could you tell me where and how?

The change by itself is easy to apply on top of yours. The
hierarchical part took some of my time to understand, which now is
clear to make it as close as possible to the _existing_ code. Feel
free to post the updated patch whenever they are ready.

Thanks

--Ying

> Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-13 12:16     ` Johannes Weiner
@ 2012-01-18  5:26       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-18  5:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Balbir Singh, Ying Han, cgroups,
	linux-mm, linux-kernel

On Fri, 13 Jan 2012 13:16:56 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Thu, Jan 12, 2012 at 10:54:27AM +0900, KAMEZAWA Hiroyuki wrote:

> > Thank you for your work and the result seems atractive and code is much
> > simpler. My small concerns are..
> > 
> > 1. This approach may increase latency of direct-reclaim because of priority=0.
> 
> I think strictly speaking yes, but note that with kswapd being less
> likely to get stuck in hammering on one group, the need for allocators
> to enter direct reclaim itself is reduced.
> 
> However, if this really becomes a problem in real world loads, the fix
> is pretty easy: just ignore the soft limit for direct reclaim.  We can
> still consider it from hard limit reclaim and kswapd.
> 
> > 2. In a case numa-spread/interleave application run in its own container, 
> >    pages on a node may paged-out again and again becasue of priority=0
> >    if some other application runs in the node.
> >    It seems difficult to use soft-limit with numa-aware applications.
> >    Do you have suggestions ?
> 
> This is a question about soft limits in general rather than about this
> particular patch, right?
> 

Partially, yes. My concern is related to "1".

Assume an application is binded to some cpu/node and try to allocate memory.
If its memcg's usage is over softlimit, this application will play bad because
newly allocated memory will be reclaim target soon, again....


> And if I understand correctly, the problem you are referring to is
> this: an application and parts of a soft-limited container share a
> node, the soft limit setting means that the container's pages on that
> node are reclaimed harder.  At that point, the container's share on
> that node becomes tiny, but since the soft limit is oblivious to
> nodes, the expansion of the other application pushes the soft-limited
> container off that node completely as long as the container stays
> above its soft limit with the usage on other nodes.
> 
> What would you think about having node-local soft limits that take the
> node size into account?
> 
> 	local_soft_limit = soft_limit * node_size / memcg_size
> 
> The soft limit can be exceeded globally, but the container is no
> longer pushed off a node on which it's only occupying a small share of
> memory.
> 
Yes, I think this kind of care is required.
What is the 'node_size' here ? size of pgdat ?
size of per-node usage in the memcg ?


> Putting it into proportion of the memcg size, not overall memory size
> has the following advantages:
> 
>   1. if the container is sitting on only one of several available
>   nodes without exceeding the limit globally, the memcg will not be
>   reclaimed harder just because it has a relatively large share of the
>   node.
> 
>   2. if the soft limit excess is ridiculously high, the local soft
>   limits will be pushed down, so the tolerance for smaller shares on
>   nodes goes down in proportion to the global soft limit excess.
> 
> Example:
> 
> 	4G soft limit * 2G node / 4G container = 2G node-local limit
> 
> The container is globally within its soft limit, so the local limit is
> at least the size of the node.  It's never reclaimed harder compared
> to other applications on the node.
> 
> 	4G soft limit * 2G node / 5G container = ~1.6G node-local limit
> 



> Here, it will experience more pressure initially, but it will level
> off when the shrinking usage and the thereby increasing node-local
> soft limit meet.  From that point on, the container and the competing
> application will be treated equally during reclaim.
> 
> Finally, if the container is 16G in size, i.e. 300% in excess, the
> per-node tolerance is at 512M node-local soft limit, which IMO strikes
> a good balance between zero tolerance and still applying some stress
> to the hugely oversized container when other applications (with
> virtually unlimited soft limits) want to run on the same node.
> 
> What do you think?

I like the idea. Another idea is changing 'priority' based on per-node stats
if not too complicated...

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
       [not found]               ` <CAFj3OHWY2Biw54gaGeH5fkxzgOhxn7NAibeYT_Jmga-_ypNSRg@mail.gmail.com>
@ 2012-01-18  9:25                 ` Johannes Weiner
  2012-01-18 11:25                   ` Sha
  0 siblings, 1 reply; 29+ messages in thread
From: Johannes Weiner @ 2012-01-18  9:25 UTC (permalink / raw)
  To: Sha
  Cc: Ying Han, Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki,
	Balbir Singh, cgroups, linux-mm, linux-kernel

On Wed, Jan 18, 2012 at 03:17:25PM +0800, Sha wrote:
> > > I don't think it solve the root of the problem, example:
> > > root
> > > -> A (hard limit 20G, soft limit 12G, usage 20G)
> > >   -> A1 ( soft limit 2G,   usage 1G)
> > >   -> A2 ( soft limit 10G, usage 19G)
> > >          ->B1 (soft limit 5G, usage 4G)
> > >          ->B2 (soft limit 5G, usage 15G)
> > >
> > > Now A is hitting its hard limit and start hierarchical reclaim under A.
> > > If we choose B1 to go through mem_cgroup_over_soft_limit, it will
> > > return true because its parent A2 has a large usage and will lead to
> > > priority=0 reclaiming. But in fact it should be B2 to be punished.
> 
> > Because A2 is over its soft limit, the whole hierarchy below it should
> > be preferred over A1, so both B1 and B2 should be soft limit reclaimed
> > to be consistent with behaviour at the root level.
> 
> Well it is just the behavior that I'm expecting actually. But with my
> humble comprehension, I can't catch the soft-limit-based hierarchical
> reclaiming under the target cgroup (A2) in the current implementation
> or after the patch. Both the current mem_cgroup_soft_reclaim or
> shrink_zone select victim sub-cgroup by mem_cgroup_iter, but it
> doesn't take soft limit into consideration, do I left anything ?

No, currently soft limits are ignored if pressure originates from
below root_mem_cgroup.

But iff soft limits are applied right now, they are applied
hierarchically, see mem_cgroup_soft_limit_reclaim().

In my opinion, the fact that soft limits are ignored when pressure is
triggered sub-root_mem_cgroup is an artifact of the per-zone tree, so
I allowed soft limits to be taken into account below root_mem_cgroup.

But IMO, this is something different from how soft limit reclaim is
applied once triggered: currently, soft limit reclaim applies to a
whole hierarchy, including all children.  And this I left unchanged.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-13 21:45         ` Ying Han
@ 2012-01-18  9:45           ` Johannes Weiner
  2012-01-18 20:38             ` Ying Han
  0 siblings, 1 reply; 29+ messages in thread
From: Johannes Weiner @ 2012-01-18  9:45 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, Balbir Singh,
	cgroups, linux-mm, linux-kernel

On Fri, Jan 13, 2012 at 01:45:30PM -0800, Ying Han wrote:
> On Fri, Jan 13, 2012 at 8:34 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Fri 13-01-12 16:50:01, Johannes Weiner wrote:
> >> On Fri, Jan 13, 2012 at 01:04:06PM +0100, Michal Hocko wrote:
> >> > On Tue 10-01-12 16:02:52, Johannes Weiner wrote:
> > [...]
> >> > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root,
> >> > > +                        struct mem_cgroup *memcg)
> >> > > +{
> >> > > + if (mem_cgroup_disabled())
> >> > > +         return false;
> >> > > +
> >> > > + if (!root)
> >> > > +         root = root_mem_cgroup;
> >> > > +
> >> > > + for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> >> > > +         /* root_mem_cgroup does not have a soft limit */
> >> > > +         if (memcg == root_mem_cgroup)
> >> > > +                 break;
> >> > > +         if (res_counter_soft_limit_excess(&memcg->res))
> >> > > +                 return true;
> >> > > +         if (memcg == root)
> >> > > +                 break;
> >> > > + }
> >> > > + return false;
> >> > > +}
> >> >
> >> > Well, this might be little bit tricky. We do not check whether memcg and
> >> > root are in a hierarchy (in terms of use_hierarchy) relation.
> >> >
> >> > If we are under global reclaim then we iterate over all memcgs and so
> >> > there is no guarantee that there is a hierarchical relation between the
> >> > given memcg and its parent. While, on the other hand, if we are doing
> >> > memcg reclaim then we have this guarantee.
> >> >
> >> > Why should we punish a group (subtree) which is perfectly under its soft
> >> > limit just because some other subtree contributes to the common parent's
> >> > usage and makes it over its limit?
> >> > Should we check memcg->use_hierarchy here?
> >>
> >> We do, actually.  parent_mem_cgroup() checks the res_counter parent,
> >> which is only set when ->use_hierarchy is also set.
> >
> > Of course I am blind.. We do not setup res_counter parent for
> > !use_hierarchy case. Sorry for noise...
> > Now it makes much better sense. I was wondering how !use_hierarchy could
> > ever work, this should be a signal that I am overlooking something
> > terribly.
> >
> > [...]
> >> > > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone,
> >> > >                   .mem_cgroup = memcg,
> >> > >                   .zone = zone,
> >> > >           };
> >> > > +         int epriority = priority;
> >> > > +         /*
> >> > > +          * Put more pressure on hierarchies that exceed their
> >> > > +          * soft limit, to push them back harder than their
> >> > > +          * well-behaving siblings.
> >> > > +          */
> >> > > +         if (mem_cgroup_over_softlimit(root, memcg))
> >> > > +                 epriority = 0;
> >> >
> >> > This sounds too aggressive to me. Shouldn't we just double the pressure
> >> > or something like that?
> >>
> >> That's the historical value.  When I tried priority - 1, it was not
> >> aggressive enough.
> >
> > Probably because we want to reclaim too much. Maybe we should do
> > reduce nr_to_reclaim (ugly) or reclaim only overlimit groups until certain
> > priority level as Ying suggested in her patchset.
> 
> I plan to post that change on top of this, and this patch set does the
> basic stuff to allow us doing further improvement.
> 
> I still like the design to skip over_soft_limit cgroups until certain
> priority. One way to set up the soft limit for each cgroup is to base
> on its actual working set size, and we prefer to punish A first with
> lots of page cache ( cold file pages above soft limit) than reclaiming
> anon pages from B ( below soft limit ). Unless we can not get enough
> pages reclaimed from A, we will start reclaiming from B.
> 
> This might not be the ideal solution, but should be a good start. Thoughts?

I don't like this design at all because unless you add weird code to
detect if soft limits apply to any memcgs on the reclaimed hierarchy
you may iterate over the same bunch of memcgs doing nothing for
several times.  For example in the default case of no softlimits set
anywhere and you repeatedly walk ALL memcgs in the system doing jack
until you reach your threshold priority level.  Elegant is something
else in my book.

Once we invert soft limits to mean guarantees and make the default
soft limit not infinity but zero, then we can ignore memcgs below
their soft limit for a few priority levels just fine because being
below the soft limit is the exception.  But I don't really want to
make this quite invasive behavioural change a requirement for a
refactoring patch if possible.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-18  9:25                 ` Johannes Weiner
@ 2012-01-18 11:25                   ` Sha
  2012-01-18 15:27                     ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Sha @ 2012-01-18 11:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ying Han, Andrew Morton, Michal Hocko, KAMEZAWA Hiroyuki,
	Balbir Singh, cgroups, linux-mm, linux-kernel

On 01/18/2012 05:25 PM, Johannes Weiner wrote:
> On Wed, Jan 18, 2012 at 03:17:25PM +0800, Sha wrote:
>>>> I don't think it solve the root of the problem, example:
>>>> root
>>>> ->  A (hard limit 20G, soft limit 12G, usage 20G)
>>>>    ->  A1 ( soft limit 2G,   usage 1G)
>>>>    ->  A2 ( soft limit 10G, usage 19G)
>>>>           ->B1 (soft limit 5G, usage 4G)
>>>>           ->B2 (soft limit 5G, usage 15G)
>>>>
>>>> Now A is hitting its hard limit and start hierarchical reclaim under A.
>>>> If we choose B1 to go through mem_cgroup_over_soft_limit, it will
>>>> return true because its parent A2 has a large usage and will lead to
>>>> priority=0 reclaiming. But in fact it should be B2 to be punished.
>>> Because A2 is over its soft limit, the whole hierarchy below it should
>>> be preferred over A1, so both B1 and B2 should be soft limit reclaimed
>>> to be consistent with behaviour at the root level.
>> Well it is just the behavior that I'm expecting actually. But with my
>> humble comprehension, I can't catch the soft-limit-based hierarchical
>> reclaiming under the target cgroup (A2) in the current implementation
>> or after the patch. Both the current mem_cgroup_soft_reclaim or
>> shrink_zone select victim sub-cgroup by mem_cgroup_iter, but it
>> doesn't take soft limit into consideration, do I left anything ?
> No, currently soft limits are ignored if pressure originates from
> below root_mem_cgroup.
>
> But iff soft limits are applied right now, they are applied
> hierarchically, see mem_cgroup_soft_limit_reclaim().
Er... I'm even more confused: mem_cgroup_soft_limit_reclaim indeed
choses the biggest soft-limit excessor first, but in the succeeding reclaim
mem_cgroup_hierarchical_reclaim just selects a child cgroup  by css_id
which has nothing to do with soft limit (see mem_cgroup_select_victim).
IMHO, it's not a genuine hierarchical reclaim.
I check this from the latest memcg-devel git tree (branch since-3.1)...

> In my opinion, the fact that soft limits are ignored when pressure is
> triggered sub-root_mem_cgroup is an artifact of the per-zone tree, so
> I allowed soft limits to be taken into account below root_mem_cgroup.
>
> But IMO, this is something different from how soft limit reclaim is
> applied once triggered: currently, soft limit reclaim applies to a
> whole hierarchy, including all children.  And this I left unchanged.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-18 11:25                   ` Sha
@ 2012-01-18 15:27                     ` Michal Hocko
  2012-01-19  6:38                       ` Sha
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2012-01-18 15:27 UTC (permalink / raw)
  To: Sha
  Cc: Johannes Weiner, Ying Han, Andrew Morton, KAMEZAWA Hiroyuki,
	Balbir Singh, cgroups, linux-mm, linux-kernel

On Wed 18-01-12 19:25:27, Sha wrote:
[...]
> Er... I'm even more confused: mem_cgroup_soft_limit_reclaim indeed
> choses the biggest soft-limit excessor first, but in the succeeding reclaim
> mem_cgroup_hierarchical_reclaim just selects a child cgroup  by css_id

mem_cgroup_soft_limit_reclaim picks up the hierarchy root (most
excessing one) and mem_cgroup_hierarchical_reclaim reclaims from that
subtree). It doesn't care who exceeds the soft limit under that
hierarchy it just tries to push the root under its limit as much as it
can. This is what Johannes tried to explain in the other email in the
thred.

> which has nothing to do with soft limit (see mem_cgroup_select_victim).
> IMHO, it's not a genuine hierarchical reclaim.

It is hierarchical because it iterates over hierarchy it is not and
never was recursively soft-hierarchical...

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-18  9:45           ` Johannes Weiner
@ 2012-01-18 20:38             ` Ying Han
  0 siblings, 0 replies; 29+ messages in thread
From: Ying Han @ 2012-01-18 20:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, Balbir Singh,
	cgroups, linux-mm, linux-kernel

On Wed, Jan 18, 2012 at 1:45 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Fri, Jan 13, 2012 at 01:45:30PM -0800, Ying Han wrote:
>> On Fri, Jan 13, 2012 at 8:34 AM, Michal Hocko <mhocko@suse.cz> wrote:
>> > On Fri 13-01-12 16:50:01, Johannes Weiner wrote:
>> >> On Fri, Jan 13, 2012 at 01:04:06PM +0100, Michal Hocko wrote:
>> >> > On Tue 10-01-12 16:02:52, Johannes Weiner wrote:
>> > [...]
>> >> > > +bool mem_cgroup_over_softlimit(struct mem_cgroup *root,
>> >> > > +                        struct mem_cgroup *memcg)
>> >> > > +{
>> >> > > + if (mem_cgroup_disabled())
>> >> > > +         return false;
>> >> > > +
>> >> > > + if (!root)
>> >> > > +         root = root_mem_cgroup;
>> >> > > +
>> >> > > + for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>> >> > > +         /* root_mem_cgroup does not have a soft limit */
>> >> > > +         if (memcg == root_mem_cgroup)
>> >> > > +                 break;
>> >> > > +         if (res_counter_soft_limit_excess(&memcg->res))
>> >> > > +                 return true;
>> >> > > +         if (memcg == root)
>> >> > > +                 break;
>> >> > > + }
>> >> > > + return false;
>> >> > > +}
>> >> >
>> >> > Well, this might be little bit tricky. We do not check whether memcg and
>> >> > root are in a hierarchy (in terms of use_hierarchy) relation.
>> >> >
>> >> > If we are under global reclaim then we iterate over all memcgs and so
>> >> > there is no guarantee that there is a hierarchical relation between the
>> >> > given memcg and its parent. While, on the other hand, if we are doing
>> >> > memcg reclaim then we have this guarantee.
>> >> >
>> >> > Why should we punish a group (subtree) which is perfectly under its soft
>> >> > limit just because some other subtree contributes to the common parent's
>> >> > usage and makes it over its limit?
>> >> > Should we check memcg->use_hierarchy here?
>> >>
>> >> We do, actually.  parent_mem_cgroup() checks the res_counter parent,
>> >> which is only set when ->use_hierarchy is also set.
>> >
>> > Of course I am blind.. We do not setup res_counter parent for
>> > !use_hierarchy case. Sorry for noise...
>> > Now it makes much better sense. I was wondering how !use_hierarchy could
>> > ever work, this should be a signal that I am overlooking something
>> > terribly.
>> >
>> > [...]
>> >> > > @@ -2121,8 +2121,16 @@ static void shrink_zone(int priority, struct zone *zone,
>> >> > >                   .mem_cgroup = memcg,
>> >> > >                   .zone = zone,
>> >> > >           };
>> >> > > +         int epriority = priority;
>> >> > > +         /*
>> >> > > +          * Put more pressure on hierarchies that exceed their
>> >> > > +          * soft limit, to push them back harder than their
>> >> > > +          * well-behaving siblings.
>> >> > > +          */
>> >> > > +         if (mem_cgroup_over_softlimit(root, memcg))
>> >> > > +                 epriority = 0;
>> >> >
>> >> > This sounds too aggressive to me. Shouldn't we just double the pressure
>> >> > or something like that?
>> >>
>> >> That's the historical value.  When I tried priority - 1, it was not
>> >> aggressive enough.
>> >
>> > Probably because we want to reclaim too much. Maybe we should do
>> > reduce nr_to_reclaim (ugly) or reclaim only overlimit groups until certain
>> > priority level as Ying suggested in her patchset.
>>
>> I plan to post that change on top of this, and this patch set does the
>> basic stuff to allow us doing further improvement.
>>
>> I still like the design to skip over_soft_limit cgroups until certain
>> priority. One way to set up the soft limit for each cgroup is to base
>> on its actual working set size, and we prefer to punish A first with
>> lots of page cache ( cold file pages above soft limit) than reclaiming
>> anon pages from B ( below soft limit ). Unless we can not get enough
>> pages reclaimed from A, we will start reclaiming from B.
>>
>> This might not be the ideal solution, but should be a good start. Thoughts?
>
> I don't like this design at all because unless you add weird code to
> detect if soft limits apply to any memcgs on the reclaimed hierarchy
> you may iterate over the same bunch of memcgs doing nothing for
> several times.  For example in the default case of no softlimits set
> anywhere and you repeatedly walk ALL memcgs in the system doing jack
> until you reach your threshold priority level.  Elegant is something
> else in my book.

Agree that change isn't ready until the default soft limit is changed to "0".

> Once we invert soft limits to mean guarantees and make the default
> soft limit not infinity but zero, then we can ignore memcgs below
> their soft limit for a few priority levels just fine because being
> below the soft limit is the exception.  But I don't really want to
> make this quite invasive behavioural change a requirement for a
> refactoring patch if possible.

Sounds reasonable to me.

--Ying

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/2] mm: memcg: hierarchical soft limit reclaim
  2012-01-18 15:27                     ` Michal Hocko
@ 2012-01-19  6:38                       ` Sha
  0 siblings, 0 replies; 29+ messages in thread
From: Sha @ 2012-01-19  6:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Ying Han, Andrew Morton, KAMEZAWA Hiroyuki,
	Balbir Singh, cgroups, linux-mm, linux-kernel

On 01/18/2012 11:27 PM, Michal Hocko wrote:
> On Wed 18-01-12 19:25:27, Sha wrote:
> [...]
>> Er... I'm even more confused: mem_cgroup_soft_limit_reclaim indeed
>> choses the biggest soft-limit excessor first, but in the succeeding reclaim
>> mem_cgroup_hierarchical_reclaim just selects a child cgroup  by css_id
> mem_cgroup_soft_limit_reclaim picks up the hierarchy root (most
> excessing one) and mem_cgroup_hierarchical_reclaim reclaims from that
> subtree). It doesn't care who exceeds the soft limit under that
> hierarchy it just tries to push the root under its limit as much as it
> can. This is what Johannes tried to explain in the other email in the
> thred.
yeah, I finally twig what  he meant... I'm not quite familiar with this 
part.
Thanks a lot for the explanation. :-)

Sha
>> which has nothing to do with soft limit (see mem_cgroup_select_victim).
>> IMHO, it's not a genuine hierarchical reclaim.
> It is hierarchical because it iterates over hierarchy it is not and
> never was recursively soft-hierarchical...
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2012-01-19  6:38 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-10 15:02 [patch 0/2] mm: memcg reclaim integration followups Johannes Weiner
2012-01-10 15:02 ` [patch 1/2] mm: memcg: per-memcg reclaim statistics Johannes Weiner
2012-01-10 23:54   ` Ying Han
2012-01-11  0:30     ` Johannes Weiner
2012-01-11 22:33       ` Ying Han
2012-01-12  9:17         ` Johannes Weiner
2012-01-10 15:02 ` [patch 2/2] mm: memcg: hierarchical soft limit reclaim Johannes Weiner
2012-01-11 21:42   ` Ying Han
2012-01-12  8:59     ` Johannes Weiner
2012-01-13 21:31       ` Ying Han
2012-01-13 22:44         ` Johannes Weiner
2012-01-17 14:22           ` Sha
2012-01-17 14:53             ` Johannes Weiner
2012-01-17 20:25               ` Ying Han
2012-01-17 21:56                 ` Johannes Weiner
2012-01-17 23:39                   ` Ying Han
     [not found]               ` <CAFj3OHWY2Biw54gaGeH5fkxzgOhxn7NAibeYT_Jmga-_ypNSRg@mail.gmail.com>
2012-01-18  9:25                 ` Johannes Weiner
2012-01-18 11:25                   ` Sha
2012-01-18 15:27                     ` Michal Hocko
2012-01-19  6:38                       ` Sha
2012-01-12  1:54   ` KAMEZAWA Hiroyuki
2012-01-13 12:16     ` Johannes Weiner
2012-01-18  5:26       ` KAMEZAWA Hiroyuki
2012-01-13 12:04   ` Michal Hocko
2012-01-13 15:50     ` Johannes Weiner
2012-01-13 16:34       ` Michal Hocko
2012-01-13 21:45         ` Ying Han
2012-01-18  9:45           ` Johannes Weiner
2012-01-18 20:38             ` Ying Han

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).