linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch v4] Soft limit rework
@ 2013-06-03 10:18 Michal Hocko
  2013-06-03 10:18 ` [patch -v4 1/8] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko
                   ` (9 more replies)
  0 siblings, 10 replies; 29+ messages in thread
From: Michal Hocko @ 2013-06-03 10:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo,
	Balbir Singh

Hi,

This is the fourth version of the patchset.

Summary of versions:
The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973
(lkml wasn't CCed at the time so I cannot find it in lwn.net
archives). There were no major objections. The second version
has been posted here http://lwn.net/Articles/548191/ as a part
of a longer and spicier thread which started after LSF here:
https://lwn.net/Articles/548192/
Version number 3 has been posted here http://lwn.net/Articles/550409/
Johannes was worried about setups with thousands of memcgs and the
tree walk overhead for the soft reclaim pass without anybody in excess.

Changes between RFC (aka V1) -> V2
As there were no major objections there were only some minor cleanups
since the last version and I have moved "memcg: Ignore soft limit until
it is explicitly specified" to the end of the series.

Changes between V2 -> V3
No changes in the code since the last version. I have just rebased the
series on top of the current mmotm tree. The most controversial part
has been dropped (the last patch "memcg: Ignore soft limit until it is
explicitly specified") so there are no semantical changes to the soft
limit behavior. This makes this work mostly a code clean up and code
reorganization. Nevertheless, this is enough to make the soft limit work
more efficiently according to my testing and groups above the soft limit
are reclaimed much less as a result.

Changes between V3->V4
Added some Reviewed-bys but the biggest change comes from Johannes
concern about the tree traversal overhead with a huge number of memcgs
(http://thread.gmane.org/gmane.linux.kernel.cgroups/7307/focus=100326)
and this version addresses this problem by augmenting the memcg tree
with the number of over soft limit children at each level of the
hierarchy. See more bellow.

The basic idea is quite simple. Pull soft reclaim into shrink_zone in
the first step and get rid of the previous soft reclaim infrastructure.
shrink_zone is done in two passes now. First it tries to do the soft
limit reclaim and it falls back to reclaim-all mode if no group is over
the limit or no pages have been scanned. The second pass happens at the
same priority so the only time we waste is the memcg tree walk which
has been updated in the third step to have only negligible overhead.

As a bonus we will get rid of a _lot_ of code by this and soft reclaim
will not stand out like before when it wasn't integrated into the zone
shrinking code and it reclaimed at priority 0 (the testing results show
that some workloads suffers from such an aggressive reclaim). The clean
up is in a separate patch because I felt it would be easier to review
that way.

The second step is soft limit reclaim integration into targeted
reclaim. It should be rather straight forward. Soft limit has been used
only for the global reclaim so far but it makes sense for any kind of
pressure coming from up-the-hierarchy, including targeted reclaim.

The third step (patches 4-8) addresses the tree walk overhead by
enhancing memcg iterators to enable skipping whole subtrees and tracking
number of over soft limit children at each level of the hierarchy. This
information is updated same way the old soft limit tree was updated
(from memcg_check_events) so we shouldn't see an additional overhead. In
fact mem_cgroup_update_soft_limit is much simpler than tree manipulation
done previously.
__shrink_zone uses mem_cgroup_soft_reclaim_eligible as a predicate for
mem_cgroup_iter so the decision whether a particular group should be
visited is done at the iterator level which allows us to decide to skip
the whole subtree as well (if there is no child in excess). This reduces
the tree walk overhead considerably.

My primary test case was a parallel kernel build with 2 groups (make
is running with -j4 with a distribution .config in a separate cgroup
without any hard limit) on a 8 CPU machine booted with 1GB memory.  I
was mostly interested in 2 setups. Default - no soft limit set and - and
0 soft limit set to both groups.
The first one should tell us whether the rework regresses the default
behavior while the second one should show us improvements in an extreme
case where both workloads are always over the soft limit.

/usr/bin/time -v has been used to collect the statistics and each
configuration had 3 runs after fresh boot without any other load on the
system.

base is mmotm-2013-05-09-15-57
rework means patches 1-3
reworkoptim means patches 4-8

* No-limit
System
base: min: 242.55 max: 245.36 avg: 243.92 std: 1.17 runs: 6
rework: min: 240.74 [99.3%] max: 244.90 [99.8%] avg: 242.82 [99.5%] std: 1.45 runs: 6
reworkoptim: min: 242.07 [99.8%] max: 244.73 [99.7%] avg: 243.78 [99.9%] std: 0.91 runs: 6
Elapsed
base: min: 596.81 max: 620.04 avg: 605.52 std: 7.56 runs: 6
rework: min: 590.06 [98.9%] max: 602.11 [97.1%] avg: 596.09 [98.4%] std: 4.39 runs: 6
reworkoptim: min: 591.54 [99.1%] max: 613.47 [98.9%] avg: 599.60 [99.0%] std: 7.20 runs: 6

The numbers are within stdev so it doesn't look like an regression

* 0-limit
System
base: min: 248.40 max: 252.00 avg: 250.19 std: 1.38 runs: 6
rework: min: 240.31 [96.7%] max: 243.61 [96.7%] avg: 242.01 [96.7%] std: 1.18 runs: 6
reworkoptim: min: 242.67 [97.7%] max: 245.73 [97.5%] avg: 244.52 [97.7%] std: 1.00 runs: 6
Elapsed
base: min: 759.28 max: 805.30 avg: 784.87 std: 15.45 runs: 6
rework: min: 588.75 [77.5%] max: 606.30 [75.3%] avg: 597.07 [76.1%] std: 5.12 runs: 6
reworkoptim: min: 591.31 [77.9%] max: 612.52 [76.1%] avg: 601.08 [76.6%] std: 6.93 runs: 6

We can see 2-3% time decrease for System time which is not rocket high
but sounds like a good outcome from a cleanup (rework) and the tracking
overhead is barely visible (within the noise).

It is even more interesting to check the Elapsed time numbers which show
that the parallel load is much more effective. I haven't looked into
the specific reasons for this boost up deeply but I would guess that
priority-0 reclaim done in the original implementation should be a big
contributor.

Page fault statistics tell us at least part of the story:
Minor
base: min: 35941845.00 max: 36029788.00 avg: 35986860.17 std: 28288.66 runs: 6
rework: min: 35595937.00 [99.0%] max: 35690024.00 [99.1%] avg: 35654958.17 [99.1%] std: 31270.07 runs: 6
reworkoptim: min: 35596909.00 [99.0%] max: 35684640.00 [99.0%] avg: 35660804.67 [99.1%] std: 29918.93 runs: 6
Major
base: min: 25390.00 max: 33132.00 avg: 29961.83 std: 2476.58 runs: 6
rework: min: 451.00 [1.8%] max: 1600.00 [4.8%] avg: 814.17 [2.7%] std: 380.01 runs: 6
reworkoptim: min: 318.00 [1.3%] max: 2023.00 [6.1%] avg: 911.50 [3.0%] std: 632.83 runs: 6

While the minor faults are within the noise the major faults are reduced
considerably. This looks like an aggressive pageout during the reclaim
and that pageout affects the working set presumably.

 While this looks as a huge win it is fair to say that there are some
workloads that actually benefit from reclaim at 0 priority (from
background reclaim). E.g. an aggressive streaming IO would like to get
rid of as many pages as possible and do not block on the pages under
writeback. This can lead to a higher System time but I generally got
Elapsed which was comparable.

The following results are from 2 groups configuration on a 8GB machine
(A running stream IO with 4*TotalMem with 0 soft limit, B runnning a
mem_eater which consumes TotalMem-1G without any limit).
System
base: min: 124.88 max: 136.97 avg: 130.77 std: 4.94 runs: 3
rework: min: 161.67 [129.5%] max: 196.80 [143.7%] avg: 174.18 [133.2%] std: 16.02 runs: 3
reworkoptim: min: 267.48 [214.2%] max: 319.64 [233.4%] avg: 300.43 [229.7%] std: 23.40 runs: 3
Elapsed
base: min: 398.86 max: 412.81 avg: 407.62 std: 6.23 runs: 3
rework: min: 423.41 [106.2%] max: 450.30 [109.1%] avg: 440.92 [108.2%] std: 12.39 runs: 3
reworkoptim: min: 379.91 [95.2%] max: 416.46 [100.9%] avg: 399.26 [97.9%] std: 15.00 runs: 3

The testing results for bunch of cgroups with both stream IO and kbuild
loads can be found in "memcg: track children in soft limit excess to
improve soft limit".

The series has seen quite some testing and I guess it is in the state to
be merged into mmotm and hopefully get into 3.11. I would like to hear
back from Johannes and Kamezawa about this timing though.

Shortlog says:
Michal Hocko (8):
      memcg: integrate soft reclaim tighter with zone shrinking code
      memcg: Get rid of soft-limit tree infrastructure
      vmscan, memcg: Do softlimit reclaim also for targeted reclaim
      memcg: enhance memcg iterator to support predicates
      memcg: track children in soft limit excess to improve soft limit
      memcg, vmscan: Do not attempt soft limit reclaim if it would not scan anything
      memcg: Track all children over limit in the root
      memcg, vmscan: do not fall into reclaim-all pass too quickly

And the diffstat is still promissing I would say:
 include/linux/memcontrol.h |   46 +++-
 mm/memcontrol.c            |  537 +++++++++++---------------------------------
 mm/vmscan.c                |   83 ++++---
 3 files changed, 218 insertions(+), 448 deletions(-)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch -v4 1/8] memcg: integrate soft reclaim tighter with zone shrinking code
  2013-06-03 10:18 [patch v4] Soft limit rework Michal Hocko
@ 2013-06-03 10:18 ` Michal Hocko
  2013-06-03 10:18 ` [patch -v4 2/8] memcg: Get rid of soft-limit tree infrastructure Michal Hocko
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2013-06-03 10:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo,
	Balbir Singh

Memcg soft reclaim has been traditionally triggered from the global
reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim
then picked up a group which exceeds the soft limit the most and
reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.

The infrastructure requires per-node-zone trees which hold over-limit
groups and keep them up-to-date (via memcg_check_events) which is not
cost free. Although this overhead hasn't turned out to be a bottle neck
the implementation is suboptimal because mem_cgroup_update_tree has no
idea which zones consumed memory over the limit so we could easily end
up having a group on a node-zone tree having only few pages from that
node-zone.

This patch doesn't try to fix node-zone trees management because it
seems that integrating soft reclaim into zone shrinking sounds much
easier and more appropriate for several reasons.
First of all 0 priority reclaim was a crude hack which might lead to
big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot
of dirty/writeback pages).
Soft reclaim should be applicable also to the targeted reclaim which is
awkward right now without additional hacks.
Last but not least the whole infrastructure eats quite some code.

After this patch shrink_zone is done in 2 passes. First it tries to do the
soft reclaim if appropriate (only for global reclaim for now to keep
compatible with the original state) and fall back to ignoring soft limit
if no group is eligible to soft reclaim or nothing has been scanned
during the first pass. Only groups which are over their soft limit or
any of their parents up the hierarchy is over the limit are considered
eligible during the first pass.

Soft limit tree which is not necessary anymore will be removed in the
follow up patch to make this patch smaller and easier to review.

Changes since v2
- Do not try soft reclaim pass if mem_cgroup_disabled
Changes since v1
- __shrink_zone doesn't return the number of shrunk groups as nr_scanned
  test covers both no groups scanned and no pages from the required zone
  as pointed by Johannes

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
---
 include/linux/memcontrol.h |   10 +--
 mm/memcontrol.c            |  161 ++++++--------------------------------------
 mm/vmscan.c                |   62 ++++++++++-------
 3 files changed, 59 insertions(+), 174 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d6183f0..1833c95 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -179,9 +179,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-						gfp_t gfp_mask,
-						unsigned long *total_scanned);
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg);
 
 void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
 static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
@@ -358,11 +356,9 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 }
 
 static inline
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-					    gfp_t gfp_mask,
-					    unsigned long *total_scanned)
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
 {
-	return 0;
+	return false;
 }
 
 static inline void mem_cgroup_split_huge_fixup(struct page *head)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fe4f123..7e5d7a9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2083,57 +2083,28 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
 }
 #endif
 
-static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
-				   struct zone *zone,
-				   gfp_t gfp_mask,
-				   unsigned long *total_scanned)
-{
-	struct mem_cgroup *victim = NULL;
-	int total = 0;
-	int loop = 0;
-	unsigned long excess;
-	unsigned long nr_scanned;
-	struct mem_cgroup_reclaim_cookie reclaim = {
-		.zone = zone,
-		.priority = 0,
-	};
+/*
+ * A group is eligible for the soft limit reclaim if it is
+ * 	a) is over its soft limit
+ * 	b) any parent up the hierarchy is over its soft limit
+ */
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *parent = memcg;
 
-	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
-
-	while (1) {
-		victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
-		if (!victim) {
-			loop++;
-			if (loop >= 2) {
-				/*
-				 * If we have not been able to reclaim
-				 * anything, it might because there are
-				 * no reclaimable pages under this hierarchy
-				 */
-				if (!total)
-					break;
-				/*
-				 * We want to do more targeted reclaim.
-				 * excess >> 2 is not to excessive so as to
-				 * reclaim too much, nor too less that we keep
-				 * coming back to reclaim from this cgroup
-				 */
-				if (total >= (excess >> 2) ||
-					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
-					break;
-			}
-			continue;
-		}
-		if (!mem_cgroup_reclaimable(victim, false))
-			continue;
-		total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
-						     zone, &nr_scanned);
-		*total_scanned += nr_scanned;
-		if (!res_counter_soft_limit_excess(&root_memcg->res))
-			break;
+	if (res_counter_soft_limit_excess(&memcg->res))
+		return true;
+
+	/*
+	 * If any parent up the hierarchy is over its soft limit then we
+	 * have to obey and reclaim from this group as well.
+	 */
+	while((parent = parent_mem_cgroup(parent))) {
+		if (res_counter_soft_limit_excess(&parent->res))
+			return true;
 	}
-	mem_cgroup_iter_break(root_memcg, victim);
-	return total;
+
+	return false;
 }
 
 /*
@@ -4751,98 +4722,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 	return ret;
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-					    gfp_t gfp_mask,
-					    unsigned long *total_scanned)
-{
-	unsigned long nr_reclaimed = 0;
-	struct mem_cgroup_per_zone *mz, *next_mz = NULL;
-	unsigned long reclaimed;
-	int loop = 0;
-	struct mem_cgroup_tree_per_zone *mctz;
-	unsigned long long excess;
-	unsigned long nr_scanned;
-
-	if (order > 0)
-		return 0;
-
-	mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
-	/*
-	 * This loop can run a while, specially if mem_cgroup's continuously
-	 * keep exceeding their soft limit and putting the system under
-	 * pressure
-	 */
-	do {
-		if (next_mz)
-			mz = next_mz;
-		else
-			mz = mem_cgroup_largest_soft_limit_node(mctz);
-		if (!mz)
-			break;
-
-		nr_scanned = 0;
-		reclaimed = mem_cgroup_soft_reclaim(mz->memcg, zone,
-						    gfp_mask, &nr_scanned);
-		nr_reclaimed += reclaimed;
-		*total_scanned += nr_scanned;
-		spin_lock(&mctz->lock);
-
-		/*
-		 * If we failed to reclaim anything from this memory cgroup
-		 * it is time to move on to the next cgroup
-		 */
-		next_mz = NULL;
-		if (!reclaimed) {
-			do {
-				/*
-				 * Loop until we find yet another one.
-				 *
-				 * By the time we get the soft_limit lock
-				 * again, someone might have aded the
-				 * group back on the RB tree. Iterate to
-				 * make sure we get a different mem.
-				 * mem_cgroup_largest_soft_limit_node returns
-				 * NULL if no other cgroup is present on
-				 * the tree
-				 */
-				next_mz =
-				__mem_cgroup_largest_soft_limit_node(mctz);
-				if (next_mz == mz)
-					css_put(&next_mz->memcg->css);
-				else /* next_mz == NULL or other memcg */
-					break;
-			} while (1);
-		}
-		__mem_cgroup_remove_exceeded(mz->memcg, mz, mctz);
-		excess = res_counter_soft_limit_excess(&mz->memcg->res);
-		/*
-		 * One school of thought says that we should not add
-		 * back the node to the tree if reclaim returns 0.
-		 * But our reclaim could return 0, simply because due
-		 * to priority we are exposing a smaller subset of
-		 * memory to reclaim from. Consider this as a longer
-		 * term TODO.
-		 */
-		/* If excess == 0, no tree ops */
-		__mem_cgroup_insert_exceeded(mz->memcg, mz, mctz, excess);
-		spin_unlock(&mctz->lock);
-		css_put(&mz->memcg->css);
-		loop++;
-		/*
-		 * Could not reclaim anything and there are no more
-		 * mem cgroups to try or we seem to be looping without
-		 * reclaiming anything.
-		 */
-		if (!nr_reclaimed &&
-			(next_mz == NULL ||
-			loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
-			break;
-	} while (!nr_reclaimed);
-	if (next_mz)
-		css_put(&next_mz->memcg->css);
-	return nr_reclaimed;
-}
-
 /**
  * mem_cgroup_force_empty_list - clears LRU of a group
  * @memcg: group to clear
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fa6a853..6682277 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -139,11 +139,21 @@ static bool global_reclaim(struct scan_control *sc)
 {
 	return !sc->target_mem_cgroup;
 }
+
+static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
+{
+	return !mem_cgroup_disabled() && global_reclaim(sc);
+}
 #else
 static bool global_reclaim(struct scan_control *sc)
 {
 	return true;
 }
+
+static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
+{
+	return false;
+}
 #endif
 
 static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
@@ -1943,7 +1953,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	}
 }
 
-static void shrink_zone(struct zone *zone, struct scan_control *sc)
+static void
+__shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
 {
 	unsigned long nr_reclaimed, nr_scanned;
 
@@ -1962,6 +1973,12 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 		do {
 			struct lruvec *lruvec;
 
+			if (soft_reclaim &&
+			    !mem_cgroup_soft_reclaim_eligible(memcg)) {
+				memcg = mem_cgroup_iter(root, memcg, &reclaim);
+				continue;
+			}
+
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
 			shrink_lruvec(lruvec, sc);
@@ -1992,6 +2009,24 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 					 sc->nr_scanned - nr_scanned, sc));
 }
 
+
+static void shrink_zone(struct zone *zone, struct scan_control *sc)
+{
+	bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc);
+	unsigned long nr_scanned = sc->nr_scanned;
+
+	__shrink_zone(zone, sc, do_soft_reclaim);
+
+	/*
+	 * No group is over the soft limit or those that are do not have
+	 * pages in the zone we are reclaiming so we have to reclaim everybody
+	 */
+	if (do_soft_reclaim && (sc->nr_scanned == nr_scanned)) {
+		__shrink_zone(zone, sc, false);
+		return;
+	}
+}
+
 /* Returns true if compaction should go ahead for a high-order request */
 static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
 {
@@ -2053,8 +2088,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 {
 	struct zoneref *z;
 	struct zone *zone;
-	unsigned long nr_soft_reclaimed;
-	unsigned long nr_soft_scanned;
 	bool aborted_reclaim = false;
 
 	/*
@@ -2094,18 +2127,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 					continue;
 				}
 			}
-			/*
-			 * This steals pages from memory cgroups over softlimit
-			 * and returns the number of reclaimed pages and
-			 * scanned pages. This works for global memory pressure
-			 * and balancing, not for a memcg's limit.
-			 */
-			nr_soft_scanned = 0;
-			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
-						sc->order, sc->gfp_mask,
-						&nr_soft_scanned);
-			sc->nr_reclaimed += nr_soft_reclaimed;
-			sc->nr_scanned += nr_soft_scanned;
 			/* need some check for avoid more shrink_zone() */
 		}
 
@@ -2628,8 +2649,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 	struct reclaim_state *reclaim_state = current->reclaim_state;
-	unsigned long nr_soft_reclaimed;
-	unsigned long nr_soft_scanned;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.may_unmap = 1,
@@ -2728,15 +2747,6 @@ loop_again:
 
 			sc.nr_scanned = 0;
 
-			nr_soft_scanned = 0;
-			/*
-			 * Call soft limit reclaim before calling shrink_zone.
-			 */
-			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
-							order, sc.gfp_mask,
-							&nr_soft_scanned);
-			sc.nr_reclaimed += nr_soft_reclaimed;
-
 			/*
 			 * We put equal pressure on every zone, unless
 			 * one zone has way too many pages free
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [patch -v4 2/8] memcg: Get rid of soft-limit tree infrastructure
  2013-06-03 10:18 [patch v4] Soft limit rework Michal Hocko
  2013-06-03 10:18 ` [patch -v4 1/8] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko
@ 2013-06-03 10:18 ` Michal Hocko
  2013-06-03 10:18 ` [patch -v4 3/8] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2013-06-03 10:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo,
	Balbir Singh

Now that the soft limit is integrated to the reclaim directly the whole
soft-limit tree infrastructure is not needed anymore. Rip it out.

Changes v1->v2
- mem_cgroup_reclaimable is no longer used
- test_mem_cgroup_node_reclaimable is not used outside NUMA code

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
---
 mm/memcontrol.c |  265 +------------------------------------------------------
 1 file changed, 2 insertions(+), 263 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7e5d7a9..14981be 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -39,7 +39,6 @@
 #include <linux/limits.h>
 #include <linux/export.h>
 #include <linux/mutex.h>
-#include <linux/rbtree.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
@@ -137,7 +136,6 @@ static const char * const mem_cgroup_lru_names[] = {
  */
 enum mem_cgroup_events_target {
 	MEM_CGROUP_TARGET_THRESH,
-	MEM_CGROUP_TARGET_SOFTLIMIT,
 	MEM_CGROUP_TARGET_NUMAINFO,
 	MEM_CGROUP_NTARGETS,
 };
@@ -173,10 +171,6 @@ struct mem_cgroup_per_zone {
 
 	struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1];
 
-	struct rb_node		tree_node;	/* RB tree node */
-	unsigned long long	usage_in_excess;/* Set to the value by which */
-						/* the soft limit is exceeded*/
-	bool			on_tree;
 	struct mem_cgroup	*memcg;		/* Back pointer, we cannot */
 						/* use container_of	   */
 };
@@ -189,26 +183,6 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[0];
 };
 
-/*
- * Cgroups above their limits are maintained in a RB-Tree, independent of
- * their hierarchy representation
- */
-
-struct mem_cgroup_tree_per_zone {
-	struct rb_root rb_root;
-	spinlock_t lock;
-};
-
-struct mem_cgroup_tree_per_node {
-	struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES];
-};
-
-struct mem_cgroup_tree {
-	struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
-};
-
-static struct mem_cgroup_tree soft_limit_tree __read_mostly;
-
 struct mem_cgroup_threshold {
 	struct eventfd_ctx *eventfd;
 	u64 threshold;
@@ -533,7 +507,6 @@ static bool move_file(void)
  * limit reclaim to prevent infinite loops, if they ever occur.
  */
 #define	MEM_CGROUP_MAX_RECLAIM_LOOPS		100
-#define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	2
 
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
@@ -764,164 +737,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *memcg, struct page *page)
 	return mem_cgroup_zoneinfo(memcg, nid, zid);
 }
 
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_node_zone(int nid, int zid)
-{
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_from_page(struct page *page)
-{
-	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
-
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static void
-__mem_cgroup_insert_exceeded(struct mem_cgroup *memcg,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz,
-				unsigned long long new_usage_in_excess)
-{
-	struct rb_node **p = &mctz->rb_root.rb_node;
-	struct rb_node *parent = NULL;
-	struct mem_cgroup_per_zone *mz_node;
-
-	if (mz->on_tree)
-		return;
-
-	mz->usage_in_excess = new_usage_in_excess;
-	if (!mz->usage_in_excess)
-		return;
-	while (*p) {
-		parent = *p;
-		mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
-					tree_node);
-		if (mz->usage_in_excess < mz_node->usage_in_excess)
-			p = &(*p)->rb_left;
-		/*
-		 * We can't avoid mem cgroups that are over their soft
-		 * limit by the same amount
-		 */
-		else if (mz->usage_in_excess >= mz_node->usage_in_excess)
-			p = &(*p)->rb_right;
-	}
-	rb_link_node(&mz->tree_node, parent, p);
-	rb_insert_color(&mz->tree_node, &mctz->rb_root);
-	mz->on_tree = true;
-}
-
-static void
-__mem_cgroup_remove_exceeded(struct mem_cgroup *memcg,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
-{
-	if (!mz->on_tree)
-		return;
-	rb_erase(&mz->tree_node, &mctz->rb_root);
-	mz->on_tree = false;
-}
-
-static void
-mem_cgroup_remove_exceeded(struct mem_cgroup *memcg,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
-{
-	spin_lock(&mctz->lock);
-	__mem_cgroup_remove_exceeded(memcg, mz, mctz);
-	spin_unlock(&mctz->lock);
-}
-
-
-static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
-{
-	unsigned long long excess;
-	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup_tree_per_zone *mctz;
-	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
-	mctz = soft_limit_tree_from_page(page);
-
-	/*
-	 * Necessary to update all ancestors when hierarchy is used.
-	 * because their event counter is not touched.
-	 */
-	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
-		mz = mem_cgroup_zoneinfo(memcg, nid, zid);
-		excess = res_counter_soft_limit_excess(&memcg->res);
-		/*
-		 * We have to update the tree if mz is on RB-tree or
-		 * mem is over its softlimit.
-		 */
-		if (excess || mz->on_tree) {
-			spin_lock(&mctz->lock);
-			/* if on-tree, remove it */
-			if (mz->on_tree)
-				__mem_cgroup_remove_exceeded(memcg, mz, mctz);
-			/*
-			 * Insert again. mz->usage_in_excess will be updated.
-			 * If excess is 0, no tree ops.
-			 */
-			__mem_cgroup_insert_exceeded(memcg, mz, mctz, excess);
-			spin_unlock(&mctz->lock);
-		}
-	}
-}
-
-static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
-{
-	int node, zone;
-	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup_tree_per_zone *mctz;
-
-	for_each_node(node) {
-		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-			mz = mem_cgroup_zoneinfo(memcg, node, zone);
-			mctz = soft_limit_tree_node_zone(node, zone);
-			mem_cgroup_remove_exceeded(memcg, mz, mctz);
-		}
-	}
-}
-
-static struct mem_cgroup_per_zone *
-__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
-	struct rb_node *rightmost = NULL;
-	struct mem_cgroup_per_zone *mz;
-
-retry:
-	mz = NULL;
-	rightmost = rb_last(&mctz->rb_root);
-	if (!rightmost)
-		goto done;		/* Nothing to reclaim from */
-
-	mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
-	/*
-	 * Remove the node now but someone else can add it back,
-	 * we will to add it back at the end of reclaim to its correct
-	 * position in the tree.
-	 */
-	__mem_cgroup_remove_exceeded(mz->memcg, mz, mctz);
-	if (!res_counter_soft_limit_excess(&mz->memcg->res) ||
-		!css_tryget(&mz->memcg->css))
-		goto retry;
-done:
-	return mz;
-}
-
-static struct mem_cgroup_per_zone *
-mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
-	struct mem_cgroup_per_zone *mz;
-
-	spin_lock(&mctz->lock);
-	mz = __mem_cgroup_largest_soft_limit_node(mctz);
-	spin_unlock(&mctz->lock);
-	return mz;
-}
-
 /*
  * Implementation Note: reading percpu statistics for memcg.
  *
@@ -1075,9 +890,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 		case MEM_CGROUP_TARGET_THRESH:
 			next = val + THRESHOLDS_EVENTS_TARGET;
 			break;
-		case MEM_CGROUP_TARGET_SOFTLIMIT:
-			next = val + SOFTLIMIT_EVENTS_TARGET;
-			break;
 		case MEM_CGROUP_TARGET_NUMAINFO:
 			next = val + NUMAINFO_EVENTS_TARGET;
 			break;
@@ -1100,11 +912,8 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 	/* threshold event is triggered in finer grain than soft limit */
 	if (unlikely(mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_THRESH))) {
-		bool do_softlimit;
 		bool do_numainfo __maybe_unused;
 
-		do_softlimit = mem_cgroup_event_ratelimit(memcg,
-						MEM_CGROUP_TARGET_SOFTLIMIT);
 #if MAX_NUMNODES > 1
 		do_numainfo = mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_NUMAINFO);
@@ -1112,8 +921,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 		preempt_enable();
 
 		mem_cgroup_threshold(memcg);
-		if (unlikely(do_softlimit))
-			mem_cgroup_update_tree(memcg, page);
 #if MAX_NUMNODES > 1
 		if (unlikely(do_numainfo))
 			atomic_inc(&memcg->numainfo_events);
@@ -1946,6 +1753,7 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
 	return total;
 }
 
+#if MAX_NUMNODES > 1
 /**
  * test_mem_cgroup_node_reclaimable
  * @memcg: the target memcg
@@ -1968,7 +1776,6 @@ static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg,
 	return false;
 
 }
-#if MAX_NUMNODES > 1
 
 /*
  * Always updating the nodemask is not very good - even if we have an empty
@@ -2036,51 +1843,12 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
 	return node;
 }
 
-/*
- * Check all nodes whether it contains reclaimable pages or not.
- * For quick scan, we make use of scan_nodes. This will allow us to skip
- * unused nodes. But scan_nodes is lazily updated and may not cotain
- * enough new information. We need to do double check.
- */
-static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
-{
-	int nid;
-
-	/*
-	 * quick check...making use of scan_node.
-	 * We can skip unused nodes.
-	 */
-	if (!nodes_empty(memcg->scan_nodes)) {
-		for (nid = first_node(memcg->scan_nodes);
-		     nid < MAX_NUMNODES;
-		     nid = next_node(nid, memcg->scan_nodes)) {
-
-			if (test_mem_cgroup_node_reclaimable(memcg, nid, noswap))
-				return true;
-		}
-	}
-	/*
-	 * Check rest of nodes.
-	 */
-	for_each_node_state(nid, N_MEMORY) {
-		if (node_isset(nid, memcg->scan_nodes))
-			continue;
-		if (test_mem_cgroup_node_reclaimable(memcg, nid, noswap))
-			return true;
-	}
-	return false;
-}
-
 #else
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
 {
 	return 0;
 }
 
-static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
-{
-	return test_mem_cgroup_node_reclaimable(memcg, 0, noswap);
-}
 #endif
 
 /*
@@ -2955,9 +2723,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 	unlock_page_cgroup(pc);
 
 	/*
-	 * "charge_statistics" updated event counter. Then, check it.
-	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
-	 * if they exceeds softlimit.
+	 * "charge_statistics" updated event counter.
 	 */
 	memcg_check_events(memcg, page);
 }
@@ -6086,8 +5852,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		mz = &pn->zoneinfo[zone];
 		lruvec_init(&mz->lruvec);
-		mz->usage_in_excess = 0;
-		mz->on_tree = false;
 		mz->memcg = memcg;
 	}
 	memcg->info.nodeinfo[node] = pn;
@@ -6143,7 +5907,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 	int node;
 	size_t size = memcg_size();
 
-	mem_cgroup_remove_from_trees(memcg);
 	free_css_id(&mem_cgroup_subsys, &memcg->css);
 
 	for_each_node(node)
@@ -6225,29 +5988,6 @@ struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 }
 EXPORT_SYMBOL(parent_mem_cgroup);
 
-static void __init mem_cgroup_soft_limit_tree_init(void)
-{
-	struct mem_cgroup_tree_per_node *rtpn;
-	struct mem_cgroup_tree_per_zone *rtpz;
-	int tmp, node, zone;
-
-	for_each_node(node) {
-		tmp = node;
-		if (!node_state(node, N_NORMAL_MEMORY))
-			tmp = -1;
-		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
-		BUG_ON(!rtpn);
-
-		soft_limit_tree.rb_tree_per_node[node] = rtpn;
-
-		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-			rtpz = &rtpn->rb_tree_per_zone[zone];
-			rtpz->rb_root = RB_ROOT;
-			spin_lock_init(&rtpz->lock);
-		}
-	}
-}
-
 static struct cgroup_subsys_state * __ref
 mem_cgroup_css_alloc(struct cgroup *cont)
 {
@@ -7040,7 +6780,6 @@ static int __init mem_cgroup_init(void)
 {
 	hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
 	enable_swap_cgroup();
-	mem_cgroup_soft_limit_tree_init();
 	memcg_stock_init();
 	return 0;
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [patch -v4 3/8] vmscan, memcg: Do softlimit reclaim also for targeted reclaim
  2013-06-03 10:18 [patch v4] Soft limit rework Michal Hocko
  2013-06-03 10:18 ` [patch -v4 1/8] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko
  2013-06-03 10:18 ` [patch -v4 2/8] memcg: Get rid of soft-limit tree infrastructure Michal Hocko
@ 2013-06-03 10:18 ` Michal Hocko
  2013-06-03 10:18 ` [patch -v4 4/8] memcg: enhance memcg iterator to support predicates Michal Hocko
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2013-06-03 10:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo,
	Balbir Singh

Soft reclaim has been done only for the global reclaim (both background
and direct). Since "memcg: integrate soft reclaim tighter with zone
shrinking code" there is no reason for this limitation anymore as the
soft limit reclaim doesn't use any special code paths and it is a
part of the zone shrinking code which is used by both global and
targeted reclaims.

>From semantic point of view it is even natural to consider soft limit
before touching all groups in the hierarchy tree which is touching the
hard limit because soft limit tells us where to push back when there is
a  memory pressure. It is not important whether the pressure comes from
the limit or imbalanced zones.

This patch simply enables soft reclaim unconditionally in
mem_cgroup_should_soft_reclaim so it is enabled for both global and
targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn
about the root of the reclaim to know where to stop checking soft limit
state of parents up the hierarchy.
Say we have
A (over soft limit)
 \
  B (below s.l., hit the hard limit)
 / \
C   D (below s.l.)

B is the source of the outside memory pressure now for D but we
shouldn't soft reclaim it because it is behaving well under B subtree
and we can still reclaim from C (pressumably it is over the limit).
mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the
hierarchy at B (root of the memory pressure).

Changes since v1
- add sc->target_mem_cgroup handling into mem_cgroup_soft_reclaim_eligible

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
---
 include/linux/memcontrol.h |    6 ++++--
 mm/memcontrol.c            |   14 +++++++++-----
 mm/vmscan.c                |    4 ++--
 3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1833c95..80ed1b6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -179,7 +179,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
 
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg);
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+		struct mem_cgroup *root);
 
 void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
 static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
@@ -356,7 +357,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 }
 
 static inline
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+		struct mem_cgroup *root)
 {
 	return false;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 14981be..9f8dd37 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1852,11 +1852,13 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
 #endif
 
 /*
- * A group is eligible for the soft limit reclaim if it is
- * 	a) is over its soft limit
+ * A group is eligible for the soft limit reclaim under the given root
+ * hierarchy if
+ * 	a) it is over its soft limit
  * 	b) any parent up the hierarchy is over its soft limit
  */
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+		struct mem_cgroup *root)
 {
 	struct mem_cgroup *parent = memcg;
 
@@ -1864,12 +1866,14 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
 		return true;
 
 	/*
-	 * If any parent up the hierarchy is over its soft limit then we
-	 * have to obey and reclaim from this group as well.
+	 * If any parent up to the root in the hierarchy is over its soft limit
+	 * then we have to obey and reclaim from this group as well.
 	 */
 	while((parent = parent_mem_cgroup(parent))) {
 		if (res_counter_soft_limit_excess(&parent->res))
 			return true;
+		if (parent == root)
+			break;
 	}
 
 	return false;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6682277..e21c1aa 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -142,7 +142,7 @@ static bool global_reclaim(struct scan_control *sc)
 
 static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
 {
-	return !mem_cgroup_disabled() && global_reclaim(sc);
+	return !mem_cgroup_disabled();
 }
 #else
 static bool global_reclaim(struct scan_control *sc)
@@ -1974,7 +1974,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
 			struct lruvec *lruvec;
 
 			if (soft_reclaim &&
-			    !mem_cgroup_soft_reclaim_eligible(memcg)) {
+			    !mem_cgroup_soft_reclaim_eligible(memcg, root)) {
 				memcg = mem_cgroup_iter(root, memcg, &reclaim);
 				continue;
 			}
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [patch -v4 4/8] memcg: enhance memcg iterator to support predicates
  2013-06-03 10:18 [patch v4] Soft limit rework Michal Hocko
                   ` (2 preceding siblings ...)
  2013-06-03 10:18 ` [patch -v4 3/8] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko
@ 2013-06-03 10:18 ` Michal Hocko
  2013-06-04  1:07   ` Tejun Heo
  2013-06-10  7:48   ` Michal Hocko
  2013-06-03 10:18 ` [patch -v4 5/8] memcg: track children in soft limit excess to improve soft limit Michal Hocko
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 29+ messages in thread
From: Michal Hocko @ 2013-06-03 10:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo,
	Balbir Singh

The caller of the iterator might know that some nodes or even subtrees
should be skipped but there is no way to tell iterators about that so
the only choice left is to let iterators to visit each node and do the
selection outside of the iterating code. This, however, doesn't scale
well with hierarchies with many groups where only few groups are
interesting.

This patch adds mem_cgroup_iter_cond variant of the iterator with a
callback which gets called for every visited node. There are three
possible ways how the callback can influence the walk. Either the node
is visited, it is skipped but the tree walk continues down the tree or
the whole subtree of the current group is skipped.

TODO is it correct to reclaim + cond together? What if the cache simply
skips interesting nodes which another predicate would find interesting?

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/memcontrol.h |   40 +++++++++++++++++++++----
 mm/memcontrol.c            |   69 ++++++++++++++++++++++++++++++++++----------
 mm/vmscan.c                |   16 ++++------
 3 files changed, 92 insertions(+), 33 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 80ed1b6..811967a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -41,6 +41,15 @@ struct mem_cgroup_reclaim_cookie {
 	unsigned int generation;
 };
 
+enum mem_cgroup_filter_t {
+	VISIT,
+	SKIP,
+	SKIP_TREE,
+};
+
+typedef enum mem_cgroup_filter_t
+(*mem_cgroup_iter_filter)(struct mem_cgroup *memcg, struct mem_cgroup *root);
+
 #ifdef CONFIG_MEMCG
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -107,9 +116,18 @@ mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
 extern void mem_cgroup_end_migration(struct mem_cgroup *memcg,
 	struct page *oldpage, struct page *newpage, bool migration_ok);
 
-struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
-				   struct mem_cgroup *,
-				   struct mem_cgroup_reclaim_cookie *);
+struct mem_cgroup *mem_cgroup_iter_cond(struct mem_cgroup *root,
+				   struct mem_cgroup *prev,
+				   struct mem_cgroup_reclaim_cookie *reclaim,
+				   mem_cgroup_iter_filter cond);
+
+static inline struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
+				   struct mem_cgroup *prev,
+				   struct mem_cgroup_reclaim_cookie *reclaim)
+{
+	return mem_cgroup_iter_cond(root, prev, reclaim, NULL);
+}
+
 void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
 
 /*
@@ -179,7 +197,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
 
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+enum mem_cgroup_filter_t
+mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
 		struct mem_cgroup *root);
 
 void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
@@ -294,6 +313,14 @@ static inline void mem_cgroup_end_migration(struct mem_cgroup *memcg,
 		struct page *oldpage, struct page *newpage, bool migration_ok)
 {
 }
+static inline struct mem_cgroup *
+mem_cgroup_iter_cond(struct mem_cgroup *root,
+		struct mem_cgroup *prev,
+		struct mem_cgroup_reclaim_cookie *reclaim,
+		mem_cgroup_iter_filter cond)
+{
+	return NULL;
+}
 
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
@@ -357,10 +384,11 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 }
 
 static inline
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+enum mem_cgroup_filter_t
+mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
 		struct mem_cgroup *root)
 {
-	return false;
+	return VISIT;
 }
 
 static inline void mem_cgroup_split_huge_fixup(struct page *head)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9f8dd37..da7ea80 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -969,6 +969,15 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
 	return memcg;
 }
 
+static enum mem_cgroup_filter_t
+mem_cgroup_filter(struct mem_cgroup *memcg, struct mem_cgroup *root,
+		mem_cgroup_iter_filter cond)
+{
+	if (!cond)
+		return VISIT;
+	return cond(memcg, root);
+}
+
 /*
  * Returns a next (in a pre-order walk) alive memcg (with elevated css
  * ref. count) or NULL if the whole root's subtree has been visited.
@@ -976,7 +985,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
  * helper function to be used by mem_cgroup_iter
  */
 static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root,
-		struct mem_cgroup *last_visited)
+		struct mem_cgroup *last_visited, mem_cgroup_iter_filter cond)
 {
 	struct cgroup *prev_cgroup, *next_cgroup;
 
@@ -984,10 +993,18 @@ static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root,
 	 * Root is not visited by cgroup iterators so it needs an
 	 * explicit visit.
 	 */
-	if (!last_visited)
-		return root;
+	if (!last_visited) {
+		switch(mem_cgroup_filter(root, root, cond)) {
+		case VISIT:
+			return root;
+		case SKIP:
+			break;
+		case SKIP_TREE:
+			return NULL;
+		}
+	}
 
-	prev_cgroup = (last_visited == root) ? NULL
+	prev_cgroup = (last_visited == root || !last_visited) ? NULL
 		: last_visited->css.cgroup;
 skip_node:
 	next_cgroup = cgroup_next_descendant_pre(
@@ -1003,11 +1020,22 @@ skip_node:
 	if (next_cgroup) {
 		struct mem_cgroup *mem = mem_cgroup_from_cont(
 				next_cgroup);
-		if (css_tryget(&mem->css))
-			return mem;
-		else {
+
+		switch (mem_cgroup_filter(mem, root, cond)) {
+		case SKIP:
 			prev_cgroup = next_cgroup;
 			goto skip_node;
+		case SKIP_TREE:
+			prev_cgroup = cgroup_rightmost_descendant(next_cgroup);
+			goto skip_node;
+		case VISIT:
+			if (css_tryget(&mem->css))
+				return mem;
+			else {
+				prev_cgroup = next_cgroup;
+				goto skip_node;
+			}
+			break;
 		}
 	}
 
@@ -1019,6 +1047,7 @@ skip_node:
  * @root: hierarchy root
  * @prev: previously returned memcg, NULL on first invocation
  * @reclaim: cookie for shared reclaim walks, NULL for full walks
+ * @cond: filter for visited nodes, NULL for no filter
  *
  * Returns references to children of the hierarchy below @root, or
  * @root itself, or %NULL after a full round-trip.
@@ -1031,9 +1060,10 @@ skip_node:
  * divide up the memcgs in the hierarchy among all concurrent
  * reclaimers operating on the same zone and priority.
  */
-struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
+struct mem_cgroup *mem_cgroup_iter_cond(struct mem_cgroup *root,
 				   struct mem_cgroup *prev,
-				   struct mem_cgroup_reclaim_cookie *reclaim)
+				   struct mem_cgroup_reclaim_cookie *reclaim,
+				   mem_cgroup_iter_filter cond)
 {
 	struct mem_cgroup *memcg = NULL;
 	struct mem_cgroup *last_visited = NULL;
@@ -1051,7 +1081,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	if (!root->use_hierarchy && root != root_mem_cgroup) {
 		if (prev)
 			goto out_css_put;
-		return root;
+		if (mem_cgroup_filter(root, root, cond) == VISIT)
+			return root;
+		return NULL;
 	}
 
 	rcu_read_lock();
@@ -1094,7 +1126,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			}
 		}
 
-		memcg = __mem_cgroup_iter_next(root, last_visited);
+		memcg = __mem_cgroup_iter_next(root, last_visited, cond);
 
 		if (reclaim) {
 			if (last_visited)
@@ -1110,7 +1142,11 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 				reclaim->generation = iter->generation;
 		}
 
-		if (prev && !memcg)
+		/*
+		 * We have finished the whole tree walk or no group has been
+		 * visited because filter told us to skip the root node.
+		 */
+		if (!memcg && (prev || (cond && !last_visited)))
 			goto out_unlock;
 	}
 out_unlock:
@@ -1857,13 +1893,14 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
  * 	a) it is over its soft limit
  * 	b) any parent up the hierarchy is over its soft limit
  */
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+enum mem_cgroup_filter_t
+mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
 		struct mem_cgroup *root)
 {
 	struct mem_cgroup *parent = memcg;
 
 	if (res_counter_soft_limit_excess(&memcg->res))
-		return true;
+		return VISIT;
 
 	/*
 	 * If any parent up to the root in the hierarchy is over its soft limit
@@ -1871,12 +1908,12 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
 	 */
 	while((parent = parent_mem_cgroup(parent))) {
 		if (res_counter_soft_limit_excess(&parent->res))
-			return true;
+			return VISIT;
 		if (parent == root)
 			break;
 	}
 
-	return false;
+	return SKIP;
 }
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e21c1aa..10bcbc2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1964,21 +1964,16 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
 			.zone = zone,
 			.priority = sc->priority,
 		};
-		struct mem_cgroup *memcg;
+		struct mem_cgroup *memcg = NULL;
+		mem_cgroup_iter_filter filter = (soft_reclaim) ?
+			mem_cgroup_soft_reclaim_eligible : NULL;
 
 		nr_reclaimed = sc->nr_reclaimed;
 		nr_scanned = sc->nr_scanned;
 
-		memcg = mem_cgroup_iter(root, NULL, &reclaim);
-		do {
+		while ((memcg = mem_cgroup_iter_cond(root, memcg, &reclaim, filter))) {
 			struct lruvec *lruvec;
 
-			if (soft_reclaim &&
-			    !mem_cgroup_soft_reclaim_eligible(memcg, root)) {
-				memcg = mem_cgroup_iter(root, memcg, &reclaim);
-				continue;
-			}
-
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
 			shrink_lruvec(lruvec, sc);
@@ -1998,8 +1993,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
 				mem_cgroup_iter_break(root, memcg);
 				break;
 			}
-			memcg = mem_cgroup_iter(root, memcg, &reclaim);
-		} while (memcg);
+		}
 
 		vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
 			   sc->nr_scanned - nr_scanned,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [patch -v4 5/8] memcg: track children in soft limit excess to improve soft limit
  2013-06-03 10:18 [patch v4] Soft limit rework Michal Hocko
                   ` (3 preceding siblings ...)
  2013-06-03 10:18 ` [patch -v4 4/8] memcg: enhance memcg iterator to support predicates Michal Hocko
@ 2013-06-03 10:18 ` Michal Hocko
  2013-06-03 10:18 ` [patch -v4 6/8] memcg, vmscan: Do not attempt soft limit reclaim if it would not scan anything Michal Hocko
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2013-06-03 10:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo,
	Balbir Singh

Soft limit reclaim has to check the whole reclaim hierarchy while doing
the first pass of the reclaim. This leads to a higher system time which
can be visible especially when there are many groups in the hierarchy.

This patch adds a per-memcg counter of children in excess. It also
restores MEM_CGROUP_TARGET_SOFTLIMIT into mem_cgroup_event_ratelimit for
a proper batching.
If a group crosses soft limit for the first time it increases parent's
children_in_excess up the hierarchy. The similarly if a group gets below
the limit it will decrease the counter. The transition phase is recorded
in soft_contributed flag.

mem_cgroup_soft_reclaim_eligible then uses this information to better
decide whether to skip the node or the whole subtree. The rule is
simple. Skip the node with a children in excess or skip the whole subtree
otherwise.

This has been tested by a stream IO (dd if=/dev/zero of=file with
4*MemTotal size) which is quite sensitive to overhead during reclaim.
The load is running in a group with soft limit set to 0 and without any
limit. Apart from that there was a hierarchy with ~500, 2k and 8k groups
(two groups on each level) without any pages in them. base denotes to
the kernel on which the whole series is based on, rework is the kernel
before this patch and reworkoptim is with this patch applied:

* Run with soft limit set to 0
Elapsed
0-0-limit/base: min: 88.21 max: 94.61 avg: 91.73 std: 2.65 runs: 3
0-0-limit/rework: min: 76.05 [86.2%] max: 79.08 [83.6%] avg: 77.84 [84.9%] std: 1.30 runs: 3
0-0-limit/reworkoptim: min: 77.98 [88.4%] max: 80.36 [84.9%] avg: 78.92 [86.0%] std: 1.03 runs: 3
System
0.5k-0-limit/base: min: 34.86 max: 36.42 avg: 35.89 std: 0.73 runs: 3
0.5k-0-limit/rework: min: 43.26 [124.1%] max: 48.95 [134.4%] avg: 46.09 [128.4%] std: 2.32 runs: 3
0.5k-0-limit/reworkoptim: min: 46.98 [134.8%] max: 50.98 [140.0%] avg: 48.49 [135.1%] std: 1.77 runs: 3
Elapsed
0.5k-0-limit/base: min: 88.50 max: 97.52 avg: 93.92 std: 3.90 runs: 3
0.5k-0-limit/rework: min: 75.92 [85.8%] max: 78.45 [80.4%] avg: 77.34 [82.3%] std: 1.06 runs: 3
0.5k-0-limit/reworkoptim: min: 75.79 [85.6%] max: 79.37 [81.4%] avg: 77.55 [82.6%] std: 1.46 runs: 3
System
2k-0-limit/base: min: 34.57 max: 37.65 avg: 36.34 std: 1.30 runs: 3
2k-0-limit/rework: min: 64.17 [185.6%] max: 68.20 [181.1%] avg: 66.21 [182.2%] std: 1.65 runs: 3
2k-0-limit/reworkoptim: min: 49.78 [144.0%] max: 52.99 [140.7%] avg: 51.00 [140.3%] std: 1.42 runs: 3
Elapsed
2k-0-limit/base: min: 92.61 max: 97.83 avg: 95.03 std: 2.15 runs: 3
2k-0-limit/rework: min: 78.33 [84.6%] max: 84.08 [85.9%] avg: 81.09 [85.3%] std: 2.35 runs: 3
2k-0-limit/reworkoptim: min: 75.72 [81.8%] max: 78.57 [80.3%] avg: 76.73 [80.7%] std: 1.30 runs: 3
System
8k-0-limit/base: min: 39.78 max: 42.09 avg: 41.09 std: 0.97 runs: 3
8k-0-limit/rework: min: 200.86 [504.9%] max: 265.42 [630.6%] avg: 241.80 [588.5%] std: 29.06 runs: 3
8k-0-limit/reworkoptim: min: 53.70 [135.0%] max: 54.89 [130.4%] avg: 54.43 [132.5%] std: 0.52 runs: 3
Elapsed
8k-0-limit/base: min: 95.11 max: 98.61 avg: 96.81 std: 1.43 runs: 3
8k-0-limit/rework: min: 246.96 [259.7%] max: 331.47 [336.1%] avg: 301.32 [311.2%] std: 38.52 runs: 3
8k-0-limit/reworkoptim: min: 76.79 [80.7%] max: 81.71 [82.9%] avg: 78.97 [81.6%] std: 2.05 runs: 3

System time is increased by 30-40% but it is reduced a lot comparing
to kernel without this patch. The higher time can be explained by the
fact that the original soft reclaim scanned at priority 0 so it was much
more effective for this workload (which is basically touch once and
writeback). The Elapsed time looks better though (~20%).

* Run with no soft limit set
System
0-no-limit/base: min: 42.18 max: 50.38 avg: 46.44 std: 3.36 runs: 3
0-no-limit/rework: min: 40.57 [96.2%] max: 47.04 [93.4%] avg: 43.82 [94.4%] std: 2.64 runs: 3
0-no-limit/reworkoptim: min: 40.45 [95.9%] max: 45.28 [89.9%] avg: 42.10 [90.7%] std: 2.25 runs: 3
Elapsed
0-no-limit/base: min: 75.97 max: 78.21 avg: 76.87 std: 0.96 runs: 3
0-no-limit/rework: min: 75.59 [99.5%] max: 80.73 [103.2%] avg: 77.64 [101.0%] std: 2.23 runs: 3
0-no-limit/reworkoptim: min: 77.85 [102.5%] max: 82.42 [105.4%] avg: 79.64 [103.6%] std: 1.99 runs: 3
System
0.5k-no-limit/base: min: 44.54 max: 46.93 avg: 46.12 std: 1.12 runs: 3
0.5k-no-limit/rework: min: 42.09 [94.5%] max: 46.16 [98.4%] avg: 43.92 [95.2%] std: 1.69 runs: 3
0.5k-no-limit/reworkoptim: min: 42.47 [95.4%] max: 45.67 [97.3%] avg: 44.06 [95.5%] std: 1.31 runs: 3
Elapsed
0.5k-no-limit/base: min: 78.26 max: 81.49 avg: 79.65 std: 1.36 runs: 3
0.5k-no-limit/rework: min: 77.01 [98.4%] max: 80.43 [98.7%] avg: 78.30 [98.3%] std: 1.52 runs: 3
0.5k-no-limit/reworkoptim: min: 76.13 [97.3%] max: 77.87 [95.6%] avg: 77.18 [96.9%] std: 0.75 runs: 3
System
2k-no-limit/base: min: 62.96 max: 69.14 avg: 66.14 std: 2.53 runs: 3
2k-no-limit/rework: min: 76.01 [120.7%] max: 81.06 [117.2%] avg: 78.17 [118.2%] std: 2.12 runs: 3
2k-no-limit/reworkoptim: min: 62.57 [99.4%] max: 66.10 [95.6%] avg: 64.53 [97.6%] std: 1.47 runs: 3
Elapsed
2k-no-limit/base: min: 76.47 max: 84.22 avg: 79.12 std: 3.60 runs: 3
2k-no-limit/rework: min: 89.67 [117.3%] max: 93.26 [110.7%] avg: 91.10 [115.1%] std: 1.55 runs: 3
2k-no-limit/reworkoptim: min: 76.94 [100.6%] max: 79.21 [94.1%] avg: 78.45 [99.2%] std: 1.07 runs: 3
System
8k-no-limit/base: min: 104.74 max: 151.34 avg: 129.21 std: 19.10 runs: 3
8k-no-limit/rework: min: 205.23 [195.9%] max: 285.94 [188.9%] avg: 258.98 [200.4%] std: 38.01 runs: 3
8k-no-limit/reworkoptim: min: 161.16 [153.9%] max: 184.54 [121.9%] avg: 174.52 [135.1%] std: 9.83 runs: 3
Elapsed
8k-no-limit/base: min: 125.43 max: 181.00 avg: 154.81 std: 22.80 runs: 3
8k-no-limit/rework: min: 254.05 [202.5%] max: 355.67 [196.5%] avg: 321.46 [207.6%] std: 47.67 runs: 3
8k-no-limit/reworkoptim: min: 193.77 [154.5%] max: 222.72 [123.0%] avg: 210.18 [135.8%] std: 12.13 runs: 3

Both System and Elapsed are in stdev with the base kernel for all
configurations except for 8k where both System and Elapsed are up by
35%. I do not have a good explanation for this because there is no soft
reclaim pass going on as no group is above the limit which is checked in
mem_cgroup_should_soft_reclaim.

Then I have tested kernel build with the same configuration to see the
behavior with a more general behavior.

* Soft limit set to 0 for the build
System
0-0-limit/base: min: 242.70 max: 245.17 avg: 243.85 std: 1.02 runs: 3
0-0-limit/rework min: 237.86 [98.0%] max: 240.22 [98.0%] avg: 239.00 [98.0%] std: 0.97 runs: 3
0-0-limit/reworkoptim: min: 241.11 [99.3%] max: 243.53 [99.3%] avg: 242.01 [99.2%] std: 1.08 runs: 3
Elapsed
0-0-limit/base: min: 348.48 max: 360.86 avg: 356.04 std: 5.41 runs: 3
0-0-limit/rework min: 286.95 [82.3%] max: 290.26 [80.4%] avg: 288.27 [81.0%] std: 1.43 runs: 3
0-0-limit/reworkoptim: min: 286.55 [82.2%] max: 289.00 [80.1%] avg: 287.69 [80.8%] std: 1.01 runs: 3
System
0.5k-0-limit/base: min: 251.77 max: 254.41 avg: 252.70 std: 1.21 runs: 3
0.5k-0-limit/rework min: 286.44 [113.8%] max: 289.30 [113.7%] avg: 287.60 [113.8%] std: 1.23 runs: 3
0.5k-0-limit/reworkoptim: min: 252.18 [100.2%] max: 253.16 [99.5%] avg: 252.62 [100.0%] std: 0.41 runs: 3
Elapsed
0.5k-0-limit/base: min: 347.83 max: 353.06 avg: 350.04 std: 2.21 runs: 3
0.5k-0-limit/rework min: 290.19 [83.4%] max: 295.62 [83.7%] avg: 293.12 [83.7%] std: 2.24 runs: 3
0.5k-0-limit/reworkoptim: min: 293.91 [84.5%] max: 294.87 [83.5%] avg: 294.29 [84.1%] std: 0.42 runs: 3
System
2k-0-limit/base: min: 263.05 max: 271.52 avg: 267.94 std: 3.58 runs: 3
2k-0-limit/rework min: 458.99 [174.5%] max: 468.31 [172.5%] avg: 464.45 [173.3%] std: 3.97 runs: 3
2k-0-limit/reworkoptim: min: 267.10 [101.5%] max: 279.38 [102.9%] avg: 272.78 [101.8%] std: 5.05 runs: 3
Elapsed
2k-0-limit/base: min: 372.33 max: 379.32 avg: 375.47 std: 2.90 runs: 3
2k-0-limit/rework min: 334.40 [89.8%] max: 339.52 [89.5%] avg: 337.44 [89.9%] std: 2.20 runs: 3
2k-0-limit/reworkoptim: min: 301.47 [81.0%] max: 319.19 [84.1%] avg: 307.90 [82.0%] std: 8.01 runs: 3
System
8k-0-limit/base: min: 320.50 max: 332.10 avg: 325.46 std: 4.88 runs: 3
8k-0-limit/rework min: 1115.76 [348.1%] max: 1165.66 [351.0%] avg: 1132.65 [348.0%] std: 23.34 runs: 3
8k-0-limit/reworkoptim: min: 403.75 [126.0%] max: 409.22 [123.2%] avg: 406.16 [124.8%] std: 2.28 runs: 3
Elapsed
8k-0-limit/base: min: 475.48 max: 585.19 avg: 525.54 std: 45.30 runs: 3
8k-0-limit/rework min: 616.25 [129.6%] max: 625.90 [107.0%] avg: 620.68 [118.1%] std: 3.98 runs: 3
8k-0-limit/reworkoptim: min: 420.18 [88.4%] max: 428.28 [73.2%] avg: 423.05 [80.5%] std: 3.71 runs: 3

Apart from 8k the system time is comparable with the base kernel while
Elapsed is up to 20% better with all configurations.

* No soft limit set
System
0-no-limit/base: min: 234.76 max: 237.42 avg: 236.25 std: 1.11 runs: 3
0-no-limit/rework min: 233.09 [99.3%] max: 238.65 [100.5%] avg: 236.09 [99.9%] std: 2.29 runs: 3
0-no-limit/reworkoptim: min: 236.12 [100.6%] max: 240.53 [101.3%] avg: 237.94 [100.7%] std: 1.88 runs: 3
Elapsed
0-no-limit/base: min: 288.52 max: 295.42 avg: 291.29 std: 2.98 runs: 3
0-no-limit/rework min: 283.17 [98.1%] max: 284.33 [96.2%] avg: 283.78 [97.4%] std: 0.48 runs: 3
0-no-limit/reworkoptim: min: 288.50 [100.0%] max: 290.79 [98.4%] avg: 289.78 [99.5%] std: 0.95 runs: 3
System
0.5k-no-limit/base: min: 286.51 max: 293.23 avg: 290.21 std: 2.78 runs: 3
0.5k-no-limit/rework min: 291.69 [101.8%] max: 294.38 [100.4%] avg: 292.97 [101.0%] std: 1.10 runs: 3
0.5k-no-limit/reworkoptim: min: 277.05 [96.7%] max: 288.76 [98.5%] avg: 284.17 [97.9%] std: 5.11 runs: 3
Elapsed
0.5k-no-limit/base: min: 294.94 max: 298.92 avg: 296.47 std: 1.75 runs: 3
0.5k-no-limit/rework min: 292.55 [99.2%] max: 294.21 [98.4%] avg: 293.55 [99.0%] std: 0.72 runs: 3
0.5k-no-limit/reworkoptim: min: 294.41 [99.8%] max: 301.67 [100.9%] avg: 297.78 [100.4%] std: 2.99 runs: 3
System
2k-no-limit/base: min: 443.41 max: 466.66 avg: 457.66 std: 10.19 runs: 3
2k-no-limit/rework min: 490.11 [110.5%] max: 516.02 [110.6%] avg: 501.42 [109.6%] std: 10.83 runs: 3
2k-no-limit/reworkoptim: min: 435.25 [98.2%] max: 458.11 [98.2%] avg: 446.73 [97.6%] std: 9.33 runs: 3
Elapsed
2k-no-limit/base: min: 330.85 max: 333.75 avg: 332.52 std: 1.23 runs: 3
2k-no-limit/rework min: 343.06 [103.7%] max: 349.59 [104.7%] avg: 345.95 [104.0%] std: 2.72 runs: 3
2k-no-limit/reworkoptim: min: 330.01 [99.7%] max: 333.92 [100.1%] avg: 332.22 [99.9%] std: 1.64 runs: 3
System
8k-no-limit/base: min: 1175.64 max: 1259.38 avg: 1222.39 std: 34.88 runs: 3
8k-no-limit/rework min: 1226.31 [104.3%] max: 1241.60 [98.6%] avg: 1233.74 [100.9%] std: 6.25 runs: 3
8k-no-limit/reworkoptim: min: 1023.45 [87.1%] max: 1056.74 [83.9%] avg: 1038.92 [85.0%] std: 13.69 runs: 3
Elapsed
8k-no-limit/base: min: 613.36 max: 619.60 avg: 616.47 std: 2.55 runs: 3
8k-no-limit/rework min: 627.56 [102.3%] max: 642.33 [103.7%] avg: 633.44 [102.8%] std: 6.39 runs: 3
8k-no-limit/reworkoptim: min: 545.89 [89.0%] max: 555.36 [89.6%] avg: 552.06 [89.6%] std: 4.37 runs: 3

and these numbers look good as well. System time is around 100%
(suprisingly better for the 8k case) and Elapsed is copies that trend.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index da7ea80..90495d5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -136,6 +136,7 @@ static const char * const mem_cgroup_lru_names[] = {
  */
 enum mem_cgroup_events_target {
 	MEM_CGROUP_TARGET_THRESH,
+	MEM_CGROUP_TARGET_SOFTLIMIT,
 	MEM_CGROUP_TARGET_NUMAINFO,
 	MEM_CGROUP_NTARGETS,
 };
@@ -355,6 +356,10 @@ struct mem_cgroup {
 	atomic_t	numainfo_updating;
 #endif
 
+	spinlock_t soft_lock;
+	bool soft_contributed;
+	atomic_t children_in_excess;
+
 	/*
 	 * Per cgroup active and inactive list, similar to the
 	 * per zone LRU lists.
@@ -890,6 +895,9 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 		case MEM_CGROUP_TARGET_THRESH:
 			next = val + THRESHOLDS_EVENTS_TARGET;
 			break;
+		case MEM_CGROUP_TARGET_SOFTLIMIT:
+			next = val + SOFTLIMIT_EVENTS_TARGET;
+			break;
 		case MEM_CGROUP_TARGET_NUMAINFO:
 			next = val + NUMAINFO_EVENTS_TARGET;
 			break;
@@ -902,6 +910,34 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 	return false;
 }
 
+static void mem_cgroup_update_soft_limit(struct mem_cgroup *memcg)
+{
+	unsigned long long excess = res_counter_soft_limit_excess(&memcg->res);
+	struct mem_cgroup *parent = memcg;
+	int delta = 0;
+
+	spin_lock(&memcg->soft_lock);
+	if (excess) {
+		if (!memcg->soft_contributed) {
+			delta = 1;
+			memcg->soft_contributed = true;
+		}
+	} else {
+		if (memcg->soft_contributed) {
+			delta = -1;
+			memcg->soft_contributed = false;
+		}
+	}
+
+	/*
+	 * Necessary to update all ancestors when hierarchy is used
+	 * because their event counter is not touched.
+	 */
+	while (delta && (parent = parent_mem_cgroup(parent)))
+		atomic_add(delta, &parent->children_in_excess);
+	spin_unlock(&memcg->soft_lock);
+}
+
 /*
  * Check events in order.
  *
@@ -912,8 +948,11 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 	/* threshold event is triggered in finer grain than soft limit */
 	if (unlikely(mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_THRESH))) {
+		bool do_softlimit;
 		bool do_numainfo __maybe_unused;
 
+		do_softlimit = mem_cgroup_event_ratelimit(memcg,
+						MEM_CGROUP_TARGET_SOFTLIMIT);
 #if MAX_NUMNODES > 1
 		do_numainfo = mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_NUMAINFO);
@@ -921,6 +960,8 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 		preempt_enable();
 
 		mem_cgroup_threshold(memcg);
+		if (unlikely(do_softlimit))
+			mem_cgroup_update_soft_limit(memcg);
 #if MAX_NUMNODES > 1
 		if (unlikely(do_numainfo))
 			atomic_inc(&memcg->numainfo_events);
@@ -1892,6 +1933,9 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
  * hierarchy if
  * 	a) it is over its soft limit
  * 	b) any parent up the hierarchy is over its soft limit
+ *
+ * If the given group doesn't have any children over the limit then it
+ * doesn't make any sense to iterate its subtree.
  */
 enum mem_cgroup_filter_t
 mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
@@ -1913,6 +1957,8 @@ mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
 			break;
 	}
 
+	if (!atomic_read(&memcg->children_in_excess))
+		return SKIP_TREE;
 	return SKIP;
 }
 
@@ -6059,6 +6105,7 @@ mem_cgroup_css_alloc(struct cgroup *cont)
 	mutex_init(&memcg->thresholds_lock);
 	spin_lock_init(&memcg->move_lock);
 	vmpressure_init(&memcg->vmpressure);
+	spin_lock_init(&memcg->soft_lock);
 
 	return &memcg->css;
 
@@ -6148,6 +6195,10 @@ static void mem_cgroup_css_offline(struct cgroup *cont)
 
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
 	mem_cgroup_reparent_charges(memcg);
+	if (memcg->soft_contributed) {
+		while ((memcg = parent_mem_cgroup(memcg)))
+			atomic_dec(&memcg->children_in_excess);
+	}
 	mem_cgroup_destroy_all_caches(memcg);
 }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [patch -v4 6/8] memcg, vmscan: Do not attempt soft limit reclaim if it would not scan anything
  2013-06-03 10:18 [patch v4] Soft limit rework Michal Hocko
                   ` (4 preceding siblings ...)
  2013-06-03 10:18 ` [patch -v4 5/8] memcg: track children in soft limit excess to improve soft limit Michal Hocko
@ 2013-06-03 10:18 ` Michal Hocko
  2013-06-03 10:18 ` [patch -v4 7/8] memcg: Track all children over limit in the root Michal Hocko
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2013-06-03 10:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo,
	Balbir Singh

mem_cgroup_should_soft_reclaim controls whether soft reclaim pass is
done and it always says yes currently. Memcg iterators are clever to
skip nodes that are not soft reclaimable quite efficiently but
mem_cgroup_should_soft_reclaim can be more clever and do not start the
soft reclaim pass at all if it knows that nothing would be scanned
anyway.

In order to do that, simply reuse mem_cgroup_soft_reclaim_eligible for
the target group of the reclaim and allow the pass only if the whole
subtree wouldn't be skipped.

Changes since v1
- do not export mem_cgroup_root and teach mem_cgroup_soft_reclaim_eligible
  to handle NULL memcg as mem_cgroup_root

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |    6 +++++-
 mm/vmscan.c     |    4 +++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 90495d5..8ff9366 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1941,7 +1941,11 @@ enum mem_cgroup_filter_t
 mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
 		struct mem_cgroup *root)
 {
-	struct mem_cgroup *parent = memcg;
+	struct mem_cgroup *parent;
+
+	if (!memcg)
+		memcg = root_mem_cgroup;
+	parent = memcg;
 
 	if (res_counter_soft_limit_excess(&memcg->res))
 		return VISIT;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 10bcbc2..dc78f07 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -142,7 +142,9 @@ static bool global_reclaim(struct scan_control *sc)
 
 static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
 {
-	return !mem_cgroup_disabled();
+	struct mem_cgroup *root = sc->target_mem_cgroup;
+	return !mem_cgroup_disabled() &&
+		mem_cgroup_soft_reclaim_eligible(root, root) != SKIP_TREE;
 }
 #else
 static bool global_reclaim(struct scan_control *sc)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [patch -v4 7/8] memcg: Track all children over limit in the root
  2013-06-03 10:18 [patch v4] Soft limit rework Michal Hocko
                   ` (5 preceding siblings ...)
  2013-06-03 10:18 ` [patch -v4 6/8] memcg, vmscan: Do not attempt soft limit reclaim if it would not scan anything Michal Hocko
@ 2013-06-03 10:18 ` Michal Hocko
  2013-06-03 10:18 ` [patch -v4 8/8] memcg, vmscan: do not fall into reclaim-all pass too quickly Michal Hocko
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2013-06-03 10:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo,
	Balbir Singh

Children in soft limit excess are currently tracked up the hierarchy
in memcg->children_in_excess. Nevertheless there still might exist
tons of groups that are not in hierarchy relation to the root cgroup
(e.g. all first level groups if root_mem_cgroup->use_hierarchy ==
false).

As the whole tree walk has to be done when the iteration starts at
root_mem_cgroup the iterator should be able to skip the walk if there
is no child above the limit without iterating them. This can be done
easily if the root tracks all children rather than only hierarchical
children. This is done by this patch which updates root_mem_cgroup
children_in_excess if root_mem_cgroup->use_hierarchy == false so the
root knows about all children in excess.

Please note that this is not an issue for inner memcgs which have
use_hierarchy == false because then only the single group is visited so
no special optimization is necessary.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |    9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8ff9366..91740f7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -932,9 +932,15 @@ static void mem_cgroup_update_soft_limit(struct mem_cgroup *memcg)
 	/*
 	 * Necessary to update all ancestors when hierarchy is used
 	 * because their event counter is not touched.
+	 * We track children even outside the hierarchy for the root
+	 * cgroup because tree walk starting at root should visit
+	 * all cgroups and we want to prevent from pointless tree
+	 * walk if no children is below the limit.
 	 */
 	while (delta && (parent = parent_mem_cgroup(parent)))
 		atomic_add(delta, &parent->children_in_excess);
+	if (memcg != root_mem_cgroup && !root_mem_cgroup->use_hierarchy)
+		atomic_add(delta, &root_mem_cgroup->children_in_excess);
 	spin_unlock(&memcg->soft_lock);
 }
 
@@ -6202,6 +6208,9 @@ static void mem_cgroup_css_offline(struct cgroup *cont)
 	if (memcg->soft_contributed) {
 		while ((memcg = parent_mem_cgroup(memcg)))
 			atomic_dec(&memcg->children_in_excess);
+
+		if (memcg != root_mem_cgroup && !root_mem_cgroup->use_hierarchy)
+			atomic_dec(&root_mem_cgroup->children_in_excess);
 	}
 	mem_cgroup_destroy_all_caches(memcg);
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [patch -v4 8/8] memcg, vmscan: do not fall into reclaim-all pass too quickly
  2013-06-03 10:18 [patch v4] Soft limit rework Michal Hocko
                   ` (6 preceding siblings ...)
  2013-06-03 10:18 ` [patch -v4 7/8] memcg: Track all children over limit in the root Michal Hocko
@ 2013-06-03 10:18 ` Michal Hocko
  2013-06-04 16:27 ` [patch v4] Soft limit rework Balbir Singh
  2013-06-11 15:43 ` Michal Hocko
  9 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2013-06-03 10:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo,
	Balbir Singh

shrink_zone starts with soft reclaim pass first and then falls back to
regular reclaim if nothing has been scanned. This behavior is natural
but there is a catch. Memcg iterators, when used with the reclaim
cookie, are designed to help to prevent from over reclaim by
interleaving reclaimers (per node-zone-priority) so the tree walk might
miss many (even all) nodes in the hierarchy e.g. when there are direct
reclaimers racing with each other or with kswapd in the global case or
multiple allocators reaching the limit for the target reclaim case.
To make it even more complicated, targeted reclaim doesn't do the whole
tree walk because it stops reclaiming once it reclaims sufficient pages.
As a result groups over the limit might be missed, thus nothing is
scanned, and reclaim would fall back to the reclaim all mode.

This patch checks for the incomplete tree walk in shrink_zone. If no
group has been visited and the hierarchy is soft reclaimable then we
must have missed some groups, in which case the __shrink_zone is called
again. This doesn't guarantee there will be some progress of course
because the current reclaimer might be still racing with others but it
would at least give a chance to start the walk without a big risk of
reclaim latencies.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c |   19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index dc78f07..72b428d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1955,10 +1955,11 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	}
 }
 
-static void
+static int
 __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
 {
 	unsigned long nr_reclaimed, nr_scanned;
+	int groups_scanned = 0;
 
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -1976,6 +1977,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
 		while ((memcg = mem_cgroup_iter_cond(root, memcg, &reclaim, filter))) {
 			struct lruvec *lruvec;
 
+			groups_scanned++;
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
 			shrink_lruvec(lruvec, sc);
@@ -2003,6 +2005,8 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
 
 	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
+
+	return groups_scanned;
 }
 
 
@@ -2010,8 +2014,19 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 {
 	bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc);
 	unsigned long nr_scanned = sc->nr_scanned;
+	int scanned_groups;
 
-	__shrink_zone(zone, sc, do_soft_reclaim);
+	scanned_groups = __shrink_zone(zone, sc, do_soft_reclaim);
+	/*
+         * memcg iterator might race with other reclaimer or start from
+         * a incomplete tree walk so the tree walk in __shrink_zone
+         * might have missed groups that are above the soft limit. Try
+         * another loop to catch up with others. Do it just once to
+         * prevent from reclaim latencies when other reclaimers always
+         * preempt this one.
+	 */
+	if (do_soft_reclaim && !scanned_groups)
+		__shrink_zone(zone, sc, do_soft_reclaim);
 
 	/*
 	 * No group is over the soft limit or those that are do not have
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [patch -v4 4/8] memcg: enhance memcg iterator to support predicates
  2013-06-03 10:18 ` [patch -v4 4/8] memcg: enhance memcg iterator to support predicates Michal Hocko
@ 2013-06-04  1:07   ` Tejun Heo
  2013-06-04 13:45     ` Michal Hocko
  2013-06-10  7:48   ` Michal Hocko
  1 sibling, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2013-06-04  1:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm,
	cgroups, linux-kernel, Ying Han, Hugh Dickins, Glauber Costa,
	Michel Lespinasse, Greg Thelen, Balbir Singh

On Mon, Jun 03, 2013 at 12:18:51PM +0200, Michal Hocko wrote:
> The caller of the iterator might know that some nodes or even subtrees
> should be skipped but there is no way to tell iterators about that so
> the only choice left is to let iterators to visit each node and do the
> selection outside of the iterating code. This, however, doesn't scale
> well with hierarchies with many groups where only few groups are
> interesting.
> 
> This patch adds mem_cgroup_iter_cond variant of the iterator with a
> callback which gets called for every visited node. There are three
> possible ways how the callback can influence the walk. Either the node
> is visited, it is skipped but the tree walk continues down the tree or
> the whole subtree of the current group is skipped.
> 
> TODO is it correct to reclaim + cond together? What if the cache simply
> skips interesting nodes which another predicate would find interesting?

I don't know.  Maybe it's just taste but it looks pretty ugly to me.
Why does everything have to be folded into the iteration function?
The iteration only depends on the current position.  Can't you factor
out skipping part outside the function rather than rolling into this
monstery thing with predicate callback?  Just test the condition
outside and call a function to skip whatever is necessary?

Also, cgroup_rightmost_descendant() can be pretty expensive depending
on how your tree looks like.  It travels to the rightmost child at
each level until it can't.  In extreme cases, you can travel the whole
subtree.  This usually isn't a problem for its usecases but yours may
be different, I'm not sure.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -v4 4/8] memcg: enhance memcg iterator to support predicates
  2013-06-04  1:07   ` Tejun Heo
@ 2013-06-04 13:45     ` Michal Hocko
  2013-06-04 19:36       ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2013-06-04 13:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm,
	cgroups, linux-kernel, Ying Han, Hugh Dickins, Glauber Costa,
	Michel Lespinasse, Greg Thelen, Balbir Singh

On Mon 03-06-13 18:07:37, Tejun Heo wrote:
> On Mon, Jun 03, 2013 at 12:18:51PM +0200, Michal Hocko wrote:
> > The caller of the iterator might know that some nodes or even subtrees
> > should be skipped but there is no way to tell iterators about that so
> > the only choice left is to let iterators to visit each node and do the
> > selection outside of the iterating code. This, however, doesn't scale
> > well with hierarchies with many groups where only few groups are
> > interesting.
> > 
> > This patch adds mem_cgroup_iter_cond variant of the iterator with a
> > callback which gets called for every visited node. There are three
> > possible ways how the callback can influence the walk. Either the node
> > is visited, it is skipped but the tree walk continues down the tree or
> > the whole subtree of the current group is skipped.
> > 
> > TODO is it correct to reclaim + cond together? What if the cache simply
> > skips interesting nodes which another predicate would find interesting?
> 
> I don't know.  Maybe it's just taste but it looks pretty ugly to me.
> Why does everything have to be folded into the iteration function?

There are basically 2 options. Factor out skipping logic into something
like memcg_cgroup_iter_skip_node and memcg_cgroup_iter_skip_tree or
bundle the generic predicate into iterators.
I find the in-iterator version more convenient to use because the caller
doesn't have to handle skip cases explicitly. All the users would have
to do the same thing (which is now hidden in the iterator) anyway.

Besides that it helps reducing the memcg code in vmscan (which we try to
keep at minimum). I already feel guilty for the code this patch set
adds.

Is this something that you find serious enough to block this series?
I do not want to push hard but I would like to settle with something
finally. This is taking way longer than I would like.

> The iteration only depends on the current position.  Can't you factor
> out skipping part outside the function rather than rolling into this
> monstery thing with predicate callback?  Just test the condition
> outside and call a function to skip whatever is necessary?
> 
> Also, cgroup_rightmost_descendant() can be pretty expensive depending
> on how your tree looks like. 

I have no problem using something else. This was just the easiest to
use and it behaves more-or-less good for hierarchies which are more or
less balanced. If this turns out to be a problem we can introduce a
new cgroup_skip_subtree which would get to last->sibling or go up the
parent chain until there is non-NULL sibling. But what would be the next
selling point here if we made it perfect right now? ;)

> It travels to the rightmost child at
> each level until it can't.  In extreme cases, you can travel the whole
> subtree.  This usually isn't a problem for its usecases but yours may
> be different, I'm not sure.

But that would be the case only if the hierarchy would be right wing
prevailing (aka politically incorrect for some :P)
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch v4] Soft limit rework
  2013-06-03 10:18 [patch v4] Soft limit rework Michal Hocko
                   ` (7 preceding siblings ...)
  2013-06-03 10:18 ` [patch -v4 8/8] memcg, vmscan: do not fall into reclaim-all pass too quickly Michal Hocko
@ 2013-06-04 16:27 ` Balbir Singh
  2013-06-04 16:38   ` Michal Hocko
  2013-06-11 15:43 ` Michal Hocko
  9 siblings, 1 reply; 29+ messages in thread
From: Balbir Singh @ 2013-06-04 16:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm,
	cgroups, linux-kernel, Ying Han, Hugh Dickins, Glauber Costa,
	Michel Lespinasse, Greg Thelen, Tejun Heo

On Mon, Jun 3, 2013 at 3:48 PM, Michal Hocko <mhocko@suse.cz> wrote:
> Hi,
>
> This is the fourth version of the patchset.
>
> Summary of versions:
> The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973
> (lkml wasn't CCed at the time so I cannot find it in lwn.net
> archives). There were no major objections. The second version
> has been posted here http://lwn.net/Articles/548191/ as a part
> of a longer and spicier thread which started after LSF here:
> https://lwn.net/Articles/548192/
> Version number 3 has been posted here http://lwn.net/Articles/550409/
> Johannes was worried about setups with thousands of memcgs and the
> tree walk overhead for the soft reclaim pass without anybody in excess.
>
> Changes between RFC (aka V1) -> V2
> As there were no major objections there were only some minor cleanups
> since the last version and I have moved "memcg: Ignore soft limit until
> it is explicitly specified" to the end of the series.
>
> Changes between V2 -> V3
> No changes in the code since the last version. I have just rebased the
> series on top of the current mmotm tree. The most controversial part
> has been dropped (the last patch "memcg: Ignore soft limit until it is
> explicitly specified") so there are no semantical changes to the soft
> limit behavior. This makes this work mostly a code clean up and code
> reorganization. Nevertheless, this is enough to make the soft limit work
> more efficiently according to my testing and groups above the soft limit
> are reclaimed much less as a result.
>
> Changes between V3->V4
> Added some Reviewed-bys but the biggest change comes from Johannes
> concern about the tree traversal overhead with a huge number of memcgs
> (http://thread.gmane.org/gmane.linux.kernel.cgroups/7307/focus=100326)
> and this version addresses this problem by augmenting the memcg tree
> with the number of over soft limit children at each level of the
> hierarchy. See more bellow.
>
> The basic idea is quite simple. Pull soft reclaim into shrink_zone in
> the first step and get rid of the previous soft reclaim infrastructure.
> shrink_zone is done in two passes now. First it tries to do the soft
> limit reclaim and it falls back to reclaim-all mode if no group is over
> the limit or no pages have been scanned. The second pass happens at the
> same priority so the only time we waste is the memcg tree walk which
> has been updated in the third step to have only negligible overhead.
>

Hi, Michal

I've just looked at this (I am yet to review the series), but the
intention of the changes do not read out clearly. Or may be I quite
outdated on the subject :)

Balbir

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch v4] Soft limit rework
  2013-06-04 16:27 ` [patch v4] Soft limit rework Balbir Singh
@ 2013-06-04 16:38   ` Michal Hocko
  2013-06-04 17:57     ` Balbir Singh
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2013-06-04 16:38 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm,
	cgroups, linux-kernel, Ying Han, Hugh Dickins, Glauber Costa,
	Michel Lespinasse, Greg Thelen, Tejun Heo

On Tue 04-06-13 21:57:56, Balbir Singh wrote:
> On Mon, Jun 3, 2013 at 3:48 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > Hi,
> >
> > This is the fourth version of the patchset.
> >
> > Summary of versions:
> > The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973
> > (lkml wasn't CCed at the time so I cannot find it in lwn.net
> > archives). There were no major objections. The second version
> > has been posted here http://lwn.net/Articles/548191/ as a part
> > of a longer and spicier thread which started after LSF here:
> > https://lwn.net/Articles/548192/
> > Version number 3 has been posted here http://lwn.net/Articles/550409/
> > Johannes was worried about setups with thousands of memcgs and the
> > tree walk overhead for the soft reclaim pass without anybody in excess.
> >
> > Changes between RFC (aka V1) -> V2
> > As there were no major objections there were only some minor cleanups
> > since the last version and I have moved "memcg: Ignore soft limit until
> > it is explicitly specified" to the end of the series.
> >
> > Changes between V2 -> V3
> > No changes in the code since the last version. I have just rebased the
> > series on top of the current mmotm tree. The most controversial part
> > has been dropped (the last patch "memcg: Ignore soft limit until it is
> > explicitly specified") so there are no semantical changes to the soft
> > limit behavior. This makes this work mostly a code clean up and code
> > reorganization. Nevertheless, this is enough to make the soft limit work
> > more efficiently according to my testing and groups above the soft limit
> > are reclaimed much less as a result.
> >
> > Changes between V3->V4
> > Added some Reviewed-bys but the biggest change comes from Johannes
> > concern about the tree traversal overhead with a huge number of memcgs
> > (http://thread.gmane.org/gmane.linux.kernel.cgroups/7307/focus=100326)
> > and this version addresses this problem by augmenting the memcg tree
> > with the number of over soft limit children at each level of the
> > hierarchy. See more bellow.
> >
> > The basic idea is quite simple. Pull soft reclaim into shrink_zone in
> > the first step and get rid of the previous soft reclaim infrastructure.
> > shrink_zone is done in two passes now. First it tries to do the soft
> > limit reclaim and it falls back to reclaim-all mode if no group is over
> > the limit or no pages have been scanned. The second pass happens at the
> > same priority so the only time we waste is the memcg tree walk which
> > has been updated in the third step to have only negligible overhead.
> >
> 
> Hi, Michal
> 
> I've just looked at this (I am yet to review the series), but the
> intention of the changes do not read out clearly. Or may be I quite
> outdated on the subject :)

OK, let me summarize. The primary intention is to get rid of the current
soft reclaim infrastructure which basically bypasses the standard
reclaim and tight it directly into shrink_zone code. This also means
that the soft reclaim doesn't reclaim at priority 0 and that it is
active also for the targeted (aka limit) reclaim.

Does this help?
 
> Balbir

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch v4] Soft limit rework
  2013-06-04 16:38   ` Michal Hocko
@ 2013-06-04 17:57     ` Balbir Singh
  2013-06-04 18:08       ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Balbir Singh @ 2013-06-04 17:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm,
	cgroups, linux-kernel, Ying Han, Hugh Dickins, Glauber Costa,
	Michel Lespinasse, Greg Thelen, Tejun Heo

> OK, let me summarize. The primary intention is to get rid of the current
> soft reclaim infrastructure which basically bypasses the standard
> reclaim and tight it directly into shrink_zone code. This also means
> that the soft reclaim doesn't reclaim at priority 0 and that it is
> active also for the targeted (aka limit) reclaim.
>
> Does this help?
>

Yes. What are the limitations of no-priority 0 reclaim? I'll also look
at the patches

Thanks,
Balbir Singh.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch v4] Soft limit rework
  2013-06-04 17:57     ` Balbir Singh
@ 2013-06-04 18:08       ` Michal Hocko
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2013-06-04 18:08 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm,
	cgroups, linux-kernel, Ying Han, Hugh Dickins, Glauber Costa,
	Michel Lespinasse, Greg Thelen, Tejun Heo

On Tue 04-06-13 23:27:05, Balbir Singh wrote:
> > OK, let me summarize. The primary intention is to get rid of the current
> > soft reclaim infrastructure which basically bypasses the standard
> > reclaim and tight it directly into shrink_zone code. This also means
> > that the soft reclaim doesn't reclaim at priority 0 and that it is
> > active also for the targeted (aka limit) reclaim.
> >
> > Does this help?
> >
> 
> Yes. What are the limitations of no-priority 0 reclaim?

I am not sure I understand the question. What do you mean by
limitations?

The priority-0 scan was always a crude hack. With a lot of pages in on
the LRU it might cause huge big stalls during direct reclaim. There are
workloads which benefited from such an aggressive reclaim - e.g.
streaming IO but that doesn't justify this kind of reclaim.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -v4 4/8] memcg: enhance memcg iterator to support predicates
  2013-06-04 13:45     ` Michal Hocko
@ 2013-06-04 19:36       ` Tejun Heo
  2013-06-04 20:48         ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2013-06-04 19:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm,
	cgroups, linux-kernel, Ying Han, Hugh Dickins, Glauber Costa,
	Michel Lespinasse, Greg Thelen, Balbir Singh

Hey, Michal.

On Tue, Jun 04, 2013 at 03:45:23PM +0200, Michal Hocko wrote:
> Is this something that you find serious enough to block this series?
> I do not want to push hard but I would like to settle with something
> finally. This is taking way longer than I would like.

I really don't think memcg can afford to add more mess than there
already is.  Let's try to get things right with each change, please.
Can we please see how the other approach would look like?  I have a
suspicion that it's likely be simpler but the devils are in the
details and all...

> > The iteration only depends on the current position.  Can't you factor
> > out skipping part outside the function rather than rolling into this
> > monstery thing with predicate callback?  Just test the condition
> > outside and call a function to skip whatever is necessary?
> > 
> > Also, cgroup_rightmost_descendant() can be pretty expensive depending
> > on how your tree looks like. 
> 
> I have no problem using something else. This was just the easiest to
> use and it behaves more-or-less good for hierarchies which are more or
> less balanced. If this turns out to be a problem we can introduce a
> new cgroup_skip_subtree which would get to last->sibling or go up the
> parent chain until there is non-NULL sibling. But what would be the next
> selling point here if we made it perfect right now? ;)

Yeah, sure thing.  I was just worried because the skipping here might
not be as good as the code seems to indicate.  There will be cases,
which aren't too uncommon, where the skipping doesn't save much
compared to just continuing the pre-order walk, so....  And nobody
would really notice it unless [s]he looks really hard for it, which is
the more worrisome part for me.  Maybe just stick a comment there
explaining that we probably want something better in the future?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -v4 4/8] memcg: enhance memcg iterator to support predicates
  2013-06-04 19:36       ` Tejun Heo
@ 2013-06-04 20:48         ` Michal Hocko
  2013-06-04 20:54           ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2013-06-04 20:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm,
	cgroups, linux-kernel, Ying Han, Hugh Dickins, Glauber Costa,
	Michel Lespinasse, Greg Thelen, Balbir Singh

On Tue 04-06-13 12:36:19, Tejun Heo wrote:
> Hey, Michal.
> 
> On Tue, Jun 04, 2013 at 03:45:23PM +0200, Michal Hocko wrote:
> > Is this something that you find serious enough to block this series?
> > I do not want to push hard but I would like to settle with something
> > finally. This is taking way longer than I would like.
> 
> I really don't think memcg can afford to add more mess than there
> already is.  Let's try to get things right with each change, please.

Is this really about inside vs. outside skipping? I think this is a
general improvement to the code. I really prefer not duplicating common
code and skipping handling is such a code (we have a visitor which can
control the walk). With a side bonus that it doesn't have to pollute
vmscan more than necessary.

Please be more specific about _what_ is so ugly about this interface so
that it matters so much.

> Can we please see how the other approach would look like?  I have a
> suspicion that it's likely be simpler but the devils are in the
> details and all...
>
> > > The iteration only depends on the current position.  Can't you factor
> > > out skipping part outside the function rather than rolling into this
> > > monstery thing with predicate callback?  Just test the condition
> > > outside and call a function to skip whatever is necessary?
> > > 
> > > Also, cgroup_rightmost_descendant() can be pretty expensive depending
> > > on how your tree looks like. 
> > 
> > I have no problem using something else. This was just the easiest to
> > use and it behaves more-or-less good for hierarchies which are more or
> > less balanced. If this turns out to be a problem we can introduce a
> > new cgroup_skip_subtree which would get to last->sibling or go up the
> > parent chain until there is non-NULL sibling. But what would be the next
> > selling point here if we made it perfect right now? ;)
> 
> Yeah, sure thing.  I was just worried because the skipping here might
> not be as good as the code seems to indicate.  There will be cases,
> which aren't too uncommon, where the skipping doesn't save much
> compared to just continuing the pre-order walk, so....  And nobody
> would really notice it unless [s]he looks really hard for it, which is
> the more worrisome part for me.  Maybe just stick a comment there
> explaining that we probably want something better in the future?

Sure thing. I will stick there a comment:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 91740f7..43e955a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1073,6 +1073,14 @@ skip_node:
 			prev_cgroup = next_cgroup;
 			goto skip_node;
 		case SKIP_TREE:
+			/*
+			 * cgroup_rightmost_descendant is not an optimal way to
+			 * skip through a subtree (especially for imbalanced
+			 * trees leaning to right) but that's what we have right
+			 * now. More effective solution would be traversing
+			 * right-up for first non-NULL without calling
+			 * cgroup_next_descendant_pre afterwards.
+			 */
 			prev_cgroup = cgroup_rightmost_descendant(next_cgroup);
 			goto skip_node;
 		case VISIT:

Better?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [patch -v4 4/8] memcg: enhance memcg iterator to support predicates
  2013-06-04 20:48         ` Michal Hocko
@ 2013-06-04 20:54           ` Tejun Heo
  2013-06-05  7:37             ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2013-06-04 20:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm,
	cgroups, linux-kernel, Ying Han, Hugh Dickins, Glauber Costa,
	Michel Lespinasse, Greg Thelen, Balbir Singh

Hey,

On Tue, Jun 04, 2013 at 10:48:07PM +0200, Michal Hocko wrote:
> > I really don't think memcg can afford to add more mess than there
> > already is.  Let's try to get things right with each change, please.
> 
> Is this really about inside vs. outside skipping? I think this is a
> general improvement to the code. I really prefer not duplicating common
> code and skipping handling is such a code (we have a visitor which can
> control the walk). With a side bonus that it doesn't have to pollute
> vmscan more than necessary.
> 
> Please be more specific about _what_ is so ugly about this interface so
> that it matters so much.

Can you please try the other approach and see how it looks?  It's just
my general experience that you usually end up with something much
uglier when you try to do much inside an iterator and having to add
callbacks which need to communicate through enums is usually a pretty
good sign that it took a wrong turn somewhere.  There sure are cases
where such approach is necessary but I really don't see it here.  So,
it'd be really great if you can give a shot.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -v4 4/8] memcg: enhance memcg iterator to support predicates
  2013-06-04 20:54           ` Tejun Heo
@ 2013-06-05  7:37             ` Michal Hocko
  2013-06-05  8:05               ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2013-06-05  7:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm,
	cgroups, linux-kernel, Ying Han, Hugh Dickins, Glauber Costa,
	Michel Lespinasse, Greg Thelen, Balbir Singh

On Tue 04-06-13 13:54:26, Tejun Heo wrote:
> Hey,
> 
> On Tue, Jun 04, 2013 at 10:48:07PM +0200, Michal Hocko wrote:
> > > I really don't think memcg can afford to add more mess than there
> > > already is.  Let's try to get things right with each change, please.
> > 
> > Is this really about inside vs. outside skipping? I think this is a
> > general improvement to the code. I really prefer not duplicating common
> > code and skipping handling is such a code (we have a visitor which can
> > control the walk). With a side bonus that it doesn't have to pollute
> > vmscan more than necessary.
> > 
> > Please be more specific about _what_ is so ugly about this interface so
> > that it matters so much.
> 
> Can you please try the other approach and see how it looks? 

Tejun, I do not have infinite amount of time and this is barely a
priority for the patchset. The core part is to be able to skip
nodes/subtrees which are not worth reclaiming, remember?

I have already expressed my priorities for inside skipping
decisions. You are just throwing "let's try a different way" handwavy
suggestions. I have no problem to pull the skip logic outside of
iterators if more people think that this is _really_ important. But
until then I take it as a really low priority that shouldn't delay the
patchset without a good reason.

So please try to focus on the technical parts of the patchset if you
want to help with the review. I really appreciate suggestions but please
do not get down to bike scheding.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -v4 4/8] memcg: enhance memcg iterator to support predicates
  2013-06-05  7:37             ` Michal Hocko
@ 2013-06-05  8:05               ` Tejun Heo
  2013-06-05  8:52                 ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2013-06-05  8:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm,
	cgroups, linux-kernel, Ying Han, Hugh Dickins, Glauber Costa,
	Michel Lespinasse, Greg Thelen, Balbir Singh

Hey, Michal.

On Wed, Jun 05, 2013 at 09:37:28AM +0200, Michal Hocko wrote:
> Tejun, I do not have infinite amount of time and this is barely a
> priority for the patchset. The core part is to be able to skip
> nodes/subtrees which are not worth reclaiming, remember?
>
> I have already expressed my priorities for inside skipping
> decisions. You are just throwing "let's try a different way" handwavy
> suggestions. I have no problem to pull the skip logic outside of
> iterators if more people think that this is _really_ important. But
> until then I take it as a really low priority that shouldn't delay the
> patchset without a good reason.
> 
> So please try to focus on the technical parts of the patchset if you
> want to help with the review. I really appreciate suggestions but please
> do not get down to bike scheding.

Well, so, I know I've been pain in the ass but here's the thing.  I
don't think you've been doing a good job of maintaining memcg.  Among
the code pieces that I look at, it really ranks very close to the
bottom in terms of readability and general messiness.  One of the core
jobs of being a maintainer is ensuring the code stays in readable and
maintainable state.

Maybe memcg is really really really special and it does require the
level of complication that you've been adding; however, I can't see
that.  Not yet anyway.  What I see is a subsystem which is slurping in
complexity without properly evaluating the benefits such complexity
brings and its overhead.  Why do you have several archaic barrier
dancings buried in memcg?  If you do really need them and yes I can
see that you might need them, build proper abstractions and update
code properly even if that takes more time because otherwise we'll end
up with something which is painful to maintain, and you're never gonna
get enough reviews and testing for the scary stuff you buried inside
memcg.

If you think I'm going overboard with the barrier stuff, what about
the css and memcg refcnts?  memcg had and still has this layered
refcnting, which is obviously bonkers if you just take one step back
and look at it.  What about the css_id addition?  Why has memcg added
so many fundamentally broken things to memcg itself and cgroup core?

At this point, I'm fairly doubtful that memcg is being properly
maintained and hope that someone else would take control of what code
goes in.  You probably had all the good technical reasons when you
were committing all those broken stuff.  I don't know why it's been
going this way.  It almost feels like you can see the details but
never the larger picture.

So, yes, I agree this isn't the biggest technical point in the series
and understand that you could be frustrated with me throwing in
wrenches, but I think going a bit slower in itself could be helpful.
Have you tried just implementing skipping interface?  Really, I'm
almost sure it's gonna be much more readable than the predicate thing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -v4 4/8] memcg: enhance memcg iterator to support predicates
  2013-06-05  8:05               ` Tejun Heo
@ 2013-06-05  8:52                 ` Michal Hocko
  2013-06-05  8:58                   ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2013-06-05  8:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm,
	cgroups, linux-kernel, Ying Han, Hugh Dickins, Glauber Costa,
	Michel Lespinasse, Greg Thelen, Balbir Singh

On Wed 05-06-13 01:05:45, Tejun Heo wrote:
> Hey, Michal.
> 
> On Wed, Jun 05, 2013 at 09:37:28AM +0200, Michal Hocko wrote:
> > Tejun, I do not have infinite amount of time and this is barely a
> > priority for the patchset. The core part is to be able to skip
> > nodes/subtrees which are not worth reclaiming, remember?
> >
> > I have already expressed my priorities for inside skipping
> > decisions. You are just throwing "let's try a different way" handwavy
> > suggestions. I have no problem to pull the skip logic outside of
> > iterators if more people think that this is _really_ important. But
> > until then I take it as a really low priority that shouldn't delay the
> > patchset without a good reason.
> > 
> > So please try to focus on the technical parts of the patchset if you
> > want to help with the review. I really appreciate suggestions but please
> > do not get down to bike scheding.
> 
> Well, so, I know I've been pain in the ass but here's the thing.  I
> don't think you've been doing a good job of maintaining memcg.  Among
> the code pieces that I look at, it really ranks very close to the
> bottom in terms of readability and general messiness. 

Something, something, something, something and other similar things....
Is this really an argumentation. Comon' Tejun.

I _really_ do not let this into a flame and I will not respond to any
other emails that are not related to the patchset. I do not care and do
not have time for that!

> One of the core jobs of being a maintainer is ensuring the code stays
> in readable and maintainable state.

As you might know I am playing the maintainer role for around year and a
half and there were many improvemtns merged since then (and some faults
as well of course).
There is a lot of space for improvements and I work at areas as time
permits focusing more at reviews for other people are willing to do.

[...]

Please stop distracting from the main purpose of this discussion with
side tracks and personal things.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -v4 4/8] memcg: enhance memcg iterator to support predicates
  2013-06-05  8:52                 ` Michal Hocko
@ 2013-06-05  8:58                   ` Tejun Heo
  2013-06-05  9:07                     ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2013-06-05  8:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm,
	cgroups, linux-kernel, Ying Han, Hugh Dickins, Glauber Costa,
	Michel Lespinasse, Greg Thelen, Balbir Singh

Hey, Michal.

On Wed, Jun 05, 2013 at 10:52:39AM +0200, Michal Hocko wrote:
> > One of the core jobs of being a maintainer is ensuring the code stays
> > in readable and maintainable state.
> 
> As you might know I am playing the maintainer role for around year and a
> half and there were many improvemtns merged since then (and some faults
> as well of course).
> There is a lot of space for improvements and I work at areas as time
> permits focusing more at reviews for other people are willing to do.

I see.  Yeah, maybe I was attributing too many things to you.  Sorry
about that.

> [...]
> 
> Please stop distracting from the main purpose of this discussion with
> side tracks and personal things.

I'm not really trying to distract you.  I suppose my points are.

* Let's please pay more attention to each change which goes in.

* Let's prioritize cleanups over new features because, whatever the
  history, memcg needs quite some cleanups.

I really don't have personal vandetta against you.  I'm mostly just
very frustrated about memcg.  Maybe you're too.

Anyways, so you aren't gonna try the skipping thing?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -v4 4/8] memcg: enhance memcg iterator to support predicates
  2013-06-05  8:58                   ` Tejun Heo
@ 2013-06-05  9:07                     ` Michal Hocko
  2013-06-05  9:09                       ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2013-06-05  9:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm,
	cgroups, linux-kernel, Ying Han, Hugh Dickins, Glauber Costa,
	Michel Lespinasse, Greg Thelen, Balbir Singh

On Wed 05-06-13 01:58:49, Tejun Heo wrote:
[...]
> Anyways, so you aren't gonna try the skipping thing?

As I said. I do not consider this a priority for the said reasons (i
will not repeat them).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -v4 4/8] memcg: enhance memcg iterator to support predicates
  2013-06-05  9:07                     ` Michal Hocko
@ 2013-06-05  9:09                       ` Tejun Heo
  2013-06-07  0:48                         ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2013-06-05  9:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm,
	cgroups, linux-kernel, Ying Han, Hugh Dickins, Glauber Costa,
	Michel Lespinasse, Greg Thelen, Balbir Singh

On Wed, Jun 05, 2013 at 11:07:39AM +0200, Michal Hocko wrote:
> On Wed 05-06-13 01:58:49, Tejun Heo wrote:
> [...]
> > Anyways, so you aren't gonna try the skipping thing?
> 
> As I said. I do not consider this a priority for the said reasons (i
> will not repeat them).

That's a weird way to respond.  Alright, whatever, let me give it a
shot then.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -v4 4/8] memcg: enhance memcg iterator to support predicates
  2013-06-05  9:09                       ` Tejun Heo
@ 2013-06-07  0:48                         ` Tejun Heo
  2013-06-07  8:25                           ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2013-06-07  0:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm,
	cgroups, linux-kernel, Ying Han, Hugh Dickins, Glauber Costa,
	Michel Lespinasse, Greg Thelen, Balbir Singh

On Wed, Jun 05, 2013 at 02:09:38AM -0700, Tejun Heo wrote:
> On Wed, Jun 05, 2013 at 11:07:39AM +0200, Michal Hocko wrote:
> > On Wed 05-06-13 01:58:49, Tejun Heo wrote:
> > [...]
> > > Anyways, so you aren't gonna try the skipping thing?
> > 
> > As I said. I do not consider this a priority for the said reasons (i
> > will not repeat them).
> 
> That's a weird way to respond.  Alright, whatever, let me give it a
> shot then.

So, there were some private exchanges and here's my main issue with
the addition of predicate callback to mem_cgroup_iter_cond().

There are two common patterns that are used to implement iteration.
One is the good ol' callback based one - ie. call_fn_on_each(fn) type
interface.  The other is something which can be used as part of flow
control by the user - be it give_me_next_elem() or for_each() type
loop macro.  In majority of cases, especially for anything generic,
the latter is considered to be the better choice because, while a bit
more challenging to implement usually, it's a lot less cumbersome for
the users of the interface.

mem_cgroup_iter_cond() seems icky to me because the predicate callback
is essentially visit callback, so now we end up with
give_me_next_elem() with visit callback, which is fundamentally
superflous.  If it were properly call_fn_on_each(fn), the return
values would be CONTINUE, SKIP_SUBTREE or ABORT, which makes more
sense to me.  Sure, it can be said that the predicate callback is for
a different purpose but it really doesn't change that the interface
now is visiting the same node in two different places.  If it were
something remotely widely used, it won't take much time developing
braindamaged usages where part is being done inside the predicate
callback and the rest is done outside without clear reason why just
because of natural code growth.  I don't think this is the type of
construct that we want in kernel in general.

That said, it also is true that doing this is the shortest path to
implementing subtree skip given how the iterator is put together
currently and the series as a whole reduces significant amount of
complexity, so it is an acceptable tradeoff to proceed with this
implementation with later restructuring of the iterator.

So, let's go ahead as proposed.  I'll try to rework the iterator on
top of it, and my aplogies to Michal for being over-the-top.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -v4 4/8] memcg: enhance memcg iterator to support predicates
  2013-06-07  0:48                         ` Tejun Heo
@ 2013-06-07  8:25                           ` Michal Hocko
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2013-06-07  8:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki, linux-mm,
	cgroups, linux-kernel, Ying Han, Hugh Dickins, Glauber Costa,
	Michel Lespinasse, Greg Thelen, Balbir Singh

On Thu 06-06-13 17:48:24, Tejun Heo wrote:
> On Wed, Jun 05, 2013 at 02:09:38AM -0700, Tejun Heo wrote:
> > On Wed, Jun 05, 2013 at 11:07:39AM +0200, Michal Hocko wrote:
> > > On Wed 05-06-13 01:58:49, Tejun Heo wrote:
> > > [...]
> > > > Anyways, so you aren't gonna try the skipping thing?
> > > 
> > > As I said. I do not consider this a priority for the said reasons (i
> > > will not repeat them).
> > 
> > That's a weird way to respond.  Alright, whatever, let me give it a
> > shot then.
> 
> So, there were some private exchanges and here's my main issue with
> the addition of predicate callback to mem_cgroup_iter_cond().
> 
> There are two common patterns that are used to implement iteration.
> One is the good ol' callback based one - ie. call_fn_on_each(fn) type
> interface.  The other is something which can be used as part of flow
> control by the user - be it give_me_next_elem() or for_each() type
> loop macro.  In majority of cases, especially for anything generic,
> the latter is considered to be the better choice because, while a bit
> more challenging to implement usually, it's a lot less cumbersome for
> the users of the interface.
> 
> mem_cgroup_iter_cond() seems icky to me because the predicate callback
> is essentially visit callback,

OK, I thought that the predicate signature made it clear that its
purpose is to _check_ whether visiting makes sense rather than _visit_
that node and work with the node. That is the reason why I didn't
include state parameter which would be expected for the full visitor.
Maybe using const would make it even more clear. I can update
documentation for the predicate to make it more clear.

> so now we end up with give_me_next_elem() with visit callback, which
> is fundamentally superflous.  If it were properly call_fn_on_each(fn),
> the return values would be CONTINUE, SKIP_SUBTREE or ABORT, which
> makes more sense to me.  Sure, it can be said that the predicate
> callback is for a different purpose but it really doesn't change that
> the interface now is visiting the same node in two different places.
> If it were something remotely widely used, it won't take much time
> developing braindamaged usages where part is being done inside the
> predicate callback and the rest is done outside without clear reason
> why just because of natural code growth.  I don't think this is the
> type of construct that we want in kernel in general.
> 
> That said, it also is true that doing this is the shortest path to
> implementing subtree skip given how the iterator is put together
> currently and the series as a whole reduces significant amount of
> complexity, so it is an acceptable tradeoff to proceed with this
> implementation with later restructuring of the iterator.

Good. As I said many times, memcg iterators could see some clean ups.

> So, let's go ahead as proposed.  

Thanks!

> I'll try to rework the iterator on top of it, and my aplogies to
> Michal for being over-the-top.

Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -v4 4/8] memcg: enhance memcg iterator to support predicates
  2013-06-03 10:18 ` [patch -v4 4/8] memcg: enhance memcg iterator to support predicates Michal Hocko
  2013-06-04  1:07   ` Tejun Heo
@ 2013-06-10  7:48   ` Michal Hocko
  1 sibling, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2013-06-10  7:48 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo,
	Balbir Singh

Just for the record. I squash the following doc update to the patch in
the next version.
---
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 811967a..9ca85ff 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -42,11 +42,19 @@ struct mem_cgroup_reclaim_cookie {
 };
 
 enum mem_cgroup_filter_t {
-	VISIT,
-	SKIP,
-	SKIP_TREE,
+	VISIT,		/* visit current node */
+	SKIP,		/* skip the current node and continue traversal */
+	SKIP_TREE,	/* skip the whole subtree and continue traversal */
 };
 
+/*
+ * mem_cgroup_filter_t predicate might instruct mem_cgroup_iter_cond how to
+ * iterate through the hierarchy tree. Each tree element is checked by the
+ * predicate before it is returned by the iterator. If a filter returns
+ * SKIP or SKIP_TREE then the iterator code continues traversal (with the
+ * next node down the hierarchy or the next node that doesn't belong under the
+ * memcg's subtree).
+ */
 typedef enum mem_cgroup_filter_t
 (*mem_cgroup_iter_filter)(struct mem_cgroup *memcg, struct mem_cgroup *root);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 91740f7..43e955a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1073,6 +1073,14 @@ skip_node:
 			prev_cgroup = next_cgroup;
 			goto skip_node;
 		case SKIP_TREE:
+			/*
+			 * cgroup_rightmost_descendant is not an optimal way to
+			 * skip through a subtree (especially for imbalanced
+			 * trees leaning to right) but that's what we have right
+			 * now. More effective solution would be traversing
+			 * right-up for first non-NULL without calling
+			 * cgroup_next_descendant_pre afterwards.
+			 */
 			prev_cgroup = cgroup_rightmost_descendant(next_cgroup);
 			goto skip_node;
 		case VISIT:
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [patch v4] Soft limit rework
  2013-06-03 10:18 [patch v4] Soft limit rework Michal Hocko
                   ` (8 preceding siblings ...)
  2013-06-04 16:27 ` [patch v4] Soft limit rework Balbir Singh
@ 2013-06-11 15:43 ` Michal Hocko
  2013-06-17 14:01   ` Michal Hocko
  9 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2013-06-11 15:43 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Glauber Costa, Michel Lespinasse, Greg Thelen, Tejun Heo,
	Balbir Singh

JFYI, I have rebased the series on top of the current mmotm tree to
catch up with Mel's changes in reclaim and other small things here and
there. To be sure that the things are still good I have started my tests
again which will take some time.

There will be only documentation updates in the new series.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch v4] Soft limit rework
  2013-06-11 15:43 ` Michal Hocko
@ 2013-06-17 14:01   ` Michal Hocko
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2013-06-17 14:01 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki
  Cc: linux-mm, cgroups, linux-kernel, Ying Han, Hugh Dickins,
	Michel Lespinasse, Greg Thelen, Tejun Heo, Balbir Singh,
	Glauber Costa

On Tue 11-06-13 17:43:53, Michal Hocko wrote:
> JFYI, I have rebased the series on top of the current mmotm tree to
> catch up with Mel's changes in reclaim and other small things here and
> there. To be sure that the things are still good I have started my tests
> again which will take some time.

And it took way more time than I would like but the current mmotm is
broken and crashes/hangs and misbehaves in strange ways. At first I
thought it is my -mm git tree that is broken but then when I started
testing with linux-next I could see issues as well. I was able to reduce
the space to slab shrinkers rework but bisection led to nothing
reasonable. I will report those issues in a separate email when I
collect all necessary information.

In the meantime I will retest on top of my -mm tree without slab
shrinkers patches applied. This should be OK for the soft reclaim work
because memcg shrinkers are still not merged into mm tree and they
shouldn't be affected by the soft reclaim as the reclaim is per
shrink_zone now.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2013-06-17 14:01 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-03 10:18 [patch v4] Soft limit rework Michal Hocko
2013-06-03 10:18 ` [patch -v4 1/8] memcg: integrate soft reclaim tighter with zone shrinking code Michal Hocko
2013-06-03 10:18 ` [patch -v4 2/8] memcg: Get rid of soft-limit tree infrastructure Michal Hocko
2013-06-03 10:18 ` [patch -v4 3/8] vmscan, memcg: Do softlimit reclaim also for targeted reclaim Michal Hocko
2013-06-03 10:18 ` [patch -v4 4/8] memcg: enhance memcg iterator to support predicates Michal Hocko
2013-06-04  1:07   ` Tejun Heo
2013-06-04 13:45     ` Michal Hocko
2013-06-04 19:36       ` Tejun Heo
2013-06-04 20:48         ` Michal Hocko
2013-06-04 20:54           ` Tejun Heo
2013-06-05  7:37             ` Michal Hocko
2013-06-05  8:05               ` Tejun Heo
2013-06-05  8:52                 ` Michal Hocko
2013-06-05  8:58                   ` Tejun Heo
2013-06-05  9:07                     ` Michal Hocko
2013-06-05  9:09                       ` Tejun Heo
2013-06-07  0:48                         ` Tejun Heo
2013-06-07  8:25                           ` Michal Hocko
2013-06-10  7:48   ` Michal Hocko
2013-06-03 10:18 ` [patch -v4 5/8] memcg: track children in soft limit excess to improve soft limit Michal Hocko
2013-06-03 10:18 ` [patch -v4 6/8] memcg, vmscan: Do not attempt soft limit reclaim if it would not scan anything Michal Hocko
2013-06-03 10:18 ` [patch -v4 7/8] memcg: Track all children over limit in the root Michal Hocko
2013-06-03 10:18 ` [patch -v4 8/8] memcg, vmscan: do not fall into reclaim-all pass too quickly Michal Hocko
2013-06-04 16:27 ` [patch v4] Soft limit rework Balbir Singh
2013-06-04 16:38   ` Michal Hocko
2013-06-04 17:57     ` Balbir Singh
2013-06-04 18:08       ` Michal Hocko
2013-06-11 15:43 ` Michal Hocko
2013-06-17 14:01   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).