linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/4] memcg: Low-limit reclaim
@ 2013-12-11 14:15 Michal Hocko
  2013-12-11 14:15 ` [RFC 1/4] memcg, mm: introduce lowlimit reclaim Michal Hocko
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Michal Hocko @ 2013-12-11 14:15 UTC (permalink / raw)
  To: linux-mm, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki
  Cc: LKML, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen,
	KOSAKI Motohiro, Tejun Heo

Hi,
previous discussions have shown that soft limits cannot be reformed
(http://lwn.net/Articles/555249/). This series introduces an alternative
approach to protecting memory allocated to processes executing within
a memory cgroup controller. It is based on a new tunable that was
discussed with Johannes and Tejun held during the last kernel summit.

This patchset introduces such low limit that is functionally similar to a
minimum guarantee. Memcgs which are under their lowlimit are not considered
eligible for the reclaim (both global and hardlimit). The default value of
the limit is 0 so all groups are eligible by default and an interested
party has to explicitly set the limit.

The primary use case is to protect an amount of memory allocated to a
workload without it being reclaimed by an unrelated activity. In some
cases this requirement can be fulfilled by mlock but it is not suitable
for many loads and generally requires application awareness. Such
application awareness can be complex. It effectively forbids the
use of memory overcommit as the application must explicitly manage
memory residency.
With low limits, such workloads can be placed in a memcg with a low
limit that protects the estimated working set.

Another use case might be unreclaimable groups. Some loads might be so
sensitive to reclaim that it is better to kill and start it again (or
since checkpoint) rather than trash. This would be trivial with low
limit set to unlimited and the OOM killer will handle the situation as
required (e.g. kill and restart).

The hierarchical behavior of the lowlimit is described in the first
patch. It is followed by a direct reclaim fix which is necessary to
handle situation when a no group is eligible because all groups are
below low limit. This is not a big deal for hardlimit reclaim because
we simply retry the reclaim few times and then trigger memcg OOM killer
path. It would blow up in the global case when we would loop without
doing any progress or trigger OOM killer. I would consider configuration
leading to this state invalid but we should handle that gracefully.

The third patch finally allows setting the lowlimit.

The last patch tries expedites OOM if it is clear that no group is
eligible for reclaim. It basically breaks out of loops in the direct
reclaim and lets kswapd sleep because it wouldn't do any progress anyway.

Thoughts?

Short log says:
Michal Hocko (4):
      memcg, mm: introduce lowlimit reclaim
      mm, memcg: allow OOM if no memcg is eligible during direct reclaim
      memcg: Allow setting low_limit
      mm, memcg: expedite OOM if no memcg is reclaimable

And a diffstat
 include/linux/memcontrol.h  | 14 +++++++++++
 include/linux/res_counter.h | 40 ++++++++++++++++++++++++++++++
 kernel/res_counter.c        |  2 ++
 mm/memcontrol.c             | 60 ++++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                 | 59 +++++++++++++++++++++++++++++++++++++++++---
 5 files changed, 170 insertions(+), 5 deletions(-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 1/4] memcg, mm: introduce lowlimit reclaim
  2013-12-11 14:15 [RFC 0/4] memcg: Low-limit reclaim Michal Hocko
@ 2013-12-11 14:15 ` Michal Hocko
  2013-12-11 14:15 ` [RFC 2/4] mm, memcg: allow OOM if no memcg is eligible during direct reclaim Michal Hocko
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2013-12-11 14:15 UTC (permalink / raw)
  To: linux-mm, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki
  Cc: LKML, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen,
	KOSAKI Motohiro, Tejun Heo

This patch introduces low limit reclaim. The low_limit acts as a reclaim
protection because groups which are under their low_limit are considered
ineligible for reclaim. While hardlimit protects from using more memory
than allowed lowlimit protects from getting bellow memory assigned to
the group due to external memory pressure.

More precisely a group is considered eligible for the reclaim under a
specific hierarchy represented by its root only if the group is above
its low limit and the same applies to all parents up the hierarchy to
the root.

Consider the following hierarchy with memory pressure coming from the
group A (hard limit reclaim - l-low_limit_in_bytes, u-usage_in_bytes,
h-limit_in_bytes):
		root_mem_cgroup
			.
		  _____/
		 /
		A (l = 80 u=90 h=90)
	       /
	      / \_________
	     /            \
	    B (l=0 u=50)   C (l=50 u=40)
	                    \
			     D (l=0 u=30)

A and B are reclaimable but C and D are not (D is protected by C).

The low_limit is 0 by default so every group is eligible. This patch
doesn't provide a way to set the limit yet although the core
infrastructure is there already.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/memcontrol.h  |  8 ++++++++
 include/linux/res_counter.h | 27 +++++++++++++++++++++++++++
 mm/memcontrol.c             | 23 +++++++++++++++++++++++
 mm/vmscan.c                 | 14 ++++++++++++++
 4 files changed, 72 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b3e7a667e03c..6841e591718d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -92,6 +92,8 @@ bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 bool task_in_mem_cgroup(struct task_struct *task,
 			const struct mem_cgroup *memcg);
 
+extern bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
+		struct mem_cgroup *root);
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 extern struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm);
@@ -289,6 +291,12 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
+static inline bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
+		struct mem_cgroup *root)
+{
+	return true;
+}
+
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	return NULL;
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 201a69749659..c7e7dfeca847 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -40,6 +40,11 @@ struct res_counter {
 	 */
 	unsigned long long soft_limit;
 	/*
+	 * the limit under which the usage cannot be pushed
+	 * due to external pressure.
+	 */
+	unsigned long long low_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -179,6 +184,28 @@ res_counter_soft_limit_excess(struct res_counter *cnt)
 	return excess;
 }
 
+/**
+ * Get the difference between the usage and the low limit
+ * @cnt: The counter
+ *
+ * Returns 0 if usage is less than or equal to low limit
+ * The difference between usage and low limit, otherwise.
+ */
+static inline unsigned long long
+res_counter_low_limit_excess(struct res_counter *cnt)
+{
+	unsigned long long excess;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	if (cnt->usage <= cnt->low_limit)
+		excess = 0;
+	else
+		excess = cnt->usage - cnt->low_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return excess;
+}
+
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d1fded477ef6..a1cfee4491bf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2851,6 +2851,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
 	return mem_cgroup_from_id(id);
 }
 
+/**
+ * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
+ * reclaim
+ * @memcg: target memcg for the reclaim
+ * @root: root of the reclaim hierarchy (null for the global reclaim)
+ *
+ * The given group is reclaimable if it is above its low limit and the same
+ * applies for all parents up the hierarchy until root (including).
+ */
+bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
+		struct mem_cgroup *root)
+{
+	do {
+		if (!res_counter_low_limit_excess(&memcg->res))
+			return false;
+		if (memcg == root)
+			break;
+
+	} while ((memcg = parent_mem_cgroup(memcg)));
+
+	return true;
+}
+
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	struct mem_cgroup *memcg = NULL;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eea668d9cff6..1c9ce5f97872 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2186,6 +2186,20 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 		do {
 			struct lruvec *lruvec;
 
+			/*
+			 * Memcg might be under its low limit so we have to
+			 * skip it.
+			 */
+			if (!mem_cgroup_reclaim_eligible(memcg, root)) {
+				/*
+				 * It would be more optimal to skip the memcg
+				 * subtree now but we do not have a memcg iter
+				 * helper for that. Anyone?
+				 */
+				memcg = mem_cgroup_iter(root, memcg, &reclaim);
+				continue;
+			}
+
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
 			shrink_lruvec(lruvec, sc);
-- 
1.8.4.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC 2/4] mm, memcg: allow OOM if no memcg is eligible during direct reclaim
  2013-12-11 14:15 [RFC 0/4] memcg: Low-limit reclaim Michal Hocko
  2013-12-11 14:15 ` [RFC 1/4] memcg, mm: introduce lowlimit reclaim Michal Hocko
@ 2013-12-11 14:15 ` Michal Hocko
  2013-12-11 14:15 ` [RFC 3/4] memcg: Allow setting low_limit Michal Hocko
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2013-12-11 14:15 UTC (permalink / raw)
  To: linux-mm, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki
  Cc: LKML, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen,
	KOSAKI Motohiro, Tejun Heo

If there is no memcg eligible for reclaim then the global direct
reclaim would end up in the endless loop because zones in the zonelists
are not considered unreclaimable (as per all_unreclaimable) and so the
OOM killer would never fire and direct reclaim would be triggered
without no chance to reclaim anything.

Memcg reclaim doesn't suffer from this because the OOM killer is
triggered after few unsuccessful attempts of the reclaim.

Fix this by checking the number of scanned pages which is obviously 0 if
nobody is eligible and also check that the whole tree hierarchy is not
eligible and tell OOM it can go ahead.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/memcontrol.h |  6 ++++++
 mm/memcontrol.c            | 10 ++++++++++
 mm/vmscan.c                |  7 +++++++
 3 files changed, 23 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6841e591718d..4ae6a9838a26 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -94,6 +94,7 @@ bool task_in_mem_cgroup(struct task_struct *task,
 
 extern bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
 		struct mem_cgroup *root);
+extern bool mem_cgroup_reclaim_no_eligible(struct mem_cgroup *root);
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 extern struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm);
@@ -297,6 +298,11 @@ static inline bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
 	return true;
 }
 
+static bool mem_cgroup_reclaim_no_eligible(struct mem_cgroup *root)
+{
+	return false;
+}
+
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	return NULL;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a1cfee4491bf..102e2da9ec8d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2874,6 +2874,16 @@ bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
 	return true;
 }
 
+bool mem_cgroup_reclaim_no_eligible(struct mem_cgroup *root)
+{
+	struct mem_cgroup *iter;
+
+	for_each_mem_cgroup_tree(iter, root)
+		if (mem_cgroup_reclaim_eligible(iter, root))
+			return false;
+	return true;
+}
+
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	struct mem_cgroup *memcg = NULL;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1c9ce5f97872..234d1690563a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2481,6 +2481,13 @@ out:
 	if (aborted_reclaim)
 		return 1;
 
+	/*
+	 * If the target memcg is not eligible for reclaim then we have no opetion
+	 * but OOM
+	 */
+	if (!sc->nr_scanned && mem_cgroup_reclaim_no_eligible(sc->target_mem_cgroup))
+		return 0;
+
 	/* top priority shrink_zones still had more to do? don't OOM, then */
 	if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
 		return 1;
-- 
1.8.4.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC 3/4] memcg: Allow setting low_limit
  2013-12-11 14:15 [RFC 0/4] memcg: Low-limit reclaim Michal Hocko
  2013-12-11 14:15 ` [RFC 1/4] memcg, mm: introduce lowlimit reclaim Michal Hocko
  2013-12-11 14:15 ` [RFC 2/4] mm, memcg: allow OOM if no memcg is eligible during direct reclaim Michal Hocko
@ 2013-12-11 14:15 ` Michal Hocko
  2013-12-11 14:15 ` [RFC 4/4] mm, memcg: expedite OOM if no memcg is reclaimable Michal Hocko
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2013-12-11 14:15 UTC (permalink / raw)
  To: linux-mm, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki
  Cc: LKML, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen,
	KOSAKI Motohiro, Tejun Heo

Export memory.low_limit_in_bytes knob with the same rules as the hard
limit represented by limit_in_bytes knob (e.g. no limit to be set for
the root cgroup). There is no memsw alternative for low_limit_in_bytes
because the primary motivation behind this limit is to protect the
working set of the group and so considering swap doesn't make much
sense. There is also no kmem variant exported because we do not have any
easy way to protect kernel allocations now.

Please note that the low limit might exceed the hard limit which
basically means that the group is never reclaimable. If the hard limit
is reached with this setting then the memcg OOM killer is triggered to
sort out the situation.

TODO: update Documentation/cgroups/memory.txt

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/res_counter.h | 13 +++++++++++++
 kernel/res_counter.c        |  2 ++
 mm/memcontrol.c             | 27 ++++++++++++++++++++++++++-
 3 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index c7e7dfeca847..7befcf3c2ee2 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -93,6 +93,7 @@ enum {
 	RES_LIMIT,
 	RES_FAILCNT,
 	RES_SOFT_LIMIT,
+	RES_LOW_LIMIT,
 };
 
 /*
@@ -251,4 +252,16 @@ res_counter_set_soft_limit(struct res_counter *cnt,
 	return 0;
 }
 
+static inline int
+res_counter_set_low_limit(struct res_counter *cnt,
+				unsigned long long low_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->low_limit = low_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
 #endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 4aa8a305aede..c57daf997d9d 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -135,6 +135,8 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->failcnt;
 	case RES_SOFT_LIMIT:
 		return &counter->soft_limit;
+	case RES_LOW_LIMIT:
+		return &counter->low_limit;
 	};
 
 	BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 102e2da9ec8d..afe7c84d823f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1694,8 +1694,9 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 	pr_cont(" as a result of limit of %s\n", memcg_name);
 done:
 
-	pr_info("memory: usage %llukB, limit %llukB, failcnt %llu\n",
+	pr_info("memory: usage %llukB, low_limit %llukB limit %llukB, failcnt %llu\n",
 		res_counter_read_u64(&memcg->res, RES_USAGE) >> 10,
+		res_counter_read_u64(&memcg->res, RES_LOW_LIMIT) >> 10,
 		res_counter_read_u64(&memcg->res, RES_LIMIT) >> 10,
 		res_counter_read_u64(&memcg->res, RES_FAILCNT));
 	pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %llu\n",
@@ -5300,6 +5301,24 @@ static int mem_cgroup_write(struct cgroup_subsys_state *css, struct cftype *cft,
 		else
 			return -EINVAL;
 		break;
+	case RES_LOW_LIMIT:
+		if (mem_cgroup_is_root(memcg)) { /* Can't set limit on root */
+			ret = -EINVAL;
+			break;
+		}
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		if (type == _MEM) {
+			ret = res_counter_set_low_limit(&memcg->res, val);
+			break;
+		}
+		/*
+		 * memsw low limit doesn't make any sense and kmem is not
+		 * implemented yet - if ever
+		 */
+		return -EINVAL;
+
 	case RES_SOFT_LIMIT:
 		ret = res_counter_memparse_write_strategy(buffer, &val);
 		if (ret)
@@ -6013,6 +6032,12 @@ static struct cftype mem_cgroup_files[] = {
 		.read = mem_cgroup_read,
 	},
 	{
+		.name = "low_limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_LOW_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read = mem_cgroup_read,
+	},
+	{
 		.name = "soft_limit_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
 		.write_string = mem_cgroup_write,
-- 
1.8.4.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC 4/4] mm, memcg: expedite OOM if no memcg is reclaimable
  2013-12-11 14:15 [RFC 0/4] memcg: Low-limit reclaim Michal Hocko
                   ` (2 preceding siblings ...)
  2013-12-11 14:15 ` [RFC 3/4] memcg: Allow setting low_limit Michal Hocko
@ 2013-12-11 14:15 ` Michal Hocko
  2014-01-24 11:07 ` [RFC 0/4] memcg: Low-limit reclaim Roman Gushchin
  2014-01-29 19:08 ` Greg Thelen
  5 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2013-12-11 14:15 UTC (permalink / raw)
  To: linux-mm, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki
  Cc: LKML, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen,
	KOSAKI Motohiro, Tejun Heo

Let shrink_zone us know that at least one group has been eligible so
that the caller knows that further attempts to reclaim will not help.

All the shrink_zone callers should back off in such a case. Direct
reclaim should hand over to OOM as soon as possible, kswapd should not
raise the priority and prefferably go to sleep, and the zone reclaim
should just give up without re-trying with higher priority.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c | 52 +++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 41 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 234d1690563a..b9e21df2751a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2167,9 +2167,15 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	}
 }
 
-static void shrink_zone(struct zone *zone, struct scan_control *sc)
+/*
+ * Returns true if there is at least one lruvec has been scanned.
+ * Always true for !CONFIG_MEMCG otherwise at least one eligible
+ * memcg has to exist (see mem_cgroup_reclaim_eligible).
+ */
+static bool shrink_zone(struct zone *zone, struct scan_control *sc)
 {
 	unsigned long nr_reclaimed, nr_scanned;
+	int groups_reclaimed = 0;
 
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -2200,6 +2206,7 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 				continue;
 			}
 
+			groups_reclaimed++;
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
 			shrink_lruvec(lruvec, sc);
@@ -2226,8 +2233,12 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 			   sc->nr_scanned - nr_scanned,
 			   sc->nr_reclaimed - nr_reclaimed);
 
+		if (!groups_reclaimed)
+			break;
 	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
+
+	return groups_reclaimed > 0;
 }
 
 /* Returns true if compaction should go ahead for a high-order request */
@@ -2347,7 +2358,9 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			/* need some check for avoid more shrink_zone() */
 		}
 
-		shrink_zone(zone, sc);
+		/* No memcg to reclaim from so bail out */
+		if (!shrink_zone(zone, sc))
+			break;
 	}
 
 	return aborted_reclaim;
@@ -2442,6 +2455,17 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 			goto out;
 
 		/*
+		 * If the target memcg is not eligible for reclaim then we have
+		 * no option but OOM
+		 */
+		if (!sc->nr_scanned &&
+				mem_cgroup_reclaim_no_eligible(
+					sc->target_mem_cgroup)) {
+			delayacct_freepages_end();
+			return 0;
+		}
+
+		/*
 		 * If we're getting trouble reclaiming, start doing
 		 * writepage even in laptop mode.
 		 */
@@ -2481,13 +2505,6 @@ out:
 	if (aborted_reclaim)
 		return 1;
 
-	/*
-	 * If the target memcg is not eligible for reclaim then we have no opetion
-	 * but OOM
-	 */
-	if (!sc->nr_scanned && mem_cgroup_reclaim_no_eligible(sc->target_mem_cgroup))
-		return 0;
-
 	/* top priority shrink_zones still had more to do? don't OOM, then */
 	if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
 		return 1;
@@ -2772,6 +2789,17 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 	unsigned long balanced_pages = 0;
 	int i;
 
+
+	/*
+	 * If no memcg is eligible to reclaim then we end up scanning nothing
+	 * and so it doesn't make any sense to do any scanning in the first
+	 * place. So let's go to sleep and pretend everything is balanced.
+	 * We rely on the direct reclaim to eventually sort out the situation
+	 * e.g. by triggering OOM killer.
+	 */
+	if (mem_cgroup_reclaim_no_eligible(NULL))
+		return true;
+
 	/* Check the watermark levels */
 	for (i = 0; i <= classzone_idx; i++) {
 		struct zone *zone = pgdat->node_zones + i;
@@ -3115,7 +3143,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		 * Raise priority if scanning rate is too low or there was no
 		 * progress in reclaiming pages
 		 */
-		if (raise_priority || !sc.nr_reclaimed)
+		if (raise_priority || (sc.nr_scanned && !sc.nr_reclaimed))
 			sc.priority--;
 	} while (sc.priority >= 1 &&
 		 !pgdat_balanced(pgdat, order, *classzone_idx));
@@ -3572,7 +3600,9 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		 * priorities until we have enough memory freed.
 		 */
 		do {
-			shrink_zone(zone, &sc);
+			/* memcg to reclaim from so bail out */
+			if(!shrink_zone(zone, &sc))
+				break;
 		} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
 	}
 
-- 
1.8.4.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC 0/4] memcg: Low-limit reclaim
  2013-12-11 14:15 [RFC 0/4] memcg: Low-limit reclaim Michal Hocko
                   ` (3 preceding siblings ...)
  2013-12-11 14:15 ` [RFC 4/4] mm, memcg: expedite OOM if no memcg is reclaimable Michal Hocko
@ 2014-01-24 11:07 ` Roman Gushchin
  2014-01-29 18:22   ` Michal Hocko
  2014-01-29 19:08 ` Greg Thelen
  5 siblings, 1 reply; 14+ messages in thread
From: Roman Gushchin @ 2014-01-24 11:07 UTC (permalink / raw)
  To: Michal Hocko, linux-mm, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki
  Cc: LKML, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen,
	KOSAKI Motohiro, Tejun Heo

Hi, Michal!

As you can remember, I've proposed to introduce low limits about a year ago.

We had a small discussion at that time: http://marc.info/?t=136195226600004 .

Since that time we intensively use low limits in our production
(on thousands of machines). So, I'm very interested to merge this
functionality into upstream.

In my experience, low limits also require some changes in memcg page accounting
policy. For instance, an application in protected cgroup should have a guarantee
that it's filecache belongs to it's cgroup and is protected by low limit
therefore. If the filecache was created by another application in other cgroup,
it can be not so. I've solved this problem by implementing optional page
reaccouting on pagefaults and read/writes.

I can prepare my current version of patchset, if someone is interested.

Regards,
Roman

On 11.12.2013 18:15, Michal Hocko wrote:
> Hi,
> previous discussions have shown that soft limits cannot be reformed
> (http://lwn.net/Articles/555249/). This series introduces an alternative
> approach to protecting memory allocated to processes executing within
> a memory cgroup controller. It is based on a new tunable that was
> discussed with Johannes and Tejun held during the last kernel summit.
>
> This patchset introduces such low limit that is functionally similar to a
> minimum guarantee. Memcgs which are under their lowlimit are not considered
> eligible for the reclaim (both global and hardlimit). The default value of
> the limit is 0 so all groups are eligible by default and an interested
> party has to explicitly set the limit.
>
> The primary use case is to protect an amount of memory allocated to a
> workload without it being reclaimed by an unrelated activity. In some
> cases this requirement can be fulfilled by mlock but it is not suitable
> for many loads and generally requires application awareness. Such
> application awareness can be complex. It effectively forbids the
> use of memory overcommit as the application must explicitly manage
> memory residency.
> With low limits, such workloads can be placed in a memcg with a low
> limit that protects the estimated working set.
>
> Another use case might be unreclaimable groups. Some loads might be so
> sensitive to reclaim that it is better to kill and start it again (or
> since checkpoint) rather than trash. This would be trivial with low
> limit set to unlimited and the OOM killer will handle the situation as
> required (e.g. kill and restart).
>
> The hierarchical behavior of the lowlimit is described in the first
> patch. It is followed by a direct reclaim fix which is necessary to
> handle situation when a no group is eligible because all groups are
> below low limit. This is not a big deal for hardlimit reclaim because
> we simply retry the reclaim few times and then trigger memcg OOM killer
> path. It would blow up in the global case when we would loop without
> doing any progress or trigger OOM killer. I would consider configuration
> leading to this state invalid but we should handle that gracefully.
>
> The third patch finally allows setting the lowlimit.
>
> The last patch tries expedites OOM if it is clear that no group is
> eligible for reclaim. It basically breaks out of loops in the direct
> reclaim and lets kswapd sleep because it wouldn't do any progress anyway.
>
> Thoughts?
>
> Short log says:
> Michal Hocko (4):
>        memcg, mm: introduce lowlimit reclaim
>        mm, memcg: allow OOM if no memcg is eligible during direct reclaim
>        memcg: Allow setting low_limit
>        mm, memcg: expedite OOM if no memcg is reclaimable
>
> And a diffstat
>   include/linux/memcontrol.h  | 14 +++++++++++
>   include/linux/res_counter.h | 40 ++++++++++++++++++++++++++++++
>   kernel/res_counter.c        |  2 ++
>   mm/memcontrol.c             | 60 ++++++++++++++++++++++++++++++++++++++++++++-
>   mm/vmscan.c                 | 59 +++++++++++++++++++++++++++++++++++++++++---
>   5 files changed, 170 insertions(+), 5 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/4] memcg: Low-limit reclaim
  2014-01-24 11:07 ` [RFC 0/4] memcg: Low-limit reclaim Roman Gushchin
@ 2014-01-29 18:22   ` Michal Hocko
  2014-02-12 12:28     ` Roman Gushchin
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2014-01-29 18:22 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki,
	LKML, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen,
	KOSAKI Motohiro, Tejun Heo

On Fri 24-01-14 15:07:02, Roman Gushchin wrote:
> Hi, Michal!

Hi,

> As you can remember, I've proposed to introduce low limits about a year ago.
> 
> We had a small discussion at that time: http://marc.info/?t=136195226600004 .

yes I remember that discussion and vaguely remember the proposed
approach. I really wanted to prevent from introduction of a new knob but
things evolved differently than I planned since then and it turned out
that the knew knob is unavoidable. That's why I came with this approach
which is quite different from yours AFAIR.
 
> Since that time we intensively use low limits in our production
> (on thousands of machines). So, I'm very interested to merge this
> functionality into upstream.

Have you tried to use this implementation? Would this work as well?
My very vague recollection of your patch is that it didn't cover both
global and target reclaims and it didn't fit into the reclaim very
naturally it used its own scaling method. I will have to refresh my
memory though.

> In my experience, low limits also require some changes in memcg page accounting
> policy. For instance, an application in protected cgroup should have a guarantee
> that it's filecache belongs to it's cgroup and is protected by low limit
> therefore. If the filecache was created by another application in other cgroup,
> it can be not so. I've solved this problem by implementing optional page
> reaccouting on pagefaults and read/writes.

Memory sharing is a separate issue and we should discuss that
separately. 

> I can prepare my current version of patchset, if someone is interested.

Sure, having something to compare with is always valuable.

> Regards,
> Roman
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/4] memcg: Low-limit reclaim
  2013-12-11 14:15 [RFC 0/4] memcg: Low-limit reclaim Michal Hocko
                   ` (4 preceding siblings ...)
  2014-01-24 11:07 ` [RFC 0/4] memcg: Low-limit reclaim Roman Gushchin
@ 2014-01-29 19:08 ` Greg Thelen
  2014-01-30 12:30   ` Michal Hocko
  5 siblings, 1 reply; 14+ messages in thread
From: Greg Thelen @ 2014-01-29 19:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki,
	LKML, Ying Han, Hugh Dickins, Michel Lespinasse, KOSAKI Motohiro,
	Tejun Heo

On Wed, Dec 11 2013, Michal Hocko wrote:

> Hi,
> previous discussions have shown that soft limits cannot be reformed
> (http://lwn.net/Articles/555249/). This series introduces an alternative
> approach to protecting memory allocated to processes executing within
> a memory cgroup controller. It is based on a new tunable that was
> discussed with Johannes and Tejun held during the last kernel summit.
>
> This patchset introduces such low limit that is functionally similar to a
> minimum guarantee. Memcgs which are under their lowlimit are not considered
> eligible for the reclaim (both global and hardlimit). The default value of
> the limit is 0 so all groups are eligible by default and an interested
> party has to explicitly set the limit.
>
> The primary use case is to protect an amount of memory allocated to a
> workload without it being reclaimed by an unrelated activity. In some
> cases this requirement can be fulfilled by mlock but it is not suitable
> for many loads and generally requires application awareness. Such
> application awareness can be complex. It effectively forbids the
> use of memory overcommit as the application must explicitly manage
> memory residency.
> With low limits, such workloads can be placed in a memcg with a low
> limit that protects the estimated working set.
>
> Another use case might be unreclaimable groups. Some loads might be so
> sensitive to reclaim that it is better to kill and start it again (or
> since checkpoint) rather than trash. This would be trivial with low
> limit set to unlimited and the OOM killer will handle the situation as
> required (e.g. kill and restart).
>
> The hierarchical behavior of the lowlimit is described in the first
> patch. It is followed by a direct reclaim fix which is necessary to
> handle situation when a no group is eligible because all groups are
> below low limit. This is not a big deal for hardlimit reclaim because
> we simply retry the reclaim few times and then trigger memcg OOM killer
> path. It would blow up in the global case when we would loop without
> doing any progress or trigger OOM killer. I would consider configuration
> leading to this state invalid but we should handle that gracefully.
>
> The third patch finally allows setting the lowlimit.
>
> The last patch tries expedites OOM if it is clear that no group is
> eligible for reclaim. It basically breaks out of loops in the direct
> reclaim and lets kswapd sleep because it wouldn't do any progress anyway.
>
> Thoughts?
>
> Short log says:
> Michal Hocko (4):
>       memcg, mm: introduce lowlimit reclaim
>       mm, memcg: allow OOM if no memcg is eligible during direct reclaim
>       memcg: Allow setting low_limit
>       mm, memcg: expedite OOM if no memcg is reclaimable
>
> And a diffstat
>  include/linux/memcontrol.h  | 14 +++++++++++
>  include/linux/res_counter.h | 40 ++++++++++++++++++++++++++++++
>  kernel/res_counter.c        |  2 ++
>  mm/memcontrol.c             | 60 ++++++++++++++++++++++++++++++++++++++++++++-
>  mm/vmscan.c                 | 59 +++++++++++++++++++++++++++++++++++++++++---
>  5 files changed, 170 insertions(+), 5 deletions(-)

The series looks useful.  We (Google) have been using something similar.
In practice such a low_limit (or memory guarantee), doesn't nest very
well.

Example:
  - parent_memcg: limit 500, low_limit 500, usage 500
    1 privately charged non-reclaimable page (e.g. mlock, slab)
  - child_memcg: limit 500, low_limit 500, usage 499

If a streaming file cache workload (e.g. sha1sum) starts gobbling up
page cache it will lead to an oom kill instead of reclaiming.  One could
argue that this is working as intended because child_memcg was promised
500 but can only get 499.  So child_memcg is oom killed rather than
being forced to operate below its promised low limit.

This has led to various internal workarounds like:
- don't charge any memory to interior tree nodes (e.g. parent_memcg);
  only charge memory to cgroup leafs.  This gets tricky when dealing
  with reparented memory inherited to parent from child during cgroup
  deletion.
- don't set low_limit on non leafs (e.g. do not set low limit on
  parent_memcg).  This constrains the cgroup layout a bit.  Some
  customers want to purchase $MEM and setup their workload with a few
  child cgroups.  A system daemon hands out $MEM by setting low_limit
  for top-level containers (e.g. parent_memcg).  Thereafter such
  customers are able to partition their workload with sub memcg below
  child_memcg.  Example:
     parent_memcg
         \
          child_memcg
            /     \
        server   backup
  Thereafter customers often want some weak isolation between server and
  backup.  To avoid undesired oom kills the server/backup isolation is
  provided with a softer memory guarantee (e.g. soft_limit).  The soft
  limit acts like the low_limit until priority becomes desperate.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/4] memcg: Low-limit reclaim
  2014-01-29 19:08 ` Greg Thelen
@ 2014-01-30 12:30   ` Michal Hocko
  2014-01-31  0:28     ` Greg Thelen
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2014-01-30 12:30 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki,
	LKML, Ying Han, Hugh Dickins, Michel Lespinasse, KOSAKI Motohiro,
	Tejun Heo

On Wed 29-01-14 11:08:46, Greg Thelen wrote:
[...]
> The series looks useful.  We (Google) have been using something similar.
> In practice such a low_limit (or memory guarantee), doesn't nest very
> well.
> 
> Example:
>   - parent_memcg: limit 500, low_limit 500, usage 500
>     1 privately charged non-reclaimable page (e.g. mlock, slab)
>   - child_memcg: limit 500, low_limit 500, usage 499

I am not sure this is a good example. Your setup basically say that no
single page should be reclaimed. I can imagine this might be useful in
some cases and I would like to allow it but it sounds too extreme (e.g.
a load which would start trashing heavily once the reclaim starts and it
makes more sense to start it again rather than crowl - think about some
mathematical simulation which might diverge).
 
> If a streaming file cache workload (e.g. sha1sum) starts gobbling up
> page cache it will lead to an oom kill instead of reclaiming. 

Does it make any sense to protect all of such memory although it is
easily reclaimable?

> One could
> argue that this is working as intended because child_memcg was promised
> 500 but can only get 499.  So child_memcg is oom killed rather than
> being forced to operate below its promised low limit.
> 
> This has led to various internal workarounds like:
> - don't charge any memory to interior tree nodes (e.g. parent_memcg);
>   only charge memory to cgroup leafs.  This gets tricky when dealing
>   with reparented memory inherited to parent from child during cgroup
>   deletion.

Do those need any protection at all?

> - don't set low_limit on non leafs (e.g. do not set low limit on
>   parent_memcg).  This constrains the cgroup layout a bit.  Some
>   customers want to purchase $MEM and setup their workload with a few
>   child cgroups.  A system daemon hands out $MEM by setting low_limit
>   for top-level containers (e.g. parent_memcg).  Thereafter such
>   customers are able to partition their workload with sub memcg below
>   child_memcg.  Example:
>      parent_memcg
>          \
>           child_memcg
>             /     \
>         server   backup

I think that the low_limit makes sense where you actually want to
protect something from reclaim. And backup sounds like a bad fit for
that.

>   Thereafter customers often want some weak isolation between server and
>   backup.  To avoid undesired oom kills the server/backup isolation is
>   provided with a softer memory guarantee (e.g. soft_limit).  The soft
>   limit acts like the low_limit until priority becomes desperate.

Johannes was already suggesting that the low_limit should allow for a
weaker semantic as well. I am not very much inclined to that but I can
leave with a knob which would say oom_on_lowlimit (on by default but
allowed to be set to 0). We would fallback to the full reclaim if
no groups turn out to be reclaimable.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/4] memcg: Low-limit reclaim
  2014-01-30 12:30   ` Michal Hocko
@ 2014-01-31  0:28     ` Greg Thelen
  2014-02-03 14:43       ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Greg Thelen @ 2014-01-31  0:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki,
	LKML, Ying Han, Hugh Dickins, Michel Lespinasse, KOSAKI Motohiro,
	Tejun Heo

On Thu, Jan 30 2014, Michal Hocko wrote:

> On Wed 29-01-14 11:08:46, Greg Thelen wrote:
> [...]
>> The series looks useful.  We (Google) have been using something similar.
>> In practice such a low_limit (or memory guarantee), doesn't nest very
>> well.
>> 
>> Example:
>>   - parent_memcg: limit 500, low_limit 500, usage 500
>>     1 privately charged non-reclaimable page (e.g. mlock, slab)
>>   - child_memcg: limit 500, low_limit 500, usage 499
>
> I am not sure this is a good example. Your setup basically say that no
> single page should be reclaimed. I can imagine this might be useful in
> some cases and I would like to allow it but it sounds too extreme (e.g.
> a load which would start trashing heavily once the reclaim starts and it
> makes more sense to start it again rather than crowl - think about some
> mathematical simulation which might diverge).

Pages will still be reclaimed the usage_in_bytes is exceeds
limit_in_bytes.  I see the low_limit as a way to tell the kernel: don't
reclaim my memory due to external pressure, but internal pressure is
different.

>> If a streaming file cache workload (e.g. sha1sum) starts gobbling up
>> page cache it will lead to an oom kill instead of reclaiming. 
>
> Does it make any sense to protect all of such memory although it is
> easily reclaimable?

I think protection makes sense in this case.  If I know my workload
needs 500 to operate well, then I reserve 500 using low_limit.  My app
doesn't want to run with less than its reservation.

>> One could argue that this is working as intended because child_memcg
>> was promised 500 but can only get 499.  So child_memcg is oom killed
>> rather than being forced to operate below its promised low limit.
>> 
>> This has led to various internal workarounds like:
>> - don't charge any memory to interior tree nodes (e.g. parent_memcg);
>>   only charge memory to cgroup leafs.  This gets tricky when dealing
>>   with reparented memory inherited to parent from child during cgroup
>>   deletion.
>
> Do those need any protection at all?

Interior tree nodes don't need protection from their children.  But
children and interior nodes need protection from siblings and parents.

>> - don't set low_limit on non leafs (e.g. do not set low limit on
>>   parent_memcg).  This constrains the cgroup layout a bit.  Some
>>   customers want to purchase $MEM and setup their workload with a few
>>   child cgroups.  A system daemon hands out $MEM by setting low_limit
>>   for top-level containers (e.g. parent_memcg).  Thereafter such
>>   customers are able to partition their workload with sub memcg below
>>   child_memcg.  Example:
>>      parent_memcg
>>          \
>>           child_memcg
>>             /     \
>>         server   backup
>
> I think that the low_limit makes sense where you actually want to
> protect something from reclaim. And backup sounds like a bad fit for
> that.

The backup job would presumably have a small low_limit, but it may still
have a minimum working set required to make useful forward progress.

Example:
  parent_memcg
      \
       child_memcg limit 500, low_limit 500, usage 500
         /     \
         |   backup   limit 10, low_limit 10, usage 10
         |
      server limit 490, low_limit 490, usage 490

One could argue that problems appear when
server.low_limit+backup.lower_limit=child_memcg.limit.  So the safer
configuration is leave some padding:
  server.low_limit + backup.low_limit + padding = child_memcg.limit
but this just defers the problem.  As memory is reparented into parent,
then padding must grow.

>>   Thereafter customers often want some weak isolation between server and
>>   backup.  To avoid undesired oom kills the server/backup isolation is
>>   provided with a softer memory guarantee (e.g. soft_limit).  The soft
>>   limit acts like the low_limit until priority becomes desperate.
>
> Johannes was already suggesting that the low_limit should allow for a
> weaker semantic as well. I am not very much inclined to that but I can
> leave with a knob which would say oom_on_lowlimit (on by default but
> allowed to be set to 0). We would fallback to the full reclaim if
> no groups turn out to be reclaimable.

I like the strong semantic of your low_limit at least at level:1 cgroups
(direct children of root).  But I have also encountered situations where
a strict guarantee is too strict and a mere preference is desirable.
Perhaps the best plan is to continue with the proposed strict low_limit
and eventually provide an additional mechanism which provides weaker
guarantees (e.g. soft_limit or something else if soft_limit cannot be
altered).  These two would offer good support for a variety of use
cases.

I thinking of something like:

bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
		struct mem_cgroup *root,
		int priority)
{
	do {
		if (memcg == root)
			break;
		if (!res_counter_low_limit_excess(&memcg->res))
			return false;
		if ((priority >= DEF_PRIORITY - 2) &&
		    !res_counter_soft_limit_exceed(&memcg->res))
			return false;
	} while ((memcg = parent_mem_cgroup(memcg)));
	return true;
}

But this soft_limit,priority extension can be added later.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/4] memcg: Low-limit reclaim
  2014-01-31  0:28     ` Greg Thelen
@ 2014-02-03 14:43       ` Michal Hocko
  2014-02-04  1:33         ` Greg Thelen
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2014-02-03 14:43 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki,
	LKML, Ying Han, Hugh Dickins, Michel Lespinasse, KOSAKI Motohiro,
	Tejun Heo

On Thu 30-01-14 16:28:27, Greg Thelen wrote:
> On Thu, Jan 30 2014, Michal Hocko wrote:
> 
> > On Wed 29-01-14 11:08:46, Greg Thelen wrote:
> > [...]
> >> The series looks useful.  We (Google) have been using something similar.
> >> In practice such a low_limit (or memory guarantee), doesn't nest very
> >> well.
> >> 
> >> Example:
> >>   - parent_memcg: limit 500, low_limit 500, usage 500
> >>     1 privately charged non-reclaimable page (e.g. mlock, slab)
> >>   - child_memcg: limit 500, low_limit 500, usage 499
> >
> > I am not sure this is a good example. Your setup basically say that no
> > single page should be reclaimed. I can imagine this might be useful in
> > some cases and I would like to allow it but it sounds too extreme (e.g.
> > a load which would start trashing heavily once the reclaim starts and it
> > makes more sense to start it again rather than crowl - think about some
> > mathematical simulation which might diverge).
> 
> Pages will still be reclaimed the usage_in_bytes is exceeds
> limit_in_bytes.  I see the low_limit as a way to tell the kernel: don't
> reclaim my memory due to external pressure, but internal pressure is
> different.

That sounds strange and very confusing to me. What if the internal
pressure comes from children memcgs? Lowlimit is intended for protecting
a group from reclaim and it shouldn't matter whether the reclaim is a
result of the internal or external pressure.

> >> If a streaming file cache workload (e.g. sha1sum) starts gobbling up
> >> page cache it will lead to an oom kill instead of reclaiming. 
> >
> > Does it make any sense to protect all of such memory although it is
> > easily reclaimable?
> 
> I think protection makes sense in this case.  If I know my workload
> needs 500 to operate well, then I reserve 500 using low_limit.  My app
> doesn't want to run with less than its reservation.
> 
> >> One could argue that this is working as intended because child_memcg
> >> was promised 500 but can only get 499.  So child_memcg is oom killed
> >> rather than being forced to operate below its promised low limit.
> >> 
> >> This has led to various internal workarounds like:
> >> - don't charge any memory to interior tree nodes (e.g. parent_memcg);
> >>   only charge memory to cgroup leafs.  This gets tricky when dealing
> >>   with reparented memory inherited to parent from child during cgroup
> >>   deletion.
> >
> > Do those need any protection at all?
> 
> Interior tree nodes don't need protection from their children.  But
> children and interior nodes need protection from siblings and parents.

Why? They contains only reparented pages in the above case. Those would
be #1 candidate for reclaim in most cases, no?

> >> - don't set low_limit on non leafs (e.g. do not set low limit on
> >>   parent_memcg).  This constrains the cgroup layout a bit.  Some
> >>   customers want to purchase $MEM and setup their workload with a few
> >>   child cgroups.  A system daemon hands out $MEM by setting low_limit
> >>   for top-level containers (e.g. parent_memcg).  Thereafter such
> >>   customers are able to partition their workload with sub memcg below
> >>   child_memcg.  Example:
> >>      parent_memcg
> >>          \
> >>           child_memcg
> >>             /     \
> >>         server   backup
> >
> > I think that the low_limit makes sense where you actually want to
> > protect something from reclaim. And backup sounds like a bad fit for
> > that.
> 
> The backup job would presumably have a small low_limit, but it may still
> have a minimum working set required to make useful forward progress.
> 
> Example:
>   parent_memcg
>       \
>        child_memcg limit 500, low_limit 500, usage 500
>          /     \
>          |   backup   limit 10, low_limit 10, usage 10
>          |
>       server limit 490, low_limit 490, usage 490
> 
> One could argue that problems appear when
> server.low_limit+backup.lower_limit=child_memcg.limit.  So the safer
> configuration is leave some padding:
>   server.low_limit + backup.low_limit + padding = child_memcg.limit
> but this just defers the problem.  As memory is reparented into parent,
> then padding must grow.

Which all sounds like a drawback of internal vs. external pressure
semantic which you have mentioned above.

> >>   Thereafter customers often want some weak isolation between server and
> >>   backup.  To avoid undesired oom kills the server/backup isolation is
> >>   provided with a softer memory guarantee (e.g. soft_limit).  The soft
> >>   limit acts like the low_limit until priority becomes desperate.
> >
> > Johannes was already suggesting that the low_limit should allow for a
> > weaker semantic as well. I am not very much inclined to that but I can
> > leave with a knob which would say oom_on_lowlimit (on by default but
> > allowed to be set to 0). We would fallback to the full reclaim if
> > no groups turn out to be reclaimable.
> 
> I like the strong semantic of your low_limit at least at level:1 cgroups
> (direct children of root).  But I have also encountered situations where
> a strict guarantee is too strict and a mere preference is desirable.
> Perhaps the best plan is to continue with the proposed strict low_limit
> and eventually provide an additional mechanism which provides weaker
> guarantees (e.g. soft_limit or something else if soft_limit cannot be
> altered).  These two would offer good support for a variety of use
> cases.
> 
> I thinking of something like:
> 
> bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> 		struct mem_cgroup *root,
> 		int priority)
> {
> 	do {
> 		if (memcg == root)
> 			break;
> 		if (!res_counter_low_limit_excess(&memcg->res))
> 			return false;
> 		if ((priority >= DEF_PRIORITY - 2) &&
> 		    !res_counter_soft_limit_exceed(&memcg->res))
> 			return false;
> 	} while ((memcg = parent_mem_cgroup(memcg)));
> 	return true;
> }

Mixing soft limit into the picture is more than confusing because it
has its own meaning now and we shouldn't recycle it until it is dead
completely.
Another thing which seems to be more serious is that such a reclaim
logic would inherently lead to a potential over reclaim because 2
priority cycles would be wasted with no progress and when we finally
find somebody then it gets hammered more at lower priority.

What I would like much more is to fallback to ignore low_limit if
nothing is reclaimable due to low_limit. That would be controlled on a
memcg level (something like memory.low_limit_fallback).

> But this soft_limit,priority extension can be added later.

Yes, I would like to have the strong semantic first and then deal with a
weaker form. Either by a new limit or a flag.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/4] memcg: Low-limit reclaim
  2014-02-03 14:43       ` Michal Hocko
@ 2014-02-04  1:33         ` Greg Thelen
  0 siblings, 0 replies; 14+ messages in thread
From: Greg Thelen @ 2014-02-04  1:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki,
	LKML, Ying Han, Hugh Dickins, Michel Lespinasse, KOSAKI Motohiro,
	Tejun Heo

On Mon, Feb 03 2014, Michal Hocko wrote:

> On Thu 30-01-14 16:28:27, Greg Thelen wrote:
>> On Thu, Jan 30 2014, Michal Hocko wrote:
>> 
>> > On Wed 29-01-14 11:08:46, Greg Thelen wrote:
>> > [...]
>> >> The series looks useful.  We (Google) have been using something similar.
>> >> In practice such a low_limit (or memory guarantee), doesn't nest very
>> >> well.
>> >> 
>> >> Example:
>> >>   - parent_memcg: limit 500, low_limit 500, usage 500
>> >>     1 privately charged non-reclaimable page (e.g. mlock, slab)
>> >>   - child_memcg: limit 500, low_limit 500, usage 499
>> >
>> > I am not sure this is a good example. Your setup basically say that no
>> > single page should be reclaimed. I can imagine this might be useful in
>> > some cases and I would like to allow it but it sounds too extreme (e.g.
>> > a load which would start trashing heavily once the reclaim starts and it
>> > makes more sense to start it again rather than crowl - think about some
>> > mathematical simulation which might diverge).
>> 
>> Pages will still be reclaimed the usage_in_bytes is exceeds
>> limit_in_bytes.  I see the low_limit as a way to tell the kernel: don't
>> reclaim my memory due to external pressure, but internal pressure is
>> different.
>
> That sounds strange and very confusing to me. What if the internal
> pressure comes from children memcgs? Lowlimit is intended for protecting
> a group from reclaim and it shouldn't matter whether the reclaim is a
> result of the internal or external pressure.
>
>> >> If a streaming file cache workload (e.g. sha1sum) starts gobbling up
>> >> page cache it will lead to an oom kill instead of reclaiming. 
>> >
>> > Does it make any sense to protect all of such memory although it is
>> > easily reclaimable?
>> 
>> I think protection makes sense in this case.  If I know my workload
>> needs 500 to operate well, then I reserve 500 using low_limit.  My app
>> doesn't want to run with less than its reservation.
>> 
>> >> One could argue that this is working as intended because child_memcg
>> >> was promised 500 but can only get 499.  So child_memcg is oom killed
>> >> rather than being forced to operate below its promised low limit.
>> >> 
>> >> This has led to various internal workarounds like:
>> >> - don't charge any memory to interior tree nodes (e.g. parent_memcg);
>> >>   only charge memory to cgroup leafs.  This gets tricky when dealing
>> >>   with reparented memory inherited to parent from child during cgroup
>> >>   deletion.
>> >
>> > Do those need any protection at all?
>> 
>> Interior tree nodes don't need protection from their children.  But
>> children and interior nodes need protection from siblings and parents.
>
> Why? They contains only reparented pages in the above case. Those would
> be #1 candidate for reclaim in most cases, no?

I think we're on the same page.  My example interior node has reclaimed
pages and is a #1 candidate for reclaim induced from charges against
parent_memcg, but not a candidate for reclaim due to global memory
pressure induced by a sibling of parent_memcg.

>> >> - don't set low_limit on non leafs (e.g. do not set low limit on
>> >>   parent_memcg).  This constrains the cgroup layout a bit.  Some
>> >>   customers want to purchase $MEM and setup their workload with a few
>> >>   child cgroups.  A system daemon hands out $MEM by setting low_limit
>> >>   for top-level containers (e.g. parent_memcg).  Thereafter such
>> >>   customers are able to partition their workload with sub memcg below
>> >>   child_memcg.  Example:
>> >>      parent_memcg
>> >>          \
>> >>           child_memcg
>> >>             /     \
>> >>         server   backup
>> >
>> > I think that the low_limit makes sense where you actually want to
>> > protect something from reclaim. And backup sounds like a bad fit for
>> > that.
>> 
>> The backup job would presumably have a small low_limit, but it may still
>> have a minimum working set required to make useful forward progress.
>> 
>> Example:
>>   parent_memcg
>>       \
>>        child_memcg limit 500, low_limit 500, usage 500
>>          /     \
>>          |   backup   limit 10, low_limit 10, usage 10
>>          |
>>       server limit 490, low_limit 490, usage 490
>> 
>> One could argue that problems appear when
>> server.low_limit+backup.lower_limit=child_memcg.limit.  So the safer
>> configuration is leave some padding:
>>   server.low_limit + backup.low_limit + padding = child_memcg.limit
>> but this just defers the problem.  As memory is reparented into parent,
>> then padding must grow.
>
> Which all sounds like a drawback of internal vs. external pressure
> semantic which you have mentioned above.

Huh?  I probably confused matters with the internal vs external talk
above.  Forgetting about that, I'm happy with the following
configuration assuming low_limit_fallback (ll_fallback) is eventually
available.

   parent_memcg
       \
        child_memcg limit 500, low_limit 500, usage 500, ll_fallback 0
          /     \
          |   backup   limit 10, low_limit 10, usage 10, ll_fallback 1
          |
       server limit 490, low_limit 490, usage 490, ll_fallback 1

>> >>   Thereafter customers often want some weak isolation between server and
>> >>   backup.  To avoid undesired oom kills the server/backup isolation is
>> >>   provided with a softer memory guarantee (e.g. soft_limit).  The soft
>> >>   limit acts like the low_limit until priority becomes desperate.
>> >
>> > Johannes was already suggesting that the low_limit should allow for a
>> > weaker semantic as well. I am not very much inclined to that but I can
>> > leave with a knob which would say oom_on_lowlimit (on by default but
>> > allowed to be set to 0). We would fallback to the full reclaim if
>> > no groups turn out to be reclaimable.
>> 
>> I like the strong semantic of your low_limit at least at level:1 cgroups
>> (direct children of root).  But I have also encountered situations where
>> a strict guarantee is too strict and a mere preference is desirable.
>> Perhaps the best plan is to continue with the proposed strict low_limit
>> and eventually provide an additional mechanism which provides weaker
>> guarantees (e.g. soft_limit or something else if soft_limit cannot be
>> altered).  These two would offer good support for a variety of use
>> cases.
>> 
>> I thinking of something like:
>> 
>> bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
>> 		struct mem_cgroup *root,
>> 		int priority)
>> {
>> 	do {
>> 		if (memcg == root)
>> 			break;
>> 		if (!res_counter_low_limit_excess(&memcg->res))
>> 			return false;
>> 		if ((priority >= DEF_PRIORITY - 2) &&
>> 		    !res_counter_soft_limit_exceed(&memcg->res))
>> 			return false;
>> 	} while ((memcg = parent_mem_cgroup(memcg)));
>> 	return true;
>> }
>
> Mixing soft limit into the picture is more than confusing because it
> has its own meaning now and we shouldn't recycle it until it is dead
> completely.
> Another thing which seems to be more serious is that such a reclaim
> logic would inherently lead to a potential over reclaim because 2
> priority cycles would be wasted with no progress and when we finally
> find somebody then it gets hammered more at lower priority.
>
> What I would like much more is to fallback to ignore low_limit if
> nothing is reclaimable due to low_limit. That would be controlled on a
> memcg level (something like memory.low_limit_fallback).

Sure, but that would require a sweep through the candidate memcg to
confirm that all cgroups are operating below their low limit.  I suppose
we could have an optimization where the number of children above
low_limit is recorded in the parent.  Then reclaim in the parent would
immediately determine if low_limit should be violated (if
memory.low_limit_fallback=1).  But this can be deferred to later
patches.

>> But this soft_limit,priority extension can be added later.
>
> Yes, I would like to have the strong semantic first and then deal with a
> weaker form. Either by a new limit or a flag.

Sounds good.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/4] memcg: Low-limit reclaim
  2014-01-29 18:22   ` Michal Hocko
@ 2014-02-12 12:28     ` Roman Gushchin
  2014-02-13 16:12       ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Roman Gushchin @ 2014-02-12 12:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, linux-mm, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, LKML, Ying Han, Hugh Dickins,
	Michel Lespinasse, Greg Thelen, KOSAKI Motohiro, Tejun Heo

Hi, Michal!

Sorry for a long reply.

At Wed, 29 Jan 2014 19:22:59 +0100,
Michal Hocko wrote:
> > As you can remember, I've proposed to introduce low limits about a year ago.
> > 
> > We had a small discussion at that time: http://marc.info/?t=136195226600004 .
> 
> yes I remember that discussion and vaguely remember the proposed
> approach. I really wanted to prevent from introduction of a new knob but
> things evolved differently than I planned since then and it turned out
> that the knew knob is unavoidable. That's why I came with this approach
> which is quite different from yours AFAIR.
>  
> > Since that time we intensively use low limits in our production
> > (on thousands of machines). So, I'm very interested to merge this
> > functionality into upstream.
> 
> Have you tried to use this implementation? Would this work as well?
> My very vague recollection of your patch is that it didn't cover both
> global and target reclaims and it didn't fit into the reclaim very
> naturally it used its own scaling method. I will have to refresh my
> memory though.

IMHO, the main problem of your implementation is the following: 
the number of reclaimed pages is not limited at all,
if cgroup is over it's low memory limit. So, a significant number 
of pages can be reclaimed, even if the memory usage is only a bit 
(e.g. one page) above the low limit.

In my case, this problem is solved by scaling the number of scanned pages.

I think, an ideal solution is to limit the number of reclaimed pages by 
low limit excess value. This allows to discard my scaling code, but save
the strict semantics of low limit under memory pressure. The main problem 
here is how to balance scanning pressure between cgroups and LRUs.

Maybe, we should calculate the number of pages to scan in a LRU based on
the low limit excess value instead of number of pages...

> > In my experience, low limits also require some changes in memcg page accounting
> > policy. For instance, an application in protected cgroup should have a guarantee
> > that it's filecache belongs to it's cgroup and is protected by low limit
> > therefore. If the filecache was created by another application in other cgroup,
> > it can be not so. I've solved this problem by implementing optional page
> > reaccouting on pagefaults and read/writes.
> 
> Memory sharing is a separate issue and we should discuss that
> separately. 
> 
> > I can prepare my current version of patchset, if someone is interested.
> 
> Sure, having something to compare with is always valuable.

----
Subject: [PATCH] memcg: low limits for memory cgroups

Low limits for memory cgroup can be used to limit memory pressure on it.
If memory usage of a cgroup is under it's low limit, it will not be
affected by global reclaim. If it reaches it's low limit from above,
the reclaiming speed will be dropped exponentially.

Low limits don't affect soft reclaim.
Also, it's possible that a cgroup with memory usage under low limit
will be reclaimed slowly on very low scanning priorities.
---
 include/linux/memcontrol.h  |  7 ++++++
 include/linux/res_counter.h | 17 +++++++++++++
 kernel/res_counter.c        |  2 ++
 mm/memcontrol.c             | 60 +++++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                 |  9 +++++++
 5 files changed, 95 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index abd0113..3905e95 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -231,6 +231,8 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 bool mem_cgroup_bad_page_check(struct page *page);
 void mem_cgroup_print_bad_page(struct page *page);
 #endif
+
+unsigned int mem_cgroup_low_limit_scale(struct mem_cgroup *memcg);
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
@@ -427,6 +429,11 @@ static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
 				struct page *newpage)
 {
 }
+
+static inline unsigned int mem_cgroup_low_limit_scale(struct mem_cgroup *memcg)
+{
+	return 0;
+}
 #endif /* CONFIG_MEMCG */
 
 #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 201a697..7a16c2a 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -40,6 +40,10 @@ struct res_counter {
 	 */
 	unsigned long long soft_limit;
 	/*
+	 * the secured guaranteed minimal limit of resource
+	 */
+	unsigned long long low_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -88,6 +92,7 @@ enum {
 	RES_LIMIT,
 	RES_FAILCNT,
 	RES_SOFT_LIMIT,
+	RES_LOW_LIMIT,
 };
 
 /*
@@ -224,4 +229,16 @@ res_counter_set_soft_limit(struct res_counter *cnt,
 	return 0;
 }
 
+static inline int
+res_counter_set_low_limit(struct res_counter *cnt,
+			   unsigned long long low_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->low_limit = low_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
 #endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 4aa8a30..c57daf9 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -135,6 +135,8 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->failcnt;
 	case RES_SOFT_LIMIT:
 		return &counter->soft_limit;
+	case RES_LOW_LIMIT:
+		return &counter->low_limit;
 	};
 
 	BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 53385cd..d24b768 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1883,6 +1883,46 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 			 NULL, "Memory cgroup out of memory");
 }
 
+/*
+ * If a cgroup is under low limit or enough close to it,
+ * decrease speed of page scanning.
+ *
+ * mem_cgroup_low_limit_scale() returns a number
+ * from range [0, DEF_PRIORITY - 2], which is used
+ * in the reclaim code as a scanning priority modifier.
+ *
+ * If the low limit is not set, it returns 0;
+ *
+ * usage - low_limit > usage / 8  => 0
+ * usage - low_limit > usage / 16 => 1
+ * usage - low_limit > usage / 32 => 2
+ * ...
+ * usage - low_limit > usage / (2 ^ DEF_PRIORITY - 3) => DEF_PRIORITY - 3
+ * usage < low_limit => DEF_PRIORITY - 2
+ *
+ */
+unsigned int mem_cgroup_low_limit_scale(struct mem_cgroup *memcg)
+{
+	unsigned long long low_limit;
+	unsigned long long usage;
+	unsigned int i;
+
+	low_limit = res_counter_read_u64(&memcg->res, RES_LOW_LIMIT);
+	if (!low_limit)
+		return 0;
+
+	usage = res_counter_read_u64(&memcg->res, RES_USAGE);
+
+	if (usage < low_limit)
+		return DEF_PRIORITY - 2;
+
+	for (i = 0; i < DEF_PRIORITY - 2; i++)
+		if (usage - low_limit > (usage >> (i + 3)))
+			break;
+
+	return i;
+}
+
 static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
 					gfp_t gfp_mask,
 					unsigned long flags)
@@ -5318,6 +5358,20 @@ static int mem_cgroup_write(struct cgroup_subsys_state *css, struct cftype *cft,
 		else
 			ret = -EINVAL;
 		break;
+	case RES_LOW_LIMIT:
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		/*
+		 * For memsw, low limits (as also soft limits, see upper)
+		 * are hard to implement in terms of semantics,
+		 * for now, we support soft limits for control without swap
+		 */
+		if (type == _MEM)
+			ret = res_counter_set_low_limit(&memcg->res, val);
+		else
+			ret = -EINVAL;
+		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
 		break;
@@ -6243,6 +6297,12 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_read_u64,
 	},
 	{
+		.name = "low_limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_LOW_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
 		.name = "soft_limit_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
 		.write_string = mem_cgroup_write,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a9c74b4..1d4eaac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -83,6 +83,9 @@ struct scan_control {
 	/* Scan (total_size >> priority) pages at once */
 	int priority;
 
+	/* If memcg is under it's low limit, do not scan it aggressively */
+	int low_limit_scale;
+
 	/*
 	 * The memory cgroup that hit its limit and as a result is the
 	 * primary target of this reclaim invocation.
@@ -2003,6 +2006,10 @@ out:
 			/* Look ma, no brain */
 			BUG();
 		}
+
+		if (sc->low_limit_scale)
+			scan >>= sc->low_limit_scale;
+
 		nr[lru] = scan;
 	}
 }
@@ -2206,6 +2213,7 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
+			sc->low_limit_scale = mem_cgroup_low_limit_scale(memcg);
 			shrink_lruvec(lruvec, sc);
 
 			/*
@@ -2640,6 +2648,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
 		.may_swap = !noswap,
 		.order = 0,
 		.priority = 0,
+		.low_limit_scale = 0,
 		.target_mem_cgroup = memcg,
 	};
 	struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC 0/4] memcg: Low-limit reclaim
  2014-02-12 12:28     ` Roman Gushchin
@ 2014-02-13 16:12       ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2014-02-13 16:12 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki,
	LKML, Ying Han, Hugh Dickins, Michel Lespinasse, Greg Thelen,
	KOSAKI Motohiro, Tejun Heo

On Wed 12-02-14 16:28:36, Roman Gushchin wrote:
> Hi, Michal!
> 
> Sorry for a long reply.
> 
> At Wed, 29 Jan 2014 19:22:59 +0100,
> Michal Hocko wrote:
> > > As you can remember, I've proposed to introduce low limits about a year ago.
> > > 
> > > We had a small discussion at that time: http://marc.info/?t=136195226600004 .
> > 
> > yes I remember that discussion and vaguely remember the proposed
> > approach. I really wanted to prevent from introduction of a new knob but
> > things evolved differently than I planned since then and it turned out
> > that the knew knob is unavoidable. That's why I came with this approach
> > which is quite different from yours AFAIR.
> >  
> > > Since that time we intensively use low limits in our production
> > > (on thousands of machines). So, I'm very interested to merge this
> > > functionality into upstream.
> > 
> > Have you tried to use this implementation? Would this work as well?
> > My very vague recollection of your patch is that it didn't cover both
> > global and target reclaims and it didn't fit into the reclaim very
> > naturally it used its own scaling method. I will have to refresh my
> > memory though.
> 
> IMHO, the main problem of your implementation is the following: 
> the number of reclaimed pages is not limited at all,
> if cgroup is over it's low memory limit. So, a significant number 
> of pages can be reclaimed, even if the memory usage is only a bit 
> (e.g. one page) above the low limit.

Yes but this is the same problem as with the regular reclaim.
We do not have any guarantee that we will reclaim only the required
amount of memory. As the reclaim priority falls down we can overreclaim.
The global reclaim tries to avoid this problem by keeping the priority
as high as possible. And the target reclaim is not a big deal because we
are limiting the number of reclaimed pages to the swap cluster.

I do not see this as a practical problem of the low_limit though,
because it protects those that are below the limit not above. Small
fluctuation around the limit should be tolerable.

> In my case, this problem is solved by scaling the number of scanned pages.
> 
> I think, an ideal solution is to limit the number of reclaimed pages by 
> low limit excess value. This allows to discard my scaling code, but save
> the strict semantics of low limit under memory pressure. The main problem 
> here is how to balance scanning pressure between cgroups and LRUs.
> 
> Maybe, we should calculate the number of pages to scan in a LRU based on
> the low limit excess value instead of number of pages...

I do not like it much and I expect other mm people to feel similar. We
already have scanning scaling based on the priority. Adding a new
variable into the picture will make the whole thing only more
complicated without a very good reason for it.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-02-13 16:12 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-11 14:15 [RFC 0/4] memcg: Low-limit reclaim Michal Hocko
2013-12-11 14:15 ` [RFC 1/4] memcg, mm: introduce lowlimit reclaim Michal Hocko
2013-12-11 14:15 ` [RFC 2/4] mm, memcg: allow OOM if no memcg is eligible during direct reclaim Michal Hocko
2013-12-11 14:15 ` [RFC 3/4] memcg: Allow setting low_limit Michal Hocko
2013-12-11 14:15 ` [RFC 4/4] mm, memcg: expedite OOM if no memcg is reclaimable Michal Hocko
2014-01-24 11:07 ` [RFC 0/4] memcg: Low-limit reclaim Roman Gushchin
2014-01-29 18:22   ` Michal Hocko
2014-02-12 12:28     ` Roman Gushchin
2014-02-13 16:12       ` Michal Hocko
2014-01-29 19:08 ` Greg Thelen
2014-01-30 12:30   ` Michal Hocko
2014-01-31  0:28     ` Greg Thelen
2014-02-03 14:43       ` Michal Hocko
2014-02-04  1:33         ` Greg Thelen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).