All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-04-28 12:26 ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-04-28 12:26 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Tejun Heo,
	Hugh Dickins, Roman Gushchin, LKML, linux-mm

Hi,
previous discussions have shown that soft limits cannot be reformed
(http://lwn.net/Articles/555249/). This series introduces an alternative
approach for protecting memory allocated to processes executing within
a memory cgroup controller. It is based on a new tunable that was
discussed with Johannes and Tejun held during the kernel summit 2013 and
at LSF 2014.

This patchset introduces such low limit that is functionally similar
to a minimum guarantee. Memcgs which are under their lowlimit are not
considered eligible for the reclaim (both global and hardlimit) unless
all groups under the reclaimed hierarchy are below the low limit when
all of them are considered eligible.

The previous version of the patchset posted as a RFC
(http://marc.info/?l=linux-mm&m=138677140628677&w=2) suggested a
hard guarantee without any fallback. More discussions led me to
reconsidering the default behavior and come up a more relaxed one. The
hard requirement can be added later based on a use case which really
requires. It would be controlled by memory.reclaim_flags knob which
would specify whether to OOM or fallback (default) when all groups are
bellow low limit.

The default value of the limit is 0 so all groups are eligible by
default and an interested party has to explicitly set the limit.

The primary use case is to protect an amount of memory allocated to a
workload without it being reclaimed by an unrelated activity. In some
cases this requirement can be fulfilled by mlock but it is not suitable
for many loads and generally requires application awareness. Such
application awareness can be complex. It effectively forbids the
use of memory overcommit as the application must explicitly manage
memory residency.
With the low limit, such workloads can be placed in a memcg with a low
limit that protects the estimated working set.

The hierarchical behavior of the lowlimit is described in the first
patch. 
The second patch allows setting the lowlimit.
The last 2 patches clarify documentation about the memcg reclaim in
gereneral (3rd patch) and low limit (4th patch).

There were some calls for using a different name but I couldn't come up
with something better so if there are a better proposals I am happy to
change this.

The series is based on top of the current mmotm tree. Once the series
gets accepted I will post a patch which will mark the soft limit as
deprecated with a note that it will be eventually dropped. Let me know
if you would prefer to have such a patch a part of the series.

Thoughts?

Short log says:
Michal Hocko (4):
      memcg, mm: introduce lowlimit reclaim
      memcg: Allow setting low_limit
      memcg, doc: clarify global vs. limit reclaims
      memcg: Document memory.low_limit_in_bytes

And diffstat says:
 Documentation/cgroups/memory.txt | 40 +++++++++++++++++++++-----------
 include/linux/memcontrol.h       |  9 ++++++++
 include/linux/res_counter.h      | 40 ++++++++++++++++++++++++++++++++
 kernel/res_counter.c             |  2 ++
 mm/memcontrol.c                  | 50 +++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                      | 34 ++++++++++++++++++++++++++-
 6 files changed, 159 insertions(+), 16 deletions(-)

^ permalink raw reply	[flat|nested] 196+ messages in thread

* [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-04-28 12:26 ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-04-28 12:26 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Tejun Heo,
	Hugh Dickins, Roman Gushchin, LKML, linux-mm

Hi,
previous discussions have shown that soft limits cannot be reformed
(http://lwn.net/Articles/555249/). This series introduces an alternative
approach for protecting memory allocated to processes executing within
a memory cgroup controller. It is based on a new tunable that was
discussed with Johannes and Tejun held during the kernel summit 2013 and
at LSF 2014.

This patchset introduces such low limit that is functionally similar
to a minimum guarantee. Memcgs which are under their lowlimit are not
considered eligible for the reclaim (both global and hardlimit) unless
all groups under the reclaimed hierarchy are below the low limit when
all of them are considered eligible.

The previous version of the patchset posted as a RFC
(http://marc.info/?l=linux-mm&m=138677140628677&w=2) suggested a
hard guarantee without any fallback. More discussions led me to
reconsidering the default behavior and come up a more relaxed one. The
hard requirement can be added later based on a use case which really
requires. It would be controlled by memory.reclaim_flags knob which
would specify whether to OOM or fallback (default) when all groups are
bellow low limit.

The default value of the limit is 0 so all groups are eligible by
default and an interested party has to explicitly set the limit.

The primary use case is to protect an amount of memory allocated to a
workload without it being reclaimed by an unrelated activity. In some
cases this requirement can be fulfilled by mlock but it is not suitable
for many loads and generally requires application awareness. Such
application awareness can be complex. It effectively forbids the
use of memory overcommit as the application must explicitly manage
memory residency.
With the low limit, such workloads can be placed in a memcg with a low
limit that protects the estimated working set.

The hierarchical behavior of the lowlimit is described in the first
patch. 
The second patch allows setting the lowlimit.
The last 2 patches clarify documentation about the memcg reclaim in
gereneral (3rd patch) and low limit (4th patch).

There were some calls for using a different name but I couldn't come up
with something better so if there are a better proposals I am happy to
change this.

The series is based on top of the current mmotm tree. Once the series
gets accepted I will post a patch which will mark the soft limit as
deprecated with a note that it will be eventually dropped. Let me know
if you would prefer to have such a patch a part of the series.

Thoughts?

Short log says:
Michal Hocko (4):
      memcg, mm: introduce lowlimit reclaim
      memcg: Allow setting low_limit
      memcg, doc: clarify global vs. limit reclaims
      memcg: Document memory.low_limit_in_bytes

And diffstat says:
 Documentation/cgroups/memory.txt | 40 +++++++++++++++++++++-----------
 include/linux/memcontrol.h       |  9 ++++++++
 include/linux/res_counter.h      | 40 ++++++++++++++++++++++++++++++++
 kernel/res_counter.c             |  2 ++
 mm/memcontrol.c                  | 50 +++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                      | 34 ++++++++++++++++++++++++++-
 6 files changed, 159 insertions(+), 16 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-04-28 12:26 ` Michal Hocko
@ 2014-04-28 12:26   ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-04-28 12:26 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Tejun Heo,
	Hugh Dickins, Roman Gushchin, LKML, linux-mm

This patch introduces low limit reclaim. The low_limit acts as a reclaim
protection because groups which are under their low_limit are considered
ineligible for reclaim. While hardlimit protects from using more memory
than allowed lowlimit protects from getting bellow memory assigned to
the group due to external memory pressure.

More precisely a group is considered eligible for the reclaim under a
specific hierarchy represented by its root only if the group is above
its low limit and the same applies to all parents up the hierarchy to
the root. Nevertheless the limit still might be ignored if all groups
under the reclaimed hierarchy are under their low limits. This will
prevent from OOM rather than protecting the memory.

Consider the following hierarchy with memory pressure coming from the
group A (hard limit reclaim - l-low_limit_in_bytes, u-usage_in_bytes,
h-limit_in_bytes):
		root_mem_cgroup
			.
		  _____/
		 /
		A (l = 80 u=90 h=90)
	       /
	      / \_________
	     /            \
	    B (l=0 u=50)   C (l=50 u=40)
	                    \
			     D (l=0 u=30)

A and B are reclaimable but C and D are not (D is protected by C).

The low_limit is 0 by default so every group is eligible. This patch
doesn't provide a way to set the limit yet although the core
infrastructure is there already.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/memcontrol.h  |  9 +++++++++
 include/linux/res_counter.h | 27 +++++++++++++++++++++++++++
 mm/memcontrol.c             | 23 +++++++++++++++++++++++
 mm/vmscan.c                 | 34 +++++++++++++++++++++++++++++++++-
 4 files changed, 92 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1fa23244fe37..6c59056f4bc6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -92,6 +92,9 @@ bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 bool task_in_mem_cgroup(struct task_struct *task,
 			const struct mem_cgroup *memcg);
 
+extern bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
+		struct mem_cgroup *root);
+
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
@@ -288,6 +291,12 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
+static inline bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
+		struct mem_cgroup *root)
+{
+	return true;
+}
+
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	return NULL;
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 56b7bc32db4f..408724eeec71 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -40,6 +40,11 @@ struct res_counter {
 	 */
 	unsigned long long soft_limit;
 	/*
+	 * the limit under which the usage cannot be pushed
+	 * due to external pressure.
+	 */
+	unsigned long long low_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -175,6 +180,28 @@ res_counter_soft_limit_excess(struct res_counter *cnt)
 	return excess;
 }
 
+/**
+ * Get the difference between the usage and the low limit
+ * @cnt: The counter
+ *
+ * Returns 0 if usage is less than or equal to low limit
+ * The difference between usage and low limit, otherwise.
+ */
+static inline unsigned long long
+res_counter_low_limit_excess(struct res_counter *cnt)
+{
+	unsigned long long excess;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	if (cnt->usage <= cnt->low_limit)
+		excess = 0;
+	else
+		excess = cnt->usage - cnt->low_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return excess;
+}
+
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 19d620b3d69c..40e517630138 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2808,6 +2808,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
 	return mem_cgroup_from_id(id);
 }
 
+/**
+ * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
+ * reclaim
+ * @memcg: target memcg for the reclaim
+ * @root: root of the reclaim hierarchy (null for the global reclaim)
+ *
+ * The given group is reclaimable if it is above its low limit and the same
+ * applies for all parents up the hierarchy until root (including).
+ */
+bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
+		struct mem_cgroup *root)
+{
+	do {
+		if (!res_counter_low_limit_excess(&memcg->res))
+			return false;
+		if (memcg == root)
+			break;
+
+	} while ((memcg = parent_mem_cgroup(memcg)));
+
+	return true;
+}
+
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	struct mem_cgroup *memcg = NULL;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c1cd99a5074b..0f428158254e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2215,9 +2215,11 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	}
 }
 
-static void shrink_zone(struct zone *zone, struct scan_control *sc)
+static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
+		bool follow_low_limit)
 {
 	unsigned long nr_reclaimed, nr_scanned;
+	unsigned nr_scanned_groups = 0;
 
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -2234,7 +2236,23 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 		do {
 			struct lruvec *lruvec;
 
+			/*
+			 * Memcg might be under its low limit so we have to
+			 * skip it during the first reclaim round
+			 */
+			if (follow_low_limit &&
+					!mem_cgroup_reclaim_eligible(memcg, root)) {
+				/*
+				 * It would be more optimal to skip the memcg
+				 * subtree now but we do not have a memcg iter
+				 * helper for that. Anyone?
+				 */
+				memcg = mem_cgroup_iter(root, memcg, &reclaim);
+				continue;
+			}
+
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+			nr_scanned_groups++;
 
 			shrink_lruvec(lruvec, sc);
 
@@ -2262,6 +2280,20 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 
 	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
+
+	return nr_scanned_groups;
+}
+
+static void shrink_zone(struct zone *zone, struct scan_control *sc)
+{
+	if (!__shrink_zone(zone, sc, true)) {
+		/*
+		 * First round of reclaim didn't find anything to reclaim
+		 * because of low limit protection so try again and ignore
+		 * the low limit this time.
+		 */
+		__shrink_zone(zone, sc, false);
+	}
 }
 
 /* Returns true if compaction should go ahead for a high-order request */
-- 
2.0.0.rc0


^ permalink raw reply related	[flat|nested] 196+ messages in thread

* [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-04-28 12:26   ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-04-28 12:26 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Tejun Heo,
	Hugh Dickins, Roman Gushchin, LKML, linux-mm

This patch introduces low limit reclaim. The low_limit acts as a reclaim
protection because groups which are under their low_limit are considered
ineligible for reclaim. While hardlimit protects from using more memory
than allowed lowlimit protects from getting bellow memory assigned to
the group due to external memory pressure.

More precisely a group is considered eligible for the reclaim under a
specific hierarchy represented by its root only if the group is above
its low limit and the same applies to all parents up the hierarchy to
the root. Nevertheless the limit still might be ignored if all groups
under the reclaimed hierarchy are under their low limits. This will
prevent from OOM rather than protecting the memory.

Consider the following hierarchy with memory pressure coming from the
group A (hard limit reclaim - l-low_limit_in_bytes, u-usage_in_bytes,
h-limit_in_bytes):
		root_mem_cgroup
			.
		  _____/
		 /
		A (l = 80 u=90 h=90)
	       /
	      / \_________
	     /            \
	    B (l=0 u=50)   C (l=50 u=40)
	                    \
			     D (l=0 u=30)

A and B are reclaimable but C and D are not (D is protected by C).

The low_limit is 0 by default so every group is eligible. This patch
doesn't provide a way to set the limit yet although the core
infrastructure is there already.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/memcontrol.h  |  9 +++++++++
 include/linux/res_counter.h | 27 +++++++++++++++++++++++++++
 mm/memcontrol.c             | 23 +++++++++++++++++++++++
 mm/vmscan.c                 | 34 +++++++++++++++++++++++++++++++++-
 4 files changed, 92 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1fa23244fe37..6c59056f4bc6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -92,6 +92,9 @@ bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 bool task_in_mem_cgroup(struct task_struct *task,
 			const struct mem_cgroup *memcg);
 
+extern bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
+		struct mem_cgroup *root);
+
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
@@ -288,6 +291,12 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
+static inline bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
+		struct mem_cgroup *root)
+{
+	return true;
+}
+
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	return NULL;
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 56b7bc32db4f..408724eeec71 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -40,6 +40,11 @@ struct res_counter {
 	 */
 	unsigned long long soft_limit;
 	/*
+	 * the limit under which the usage cannot be pushed
+	 * due to external pressure.
+	 */
+	unsigned long long low_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -175,6 +180,28 @@ res_counter_soft_limit_excess(struct res_counter *cnt)
 	return excess;
 }
 
+/**
+ * Get the difference between the usage and the low limit
+ * @cnt: The counter
+ *
+ * Returns 0 if usage is less than or equal to low limit
+ * The difference between usage and low limit, otherwise.
+ */
+static inline unsigned long long
+res_counter_low_limit_excess(struct res_counter *cnt)
+{
+	unsigned long long excess;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	if (cnt->usage <= cnt->low_limit)
+		excess = 0;
+	else
+		excess = cnt->usage - cnt->low_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return excess;
+}
+
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 19d620b3d69c..40e517630138 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2808,6 +2808,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
 	return mem_cgroup_from_id(id);
 }
 
+/**
+ * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
+ * reclaim
+ * @memcg: target memcg for the reclaim
+ * @root: root of the reclaim hierarchy (null for the global reclaim)
+ *
+ * The given group is reclaimable if it is above its low limit and the same
+ * applies for all parents up the hierarchy until root (including).
+ */
+bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
+		struct mem_cgroup *root)
+{
+	do {
+		if (!res_counter_low_limit_excess(&memcg->res))
+			return false;
+		if (memcg == root)
+			break;
+
+	} while ((memcg = parent_mem_cgroup(memcg)));
+
+	return true;
+}
+
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	struct mem_cgroup *memcg = NULL;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c1cd99a5074b..0f428158254e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2215,9 +2215,11 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	}
 }
 
-static void shrink_zone(struct zone *zone, struct scan_control *sc)
+static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
+		bool follow_low_limit)
 {
 	unsigned long nr_reclaimed, nr_scanned;
+	unsigned nr_scanned_groups = 0;
 
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -2234,7 +2236,23 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 		do {
 			struct lruvec *lruvec;
 
+			/*
+			 * Memcg might be under its low limit so we have to
+			 * skip it during the first reclaim round
+			 */
+			if (follow_low_limit &&
+					!mem_cgroup_reclaim_eligible(memcg, root)) {
+				/*
+				 * It would be more optimal to skip the memcg
+				 * subtree now but we do not have a memcg iter
+				 * helper for that. Anyone?
+				 */
+				memcg = mem_cgroup_iter(root, memcg, &reclaim);
+				continue;
+			}
+
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+			nr_scanned_groups++;
 
 			shrink_lruvec(lruvec, sc);
 
@@ -2262,6 +2280,20 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 
 	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
+
+	return nr_scanned_groups;
+}
+
+static void shrink_zone(struct zone *zone, struct scan_control *sc)
+{
+	if (!__shrink_zone(zone, sc, true)) {
+		/*
+		 * First round of reclaim didn't find anything to reclaim
+		 * because of low limit protection so try again and ignore
+		 * the low limit this time.
+		 */
+		__shrink_zone(zone, sc, false);
+	}
 }
 
 /* Returns true if compaction should go ahead for a high-order request */
-- 
2.0.0.rc0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 196+ messages in thread

* [PATCH 2/4] memcg: Allow setting low_limit
  2014-04-28 12:26 ` Michal Hocko
@ 2014-04-28 12:26   ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-04-28 12:26 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Tejun Heo,
	Hugh Dickins, Roman Gushchin, LKML, linux-mm

Export memory.low_limit_in_bytes knob with the same rules as the hard
limit represented by limit_in_bytes knob (e.g. no limit to be set for
the root cgroup). There is no memsw alternative for low_limit_in_bytes
because the primary motivation behind this limit is to protect the
working set of the group and so considering swap doesn't make much
sense. There is also no kmem variant exported because we do not have any
easy way to protect kernel allocations now.

Please note that the low limit might exceed the hard limit which
basically means that the group is not reclaimable if there is other
reclaim target in the hierarchy under pressure.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/res_counter.h | 13 +++++++++++++
 kernel/res_counter.c        |  2 ++
 mm/memcontrol.c             | 27 ++++++++++++++++++++++++++-
 3 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 408724eeec71..b810855024f9 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -93,6 +93,7 @@ enum {
 	RES_LIMIT,
 	RES_FAILCNT,
 	RES_SOFT_LIMIT,
+	RES_LOW_LIMIT,
 };
 
 /*
@@ -247,4 +248,16 @@ res_counter_set_soft_limit(struct res_counter *cnt,
 	return 0;
 }
 
+static inline int
+res_counter_set_low_limit(struct res_counter *cnt,
+				unsigned long long low_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->low_limit = low_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
 #endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 51dbac6a3633..e851a9ad50bf 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -136,6 +136,8 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->failcnt;
 	case RES_SOFT_LIMIT:
 		return &counter->soft_limit;
+	case RES_LOW_LIMIT:
+		return &counter->low_limit;
 	};
 
 	BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 40e517630138..53193fec8c50 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1695,8 +1695,9 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 
 	rcu_read_unlock();
 
-	pr_info("memory: usage %llukB, limit %llukB, failcnt %llu\n",
+	pr_info("memory: usage %llukB, low_limit %llukB limit %llukB, failcnt %llu\n",
 		res_counter_read_u64(&memcg->res, RES_USAGE) >> 10,
+		res_counter_read_u64(&memcg->res, RES_LOW_LIMIT) >> 10,
 		res_counter_read_u64(&memcg->res, RES_LIMIT) >> 10,
 		res_counter_read_u64(&memcg->res, RES_FAILCNT));
 	pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %llu\n",
@@ -5134,6 +5135,24 @@ static int mem_cgroup_write(struct cgroup_subsys_state *css, struct cftype *cft,
 		else
 			return -EINVAL;
 		break;
+	case RES_LOW_LIMIT:
+		if (mem_cgroup_is_root(memcg)) { /* Can't set limit on root */
+			ret = -EINVAL;
+			break;
+		}
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		if (type == _MEM) {
+			ret = res_counter_set_low_limit(&memcg->res, val);
+			break;
+		}
+		/*
+		 * memsw low limit doesn't make any sense and kmem is not
+		 * implemented yet - if ever
+		 */
+		return -EINVAL;
+
 	case RES_SOFT_LIMIT:
 		ret = res_counter_memparse_write_strategy(buffer, &val);
 		if (ret)
@@ -6056,6 +6075,12 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_read_u64,
 	},
 	{
+		.name = "low_limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_LOW_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
 		.name = "soft_limit_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
 		.write_string = mem_cgroup_write,
-- 
2.0.0.rc0


^ permalink raw reply related	[flat|nested] 196+ messages in thread

* [PATCH 2/4] memcg: Allow setting low_limit
@ 2014-04-28 12:26   ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-04-28 12:26 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Tejun Heo,
	Hugh Dickins, Roman Gushchin, LKML, linux-mm

Export memory.low_limit_in_bytes knob with the same rules as the hard
limit represented by limit_in_bytes knob (e.g. no limit to be set for
the root cgroup). There is no memsw alternative for low_limit_in_bytes
because the primary motivation behind this limit is to protect the
working set of the group and so considering swap doesn't make much
sense. There is also no kmem variant exported because we do not have any
easy way to protect kernel allocations now.

Please note that the low limit might exceed the hard limit which
basically means that the group is not reclaimable if there is other
reclaim target in the hierarchy under pressure.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/res_counter.h | 13 +++++++++++++
 kernel/res_counter.c        |  2 ++
 mm/memcontrol.c             | 27 ++++++++++++++++++++++++++-
 3 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 408724eeec71..b810855024f9 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -93,6 +93,7 @@ enum {
 	RES_LIMIT,
 	RES_FAILCNT,
 	RES_SOFT_LIMIT,
+	RES_LOW_LIMIT,
 };
 
 /*
@@ -247,4 +248,16 @@ res_counter_set_soft_limit(struct res_counter *cnt,
 	return 0;
 }
 
+static inline int
+res_counter_set_low_limit(struct res_counter *cnt,
+				unsigned long long low_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->low_limit = low_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
 #endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 51dbac6a3633..e851a9ad50bf 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -136,6 +136,8 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->failcnt;
 	case RES_SOFT_LIMIT:
 		return &counter->soft_limit;
+	case RES_LOW_LIMIT:
+		return &counter->low_limit;
 	};
 
 	BUG();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 40e517630138..53193fec8c50 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1695,8 +1695,9 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 
 	rcu_read_unlock();
 
-	pr_info("memory: usage %llukB, limit %llukB, failcnt %llu\n",
+	pr_info("memory: usage %llukB, low_limit %llukB limit %llukB, failcnt %llu\n",
 		res_counter_read_u64(&memcg->res, RES_USAGE) >> 10,
+		res_counter_read_u64(&memcg->res, RES_LOW_LIMIT) >> 10,
 		res_counter_read_u64(&memcg->res, RES_LIMIT) >> 10,
 		res_counter_read_u64(&memcg->res, RES_FAILCNT));
 	pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %llu\n",
@@ -5134,6 +5135,24 @@ static int mem_cgroup_write(struct cgroup_subsys_state *css, struct cftype *cft,
 		else
 			return -EINVAL;
 		break;
+	case RES_LOW_LIMIT:
+		if (mem_cgroup_is_root(memcg)) { /* Can't set limit on root */
+			ret = -EINVAL;
+			break;
+		}
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		if (type == _MEM) {
+			ret = res_counter_set_low_limit(&memcg->res, val);
+			break;
+		}
+		/*
+		 * memsw low limit doesn't make any sense and kmem is not
+		 * implemented yet - if ever
+		 */
+		return -EINVAL;
+
 	case RES_SOFT_LIMIT:
 		ret = res_counter_memparse_write_strategy(buffer, &val);
 		if (ret)
@@ -6056,6 +6075,12 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_read_u64,
 	},
 	{
+		.name = "low_limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_LOW_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
 		.name = "soft_limit_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
 		.write_string = mem_cgroup_write,
-- 
2.0.0.rc0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 196+ messages in thread

* [PATCH 3/4] memcg, doc: clarify global vs. limit reclaims
  2014-04-28 12:26 ` Michal Hocko
@ 2014-04-28 12:26   ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-04-28 12:26 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Tejun Heo,
	Hugh Dickins, Roman Gushchin, LKML, linux-mm

Be explicit about global and hard limit reclaims in our documentation.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 Documentation/cgroups/memory.txt | 31 +++++++++++++++++--------------
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 4937e6fff9b4..add1be001416 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -236,23 +236,26 @@ it by cgroup.
 2.5 Reclaim
 
 Each cgroup maintains a per cgroup LRU which has the same structure as
-global VM. When a cgroup goes over its limit, we first try
-to reclaim memory from the cgroup so as to make space for the new
-pages that the cgroup has touched. If the reclaim is unsuccessful,
-an OOM routine is invoked to select and kill the bulkiest task in the
-cgroup. (See 10. OOM Control below.)
-
-The reclaim algorithm has not been modified for cgroups, except that
-pages that are selected for reclaiming come from the per-cgroup LRU
-list.
-
-NOTE: Reclaim does not work for the root cgroup, since we cannot set any
-limits on the root cgroup.
+global VM. Cgroups can get reclaimed basically under two conditions
+ - under global memory pressure when all cgroups are reclaimed
+   proportionally wrt. their LRU size in a round robin fashion
+ - when a cgroup or its hierarchical parent (see 6. Hierarchical support)
+   hits hard limit. If the reclaim is unsuccessful, an OOM routine is invoked
+   to select and kill the bulkiest task in the cgroup. (See 10. OOM Control
+   below.)
+
+Global and hard-limit reclaims share the same code the only difference
+is the objective of the reclaim. The global reclaim aims at balancing
+zones' watermarks while the limit reclaim frees some memory to allow new
+charges.
+
+NOTE: Hard limit reclaim does not work for the root cgroup, since we cannot set
+any limits on the root cgroup.
 
 Note2: When panic_on_oom is set to "2", the whole system will panic.
 
-When oom event notifier is registered, event will be delivered.
-(See oom_control section)
+When oom event notifier is registered, event will be delivered to the root
+of the memory pressure which cannot be handled (See oom_control section)
 
 2.6 Locking
 
-- 
2.0.0.rc0


^ permalink raw reply related	[flat|nested] 196+ messages in thread

* [PATCH 3/4] memcg, doc: clarify global vs. limit reclaims
@ 2014-04-28 12:26   ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-04-28 12:26 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Tejun Heo,
	Hugh Dickins, Roman Gushchin, LKML, linux-mm

Be explicit about global and hard limit reclaims in our documentation.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 Documentation/cgroups/memory.txt | 31 +++++++++++++++++--------------
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 4937e6fff9b4..add1be001416 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -236,23 +236,26 @@ it by cgroup.
 2.5 Reclaim
 
 Each cgroup maintains a per cgroup LRU which has the same structure as
-global VM. When a cgroup goes over its limit, we first try
-to reclaim memory from the cgroup so as to make space for the new
-pages that the cgroup has touched. If the reclaim is unsuccessful,
-an OOM routine is invoked to select and kill the bulkiest task in the
-cgroup. (See 10. OOM Control below.)
-
-The reclaim algorithm has not been modified for cgroups, except that
-pages that are selected for reclaiming come from the per-cgroup LRU
-list.
-
-NOTE: Reclaim does not work for the root cgroup, since we cannot set any
-limits on the root cgroup.
+global VM. Cgroups can get reclaimed basically under two conditions
+ - under global memory pressure when all cgroups are reclaimed
+   proportionally wrt. their LRU size in a round robin fashion
+ - when a cgroup or its hierarchical parent (see 6. Hierarchical support)
+   hits hard limit. If the reclaim is unsuccessful, an OOM routine is invoked
+   to select and kill the bulkiest task in the cgroup. (See 10. OOM Control
+   below.)
+
+Global and hard-limit reclaims share the same code the only difference
+is the objective of the reclaim. The global reclaim aims at balancing
+zones' watermarks while the limit reclaim frees some memory to allow new
+charges.
+
+NOTE: Hard limit reclaim does not work for the root cgroup, since we cannot set
+any limits on the root cgroup.
 
 Note2: When panic_on_oom is set to "2", the whole system will panic.
 
-When oom event notifier is registered, event will be delivered.
-(See oom_control section)
+When oom event notifier is registered, event will be delivered to the root
+of the memory pressure which cannot be handled (See oom_control section)
 
 2.6 Locking
 
-- 
2.0.0.rc0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 196+ messages in thread

* [PATCH 4/4] memcg: Document memory.low_limit_in_bytes
  2014-04-28 12:26 ` Michal Hocko
@ 2014-04-28 12:26   ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-04-28 12:26 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Tejun Heo,
	Hugh Dickins, Roman Gushchin, LKML, linux-mm

Describe low_limit_in_bytes and its effect.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 Documentation/cgroups/memory.txt | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index add1be001416..a52913fe96fb 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -57,6 +57,7 @@ Brief summary of control files.
  memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
 				 (See 5.5 for details)
  memory.limit_in_bytes		 # set/show limit of memory usage
+ memory.low_limit_in_bytes	 # set/show low limit for memory reclaim
  memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
  memory.failcnt			 # show the number of memory usage hits limits
  memory.memsw.failcnt		 # show the number of memory+Swap hits limits
@@ -249,6 +250,14 @@ is the objective of the reclaim. The global reclaim aims at balancing
 zones' watermarks while the limit reclaim frees some memory to allow new
 charges.
 
+Groups might be also protected from both global and limit reclaim by
+low_limit_in_bytes knob. If the limit is non-zero the reclaim logic
+doesn't include groups (and their subgroups - see 6. Hierarchy support)
+which are bellow the low limit if there is other eligible cgroup in the
+reclaimed hierarchy. If all groups which participate reclaim are under
+their low limits then all of them are reclaimed and the low limit is
+ignored.
+
 NOTE: Hard limit reclaim does not work for the root cgroup, since we cannot set
 any limits on the root cgroup.
 
-- 
2.0.0.rc0


^ permalink raw reply related	[flat|nested] 196+ messages in thread

* [PATCH 4/4] memcg: Document memory.low_limit_in_bytes
@ 2014-04-28 12:26   ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-04-28 12:26 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Tejun Heo,
	Hugh Dickins, Roman Gushchin, LKML, linux-mm

Describe low_limit_in_bytes and its effect.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 Documentation/cgroups/memory.txt | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index add1be001416..a52913fe96fb 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -57,6 +57,7 @@ Brief summary of control files.
  memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
 				 (See 5.5 for details)
  memory.limit_in_bytes		 # set/show limit of memory usage
+ memory.low_limit_in_bytes	 # set/show low limit for memory reclaim
  memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
  memory.failcnt			 # show the number of memory usage hits limits
  memory.memsw.failcnt		 # show the number of memory+Swap hits limits
@@ -249,6 +250,14 @@ is the objective of the reclaim. The global reclaim aims at balancing
 zones' watermarks while the limit reclaim frees some memory to allow new
 charges.
 
+Groups might be also protected from both global and limit reclaim by
+low_limit_in_bytes knob. If the limit is non-zero the reclaim logic
+doesn't include groups (and their subgroups - see 6. Hierarchy support)
+which are bellow the low limit if there is other eligible cgroup in the
+reclaimed hierarchy. If all groups which participate reclaim are under
+their low limits then all of them are reclaimed and the low limit is
+ignored.
+
 NOTE: Hard limit reclaim does not work for the root cgroup, since we cannot set
 any limits on the root cgroup.
 
-- 
2.0.0.rc0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-04-28 12:26 ` Michal Hocko
@ 2014-04-28 15:46   ` Roman Gushchin
  -1 siblings, 0 replies; 196+ messages in thread
From: Roman Gushchin @ 2014-04-28 15:46 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Tejun Heo,
	Hugh Dickins, LKML, linux-mm

28.04.2014, 16:27, "Michal Hocko" <mhocko@suse.cz>:
> The series is based on top of the current mmotm tree. Once the series
> gets accepted I will post a patch which will mark the soft limit as
> deprecated with a note that it will be eventually dropped. Let me know
> if you would prefer to have such a patch a part of the series.
>
> Thoughts?


Looks good to me.

The only question is: are there any ideas how the hierarchy support
will be used in this case in practice?
Will someone set low limit for non-leaf cgroups? Why?

Thanks,
Roman

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-04-28 15:46   ` Roman Gushchin
  0 siblings, 0 replies; 196+ messages in thread
From: Roman Gushchin @ 2014-04-28 15:46 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Tejun Heo,
	Hugh Dickins, LKML, linux-mm

28.04.2014, 16:27, "Michal Hocko" <mhocko@suse.cz>:
> The series is based on top of the current mmotm tree. Once the series
> gets accepted I will post a patch which will mark the soft limit as
> deprecated with a note that it will be eventually dropped. Let me know
> if you would prefer to have such a patch a part of the series.
>
> Thoughts?


Looks good to me.

The only question is: are there any ideas how the hierarchy support
will be used in this case in practice?
Will someone set low limit for non-leaf cgroups? Why?

Thanks,
Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-04-28 15:46   ` Roman Gushchin
@ 2014-04-29  7:42     ` Greg Thelen
  -1 siblings, 0 replies; 196+ messages in thread
From: Greg Thelen @ 2014-04-29  7:42 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Hugh Dickins,
	LKML, linux-mm


On Mon, Apr 28 2014, Roman Gushchin <klamm@yandex-team.ru> wrote:

> 28.04.2014, 16:27, "Michal Hocko" <mhocko@suse.cz>:
>> The series is based on top of the current mmotm tree. Once the series
>> gets accepted I will post a patch which will mark the soft limit as
>> deprecated with a note that it will be eventually dropped. Let me know
>> if you would prefer to have such a patch a part of the series.
>>
>> Thoughts?
>
>
> Looks good to me.
>
> The only question is: are there any ideas how the hierarchy support
> will be used in this case in practice?
> Will someone set low limit for non-leaf cgroups? Why?
>
> Thanks,
> Roman

I imagine that a hosting service may want to give X MB to a top level
memcg (/a) with sub-jobs (/a/b, /a/c) which may(not) have their own
low-limits.

Examples:

case_1) only set low limit on /a.  /a/b and /a/c may overcommit /a's
        memory (b.limit_in_bytes + c.limit_in_bytes > a.limit_in_bytes).

case_2) low limits on all memcg.  But not overcommitting low_limits
        (b.low_limit_in_in_bytes + c.low_limit_in_in_bytes <=
        a.low_limit_in_in_bytes).

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-04-29  7:42     ` Greg Thelen
  0 siblings, 0 replies; 196+ messages in thread
From: Greg Thelen @ 2014-04-29  7:42 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Hugh Dickins,
	LKML, linux-mm


On Mon, Apr 28 2014, Roman Gushchin <klamm@yandex-team.ru> wrote:

> 28.04.2014, 16:27, "Michal Hocko" <mhocko@suse.cz>:
>> The series is based on top of the current mmotm tree. Once the series
>> gets accepted I will post a patch which will mark the soft limit as
>> deprecated with a note that it will be eventually dropped. Let me know
>> if you would prefer to have such a patch a part of the series.
>>
>> Thoughts?
>
>
> Looks good to me.
>
> The only question is: are there any ideas how the hierarchy support
> will be used in this case in practice?
> Will someone set low limit for non-leaf cgroups? Why?
>
> Thanks,
> Roman

I imagine that a hosting service may want to give X MB to a top level
memcg (/a) with sub-jobs (/a/b, /a/c) which may(not) have their own
low-limits.

Examples:

case_1) only set low limit on /a.  /a/b and /a/c may overcommit /a's
        memory (b.limit_in_bytes + c.limit_in_bytes > a.limit_in_bytes).

case_2) low limits on all memcg.  But not overcommitting low_limits
        (b.low_limit_in_in_bytes + c.low_limit_in_in_bytes <=
        a.low_limit_in_in_bytes).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-04-29  7:42     ` Greg Thelen
@ 2014-04-29 10:50       ` Roman Gushchin
  -1 siblings, 0 replies; 196+ messages in thread
From: Roman Gushchin @ 2014-04-29 10:50 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Michal Hocko, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Hugh Dickins,
	LKML, linux-mm

29.04.2014, 11:42, "Greg Thelen" <gthelen@google.com>:
> On Mon, Apr 28 2014, Roman Gushchin <klamm@yandex-team.ru> wrote:
>
>>  28.04.2014, 16:27, "Michal Hocko" <mhocko@suse.cz>:
>>>  The series is based on top of the current mmotm tree. Once the series
>>>  gets accepted I will post a patch which will mark the soft limit as
>>>  deprecated with a note that it will be eventually dropped. Let me know
>>>  if you would prefer to have such a patch a part of the series.
>>>
>>>  Thoughts?
>>  Looks good to me.
>>
>>  The only question is: are there any ideas how the hierarchy support
>>  will be used in this case in practice?
>>  Will someone set low limit for non-leaf cgroups? Why?
>>
>>  Thanks,
>>  Roman
>
> I imagine that a hosting service may want to give X MB to a top level
> memcg (/a) with sub-jobs (/a/b, /a/c) which may(not) have their own
> low-limits.
>
> Examples:
>
> case_1) only set low limit on /a.  /a/b and /a/c may overcommit /a's
>         memory (b.limit_in_bytes + c.limit_in_bytes > a.limit_in_bytes).
>
> case_2) low limits on all memcg.  But not overcommitting low_limits
>         (b.low_limit_in_in_bytes + c.low_limit_in_in_bytes <=
>         a.low_limit_in_in_bytes).

Thanks!

With use_hierarchy turned on it looks perfectly usable.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-04-29 10:50       ` Roman Gushchin
  0 siblings, 0 replies; 196+ messages in thread
From: Roman Gushchin @ 2014-04-29 10:50 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Michal Hocko, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Hugh Dickins,
	LKML, linux-mm

29.04.2014, 11:42, "Greg Thelen" <gthelen@google.com>:
> On Mon, Apr 28 2014, Roman Gushchin <klamm@yandex-team.ru> wrote:
>
>> ?28.04.2014, 16:27, "Michal Hocko" <mhocko@suse.cz>:
>>> ?The series is based on top of the current mmotm tree. Once the series
>>> ?gets accepted I will post a patch which will mark the soft limit as
>>> ?deprecated with a note that it will be eventually dropped. Let me know
>>> ?if you would prefer to have such a patch a part of the series.
>>>
>>> ?Thoughts?
>> ?Looks good to me.
>>
>> ?The only question is: are there any ideas how the hierarchy support
>> ?will be used in this case in practice?
>> ?Will someone set low limit for non-leaf cgroups? Why?
>>
>> ?Thanks,
>> ?Roman
>
> I imagine that a hosting service may want to give X MB to a top level
> memcg (/a) with sub-jobs (/a/b, /a/c) which may(not) have their own
> low-limits.
>
> Examples:
>
> case_1) only set low limit on /a. ?/a/b and /a/c may overcommit /a's
> ????????memory (b.limit_in_bytes + c.limit_in_bytes > a.limit_in_bytes).
>
> case_2) low limits on all memcg. ?But not overcommitting low_limits
> ????????(b.low_limit_in_in_bytes + c.low_limit_in_in_bytes <=
> ????????a.low_limit_in_in_bytes).

Thanks!

With use_hierarchy turned on it looks perfectly usable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-04-29 10:50       ` Roman Gushchin
@ 2014-04-29 12:54         ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-04-29 12:54 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Greg Thelen, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Hugh Dickins,
	LKML, linux-mm

On Tue 29-04-14 14:50:18, Roman Gushchin wrote:
> 29.04.2014, 11:42, "Greg Thelen" <gthelen@google.com>:
> > On Mon, Apr 28 2014, Roman Gushchin <klamm@yandex-team.ru> wrote:
> >
> >>  28.04.2014, 16:27, "Michal Hocko" <mhocko@suse.cz>:
> >>>  The series is based on top of the current mmotm tree. Once the series
> >>>  gets accepted I will post a patch which will mark the soft limit as
> >>>  deprecated with a note that it will be eventually dropped. Let me know
> >>>  if you would prefer to have such a patch a part of the series.
> >>>
> >>>  Thoughts?
> >>  Looks good to me.
> >>
> >>  The only question is: are there any ideas how the hierarchy support
> >>  will be used in this case in practice?
> >>  Will someone set low limit for non-leaf cgroups? Why?
> >>
> >>  Thanks,
> >>  Roman
> >
> > I imagine that a hosting service may want to give X MB to a top level
> > memcg (/a) with sub-jobs (/a/b, /a/c) which may(not) have their own
> > low-limits.

I would expect the limit would be set on leaf nodes most of the time
because intermediate nodes have charges inter-mixed with charges from
children so it is not entirely clear who to protect.
On the on the other hand I can imagine that the higher level node might
get some portion of memory by an admin without any way to set the limit
down the hierarchy for its user as described by Greg.

> > Examples:
> >
> > case_1) only set low limit on /a.  /a/b and /a/c may overcommit /a's
> >         memory (b.limit_in_bytes + c.limit_in_bytes > a.limit_in_bytes).
> >
> > case_2) low limits on all memcg.  But not overcommitting low_limits
> >         (b.low_limit_in_in_bytes + c.low_limit_in_in_bytes <=
> >         a.low_limit_in_in_bytes).
> 
> Thanks!
> 
> With use_hierarchy turned on it looks perfectly usable.

use_hierarchy is becoming the default and we even complain about deeper
directory structures without it being enabled.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-04-29 12:54         ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-04-29 12:54 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Greg Thelen, Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Hugh Dickins,
	LKML, linux-mm

On Tue 29-04-14 14:50:18, Roman Gushchin wrote:
> 29.04.2014, 11:42, "Greg Thelen" <gthelen@google.com>:
> > On Mon, Apr 28 2014, Roman Gushchin <klamm@yandex-team.ru> wrote:
> >
> >>  28.04.2014, 16:27, "Michal Hocko" <mhocko@suse.cz>:
> >>>  The series is based on top of the current mmotm tree. Once the series
> >>>  gets accepted I will post a patch which will mark the soft limit as
> >>>  deprecated with a note that it will be eventually dropped. Let me know
> >>>  if you would prefer to have such a patch a part of the series.
> >>>
> >>>  Thoughts?
> >>  Looks good to me.
> >>
> >>  The only question is: are there any ideas how the hierarchy support
> >>  will be used in this case in practice?
> >>  Will someone set low limit for non-leaf cgroups? Why?
> >>
> >>  Thanks,
> >>  Roman
> >
> > I imagine that a hosting service may want to give X MB to a top level
> > memcg (/a) with sub-jobs (/a/b, /a/c) which may(not) have their own
> > low-limits.

I would expect the limit would be set on leaf nodes most of the time
because intermediate nodes have charges inter-mixed with charges from
children so it is not entirely clear who to protect.
On the on the other hand I can imagine that the higher level node might
get some portion of memory by an admin without any way to set the limit
down the hierarchy for its user as described by Greg.

> > Examples:
> >
> > case_1) only set low limit on /a.  /a/b and /a/c may overcommit /a's
> >         memory (b.limit_in_bytes + c.limit_in_bytes > a.limit_in_bytes).
> >
> > case_2) low limits on all memcg.  But not overcommitting low_limits
> >         (b.low_limit_in_in_bytes + c.low_limit_in_in_bytes <=
> >         a.low_limit_in_in_bytes).
> 
> Thanks!
> 
> With use_hierarchy turned on it looks perfectly usable.

use_hierarchy is becoming the default and we even complain about deeper
directory structures without it being enabled.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-04-28 12:26 ` Michal Hocko
@ 2014-04-30 21:52   ` Andrew Morton
  -1 siblings, 0 replies; 196+ messages in thread
From: Andrew Morton @ 2014-04-30 21:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Mon, 28 Apr 2014 14:26:41 +0200 Michal Hocko <mhocko@suse.cz> wrote:

> Hi,
> previous discussions have shown that soft limits cannot be reformed
> (http://lwn.net/Articles/555249/). This series introduces an alternative
> approach for protecting memory allocated to processes executing within
> a memory cgroup controller. It is based on a new tunable that was
> discussed with Johannes and Tejun held during the kernel summit 2013 and
> at LSF 2014.
> 
> This patchset introduces such low limit that is functionally similar
> to a minimum guarantee. Memcgs which are under their lowlimit are not
> considered eligible for the reclaim (both global and hardlimit) unless
> all groups under the reclaimed hierarchy are below the low limit when
> all of them are considered eligible.

Permitting containers to avoid global reclaim sounds rather worrisome. 

Fairness: won't it permit processes to completely protect their memory
while everything else in the system is getting utterly pounded?  We
need to consider global-vs-memcg fairness as well as memcg-vs-memgc.

Security: can this feature be used to DoS the machine?  Set up enough
hierarchies which are below their low limit and we risk memory
exhaustion and swap-thrashing and oom-killings for other processes.


All of that being said, your statement doesn't appear to be true ;)

> +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> +{
> +	if (!__shrink_zone(zone, sc, true)) {
> +		/*
> +		 * First round of reclaim didn't find anything to reclaim
> +		 * because of low limit protection so try again and ignore
> +		 * the low limit this time.
> +		 */
> +		__shrink_zone(zone, sc, false);
> +	}
>  }


^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-04-30 21:52   ` Andrew Morton
  0 siblings, 0 replies; 196+ messages in thread
From: Andrew Morton @ 2014-04-30 21:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Mon, 28 Apr 2014 14:26:41 +0200 Michal Hocko <mhocko@suse.cz> wrote:

> Hi,
> previous discussions have shown that soft limits cannot be reformed
> (http://lwn.net/Articles/555249/). This series introduces an alternative
> approach for protecting memory allocated to processes executing within
> a memory cgroup controller. It is based on a new tunable that was
> discussed with Johannes and Tejun held during the kernel summit 2013 and
> at LSF 2014.
> 
> This patchset introduces such low limit that is functionally similar
> to a minimum guarantee. Memcgs which are under their lowlimit are not
> considered eligible for the reclaim (both global and hardlimit) unless
> all groups under the reclaimed hierarchy are below the low limit when
> all of them are considered eligible.

Permitting containers to avoid global reclaim sounds rather worrisome. 

Fairness: won't it permit processes to completely protect their memory
while everything else in the system is getting utterly pounded?  We
need to consider global-vs-memcg fairness as well as memcg-vs-memgc.

Security: can this feature be used to DoS the machine?  Set up enough
hierarchies which are below their low limit and we risk memory
exhaustion and swap-thrashing and oom-killings for other processes.


All of that being said, your statement doesn't appear to be true ;)

> +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> +{
> +	if (!__shrink_zone(zone, sc, true)) {
> +		/*
> +		 * First round of reclaim didn't find anything to reclaim
> +		 * because of low limit protection so try again and ignore
> +		 * the low limit this time.
> +		 */
> +		__shrink_zone(zone, sc, false);
> +	}
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-04-28 12:26 ` Michal Hocko
@ 2014-04-30 21:59   ` Andrew Morton
  -1 siblings, 0 replies; 196+ messages in thread
From: Andrew Morton @ 2014-04-30 21:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Mon, 28 Apr 2014 14:26:41 +0200 Michal Hocko <mhocko@suse.cz> wrote:

> The series is based on top of the current mmotm tree. Once the series
> gets accepted I will post a patch which will mark the soft limit as
> deprecated with a note that it will be eventually dropped. Let me know
> if you would prefer to have such a patch a part of the series.

Yes please, we may as well get it all in there.

> Thoughts?

I suspect it's a bit early for me to be grabbing these, but I did it anyway.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-04-30 21:59   ` Andrew Morton
  0 siblings, 0 replies; 196+ messages in thread
From: Andrew Morton @ 2014-04-30 21:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Mon, 28 Apr 2014 14:26:41 +0200 Michal Hocko <mhocko@suse.cz> wrote:

> The series is based on top of the current mmotm tree. Once the series
> gets accepted I will post a patch which will mark the soft limit as
> deprecated with a note that it will be eventually dropped. Let me know
> if you would prefer to have such a patch a part of the series.

Yes please, we may as well get it all in there.

> Thoughts?

I suspect it's a bit early for me to be grabbing these, but I did it anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-04-30 21:52   ` Andrew Morton
@ 2014-04-30 22:49     ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-04-30 22:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Wed, Apr 30, 2014 at 02:52:38PM -0700, Andrew Morton wrote:
> On Mon, 28 Apr 2014 14:26:41 +0200 Michal Hocko <mhocko@suse.cz> wrote:
> 
> > Hi,
> > previous discussions have shown that soft limits cannot be reformed
> > (http://lwn.net/Articles/555249/). This series introduces an alternative
> > approach for protecting memory allocated to processes executing within
> > a memory cgroup controller. It is based on a new tunable that was
> > discussed with Johannes and Tejun held during the kernel summit 2013 and
> > at LSF 2014.
> > 
> > This patchset introduces such low limit that is functionally similar
> > to a minimum guarantee. Memcgs which are under their lowlimit are not
> > considered eligible for the reclaim (both global and hardlimit) unless
> > all groups under the reclaimed hierarchy are below the low limit when
> > all of them are considered eligible.
> 
> Permitting containers to avoid global reclaim sounds rather worrisome.
> 
> Fairness: won't it permit processes to completely protect their memory
> while everything else in the system is getting utterly pounded?  We
> need to consider global-vs-memcg fairness as well as memcg-vs-memgc.

Yes.

> Security: can this feature be used to DoS the machine?  Set up enough
> hierarchies which are below their low limit and we risk memory
> exhaustion and swap-thrashing and oom-killings for other processes.

And yes.

However, setting the low limit is a priviliged operation, so I don't
see how you could do worse with it than with mlock, disabling swap
etc.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-04-30 22:49     ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-04-30 22:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Wed, Apr 30, 2014 at 02:52:38PM -0700, Andrew Morton wrote:
> On Mon, 28 Apr 2014 14:26:41 +0200 Michal Hocko <mhocko@suse.cz> wrote:
> 
> > Hi,
> > previous discussions have shown that soft limits cannot be reformed
> > (http://lwn.net/Articles/555249/). This series introduces an alternative
> > approach for protecting memory allocated to processes executing within
> > a memory cgroup controller. It is based on a new tunable that was
> > discussed with Johannes and Tejun held during the kernel summit 2013 and
> > at LSF 2014.
> > 
> > This patchset introduces such low limit that is functionally similar
> > to a minimum guarantee. Memcgs which are under their lowlimit are not
> > considered eligible for the reclaim (both global and hardlimit) unless
> > all groups under the reclaimed hierarchy are below the low limit when
> > all of them are considered eligible.
> 
> Permitting containers to avoid global reclaim sounds rather worrisome.
> 
> Fairness: won't it permit processes to completely protect their memory
> while everything else in the system is getting utterly pounded?  We
> need to consider global-vs-memcg fairness as well as memcg-vs-memgc.

Yes.

> Security: can this feature be used to DoS the machine?  Set up enough
> hierarchies which are below their low limit and we risk memory
> exhaustion and swap-thrashing and oom-killings for other processes.

And yes.

However, setting the low limit is a priviliged operation, so I don't
see how you could do worse with it than with mlock, disabling swap
etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-04-28 12:26   ` Michal Hocko
@ 2014-04-30 22:55     ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-04-30 22:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 19d620b3d69c..40e517630138 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2808,6 +2808,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
>  	return mem_cgroup_from_id(id);
>  }
>  
> +/**
> + * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
> + * reclaim
> + * @memcg: target memcg for the reclaim
> + * @root: root of the reclaim hierarchy (null for the global reclaim)
> + *
> + * The given group is reclaimable if it is above its low limit and the same
> + * applies for all parents up the hierarchy until root (including).
> + */
> +bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> +		struct mem_cgroup *root)

Could you please rename this to something that is more descriptive in
the reclaim callsite?  How about mem_cgroup_within_low_limit()?

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c1cd99a5074b..0f428158254e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2215,9 +2215,11 @@ static inline bool should_continue_reclaim(struct zone *zone,
>  	}
>  }
>  
> -static void shrink_zone(struct zone *zone, struct scan_control *sc)
> +static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
> +		bool follow_low_limit)
>  {
>  	unsigned long nr_reclaimed, nr_scanned;
> +	unsigned nr_scanned_groups = 0;
>  
>  	do {
>  		struct mem_cgroup *root = sc->target_mem_cgroup;
> @@ -2234,7 +2236,23 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  		do {
>  			struct lruvec *lruvec;
>  
> +			/*
> +			 * Memcg might be under its low limit so we have to
> +			 * skip it during the first reclaim round
> +			 */
> +			if (follow_low_limit &&
> +					!mem_cgroup_reclaim_eligible(memcg, root)) {
> +				/*
> +				 * It would be more optimal to skip the memcg
> +				 * subtree now but we do not have a memcg iter
> +				 * helper for that. Anyone?
> +				 */
> +				memcg = mem_cgroup_iter(root, memcg, &reclaim);
> +				continue;
> +			}
> +
>  			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> +			nr_scanned_groups++;
>  
>  			shrink_lruvec(lruvec, sc);
>  
> @@ -2262,6 +2280,20 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  
>  	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
>  					 sc->nr_scanned - nr_scanned, sc));
> +
> +	return nr_scanned_groups;
> +}
> +
> +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> +{
> +	if (!__shrink_zone(zone, sc, true)) {
> +		/*
> +		 * First round of reclaim didn't find anything to reclaim
> +		 * because of low limit protection so try again and ignore
> +		 * the low limit this time.
> +		 */
> +		__shrink_zone(zone, sc, false);
> +	}
>  }
>  
>  /* Returns true if compaction should go ahead for a high-order request */

I would actually prefer not having a second round here, and make the
low limit behave more like mlock memory.  If there is no reclaimable
memory, go OOM.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-04-30 22:55     ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-04-30 22:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 19d620b3d69c..40e517630138 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2808,6 +2808,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
>  	return mem_cgroup_from_id(id);
>  }
>  
> +/**
> + * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
> + * reclaim
> + * @memcg: target memcg for the reclaim
> + * @root: root of the reclaim hierarchy (null for the global reclaim)
> + *
> + * The given group is reclaimable if it is above its low limit and the same
> + * applies for all parents up the hierarchy until root (including).
> + */
> +bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> +		struct mem_cgroup *root)

Could you please rename this to something that is more descriptive in
the reclaim callsite?  How about mem_cgroup_within_low_limit()?

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c1cd99a5074b..0f428158254e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2215,9 +2215,11 @@ static inline bool should_continue_reclaim(struct zone *zone,
>  	}
>  }
>  
> -static void shrink_zone(struct zone *zone, struct scan_control *sc)
> +static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
> +		bool follow_low_limit)
>  {
>  	unsigned long nr_reclaimed, nr_scanned;
> +	unsigned nr_scanned_groups = 0;
>  
>  	do {
>  		struct mem_cgroup *root = sc->target_mem_cgroup;
> @@ -2234,7 +2236,23 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  		do {
>  			struct lruvec *lruvec;
>  
> +			/*
> +			 * Memcg might be under its low limit so we have to
> +			 * skip it during the first reclaim round
> +			 */
> +			if (follow_low_limit &&
> +					!mem_cgroup_reclaim_eligible(memcg, root)) {
> +				/*
> +				 * It would be more optimal to skip the memcg
> +				 * subtree now but we do not have a memcg iter
> +				 * helper for that. Anyone?
> +				 */
> +				memcg = mem_cgroup_iter(root, memcg, &reclaim);
> +				continue;
> +			}
> +
>  			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> +			nr_scanned_groups++;
>  
>  			shrink_lruvec(lruvec, sc);
>  
> @@ -2262,6 +2280,20 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  
>  	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
>  					 sc->nr_scanned - nr_scanned, sc));
> +
> +	return nr_scanned_groups;
> +}
> +
> +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> +{
> +	if (!__shrink_zone(zone, sc, true)) {
> +		/*
> +		 * First round of reclaim didn't find anything to reclaim
> +		 * because of low limit protection so try again and ignore
> +		 * the low limit this time.
> +		 */
> +		__shrink_zone(zone, sc, false);
> +	}
>  }
>  
>  /* Returns true if compaction should go ahead for a high-order request */

I would actually prefer not having a second round here, and make the
low limit behave more like mlock memory.  If there is no reclaimable
memory, go OOM.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 4/4] memcg: Document memory.low_limit_in_bytes
  2014-04-28 12:26   ` Michal Hocko
@ 2014-04-30 22:57     ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-04-30 22:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Mon, Apr 28, 2014 at 02:26:45PM +0200, Michal Hocko wrote:
> Describe low_limit_in_bytes and its effect.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  Documentation/cgroups/memory.txt | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index add1be001416..a52913fe96fb 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -57,6 +57,7 @@ Brief summary of control files.
>   memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
>  				 (See 5.5 for details)
>   memory.limit_in_bytes		 # set/show limit of memory usage
> + memory.low_limit_in_bytes	 # set/show low limit for memory reclaim
>   memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
>   memory.failcnt			 # show the number of memory usage hits limits
>   memory.memsw.failcnt		 # show the number of memory+Swap hits limits
> @@ -249,6 +250,14 @@ is the objective of the reclaim. The global reclaim aims at balancing
>  zones' watermarks while the limit reclaim frees some memory to allow new
>  charges.
>  
> +Groups might be also protected from both global and limit reclaim by
> +low_limit_in_bytes knob. If the limit is non-zero the reclaim logic
> +doesn't include groups (and their subgroups - see 6. Hierarchy support)
> +which are bellow the low limit if there is other eligible cgroup in the

'below' :-) Although I really like that spello.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 4/4] memcg: Document memory.low_limit_in_bytes
@ 2014-04-30 22:57     ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-04-30 22:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Mon, Apr 28, 2014 at 02:26:45PM +0200, Michal Hocko wrote:
> Describe low_limit_in_bytes and its effect.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  Documentation/cgroups/memory.txt | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index add1be001416..a52913fe96fb 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -57,6 +57,7 @@ Brief summary of control files.
>   memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
>  				 (See 5.5 for details)
>   memory.limit_in_bytes		 # set/show limit of memory usage
> + memory.low_limit_in_bytes	 # set/show low limit for memory reclaim
>   memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
>   memory.failcnt			 # show the number of memory usage hits limits
>   memory.memsw.failcnt		 # show the number of memory+Swap hits limits
> @@ -249,6 +250,14 @@ is the objective of the reclaim. The global reclaim aims at balancing
>  zones' watermarks while the limit reclaim frees some memory to allow new
>  charges.
>  
> +Groups might be also protected from both global and limit reclaim by
> +low_limit_in_bytes knob. If the limit is non-zero the reclaim logic
> +doesn't include groups (and their subgroups - see 6. Hierarchy support)
> +which are bellow the low limit if there is other eligible cgroup in the

'below' :-) Although I really like that spello.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 3/4] memcg, doc: clarify global vs. limit reclaims
  2014-04-28 12:26   ` Michal Hocko
@ 2014-04-30 23:03     ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-04-30 23:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Mon, Apr 28, 2014 at 02:26:44PM +0200, Michal Hocko wrote:
> Be explicit about global and hard limit reclaims in our documentation.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  Documentation/cgroups/memory.txt | 31 +++++++++++++++++--------------
>  1 file changed, 17 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 4937e6fff9b4..add1be001416 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -236,23 +236,26 @@ it by cgroup.
>  2.5 Reclaim
>  
>  Each cgroup maintains a per cgroup LRU which has the same structure as
> -global VM. When a cgroup goes over its limit, we first try
> -to reclaim memory from the cgroup so as to make space for the new
> -pages that the cgroup has touched. If the reclaim is unsuccessful,
> -an OOM routine is invoked to select and kill the bulkiest task in the
> -cgroup. (See 10. OOM Control below.)
> -
> -The reclaim algorithm has not been modified for cgroups, except that
> -pages that are selected for reclaiming come from the per-cgroup LRU
> -list.
> -
> -NOTE: Reclaim does not work for the root cgroup, since we cannot set any
> -limits on the root cgroup.
> +global VM. Cgroups can get reclaimed basically under two conditions
> + - under global memory pressure when all cgroups are reclaimed
> +   proportionally wrt. their LRU size in a round robin fashion
> + - when a cgroup or its hierarchical parent (see 6. Hierarchical support)
> +   hits hard limit. If the reclaim is unsuccessful, an OOM routine is invoked
> +   to select and kill the bulkiest task in the cgroup. (See 10. OOM Control
> +   below.)

In the whole hierarchy, not just that cgroup.

> +Global and hard-limit reclaims share the same code the only difference
> +is the objective of the reclaim. The global reclaim aims at balancing
> +zones' watermarks while the limit reclaim frees some memory to allow new
> +charges.

This is a kswapd vs. direct reclaim issue, not global vs. memcg.
Memcg reclaim just happens to be direct reclaim.  Either way, I'd
rather not have such implementation details in the user documentation.

> +NOTE: Hard limit reclaim does not work for the root cgroup, since we cannot set
> +any limits on the root cgroup.

Not sure it's necessary to include this...

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 3/4] memcg, doc: clarify global vs. limit reclaims
@ 2014-04-30 23:03     ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-04-30 23:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Mon, Apr 28, 2014 at 02:26:44PM +0200, Michal Hocko wrote:
> Be explicit about global and hard limit reclaims in our documentation.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  Documentation/cgroups/memory.txt | 31 +++++++++++++++++--------------
>  1 file changed, 17 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 4937e6fff9b4..add1be001416 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -236,23 +236,26 @@ it by cgroup.
>  2.5 Reclaim
>  
>  Each cgroup maintains a per cgroup LRU which has the same structure as
> -global VM. When a cgroup goes over its limit, we first try
> -to reclaim memory from the cgroup so as to make space for the new
> -pages that the cgroup has touched. If the reclaim is unsuccessful,
> -an OOM routine is invoked to select and kill the bulkiest task in the
> -cgroup. (See 10. OOM Control below.)
> -
> -The reclaim algorithm has not been modified for cgroups, except that
> -pages that are selected for reclaiming come from the per-cgroup LRU
> -list.
> -
> -NOTE: Reclaim does not work for the root cgroup, since we cannot set any
> -limits on the root cgroup.
> +global VM. Cgroups can get reclaimed basically under two conditions
> + - under global memory pressure when all cgroups are reclaimed
> +   proportionally wrt. their LRU size in a round robin fashion
> + - when a cgroup or its hierarchical parent (see 6. Hierarchical support)
> +   hits hard limit. If the reclaim is unsuccessful, an OOM routine is invoked
> +   to select and kill the bulkiest task in the cgroup. (See 10. OOM Control
> +   below.)

In the whole hierarchy, not just that cgroup.

> +Global and hard-limit reclaims share the same code the only difference
> +is the objective of the reclaim. The global reclaim aims at balancing
> +zones' watermarks while the limit reclaim frees some memory to allow new
> +charges.

This is a kswapd vs. direct reclaim issue, not global vs. memcg.
Memcg reclaim just happens to be direct reclaim.  Either way, I'd
rather not have such implementation details in the user documentation.

> +NOTE: Hard limit reclaim does not work for the root cgroup, since we cannot set
> +any limits on the root cgroup.

Not sure it's necessary to include this...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-04-30 22:55     ` Johannes Weiner
@ 2014-05-02  9:36       ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-02  9:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 19d620b3d69c..40e517630138 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2808,6 +2808,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> >  	return mem_cgroup_from_id(id);
> >  }
> >  
> > +/**
> > + * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
> > + * reclaim
> > + * @memcg: target memcg for the reclaim
> > + * @root: root of the reclaim hierarchy (null for the global reclaim)
> > + *
> > + * The given group is reclaimable if it is above its low limit and the same
> > + * applies for all parents up the hierarchy until root (including).
> > + */
> > +bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> > +		struct mem_cgroup *root)
> 
> Could you please rename this to something that is more descriptive in
> the reclaim callsite?  How about mem_cgroup_within_low_limit()?

I have intentionally used somethig that is not low_limit specific. The
generic reclaim code does't have to care about the reason why a memcg is
not reclaimable. I agree that having follow_low_limit paramter explicit
and mem_cgroup_reclaim_eligible not is messy. So something should be
renamed. I would probably go with s@follow_low_limit@check_reclaim_eligible@
but I do not have a strong preference.

> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index c1cd99a5074b..0f428158254e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
[...]
> > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > +{
> > +	if (!__shrink_zone(zone, sc, true)) {
> > +		/*
> > +		 * First round of reclaim didn't find anything to reclaim
> > +		 * because of low limit protection so try again and ignore
> > +		 * the low limit this time.
> > +		 */
> > +		__shrink_zone(zone, sc, false);
> > +	}
> >  }
> >  
> >  /* Returns true if compaction should go ahead for a high-order request */
> 
> I would actually prefer not having a second round here, and make the
> low limit behave more like mlock memory.  If there is no reclaimable
> memory, go OOM.

This was done in my previous attempt and I prefer OOM myself but it is
also true that starting with a more relaxed limit and adding an
option for hard guarantee later when we have a clear usecase is a better
approach. Although I can see potential in go-oom-rather-than-reclaim
configurations, usecases I am primarily interested in won't overcommit on
low_limit.

That being said, I like the idea of having the hard guarantee but I also
think it should be configurable. I can post those patches in this thread
but I feel it is too early as nobody has explicitly asked for this yet.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-05-02  9:36       ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-02  9:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 19d620b3d69c..40e517630138 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2808,6 +2808,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> >  	return mem_cgroup_from_id(id);
> >  }
> >  
> > +/**
> > + * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
> > + * reclaim
> > + * @memcg: target memcg for the reclaim
> > + * @root: root of the reclaim hierarchy (null for the global reclaim)
> > + *
> > + * The given group is reclaimable if it is above its low limit and the same
> > + * applies for all parents up the hierarchy until root (including).
> > + */
> > +bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> > +		struct mem_cgroup *root)
> 
> Could you please rename this to something that is more descriptive in
> the reclaim callsite?  How about mem_cgroup_within_low_limit()?

I have intentionally used somethig that is not low_limit specific. The
generic reclaim code does't have to care about the reason why a memcg is
not reclaimable. I agree that having follow_low_limit paramter explicit
and mem_cgroup_reclaim_eligible not is messy. So something should be
renamed. I would probably go with s@follow_low_limit@check_reclaim_eligible@
but I do not have a strong preference.

> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index c1cd99a5074b..0f428158254e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
[...]
> > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > +{
> > +	if (!__shrink_zone(zone, sc, true)) {
> > +		/*
> > +		 * First round of reclaim didn't find anything to reclaim
> > +		 * because of low limit protection so try again and ignore
> > +		 * the low limit this time.
> > +		 */
> > +		__shrink_zone(zone, sc, false);
> > +	}
> >  }
> >  
> >  /* Returns true if compaction should go ahead for a high-order request */
> 
> I would actually prefer not having a second round here, and make the
> low limit behave more like mlock memory.  If there is no reclaimable
> memory, go OOM.

This was done in my previous attempt and I prefer OOM myself but it is
also true that starting with a more relaxed limit and adding an
option for hard guarantee later when we have a clear usecase is a better
approach. Although I can see potential in go-oom-rather-than-reclaim
configurations, usecases I am primarily interested in won't overcommit on
low_limit.

That being said, I like the idea of having the hard guarantee but I also
think it should be configurable. I can post those patches in this thread
but I feel it is too early as nobody has explicitly asked for this yet.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 3/4] memcg, doc: clarify global vs. limit reclaims
  2014-04-30 23:03     ` Johannes Weiner
@ 2014-05-02  9:43       ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-02  9:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Wed 30-04-14 19:03:50, Johannes Weiner wrote:
> On Mon, Apr 28, 2014 at 02:26:44PM +0200, Michal Hocko wrote:
> > Be explicit about global and hard limit reclaims in our documentation.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > ---
> >  Documentation/cgroups/memory.txt | 31 +++++++++++++++++--------------
> >  1 file changed, 17 insertions(+), 14 deletions(-)
> > 
> > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> > index 4937e6fff9b4..add1be001416 100644
> > --- a/Documentation/cgroups/memory.txt
> > +++ b/Documentation/cgroups/memory.txt
> > @@ -236,23 +236,26 @@ it by cgroup.
> >  2.5 Reclaim
> >  
> >  Each cgroup maintains a per cgroup LRU which has the same structure as
> > -global VM. When a cgroup goes over its limit, we first try
> > -to reclaim memory from the cgroup so as to make space for the new
> > -pages that the cgroup has touched. If the reclaim is unsuccessful,
> > -an OOM routine is invoked to select and kill the bulkiest task in the
> > -cgroup. (See 10. OOM Control below.)
> > -
> > -The reclaim algorithm has not been modified for cgroups, except that
> > -pages that are selected for reclaiming come from the per-cgroup LRU
> > -list.
> > -
> > -NOTE: Reclaim does not work for the root cgroup, since we cannot set any
> > -limits on the root cgroup.
> > +global VM. Cgroups can get reclaimed basically under two conditions
> > + - under global memory pressure when all cgroups are reclaimed
> > +   proportionally wrt. their LRU size in a round robin fashion
> > + - when a cgroup or its hierarchical parent (see 6. Hierarchical support)
> > +   hits hard limit. If the reclaim is unsuccessful, an OOM routine is invoked
> > +   to select and kill the bulkiest task in the cgroup. (See 10. OOM Control
> > +   below.)
> 
> In the whole hierarchy, not just that cgroup.

Right. Fixed
 
> > +Global and hard-limit reclaims share the same code the only difference
> > +is the objective of the reclaim. The global reclaim aims at balancing
> > +zones' watermarks while the limit reclaim frees some memory to allow new
> > +charges.
> 
> This is a kswapd vs. direct reclaim issue, not global vs. memcg.
> Memcg reclaim just happens to be direct reclaim.  Either way, I'd
> rather not have such implementation details in the user documentation.

OK, removed
 
> > +NOTE: Hard limit reclaim does not work for the root cgroup, since we cannot set
> > +any limits on the root cgroup.
> 
> Not sure it's necessary to include this...

removed as well.

Incremental patch on top:
---
>From 30b9505169e574cdb553226e1a361cc527ed492b Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Fri, 2 May 2014 11:42:35 +0200
Subject: [PATCH] mmotm: memcg-doc-clarify-global-vs-limit-reclaims-fix.patch

update doc as per Johannes

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 Documentation/cgroups/memory.txt | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index add1be001416..2cde96787ceb 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -241,17 +241,9 @@ global VM. Cgroups can get reclaimed basically under two conditions
    proportionally wrt. their LRU size in a round robin fashion
  - when a cgroup or its hierarchical parent (see 6. Hierarchical support)
    hits hard limit. If the reclaim is unsuccessful, an OOM routine is invoked
-   to select and kill the bulkiest task in the cgroup. (See 10. OOM Control
+   to select and kill the bulkiest task in the hiearchy. (See 10. OOM Control
    below.)
 
-Global and hard-limit reclaims share the same code the only difference
-is the objective of the reclaim. The global reclaim aims at balancing
-zones' watermarks while the limit reclaim frees some memory to allow new
-charges.
-
-NOTE: Hard limit reclaim does not work for the root cgroup, since we cannot set
-any limits on the root cgroup.
-
 Note2: When panic_on_oom is set to "2", the whole system will panic.
 
 When oom event notifier is registered, event will be delivered to the root
-- 
2.0.0.rc0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 196+ messages in thread

* Re: [PATCH 3/4] memcg, doc: clarify global vs. limit reclaims
@ 2014-05-02  9:43       ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-02  9:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Wed 30-04-14 19:03:50, Johannes Weiner wrote:
> On Mon, Apr 28, 2014 at 02:26:44PM +0200, Michal Hocko wrote:
> > Be explicit about global and hard limit reclaims in our documentation.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > ---
> >  Documentation/cgroups/memory.txt | 31 +++++++++++++++++--------------
> >  1 file changed, 17 insertions(+), 14 deletions(-)
> > 
> > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> > index 4937e6fff9b4..add1be001416 100644
> > --- a/Documentation/cgroups/memory.txt
> > +++ b/Documentation/cgroups/memory.txt
> > @@ -236,23 +236,26 @@ it by cgroup.
> >  2.5 Reclaim
> >  
> >  Each cgroup maintains a per cgroup LRU which has the same structure as
> > -global VM. When a cgroup goes over its limit, we first try
> > -to reclaim memory from the cgroup so as to make space for the new
> > -pages that the cgroup has touched. If the reclaim is unsuccessful,
> > -an OOM routine is invoked to select and kill the bulkiest task in the
> > -cgroup. (See 10. OOM Control below.)
> > -
> > -The reclaim algorithm has not been modified for cgroups, except that
> > -pages that are selected for reclaiming come from the per-cgroup LRU
> > -list.
> > -
> > -NOTE: Reclaim does not work for the root cgroup, since we cannot set any
> > -limits on the root cgroup.
> > +global VM. Cgroups can get reclaimed basically under two conditions
> > + - under global memory pressure when all cgroups are reclaimed
> > +   proportionally wrt. their LRU size in a round robin fashion
> > + - when a cgroup or its hierarchical parent (see 6. Hierarchical support)
> > +   hits hard limit. If the reclaim is unsuccessful, an OOM routine is invoked
> > +   to select and kill the bulkiest task in the cgroup. (See 10. OOM Control
> > +   below.)
> 
> In the whole hierarchy, not just that cgroup.

Right. Fixed
 
> > +Global and hard-limit reclaims share the same code the only difference
> > +is the objective of the reclaim. The global reclaim aims at balancing
> > +zones' watermarks while the limit reclaim frees some memory to allow new
> > +charges.
> 
> This is a kswapd vs. direct reclaim issue, not global vs. memcg.
> Memcg reclaim just happens to be direct reclaim.  Either way, I'd
> rather not have such implementation details in the user documentation.

OK, removed
 
> > +NOTE: Hard limit reclaim does not work for the root cgroup, since we cannot set
> > +any limits on the root cgroup.
> 
> Not sure it's necessary to include this...

removed as well.

Incremental patch on top:
---

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 4/4] memcg: Document memory.low_limit_in_bytes
  2014-04-30 22:57     ` Johannes Weiner
@ 2014-05-02  9:46       ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-02  9:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Wed 30-04-14 18:57:48, Johannes Weiner wrote:
> On Mon, Apr 28, 2014 at 02:26:45PM +0200, Michal Hocko wrote:
> > Describe low_limit_in_bytes and its effect.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > ---
> >  Documentation/cgroups/memory.txt | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> > 
> > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> > index add1be001416..a52913fe96fb 100644
> > --- a/Documentation/cgroups/memory.txt
> > +++ b/Documentation/cgroups/memory.txt
> > @@ -57,6 +57,7 @@ Brief summary of control files.
> >   memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
> >  				 (See 5.5 for details)
> >   memory.limit_in_bytes		 # set/show limit of memory usage
> > + memory.low_limit_in_bytes	 # set/show low limit for memory reclaim
> >   memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
> >   memory.failcnt			 # show the number of memory usage hits limits
> >   memory.memsw.failcnt		 # show the number of memory+Swap hits limits
> > @@ -249,6 +250,14 @@ is the objective of the reclaim. The global reclaim aims at balancing
> >  zones' watermarks while the limit reclaim frees some memory to allow new
> >  charges.
> >  
> > +Groups might be also protected from both global and limit reclaim by
> > +low_limit_in_bytes knob. If the limit is non-zero the reclaim logic
> > +doesn't include groups (and their subgroups - see 6. Hierarchy support)
> > +which are bellow the low limit if there is other eligible cgroup in the
> 
> 'below' :-) Although I really like that spello.

ups
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 4/4] memcg: Document memory.low_limit_in_bytes
@ 2014-05-02  9:46       ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-02  9:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Wed 30-04-14 18:57:48, Johannes Weiner wrote:
> On Mon, Apr 28, 2014 at 02:26:45PM +0200, Michal Hocko wrote:
> > Describe low_limit_in_bytes and its effect.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > ---
> >  Documentation/cgroups/memory.txt | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> > 
> > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> > index add1be001416..a52913fe96fb 100644
> > --- a/Documentation/cgroups/memory.txt
> > +++ b/Documentation/cgroups/memory.txt
> > @@ -57,6 +57,7 @@ Brief summary of control files.
> >   memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
> >  				 (See 5.5 for details)
> >   memory.limit_in_bytes		 # set/show limit of memory usage
> > + memory.low_limit_in_bytes	 # set/show low limit for memory reclaim
> >   memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
> >   memory.failcnt			 # show the number of memory usage hits limits
> >   memory.memsw.failcnt		 # show the number of memory+Swap hits limits
> > @@ -249,6 +250,14 @@ is the objective of the reclaim. The global reclaim aims at balancing
> >  zones' watermarks while the limit reclaim frees some memory to allow new
> >  charges.
> >  
> > +Groups might be also protected from both global and limit reclaim by
> > +low_limit_in_bytes knob. If the limit is non-zero the reclaim logic
> > +doesn't include groups (and their subgroups - see 6. Hierarchy support)
> > +which are bellow the low limit if there is other eligible cgroup in the
> 
> 'below' :-) Although I really like that spello.

ups
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-04-30 21:59   ` Andrew Morton
@ 2014-05-02 11:22     ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-02 11:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Wed 30-04-14 14:59:10, Andrew Morton wrote:
> On Mon, 28 Apr 2014 14:26:41 +0200 Michal Hocko <mhocko@suse.cz> wrote:
> 
> > The series is based on top of the current mmotm tree. Once the series
> > gets accepted I will post a patch which will mark the soft limit as
> > deprecated with a note that it will be eventually dropped. Let me know
> > if you would prefer to have such a patch a part of the series.
> 
> Yes please, we may as well get it all in there.
---
>From 01d74f2034b871e591b238ce1652c9f5ef7c1bd2 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Fri, 2 May 2014 13:16:57 +0200
Subject: [PATCH] memcg: mark soft limit as deprecated

Now that we have low limit reclaim the soft limit reclaim should be
deprecated and not longer used. Print an information about this into the
log and mention it also in the memcg documentation.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 Documentation/cgroups/memory.txt | 7 +++++++
 mm/memcontrol.c                  | 9 +++++++--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 9bc3325e99ce..7f3a7414bdf2 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -652,6 +652,13 @@ heavily contended for, memory is allocated based on the soft limit
 hints/setup. Currently soft limit based reclaim is set up such that
 it gets invoked from balance_pgdat (kswapd).
 
+NOTE:
+Soft limit is considered deprecated and will be removed later.
+low_limit_in_bytes should be used instead. The main difference between
+the two is that the soft limit hammers groups over the selected limit to
+prevent reclaiming from other groups while low_limit excludes groups under
+the limit from reclaim and reclaim others.
+
 7.1 Interface
 
 Soft limits can be setup by using the following commands (in this example we
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 53193fec8c50..7a276c0d141e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5162,10 +5162,15 @@ static int mem_cgroup_write(struct cgroup_subsys_state *css, struct cftype *cft,
 		 * of semantics, for now, we support soft limits for
 		 * control without swap
 		 */
-		if (type == _MEM)
+		if (type == _MEM) {
+			pr_info("soft_limit_in_bytes is deprecated and should ");
+			pr_cont("be replaced by low_limit_in_bytes. Please let ");
+			pr_cont("us know if that doesn't work for you at ");
+			pr_cont("linux-mm@kvack.org\n");
 			ret = res_counter_set_soft_limit(&memcg->res, val);
-		else
+		} else {
 			ret = -EINVAL;
+		}
 		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
-- 
2.0.0.rc0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-05-02 11:22     ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-02 11:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Wed 30-04-14 14:59:10, Andrew Morton wrote:
> On Mon, 28 Apr 2014 14:26:41 +0200 Michal Hocko <mhocko@suse.cz> wrote:
> 
> > The series is based on top of the current mmotm tree. Once the series
> > gets accepted I will post a patch which will mark the soft limit as
> > deprecated with a note that it will be eventually dropped. Let me know
> > if you would prefer to have such a patch a part of the series.
> 
> Yes please, we may as well get it all in there.
---

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-04-30 21:52   ` Andrew Morton
@ 2014-05-02 12:03     ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-02 12:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Wed 30-04-14 14:52:38, Andrew Morton wrote:
> On Mon, 28 Apr 2014 14:26:41 +0200 Michal Hocko <mhocko@suse.cz> wrote:
> 
> > Hi,
> > previous discussions have shown that soft limits cannot be reformed
> > (http://lwn.net/Articles/555249/). This series introduces an alternative
> > approach for protecting memory allocated to processes executing within
> > a memory cgroup controller. It is based on a new tunable that was
> > discussed with Johannes and Tejun held during the kernel summit 2013 and
> > at LSF 2014.
> > 
> > This patchset introduces such low limit that is functionally similar
> > to a minimum guarantee. Memcgs which are under their lowlimit are not
> > considered eligible for the reclaim (both global and hardlimit) unless
> > all groups under the reclaimed hierarchy are below the low limit when
> > all of them are considered eligible.
> 
> Permitting containers to avoid global reclaim sounds rather worrisome. 
> 
> Fairness: won't it permit processes to completely protect their memory
> while everything else in the system is getting utterly pounded?  We
> need to consider global-vs-memcg fairness as well as memcg-vs-memgc.
> 
> Security: can this feature be used to DoS the machine?  Set up enough
> hierarchies which are below their low limit and we risk memory
> exhaustion and swap-thrashing and oom-killings for other processes.

Johannes has already pointed out that setting the low limit is really
supposed to be a privileged operation. And, in principle, this is not any
different from any other guarantee.
 
> All of that being said, your statement doesn't appear to be true ;)

"
Memcgs which are under their lowlimit are ignored during the reclaim
(both global and hardlimit) unless all groups under the reclaimed
hierarchy are below the low limit. Low limit will be ignored in this
case for all groups in the hierarchy.
"

Better?

> > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > +{
> > +	if (!__shrink_zone(zone, sc, true)) {
> > +		/*
> > +		 * First round of reclaim didn't find anything to reclaim
> > +		 * because of low limit protection so try again and ignore
> > +		 * the low limit this time.
> > +		 */
> > +		__shrink_zone(zone, sc, false);
> > +	}
> >  }
> `

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-05-02 12:03     ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-02 12:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Wed 30-04-14 14:52:38, Andrew Morton wrote:
> On Mon, 28 Apr 2014 14:26:41 +0200 Michal Hocko <mhocko@suse.cz> wrote:
> 
> > Hi,
> > previous discussions have shown that soft limits cannot be reformed
> > (http://lwn.net/Articles/555249/). This series introduces an alternative
> > approach for protecting memory allocated to processes executing within
> > a memory cgroup controller. It is based on a new tunable that was
> > discussed with Johannes and Tejun held during the kernel summit 2013 and
> > at LSF 2014.
> > 
> > This patchset introduces such low limit that is functionally similar
> > to a minimum guarantee. Memcgs which are under their lowlimit are not
> > considered eligible for the reclaim (both global and hardlimit) unless
> > all groups under the reclaimed hierarchy are below the low limit when
> > all of them are considered eligible.
> 
> Permitting containers to avoid global reclaim sounds rather worrisome. 
> 
> Fairness: won't it permit processes to completely protect their memory
> while everything else in the system is getting utterly pounded?  We
> need to consider global-vs-memcg fairness as well as memcg-vs-memgc.
> 
> Security: can this feature be used to DoS the machine?  Set up enough
> hierarchies which are below their low limit and we risk memory
> exhaustion and swap-thrashing and oom-killings for other processes.

Johannes has already pointed out that setting the low limit is really
supposed to be a privileged operation. And, in principle, this is not any
different from any other guarantee.
 
> All of that being said, your statement doesn't appear to be true ;)

"
Memcgs which are under their lowlimit are ignored during the reclaim
(both global and hardlimit) unless all groups under the reclaimed
hierarchy are below the low limit. Low limit will be ignored in this
case for all groups in the hierarchy.
"

Better?

> > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > +{
> > +	if (!__shrink_zone(zone, sc, true)) {
> > +		/*
> > +		 * First round of reclaim didn't find anything to reclaim
> > +		 * because of low limit protection so try again and ignore
> > +		 * the low limit this time.
> > +		 */
> > +		__shrink_zone(zone, sc, false);
> > +	}
> >  }
> `

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-02  9:36       ` Michal Hocko
@ 2014-05-02 12:07         ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-02 12:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri 02-05-14 11:36:28, Michal Hocko wrote:
> On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> > On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 19d620b3d69c..40e517630138 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2808,6 +2808,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> > >  	return mem_cgroup_from_id(id);
> > >  }
> > >  
> > > +/**
> > > + * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
> > > + * reclaim
> > > + * @memcg: target memcg for the reclaim
> > > + * @root: root of the reclaim hierarchy (null for the global reclaim)
> > > + *
> > > + * The given group is reclaimable if it is above its low limit and the same
> > > + * applies for all parents up the hierarchy until root (including).
> > > + */
> > > +bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> > > +		struct mem_cgroup *root)
> > 
> > Could you please rename this to something that is more descriptive in
> > the reclaim callsite?  How about mem_cgroup_within_low_limit()?
> 
> I have intentionally used somethig that is not low_limit specific. The
> generic reclaim code does't have to care about the reason why a memcg is
> not reclaimable. I agree that having follow_low_limit paramter explicit
> and mem_cgroup_reclaim_eligible not is messy. So something should be
> renamed. I would probably go with s@follow_low_limit@check_reclaim_eligible@
> but I do not have a strong preference.

What about this?
---
>From cbe72efdf89b89d60004c84b359fc3d95db61983 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Fri, 2 May 2014 14:03:49 +0200
Subject: [PATCH] mmotm: memcg-mm-introduce-lowlimit-reclaim-fix.patch

Use reclaim eligibility rather than low_limit. Generic code doesn't
have to be aware of the reason why a group is not eligible.
---
 mm/vmscan.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8ecf323a1c81..f195a0db5fbb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2215,8 +2215,15 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	}
 }
 
+/**
+ * __shrink_zone - shrinks a given zone
+ *
+ * @zone: zone to shrink
+ * @sc: scan control with additional reclaim parameters
+ * @check_memcg_eligible: check each memcg whether it is eligible for reclaim
+ */
 static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
-		bool follow_low_limit)
+		bool check_memcg_eligible)
 {
 	unsigned long nr_reclaimed, nr_scanned;
 	unsigned nr_scanned_groups = 0;
@@ -2237,10 +2244,10 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
 			struct lruvec *lruvec;
 
 			/*
-			 * Memcg might be under its low limit so we have to
-			 * skip it during the first reclaim round
+			 * Memcg might be protected from the reclaim so we have
+			 * to skip it during the first reclaim round
 			 */
-			if (follow_low_limit &&
+			if (check_memcg_eligible &&
 					!mem_cgroup_reclaim_eligible(memcg, root)) {
 				/*
 				 * It would be more optimal to skip the memcg
@@ -2289,8 +2296,8 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 	if (!__shrink_zone(zone, sc, true)) {
 		/*
 		 * First round of reclaim didn't find anything to reclaim
-		 * because of low limit protection so try again and ignore
-		 * the low limit this time.
+		 * because of the reclaim protection so try again and ignore
+		 * reclaim eligibility of memcgs.
 		 */
 		__shrink_zone(zone, sc, false);
 	}
-- 
2.0.0.rc0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-05-02 12:07         ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-02 12:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri 02-05-14 11:36:28, Michal Hocko wrote:
> On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> > On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 19d620b3d69c..40e517630138 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2808,6 +2808,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> > >  	return mem_cgroup_from_id(id);
> > >  }
> > >  
> > > +/**
> > > + * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
> > > + * reclaim
> > > + * @memcg: target memcg for the reclaim
> > > + * @root: root of the reclaim hierarchy (null for the global reclaim)
> > > + *
> > > + * The given group is reclaimable if it is above its low limit and the same
> > > + * applies for all parents up the hierarchy until root (including).
> > > + */
> > > +bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> > > +		struct mem_cgroup *root)
> > 
> > Could you please rename this to something that is more descriptive in
> > the reclaim callsite?  How about mem_cgroup_within_low_limit()?
> 
> I have intentionally used somethig that is not low_limit specific. The
> generic reclaim code does't have to care about the reason why a memcg is
> not reclaimable. I agree that having follow_low_limit paramter explicit
> and mem_cgroup_reclaim_eligible not is messy. So something should be
> renamed. I would probably go with s@follow_low_limit@check_reclaim_eligible@
> but I do not have a strong preference.

What about this?
---

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-02 12:07         ` Michal Hocko
@ 2014-05-02 13:01           ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-02 13:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri, May 02, 2014 at 02:07:15PM +0200, Michal Hocko wrote:
> On Fri 02-05-14 11:36:28, Michal Hocko wrote:
> > On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> > > On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index 19d620b3d69c..40e517630138 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -2808,6 +2808,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> > > >  	return mem_cgroup_from_id(id);
> > > >  }
> > > >  
> > > > +/**
> > > > + * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
> > > > + * reclaim
> > > > + * @memcg: target memcg for the reclaim
> > > > + * @root: root of the reclaim hierarchy (null for the global reclaim)
> > > > + *
> > > > + * The given group is reclaimable if it is above its low limit and the same
> > > > + * applies for all parents up the hierarchy until root (including).
> > > > + */
> > > > +bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> > > > +		struct mem_cgroup *root)
> > > 
> > > Could you please rename this to something that is more descriptive in
> > > the reclaim callsite?  How about mem_cgroup_within_low_limit()?
> > 
> > I have intentionally used somethig that is not low_limit specific. The
> > generic reclaim code does't have to care about the reason why a memcg is
> > not reclaimable. I agree that having follow_low_limit paramter explicit
> > and mem_cgroup_reclaim_eligible not is messy. So something should be
> > renamed. I would probably go with s@follow_low_limit@check_reclaim_eligible@
> > but I do not have a strong preference.
> 
> What about this?

I really don't like it.

Yes, we should be hiding implementation details, but we should stop
treating memcg like an alien in this code.  The VM code obviously
doesn't have to know HOW the guarantees are exactly implemented, but
it's a perfectly fine *concept* that can be known outside of memcg:

shrink_zone:
for each memcg in system:
  if mem_cgroup_within_guarantee(memcg):
    continue
  reclaim(memcg-zone)

is perfectly understandable and makes it easier to reason about the
behavior of the reclaim code.  If I just see !mem_cgroup_eligible(), I
don't know if this affects the scenario I'm thinking about at all.

It's obscuring useful information for absolutely no benefit.  If you
burden the reclaim code with a callback, you better explain what you
are doing.  You owe it to the reader.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-05-02 13:01           ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-02 13:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri, May 02, 2014 at 02:07:15PM +0200, Michal Hocko wrote:
> On Fri 02-05-14 11:36:28, Michal Hocko wrote:
> > On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> > > On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index 19d620b3d69c..40e517630138 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -2808,6 +2808,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> > > >  	return mem_cgroup_from_id(id);
> > > >  }
> > > >  
> > > > +/**
> > > > + * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
> > > > + * reclaim
> > > > + * @memcg: target memcg for the reclaim
> > > > + * @root: root of the reclaim hierarchy (null for the global reclaim)
> > > > + *
> > > > + * The given group is reclaimable if it is above its low limit and the same
> > > > + * applies for all parents up the hierarchy until root (including).
> > > > + */
> > > > +bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> > > > +		struct mem_cgroup *root)
> > > 
> > > Could you please rename this to something that is more descriptive in
> > > the reclaim callsite?  How about mem_cgroup_within_low_limit()?
> > 
> > I have intentionally used somethig that is not low_limit specific. The
> > generic reclaim code does't have to care about the reason why a memcg is
> > not reclaimable. I agree that having follow_low_limit paramter explicit
> > and mem_cgroup_reclaim_eligible not is messy. So something should be
> > renamed. I would probably go with s@follow_low_limit@check_reclaim_eligible@
> > but I do not have a strong preference.
> 
> What about this?

I really don't like it.

Yes, we should be hiding implementation details, but we should stop
treating memcg like an alien in this code.  The VM code obviously
doesn't have to know HOW the guarantees are exactly implemented, but
it's a perfectly fine *concept* that can be known outside of memcg:

shrink_zone:
for each memcg in system:
  if mem_cgroup_within_guarantee(memcg):
    continue
  reclaim(memcg-zone)

is perfectly understandable and makes it easier to reason about the
behavior of the reclaim code.  If I just see !mem_cgroup_eligible(), I
don't know if this affects the scenario I'm thinking about at all.

It's obscuring useful information for absolutely no benefit.  If you
burden the reclaim code with a callback, you better explain what you
are doing.  You owe it to the reader.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-02 13:01           ` Johannes Weiner
@ 2014-05-02 14:15             ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-02 14:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri 02-05-14 09:01:18, Johannes Weiner wrote:
> On Fri, May 02, 2014 at 02:07:15PM +0200, Michal Hocko wrote:
> > On Fri 02-05-14 11:36:28, Michal Hocko wrote:
> > > On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> > > > On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > index 19d620b3d69c..40e517630138 100644
> > > > > --- a/mm/memcontrol.c
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -2808,6 +2808,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> > > > >  	return mem_cgroup_from_id(id);
> > > > >  }
> > > > >  
> > > > > +/**
> > > > > + * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
> > > > > + * reclaim
> > > > > + * @memcg: target memcg for the reclaim
> > > > > + * @root: root of the reclaim hierarchy (null for the global reclaim)
> > > > > + *
> > > > > + * The given group is reclaimable if it is above its low limit and the same
> > > > > + * applies for all parents up the hierarchy until root (including).
> > > > > + */
> > > > > +bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> > > > > +		struct mem_cgroup *root)
> > > > 
> > > > Could you please rename this to something that is more descriptive in
> > > > the reclaim callsite?  How about mem_cgroup_within_low_limit()?
> > > 
> > > I have intentionally used somethig that is not low_limit specific. The
> > > generic reclaim code does't have to care about the reason why a memcg is
> > > not reclaimable. I agree that having follow_low_limit paramter explicit
> > > and mem_cgroup_reclaim_eligible not is messy. So something should be
> > > renamed. I would probably go with s@follow_low_limit@check_reclaim_eligible@
> > > but I do not have a strong preference.
> > 
> > What about this?
> 
> I really don't like it.
> 
> Yes, we should be hiding implementation details, but we should stop
> treating memcg like an alien in this code.  The VM code obviously
> doesn't have to know HOW the guarantees are exactly implemented, but
> it's a perfectly fine *concept* that can be known outside of memcg:
> 
> shrink_zone:
> for each memcg in system:
>   if mem_cgroup_within_guarantee(memcg):
>     continue
>   reclaim(memcg-zone)
> 
> is perfectly understandable and makes it easier to reason about the
> behavior of the reclaim code.  If I just see !mem_cgroup_eligible(), I
> don't know if this affects the scenario I'm thinking about at all.
> 
> It's obscuring useful information for absolutely no benefit.  If you
> burden the reclaim code with a callback, you better explain what you
> are doing.  You owe it to the reader.

OK fair enough, what about the following?
---
>From 4e0404fa2888d04de80f33fcb76712b0fbd44e1c Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Fri, 2 May 2014 16:12:41 +0200
Subject: [PATCH] mmotm: memcg-mm-introduce-lowlimit-reclaim-fix.patch

mem_cgroup_reclaim_eligible -> mem_cgroup_within_guarantee as suggested
by Johannes.
---
 include/linux/memcontrol.h |  6 +++---
 mm/memcontrol.c            | 15 ++++++++-------
 mm/vmscan.c                | 25 ++++++++++++++++---------
 3 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6c59056f4bc6..c00ccc5f70b9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -92,7 +92,7 @@ bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 bool task_in_mem_cgroup(struct task_struct *task,
 			const struct mem_cgroup *memcg);
 
-extern bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
+extern bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
 		struct mem_cgroup *root);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
@@ -291,10 +291,10 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
-static inline bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
+static inline bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
 		struct mem_cgroup *root)
 {
-	return true;
+	return false;
 }
 
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7a276c0d141e..58982d18f6ea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2810,26 +2810,27 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
 }
 
 /**
- * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
- * reclaim
+ * mem_cgroup_within_guarantee - checks whether given memcg is within its
+ * memory guarantee
  * @memcg: target memcg for the reclaim
  * @root: root of the reclaim hierarchy (null for the global reclaim)
  *
- * The given group is reclaimable if it is above its low limit and the same
- * applies for all parents up the hierarchy until root (including).
+ * The given group is within its reclaim gurantee if it is below its low limit
+ * or the same applies for any parent up the hierarchy until root (including).
+ * Such a group might be excluded from the reclaim.
  */
-bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
+bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
 		struct mem_cgroup *root)
 {
 	do {
 		if (!res_counter_low_limit_excess(&memcg->res))
-			return false;
+			return true;
 		if (memcg == root)
 			break;
 
 	} while ((memcg = parent_mem_cgroup(memcg)));
 
-	return true;
+	return false;
 }
 
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0f428158254e..20ca95fbaebb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2215,8 +2215,18 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	}
 }
 
+/**
+ * __shrink_zone - shrinks a given zone
+ *
+ * @zone: zone to shrink
+ * @sc: scan control with additional reclaim parameters
+ * @force_memcg_guarantee: do not reclaim memcgs which are within their memory
+ * guarantee
+ *
+ * Returns the number of reclaimed memcgs.
+ */
 static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
-		bool follow_low_limit)
+		bool force_memcg_guarantee)
 {
 	unsigned long nr_reclaimed, nr_scanned;
 	unsigned nr_scanned_groups = 0;
@@ -2236,12 +2246,9 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
 		do {
 			struct lruvec *lruvec;
 
-			/*
-			 * Memcg might be under its low limit so we have to
-			 * skip it during the first reclaim round
-			 */
-			if (follow_low_limit &&
-					!mem_cgroup_reclaim_eligible(memcg, root)) {
+			/* Memcg might be protected from the reclaim */
+			if (force_memcg_guarantee &&
+					mem_cgroup_within_guarantee(memcg, root)) {
 				/*
 				 * It would be more optimal to skip the memcg
 				 * subtree now but we do not have a memcg iter
@@ -2289,8 +2296,8 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 	if (!__shrink_zone(zone, sc, true)) {
 		/*
 		 * First round of reclaim didn't find anything to reclaim
-		 * because of low limit protection so try again and ignore
-		 * the low limit this time.
+		 * because of the memory guantees for all memcgs in the
+		 * reclaim target so try again and ignore guarantees this time.
 		 */
 		__shrink_zone(zone, sc, false);
 	}
-- 
2.0.0.rc0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-05-02 14:15             ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-02 14:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri 02-05-14 09:01:18, Johannes Weiner wrote:
> On Fri, May 02, 2014 at 02:07:15PM +0200, Michal Hocko wrote:
> > On Fri 02-05-14 11:36:28, Michal Hocko wrote:
> > > On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> > > > On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > index 19d620b3d69c..40e517630138 100644
> > > > > --- a/mm/memcontrol.c
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -2808,6 +2808,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> > > > >  	return mem_cgroup_from_id(id);
> > > > >  }
> > > > >  
> > > > > +/**
> > > > > + * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
> > > > > + * reclaim
> > > > > + * @memcg: target memcg for the reclaim
> > > > > + * @root: root of the reclaim hierarchy (null for the global reclaim)
> > > > > + *
> > > > > + * The given group is reclaimable if it is above its low limit and the same
> > > > > + * applies for all parents up the hierarchy until root (including).
> > > > > + */
> > > > > +bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> > > > > +		struct mem_cgroup *root)
> > > > 
> > > > Could you please rename this to something that is more descriptive in
> > > > the reclaim callsite?  How about mem_cgroup_within_low_limit()?
> > > 
> > > I have intentionally used somethig that is not low_limit specific. The
> > > generic reclaim code does't have to care about the reason why a memcg is
> > > not reclaimable. I agree that having follow_low_limit paramter explicit
> > > and mem_cgroup_reclaim_eligible not is messy. So something should be
> > > renamed. I would probably go with s@follow_low_limit@check_reclaim_eligible@
> > > but I do not have a strong preference.
> > 
> > What about this?
> 
> I really don't like it.
> 
> Yes, we should be hiding implementation details, but we should stop
> treating memcg like an alien in this code.  The VM code obviously
> doesn't have to know HOW the guarantees are exactly implemented, but
> it's a perfectly fine *concept* that can be known outside of memcg:
> 
> shrink_zone:
> for each memcg in system:
>   if mem_cgroup_within_guarantee(memcg):
>     continue
>   reclaim(memcg-zone)
> 
> is perfectly understandable and makes it easier to reason about the
> behavior of the reclaim code.  If I just see !mem_cgroup_eligible(), I
> don't know if this affects the scenario I'm thinking about at all.
> 
> It's obscuring useful information for absolutely no benefit.  If you
> burden the reclaim code with a callback, you better explain what you
> are doing.  You owe it to the reader.

OK fair enough, what about the following?
---

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-02 14:15             ` Michal Hocko
@ 2014-05-02 15:04               ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-02 15:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri, May 02, 2014 at 04:15:15PM +0200, Michal Hocko wrote:
> On Fri 02-05-14 09:01:18, Johannes Weiner wrote:
> > On Fri, May 02, 2014 at 02:07:15PM +0200, Michal Hocko wrote:
> > > On Fri 02-05-14 11:36:28, Michal Hocko wrote:
> > > > On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> > > > > On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > > index 19d620b3d69c..40e517630138 100644
> > > > > > --- a/mm/memcontrol.c
> > > > > > +++ b/mm/memcontrol.c
> > > > > > @@ -2808,6 +2808,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> > > > > >  	return mem_cgroup_from_id(id);
> > > > > >  }
> > > > > >  
> > > > > > +/**
> > > > > > + * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
> > > > > > + * reclaim
> > > > > > + * @memcg: target memcg for the reclaim
> > > > > > + * @root: root of the reclaim hierarchy (null for the global reclaim)
> > > > > > + *
> > > > > > + * The given group is reclaimable if it is above its low limit and the same
> > > > > > + * applies for all parents up the hierarchy until root (including).
> > > > > > + */
> > > > > > +bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> > > > > > +		struct mem_cgroup *root)
> > > > > 
> > > > > Could you please rename this to something that is more descriptive in
> > > > > the reclaim callsite?  How about mem_cgroup_within_low_limit()?
> > > > 
> > > > I have intentionally used somethig that is not low_limit specific. The
> > > > generic reclaim code does't have to care about the reason why a memcg is
> > > > not reclaimable. I agree that having follow_low_limit paramter explicit
> > > > and mem_cgroup_reclaim_eligible not is messy. So something should be
> > > > renamed. I would probably go with s@follow_low_limit@check_reclaim_eligible@
> > > > but I do not have a strong preference.
> > > 
> > > What about this?
> > 
> > I really don't like it.
> > 
> > Yes, we should be hiding implementation details, but we should stop
> > treating memcg like an alien in this code.  The VM code obviously
> > doesn't have to know HOW the guarantees are exactly implemented, but
> > it's a perfectly fine *concept* that can be known outside of memcg:
> > 
> > shrink_zone:
> > for each memcg in system:
> >   if mem_cgroup_within_guarantee(memcg):
> >     continue
> >   reclaim(memcg-zone)
> > 
> > is perfectly understandable and makes it easier to reason about the
> > behavior of the reclaim code.  If I just see !mem_cgroup_eligible(), I
> > don't know if this affects the scenario I'm thinking about at all.
> > 
> > It's obscuring useful information for absolutely no benefit.  If you
> > burden the reclaim code with a callback, you better explain what you
> > are doing.  You owe it to the reader.
> 
> OK fair enough, what about the following?

Thanks, that's much better IMO.

> @@ -2215,8 +2215,18 @@ static inline bool should_continue_reclaim(struct zone *zone,
>  	}
>  }
>  
> +/**
> + * __shrink_zone - shrinks a given zone
> + *
> + * @zone: zone to shrink
> + * @sc: scan control with additional reclaim parameters
> + * @force_memcg_guarantee: do not reclaim memcgs which are within their memory
> + * guarantee
> + *
> + * Returns the number of reclaimed memcgs.
> + */
>  static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
> -		bool follow_low_limit)
> +		bool force_memcg_guarantee)
>  {
>  	unsigned long nr_reclaimed, nr_scanned;
>  	unsigned nr_scanned_groups = 0;
> @@ -2236,12 +2246,9 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
>  		do {
>  			struct lruvec *lruvec;
>  
> -			/*
> -			 * Memcg might be under its low limit so we have to
> -			 * skip it during the first reclaim round
> -			 */
> -			if (follow_low_limit &&
> -					!mem_cgroup_reclaim_eligible(memcg, root)) {
> +			/* Memcg might be protected from the reclaim */
> +			if (force_memcg_guarantee &&

respect_?  consider_?

force sounds like something the second round would do -- force reclaim
despite guarantees...  But then again, I'm still for removing that 2nd
force cycle, so I don't care too strongly about that name (yet) :-)

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-05-02 15:04               ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-02 15:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri, May 02, 2014 at 04:15:15PM +0200, Michal Hocko wrote:
> On Fri 02-05-14 09:01:18, Johannes Weiner wrote:
> > On Fri, May 02, 2014 at 02:07:15PM +0200, Michal Hocko wrote:
> > > On Fri 02-05-14 11:36:28, Michal Hocko wrote:
> > > > On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> > > > > On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > > index 19d620b3d69c..40e517630138 100644
> > > > > > --- a/mm/memcontrol.c
> > > > > > +++ b/mm/memcontrol.c
> > > > > > @@ -2808,6 +2808,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> > > > > >  	return mem_cgroup_from_id(id);
> > > > > >  }
> > > > > >  
> > > > > > +/**
> > > > > > + * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
> > > > > > + * reclaim
> > > > > > + * @memcg: target memcg for the reclaim
> > > > > > + * @root: root of the reclaim hierarchy (null for the global reclaim)
> > > > > > + *
> > > > > > + * The given group is reclaimable if it is above its low limit and the same
> > > > > > + * applies for all parents up the hierarchy until root (including).
> > > > > > + */
> > > > > > +bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> > > > > > +		struct mem_cgroup *root)
> > > > > 
> > > > > Could you please rename this to something that is more descriptive in
> > > > > the reclaim callsite?  How about mem_cgroup_within_low_limit()?
> > > > 
> > > > I have intentionally used somethig that is not low_limit specific. The
> > > > generic reclaim code does't have to care about the reason why a memcg is
> > > > not reclaimable. I agree that having follow_low_limit paramter explicit
> > > > and mem_cgroup_reclaim_eligible not is messy. So something should be
> > > > renamed. I would probably go with s@follow_low_limit@check_reclaim_eligible@
> > > > but I do not have a strong preference.
> > > 
> > > What about this?
> > 
> > I really don't like it.
> > 
> > Yes, we should be hiding implementation details, but we should stop
> > treating memcg like an alien in this code.  The VM code obviously
> > doesn't have to know HOW the guarantees are exactly implemented, but
> > it's a perfectly fine *concept* that can be known outside of memcg:
> > 
> > shrink_zone:
> > for each memcg in system:
> >   if mem_cgroup_within_guarantee(memcg):
> >     continue
> >   reclaim(memcg-zone)
> > 
> > is perfectly understandable and makes it easier to reason about the
> > behavior of the reclaim code.  If I just see !mem_cgroup_eligible(), I
> > don't know if this affects the scenario I'm thinking about at all.
> > 
> > It's obscuring useful information for absolutely no benefit.  If you
> > burden the reclaim code with a callback, you better explain what you
> > are doing.  You owe it to the reader.
> 
> OK fair enough, what about the following?

Thanks, that's much better IMO.

> @@ -2215,8 +2215,18 @@ static inline bool should_continue_reclaim(struct zone *zone,
>  	}
>  }
>  
> +/**
> + * __shrink_zone - shrinks a given zone
> + *
> + * @zone: zone to shrink
> + * @sc: scan control with additional reclaim parameters
> + * @force_memcg_guarantee: do not reclaim memcgs which are within their memory
> + * guarantee
> + *
> + * Returns the number of reclaimed memcgs.
> + */
>  static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
> -		bool follow_low_limit)
> +		bool force_memcg_guarantee)
>  {
>  	unsigned long nr_reclaimed, nr_scanned;
>  	unsigned nr_scanned_groups = 0;
> @@ -2236,12 +2246,9 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
>  		do {
>  			struct lruvec *lruvec;
>  
> -			/*
> -			 * Memcg might be under its low limit so we have to
> -			 * skip it during the first reclaim round
> -			 */
> -			if (follow_low_limit &&
> -					!mem_cgroup_reclaim_eligible(memcg, root)) {
> +			/* Memcg might be protected from the reclaim */
> +			if (force_memcg_guarantee &&

respect_?  consider_?

force sounds like something the second round would do -- force reclaim
despite guarantees...  But then again, I'm still for removing that 2nd
force cycle, so I don't care too strongly about that name (yet) :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-02 15:04               ` Johannes Weiner
@ 2014-05-02 15:11                 ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-02 15:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri 02-05-14 11:04:34, Johannes Weiner wrote:
[...]
> > @@ -2236,12 +2246,9 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
> >  		do {
> >  			struct lruvec *lruvec;
> >  
> > -			/*
> > -			 * Memcg might be under its low limit so we have to
> > -			 * skip it during the first reclaim round
> > -			 */
> > -			if (follow_low_limit &&
> > -					!mem_cgroup_reclaim_eligible(memcg, root)) {
> > +			/* Memcg might be protected from the reclaim */
> > +			if (force_memcg_guarantee &&
> 
> respect_?  consider_?

enforce_ ?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-05-02 15:11                 ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-02 15:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri 02-05-14 11:04:34, Johannes Weiner wrote:
[...]
> > @@ -2236,12 +2246,9 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
> >  		do {
> >  			struct lruvec *lruvec;
> >  
> > -			/*
> > -			 * Memcg might be under its low limit so we have to
> > -			 * skip it during the first reclaim round
> > -			 */
> > -			if (follow_low_limit &&
> > -					!mem_cgroup_reclaim_eligible(memcg, root)) {
> > +			/* Memcg might be protected from the reclaim */
> > +			if (force_memcg_guarantee &&
> 
> respect_?  consider_?

enforce_ ?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-02 15:11                 ` Michal Hocko
@ 2014-05-02 15:34                   ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-02 15:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri, May 02, 2014 at 05:11:20PM +0200, Michal Hocko wrote:
> On Fri 02-05-14 11:04:34, Johannes Weiner wrote:
> [...]
> > > @@ -2236,12 +2246,9 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
> > >  		do {
> > >  			struct lruvec *lruvec;
> > >  
> > > -			/*
> > > -			 * Memcg might be under its low limit so we have to
> > > -			 * skip it during the first reclaim round
> > > -			 */
> > > -			if (follow_low_limit &&
> > > -					!mem_cgroup_reclaim_eligible(memcg, root)) {
> > > +			/* Memcg might be protected from the reclaim */
> > > +			if (force_memcg_guarantee &&
> > 
> > respect_?  consider_?
> 
> enforce_ ?

A native speaker might be better at this, but to me it seems weird to
enforce a promise.  honor_memcg_guarantee?

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-05-02 15:34                   ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-02 15:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri, May 02, 2014 at 05:11:20PM +0200, Michal Hocko wrote:
> On Fri 02-05-14 11:04:34, Johannes Weiner wrote:
> [...]
> > > @@ -2236,12 +2246,9 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
> > >  		do {
> > >  			struct lruvec *lruvec;
> > >  
> > > -			/*
> > > -			 * Memcg might be under its low limit so we have to
> > > -			 * skip it during the first reclaim round
> > > -			 */
> > > -			if (follow_low_limit &&
> > > -					!mem_cgroup_reclaim_eligible(memcg, root)) {
> > > +			/* Memcg might be protected from the reclaim */
> > > +			if (force_memcg_guarantee &&
> > 
> > respect_?  consider_?
> 
> enforce_ ?

A native speaker might be better at this, but to me it seems weird to
enforce a promise.  honor_memcg_guarantee?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-02 15:34                   ` Johannes Weiner
@ 2014-05-02 15:48                     ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-02 15:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri 02-05-14 11:34:51, Johannes Weiner wrote:
> On Fri, May 02, 2014 at 05:11:20PM +0200, Michal Hocko wrote:
> > On Fri 02-05-14 11:04:34, Johannes Weiner wrote:
> > [...]
> > > > @@ -2236,12 +2246,9 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
> > > >  		do {
> > > >  			struct lruvec *lruvec;
> > > >  
> > > > -			/*
> > > > -			 * Memcg might be under its low limit so we have to
> > > > -			 * skip it during the first reclaim round
> > > > -			 */
> > > > -			if (follow_low_limit &&
> > > > -					!mem_cgroup_reclaim_eligible(memcg, root)) {
> > > > +			/* Memcg might be protected from the reclaim */
> > > > +			if (force_memcg_guarantee &&
> > > 
> > > respect_?  consider_?
> > 
> > enforce_ ?
> 
> A native speaker might be better at this, but to me it seems weird to
> enforce a promise.  honor_memcg_guarantee?

OK, will go with honor. Thanks!
---
>From 3101ce41cc8c0c9691d98054e8811c66a77cd079 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Fri, 2 May 2014 17:47:32 +0200
Subject: [PATCH] mmotm: memcg-mm-introduce-lowlimit-reclaim-fix.patch

mem_cgroup_reclaim_eligible -> mem_cgroup_within_guarantee
follow_low_limit -> honor_memcg_guarantee
and as suggested by Johannes.
---
 include/linux/memcontrol.h |  6 +++---
 mm/memcontrol.c            | 15 ++++++++-------
 mm/vmscan.c                | 25 ++++++++++++++++---------
 3 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6c59056f4bc6..c00ccc5f70b9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -92,7 +92,7 @@ bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 bool task_in_mem_cgroup(struct task_struct *task,
 			const struct mem_cgroup *memcg);
 
-extern bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
+extern bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
 		struct mem_cgroup *root);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
@@ -291,10 +291,10 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
-static inline bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
+static inline bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
 		struct mem_cgroup *root)
 {
-	return true;
+	return false;
 }
 
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7a276c0d141e..58982d18f6ea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2810,26 +2810,27 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
 }
 
 /**
- * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
- * reclaim
+ * mem_cgroup_within_guarantee - checks whether given memcg is within its
+ * memory guarantee
  * @memcg: target memcg for the reclaim
  * @root: root of the reclaim hierarchy (null for the global reclaim)
  *
- * The given group is reclaimable if it is above its low limit and the same
- * applies for all parents up the hierarchy until root (including).
+ * The given group is within its reclaim gurantee if it is below its low limit
+ * or the same applies for any parent up the hierarchy until root (including).
+ * Such a group might be excluded from the reclaim.
  */
-bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
+bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
 		struct mem_cgroup *root)
 {
 	do {
 		if (!res_counter_low_limit_excess(&memcg->res))
-			return false;
+			return true;
 		if (memcg == root)
 			break;
 
 	} while ((memcg = parent_mem_cgroup(memcg)));
 
-	return true;
+	return false;
 }
 
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0f428158254e..5f923999bb79 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2215,8 +2215,18 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	}
 }
 
+/**
+ * __shrink_zone - shrinks a given zone
+ *
+ * @zone: zone to shrink
+ * @sc: scan control with additional reclaim parameters
+ * @honor_memcg_guarantee: do not reclaim memcgs which are within their memory
+ * guarantee
+ *
+ * Returns the number of reclaimed memcgs.
+ */
 static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
-		bool follow_low_limit)
+		bool honor_memcg_guarantee)
 {
 	unsigned long nr_reclaimed, nr_scanned;
 	unsigned nr_scanned_groups = 0;
@@ -2236,12 +2246,9 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
 		do {
 			struct lruvec *lruvec;
 
-			/*
-			 * Memcg might be under its low limit so we have to
-			 * skip it during the first reclaim round
-			 */
-			if (follow_low_limit &&
-					!mem_cgroup_reclaim_eligible(memcg, root)) {
+			/* Memcg might be protected from the reclaim */
+			if (honor_memcg_guarantee &&
+					mem_cgroup_within_guarantee(memcg, root)) {
 				/*
 				 * It would be more optimal to skip the memcg
 				 * subtree now but we do not have a memcg iter
@@ -2289,8 +2296,8 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 	if (!__shrink_zone(zone, sc, true)) {
 		/*
 		 * First round of reclaim didn't find anything to reclaim
-		 * because of low limit protection so try again and ignore
-		 * the low limit this time.
+		 * because of the memory guantees for all memcgs in the
+		 * reclaim target so try again and ignore guarantees this time.
 		 */
 		__shrink_zone(zone, sc, false);
 	}
-- 
2.0.0.rc0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-05-02 15:48                     ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-02 15:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri 02-05-14 11:34:51, Johannes Weiner wrote:
> On Fri, May 02, 2014 at 05:11:20PM +0200, Michal Hocko wrote:
> > On Fri 02-05-14 11:04:34, Johannes Weiner wrote:
> > [...]
> > > > @@ -2236,12 +2246,9 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
> > > >  		do {
> > > >  			struct lruvec *lruvec;
> > > >  
> > > > -			/*
> > > > -			 * Memcg might be under its low limit so we have to
> > > > -			 * skip it during the first reclaim round
> > > > -			 */
> > > > -			if (follow_low_limit &&
> > > > -					!mem_cgroup_reclaim_eligible(memcg, root)) {
> > > > +			/* Memcg might be protected from the reclaim */
> > > > +			if (force_memcg_guarantee &&
> > > 
> > > respect_?  consider_?
> > 
> > enforce_ ?
> 
> A native speaker might be better at this, but to me it seems weird to
> enforce a promise.  honor_memcg_guarantee?

OK, will go with honor. Thanks!
---

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-02  9:36       ` Michal Hocko
@ 2014-05-02 15:58         ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-02 15:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri, May 02, 2014 at 11:36:28AM +0200, Michal Hocko wrote:
> On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> > On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 19d620b3d69c..40e517630138 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2808,6 +2808,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> > >  	return mem_cgroup_from_id(id);
> > >  }
> > >  
> > > +/**
> > > + * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
> > > + * reclaim
> > > + * @memcg: target memcg for the reclaim
> > > + * @root: root of the reclaim hierarchy (null for the global reclaim)
> > > + *
> > > + * The given group is reclaimable if it is above its low limit and the same
> > > + * applies for all parents up the hierarchy until root (including).
> > > + */
> > > +bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> > > +		struct mem_cgroup *root)
> > 
> > Could you please rename this to something that is more descriptive in
> > the reclaim callsite?  How about mem_cgroup_within_low_limit()?
> 
> I have intentionally used somethig that is not low_limit specific. The
> generic reclaim code does't have to care about the reason why a memcg is
> not reclaimable. I agree that having follow_low_limit paramter explicit
> and mem_cgroup_reclaim_eligible not is messy. So something should be
> renamed. I would probably go with s@follow_low_limit@check_reclaim_eligible@
> but I do not have a strong preference.
> 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index c1cd99a5074b..0f428158254e 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> [...]
> > > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > > +{
> > > +	if (!__shrink_zone(zone, sc, true)) {
> > > +		/*
> > > +		 * First round of reclaim didn't find anything to reclaim
> > > +		 * because of low limit protection so try again and ignore
> > > +		 * the low limit this time.
> > > +		 */
> > > +		__shrink_zone(zone, sc, false);
> > > +	}

So I don't think this can work as it is, because we are not actually
changing priority levels yet.  It will give up on the guarantees of
bigger groups way before smaller groups are even seriously looked at.

> > I would actually prefer not having a second round here, and make the
> > low limit behave more like mlock memory.  If there is no reclaimable
> > memory, go OOM.
> 
> This was done in my previous attempt and I prefer OOM myself but it is
> also true that starting with a more relaxed limit and adding an
> option for hard guarantee later when we have a clear usecase is a better
> approach. Although I can see potential in go-oom-rather-than-reclaim
> configurations, usecases I am primarily interested in won't overcommit on
> low_limit.
> 
> That being said, I like the idea of having the hard guarantee but I also
> think it should be configurable. I can post those patches in this thread
> but I feel it is too early as nobody has explicitly asked for this yet.

As per above, this makes the semantics so much more fishy.  When
exactly do we stop honoring the guarantees in the process?

This is not even guarantees anymore, but rather another reclaim
prioritization scheme with best-effort semantics.  That went over
horribly with soft limits, and I don't want to repeat this.

Overcommitting on guarantees makes no sense, and you even agree you
are not interested in it.  We also agree that we can always add a knob
later on to change semantics when an actual usecase presents itself,
so why not start with the clear and simple semantics, and the simpler
implementation?

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-05-02 15:58         ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-02 15:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri, May 02, 2014 at 11:36:28AM +0200, Michal Hocko wrote:
> On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> > On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 19d620b3d69c..40e517630138 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2808,6 +2808,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> > >  	return mem_cgroup_from_id(id);
> > >  }
> > >  
> > > +/**
> > > + * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
> > > + * reclaim
> > > + * @memcg: target memcg for the reclaim
> > > + * @root: root of the reclaim hierarchy (null for the global reclaim)
> > > + *
> > > + * The given group is reclaimable if it is above its low limit and the same
> > > + * applies for all parents up the hierarchy until root (including).
> > > + */
> > > +bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> > > +		struct mem_cgroup *root)
> > 
> > Could you please rename this to something that is more descriptive in
> > the reclaim callsite?  How about mem_cgroup_within_low_limit()?
> 
> I have intentionally used somethig that is not low_limit specific. The
> generic reclaim code does't have to care about the reason why a memcg is
> not reclaimable. I agree that having follow_low_limit paramter explicit
> and mem_cgroup_reclaim_eligible not is messy. So something should be
> renamed. I would probably go with s@follow_low_limit@check_reclaim_eligible@
> but I do not have a strong preference.
> 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index c1cd99a5074b..0f428158254e 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> [...]
> > > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > > +{
> > > +	if (!__shrink_zone(zone, sc, true)) {
> > > +		/*
> > > +		 * First round of reclaim didn't find anything to reclaim
> > > +		 * because of low limit protection so try again and ignore
> > > +		 * the low limit this time.
> > > +		 */
> > > +		__shrink_zone(zone, sc, false);
> > > +	}

So I don't think this can work as it is, because we are not actually
changing priority levels yet.  It will give up on the guarantees of
bigger groups way before smaller groups are even seriously looked at.

> > I would actually prefer not having a second round here, and make the
> > low limit behave more like mlock memory.  If there is no reclaimable
> > memory, go OOM.
> 
> This was done in my previous attempt and I prefer OOM myself but it is
> also true that starting with a more relaxed limit and adding an
> option for hard guarantee later when we have a clear usecase is a better
> approach. Although I can see potential in go-oom-rather-than-reclaim
> configurations, usecases I am primarily interested in won't overcommit on
> low_limit.
> 
> That being said, I like the idea of having the hard guarantee but I also
> think it should be configurable. I can post those patches in this thread
> but I feel it is too early as nobody has explicitly asked for this yet.

As per above, this makes the semantics so much more fishy.  When
exactly do we stop honoring the guarantees in the process?

This is not even guarantees anymore, but rather another reclaim
prioritization scheme with best-effort semantics.  That went over
horribly with soft limits, and I don't want to repeat this.

Overcommitting on guarantees makes no sense, and you even agree you
are not interested in it.  We also agree that we can always add a knob
later on to change semantics when an actual usecase presents itself,
so why not start with the clear and simple semantics, and the simpler
implementation?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-02 15:58         ` Johannes Weiner
@ 2014-05-02 16:49           ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-02 16:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri 02-05-14 11:58:05, Johannes Weiner wrote:
> On Fri, May 02, 2014 at 11:36:28AM +0200, Michal Hocko wrote:
> > On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> > > On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index 19d620b3d69c..40e517630138 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -2808,6 +2808,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> > > >  	return mem_cgroup_from_id(id);
> > > >  }
> > > >  
> > > > +/**
> > > > + * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
> > > > + * reclaim
> > > > + * @memcg: target memcg for the reclaim
> > > > + * @root: root of the reclaim hierarchy (null for the global reclaim)
> > > > + *
> > > > + * The given group is reclaimable if it is above its low limit and the same
> > > > + * applies for all parents up the hierarchy until root (including).
> > > > + */
> > > > +bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> > > > +		struct mem_cgroup *root)
> > > 
> > > Could you please rename this to something that is more descriptive in
> > > the reclaim callsite?  How about mem_cgroup_within_low_limit()?
> > 
> > I have intentionally used somethig that is not low_limit specific. The
> > generic reclaim code does't have to care about the reason why a memcg is
> > not reclaimable. I agree that having follow_low_limit paramter explicit
> > and mem_cgroup_reclaim_eligible not is messy. So something should be
> > renamed. I would probably go with s@follow_low_limit@check_reclaim_eligible@
> > but I do not have a strong preference.
> > 
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index c1cd99a5074b..0f428158254e 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > [...]
> > > > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > > > +{
> > > > +	if (!__shrink_zone(zone, sc, true)) {
> > > > +		/*
> > > > +		 * First round of reclaim didn't find anything to reclaim
> > > > +		 * because of low limit protection so try again and ignore
> > > > +		 * the low limit this time.
> > > > +		 */
> > > > +		__shrink_zone(zone, sc, false);
> > > > +	}
> 
> So I don't think this can work as it is, because we are not actually
> changing priority levels yet. 

__shrink_zone returns with 0 only if the whole hierarchy is is under low
limit. This means that they are over-committed and it doesn't make much
sense to play with priority. Low limit reclaimability is independent on
the priority.

> It will give up on the guarantees of bigger groups way before smaller
> groups are even seriously looked at.

How would that happen? Those (smaller) groups would get reclaimed and we
wouldn't fallback. Or am I missing your point?

> > > I would actually prefer not having a second round here, and make the
> > > low limit behave more like mlock memory.  If there is no reclaimable
> > > memory, go OOM.
> > 
> > This was done in my previous attempt and I prefer OOM myself but it is
> > also true that starting with a more relaxed limit and adding an
> > option for hard guarantee later when we have a clear usecase is a better
> > approach. Although I can see potential in go-oom-rather-than-reclaim
> > configurations, usecases I am primarily interested in won't overcommit on
> > low_limit.
> > 
> > That being said, I like the idea of having the hard guarantee but I also
> > think it should be configurable. I can post those patches in this thread
> > but I feel it is too early as nobody has explicitly asked for this yet.
> 
> As per above, this makes the semantics so much more fishy.  When
> exactly do we stop honoring the guarantees in the process?

When the reclaimed hierarchy is bellow low_limit. In other words when we
would go and OOM without fallback.

> This is not even guarantees anymore, but rather another reclaim
> prioritization scheme with best-effort semantics.  That went over
> horribly with soft limits, and I don't want to repeat this.
> 
> Overcommitting on guarantees makes no sense, and you even agree you
> are not interested in it.  We also agree that we can always add a knob
> later on to change semantics when an actual usecase presents itself,
> so why not start with the clear and simple semantics, and the simpler
> implementation?

So you are really preferring an OOM instead? That was the original
implementation posted at the end of last year and some people
had concerns about it. This is the primary reason I came up with a
weaker version which fallbacks rather than OOM.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-05-02 16:49           ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-02 16:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri 02-05-14 11:58:05, Johannes Weiner wrote:
> On Fri, May 02, 2014 at 11:36:28AM +0200, Michal Hocko wrote:
> > On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> > > On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index 19d620b3d69c..40e517630138 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -2808,6 +2808,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> > > >  	return mem_cgroup_from_id(id);
> > > >  }
> > > >  
> > > > +/**
> > > > + * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
> > > > + * reclaim
> > > > + * @memcg: target memcg for the reclaim
> > > > + * @root: root of the reclaim hierarchy (null for the global reclaim)
> > > > + *
> > > > + * The given group is reclaimable if it is above its low limit and the same
> > > > + * applies for all parents up the hierarchy until root (including).
> > > > + */
> > > > +bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> > > > +		struct mem_cgroup *root)
> > > 
> > > Could you please rename this to something that is more descriptive in
> > > the reclaim callsite?  How about mem_cgroup_within_low_limit()?
> > 
> > I have intentionally used somethig that is not low_limit specific. The
> > generic reclaim code does't have to care about the reason why a memcg is
> > not reclaimable. I agree that having follow_low_limit paramter explicit
> > and mem_cgroup_reclaim_eligible not is messy. So something should be
> > renamed. I would probably go with s@follow_low_limit@check_reclaim_eligible@
> > but I do not have a strong preference.
> > 
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index c1cd99a5074b..0f428158254e 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > [...]
> > > > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > > > +{
> > > > +	if (!__shrink_zone(zone, sc, true)) {
> > > > +		/*
> > > > +		 * First round of reclaim didn't find anything to reclaim
> > > > +		 * because of low limit protection so try again and ignore
> > > > +		 * the low limit this time.
> > > > +		 */
> > > > +		__shrink_zone(zone, sc, false);
> > > > +	}
> 
> So I don't think this can work as it is, because we are not actually
> changing priority levels yet. 

__shrink_zone returns with 0 only if the whole hierarchy is is under low
limit. This means that they are over-committed and it doesn't make much
sense to play with priority. Low limit reclaimability is independent on
the priority.

> It will give up on the guarantees of bigger groups way before smaller
> groups are even seriously looked at.

How would that happen? Those (smaller) groups would get reclaimed and we
wouldn't fallback. Or am I missing your point?

> > > I would actually prefer not having a second round here, and make the
> > > low limit behave more like mlock memory.  If there is no reclaimable
> > > memory, go OOM.
> > 
> > This was done in my previous attempt and I prefer OOM myself but it is
> > also true that starting with a more relaxed limit and adding an
> > option for hard guarantee later when we have a clear usecase is a better
> > approach. Although I can see potential in go-oom-rather-than-reclaim
> > configurations, usecases I am primarily interested in won't overcommit on
> > low_limit.
> > 
> > That being said, I like the idea of having the hard guarantee but I also
> > think it should be configurable. I can post those patches in this thread
> > but I feel it is too early as nobody has explicitly asked for this yet.
> 
> As per above, this makes the semantics so much more fishy.  When
> exactly do we stop honoring the guarantees in the process?

When the reclaimed hierarchy is bellow low_limit. In other words when we
would go and OOM without fallback.

> This is not even guarantees anymore, but rather another reclaim
> prioritization scheme with best-effort semantics.  That went over
> horribly with soft limits, and I don't want to repeat this.
> 
> Overcommitting on guarantees makes no sense, and you even agree you
> are not interested in it.  We also agree that we can always add a knob
> later on to change semantics when an actual usecase presents itself,
> so why not start with the clear and simple semantics, and the simpler
> implementation?

So you are really preferring an OOM instead? That was the original
implementation posted at the end of last year and some people
had concerns about it. This is the primary reason I came up with a
weaker version which fallbacks rather than OOM.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-02 16:49           ` Michal Hocko
@ 2014-05-02 22:00             ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-02 22:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri, May 02, 2014 at 06:49:30PM +0200, Michal Hocko wrote:
> On Fri 02-05-14 11:58:05, Johannes Weiner wrote:
> > On Fri, May 02, 2014 at 11:36:28AM +0200, Michal Hocko wrote:
> > > On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> > > > On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > index 19d620b3d69c..40e517630138 100644
> > > > > --- a/mm/memcontrol.c
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -2808,6 +2808,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> > > > >  	return mem_cgroup_from_id(id);
> > > > >  }
> > > > >  
> > > > > +/**
> > > > > + * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
> > > > > + * reclaim
> > > > > + * @memcg: target memcg for the reclaim
> > > > > + * @root: root of the reclaim hierarchy (null for the global reclaim)
> > > > > + *
> > > > > + * The given group is reclaimable if it is above its low limit and the same
> > > > > + * applies for all parents up the hierarchy until root (including).
> > > > > + */
> > > > > +bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> > > > > +		struct mem_cgroup *root)
> > > > 
> > > > Could you please rename this to something that is more descriptive in
> > > > the reclaim callsite?  How about mem_cgroup_within_low_limit()?
> > > 
> > > I have intentionally used somethig that is not low_limit specific. The
> > > generic reclaim code does't have to care about the reason why a memcg is
> > > not reclaimable. I agree that having follow_low_limit paramter explicit
> > > and mem_cgroup_reclaim_eligible not is messy. So something should be
> > > renamed. I would probably go with s@follow_low_limit@check_reclaim_eligible@
> > > but I do not have a strong preference.
> > > 
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index c1cd99a5074b..0f428158254e 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > [...]
> > > > > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > > > > +{
> > > > > +	if (!__shrink_zone(zone, sc, true)) {
> > > > > +		/*
> > > > > +		 * First round of reclaim didn't find anything to reclaim
> > > > > +		 * because of low limit protection so try again and ignore
> > > > > +		 * the low limit this time.
> > > > > +		 */
> > > > > +		__shrink_zone(zone, sc, false);
> > > > > +	}
> > 
> > So I don't think this can work as it is, because we are not actually
> > changing priority levels yet. 
> 
> __shrink_zone returns with 0 only if the whole hierarchy is is under low
> limit. This means that they are over-committed and it doesn't make much
> sense to play with priority. Low limit reclaimability is independent on
> the priority.
> 
> > It will give up on the guarantees of bigger groups way before smaller
> > groups are even seriously looked at.
> 
> How would that happen? Those (smaller) groups would get reclaimed and we
> wouldn't fallback. Or am I missing your point?

Lol, I hadn't updated my brain to a394cb8ee632 ("memcg,vmscan: do not
break out targeted reclaim without reclaimed pages") yet...  Yes, you
are right.

> > > > I would actually prefer not having a second round here, and make the
> > > > low limit behave more like mlock memory.  If there is no reclaimable
> > > > memory, go OOM.
> > > 
> > > This was done in my previous attempt and I prefer OOM myself but it is
> > > also true that starting with a more relaxed limit and adding an
> > > option for hard guarantee later when we have a clear usecase is a better
> > > approach. Although I can see potential in go-oom-rather-than-reclaim
> > > configurations, usecases I am primarily interested in won't overcommit on
> > > low_limit.
> > > 
> > > That being said, I like the idea of having the hard guarantee but I also
> > > think it should be configurable. I can post those patches in this thread
> > > but I feel it is too early as nobody has explicitly asked for this yet.
> > 
> > As per above, this makes the semantics so much more fishy.  When
> > exactly do we stop honoring the guarantees in the process?
> 
> When the reclaimed hierarchy is bellow low_limit. In other words when we
> would go and OOM without fallback.
>
> > This is not even guarantees anymore, but rather another reclaim
> > prioritization scheme with best-effort semantics.  That went over
> > horribly with soft limits, and I don't want to repeat this.
> > 
> > Overcommitting on guarantees makes no sense, and you even agree you
> > are not interested in it.  We also agree that we can always add a knob
> > later on to change semantics when an actual usecase presents itself,
> > so why not start with the clear and simple semantics, and the simpler
> > implementation?
> 
> So you are really preferring an OOM instead? That was the original
> implementation posted at the end of last year and some people
> had concerns about it. This is the primary reason I came up with a
> weaker version which fallbacks rather than OOM.

I'll dig through the archives on this then, thanks.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-05-02 22:00             ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-02 22:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri, May 02, 2014 at 06:49:30PM +0200, Michal Hocko wrote:
> On Fri 02-05-14 11:58:05, Johannes Weiner wrote:
> > On Fri, May 02, 2014 at 11:36:28AM +0200, Michal Hocko wrote:
> > > On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> > > > On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > index 19d620b3d69c..40e517630138 100644
> > > > > --- a/mm/memcontrol.c
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -2808,6 +2808,29 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> > > > >  	return mem_cgroup_from_id(id);
> > > > >  }
> > > > >  
> > > > > +/**
> > > > > + * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
> > > > > + * reclaim
> > > > > + * @memcg: target memcg for the reclaim
> > > > > + * @root: root of the reclaim hierarchy (null for the global reclaim)
> > > > > + *
> > > > > + * The given group is reclaimable if it is above its low limit and the same
> > > > > + * applies for all parents up the hierarchy until root (including).
> > > > > + */
> > > > > +bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> > > > > +		struct mem_cgroup *root)
> > > > 
> > > > Could you please rename this to something that is more descriptive in
> > > > the reclaim callsite?  How about mem_cgroup_within_low_limit()?
> > > 
> > > I have intentionally used somethig that is not low_limit specific. The
> > > generic reclaim code does't have to care about the reason why a memcg is
> > > not reclaimable. I agree that having follow_low_limit paramter explicit
> > > and mem_cgroup_reclaim_eligible not is messy. So something should be
> > > renamed. I would probably go with s@follow_low_limit@check_reclaim_eligible@
> > > but I do not have a strong preference.
> > > 
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index c1cd99a5074b..0f428158254e 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > [...]
> > > > > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > > > > +{
> > > > > +	if (!__shrink_zone(zone, sc, true)) {
> > > > > +		/*
> > > > > +		 * First round of reclaim didn't find anything to reclaim
> > > > > +		 * because of low limit protection so try again and ignore
> > > > > +		 * the low limit this time.
> > > > > +		 */
> > > > > +		__shrink_zone(zone, sc, false);
> > > > > +	}
> > 
> > So I don't think this can work as it is, because we are not actually
> > changing priority levels yet. 
> 
> __shrink_zone returns with 0 only if the whole hierarchy is is under low
> limit. This means that they are over-committed and it doesn't make much
> sense to play with priority. Low limit reclaimability is independent on
> the priority.
> 
> > It will give up on the guarantees of bigger groups way before smaller
> > groups are even seriously looked at.
> 
> How would that happen? Those (smaller) groups would get reclaimed and we
> wouldn't fallback. Or am I missing your point?

Lol, I hadn't updated my brain to a394cb8ee632 ("memcg,vmscan: do not
break out targeted reclaim without reclaimed pages") yet...  Yes, you
are right.

> > > > I would actually prefer not having a second round here, and make the
> > > > low limit behave more like mlock memory.  If there is no reclaimable
> > > > memory, go OOM.
> > > 
> > > This was done in my previous attempt and I prefer OOM myself but it is
> > > also true that starting with a more relaxed limit and adding an
> > > option for hard guarantee later when we have a clear usecase is a better
> > > approach. Although I can see potential in go-oom-rather-than-reclaim
> > > configurations, usecases I am primarily interested in won't overcommit on
> > > low_limit.
> > > 
> > > That being said, I like the idea of having the hard guarantee but I also
> > > think it should be configurable. I can post those patches in this thread
> > > but I feel it is too early as nobody has explicitly asked for this yet.
> > 
> > As per above, this makes the semantics so much more fishy.  When
> > exactly do we stop honoring the guarantees in the process?
> 
> When the reclaimed hierarchy is bellow low_limit. In other words when we
> would go and OOM without fallback.
>
> > This is not even guarantees anymore, but rather another reclaim
> > prioritization scheme with best-effort semantics.  That went over
> > horribly with soft limits, and I don't want to repeat this.
> > 
> > Overcommitting on guarantees makes no sense, and you even agree you
> > are not interested in it.  We also agree that we can always add a knob
> > later on to change semantics when an actual usecase presents itself,
> > so why not start with the clear and simple semantics, and the simpler
> > implementation?
> 
> So you are really preferring an OOM instead? That was the original
> implementation posted at the end of last year and some people
> had concerns about it. This is the primary reason I came up with a
> weaker version which fallbacks rather than OOM.

I'll dig through the archives on this then, thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-02 22:00             ` Johannes Weiner
@ 2014-05-05 14:21               ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-05 14:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri 02-05-14 18:00:56, Johannes Weiner wrote:
> On Fri, May 02, 2014 at 06:49:30PM +0200, Michal Hocko wrote:
> > On Fri 02-05-14 11:58:05, Johannes Weiner wrote:
> > > On Fri, May 02, 2014 at 11:36:28AM +0200, Michal Hocko wrote:
> > > > On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> > > > > On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
[...]
> > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > index c1cd99a5074b..0f428158254e 100644
> > > > > > --- a/mm/vmscan.c
> > > > > > +++ b/mm/vmscan.c
> > > > [...]
> > > > > > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > > > > > +{
> > > > > > +	if (!__shrink_zone(zone, sc, true)) {
> > > > > > +		/*
> > > > > > +		 * First round of reclaim didn't find anything to reclaim
> > > > > > +		 * because of low limit protection so try again and ignore
> > > > > > +		 * the low limit this time.
> > > > > > +		 */
> > > > > > +		__shrink_zone(zone, sc, false);
> > > > > > +	}
> > > 
> > > So I don't think this can work as it is, because we are not actually
> > > changing priority levels yet. 
> > 
> > __shrink_zone returns with 0 only if the whole hierarchy is is under low
> > limit. This means that they are over-committed and it doesn't make much
> > sense to play with priority. Low limit reclaimability is independent on
> > the priority.
> > 
> > > It will give up on the guarantees of bigger groups way before smaller
> > > groups are even seriously looked at.
> > 
> > How would that happen? Those (smaller) groups would get reclaimed and we
> > wouldn't fallback. Or am I missing your point?
> 
> Lol, I hadn't updated my brain to a394cb8ee632 ("memcg,vmscan: do not
> break out targeted reclaim without reclaimed pages") yet...  Yes, you
> are right.

You made me think about this more and you are right ;).
The code as is doesn't cope with many racing reclaimers when some
threads can fallback to ignore the lowlimit although there are groups to
scan in the hierarchy but they were visited by other reclaimers.
The patch bellow should help with that. What do you think?
I am also thinking we want to add a fallback counter in memory.stat?
---
>From e997b8b4ac724aa29bdeff998d2186ee3c0a97d8 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Mon, 5 May 2014 15:12:18 +0200
Subject: [PATCH] vmscan: memcg: check whether the low limit should be ignored

Low-limit (aka guarantee) is ignored when there is no group scanned
during the first round of __shink_zone. This approach doesn't work when
multiple reclaimers race and reclaim the same hierarchy (e.g. kswapd
vs. direct reclaim or multiple tasks hitting the hard limit) because
memcg iterator makes sure that multiple reclaimers are interleaved
in the hierarchy. This means that some reclaimers can see 0 scanned
groups although there are groups which are above the low-limit and they
were reclaimed on behalf of other reclaimers. This leads to a premature
low-limit break.

This patch adds mem_cgroup_all_within_guarantee() which will check
whether all the groups in the reclaimed hierarchy are within their low
limit and shrink_zone will allow the fallback reclaim only when that is
true. This alone is still not sufficient however because it would lead
to another problem. If a reclaimer constantly fails to scan anything
because it sees only groups within their guarantees while others do the
reclaim then the reclaim priority would drop down very quickly.
shrink_zone has to be careful to preserve scan at least one group
semantic so __shrink_zone has to be retried until at least one group
is scanned.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/memcontrol.h |  5 +++++
 mm/memcontrol.c            | 13 +++++++++++++
 mm/vmscan.c                | 17 ++++++++++++-----
 3 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c00ccc5f70b9..077a777bd9ff 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -94,6 +94,7 @@ bool task_in_mem_cgroup(struct task_struct *task,
 
 extern bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
 		struct mem_cgroup *root);
+extern bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
@@ -296,6 +297,10 @@ static inline bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
 {
 	return false;
 }
+static inline bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
+{
+	return false;
+}
 
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 58982d18f6ea..4fd4784d1548 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2833,6 +2833,19 @@ bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
 	return false;
 }
 
+bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
+{
+	struct mem_cgroup *iter;
+
+	for_each_mem_cgroup_tree(iter, root)
+		if (!mem_cgroup_within_guarantee(iter, root)) {
+			mem_cgroup_iter_break(root, iter);
+			return false;
+		}
+
+	return true;
+}
+
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	struct mem_cgroup *memcg = NULL;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5f923999bb79..2686e47f04cc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2293,13 +2293,20 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
 
 static void shrink_zone(struct zone *zone, struct scan_control *sc)
 {
-	if (!__shrink_zone(zone, sc, true)) {
+	bool honor_guarantee = true;
+
+	while (!__shrink_zone(zone, sc, honor_guarantee)) {
 		/*
-		 * First round of reclaim didn't find anything to reclaim
-		 * because of the memory guantees for all memcgs in the
-		 * reclaim target so try again and ignore guarantees this time.
+		 * The previous round of reclaim didn't find anything to scan
+		 * because
+		 * a) the whole reclaimed hierarchy is within guarantee so
+		 *    we fallback to ignore the guarantee because other option
+		 *    would be the OOM
+		 * b) multiple reclaimers are racing and so the first round
+		 *    should be retried
 		 */
-		__shrink_zone(zone, sc, false);
+		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
+			honor_guarantee = false;
 	}
 }
 
-- 
2.0.0.rc0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-05-05 14:21               ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-05 14:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri 02-05-14 18:00:56, Johannes Weiner wrote:
> On Fri, May 02, 2014 at 06:49:30PM +0200, Michal Hocko wrote:
> > On Fri 02-05-14 11:58:05, Johannes Weiner wrote:
> > > On Fri, May 02, 2014 at 11:36:28AM +0200, Michal Hocko wrote:
> > > > On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> > > > > On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
[...]
> > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > index c1cd99a5074b..0f428158254e 100644
> > > > > > --- a/mm/vmscan.c
> > > > > > +++ b/mm/vmscan.c
> > > > [...]
> > > > > > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > > > > > +{
> > > > > > +	if (!__shrink_zone(zone, sc, true)) {
> > > > > > +		/*
> > > > > > +		 * First round of reclaim didn't find anything to reclaim
> > > > > > +		 * because of low limit protection so try again and ignore
> > > > > > +		 * the low limit this time.
> > > > > > +		 */
> > > > > > +		__shrink_zone(zone, sc, false);
> > > > > > +	}
> > > 
> > > So I don't think this can work as it is, because we are not actually
> > > changing priority levels yet. 
> > 
> > __shrink_zone returns with 0 only if the whole hierarchy is is under low
> > limit. This means that they are over-committed and it doesn't make much
> > sense to play with priority. Low limit reclaimability is independent on
> > the priority.
> > 
> > > It will give up on the guarantees of bigger groups way before smaller
> > > groups are even seriously looked at.
> > 
> > How would that happen? Those (smaller) groups would get reclaimed and we
> > wouldn't fallback. Or am I missing your point?
> 
> Lol, I hadn't updated my brain to a394cb8ee632 ("memcg,vmscan: do not
> break out targeted reclaim without reclaimed pages") yet...  Yes, you
> are right.

You made me think about this more and you are right ;).
The code as is doesn't cope with many racing reclaimers when some
threads can fallback to ignore the lowlimit although there are groups to
scan in the hierarchy but they were visited by other reclaimers.
The patch bellow should help with that. What do you think?
I am also thinking we want to add a fallback counter in memory.stat?
---

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-02 22:00             ` Johannes Weiner
  (?)
  (?)
@ 2014-05-06 13:29             ` Johannes Weiner
  2014-05-06 14:32                 ` Michal Hocko
  -1 siblings, 1 reply; 196+ messages in thread
From: Johannes Weiner @ 2014-05-06 13:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Fri, May 02, 2014 at 06:00:56PM -0400, Johannes Weiner wrote:
> On Fri, May 02, 2014 at 06:49:30PM +0200, Michal Hocko wrote:
> > On Fri 02-05-14 11:58:05, Johannes Weiner wrote:
> > > This is not even guarantees anymore, but rather another reclaim
> > > prioritization scheme with best-effort semantics.  That went over
> > > horribly with soft limits, and I don't want to repeat this.
> > > 
> > > Overcommitting on guarantees makes no sense, and you even agree you
> > > are not interested in it.  We also agree that we can always add a knob
> > > later on to change semantics when an actual usecase presents itself,
> > > so why not start with the clear and simple semantics, and the simpler
> > > implementation?
> > 
> > So you are really preferring an OOM instead? That was the original
> > implementation posted at the end of last year and some people
> > had concerns about it. This is the primary reason I came up with a
> > weaker version which fallbacks rather than OOM.
> 
> I'll dig through the archives on this then, thanks.

The most recent discussion on this I could find was between you and
Greg, where the final outcome was (excerpt):

---

From: Greg Thelen <gthelen@google.com>
To: Michal Hocko <mhocko@suse.cz>
Cc: linux-mm@kvack.org,  Johannes Weiner <hannes@cmpxchg.org>,  Andrew Morton <akpm@linux-foundation.org>,  KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,  LKML <linux-kernel@vger.kernel.org>,  Ying Han <yinghan@google.com>,  Hugh Dickins <hughd@google.com>,  Michel Lespinasse <walken@google.com>,  KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,  Tejun Heo <tj@kernel.org>
Subject: Re: [RFC 0/4] memcg: Low-limit reclaim
References: <1386771355-21805-1-git-send-email-mhocko@suse.cz>
	<xr93sis6obb5.fsf@gthelen.mtv.corp.google.com>
	<20140130123044.GB13509@dhcp22.suse.cz>
	<xr931tzphu50.fsf@gthelen.mtv.corp.google.com>
	<20140203144341.GI2495@dhcp22.suse.cz>
Date: Mon, 03 Feb 2014 17:33:13 -0800
Message-ID: <xr93zjm7br1i.fsf@gthelen.mtv.corp.google.com>
List-ID: <linux-mm.kvack.org>

On Mon, Feb 03 2014, Michal Hocko wrote:

> On Thu 30-01-14 16:28:27, Greg Thelen wrote:
>> But this soft_limit,priority extension can be added later.
>
> Yes, I would like to have the strong semantic first and then deal with a
> weaker form. Either by a new limit or a flag.

Sounds good.

---

So I think everybody involved in the discussions so far are preferring
a hard guarantee, and then later, if needed, to either add a knob to
make it a soft guarantee or to actually implement a usable soft limit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-06 13:29             ` Johannes Weiner
@ 2014-05-06 14:32                 ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-06 14:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Tue 06-05-14 09:29:32, Johannes Weiner wrote:
> On Fri, May 02, 2014 at 06:00:56PM -0400, Johannes Weiner wrote:
> > On Fri, May 02, 2014 at 06:49:30PM +0200, Michal Hocko wrote:
> > > On Fri 02-05-14 11:58:05, Johannes Weiner wrote:
> > > > This is not even guarantees anymore, but rather another reclaim
> > > > prioritization scheme with best-effort semantics.  That went over
> > > > horribly with soft limits, and I don't want to repeat this.
> > > > 
> > > > Overcommitting on guarantees makes no sense, and you even agree you
> > > > are not interested in it.  We also agree that we can always add a knob
> > > > later on to change semantics when an actual usecase presents itself,
> > > > so why not start with the clear and simple semantics, and the simpler
> > > > implementation?
> > > 
> > > So you are really preferring an OOM instead? That was the original
> > > implementation posted at the end of last year and some people
> > > had concerns about it. This is the primary reason I came up with a
> > > weaker version which fallbacks rather than OOM.
> > 
> > I'll dig through the archives on this then, thanks.
> 
> The most recent discussion on this I could find was between you and
> Greg, where the final outcome was (excerpt):
> 
> ---
> 
> From: Greg Thelen <gthelen@google.com>
> To: Michal Hocko <mhocko@suse.cz>
> Cc: linux-mm@kvack.org,  Johannes Weiner <hannes@cmpxchg.org>,  Andrew Morton <akpm@linux-foundation.org>,  KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,  LKML <linux-kernel@vger.kernel.org>,  Ying Han <yinghan@google.com>,  Hugh Dickins <hughd@google.com>,  Michel Lespinasse <walken@google.com>,  KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,  Tejun Heo <tj@kernel.org>
> Subject: Re: [RFC 0/4] memcg: Low-limit reclaim
> References: <1386771355-21805-1-git-send-email-mhocko@suse.cz>
> 	<xr93sis6obb5.fsf@gthelen.mtv.corp.google.com>
> 	<20140130123044.GB13509@dhcp22.suse.cz>
> 	<xr931tzphu50.fsf@gthelen.mtv.corp.google.com>
> 	<20140203144341.GI2495@dhcp22.suse.cz>
> Date: Mon, 03 Feb 2014 17:33:13 -0800
> Message-ID: <xr93zjm7br1i.fsf@gthelen.mtv.corp.google.com>
> List-ID: <linux-mm.kvack.org>
> 
> On Mon, Feb 03 2014, Michal Hocko wrote:
> 
> > On Thu 30-01-14 16:28:27, Greg Thelen wrote:
> >> But this soft_limit,priority extension can be added later.
> >
> > Yes, I would like to have the strong semantic first and then deal with a
> > weaker form. Either by a new limit or a flag.
> 
> Sounds good.
> 
> ---
> 
> So I think everybody involved in the discussions so far are preferring
> a hard guarantee, and then later, if needed, to either add a knob to
> make it a soft guarantee or to actually implement a usable soft limit.

I am afraid the most of that discussion happened off-list :( Sadly not
much of a discussion happened on the list.
Sorry I should have been specific and mention that the discussions
happened at LSF and partly at the KS.

The strongest point was made by Rik when he claimed that memcg is not
aware of memory zones and so one memcg with lowlimit larger than the
size of a zone can eat up that zone without any way to free it. This
can cause additional troubles (permanent reclaim on that zone and OOM in
an extreme situations).

That convinced me that having the default oom semantic from the very
beginning is too much and starting with something more relaxed (fallback
rather than oom), but still usable, is a better choice.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-05-06 14:32                 ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-06 14:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Tue 06-05-14 09:29:32, Johannes Weiner wrote:
> On Fri, May 02, 2014 at 06:00:56PM -0400, Johannes Weiner wrote:
> > On Fri, May 02, 2014 at 06:49:30PM +0200, Michal Hocko wrote:
> > > On Fri 02-05-14 11:58:05, Johannes Weiner wrote:
> > > > This is not even guarantees anymore, but rather another reclaim
> > > > prioritization scheme with best-effort semantics.  That went over
> > > > horribly with soft limits, and I don't want to repeat this.
> > > > 
> > > > Overcommitting on guarantees makes no sense, and you even agree you
> > > > are not interested in it.  We also agree that we can always add a knob
> > > > later on to change semantics when an actual usecase presents itself,
> > > > so why not start with the clear and simple semantics, and the simpler
> > > > implementation?
> > > 
> > > So you are really preferring an OOM instead? That was the original
> > > implementation posted at the end of last year and some people
> > > had concerns about it. This is the primary reason I came up with a
> > > weaker version which fallbacks rather than OOM.
> > 
> > I'll dig through the archives on this then, thanks.
> 
> The most recent discussion on this I could find was between you and
> Greg, where the final outcome was (excerpt):
> 
> ---
> 
> From: Greg Thelen <gthelen@google.com>
> To: Michal Hocko <mhocko@suse.cz>
> Cc: linux-mm@kvack.org,  Johannes Weiner <hannes@cmpxchg.org>,  Andrew Morton <akpm@linux-foundation.org>,  KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,  LKML <linux-kernel@vger.kernel.org>,  Ying Han <yinghan@google.com>,  Hugh Dickins <hughd@google.com>,  Michel Lespinasse <walken@google.com>,  KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,  Tejun Heo <tj@kernel.org>
> Subject: Re: [RFC 0/4] memcg: Low-limit reclaim
> References: <1386771355-21805-1-git-send-email-mhocko@suse.cz>
> 	<xr93sis6obb5.fsf@gthelen.mtv.corp.google.com>
> 	<20140130123044.GB13509@dhcp22.suse.cz>
> 	<xr931tzphu50.fsf@gthelen.mtv.corp.google.com>
> 	<20140203144341.GI2495@dhcp22.suse.cz>
> Date: Mon, 03 Feb 2014 17:33:13 -0800
> Message-ID: <xr93zjm7br1i.fsf@gthelen.mtv.corp.google.com>
> List-ID: <linux-mm.kvack.org>
> 
> On Mon, Feb 03 2014, Michal Hocko wrote:
> 
> > On Thu 30-01-14 16:28:27, Greg Thelen wrote:
> >> But this soft_limit,priority extension can be added later.
> >
> > Yes, I would like to have the strong semantic first and then deal with a
> > weaker form. Either by a new limit or a flag.
> 
> Sounds good.
> 
> ---
> 
> So I think everybody involved in the discussions so far are preferring
> a hard guarantee, and then later, if needed, to either add a knob to
> make it a soft guarantee or to actually implement a usable soft limit.

I am afraid the most of that discussion happened off-list :( Sadly not
much of a discussion happened on the list.
Sorry I should have been specific and mention that the discussions
happened at LSF and partly at the KS.

The strongest point was made by Rik when he claimed that memcg is not
aware of memory zones and so one memcg with lowlimit larger than the
size of a zone can eat up that zone without any way to free it. This
can cause additional troubles (permanent reclaim on that zone and OOM in
an extreme situations).

That convinced me that having the default oom semantic from the very
beginning is too much and starting with something more relaxed (fallback
rather than oom), but still usable, is a better choice.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-06 14:32                 ` Michal Hocko
@ 2014-05-06 15:21                   ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-06 15:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Tue, May 06, 2014 at 04:32:42PM +0200, Michal Hocko wrote:
> On Tue 06-05-14 09:29:32, Johannes Weiner wrote:
> > On Fri, May 02, 2014 at 06:00:56PM -0400, Johannes Weiner wrote:
> > > On Fri, May 02, 2014 at 06:49:30PM +0200, Michal Hocko wrote:
> > > > On Fri 02-05-14 11:58:05, Johannes Weiner wrote:
> > > > > This is not even guarantees anymore, but rather another reclaim
> > > > > prioritization scheme with best-effort semantics.  That went over
> > > > > horribly with soft limits, and I don't want to repeat this.
> > > > > 
> > > > > Overcommitting on guarantees makes no sense, and you even agree you
> > > > > are not interested in it.  We also agree that we can always add a knob
> > > > > later on to change semantics when an actual usecase presents itself,
> > > > > so why not start with the clear and simple semantics, and the simpler
> > > > > implementation?
> > > > 
> > > > So you are really preferring an OOM instead? That was the original
> > > > implementation posted at the end of last year and some people
> > > > had concerns about it. This is the primary reason I came up with a
> > > > weaker version which fallbacks rather than OOM.
> > > 
> > > I'll dig through the archives on this then, thanks.
> > 
> > The most recent discussion on this I could find was between you and
> > Greg, where the final outcome was (excerpt):
> > 
> > ---
> > 
> > From: Greg Thelen <gthelen@google.com>
> > To: Michal Hocko <mhocko@suse.cz>
> > Cc: linux-mm@kvack.org,  Johannes Weiner <hannes@cmpxchg.org>,  Andrew Morton <akpm@linux-foundation.org>,  KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,  LKML <linux-kernel@vger.kernel.org>,  Ying Han <yinghan@google.com>,  Hugh Dickins <hughd@google.com>,  Michel Lespinasse <walken@google.com>,  KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,  Tejun Heo <tj@kernel.org>
> > Subject: Re: [RFC 0/4] memcg: Low-limit reclaim
> > References: <1386771355-21805-1-git-send-email-mhocko@suse.cz>
> > 	<xr93sis6obb5.fsf@gthelen.mtv.corp.google.com>
> > 	<20140130123044.GB13509@dhcp22.suse.cz>
> > 	<xr931tzphu50.fsf@gthelen.mtv.corp.google.com>
> > 	<20140203144341.GI2495@dhcp22.suse.cz>
> > Date: Mon, 03 Feb 2014 17:33:13 -0800
> > Message-ID: <xr93zjm7br1i.fsf@gthelen.mtv.corp.google.com>
> > List-ID: <linux-mm.kvack.org>
> > 
> > On Mon, Feb 03 2014, Michal Hocko wrote:
> > 
> > > On Thu 30-01-14 16:28:27, Greg Thelen wrote:
> > >> But this soft_limit,priority extension can be added later.
> > >
> > > Yes, I would like to have the strong semantic first and then deal with a
> > > weaker form. Either by a new limit or a flag.
> > 
> > Sounds good.
> > 
> > ---
> > 
> > So I think everybody involved in the discussions so far are preferring
> > a hard guarantee, and then later, if needed, to either add a knob to
> > make it a soft guarantee or to actually implement a usable soft limit.
> 
> I am afraid the most of that discussion happened off-list :( Sadly not
> much of a discussion happened on the list.

Time to do it now, then :)

> Sorry I should have been specific and mention that the discussions
> happened at LSF and partly at the KS.
> 
> The strongest point was made by Rik when he claimed that memcg is not
> aware of memory zones and so one memcg with lowlimit larger than the
> size of a zone can eat up that zone without any way to free it.

But who actually cares if an individual zone can be reclaimed?

Userspace allocations can fall back to any other zone.  Unless there
are hard bindings, but hopefully nobody binds a memcg to a node that
is smaller than that memcg's guarantee.  And while the pages are not
reclaimable, they are still movable, so the NUMA balancer is free to
correct any allocation mistakes later on.

As to kernel allocations, watermarks and lowmem protection prevent any
single zone from filling up with userspace pages, regardless of their
reclaimability.

> This can cause additional troubles (permanent reclaim on that zone
> and OOM in an extreme situations).

We have protection against wasting CPU cycles on unreclaimable zones.

So how is it different than anonymous/shared memory without swap?  Or
mlocked memory?

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-05-06 15:21                   ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-06 15:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Tue, May 06, 2014 at 04:32:42PM +0200, Michal Hocko wrote:
> On Tue 06-05-14 09:29:32, Johannes Weiner wrote:
> > On Fri, May 02, 2014 at 06:00:56PM -0400, Johannes Weiner wrote:
> > > On Fri, May 02, 2014 at 06:49:30PM +0200, Michal Hocko wrote:
> > > > On Fri 02-05-14 11:58:05, Johannes Weiner wrote:
> > > > > This is not even guarantees anymore, but rather another reclaim
> > > > > prioritization scheme with best-effort semantics.  That went over
> > > > > horribly with soft limits, and I don't want to repeat this.
> > > > > 
> > > > > Overcommitting on guarantees makes no sense, and you even agree you
> > > > > are not interested in it.  We also agree that we can always add a knob
> > > > > later on to change semantics when an actual usecase presents itself,
> > > > > so why not start with the clear and simple semantics, and the simpler
> > > > > implementation?
> > > > 
> > > > So you are really preferring an OOM instead? That was the original
> > > > implementation posted at the end of last year and some people
> > > > had concerns about it. This is the primary reason I came up with a
> > > > weaker version which fallbacks rather than OOM.
> > > 
> > > I'll dig through the archives on this then, thanks.
> > 
> > The most recent discussion on this I could find was between you and
> > Greg, where the final outcome was (excerpt):
> > 
> > ---
> > 
> > From: Greg Thelen <gthelen@google.com>
> > To: Michal Hocko <mhocko@suse.cz>
> > Cc: linux-mm@kvack.org,  Johannes Weiner <hannes@cmpxchg.org>,  Andrew Morton <akpm@linux-foundation.org>,  KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,  LKML <linux-kernel@vger.kernel.org>,  Ying Han <yinghan@google.com>,  Hugh Dickins <hughd@google.com>,  Michel Lespinasse <walken@google.com>,  KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,  Tejun Heo <tj@kernel.org>
> > Subject: Re: [RFC 0/4] memcg: Low-limit reclaim
> > References: <1386771355-21805-1-git-send-email-mhocko@suse.cz>
> > 	<xr93sis6obb5.fsf@gthelen.mtv.corp.google.com>
> > 	<20140130123044.GB13509@dhcp22.suse.cz>
> > 	<xr931tzphu50.fsf@gthelen.mtv.corp.google.com>
> > 	<20140203144341.GI2495@dhcp22.suse.cz>
> > Date: Mon, 03 Feb 2014 17:33:13 -0800
> > Message-ID: <xr93zjm7br1i.fsf@gthelen.mtv.corp.google.com>
> > List-ID: <linux-mm.kvack.org>
> > 
> > On Mon, Feb 03 2014, Michal Hocko wrote:
> > 
> > > On Thu 30-01-14 16:28:27, Greg Thelen wrote:
> > >> But this soft_limit,priority extension can be added later.
> > >
> > > Yes, I would like to have the strong semantic first and then deal with a
> > > weaker form. Either by a new limit or a flag.
> > 
> > Sounds good.
> > 
> > ---
> > 
> > So I think everybody involved in the discussions so far are preferring
> > a hard guarantee, and then later, if needed, to either add a knob to
> > make it a soft guarantee or to actually implement a usable soft limit.
> 
> I am afraid the most of that discussion happened off-list :( Sadly not
> much of a discussion happened on the list.

Time to do it now, then :)

> Sorry I should have been specific and mention that the discussions
> happened at LSF and partly at the KS.
> 
> The strongest point was made by Rik when he claimed that memcg is not
> aware of memory zones and so one memcg with lowlimit larger than the
> size of a zone can eat up that zone without any way to free it.

But who actually cares if an individual zone can be reclaimed?

Userspace allocations can fall back to any other zone.  Unless there
are hard bindings, but hopefully nobody binds a memcg to a node that
is smaller than that memcg's guarantee.  And while the pages are not
reclaimable, they are still movable, so the NUMA balancer is free to
correct any allocation mistakes later on.

As to kernel allocations, watermarks and lowmem protection prevent any
single zone from filling up with userspace pages, regardless of their
reclaimability.

> This can cause additional troubles (permanent reclaim on that zone
> and OOM in an extreme situations).

We have protection against wasting CPU cycles on unreclaimable zones.

So how is it different than anonymous/shared memory without swap?  Or
mlocked memory?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-06 15:21                   ` Johannes Weiner
@ 2014-05-06 16:12                     ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-06 16:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

I am adding Rik to CC (sorry to put you in the middle of a thread -
we have started here: https://lkml.org/lkml/2014/4/28/237). You were
stressing out risks of using lowlimit as a hard guarantee at LSF. Could
you repeat your concerns here as well, please?

Short summary:
We are basically discussing how to handle lowlimit overcommit situation,
when no group is reclaimable because it either doesn't have any pages on
the LRU or it is bellow its lowlimit (aka guaranteed memory).

The solution proposed in this series is to fallback and reclaim
everybody rather than OOM with a note that if somebody really needs an
OOM then we can add a per-memcg knob which tells whether to fallback or oom.

Previously I was suggesting OOM as a default but I realized that this
might be too risky for the default behavior although I can see some
point in that behavior as well (it would allow to have a group which
would never reclaim memory and rather go OOM where the memory demand can
be handled more specifically). I do not have any call for such a hard
guarantee requirement usecase now and it would be quite trivial to build
it on top of the more relaxed implementation so I am more inclined to
the fallback default now.

More comments inlined below.

On Tue 06-05-14 11:21:12, Johannes Weiner wrote:
> On Tue, May 06, 2014 at 04:32:42PM +0200, Michal Hocko wrote:
> > On Tue 06-05-14 09:29:32, Johannes Weiner wrote:
> > > On Fri, May 02, 2014 at 06:00:56PM -0400, Johannes Weiner wrote:
> > > > On Fri, May 02, 2014 at 06:49:30PM +0200, Michal Hocko wrote:
> > > > > On Fri 02-05-14 11:58:05, Johannes Weiner wrote:
> > > > > > This is not even guarantees anymore, but rather another reclaim
> > > > > > prioritization scheme with best-effort semantics.  That went over
> > > > > > horribly with soft limits, and I don't want to repeat this.
> > > > > > 
> > > > > > Overcommitting on guarantees makes no sense, and you even agree you
> > > > > > are not interested in it.  We also agree that we can always add a knob
> > > > > > later on to change semantics when an actual usecase presents itself,
> > > > > > so why not start with the clear and simple semantics, and the simpler
> > > > > > implementation?
> > > > > 
> > > > > So you are really preferring an OOM instead? That was the original
> > > > > implementation posted at the end of last year and some people
> > > > > had concerns about it. This is the primary reason I came up with a
> > > > > weaker version which fallbacks rather than OOM.
> > > > 
> > > > I'll dig through the archives on this then, thanks.
> > > 
> > > The most recent discussion on this I could find was between you and
> > > Greg, where the final outcome was (excerpt):
> > > 
> > > ---
> > > 
> > > From: Greg Thelen <gthelen@google.com>
> > > To: Michal Hocko <mhocko@suse.cz>
> > > Cc: linux-mm@kvack.org,  Johannes Weiner <hannes@cmpxchg.org>,  Andrew Morton <akpm@linux-foundation.org>,  KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,  LKML <linux-kernel@vger.kernel.org>,  Ying Han <yinghan@google.com>,  Hugh Dickins <hughd@google.com>,  Michel Lespinasse <walken@google.com>,  KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,  Tejun Heo <tj@kernel.org>
> > > Subject: Re: [RFC 0/4] memcg: Low-limit reclaim
> > > References: <1386771355-21805-1-git-send-email-mhocko@suse.cz>
> > > 	<xr93sis6obb5.fsf@gthelen.mtv.corp.google.com>
> > > 	<20140130123044.GB13509@dhcp22.suse.cz>
> > > 	<xr931tzphu50.fsf@gthelen.mtv.corp.google.com>
> > > 	<20140203144341.GI2495@dhcp22.suse.cz>
> > > Date: Mon, 03 Feb 2014 17:33:13 -0800
> > > Message-ID: <xr93zjm7br1i.fsf@gthelen.mtv.corp.google.com>
> > > List-ID: <linux-mm.kvack.org>
> > > 
> > > On Mon, Feb 03 2014, Michal Hocko wrote:
> > > 
> > > > On Thu 30-01-14 16:28:27, Greg Thelen wrote:
> > > >> But this soft_limit,priority extension can be added later.
> > > >
> > > > Yes, I would like to have the strong semantic first and then deal with a
> > > > weaker form. Either by a new limit or a flag.
> > > 
> > > Sounds good.
> > > 
> > > ---
> > > 
> > > So I think everybody involved in the discussions so far are preferring
> > > a hard guarantee, and then later, if needed, to either add a knob to
> > > make it a soft guarantee or to actually implement a usable soft limit.
> > 
> > I am afraid the most of that discussion happened off-list :( Sadly not
> > much of a discussion happened on the list.
> 
> Time to do it now, then :)
> 
> > Sorry I should have been specific and mention that the discussions
> > happened at LSF and partly at the KS.
> > 
> > The strongest point was made by Rik when he claimed that memcg is not
> > aware of memory zones and so one memcg with lowlimit larger than the
> > size of a zone can eat up that zone without any way to free it.
> 
> But who actually cares if an individual zone can be reclaimed?
> 
> Userspace allocations can fall back to any other zone.  Unless there
> are hard bindings, but hopefully nobody binds a memcg to a node that
> is smaller than that memcg's guarantee. 

The protected group might spill over to another group and eat it when
another group would be simply pushed out from the node it is bound to.

> And while the pages are not
> reclaimable, they are still movable, so the NUMA balancer is free to
> correct any allocation mistakes later on.

Do we want to depend on NUMA balancer, though?

> As to kernel allocations, watermarks and lowmem protection prevent any
> single zone from filling up with userspace pages, regardless of their
> reclaimability.

Yes but that protects kernel allocations so it wouldn't help with
competing userspace in different memcgs.

> > This can cause additional troubles (permanent reclaim on that zone
> > and OOM in an extreme situations).
> 
> We have protection against wasting CPU cycles on unreclaimable zones.
> 
> So how is it different than anonymous/shared memory without swap?  Or
> mlocked memory?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-05-06 16:12                     ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-06 16:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

I am adding Rik to CC (sorry to put you in the middle of a thread -
we have started here: https://lkml.org/lkml/2014/4/28/237). You were
stressing out risks of using lowlimit as a hard guarantee at LSF. Could
you repeat your concerns here as well, please?

Short summary:
We are basically discussing how to handle lowlimit overcommit situation,
when no group is reclaimable because it either doesn't have any pages on
the LRU or it is bellow its lowlimit (aka guaranteed memory).

The solution proposed in this series is to fallback and reclaim
everybody rather than OOM with a note that if somebody really needs an
OOM then we can add a per-memcg knob which tells whether to fallback or oom.

Previously I was suggesting OOM as a default but I realized that this
might be too risky for the default behavior although I can see some
point in that behavior as well (it would allow to have a group which
would never reclaim memory and rather go OOM where the memory demand can
be handled more specifically). I do not have any call for such a hard
guarantee requirement usecase now and it would be quite trivial to build
it on top of the more relaxed implementation so I am more inclined to
the fallback default now.

More comments inlined below.

On Tue 06-05-14 11:21:12, Johannes Weiner wrote:
> On Tue, May 06, 2014 at 04:32:42PM +0200, Michal Hocko wrote:
> > On Tue 06-05-14 09:29:32, Johannes Weiner wrote:
> > > On Fri, May 02, 2014 at 06:00:56PM -0400, Johannes Weiner wrote:
> > > > On Fri, May 02, 2014 at 06:49:30PM +0200, Michal Hocko wrote:
> > > > > On Fri 02-05-14 11:58:05, Johannes Weiner wrote:
> > > > > > This is not even guarantees anymore, but rather another reclaim
> > > > > > prioritization scheme with best-effort semantics.  That went over
> > > > > > horribly with soft limits, and I don't want to repeat this.
> > > > > > 
> > > > > > Overcommitting on guarantees makes no sense, and you even agree you
> > > > > > are not interested in it.  We also agree that we can always add a knob
> > > > > > later on to change semantics when an actual usecase presents itself,
> > > > > > so why not start with the clear and simple semantics, and the simpler
> > > > > > implementation?
> > > > > 
> > > > > So you are really preferring an OOM instead? That was the original
> > > > > implementation posted at the end of last year and some people
> > > > > had concerns about it. This is the primary reason I came up with a
> > > > > weaker version which fallbacks rather than OOM.
> > > > 
> > > > I'll dig through the archives on this then, thanks.
> > > 
> > > The most recent discussion on this I could find was between you and
> > > Greg, where the final outcome was (excerpt):
> > > 
> > > ---
> > > 
> > > From: Greg Thelen <gthelen@google.com>
> > > To: Michal Hocko <mhocko@suse.cz>
> > > Cc: linux-mm@kvack.org,  Johannes Weiner <hannes@cmpxchg.org>,  Andrew Morton <akpm@linux-foundation.org>,  KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,  LKML <linux-kernel@vger.kernel.org>,  Ying Han <yinghan@google.com>,  Hugh Dickins <hughd@google.com>,  Michel Lespinasse <walken@google.com>,  KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,  Tejun Heo <tj@kernel.org>
> > > Subject: Re: [RFC 0/4] memcg: Low-limit reclaim
> > > References: <1386771355-21805-1-git-send-email-mhocko@suse.cz>
> > > 	<xr93sis6obb5.fsf@gthelen.mtv.corp.google.com>
> > > 	<20140130123044.GB13509@dhcp22.suse.cz>
> > > 	<xr931tzphu50.fsf@gthelen.mtv.corp.google.com>
> > > 	<20140203144341.GI2495@dhcp22.suse.cz>
> > > Date: Mon, 03 Feb 2014 17:33:13 -0800
> > > Message-ID: <xr93zjm7br1i.fsf@gthelen.mtv.corp.google.com>
> > > List-ID: <linux-mm.kvack.org>
> > > 
> > > On Mon, Feb 03 2014, Michal Hocko wrote:
> > > 
> > > > On Thu 30-01-14 16:28:27, Greg Thelen wrote:
> > > >> But this soft_limit,priority extension can be added later.
> > > >
> > > > Yes, I would like to have the strong semantic first and then deal with a
> > > > weaker form. Either by a new limit or a flag.
> > > 
> > > Sounds good.
> > > 
> > > ---
> > > 
> > > So I think everybody involved in the discussions so far are preferring
> > > a hard guarantee, and then later, if needed, to either add a knob to
> > > make it a soft guarantee or to actually implement a usable soft limit.
> > 
> > I am afraid the most of that discussion happened off-list :( Sadly not
> > much of a discussion happened on the list.
> 
> Time to do it now, then :)
> 
> > Sorry I should have been specific and mention that the discussions
> > happened at LSF and partly at the KS.
> > 
> > The strongest point was made by Rik when he claimed that memcg is not
> > aware of memory zones and so one memcg with lowlimit larger than the
> > size of a zone can eat up that zone without any way to free it.
> 
> But who actually cares if an individual zone can be reclaimed?
> 
> Userspace allocations can fall back to any other zone.  Unless there
> are hard bindings, but hopefully nobody binds a memcg to a node that
> is smaller than that memcg's guarantee. 

The protected group might spill over to another group and eat it when
another group would be simply pushed out from the node it is bound to.

> And while the pages are not
> reclaimable, they are still movable, so the NUMA balancer is free to
> correct any allocation mistakes later on.

Do we want to depend on NUMA balancer, though?

> As to kernel allocations, watermarks and lowmem protection prevent any
> single zone from filling up with userspace pages, regardless of their
> reclaimability.

Yes but that protects kernel allocations so it wouldn't help with
competing userspace in different memcgs.

> > This can cause additional troubles (permanent reclaim on that zone
> > and OOM in an extreme situations).
> 
> We have protection against wasting CPU cycles on unreclaimable zones.
> 
> So how is it different than anonymous/shared memory without swap?  Or
> mlocked memory?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-06 16:12                     ` Michal Hocko
@ 2014-05-06 16:51                       ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-06 16:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Tue, May 06, 2014 at 06:12:56PM +0200, Michal Hocko wrote:
> I am adding Rik to CC (sorry to put you in the middle of a thread -
> we have started here: https://lkml.org/lkml/2014/4/28/237). You were
> stressing out risks of using lowlimit as a hard guarantee at LSF. Could
> you repeat your concerns here as well, please?
> 
> Short summary:
> We are basically discussing how to handle lowlimit overcommit situation,
> when no group is reclaimable because it either doesn't have any pages on
> the LRU or it is bellow its lowlimit (aka guaranteed memory).
> 
> The solution proposed in this series is to fallback and reclaim
> everybody rather than OOM with a note that if somebody really needs an
> OOM then we can add a per-memcg knob which tells whether to fallback or oom.
> 
> Previously I was suggesting OOM as a default but I realized that this
> might be too risky for the default behavior although I can see some
> point in that behavior as well (it would allow to have a group which
> would never reclaim memory and rather go OOM where the memory demand can
> be handled more specifically). I do not have any call for such a hard
> guarantee requirement usecase now and it would be quite trivial to build
> it on top of the more relaxed implementation so I am more inclined to
> the fallback default now.
> 
> More comments inlined below.
> 
> On Tue 06-05-14 11:21:12, Johannes Weiner wrote:
> > On Tue, May 06, 2014 at 04:32:42PM +0200, Michal Hocko wrote:
> > > On Tue 06-05-14 09:29:32, Johannes Weiner wrote:
> > > > On Fri, May 02, 2014 at 06:00:56PM -0400, Johannes Weiner wrote:
> > > > > On Fri, May 02, 2014 at 06:49:30PM +0200, Michal Hocko wrote:
> > > > > > On Fri 02-05-14 11:58:05, Johannes Weiner wrote:
> > > > > > > This is not even guarantees anymore, but rather another reclaim
> > > > > > > prioritization scheme with best-effort semantics.  That went over
> > > > > > > horribly with soft limits, and I don't want to repeat this.
> > > > > > > 
> > > > > > > Overcommitting on guarantees makes no sense, and you even agree you
> > > > > > > are not interested in it.  We also agree that we can always add a knob
> > > > > > > later on to change semantics when an actual usecase presents itself,
> > > > > > > so why not start with the clear and simple semantics, and the simpler
> > > > > > > implementation?
> > > > > > 
> > > > > > So you are really preferring an OOM instead? That was the original
> > > > > > implementation posted at the end of last year and some people
> > > > > > had concerns about it. This is the primary reason I came up with a
> > > > > > weaker version which fallbacks rather than OOM.
> > > > > 
> > > > > I'll dig through the archives on this then, thanks.
> > > > 
> > > > The most recent discussion on this I could find was between you and
> > > > Greg, where the final outcome was (excerpt):
> > > > 
> > > > ---
> > > > 
> > > > From: Greg Thelen <gthelen@google.com>
> > > > To: Michal Hocko <mhocko@suse.cz>
> > > > Cc: linux-mm@kvack.org,  Johannes Weiner <hannes@cmpxchg.org>,  Andrew Morton <akpm@linux-foundation.org>,  KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,  LKML <linux-kernel@vger.kernel.org>,  Ying Han <yinghan@google.com>,  Hugh Dickins <hughd@google.com>,  Michel Lespinasse <walken@google.com>,  KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,  Tejun Heo <tj@kernel.org>
> > > > Subject: Re: [RFC 0/4] memcg: Low-limit reclaim
> > > > References: <1386771355-21805-1-git-send-email-mhocko@suse.cz>
> > > > 	<xr93sis6obb5.fsf@gthelen.mtv.corp.google.com>
> > > > 	<20140130123044.GB13509@dhcp22.suse.cz>
> > > > 	<xr931tzphu50.fsf@gthelen.mtv.corp.google.com>
> > > > 	<20140203144341.GI2495@dhcp22.suse.cz>
> > > > Date: Mon, 03 Feb 2014 17:33:13 -0800
> > > > Message-ID: <xr93zjm7br1i.fsf@gthelen.mtv.corp.google.com>
> > > > List-ID: <linux-mm.kvack.org>
> > > > 
> > > > On Mon, Feb 03 2014, Michal Hocko wrote:
> > > > 
> > > > > On Thu 30-01-14 16:28:27, Greg Thelen wrote:
> > > > >> But this soft_limit,priority extension can be added later.
> > > > >
> > > > > Yes, I would like to have the strong semantic first and then deal with a
> > > > > weaker form. Either by a new limit or a flag.
> > > > 
> > > > Sounds good.
> > > > 
> > > > ---
> > > > 
> > > > So I think everybody involved in the discussions so far are preferring
> > > > a hard guarantee, and then later, if needed, to either add a knob to
> > > > make it a soft guarantee or to actually implement a usable soft limit.
> > > 
> > > I am afraid the most of that discussion happened off-list :( Sadly not
> > > much of a discussion happened on the list.
> > 
> > Time to do it now, then :)
> > 
> > > Sorry I should have been specific and mention that the discussions
> > > happened at LSF and partly at the KS.
> > > 
> > > The strongest point was made by Rik when he claimed that memcg is not
> > > aware of memory zones and so one memcg with lowlimit larger than the
> > > size of a zone can eat up that zone without any way to free it.
> > 
> > But who actually cares if an individual zone can be reclaimed?
> > 
> > Userspace allocations can fall back to any other zone.  Unless there
> > are hard bindings, but hopefully nobody binds a memcg to a node that
> > is smaller than that memcg's guarantee. 
> 
> The protected group might spill over to another group and eat it when
> another group would be simply pushed out from the node it is bound to.

I don't really understand the point you're trying to make.

> > And while the pages are not
> > reclaimable, they are still movable, so the NUMA balancer is free to
> > correct any allocation mistakes later on.
> 
> Do we want to depend on NUMA balancer, though?

You're missing my point.

This is about which functionality of the system is actually impeded by
having large portions of a zone unreclaimable.  Freeing pages in a
zone is means to an end, not an end in itself.

We wouldn't depend on the NUMA balancer to "free" a zone, I'm just
saying that the NUMA balancer would be unaffected by a zone full of
unreclaimable pages, as long as they are movable.

So who exactly cares about the ability to reclaim individual zones and
how is it a new type of problem compared to existing unreclaimable but
movable memory?

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-05-06 16:51                       ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-06 16:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Tue, May 06, 2014 at 06:12:56PM +0200, Michal Hocko wrote:
> I am adding Rik to CC (sorry to put you in the middle of a thread -
> we have started here: https://lkml.org/lkml/2014/4/28/237). You were
> stressing out risks of using lowlimit as a hard guarantee at LSF. Could
> you repeat your concerns here as well, please?
> 
> Short summary:
> We are basically discussing how to handle lowlimit overcommit situation,
> when no group is reclaimable because it either doesn't have any pages on
> the LRU or it is bellow its lowlimit (aka guaranteed memory).
> 
> The solution proposed in this series is to fallback and reclaim
> everybody rather than OOM with a note that if somebody really needs an
> OOM then we can add a per-memcg knob which tells whether to fallback or oom.
> 
> Previously I was suggesting OOM as a default but I realized that this
> might be too risky for the default behavior although I can see some
> point in that behavior as well (it would allow to have a group which
> would never reclaim memory and rather go OOM where the memory demand can
> be handled more specifically). I do not have any call for such a hard
> guarantee requirement usecase now and it would be quite trivial to build
> it on top of the more relaxed implementation so I am more inclined to
> the fallback default now.
> 
> More comments inlined below.
> 
> On Tue 06-05-14 11:21:12, Johannes Weiner wrote:
> > On Tue, May 06, 2014 at 04:32:42PM +0200, Michal Hocko wrote:
> > > On Tue 06-05-14 09:29:32, Johannes Weiner wrote:
> > > > On Fri, May 02, 2014 at 06:00:56PM -0400, Johannes Weiner wrote:
> > > > > On Fri, May 02, 2014 at 06:49:30PM +0200, Michal Hocko wrote:
> > > > > > On Fri 02-05-14 11:58:05, Johannes Weiner wrote:
> > > > > > > This is not even guarantees anymore, but rather another reclaim
> > > > > > > prioritization scheme with best-effort semantics.  That went over
> > > > > > > horribly with soft limits, and I don't want to repeat this.
> > > > > > > 
> > > > > > > Overcommitting on guarantees makes no sense, and you even agree you
> > > > > > > are not interested in it.  We also agree that we can always add a knob
> > > > > > > later on to change semantics when an actual usecase presents itself,
> > > > > > > so why not start with the clear and simple semantics, and the simpler
> > > > > > > implementation?
> > > > > > 
> > > > > > So you are really preferring an OOM instead? That was the original
> > > > > > implementation posted at the end of last year and some people
> > > > > > had concerns about it. This is the primary reason I came up with a
> > > > > > weaker version which fallbacks rather than OOM.
> > > > > 
> > > > > I'll dig through the archives on this then, thanks.
> > > > 
> > > > The most recent discussion on this I could find was between you and
> > > > Greg, where the final outcome was (excerpt):
> > > > 
> > > > ---
> > > > 
> > > > From: Greg Thelen <gthelen@google.com>
> > > > To: Michal Hocko <mhocko@suse.cz>
> > > > Cc: linux-mm@kvack.org,  Johannes Weiner <hannes@cmpxchg.org>,  Andrew Morton <akpm@linux-foundation.org>,  KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,  LKML <linux-kernel@vger.kernel.org>,  Ying Han <yinghan@google.com>,  Hugh Dickins <hughd@google.com>,  Michel Lespinasse <walken@google.com>,  KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,  Tejun Heo <tj@kernel.org>
> > > > Subject: Re: [RFC 0/4] memcg: Low-limit reclaim
> > > > References: <1386771355-21805-1-git-send-email-mhocko@suse.cz>
> > > > 	<xr93sis6obb5.fsf@gthelen.mtv.corp.google.com>
> > > > 	<20140130123044.GB13509@dhcp22.suse.cz>
> > > > 	<xr931tzphu50.fsf@gthelen.mtv.corp.google.com>
> > > > 	<20140203144341.GI2495@dhcp22.suse.cz>
> > > > Date: Mon, 03 Feb 2014 17:33:13 -0800
> > > > Message-ID: <xr93zjm7br1i.fsf@gthelen.mtv.corp.google.com>
> > > > List-ID: <linux-mm.kvack.org>
> > > > 
> > > > On Mon, Feb 03 2014, Michal Hocko wrote:
> > > > 
> > > > > On Thu 30-01-14 16:28:27, Greg Thelen wrote:
> > > > >> But this soft_limit,priority extension can be added later.
> > > > >
> > > > > Yes, I would like to have the strong semantic first and then deal with a
> > > > > weaker form. Either by a new limit or a flag.
> > > > 
> > > > Sounds good.
> > > > 
> > > > ---
> > > > 
> > > > So I think everybody involved in the discussions so far are preferring
> > > > a hard guarantee, and then later, if needed, to either add a knob to
> > > > make it a soft guarantee or to actually implement a usable soft limit.
> > > 
> > > I am afraid the most of that discussion happened off-list :( Sadly not
> > > much of a discussion happened on the list.
> > 
> > Time to do it now, then :)
> > 
> > > Sorry I should have been specific and mention that the discussions
> > > happened at LSF and partly at the KS.
> > > 
> > > The strongest point was made by Rik when he claimed that memcg is not
> > > aware of memory zones and so one memcg with lowlimit larger than the
> > > size of a zone can eat up that zone without any way to free it.
> > 
> > But who actually cares if an individual zone can be reclaimed?
> > 
> > Userspace allocations can fall back to any other zone.  Unless there
> > are hard bindings, but hopefully nobody binds a memcg to a node that
> > is smaller than that memcg's guarantee. 
> 
> The protected group might spill over to another group and eat it when
> another group would be simply pushed out from the node it is bound to.

I don't really understand the point you're trying to make.

> > And while the pages are not
> > reclaimable, they are still movable, so the NUMA balancer is free to
> > correct any allocation mistakes later on.
> 
> Do we want to depend on NUMA balancer, though?

You're missing my point.

This is about which functionality of the system is actually impeded by
having large portions of a zone unreclaimable.  Freeing pages in a
zone is means to an end, not an end in itself.

We wouldn't depend on the NUMA balancer to "free" a zone, I'm just
saying that the NUMA balancer would be unaffected by a zone full of
unreclaimable pages, as long as they are movable.

So who exactly cares about the ability to reclaim individual zones and
how is it a new type of problem compared to existing unreclaimable but
movable memory?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-06 16:51                       ` Johannes Weiner
@ 2014-05-06 18:30                         ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-06 18:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Tue 06-05-14 12:51:50, Johannes Weiner wrote:
> On Tue, May 06, 2014 at 06:12:56PM +0200, Michal Hocko wrote:
> > On Tue 06-05-14 11:21:12, Johannes Weiner wrote:
> > > On Tue, May 06, 2014 at 04:32:42PM +0200, Michal Hocko wrote:
[...]
> > > > The strongest point was made by Rik when he claimed that memcg is not
> > > > aware of memory zones and so one memcg with lowlimit larger than the
> > > > size of a zone can eat up that zone without any way to free it.
> > > 
> > > But who actually cares if an individual zone can be reclaimed?
> > > 
> > > Userspace allocations can fall back to any other zone.  Unless there
> > > are hard bindings, but hopefully nobody binds a memcg to a node that
> > > is smaller than that memcg's guarantee. 
> > 
> > The protected group might spill over to another group and eat it when
> > another group would be simply pushed out from the node it is bound to.
> 
> I don't really understand the point you're trying to make.

I was just trying to show a case where individual zone matters. To make
it more specific consider 2 groups A (with low-limit 60% RAM) and B
(say with low-limit 10% RAM) and bound to a node X (25% of RAM). Now
having 70% of RAM reserved for guarantee makes some sense, right? B is
not over-committing the node it is bound to. Yet the A's allocations
might make pressure on X regardless that the whole system is still doing
good. This can lead to a situation where X gets depleted and nothing
would be reclaimable leading to an OOM condition.

I can imagine that most people would rather see the lowlimit break than
OOM. And if there is somebody who really wants OOM even under such
condition then why not, I would be happy to add a knob which would allow
that. But I feel that the default behavior should be the least explosive
one...

> > > And while the pages are not
> > > reclaimable, they are still movable, so the NUMA balancer is free to
> > > correct any allocation mistakes later on.
> > 
> > Do we want to depend on NUMA balancer, though?
> 
> You're missing my point.
> 
> This is about which functionality of the system is actually impeded by
> having large portions of a zone unreclaimable.  Freeing pages in a
> zone is means to an end, not an end in itself.
> 
> We wouldn't depend on the NUMA balancer to "free" a zone, I'm just
> saying that the NUMA balancer would be unaffected by a zone full of
> unreclaimable pages, as long as they are movable.

Agreed. I wasn't objecting to that part. I was merely noticing that we
do not want to depend on NUMA balancer to fix up placements later just
because they are unreclaimable due to restrictions defined outside of
the NUMA scope.

> So who exactly cares about the ability to reclaim individual zones and
> how is it a new type of problem compared to existing unreclaimable but
> movable memory?

The low limit makes the current situation different. Page allocator
simply cannot make the best decisions on the placement because it
doesn't have any idea to which group the page gets charged to and
therefore whether it gets protected or not. NUMA balancing can help
to reduce this issues but I do not think it can handle the problem
itself.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-05-06 18:30                         ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-06 18:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Tue 06-05-14 12:51:50, Johannes Weiner wrote:
> On Tue, May 06, 2014 at 06:12:56PM +0200, Michal Hocko wrote:
> > On Tue 06-05-14 11:21:12, Johannes Weiner wrote:
> > > On Tue, May 06, 2014 at 04:32:42PM +0200, Michal Hocko wrote:
[...]
> > > > The strongest point was made by Rik when he claimed that memcg is not
> > > > aware of memory zones and so one memcg with lowlimit larger than the
> > > > size of a zone can eat up that zone without any way to free it.
> > > 
> > > But who actually cares if an individual zone can be reclaimed?
> > > 
> > > Userspace allocations can fall back to any other zone.  Unless there
> > > are hard bindings, but hopefully nobody binds a memcg to a node that
> > > is smaller than that memcg's guarantee. 
> > 
> > The protected group might spill over to another group and eat it when
> > another group would be simply pushed out from the node it is bound to.
> 
> I don't really understand the point you're trying to make.

I was just trying to show a case where individual zone matters. To make
it more specific consider 2 groups A (with low-limit 60% RAM) and B
(say with low-limit 10% RAM) and bound to a node X (25% of RAM). Now
having 70% of RAM reserved for guarantee makes some sense, right? B is
not over-committing the node it is bound to. Yet the A's allocations
might make pressure on X regardless that the whole system is still doing
good. This can lead to a situation where X gets depleted and nothing
would be reclaimable leading to an OOM condition.

I can imagine that most people would rather see the lowlimit break than
OOM. And if there is somebody who really wants OOM even under such
condition then why not, I would be happy to add a knob which would allow
that. But I feel that the default behavior should be the least explosive
one...

> > > And while the pages are not
> > > reclaimable, they are still movable, so the NUMA balancer is free to
> > > correct any allocation mistakes later on.
> > 
> > Do we want to depend on NUMA balancer, though?
> 
> You're missing my point.
> 
> This is about which functionality of the system is actually impeded by
> having large portions of a zone unreclaimable.  Freeing pages in a
> zone is means to an end, not an end in itself.
> 
> We wouldn't depend on the NUMA balancer to "free" a zone, I'm just
> saying that the NUMA balancer would be unaffected by a zone full of
> unreclaimable pages, as long as they are movable.

Agreed. I wasn't objecting to that part. I was merely noticing that we
do not want to depend on NUMA balancer to fix up placements later just
because they are unreclaimable due to restrictions defined outside of
the NUMA scope.

> So who exactly cares about the ability to reclaim individual zones and
> how is it a new type of problem compared to existing unreclaimable but
> movable memory?

The low limit makes the current situation different. Page allocator
simply cannot make the best decisions on the placement because it
doesn't have any idea to which group the page gets charged to and
therefore whether it gets protected or not. NUMA balancing can help
to reduce this issues but I do not think it can handle the problem
itself.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-06 18:30                         ` Michal Hocko
@ 2014-05-06 19:55                           ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-06 19:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Tue, May 06, 2014 at 08:30:01PM +0200, Michal Hocko wrote:
> On Tue 06-05-14 12:51:50, Johannes Weiner wrote:
> > On Tue, May 06, 2014 at 06:12:56PM +0200, Michal Hocko wrote:
> > > On Tue 06-05-14 11:21:12, Johannes Weiner wrote:
> > > > On Tue, May 06, 2014 at 04:32:42PM +0200, Michal Hocko wrote:
> [...]
> > > > > The strongest point was made by Rik when he claimed that memcg is not
> > > > > aware of memory zones and so one memcg with lowlimit larger than the
> > > > > size of a zone can eat up that zone without any way to free it.
> > > > 
> > > > But who actually cares if an individual zone can be reclaimed?
> > > > 
> > > > Userspace allocations can fall back to any other zone.  Unless there
> > > > are hard bindings, but hopefully nobody binds a memcg to a node that
> > > > is smaller than that memcg's guarantee. 
> > > 
> > > The protected group might spill over to another group and eat it when
> > > another group would be simply pushed out from the node it is bound to.
> > 
> > I don't really understand the point you're trying to make.
> 
> I was just trying to show a case where individual zone matters. To make
> it more specific consider 2 groups A (with low-limit 60% RAM) and B
> (say with low-limit 10% RAM) and bound to a node X (25% of RAM). Now
> having 70% of RAM reserved for guarantee makes some sense, right? B is
> not over-committing the node it is bound to. Yet the A's allocations
> might make pressure on X regardless that the whole system is still doing
> good. This can lead to a situation where X gets depleted and nothing
> would be reclaimable leading to an OOM condition.

Once you assume control of memory *placement* in the system like this,
you can not also pretend to be clueless and have unreclaimable memory
of this magnitude spread around into nodes used by other bound tasks.

If we were to actively support such configurations, we should be doing
direct NUMA balancing and migrate these pages out of node X when B
needs to allocate.  That would fix the problem for all unevictable
memory, not just memcg guarantees, and would prefer node-offloading
over swapping in cases where swap is available.

But really, this whole scenario sounds contrived to me.  And there is
nothing specific about memcg guarantees in there.

> I can imagine that most people would rather see the lowlimit break than
> OOM. And if there is somebody who really wants OOM even under such
> condition then why not, I would be happy to add a knob which would allow
> that. But I feel that the default behavior should be the least explosive
> one...

Memcgs being node-agnostic is a reason *for* doing hard guarantees,
not against it.  If I set up guarantees on a NUMA system balanced by
the kernel, I want them to be honored, and not have my guaranteed
memory reclaimed randomly due to kernel-internal placement decisions.

> > > > And while the pages are not
> > > > reclaimable, they are still movable, so the NUMA balancer is free to
> > > > correct any allocation mistakes later on.
> > > 
> > > Do we want to depend on NUMA balancer, though?
> > 
> > You're missing my point.
> > 
> > This is about which functionality of the system is actually impeded by
> > having large portions of a zone unreclaimable.  Freeing pages in a
> > zone is means to an end, not an end in itself.
> > 
> > We wouldn't depend on the NUMA balancer to "free" a zone, I'm just
> > saying that the NUMA balancer would be unaffected by a zone full of
> > unreclaimable pages, as long as they are movable.
> 
> Agreed. I wasn't objecting to that part. I was merely noticing that we
> do not want to depend on NUMA balancer to fix up placements later just
> because they are unreclaimable due to restrictions defined outside of
> the NUMA scope.

Again, this is not a new problem.  Solve it if you want to, but don't
design a new userspace ABI around a limitation in NUMA node reclaim.

> > So who exactly cares about the ability to reclaim individual zones and
> > how is it a new type of problem compared to existing unreclaimable but
> > movable memory?
> 
> The low limit makes the current situation different. Page allocator
> simply cannot make the best decisions on the placement because it
> doesn't have any idea to which group the page gets charged to and
> therefore whether it gets protected or not. NUMA balancing can help
> to reduce this issues but I do not think it can handle the problem
> itself.

It depends on the task, not on the group.

You can turn your argument upside down: if you fail guarantees just
because a single zone is otherwise unreclaimable, then page allocator
placement ends up dictating which page is guaranteed memory and which
is not.  This really makes no sense to me.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-05-06 19:55                           ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-06 19:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Tue, May 06, 2014 at 08:30:01PM +0200, Michal Hocko wrote:
> On Tue 06-05-14 12:51:50, Johannes Weiner wrote:
> > On Tue, May 06, 2014 at 06:12:56PM +0200, Michal Hocko wrote:
> > > On Tue 06-05-14 11:21:12, Johannes Weiner wrote:
> > > > On Tue, May 06, 2014 at 04:32:42PM +0200, Michal Hocko wrote:
> [...]
> > > > > The strongest point was made by Rik when he claimed that memcg is not
> > > > > aware of memory zones and so one memcg with lowlimit larger than the
> > > > > size of a zone can eat up that zone without any way to free it.
> > > > 
> > > > But who actually cares if an individual zone can be reclaimed?
> > > > 
> > > > Userspace allocations can fall back to any other zone.  Unless there
> > > > are hard bindings, but hopefully nobody binds a memcg to a node that
> > > > is smaller than that memcg's guarantee. 
> > > 
> > > The protected group might spill over to another group and eat it when
> > > another group would be simply pushed out from the node it is bound to.
> > 
> > I don't really understand the point you're trying to make.
> 
> I was just trying to show a case where individual zone matters. To make
> it more specific consider 2 groups A (with low-limit 60% RAM) and B
> (say with low-limit 10% RAM) and bound to a node X (25% of RAM). Now
> having 70% of RAM reserved for guarantee makes some sense, right? B is
> not over-committing the node it is bound to. Yet the A's allocations
> might make pressure on X regardless that the whole system is still doing
> good. This can lead to a situation where X gets depleted and nothing
> would be reclaimable leading to an OOM condition.

Once you assume control of memory *placement* in the system like this,
you can not also pretend to be clueless and have unreclaimable memory
of this magnitude spread around into nodes used by other bound tasks.

If we were to actively support such configurations, we should be doing
direct NUMA balancing and migrate these pages out of node X when B
needs to allocate.  That would fix the problem for all unevictable
memory, not just memcg guarantees, and would prefer node-offloading
over swapping in cases where swap is available.

But really, this whole scenario sounds contrived to me.  And there is
nothing specific about memcg guarantees in there.

> I can imagine that most people would rather see the lowlimit break than
> OOM. And if there is somebody who really wants OOM even under such
> condition then why not, I would be happy to add a knob which would allow
> that. But I feel that the default behavior should be the least explosive
> one...

Memcgs being node-agnostic is a reason *for* doing hard guarantees,
not against it.  If I set up guarantees on a NUMA system balanced by
the kernel, I want them to be honored, and not have my guaranteed
memory reclaimed randomly due to kernel-internal placement decisions.

> > > > And while the pages are not
> > > > reclaimable, they are still movable, so the NUMA balancer is free to
> > > > correct any allocation mistakes later on.
> > > 
> > > Do we want to depend on NUMA balancer, though?
> > 
> > You're missing my point.
> > 
> > This is about which functionality of the system is actually impeded by
> > having large portions of a zone unreclaimable.  Freeing pages in a
> > zone is means to an end, not an end in itself.
> > 
> > We wouldn't depend on the NUMA balancer to "free" a zone, I'm just
> > saying that the NUMA balancer would be unaffected by a zone full of
> > unreclaimable pages, as long as they are movable.
> 
> Agreed. I wasn't objecting to that part. I was merely noticing that we
> do not want to depend on NUMA balancer to fix up placements later just
> because they are unreclaimable due to restrictions defined outside of
> the NUMA scope.

Again, this is not a new problem.  Solve it if you want to, but don't
design a new userspace ABI around a limitation in NUMA node reclaim.

> > So who exactly cares about the ability to reclaim individual zones and
> > how is it a new type of problem compared to existing unreclaimable but
> > movable memory?
> 
> The low limit makes the current situation different. Page allocator
> simply cannot make the best decisions on the placement because it
> doesn't have any idea to which group the page gets charged to and
> therefore whether it gets protected or not. NUMA balancing can help
> to reduce this issues but I do not think it can handle the problem
> itself.

It depends on the task, not on the group.

You can turn your argument upside down: if you fail guarantees just
because a single zone is otherwise unreclaimable, then page allocator
placement ends up dictating which page is guaranteed memory and which
is not.  This really makes no sense to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 3/4] memcg, doc: clarify global vs. limit reclaims
  2014-05-02  9:43       ` Michal Hocko
@ 2014-05-06 19:56         ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-06 19:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

Andrew, could you queue this one please?

On Fri 02-05-14 11:43:51, Michal Hocko wrote:
[...]
> From 30b9505169e574cdb553226e1a361cc527ed492b Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Fri, 2 May 2014 11:42:35 +0200
> Subject: [PATCH] mmotm: memcg-doc-clarify-global-vs-limit-reclaims-fix.patch
> 
> update doc as per Johannes
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  Documentation/cgroups/memory.txt | 10 +---------
>  1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index add1be001416..2cde96787ceb 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -241,17 +241,9 @@ global VM. Cgroups can get reclaimed basically under two conditions
>     proportionally wrt. their LRU size in a round robin fashion
>   - when a cgroup or its hierarchical parent (see 6. Hierarchical support)
>     hits hard limit. If the reclaim is unsuccessful, an OOM routine is invoked
> -   to select and kill the bulkiest task in the cgroup. (See 10. OOM Control
> +   to select and kill the bulkiest task in the hiearchy. (See 10. OOM Control
>     below.)
>  
> -Global and hard-limit reclaims share the same code the only difference
> -is the objective of the reclaim. The global reclaim aims at balancing
> -zones' watermarks while the limit reclaim frees some memory to allow new
> -charges.
> -
> -NOTE: Hard limit reclaim does not work for the root cgroup, since we cannot set
> -any limits on the root cgroup.
> -
>  Note2: When panic_on_oom is set to "2", the whole system will panic.
>  
>  When oom event notifier is registered, event will be delivered to the root
> -- 
> 2.0.0.rc0
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 3/4] memcg, doc: clarify global vs. limit reclaims
@ 2014-05-06 19:56         ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-06 19:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

Andrew, could you queue this one please?

On Fri 02-05-14 11:43:51, Michal Hocko wrote:
[...]
> From 30b9505169e574cdb553226e1a361cc527ed492b Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Fri, 2 May 2014 11:42:35 +0200
> Subject: [PATCH] mmotm: memcg-doc-clarify-global-vs-limit-reclaims-fix.patch
> 
> update doc as per Johannes
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  Documentation/cgroups/memory.txt | 10 +---------
>  1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index add1be001416..2cde96787ceb 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -241,17 +241,9 @@ global VM. Cgroups can get reclaimed basically under two conditions
>     proportionally wrt. their LRU size in a round robin fashion
>   - when a cgroup or its hierarchical parent (see 6. Hierarchical support)
>     hits hard limit. If the reclaim is unsuccessful, an OOM routine is invoked
> -   to select and kill the bulkiest task in the cgroup. (See 10. OOM Control
> +   to select and kill the bulkiest task in the hiearchy. (See 10. OOM Control
>     below.)
>  
> -Global and hard-limit reclaims share the same code the only difference
> -is the objective of the reclaim. The global reclaim aims at balancing
> -zones' watermarks while the limit reclaim frees some memory to allow new
> -charges.
> -
> -NOTE: Hard limit reclaim does not work for the root cgroup, since we cannot set
> -any limits on the root cgroup.
> -
>  Note2: When panic_on_oom is set to "2", the whole system will panic.
>  
>  When oom event notifier is registered, event will be delivered to the root
> -- 
> 2.0.0.rc0
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-02 15:48                     ` Michal Hocko
@ 2014-05-06 19:58                       ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-06 19:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

Andrew, could you queue/fold this one, please?

On Fri 02-05-14 17:48:52, Michal Hocko wrote:
[...]
> From 3101ce41cc8c0c9691d98054e8811c66a77cd079 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Fri, 2 May 2014 17:47:32 +0200
> Subject: [PATCH] mmotm: memcg-mm-introduce-lowlimit-reclaim-fix.patch
> 
> mem_cgroup_reclaim_eligible -> mem_cgroup_within_guarantee
> follow_low_limit -> honor_memcg_guarantee
> and as suggested by Johannes.
> ---
>  include/linux/memcontrol.h |  6 +++---
>  mm/memcontrol.c            | 15 ++++++++-------
>  mm/vmscan.c                | 25 ++++++++++++++++---------
>  3 files changed, 27 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6c59056f4bc6..c00ccc5f70b9 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -92,7 +92,7 @@ bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
>  bool task_in_mem_cgroup(struct task_struct *task,
>  			const struct mem_cgroup *memcg);
>  
> -extern bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> +extern bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  		struct mem_cgroup *root);
>  
>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
> @@ -291,10 +291,10 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
>  	return &zone->lruvec;
>  }
>  
> -static inline bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> +static inline bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  		struct mem_cgroup *root)
>  {
> -	return true;
> +	return false;
>  }
>  
>  static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7a276c0d141e..58982d18f6ea 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2810,26 +2810,27 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
>  }
>  
>  /**
> - * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
> - * reclaim
> + * mem_cgroup_within_guarantee - checks whether given memcg is within its
> + * memory guarantee
>   * @memcg: target memcg for the reclaim
>   * @root: root of the reclaim hierarchy (null for the global reclaim)
>   *
> - * The given group is reclaimable if it is above its low limit and the same
> - * applies for all parents up the hierarchy until root (including).
> + * The given group is within its reclaim gurantee if it is below its low limit
> + * or the same applies for any parent up the hierarchy until root (including).
> + * Such a group might be excluded from the reclaim.
>   */
> -bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> +bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  		struct mem_cgroup *root)
>  {
>  	do {
>  		if (!res_counter_low_limit_excess(&memcg->res))
> -			return false;
> +			return true;
>  		if (memcg == root)
>  			break;
>  
>  	} while ((memcg = parent_mem_cgroup(memcg)));
>  
> -	return true;
> +	return false;
>  }
>  
>  struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0f428158254e..5f923999bb79 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2215,8 +2215,18 @@ static inline bool should_continue_reclaim(struct zone *zone,
>  	}
>  }
>  
> +/**
> + * __shrink_zone - shrinks a given zone
> + *
> + * @zone: zone to shrink
> + * @sc: scan control with additional reclaim parameters
> + * @honor_memcg_guarantee: do not reclaim memcgs which are within their memory
> + * guarantee
> + *
> + * Returns the number of reclaimed memcgs.
> + */
>  static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
> -		bool follow_low_limit)
> +		bool honor_memcg_guarantee)
>  {
>  	unsigned long nr_reclaimed, nr_scanned;
>  	unsigned nr_scanned_groups = 0;
> @@ -2236,12 +2246,9 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
>  		do {
>  			struct lruvec *lruvec;
>  
> -			/*
> -			 * Memcg might be under its low limit so we have to
> -			 * skip it during the first reclaim round
> -			 */
> -			if (follow_low_limit &&
> -					!mem_cgroup_reclaim_eligible(memcg, root)) {
> +			/* Memcg might be protected from the reclaim */
> +			if (honor_memcg_guarantee &&
> +					mem_cgroup_within_guarantee(memcg, root)) {
>  				/*
>  				 * It would be more optimal to skip the memcg
>  				 * subtree now but we do not have a memcg iter
> @@ -2289,8 +2296,8 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  	if (!__shrink_zone(zone, sc, true)) {
>  		/*
>  		 * First round of reclaim didn't find anything to reclaim
> -		 * because of low limit protection so try again and ignore
> -		 * the low limit this time.
> +		 * because of the memory guantees for all memcgs in the
> +		 * reclaim target so try again and ignore guarantees this time.
>  		 */
>  		__shrink_zone(zone, sc, false);
>  	}
> -- 
> 2.0.0.rc0
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-05-06 19:58                       ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-06 19:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

Andrew, could you queue/fold this one, please?

On Fri 02-05-14 17:48:52, Michal Hocko wrote:
[...]
> From 3101ce41cc8c0c9691d98054e8811c66a77cd079 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Fri, 2 May 2014 17:47:32 +0200
> Subject: [PATCH] mmotm: memcg-mm-introduce-lowlimit-reclaim-fix.patch
> 
> mem_cgroup_reclaim_eligible -> mem_cgroup_within_guarantee
> follow_low_limit -> honor_memcg_guarantee
> and as suggested by Johannes.
> ---
>  include/linux/memcontrol.h |  6 +++---
>  mm/memcontrol.c            | 15 ++++++++-------
>  mm/vmscan.c                | 25 ++++++++++++++++---------
>  3 files changed, 27 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6c59056f4bc6..c00ccc5f70b9 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -92,7 +92,7 @@ bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
>  bool task_in_mem_cgroup(struct task_struct *task,
>  			const struct mem_cgroup *memcg);
>  
> -extern bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> +extern bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  		struct mem_cgroup *root);
>  
>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
> @@ -291,10 +291,10 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
>  	return &zone->lruvec;
>  }
>  
> -static inline bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> +static inline bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  		struct mem_cgroup *root)
>  {
> -	return true;
> +	return false;
>  }
>  
>  static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7a276c0d141e..58982d18f6ea 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2810,26 +2810,27 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
>  }
>  
>  /**
> - * mem_cgroup_reclaim_eligible - checks whether given memcg is eligible for the
> - * reclaim
> + * mem_cgroup_within_guarantee - checks whether given memcg is within its
> + * memory guarantee
>   * @memcg: target memcg for the reclaim
>   * @root: root of the reclaim hierarchy (null for the global reclaim)
>   *
> - * The given group is reclaimable if it is above its low limit and the same
> - * applies for all parents up the hierarchy until root (including).
> + * The given group is within its reclaim gurantee if it is below its low limit
> + * or the same applies for any parent up the hierarchy until root (including).
> + * Such a group might be excluded from the reclaim.
>   */
> -bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
> +bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  		struct mem_cgroup *root)
>  {
>  	do {
>  		if (!res_counter_low_limit_excess(&memcg->res))
> -			return false;
> +			return true;
>  		if (memcg == root)
>  			break;
>  
>  	} while ((memcg = parent_mem_cgroup(memcg)));
>  
> -	return true;
> +	return false;
>  }
>  
>  struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0f428158254e..5f923999bb79 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2215,8 +2215,18 @@ static inline bool should_continue_reclaim(struct zone *zone,
>  	}
>  }
>  
> +/**
> + * __shrink_zone - shrinks a given zone
> + *
> + * @zone: zone to shrink
> + * @sc: scan control with additional reclaim parameters
> + * @honor_memcg_guarantee: do not reclaim memcgs which are within their memory
> + * guarantee
> + *
> + * Returns the number of reclaimed memcgs.
> + */
>  static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
> -		bool follow_low_limit)
> +		bool honor_memcg_guarantee)
>  {
>  	unsigned long nr_reclaimed, nr_scanned;
>  	unsigned nr_scanned_groups = 0;
> @@ -2236,12 +2246,9 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
>  		do {
>  			struct lruvec *lruvec;
>  
> -			/*
> -			 * Memcg might be under its low limit so we have to
> -			 * skip it during the first reclaim round
> -			 */
> -			if (follow_low_limit &&
> -					!mem_cgroup_reclaim_eligible(memcg, root)) {
> +			/* Memcg might be protected from the reclaim */
> +			if (honor_memcg_guarantee &&
> +					mem_cgroup_within_guarantee(memcg, root)) {
>  				/*
>  				 * It would be more optimal to skip the memcg
>  				 * subtree now but we do not have a memcg iter
> @@ -2289,8 +2296,8 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  	if (!__shrink_zone(zone, sc, true)) {
>  		/*
>  		 * First round of reclaim didn't find anything to reclaim
> -		 * because of low limit protection so try again and ignore
> -		 * the low limit this time.
> +		 * because of the memory guantees for all memcgs in the
> +		 * reclaim target so try again and ignore guarantees this time.
>  		 */
>  		__shrink_zone(zone, sc, false);
>  	}
> -- 
> 2.0.0.rc0
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-05 14:21               ` Michal Hocko
@ 2014-05-19 16:18                 ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-19 16:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

Andrew, it seems this one got lost as well.

On Mon 05-05-14 16:21:00, Michal Hocko wrote:
> On Fri 02-05-14 18:00:56, Johannes Weiner wrote:
> > On Fri, May 02, 2014 at 06:49:30PM +0200, Michal Hocko wrote:
> > > On Fri 02-05-14 11:58:05, Johannes Weiner wrote:
> > > > On Fri, May 02, 2014 at 11:36:28AM +0200, Michal Hocko wrote:
> > > > > On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> > > > > > On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> [...]
> > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > > index c1cd99a5074b..0f428158254e 100644
> > > > > > > --- a/mm/vmscan.c
> > > > > > > +++ b/mm/vmscan.c
> > > > > [...]
> > > > > > > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > > > > > > +{
> > > > > > > +	if (!__shrink_zone(zone, sc, true)) {
> > > > > > > +		/*
> > > > > > > +		 * First round of reclaim didn't find anything to reclaim
> > > > > > > +		 * because of low limit protection so try again and ignore
> > > > > > > +		 * the low limit this time.
> > > > > > > +		 */
> > > > > > > +		__shrink_zone(zone, sc, false);
> > > > > > > +	}
> > > > 
> > > > So I don't think this can work as it is, because we are not actually
> > > > changing priority levels yet. 
> > > 
> > > __shrink_zone returns with 0 only if the whole hierarchy is is under low
> > > limit. This means that they are over-committed and it doesn't make much
> > > sense to play with priority. Low limit reclaimability is independent on
> > > the priority.
> > > 
> > > > It will give up on the guarantees of bigger groups way before smaller
> > > > groups are even seriously looked at.
> > > 
> > > How would that happen? Those (smaller) groups would get reclaimed and we
> > > wouldn't fallback. Or am I missing your point?
> > 
> > Lol, I hadn't updated my brain to a394cb8ee632 ("memcg,vmscan: do not
> > break out targeted reclaim without reclaimed pages") yet...  Yes, you
> > are right.
> 
> You made me think about this more and you are right ;).
> The code as is doesn't cope with many racing reclaimers when some
> threads can fallback to ignore the lowlimit although there are groups to
> scan in the hierarchy but they were visited by other reclaimers.
> The patch bellow should help with that. What do you think?
> I am also thinking we want to add a fallback counter in memory.stat?
> ---
> From e997b8b4ac724aa29bdeff998d2186ee3c0a97d8 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Mon, 5 May 2014 15:12:18 +0200
> Subject: [PATCH] vmscan: memcg: check whether the low limit should be ignored
> 
> Low-limit (aka guarantee) is ignored when there is no group scanned
> during the first round of __shink_zone. This approach doesn't work when
> multiple reclaimers race and reclaim the same hierarchy (e.g. kswapd
> vs. direct reclaim or multiple tasks hitting the hard limit) because
> memcg iterator makes sure that multiple reclaimers are interleaved
> in the hierarchy. This means that some reclaimers can see 0 scanned
> groups although there are groups which are above the low-limit and they
> were reclaimed on behalf of other reclaimers. This leads to a premature
> low-limit break.
> 
> This patch adds mem_cgroup_all_within_guarantee() which will check
> whether all the groups in the reclaimed hierarchy are within their low
> limit and shrink_zone will allow the fallback reclaim only when that is
> true. This alone is still not sufficient however because it would lead
> to another problem. If a reclaimer constantly fails to scan anything
> because it sees only groups within their guarantees while others do the
> reclaim then the reclaim priority would drop down very quickly.
> shrink_zone has to be careful to preserve scan at least one group
> semantic so __shrink_zone has to be retried until at least one group
> is scanned.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  include/linux/memcontrol.h |  5 +++++
>  mm/memcontrol.c            | 13 +++++++++++++
>  mm/vmscan.c                | 17 ++++++++++++-----
>  3 files changed, 30 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c00ccc5f70b9..077a777bd9ff 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -94,6 +94,7 @@ bool task_in_mem_cgroup(struct task_struct *task,
>  
>  extern bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  		struct mem_cgroup *root);
> +extern bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root);
>  
>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> @@ -296,6 +297,10 @@ static inline bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  {
>  	return false;
>  }
> +static inline bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
> +{
> +	return false;
> +}
>  
>  static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 58982d18f6ea..4fd4784d1548 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2833,6 +2833,19 @@ bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  	return false;
>  }
>  
> +bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
> +{
> +	struct mem_cgroup *iter;
> +
> +	for_each_mem_cgroup_tree(iter, root)
> +		if (!mem_cgroup_within_guarantee(iter, root)) {
> +			mem_cgroup_iter_break(root, iter);
> +			return false;
> +		}
> +
> +	return true;
> +}
> +
>  struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>  {
>  	struct mem_cgroup *memcg = NULL;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5f923999bb79..2686e47f04cc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2293,13 +2293,20 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
>  
>  static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  {
> -	if (!__shrink_zone(zone, sc, true)) {
> +	bool honor_guarantee = true;
> +
> +	while (!__shrink_zone(zone, sc, honor_guarantee)) {
>  		/*
> -		 * First round of reclaim didn't find anything to reclaim
> -		 * because of the memory guantees for all memcgs in the
> -		 * reclaim target so try again and ignore guarantees this time.
> +		 * The previous round of reclaim didn't find anything to scan
> +		 * because
> +		 * a) the whole reclaimed hierarchy is within guarantee so
> +		 *    we fallback to ignore the guarantee because other option
> +		 *    would be the OOM
> +		 * b) multiple reclaimers are racing and so the first round
> +		 *    should be retried
>  		 */
> -		__shrink_zone(zone, sc, false);
> +		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
> +			honor_guarantee = false;
>  	}
>  }
>  
> -- 
> 2.0.0.rc0
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-05-19 16:18                 ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-19 16:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

Andrew, it seems this one got lost as well.

On Mon 05-05-14 16:21:00, Michal Hocko wrote:
> On Fri 02-05-14 18:00:56, Johannes Weiner wrote:
> > On Fri, May 02, 2014 at 06:49:30PM +0200, Michal Hocko wrote:
> > > On Fri 02-05-14 11:58:05, Johannes Weiner wrote:
> > > > On Fri, May 02, 2014 at 11:36:28AM +0200, Michal Hocko wrote:
> > > > > On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> > > > > > On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> [...]
> > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > > index c1cd99a5074b..0f428158254e 100644
> > > > > > > --- a/mm/vmscan.c
> > > > > > > +++ b/mm/vmscan.c
> > > > > [...]
> > > > > > > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > > > > > > +{
> > > > > > > +	if (!__shrink_zone(zone, sc, true)) {
> > > > > > > +		/*
> > > > > > > +		 * First round of reclaim didn't find anything to reclaim
> > > > > > > +		 * because of low limit protection so try again and ignore
> > > > > > > +		 * the low limit this time.
> > > > > > > +		 */
> > > > > > > +		__shrink_zone(zone, sc, false);
> > > > > > > +	}
> > > > 
> > > > So I don't think this can work as it is, because we are not actually
> > > > changing priority levels yet. 
> > > 
> > > __shrink_zone returns with 0 only if the whole hierarchy is is under low
> > > limit. This means that they are over-committed and it doesn't make much
> > > sense to play with priority. Low limit reclaimability is independent on
> > > the priority.
> > > 
> > > > It will give up on the guarantees of bigger groups way before smaller
> > > > groups are even seriously looked at.
> > > 
> > > How would that happen? Those (smaller) groups would get reclaimed and we
> > > wouldn't fallback. Or am I missing your point?
> > 
> > Lol, I hadn't updated my brain to a394cb8ee632 ("memcg,vmscan: do not
> > break out targeted reclaim without reclaimed pages") yet...  Yes, you
> > are right.
> 
> You made me think about this more and you are right ;).
> The code as is doesn't cope with many racing reclaimers when some
> threads can fallback to ignore the lowlimit although there are groups to
> scan in the hierarchy but they were visited by other reclaimers.
> The patch bellow should help with that. What do you think?
> I am also thinking we want to add a fallback counter in memory.stat?
> ---
> From e997b8b4ac724aa29bdeff998d2186ee3c0a97d8 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Mon, 5 May 2014 15:12:18 +0200
> Subject: [PATCH] vmscan: memcg: check whether the low limit should be ignored
> 
> Low-limit (aka guarantee) is ignored when there is no group scanned
> during the first round of __shink_zone. This approach doesn't work when
> multiple reclaimers race and reclaim the same hierarchy (e.g. kswapd
> vs. direct reclaim or multiple tasks hitting the hard limit) because
> memcg iterator makes sure that multiple reclaimers are interleaved
> in the hierarchy. This means that some reclaimers can see 0 scanned
> groups although there are groups which are above the low-limit and they
> were reclaimed on behalf of other reclaimers. This leads to a premature
> low-limit break.
> 
> This patch adds mem_cgroup_all_within_guarantee() which will check
> whether all the groups in the reclaimed hierarchy are within their low
> limit and shrink_zone will allow the fallback reclaim only when that is
> true. This alone is still not sufficient however because it would lead
> to another problem. If a reclaimer constantly fails to scan anything
> because it sees only groups within their guarantees while others do the
> reclaim then the reclaim priority would drop down very quickly.
> shrink_zone has to be careful to preserve scan at least one group
> semantic so __shrink_zone has to be retried until at least one group
> is scanned.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  include/linux/memcontrol.h |  5 +++++
>  mm/memcontrol.c            | 13 +++++++++++++
>  mm/vmscan.c                | 17 ++++++++++++-----
>  3 files changed, 30 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c00ccc5f70b9..077a777bd9ff 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -94,6 +94,7 @@ bool task_in_mem_cgroup(struct task_struct *task,
>  
>  extern bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  		struct mem_cgroup *root);
> +extern bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root);
>  
>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> @@ -296,6 +297,10 @@ static inline bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  {
>  	return false;
>  }
> +static inline bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
> +{
> +	return false;
> +}
>  
>  static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 58982d18f6ea..4fd4784d1548 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2833,6 +2833,19 @@ bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  	return false;
>  }
>  
> +bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
> +{
> +	struct mem_cgroup *iter;
> +
> +	for_each_mem_cgroup_tree(iter, root)
> +		if (!mem_cgroup_within_guarantee(iter, root)) {
> +			mem_cgroup_iter_break(root, iter);
> +			return false;
> +		}
> +
> +	return true;
> +}
> +
>  struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>  {
>  	struct mem_cgroup *memcg = NULL;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5f923999bb79..2686e47f04cc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2293,13 +2293,20 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
>  
>  static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  {
> -	if (!__shrink_zone(zone, sc, true)) {
> +	bool honor_guarantee = true;
> +
> +	while (!__shrink_zone(zone, sc, honor_guarantee)) {
>  		/*
> -		 * First round of reclaim didn't find anything to reclaim
> -		 * because of the memory guantees for all memcgs in the
> -		 * reclaim target so try again and ignore guarantees this time.
> +		 * The previous round of reclaim didn't find anything to scan
> +		 * because
> +		 * a) the whole reclaimed hierarchy is within guarantee so
> +		 *    we fallback to ignore the guarantee because other option
> +		 *    would be the OOM
> +		 * b) multiple reclaimers are racing and so the first round
> +		 *    should be retried
>  		 */
> -		__shrink_zone(zone, sc, false);
> +		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
> +			honor_guarantee = false;
>  	}
>  }
>  
> -- 
> 2.0.0.rc0
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-04-28 12:26 ` Michal Hocko
@ 2014-05-28 12:10   ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-28 12:10 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Tejun Heo,
	Hugh Dickins, Roman Gushchin, LKML, linux-mm, Rik van Riel

Hi Andrew, Johannes,

On Mon 28-04-14 14:26:41, Michal Hocko wrote:
> This patchset introduces such low limit that is functionally similar
> to a minimum guarantee. Memcgs which are under their lowlimit are not
> considered eligible for the reclaim (both global and hardlimit) unless
> all groups under the reclaimed hierarchy are below the low limit when
> all of them are considered eligible.
> 
> The previous version of the patchset posted as a RFC
> (http://marc.info/?l=linux-mm&m=138677140628677&w=2) suggested a
> hard guarantee without any fallback. More discussions led me to
> reconsidering the default behavior and come up a more relaxed one. The
> hard requirement can be added later based on a use case which really
> requires. It would be controlled by memory.reclaim_flags knob which
> would specify whether to OOM or fallback (default) when all groups are
> bellow low limit.

It seems that we are not in a full agreement about the default behavior
yet. Johannes seems to be more for hard guarantee while I would like to
see the weaker approach first and move to the stronger model later.
Johannes, is this absolutely no-go for you? Do you think it is seriously
handicapping the semantic of the new knob?

My main motivation for the weaker model is that it is hard to see all
the corner case right now and once we hit them I would like to see a
graceful fallback rather than fatal action like OOM killer. Besides that
the usaceses I am mostly interested in are OK with fallback when the
alternative would be OOM killer. I also feel that introducing a knob
with a weaker semantic which can be made stronger later is a sensible
way to go.

It would be helpful to have a counter which would tell us how many times
the lowlimit was breached if we go with the weaker semantic.  I guess we
have touched that already but I haven't posted any patch yet.  So here
it goes.
---
>From 109fbc272b120e70a5d9217abf33a181eb1024f4 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Mon, 26 May 2014 10:46:10 +0200
Subject: [PATCH] memcg, vmscan: count how many times low limit has been
 breached

The counter is displayed in memory.stat file.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 Documentation/cgroups/memory.txt | 6 +++++-
 include/linux/memcontrol.h       | 5 +++++
 mm/memcontrol.c                  | 7 +++++++
 mm/vmscan.c                      | 8 ++++++--
 4 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 7f3a7414bdf2..ad0f31402d84 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -58,6 +58,9 @@ Brief summary of control files.
 				 (See 5.5 for details)
  memory.limit_in_bytes		 # set/show limit of memory usage
  memory.low_limit_in_bytes	 # set/show low limit for memory reclaim
+ memory.low_limit_breached	 # number of times low_limit has been
+				 # ignored and the cgroup reclaimed even
+				 # when it was above the limit
  memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
  memory.failcnt			 # show the number of memory usage hits limits
  memory.memsw.failcnt		 # show the number of memory+Swap hits limits
@@ -251,7 +254,8 @@ doesn't include groups (and their subgroups - see 6. Hierarchy support)
 which are below the low limit if there is other eligible cgroup in the
 reclaimed hierarchy. If all groups which participate reclaim are under
 their low limits then all of them are reclaimed and the low limit is
-ignored.
+ignored. low_limit_breached counter in memory.stat file can be checked
+to see how many times such an event occurred.
 
 Note2: When panic_on_oom is set to "2", the whole system will panic.
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 077a777bd9ff..5e2ca2163b12 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -94,6 +94,8 @@ bool task_in_mem_cgroup(struct task_struct *task,
 
 extern bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
 		struct mem_cgroup *root);
+
+extern void mem_cgroup_guarantee_breached(struct mem_cgroup *memcg);
 extern bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
@@ -297,6 +299,9 @@ static inline bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
 {
 	return false;
 }
+static inline  void mem_cgroup_guarantee_breached(struct mem_cgroup *memcg)
+{
+}
 static inline bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
 {
 	return false;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4fd4784d1548..4af05d5f59bc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -102,6 +102,7 @@ enum mem_cgroup_events_index {
 	MEM_CGROUP_EVENTS_PGPGOUT,	/* # of pages paged out */
 	MEM_CGROUP_EVENTS_PGFAULT,	/* # of page-faults */
 	MEM_CGROUP_EVENTS_PGMAJFAULT,	/* # of major page-faults */
+	MEM_CGROUP_EVENTS_LOW_LIMIT_FALLBACK, /* # of times low limit was breached */
 	MEM_CGROUP_EVENTS_NSTATS,
 };
 
@@ -110,6 +111,7 @@ static const char * const mem_cgroup_events_names[] = {
 	"pgpgout",
 	"pgfault",
 	"pgmajfault",
+	"low_limit_breached",
 };
 
 static const char * const mem_cgroup_lru_names[] = {
@@ -2833,6 +2835,11 @@ bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
 	return false;
 }
 
+void mem_cgroup_guarantee_breached(struct mem_cgroup *memcg)
+{
+	this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_LOW_LIMIT_FALLBACK]);
+}
+
 bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
 {
 	struct mem_cgroup *iter;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2686e47f04cc..8041b0667673 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2245,10 +2245,11 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
 		memcg = mem_cgroup_iter(root, NULL, &reclaim);
 		do {
 			struct lruvec *lruvec;
+			bool within_guarantee;
 
 			/* Memcg might be protected from the reclaim */
-			if (honor_memcg_guarantee &&
-					mem_cgroup_within_guarantee(memcg, root)) {
+			within_guarantee = mem_cgroup_within_guarantee(memcg, root);
+			if (honor_memcg_guarantee && within_guarantee) {
 				/*
 				 * It would be more optimal to skip the memcg
 				 * subtree now but we do not have a memcg iter
@@ -2258,6 +2259,9 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
 				continue;
 			}
 
+			if (within_guarantee)
+				mem_cgroup_guarantee_breached(memcg);
+
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 			nr_scanned_groups++;
 
-- 
2.0.0.rc4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-05-28 12:10   ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-28 12:10 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Tejun Heo,
	Hugh Dickins, Roman Gushchin, LKML, linux-mm, Rik van Riel

Hi Andrew, Johannes,

On Mon 28-04-14 14:26:41, Michal Hocko wrote:
> This patchset introduces such low limit that is functionally similar
> to a minimum guarantee. Memcgs which are under their lowlimit are not
> considered eligible for the reclaim (both global and hardlimit) unless
> all groups under the reclaimed hierarchy are below the low limit when
> all of them are considered eligible.
> 
> The previous version of the patchset posted as a RFC
> (http://marc.info/?l=linux-mm&m=138677140628677&w=2) suggested a
> hard guarantee without any fallback. More discussions led me to
> reconsidering the default behavior and come up a more relaxed one. The
> hard requirement can be added later based on a use case which really
> requires. It would be controlled by memory.reclaim_flags knob which
> would specify whether to OOM or fallback (default) when all groups are
> bellow low limit.

It seems that we are not in a full agreement about the default behavior
yet. Johannes seems to be more for hard guarantee while I would like to
see the weaker approach first and move to the stronger model later.
Johannes, is this absolutely no-go for you? Do you think it is seriously
handicapping the semantic of the new knob?

My main motivation for the weaker model is that it is hard to see all
the corner case right now and once we hit them I would like to see a
graceful fallback rather than fatal action like OOM killer. Besides that
the usaceses I am mostly interested in are OK with fallback when the
alternative would be OOM killer. I also feel that introducing a knob
with a weaker semantic which can be made stronger later is a sensible
way to go.

It would be helpful to have a counter which would tell us how many times
the lowlimit was breached if we go with the weaker semantic.  I guess we
have touched that already but I haven't posted any patch yet.  So here
it goes.
---

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-05-28 12:10   ` Michal Hocko
@ 2014-05-28 13:49     ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-28 13:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote:
> Hi Andrew, Johannes,
> 
> On Mon 28-04-14 14:26:41, Michal Hocko wrote:
> > This patchset introduces such low limit that is functionally similar
> > to a minimum guarantee. Memcgs which are under their lowlimit are not
> > considered eligible for the reclaim (both global and hardlimit) unless
> > all groups under the reclaimed hierarchy are below the low limit when
> > all of them are considered eligible.
> > 
> > The previous version of the patchset posted as a RFC
> > (http://marc.info/?l=linux-mm&m=138677140628677&w=2) suggested a
> > hard guarantee without any fallback. More discussions led me to
> > reconsidering the default behavior and come up a more relaxed one. The
> > hard requirement can be added later based on a use case which really
> > requires. It would be controlled by memory.reclaim_flags knob which
> > would specify whether to OOM or fallback (default) when all groups are
> > bellow low limit.
> 
> It seems that we are not in a full agreement about the default behavior
> yet. Johannes seems to be more for hard guarantee while I would like to
> see the weaker approach first and move to the stronger model later.
> Johannes, is this absolutely no-go for you? Do you think it is seriously
> handicapping the semantic of the new knob?

Well we certainly can't start OOMing where we previously didn't,
that's called a regression and automatically limits our options.

Any unexpected OOMs will be much more acceptable from a new feature
than from configuration that previously "worked" and then stopped.

> My main motivation for the weaker model is that it is hard to see all
> the corner case right now and once we hit them I would like to see a
> graceful fallback rather than fatal action like OOM killer. Besides that
> the usaceses I am mostly interested in are OK with fallback when the
> alternative would be OOM killer. I also feel that introducing a knob
> with a weaker semantic which can be made stronger later is a sensible
> way to go.

We can't make it stronger, but we can make it weaker.  Stronger is the
simpler definition, it's simpler code, your usecases are fine with it,
Greg and I prefer it too.  I don't even know what we are arguing about
here.

Patch applies on top of mmots.

---
>From ced6ac70bb274cdaa4c5d78b53420d84fb803dd7 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 28 May 2014 09:37:05 -0400
Subject: [patch] mm: vmscan: treat memcg low limit as hard guarantee

Don't hide low limit configuration problems behind weak semantics and
quietly breach the set-up guarantees.

Make it simple: memcg guarantees are equivalent to mlocked memory,
anonymous memory without swap, kernel memory, pinned memory etc. -
unreclaimable.  If no memory can be reclaimed without otherwise
breaching guarantees, it's a real problem, so let the machine OOM and
dump the memory state in that situation.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |  5 -----
 mm/memcontrol.c            | 15 ---------------
 mm/vmscan.c                | 41 +++++------------------------------------
 3 files changed, 5 insertions(+), 56 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a5cf853129ec..c3a53cbb88eb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -94,7 +94,6 @@ bool task_in_mem_cgroup(struct task_struct *task,
 
 extern bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
 		struct mem_cgroup *root);
-extern bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
@@ -297,10 +296,6 @@ static inline bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
 {
 	return false;
 }
-static inline bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
-{
-	return false;
-}
 
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4df733e13727..85fdef53fcf1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2788,7 +2788,6 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
  *
  * The given group is within its reclaim gurantee if it is below its low limit
  * or the same applies for any parent up the hierarchy until root (including).
- * Such a group might be excluded from the reclaim.
  */
 bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
 		struct mem_cgroup *root)
@@ -2801,25 +2800,11 @@ bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
 			return true;
 		if (memcg == root)
 			break;
-
 	} while ((memcg = parent_mem_cgroup(memcg)));
 
 	return false;
 }
 
-bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
-{
-	struct mem_cgroup *iter;
-
-	for_each_mem_cgroup_tree(iter, root)
-		if (!mem_cgroup_within_guarantee(iter, root)) {
-			mem_cgroup_iter_break(root, iter);
-			return false;
-		}
-
-	return true;
-}
-
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	struct mem_cgroup *memcg = NULL;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a8ffe4e616fe..c72493e8fb53 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2244,20 +2244,14 @@ static inline bool should_continue_reclaim(struct zone *zone,
 }
 
 /**
- * __shrink_zone - shrinks a given zone
+ * shrink_zone - shrinks a given zone
  *
  * @zone: zone to shrink
  * @sc: scan control with additional reclaim parameters
- * @honor_memcg_guarantee: do not reclaim memcgs which are within their memory
- * guarantee
- *
- * Returns the number of reclaimed memcgs.
  */
-static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
-		bool honor_memcg_guarantee)
+static void shrink_zone(struct zone *zone, struct scan_control *sc)
 {
 	unsigned long nr_reclaimed, nr_scanned;
-	unsigned nr_scanned_groups = 0;
 
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -2274,20 +2268,16 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
 		do {
 			struct lruvec *lruvec;
 
-			/* Memcg might be protected from the reclaim */
-			if (honor_memcg_guarantee &&
-					mem_cgroup_within_guarantee(memcg, root)) {
+			/* Don't reclaim guaranteed memory */
+			if (mem_cgroup_within_guarantee(memcg, root)) {
 				/*
-				 * It would be more optimal to skip the memcg
-				 * subtree now but we do not have a memcg iter
-				 * helper for that. Anyone?
+				 * XXX: skip the entire subtree here
 				 */
 				memcg = mem_cgroup_iter(root, memcg, &reclaim);
 				continue;
 			}
 
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
-			nr_scanned_groups++;
 
 			sc->swappiness = mem_cgroup_swappiness(memcg);
 			shrink_lruvec(lruvec, sc);
@@ -2316,27 +2306,6 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
 
 	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
-
-	return nr_scanned_groups;
-}
-
-static void shrink_zone(struct zone *zone, struct scan_control *sc)
-{
-	bool honor_guarantee = true;
-
-	while (!__shrink_zone(zone, sc, honor_guarantee)) {
-		/*
-		 * The previous round of reclaim didn't find anything to scan
-		 * because
-		 * a) the whole reclaimed hierarchy is within guarantee so
-		 *    we fallback to ignore the guarantee because other option
-		 *    would be the OOM
-		 * b) multiple reclaimers are racing and so the first round
-		 *    should be retried
-		 */
-		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
-			honor_guarantee = false;
-	}
 }
 
 /* Returns true if compaction should go ahead for a high-order request */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-05-28 13:49     ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-28 13:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote:
> Hi Andrew, Johannes,
> 
> On Mon 28-04-14 14:26:41, Michal Hocko wrote:
> > This patchset introduces such low limit that is functionally similar
> > to a minimum guarantee. Memcgs which are under their lowlimit are not
> > considered eligible for the reclaim (both global and hardlimit) unless
> > all groups under the reclaimed hierarchy are below the low limit when
> > all of them are considered eligible.
> > 
> > The previous version of the patchset posted as a RFC
> > (http://marc.info/?l=linux-mm&m=138677140628677&w=2) suggested a
> > hard guarantee without any fallback. More discussions led me to
> > reconsidering the default behavior and come up a more relaxed one. The
> > hard requirement can be added later based on a use case which really
> > requires. It would be controlled by memory.reclaim_flags knob which
> > would specify whether to OOM or fallback (default) when all groups are
> > bellow low limit.
> 
> It seems that we are not in a full agreement about the default behavior
> yet. Johannes seems to be more for hard guarantee while I would like to
> see the weaker approach first and move to the stronger model later.
> Johannes, is this absolutely no-go for you? Do you think it is seriously
> handicapping the semantic of the new knob?

Well we certainly can't start OOMing where we previously didn't,
that's called a regression and automatically limits our options.

Any unexpected OOMs will be much more acceptable from a new feature
than from configuration that previously "worked" and then stopped.

> My main motivation for the weaker model is that it is hard to see all
> the corner case right now and once we hit them I would like to see a
> graceful fallback rather than fatal action like OOM killer. Besides that
> the usaceses I am mostly interested in are OK with fallback when the
> alternative would be OOM killer. I also feel that introducing a knob
> with a weaker semantic which can be made stronger later is a sensible
> way to go.

We can't make it stronger, but we can make it weaker.  Stronger is the
simpler definition, it's simpler code, your usecases are fine with it,
Greg and I prefer it too.  I don't even know what we are arguing about
here.

Patch applies on top of mmots.

---

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-05-28 13:49     ` Johannes Weiner
@ 2014-05-28 14:21       ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-28 14:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Wed 28-05-14 09:49:05, Johannes Weiner wrote:
> On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote:
> > Hi Andrew, Johannes,
> > 
> > On Mon 28-04-14 14:26:41, Michal Hocko wrote:
> > > This patchset introduces such low limit that is functionally similar
> > > to a minimum guarantee. Memcgs which are under their lowlimit are not
> > > considered eligible for the reclaim (both global and hardlimit) unless
> > > all groups under the reclaimed hierarchy are below the low limit when
> > > all of them are considered eligible.
> > > 
> > > The previous version of the patchset posted as a RFC
> > > (http://marc.info/?l=linux-mm&m=138677140628677&w=2) suggested a
> > > hard guarantee without any fallback. More discussions led me to
> > > reconsidering the default behavior and come up a more relaxed one. The
> > > hard requirement can be added later based on a use case which really
> > > requires. It would be controlled by memory.reclaim_flags knob which
> > > would specify whether to OOM or fallback (default) when all groups are
> > > bellow low limit.
> > 
> > It seems that we are not in a full agreement about the default behavior
> > yet. Johannes seems to be more for hard guarantee while I would like to
> > see the weaker approach first and move to the stronger model later.
> > Johannes, is this absolutely no-go for you? Do you think it is seriously
> > handicapping the semantic of the new knob?
> 
> Well we certainly can't start OOMing where we previously didn't,
> that's called a regression and automatically limits our options.
> 
> Any unexpected OOMs will be much more acceptable from a new feature
> than from configuration that previously "worked" and then stopped.

Yes and we are not talking about regressions, are we?

> > My main motivation for the weaker model is that it is hard to see all
> > the corner case right now and once we hit them I would like to see a
> > graceful fallback rather than fatal action like OOM killer. Besides that
> > the usaceses I am mostly interested in are OK with fallback when the
> > alternative would be OOM killer. I also feel that introducing a knob
> > with a weaker semantic which can be made stronger later is a sensible
> > way to go.
> 
> We can't make it stronger, but we can make it weaker. 

Why cannot we make it stronger by a knob/configuration option?

> Stronger is the simpler definition, it's simpler code,

The code is not really that much simpler. The one you have posted will
not work I am afraid. I haven't tested it yet but I remember I had to do
some tweaks to the reclaim path to not end up in an endless loop in the
direct reclaim (http://marc.info/?l=linux-mm&m=138677140828678&w=2 and
http://marc.info/?l=linux-mm&m=138677141328682&w=2).

> your usecases are fine with it,

my usecases do not overcommit low_limit on the available memory, so far
so good, but once we hit a corner cases when limits are set properly but
we end up not being able to reclaim anybody in a zone then OOM sounds
too brutal.

> Greg and I prefer it too.  I don't even know what we are arguing about
> here.
> 
> Patch applies on top of mmots.
> 
> ---
> From ced6ac70bb274cdaa4c5d78b53420d84fb803dd7 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Wed, 28 May 2014 09:37:05 -0400
> Subject: [patch] mm: vmscan: treat memcg low limit as hard guarantee
> 
> Don't hide low limit configuration problems behind weak semantics and
> quietly breach the set-up guarantees.
> 
> Make it simple: memcg guarantees are equivalent to mlocked memory,
> anonymous memory without swap, kernel memory, pinned memory etc. -
> unreclaimable.  If no memory can be reclaimed without otherwise
> breaching guarantees, it's a real problem, so let the machine OOM and
> dump the memory state in that situation.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h |  5 -----
>  mm/memcontrol.c            | 15 ---------------
>  mm/vmscan.c                | 41 +++++------------------------------------
>  3 files changed, 5 insertions(+), 56 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a5cf853129ec..c3a53cbb88eb 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -94,7 +94,6 @@ bool task_in_mem_cgroup(struct task_struct *task,
>  
>  extern bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  		struct mem_cgroup *root);
> -extern bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root);
>  
>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> @@ -297,10 +296,6 @@ static inline bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  {
>  	return false;
>  }
> -static inline bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
> -{
> -	return false;
> -}
>  
>  static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4df733e13727..85fdef53fcf1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2788,7 +2788,6 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
>   *
>   * The given group is within its reclaim gurantee if it is below its low limit
>   * or the same applies for any parent up the hierarchy until root (including).
> - * Such a group might be excluded from the reclaim.
>   */
>  bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  		struct mem_cgroup *root)
> @@ -2801,25 +2800,11 @@ bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  			return true;
>  		if (memcg == root)
>  			break;
> -
>  	} while ((memcg = parent_mem_cgroup(memcg)));
>  
>  	return false;
>  }
>  
> -bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
> -{
> -	struct mem_cgroup *iter;
> -
> -	for_each_mem_cgroup_tree(iter, root)
> -		if (!mem_cgroup_within_guarantee(iter, root)) {
> -			mem_cgroup_iter_break(root, iter);
> -			return false;
> -		}
> -
> -	return true;
> -}
> -
>  struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>  {
>  	struct mem_cgroup *memcg = NULL;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a8ffe4e616fe..c72493e8fb53 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2244,20 +2244,14 @@ static inline bool should_continue_reclaim(struct zone *zone,
>  }
>  
>  /**
> - * __shrink_zone - shrinks a given zone
> + * shrink_zone - shrinks a given zone
>   *
>   * @zone: zone to shrink
>   * @sc: scan control with additional reclaim parameters
> - * @honor_memcg_guarantee: do not reclaim memcgs which are within their memory
> - * guarantee
> - *
> - * Returns the number of reclaimed memcgs.
>   */
> -static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
> -		bool honor_memcg_guarantee)
> +static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  {
>  	unsigned long nr_reclaimed, nr_scanned;
> -	unsigned nr_scanned_groups = 0;
>  
>  	do {
>  		struct mem_cgroup *root = sc->target_mem_cgroup;
> @@ -2274,20 +2268,16 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
>  		do {
>  			struct lruvec *lruvec;
>  
> -			/* Memcg might be protected from the reclaim */
> -			if (honor_memcg_guarantee &&
> -					mem_cgroup_within_guarantee(memcg, root)) {
> +			/* Don't reclaim guaranteed memory */
> +			if (mem_cgroup_within_guarantee(memcg, root)) {
>  				/*
> -				 * It would be more optimal to skip the memcg
> -				 * subtree now but we do not have a memcg iter
> -				 * helper for that. Anyone?
> +				 * XXX: skip the entire subtree here
>  				 */
>  				memcg = mem_cgroup_iter(root, memcg, &reclaim);
>  				continue;
>  			}
>  
>  			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> -			nr_scanned_groups++;
>  
>  			sc->swappiness = mem_cgroup_swappiness(memcg);
>  			shrink_lruvec(lruvec, sc);
> @@ -2316,27 +2306,6 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
>  
>  	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
>  					 sc->nr_scanned - nr_scanned, sc));
> -
> -	return nr_scanned_groups;
> -}
> -
> -static void shrink_zone(struct zone *zone, struct scan_control *sc)
> -{
> -	bool honor_guarantee = true;
> -
> -	while (!__shrink_zone(zone, sc, honor_guarantee)) {
> -		/*
> -		 * The previous round of reclaim didn't find anything to scan
> -		 * because
> -		 * a) the whole reclaimed hierarchy is within guarantee so
> -		 *    we fallback to ignore the guarantee because other option
> -		 *    would be the OOM
> -		 * b) multiple reclaimers are racing and so the first round
> -		 *    should be retried
> -		 */
> -		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
> -			honor_guarantee = false;
> -	}
>  }
>  
>  /* Returns true if compaction should go ahead for a high-order request */
> -- 
> 1.9.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-05-28 14:21       ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-28 14:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Wed 28-05-14 09:49:05, Johannes Weiner wrote:
> On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote:
> > Hi Andrew, Johannes,
> > 
> > On Mon 28-04-14 14:26:41, Michal Hocko wrote:
> > > This patchset introduces such low limit that is functionally similar
> > > to a minimum guarantee. Memcgs which are under their lowlimit are not
> > > considered eligible for the reclaim (both global and hardlimit) unless
> > > all groups under the reclaimed hierarchy are below the low limit when
> > > all of them are considered eligible.
> > > 
> > > The previous version of the patchset posted as a RFC
> > > (http://marc.info/?l=linux-mm&m=138677140628677&w=2) suggested a
> > > hard guarantee without any fallback. More discussions led me to
> > > reconsidering the default behavior and come up a more relaxed one. The
> > > hard requirement can be added later based on a use case which really
> > > requires. It would be controlled by memory.reclaim_flags knob which
> > > would specify whether to OOM or fallback (default) when all groups are
> > > bellow low limit.
> > 
> > It seems that we are not in a full agreement about the default behavior
> > yet. Johannes seems to be more for hard guarantee while I would like to
> > see the weaker approach first and move to the stronger model later.
> > Johannes, is this absolutely no-go for you? Do you think it is seriously
> > handicapping the semantic of the new knob?
> 
> Well we certainly can't start OOMing where we previously didn't,
> that's called a regression and automatically limits our options.
> 
> Any unexpected OOMs will be much more acceptable from a new feature
> than from configuration that previously "worked" and then stopped.

Yes and we are not talking about regressions, are we?

> > My main motivation for the weaker model is that it is hard to see all
> > the corner case right now and once we hit them I would like to see a
> > graceful fallback rather than fatal action like OOM killer. Besides that
> > the usaceses I am mostly interested in are OK with fallback when the
> > alternative would be OOM killer. I also feel that introducing a knob
> > with a weaker semantic which can be made stronger later is a sensible
> > way to go.
> 
> We can't make it stronger, but we can make it weaker. 

Why cannot we make it stronger by a knob/configuration option?

> Stronger is the simpler definition, it's simpler code,

The code is not really that much simpler. The one you have posted will
not work I am afraid. I haven't tested it yet but I remember I had to do
some tweaks to the reclaim path to not end up in an endless loop in the
direct reclaim (http://marc.info/?l=linux-mm&m=138677140828678&w=2 and
http://marc.info/?l=linux-mm&m=138677141328682&w=2).

> your usecases are fine with it,

my usecases do not overcommit low_limit on the available memory, so far
so good, but once we hit a corner cases when limits are set properly but
we end up not being able to reclaim anybody in a zone then OOM sounds
too brutal.

> Greg and I prefer it too.  I don't even know what we are arguing about
> here.
> 
> Patch applies on top of mmots.
> 
> ---
> From ced6ac70bb274cdaa4c5d78b53420d84fb803dd7 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Wed, 28 May 2014 09:37:05 -0400
> Subject: [patch] mm: vmscan: treat memcg low limit as hard guarantee
> 
> Don't hide low limit configuration problems behind weak semantics and
> quietly breach the set-up guarantees.
> 
> Make it simple: memcg guarantees are equivalent to mlocked memory,
> anonymous memory without swap, kernel memory, pinned memory etc. -
> unreclaimable.  If no memory can be reclaimed without otherwise
> breaching guarantees, it's a real problem, so let the machine OOM and
> dump the memory state in that situation.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h |  5 -----
>  mm/memcontrol.c            | 15 ---------------
>  mm/vmscan.c                | 41 +++++------------------------------------
>  3 files changed, 5 insertions(+), 56 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a5cf853129ec..c3a53cbb88eb 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -94,7 +94,6 @@ bool task_in_mem_cgroup(struct task_struct *task,
>  
>  extern bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  		struct mem_cgroup *root);
> -extern bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root);
>  
>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> @@ -297,10 +296,6 @@ static inline bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  {
>  	return false;
>  }
> -static inline bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
> -{
> -	return false;
> -}
>  
>  static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4df733e13727..85fdef53fcf1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2788,7 +2788,6 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
>   *
>   * The given group is within its reclaim gurantee if it is below its low limit
>   * or the same applies for any parent up the hierarchy until root (including).
> - * Such a group might be excluded from the reclaim.
>   */
>  bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  		struct mem_cgroup *root)
> @@ -2801,25 +2800,11 @@ bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  			return true;
>  		if (memcg == root)
>  			break;
> -
>  	} while ((memcg = parent_mem_cgroup(memcg)));
>  
>  	return false;
>  }
>  
> -bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
> -{
> -	struct mem_cgroup *iter;
> -
> -	for_each_mem_cgroup_tree(iter, root)
> -		if (!mem_cgroup_within_guarantee(iter, root)) {
> -			mem_cgroup_iter_break(root, iter);
> -			return false;
> -		}
> -
> -	return true;
> -}
> -
>  struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>  {
>  	struct mem_cgroup *memcg = NULL;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a8ffe4e616fe..c72493e8fb53 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2244,20 +2244,14 @@ static inline bool should_continue_reclaim(struct zone *zone,
>  }
>  
>  /**
> - * __shrink_zone - shrinks a given zone
> + * shrink_zone - shrinks a given zone
>   *
>   * @zone: zone to shrink
>   * @sc: scan control with additional reclaim parameters
> - * @honor_memcg_guarantee: do not reclaim memcgs which are within their memory
> - * guarantee
> - *
> - * Returns the number of reclaimed memcgs.
>   */
> -static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
> -		bool honor_memcg_guarantee)
> +static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  {
>  	unsigned long nr_reclaimed, nr_scanned;
> -	unsigned nr_scanned_groups = 0;
>  
>  	do {
>  		struct mem_cgroup *root = sc->target_mem_cgroup;
> @@ -2274,20 +2268,16 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
>  		do {
>  			struct lruvec *lruvec;
>  
> -			/* Memcg might be protected from the reclaim */
> -			if (honor_memcg_guarantee &&
> -					mem_cgroup_within_guarantee(memcg, root)) {
> +			/* Don't reclaim guaranteed memory */
> +			if (mem_cgroup_within_guarantee(memcg, root)) {
>  				/*
> -				 * It would be more optimal to skip the memcg
> -				 * subtree now but we do not have a memcg iter
> -				 * helper for that. Anyone?
> +				 * XXX: skip the entire subtree here
>  				 */
>  				memcg = mem_cgroup_iter(root, memcg, &reclaim);
>  				continue;
>  			}
>  
>  			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> -			nr_scanned_groups++;
>  
>  			sc->swappiness = mem_cgroup_swappiness(memcg);
>  			shrink_lruvec(lruvec, sc);
> @@ -2316,27 +2306,6 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
>  
>  	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
>  					 sc->nr_scanned - nr_scanned, sc));
> -
> -	return nr_scanned_groups;
> -}
> -
> -static void shrink_zone(struct zone *zone, struct scan_control *sc)
> -{
> -	bool honor_guarantee = true;
> -
> -	while (!__shrink_zone(zone, sc, honor_guarantee)) {
> -		/*
> -		 * The previous round of reclaim didn't find anything to scan
> -		 * because
> -		 * a) the whole reclaimed hierarchy is within guarantee so
> -		 *    we fallback to ignore the guarantee because other option
> -		 *    would be the OOM
> -		 * b) multiple reclaimers are racing and so the first round
> -		 *    should be retried
> -		 */
> -		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
> -			honor_guarantee = false;
> -	}
>  }
>  
>  /* Returns true if compaction should go ahead for a high-order request */
> -- 
> 1.9.3
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-05-28 14:21       ` Michal Hocko
@ 2014-05-28 15:28         ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-28 15:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Wed, May 28, 2014 at 04:21:44PM +0200, Michal Hocko wrote:
> On Wed 28-05-14 09:49:05, Johannes Weiner wrote:
> > On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote:
> > > Hi Andrew, Johannes,
> > > 
> > > On Mon 28-04-14 14:26:41, Michal Hocko wrote:
> > > > This patchset introduces such low limit that is functionally similar
> > > > to a minimum guarantee. Memcgs which are under their lowlimit are not
> > > > considered eligible for the reclaim (both global and hardlimit) unless
> > > > all groups under the reclaimed hierarchy are below the low limit when
> > > > all of them are considered eligible.
> > > > 
> > > > The previous version of the patchset posted as a RFC
> > > > (http://marc.info/?l=linux-mm&m=138677140628677&w=2) suggested a
> > > > hard guarantee without any fallback. More discussions led me to
> > > > reconsidering the default behavior and come up a more relaxed one. The
> > > > hard requirement can be added later based on a use case which really
> > > > requires. It would be controlled by memory.reclaim_flags knob which
> > > > would specify whether to OOM or fallback (default) when all groups are
> > > > bellow low limit.
> > > 
> > > It seems that we are not in a full agreement about the default behavior
> > > yet. Johannes seems to be more for hard guarantee while I would like to
> > > see the weaker approach first and move to the stronger model later.
> > > Johannes, is this absolutely no-go for you? Do you think it is seriously
> > > handicapping the semantic of the new knob?
> > 
> > Well we certainly can't start OOMing where we previously didn't,
> > that's called a regression and automatically limits our options.
> > 
> > Any unexpected OOMs will be much more acceptable from a new feature
> > than from configuration that previously "worked" and then stopped.
> 
> Yes and we are not talking about regressions, are we?
> 
> > > My main motivation for the weaker model is that it is hard to see all
> > > the corner case right now and once we hit them I would like to see a
> > > graceful fallback rather than fatal action like OOM killer. Besides that
> > > the usaceses I am mostly interested in are OK with fallback when the
> > > alternative would be OOM killer. I also feel that introducing a knob
> > > with a weaker semantic which can be made stronger later is a sensible
> > > way to go.
> > 
> > We can't make it stronger, but we can make it weaker. 
> 
> Why cannot we make it stronger by a knob/configuration option?

Why can't we make it weaker by a knob?  Why should we design the
default for unforeseeable cornercases rather than make the default
make sense for existing cases and give cornercases a fallback once
they show up?

> > Stronger is the simpler definition, it's simpler code,
> 
> The code is not really that much simpler. The one you have posted will
> not work I am afraid. I haven't tested it yet but I remember I had to do
> some tweaks to the reclaim path to not end up in an endless loop in the
> direct reclaim (http://marc.info/?l=linux-mm&m=138677140828678&w=2 and
> http://marc.info/?l=linux-mm&m=138677141328682&w=2).

That's just a result of do_try_to_free_pages being stupid and using
its own zonelist loop to check reclaimability by duplicating all the
checks instead of properly using returned state of shrink_zones().
Something that would be worth fixing regardless of memcg guarantees.

Or maybe we could add the guaranteed lru pages to sc->nr_scanned.

> > your usecases are fine with it,
> 
> my usecases do not overcommit low_limit on the available memory, so far
> so good, but once we hit a corner cases when limits are set properly but
> we end up not being able to reclaim anybody in a zone then OOM sounds
> too brutal.

What cornercases?

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-05-28 15:28         ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-28 15:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Wed, May 28, 2014 at 04:21:44PM +0200, Michal Hocko wrote:
> On Wed 28-05-14 09:49:05, Johannes Weiner wrote:
> > On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote:
> > > Hi Andrew, Johannes,
> > > 
> > > On Mon 28-04-14 14:26:41, Michal Hocko wrote:
> > > > This patchset introduces such low limit that is functionally similar
> > > > to a minimum guarantee. Memcgs which are under their lowlimit are not
> > > > considered eligible for the reclaim (both global and hardlimit) unless
> > > > all groups under the reclaimed hierarchy are below the low limit when
> > > > all of them are considered eligible.
> > > > 
> > > > The previous version of the patchset posted as a RFC
> > > > (http://marc.info/?l=linux-mm&m=138677140628677&w=2) suggested a
> > > > hard guarantee without any fallback. More discussions led me to
> > > > reconsidering the default behavior and come up a more relaxed one. The
> > > > hard requirement can be added later based on a use case which really
> > > > requires. It would be controlled by memory.reclaim_flags knob which
> > > > would specify whether to OOM or fallback (default) when all groups are
> > > > bellow low limit.
> > > 
> > > It seems that we are not in a full agreement about the default behavior
> > > yet. Johannes seems to be more for hard guarantee while I would like to
> > > see the weaker approach first and move to the stronger model later.
> > > Johannes, is this absolutely no-go for you? Do you think it is seriously
> > > handicapping the semantic of the new knob?
> > 
> > Well we certainly can't start OOMing where we previously didn't,
> > that's called a regression and automatically limits our options.
> > 
> > Any unexpected OOMs will be much more acceptable from a new feature
> > than from configuration that previously "worked" and then stopped.
> 
> Yes and we are not talking about regressions, are we?
> 
> > > My main motivation for the weaker model is that it is hard to see all
> > > the corner case right now and once we hit them I would like to see a
> > > graceful fallback rather than fatal action like OOM killer. Besides that
> > > the usaceses I am mostly interested in are OK with fallback when the
> > > alternative would be OOM killer. I also feel that introducing a knob
> > > with a weaker semantic which can be made stronger later is a sensible
> > > way to go.
> > 
> > We can't make it stronger, but we can make it weaker. 
> 
> Why cannot we make it stronger by a knob/configuration option?

Why can't we make it weaker by a knob?  Why should we design the
default for unforeseeable cornercases rather than make the default
make sense for existing cases and give cornercases a fallback once
they show up?

> > Stronger is the simpler definition, it's simpler code,
> 
> The code is not really that much simpler. The one you have posted will
> not work I am afraid. I haven't tested it yet but I remember I had to do
> some tweaks to the reclaim path to not end up in an endless loop in the
> direct reclaim (http://marc.info/?l=linux-mm&m=138677140828678&w=2 and
> http://marc.info/?l=linux-mm&m=138677141328682&w=2).

That's just a result of do_try_to_free_pages being stupid and using
its own zonelist loop to check reclaimability by duplicating all the
checks instead of properly using returned state of shrink_zones().
Something that would be worth fixing regardless of memcg guarantees.

Or maybe we could add the guaranteed lru pages to sc->nr_scanned.

> > your usecases are fine with it,
> 
> my usecases do not overcommit low_limit on the available memory, so far
> so good, but once we hit a corner cases when limits are set properly but
> we end up not being able to reclaim anybody in a zone then OOM sounds
> too brutal.

What cornercases?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-05-28 15:28         ` Johannes Weiner
@ 2014-05-28 15:54           ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-28 15:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Wed 28-05-14 11:28:54, Johannes Weiner wrote:
> On Wed, May 28, 2014 at 04:21:44PM +0200, Michal Hocko wrote:
> > On Wed 28-05-14 09:49:05, Johannes Weiner wrote:
> > > On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote:
[...]
> > > > My main motivation for the weaker model is that it is hard to see all
> > > > the corner case right now and once we hit them I would like to see a
> > > > graceful fallback rather than fatal action like OOM killer. Besides that
> > > > the usaceses I am mostly interested in are OK with fallback when the
> > > > alternative would be OOM killer. I also feel that introducing a knob
> > > > with a weaker semantic which can be made stronger later is a sensible
> > > > way to go.
> > > 
> > > We can't make it stronger, but we can make it weaker. 
> > 
> > Why cannot we make it stronger by a knob/configuration option?
> 
> Why can't we make it weaker by a knob?

I haven't said we couldn't.

> Why should we design the default for unforeseeable cornercases
> rather than make the default make sense for existing cases and give
> cornercases a fallback once they show up?

Sure we can do that but it would be little bit lame IMO. We are
promising something and once we find out it doesn't work we will make
it weaker to workaround that.

Besides that the default should reflect the usecases, no? Do we have any
use case for the hard guarantee?

> > > Stronger is the simpler definition, it's simpler code,
> > 
> > The code is not really that much simpler. The one you have posted will
> > not work I am afraid. I haven't tested it yet but I remember I had to do
> > some tweaks to the reclaim path to not end up in an endless loop in the
> > direct reclaim (http://marc.info/?l=linux-mm&m=138677140828678&w=2 and
> > http://marc.info/?l=linux-mm&m=138677141328682&w=2).
> 
> That's just a result of do_try_to_free_pages being stupid and using
> its own zonelist loop to check reclaimability by duplicating all the
> checks instead of properly using returned state of shrink_zones().
> Something that would be worth fixing regardless of memcg guarantees.
> 
> Or maybe we could add the guaranteed lru pages to sc->nr_scanned.

Fixes might be different than what I was proposing previously. I was
merely pointing out that removing the retry loop is not sufficient.

> > > your usecases are fine with it,
> > 
> > my usecases do not overcommit low_limit on the available memory, so far
> > so good, but once we hit a corner cases when limits are set properly but
> > we end up not being able to reclaim anybody in a zone then OOM sounds
> > too brutal.
> 
> What cornercases?

I have mentioned a case where NUMA placement and specific node bindings
interfering with other allocators can end up in unreclaimable zones.
While you might disagree about the setup I have seen different things
done out there.

Besides that the reclaim logic is complex enough and history thought me
that little buggers are hidden at places where you do not expect them.

So call me a chicken but I would sleep calmer if we start weaker and add
an additional guarantees later when somebody really insists on rseeing
an OOM rather than get reclaimed.
The proposed counter can tell us more how good we are at not touching
groups with the limit and we can eventually debug those corner cases
without affecting the loads too much.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-05-28 15:54           ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-05-28 15:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Wed 28-05-14 11:28:54, Johannes Weiner wrote:
> On Wed, May 28, 2014 at 04:21:44PM +0200, Michal Hocko wrote:
> > On Wed 28-05-14 09:49:05, Johannes Weiner wrote:
> > > On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote:
[...]
> > > > My main motivation for the weaker model is that it is hard to see all
> > > > the corner case right now and once we hit them I would like to see a
> > > > graceful fallback rather than fatal action like OOM killer. Besides that
> > > > the usaceses I am mostly interested in are OK with fallback when the
> > > > alternative would be OOM killer. I also feel that introducing a knob
> > > > with a weaker semantic which can be made stronger later is a sensible
> > > > way to go.
> > > 
> > > We can't make it stronger, but we can make it weaker. 
> > 
> > Why cannot we make it stronger by a knob/configuration option?
> 
> Why can't we make it weaker by a knob?

I haven't said we couldn't.

> Why should we design the default for unforeseeable cornercases
> rather than make the default make sense for existing cases and give
> cornercases a fallback once they show up?

Sure we can do that but it would be little bit lame IMO. We are
promising something and once we find out it doesn't work we will make
it weaker to workaround that.

Besides that the default should reflect the usecases, no? Do we have any
use case for the hard guarantee?

> > > Stronger is the simpler definition, it's simpler code,
> > 
> > The code is not really that much simpler. The one you have posted will
> > not work I am afraid. I haven't tested it yet but I remember I had to do
> > some tweaks to the reclaim path to not end up in an endless loop in the
> > direct reclaim (http://marc.info/?l=linux-mm&m=138677140828678&w=2 and
> > http://marc.info/?l=linux-mm&m=138677141328682&w=2).
> 
> That's just a result of do_try_to_free_pages being stupid and using
> its own zonelist loop to check reclaimability by duplicating all the
> checks instead of properly using returned state of shrink_zones().
> Something that would be worth fixing regardless of memcg guarantees.
> 
> Or maybe we could add the guaranteed lru pages to sc->nr_scanned.

Fixes might be different than what I was proposing previously. I was
merely pointing out that removing the retry loop is not sufficient.

> > > your usecases are fine with it,
> > 
> > my usecases do not overcommit low_limit on the available memory, so far
> > so good, but once we hit a corner cases when limits are set properly but
> > we end up not being able to reclaim anybody in a zone then OOM sounds
> > too brutal.
> 
> What cornercases?

I have mentioned a case where NUMA placement and specific node bindings
interfering with other allocators can end up in unreclaimable zones.
While you might disagree about the setup I have seen different things
done out there.

Besides that the reclaim logic is complex enough and history thought me
that little buggers are hidden at places where you do not expect them.

So call me a chicken but I would sleep calmer if we start weaker and add
an additional guarantees later when somebody really insists on rseeing
an OOM rather than get reclaimed.
The proposed counter can tell us more how good we are at not touching
groups with the limit and we can eventually debug those corner cases
without affecting the loads too much.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-05-28 15:28         ` Johannes Weiner
@ 2014-05-28 16:17           ` Greg Thelen
  -1 siblings, 0 replies; 196+ messages in thread
From: Greg Thelen @ 2014-05-28 16:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel


On Wed, May 28 2014, Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Wed, May 28, 2014 at 04:21:44PM +0200, Michal Hocko wrote:
>> On Wed 28-05-14 09:49:05, Johannes Weiner wrote:
>> > On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote:
>> > > Hi Andrew, Johannes,
>> > > 
>> > > On Mon 28-04-14 14:26:41, Michal Hocko wrote:
>> > > > This patchset introduces such low limit that is functionally similar
>> > > > to a minimum guarantee. Memcgs which are under their lowlimit are not
>> > > > considered eligible for the reclaim (both global and hardlimit) unless
>> > > > all groups under the reclaimed hierarchy are below the low limit when
>> > > > all of them are considered eligible.
>> > > > 
>> > > > The previous version of the patchset posted as a RFC
>> > > > (http://marc.info/?l=linux-mm&m=138677140628677&w=2) suggested a
>> > > > hard guarantee without any fallback. More discussions led me to
>> > > > reconsidering the default behavior and come up a more relaxed one. The
>> > > > hard requirement can be added later based on a use case which really
>> > > > requires. It would be controlled by memory.reclaim_flags knob which
>> > > > would specify whether to OOM or fallback (default) when all groups are
>> > > > bellow low limit.
>> > > 
>> > > It seems that we are not in a full agreement about the default behavior
>> > > yet. Johannes seems to be more for hard guarantee while I would like to
>> > > see the weaker approach first and move to the stronger model later.
>> > > Johannes, is this absolutely no-go for you? Do you think it is seriously
>> > > handicapping the semantic of the new knob?
>> > 
>> > Well we certainly can't start OOMing where we previously didn't,
>> > that's called a regression and automatically limits our options.
>> > 
>> > Any unexpected OOMs will be much more acceptable from a new feature
>> > than from configuration that previously "worked" and then stopped.
>> 
>> Yes and we are not talking about regressions, are we?
>> 
>> > > My main motivation for the weaker model is that it is hard to see all
>> > > the corner case right now and once we hit them I would like to see a
>> > > graceful fallback rather than fatal action like OOM killer. Besides that
>> > > the usaceses I am mostly interested in are OK with fallback when the
>> > > alternative would be OOM killer. I also feel that introducing a knob
>> > > with a weaker semantic which can be made stronger later is a sensible
>> > > way to go.
>> > 
>> > We can't make it stronger, but we can make it weaker. 
>> 
>> Why cannot we make it stronger by a knob/configuration option?
>
> Why can't we make it weaker by a knob?  Why should we design the
> default for unforeseeable cornercases rather than make the default
> make sense for existing cases and give cornercases a fallback once
> they show up?

My 2c...  The following works for my use cases:
1) introduce memory.low_limit_in_bytes (default=0 thus no default change
   from older kernels)
2) interested users will set low_limit_in_bytes to non-zero value.
   Memory protected by low limit should be as migratable/reclaimable as
   mlock memory.  If a zone full of mlock memory causes oom kills, then
   so should the low limit.

If we find corner cases where low_limit_in_bytes is too strict, then we
could discuss a new knob to relax it.  But I think we should start with
a strict low-limit.  If the oom killer gets tied in knots due to low
limit, then I'd like to explore fixing the oom killer before relaxing
low limit.

Disclaimer: new use cases will certainly appear with various
requirements.  But an oom-killing low_limit_in_bytes seems like a
generic opt-in feature, so I think it's worthwhile.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-05-28 16:17           ` Greg Thelen
  0 siblings, 0 replies; 196+ messages in thread
From: Greg Thelen @ 2014-05-28 16:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel


On Wed, May 28 2014, Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Wed, May 28, 2014 at 04:21:44PM +0200, Michal Hocko wrote:
>> On Wed 28-05-14 09:49:05, Johannes Weiner wrote:
>> > On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote:
>> > > Hi Andrew, Johannes,
>> > > 
>> > > On Mon 28-04-14 14:26:41, Michal Hocko wrote:
>> > > > This patchset introduces such low limit that is functionally similar
>> > > > to a minimum guarantee. Memcgs which are under their lowlimit are not
>> > > > considered eligible for the reclaim (both global and hardlimit) unless
>> > > > all groups under the reclaimed hierarchy are below the low limit when
>> > > > all of them are considered eligible.
>> > > > 
>> > > > The previous version of the patchset posted as a RFC
>> > > > (http://marc.info/?l=linux-mm&m=138677140628677&w=2) suggested a
>> > > > hard guarantee without any fallback. More discussions led me to
>> > > > reconsidering the default behavior and come up a more relaxed one. The
>> > > > hard requirement can be added later based on a use case which really
>> > > > requires. It would be controlled by memory.reclaim_flags knob which
>> > > > would specify whether to OOM or fallback (default) when all groups are
>> > > > bellow low limit.
>> > > 
>> > > It seems that we are not in a full agreement about the default behavior
>> > > yet. Johannes seems to be more for hard guarantee while I would like to
>> > > see the weaker approach first and move to the stronger model later.
>> > > Johannes, is this absolutely no-go for you? Do you think it is seriously
>> > > handicapping the semantic of the new knob?
>> > 
>> > Well we certainly can't start OOMing where we previously didn't,
>> > that's called a regression and automatically limits our options.
>> > 
>> > Any unexpected OOMs will be much more acceptable from a new feature
>> > than from configuration that previously "worked" and then stopped.
>> 
>> Yes and we are not talking about regressions, are we?
>> 
>> > > My main motivation for the weaker model is that it is hard to see all
>> > > the corner case right now and once we hit them I would like to see a
>> > > graceful fallback rather than fatal action like OOM killer. Besides that
>> > > the usaceses I am mostly interested in are OK with fallback when the
>> > > alternative would be OOM killer. I also feel that introducing a knob
>> > > with a weaker semantic which can be made stronger later is a sensible
>> > > way to go.
>> > 
>> > We can't make it stronger, but we can make it weaker. 
>> 
>> Why cannot we make it stronger by a knob/configuration option?
>
> Why can't we make it weaker by a knob?  Why should we design the
> default for unforeseeable cornercases rather than make the default
> make sense for existing cases and give cornercases a fallback once
> they show up?

My 2c...  The following works for my use cases:
1) introduce memory.low_limit_in_bytes (default=0 thus no default change
   from older kernels)
2) interested users will set low_limit_in_bytes to non-zero value.
   Memory protected by low limit should be as migratable/reclaimable as
   mlock memory.  If a zone full of mlock memory causes oom kills, then
   so should the low limit.

If we find corner cases where low_limit_in_bytes is too strict, then we
could discuss a new knob to relax it.  But I think we should start with
a strict low-limit.  If the oom killer gets tied in knots due to low
limit, then I'd like to explore fixing the oom killer before relaxing
low limit.

Disclaimer: new use cases will certainly appear with various
requirements.  But an oom-killing low_limit_in_bytes seems like a
generic opt-in feature, so I think it's worthwhile.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-05-28 15:54           ` Michal Hocko
@ 2014-05-28 16:33             ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-28 16:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Wed, May 28, 2014 at 05:54:14PM +0200, Michal Hocko wrote:
> On Wed 28-05-14 11:28:54, Johannes Weiner wrote:
> > On Wed, May 28, 2014 at 04:21:44PM +0200, Michal Hocko wrote:
> > > On Wed 28-05-14 09:49:05, Johannes Weiner wrote:
> > > > On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote:
> [...]
> > > > > My main motivation for the weaker model is that it is hard to see all
> > > > > the corner case right now and once we hit them I would like to see a
> > > > > graceful fallback rather than fatal action like OOM killer. Besides that
> > > > > the usaceses I am mostly interested in are OK with fallback when the
> > > > > alternative would be OOM killer. I also feel that introducing a knob
> > > > > with a weaker semantic which can be made stronger later is a sensible
> > > > > way to go.
> > > > 
> > > > We can't make it stronger, but we can make it weaker. 
> > > 
> > > Why cannot we make it stronger by a knob/configuration option?
> > 
> > Why can't we make it weaker by a knob?
> 
> I haven't said we couldn't.
> 
> > Why should we design the default for unforeseeable cornercases
> > rather than make the default make sense for existing cases and give
> > cornercases a fallback once they show up?
> 
> Sure we can do that but it would be little bit lame IMO. We are
> promising something and once we find out it doesn't work we will make
> it weaker to workaround that.
>
> Besides that the default should reflect the usecases, no? Do we have any
> use case for the hard guarantee?

You're adding an extra layer of complexity so the burden of proof is
on you.  Do we have any usecases that require a graceful fallback?

> > > > Stronger is the simpler definition, it's simpler code,
> > > 
> > > The code is not really that much simpler. The one you have posted will
> > > not work I am afraid. I haven't tested it yet but I remember I had to do
> > > some tweaks to the reclaim path to not end up in an endless loop in the
> > > direct reclaim (http://marc.info/?l=linux-mm&m=138677140828678&w=2 and
> > > http://marc.info/?l=linux-mm&m=138677141328682&w=2).
> > 
> > That's just a result of do_try_to_free_pages being stupid and using
> > its own zonelist loop to check reclaimability by duplicating all the
> > checks instead of properly using returned state of shrink_zones().
> > Something that would be worth fixing regardless of memcg guarantees.
> > 
> > Or maybe we could add the guaranteed lru pages to sc->nr_scanned.
> 
> Fixes might be different than what I was proposing previously. I was
> merely pointing out that removing the retry loop is not sufficient.

No, you were claiming that the hard limit implementation is not
simpler.  It is.

> > > > your usecases are fine with it,
> > > 
> > > my usecases do not overcommit low_limit on the available memory, so far
> > > so good, but once we hit a corner cases when limits are set properly but
> > > we end up not being able to reclaim anybody in a zone then OOM sounds
> > > too brutal.
> > 
> > What cornercases?
> 
> I have mentioned a case where NUMA placement and specific node bindings
> interfering with other allocators can end up in unreclaimable zones.
> While you might disagree about the setup I have seen different things
> done out there.

If you have real usecases that might depend on weak guarantees, please
make a rational argument for them and don't just handwave.  I know
that there is every conceivable configuration out there, but it's
unreasonable to design new features around the requirement of setups
that are questionable to begin with.

> Besides that the reclaim logic is complex enough and history thought me
> that little buggers are hidden at places where you do not expect them.

So we introduce user interfaces designed around the fact that we don't
trust our own code anymore?

There is being prudent and then there is cargo cult programming.

> So call me a chicken but I would sleep calmer if we start weaker and add
> an additional guarantees later when somebody really insists on rseeing
> an OOM rather than get reclaimed.
> The proposed counter can tell us more how good we are at not touching
> groups with the limit and we can eventually debug those corner cases
> without affecting the loads too much.

More realistically, potential bugs are never reported with a silent
counter, which further widens the gap between our assumptions on how
the VM behaves and what happens in production.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-05-28 16:33             ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-05-28 16:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Wed, May 28, 2014 at 05:54:14PM +0200, Michal Hocko wrote:
> On Wed 28-05-14 11:28:54, Johannes Weiner wrote:
> > On Wed, May 28, 2014 at 04:21:44PM +0200, Michal Hocko wrote:
> > > On Wed 28-05-14 09:49:05, Johannes Weiner wrote:
> > > > On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote:
> [...]
> > > > > My main motivation for the weaker model is that it is hard to see all
> > > > > the corner case right now and once we hit them I would like to see a
> > > > > graceful fallback rather than fatal action like OOM killer. Besides that
> > > > > the usaceses I am mostly interested in are OK with fallback when the
> > > > > alternative would be OOM killer. I also feel that introducing a knob
> > > > > with a weaker semantic which can be made stronger later is a sensible
> > > > > way to go.
> > > > 
> > > > We can't make it stronger, but we can make it weaker. 
> > > 
> > > Why cannot we make it stronger by a knob/configuration option?
> > 
> > Why can't we make it weaker by a knob?
> 
> I haven't said we couldn't.
> 
> > Why should we design the default for unforeseeable cornercases
> > rather than make the default make sense for existing cases and give
> > cornercases a fallback once they show up?
> 
> Sure we can do that but it would be little bit lame IMO. We are
> promising something and once we find out it doesn't work we will make
> it weaker to workaround that.
>
> Besides that the default should reflect the usecases, no? Do we have any
> use case for the hard guarantee?

You're adding an extra layer of complexity so the burden of proof is
on you.  Do we have any usecases that require a graceful fallback?

> > > > Stronger is the simpler definition, it's simpler code,
> > > 
> > > The code is not really that much simpler. The one you have posted will
> > > not work I am afraid. I haven't tested it yet but I remember I had to do
> > > some tweaks to the reclaim path to not end up in an endless loop in the
> > > direct reclaim (http://marc.info/?l=linux-mm&m=138677140828678&w=2 and
> > > http://marc.info/?l=linux-mm&m=138677141328682&w=2).
> > 
> > That's just a result of do_try_to_free_pages being stupid and using
> > its own zonelist loop to check reclaimability by duplicating all the
> > checks instead of properly using returned state of shrink_zones().
> > Something that would be worth fixing regardless of memcg guarantees.
> > 
> > Or maybe we could add the guaranteed lru pages to sc->nr_scanned.
> 
> Fixes might be different than what I was proposing previously. I was
> merely pointing out that removing the retry loop is not sufficient.

No, you were claiming that the hard limit implementation is not
simpler.  It is.

> > > > your usecases are fine with it,
> > > 
> > > my usecases do not overcommit low_limit on the available memory, so far
> > > so good, but once we hit a corner cases when limits are set properly but
> > > we end up not being able to reclaim anybody in a zone then OOM sounds
> > > too brutal.
> > 
> > What cornercases?
> 
> I have mentioned a case where NUMA placement and specific node bindings
> interfering with other allocators can end up in unreclaimable zones.
> While you might disagree about the setup I have seen different things
> done out there.

If you have real usecases that might depend on weak guarantees, please
make a rational argument for them and don't just handwave.  I know
that there is every conceivable configuration out there, but it's
unreasonable to design new features around the requirement of setups
that are questionable to begin with.

> Besides that the reclaim logic is complex enough and history thought me
> that little buggers are hidden at places where you do not expect them.

So we introduce user interfaces designed around the fact that we don't
trust our own code anymore?

There is being prudent and then there is cargo cult programming.

> So call me a chicken but I would sleep calmer if we start weaker and add
> an additional guarantees later when somebody really insists on rseeing
> an OOM rather than get reclaimed.
> The proposed counter can tell us more how good we are at not touching
> groups with the limit and we can eventually debug those corner cases
> without affecting the loads too much.

More realistically, potential bugs are never reported with a silent
counter, which further widens the gap between our assumptions on how
the VM behaves and what happens in production.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-05-28 16:33             ` Johannes Weiner
@ 2014-06-03 11:07               ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-03 11:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Wed 28-05-14 12:33:35, Johannes Weiner wrote:
> On Wed, May 28, 2014 at 05:54:14PM +0200, Michal Hocko wrote:
> > On Wed 28-05-14 11:28:54, Johannes Weiner wrote:
> > > On Wed, May 28, 2014 at 04:21:44PM +0200, Michal Hocko wrote:
> > > > On Wed 28-05-14 09:49:05, Johannes Weiner wrote:
> > > > > On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote:
> > [...]
> > > > > > My main motivation for the weaker model is that it is hard to see all
> > > > > > the corner case right now and once we hit them I would like to see a
> > > > > > graceful fallback rather than fatal action like OOM killer. Besides that
> > > > > > the usaceses I am mostly interested in are OK with fallback when the
> > > > > > alternative would be OOM killer. I also feel that introducing a knob
> > > > > > with a weaker semantic which can be made stronger later is a sensible
> > > > > > way to go.
> > > > > 
> > > > > We can't make it stronger, but we can make it weaker. 
> > > > 
> > > > Why cannot we make it stronger by a knob/configuration option?
> > > 
> > > Why can't we make it weaker by a knob?
> > 
> > I haven't said we couldn't.
> > 
> > > Why should we design the default for unforeseeable cornercases
> > > rather than make the default make sense for existing cases and give
> > > cornercases a fallback once they show up?
> > 
> > Sure we can do that but it would be little bit lame IMO. We are
> > promising something and once we find out it doesn't work we will make
> > it weaker to workaround that.
> >
> > Besides that the default should reflect the usecases, no? Do we have any
> > use case for the hard guarantee?
> 
> You're adding an extra layer of complexity so the burden of proof is
> on you.  Do we have any usecases that require a graceful fallback?

As far as I am aware nobody (except for google) really loves OOM
killer because there is nothing you can do once it strikes (in the
global/cpuset memory case). You have no choice for clean up etc...

If we consider that memcg and its limits are not zone aware while the
page allocator and reclaim are zone oriented then I can see a problem
of unexpected reclaim failure although there is no over commit on the
low_limit globally. And we do not have in-kernel effective measures to
mitigate this inherent problem. At least not now and I am afraid it is
a long route to have something that would work reasonably well in such
cases.

So to me it sounds more responsible to promise only as much as we can
handle. I think that fallback mode is not crippling the semantic of
the knob as it triggers only for limit overcommit or strange corner
cases. We have agreed that we do not care about the first one and
handling the later one by potentially fatal action doesn't sounds very
user friendly to me.

For example, if we get back to the NUMA case then a graceful fallback
allows to migrate offending tasks off the node and reduce reclaim on the
protected group. This can be done simply by watching the breach counter
and act upon it. On the other hand if the default policy is OOM then
the possible actions are much more reduced (action would have to be
pro-active with hopes that they are faster than OOM).

Yet if somebody really wants to overcommit on the low_limit and get OOM
rather than get reclaimed then I can see some sense in it and would be
willing to add a knob to set allow this behavior. But that is a
different situation because the configuration will be explicit and
administrator aware of the consequences and is willing to accept them.

> > > > > Stronger is the simpler definition, it's simpler code,
> > > > 
> > > > The code is not really that much simpler. The one you have posted will
> > > > not work I am afraid. I haven't tested it yet but I remember I had to do
> > > > some tweaks to the reclaim path to not end up in an endless loop in the
> > > > direct reclaim (http://marc.info/?l=linux-mm&m=138677140828678&w=2 and
> > > > http://marc.info/?l=linux-mm&m=138677141328682&w=2).
> > > 
> > > That's just a result of do_try_to_free_pages being stupid and using
> > > its own zonelist loop to check reclaimability by duplicating all the
> > > checks instead of properly using returned state of shrink_zones().
> > > Something that would be worth fixing regardless of memcg guarantees.
> > > 
> > > Or maybe we could add the guaranteed lru pages to sc->nr_scanned.
> > 
> > Fixes might be different than what I was proposing previously. I was
> > merely pointing out that removing the retry loop is not sufficient.
> 
> No, you were claiming that the hard limit implementation is not
> simpler.  It is.

Well, there are things you have to check anyway - short loops due to
racing reclaimers and quick priority drop down or even pre-mature OOM
in direct reclaim paths. kswapd shoudn't loop endlessly if it cannot
balance the zone because all groups are withing limit on the node.
So I fail to see it as that much simpler.

Anyway, the complexity of the retry&ignore loop doesn't seem to be
significant enough to dictate the default behavior. We should go with
the one which makes the most sense for users.

> > > > > your usecases are fine with it,
> > > > 
> > > > my usecases do not overcommit low_limit on the available memory, so far
> > > > so good, but once we hit a corner cases when limits are set properly but
> > > > we end up not being able to reclaim anybody in a zone then OOM sounds
> > > > too brutal.
> > > 
> > > What cornercases?
> > 
> > I have mentioned a case where NUMA placement and specific node bindings
> > interfering with other allocators can end up in unreclaimable zones.
> > While you might disagree about the setup I have seen different things
> > done out there.
> 
> If you have real usecases that might depend on weak guarantees, please
> make a rational argument for them and don't just handwave. 

As I've said above. Usecases I am interested in do not overcommit on
low_limit. The limit is used to protect group(s) from memory pressure
from other loads which are running on the same machine. Primarily
because the working set is quite expensive to build up. If we really
hit a corner case and OOM would trigger then the whole state has to be
rebuilt and that is much more expensive than ephemeral reclaim.

> I know that there is every conceivable configuration out there, but
> it's unreasonable to design new features around the requirement of
> setups that are questionable to begin with.

I do agree but on the other hand I think we shouldn't ignore inherent
problems which might lead to problems mentioned above and provide an
interface which doesn't cause an unexpected behavior.

> > Besides that the reclaim logic is complex enough and history thought me
> > that little buggers are hidden at places where you do not expect them.
> 
> So we introduce user interfaces designed around the fact that we don't
> trust our own code anymore?

No, we are talking about inherent problems here. And my experience
taught me to be careful and corner cases tend to show up in the real
life situations. 

> There is being prudent and then there is cargo cult programming.
> 
> > So call me a chicken but I would sleep calmer if we start weaker and add
> > an additional guarantees later when somebody really insists on rseeing
> > an OOM rather than get reclaimed.
> > The proposed counter can tell us more how good we are at not touching
> > groups with the limit and we can eventually debug those corner cases
> > without affecting the loads too much.
> 
> More realistically, potential bugs are never reported with a silent
> counter, which further widens the gap between our assumptions on how
> the VM behaves and what happens in production.

OOM driven reports are arguably worse and without easy workaround on the
other hand.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-06-03 11:07               ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-03 11:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Wed 28-05-14 12:33:35, Johannes Weiner wrote:
> On Wed, May 28, 2014 at 05:54:14PM +0200, Michal Hocko wrote:
> > On Wed 28-05-14 11:28:54, Johannes Weiner wrote:
> > > On Wed, May 28, 2014 at 04:21:44PM +0200, Michal Hocko wrote:
> > > > On Wed 28-05-14 09:49:05, Johannes Weiner wrote:
> > > > > On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote:
> > [...]
> > > > > > My main motivation for the weaker model is that it is hard to see all
> > > > > > the corner case right now and once we hit them I would like to see a
> > > > > > graceful fallback rather than fatal action like OOM killer. Besides that
> > > > > > the usaceses I am mostly interested in are OK with fallback when the
> > > > > > alternative would be OOM killer. I also feel that introducing a knob
> > > > > > with a weaker semantic which can be made stronger later is a sensible
> > > > > > way to go.
> > > > > 
> > > > > We can't make it stronger, but we can make it weaker. 
> > > > 
> > > > Why cannot we make it stronger by a knob/configuration option?
> > > 
> > > Why can't we make it weaker by a knob?
> > 
> > I haven't said we couldn't.
> > 
> > > Why should we design the default for unforeseeable cornercases
> > > rather than make the default make sense for existing cases and give
> > > cornercases a fallback once they show up?
> > 
> > Sure we can do that but it would be little bit lame IMO. We are
> > promising something and once we find out it doesn't work we will make
> > it weaker to workaround that.
> >
> > Besides that the default should reflect the usecases, no? Do we have any
> > use case for the hard guarantee?
> 
> You're adding an extra layer of complexity so the burden of proof is
> on you.  Do we have any usecases that require a graceful fallback?

As far as I am aware nobody (except for google) really loves OOM
killer because there is nothing you can do once it strikes (in the
global/cpuset memory case). You have no choice for clean up etc...

If we consider that memcg and its limits are not zone aware while the
page allocator and reclaim are zone oriented then I can see a problem
of unexpected reclaim failure although there is no over commit on the
low_limit globally. And we do not have in-kernel effective measures to
mitigate this inherent problem. At least not now and I am afraid it is
a long route to have something that would work reasonably well in such
cases.

So to me it sounds more responsible to promise only as much as we can
handle. I think that fallback mode is not crippling the semantic of
the knob as it triggers only for limit overcommit or strange corner
cases. We have agreed that we do not care about the first one and
handling the later one by potentially fatal action doesn't sounds very
user friendly to me.

For example, if we get back to the NUMA case then a graceful fallback
allows to migrate offending tasks off the node and reduce reclaim on the
protected group. This can be done simply by watching the breach counter
and act upon it. On the other hand if the default policy is OOM then
the possible actions are much more reduced (action would have to be
pro-active with hopes that they are faster than OOM).

Yet if somebody really wants to overcommit on the low_limit and get OOM
rather than get reclaimed then I can see some sense in it and would be
willing to add a knob to set allow this behavior. But that is a
different situation because the configuration will be explicit and
administrator aware of the consequences and is willing to accept them.

> > > > > Stronger is the simpler definition, it's simpler code,
> > > > 
> > > > The code is not really that much simpler. The one you have posted will
> > > > not work I am afraid. I haven't tested it yet but I remember I had to do
> > > > some tweaks to the reclaim path to not end up in an endless loop in the
> > > > direct reclaim (http://marc.info/?l=linux-mm&m=138677140828678&w=2 and
> > > > http://marc.info/?l=linux-mm&m=138677141328682&w=2).
> > > 
> > > That's just a result of do_try_to_free_pages being stupid and using
> > > its own zonelist loop to check reclaimability by duplicating all the
> > > checks instead of properly using returned state of shrink_zones().
> > > Something that would be worth fixing regardless of memcg guarantees.
> > > 
> > > Or maybe we could add the guaranteed lru pages to sc->nr_scanned.
> > 
> > Fixes might be different than what I was proposing previously. I was
> > merely pointing out that removing the retry loop is not sufficient.
> 
> No, you were claiming that the hard limit implementation is not
> simpler.  It is.

Well, there are things you have to check anyway - short loops due to
racing reclaimers and quick priority drop down or even pre-mature OOM
in direct reclaim paths. kswapd shoudn't loop endlessly if it cannot
balance the zone because all groups are withing limit on the node.
So I fail to see it as that much simpler.

Anyway, the complexity of the retry&ignore loop doesn't seem to be
significant enough to dictate the default behavior. We should go with
the one which makes the most sense for users.

> > > > > your usecases are fine with it,
> > > > 
> > > > my usecases do not overcommit low_limit on the available memory, so far
> > > > so good, but once we hit a corner cases when limits are set properly but
> > > > we end up not being able to reclaim anybody in a zone then OOM sounds
> > > > too brutal.
> > > 
> > > What cornercases?
> > 
> > I have mentioned a case where NUMA placement and specific node bindings
> > interfering with other allocators can end up in unreclaimable zones.
> > While you might disagree about the setup I have seen different things
> > done out there.
> 
> If you have real usecases that might depend on weak guarantees, please
> make a rational argument for them and don't just handwave. 

As I've said above. Usecases I am interested in do not overcommit on
low_limit. The limit is used to protect group(s) from memory pressure
from other loads which are running on the same machine. Primarily
because the working set is quite expensive to build up. If we really
hit a corner case and OOM would trigger then the whole state has to be
rebuilt and that is much more expensive than ephemeral reclaim.

> I know that there is every conceivable configuration out there, but
> it's unreasonable to design new features around the requirement of
> setups that are questionable to begin with.

I do agree but on the other hand I think we shouldn't ignore inherent
problems which might lead to problems mentioned above and provide an
interface which doesn't cause an unexpected behavior.

> > Besides that the reclaim logic is complex enough and history thought me
> > that little buggers are hidden at places where you do not expect them.
> 
> So we introduce user interfaces designed around the fact that we don't
> trust our own code anymore?

No, we are talking about inherent problems here. And my experience
taught me to be careful and corner cases tend to show up in the real
life situations. 

> There is being prudent and then there is cargo cult programming.
> 
> > So call me a chicken but I would sleep calmer if we start weaker and add
> > an additional guarantees later when somebody really insists on rseeing
> > an OOM rather than get reclaimed.
> > The proposed counter can tell us more how good we are at not touching
> > groups with the limit and we can eventually debug those corner cases
> > without affecting the loads too much.
> 
> More realistically, potential bugs are never reported with a silent
> counter, which further widens the gap between our assumptions on how
> the VM behaves and what happens in production.

OOM driven reports are arguably worse and without easy workaround on the
other hand.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-05-28 16:17           ` Greg Thelen
@ 2014-06-03 11:09             ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-03 11:09 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Hugh Dickins,
	Roman Gushchin, LKML, linux-mm, Rik van Riel

On Wed 28-05-14 09:17:13, Greg Thelen wrote:
[...]
> My 2c...  The following works for my use cases:
> 1) introduce memory.low_limit_in_bytes (default=0 thus no default change
>    from older kernels)
> 2) interested users will set low_limit_in_bytes to non-zero value.
>    Memory protected by low limit should be as migratable/reclaimable as
>    mlock memory.  If a zone full of mlock memory causes oom kills, then
>    so should the low limit.

Would fallback mode in overcommit or the corner case situation break
your usecase?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-06-03 11:09             ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-03 11:09 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Hugh Dickins,
	Roman Gushchin, LKML, linux-mm, Rik van Riel

On Wed 28-05-14 09:17:13, Greg Thelen wrote:
[...]
> My 2c...  The following works for my use cases:
> 1) introduce memory.low_limit_in_bytes (default=0 thus no default change
>    from older kernels)
> 2) interested users will set low_limit_in_bytes to non-zero value.
>    Memory protected by low limit should be as migratable/reclaimable as
>    mlock memory.  If a zone full of mlock memory causes oom kills, then
>    so should the low limit.

Would fallback mode in overcommit or the corner case situation break
your usecase?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-06-03 11:09             ` Michal Hocko
  (?)
@ 2014-06-03 14:01             ` Greg Thelen
  2014-06-03 14:44                 ` Michal Hocko
  -1 siblings, 1 reply; 196+ messages in thread
From: Greg Thelen @ 2014-06-03 14:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, KAMEZAWA Hiroyuki, Tejun Heo, linux-mm,
	Johannes Weiner, Hugh Dickins, KOSAKI Motohiro, Rik van Riel,
	LKML, Andrew Morton, Michel Lespinasse

[-- Attachment #1: Type: text/plain, Size: 1400 bytes --]

On Jun 3, 2014 4:10 AM, "Michal Hocko" <mhocko@suse.cz> wrote:
>
> On Wed 28-05-14 09:17:13, Greg Thelen wrote:
> [...]
> > My 2c...  The following works for my use cases:
> > 1) introduce memory.low_limit_in_bytes (default=0 thus no default change
> >    from older kernels)
> > 2) interested users will set low_limit_in_bytes to non-zero value.
> >    Memory protected by low limit should be as migratable/reclaimable as
> >    mlock memory.  If a zone full of mlock memory causes oom kills, then
> >    so should the low limit.
>
> Would fallback mode in overcommit or the corner case situation break
> your usecase?

Yes.  Fallback mode would break my use cases.  What is the corner case
situation?  NUMA conflicts?  Low limit is a substitute for users mlocking
memory.  So if mlocked memory has the same NUMA conflicts, then I see no
problem with low limit having the same behavior.

>From a user API perspective, I'm not clear on the difference between
non-ooming (fallback) low limit and the existing soft limit interface.  If
low limit is a "soft" (non ooming) limit then why not rework the existing
soft limit interface and save the low limit for strict (ooming) behavior?

Of course, Google can continue to tweak the soft limit or new low limit to
provide an ooming guarantee rather than violating the limit.

PS: I currently have very limited connectivity so my responses will be
delayed.

[-- Attachment #2: Type: text/html, Size: 1707 bytes --]

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-06-03 11:07               ` Michal Hocko
@ 2014-06-03 14:22                 ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-03 14:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Tue, Jun 03, 2014 at 01:07:43PM +0200, Michal Hocko wrote:
> On Wed 28-05-14 12:33:35, Johannes Weiner wrote:
> > On Wed, May 28, 2014 at 05:54:14PM +0200, Michal Hocko wrote:
> > > On Wed 28-05-14 11:28:54, Johannes Weiner wrote:
> > > > On Wed, May 28, 2014 at 04:21:44PM +0200, Michal Hocko wrote:
> > > > > On Wed 28-05-14 09:49:05, Johannes Weiner wrote:
> > > > > > On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > > > My main motivation for the weaker model is that it is hard to see all
> > > > > > > the corner case right now and once we hit them I would like to see a
> > > > > > > graceful fallback rather than fatal action like OOM killer. Besides that
> > > > > > > the usaceses I am mostly interested in are OK with fallback when the
> > > > > > > alternative would be OOM killer. I also feel that introducing a knob
> > > > > > > with a weaker semantic which can be made stronger later is a sensible
> > > > > > > way to go.
> > > > > > 
> > > > > > We can't make it stronger, but we can make it weaker. 
> > > > > 
> > > > > Why cannot we make it stronger by a knob/configuration option?
> > > > 
> > > > Why can't we make it weaker by a knob?
> > > 
> > > I haven't said we couldn't.
> > > 
> > > > Why should we design the default for unforeseeable cornercases
> > > > rather than make the default make sense for existing cases and give
> > > > cornercases a fallback once they show up?
> > > 
> > > Sure we can do that but it would be little bit lame IMO. We are
> > > promising something and once we find out it doesn't work we will make
> > > it weaker to workaround that.
> > >
> > > Besides that the default should reflect the usecases, no? Do we have any
> > > use case for the hard guarantee?
> > 
> > You're adding an extra layer of complexity so the burden of proof is
> > on you.  Do we have any usecases that require a graceful fallback?
> 
> As far as I am aware nobody (except for google) really loves OOM
> killer because there is nothing you can do once it strikes (in the
> global/cpuset memory case). You have no choice for clean up etc...
> 
> If we consider that memcg and its limits are not zone aware while the
> page allocator and reclaim are zone oriented then I can see a problem
> of unexpected reclaim failure although there is no over commit on the
> low_limit globally. And we do not have in-kernel effective measures to
> mitigate this inherent problem. At least not now and I am afraid it is
> a long route to have something that would work reasonably well in such
> cases.

Which "inherent problem"?

> So to me it sounds more responsible to promise only as much as we can
> handle. I think that fallback mode is not crippling the semantic of
> the knob as it triggers only for limit overcommit or strange corner
> cases. We have agreed that we do not care about the first one and
> handling the later one by potentially fatal action doesn't sounds very
> user friendly to me.

It *absolutely* cripples the semantics.  Think about the security use
cases of mlock for example, where certain memory may never hit the
platter.  This wouldn't be possible with your watered down guarantees.

And it's the user who makes the promise, not us.  I'd rather have the
responsibility with the user.  Provide mechanism, not policy.

> For example, if we get back to the NUMA case then a graceful fallback
> allows to migrate offending tasks off the node and reduce reclaim on the
> protected group. This can be done simply by watching the breach counter
> and act upon it. On the other hand if the default policy is OOM then
> the possible actions are much more reduced (action would have to be
> pro-active with hopes that they are faster than OOM).

It's really frustrating that you just repeat arguments to which I
already responded.

Again, how is this different from mlock?

And again, if this really is a problem (which I doubt), we should fix
it at the root and implement direct migration, rather than design an
interface around it.

> > > > > > Stronger is the simpler definition, it's simpler code,
> > > > > 
> > > > > The code is not really that much simpler. The one you have posted will
> > > > > not work I am afraid. I haven't tested it yet but I remember I had to do
> > > > > some tweaks to the reclaim path to not end up in an endless loop in the
> > > > > direct reclaim (http://marc.info/?l=linux-mm&m=138677140828678&w=2 and
> > > > > http://marc.info/?l=linux-mm&m=138677141328682&w=2).
> > > > 
> > > > That's just a result of do_try_to_free_pages being stupid and using
> > > > its own zonelist loop to check reclaimability by duplicating all the
> > > > checks instead of properly using returned state of shrink_zones().
> > > > Something that would be worth fixing regardless of memcg guarantees.
> > > > 
> > > > Or maybe we could add the guaranteed lru pages to sc->nr_scanned.
> > > 
> > > Fixes might be different than what I was proposing previously. I was
> > > merely pointing out that removing the retry loop is not sufficient.
> > 
> > No, you were claiming that the hard limit implementation is not
> > simpler.  It is.
> 
> Well, there are things you have to check anyway - short loops due to
> racing reclaimers and quick priority drop down or even pre-mature OOM
> in direct reclaim paths. kswapd shoudn't loop endlessly if it cannot
> balance the zone because all groups are withing limit on the node.
> So I fail to see it as that much simpler.

Could you please stop with the handwaving?  If there are bugs, we have
to fix them.  These pages are unreclaimable, plain and simple, like
anon without swap and mlocked pages.  None of this is new.

> Anyway, the complexity of the retry&ignore loop doesn't seem to be
> significant enough to dictate the default behavior. We should go with
> the one which makes the most sense for users.

The point is that you are adding complexity to weaken the semantics
and usefulness of this feature, with the only justification being
potential misconfigurations and a fear of unearthing kernel bugs.

This makes little sense for users, and even less sense for us.

> > > > > > your usecases are fine with it,
> > > > > 
> > > > > my usecases do not overcommit low_limit on the available memory, so far
> > > > > so good, but once we hit a corner cases when limits are set properly but
> > > > > we end up not being able to reclaim anybody in a zone then OOM sounds
> > > > > too brutal.
> > > > 
> > > > What cornercases?
> > > 
> > > I have mentioned a case where NUMA placement and specific node bindings
> > > interfering with other allocators can end up in unreclaimable zones.
> > > While you might disagree about the setup I have seen different things
> > > done out there.
> > 
> > If you have real usecases that might depend on weak guarantees, please
> > make a rational argument for them and don't just handwave. 
> 
> As I've said above. Usecases I am interested in do not overcommit on
> low_limit. The limit is used to protect group(s) from memory pressure
> from other loads which are running on the same machine. Primarily
> because the working set is quite expensive to build up. If we really
> hit a corner case and OOM would trigger then the whole state has to be
> rebuilt and that is much more expensive than ephemeral reclaim.

What corner cases?

> > I know that there is every conceivable configuration out there, but
> > it's unreasonable to design new features around the requirement of
> > setups that are questionable to begin with.
> 
> I do agree but on the other hand I think we shouldn't ignore inherent
> problems which might lead to problems mentioned above and provide an
> interface which doesn't cause an unexpected behavior.

What inherent problems?

> > > Besides that the reclaim logic is complex enough and history thought me
> > > that little buggers are hidden at places where you do not expect them.
> > 
> > So we introduce user interfaces designed around the fact that we don't
> > trust our own code anymore?
> 
> No, we are talking about inherent problems here. And my experience
> taught me to be careful and corner cases tend to show up in the real
> life situations.

I'm not willing to base an interface on this level of vagueness.

> > There is being prudent and then there is cargo cult programming.
> > 
> > > So call me a chicken but I would sleep calmer if we start weaker and add
> > > an additional guarantees later when somebody really insists on rseeing
> > > an OOM rather than get reclaimed.
> > > The proposed counter can tell us more how good we are at not touching
> > > groups with the limit and we can eventually debug those corner cases
> > > without affecting the loads too much.
> > 
> > More realistically, potential bugs are never reported with a silent
> > counter, which further widens the gap between our assumptions on how
> > the VM behaves and what happens in production.
> 
> OOM driven reports are arguably worse and without easy workaround on the
> other hand.

The workaround is obviously to lower the guarantees and/or fix the
NUMA bindings in such cases.

I really don't think you have a point here, because there is not a
single concrete example backing up your arguments.

Please remove the fallback code from your changes.  They weaken the
feature and add more complexity without reasonable justification - at
least you didn't convince anybody else involved in the discussion.

Because this is user-visible ABI that we are stuck with once released,
the patches should not be merged until we agree on the behavior.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-06-03 14:22                 ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-03 14:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Tue, Jun 03, 2014 at 01:07:43PM +0200, Michal Hocko wrote:
> On Wed 28-05-14 12:33:35, Johannes Weiner wrote:
> > On Wed, May 28, 2014 at 05:54:14PM +0200, Michal Hocko wrote:
> > > On Wed 28-05-14 11:28:54, Johannes Weiner wrote:
> > > > On Wed, May 28, 2014 at 04:21:44PM +0200, Michal Hocko wrote:
> > > > > On Wed 28-05-14 09:49:05, Johannes Weiner wrote:
> > > > > > On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > > > My main motivation for the weaker model is that it is hard to see all
> > > > > > > the corner case right now and once we hit them I would like to see a
> > > > > > > graceful fallback rather than fatal action like OOM killer. Besides that
> > > > > > > the usaceses I am mostly interested in are OK with fallback when the
> > > > > > > alternative would be OOM killer. I also feel that introducing a knob
> > > > > > > with a weaker semantic which can be made stronger later is a sensible
> > > > > > > way to go.
> > > > > > 
> > > > > > We can't make it stronger, but we can make it weaker. 
> > > > > 
> > > > > Why cannot we make it stronger by a knob/configuration option?
> > > > 
> > > > Why can't we make it weaker by a knob?
> > > 
> > > I haven't said we couldn't.
> > > 
> > > > Why should we design the default for unforeseeable cornercases
> > > > rather than make the default make sense for existing cases and give
> > > > cornercases a fallback once they show up?
> > > 
> > > Sure we can do that but it would be little bit lame IMO. We are
> > > promising something and once we find out it doesn't work we will make
> > > it weaker to workaround that.
> > >
> > > Besides that the default should reflect the usecases, no? Do we have any
> > > use case for the hard guarantee?
> > 
> > You're adding an extra layer of complexity so the burden of proof is
> > on you.  Do we have any usecases that require a graceful fallback?
> 
> As far as I am aware nobody (except for google) really loves OOM
> killer because there is nothing you can do once it strikes (in the
> global/cpuset memory case). You have no choice for clean up etc...
> 
> If we consider that memcg and its limits are not zone aware while the
> page allocator and reclaim are zone oriented then I can see a problem
> of unexpected reclaim failure although there is no over commit on the
> low_limit globally. And we do not have in-kernel effective measures to
> mitigate this inherent problem. At least not now and I am afraid it is
> a long route to have something that would work reasonably well in such
> cases.

Which "inherent problem"?

> So to me it sounds more responsible to promise only as much as we can
> handle. I think that fallback mode is not crippling the semantic of
> the knob as it triggers only for limit overcommit or strange corner
> cases. We have agreed that we do not care about the first one and
> handling the later one by potentially fatal action doesn't sounds very
> user friendly to me.

It *absolutely* cripples the semantics.  Think about the security use
cases of mlock for example, where certain memory may never hit the
platter.  This wouldn't be possible with your watered down guarantees.

And it's the user who makes the promise, not us.  I'd rather have the
responsibility with the user.  Provide mechanism, not policy.

> For example, if we get back to the NUMA case then a graceful fallback
> allows to migrate offending tasks off the node and reduce reclaim on the
> protected group. This can be done simply by watching the breach counter
> and act upon it. On the other hand if the default policy is OOM then
> the possible actions are much more reduced (action would have to be
> pro-active with hopes that they are faster than OOM).

It's really frustrating that you just repeat arguments to which I
already responded.

Again, how is this different from mlock?

And again, if this really is a problem (which I doubt), we should fix
it at the root and implement direct migration, rather than design an
interface around it.

> > > > > > Stronger is the simpler definition, it's simpler code,
> > > > > 
> > > > > The code is not really that much simpler. The one you have posted will
> > > > > not work I am afraid. I haven't tested it yet but I remember I had to do
> > > > > some tweaks to the reclaim path to not end up in an endless loop in the
> > > > > direct reclaim (http://marc.info/?l=linux-mm&m=138677140828678&w=2 and
> > > > > http://marc.info/?l=linux-mm&m=138677141328682&w=2).
> > > > 
> > > > That's just a result of do_try_to_free_pages being stupid and using
> > > > its own zonelist loop to check reclaimability by duplicating all the
> > > > checks instead of properly using returned state of shrink_zones().
> > > > Something that would be worth fixing regardless of memcg guarantees.
> > > > 
> > > > Or maybe we could add the guaranteed lru pages to sc->nr_scanned.
> > > 
> > > Fixes might be different than what I was proposing previously. I was
> > > merely pointing out that removing the retry loop is not sufficient.
> > 
> > No, you were claiming that the hard limit implementation is not
> > simpler.  It is.
> 
> Well, there are things you have to check anyway - short loops due to
> racing reclaimers and quick priority drop down or even pre-mature OOM
> in direct reclaim paths. kswapd shoudn't loop endlessly if it cannot
> balance the zone because all groups are withing limit on the node.
> So I fail to see it as that much simpler.

Could you please stop with the handwaving?  If there are bugs, we have
to fix them.  These pages are unreclaimable, plain and simple, like
anon without swap and mlocked pages.  None of this is new.

> Anyway, the complexity of the retry&ignore loop doesn't seem to be
> significant enough to dictate the default behavior. We should go with
> the one which makes the most sense for users.

The point is that you are adding complexity to weaken the semantics
and usefulness of this feature, with the only justification being
potential misconfigurations and a fear of unearthing kernel bugs.

This makes little sense for users, and even less sense for us.

> > > > > > your usecases are fine with it,
> > > > > 
> > > > > my usecases do not overcommit low_limit on the available memory, so far
> > > > > so good, but once we hit a corner cases when limits are set properly but
> > > > > we end up not being able to reclaim anybody in a zone then OOM sounds
> > > > > too brutal.
> > > > 
> > > > What cornercases?
> > > 
> > > I have mentioned a case where NUMA placement and specific node bindings
> > > interfering with other allocators can end up in unreclaimable zones.
> > > While you might disagree about the setup I have seen different things
> > > done out there.
> > 
> > If you have real usecases that might depend on weak guarantees, please
> > make a rational argument for them and don't just handwave. 
> 
> As I've said above. Usecases I am interested in do not overcommit on
> low_limit. The limit is used to protect group(s) from memory pressure
> from other loads which are running on the same machine. Primarily
> because the working set is quite expensive to build up. If we really
> hit a corner case and OOM would trigger then the whole state has to be
> rebuilt and that is much more expensive than ephemeral reclaim.

What corner cases?

> > I know that there is every conceivable configuration out there, but
> > it's unreasonable to design new features around the requirement of
> > setups that are questionable to begin with.
> 
> I do agree but on the other hand I think we shouldn't ignore inherent
> problems which might lead to problems mentioned above and provide an
> interface which doesn't cause an unexpected behavior.

What inherent problems?

> > > Besides that the reclaim logic is complex enough and history thought me
> > > that little buggers are hidden at places where you do not expect them.
> > 
> > So we introduce user interfaces designed around the fact that we don't
> > trust our own code anymore?
> 
> No, we are talking about inherent problems here. And my experience
> taught me to be careful and corner cases tend to show up in the real
> life situations.

I'm not willing to base an interface on this level of vagueness.

> > There is being prudent and then there is cargo cult programming.
> > 
> > > So call me a chicken but I would sleep calmer if we start weaker and add
> > > an additional guarantees later when somebody really insists on rseeing
> > > an OOM rather than get reclaimed.
> > > The proposed counter can tell us more how good we are at not touching
> > > groups with the limit and we can eventually debug those corner cases
> > > without affecting the loads too much.
> > 
> > More realistically, potential bugs are never reported with a silent
> > counter, which further widens the gap between our assumptions on how
> > the VM behaves and what happens in production.
> 
> OOM driven reports are arguably worse and without easy workaround on the
> other hand.

The workaround is obviously to lower the guarantees and/or fix the
NUMA bindings in such cases.

I really don't think you have a point here, because there is not a
single concrete example backing up your arguments.

Please remove the fallback code from your changes.  They weaken the
feature and add more complexity without reasonable justification - at
least you didn't convince anybody else involved in the discussion.

Because this is user-visible ABI that we are stuck with once released,
the patches should not be merged until we agree on the behavior.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-06-03 14:01             ` Greg Thelen
@ 2014-06-03 14:44                 ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-03 14:44 UTC (permalink / raw)
  To: Greg Thelen, Johannes Weiner
  Cc: Roman Gushchin, KAMEZAWA Hiroyuki, Tejun Heo, linux-mm,
	Hugh Dickins, KOSAKI Motohiro, Rik van Riel, LKML, Andrew Morton,
	Michel Lespinasse

On Tue 03-06-14 07:01:20, Greg Thelen wrote:
> On Jun 3, 2014 4:10 AM, "Michal Hocko" <mhocko@suse.cz> wrote:
> >
> > On Wed 28-05-14 09:17:13, Greg Thelen wrote:
> > [...]
> > > My 2c...  The following works for my use cases:
> > > 1) introduce memory.low_limit_in_bytes (default=0 thus no default change
> > >    from older kernels)
> > > 2) interested users will set low_limit_in_bytes to non-zero value.
> > >    Memory protected by low limit should be as migratable/reclaimable as
> > >    mlock memory.  If a zone full of mlock memory causes oom kills, then
> > >    so should the low limit.
> >
> > Would fallback mode in overcommit or the corner case situation break
> > your usecase?
> 
> Yes.  Fallback mode would break my use cases.  What is the corner case
> situation?  NUMA conflicts? 

Described here http://marc.info/?l=linux-mm&m=139940101124396&w=2

> Low limit is a substitute for users mlocking memory.  So if mlocked
> memory has the same NUMA conflicts, then I see no problem with low
> limit having the same behavior.

In principal they are similar - at least from the reclaim POV. The usage
will be however quite different IMO.
mlock is the explicit way to keep memory resident. The application
writer knows_what_he_is_doing, right?
Lowlimit is an administrative tool. Administrator of a potentially complex
application is tuning the said application to beat the best performance
out of it.
Now both of them know that the thing might blow up if they overcommit on
the locked memory. So the application writer can check the system state
before he asks for mlock and he knows about previous mlocks.
Admin doesn't have that possibility because the memory distribution of
the memcg is not easy to find out.

> From a user API perspective, I'm not clear on the difference between
> non-ooming (fallback) low limit and the existing soft limit interface.  If
> low limit is a "soft" (non ooming) limit then why not rework the existing
> soft limit interface and save the low limit for strict (ooming) behavior?

No, not that path again. Pretty please! We've been there and it didn't
work out. We've been told to not flip defaults and potentially break
userspace. Softlimit with it weird semantic should die and stay as a
colorful example of a bad design decision.

> Of course, Google can continue to tweak the soft limit or new low
> limit to provide an ooming guarantee rather than violating the limit.

If you have the use case for the hard guarantee then we can add a knob
as I've said repeatedly. I just wanted to hear the use case. If you have
one, great. I just wanted to start with something which is more usable
in general.

Your setup is quite specific and known to love OOM killers so you are
very well prepared for that. On the other hand my users would end up in
a surprise if they saw an OOM while the setup was seemingly correct
because lowlimit was not overcommitted.

I can come up with a patch on top of what is in mm tree now. It would
add a knob (configurable to default to fallback or OOM by default).

What do you think about this? Would that work for you and Johannes?

> PS: I currently have very limited connectivity so my responses will be
> delayed.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-06-03 14:44                 ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-03 14:44 UTC (permalink / raw)
  To: Greg Thelen, Johannes Weiner
  Cc: Roman Gushchin, KAMEZAWA Hiroyuki, Tejun Heo, linux-mm,
	Hugh Dickins, KOSAKI Motohiro, Rik van Riel, LKML, Andrew Morton,
	Michel Lespinasse

On Tue 03-06-14 07:01:20, Greg Thelen wrote:
> On Jun 3, 2014 4:10 AM, "Michal Hocko" <mhocko@suse.cz> wrote:
> >
> > On Wed 28-05-14 09:17:13, Greg Thelen wrote:
> > [...]
> > > My 2c...  The following works for my use cases:
> > > 1) introduce memory.low_limit_in_bytes (default=0 thus no default change
> > >    from older kernels)
> > > 2) interested users will set low_limit_in_bytes to non-zero value.
> > >    Memory protected by low limit should be as migratable/reclaimable as
> > >    mlock memory.  If a zone full of mlock memory causes oom kills, then
> > >    so should the low limit.
> >
> > Would fallback mode in overcommit or the corner case situation break
> > your usecase?
> 
> Yes.  Fallback mode would break my use cases.  What is the corner case
> situation?  NUMA conflicts? 

Described here http://marc.info/?l=linux-mm&m=139940101124396&w=2

> Low limit is a substitute for users mlocking memory.  So if mlocked
> memory has the same NUMA conflicts, then I see no problem with low
> limit having the same behavior.

In principal they are similar - at least from the reclaim POV. The usage
will be however quite different IMO.
mlock is the explicit way to keep memory resident. The application
writer knows_what_he_is_doing, right?
Lowlimit is an administrative tool. Administrator of a potentially complex
application is tuning the said application to beat the best performance
out of it.
Now both of them know that the thing might blow up if they overcommit on
the locked memory. So the application writer can check the system state
before he asks for mlock and he knows about previous mlocks.
Admin doesn't have that possibility because the memory distribution of
the memcg is not easy to find out.

> From a user API perspective, I'm not clear on the difference between
> non-ooming (fallback) low limit and the existing soft limit interface.  If
> low limit is a "soft" (non ooming) limit then why not rework the existing
> soft limit interface and save the low limit for strict (ooming) behavior?

No, not that path again. Pretty please! We've been there and it didn't
work out. We've been told to not flip defaults and potentially break
userspace. Softlimit with it weird semantic should die and stay as a
colorful example of a bad design decision.

> Of course, Google can continue to tweak the soft limit or new low
> limit to provide an ooming guarantee rather than violating the limit.

If you have the use case for the hard guarantee then we can add a knob
as I've said repeatedly. I just wanted to hear the use case. If you have
one, great. I just wanted to start with something which is more usable
in general.

Your setup is quite specific and known to love OOM killers so you are
very well prepared for that. On the other hand my users would end up in
a surprise if they saw an OOM while the setup was seemingly correct
because lowlimit was not overcommitted.

I can come up with a patch on top of what is in mm tree now. It would
add a knob (configurable to default to fallback or OOM by default).

What do you think about this? Would that work for you and Johannes?

> PS: I currently have very limited connectivity so my responses will be
> delayed.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-06-03 14:22                 ` Johannes Weiner
@ 2014-06-04 14:46                   ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-04 14:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Tue 03-06-14 10:22:49, Johannes Weiner wrote:
> On Tue, Jun 03, 2014 at 01:07:43PM +0200, Michal Hocko wrote:
[...]
> > If we consider that memcg and its limits are not zone aware while the
> > page allocator and reclaim are zone oriented then I can see a problem
> > of unexpected reclaim failure although there is no over commit on the
> > low_limit globally. And we do not have in-kernel effective measures to
> > mitigate this inherent problem. At least not now and I am afraid it is
> > a long route to have something that would work reasonably well in such
> > cases.
> 
> Which "inherent problem"?

zone unawareness of the limit vs. allocation/reclaim which are zone
oriented.
 
> > So to me it sounds more responsible to promise only as much as we can
> > handle. I think that fallback mode is not crippling the semantic of
> > the knob as it triggers only for limit overcommit or strange corner
> > cases. We have agreed that we do not care about the first one and
> > handling the later one by potentially fatal action doesn't sounds very
> > user friendly to me.
> 
> It *absolutely* cripples the semantics.  Think about the security use
> cases of mlock for example, where certain memory may never hit the
> platter.  This wouldn't be possible with your watered down guarantees.

Is this really a use case? It sounds like a weak one to me. Because
any sudden memory consumption above the limit can reclaim your
to-protect-page it will hit the platter and you cannot do anything about
this. So yeah, this is not mlock.

> And it's the user who makes the promise, not us.  I'd rather have the
> responsibility with the user.  Provide mechanism, not policy.

And that user is the application writer, not its administrator. And
memcg is more of an admin interface than a development API.

> > For example, if we get back to the NUMA case then a graceful fallback
> > allows to migrate offending tasks off the node and reduce reclaim on the
> > protected group. This can be done simply by watching the breach counter
> > and act upon it. On the other hand if the default policy is OOM then
> > the possible actions are much more reduced (action would have to be
> > pro-active with hopes that they are faster than OOM).
> 
> It's really frustrating that you just repeat arguments to which I
> already responded.

No you haven't responded. You are dismissing the issue in the first
place. Can you guarantee that there is no OOM when low_limits do not
overcommit the machine and node bound tasks live in a group which
doesn't overcommit the node?

> Again, how is this different from mlock?

Sigh. The first thing is that this is not mlock. You are operating on
per-group basis. You are running a load which can make its own decisions
on the NUMA placement etc... With mlock you are explicit about which
memory is locked (potentially even the placement). So the situation is
very much different I would say.

> And again, if this really is a problem (which I doubt), we should fix
> it at the root and implement direct migration, rather than design an
> interface around it.

Why would we do something like that in the kernel when we have tools to
migrate tasks from the userspace?

> > > > > > > Stronger is the simpler definition, it's simpler code,
> > > > > > 
> > > > > > The code is not really that much simpler. The one you have posted will
> > > > > > not work I am afraid. I haven't tested it yet but I remember I had to do
> > > > > > some tweaks to the reclaim path to not end up in an endless loop in the
> > > > > > direct reclaim (http://marc.info/?l=linux-mm&m=138677140828678&w=2 and
> > > > > > http://marc.info/?l=linux-mm&m=138677141328682&w=2).
> > > > > 
> > > > > That's just a result of do_try_to_free_pages being stupid and using
> > > > > its own zonelist loop to check reclaimability by duplicating all the
> > > > > checks instead of properly using returned state of shrink_zones().
> > > > > Something that would be worth fixing regardless of memcg guarantees.
> > > > > 
> > > > > Or maybe we could add the guaranteed lru pages to sc->nr_scanned.
> > > > 
> > > > Fixes might be different than what I was proposing previously. I was
> > > > merely pointing out that removing the retry loop is not sufficient.
> > > 
> > > No, you were claiming that the hard limit implementation is not
> > > simpler.  It is.
> > 
> > Well, there are things you have to check anyway - short loops due to
> > racing reclaimers and quick priority drop down or even pre-mature OOM
> > in direct reclaim paths. kswapd shoudn't loop endlessly if it cannot
> > balance the zone because all groups are withing limit on the node.
> > So I fail to see it as that much simpler.
> 
> Could you please stop with the handwaving?  If there are bugs, we have
> to fix them.  These pages are unreclaimable, plain and simple, like
> anon without swap and mlocked pages.  None of this is new.

There is no handwaving. The above two patches describe what I mean.
You have just thrown a patch to remove retry loop claiming that the code
is easier that way and I've tried to explain to you that it is not that
simple. Full stop.

> > Anyway, the complexity of the retry&ignore loop doesn't seem to be
> > significant enough to dictate the default behavior. We should go with
> > the one which makes the most sense for users.
> 
> The point is that you are adding complexity to weaken the semantics
> and usefulness of this feature, with the only justification being
> potential misconfigurations and a

Which misconfiguration are you talking about?

> fear of unearthing kernel bugs.

This is not right! I didn't say I am afraid of bugs. I said that the
code is complicated enough that seeing all the potential corner cases is
really hard and so starting with weaker semantic makes some sense.

> This makes little sense for users, and even less sense for us.
> 
> > > > > > > your usecases are fine with it,
> > > > > > 
> > > > > > my usecases do not overcommit low_limit on the available memory, so far
> > > > > > so good, but once we hit a corner cases when limits are set properly but
> > > > > > we end up not being able to reclaim anybody in a zone then OOM sounds
> > > > > > too brutal.
> > > > > 
> > > > > What cornercases?
> > > > 
> > > > I have mentioned a case where NUMA placement and specific node bindings
> > > > interfering with other allocators can end up in unreclaimable zones.
> > > > While you might disagree about the setup I have seen different things
> > > > done out there.
> > > 
> > > If you have real usecases that might depend on weak guarantees, please
> > > make a rational argument for them and don't just handwave. 
> > 
> > As I've said above. Usecases I am interested in do not overcommit on
> > low_limit. The limit is used to protect group(s) from memory pressure
> > from other loads which are running on the same machine. Primarily
> > because the working set is quite expensive to build up. If we really
> > hit a corner case and OOM would trigger then the whole state has to be
> > rebuilt and that is much more expensive than ephemeral reclaim.
> 
> What corner cases?

Seriously? Come on Johannes, try to be little bit constructive.

> > > I know that there is every conceivable configuration out there, but
> > > it's unreasonable to design new features around the requirement of
> > > setups that are questionable to begin with.
> > 
> > I do agree but on the other hand I think we shouldn't ignore inherent
> > problems which might lead to problems mentioned above and provide an
> > interface which doesn't cause an unexpected behavior.
> 
> What inherent problems?
> 
> > > > Besides that the reclaim logic is complex enough and history thought me
> > > > that little buggers are hidden at places where you do not expect them.
> > > 
> > > So we introduce user interfaces designed around the fact that we don't
> > > trust our own code anymore?
> > 
> > No, we are talking about inherent problems here. And my experience
> > taught me to be careful and corner cases tend to show up in the real
> > life situations.
> 
> I'm not willing to base an interface on this level of vagueness.
> 
> > > There is being prudent and then there is cargo cult programming.
> > > 
> > > > So call me a chicken but I would sleep calmer if we start weaker and add
> > > > an additional guarantees later when somebody really insists on rseeing
> > > > an OOM rather than get reclaimed.
> > > > The proposed counter can tell us more how good we are at not touching
> > > > groups with the limit and we can eventually debug those corner cases
> > > > without affecting the loads too much.
> > > 
> > > More realistically, potential bugs are never reported with a silent
> > > counter, which further widens the gap between our assumptions on how
> > > the VM behaves and what happens in production.
> > 
> > OOM driven reports are arguably worse and without easy workaround on the
> > other hand.
> 
> The workaround is obviously to lower the guarantees and/or fix the
> NUMA bindings in such cases.

How? Do not use low_limit on node bound loads? Use cumulative low_limit
smaller than any node which has bindings? How is the feature still
useful?

> I really don't think you have a point here, because there is not a
> single concrete example backing up your arguments.
> 
> Please remove the fallback code from your changes.  They weaken the
> feature and add more complexity without reasonable justification - at
> least you didn't convince anybody else involved in the discussion.

OK, so you are simply ignoring the usecase I've provided to you and then
claim the usefulness of the OOM default without providing any usecases
(we are still talking about setups which do not overcommit low_limit).

> Because this is user-visible ABI that we are stuck with once released,
> the patches should not be merged until we agree on the behavior.

In the other email I have suggested to add a knob with the configurable
default. Would you be OK with that?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-06-04 14:46                   ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-04 14:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Tue 03-06-14 10:22:49, Johannes Weiner wrote:
> On Tue, Jun 03, 2014 at 01:07:43PM +0200, Michal Hocko wrote:
[...]
> > If we consider that memcg and its limits are not zone aware while the
> > page allocator and reclaim are zone oriented then I can see a problem
> > of unexpected reclaim failure although there is no over commit on the
> > low_limit globally. And we do not have in-kernel effective measures to
> > mitigate this inherent problem. At least not now and I am afraid it is
> > a long route to have something that would work reasonably well in such
> > cases.
> 
> Which "inherent problem"?

zone unawareness of the limit vs. allocation/reclaim which are zone
oriented.
 
> > So to me it sounds more responsible to promise only as much as we can
> > handle. I think that fallback mode is not crippling the semantic of
> > the knob as it triggers only for limit overcommit or strange corner
> > cases. We have agreed that we do not care about the first one and
> > handling the later one by potentially fatal action doesn't sounds very
> > user friendly to me.
> 
> It *absolutely* cripples the semantics.  Think about the security use
> cases of mlock for example, where certain memory may never hit the
> platter.  This wouldn't be possible with your watered down guarantees.

Is this really a use case? It sounds like a weak one to me. Because
any sudden memory consumption above the limit can reclaim your
to-protect-page it will hit the platter and you cannot do anything about
this. So yeah, this is not mlock.

> And it's the user who makes the promise, not us.  I'd rather have the
> responsibility with the user.  Provide mechanism, not policy.

And that user is the application writer, not its administrator. And
memcg is more of an admin interface than a development API.

> > For example, if we get back to the NUMA case then a graceful fallback
> > allows to migrate offending tasks off the node and reduce reclaim on the
> > protected group. This can be done simply by watching the breach counter
> > and act upon it. On the other hand if the default policy is OOM then
> > the possible actions are much more reduced (action would have to be
> > pro-active with hopes that they are faster than OOM).
> 
> It's really frustrating that you just repeat arguments to which I
> already responded.

No you haven't responded. You are dismissing the issue in the first
place. Can you guarantee that there is no OOM when low_limits do not
overcommit the machine and node bound tasks live in a group which
doesn't overcommit the node?

> Again, how is this different from mlock?

Sigh. The first thing is that this is not mlock. You are operating on
per-group basis. You are running a load which can make its own decisions
on the NUMA placement etc... With mlock you are explicit about which
memory is locked (potentially even the placement). So the situation is
very much different I would say.

> And again, if this really is a problem (which I doubt), we should fix
> it at the root and implement direct migration, rather than design an
> interface around it.

Why would we do something like that in the kernel when we have tools to
migrate tasks from the userspace?

> > > > > > > Stronger is the simpler definition, it's simpler code,
> > > > > > 
> > > > > > The code is not really that much simpler. The one you have posted will
> > > > > > not work I am afraid. I haven't tested it yet but I remember I had to do
> > > > > > some tweaks to the reclaim path to not end up in an endless loop in the
> > > > > > direct reclaim (http://marc.info/?l=linux-mm&m=138677140828678&w=2 and
> > > > > > http://marc.info/?l=linux-mm&m=138677141328682&w=2).
> > > > > 
> > > > > That's just a result of do_try_to_free_pages being stupid and using
> > > > > its own zonelist loop to check reclaimability by duplicating all the
> > > > > checks instead of properly using returned state of shrink_zones().
> > > > > Something that would be worth fixing regardless of memcg guarantees.
> > > > > 
> > > > > Or maybe we could add the guaranteed lru pages to sc->nr_scanned.
> > > > 
> > > > Fixes might be different than what I was proposing previously. I was
> > > > merely pointing out that removing the retry loop is not sufficient.
> > > 
> > > No, you were claiming that the hard limit implementation is not
> > > simpler.  It is.
> > 
> > Well, there are things you have to check anyway - short loops due to
> > racing reclaimers and quick priority drop down or even pre-mature OOM
> > in direct reclaim paths. kswapd shoudn't loop endlessly if it cannot
> > balance the zone because all groups are withing limit on the node.
> > So I fail to see it as that much simpler.
> 
> Could you please stop with the handwaving?  If there are bugs, we have
> to fix them.  These pages are unreclaimable, plain and simple, like
> anon without swap and mlocked pages.  None of this is new.

There is no handwaving. The above two patches describe what I mean.
You have just thrown a patch to remove retry loop claiming that the code
is easier that way and I've tried to explain to you that it is not that
simple. Full stop.

> > Anyway, the complexity of the retry&ignore loop doesn't seem to be
> > significant enough to dictate the default behavior. We should go with
> > the one which makes the most sense for users.
> 
> The point is that you are adding complexity to weaken the semantics
> and usefulness of this feature, with the only justification being
> potential misconfigurations and a

Which misconfiguration are you talking about?

> fear of unearthing kernel bugs.

This is not right! I didn't say I am afraid of bugs. I said that the
code is complicated enough that seeing all the potential corner cases is
really hard and so starting with weaker semantic makes some sense.

> This makes little sense for users, and even less sense for us.
> 
> > > > > > > your usecases are fine with it,
> > > > > > 
> > > > > > my usecases do not overcommit low_limit on the available memory, so far
> > > > > > so good, but once we hit a corner cases when limits are set properly but
> > > > > > we end up not being able to reclaim anybody in a zone then OOM sounds
> > > > > > too brutal.
> > > > > 
> > > > > What cornercases?
> > > > 
> > > > I have mentioned a case where NUMA placement and specific node bindings
> > > > interfering with other allocators can end up in unreclaimable zones.
> > > > While you might disagree about the setup I have seen different things
> > > > done out there.
> > > 
> > > If you have real usecases that might depend on weak guarantees, please
> > > make a rational argument for them and don't just handwave. 
> > 
> > As I've said above. Usecases I am interested in do not overcommit on
> > low_limit. The limit is used to protect group(s) from memory pressure
> > from other loads which are running on the same machine. Primarily
> > because the working set is quite expensive to build up. If we really
> > hit a corner case and OOM would trigger then the whole state has to be
> > rebuilt and that is much more expensive than ephemeral reclaim.
> 
> What corner cases?

Seriously? Come on Johannes, try to be little bit constructive.

> > > I know that there is every conceivable configuration out there, but
> > > it's unreasonable to design new features around the requirement of
> > > setups that are questionable to begin with.
> > 
> > I do agree but on the other hand I think we shouldn't ignore inherent
> > problems which might lead to problems mentioned above and provide an
> > interface which doesn't cause an unexpected behavior.
> 
> What inherent problems?
> 
> > > > Besides that the reclaim logic is complex enough and history thought me
> > > > that little buggers are hidden at places where you do not expect them.
> > > 
> > > So we introduce user interfaces designed around the fact that we don't
> > > trust our own code anymore?
> > 
> > No, we are talking about inherent problems here. And my experience
> > taught me to be careful and corner cases tend to show up in the real
> > life situations.
> 
> I'm not willing to base an interface on this level of vagueness.
> 
> > > There is being prudent and then there is cargo cult programming.
> > > 
> > > > So call me a chicken but I would sleep calmer if we start weaker and add
> > > > an additional guarantees later when somebody really insists on rseeing
> > > > an OOM rather than get reclaimed.
> > > > The proposed counter can tell us more how good we are at not touching
> > > > groups with the limit and we can eventually debug those corner cases
> > > > without affecting the loads too much.
> > > 
> > > More realistically, potential bugs are never reported with a silent
> > > counter, which further widens the gap between our assumptions on how
> > > the VM behaves and what happens in production.
> > 
> > OOM driven reports are arguably worse and without easy workaround on the
> > other hand.
> 
> The workaround is obviously to lower the guarantees and/or fix the
> NUMA bindings in such cases.

How? Do not use low_limit on node bound loads? Use cumulative low_limit
smaller than any node which has bindings? How is the feature still
useful?

> I really don't think you have a point here, because there is not a
> single concrete example backing up your arguments.
> 
> Please remove the fallback code from your changes.  They weaken the
> feature and add more complexity without reasonable justification - at
> least you didn't convince anybody else involved in the discussion.

OK, so you are simply ignoring the usecase I've provided to you and then
claim the usefulness of the OOM default without providing any usecases
(we are still talking about setups which do not overcommit low_limit).

> Because this is user-visible ABI that we are stuck with once released,
> the patches should not be merged until we agree on the behavior.

In the other email I have suggested to add a knob with the configurable
default. Would you be OK with that?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-06-04 14:46                   ` Michal Hocko
@ 2014-06-04 15:44                     ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-04 15:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> On Tue 03-06-14 10:22:49, Johannes Weiner wrote:
> > On Tue, Jun 03, 2014 at 01:07:43PM +0200, Michal Hocko wrote:
> [...]
> > > If we consider that memcg and its limits are not zone aware while the
> > > page allocator and reclaim are zone oriented then I can see a problem
> > > of unexpected reclaim failure although there is no over commit on the
> > > low_limit globally. And we do not have in-kernel effective measures to
> > > mitigate this inherent problem. At least not now and I am afraid it is
> > > a long route to have something that would work reasonably well in such
> > > cases.
> > 
> > Which "inherent problem"?
> 
> zone unawareness of the limit vs. allocation/reclaim which are zone
> oriented.

This is a quote from another subthread where you haven't responded:

---

> > > > But who actually cares if an individual zone can be reclaimed?
> > > > 
> > > > Userspace allocations can fall back to any other zone.  Unless there
> > > > are hard bindings, but hopefully nobody binds a memcg to a node that
> > > > is smaller than that memcg's guarantee. 
> > > 
> > > The protected group might spill over to another group and eat it when
> > > another group would be simply pushed out from the node it is bound to.
> > 
> > I don't really understand the point you're trying to make.
> 
> I was just trying to show a case where individual zone matters. To make
> it more specific consider 2 groups A (with low-limit 60% RAM) and B
> (say with low-limit 10% RAM) and bound to a node X (25% of RAM). Now
> having 70% of RAM reserved for guarantee makes some sense, right? B is
> not over-committing the node it is bound to. Yet the A's allocations
> might make pressure on X regardless that the whole system is still doing
> good. This can lead to a situation where X gets depleted and nothing
> would be reclaimable leading to an OOM condition.

Once you assume control of memory *placement* in the system like this,
you can not also pretend to be clueless and have unreclaimable memory
of this magnitude spread around into nodes used by other bound tasks.

If we were to actively support such configurations, we should be doing
direct NUMA balancing and migrate these pages out of node X when B
needs to allocate.  That would fix the problem for all unevictable
memory, not just memcg guarantees, and would prefer node-offloading
over swapping in cases where swap is available.

But really, this whole scenario sounds contrived to me.  And there is
nothing specific about memcg guarantees in there.

---

> > > So to me it sounds more responsible to promise only as much as we can
> > > handle. I think that fallback mode is not crippling the semantic of
> > > the knob as it triggers only for limit overcommit or strange corner
> > > cases. We have agreed that we do not care about the first one and
> > > handling the later one by potentially fatal action doesn't sounds very
> > > user friendly to me.
> > 
> > It *absolutely* cripples the semantics.  Think about the security use
> > cases of mlock for example, where certain memory may never hit the
> > platter.  This wouldn't be possible with your watered down guarantees.
> 
> Is this really a use case? It sounds like a weak one to me. Because
> any sudden memory consumption above the limit can reclaim your
> to-protect-page it will hit the platter and you cannot do anything about
> this. So yeah, this is not mlock.

You are right, that is a weak usecase.

It doesn't change the fact that it does severely weaken the semantics
and turns it into another best-effort mechanism that the user can't
count on.  This sucks.  It sucked with soft limits and it will suck
again.  The irony is that Greg even pointed out you should be doing
soft limits if you want this sort of behavior.

> > And it's the user who makes the promise, not us.  I'd rather have the
> > responsibility with the user.  Provide mechanism, not policy.
> 
> And that user is the application writer, not its administrator. And
> memcg is more of an admin interface than a development API.
> 
> > > For example, if we get back to the NUMA case then a graceful fallback
> > > allows to migrate offending tasks off the node and reduce reclaim on the
> > > protected group. This can be done simply by watching the breach counter
> > > and act upon it. On the other hand if the default policy is OOM then
> > > the possible actions are much more reduced (action would have to be
> > > pro-active with hopes that they are faster than OOM).
> > 
> > It's really frustrating that you just repeat arguments to which I
> > already responded.
> 
> No you haven't responded. You are dismissing the issue in the first
> place. Can you guarantee that there is no OOM when low_limits do not
> overcommit the machine and node bound tasks live in a group which
> doesn't overcommit the node?

I was referring to the quote on NUMA configurations above.

You can not make this guarantee even *without* the low limit set
because of other sources of unreclaimable memory.

> > Again, how is this different from mlock?
> 
> Sigh. The first thing is that this is not mlock. You are operating on
> per-group basis. You are running a load which can make its own decisions
> on the NUMA placement etc... With mlock you are explicit about which
> memory is locked (potentially even the placement). So the situation is
> very much different I would say.

It is still unreclaimable memory, just like mlock and anon without
swap.  I still don't see how it's different, and making unfounded
claims that it is won't change that.  Provide an example.

> > And again, if this really is a problem (which I doubt), we should fix
> > it at the root and implement direct migration, rather than design an
> > interface around it.
> 
> Why would we do something like that in the kernel when we have tools to
> migrate tasks from the userspace?

This is again in reference to partial NUMA bindings.  We can not
reclaim guaranteed memory, but we could direct-migrate unbound
unreclaimable memory to another node at allocation time as part of the
reclaim cycle.

If unbound unreclaimable memory spilling into random nodes making them
unreclaimable and causing OOMs for node-bound tasks truly is a
problem, then a) it's not a new one because of mlock and swapless anon
and b) it shouldn't be solved by weakening guarantee semantics, but by
something like direct migrate.

> > > > > So call me a chicken but I would sleep calmer if we start weaker and add
> > > > > an additional guarantees later when somebody really insists on rseeing
> > > > > an OOM rather than get reclaimed.
> > > > > The proposed counter can tell us more how good we are at not touching
> > > > > groups with the limit and we can eventually debug those corner cases
> > > > > without affecting the loads too much.
> > > > 
> > > > More realistically, potential bugs are never reported with a silent
> > > > counter, which further widens the gap between our assumptions on how
> > > > the VM behaves and what happens in production.
> > > 
> > > OOM driven reports are arguably worse and without easy workaround on the
> > > other hand.
> > 
> > The workaround is obviously to lower the guarantees and/or fix the
> > NUMA bindings in such cases.
> 
> How? Do not use low_limit on node bound loads? Use cumulative low_limit
> smaller than any node which has bindings? How is the feature still
> useful?

You completely lost me.  Why can I not use the low limit in
combination with hard bindings?  And yes, the low limit of a group of
tasks would have to be smaller than the nodes that group is bound to.

> > I really don't think you have a point here, because there is not a
> > single concrete example backing up your arguments.
> > 
> > Please remove the fallback code from your changes.  They weaken the
> > feature and add more complexity without reasonable justification - at
> > least you didn't convince anybody else involved in the discussion.
> 
> OK, so you are simply ignoring the usecase I've provided to you and then
> claim the usefulness of the OOM default without providing any usecases
> (we are still talking about setups which do not overcommit low_limit).

I'm not ignoring it, I think I addressed all usecases that you
mentioned.  If I'm missing one, please point it out to me.

> > Because this is user-visible ABI that we are stuck with once released,
> > the patches should not be merged until we agree on the behavior.
> 
> In the other email I have suggested to add a knob with the configurable
> default. Would you be OK with that?

No, I want to agree on whether we need that fallback code or not.  I'm
not interested in merging code that you can't convince anybody else is
needed.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-06-04 15:44                     ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-04 15:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> On Tue 03-06-14 10:22:49, Johannes Weiner wrote:
> > On Tue, Jun 03, 2014 at 01:07:43PM +0200, Michal Hocko wrote:
> [...]
> > > If we consider that memcg and its limits are not zone aware while the
> > > page allocator and reclaim are zone oriented then I can see a problem
> > > of unexpected reclaim failure although there is no over commit on the
> > > low_limit globally. And we do not have in-kernel effective measures to
> > > mitigate this inherent problem. At least not now and I am afraid it is
> > > a long route to have something that would work reasonably well in such
> > > cases.
> > 
> > Which "inherent problem"?
> 
> zone unawareness of the limit vs. allocation/reclaim which are zone
> oriented.

This is a quote from another subthread where you haven't responded:

---

> > > > But who actually cares if an individual zone can be reclaimed?
> > > > 
> > > > Userspace allocations can fall back to any other zone.  Unless there
> > > > are hard bindings, but hopefully nobody binds a memcg to a node that
> > > > is smaller than that memcg's guarantee. 
> > > 
> > > The protected group might spill over to another group and eat it when
> > > another group would be simply pushed out from the node it is bound to.
> > 
> > I don't really understand the point you're trying to make.
> 
> I was just trying to show a case where individual zone matters. To make
> it more specific consider 2 groups A (with low-limit 60% RAM) and B
> (say with low-limit 10% RAM) and bound to a node X (25% of RAM). Now
> having 70% of RAM reserved for guarantee makes some sense, right? B is
> not over-committing the node it is bound to. Yet the A's allocations
> might make pressure on X regardless that the whole system is still doing
> good. This can lead to a situation where X gets depleted and nothing
> would be reclaimable leading to an OOM condition.

Once you assume control of memory *placement* in the system like this,
you can not also pretend to be clueless and have unreclaimable memory
of this magnitude spread around into nodes used by other bound tasks.

If we were to actively support such configurations, we should be doing
direct NUMA balancing and migrate these pages out of node X when B
needs to allocate.  That would fix the problem for all unevictable
memory, not just memcg guarantees, and would prefer node-offloading
over swapping in cases where swap is available.

But really, this whole scenario sounds contrived to me.  And there is
nothing specific about memcg guarantees in there.

---

> > > So to me it sounds more responsible to promise only as much as we can
> > > handle. I think that fallback mode is not crippling the semantic of
> > > the knob as it triggers only for limit overcommit or strange corner
> > > cases. We have agreed that we do not care about the first one and
> > > handling the later one by potentially fatal action doesn't sounds very
> > > user friendly to me.
> > 
> > It *absolutely* cripples the semantics.  Think about the security use
> > cases of mlock for example, where certain memory may never hit the
> > platter.  This wouldn't be possible with your watered down guarantees.
> 
> Is this really a use case? It sounds like a weak one to me. Because
> any sudden memory consumption above the limit can reclaim your
> to-protect-page it will hit the platter and you cannot do anything about
> this. So yeah, this is not mlock.

You are right, that is a weak usecase.

It doesn't change the fact that it does severely weaken the semantics
and turns it into another best-effort mechanism that the user can't
count on.  This sucks.  It sucked with soft limits and it will suck
again.  The irony is that Greg even pointed out you should be doing
soft limits if you want this sort of behavior.

> > And it's the user who makes the promise, not us.  I'd rather have the
> > responsibility with the user.  Provide mechanism, not policy.
> 
> And that user is the application writer, not its administrator. And
> memcg is more of an admin interface than a development API.
> 
> > > For example, if we get back to the NUMA case then a graceful fallback
> > > allows to migrate offending tasks off the node and reduce reclaim on the
> > > protected group. This can be done simply by watching the breach counter
> > > and act upon it. On the other hand if the default policy is OOM then
> > > the possible actions are much more reduced (action would have to be
> > > pro-active with hopes that they are faster than OOM).
> > 
> > It's really frustrating that you just repeat arguments to which I
> > already responded.
> 
> No you haven't responded. You are dismissing the issue in the first
> place. Can you guarantee that there is no OOM when low_limits do not
> overcommit the machine and node bound tasks live in a group which
> doesn't overcommit the node?

I was referring to the quote on NUMA configurations above.

You can not make this guarantee even *without* the low limit set
because of other sources of unreclaimable memory.

> > Again, how is this different from mlock?
> 
> Sigh. The first thing is that this is not mlock. You are operating on
> per-group basis. You are running a load which can make its own decisions
> on the NUMA placement etc... With mlock you are explicit about which
> memory is locked (potentially even the placement). So the situation is
> very much different I would say.

It is still unreclaimable memory, just like mlock and anon without
swap.  I still don't see how it's different, and making unfounded
claims that it is won't change that.  Provide an example.

> > And again, if this really is a problem (which I doubt), we should fix
> > it at the root and implement direct migration, rather than design an
> > interface around it.
> 
> Why would we do something like that in the kernel when we have tools to
> migrate tasks from the userspace?

This is again in reference to partial NUMA bindings.  We can not
reclaim guaranteed memory, but we could direct-migrate unbound
unreclaimable memory to another node at allocation time as part of the
reclaim cycle.

If unbound unreclaimable memory spilling into random nodes making them
unreclaimable and causing OOMs for node-bound tasks truly is a
problem, then a) it's not a new one because of mlock and swapless anon
and b) it shouldn't be solved by weakening guarantee semantics, but by
something like direct migrate.

> > > > > So call me a chicken but I would sleep calmer if we start weaker and add
> > > > > an additional guarantees later when somebody really insists on rseeing
> > > > > an OOM rather than get reclaimed.
> > > > > The proposed counter can tell us more how good we are at not touching
> > > > > groups with the limit and we can eventually debug those corner cases
> > > > > without affecting the loads too much.
> > > > 
> > > > More realistically, potential bugs are never reported with a silent
> > > > counter, which further widens the gap between our assumptions on how
> > > > the VM behaves and what happens in production.
> > > 
> > > OOM driven reports are arguably worse and without easy workaround on the
> > > other hand.
> > 
> > The workaround is obviously to lower the guarantees and/or fix the
> > NUMA bindings in such cases.
> 
> How? Do not use low_limit on node bound loads? Use cumulative low_limit
> smaller than any node which has bindings? How is the feature still
> useful?

You completely lost me.  Why can I not use the low limit in
combination with hard bindings?  And yes, the low limit of a group of
tasks would have to be smaller than the nodes that group is bound to.

> > I really don't think you have a point here, because there is not a
> > single concrete example backing up your arguments.
> > 
> > Please remove the fallback code from your changes.  They weaken the
> > feature and add more complexity without reasonable justification - at
> > least you didn't convince anybody else involved in the discussion.
> 
> OK, so you are simply ignoring the usecase I've provided to you and then
> claim the usefulness of the OOM default without providing any usecases
> (we are still talking about setups which do not overcommit low_limit).

I'm not ignoring it, I think I addressed all usecases that you
mentioned.  If I'm missing one, please point it out to me.

> > Because this is user-visible ABI that we are stuck with once released,
> > the patches should not be merged until we agree on the behavior.
> 
> In the other email I have suggested to add a knob with the configurable
> default. Would you be OK with that?

No, I want to agree on whether we need that fallback code or not.  I'm
not interested in merging code that you can't convince anybody else is
needed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-06-04 15:44                     ` Johannes Weiner
@ 2014-06-04 19:18                       ` Hugh Dickins
  -1 siblings, 0 replies; 196+ messages in thread
From: Hugh Dickins @ 2014-06-04 19:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Greg Thelen, Michel Lespinasse, Tejun Heo, Hugh Dickins,
	Roman Gushchin, LKML, linux-mm, Rik van Riel

On Wed, 4 Jun 2014, Johannes Weiner wrote:
> On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > 
> > In the other email I have suggested to add a knob with the configurable
> > default. Would you be OK with that?
> 
> No, I want to agree on whether we need that fallback code or not.  I'm
> not interested in merging code that you can't convince anybody else is
> needed.

I for one would welcome such a knob as Michal is proposing.

I thought it was long ago agreed that the low limit was going to fallback
when it couldn't be satisfied.  But you seem implacably opposed to that
as default, and I can well believe that Google is so accustomed to OOMing
that it is more comfortable with OOMing as the default.  Okay.  But I
would expect there to be many who want the attempt towards isolation that
low limit offers, without a collapse to OOM at the first misjudgement.

Hugh

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-06-04 19:18                       ` Hugh Dickins
  0 siblings, 0 replies; 196+ messages in thread
From: Hugh Dickins @ 2014-06-04 19:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Greg Thelen, Michel Lespinasse, Tejun Heo, Hugh Dickins,
	Roman Gushchin, LKML, linux-mm, Rik van Riel

On Wed, 4 Jun 2014, Johannes Weiner wrote:
> On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > 
> > In the other email I have suggested to add a knob with the configurable
> > default. Would you be OK with that?
> 
> No, I want to agree on whether we need that fallback code or not.  I'm
> not interested in merging code that you can't convince anybody else is
> needed.

I for one would welcome such a knob as Michal is proposing.

I thought it was long ago agreed that the low limit was going to fallback
when it couldn't be satisfied.  But you seem implacably opposed to that
as default, and I can well believe that Google is so accustomed to OOMing
that it is more comfortable with OOMing as the default.  Okay.  But I
would expect there to be many who want the attempt towards isolation that
low limit offers, without a collapse to OOM at the first misjudgement.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-06-04 19:18                       ` Hugh Dickins
@ 2014-06-04 21:45                         ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-04 21:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Greg Thelen, Michel Lespinasse, Tejun Heo, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Wed, Jun 04, 2014 at 12:18:59PM -0700, Hugh Dickins wrote:
> On Wed, 4 Jun 2014, Johannes Weiner wrote:
> > On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > > 
> > > In the other email I have suggested to add a knob with the configurable
> > > default. Would you be OK with that?
> > 
> > No, I want to agree on whether we need that fallback code or not.  I'm
> > not interested in merging code that you can't convince anybody else is
> > needed.
> 
> I for one would welcome such a knob as Michal is proposing.

Now we have a tie :-)

> I thought it was long ago agreed that the low limit was going to fallback
> when it couldn't be satisfied.  But you seem implacably opposed to that
> as default, and I can well believe that Google is so accustomed to OOMing
> that it is more comfortable with OOMing as the default.  Okay.  But I
> would expect there to be many who want the attempt towards isolation that
> low limit offers, without a collapse to OOM at the first misjudgement.

At the same time, I only see users like Google pushing the limits of
the machine to a point where guarantees cover north of 90% of memory.
I would expect more casual users to work with much smaller guarantees,
and a good chunk of slack on top - otherwise they already had better
be set up for the occasional OOM.  Is this an unreasonable assumption
to make?

I'm not opposed to this feature per se, but I'm really opposed to
merging it for the partial hard bindings argument and for papering
over deficiencies in our reclaim code, because I don't want any of
that in the changelog, in the documentation, or in what we otherwise
tell users about it.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-06-04 21:45                         ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-04 21:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Greg Thelen, Michel Lespinasse, Tejun Heo, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Wed, Jun 04, 2014 at 12:18:59PM -0700, Hugh Dickins wrote:
> On Wed, 4 Jun 2014, Johannes Weiner wrote:
> > On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > > 
> > > In the other email I have suggested to add a knob with the configurable
> > > default. Would you be OK with that?
> > 
> > No, I want to agree on whether we need that fallback code or not.  I'm
> > not interested in merging code that you can't convince anybody else is
> > needed.
> 
> I for one would welcome such a knob as Michal is proposing.

Now we have a tie :-)

> I thought it was long ago agreed that the low limit was going to fallback
> when it couldn't be satisfied.  But you seem implacably opposed to that
> as default, and I can well believe that Google is so accustomed to OOMing
> that it is more comfortable with OOMing as the default.  Okay.  But I
> would expect there to be many who want the attempt towards isolation that
> low limit offers, without a collapse to OOM at the first misjudgement.

At the same time, I only see users like Google pushing the limits of
the machine to a point where guarantees cover north of 90% of memory.
I would expect more casual users to work with much smaller guarantees,
and a good chunk of slack on top - otherwise they already had better
be set up for the occasional OOM.  Is this an unreasonable assumption
to make?

I'm not opposed to this feature per se, but I'm really opposed to
merging it for the partial hard bindings argument and for papering
over deficiencies in our reclaim code, because I don't want any of
that in the changelog, in the documentation, or in what we otherwise
tell users about it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-06-04 15:44                     ` Johannes Weiner
@ 2014-06-05 14:32                       ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-05 14:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Wed 04-06-14 11:44:08, Johannes Weiner wrote:
> On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > On Tue 03-06-14 10:22:49, Johannes Weiner wrote:
> > > On Tue, Jun 03, 2014 at 01:07:43PM +0200, Michal Hocko wrote:
> > [...]
> > > > If we consider that memcg and its limits are not zone aware while the
> > > > page allocator and reclaim are zone oriented then I can see a problem
> > > > of unexpected reclaim failure although there is no over commit on the
> > > > low_limit globally. And we do not have in-kernel effective measures to
> > > > mitigate this inherent problem. At least not now and I am afraid it is
> > > > a long route to have something that would work reasonably well in such
> > > > cases.
> > > 
> > > Which "inherent problem"?
> > 
> > zone unawareness of the limit vs. allocation/reclaim which are zone
> > oriented.
> 
> This is a quote from another subthread where you haven't responded:
> 
> ---
> 
> > > > > But who actually cares if an individual zone can be reclaimed?
> > > > > 
> > > > > Userspace allocations can fall back to any other zone.  Unless there
> > > > > are hard bindings, but hopefully nobody binds a memcg to a node that
> > > > > is smaller than that memcg's guarantee. 
> > > > 
> > > > The protected group might spill over to another group and eat it when
> > > > another group would be simply pushed out from the node it is bound to.
> > > 
> > > I don't really understand the point you're trying to make.
> > 
> > I was just trying to show a case where individual zone matters. To make
> > it more specific consider 2 groups A (with low-limit 60% RAM) and B
> > (say with low-limit 10% RAM) and bound to a node X (25% of RAM). Now
> > having 70% of RAM reserved for guarantee makes some sense, right? B is
> > not over-committing the node it is bound to. Yet the A's allocations
> > might make pressure on X regardless that the whole system is still doing
> > good. This can lead to a situation where X gets depleted and nothing
> > would be reclaimable leading to an OOM condition.
> 
> Once you assume control of memory *placement* in the system like this,
> you can not also pretend to be clueless and have unreclaimable memory
> of this magnitude spread around into nodes used by other bound tasks.

You are still assuming that the administrator controls the placement.
The load running in your memcg might be a black box for admin. E.g. a
container which pays $$ to get a priority and not get reclaimed if that
is possible. Admin can make sure that the cumulative low_limits for
containers are sane but he doesn't have any control over what the loads
inside are doing and potential OOM when one tries to DOS the other is
definitely not welcome.
 
> If we were to actively support such configurations, we should be doing
> direct NUMA balancing and migrate these pages out of node X when B
> needs to allocate. 

Migration is certainly a way how to reduce the risk. It is a question
whether this is something to be done by the kernel implicitly or by
administrator.

> That would fix the problem for all unevictable
> memory, not just memcg guarantees, and would prefer node-offloading
> over swapping in cases where swap is available.

That would certainly lower the risk. But there still might be unmovable
memory sitting on the node so this will never be 100%.

> But really, this whole scenario sounds contrived to me.  And there is
> nothing specific about memcg guarantees in there.
> 
> ---
> 
> > > > So to me it sounds more responsible to promise only as much as we can
> > > > handle. I think that fallback mode is not crippling the semantic of
> > > > the knob as it triggers only for limit overcommit or strange corner
> > > > cases. We have agreed that we do not care about the first one and
> > > > handling the later one by potentially fatal action doesn't sounds very
> > > > user friendly to me.
> > > 
> > > It *absolutely* cripples the semantics.  Think about the security use
> > > cases of mlock for example, where certain memory may never hit the
> > > platter.  This wouldn't be possible with your watered down guarantees.
> > 
> > Is this really a use case? It sounds like a weak one to me. Because
> > any sudden memory consumption above the limit can reclaim your
> > to-protect-page it will hit the platter and you cannot do anything about
> > this. So yeah, this is not mlock.
> 
> You are right, that is a weak usecase.
> 
> It doesn't change the fact that it does severely weaken the semantics
> and turns it into another best-effort mechanism that the user can't
> count on.  This sucks.  It sucked with soft limits and it will suck
> again.  The irony is that Greg even pointed out you should be doing
> soft limits if you want this sort of behavior.

The question is whether we really _need_ hard guarantees. I came with
the low_limit as a replacement for soft_limit which really sucks. But it
sucks not because you cannot count on it. It is the way how it has
opposite semantic which sucks - and the implementation of course. I have
tried to fix it and that route was a no-go.

I think the hard guarantee makes some sense when we allow to overcommit
the limit. Somebody might really want to setup lowlimit == hardlimit
because reclaim would be more harmful than restart of the application.
I would however expect that this would be more of an exception rather
than regular use. Most users I can think of will set low_limit to an
effective working set size to be isolated from other loads and ephemeral
reclaim will not hurt them. OOM would on other hand would be really
harmful.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-06-05 14:32                       ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-05 14:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Wed 04-06-14 11:44:08, Johannes Weiner wrote:
> On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > On Tue 03-06-14 10:22:49, Johannes Weiner wrote:
> > > On Tue, Jun 03, 2014 at 01:07:43PM +0200, Michal Hocko wrote:
> > [...]
> > > > If we consider that memcg and its limits are not zone aware while the
> > > > page allocator and reclaim are zone oriented then I can see a problem
> > > > of unexpected reclaim failure although there is no over commit on the
> > > > low_limit globally. And we do not have in-kernel effective measures to
> > > > mitigate this inherent problem. At least not now and I am afraid it is
> > > > a long route to have something that would work reasonably well in such
> > > > cases.
> > > 
> > > Which "inherent problem"?
> > 
> > zone unawareness of the limit vs. allocation/reclaim which are zone
> > oriented.
> 
> This is a quote from another subthread where you haven't responded:
> 
> ---
> 
> > > > > But who actually cares if an individual zone can be reclaimed?
> > > > > 
> > > > > Userspace allocations can fall back to any other zone.  Unless there
> > > > > are hard bindings, but hopefully nobody binds a memcg to a node that
> > > > > is smaller than that memcg's guarantee. 
> > > > 
> > > > The protected group might spill over to another group and eat it when
> > > > another group would be simply pushed out from the node it is bound to.
> > > 
> > > I don't really understand the point you're trying to make.
> > 
> > I was just trying to show a case where individual zone matters. To make
> > it more specific consider 2 groups A (with low-limit 60% RAM) and B
> > (say with low-limit 10% RAM) and bound to a node X (25% of RAM). Now
> > having 70% of RAM reserved for guarantee makes some sense, right? B is
> > not over-committing the node it is bound to. Yet the A's allocations
> > might make pressure on X regardless that the whole system is still doing
> > good. This can lead to a situation where X gets depleted and nothing
> > would be reclaimable leading to an OOM condition.
> 
> Once you assume control of memory *placement* in the system like this,
> you can not also pretend to be clueless and have unreclaimable memory
> of this magnitude spread around into nodes used by other bound tasks.

You are still assuming that the administrator controls the placement.
The load running in your memcg might be a black box for admin. E.g. a
container which pays $$ to get a priority and not get reclaimed if that
is possible. Admin can make sure that the cumulative low_limits for
containers are sane but he doesn't have any control over what the loads
inside are doing and potential OOM when one tries to DOS the other is
definitely not welcome.
 
> If we were to actively support such configurations, we should be doing
> direct NUMA balancing and migrate these pages out of node X when B
> needs to allocate. 

Migration is certainly a way how to reduce the risk. It is a question
whether this is something to be done by the kernel implicitly or by
administrator.

> That would fix the problem for all unevictable
> memory, not just memcg guarantees, and would prefer node-offloading
> over swapping in cases where swap is available.

That would certainly lower the risk. But there still might be unmovable
memory sitting on the node so this will never be 100%.

> But really, this whole scenario sounds contrived to me.  And there is
> nothing specific about memcg guarantees in there.
> 
> ---
> 
> > > > So to me it sounds more responsible to promise only as much as we can
> > > > handle. I think that fallback mode is not crippling the semantic of
> > > > the knob as it triggers only for limit overcommit or strange corner
> > > > cases. We have agreed that we do not care about the first one and
> > > > handling the later one by potentially fatal action doesn't sounds very
> > > > user friendly to me.
> > > 
> > > It *absolutely* cripples the semantics.  Think about the security use
> > > cases of mlock for example, where certain memory may never hit the
> > > platter.  This wouldn't be possible with your watered down guarantees.
> > 
> > Is this really a use case? It sounds like a weak one to me. Because
> > any sudden memory consumption above the limit can reclaim your
> > to-protect-page it will hit the platter and you cannot do anything about
> > this. So yeah, this is not mlock.
> 
> You are right, that is a weak usecase.
> 
> It doesn't change the fact that it does severely weaken the semantics
> and turns it into another best-effort mechanism that the user can't
> count on.  This sucks.  It sucked with soft limits and it will suck
> again.  The irony is that Greg even pointed out you should be doing
> soft limits if you want this sort of behavior.

The question is whether we really _need_ hard guarantees. I came with
the low_limit as a replacement for soft_limit which really sucks. But it
sucks not because you cannot count on it. It is the way how it has
opposite semantic which sucks - and the implementation of course. I have
tried to fix it and that route was a no-go.

I think the hard guarantee makes some sense when we allow to overcommit
the limit. Somebody might really want to setup lowlimit == hardlimit
because reclaim would be more harmful than restart of the application.
I would however expect that this would be more of an exception rather
than regular use. Most users I can think of will set low_limit to an
effective working set size to be isolated from other loads and ephemeral
reclaim will not hurt them. OOM would on other hand would be really
harmful.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-06-04 21:45                         ` Johannes Weiner
@ 2014-06-05 14:51                           ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-05 14:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Greg Thelen, Michel Lespinasse, Tejun Heo, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Wed 04-06-14 17:45:53, Johannes Weiner wrote:
> On Wed, Jun 04, 2014 at 12:18:59PM -0700, Hugh Dickins wrote:
> > On Wed, 4 Jun 2014, Johannes Weiner wrote:
> > > On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > > > 
> > > > In the other email I have suggested to add a knob with the configurable
> > > > default. Would you be OK with that?
> > > 
> > > No, I want to agree on whether we need that fallback code or not.  I'm
> > > not interested in merging code that you can't convince anybody else is
> > > needed.
> > 
> > I for one would welcome such a knob as Michal is proposing.
> 
> Now we have a tie :-)
> 
> > I thought it was long ago agreed that the low limit was going to fallback
> > when it couldn't be satisfied.  But you seem implacably opposed to that
> > as default, and I can well believe that Google is so accustomed to OOMing
> > that it is more comfortable with OOMing as the default.  Okay.  But I
> > would expect there to be many who want the attempt towards isolation that
> > low limit offers, without a collapse to OOM at the first misjudgement.
> 
> At the same time, I only see users like Google pushing the limits of
> the machine to a point where guarantees cover north of 90% of memory.

I can think of in-memory database loads which would use the reclaim
protection which is quite high as well (say 80% of available memory).
Those would definitely like to see ephemeral reclaim rather than OOM.

> I would expect more casual users to work with much smaller guarantees,
> and a good chunk of slack on top - otherwise they already had better
> be set up for the occasional OOM.  Is this an unreasonable assumption
> to make?
> 
> I'm not opposed to this feature per se, but I'm really opposed to
> merging it for the partial hard bindings argument

This was just an example that even setup which is not overcomiting the
limit might be caught in an unreclaimable position. Sure we can mitigate
those issues to some point and that would be surely welcome.

The more important part, however, is that not all usecases really
_require_ hard guarantee. They are asking for a reasonable memory
isolation which they currently do not have. Having a risk of OOM would
be a no-go for them so the feature wouldn't be useful for them.

I have repeatedly said that I can see also some use for the hard
guarantee. Mainly to support overcommit on the limit. I didn't hear
about those usecases yet but it seems that at least Google would like to
have really hard guarantees.

So I think the best way forward is to have a configurable default and
per-memcg knob.

> and for papering over deficiencies in our reclaim code, because I
> don't want any of that in the changelog, in the documentation, or in
> what we otherwise tell users about it.


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-06-05 14:51                           ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-05 14:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Greg Thelen, Michel Lespinasse, Tejun Heo, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Wed 04-06-14 17:45:53, Johannes Weiner wrote:
> On Wed, Jun 04, 2014 at 12:18:59PM -0700, Hugh Dickins wrote:
> > On Wed, 4 Jun 2014, Johannes Weiner wrote:
> > > On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > > > 
> > > > In the other email I have suggested to add a knob with the configurable
> > > > default. Would you be OK with that?
> > > 
> > > No, I want to agree on whether we need that fallback code or not.  I'm
> > > not interested in merging code that you can't convince anybody else is
> > > needed.
> > 
> > I for one would welcome such a knob as Michal is proposing.
> 
> Now we have a tie :-)
> 
> > I thought it was long ago agreed that the low limit was going to fallback
> > when it couldn't be satisfied.  But you seem implacably opposed to that
> > as default, and I can well believe that Google is so accustomed to OOMing
> > that it is more comfortable with OOMing as the default.  Okay.  But I
> > would expect there to be many who want the attempt towards isolation that
> > low limit offers, without a collapse to OOM at the first misjudgement.
> 
> At the same time, I only see users like Google pushing the limits of
> the machine to a point where guarantees cover north of 90% of memory.

I can think of in-memory database loads which would use the reclaim
protection which is quite high as well (say 80% of available memory).
Those would definitely like to see ephemeral reclaim rather than OOM.

> I would expect more casual users to work with much smaller guarantees,
> and a good chunk of slack on top - otherwise they already had better
> be set up for the occasional OOM.  Is this an unreasonable assumption
> to make?
> 
> I'm not opposed to this feature per se, but I'm really opposed to
> merging it for the partial hard bindings argument

This was just an example that even setup which is not overcomiting the
limit might be caught in an unreclaimable position. Sure we can mitigate
those issues to some point and that would be surely welcome.

The more important part, however, is that not all usecases really
_require_ hard guarantee. They are asking for a reasonable memory
isolation which they currently do not have. Having a risk of OOM would
be a no-go for them so the feature wouldn't be useful for them.

I have repeatedly said that I can see also some use for the hard
guarantee. Mainly to support overcommit on the limit. I didn't hear
about those usecases yet but it seems that at least Google would like to
have really hard guarantees.

So I think the best way forward is to have a configurable default and
per-memcg knob.

> and for papering over deficiencies in our reclaim code, because I
> don't want any of that in the changelog, in the documentation, or in
> what we otherwise tell users about it.


-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-06-05 14:32                       ` Michal Hocko
@ 2014-06-05 15:43                         ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-05 15:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Thu, Jun 05, 2014 at 04:32:35PM +0200, Michal Hocko wrote:
> On Wed 04-06-14 11:44:08, Johannes Weiner wrote:
> > On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > > On Tue 03-06-14 10:22:49, Johannes Weiner wrote:
> > > > On Tue, Jun 03, 2014 at 01:07:43PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > If we consider that memcg and its limits are not zone aware while the
> > > > > page allocator and reclaim are zone oriented then I can see a problem
> > > > > of unexpected reclaim failure although there is no over commit on the
> > > > > low_limit globally. And we do not have in-kernel effective measures to
> > > > > mitigate this inherent problem. At least not now and I am afraid it is
> > > > > a long route to have something that would work reasonably well in such
> > > > > cases.
> > > > 
> > > > Which "inherent problem"?
> > > 
> > > zone unawareness of the limit vs. allocation/reclaim which are zone
> > > oriented.
> > 
> > This is a quote from another subthread where you haven't responded:
> > 
> > ---
> > 
> > > > > > But who actually cares if an individual zone can be reclaimed?
> > > > > > 
> > > > > > Userspace allocations can fall back to any other zone.  Unless there
> > > > > > are hard bindings, but hopefully nobody binds a memcg to a node that
> > > > > > is smaller than that memcg's guarantee. 
> > > > > 
> > > > > The protected group might spill over to another group and eat it when
> > > > > another group would be simply pushed out from the node it is bound to.
> > > > 
> > > > I don't really understand the point you're trying to make.
> > > 
> > > I was just trying to show a case where individual zone matters. To make
> > > it more specific consider 2 groups A (with low-limit 60% RAM) and B
> > > (say with low-limit 10% RAM) and bound to a node X (25% of RAM). Now
> > > having 70% of RAM reserved for guarantee makes some sense, right? B is
> > > not over-committing the node it is bound to. Yet the A's allocations
> > > might make pressure on X regardless that the whole system is still doing
> > > good. This can lead to a situation where X gets depleted and nothing
> > > would be reclaimable leading to an OOM condition.
> > 
> > Once you assume control of memory *placement* in the system like this,
> > you can not also pretend to be clueless and have unreclaimable memory
> > of this magnitude spread around into nodes used by other bound tasks.
> 
> You are still assuming that the administrator controls the placement.
> The load running in your memcg might be a black box for admin. E.g. a
> container which pays $$ to get a priority and not get reclaimed if that
> is possible. Admin can make sure that the cumulative low_limits for
> containers are sane but he doesn't have any control over what the loads
> inside are doing and potential OOM when one tries to DOS the other is
> definitely not welcome.

This is completely backwards, though: if you pay for guaranteed
memory, you don't want to get reclaimed just because some other task
that might not even have guarantees starts allocating with a
restricted node mask.  This breaks isolation.

For one, this can be used maliciously by intentionally binding a
low-priority task to a node with guaranteed memory and starting to
allocate.  Even with a small hard limit, you can just plow through
files to push guaranteed cache of the other group out of memory.

But even if it's not malicious, in such a scenario I'd still prefer
OOMing the task with the more restrictive node mask over reclaiming
guaranteed memory.

Then, on top of that, we can use direct migration to mitigate OOMs in
these scenarios (should we sufficiently care about them), but I'd much
prefer OOMs over breaking isolation and the possible priority
inversion that is inherent in the fallback on NUMA setups.

> > If we were to actively support such configurations, we should be doing
> > direct NUMA balancing and migrate these pages out of node X when B
> > needs to allocate. 
> 
> Migration is certainly a way how to reduce the risk. It is a question
> whether this is something to be done by the kernel implicitly or by
> administrator.

As long as the kernel is responsible for *any* placement - i.e. unless
you bind everything - it might as well be the kernel that fixes it up.

> > That would fix the problem for all unevictable
> > memory, not just memcg guarantees, and would prefer node-offloading
> > over swapping in cases where swap is available.
> 
> That would certainly lower the risk. But there still might be unmovable
> memory sitting on the node so this will never be 100%.

Yes, and as per above, I think in most cases it's actually preferable
to kill the bound task (or direct-migrate) over violating guarantees
of another task.

> > > > > So to me it sounds more responsible to promise only as much as we can
> > > > > handle. I think that fallback mode is not crippling the semantic of
> > > > > the knob as it triggers only for limit overcommit or strange corner
> > > > > cases. We have agreed that we do not care about the first one and
> > > > > handling the later one by potentially fatal action doesn't sounds very
> > > > > user friendly to me.
> > > > 
> > > > It *absolutely* cripples the semantics.  Think about the security use
> > > > cases of mlock for example, where certain memory may never hit the
> > > > platter.  This wouldn't be possible with your watered down guarantees.
> > > 
> > > Is this really a use case? It sounds like a weak one to me. Because
> > > any sudden memory consumption above the limit can reclaim your
> > > to-protect-page it will hit the platter and you cannot do anything about
> > > this. So yeah, this is not mlock.
> > 
> > You are right, that is a weak usecase.
> > 
> > It doesn't change the fact that it does severely weaken the semantics
> > and turns it into another best-effort mechanism that the user can't
> > count on.  This sucks.  It sucked with soft limits and it will suck
> > again.  The irony is that Greg even pointed out you should be doing
> > soft limits if you want this sort of behavior.
> 
> The question is whether we really _need_ hard guarantees. I came with
> the low_limit as a replacement for soft_limit which really sucks. But it
> sucks not because you cannot count on it. It is the way how it has
> opposite semantic which sucks - and the implementation of course. I have
> tried to fix it and that route was a no-go.

We need hard guarantees for actual isolation.  Otherwise you can't
charge for guarantees.

> I think the hard guarantee makes some sense when we allow to overcommit
> the limit. Somebody might really want to setup lowlimit == hardlimit
> because reclaim would be more harmful than restart of the application.
> I would however expect that this would be more of an exception rather
> than regular use. Most users I can think of will set low_limit to an
> effective working set size to be isolated from other loads and ephemeral
> reclaim will not hurt them. OOM would on other hand would be really
> harmful.

It's not an either-or because OOM would happen to one group, and
guaranteed memory reclaim would happen to another.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-06-05 15:43                         ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-05 15:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Thu, Jun 05, 2014 at 04:32:35PM +0200, Michal Hocko wrote:
> On Wed 04-06-14 11:44:08, Johannes Weiner wrote:
> > On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > > On Tue 03-06-14 10:22:49, Johannes Weiner wrote:
> > > > On Tue, Jun 03, 2014 at 01:07:43PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > If we consider that memcg and its limits are not zone aware while the
> > > > > page allocator and reclaim are zone oriented then I can see a problem
> > > > > of unexpected reclaim failure although there is no over commit on the
> > > > > low_limit globally. And we do not have in-kernel effective measures to
> > > > > mitigate this inherent problem. At least not now and I am afraid it is
> > > > > a long route to have something that would work reasonably well in such
> > > > > cases.
> > > > 
> > > > Which "inherent problem"?
> > > 
> > > zone unawareness of the limit vs. allocation/reclaim which are zone
> > > oriented.
> > 
> > This is a quote from another subthread where you haven't responded:
> > 
> > ---
> > 
> > > > > > But who actually cares if an individual zone can be reclaimed?
> > > > > > 
> > > > > > Userspace allocations can fall back to any other zone.  Unless there
> > > > > > are hard bindings, but hopefully nobody binds a memcg to a node that
> > > > > > is smaller than that memcg's guarantee. 
> > > > > 
> > > > > The protected group might spill over to another group and eat it when
> > > > > another group would be simply pushed out from the node it is bound to.
> > > > 
> > > > I don't really understand the point you're trying to make.
> > > 
> > > I was just trying to show a case where individual zone matters. To make
> > > it more specific consider 2 groups A (with low-limit 60% RAM) and B
> > > (say with low-limit 10% RAM) and bound to a node X (25% of RAM). Now
> > > having 70% of RAM reserved for guarantee makes some sense, right? B is
> > > not over-committing the node it is bound to. Yet the A's allocations
> > > might make pressure on X regardless that the whole system is still doing
> > > good. This can lead to a situation where X gets depleted and nothing
> > > would be reclaimable leading to an OOM condition.
> > 
> > Once you assume control of memory *placement* in the system like this,
> > you can not also pretend to be clueless and have unreclaimable memory
> > of this magnitude spread around into nodes used by other bound tasks.
> 
> You are still assuming that the administrator controls the placement.
> The load running in your memcg might be a black box for admin. E.g. a
> container which pays $$ to get a priority and not get reclaimed if that
> is possible. Admin can make sure that the cumulative low_limits for
> containers are sane but he doesn't have any control over what the loads
> inside are doing and potential OOM when one tries to DOS the other is
> definitely not welcome.

This is completely backwards, though: if you pay for guaranteed
memory, you don't want to get reclaimed just because some other task
that might not even have guarantees starts allocating with a
restricted node mask.  This breaks isolation.

For one, this can be used maliciously by intentionally binding a
low-priority task to a node with guaranteed memory and starting to
allocate.  Even with a small hard limit, you can just plow through
files to push guaranteed cache of the other group out of memory.

But even if it's not malicious, in such a scenario I'd still prefer
OOMing the task with the more restrictive node mask over reclaiming
guaranteed memory.

Then, on top of that, we can use direct migration to mitigate OOMs in
these scenarios (should we sufficiently care about them), but I'd much
prefer OOMs over breaking isolation and the possible priority
inversion that is inherent in the fallback on NUMA setups.

> > If we were to actively support such configurations, we should be doing
> > direct NUMA balancing and migrate these pages out of node X when B
> > needs to allocate. 
> 
> Migration is certainly a way how to reduce the risk. It is a question
> whether this is something to be done by the kernel implicitly or by
> administrator.

As long as the kernel is responsible for *any* placement - i.e. unless
you bind everything - it might as well be the kernel that fixes it up.

> > That would fix the problem for all unevictable
> > memory, not just memcg guarantees, and would prefer node-offloading
> > over swapping in cases where swap is available.
> 
> That would certainly lower the risk. But there still might be unmovable
> memory sitting on the node so this will never be 100%.

Yes, and as per above, I think in most cases it's actually preferable
to kill the bound task (or direct-migrate) over violating guarantees
of another task.

> > > > > So to me it sounds more responsible to promise only as much as we can
> > > > > handle. I think that fallback mode is not crippling the semantic of
> > > > > the knob as it triggers only for limit overcommit or strange corner
> > > > > cases. We have agreed that we do not care about the first one and
> > > > > handling the later one by potentially fatal action doesn't sounds very
> > > > > user friendly to me.
> > > > 
> > > > It *absolutely* cripples the semantics.  Think about the security use
> > > > cases of mlock for example, where certain memory may never hit the
> > > > platter.  This wouldn't be possible with your watered down guarantees.
> > > 
> > > Is this really a use case? It sounds like a weak one to me. Because
> > > any sudden memory consumption above the limit can reclaim your
> > > to-protect-page it will hit the platter and you cannot do anything about
> > > this. So yeah, this is not mlock.
> > 
> > You are right, that is a weak usecase.
> > 
> > It doesn't change the fact that it does severely weaken the semantics
> > and turns it into another best-effort mechanism that the user can't
> > count on.  This sucks.  It sucked with soft limits and it will suck
> > again.  The irony is that Greg even pointed out you should be doing
> > soft limits if you want this sort of behavior.
> 
> The question is whether we really _need_ hard guarantees. I came with
> the low_limit as a replacement for soft_limit which really sucks. But it
> sucks not because you cannot count on it. It is the way how it has
> opposite semantic which sucks - and the implementation of course. I have
> tried to fix it and that route was a no-go.

We need hard guarantees for actual isolation.  Otherwise you can't
charge for guarantees.

> I think the hard guarantee makes some sense when we allow to overcommit
> the limit. Somebody might really want to setup lowlimit == hardlimit
> because reclaim would be more harmful than restart of the application.
> I would however expect that this would be more of an exception rather
> than regular use. Most users I can think of will set low_limit to an
> effective working set size to be isolated from other loads and ephemeral
> reclaim will not hurt them. OOM would on other hand would be really
> harmful.

It's not an either-or because OOM would happen to one group, and
guaranteed memory reclaim would happen to another.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-06-05 15:43                         ` Johannes Weiner
@ 2014-06-05 16:09                           ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-05 16:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Thu 05-06-14 11:43:28, Johannes Weiner wrote:
> On Thu, Jun 05, 2014 at 04:32:35PM +0200, Michal Hocko wrote:
> > On Wed 04-06-14 11:44:08, Johannes Weiner wrote:
> > > On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > > > On Tue 03-06-14 10:22:49, Johannes Weiner wrote:
> > > > > On Tue, Jun 03, 2014 at 01:07:43PM +0200, Michal Hocko wrote:
> > > > [...]
> > > > > > If we consider that memcg and its limits are not zone aware while the
> > > > > > page allocator and reclaim are zone oriented then I can see a problem
> > > > > > of unexpected reclaim failure although there is no over commit on the
> > > > > > low_limit globally. And we do not have in-kernel effective measures to
> > > > > > mitigate this inherent problem. At least not now and I am afraid it is
> > > > > > a long route to have something that would work reasonably well in such
> > > > > > cases.
> > > > > 
> > > > > Which "inherent problem"?
> > > > 
> > > > zone unawareness of the limit vs. allocation/reclaim which are zone
> > > > oriented.
> > > 
> > > This is a quote from another subthread where you haven't responded:
> > > 
> > > ---
> > > 
> > > > > > > But who actually cares if an individual zone can be reclaimed?
> > > > > > > 
> > > > > > > Userspace allocations can fall back to any other zone.  Unless there
> > > > > > > are hard bindings, but hopefully nobody binds a memcg to a node that
> > > > > > > is smaller than that memcg's guarantee. 
> > > > > > 
> > > > > > The protected group might spill over to another group and eat it when
> > > > > > another group would be simply pushed out from the node it is bound to.
> > > > > 
> > > > > I don't really understand the point you're trying to make.
> > > > 
> > > > I was just trying to show a case where individual zone matters. To make
> > > > it more specific consider 2 groups A (with low-limit 60% RAM) and B
> > > > (say with low-limit 10% RAM) and bound to a node X (25% of RAM). Now
> > > > having 70% of RAM reserved for guarantee makes some sense, right? B is
> > > > not over-committing the node it is bound to. Yet the A's allocations
> > > > might make pressure on X regardless that the whole system is still doing
> > > > good. This can lead to a situation where X gets depleted and nothing
> > > > would be reclaimable leading to an OOM condition.
> > > 
> > > Once you assume control of memory *placement* in the system like this,
> > > you can not also pretend to be clueless and have unreclaimable memory
> > > of this magnitude spread around into nodes used by other bound tasks.
> > 
> > You are still assuming that the administrator controls the placement.
> > The load running in your memcg might be a black box for admin. E.g. a
> > container which pays $$ to get a priority and not get reclaimed if that
> > is possible. Admin can make sure that the cumulative low_limits for
> > containers are sane but he doesn't have any control over what the loads
> > inside are doing and potential OOM when one tries to DOS the other is
> > definitely not welcome.
> 
> This is completely backwards, though: if you pay for guaranteed

I didn't say anything about guarantee, though. You even do not need
anything as strong as guarantee. You are paying for prioritization.

> memory, you don't want to get reclaimed just because some other task
> that might not even have guarantees starts allocating with a
> restricted node mask.  This breaks isolation.

If the other task doesn't have any limit set then its pages would get
reclaimed. This wouldn't be everybody within low limit situation.

> For one, this can be used maliciously by intentionally binding a
> low-priority task to a node with guaranteed memory and starting to
> allocate.  Even with a small hard limit, you can just plow through
> files to push guaranteed cache of the other group out of memory.
>
> But even if it's not malicious, in such a scenario I'd still prefer
> OOMing the task with the more restrictive node mask over reclaiming
> guaranteed memory.

Why?

> Then, on top of that, we can use direct migration to mitigate OOMs in
> these scenarios (should we sufficiently care about them), but I'd much
> prefer OOMs over breaking isolation and the possible priority
> inversion that is inherent in the fallback on NUMA setups.

Could you be more specific about what you mean by priority inversion?

> > > If we were to actively support such configurations, we should be doing
> > > direct NUMA balancing and migrate these pages out of node X when B
> > > needs to allocate. 
> > 
> > Migration is certainly a way how to reduce the risk. It is a question
> > whether this is something to be done by the kernel implicitly or by
> > administrator.
> 
> As long as the kernel is responsible for *any* placement - i.e. unless
> you bind everything - it might as well be the kernel that fixes it up.
> 
> > > That would fix the problem for all unevictable
> > > memory, not just memcg guarantees, and would prefer node-offloading
> > > over swapping in cases where swap is available.
> > 
> > That would certainly lower the risk. But there still might be unmovable
> > memory sitting on the node so this will never be 100%.
> 
> Yes, and as per above, I think in most cases it's actually preferable
> to kill the bound task (or direct-migrate) over violating guarantees
> of another task.
> 
> > > > > > So to me it sounds more responsible to promise only as much as we can
> > > > > > handle. I think that fallback mode is not crippling the semantic of
> > > > > > the knob as it triggers only for limit overcommit or strange corner
> > > > > > cases. We have agreed that we do not care about the first one and
> > > > > > handling the later one by potentially fatal action doesn't sounds very
> > > > > > user friendly to me.
> > > > > 
> > > > > It *absolutely* cripples the semantics.  Think about the security use
> > > > > cases of mlock for example, where certain memory may never hit the
> > > > > platter.  This wouldn't be possible with your watered down guarantees.
> > > > 
> > > > Is this really a use case? It sounds like a weak one to me. Because
> > > > any sudden memory consumption above the limit can reclaim your
> > > > to-protect-page it will hit the platter and you cannot do anything about
> > > > this. So yeah, this is not mlock.
> > > 
> > > You are right, that is a weak usecase.
> > > 
> > > It doesn't change the fact that it does severely weaken the semantics
> > > and turns it into another best-effort mechanism that the user can't
> > > count on.  This sucks.  It sucked with soft limits and it will suck
> > > again.  The irony is that Greg even pointed out you should be doing
> > > soft limits if you want this sort of behavior.
> > 
> > The question is whether we really _need_ hard guarantees. I came with
> > the low_limit as a replacement for soft_limit which really sucks. But it
> > sucks not because you cannot count on it. It is the way how it has
> > opposite semantic which sucks - and the implementation of course. I have
> > tried to fix it and that route was a no-go.
> 
> We need hard guarantees for actual isolation.  Otherwise you can't
> charge for guarantees.

You can still charge for prioritization. Which on its own is a valid use
case. You seem to be bound to hard guanratee and overlook that there is
a class of use cases which do not need such a behavior.

Please note that setting up hard guarantee is really non trivial task.
Especially if any downtime of the service which you want to protect is a
big deal. I wouldn't be surprised if the risk was big enough that using
the limit would be a no-go although there would be a possibility of
performance improvement.

> > I think the hard guarantee makes some sense when we allow to overcommit
> > the limit. Somebody might really want to setup lowlimit == hardlimit
> > because reclaim would be more harmful than restart of the application.
> > I would however expect that this would be more of an exception rather
> > than regular use. Most users I can think of will set low_limit to an
> > effective working set size to be isolated from other loads and ephemeral
> > reclaim will not hurt them. OOM would on other hand would be really
> > harmful.
> 
> It's not an either-or because OOM would happen to one group, and
> guaranteed memory reclaim would happen to another.

I do not follow.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-06-05 16:09                           ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-05 16:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Thu 05-06-14 11:43:28, Johannes Weiner wrote:
> On Thu, Jun 05, 2014 at 04:32:35PM +0200, Michal Hocko wrote:
> > On Wed 04-06-14 11:44:08, Johannes Weiner wrote:
> > > On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > > > On Tue 03-06-14 10:22:49, Johannes Weiner wrote:
> > > > > On Tue, Jun 03, 2014 at 01:07:43PM +0200, Michal Hocko wrote:
> > > > [...]
> > > > > > If we consider that memcg and its limits are not zone aware while the
> > > > > > page allocator and reclaim are zone oriented then I can see a problem
> > > > > > of unexpected reclaim failure although there is no over commit on the
> > > > > > low_limit globally. And we do not have in-kernel effective measures to
> > > > > > mitigate this inherent problem. At least not now and I am afraid it is
> > > > > > a long route to have something that would work reasonably well in such
> > > > > > cases.
> > > > > 
> > > > > Which "inherent problem"?
> > > > 
> > > > zone unawareness of the limit vs. allocation/reclaim which are zone
> > > > oriented.
> > > 
> > > This is a quote from another subthread where you haven't responded:
> > > 
> > > ---
> > > 
> > > > > > > But who actually cares if an individual zone can be reclaimed?
> > > > > > > 
> > > > > > > Userspace allocations can fall back to any other zone.  Unless there
> > > > > > > are hard bindings, but hopefully nobody binds a memcg to a node that
> > > > > > > is smaller than that memcg's guarantee. 
> > > > > > 
> > > > > > The protected group might spill over to another group and eat it when
> > > > > > another group would be simply pushed out from the node it is bound to.
> > > > > 
> > > > > I don't really understand the point you're trying to make.
> > > > 
> > > > I was just trying to show a case where individual zone matters. To make
> > > > it more specific consider 2 groups A (with low-limit 60% RAM) and B
> > > > (say with low-limit 10% RAM) and bound to a node X (25% of RAM). Now
> > > > having 70% of RAM reserved for guarantee makes some sense, right? B is
> > > > not over-committing the node it is bound to. Yet the A's allocations
> > > > might make pressure on X regardless that the whole system is still doing
> > > > good. This can lead to a situation where X gets depleted and nothing
> > > > would be reclaimable leading to an OOM condition.
> > > 
> > > Once you assume control of memory *placement* in the system like this,
> > > you can not also pretend to be clueless and have unreclaimable memory
> > > of this magnitude spread around into nodes used by other bound tasks.
> > 
> > You are still assuming that the administrator controls the placement.
> > The load running in your memcg might be a black box for admin. E.g. a
> > container which pays $$ to get a priority and not get reclaimed if that
> > is possible. Admin can make sure that the cumulative low_limits for
> > containers are sane but he doesn't have any control over what the loads
> > inside are doing and potential OOM when one tries to DOS the other is
> > definitely not welcome.
> 
> This is completely backwards, though: if you pay for guaranteed

I didn't say anything about guarantee, though. You even do not need
anything as strong as guarantee. You are paying for prioritization.

> memory, you don't want to get reclaimed just because some other task
> that might not even have guarantees starts allocating with a
> restricted node mask.  This breaks isolation.

If the other task doesn't have any limit set then its pages would get
reclaimed. This wouldn't be everybody within low limit situation.

> For one, this can be used maliciously by intentionally binding a
> low-priority task to a node with guaranteed memory and starting to
> allocate.  Even with a small hard limit, you can just plow through
> files to push guaranteed cache of the other group out of memory.
>
> But even if it's not malicious, in such a scenario I'd still prefer
> OOMing the task with the more restrictive node mask over reclaiming
> guaranteed memory.

Why?

> Then, on top of that, we can use direct migration to mitigate OOMs in
> these scenarios (should we sufficiently care about them), but I'd much
> prefer OOMs over breaking isolation and the possible priority
> inversion that is inherent in the fallback on NUMA setups.

Could you be more specific about what you mean by priority inversion?

> > > If we were to actively support such configurations, we should be doing
> > > direct NUMA balancing and migrate these pages out of node X when B
> > > needs to allocate. 
> > 
> > Migration is certainly a way how to reduce the risk. It is a question
> > whether this is something to be done by the kernel implicitly or by
> > administrator.
> 
> As long as the kernel is responsible for *any* placement - i.e. unless
> you bind everything - it might as well be the kernel that fixes it up.
> 
> > > That would fix the problem for all unevictable
> > > memory, not just memcg guarantees, and would prefer node-offloading
> > > over swapping in cases where swap is available.
> > 
> > That would certainly lower the risk. But there still might be unmovable
> > memory sitting on the node so this will never be 100%.
> 
> Yes, and as per above, I think in most cases it's actually preferable
> to kill the bound task (or direct-migrate) over violating guarantees
> of another task.
> 
> > > > > > So to me it sounds more responsible to promise only as much as we can
> > > > > > handle. I think that fallback mode is not crippling the semantic of
> > > > > > the knob as it triggers only for limit overcommit or strange corner
> > > > > > cases. We have agreed that we do not care about the first one and
> > > > > > handling the later one by potentially fatal action doesn't sounds very
> > > > > > user friendly to me.
> > > > > 
> > > > > It *absolutely* cripples the semantics.  Think about the security use
> > > > > cases of mlock for example, where certain memory may never hit the
> > > > > platter.  This wouldn't be possible with your watered down guarantees.
> > > > 
> > > > Is this really a use case? It sounds like a weak one to me. Because
> > > > any sudden memory consumption above the limit can reclaim your
> > > > to-protect-page it will hit the platter and you cannot do anything about
> > > > this. So yeah, this is not mlock.
> > > 
> > > You are right, that is a weak usecase.
> > > 
> > > It doesn't change the fact that it does severely weaken the semantics
> > > and turns it into another best-effort mechanism that the user can't
> > > count on.  This sucks.  It sucked with soft limits and it will suck
> > > again.  The irony is that Greg even pointed out you should be doing
> > > soft limits if you want this sort of behavior.
> > 
> > The question is whether we really _need_ hard guarantees. I came with
> > the low_limit as a replacement for soft_limit which really sucks. But it
> > sucks not because you cannot count on it. It is the way how it has
> > opposite semantic which sucks - and the implementation of course. I have
> > tried to fix it and that route was a no-go.
> 
> We need hard guarantees for actual isolation.  Otherwise you can't
> charge for guarantees.

You can still charge for prioritization. Which on its own is a valid use
case. You seem to be bound to hard guanratee and overlook that there is
a class of use cases which do not need such a behavior.

Please note that setting up hard guarantee is really non trivial task.
Especially if any downtime of the service which you want to protect is a
big deal. I wouldn't be surprised if the risk was big enough that using
the limit would be a no-go although there would be a possibility of
performance improvement.

> > I think the hard guarantee makes some sense when we allow to overcommit
> > the limit. Somebody might really want to setup lowlimit == hardlimit
> > because reclaim would be more harmful than restart of the application.
> > I would however expect that this would be more of an exception rather
> > than regular use. Most users I can think of will set low_limit to an
> > effective working set size to be isolated from other loads and ephemeral
> > reclaim will not hurt them. OOM would on other hand would be really
> > harmful.
> 
> It's not an either-or because OOM would happen to one group, and
> guaranteed memory reclaim would happen to another.

I do not follow.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-06-05 14:51                           ` Michal Hocko
@ 2014-06-05 16:10                             ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-05 16:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Greg Thelen, Michel Lespinasse, Tejun Heo, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Thu, Jun 05, 2014 at 04:51:09PM +0200, Michal Hocko wrote:
> On Wed 04-06-14 17:45:53, Johannes Weiner wrote:
> > On Wed, Jun 04, 2014 at 12:18:59PM -0700, Hugh Dickins wrote:
> > > On Wed, 4 Jun 2014, Johannes Weiner wrote:
> > > > On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > > > > 
> > > > > In the other email I have suggested to add a knob with the configurable
> > > > > default. Would you be OK with that?
> > > > 
> > > > No, I want to agree on whether we need that fallback code or not.  I'm
> > > > not interested in merging code that you can't convince anybody else is
> > > > needed.
> > > 
> > > I for one would welcome such a knob as Michal is proposing.
> > 
> > Now we have a tie :-)
> > 
> > > I thought it was long ago agreed that the low limit was going to fallback
> > > when it couldn't be satisfied.  But you seem implacably opposed to that
> > > as default, and I can well believe that Google is so accustomed to OOMing
> > > that it is more comfortable with OOMing as the default.  Okay.  But I
> > > would expect there to be many who want the attempt towards isolation that
> > > low limit offers, without a collapse to OOM at the first misjudgement.
> > 
> > At the same time, I only see users like Google pushing the limits of
> > the machine to a point where guarantees cover north of 90% of memory.
> 
> I can think of in-memory database loads which would use the reclaim
> protection which is quite high as well (say 80% of available memory).
> Those would definitely like to see ephemeral reclaim rather than OOM.

The OOM wouldn't apply to the database workload, but to other stuff in
the system.

> > I would expect more casual users to work with much smaller guarantees,
> > and a good chunk of slack on top - otherwise they already had better
> > be set up for the occasional OOM.  Is this an unreasonable assumption
> > to make?
> > 
> > I'm not opposed to this feature per se, but I'm really opposed to
> > merging it for the partial hard bindings argument
> 
> This was just an example that even setup which is not overcomiting the
> limit might be caught in an unreclaimable position. Sure we can mitigate
> those issues to some point and that would be surely welcome.
> 
> The more important part, however, is that not all usecases really
> _require_ hard guarantee. 

It's not about whether hard guarantees are necessary, it's about
getting away without additional fallback semantics.  The default
position is still always simpler semantics, so we are looking for
reasons for the fallback here, not the other way around.

> They are asking for a reasonable memory isolation which they
> currently do not have. Having a risk of OOM would be a no-go for
> them so the feature wouldn't be useful for them.

Let's not go back to Handwaving Land on this, please.  What does
"reasonable memory isolation" mean?

It really boils down to the interaction with other workloads: do we
want other workloads to reclaim our guaranteed memory or OOM?  If you
prefer reclaiming workingset memory over OOM, why can't you set the
low limit more conservatively in the first place?

> I have repeatedly said that I can see also some use for the hard
> guarantee. Mainly to support overcommit on the limit. I didn't hear
> about those usecases yet but it seems that at least Google would like to
> have really hard guarantees.

Everybody who wants to charge for guaranteed memory can not afford to
have other workloads break isolation at will.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-06-05 16:10                             ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-05 16:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Greg Thelen, Michel Lespinasse, Tejun Heo, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Thu, Jun 05, 2014 at 04:51:09PM +0200, Michal Hocko wrote:
> On Wed 04-06-14 17:45:53, Johannes Weiner wrote:
> > On Wed, Jun 04, 2014 at 12:18:59PM -0700, Hugh Dickins wrote:
> > > On Wed, 4 Jun 2014, Johannes Weiner wrote:
> > > > On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > > > > 
> > > > > In the other email I have suggested to add a knob with the configurable
> > > > > default. Would you be OK with that?
> > > > 
> > > > No, I want to agree on whether we need that fallback code or not.  I'm
> > > > not interested in merging code that you can't convince anybody else is
> > > > needed.
> > > 
> > > I for one would welcome such a knob as Michal is proposing.
> > 
> > Now we have a tie :-)
> > 
> > > I thought it was long ago agreed that the low limit was going to fallback
> > > when it couldn't be satisfied.  But you seem implacably opposed to that
> > > as default, and I can well believe that Google is so accustomed to OOMing
> > > that it is more comfortable with OOMing as the default.  Okay.  But I
> > > would expect there to be many who want the attempt towards isolation that
> > > low limit offers, without a collapse to OOM at the first misjudgement.
> > 
> > At the same time, I only see users like Google pushing the limits of
> > the machine to a point where guarantees cover north of 90% of memory.
> 
> I can think of in-memory database loads which would use the reclaim
> protection which is quite high as well (say 80% of available memory).
> Those would definitely like to see ephemeral reclaim rather than OOM.

The OOM wouldn't apply to the database workload, but to other stuff in
the system.

> > I would expect more casual users to work with much smaller guarantees,
> > and a good chunk of slack on top - otherwise they already had better
> > be set up for the occasional OOM.  Is this an unreasonable assumption
> > to make?
> > 
> > I'm not opposed to this feature per se, but I'm really opposed to
> > merging it for the partial hard bindings argument
> 
> This was just an example that even setup which is not overcomiting the
> limit might be caught in an unreclaimable position. Sure we can mitigate
> those issues to some point and that would be surely welcome.
> 
> The more important part, however, is that not all usecases really
> _require_ hard guarantee. 

It's not about whether hard guarantees are necessary, it's about
getting away without additional fallback semantics.  The default
position is still always simpler semantics, so we are looking for
reasons for the fallback here, not the other way around.

> They are asking for a reasonable memory isolation which they
> currently do not have. Having a risk of OOM would be a no-go for
> them so the feature wouldn't be useful for them.

Let's not go back to Handwaving Land on this, please.  What does
"reasonable memory isolation" mean?

It really boils down to the interaction with other workloads: do we
want other workloads to reclaim our guaranteed memory or OOM?  If you
prefer reclaiming workingset memory over OOM, why can't you set the
low limit more conservatively in the first place?

> I have repeatedly said that I can see also some use for the hard
> guarantee. Mainly to support overcommit on the limit. I didn't hear
> about those usecases yet but it seems that at least Google would like to
> have really hard guarantees.

Everybody who wants to charge for guaranteed memory can not afford to
have other workloads break isolation at will.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-06-05 16:10                             ` Johannes Weiner
@ 2014-06-05 16:43                               ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-05 16:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Greg Thelen, Michel Lespinasse, Tejun Heo, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Thu 05-06-14 12:10:35, Johannes Weiner wrote:
> On Thu, Jun 05, 2014 at 04:51:09PM +0200, Michal Hocko wrote:
> > On Wed 04-06-14 17:45:53, Johannes Weiner wrote:
> > > On Wed, Jun 04, 2014 at 12:18:59PM -0700, Hugh Dickins wrote:
> > > > On Wed, 4 Jun 2014, Johannes Weiner wrote:
> > > > > On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > > > > > 
> > > > > > In the other email I have suggested to add a knob with the configurable
> > > > > > default. Would you be OK with that?
> > > > > 
> > > > > No, I want to agree on whether we need that fallback code or not.  I'm
> > > > > not interested in merging code that you can't convince anybody else is
> > > > > needed.
> > > > 
> > > > I for one would welcome such a knob as Michal is proposing.
> > > 
> > > Now we have a tie :-)
> > > 
> > > > I thought it was long ago agreed that the low limit was going to fallback
> > > > when it couldn't be satisfied.  But you seem implacably opposed to that
> > > > as default, and I can well believe that Google is so accustomed to OOMing
> > > > that it is more comfortable with OOMing as the default.  Okay.  But I
> > > > would expect there to be many who want the attempt towards isolation that
> > > > low limit offers, without a collapse to OOM at the first misjudgement.
> > > 
> > > At the same time, I only see users like Google pushing the limits of
> > > the machine to a point where guarantees cover north of 90% of memory.
> > 
> > I can think of in-memory database loads which would use the reclaim
> > protection which is quite high as well (say 80% of available memory).
> > Those would definitely like to see ephemeral reclaim rather than OOM.
> 
> The OOM wouldn't apply to the database workload, but to other stuff in
> the system.

Are we talking about the same thing? If nothing is reclaimable because
everybody is within limit then the database workload will be, of course,
a candidate for OOM. And quite a hot one as the memory consumption will
be dominating.

> > > I would expect more casual users to work with much smaller guarantees,
> > > and a good chunk of slack on top - otherwise they already had better
> > > be set up for the occasional OOM.  Is this an unreasonable assumption
> > > to make?
> > > 
> > > I'm not opposed to this feature per se, but I'm really opposed to
> > > merging it for the partial hard bindings argument
> > 
> > This was just an example that even setup which is not overcomiting the
> > limit might be caught in an unreclaimable position. Sure we can mitigate
> > those issues to some point and that would be surely welcome.
> > 
> > The more important part, however, is that not all usecases really
> > _require_ hard guarantee. 
> 
> It's not about whether hard guarantees are necessary, it's about
> getting away without additional fallback semantics.  The default
> position is still always simpler semantics, so we are looking for
> reasons for the fallback here, not the other way around.

This doesn't make much sense to me. So you are pushing for something
that is even not necessary. I have already mentioned that I am aware of
usecases which would prefer ephemeral reclaim rather than OOM and that
is pretty darn good reason to have a fallback mode.

> > They are asking for a reasonable memory isolation which they
> > currently do not have. Having a risk of OOM would be a no-go for
> > them so the feature wouldn't be useful for them.
> 
> Let's not go back to Handwaving Land on this, please.  What does
> "reasonable memory isolation" mean?
 
To not reclaim unless everybody is within limit. This doesn't happen
when the limit is not overcomitted normally but still cannot be ruled out
for different reasons (non-user allocations, NUMA setups and who knows
what else)

> It really boils down to the interaction with other workloads: do we
> want other workloads to reclaim our guaranteed memory or OOM?  If you
> prefer reclaiming workingset memory over OOM, why can't you set the
> low limit more conservatively in the first place?

When do I do that? After OOM killed my load? That is too late, I am
afraid.

> > I have repeatedly said that I can see also some use for the hard
> > guarantee. Mainly to support overcommit on the limit. I didn't hear
> > about those usecases yet but it seems that at least Google would like to
> > have really hard guarantees.
> 
> Everybody who wants to charge for guaranteed memory can not afford to
> have other workloads break isolation at will.

I am getting tired of this discussion to be honest. You seem to be
locked up to guarantee ignoring that there are usecases which really do
not need have such requirements.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-06-05 16:43                               ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-05 16:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Greg Thelen, Michel Lespinasse, Tejun Heo, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Thu 05-06-14 12:10:35, Johannes Weiner wrote:
> On Thu, Jun 05, 2014 at 04:51:09PM +0200, Michal Hocko wrote:
> > On Wed 04-06-14 17:45:53, Johannes Weiner wrote:
> > > On Wed, Jun 04, 2014 at 12:18:59PM -0700, Hugh Dickins wrote:
> > > > On Wed, 4 Jun 2014, Johannes Weiner wrote:
> > > > > On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > > > > > 
> > > > > > In the other email I have suggested to add a knob with the configurable
> > > > > > default. Would you be OK with that?
> > > > > 
> > > > > No, I want to agree on whether we need that fallback code or not.  I'm
> > > > > not interested in merging code that you can't convince anybody else is
> > > > > needed.
> > > > 
> > > > I for one would welcome such a knob as Michal is proposing.
> > > 
> > > Now we have a tie :-)
> > > 
> > > > I thought it was long ago agreed that the low limit was going to fallback
> > > > when it couldn't be satisfied.  But you seem implacably opposed to that
> > > > as default, and I can well believe that Google is so accustomed to OOMing
> > > > that it is more comfortable with OOMing as the default.  Okay.  But I
> > > > would expect there to be many who want the attempt towards isolation that
> > > > low limit offers, without a collapse to OOM at the first misjudgement.
> > > 
> > > At the same time, I only see users like Google pushing the limits of
> > > the machine to a point where guarantees cover north of 90% of memory.
> > 
> > I can think of in-memory database loads which would use the reclaim
> > protection which is quite high as well (say 80% of available memory).
> > Those would definitely like to see ephemeral reclaim rather than OOM.
> 
> The OOM wouldn't apply to the database workload, but to other stuff in
> the system.

Are we talking about the same thing? If nothing is reclaimable because
everybody is within limit then the database workload will be, of course,
a candidate for OOM. And quite a hot one as the memory consumption will
be dominating.

> > > I would expect more casual users to work with much smaller guarantees,
> > > and a good chunk of slack on top - otherwise they already had better
> > > be set up for the occasional OOM.  Is this an unreasonable assumption
> > > to make?
> > > 
> > > I'm not opposed to this feature per se, but I'm really opposed to
> > > merging it for the partial hard bindings argument
> > 
> > This was just an example that even setup which is not overcomiting the
> > limit might be caught in an unreclaimable position. Sure we can mitigate
> > those issues to some point and that would be surely welcome.
> > 
> > The more important part, however, is that not all usecases really
> > _require_ hard guarantee. 
> 
> It's not about whether hard guarantees are necessary, it's about
> getting away without additional fallback semantics.  The default
> position is still always simpler semantics, so we are looking for
> reasons for the fallback here, not the other way around.

This doesn't make much sense to me. So you are pushing for something
that is even not necessary. I have already mentioned that I am aware of
usecases which would prefer ephemeral reclaim rather than OOM and that
is pretty darn good reason to have a fallback mode.

> > They are asking for a reasonable memory isolation which they
> > currently do not have. Having a risk of OOM would be a no-go for
> > them so the feature wouldn't be useful for them.
> 
> Let's not go back to Handwaving Land on this, please.  What does
> "reasonable memory isolation" mean?
 
To not reclaim unless everybody is within limit. This doesn't happen
when the limit is not overcomitted normally but still cannot be ruled out
for different reasons (non-user allocations, NUMA setups and who knows
what else)

> It really boils down to the interaction with other workloads: do we
> want other workloads to reclaim our guaranteed memory or OOM?  If you
> prefer reclaiming workingset memory over OOM, why can't you set the
> low limit more conservatively in the first place?

When do I do that? After OOM killed my load? That is too late, I am
afraid.

> > I have repeatedly said that I can see also some use for the hard
> > guarantee. Mainly to support overcommit on the limit. I didn't hear
> > about those usecases yet but it seems that at least Google would like to
> > have really hard guarantees.
> 
> Everybody who wants to charge for guaranteed memory can not afford to
> have other workloads break isolation at will.

I am getting tired of this discussion to be honest. You seem to be
locked up to guarantee ignoring that there are usecases which really do
not need have such requirements.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-06-05 16:09                           ` Michal Hocko
@ 2014-06-05 16:46                             ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-05 16:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Thu, Jun 05, 2014 at 06:09:04PM +0200, Michal Hocko wrote:
> On Thu 05-06-14 11:43:28, Johannes Weiner wrote:
> > On Thu, Jun 05, 2014 at 04:32:35PM +0200, Michal Hocko wrote:
> > > On Wed 04-06-14 11:44:08, Johannes Weiner wrote:
> > > > On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > > > > On Tue 03-06-14 10:22:49, Johannes Weiner wrote:
> > > > > > On Tue, Jun 03, 2014 at 01:07:43PM +0200, Michal Hocko wrote:
> > > > > [...]
> > > > > > > If we consider that memcg and its limits are not zone aware while the
> > > > > > > page allocator and reclaim are zone oriented then I can see a problem
> > > > > > > of unexpected reclaim failure although there is no over commit on the
> > > > > > > low_limit globally. And we do not have in-kernel effective measures to
> > > > > > > mitigate this inherent problem. At least not now and I am afraid it is
> > > > > > > a long route to have something that would work reasonably well in such
> > > > > > > cases.
> > > > > > 
> > > > > > Which "inherent problem"?
> > > > > 
> > > > > zone unawareness of the limit vs. allocation/reclaim which are zone
> > > > > oriented.
> > > > 
> > > > This is a quote from another subthread where you haven't responded:
> > > > 
> > > > ---
> > > > 
> > > > > > > > But who actually cares if an individual zone can be reclaimed?
> > > > > > > > 
> > > > > > > > Userspace allocations can fall back to any other zone.  Unless there
> > > > > > > > are hard bindings, but hopefully nobody binds a memcg to a node that
> > > > > > > > is smaller than that memcg's guarantee. 
> > > > > > > 
> > > > > > > The protected group might spill over to another group and eat it when
> > > > > > > another group would be simply pushed out from the node it is bound to.
> > > > > > 
> > > > > > I don't really understand the point you're trying to make.
> > > > > 
> > > > > I was just trying to show a case where individual zone matters. To make
> > > > > it more specific consider 2 groups A (with low-limit 60% RAM) and B
> > > > > (say with low-limit 10% RAM) and bound to a node X (25% of RAM). Now
> > > > > having 70% of RAM reserved for guarantee makes some sense, right? B is
> > > > > not over-committing the node it is bound to. Yet the A's allocations
> > > > > might make pressure on X regardless that the whole system is still doing
> > > > > good. This can lead to a situation where X gets depleted and nothing
> > > > > would be reclaimable leading to an OOM condition.
> > > > 
> > > > Once you assume control of memory *placement* in the system like this,
> > > > you can not also pretend to be clueless and have unreclaimable memory
> > > > of this magnitude spread around into nodes used by other bound tasks.
> > > 
> > > You are still assuming that the administrator controls the placement.
> > > The load running in your memcg might be a black box for admin. E.g. a
> > > container which pays $$ to get a priority and not get reclaimed if that
> > > is possible. Admin can make sure that the cumulative low_limits for
> > > containers are sane but he doesn't have any control over what the loads
> > > inside are doing and potential OOM when one tries to DOS the other is
> > > definitely not welcome.
> > 
> > This is completely backwards, though: if you pay for guaranteed
> 
> I didn't say anything about guarantee, though. You even do not need
> anything as strong as guarantee. You are paying for prioritization.
>
> > memory, you don't want to get reclaimed just because some other task
> > that might not even have guarantees starts allocating with a
> > restricted node mask.  This breaks isolation.
> 
> If the other task doesn't have any limit set then its pages would get
> reclaimed. This wouldn't be everybody within low limit situation.

Ah, I messed up the consequences of the "all within low limit" clause.
The fallback implications on NUMA boggle my mind.

You can still break isolation when both have the same prioritization:
bind to a node that contains primarily guaranteed memory of another
group, then allocate a chunk that is within your limit to force equal
reclaim of all memory on that node, including the guaranteed memory.

As that other group refaults its pages on another node, you can just
follow it and do it again.

> > For one, this can be used maliciously by intentionally binding a
> > low-priority task to a node with guaranteed memory and starting to
> > allocate.  Even with a small hard limit, you can just plow through
> > files to push guaranteed cache of the other group out of memory.
> >
> > But even if it's not malicious, in such a scenario I'd still prefer
> > OOMing the task with the more restrictive node mask over reclaiming
> > guaranteed memory.
> 
> Why?

I think the locality preference of one group should not harm the
workingset size requirements that another group has to avoid IO.

> > Then, on top of that, we can use direct migration to mitigate OOMs in
> > these scenarios (should we sufficiently care about them), but I'd much
> > prefer OOMs over breaking isolation and the possible priority
> > inversion that is inherent in the fallback on NUMA setups.
> 
> Could you be more specific about what you mean by priority inversion?

As per above, it's not an inversion because of the "all within
guarantee" clause, but I think you can still DOS peers.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-06-05 16:46                             ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-05 16:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Thu, Jun 05, 2014 at 06:09:04PM +0200, Michal Hocko wrote:
> On Thu 05-06-14 11:43:28, Johannes Weiner wrote:
> > On Thu, Jun 05, 2014 at 04:32:35PM +0200, Michal Hocko wrote:
> > > On Wed 04-06-14 11:44:08, Johannes Weiner wrote:
> > > > On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > > > > On Tue 03-06-14 10:22:49, Johannes Weiner wrote:
> > > > > > On Tue, Jun 03, 2014 at 01:07:43PM +0200, Michal Hocko wrote:
> > > > > [...]
> > > > > > > If we consider that memcg and its limits are not zone aware while the
> > > > > > > page allocator and reclaim are zone oriented then I can see a problem
> > > > > > > of unexpected reclaim failure although there is no over commit on the
> > > > > > > low_limit globally. And we do not have in-kernel effective measures to
> > > > > > > mitigate this inherent problem. At least not now and I am afraid it is
> > > > > > > a long route to have something that would work reasonably well in such
> > > > > > > cases.
> > > > > > 
> > > > > > Which "inherent problem"?
> > > > > 
> > > > > zone unawareness of the limit vs. allocation/reclaim which are zone
> > > > > oriented.
> > > > 
> > > > This is a quote from another subthread where you haven't responded:
> > > > 
> > > > ---
> > > > 
> > > > > > > > But who actually cares if an individual zone can be reclaimed?
> > > > > > > > 
> > > > > > > > Userspace allocations can fall back to any other zone.  Unless there
> > > > > > > > are hard bindings, but hopefully nobody binds a memcg to a node that
> > > > > > > > is smaller than that memcg's guarantee. 
> > > > > > > 
> > > > > > > The protected group might spill over to another group and eat it when
> > > > > > > another group would be simply pushed out from the node it is bound to.
> > > > > > 
> > > > > > I don't really understand the point you're trying to make.
> > > > > 
> > > > > I was just trying to show a case where individual zone matters. To make
> > > > > it more specific consider 2 groups A (with low-limit 60% RAM) and B
> > > > > (say with low-limit 10% RAM) and bound to a node X (25% of RAM). Now
> > > > > having 70% of RAM reserved for guarantee makes some sense, right? B is
> > > > > not over-committing the node it is bound to. Yet the A's allocations
> > > > > might make pressure on X regardless that the whole system is still doing
> > > > > good. This can lead to a situation where X gets depleted and nothing
> > > > > would be reclaimable leading to an OOM condition.
> > > > 
> > > > Once you assume control of memory *placement* in the system like this,
> > > > you can not also pretend to be clueless and have unreclaimable memory
> > > > of this magnitude spread around into nodes used by other bound tasks.
> > > 
> > > You are still assuming that the administrator controls the placement.
> > > The load running in your memcg might be a black box for admin. E.g. a
> > > container which pays $$ to get a priority and not get reclaimed if that
> > > is possible. Admin can make sure that the cumulative low_limits for
> > > containers are sane but he doesn't have any control over what the loads
> > > inside are doing and potential OOM when one tries to DOS the other is
> > > definitely not welcome.
> > 
> > This is completely backwards, though: if you pay for guaranteed
> 
> I didn't say anything about guarantee, though. You even do not need
> anything as strong as guarantee. You are paying for prioritization.
>
> > memory, you don't want to get reclaimed just because some other task
> > that might not even have guarantees starts allocating with a
> > restricted node mask.  This breaks isolation.
> 
> If the other task doesn't have any limit set then its pages would get
> reclaimed. This wouldn't be everybody within low limit situation.

Ah, I messed up the consequences of the "all within low limit" clause.
The fallback implications on NUMA boggle my mind.

You can still break isolation when both have the same prioritization:
bind to a node that contains primarily guaranteed memory of another
group, then allocate a chunk that is within your limit to force equal
reclaim of all memory on that node, including the guaranteed memory.

As that other group refaults its pages on another node, you can just
follow it and do it again.

> > For one, this can be used maliciously by intentionally binding a
> > low-priority task to a node with guaranteed memory and starting to
> > allocate.  Even with a small hard limit, you can just plow through
> > files to push guaranteed cache of the other group out of memory.
> >
> > But even if it's not malicious, in such a scenario I'd still prefer
> > OOMing the task with the more restrictive node mask over reclaiming
> > guaranteed memory.
> 
> Why?

I think the locality preference of one group should not harm the
workingset size requirements that another group has to avoid IO.

> > Then, on top of that, we can use direct migration to mitigate OOMs in
> > these scenarios (should we sufficiently care about them), but I'd much
> > prefer OOMs over breaking isolation and the possible priority
> > inversion that is inherent in the fallback on NUMA setups.
> 
> Could you be more specific about what you mean by priority inversion?

As per above, it's not an inversion because of the "all within
guarantee" clause, but I think you can still DOS peers.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-06-05 16:43                               ` Michal Hocko
@ 2014-06-05 18:23                                 ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-05 18:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Greg Thelen, Michel Lespinasse, Tejun Heo, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Thu, Jun 05, 2014 at 06:43:55PM +0200, Michal Hocko wrote:
> On Thu 05-06-14 12:10:35, Johannes Weiner wrote:
> > On Thu, Jun 05, 2014 at 04:51:09PM +0200, Michal Hocko wrote:
> > > On Wed 04-06-14 17:45:53, Johannes Weiner wrote:
> > > > On Wed, Jun 04, 2014 at 12:18:59PM -0700, Hugh Dickins wrote:
> > > > > On Wed, 4 Jun 2014, Johannes Weiner wrote:
> > > > > > On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > > > > > > 
> > > > > > > In the other email I have suggested to add a knob with the configurable
> > > > > > > default. Would you be OK with that?
> > > > > > 
> > > > > > No, I want to agree on whether we need that fallback code or not.  I'm
> > > > > > not interested in merging code that you can't convince anybody else is
> > > > > > needed.
> > > > > 
> > > > > I for one would welcome such a knob as Michal is proposing.
> > > > 
> > > > Now we have a tie :-)
> > > > 
> > > > > I thought it was long ago agreed that the low limit was going to fallback
> > > > > when it couldn't be satisfied.  But you seem implacably opposed to that
> > > > > as default, and I can well believe that Google is so accustomed to OOMing
> > > > > that it is more comfortable with OOMing as the default.  Okay.  But I
> > > > > would expect there to be many who want the attempt towards isolation that
> > > > > low limit offers, without a collapse to OOM at the first misjudgement.
> > > > 
> > > > At the same time, I only see users like Google pushing the limits of
> > > > the machine to a point where guarantees cover north of 90% of memory.
> > > 
> > > I can think of in-memory database loads which would use the reclaim
> > > protection which is quite high as well (say 80% of available memory).
> > > Those would definitely like to see ephemeral reclaim rather than OOM.
> > 
> > The OOM wouldn't apply to the database workload, but to other stuff in
> > the system.
> 
> Are we talking about the same thing? If nothing is reclaimable because
> everybody is within limit then the database workload will be, of course,
> a candidate for OOM. And quite a hot one as the memory consumption will
> be dominating.

Yes, you're right.

> > > > I would expect more casual users to work with much smaller guarantees,
> > > > and a good chunk of slack on top - otherwise they already had better
> > > > be set up for the occasional OOM.  Is this an unreasonable assumption
> > > > to make?
> > > > 
> > > > I'm not opposed to this feature per se, but I'm really opposed to
> > > > merging it for the partial hard bindings argument
> > > 
> > > This was just an example that even setup which is not overcomiting the
> > > limit might be caught in an unreclaimable position. Sure we can mitigate
> > > those issues to some point and that would be surely welcome.
> > > 
> > > The more important part, however, is that not all usecases really
> > > _require_ hard guarantee. 
> > 
> > It's not about whether hard guarantees are necessary, it's about
> > getting away without additional fallback semantics.  The default
> > position is still always simpler semantics, so we are looking for
> > reasons for the fallback here, not the other way around.
> 
> This doesn't make much sense to me. So you are pushing for something
> that is even not necessary. I have already mentioned that I am aware of
> usecases which would prefer ephemeral reclaim rather than OOM and that
> is pretty darn good reason to have a fallback mode.

I think it's quite clear that there is merit for both behaviors, but
because there are less "traps" in the hard guarantee semantics their
usefulness is much easier to assess.

So can we please explore the situations wherein fallbacks would happen
so that we can judge the applicability of both behaviors and pick a
reasonable default?

> > > They are asking for a reasonable memory isolation which they
> > > currently do not have. Having a risk of OOM would be a no-go for
> > > them so the feature wouldn't be useful for them.
> > 
> > Let's not go back to Handwaving Land on this, please.  What does
> > "reasonable memory isolation" mean?
>  
> To not reclaim unless everybody is within limit. This doesn't happen
> when the limit is not overcomitted normally but still cannot be ruled out
> for different reasons (non-user allocations, NUMA setups and who knows
> what else)

I'm desperately trying to gather a list of these corner cases to get a
feeling for when that "best effort" falls apart, because that is part
of the interface and something that we have to know for future kernel
development, and the user has to know in order to pick a behavior.

Your documentation doesn't mention any of those corner cases and leads
on that you're basically guaranteed this memory unless you overcommit
the limit.

> > > I have repeatedly said that I can see also some use for the hard
> > > guarantee. Mainly to support overcommit on the limit. I didn't hear
> > > about those usecases yet but it seems that at least Google would like to
> > > have really hard guarantees.
> > 
> > Everybody who wants to charge for guaranteed memory can not afford to
> > have other workloads break isolation at will.
> 
> I am getting tired of this discussion to be honest. You seem to be
> locked up to guarantee ignoring that there are usecases which really do
> not need have such requirements.

I already wrote Hugh that I'm not against the fall back per se, but I
really want to map out the usecases and match them up with the hard
and best-effort low limit, so that we know how to document this and
what the default behavior should be.

It's not the fallback that's bothering me, it's your unwillingness to
explore and document the vagueness that is inherent in the semantics.
I really don't want to merge more underdefined best-effort features
that will be useless to users and a hinderance in future development.

"I'm aware of usecases that would prefer reclaim over OOM" is just not
cutting it when it comes to designing an interface that we're going to
be stuck with indefinitely.  We need a concrete understanding of how
configurations would behave under common situations in the real world.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-06-05 18:23                                 ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-05 18:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Greg Thelen, Michel Lespinasse, Tejun Heo, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Thu, Jun 05, 2014 at 06:43:55PM +0200, Michal Hocko wrote:
> On Thu 05-06-14 12:10:35, Johannes Weiner wrote:
> > On Thu, Jun 05, 2014 at 04:51:09PM +0200, Michal Hocko wrote:
> > > On Wed 04-06-14 17:45:53, Johannes Weiner wrote:
> > > > On Wed, Jun 04, 2014 at 12:18:59PM -0700, Hugh Dickins wrote:
> > > > > On Wed, 4 Jun 2014, Johannes Weiner wrote:
> > > > > > On Wed, Jun 04, 2014 at 04:46:58PM +0200, Michal Hocko wrote:
> > > > > > > 
> > > > > > > In the other email I have suggested to add a knob with the configurable
> > > > > > > default. Would you be OK with that?
> > > > > > 
> > > > > > No, I want to agree on whether we need that fallback code or not.  I'm
> > > > > > not interested in merging code that you can't convince anybody else is
> > > > > > needed.
> > > > > 
> > > > > I for one would welcome such a knob as Michal is proposing.
> > > > 
> > > > Now we have a tie :-)
> > > > 
> > > > > I thought it was long ago agreed that the low limit was going to fallback
> > > > > when it couldn't be satisfied.  But you seem implacably opposed to that
> > > > > as default, and I can well believe that Google is so accustomed to OOMing
> > > > > that it is more comfortable with OOMing as the default.  Okay.  But I
> > > > > would expect there to be many who want the attempt towards isolation that
> > > > > low limit offers, without a collapse to OOM at the first misjudgement.
> > > > 
> > > > At the same time, I only see users like Google pushing the limits of
> > > > the machine to a point where guarantees cover north of 90% of memory.
> > > 
> > > I can think of in-memory database loads which would use the reclaim
> > > protection which is quite high as well (say 80% of available memory).
> > > Those would definitely like to see ephemeral reclaim rather than OOM.
> > 
> > The OOM wouldn't apply to the database workload, but to other stuff in
> > the system.
> 
> Are we talking about the same thing? If nothing is reclaimable because
> everybody is within limit then the database workload will be, of course,
> a candidate for OOM. And quite a hot one as the memory consumption will
> be dominating.

Yes, you're right.

> > > > I would expect more casual users to work with much smaller guarantees,
> > > > and a good chunk of slack on top - otherwise they already had better
> > > > be set up for the occasional OOM.  Is this an unreasonable assumption
> > > > to make?
> > > > 
> > > > I'm not opposed to this feature per se, but I'm really opposed to
> > > > merging it for the partial hard bindings argument
> > > 
> > > This was just an example that even setup which is not overcomiting the
> > > limit might be caught in an unreclaimable position. Sure we can mitigate
> > > those issues to some point and that would be surely welcome.
> > > 
> > > The more important part, however, is that not all usecases really
> > > _require_ hard guarantee. 
> > 
> > It's not about whether hard guarantees are necessary, it's about
> > getting away without additional fallback semantics.  The default
> > position is still always simpler semantics, so we are looking for
> > reasons for the fallback here, not the other way around.
> 
> This doesn't make much sense to me. So you are pushing for something
> that is even not necessary. I have already mentioned that I am aware of
> usecases which would prefer ephemeral reclaim rather than OOM and that
> is pretty darn good reason to have a fallback mode.

I think it's quite clear that there is merit for both behaviors, but
because there are less "traps" in the hard guarantee semantics their
usefulness is much easier to assess.

So can we please explore the situations wherein fallbacks would happen
so that we can judge the applicability of both behaviors and pick a
reasonable default?

> > > They are asking for a reasonable memory isolation which they
> > > currently do not have. Having a risk of OOM would be a no-go for
> > > them so the feature wouldn't be useful for them.
> > 
> > Let's not go back to Handwaving Land on this, please.  What does
> > "reasonable memory isolation" mean?
>  
> To not reclaim unless everybody is within limit. This doesn't happen
> when the limit is not overcomitted normally but still cannot be ruled out
> for different reasons (non-user allocations, NUMA setups and who knows
> what else)

I'm desperately trying to gather a list of these corner cases to get a
feeling for when that "best effort" falls apart, because that is part
of the interface and something that we have to know for future kernel
development, and the user has to know in order to pick a behavior.

Your documentation doesn't mention any of those corner cases and leads
on that you're basically guaranteed this memory unless you overcommit
the limit.

> > > I have repeatedly said that I can see also some use for the hard
> > > guarantee. Mainly to support overcommit on the limit. I didn't hear
> > > about those usecases yet but it seems that at least Google would like to
> > > have really hard guarantees.
> > 
> > Everybody who wants to charge for guaranteed memory can not afford to
> > have other workloads break isolation at will.
> 
> I am getting tired of this discussion to be honest. You seem to be
> locked up to guarantee ignoring that there are usecases which really do
> not need have such requirements.

I already wrote Hugh that I'm not against the fall back per se, but I
really want to map out the usecases and match them up with the hard
and best-effort low limit, so that we know how to document this and
what the default behavior should be.

It's not the fallback that's bothering me, it's your unwillingness to
explore and document the vagueness that is inherent in the semantics.
I really don't want to merge more underdefined best-effort features
that will be useless to users and a hinderance in future development.

"I'm aware of usecases that would prefer reclaim over OOM" is just not
cutting it when it comes to designing an interface that we're going to
be stuck with indefinitely.  We need a concrete understanding of how
configurations would behave under common situations in the real world.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-06-04 19:18                       ` Hugh Dickins
@ 2014-06-05 19:36                         ` Tejun Heo
  -1 siblings, 0 replies; 196+ messages in thread
From: Tejun Heo @ 2014-06-05 19:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Roman Gushchin,
	LKML, linux-mm, Rik van Riel

Hello, guys.

On Wed, Jun 04, 2014 at 12:18:59PM -0700, Hugh Dickins wrote:
> that it is more comfortable with OOMing as the default.  Okay.  But I
> would expect there to be many who want the attempt towards isolation that
> low limit offers, without a collapse to OOM at the first misjudgement.

Ignoring all the mm details about NUMA, zones, overcommit or whatnot,
solely focusing on the core feature provided by the interface, which
BTW is what should guide the interface design anyway, I'm curious
about how this would end up.

The first thing which strikes me is that it's not symmetric with the
limits.  We have two of them - soft and hard limits.  A hard limit
provides the absolute boundary which can't be exceeded while a soft
limit, in principle, gives a guideline on memory usage of the group to
the kernel so that its behavior can be better suited to the purpose of
the system.

We still seem to have a number of issues with the semantics of
softlimit but its principle role makes sense to me.  It allows
userland to provide additional information or hints on how resource
allocation should be managed across different groups and as such can
be used for rule-of-thumb, this-will-prolly-work kind of easy /
automatic configuration, which is a very useful thing.  Can be a lot
more widely useful than hard limits in the long term.

It seems that the main suggested use case for soft guarantee is
"reasonable" isolation, which seems symmetric to the role of soft
limit; however, if I think about it more, I'm not sure whether such
usage would make sense.

The thing with softlimit is that it doesn't affect resource allocation
until the limit is hit.  The system keeps memory distribution balanced
and useful through reclaiming less useful stuff and that part isn't
adversely affected by softlimit.  This means that softlimit
configuration is relaxed in both directions.  It doesn't have to be
completely accurate.  It can be good enough or "reasonable" and still
useful without affecting the whole system adversely.

However, soft guarantee isn't like that.  Because it disables reclaim
completely until the guaranteed amount is reached which many workloads
wouldn't have trouble filling up over time, the configuration isn't
relaxed downwards.  Any configured value adversely affects the overall
efficiency of the system, which in turn breaks the "reasonable"
isolation use case because such usages require the configuration to be
not punishing by default.

So, that's where I think the symmetry breaks.  The direction of the
pressure the guarantee exerts makes it punishing for the system by
default, which means that it can't be a generally useful thing which
can be widely configured with should-be-good-enough values.  It
inherently is something which requires a lot more attention and thus a
lot less useful (at least its usefulness is a lot narrower).

As soft limit has been misused to provide guarantee, I think it's a
healthy thing to separate it out and implement it properly and I
probably am not very qualified to make calls on mm decisions; however,
I strongly think that we should think through the actual use cases
before deciding which features and interface we expose to userland.
They better make sense and are actually useful.  I strongly believe
that failure to do so had been one of the main reasons why we got
burned so badly with cgroup in general.  Let's please not repeat that.
If this is useful, let's find out why and how and crystailize the
interface for those usages.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-06-05 19:36                         ` Tejun Heo
  0 siblings, 0 replies; 196+ messages in thread
From: Tejun Heo @ 2014-06-05 19:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, Michal Hocko, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Roman Gushchin,
	LKML, linux-mm, Rik van Riel

Hello, guys.

On Wed, Jun 04, 2014 at 12:18:59PM -0700, Hugh Dickins wrote:
> that it is more comfortable with OOMing as the default.  Okay.  But I
> would expect there to be many who want the attempt towards isolation that
> low limit offers, without a collapse to OOM at the first misjudgement.

Ignoring all the mm details about NUMA, zones, overcommit or whatnot,
solely focusing on the core feature provided by the interface, which
BTW is what should guide the interface design anyway, I'm curious
about how this would end up.

The first thing which strikes me is that it's not symmetric with the
limits.  We have two of them - soft and hard limits.  A hard limit
provides the absolute boundary which can't be exceeded while a soft
limit, in principle, gives a guideline on memory usage of the group to
the kernel so that its behavior can be better suited to the purpose of
the system.

We still seem to have a number of issues with the semantics of
softlimit but its principle role makes sense to me.  It allows
userland to provide additional information or hints on how resource
allocation should be managed across different groups and as such can
be used for rule-of-thumb, this-will-prolly-work kind of easy /
automatic configuration, which is a very useful thing.  Can be a lot
more widely useful than hard limits in the long term.

It seems that the main suggested use case for soft guarantee is
"reasonable" isolation, which seems symmetric to the role of soft
limit; however, if I think about it more, I'm not sure whether such
usage would make sense.

The thing with softlimit is that it doesn't affect resource allocation
until the limit is hit.  The system keeps memory distribution balanced
and useful through reclaiming less useful stuff and that part isn't
adversely affected by softlimit.  This means that softlimit
configuration is relaxed in both directions.  It doesn't have to be
completely accurate.  It can be good enough or "reasonable" and still
useful without affecting the whole system adversely.

However, soft guarantee isn't like that.  Because it disables reclaim
completely until the guaranteed amount is reached which many workloads
wouldn't have trouble filling up over time, the configuration isn't
relaxed downwards.  Any configured value adversely affects the overall
efficiency of the system, which in turn breaks the "reasonable"
isolation use case because such usages require the configuration to be
not punishing by default.

So, that's where I think the symmetry breaks.  The direction of the
pressure the guarantee exerts makes it punishing for the system by
default, which means that it can't be a generally useful thing which
can be widely configured with should-be-good-enough values.  It
inherently is something which requires a lot more attention and thus a
lot less useful (at least its usefulness is a lot narrower).

As soft limit has been misused to provide guarantee, I think it's a
healthy thing to separate it out and implement it properly and I
probably am not very qualified to make calls on mm decisions; however,
I strongly think that we should think through the actual use cases
before deciding which features and interface we expose to userland.
They better make sense and are actually useful.  I strongly believe
that failure to do so had been one of the main reasons why we got
burned so badly with cgroup in general.  Let's please not repeat that.
If this is useful, let's find out why and how and crystailize the
interface for those usages.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
  2014-06-05 18:23                                 ` Johannes Weiner
@ 2014-06-06 14:44                                   ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-06 14:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Greg Thelen, Michel Lespinasse, Tejun Heo, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Thu 05-06-14 14:23:36, Johannes Weiner wrote:
> On Thu, Jun 05, 2014 at 06:43:55PM +0200, Michal Hocko wrote:
> > On Thu 05-06-14 12:10:35, Johannes Weiner wrote:
[...]
> > > It's not about whether hard guarantees are necessary, it's about
> > > getting away without additional fallback semantics.  The default
> > > position is still always simpler semantics, so we are looking for
> > > reasons for the fallback here, not the other way around.
> > 
> > This doesn't make much sense to me. So you are pushing for something
> > that is even not necessary. I have already mentioned that I am aware of
> > usecases which would prefer ephemeral reclaim rather than OOM and that
> > is pretty darn good reason to have a fallback mode.
> 
> I think it's quite clear that there is merit for both behaviors, but

OK, I am glad we are moving forward

> because there are less "traps" in the hard guarantee semantics their
> usefulness is much easier to assess.
> So can we please explore the situations wherein fallbacks would happen
> so that we can judge the applicability of both behaviors and pick a
> reasonable default?

This a hard question. NUMA interactions is the one thing that is quite
obvious. But there are other sources of allocations which are not
tracked by memcg (basically all kernel/drivers allocations or hugetlbfs)
which might produce a memory pressure which was not expected when the
limits where designed by admin.

> > > > They are asking for a reasonable memory isolation which they
> > > > currently do not have. Having a risk of OOM would be a no-go for
> > > > them so the feature wouldn't be useful for them.
> > > 
> > > Let's not go back to Handwaving Land on this, please.  What does
> > > "reasonable memory isolation" mean?
> >  
> > To not reclaim unless everybody is within limit. This doesn't happen
> > when the limit is not overcomitted normally but still cannot be ruled out
> > for different reasons (non-user allocations, NUMA setups and who knows
> > what else)
> 
> I'm desperately trying to gather a list of these corner cases to get a
> feeling for when that "best effort" falls apart, because that is part
> of the interface and something that we have to know for future kernel
> development, and the user has to know in order to pick a behavior.
>
> Your documentation doesn't mention any of those corner cases and leads
> on that you're basically guaranteed this memory unless you overcommit
> the limit.

Documentation can be always improved and I am open to suggestions. I
will send 2 patches to allow both modes + configuration as a reply to
this email so that we can discuss the direction further. And I can
definitely document at least the NUMA setup which can help us in the
future development as you said.

> > > > I have repeatedly said that I can see also some use for the hard
> > > > guarantee. Mainly to support overcommit on the limit. I didn't hear
> > > > about those usecases yet but it seems that at least Google would like to
> > > > have really hard guarantees.
> > > 
> > > Everybody who wants to charge for guaranteed memory can not afford to
> > > have other workloads break isolation at will.
> > 
> > I am getting tired of this discussion to be honest. You seem to be
> > locked up to guarantee ignoring that there are usecases which really do
> > not need have such requirements.
> 
> I already wrote Hugh that I'm not against the fall back per se, but I
> really want to map out the usecases and match them up with the hard
> and best-effort low limit, so that we know how to document this and
> what the default behavior should be.

OK

> It's not the fallback that's bothering me, it's your unwillingness to
> explore and document the vagueness that is inherent in the semantics.

Hmm, I do not remember any resistance to add or improve documentation.
I've tried to be as precise as possible. I understand that all the
consequences might not be clear from it but this can certainly be
improved.

> I really don't want to merge more underdefined best-effort features
> that will be useless to users and a hinderance in future development.

I think we have agreed that both modes make sense depending on the use
case. I do not think any of them is underdefined.

> "I'm aware of usecases that would prefer reclaim over OOM" is just not
> cutting it when it comes to designing an interface that we're going to
> be stuck with indefinitely. 

Best effort is a well established approach and I am quite surprised that
it is under such a strong hammering from you.

I think that a database workload I have mentioned is a nice example of
best effort semantic which makes a lot of sense. Normally it would get
reclaimed all the time (which might lead to slowdowns). Now you can
offer them a certain protection/prioritization over other loads on the
same system. This is exactly the kind of load which should never get OOM
and ephemeral reclaim is much better than no protection at all.

> We need a concrete understanding of how configurations would behave
> under common situations in the real world.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH v2 0/4] memcg: Low-limit reclaim
@ 2014-06-06 14:44                                   ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-06 14:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Greg Thelen, Michel Lespinasse, Tejun Heo, Roman Gushchin, LKML,
	linux-mm, Rik van Riel

On Thu 05-06-14 14:23:36, Johannes Weiner wrote:
> On Thu, Jun 05, 2014 at 06:43:55PM +0200, Michal Hocko wrote:
> > On Thu 05-06-14 12:10:35, Johannes Weiner wrote:
[...]
> > > It's not about whether hard guarantees are necessary, it's about
> > > getting away without additional fallback semantics.  The default
> > > position is still always simpler semantics, so we are looking for
> > > reasons for the fallback here, not the other way around.
> > 
> > This doesn't make much sense to me. So you are pushing for something
> > that is even not necessary. I have already mentioned that I am aware of
> > usecases which would prefer ephemeral reclaim rather than OOM and that
> > is pretty darn good reason to have a fallback mode.
> 
> I think it's quite clear that there is merit for both behaviors, but

OK, I am glad we are moving forward

> because there are less "traps" in the hard guarantee semantics their
> usefulness is much easier to assess.
> So can we please explore the situations wherein fallbacks would happen
> so that we can judge the applicability of both behaviors and pick a
> reasonable default?

This a hard question. NUMA interactions is the one thing that is quite
obvious. But there are other sources of allocations which are not
tracked by memcg (basically all kernel/drivers allocations or hugetlbfs)
which might produce a memory pressure which was not expected when the
limits where designed by admin.

> > > > They are asking for a reasonable memory isolation which they
> > > > currently do not have. Having a risk of OOM would be a no-go for
> > > > them so the feature wouldn't be useful for them.
> > > 
> > > Let's not go back to Handwaving Land on this, please.  What does
> > > "reasonable memory isolation" mean?
> >  
> > To not reclaim unless everybody is within limit. This doesn't happen
> > when the limit is not overcomitted normally but still cannot be ruled out
> > for different reasons (non-user allocations, NUMA setups and who knows
> > what else)
> 
> I'm desperately trying to gather a list of these corner cases to get a
> feeling for when that "best effort" falls apart, because that is part
> of the interface and something that we have to know for future kernel
> development, and the user has to know in order to pick a behavior.
>
> Your documentation doesn't mention any of those corner cases and leads
> on that you're basically guaranteed this memory unless you overcommit
> the limit.

Documentation can be always improved and I am open to suggestions. I
will send 2 patches to allow both modes + configuration as a reply to
this email so that we can discuss the direction further. And I can
definitely document at least the NUMA setup which can help us in the
future development as you said.

> > > > I have repeatedly said that I can see also some use for the hard
> > > > guarantee. Mainly to support overcommit on the limit. I didn't hear
> > > > about those usecases yet but it seems that at least Google would like to
> > > > have really hard guarantees.
> > > 
> > > Everybody who wants to charge for guaranteed memory can not afford to
> > > have other workloads break isolation at will.
> > 
> > I am getting tired of this discussion to be honest. You seem to be
> > locked up to guarantee ignoring that there are usecases which really do
> > not need have such requirements.
> 
> I already wrote Hugh that I'm not against the fall back per se, but I
> really want to map out the usecases and match them up with the hard
> and best-effort low limit, so that we know how to document this and
> what the default behavior should be.

OK

> It's not the fallback that's bothering me, it's your unwillingness to
> explore and document the vagueness that is inherent in the semantics.

Hmm, I do not remember any resistance to add or improve documentation.
I've tried to be as precise as possible. I understand that all the
consequences might not be clear from it but this can certainly be
improved.

> I really don't want to merge more underdefined best-effort features
> that will be useless to users and a hinderance in future development.

I think we have agreed that both modes make sense depending on the use
case. I do not think any of them is underdefined.

> "I'm aware of usecases that would prefer reclaim over OOM" is just not
> cutting it when it comes to designing an interface that we're going to
> be stuck with indefinitely. 

Best effort is a well established approach and I am quite surprised that
it is under such a strong hammering from you.

I think that a database workload I have mentioned is a nice example of
best effort semantic which makes a lot of sense. Normally it would get
reclaimed all the time (which might lead to slowdowns). Now you can
offer them a certain protection/prioritization over other loads on the
same system. This is exactly the kind of load which should never get OOM
and ephemeral reclaim is much better than no protection at all.

> We need a concrete understanding of how configurations would behave
> under common situations in the real world.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* [PATCH 1/2] mm, memcg: allow OOM if no memcg is eligible during direct reclaim
  2014-06-06 14:44                                   ` Michal Hocko
@ 2014-06-06 14:46                                     ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-06 14:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Greg Thelen, Michel Lespinasse, Tejun Heo, Roman Gushchin, LKML,
	linux-mm

If there is no memcg eligible for reclaim because all groups under the
reclaimed hierarchy are within their guarantee then the global direct
reclaim would end up in the endless loop because zones in the zonelists
are not considered unreclaimable (as per all_unreclaimable) and so the
OOM killer would never fire and direct reclaim would be triggered
without no chance to reclaim anything.

This is not possible yet because reclaim falls back to ignore low_limit
when nobody is eligible for reclaim. Following patch will allow to set
the fallback mode to hard guarantee, though, so this is a preparatory
patch.

Memcg reclaim doesn't suffer from this because the OOM killer is
triggered after few unsuccessful attempts of the reclaim.

Fix this by checking the number of scanned pages which is obviously 0 if
nobody is eligible and also check that the whole tree hierarchy is not
eligible and tell OOM it can go ahead.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8041b0667673..99137aecd95f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2570,6 +2570,13 @@ out:
 	if (aborted_reclaim)
 		return 1;
 
+	/*
+	 * If the target memcg is not eligible for reclaim then we have no option
+	 * but OOM
+	 */
+	if (!sc->nr_scanned && mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
+		return 0;
+
 	/* top priority shrink_zones still had more to do? don't OOM, then */
 	if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
 		return 1;
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 196+ messages in thread

* [PATCH 1/2] mm, memcg: allow OOM if no memcg is eligible during direct reclaim
@ 2014-06-06 14:46                                     ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-06 14:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Greg Thelen, Michel Lespinasse, Tejun Heo, Roman Gushchin, LKML,
	linux-mm

If there is no memcg eligible for reclaim because all groups under the
reclaimed hierarchy are within their guarantee then the global direct
reclaim would end up in the endless loop because zones in the zonelists
are not considered unreclaimable (as per all_unreclaimable) and so the
OOM killer would never fire and direct reclaim would be triggered
without no chance to reclaim anything.

This is not possible yet because reclaim falls back to ignore low_limit
when nobody is eligible for reclaim. Following patch will allow to set
the fallback mode to hard guarantee, though, so this is a preparatory
patch.

Memcg reclaim doesn't suffer from this because the OOM killer is
triggered after few unsuccessful attempts of the reclaim.

Fix this by checking the number of scanned pages which is obviously 0 if
nobody is eligible and also check that the whole tree hierarchy is not
eligible and tell OOM it can go ahead.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8041b0667673..99137aecd95f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2570,6 +2570,13 @@ out:
 	if (aborted_reclaim)
 		return 1;
 
+	/*
+	 * If the target memcg is not eligible for reclaim then we have no option
+	 * but OOM
+	 */
+	if (!sc->nr_scanned && mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
+		return 0;
+
 	/* top priority shrink_zones still had more to do? don't OOM, then */
 	if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
 		return 1;
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 196+ messages in thread

* [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
  2014-06-06 14:46                                     ` Michal Hocko
@ 2014-06-06 14:46                                       ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-06 14:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Greg Thelen, Michel Lespinasse, Tejun Heo, Roman Gushchin, LKML,
	linux-mm

Some users (e.g. Google) would like to have stronger semantic than low
limit offers currently. The fallback mode is not desirable and they
prefer hitting OOM killer rather than ignoring low limit for protected
groups. There are other possible usecases which can benefit from hard
guarantees. I can imagine workloads where setting low_limit to the same
value as hard_limit to prevent from any reclaim at all makes a lot of
sense because reclaim is much more disrupting than restart of the load.

This patch adds a new per memcg memory.reclaim_strategy knob which
tells what to do in a situation when memory reclaim cannot do any
progress because all groups in the reclaimed hierarchy are within their
low_limit. There are two options available:
	- low_limit_best_effort - the current mode when reclaim falls
	  back to the even reclaim of all groups in the reclaimed
	  hierarchy
	- low_limit_guarantee - groups within low_limit are never
	  reclaimed and OOM killer is triggered instead. OOM message
	  will mention the fact that the OOM was triggered due to
	  low_limit reclaim protection.

Root memcg's knob refers to the global memory reclaim and new memcgs
inherit the setting from parents (or root memcg if this is
!use_hierarchy setup). The initial value for the root memcg is defined
by the config (CONFIG_MEMCG_LOW_LIMIT_GUARANTEE or
CONFIG_MEMCG_LOW_LIMIT_BEST_EFFORT) and it can be changed later in
runtime.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 Documentation/cgroups/memory.txt | 21 +++++++++--
 include/linux/memcontrol.h       |  5 +++
 init/Kconfig                     | 75 ++++++++++++++++++++++++++++++++++++++++
 mm/memcontrol.c                  | 55 +++++++++++++++++++++++++++++
 mm/oom_kill.c                    |  6 ++--
 mm/vmscan.c                      |  5 ++-
 6 files changed, 161 insertions(+), 6 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index bf895d7e1363..c6785d575b18 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -61,6 +61,8 @@ Brief summary of control files.
  memory.low_limit_breached	 # number of times low_limit has been
 				 # ignored and the cgroup reclaimed even
 				 # when it was above the limit
+ memory.reclaim_strategy	 # strategy when no progress can be made
+				 # because of low_limit reclaim protection
  memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
  memory.failcnt			 # show the number of memory usage hits limits
  memory.memsw.failcnt		 # show the number of memory+Swap hits limits
@@ -253,9 +255,22 @@ low_limit_in_bytes knob. If the limit is non-zero the reclaim logic
 doesn't include groups (and their subgroups - see 6. Hierarchy support)
 which are below the low limit if there is other eligible cgroup in the
 reclaimed hierarchy. If all groups which participate reclaim are under
-their low limits then all of them are reclaimed and the low limit is
-ignored. low_limit_breached counter in memory.stat file can be checked
-to see how many times such an event occurred.
+their low limits then reclaim cannot make any forward process. The further
+behavior depends on memory.reclaim_strategy configuration of the memory
+cgroup which is target of the memory pressure. There are two possible
+modes available:
+	- low_limit_best_effort - low_limit value is ignored and all the
+	  groups are reclaimed evenly. low_limit_breached counter in
+	  memory.stat file of each cgroup can be checked to see how many
+	  times such an event occurred.
+	- low_limit_guarantee - no groups are reclaimed and OOM killer will
+	  be triggered to sort out the situation.
+
+memory.reclaim_strategy is inherited from parent cgroup but it can
+be changed down the hierarchy. Root cgroup's file refers to the
+global memory reclaim and it is defined according to config (either
+CONFIG_MEMCG_LOW_LIMIT_GUARANTEE or CONFIG_MEMCG_LOW_LIMIT_BEST_EFFORT)
+and can be changed in runtime as well.
 
 Note2: When panic_on_oom is set to "2", the whole system will panic.
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5e2ca2163b12..0b61da737521 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -97,6 +97,7 @@ extern bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
 
 extern void mem_cgroup_guarantee_breached(struct mem_cgroup *memcg);
 extern bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root);
+extern bool mem_cgroup_hard_guarantee(struct mem_cgroup *memcg);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
@@ -306,6 +307,10 @@ static inline bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
 {
 	return false;
 }
+static inline bool mem_cgroup_hard_guarantee(struct mem_cgroup *memcg)
+{
+	return false;
+}
 
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
diff --git a/init/Kconfig b/init/Kconfig
index 8a2d7394c75f..fe78f3f99265 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -936,6 +936,81 @@ config MEMCG
 	  this, you can set "cgroup_disable=memory" at your boot option to
 	  disable memory resource controller and you can avoid overheads.
 	  (and lose benefits of memory resource controller)
+choice
+	prompt "Memory Resource Controller reclaim protection"
+	depends on MEMCG
+	help
+	   Memory resource controller allows for memory protection by
+	   low_limit_in_bytes knob. If the memory consumption of all
+	   processes inside the group is less than the limit then
+	   the group is excluded from the memory reclaim and so the
+	   charged memory is protected from external memory pressure.
+	   This can be used for memory isolation of different loads
+	   running on the same machines by separating them to groups
+	   with appropriate low limits.
+
+	   Please note that the configuration of low limits has to be
+	   done carefully because inappropriate setup can render the machine
+	   unusable. A typical example would be a too large limit and
+	   so not enough memory available for the rest of the system
+	   resulting in memory trashing or other misbehaviors.
+
+	   If the memory reclaim ends up in a position that all memory
+	   cgroups are within their limits then there is no way to proceed
+	   and free some memory. There are two possible situations how to
+	   handle this situation. The reclaimer can either fall back to
+	   ignoring low_limits and reclaim all groups in fair manner or
+	   Out of memory killer is triggered to sort out the situation.
+
+	   This section provides a way to setup default behavior which is
+	   then inherited by newly created memory cgroups. Each cgroup can
+	   redefine this default by memory.reclaim_strategy file and the
+	   behavior will apply to the memory pressure applied to it. Root
+	   memory cgroup controls behavior of the global memory pressure.
+
+config MEMCG_LOW_LIMIT_BEST_EFFORT
+	bool "Treat low_limit as a best effort"
+	help
+	   Memory reclaim (both global and triggered by hard limit) will
+	   fall back to the proportional reclaim when all memory cgroups
+	   of the reclaimed hierarchy are within their low_limits. This
+	   situation shouldn't happen if the cumulative low_limit setup
+	   doesn't overcommit available memory (available RAM for global
+	   reclaim resp. hard limit). User space memory, which is tracked
+	   by Memory Resource Controller, is not the only one on the system,
+	   though, and kernel has to use some memory as well and that is
+	   not a fixed amount. So sometimes it might be really hard to
+	   estimate to appropriate maximum for low_limits so they are still
+	   safe.
+
+	   If you need a reasonable memory isolation and the workload
+	   protected by the low_limit will handle ephemeral reclaim much
+	   better than a potential OOM killer then you should use this
+	   mode. memory.stat file and low_limit_breached counter will
+	   tell you how many times the limit has been ignored because
+	   the system couldn't make any progress due to low_limit setup.
+
+config MEMCG_LOW_LIMIT_GUARANTEE
+	bool "Treat low_limit as a hard guarantee"
+	help
+	   Memory reclaim (both global and triggered by hard limit) will
+	   trigger OOM killer (either global one or memcg depending on
+	   the reclaim context) to resolve the situation.
+
+	   Although this setup might sound too harsh it has a nice property
+	   that once the low_limit is set the group is guaranteed to be
+	   unreclaimable under all conditions. A process from the group
+	   still might get killed by OOM killer though. This is a really
+	   strong property and should be used with care. Administrator
+	   has be to be careful to keep enough memory for the kernel, drivers
+	   and other memory which is not accounted to Memory Controller
+	   (e.g. hugetlbfs, slab allocations, page tables, etc...).
+
+	   Select this option if you need low_limit to behave as a guarantee
+	   and absolutely no reclaim is allowed while groups are within the
+	   limit.
+
+endchoice
 
 config MEMCG_SWAP
 	bool "Memory Resource Controller Swap Extension"
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7f62b6533f60..302691dceb8c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -377,6 +377,8 @@ struct mem_cgroup {
 	struct list_head event_list;
 	spinlock_t event_list_lock;
 
+	bool hard_low_limit;
+
 	struct mem_cgroup_per_node *nodeinfo[0];
 	/* WARNING: nodeinfo must be the last member here */
 };
@@ -2856,6 +2858,22 @@ bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
 	return true;
 }
 
+/** mem_cgroup_hard_guarantee - Does the memcg require hard guarantee for memory
+ * @memcg: memcg to check
+ *
+ * Reclaimer is not allow to reclaim group if mem_cgroup_within_guarantee is
+ * true.
+ */
+bool mem_cgroup_hard_guarantee(struct mem_cgroup *memcg)
+{
+	if (mem_cgroup_disabled())
+		return false;
+
+	if (!memcg)
+		memcg = root_mem_cgroup;
+	return memcg->hard_low_limit;
+}
+
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	struct mem_cgroup *memcg = NULL;
@@ -5203,6 +5221,33 @@ static int mem_cgroup_write(struct cgroup_subsys_state *css, struct cftype *cft,
 	return ret;
 }
 
+static int mem_cgroup_write_reclaim_strategy(struct cgroup_subsys_state *css, struct cftype *cft,
+			    char *buffer)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	int ret = 0;
+
+	if (!strncmp(buffer, "low_limit_guarantee",
+				sizeof("low_limit_guarantee"))) {
+		memcg->hard_low_limit = true;
+	} else if (!strncmp(buffer, "low_limit_best_effort",
+				sizeof("low_limit_best_effort"))) {
+		memcg->hard_low_limit = false;
+	} else
+		ret = -EINVAL;
+
+	return ret;
+}
+
+static int mem_cgroup_read_reclaim_strategy(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	seq_printf(m, "%s\n", memcg->hard_low_limit ?
+			"low_limit_guarantee" : "low_limit_best_effort");
+
+	return 0;
+}
+
 static void memcg_get_hierarchical_limit(struct mem_cgroup *memcg,
 		unsigned long long *mem_limit, unsigned long long *memsw_limit)
 {
@@ -6110,6 +6155,11 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_read_u64,
 	},
 	{
+		.name = "reclaim_strategy",
+		.write_string = mem_cgroup_write_reclaim_strategy,
+		.seq_show = mem_cgroup_read_reclaim_strategy
+	},
+	{
 		.name = "soft_limit_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
 		.write_string = mem_cgroup_write,
@@ -6375,6 +6425,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	/* root ? */
 	if (parent_css == NULL) {
 		root_mem_cgroup = memcg;
+#ifdef CONFIG_MEMCG_LOW_LIMIT_GUARANTEE
+		memcg->hard_low_limit = true;
+#endif
 		res_counter_init(&memcg->res, NULL);
 		res_counter_init(&memcg->memsw, NULL);
 		res_counter_init(&memcg->kmem, NULL);
@@ -6418,6 +6471,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		res_counter_init(&memcg->res, &parent->res);
 		res_counter_init(&memcg->memsw, &parent->memsw);
 		res_counter_init(&memcg->kmem, &parent->kmem);
+		memcg->hard_low_limit = parent->hard_low_limit;
 
 		/*
 		 * No need to take a reference to the parent because cgroup
@@ -6434,6 +6488,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		 */
 		if (parent != root_mem_cgroup)
 			memory_cgrp_subsys.broken_hierarchy = true;
+		memcg->hard_low_limit = root_mem_cgroup->hard_low_limit;
 	}
 	mutex_unlock(&memcg_create_mutex);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3291e82d4352..80e5aafe7ade 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -392,9 +392,11 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 {
 	task_lock(current);
 	pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
-		"oom_score_adj=%hd\n",
+		"oom_score_adj=%hd%s\n",
 		current->comm, gfp_mask, order,
-		current->signal->oom_score_adj);
+		current->signal->oom_score_adj,
+		mem_cgroup_all_within_guarantee(memcg) ?
+		" because all groups are withing low_limit guarantee":"");
 	cpuset_print_task_mems_allowed(current);
 	task_unlock(current);
 	dump_stack();
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99137aecd95f..11e841bb5d44 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2309,8 +2309,11 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 		 * b) multiple reclaimers are racing and so the first round
 		 *    should be retried
 		 */
-		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
+		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup)) {
+			if (mem_cgroup_hard_guarantee(sc->target_mem_cgroup))
+				break;
 			honor_guarantee = false;
+		}
 	}
 }
 
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 196+ messages in thread

* [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
@ 2014-06-06 14:46                                       ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-06 14:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Greg Thelen, Michel Lespinasse, Tejun Heo, Roman Gushchin, LKML,
	linux-mm

Some users (e.g. Google) would like to have stronger semantic than low
limit offers currently. The fallback mode is not desirable and they
prefer hitting OOM killer rather than ignoring low limit for protected
groups. There are other possible usecases which can benefit from hard
guarantees. I can imagine workloads where setting low_limit to the same
value as hard_limit to prevent from any reclaim at all makes a lot of
sense because reclaim is much more disrupting than restart of the load.

This patch adds a new per memcg memory.reclaim_strategy knob which
tells what to do in a situation when memory reclaim cannot do any
progress because all groups in the reclaimed hierarchy are within their
low_limit. There are two options available:
	- low_limit_best_effort - the current mode when reclaim falls
	  back to the even reclaim of all groups in the reclaimed
	  hierarchy
	- low_limit_guarantee - groups within low_limit are never
	  reclaimed and OOM killer is triggered instead. OOM message
	  will mention the fact that the OOM was triggered due to
	  low_limit reclaim protection.

Root memcg's knob refers to the global memory reclaim and new memcgs
inherit the setting from parents (or root memcg if this is
!use_hierarchy setup). The initial value for the root memcg is defined
by the config (CONFIG_MEMCG_LOW_LIMIT_GUARANTEE or
CONFIG_MEMCG_LOW_LIMIT_BEST_EFFORT) and it can be changed later in
runtime.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 Documentation/cgroups/memory.txt | 21 +++++++++--
 include/linux/memcontrol.h       |  5 +++
 init/Kconfig                     | 75 ++++++++++++++++++++++++++++++++++++++++
 mm/memcontrol.c                  | 55 +++++++++++++++++++++++++++++
 mm/oom_kill.c                    |  6 ++--
 mm/vmscan.c                      |  5 ++-
 6 files changed, 161 insertions(+), 6 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index bf895d7e1363..c6785d575b18 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -61,6 +61,8 @@ Brief summary of control files.
  memory.low_limit_breached	 # number of times low_limit has been
 				 # ignored and the cgroup reclaimed even
 				 # when it was above the limit
+ memory.reclaim_strategy	 # strategy when no progress can be made
+				 # because of low_limit reclaim protection
  memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
  memory.failcnt			 # show the number of memory usage hits limits
  memory.memsw.failcnt		 # show the number of memory+Swap hits limits
@@ -253,9 +255,22 @@ low_limit_in_bytes knob. If the limit is non-zero the reclaim logic
 doesn't include groups (and their subgroups - see 6. Hierarchy support)
 which are below the low limit if there is other eligible cgroup in the
 reclaimed hierarchy. If all groups which participate reclaim are under
-their low limits then all of them are reclaimed and the low limit is
-ignored. low_limit_breached counter in memory.stat file can be checked
-to see how many times such an event occurred.
+their low limits then reclaim cannot make any forward process. The further
+behavior depends on memory.reclaim_strategy configuration of the memory
+cgroup which is target of the memory pressure. There are two possible
+modes available:
+	- low_limit_best_effort - low_limit value is ignored and all the
+	  groups are reclaimed evenly. low_limit_breached counter in
+	  memory.stat file of each cgroup can be checked to see how many
+	  times such an event occurred.
+	- low_limit_guarantee - no groups are reclaimed and OOM killer will
+	  be triggered to sort out the situation.
+
+memory.reclaim_strategy is inherited from parent cgroup but it can
+be changed down the hierarchy. Root cgroup's file refers to the
+global memory reclaim and it is defined according to config (either
+CONFIG_MEMCG_LOW_LIMIT_GUARANTEE or CONFIG_MEMCG_LOW_LIMIT_BEST_EFFORT)
+and can be changed in runtime as well.
 
 Note2: When panic_on_oom is set to "2", the whole system will panic.
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5e2ca2163b12..0b61da737521 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -97,6 +97,7 @@ extern bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
 
 extern void mem_cgroup_guarantee_breached(struct mem_cgroup *memcg);
 extern bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root);
+extern bool mem_cgroup_hard_guarantee(struct mem_cgroup *memcg);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
@@ -306,6 +307,10 @@ static inline bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
 {
 	return false;
 }
+static inline bool mem_cgroup_hard_guarantee(struct mem_cgroup *memcg)
+{
+	return false;
+}
 
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
diff --git a/init/Kconfig b/init/Kconfig
index 8a2d7394c75f..fe78f3f99265 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -936,6 +936,81 @@ config MEMCG
 	  this, you can set "cgroup_disable=memory" at your boot option to
 	  disable memory resource controller and you can avoid overheads.
 	  (and lose benefits of memory resource controller)
+choice
+	prompt "Memory Resource Controller reclaim protection"
+	depends on MEMCG
+	help
+	   Memory resource controller allows for memory protection by
+	   low_limit_in_bytes knob. If the memory consumption of all
+	   processes inside the group is less than the limit then
+	   the group is excluded from the memory reclaim and so the
+	   charged memory is protected from external memory pressure.
+	   This can be used for memory isolation of different loads
+	   running on the same machines by separating them to groups
+	   with appropriate low limits.
+
+	   Please note that the configuration of low limits has to be
+	   done carefully because inappropriate setup can render the machine
+	   unusable. A typical example would be a too large limit and
+	   so not enough memory available for the rest of the system
+	   resulting in memory trashing or other misbehaviors.
+
+	   If the memory reclaim ends up in a position that all memory
+	   cgroups are within their limits then there is no way to proceed
+	   and free some memory. There are two possible situations how to
+	   handle this situation. The reclaimer can either fall back to
+	   ignoring low_limits and reclaim all groups in fair manner or
+	   Out of memory killer is triggered to sort out the situation.
+
+	   This section provides a way to setup default behavior which is
+	   then inherited by newly created memory cgroups. Each cgroup can
+	   redefine this default by memory.reclaim_strategy file and the
+	   behavior will apply to the memory pressure applied to it. Root
+	   memory cgroup controls behavior of the global memory pressure.
+
+config MEMCG_LOW_LIMIT_BEST_EFFORT
+	bool "Treat low_limit as a best effort"
+	help
+	   Memory reclaim (both global and triggered by hard limit) will
+	   fall back to the proportional reclaim when all memory cgroups
+	   of the reclaimed hierarchy are within their low_limits. This
+	   situation shouldn't happen if the cumulative low_limit setup
+	   doesn't overcommit available memory (available RAM for global
+	   reclaim resp. hard limit). User space memory, which is tracked
+	   by Memory Resource Controller, is not the only one on the system,
+	   though, and kernel has to use some memory as well and that is
+	   not a fixed amount. So sometimes it might be really hard to
+	   estimate to appropriate maximum for low_limits so they are still
+	   safe.
+
+	   If you need a reasonable memory isolation and the workload
+	   protected by the low_limit will handle ephemeral reclaim much
+	   better than a potential OOM killer then you should use this
+	   mode. memory.stat file and low_limit_breached counter will
+	   tell you how many times the limit has been ignored because
+	   the system couldn't make any progress due to low_limit setup.
+
+config MEMCG_LOW_LIMIT_GUARANTEE
+	bool "Treat low_limit as a hard guarantee"
+	help
+	   Memory reclaim (both global and triggered by hard limit) will
+	   trigger OOM killer (either global one or memcg depending on
+	   the reclaim context) to resolve the situation.
+
+	   Although this setup might sound too harsh it has a nice property
+	   that once the low_limit is set the group is guaranteed to be
+	   unreclaimable under all conditions. A process from the group
+	   still might get killed by OOM killer though. This is a really
+	   strong property and should be used with care. Administrator
+	   has be to be careful to keep enough memory for the kernel, drivers
+	   and other memory which is not accounted to Memory Controller
+	   (e.g. hugetlbfs, slab allocations, page tables, etc...).
+
+	   Select this option if you need low_limit to behave as a guarantee
+	   and absolutely no reclaim is allowed while groups are within the
+	   limit.
+
+endchoice
 
 config MEMCG_SWAP
 	bool "Memory Resource Controller Swap Extension"
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7f62b6533f60..302691dceb8c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -377,6 +377,8 @@ struct mem_cgroup {
 	struct list_head event_list;
 	spinlock_t event_list_lock;
 
+	bool hard_low_limit;
+
 	struct mem_cgroup_per_node *nodeinfo[0];
 	/* WARNING: nodeinfo must be the last member here */
 };
@@ -2856,6 +2858,22 @@ bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
 	return true;
 }
 
+/** mem_cgroup_hard_guarantee - Does the memcg require hard guarantee for memory
+ * @memcg: memcg to check
+ *
+ * Reclaimer is not allow to reclaim group if mem_cgroup_within_guarantee is
+ * true.
+ */
+bool mem_cgroup_hard_guarantee(struct mem_cgroup *memcg)
+{
+	if (mem_cgroup_disabled())
+		return false;
+
+	if (!memcg)
+		memcg = root_mem_cgroup;
+	return memcg->hard_low_limit;
+}
+
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	struct mem_cgroup *memcg = NULL;
@@ -5203,6 +5221,33 @@ static int mem_cgroup_write(struct cgroup_subsys_state *css, struct cftype *cft,
 	return ret;
 }
 
+static int mem_cgroup_write_reclaim_strategy(struct cgroup_subsys_state *css, struct cftype *cft,
+			    char *buffer)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	int ret = 0;
+
+	if (!strncmp(buffer, "low_limit_guarantee",
+				sizeof("low_limit_guarantee"))) {
+		memcg->hard_low_limit = true;
+	} else if (!strncmp(buffer, "low_limit_best_effort",
+				sizeof("low_limit_best_effort"))) {
+		memcg->hard_low_limit = false;
+	} else
+		ret = -EINVAL;
+
+	return ret;
+}
+
+static int mem_cgroup_read_reclaim_strategy(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	seq_printf(m, "%s\n", memcg->hard_low_limit ?
+			"low_limit_guarantee" : "low_limit_best_effort");
+
+	return 0;
+}
+
 static void memcg_get_hierarchical_limit(struct mem_cgroup *memcg,
 		unsigned long long *mem_limit, unsigned long long *memsw_limit)
 {
@@ -6110,6 +6155,11 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_read_u64,
 	},
 	{
+		.name = "reclaim_strategy",
+		.write_string = mem_cgroup_write_reclaim_strategy,
+		.seq_show = mem_cgroup_read_reclaim_strategy
+	},
+	{
 		.name = "soft_limit_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
 		.write_string = mem_cgroup_write,
@@ -6375,6 +6425,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	/* root ? */
 	if (parent_css == NULL) {
 		root_mem_cgroup = memcg;
+#ifdef CONFIG_MEMCG_LOW_LIMIT_GUARANTEE
+		memcg->hard_low_limit = true;
+#endif
 		res_counter_init(&memcg->res, NULL);
 		res_counter_init(&memcg->memsw, NULL);
 		res_counter_init(&memcg->kmem, NULL);
@@ -6418,6 +6471,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		res_counter_init(&memcg->res, &parent->res);
 		res_counter_init(&memcg->memsw, &parent->memsw);
 		res_counter_init(&memcg->kmem, &parent->kmem);
+		memcg->hard_low_limit = parent->hard_low_limit;
 
 		/*
 		 * No need to take a reference to the parent because cgroup
@@ -6434,6 +6488,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		 */
 		if (parent != root_mem_cgroup)
 			memory_cgrp_subsys.broken_hierarchy = true;
+		memcg->hard_low_limit = root_mem_cgroup->hard_low_limit;
 	}
 	mutex_unlock(&memcg_create_mutex);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3291e82d4352..80e5aafe7ade 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -392,9 +392,11 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 {
 	task_lock(current);
 	pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
-		"oom_score_adj=%hd\n",
+		"oom_score_adj=%hd%s\n",
 		current->comm, gfp_mask, order,
-		current->signal->oom_score_adj);
+		current->signal->oom_score_adj,
+		mem_cgroup_all_within_guarantee(memcg) ?
+		" because all groups are withing low_limit guarantee":"");
 	cpuset_print_task_mems_allowed(current);
 	task_unlock(current);
 	dump_stack();
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99137aecd95f..11e841bb5d44 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2309,8 +2309,11 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 		 * b) multiple reclaimers are racing and so the first round
 		 *    should be retried
 		 */
-		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
+		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup)) {
+			if (mem_cgroup_hard_guarantee(sc->target_mem_cgroup))
+				break;
 			honor_guarantee = false;
+		}
 	}
 }
 
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
  2014-06-06 14:46                                       ` Michal Hocko
@ 2014-06-06 15:29                                         ` Tejun Heo
  -1 siblings, 0 replies; 196+ messages in thread
From: Tejun Heo @ 2014-06-06 15:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Roman Gushchin,
	LKML, linux-mm

Hello, Michal.

On Fri, Jun 06, 2014 at 04:46:50PM +0200, Michal Hocko wrote:
> +choice
> +	prompt "Memory Resource Controller reclaim protection"
> +	depends on MEMCG
> +	help

Why is this necessary?

- This doesn't affect boot.

- memcg requires runtime config *anyway*.

- The config is inherited from the parent, so the default flipping
  isn't exactly difficult.

Please drop the kconfig option.

> +static int mem_cgroup_write_reclaim_strategy(struct cgroup_subsys_state *css, struct cftype *cft,
> +			    char *buffer)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +	int ret = 0;
> +
> +	if (!strncmp(buffer, "low_limit_guarantee",
> +				sizeof("low_limit_guarantee"))) {
> +		memcg->hard_low_limit = true;
> +	} else if (!strncmp(buffer, "low_limit_best_effort",
> +				sizeof("low_limit_best_effort"))) {
> +		memcg->hard_low_limit = false;
> +	} else
> +		ret = -EINVAL;
> +
> +	return ret;
> +}

So, ummm, this raises a big red flag for me.  You're now implementing
two behaviors in a mostly symmetric manner to soft/hard limits but
choosing a completely different scheme in how they're configured
without any rationale.

* Are you sure soft and hard guarantees aren't useful when used in
  combination?  If so, why would that be the case?

* We have pressure monitoring interface which can be used for soft
  limit pressure monitoring.  How should breaching soft guarantee be
  factored into that?  There doesn't seem to be any way of notifying
  that at the moment?  Wouldn't we want that to be integrated into the
  same mechanism?

What scares me the most is that you don't even seem to have noticed
the asymmetry and are proposing userland-facing interface without
actually thinking things through.  This is exactly how we've been
getting into trouble.

For now, for everything.

 Nacked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
@ 2014-06-06 15:29                                         ` Tejun Heo
  0 siblings, 0 replies; 196+ messages in thread
From: Tejun Heo @ 2014-06-06 15:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Roman Gushchin,
	LKML, linux-mm

Hello, Michal.

On Fri, Jun 06, 2014 at 04:46:50PM +0200, Michal Hocko wrote:
> +choice
> +	prompt "Memory Resource Controller reclaim protection"
> +	depends on MEMCG
> +	help

Why is this necessary?

- This doesn't affect boot.

- memcg requires runtime config *anyway*.

- The config is inherited from the parent, so the default flipping
  isn't exactly difficult.

Please drop the kconfig option.

> +static int mem_cgroup_write_reclaim_strategy(struct cgroup_subsys_state *css, struct cftype *cft,
> +			    char *buffer)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +	int ret = 0;
> +
> +	if (!strncmp(buffer, "low_limit_guarantee",
> +				sizeof("low_limit_guarantee"))) {
> +		memcg->hard_low_limit = true;
> +	} else if (!strncmp(buffer, "low_limit_best_effort",
> +				sizeof("low_limit_best_effort"))) {
> +		memcg->hard_low_limit = false;
> +	} else
> +		ret = -EINVAL;
> +
> +	return ret;
> +}

So, ummm, this raises a big red flag for me.  You're now implementing
two behaviors in a mostly symmetric manner to soft/hard limits but
choosing a completely different scheme in how they're configured
without any rationale.

* Are you sure soft and hard guarantees aren't useful when used in
  combination?  If so, why would that be the case?

* We have pressure monitoring interface which can be used for soft
  limit pressure monitoring.  How should breaching soft guarantee be
  factored into that?  There doesn't seem to be any way of notifying
  that at the moment?  Wouldn't we want that to be integrated into the
  same mechanism?

What scares me the most is that you don't even seem to have noticed
the asymmetry and are proposing userland-facing interface without
actually thinking things through.  This is exactly how we've been
getting into trouble.

For now, for everything.

 Nacked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
  2014-06-06 15:29                                         ` Tejun Heo
@ 2014-06-06 15:34                                           ` Tejun Heo
  -1 siblings, 0 replies; 196+ messages in thread
From: Tejun Heo @ 2014-06-06 15:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Roman Gushchin,
	LKML, linux-mm

A bit of addition.

Let's *please* think through how memcg should be configured and
different knobs / limits interact with each other and come up with a
consistent scheme before adding more shits on top.  This "oh I know
this use case and maybe that behavior is necessary too, let's add N
different and incompatible ways to mix and match them" doesn't fly.
Aren't we suppposed to at least have learned that already?

-- 
tejun

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
@ 2014-06-06 15:34                                           ` Tejun Heo
  0 siblings, 0 replies; 196+ messages in thread
From: Tejun Heo @ 2014-06-06 15:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Roman Gushchin,
	LKML, linux-mm

A bit of addition.

Let's *please* think through how memcg should be configured and
different knobs / limits interact with each other and come up with a
consistent scheme before adding more shits on top.  This "oh I know
this use case and maybe that behavior is necessary too, let's add N
different and incompatible ways to mix and match them" doesn't fly.
Aren't we suppposed to at least have learned that already?

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
  2014-06-06 15:29                                         ` Tejun Heo
@ 2014-06-09  8:30                                           ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-09  8:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Roman Gushchin,
	LKML, linux-mm

On Fri 06-06-14 11:29:14, Tejun Heo wrote:
> Hello, Michal.
> 
> On Fri, Jun 06, 2014 at 04:46:50PM +0200, Michal Hocko wrote:
> > +choice
> > +	prompt "Memory Resource Controller reclaim protection"
> > +	depends on MEMCG
> > +	help
> 
> Why is this necessary?

It allows user/admin to set the default behavior.

> - This doesn't affect boot.
> 
> - memcg requires runtime config *anyway*.
> 
> - The config is inherited from the parent, so the default flipping
>   isn't exactly difficult.
> 
> Please drop the kconfig option.

How do you propose to tell the default then? Only at the runtime?
I really do not insist on the kconfig. I find it useful for a)
documentation purpose b) easy way to configure the default.

> > +static int mem_cgroup_write_reclaim_strategy(struct cgroup_subsys_state *css, struct cftype *cft,
> > +			    char *buffer)
> > +{
> > +	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> > +	int ret = 0;
> > +
> > +	if (!strncmp(buffer, "low_limit_guarantee",
> > +				sizeof("low_limit_guarantee"))) {
> > +		memcg->hard_low_limit = true;
> > +	} else if (!strncmp(buffer, "low_limit_best_effort",
> > +				sizeof("low_limit_best_effort"))) {
> > +		memcg->hard_low_limit = false;
> > +	} else
> > +		ret = -EINVAL;
> > +
> > +	return ret;
> > +}
> 
> So, ummm, this raises a big red flag for me.  You're now implementing
> two behaviors in a mostly symmetric manner to soft/hard limits but
> choosing a completely different scheme in how they're configured
> without any rationale.

So what is your suggestion then? Using a global setting? Using a
separate knob? Something completely different?

> * Are you sure soft and hard guarantees aren't useful when used in
>   combination?  If so, why would that be the case?

This was a call from Google to have per-memcg setup AFAIR. Using
different reclaim protection on the global case vs. limit reclaim makes
a lot of sense to me. If this is a major obstacle then I am OK to drop
it and only have a global setting for now.

> * We have pressure monitoring interface which can be used for soft
>   limit pressure monitoring. 

Which one is that? I only know about oom_control triggered by the hard
limit pressure.

>   How should breaching soft guarantee be
>   factored into that?  There doesn't seem to be any way of notifying
>   that at the moment?  Wouldn't we want that to be integrated into the
>   same mechanism?

Yes, there is. We have a counter in memory.stat file which tells how
many times the limit has been breached.

> What scares me the most is that you don't even seem to have noticed
> the asymmetry and are proposing userland-facing interface without
> actually thinking things through.  This is exactly how we've been
> getting into trouble.

This has been discussed up and down for the last _two_ years. I have
considered other options how to provide a very _useful_ feature users
are calling for. There is even general consensus among developers that
the feature is desirable and that the two modes (soft/hard) memory
protection are needed. Yet I would _really_ like to hear any
suggestion to get unstuck. It is far from useful to come and Nack this
_again_ without providing any alternative suggestions.

> For now, for everything.
> 
>  Nacked-by: Tejun Heo <tj@kernel.org>
> 
> Thanks.
> 
> -- 
> tejun

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
@ 2014-06-09  8:30                                           ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-09  8:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Roman Gushchin,
	LKML, linux-mm

On Fri 06-06-14 11:29:14, Tejun Heo wrote:
> Hello, Michal.
> 
> On Fri, Jun 06, 2014 at 04:46:50PM +0200, Michal Hocko wrote:
> > +choice
> > +	prompt "Memory Resource Controller reclaim protection"
> > +	depends on MEMCG
> > +	help
> 
> Why is this necessary?

It allows user/admin to set the default behavior.

> - This doesn't affect boot.
> 
> - memcg requires runtime config *anyway*.
> 
> - The config is inherited from the parent, so the default flipping
>   isn't exactly difficult.
> 
> Please drop the kconfig option.

How do you propose to tell the default then? Only at the runtime?
I really do not insist on the kconfig. I find it useful for a)
documentation purpose b) easy way to configure the default.

> > +static int mem_cgroup_write_reclaim_strategy(struct cgroup_subsys_state *css, struct cftype *cft,
> > +			    char *buffer)
> > +{
> > +	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> > +	int ret = 0;
> > +
> > +	if (!strncmp(buffer, "low_limit_guarantee",
> > +				sizeof("low_limit_guarantee"))) {
> > +		memcg->hard_low_limit = true;
> > +	} else if (!strncmp(buffer, "low_limit_best_effort",
> > +				sizeof("low_limit_best_effort"))) {
> > +		memcg->hard_low_limit = false;
> > +	} else
> > +		ret = -EINVAL;
> > +
> > +	return ret;
> > +}
> 
> So, ummm, this raises a big red flag for me.  You're now implementing
> two behaviors in a mostly symmetric manner to soft/hard limits but
> choosing a completely different scheme in how they're configured
> without any rationale.

So what is your suggestion then? Using a global setting? Using a
separate knob? Something completely different?

> * Are you sure soft and hard guarantees aren't useful when used in
>   combination?  If so, why would that be the case?

This was a call from Google to have per-memcg setup AFAIR. Using
different reclaim protection on the global case vs. limit reclaim makes
a lot of sense to me. If this is a major obstacle then I am OK to drop
it and only have a global setting for now.

> * We have pressure monitoring interface which can be used for soft
>   limit pressure monitoring. 

Which one is that? I only know about oom_control triggered by the hard
limit pressure.

>   How should breaching soft guarantee be
>   factored into that?  There doesn't seem to be any way of notifying
>   that at the moment?  Wouldn't we want that to be integrated into the
>   same mechanism?

Yes, there is. We have a counter in memory.stat file which tells how
many times the limit has been breached.

> What scares me the most is that you don't even seem to have noticed
> the asymmetry and are proposing userland-facing interface without
> actually thinking things through.  This is exactly how we've been
> getting into trouble.

This has been discussed up and down for the last _two_ years. I have
considered other options how to provide a very _useful_ feature users
are calling for. There is even general consensus among developers that
the feature is desirable and that the two modes (soft/hard) memory
protection are needed. Yet I would _really_ like to hear any
suggestion to get unstuck. It is far from useful to come and Nack this
_again_ without providing any alternative suggestions.

> For now, for everything.
> 
>  Nacked-by: Tejun Heo <tj@kernel.org>
> 
> Thanks.
> 
> -- 
> tejun

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
  2014-06-09  8:30                                           ` Michal Hocko
@ 2014-06-09 13:54                                             ` Tejun Heo
  -1 siblings, 0 replies; 196+ messages in thread
From: Tejun Heo @ 2014-06-09 13:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Roman Gushchin,
	LKML, linux-mm

Hello,

On Mon, Jun 09, 2014 at 10:30:42AM +0200, Michal Hocko wrote:
> On Fri 06-06-14 11:29:14, Tejun Heo wrote:
> > Why is this necessary?
> 
> It allows user/admin to set the default behavior.

By recomipling the kernel for something which can be trivially
configured post-boot without any difference?  The only thing it'll
achieve is confusing the hell out of people why different kernels show
different behaviors without any userland differences while taxing the
already constrained kernel configuration process more for no gain
whatsoever.

> How do you propose to tell the default then? Only at the runtime?
> I really do not insist on the kconfig. I find it useful for a)
> documentation purpose b) easy way to configure the default.

Please don't ever add Kconfig options like this.  This is uttrely
unnecessary and idiotic.  You don't add completely redundant Kconfig
option for documentation purposes.

> > * Are you sure soft and hard guarantees aren't useful when used in
> >   combination?  If so, why would that be the case?
> 
> This was a call from Google to have per-memcg setup AFAIR. Using
> different reclaim protection on the global case vs. limit reclaim makes
> a lot of sense to me. If this is a major obstacle then I am OK to drop
> it and only have a global setting for now.

Isn't it obvious that what needs to be investigated is why we're
trying to add an interface which is completely different for
guarantees as compared to limits?  Why wouldn't they have a symmetric
interface in the reverse direction as soft/hard limits?  If not, where
does the asymmetry come from?  Thse are the *first* questions which
should come to anyone's mind when [s]he is trying to add configs for a
different type of threshholds and something which must be explicitly
laid out as rationales for the design choices.

> > * We have pressure monitoring interface which can be used for soft
> >   limit pressure monitoring. 
> 
> Which one is that? I only know about oom_control triggered by the hard
> limit pressure.

Weren't you guys planning to use vmpressre notification to find out
about softlimit breach conditions?

> >   How should breaching soft guarantee be
> >   factored into that?  There doesn't seem to be any way of notifying
> >   that at the moment?  Wouldn't we want that to be integrated into the
> >   same mechanism?
> 
> Yes, there is. We have a counter in memory.stat file which tells how
> many times the limit has been breached.

How does the userland find out?  By polling the file every frigging
second?  Note that there actually is an actual asymmetry here which
makes breaching soft guarantee a much more significant event than
breaching soft limit - the former is violation of the configured
objective, the latter is not.  You *need* a way to notify the event.

> > What scares me the most is that you don't even seem to have noticed
> > the asymmetry and are proposing userland-facing interface without
> > actually thinking things through.  This is exactly how we've been
> > getting into trouble.
> 
> This has been discussed up and down for the last _two_ years. I have
> considered other options how to provide a very _useful_ feature users
> are calling for. There is even general consensus among developers that

AFAIR, there hasn't been much discussion about the details of the
interface and the proposed one is almost laughable.  How is this
acceptable as a userland visible API that we need to maintain for the
future?  It's broken on delivery.

> the feature is desirable and that the two modes (soft/hard) memory
> protection are needed. Yet I would _really_ like to hear any
> suggestion to get unstuck. It is far from useful to come and Nack this
> _again_ without providing any alternative suggestions.

I've pointed out two major points where the proposed interface is
evidently deficient and told you why they're so and it's not like the
said deficiencies are anything subtle.  If you can't figure out what
to do next from there on, I don't think I can help you.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
@ 2014-06-09 13:54                                             ` Tejun Heo
  0 siblings, 0 replies; 196+ messages in thread
From: Tejun Heo @ 2014-06-09 13:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Greg Thelen, Michel Lespinasse, Roman Gushchin,
	LKML, linux-mm

Hello,

On Mon, Jun 09, 2014 at 10:30:42AM +0200, Michal Hocko wrote:
> On Fri 06-06-14 11:29:14, Tejun Heo wrote:
> > Why is this necessary?
> 
> It allows user/admin to set the default behavior.

By recomipling the kernel for something which can be trivially
configured post-boot without any difference?  The only thing it'll
achieve is confusing the hell out of people why different kernels show
different behaviors without any userland differences while taxing the
already constrained kernel configuration process more for no gain
whatsoever.

> How do you propose to tell the default then? Only at the runtime?
> I really do not insist on the kconfig. I find it useful for a)
> documentation purpose b) easy way to configure the default.

Please don't ever add Kconfig options like this.  This is uttrely
unnecessary and idiotic.  You don't add completely redundant Kconfig
option for documentation purposes.

> > * Are you sure soft and hard guarantees aren't useful when used in
> >   combination?  If so, why would that be the case?
> 
> This was a call from Google to have per-memcg setup AFAIR. Using
> different reclaim protection on the global case vs. limit reclaim makes
> a lot of sense to me. If this is a major obstacle then I am OK to drop
> it and only have a global setting for now.

Isn't it obvious that what needs to be investigated is why we're
trying to add an interface which is completely different for
guarantees as compared to limits?  Why wouldn't they have a symmetric
interface in the reverse direction as soft/hard limits?  If not, where
does the asymmetry come from?  Thse are the *first* questions which
should come to anyone's mind when [s]he is trying to add configs for a
different type of threshholds and something which must be explicitly
laid out as rationales for the design choices.

> > * We have pressure monitoring interface which can be used for soft
> >   limit pressure monitoring. 
> 
> Which one is that? I only know about oom_control triggered by the hard
> limit pressure.

Weren't you guys planning to use vmpressre notification to find out
about softlimit breach conditions?

> >   How should breaching soft guarantee be
> >   factored into that?  There doesn't seem to be any way of notifying
> >   that at the moment?  Wouldn't we want that to be integrated into the
> >   same mechanism?
> 
> Yes, there is. We have a counter in memory.stat file which tells how
> many times the limit has been breached.

How does the userland find out?  By polling the file every frigging
second?  Note that there actually is an actual asymmetry here which
makes breaching soft guarantee a much more significant event than
breaching soft limit - the former is violation of the configured
objective, the latter is not.  You *need* a way to notify the event.

> > What scares me the most is that you don't even seem to have noticed
> > the asymmetry and are proposing userland-facing interface without
> > actually thinking things through.  This is exactly how we've been
> > getting into trouble.
> 
> This has been discussed up and down for the last _two_ years. I have
> considered other options how to provide a very _useful_ feature users
> are calling for. There is even general consensus among developers that

AFAIR, there hasn't been much discussion about the details of the
interface and the proposed one is almost laughable.  How is this
acceptable as a userland visible API that we need to maintain for the
future?  It's broken on delivery.

> the feature is desirable and that the two modes (soft/hard) memory
> protection are needed. Yet I would _really_ like to hear any
> suggestion to get unstuck. It is far from useful to come and Nack this
> _again_ without providing any alternative suggestions.

I've pointed out two major points where the proposed interface is
evidently deficient and told you why they're so and it's not like the
said deficiencies are anything subtle.  If you can't figure out what
to do next from there on, I don't think I can help you.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
  2014-06-06 14:46                                       ` Michal Hocko
@ 2014-06-09 22:52                                         ` Greg Thelen
  -1 siblings, 0 replies; 196+ messages in thread
From: Greg Thelen @ 2014-06-09 22:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	LKML, linux-mm


On Fri, Jun 06 2014, Michal Hocko <mhocko@suse.cz> wrote:

> Some users (e.g. Google) would like to have stronger semantic than low
> limit offers currently. The fallback mode is not desirable and they
> prefer hitting OOM killer rather than ignoring low limit for protected
> groups. There are other possible usecases which can benefit from hard
> guarantees. I can imagine workloads where setting low_limit to the same
> value as hard_limit to prevent from any reclaim at all makes a lot of
> sense because reclaim is much more disrupting than restart of the load.
>
> This patch adds a new per memcg memory.reclaim_strategy knob which
> tells what to do in a situation when memory reclaim cannot do any
> progress because all groups in the reclaimed hierarchy are within their
> low_limit. There are two options available:
> 	- low_limit_best_effort - the current mode when reclaim falls
> 	  back to the even reclaim of all groups in the reclaimed
> 	  hierarchy
> 	- low_limit_guarantee - groups within low_limit are never
> 	  reclaimed and OOM killer is triggered instead. OOM message
> 	  will mention the fact that the OOM was triggered due to
> 	  low_limit reclaim protection.

To (a) be consistent with existing hard and soft limits APIs and (b)
allow use of both best effort and guarantee memory limits, I wonder if
it's best to offer three per memcg limits, rather than two limits (hard,
low_limit) and a related reclaim_strategy knob.  The three limits I'm
thinking about are:

1) hard_limit (aka the existing limit_in_bytes cgroupfs file).  No
   change needed here.  This is an upper bound on a memcg hierarchy's
   memory consumption (assuming use_hierarchy=1).

2) best_effort_limit (aka desired working set).  This allow an
   application or administrator to provide a hint to the kernel about
   desired working set size.  Before oom'ing the kernel is allowed to
   reclaim below this limit.  I think the current soft_limit_in_bytes
   claims to provide this.  If we prefer to deprecate
   soft_limit_in_bytes, then a new desired_working_set_in_bytes (or a
   hopefully better named) API seems reasonable.

3) low_limit_guarantee which is a lower bound of memory usage.  A memcg
   would prefer to be oom killed rather than operate below this
   threshold.  Default value is zero to preserve compatibility with
   existing apps.

Logically hard_limit >= best_effort_limit >= low_limit_guarantee.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
@ 2014-06-09 22:52                                         ` Greg Thelen
  0 siblings, 0 replies; 196+ messages in thread
From: Greg Thelen @ 2014-06-09 22:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	LKML, linux-mm


On Fri, Jun 06 2014, Michal Hocko <mhocko@suse.cz> wrote:

> Some users (e.g. Google) would like to have stronger semantic than low
> limit offers currently. The fallback mode is not desirable and they
> prefer hitting OOM killer rather than ignoring low limit for protected
> groups. There are other possible usecases which can benefit from hard
> guarantees. I can imagine workloads where setting low_limit to the same
> value as hard_limit to prevent from any reclaim at all makes a lot of
> sense because reclaim is much more disrupting than restart of the load.
>
> This patch adds a new per memcg memory.reclaim_strategy knob which
> tells what to do in a situation when memory reclaim cannot do any
> progress because all groups in the reclaimed hierarchy are within their
> low_limit. There are two options available:
> 	- low_limit_best_effort - the current mode when reclaim falls
> 	  back to the even reclaim of all groups in the reclaimed
> 	  hierarchy
> 	- low_limit_guarantee - groups within low_limit are never
> 	  reclaimed and OOM killer is triggered instead. OOM message
> 	  will mention the fact that the OOM was triggered due to
> 	  low_limit reclaim protection.

To (a) be consistent with existing hard and soft limits APIs and (b)
allow use of both best effort and guarantee memory limits, I wonder if
it's best to offer three per memcg limits, rather than two limits (hard,
low_limit) and a related reclaim_strategy knob.  The three limits I'm
thinking about are:

1) hard_limit (aka the existing limit_in_bytes cgroupfs file).  No
   change needed here.  This is an upper bound on a memcg hierarchy's
   memory consumption (assuming use_hierarchy=1).

2) best_effort_limit (aka desired working set).  This allow an
   application or administrator to provide a hint to the kernel about
   desired working set size.  Before oom'ing the kernel is allowed to
   reclaim below this limit.  I think the current soft_limit_in_bytes
   claims to provide this.  If we prefer to deprecate
   soft_limit_in_bytes, then a new desired_working_set_in_bytes (or a
   hopefully better named) API seems reasonable.

3) low_limit_guarantee which is a lower bound of memory usage.  A memcg
   would prefer to be oom killed rather than operate below this
   threshold.  Default value is zero to preserve compatibility with
   existing apps.

Logically hard_limit >= best_effort_limit >= low_limit_guarantee.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
  2014-06-09 22:52                                         ` Greg Thelen
@ 2014-06-10 16:57                                           ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-10 16:57 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Michal Hocko, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	LKML, linux-mm

On Mon, Jun 09, 2014 at 03:52:51PM -0700, Greg Thelen wrote:
> 
> On Fri, Jun 06 2014, Michal Hocko <mhocko@suse.cz> wrote:
> 
> > Some users (e.g. Google) would like to have stronger semantic than low
> > limit offers currently. The fallback mode is not desirable and they
> > prefer hitting OOM killer rather than ignoring low limit for protected
> > groups. There are other possible usecases which can benefit from hard
> > guarantees. I can imagine workloads where setting low_limit to the same
> > value as hard_limit to prevent from any reclaim at all makes a lot of
> > sense because reclaim is much more disrupting than restart of the load.
> >
> > This patch adds a new per memcg memory.reclaim_strategy knob which
> > tells what to do in a situation when memory reclaim cannot do any
> > progress because all groups in the reclaimed hierarchy are within their
> > low_limit. There are two options available:
> > 	- low_limit_best_effort - the current mode when reclaim falls
> > 	  back to the even reclaim of all groups in the reclaimed
> > 	  hierarchy
> > 	- low_limit_guarantee - groups within low_limit are never
> > 	  reclaimed and OOM killer is triggered instead. OOM message
> > 	  will mention the fact that the OOM was triggered due to
> > 	  low_limit reclaim protection.
> 
> To (a) be consistent with existing hard and soft limits APIs and (b)
> allow use of both best effort and guarantee memory limits, I wonder if
> it's best to offer three per memcg limits, rather than two limits (hard,
> low_limit) and a related reclaim_strategy knob.  The three limits I'm
> thinking about are:
> 
> 1) hard_limit (aka the existing limit_in_bytes cgroupfs file).  No
>    change needed here.  This is an upper bound on a memcg hierarchy's
>    memory consumption (assuming use_hierarchy=1).

This creates internal pressure.  Outside reclaim is not affected by
it, but internal charges can not exceed this limit.  This is set to
hard limit the maximum memory consumption of a group (max).

> 2) best_effort_limit (aka desired working set).  This allow an
>    application or administrator to provide a hint to the kernel about
>    desired working set size.  Before oom'ing the kernel is allowed to
>    reclaim below this limit.  I think the current soft_limit_in_bytes
>    claims to provide this.  If we prefer to deprecate
>    soft_limit_in_bytes, then a new desired_working_set_in_bytes (or a
>    hopefully better named) API seems reasonable.

This controls how external pressure applies to the group.

But it's conceivable that we'd like to have the equivalent of such a
soft limit for *internal* pressure.  Set below the hard limit, this
internal soft limit would have charges trigger direct reclaim in the
memcg but allow them to continue to the hard limit.  This would create
a situation wherein the allocating tasks are not killed, but throttled
under reclaim, which gives the administrator a window to detect the
situation with vmpressure and possibly intervene.  Because as it
stands, once the current hard limit is hit things can go down pretty
fast and the window for reacting to vmpressure readings is often too
small.  This would offer a more gradual deterioration.  It would be
set to the upper end of the working set size range (high).

I think for many users such an internal soft limit would actually be
preferred over the current hard limit, as they'd rather have some
reclaim throttling than an OOM kill when the group reaches its upper
bound.  The current hard limit would be reserved for more advanced or
paid cases, where the admin would rather see a memcg get OOM killed
than exceed a certain size.

Then, as you proposed, we'd have the soft limit for external pressure,
where the kernel only reclaims groups within that limit in order to
avoid OOM kills.  It would be set to the estimated lower end of the
working set size range (low).

> 3) low_limit_guarantee which is a lower bound of memory usage.  A memcg
>    would prefer to be oom killed rather than operate below this
>    threshold.  Default value is zero to preserve compatibility with
>    existing apps.

And this would be the external pressure hard limit, which would be set
to the absolute minimum requirement of the group (min).

Either because it would be hopelessly thrashing without it, or because
this guaranteed memory is actually paid for.  Again, I would expect
many users to not even set this minimum guarantee but solely use the
external soft limit (low) instead.

> Logically hard_limit >= best_effort_limit >= low_limit_guarantee.

max >= high >= low >= min

I think we should be able to express all desired usecases with these
four limits, including the advanced configurations, while making it
easy for many users to set up groups without being a) dead certain
about their memory consumption or b) prepared for frequent OOM kills,
while still allowing them to properly utilize their machines.

What do you think?

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
@ 2014-06-10 16:57                                           ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-10 16:57 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Michal Hocko, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	LKML, linux-mm

On Mon, Jun 09, 2014 at 03:52:51PM -0700, Greg Thelen wrote:
> 
> On Fri, Jun 06 2014, Michal Hocko <mhocko@suse.cz> wrote:
> 
> > Some users (e.g. Google) would like to have stronger semantic than low
> > limit offers currently. The fallback mode is not desirable and they
> > prefer hitting OOM killer rather than ignoring low limit for protected
> > groups. There are other possible usecases which can benefit from hard
> > guarantees. I can imagine workloads where setting low_limit to the same
> > value as hard_limit to prevent from any reclaim at all makes a lot of
> > sense because reclaim is much more disrupting than restart of the load.
> >
> > This patch adds a new per memcg memory.reclaim_strategy knob which
> > tells what to do in a situation when memory reclaim cannot do any
> > progress because all groups in the reclaimed hierarchy are within their
> > low_limit. There are two options available:
> > 	- low_limit_best_effort - the current mode when reclaim falls
> > 	  back to the even reclaim of all groups in the reclaimed
> > 	  hierarchy
> > 	- low_limit_guarantee - groups within low_limit are never
> > 	  reclaimed and OOM killer is triggered instead. OOM message
> > 	  will mention the fact that the OOM was triggered due to
> > 	  low_limit reclaim protection.
> 
> To (a) be consistent with existing hard and soft limits APIs and (b)
> allow use of both best effort and guarantee memory limits, I wonder if
> it's best to offer three per memcg limits, rather than two limits (hard,
> low_limit) and a related reclaim_strategy knob.  The three limits I'm
> thinking about are:
> 
> 1) hard_limit (aka the existing limit_in_bytes cgroupfs file).  No
>    change needed here.  This is an upper bound on a memcg hierarchy's
>    memory consumption (assuming use_hierarchy=1).

This creates internal pressure.  Outside reclaim is not affected by
it, but internal charges can not exceed this limit.  This is set to
hard limit the maximum memory consumption of a group (max).

> 2) best_effort_limit (aka desired working set).  This allow an
>    application or administrator to provide a hint to the kernel about
>    desired working set size.  Before oom'ing the kernel is allowed to
>    reclaim below this limit.  I think the current soft_limit_in_bytes
>    claims to provide this.  If we prefer to deprecate
>    soft_limit_in_bytes, then a new desired_working_set_in_bytes (or a
>    hopefully better named) API seems reasonable.

This controls how external pressure applies to the group.

But it's conceivable that we'd like to have the equivalent of such a
soft limit for *internal* pressure.  Set below the hard limit, this
internal soft limit would have charges trigger direct reclaim in the
memcg but allow them to continue to the hard limit.  This would create
a situation wherein the allocating tasks are not killed, but throttled
under reclaim, which gives the administrator a window to detect the
situation with vmpressure and possibly intervene.  Because as it
stands, once the current hard limit is hit things can go down pretty
fast and the window for reacting to vmpressure readings is often too
small.  This would offer a more gradual deterioration.  It would be
set to the upper end of the working set size range (high).

I think for many users such an internal soft limit would actually be
preferred over the current hard limit, as they'd rather have some
reclaim throttling than an OOM kill when the group reaches its upper
bound.  The current hard limit would be reserved for more advanced or
paid cases, where the admin would rather see a memcg get OOM killed
than exceed a certain size.

Then, as you proposed, we'd have the soft limit for external pressure,
where the kernel only reclaims groups within that limit in order to
avoid OOM kills.  It would be set to the estimated lower end of the
working set size range (low).

> 3) low_limit_guarantee which is a lower bound of memory usage.  A memcg
>    would prefer to be oom killed rather than operate below this
>    threshold.  Default value is zero to preserve compatibility with
>    existing apps.

And this would be the external pressure hard limit, which would be set
to the absolute minimum requirement of the group (min).

Either because it would be hopelessly thrashing without it, or because
this guaranteed memory is actually paid for.  Again, I would expect
many users to not even set this minimum guarantee but solely use the
external soft limit (low) instead.

> Logically hard_limit >= best_effort_limit >= low_limit_guarantee.

max >= high >= low >= min

I think we should be able to express all desired usecases with these
four limits, including the advanced configurations, while making it
easy for many users to set up groups without being a) dead certain
about their memory consumption or b) prepared for frequent OOM kills,
while still allowing them to properly utilize their machines.

What do you think?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
  2014-06-10 16:57                                           ` Johannes Weiner
@ 2014-06-10 22:16                                             ` Greg Thelen
  -1 siblings, 0 replies; 196+ messages in thread
From: Greg Thelen @ 2014-06-10 22:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	LKML, linux-mm


On Tue, Jun 10 2014, Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Mon, Jun 09, 2014 at 03:52:51PM -0700, Greg Thelen wrote:
>> 
>> On Fri, Jun 06 2014, Michal Hocko <mhocko@suse.cz> wrote:
>> 
>> > Some users (e.g. Google) would like to have stronger semantic than low
>> > limit offers currently. The fallback mode is not desirable and they
>> > prefer hitting OOM killer rather than ignoring low limit for protected
>> > groups. There are other possible usecases which can benefit from hard
>> > guarantees. I can imagine workloads where setting low_limit to the same
>> > value as hard_limit to prevent from any reclaim at all makes a lot of
>> > sense because reclaim is much more disrupting than restart of the load.
>> >
>> > This patch adds a new per memcg memory.reclaim_strategy knob which
>> > tells what to do in a situation when memory reclaim cannot do any
>> > progress because all groups in the reclaimed hierarchy are within their
>> > low_limit. There are two options available:
>> > 	- low_limit_best_effort - the current mode when reclaim falls
>> > 	  back to the even reclaim of all groups in the reclaimed
>> > 	  hierarchy
>> > 	- low_limit_guarantee - groups within low_limit are never
>> > 	  reclaimed and OOM killer is triggered instead. OOM message
>> > 	  will mention the fact that the OOM was triggered due to
>> > 	  low_limit reclaim protection.
>> 
>> To (a) be consistent with existing hard and soft limits APIs and (b)
>> allow use of both best effort and guarantee memory limits, I wonder if
>> it's best to offer three per memcg limits, rather than two limits (hard,
>> low_limit) and a related reclaim_strategy knob.  The three limits I'm
>> thinking about are:
>> 
>> 1) hard_limit (aka the existing limit_in_bytes cgroupfs file).  No
>>    change needed here.  This is an upper bound on a memcg hierarchy's
>>    memory consumption (assuming use_hierarchy=1).
>
> This creates internal pressure.  Outside reclaim is not affected by
> it, but internal charges can not exceed this limit.  This is set to
> hard limit the maximum memory consumption of a group (max).
>
>> 2) best_effort_limit (aka desired working set).  This allow an
>>    application or administrator to provide a hint to the kernel about
>>    desired working set size.  Before oom'ing the kernel is allowed to
>>    reclaim below this limit.  I think the current soft_limit_in_bytes
>>    claims to provide this.  If we prefer to deprecate
>>    soft_limit_in_bytes, then a new desired_working_set_in_bytes (or a
>>    hopefully better named) API seems reasonable.
>
> This controls how external pressure applies to the group.
>
> But it's conceivable that we'd like to have the equivalent of such a
> soft limit for *internal* pressure.  Set below the hard limit, this
> internal soft limit would have charges trigger direct reclaim in the
> memcg but allow them to continue to the hard limit.  This would create
> a situation wherein the allocating tasks are not killed, but throttled
> under reclaim, which gives the administrator a window to detect the
> situation with vmpressure and possibly intervene.  Because as it
> stands, once the current hard limit is hit things can go down pretty
> fast and the window for reacting to vmpressure readings is often too
> small.  This would offer a more gradual deterioration.  It would be
> set to the upper end of the working set size range (high).
>
> I think for many users such an internal soft limit would actually be
> preferred over the current hard limit, as they'd rather have some
> reclaim throttling than an OOM kill when the group reaches its upper
> bound.  The current hard limit would be reserved for more advanced or
> paid cases, where the admin would rather see a memcg get OOM killed
> than exceed a certain size.
>
> Then, as you proposed, we'd have the soft limit for external pressure,
> where the kernel only reclaims groups within that limit in order to
> avoid OOM kills.  It would be set to the estimated lower end of the
> working set size range (low).
>
>> 3) low_limit_guarantee which is a lower bound of memory usage.  A memcg
>>    would prefer to be oom killed rather than operate below this
>>    threshold.  Default value is zero to preserve compatibility with
>>    existing apps.
>
> And this would be the external pressure hard limit, which would be set
> to the absolute minimum requirement of the group (min).
>
> Either because it would be hopelessly thrashing without it, or because
> this guaranteed memory is actually paid for.  Again, I would expect
> many users to not even set this minimum guarantee but solely use the
> external soft limit (low) instead.
>
>> Logically hard_limit >= best_effort_limit >= low_limit_guarantee.
>
> max >= high >= low >= min
>
> I think we should be able to express all desired usecases with these
> four limits, including the advanced configurations, while making it
> easy for many users to set up groups without being a) dead certain
> about their memory consumption or b) prepared for frequent OOM kills,
> while still allowing them to properly utilize their machines.
>
> What do you think?

Sounds good to me.

Recapping so I understand:
- max and high apply to internal pressure.
- low and min apply to external pressure.

The {max, high, low, min} names are better than mine.  Given that they
mimic global watermarks I keep wondering if a per memcg kswapd would
someday use these new memcg watermarks.  But merely waking up some
futuristic per memcg kswapd when usage crosses high would sacrifice the
throttling for vmpressure to respond.  So I think what you've proposed
is good for most use cases I have in mind.  Though I'm not sure that I
have immediate use for the high wmark.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
@ 2014-06-10 22:16                                             ` Greg Thelen
  0 siblings, 0 replies; 196+ messages in thread
From: Greg Thelen @ 2014-06-10 22:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	LKML, linux-mm


On Tue, Jun 10 2014, Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Mon, Jun 09, 2014 at 03:52:51PM -0700, Greg Thelen wrote:
>> 
>> On Fri, Jun 06 2014, Michal Hocko <mhocko@suse.cz> wrote:
>> 
>> > Some users (e.g. Google) would like to have stronger semantic than low
>> > limit offers currently. The fallback mode is not desirable and they
>> > prefer hitting OOM killer rather than ignoring low limit for protected
>> > groups. There are other possible usecases which can benefit from hard
>> > guarantees. I can imagine workloads where setting low_limit to the same
>> > value as hard_limit to prevent from any reclaim at all makes a lot of
>> > sense because reclaim is much more disrupting than restart of the load.
>> >
>> > This patch adds a new per memcg memory.reclaim_strategy knob which
>> > tells what to do in a situation when memory reclaim cannot do any
>> > progress because all groups in the reclaimed hierarchy are within their
>> > low_limit. There are two options available:
>> > 	- low_limit_best_effort - the current mode when reclaim falls
>> > 	  back to the even reclaim of all groups in the reclaimed
>> > 	  hierarchy
>> > 	- low_limit_guarantee - groups within low_limit are never
>> > 	  reclaimed and OOM killer is triggered instead. OOM message
>> > 	  will mention the fact that the OOM was triggered due to
>> > 	  low_limit reclaim protection.
>> 
>> To (a) be consistent with existing hard and soft limits APIs and (b)
>> allow use of both best effort and guarantee memory limits, I wonder if
>> it's best to offer three per memcg limits, rather than two limits (hard,
>> low_limit) and a related reclaim_strategy knob.  The three limits I'm
>> thinking about are:
>> 
>> 1) hard_limit (aka the existing limit_in_bytes cgroupfs file).  No
>>    change needed here.  This is an upper bound on a memcg hierarchy's
>>    memory consumption (assuming use_hierarchy=1).
>
> This creates internal pressure.  Outside reclaim is not affected by
> it, but internal charges can not exceed this limit.  This is set to
> hard limit the maximum memory consumption of a group (max).
>
>> 2) best_effort_limit (aka desired working set).  This allow an
>>    application or administrator to provide a hint to the kernel about
>>    desired working set size.  Before oom'ing the kernel is allowed to
>>    reclaim below this limit.  I think the current soft_limit_in_bytes
>>    claims to provide this.  If we prefer to deprecate
>>    soft_limit_in_bytes, then a new desired_working_set_in_bytes (or a
>>    hopefully better named) API seems reasonable.
>
> This controls how external pressure applies to the group.
>
> But it's conceivable that we'd like to have the equivalent of such a
> soft limit for *internal* pressure.  Set below the hard limit, this
> internal soft limit would have charges trigger direct reclaim in the
> memcg but allow them to continue to the hard limit.  This would create
> a situation wherein the allocating tasks are not killed, but throttled
> under reclaim, which gives the administrator a window to detect the
> situation with vmpressure and possibly intervene.  Because as it
> stands, once the current hard limit is hit things can go down pretty
> fast and the window for reacting to vmpressure readings is often too
> small.  This would offer a more gradual deterioration.  It would be
> set to the upper end of the working set size range (high).
>
> I think for many users such an internal soft limit would actually be
> preferred over the current hard limit, as they'd rather have some
> reclaim throttling than an OOM kill when the group reaches its upper
> bound.  The current hard limit would be reserved for more advanced or
> paid cases, where the admin would rather see a memcg get OOM killed
> than exceed a certain size.
>
> Then, as you proposed, we'd have the soft limit for external pressure,
> where the kernel only reclaims groups within that limit in order to
> avoid OOM kills.  It would be set to the estimated lower end of the
> working set size range (low).
>
>> 3) low_limit_guarantee which is a lower bound of memory usage.  A memcg
>>    would prefer to be oom killed rather than operate below this
>>    threshold.  Default value is zero to preserve compatibility with
>>    existing apps.
>
> And this would be the external pressure hard limit, which would be set
> to the absolute minimum requirement of the group (min).
>
> Either because it would be hopelessly thrashing without it, or because
> this guaranteed memory is actually paid for.  Again, I would expect
> many users to not even set this minimum guarantee but solely use the
> external soft limit (low) instead.
>
>> Logically hard_limit >= best_effort_limit >= low_limit_guarantee.
>
> max >= high >= low >= min
>
> I think we should be able to express all desired usecases with these
> four limits, including the advanced configurations, while making it
> easy for many users to set up groups without being a) dead certain
> about their memory consumption or b) prepared for frequent OOM kills,
> while still allowing them to properly utilize their machines.
>
> What do you think?

Sounds good to me.

Recapping so I understand:
- max and high apply to internal pressure.
- low and min apply to external pressure.

The {max, high, low, min} names are better than mine.  Given that they
mimic global watermarks I keep wondering if a per memcg kswapd would
someday use these new memcg watermarks.  But merely waking up some
futuristic per memcg kswapd when usage crosses high would sacrifice the
throttling for vmpressure to respond.  So I think what you've proposed
is good for most use cases I have in mind.  Though I'm not sure that I
have immediate use for the high wmark.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
  2014-06-10 16:57                                           ` Johannes Weiner
@ 2014-06-11  7:57                                             ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-11  7:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	LKML, linux-mm

On Tue 10-06-14 12:57:56, Johannes Weiner wrote:
> On Mon, Jun 09, 2014 at 03:52:51PM -0700, Greg Thelen wrote:
> > 
> > On Fri, Jun 06 2014, Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > Some users (e.g. Google) would like to have stronger semantic than low
> > > limit offers currently. The fallback mode is not desirable and they
> > > prefer hitting OOM killer rather than ignoring low limit for protected
> > > groups. There are other possible usecases which can benefit from hard
> > > guarantees. I can imagine workloads where setting low_limit to the same
> > > value as hard_limit to prevent from any reclaim at all makes a lot of
> > > sense because reclaim is much more disrupting than restart of the load.
> > >
> > > This patch adds a new per memcg memory.reclaim_strategy knob which
> > > tells what to do in a situation when memory reclaim cannot do any
> > > progress because all groups in the reclaimed hierarchy are within their
> > > low_limit. There are two options available:
> > > 	- low_limit_best_effort - the current mode when reclaim falls
> > > 	  back to the even reclaim of all groups in the reclaimed
> > > 	  hierarchy
> > > 	- low_limit_guarantee - groups within low_limit are never
> > > 	  reclaimed and OOM killer is triggered instead. OOM message
> > > 	  will mention the fact that the OOM was triggered due to
> > > 	  low_limit reclaim protection.
> > 
> > To (a) be consistent with existing hard and soft limits APIs and (b)
> > allow use of both best effort and guarantee memory limits, I wonder if
> > it's best to offer three per memcg limits, rather than two limits (hard,
> > low_limit) and a related reclaim_strategy knob.  The three limits I'm
> > thinking about are:
> > 
> > 1) hard_limit (aka the existing limit_in_bytes cgroupfs file).  No
> >    change needed here.  This is an upper bound on a memcg hierarchy's
> >    memory consumption (assuming use_hierarchy=1).
> 
> This creates internal pressure.  Outside reclaim is not affected by
> it, but internal charges can not exceed this limit.  This is set to
> hard limit the maximum memory consumption of a group (max).
> 
> > 2) best_effort_limit (aka desired working set).  This allow an
> >    application or administrator to provide a hint to the kernel about
> >    desired working set size.  Before oom'ing the kernel is allowed to
> >    reclaim below this limit.  I think the current soft_limit_in_bytes
> >    claims to provide this.  If we prefer to deprecate
> >    soft_limit_in_bytes, then a new desired_working_set_in_bytes (or a
> >    hopefully better named) API seems reasonable.
> 
> This controls how external pressure applies to the group.
> 
> But it's conceivable that we'd like to have the equivalent of such a
> soft limit for *internal* pressure.  Set below the hard limit, this
> internal soft limit would have charges trigger direct reclaim in the
> memcg but allow them to continue to the hard limit.  This would create
> a situation wherein the allocating tasks are not killed, but throttled
> under reclaim, which gives the administrator a window to detect the
> situation with vmpressure and possibly intervene.  Because as it
> stands, once the current hard limit is hit things can go down pretty
> fast and the window for reacting to vmpressure readings is often too
> small.  This would offer a more gradual deterioration.  It would be
> set to the upper end of the working set size range (high).
> 
> I think for many users such an internal soft limit would actually be
> preferred over the current hard limit, as they'd rather have some
> reclaim throttling than an OOM kill when the group reaches its upper
> bound.  

Yes, this sounds useful. We have already discussed that and the
primary question is whether the high limit reclaim should be direct
or background. There are some cons and pros for both. Direct one is
much easier to implement but it is questionable whether it is too
heavy.  Background is much more tricky to implement on the other
hand. The obvious advantage would be a more convergence to the global
behavior while we still get the notification that something bad is
going on.  I assume that a dedicated workqueque would be doable but we
would definitely need an evaluation of what happens with zillions of
high_limit reclaimers.

> The current hard limit would be reserved for more advanced or paid
> cases, where the admin would rather see a memcg get OOM killed than
> exceed a certain size.

So the hard_limit will not change, right? Still reclaim and fallback to
OOM if nothing can be reclaimable as we do currently.

> Then, as you proposed, we'd have the soft limit for external pressure,
> where the kernel only reclaims groups within that limit in order to
> avoid OOM kills.  It would be set to the estimated lower end of the
> working set size range (low).

OK, that is how the current low_limit is implemented.

> > 3) low_limit_guarantee which is a lower bound of memory usage.  A memcg
> >    would prefer to be oom killed rather than operate below this
> >    threshold.  Default value is zero to preserve compatibility with
> >    existing apps.
> 
> And this would be the external pressure hard limit, which would be set
> to the absolute minimum requirement of the group (min).
> 
> Either because it would be hopelessly thrashing without it, or because
> this guaranteed memory is actually paid for.  Again, I would expect
> many users to not even set this minimum guarantee but solely use the
> external soft limit (low) instead.
> 
> > Logically hard_limit >= best_effort_limit >= low_limit_guarantee.
> 
> max >= high >= low >= min

It might be a bit confusing for people familiar with the per-zone
watermarks where the meaning is opposite (hard reclaim if under min,
kswapd if between low and high). Nevertheless names have a good meaning
in the memcg context so I would go with min, low and high as you
suggest.

> I think we should be able to express all desired usecases with these
> four limits, including the advanced configurations, while making it
> easy for many users to set up groups without being a) dead certain
> about their memory consumption or b) prepared for frequent OOM kills,
> while still allowing them to properly utilize their machines.
> 
> What do you think?

OK, I think this sounds viable. low_limit part of it is already in
Andrew's tree and I will post a follow up patches for min_limit which
are quite trivial on top for further discussion.

Is this the kind of symmetry Tejun is asking for and that would make
change is Nack position? I am still not sure it satisfies his soft
guarantee objections from other email.

I am also not sure whether high_limit has to be bundled with
{min,low}_limit in one patchset necessarily. I think we need it and
we should discuss what is the best implementation but I do not see
any reason to postponing the memory protection part which is quite
independent on pro-active memory reclaim.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
@ 2014-06-11  7:57                                             ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-11  7:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	LKML, linux-mm

On Tue 10-06-14 12:57:56, Johannes Weiner wrote:
> On Mon, Jun 09, 2014 at 03:52:51PM -0700, Greg Thelen wrote:
> > 
> > On Fri, Jun 06 2014, Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > Some users (e.g. Google) would like to have stronger semantic than low
> > > limit offers currently. The fallback mode is not desirable and they
> > > prefer hitting OOM killer rather than ignoring low limit for protected
> > > groups. There are other possible usecases which can benefit from hard
> > > guarantees. I can imagine workloads where setting low_limit to the same
> > > value as hard_limit to prevent from any reclaim at all makes a lot of
> > > sense because reclaim is much more disrupting than restart of the load.
> > >
> > > This patch adds a new per memcg memory.reclaim_strategy knob which
> > > tells what to do in a situation when memory reclaim cannot do any
> > > progress because all groups in the reclaimed hierarchy are within their
> > > low_limit. There are two options available:
> > > 	- low_limit_best_effort - the current mode when reclaim falls
> > > 	  back to the even reclaim of all groups in the reclaimed
> > > 	  hierarchy
> > > 	- low_limit_guarantee - groups within low_limit are never
> > > 	  reclaimed and OOM killer is triggered instead. OOM message
> > > 	  will mention the fact that the OOM was triggered due to
> > > 	  low_limit reclaim protection.
> > 
> > To (a) be consistent with existing hard and soft limits APIs and (b)
> > allow use of both best effort and guarantee memory limits, I wonder if
> > it's best to offer three per memcg limits, rather than two limits (hard,
> > low_limit) and a related reclaim_strategy knob.  The three limits I'm
> > thinking about are:
> > 
> > 1) hard_limit (aka the existing limit_in_bytes cgroupfs file).  No
> >    change needed here.  This is an upper bound on a memcg hierarchy's
> >    memory consumption (assuming use_hierarchy=1).
> 
> This creates internal pressure.  Outside reclaim is not affected by
> it, but internal charges can not exceed this limit.  This is set to
> hard limit the maximum memory consumption of a group (max).
> 
> > 2) best_effort_limit (aka desired working set).  This allow an
> >    application or administrator to provide a hint to the kernel about
> >    desired working set size.  Before oom'ing the kernel is allowed to
> >    reclaim below this limit.  I think the current soft_limit_in_bytes
> >    claims to provide this.  If we prefer to deprecate
> >    soft_limit_in_bytes, then a new desired_working_set_in_bytes (or a
> >    hopefully better named) API seems reasonable.
> 
> This controls how external pressure applies to the group.
> 
> But it's conceivable that we'd like to have the equivalent of such a
> soft limit for *internal* pressure.  Set below the hard limit, this
> internal soft limit would have charges trigger direct reclaim in the
> memcg but allow them to continue to the hard limit.  This would create
> a situation wherein the allocating tasks are not killed, but throttled
> under reclaim, which gives the administrator a window to detect the
> situation with vmpressure and possibly intervene.  Because as it
> stands, once the current hard limit is hit things can go down pretty
> fast and the window for reacting to vmpressure readings is often too
> small.  This would offer a more gradual deterioration.  It would be
> set to the upper end of the working set size range (high).
> 
> I think for many users such an internal soft limit would actually be
> preferred over the current hard limit, as they'd rather have some
> reclaim throttling than an OOM kill when the group reaches its upper
> bound.  

Yes, this sounds useful. We have already discussed that and the
primary question is whether the high limit reclaim should be direct
or background. There are some cons and pros for both. Direct one is
much easier to implement but it is questionable whether it is too
heavy.  Background is much more tricky to implement on the other
hand. The obvious advantage would be a more convergence to the global
behavior while we still get the notification that something bad is
going on.  I assume that a dedicated workqueque would be doable but we
would definitely need an evaluation of what happens with zillions of
high_limit reclaimers.

> The current hard limit would be reserved for more advanced or paid
> cases, where the admin would rather see a memcg get OOM killed than
> exceed a certain size.

So the hard_limit will not change, right? Still reclaim and fallback to
OOM if nothing can be reclaimable as we do currently.

> Then, as you proposed, we'd have the soft limit for external pressure,
> where the kernel only reclaims groups within that limit in order to
> avoid OOM kills.  It would be set to the estimated lower end of the
> working set size range (low).

OK, that is how the current low_limit is implemented.

> > 3) low_limit_guarantee which is a lower bound of memory usage.  A memcg
> >    would prefer to be oom killed rather than operate below this
> >    threshold.  Default value is zero to preserve compatibility with
> >    existing apps.
> 
> And this would be the external pressure hard limit, which would be set
> to the absolute minimum requirement of the group (min).
> 
> Either because it would be hopelessly thrashing without it, or because
> this guaranteed memory is actually paid for.  Again, I would expect
> many users to not even set this minimum guarantee but solely use the
> external soft limit (low) instead.
> 
> > Logically hard_limit >= best_effort_limit >= low_limit_guarantee.
> 
> max >= high >= low >= min

It might be a bit confusing for people familiar with the per-zone
watermarks where the meaning is opposite (hard reclaim if under min,
kswapd if between low and high). Nevertheless names have a good meaning
in the memcg context so I would go with min, low and high as you
suggest.

> I think we should be able to express all desired usecases with these
> four limits, including the advanced configurations, while making it
> easy for many users to set up groups without being a) dead certain
> about their memory consumption or b) prepared for frequent OOM kills,
> while still allowing them to properly utilize their machines.
> 
> What do you think?

OK, I think this sounds viable. low_limit part of it is already in
Andrew's tree and I will post a follow up patches for min_limit which
are quite trivial on top for further discussion.

Is this the kind of symmetry Tejun is asking for and that would make
change is Nack position? I am still not sure it satisfies his soft
guarantee objections from other email.

I am also not sure whether high_limit has to be bundled with
{min,low}_limit in one patchset necessarily. I think we need it and
we should discuss what is the best implementation but I do not see
any reason to postponing the memory protection part which is quite
independent on pro-active memory reclaim.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* [PATCH 1/2] mm, memcg: allow OOM if no memcg is eligible during direct reclaim
  2014-06-11  7:57                                             ` Michal Hocko
@ 2014-06-11  8:00                                               ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-11  8:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	linux-mm, LKML

If there is no memcg eligible for reclaim because all groups under the
reclaimed hierarchy are within their guarantee then the global direct
reclaim would end up in the endless loop because zones in the zonelists
are not considered unreclaimable (as per all_unreclaimable) and so the
OOM killer would never fire and direct reclaim would be triggered
without no chance to reclaim anything.

This is not possible yet because reclaim falls back to ignore low_limit
when nobody is eligible for reclaim. Following patch will allow to set
the fallback mode to hard guarantee, though, so this is a preparatory
patch.

Memcg reclaim doesn't suffer from this because the OOM killer is
triggered after few unsuccessful attempts of the reclaim.

Fix this by checking the number of scanned pages which is obviously 0 if
nobody is eligible and also check that the whole tree hierarchy is not
eligible and tell OOM it can go ahead.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8041b0667673..99137aecd95f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2570,6 +2570,13 @@ out:
 	if (aborted_reclaim)
 		return 1;
 
+	/*
+	 * If the target memcg is not eligible for reclaim then we have no option
+	 * but OOM
+	 */
+	if (!sc->nr_scanned && mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
+		return 0;
+
 	/* top priority shrink_zones still had more to do? don't OOM, then */
 	if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
 		return 1;
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 196+ messages in thread

* [PATCH 1/2] mm, memcg: allow OOM if no memcg is eligible during direct reclaim
@ 2014-06-11  8:00                                               ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-11  8:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	linux-mm, LKML

If there is no memcg eligible for reclaim because all groups under the
reclaimed hierarchy are within their guarantee then the global direct
reclaim would end up in the endless loop because zones in the zonelists
are not considered unreclaimable (as per all_unreclaimable) and so the
OOM killer would never fire and direct reclaim would be triggered
without no chance to reclaim anything.

This is not possible yet because reclaim falls back to ignore low_limit
when nobody is eligible for reclaim. Following patch will allow to set
the fallback mode to hard guarantee, though, so this is a preparatory
patch.

Memcg reclaim doesn't suffer from this because the OOM killer is
triggered after few unsuccessful attempts of the reclaim.

Fix this by checking the number of scanned pages which is obviously 0 if
nobody is eligible and also check that the whole tree hierarchy is not
eligible and tell OOM it can go ahead.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8041b0667673..99137aecd95f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2570,6 +2570,13 @@ out:
 	if (aborted_reclaim)
 		return 1;
 
+	/*
+	 * If the target memcg is not eligible for reclaim then we have no option
+	 * but OOM
+	 */
+	if (!sc->nr_scanned && mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
+		return 0;
+
 	/* top priority shrink_zones still had more to do? don't OOM, then */
 	if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
 		return 1;
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 196+ messages in thread

* [PATCH 2/2] memcg: Allow guarantee reclaim
  2014-06-11  8:00                                               ` Michal Hocko
@ 2014-06-11  8:00                                                 ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-11  8:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	linux-mm, LKML

Some users (e.g. Google) would like to have stronger semantic than low
limit offers currently. The fallback mode is not desirable and they
prefer hitting OOM killer rather than ignoring low limit for protected
groups.

There are other possible usecases which can benefit from hard
guarantees. There are loads which will simply start trashing if the
memory working set drops under certain level and it is more appropriate
to simply kill and restart such a load if the required memory cannot
be provided. Another usecase would be a hard memory isolation for
containers.

The min_limit is initialized to 0 and it has precedence over low_limit.
If the reclaim is not able to find any memcg in the reclaimed hierarchy
above min_limit then OOM killer is triggered to resolve the situation.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 Documentation/cgroups/memory.txt | 26 ++++++++++++++++++--------
 include/linux/memcontrol.h       | 14 ++++++++------
 include/linux/res_counter.h      | 32 ++++++++++++++++++++++++++++++--
 mm/memcontrol.c                  | 18 +++++++++++-------
 mm/oom_kill.c                    |  6 ++++--
 mm/vmscan.c                      | 38 ++++++++++++++++++++++----------------
 6 files changed, 93 insertions(+), 41 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index bf895d7e1363..6929a06c9e5d 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -61,6 +61,7 @@ Brief summary of control files.
  memory.low_limit_breached	 # number of times low_limit has been
 				 # ignored and the cgroup reclaimed even
 				 # when it was above the limit
+ memory.min_limit_in_bytes	 # set/show min limit for memory reclaim
  memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
  memory.failcnt			 # show the number of memory usage hits limits
  memory.memsw.failcnt		 # show the number of memory+Swap hits limits
@@ -248,14 +249,23 @@ global VM. Cgroups can get reclaimed basically under two conditions
    to select and kill the bulkiest task in the hiearchy. (See 10. OOM Control
    below.)
 
-Groups might be also protected from both global and limit reclaim by
-low_limit_in_bytes knob. If the limit is non-zero the reclaim logic
-doesn't include groups (and their subgroups - see 6. Hierarchy support)
-which are below the low limit if there is other eligible cgroup in the
-reclaimed hierarchy. If all groups which participate reclaim are under
-their low limits then all of them are reclaimed and the low limit is
-ignored. low_limit_breached counter in memory.stat file can be checked
-to see how many times such an event occurred.
+Groups might be also protected from both global and limit reclaim
+by low_limit_in_bytes and min_limit_in_bytes knobs. The first one
+provides an optimistic reclaim protection while the later one provides
+hard memory reclaim protection guarantee. Both limits are 0 by default
+and min watermark has always precedence to low watermark.
+
+If the low limit is non-zero the reclaim logic doesn't include
+groups (and their subgroups - see 6. Hierarchy support) which are
+below low_limit if there is other eligible cgroup in the reclaimed
+hierarchy. If all groups which participate reclaim are under their low
+limits then all of them are reclaimed and the low limit is ignored.
+low_limit_breached counter in memory.stat file can be checked to see how
+many times such an event occurred.
+
+If, however, all the groups under reclaimed hierarchy are under their min
+limits then no reclaim is done and OOM killer is triggered to resolve the
+situation. In other words low_limit is never breached by the reclaim.
 
 Note2: When panic_on_oom is set to "2", the whole system will panic.
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5e2ca2163b12..ddb96729a6b6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -93,10 +93,11 @@ bool task_in_mem_cgroup(struct task_struct *task,
 			const struct mem_cgroup *memcg);
 
 extern bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
-		struct mem_cgroup *root);
+		struct mem_cgroup *root, bool soft_guarantee);
 
-extern void mem_cgroup_guarantee_breached(struct mem_cgroup *memcg);
-extern bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root);
+extern void mem_cgroup_soft_guarantee_breached(struct mem_cgroup *memcg);
+extern bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root,
+		bool soft_guarantee);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
@@ -295,14 +296,15 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 }
 
 static inline bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
-		struct mem_cgroup *root)
+		struct mem_cgroup *root, bool soft_guarantee)
 {
 	return false;
 }
-static inline  void mem_cgroup_guarantee_breached(struct mem_cgroup *memcg)
+static inline  void mem_cgroup_soft_guarantee_breached(struct mem_cgroup *memcg)
 {
 }
-static inline bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
+static inline bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root,
+		bool soft_guarantee)
 {
 	return false;
 }
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index b810855024f9..21dff6507aa7 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -40,11 +40,17 @@ struct res_counter {
 	 */
 	unsigned long long soft_limit;
 	/*
-	 * the limit under which the usage cannot be pushed
-	 * due to external pressure.
+	 * the limit under which the usage shouldn't be pushed
+	 * due to external pressure if it is possible.
 	 */
 	unsigned long long low_limit;
 	/*
+	 * the limit under with the usage cannot be pushed
+	 * due to external pressure.
+	 */
+	unsigned long long min_limit;
+
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -203,6 +209,28 @@ res_counter_low_limit_excess(struct res_counter *cnt)
 	return excess;
 }
 
+/**
+ * Get the difference between the usage and the min limit
+ * @cnt: The counter
+ *
+ * Returns 0 if usage is less than or equal to min limit
+ * The difference between usage and min limit, otherwise.
+ */
+static inline unsigned long long
+res_counter_min_limit_excess(struct res_counter *cnt)
+{
+	unsigned long long excess;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	if (cnt->usage <= cnt->min_limit)
+		excess = 0;
+	else
+		excess = cnt->usage - cnt->min_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return excess;
+}
+
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7f62b6533f60..26f137175f1c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2816,19 +2816,23 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
  * memory guarantee
  * @memcg: target memcg for the reclaim
  * @root: root of the reclaim hierarchy (null for the global reclaim)
+ * @soft_guarantee: is the guarantee soft (allows fallback).
  *
- * The given group is within its reclaim gurantee if it is below its low limit
- * or the same applies for any parent up the hierarchy until root (including).
+ * The given group is within its reclaim gurantee if it is below its min limit
+ * and if soft_guarantee is true then also below its low limit.
+ * Or the same applies for any parent up the hierarchy until root (including).
  * Such a group might be excluded from the reclaim.
  */
 bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
-		struct mem_cgroup *root)
+		struct mem_cgroup *root, bool soft_guarantee)
 {
 	if (mem_cgroup_disabled())
 		return false;
 
 	do {
-		if (!res_counter_low_limit_excess(&memcg->res))
+		if (!res_counter_min_limit_excess(&memcg->res))
+			return true;
+		if (soft_guarantee && !res_counter_low_limit_excess(&memcg->res))
 			return true;
 		if (memcg == root)
 			break;
@@ -2838,17 +2842,17 @@ bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
 	return false;
 }
 
-void mem_cgroup_guarantee_breached(struct mem_cgroup *memcg)
+void mem_cgroup_soft_guarantee_breached(struct mem_cgroup *memcg)
 {
 	this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_LOW_LIMIT_FALLBACK]);
 }
 
-bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
+bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root, bool soft_guarantee)
 {
 	struct mem_cgroup *iter;
 
 	for_each_mem_cgroup_tree(iter, root)
-		if (!mem_cgroup_within_guarantee(iter, root)) {
+		if (!mem_cgroup_within_guarantee(iter, root, soft_guarantee)) {
 			mem_cgroup_iter_break(root, iter);
 			return false;
 		}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3291e82d4352..e44b471af476 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -392,9 +392,11 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 {
 	task_lock(current);
 	pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
-		"oom_score_adj=%hd\n",
+		"oom_score_adj=%hd%s\n",
 		current->comm, gfp_mask, order,
-		current->signal->oom_score_adj);
+		current->signal->oom_score_adj,
+		mem_cgroup_all_within_guarantee(memcg, false) ?
+		" because all groups are withing min_limit guarantee":"");
 	cpuset_print_task_mems_allowed(current);
 	task_unlock(current);
 	dump_stack();
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99137aecd95f..8e844bd42c51 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2220,13 +2220,12 @@ static inline bool should_continue_reclaim(struct zone *zone,
  *
  * @zone: zone to shrink
  * @sc: scan control with additional reclaim parameters
- * @honor_memcg_guarantee: do not reclaim memcgs which are within their memory
- * guarantee
+ * @soft_guarantee: Use soft guarantee reclaim target for memcg reclaim.
  *
  * Returns the number of reclaimed memcgs.
  */
 static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
-		bool honor_memcg_guarantee)
+		bool soft_guarantee)
 {
 	unsigned long nr_reclaimed, nr_scanned;
 	unsigned nr_scanned_groups = 0;
@@ -2245,11 +2244,10 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
 		memcg = mem_cgroup_iter(root, NULL, &reclaim);
 		do {
 			struct lruvec *lruvec;
-			bool within_guarantee;
 
 			/* Memcg might be protected from the reclaim */
-			within_guarantee = mem_cgroup_within_guarantee(memcg, root);
-			if (honor_memcg_guarantee && within_guarantee) {
+			if (mem_cgroup_within_guarantee(memcg, root,
+						soft_guarantee)) {
 				/*
 				 * It would be more optimal to skip the memcg
 				 * subtree now but we do not have a memcg iter
@@ -2259,8 +2257,8 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
 				continue;
 			}
 
-			if (within_guarantee)
-				mem_cgroup_guarantee_breached(memcg);
+			if (!soft_guarantee)
+				mem_cgroup_soft_guarantee_breached(memcg);
 
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 			nr_scanned_groups++;
@@ -2297,20 +2295,27 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
 
 static void shrink_zone(struct zone *zone, struct scan_control *sc)
 {
-	bool honor_guarantee = true;
+	bool soft_guarantee = true;
 
-	while (!__shrink_zone(zone, sc, honor_guarantee)) {
+	while (!__shrink_zone(zone, sc, soft_guarantee)) {
 		/*
 		 * The previous round of reclaim didn't find anything to scan
 		 * because
-		 * a) the whole reclaimed hierarchy is within guarantee so
-		 *    we fallback to ignore the guarantee because other option
-		 *    would be the OOM
+		 * a) the whole reclaimed hierarchy is within soft guarantee so
+		 *    we are switching to the hard guarantee reclaim target
 		 * b) multiple reclaimers are racing and so the first round
 		 *    should be retried
 		 */
-		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
-			honor_guarantee = false;
+		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup,
+					soft_guarantee)) {
+			/*
+			 * Nothing to reclaim even with hard guarantees so
+			 * we have to OOM
+			 */
+			if (!soft_guarantee)
+				break;
+			soft_guarantee = false;
+		}
 	}
 }
 
@@ -2574,7 +2579,8 @@ out:
 	 * If the target memcg is not eligible for reclaim then we have no option
 	 * but OOM
 	 */
-	if (!sc->nr_scanned && mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
+	if (!sc->nr_scanned &&
+			mem_cgroup_all_within_guarantee(sc->target_mem_cgroup, false))
 		return 0;
 
 	/* top priority shrink_zones still had more to do? don't OOM, then */
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 196+ messages in thread

* [PATCH 2/2] memcg: Allow guarantee reclaim
@ 2014-06-11  8:00                                                 ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-11  8:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	linux-mm, LKML

Some users (e.g. Google) would like to have stronger semantic than low
limit offers currently. The fallback mode is not desirable and they
prefer hitting OOM killer rather than ignoring low limit for protected
groups.

There are other possible usecases which can benefit from hard
guarantees. There are loads which will simply start trashing if the
memory working set drops under certain level and it is more appropriate
to simply kill and restart such a load if the required memory cannot
be provided. Another usecase would be a hard memory isolation for
containers.

The min_limit is initialized to 0 and it has precedence over low_limit.
If the reclaim is not able to find any memcg in the reclaimed hierarchy
above min_limit then OOM killer is triggered to resolve the situation.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 Documentation/cgroups/memory.txt | 26 ++++++++++++++++++--------
 include/linux/memcontrol.h       | 14 ++++++++------
 include/linux/res_counter.h      | 32 ++++++++++++++++++++++++++++++--
 mm/memcontrol.c                  | 18 +++++++++++-------
 mm/oom_kill.c                    |  6 ++++--
 mm/vmscan.c                      | 38 ++++++++++++++++++++++----------------
 6 files changed, 93 insertions(+), 41 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index bf895d7e1363..6929a06c9e5d 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -61,6 +61,7 @@ Brief summary of control files.
  memory.low_limit_breached	 # number of times low_limit has been
 				 # ignored and the cgroup reclaimed even
 				 # when it was above the limit
+ memory.min_limit_in_bytes	 # set/show min limit for memory reclaim
  memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
  memory.failcnt			 # show the number of memory usage hits limits
  memory.memsw.failcnt		 # show the number of memory+Swap hits limits
@@ -248,14 +249,23 @@ global VM. Cgroups can get reclaimed basically under two conditions
    to select and kill the bulkiest task in the hiearchy. (See 10. OOM Control
    below.)
 
-Groups might be also protected from both global and limit reclaim by
-low_limit_in_bytes knob. If the limit is non-zero the reclaim logic
-doesn't include groups (and their subgroups - see 6. Hierarchy support)
-which are below the low limit if there is other eligible cgroup in the
-reclaimed hierarchy. If all groups which participate reclaim are under
-their low limits then all of them are reclaimed and the low limit is
-ignored. low_limit_breached counter in memory.stat file can be checked
-to see how many times such an event occurred.
+Groups might be also protected from both global and limit reclaim
+by low_limit_in_bytes and min_limit_in_bytes knobs. The first one
+provides an optimistic reclaim protection while the later one provides
+hard memory reclaim protection guarantee. Both limits are 0 by default
+and min watermark has always precedence to low watermark.
+
+If the low limit is non-zero the reclaim logic doesn't include
+groups (and their subgroups - see 6. Hierarchy support) which are
+below low_limit if there is other eligible cgroup in the reclaimed
+hierarchy. If all groups which participate reclaim are under their low
+limits then all of them are reclaimed and the low limit is ignored.
+low_limit_breached counter in memory.stat file can be checked to see how
+many times such an event occurred.
+
+If, however, all the groups under reclaimed hierarchy are under their min
+limits then no reclaim is done and OOM killer is triggered to resolve the
+situation. In other words low_limit is never breached by the reclaim.
 
 Note2: When panic_on_oom is set to "2", the whole system will panic.
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5e2ca2163b12..ddb96729a6b6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -93,10 +93,11 @@ bool task_in_mem_cgroup(struct task_struct *task,
 			const struct mem_cgroup *memcg);
 
 extern bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
-		struct mem_cgroup *root);
+		struct mem_cgroup *root, bool soft_guarantee);
 
-extern void mem_cgroup_guarantee_breached(struct mem_cgroup *memcg);
-extern bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root);
+extern void mem_cgroup_soft_guarantee_breached(struct mem_cgroup *memcg);
+extern bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root,
+		bool soft_guarantee);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
@@ -295,14 +296,15 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 }
 
 static inline bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
-		struct mem_cgroup *root)
+		struct mem_cgroup *root, bool soft_guarantee)
 {
 	return false;
 }
-static inline  void mem_cgroup_guarantee_breached(struct mem_cgroup *memcg)
+static inline  void mem_cgroup_soft_guarantee_breached(struct mem_cgroup *memcg)
 {
 }
-static inline bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
+static inline bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root,
+		bool soft_guarantee)
 {
 	return false;
 }
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index b810855024f9..21dff6507aa7 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -40,11 +40,17 @@ struct res_counter {
 	 */
 	unsigned long long soft_limit;
 	/*
-	 * the limit under which the usage cannot be pushed
-	 * due to external pressure.
+	 * the limit under which the usage shouldn't be pushed
+	 * due to external pressure if it is possible.
 	 */
 	unsigned long long low_limit;
 	/*
+	 * the limit under with the usage cannot be pushed
+	 * due to external pressure.
+	 */
+	unsigned long long min_limit;
+
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -203,6 +209,28 @@ res_counter_low_limit_excess(struct res_counter *cnt)
 	return excess;
 }
 
+/**
+ * Get the difference between the usage and the min limit
+ * @cnt: The counter
+ *
+ * Returns 0 if usage is less than or equal to min limit
+ * The difference between usage and min limit, otherwise.
+ */
+static inline unsigned long long
+res_counter_min_limit_excess(struct res_counter *cnt)
+{
+	unsigned long long excess;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	if (cnt->usage <= cnt->min_limit)
+		excess = 0;
+	else
+		excess = cnt->usage - cnt->min_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return excess;
+}
+
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7f62b6533f60..26f137175f1c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2816,19 +2816,23 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
  * memory guarantee
  * @memcg: target memcg for the reclaim
  * @root: root of the reclaim hierarchy (null for the global reclaim)
+ * @soft_guarantee: is the guarantee soft (allows fallback).
  *
- * The given group is within its reclaim gurantee if it is below its low limit
- * or the same applies for any parent up the hierarchy until root (including).
+ * The given group is within its reclaim gurantee if it is below its min limit
+ * and if soft_guarantee is true then also below its low limit.
+ * Or the same applies for any parent up the hierarchy until root (including).
  * Such a group might be excluded from the reclaim.
  */
 bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
-		struct mem_cgroup *root)
+		struct mem_cgroup *root, bool soft_guarantee)
 {
 	if (mem_cgroup_disabled())
 		return false;
 
 	do {
-		if (!res_counter_low_limit_excess(&memcg->res))
+		if (!res_counter_min_limit_excess(&memcg->res))
+			return true;
+		if (soft_guarantee && !res_counter_low_limit_excess(&memcg->res))
 			return true;
 		if (memcg == root)
 			break;
@@ -2838,17 +2842,17 @@ bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
 	return false;
 }
 
-void mem_cgroup_guarantee_breached(struct mem_cgroup *memcg)
+void mem_cgroup_soft_guarantee_breached(struct mem_cgroup *memcg)
 {
 	this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_LOW_LIMIT_FALLBACK]);
 }
 
-bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
+bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root, bool soft_guarantee)
 {
 	struct mem_cgroup *iter;
 
 	for_each_mem_cgroup_tree(iter, root)
-		if (!mem_cgroup_within_guarantee(iter, root)) {
+		if (!mem_cgroup_within_guarantee(iter, root, soft_guarantee)) {
 			mem_cgroup_iter_break(root, iter);
 			return false;
 		}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3291e82d4352..e44b471af476 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -392,9 +392,11 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 {
 	task_lock(current);
 	pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
-		"oom_score_adj=%hd\n",
+		"oom_score_adj=%hd%s\n",
 		current->comm, gfp_mask, order,
-		current->signal->oom_score_adj);
+		current->signal->oom_score_adj,
+		mem_cgroup_all_within_guarantee(memcg, false) ?
+		" because all groups are withing min_limit guarantee":"");
 	cpuset_print_task_mems_allowed(current);
 	task_unlock(current);
 	dump_stack();
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99137aecd95f..8e844bd42c51 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2220,13 +2220,12 @@ static inline bool should_continue_reclaim(struct zone *zone,
  *
  * @zone: zone to shrink
  * @sc: scan control with additional reclaim parameters
- * @honor_memcg_guarantee: do not reclaim memcgs which are within their memory
- * guarantee
+ * @soft_guarantee: Use soft guarantee reclaim target for memcg reclaim.
  *
  * Returns the number of reclaimed memcgs.
  */
 static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
-		bool honor_memcg_guarantee)
+		bool soft_guarantee)
 {
 	unsigned long nr_reclaimed, nr_scanned;
 	unsigned nr_scanned_groups = 0;
@@ -2245,11 +2244,10 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
 		memcg = mem_cgroup_iter(root, NULL, &reclaim);
 		do {
 			struct lruvec *lruvec;
-			bool within_guarantee;
 
 			/* Memcg might be protected from the reclaim */
-			within_guarantee = mem_cgroup_within_guarantee(memcg, root);
-			if (honor_memcg_guarantee && within_guarantee) {
+			if (mem_cgroup_within_guarantee(memcg, root,
+						soft_guarantee)) {
 				/*
 				 * It would be more optimal to skip the memcg
 				 * subtree now but we do not have a memcg iter
@@ -2259,8 +2257,8 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
 				continue;
 			}
 
-			if (within_guarantee)
-				mem_cgroup_guarantee_breached(memcg);
+			if (!soft_guarantee)
+				mem_cgroup_soft_guarantee_breached(memcg);
 
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 			nr_scanned_groups++;
@@ -2297,20 +2295,27 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
 
 static void shrink_zone(struct zone *zone, struct scan_control *sc)
 {
-	bool honor_guarantee = true;
+	bool soft_guarantee = true;
 
-	while (!__shrink_zone(zone, sc, honor_guarantee)) {
+	while (!__shrink_zone(zone, sc, soft_guarantee)) {
 		/*
 		 * The previous round of reclaim didn't find anything to scan
 		 * because
-		 * a) the whole reclaimed hierarchy is within guarantee so
-		 *    we fallback to ignore the guarantee because other option
-		 *    would be the OOM
+		 * a) the whole reclaimed hierarchy is within soft guarantee so
+		 *    we are switching to the hard guarantee reclaim target
 		 * b) multiple reclaimers are racing and so the first round
 		 *    should be retried
 		 */
-		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
-			honor_guarantee = false;
+		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup,
+					soft_guarantee)) {
+			/*
+			 * Nothing to reclaim even with hard guarantees so
+			 * we have to OOM
+			 */
+			if (!soft_guarantee)
+				break;
+			soft_guarantee = false;
+		}
 	}
 }
 
@@ -2574,7 +2579,8 @@ out:
 	 * If the target memcg is not eligible for reclaim then we have no option
 	 * but OOM
 	 */
-	if (!sc->nr_scanned && mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
+	if (!sc->nr_scanned &&
+			mem_cgroup_all_within_guarantee(sc->target_mem_cgroup, false))
 		return 0;
 
 	/* top priority shrink_zones still had more to do? don't OOM, then */
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
  2014-06-11  7:57                                             ` Michal Hocko
@ 2014-06-11 12:31                                               ` Tejun Heo
  -1 siblings, 0 replies; 196+ messages in thread
From: Tejun Heo @ 2014-06-11 12:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Greg Thelen, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michel Lespinasse,
	Roman Gushchin, LKML, linux-mm

Hello, Michal.

On Wed, Jun 11, 2014 at 09:57:29AM +0200, Michal Hocko wrote:
> Is this the kind of symmetry Tejun is asking for and that would make
> change is Nack position? I am still not sure it satisfies his soft

Yes, pretty much.  What primarily bothered me was the soft/hard
guarantees being chosen by a toggle switch while the soft/hard limits
can be configured separately and combined.

> guarantee objections from other email.

I was wondering about the usefulness of "low" itself in isolation and
I still think it'd be less useful than "high", but as there seem to be
use cases which can be served with that and especially as a part of a
consistent control scheme, I have no objection.

"low" definitely requires a notification mechanism tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
@ 2014-06-11 12:31                                               ` Tejun Heo
  0 siblings, 0 replies; 196+ messages in thread
From: Tejun Heo @ 2014-06-11 12:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Greg Thelen, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michel Lespinasse,
	Roman Gushchin, LKML, linux-mm

Hello, Michal.

On Wed, Jun 11, 2014 at 09:57:29AM +0200, Michal Hocko wrote:
> Is this the kind of symmetry Tejun is asking for and that would make
> change is Nack position? I am still not sure it satisfies his soft

Yes, pretty much.  What primarily bothered me was the soft/hard
guarantees being chosen by a toggle switch while the soft/hard limits
can be configured separately and combined.

> guarantee objections from other email.

I was wondering about the usefulness of "low" itself in isolation and
I still think it'd be less useful than "high", but as there seem to be
use cases which can be served with that and especially as a part of a
consistent control scheme, I have no objection.

"low" definitely requires a notification mechanism tho.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
  2014-06-11 12:31                                               ` Tejun Heo
@ 2014-06-11 14:11                                                 ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-11 14:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Greg Thelen, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michel Lespinasse,
	Roman Gushchin, LKML, linux-mm

On Wed 11-06-14 08:31:09, Tejun Heo wrote:
> Hello, Michal.
> 
> On Wed, Jun 11, 2014 at 09:57:29AM +0200, Michal Hocko wrote:
> > Is this the kind of symmetry Tejun is asking for and that would make
> > change is Nack position? I am still not sure it satisfies his soft
> 
> Yes, pretty much.  What primarily bothered me was the soft/hard
> guarantees being chosen by a toggle switch while the soft/hard limits
> can be configured separately and combined.

The last consensus at LSF was that there would be a knob which will
distinguish hard/best effort behavior. The weaker semantic has strong
usecases IMHO so I wanted to start with it and add a knob for the hard
guarantee later when explicitly asked for.

Going with min, low, high and hard makes more sense to me of course.

> > guarantee objections from other email.
> 
> I was wondering about the usefulness of "low" itself in isolation and

I think it has more usecases than "min" from simply practical POV. OOM
means a potential service down time and that is a no go. Optimistic
isolation on the other hand adds an advantages of the isolation most of
the time while not getting completely flat on an exception (be it
misconfiguration or a corner case like mentioned during the discussion).

That doesn't mean "min" is not useful. It definitely is, the category
of usecases will be more specific though.

> I still think it'd be less useful than "high", but as there seem to be
> use cases which can be served with that and especially as a part of a
> consistent control scheme, I have no objection.
> 
> "low" definitely requires a notification mechanism tho.

Would vmpressure notification be sufficient? That one is in place for
any memcg which is reclaimed.

Or are you thinking about something more like oom_control?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
@ 2014-06-11 14:11                                                 ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-11 14:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Greg Thelen, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michel Lespinasse,
	Roman Gushchin, LKML, linux-mm

On Wed 11-06-14 08:31:09, Tejun Heo wrote:
> Hello, Michal.
> 
> On Wed, Jun 11, 2014 at 09:57:29AM +0200, Michal Hocko wrote:
> > Is this the kind of symmetry Tejun is asking for and that would make
> > change is Nack position? I am still not sure it satisfies his soft
> 
> Yes, pretty much.  What primarily bothered me was the soft/hard
> guarantees being chosen by a toggle switch while the soft/hard limits
> can be configured separately and combined.

The last consensus at LSF was that there would be a knob which will
distinguish hard/best effort behavior. The weaker semantic has strong
usecases IMHO so I wanted to start with it and add a knob for the hard
guarantee later when explicitly asked for.

Going with min, low, high and hard makes more sense to me of course.

> > guarantee objections from other email.
> 
> I was wondering about the usefulness of "low" itself in isolation and

I think it has more usecases than "min" from simply practical POV. OOM
means a potential service down time and that is a no go. Optimistic
isolation on the other hand adds an advantages of the isolation most of
the time while not getting completely flat on an exception (be it
misconfiguration or a corner case like mentioned during the discussion).

That doesn't mean "min" is not useful. It definitely is, the category
of usecases will be more specific though.

> I still think it'd be less useful than "high", but as there seem to be
> use cases which can be served with that and especially as a part of a
> consistent control scheme, I have no objection.
> 
> "low" definitely requires a notification mechanism tho.

Would vmpressure notification be sufficient? That one is in place for
any memcg which is reclaimed.

Or are you thinking about something more like oom_control?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-05-05 14:21               ` Michal Hocko
@ 2014-06-11 15:15                 ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-11 15:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Mon, May 05, 2014 at 04:21:00PM +0200, Michal Hocko wrote:
> On Fri 02-05-14 18:00:56, Johannes Weiner wrote:
> > On Fri, May 02, 2014 at 06:49:30PM +0200, Michal Hocko wrote:
> > > On Fri 02-05-14 11:58:05, Johannes Weiner wrote:
> > > > On Fri, May 02, 2014 at 11:36:28AM +0200, Michal Hocko wrote:
> > > > > On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> > > > > > On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> [...]
> > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > > index c1cd99a5074b..0f428158254e 100644
> > > > > > > --- a/mm/vmscan.c
> > > > > > > +++ b/mm/vmscan.c
> > > > > [...]
> > > > > > > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > > > > > > +{
> > > > > > > +	if (!__shrink_zone(zone, sc, true)) {
> > > > > > > +		/*
> > > > > > > +		 * First round of reclaim didn't find anything to reclaim
> > > > > > > +		 * because of low limit protection so try again and ignore
> > > > > > > +		 * the low limit this time.
> > > > > > > +		 */
> > > > > > > +		__shrink_zone(zone, sc, false);
> > > > > > > +	}
> > > > 
> > > > So I don't think this can work as it is, because we are not actually
> > > > changing priority levels yet. 
> > > 
> > > __shrink_zone returns with 0 only if the whole hierarchy is is under low
> > > limit. This means that they are over-committed and it doesn't make much
> > > sense to play with priority. Low limit reclaimability is independent on
> > > the priority.
> > > 
> > > > It will give up on the guarantees of bigger groups way before smaller
> > > > groups are even seriously looked at.
> > > 
> > > How would that happen? Those (smaller) groups would get reclaimed and we
> > > wouldn't fallback. Or am I missing your point?
> > 
> > Lol, I hadn't updated my brain to a394cb8ee632 ("memcg,vmscan: do not
> > break out targeted reclaim without reclaimed pages") yet...  Yes, you
> > are right.
> 
> You made me think about this more and you are right ;).
> The code as is doesn't cope with many racing reclaimers when some
> threads can fallback to ignore the lowlimit although there are groups to
> scan in the hierarchy but they were visited by other reclaimers.
> The patch bellow should help with that. What do you think?
> I am also thinking we want to add a fallback counter in memory.stat?
> ---
> >From e997b8b4ac724aa29bdeff998d2186ee3c0a97d8 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Mon, 5 May 2014 15:12:18 +0200
> Subject: [PATCH] vmscan: memcg: check whether the low limit should be ignored
> 
> Low-limit (aka guarantee) is ignored when there is no group scanned
> during the first round of __shink_zone. This approach doesn't work when
> multiple reclaimers race and reclaim the same hierarchy (e.g. kswapd
> vs. direct reclaim or multiple tasks hitting the hard limit) because
> memcg iterator makes sure that multiple reclaimers are interleaved
> in the hierarchy. This means that some reclaimers can see 0 scanned
> groups although there are groups which are above the low-limit and they
> were reclaimed on behalf of other reclaimers. This leads to a premature
> low-limit break.
> 
> This patch adds mem_cgroup_all_within_guarantee() which will check
> whether all the groups in the reclaimed hierarchy are within their low
> limit and shrink_zone will allow the fallback reclaim only when that is
> true. This alone is still not sufficient however because it would lead
> to another problem. If a reclaimer constantly fails to scan anything
> because it sees only groups within their guarantees while others do the
> reclaim then the reclaim priority would drop down very quickly.
> shrink_zone has to be careful to preserve scan at least one group
> semantic so __shrink_zone has to be retried until at least one group
> is scanned.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  include/linux/memcontrol.h |  5 +++++
>  mm/memcontrol.c            | 13 +++++++++++++
>  mm/vmscan.c                | 17 ++++++++++++-----
>  3 files changed, 30 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c00ccc5f70b9..077a777bd9ff 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -94,6 +94,7 @@ bool task_in_mem_cgroup(struct task_struct *task,
>  
>  extern bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  		struct mem_cgroup *root);
> +extern bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root);
>  
>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> @@ -296,6 +297,10 @@ static inline bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  {
>  	return false;
>  }
> +static inline bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
> +{
> +	return false;
> +}
>  
>  static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 58982d18f6ea..4fd4784d1548 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2833,6 +2833,19 @@ bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  	return false;
>  }
>  
> +bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
> +{
> +	struct mem_cgroup *iter;
> +
> +	for_each_mem_cgroup_tree(iter, root)
> +		if (!mem_cgroup_within_guarantee(iter, root)) {
> +			mem_cgroup_iter_break(root, iter);
> +			return false;
> +		}
> +
> +	return true;
> +}
> +
>  struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>  {
>  	struct mem_cgroup *memcg = NULL;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5f923999bb79..2686e47f04cc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2293,13 +2293,20 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
>  
>  static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  {
> -	if (!__shrink_zone(zone, sc, true)) {
> +	bool honor_guarantee = true;
> +
> +	while (!__shrink_zone(zone, sc, honor_guarantee)) {
>  		/*
> -		 * First round of reclaim didn't find anything to reclaim
> -		 * because of the memory guantees for all memcgs in the
> -		 * reclaim target so try again and ignore guarantees this time.
> +		 * The previous round of reclaim didn't find anything to scan
> +		 * because
> +		 * a) the whole reclaimed hierarchy is within guarantee so
> +		 *    we fallback to ignore the guarantee because other option
> +		 *    would be the OOM
> +		 * b) multiple reclaimers are racing and so the first round
> +		 *    should be retried
>  		 */
> -		__shrink_zone(zone, sc, false);
> +		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
> +			honor_guarantee = false;
>  	}

I don't like that this adds a non-chalant `for each memcg' here, we
can have a lot of memcgs.  Sooner or later we'll have to break up that
full hierarchy iteration in shrink_zone() because of scalability, I
want to avoid adding more of them.

How about these changes on top of what we currently have?  Sure it's
not as accurate, but it should be good start, and it's a *lot* less
overhead.

mem_cgroup_watermark() is also a more fitting name, given that this
has nothing to do with a guarantee for now.

It can also be easily extended to support the MIN watermark while the
code in vmscan.c remains readable.

---

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a5cf853129ec..6167bed81d78 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -53,6 +53,11 @@ struct mem_cgroup_reclaim_cookie {
 	unsigned int generation;
 };
 
+enum memcg_watermark {
+	MEMCG_WMARK_NORMAL,
+	MEMCG_WMARK_LOW,
+};
+
 #ifdef CONFIG_MEMCG
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -92,9 +97,8 @@ bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 bool task_in_mem_cgroup(struct task_struct *task,
 			const struct mem_cgroup *memcg);
 
-extern bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
-		struct mem_cgroup *root);
-extern bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root);
+enum memcg_watermark mem_cgroup_watermark(struct mem_cgroup *root,
+					  struct mem_cgroup *memcg);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
@@ -292,16 +296,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
-static inline bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
-		struct mem_cgroup *root)
-{
-	return false;
-}
-static inline bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
-{
-	return false;
-}
-
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	return NULL;
@@ -319,6 +313,12 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
 	return true;
 }
 
+static inline enum memcg_watermark
+mem_cgroup_watermark(struct mem_cgroup *root, struct mem_cgroup *memcg)
+{
+	return MEMCG_WMARK_NORMAL;
+}
+
 static inline struct cgroup_subsys_state
 		*mem_cgroup_css(struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7ff5b8e297fd..8ee8786a286c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2780,44 +2780,20 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
 	return mem_cgroup_from_id(id);
 }
 
-/**
- * mem_cgroup_within_guarantee - checks whether given memcg is within its
- * memory guarantee
- * @memcg: target memcg for the reclaim
- * @root: root of the reclaim hierarchy (null for the global reclaim)
- *
- * The given group is within its reclaim gurantee if it is below its low limit
- * or the same applies for any parent up the hierarchy until root (including).
- * Such a group might be excluded from the reclaim.
- */
-bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
-		struct mem_cgroup *root)
+enum memcg_watermark mem_cgroup_watermark(struct mem_cgroup *root,
+					  struct mem_cgroup *memcg)
 {
 	if (mem_cgroup_disabled())
-		return false;
+		return MEMCG_WMARK_NORMAL;
 
 	do {
 		if (!res_counter_low_limit_excess(&memcg->res))
-			return true;
+			return MEMCG_WMARK_LOW;
 		if (memcg == root)
 			break;
-
 	} while ((memcg = parent_mem_cgroup(memcg)));
 
-	return false;
-}
-
-bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
-{
-	struct mem_cgroup *iter;
-
-	for_each_mem_cgroup_tree(iter, root)
-		if (!mem_cgroup_within_guarantee(iter, root)) {
-			mem_cgroup_iter_break(root, iter);
-			return false;
-		}
-
-	return true;
+	return MEMCG_WMARK_NORMAL;
 }
 
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b19ebb3a666b..687076b7a1a6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2231,21 +2231,9 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	}
 }
 
-/**
- * __shrink_zone - shrinks a given zone
- *
- * @zone: zone to shrink
- * @sc: scan control with additional reclaim parameters
- * @honor_memcg_guarantee: do not reclaim memcgs which are within their memory
- * guarantee
- *
- * Returns the number of reclaimed memcgs.
- */
-static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
-		bool honor_memcg_guarantee)
+static void shrink_zone(struct zone *zone, struct scan_control *sc)
 {
 	unsigned long nr_reclaimed, nr_scanned;
-	unsigned nr_scanned_groups = 0;
 
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -2262,20 +2250,22 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
 		do {
 			struct lruvec *lruvec;
 
-			/* Memcg might be protected from the reclaim */
-			if (honor_memcg_guarantee &&
-					mem_cgroup_within_guarantee(memcg, root)) {
+			switch (mem_cgroup_watermark(root, memcg)) {
+			case MEMCG_WMARK_LOW:
 				/*
-				 * It would be more optimal to skip the memcg
-				 * subtree now but we do not have a memcg iter
-				 * helper for that. Anyone?
+				 * Memcg within the configured low
+				 * watermark: try to avoid reclaim
+				 * until the reclaimer struggles.
 				 */
+				if (priority < DEF_PRIORITY - 2)
+					break;
+
+				/* XXX: skip the whole subtree */
 				memcg = mem_cgroup_iter(root, memcg, &reclaim);
 				continue;
 			}
 
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
-			nr_scanned_groups++;
 
 			sc->swappiness = mem_cgroup_swappiness(memcg);
 			shrink_lruvec(lruvec, sc);
@@ -2304,27 +2294,6 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
 
 	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
-
-	return nr_scanned_groups;
-}
-
-static void shrink_zone(struct zone *zone, struct scan_control *sc)
-{
-	bool honor_guarantee = true;
-
-	while (!__shrink_zone(zone, sc, honor_guarantee)) {
-		/*
-		 * The previous round of reclaim didn't find anything to scan
-		 * because
-		 * a) the whole reclaimed hierarchy is within guarantee so
-		 *    we fallback to ignore the guarantee because other option
-		 *    would be the OOM
-		 * b) multiple reclaimers are racing and so the first round
-		 *    should be retried
-		 */
-		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
-			honor_guarantee = false;
-	}
 }
 
 /* Returns true if compaction should go ahead for a high-order request */

^ permalink raw reply related	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-06-11 15:15                 ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-11 15:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Mon, May 05, 2014 at 04:21:00PM +0200, Michal Hocko wrote:
> On Fri 02-05-14 18:00:56, Johannes Weiner wrote:
> > On Fri, May 02, 2014 at 06:49:30PM +0200, Michal Hocko wrote:
> > > On Fri 02-05-14 11:58:05, Johannes Weiner wrote:
> > > > On Fri, May 02, 2014 at 11:36:28AM +0200, Michal Hocko wrote:
> > > > > On Wed 30-04-14 18:55:50, Johannes Weiner wrote:
> > > > > > On Mon, Apr 28, 2014 at 02:26:42PM +0200, Michal Hocko wrote:
> [...]
> > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > > index c1cd99a5074b..0f428158254e 100644
> > > > > > > --- a/mm/vmscan.c
> > > > > > > +++ b/mm/vmscan.c
> > > > > [...]
> > > > > > > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > > > > > > +{
> > > > > > > +	if (!__shrink_zone(zone, sc, true)) {
> > > > > > > +		/*
> > > > > > > +		 * First round of reclaim didn't find anything to reclaim
> > > > > > > +		 * because of low limit protection so try again and ignore
> > > > > > > +		 * the low limit this time.
> > > > > > > +		 */
> > > > > > > +		__shrink_zone(zone, sc, false);
> > > > > > > +	}
> > > > 
> > > > So I don't think this can work as it is, because we are not actually
> > > > changing priority levels yet. 
> > > 
> > > __shrink_zone returns with 0 only if the whole hierarchy is is under low
> > > limit. This means that they are over-committed and it doesn't make much
> > > sense to play with priority. Low limit reclaimability is independent on
> > > the priority.
> > > 
> > > > It will give up on the guarantees of bigger groups way before smaller
> > > > groups are even seriously looked at.
> > > 
> > > How would that happen? Those (smaller) groups would get reclaimed and we
> > > wouldn't fallback. Or am I missing your point?
> > 
> > Lol, I hadn't updated my brain to a394cb8ee632 ("memcg,vmscan: do not
> > break out targeted reclaim without reclaimed pages") yet...  Yes, you
> > are right.
> 
> You made me think about this more and you are right ;).
> The code as is doesn't cope with many racing reclaimers when some
> threads can fallback to ignore the lowlimit although there are groups to
> scan in the hierarchy but they were visited by other reclaimers.
> The patch bellow should help with that. What do you think?
> I am also thinking we want to add a fallback counter in memory.stat?
> ---
> >From e997b8b4ac724aa29bdeff998d2186ee3c0a97d8 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Mon, 5 May 2014 15:12:18 +0200
> Subject: [PATCH] vmscan: memcg: check whether the low limit should be ignored
> 
> Low-limit (aka guarantee) is ignored when there is no group scanned
> during the first round of __shink_zone. This approach doesn't work when
> multiple reclaimers race and reclaim the same hierarchy (e.g. kswapd
> vs. direct reclaim or multiple tasks hitting the hard limit) because
> memcg iterator makes sure that multiple reclaimers are interleaved
> in the hierarchy. This means that some reclaimers can see 0 scanned
> groups although there are groups which are above the low-limit and they
> were reclaimed on behalf of other reclaimers. This leads to a premature
> low-limit break.
> 
> This patch adds mem_cgroup_all_within_guarantee() which will check
> whether all the groups in the reclaimed hierarchy are within their low
> limit and shrink_zone will allow the fallback reclaim only when that is
> true. This alone is still not sufficient however because it would lead
> to another problem. If a reclaimer constantly fails to scan anything
> because it sees only groups within their guarantees while others do the
> reclaim then the reclaim priority would drop down very quickly.
> shrink_zone has to be careful to preserve scan at least one group
> semantic so __shrink_zone has to be retried until at least one group
> is scanned.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  include/linux/memcontrol.h |  5 +++++
>  mm/memcontrol.c            | 13 +++++++++++++
>  mm/vmscan.c                | 17 ++++++++++++-----
>  3 files changed, 30 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c00ccc5f70b9..077a777bd9ff 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -94,6 +94,7 @@ bool task_in_mem_cgroup(struct task_struct *task,
>  
>  extern bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  		struct mem_cgroup *root);
> +extern bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root);
>  
>  extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> @@ -296,6 +297,10 @@ static inline bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  {
>  	return false;
>  }
> +static inline bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
> +{
> +	return false;
> +}
>  
>  static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 58982d18f6ea..4fd4784d1548 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2833,6 +2833,19 @@ bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
>  	return false;
>  }
>  
> +bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
> +{
> +	struct mem_cgroup *iter;
> +
> +	for_each_mem_cgroup_tree(iter, root)
> +		if (!mem_cgroup_within_guarantee(iter, root)) {
> +			mem_cgroup_iter_break(root, iter);
> +			return false;
> +		}
> +
> +	return true;
> +}
> +
>  struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>  {
>  	struct mem_cgroup *memcg = NULL;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5f923999bb79..2686e47f04cc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2293,13 +2293,20 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
>  
>  static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  {
> -	if (!__shrink_zone(zone, sc, true)) {
> +	bool honor_guarantee = true;
> +
> +	while (!__shrink_zone(zone, sc, honor_guarantee)) {
>  		/*
> -		 * First round of reclaim didn't find anything to reclaim
> -		 * because of the memory guantees for all memcgs in the
> -		 * reclaim target so try again and ignore guarantees this time.
> +		 * The previous round of reclaim didn't find anything to scan
> +		 * because
> +		 * a) the whole reclaimed hierarchy is within guarantee so
> +		 *    we fallback to ignore the guarantee because other option
> +		 *    would be the OOM
> +		 * b) multiple reclaimers are racing and so the first round
> +		 *    should be retried
>  		 */
> -		__shrink_zone(zone, sc, false);
> +		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
> +			honor_guarantee = false;
>  	}

I don't like that this adds a non-chalant `for each memcg' here, we
can have a lot of memcgs.  Sooner or later we'll have to break up that
full hierarchy iteration in shrink_zone() because of scalability, I
want to avoid adding more of them.

How about these changes on top of what we currently have?  Sure it's
not as accurate, but it should be good start, and it's a *lot* less
overhead.

mem_cgroup_watermark() is also a more fitting name, given that this
has nothing to do with a guarantee for now.

It can also be easily extended to support the MIN watermark while the
code in vmscan.c remains readable.

---

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a5cf853129ec..6167bed81d78 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -53,6 +53,11 @@ struct mem_cgroup_reclaim_cookie {
 	unsigned int generation;
 };
 
+enum memcg_watermark {
+	MEMCG_WMARK_NORMAL,
+	MEMCG_WMARK_LOW,
+};
+
 #ifdef CONFIG_MEMCG
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -92,9 +97,8 @@ bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 bool task_in_mem_cgroup(struct task_struct *task,
 			const struct mem_cgroup *memcg);
 
-extern bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
-		struct mem_cgroup *root);
-extern bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root);
+enum memcg_watermark mem_cgroup_watermark(struct mem_cgroup *root,
+					  struct mem_cgroup *memcg);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
@@ -292,16 +296,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
-static inline bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
-		struct mem_cgroup *root)
-{
-	return false;
-}
-static inline bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
-{
-	return false;
-}
-
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	return NULL;
@@ -319,6 +313,12 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
 	return true;
 }
 
+static inline enum memcg_watermark
+mem_cgroup_watermark(struct mem_cgroup *root, struct mem_cgroup *memcg)
+{
+	return MEMCG_WMARK_NORMAL;
+}
+
 static inline struct cgroup_subsys_state
 		*mem_cgroup_css(struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7ff5b8e297fd..8ee8786a286c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2780,44 +2780,20 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
 	return mem_cgroup_from_id(id);
 }
 
-/**
- * mem_cgroup_within_guarantee - checks whether given memcg is within its
- * memory guarantee
- * @memcg: target memcg for the reclaim
- * @root: root of the reclaim hierarchy (null for the global reclaim)
- *
- * The given group is within its reclaim gurantee if it is below its low limit
- * or the same applies for any parent up the hierarchy until root (including).
- * Such a group might be excluded from the reclaim.
- */
-bool mem_cgroup_within_guarantee(struct mem_cgroup *memcg,
-		struct mem_cgroup *root)
+enum memcg_watermark mem_cgroup_watermark(struct mem_cgroup *root,
+					  struct mem_cgroup *memcg)
 {
 	if (mem_cgroup_disabled())
-		return false;
+		return MEMCG_WMARK_NORMAL;
 
 	do {
 		if (!res_counter_low_limit_excess(&memcg->res))
-			return true;
+			return MEMCG_WMARK_LOW;
 		if (memcg == root)
 			break;
-
 	} while ((memcg = parent_mem_cgroup(memcg)));
 
-	return false;
-}
-
-bool mem_cgroup_all_within_guarantee(struct mem_cgroup *root)
-{
-	struct mem_cgroup *iter;
-
-	for_each_mem_cgroup_tree(iter, root)
-		if (!mem_cgroup_within_guarantee(iter, root)) {
-			mem_cgroup_iter_break(root, iter);
-			return false;
-		}
-
-	return true;
+	return MEMCG_WMARK_NORMAL;
 }
 
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b19ebb3a666b..687076b7a1a6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2231,21 +2231,9 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	}
 }
 
-/**
- * __shrink_zone - shrinks a given zone
- *
- * @zone: zone to shrink
- * @sc: scan control with additional reclaim parameters
- * @honor_memcg_guarantee: do not reclaim memcgs which are within their memory
- * guarantee
- *
- * Returns the number of reclaimed memcgs.
- */
-static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
-		bool honor_memcg_guarantee)
+static void shrink_zone(struct zone *zone, struct scan_control *sc)
 {
 	unsigned long nr_reclaimed, nr_scanned;
-	unsigned nr_scanned_groups = 0;
 
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -2262,20 +2250,22 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
 		do {
 			struct lruvec *lruvec;
 
-			/* Memcg might be protected from the reclaim */
-			if (honor_memcg_guarantee &&
-					mem_cgroup_within_guarantee(memcg, root)) {
+			switch (mem_cgroup_watermark(root, memcg)) {
+			case MEMCG_WMARK_LOW:
 				/*
-				 * It would be more optimal to skip the memcg
-				 * subtree now but we do not have a memcg iter
-				 * helper for that. Anyone?
+				 * Memcg within the configured low
+				 * watermark: try to avoid reclaim
+				 * until the reclaimer struggles.
 				 */
+				if (priority < DEF_PRIORITY - 2)
+					break;
+
+				/* XXX: skip the whole subtree */
 				memcg = mem_cgroup_iter(root, memcg, &reclaim);
 				continue;
 			}
 
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
-			nr_scanned_groups++;
 
 			sc->swappiness = mem_cgroup_swappiness(memcg);
 			shrink_lruvec(lruvec, sc);
@@ -2304,27 +2294,6 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
 
 	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
-
-	return nr_scanned_groups;
-}
-
-static void shrink_zone(struct zone *zone, struct scan_control *sc)
-{
-	bool honor_guarantee = true;
-
-	while (!__shrink_zone(zone, sc, honor_guarantee)) {
-		/*
-		 * The previous round of reclaim didn't find anything to scan
-		 * because
-		 * a) the whole reclaimed hierarchy is within guarantee so
-		 *    we fallback to ignore the guarantee because other option
-		 *    would be the OOM
-		 * b) multiple reclaimers are racing and so the first round
-		 *    should be retried
-		 */
-		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
-			honor_guarantee = false;
-	}
 }
 
 /* Returns true if compaction should go ahead for a high-order request */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/2] mm, memcg: allow OOM if no memcg is eligible during direct reclaim
  2014-06-11  8:00                                               ` Michal Hocko
@ 2014-06-11 15:20                                                 ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-11 15:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	linux-mm, LKML

On Wed, Jun 11, 2014 at 10:00:23AM +0200, Michal Hocko wrote:
> If there is no memcg eligible for reclaim because all groups under the
> reclaimed hierarchy are within their guarantee then the global direct
> reclaim would end up in the endless loop because zones in the zonelists
> are not considered unreclaimable (as per all_unreclaimable) and so the
> OOM killer would never fire and direct reclaim would be triggered
> without no chance to reclaim anything.
> 
> This is not possible yet because reclaim falls back to ignore low_limit
> when nobody is eligible for reclaim. Following patch will allow to set
> the fallback mode to hard guarantee, though, so this is a preparatory
> patch.
> 
> Memcg reclaim doesn't suffer from this because the OOM killer is
> triggered after few unsuccessful attempts of the reclaim.
> 
> Fix this by checking the number of scanned pages which is obviously 0 if
> nobody is eligible and also check that the whole tree hierarchy is not
> eligible and tell OOM it can go ahead.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/vmscan.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8041b0667673..99137aecd95f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2570,6 +2570,13 @@ out:
>  	if (aborted_reclaim)
>  		return 1;
>  
> +	/*
> +	 * If the target memcg is not eligible for reclaim then we have no option
> +	 * but OOM
> +	 */
> +	if (!sc->nr_scanned && mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
> +		return 0;

We can't just sprinkle `for each memcg in hierarchy` loops like this,
they can get really expensive.

It's pretty stupid to not have a return value on shrink_zone(), which
could easily indicate whether a zone was reclaimable, and instead have
another iteration over the same zonelist and the same memcg hierarchy
afterwards to figure out if shrink_zone() was successful or not.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/2] mm, memcg: allow OOM if no memcg is eligible during direct reclaim
@ 2014-06-11 15:20                                                 ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-11 15:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	linux-mm, LKML

On Wed, Jun 11, 2014 at 10:00:23AM +0200, Michal Hocko wrote:
> If there is no memcg eligible for reclaim because all groups under the
> reclaimed hierarchy are within their guarantee then the global direct
> reclaim would end up in the endless loop because zones in the zonelists
> are not considered unreclaimable (as per all_unreclaimable) and so the
> OOM killer would never fire and direct reclaim would be triggered
> without no chance to reclaim anything.
> 
> This is not possible yet because reclaim falls back to ignore low_limit
> when nobody is eligible for reclaim. Following patch will allow to set
> the fallback mode to hard guarantee, though, so this is a preparatory
> patch.
> 
> Memcg reclaim doesn't suffer from this because the OOM killer is
> triggered after few unsuccessful attempts of the reclaim.
> 
> Fix this by checking the number of scanned pages which is obviously 0 if
> nobody is eligible and also check that the whole tree hierarchy is not
> eligible and tell OOM it can go ahead.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/vmscan.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8041b0667673..99137aecd95f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2570,6 +2570,13 @@ out:
>  	if (aborted_reclaim)
>  		return 1;
>  
> +	/*
> +	 * If the target memcg is not eligible for reclaim then we have no option
> +	 * but OOM
> +	 */
> +	if (!sc->nr_scanned && mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
> +		return 0;

We can't just sprinkle `for each memcg in hierarchy` loops like this,
they can get really expensive.

It's pretty stupid to not have a return value on shrink_zone(), which
could easily indicate whether a zone was reclaimable, and instead have
another iteration over the same zonelist and the same memcg hierarchy
afterwards to figure out if shrink_zone() was successful or not.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
  2014-06-11 14:11                                                 ` Michal Hocko
@ 2014-06-11 15:34                                                   ` Tejun Heo
  -1 siblings, 0 replies; 196+ messages in thread
From: Tejun Heo @ 2014-06-11 15:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Greg Thelen, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michel Lespinasse,
	Roman Gushchin, LKML, linux-mm

On Wed, Jun 11, 2014 at 04:11:17PM +0200, Michal Hocko wrote:
> > I still think it'd be less useful than "high", but as there seem to be
> > use cases which can be served with that and especially as a part of a
> > consistent control scheme, I have no objection.
> > 
> > "low" definitely requires a notification mechanism tho.
> 
> Would vmpressure notification be sufficient? That one is in place for
> any memcg which is reclaimed.

Yeah, as long as it can reliably notify userland that the soft
guarantee has been breached, it'd be great as it means we'd have a
single mechanism to monitor both "low" and "high" while "min" and
"max" are oom based, which BTW needs more work but that's a separate
piece of work.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
@ 2014-06-11 15:34                                                   ` Tejun Heo
  0 siblings, 0 replies; 196+ messages in thread
From: Tejun Heo @ 2014-06-11 15:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Greg Thelen, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michel Lespinasse,
	Roman Gushchin, LKML, linux-mm

On Wed, Jun 11, 2014 at 04:11:17PM +0200, Michal Hocko wrote:
> > I still think it'd be less useful than "high", but as there seem to be
> > use cases which can be served with that and especially as a part of a
> > consistent control scheme, I have no objection.
> > 
> > "low" definitely requires a notification mechanism tho.
> 
> Would vmpressure notification be sufficient? That one is in place for
> any memcg which is reclaimed.

Yeah, as long as it can reliably notify userland that the soft
guarantee has been breached, it'd be great as it means we'd have a
single mechanism to monitor both "low" and "high" while "min" and
"max" are oom based, which BTW needs more work but that's a separate
piece of work.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
  2014-06-11  8:00                                                 ` Michal Hocko
@ 2014-06-11 15:36                                                   ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-11 15:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	linux-mm, LKML

On Wed, Jun 11, 2014 at 10:00:24AM +0200, Michal Hocko wrote:
> Some users (e.g. Google) would like to have stronger semantic than low
> limit offers currently. The fallback mode is not desirable and they
> prefer hitting OOM killer rather than ignoring low limit for protected
> groups.
> 
> There are other possible usecases which can benefit from hard
> guarantees. There are loads which will simply start trashing if the
> memory working set drops under certain level and it is more appropriate
> to simply kill and restart such a load if the required memory cannot
> be provided. Another usecase would be a hard memory isolation for
> containers.
> 
> The min_limit is initialized to 0 and it has precedence over low_limit.
> If the reclaim is not able to find any memcg in the reclaimed hierarchy
> above min_limit then OOM killer is triggered to resolve the situation.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 99137aecd95f..8e844bd42c51 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2220,13 +2220,12 @@ static inline bool should_continue_reclaim(struct zone *zone,
>   *
>   * @zone: zone to shrink
>   * @sc: scan control with additional reclaim parameters
> - * @honor_memcg_guarantee: do not reclaim memcgs which are within their memory
> - * guarantee
> + * @soft_guarantee: Use soft guarantee reclaim target for memcg reclaim.
>   *
>   * Returns the number of reclaimed memcgs.
>   */
>  static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
> -		bool honor_memcg_guarantee)
> +		bool soft_guarantee)
>  {
>  	unsigned long nr_reclaimed, nr_scanned;
>  	unsigned nr_scanned_groups = 0;
> @@ -2245,11 +2244,10 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
>  		memcg = mem_cgroup_iter(root, NULL, &reclaim);
>  		do {
>  			struct lruvec *lruvec;
> -			bool within_guarantee;
>  
>  			/* Memcg might be protected from the reclaim */
> -			within_guarantee = mem_cgroup_within_guarantee(memcg, root);
> -			if (honor_memcg_guarantee && within_guarantee) {
> +			if (mem_cgroup_within_guarantee(memcg, root,
> +						soft_guarantee)) {
>  				/*
>  				 * It would be more optimal to skip the memcg
>  				 * subtree now but we do not have a memcg iter
> @@ -2259,8 +2257,8 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
>  				continue;
>  			}
>  
> -			if (within_guarantee)
> -				mem_cgroup_guarantee_breached(memcg);
> +			if (!soft_guarantee)
> +				mem_cgroup_soft_guarantee_breached(memcg);
>  
>  			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
>  			nr_scanned_groups++;
> @@ -2297,20 +2295,27 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
>  
>  static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  {
> -	bool honor_guarantee = true;
> +	bool soft_guarantee = true;
>  
> -	while (!__shrink_zone(zone, sc, honor_guarantee)) {
> +	while (!__shrink_zone(zone, sc, soft_guarantee)) {
>  		/*
>  		 * The previous round of reclaim didn't find anything to scan
>  		 * because
> -		 * a) the whole reclaimed hierarchy is within guarantee so
> -		 *    we fallback to ignore the guarantee because other option
> -		 *    would be the OOM
> +		 * a) the whole reclaimed hierarchy is within soft guarantee so
> +		 *    we are switching to the hard guarantee reclaim target
>  		 * b) multiple reclaimers are racing and so the first round
>  		 *    should be retried
>  		 */
> -		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
> -			honor_guarantee = false;
> +		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup,
> +					soft_guarantee)) {
> +			/*
> +			 * Nothing to reclaim even with hard guarantees so
> +			 * we have to OOM
> +			 */
> +			if (!soft_guarantee)
> +				break;
> +			soft_guarantee = false;
> +		}
>  	}
>  }
>  
> @@ -2574,7 +2579,8 @@ out:
>  	 * If the target memcg is not eligible for reclaim then we have no option
>  	 * but OOM
>  	 */
> -	if (!sc->nr_scanned && mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
> +	if (!sc->nr_scanned &&
> +			mem_cgroup_all_within_guarantee(sc->target_mem_cgroup, false))
>  		return 0;

This code is truly dreadful.

Don't call it guarantee when it doesn't guarantee anything.  I thought
we agreed that min, low, high, max, is reasonable nomenclature, please
use it consistently.

With my proposed cleanups and scalability fixes in the other mail, the
vmscan.c changes to support the min watermark would be something like
the following.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 687076b7a1a6..cee19b6d04dc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2259,7 +2259,7 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 				 */
 				if (priority < DEF_PRIORITY - 2)
 					break;
-
+			case MEMCG_WMARK_MIN:
 				/* XXX: skip the whole subtree */
 				memcg = mem_cgroup_iter(root, memcg, &reclaim);
 				continue;


^ permalink raw reply related	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
@ 2014-06-11 15:36                                                   ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-11 15:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	linux-mm, LKML

On Wed, Jun 11, 2014 at 10:00:24AM +0200, Michal Hocko wrote:
> Some users (e.g. Google) would like to have stronger semantic than low
> limit offers currently. The fallback mode is not desirable and they
> prefer hitting OOM killer rather than ignoring low limit for protected
> groups.
> 
> There are other possible usecases which can benefit from hard
> guarantees. There are loads which will simply start trashing if the
> memory working set drops under certain level and it is more appropriate
> to simply kill and restart such a load if the required memory cannot
> be provided. Another usecase would be a hard memory isolation for
> containers.
> 
> The min_limit is initialized to 0 and it has precedence over low_limit.
> If the reclaim is not able to find any memcg in the reclaimed hierarchy
> above min_limit then OOM killer is triggered to resolve the situation.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 99137aecd95f..8e844bd42c51 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2220,13 +2220,12 @@ static inline bool should_continue_reclaim(struct zone *zone,
>   *
>   * @zone: zone to shrink
>   * @sc: scan control with additional reclaim parameters
> - * @honor_memcg_guarantee: do not reclaim memcgs which are within their memory
> - * guarantee
> + * @soft_guarantee: Use soft guarantee reclaim target for memcg reclaim.
>   *
>   * Returns the number of reclaimed memcgs.
>   */
>  static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
> -		bool honor_memcg_guarantee)
> +		bool soft_guarantee)
>  {
>  	unsigned long nr_reclaimed, nr_scanned;
>  	unsigned nr_scanned_groups = 0;
> @@ -2245,11 +2244,10 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
>  		memcg = mem_cgroup_iter(root, NULL, &reclaim);
>  		do {
>  			struct lruvec *lruvec;
> -			bool within_guarantee;
>  
>  			/* Memcg might be protected from the reclaim */
> -			within_guarantee = mem_cgroup_within_guarantee(memcg, root);
> -			if (honor_memcg_guarantee && within_guarantee) {
> +			if (mem_cgroup_within_guarantee(memcg, root,
> +						soft_guarantee)) {
>  				/*
>  				 * It would be more optimal to skip the memcg
>  				 * subtree now but we do not have a memcg iter
> @@ -2259,8 +2257,8 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
>  				continue;
>  			}
>  
> -			if (within_guarantee)
> -				mem_cgroup_guarantee_breached(memcg);
> +			if (!soft_guarantee)
> +				mem_cgroup_soft_guarantee_breached(memcg);
>  
>  			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
>  			nr_scanned_groups++;
> @@ -2297,20 +2295,27 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
>  
>  static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  {
> -	bool honor_guarantee = true;
> +	bool soft_guarantee = true;
>  
> -	while (!__shrink_zone(zone, sc, honor_guarantee)) {
> +	while (!__shrink_zone(zone, sc, soft_guarantee)) {
>  		/*
>  		 * The previous round of reclaim didn't find anything to scan
>  		 * because
> -		 * a) the whole reclaimed hierarchy is within guarantee so
> -		 *    we fallback to ignore the guarantee because other option
> -		 *    would be the OOM
> +		 * a) the whole reclaimed hierarchy is within soft guarantee so
> +		 *    we are switching to the hard guarantee reclaim target
>  		 * b) multiple reclaimers are racing and so the first round
>  		 *    should be retried
>  		 */
> -		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
> -			honor_guarantee = false;
> +		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup,
> +					soft_guarantee)) {
> +			/*
> +			 * Nothing to reclaim even with hard guarantees so
> +			 * we have to OOM
> +			 */
> +			if (!soft_guarantee)
> +				break;
> +			soft_guarantee = false;
> +		}
>  	}
>  }
>  
> @@ -2574,7 +2579,8 @@ out:
>  	 * If the target memcg is not eligible for reclaim then we have no option
>  	 * but OOM
>  	 */
> -	if (!sc->nr_scanned && mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
> +	if (!sc->nr_scanned &&
> +			mem_cgroup_all_within_guarantee(sc->target_mem_cgroup, false))
>  		return 0;

This code is truly dreadful.

Don't call it guarantee when it doesn't guarantee anything.  I thought
we agreed that min, low, high, max, is reasonable nomenclature, please
use it consistently.

With my proposed cleanups and scalability fixes in the other mail, the
vmscan.c changes to support the min watermark would be something like
the following.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 687076b7a1a6..cee19b6d04dc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2259,7 +2259,7 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 				 */
 				if (priority < DEF_PRIORITY - 2)
 					break;
-
+			case MEMCG_WMARK_MIN:
 				/* XXX: skip the whole subtree */
 				memcg = mem_cgroup_iter(root, memcg, &reclaim);
 				continue;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
  2014-06-11 15:15                 ` Johannes Weiner
@ 2014-06-11 16:08                   ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-11 16:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Wed 11-06-14 11:15:44, Johannes Weiner wrote:
> On Mon, May 05, 2014 at 04:21:00PM +0200, Michal Hocko wrote:
[...]
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2293,13 +2293,20 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
> >  
> >  static void shrink_zone(struct zone *zone, struct scan_control *sc)
> >  {
> > -	if (!__shrink_zone(zone, sc, true)) {
> > +	bool honor_guarantee = true;
> > +
> > +	while (!__shrink_zone(zone, sc, honor_guarantee)) {
> >  		/*
> > -		 * First round of reclaim didn't find anything to reclaim
> > -		 * because of the memory guantees for all memcgs in the
> > -		 * reclaim target so try again and ignore guarantees this time.
> > +		 * The previous round of reclaim didn't find anything to scan
> > +		 * because
> > +		 * a) the whole reclaimed hierarchy is within guarantee so
> > +		 *    we fallback to ignore the guarantee because other option
> > +		 *    would be the OOM
> > +		 * b) multiple reclaimers are racing and so the first round
> > +		 *    should be retried
> >  		 */
> > -		__shrink_zone(zone, sc, false);
> > +		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
> > +			honor_guarantee = false;
> >  	}
> 
> I don't like that this adds a non-chalant `for each memcg' here, we
> can have a lot of memcgs.  Sooner or later we'll have to break up that
> full hierarchy iteration in shrink_zone() because of scalability, I
> want to avoid adding more of them.

mem_cgroup_all_within_guarantee can be simply optimized to exclude whole
subtrees of each memcg which is mem_cgroup_within_guarantee. cgroups
iterator are easy and quite optimal to skip the whole subtree AFAIR so I
do not see this as a bottleneck here.

> How about these changes on top of what we currently have?

I really do not like how you got back to priority based break out.
We were discussing that 2 or so years ago and the main objection was
that this is really not useful. You do not want to scan/reclaim so far
"priviledged" memcgs at high priority all of the sudden.

> Sure it's not as accurate, but it should be good start, and it's a
> *lot* less overhead.
> 
> mem_cgroup_watermark() is also a more fitting name, given that this
> has nothing to do with a guarantee for now.

mem_cgroup_watermark sounds like a better name indeed.

> It can also be easily extended to support the MIN watermark while the
> code in vmscan.c remains readable.
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/4] memcg, mm: introduce lowlimit reclaim
@ 2014-06-11 16:08                   ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-11 16:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Greg Thelen,
	Michel Lespinasse, Tejun Heo, Hugh Dickins, Roman Gushchin, LKML,
	linux-mm

On Wed 11-06-14 11:15:44, Johannes Weiner wrote:
> On Mon, May 05, 2014 at 04:21:00PM +0200, Michal Hocko wrote:
[...]
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2293,13 +2293,20 @@ static unsigned __shrink_zone(struct zone *zone, struct scan_control *sc,
> >  
> >  static void shrink_zone(struct zone *zone, struct scan_control *sc)
> >  {
> > -	if (!__shrink_zone(zone, sc, true)) {
> > +	bool honor_guarantee = true;
> > +
> > +	while (!__shrink_zone(zone, sc, honor_guarantee)) {
> >  		/*
> > -		 * First round of reclaim didn't find anything to reclaim
> > -		 * because of the memory guantees for all memcgs in the
> > -		 * reclaim target so try again and ignore guarantees this time.
> > +		 * The previous round of reclaim didn't find anything to scan
> > +		 * because
> > +		 * a) the whole reclaimed hierarchy is within guarantee so
> > +		 *    we fallback to ignore the guarantee because other option
> > +		 *    would be the OOM
> > +		 * b) multiple reclaimers are racing and so the first round
> > +		 *    should be retried
> >  		 */
> > -		__shrink_zone(zone, sc, false);
> > +		if (mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
> > +			honor_guarantee = false;
> >  	}
> 
> I don't like that this adds a non-chalant `for each memcg' here, we
> can have a lot of memcgs.  Sooner or later we'll have to break up that
> full hierarchy iteration in shrink_zone() because of scalability, I
> want to avoid adding more of them.

mem_cgroup_all_within_guarantee can be simply optimized to exclude whole
subtrees of each memcg which is mem_cgroup_within_guarantee. cgroups
iterator are easy and quite optimal to skip the whole subtree AFAIR so I
do not see this as a bottleneck here.

> How about these changes on top of what we currently have?

I really do not like how you got back to priority based break out.
We were discussing that 2 or so years ago and the main objection was
that this is really not useful. You do not want to scan/reclaim so far
"priviledged" memcgs at high priority all of the sudden.

> Sure it's not as accurate, but it should be good start, and it's a
> *lot* less overhead.
> 
> mem_cgroup_watermark() is also a more fitting name, given that this
> has nothing to do with a guarantee for now.

mem_cgroup_watermark sounds like a better name indeed.

> It can also be easily extended to support the MIN watermark while the
> code in vmscan.c remains readable.
 
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/2] mm, memcg: allow OOM if no memcg is eligible during direct reclaim
  2014-06-11 15:20                                                 ` Johannes Weiner
@ 2014-06-11 16:14                                                   ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-11 16:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	linux-mm, LKML

On Wed 11-06-14 11:20:30, Johannes Weiner wrote:
> On Wed, Jun 11, 2014 at 10:00:23AM +0200, Michal Hocko wrote:
> > If there is no memcg eligible for reclaim because all groups under the
> > reclaimed hierarchy are within their guarantee then the global direct
> > reclaim would end up in the endless loop because zones in the zonelists
> > are not considered unreclaimable (as per all_unreclaimable) and so the
> > OOM killer would never fire and direct reclaim would be triggered
> > without no chance to reclaim anything.
> > 
> > This is not possible yet because reclaim falls back to ignore low_limit
> > when nobody is eligible for reclaim. Following patch will allow to set
> > the fallback mode to hard guarantee, though, so this is a preparatory
> > patch.
> > 
> > Memcg reclaim doesn't suffer from this because the OOM killer is
> > triggered after few unsuccessful attempts of the reclaim.
> > 
> > Fix this by checking the number of scanned pages which is obviously 0 if
> > nobody is eligible and also check that the whole tree hierarchy is not
> > eligible and tell OOM it can go ahead.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > ---
> >  mm/vmscan.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 8041b0667673..99137aecd95f 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2570,6 +2570,13 @@ out:
> >  	if (aborted_reclaim)
> >  		return 1;
> >  
> > +	/*
> > +	 * If the target memcg is not eligible for reclaim then we have no option
> > +	 * but OOM
> > +	 */
> > +	if (!sc->nr_scanned && mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
> > +		return 0;
> 
> We can't just sprinkle `for each memcg in hierarchy` loops like this,
> they can get really expensive.

Yeah, I know. This one gets called only when nothing was scanned which
shoudln't happen without the hard guarantee. And as said in other email
we can optimize mem_cgroup_all_within_guarantee to skip all subtrees
that are within their guarantee.

> It's pretty stupid to not have a return value on shrink_zone(), which
> could easily indicate whether a zone was reclaimable, and instead have
> another iteration over the same zonelist and the same memcg hierarchy
> afterwards to figure out if shrink_zone() was successful or not.

I know it is stupid but this is the easiest way right now. We can/should
refactor shrink_zones to forward that information. I was playing with
sticking that infortmation into scan_control but that was even uglier.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 1/2] mm, memcg: allow OOM if no memcg is eligible during direct reclaim
@ 2014-06-11 16:14                                                   ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-11 16:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	linux-mm, LKML

On Wed 11-06-14 11:20:30, Johannes Weiner wrote:
> On Wed, Jun 11, 2014 at 10:00:23AM +0200, Michal Hocko wrote:
> > If there is no memcg eligible for reclaim because all groups under the
> > reclaimed hierarchy are within their guarantee then the global direct
> > reclaim would end up in the endless loop because zones in the zonelists
> > are not considered unreclaimable (as per all_unreclaimable) and so the
> > OOM killer would never fire and direct reclaim would be triggered
> > without no chance to reclaim anything.
> > 
> > This is not possible yet because reclaim falls back to ignore low_limit
> > when nobody is eligible for reclaim. Following patch will allow to set
> > the fallback mode to hard guarantee, though, so this is a preparatory
> > patch.
> > 
> > Memcg reclaim doesn't suffer from this because the OOM killer is
> > triggered after few unsuccessful attempts of the reclaim.
> > 
> > Fix this by checking the number of scanned pages which is obviously 0 if
> > nobody is eligible and also check that the whole tree hierarchy is not
> > eligible and tell OOM it can go ahead.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > ---
> >  mm/vmscan.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 8041b0667673..99137aecd95f 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2570,6 +2570,13 @@ out:
> >  	if (aborted_reclaim)
> >  		return 1;
> >  
> > +	/*
> > +	 * If the target memcg is not eligible for reclaim then we have no option
> > +	 * but OOM
> > +	 */
> > +	if (!sc->nr_scanned && mem_cgroup_all_within_guarantee(sc->target_mem_cgroup))
> > +		return 0;
> 
> We can't just sprinkle `for each memcg in hierarchy` loops like this,
> they can get really expensive.

Yeah, I know. This one gets called only when nothing was scanned which
shoudln't happen without the hard guarantee. And as said in other email
we can optimize mem_cgroup_all_within_guarantee to skip all subtrees
that are within their guarantee.

> It's pretty stupid to not have a return value on shrink_zone(), which
> could easily indicate whether a zone was reclaimable, and instead have
> another iteration over the same zonelist and the same memcg hierarchy
> afterwards to figure out if shrink_zone() was successful or not.

I know it is stupid but this is the easiest way right now. We can/should
refactor shrink_zones to forward that information. I was playing with
sticking that infortmation into scan_control but that was even uglier.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
  2014-06-11 15:36                                                   ` Johannes Weiner
@ 2014-06-12 13:22                                                     ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-12 13:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	linux-mm, LKML

On Wed 11-06-14 11:36:31, Johannes Weiner wrote:
[...]
> This code is truly dreadful.
> 
> Don't call it guarantee when it doesn't guarantee anything.  I thought
> we agreed that min, low, high, max, is reasonable nomenclature, please
> use it consistently.

I can certainly change the internal naming. I will use your wmark naming
suggestion.
 
> With my proposed cleanups and scalability fixes in the other mail, the
> vmscan.c changes to support the min watermark would be something like
> the following.

The semantic is, however, much different as pointed out in the other email.
The following on top of you cleanup will lead to the same deadlock
described in 1st patch (mm, memcg: allow OOM if no memcg is eligible
during direct reclaim).

Anyway, the situation now is pretty chaotic. I plan to gather all the
patchse posted so far and repost for the future discussion. I just need
to finish some internal tasks and will post it soon.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 687076b7a1a6..cee19b6d04dc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2259,7 +2259,7 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  				 */
>  				if (priority < DEF_PRIORITY - 2)
>  					break;
> -
> +			case MEMCG_WMARK_MIN:
>  				/* XXX: skip the whole subtree */
>  				memcg = mem_cgroup_iter(root, memcg, &reclaim);
>  				continue;
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
@ 2014-06-12 13:22                                                     ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-12 13:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	linux-mm, LKML

On Wed 11-06-14 11:36:31, Johannes Weiner wrote:
[...]
> This code is truly dreadful.
> 
> Don't call it guarantee when it doesn't guarantee anything.  I thought
> we agreed that min, low, high, max, is reasonable nomenclature, please
> use it consistently.

I can certainly change the internal naming. I will use your wmark naming
suggestion.
 
> With my proposed cleanups and scalability fixes in the other mail, the
> vmscan.c changes to support the min watermark would be something like
> the following.

The semantic is, however, much different as pointed out in the other email.
The following on top of you cleanup will lead to the same deadlock
described in 1st patch (mm, memcg: allow OOM if no memcg is eligible
during direct reclaim).

Anyway, the situation now is pretty chaotic. I plan to gather all the
patchse posted so far and repost for the future discussion. I just need
to finish some internal tasks and will post it soon.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 687076b7a1a6..cee19b6d04dc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2259,7 +2259,7 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  				 */
>  				if (priority < DEF_PRIORITY - 2)
>  					break;
> -
> +			case MEMCG_WMARK_MIN:
>  				/* XXX: skip the whole subtree */
>  				memcg = mem_cgroup_iter(root, memcg, &reclaim);
>  				continue;
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
  2014-06-12 13:22                                                     ` Michal Hocko
@ 2014-06-12 13:56                                                       ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-12 13:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	linux-mm, LKML

On Thu, Jun 12, 2014 at 03:22:07PM +0200, Michal Hocko wrote:
> On Wed 11-06-14 11:36:31, Johannes Weiner wrote:
> [...]
> > This code is truly dreadful.
> > 
> > Don't call it guarantee when it doesn't guarantee anything.  I thought
> > we agreed that min, low, high, max, is reasonable nomenclature, please
> > use it consistently.
> 
> I can certainly change the internal naming. I will use your wmark naming
> suggestion.

Cool, thanks.

> > With my proposed cleanups and scalability fixes in the other mail, the
> > vmscan.c changes to support the min watermark would be something like
> > the following.
> 
> The semantic is, however, much different as pointed out in the other email.
> The following on top of you cleanup will lead to the same deadlock
> described in 1st patch (mm, memcg: allow OOM if no memcg is eligible
> during direct reclaim).

I'm currently reworking shrink_zones() and getting rid of
all_unreclaimable() etc. to remove the code duplication.

> Anyway, the situation now is pretty chaotic. I plan to gather all the
> patchse posted so far and repost for the future discussion. I just need
> to finish some internal tasks and will post it soon.

That would be great, thanks, it's really hard to follow this stuff
halfway in and halfway outside of -mm.

Now that we roughly figured out what knobs and semantics we want, it
would be great to figure out the merging logistics.

I would prefer if we could introduce max, high, low, min in unified
hierarchy, and *only* in there, so that we never have to worry about
it coexisting and interacting with the existing hard and soft limit.

It would also be beneficial to introduce them all close to each other,
develop them together, possibly submit them in the same patch series,
so that we know the requirements and how the code should look like in
the big picture and can offer a fully consistent and documented usage
model in the unified hierarchy.

Does that make sense?

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
@ 2014-06-12 13:56                                                       ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-12 13:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	linux-mm, LKML

On Thu, Jun 12, 2014 at 03:22:07PM +0200, Michal Hocko wrote:
> On Wed 11-06-14 11:36:31, Johannes Weiner wrote:
> [...]
> > This code is truly dreadful.
> > 
> > Don't call it guarantee when it doesn't guarantee anything.  I thought
> > we agreed that min, low, high, max, is reasonable nomenclature, please
> > use it consistently.
> 
> I can certainly change the internal naming. I will use your wmark naming
> suggestion.

Cool, thanks.

> > With my proposed cleanups and scalability fixes in the other mail, the
> > vmscan.c changes to support the min watermark would be something like
> > the following.
> 
> The semantic is, however, much different as pointed out in the other email.
> The following on top of you cleanup will lead to the same deadlock
> described in 1st patch (mm, memcg: allow OOM if no memcg is eligible
> during direct reclaim).

I'm currently reworking shrink_zones() and getting rid of
all_unreclaimable() etc. to remove the code duplication.

> Anyway, the situation now is pretty chaotic. I plan to gather all the
> patchse posted so far and repost for the future discussion. I just need
> to finish some internal tasks and will post it soon.

That would be great, thanks, it's really hard to follow this stuff
halfway in and halfway outside of -mm.

Now that we roughly figured out what knobs and semantics we want, it
would be great to figure out the merging logistics.

I would prefer if we could introduce max, high, low, min in unified
hierarchy, and *only* in there, so that we never have to worry about
it coexisting and interacting with the existing hard and soft limit.

It would also be beneficial to introduce them all close to each other,
develop them together, possibly submit them in the same patch series,
so that we know the requirements and how the code should look like in
the big picture and can offer a fully consistent and documented usage
model in the unified hierarchy.

Does that make sense?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
  2014-06-12 13:56                                                       ` Johannes Weiner
@ 2014-06-12 14:22                                                         ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-12 14:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	linux-mm, LKML

On Thu 12-06-14 09:56:00, Johannes Weiner wrote:
> On Thu, Jun 12, 2014 at 03:22:07PM +0200, Michal Hocko wrote:
[...]
> > Anyway, the situation now is pretty chaotic. I plan to gather all the
> > patchse posted so far and repost for the future discussion. I just need
> > to finish some internal tasks and will post it soon.
> 
> That would be great, thanks, it's really hard to follow this stuff
> halfway in and halfway outside of -mm.
> 
> Now that we roughly figured out what knobs and semantics we want, it
> would be great to figure out the merging logistics.
> 
> I would prefer if we could introduce max, high, low, min in unified
> hierarchy, and *only* in there, so that we never have to worry about
> it coexisting and interacting with the existing hard and soft limit.

The primary question would be, whether this is is the best transition
strategy. I do not know how many users apart from developers are really
using unified hierarchy. I would be worried that we merge a feature which
will not be used for a long time.

Moreover, if somebody wants to transition from soft limit then it would
be really hard because switching to unified hierarchy might be a no-go.

I think that it is clear that we should deprecate soft_limit ASAP. I
also think it wont't hurt to have min, low, high in both old and unified
API and strongly warn if somebody tries to use soft_limit along with any
of the new APIs in the first step. Later we can even forbid any
combination by a hard failure.

> It would also be beneficial to introduce them all close to each other,
> develop them together, possibly submit them in the same patch series,
> so that we know the requirements and how the code should look like in
> the big picture and can offer a fully consistent and documented usage
> model in the unified hierarchy.

Min and Low should definitely go together. High sounds like an
orthogonal problem (pro-active reclaim vs reclaim protection) so I think
it can go its own way and pace. We still have to discuss its semantic
and I feel it would be a bit disturbing to have everything in one
bundle. 
I do understand your point about the global picture, though. Do you
think that there is a risk that formulating semantic for High limit
might change the way how Min and Low would be defined?

> Does that make sense?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
@ 2014-06-12 14:22                                                         ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-12 14:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	linux-mm, LKML

On Thu 12-06-14 09:56:00, Johannes Weiner wrote:
> On Thu, Jun 12, 2014 at 03:22:07PM +0200, Michal Hocko wrote:
[...]
> > Anyway, the situation now is pretty chaotic. I plan to gather all the
> > patchse posted so far and repost for the future discussion. I just need
> > to finish some internal tasks and will post it soon.
> 
> That would be great, thanks, it's really hard to follow this stuff
> halfway in and halfway outside of -mm.
> 
> Now that we roughly figured out what knobs and semantics we want, it
> would be great to figure out the merging logistics.
> 
> I would prefer if we could introduce max, high, low, min in unified
> hierarchy, and *only* in there, so that we never have to worry about
> it coexisting and interacting with the existing hard and soft limit.

The primary question would be, whether this is is the best transition
strategy. I do not know how many users apart from developers are really
using unified hierarchy. I would be worried that we merge a feature which
will not be used for a long time.

Moreover, if somebody wants to transition from soft limit then it would
be really hard because switching to unified hierarchy might be a no-go.

I think that it is clear that we should deprecate soft_limit ASAP. I
also think it wont't hurt to have min, low, high in both old and unified
API and strongly warn if somebody tries to use soft_limit along with any
of the new APIs in the first step. Later we can even forbid any
combination by a hard failure.

> It would also be beneficial to introduce them all close to each other,
> develop them together, possibly submit them in the same patch series,
> so that we know the requirements and how the code should look like in
> the big picture and can offer a fully consistent and documented usage
> model in the unified hierarchy.

Min and Low should definitely go together. High sounds like an
orthogonal problem (pro-active reclaim vs reclaim protection) so I think
it can go its own way and pace. We still have to discuss its semantic
and I feel it would be a bit disturbing to have everything in one
bundle. 
I do understand your point about the global picture, though. Do you
think that there is a risk that formulating semantic for High limit
might change the way how Min and Low would be defined?

> Does that make sense?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
  2014-06-12 14:22                                                         ` Michal Hocko
@ 2014-06-12 16:17                                                           ` Tejun Heo
  -1 siblings, 0 replies; 196+ messages in thread
From: Tejun Heo @ 2014-06-12 16:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Greg Thelen, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michel Lespinasse,
	Roman Gushchin, linux-mm, LKML, Li Zefan

Hello, Michal.

On Thu, Jun 12, 2014 at 04:22:37PM +0200, Michal Hocko wrote:
> The primary question would be, whether this is is the best transition
> strategy. I do not know how many users apart from developers are really
> using unified hierarchy. I would be worried that we merge a feature which
> will not be used for a long time.

I'm planning to drop __DEVEL__ mask from the unified hierarchy in a
cycle, at most two.  The biggest hold up at the moment is
straightening out the interfaces and interaction between memcg and
blkcg because I think it'd be silly to have to go through another
round of interface versioning effort right after transitioning to
unified hierarchy.  I'm not too confident whether it'd be possible to
get blkcg completely in shape by that time, but, if that takes too
long, I'll just leave blkcg behind temporarily.  So, at least from
kernel side, it's not gonna be too long.

There sure is a question of how fast userland will move to the new
interface.  Some are already playing with unified hierarchy and
planning to migrate as soon as possible but there sure will be others
who will take more time.  Can't tell for sure, but the thing is that
migration to min/low/high/max scheme is a signficant migration effort
too, so I'm not sure how much we'd gain by doing that separately.
It'd be an extra transition step for userland (optional but still),
more combinations of configration to handle for memcg, and it's not
like unified hierarchy is that difficult to transition to.

> Moreover, if somebody wants to transition from soft limit then it would
> be really hard because switching to unified hierarchy might be a no-go.

Why would that be a no-go?  Its usage is mostly similar with
tranditional hierarchies and can be used with other hierarchies, so
while it'd take some adaptation, in most cases gradual transition
shouldn't be a big problem.

> I think that it is clear that we should deprecate soft_limit ASAP. I
> also think it wont't hurt to have min, low, high in both old and unified
> API and strongly warn if somebody tries to use soft_limit along with any
> of the new APIs in the first step. Later we can even forbid any
> combination by a hard failure.

I don't quite understand how you plan to deprecate it.  Sure you can
fail with -EINVAL or whatnot when the wrong combination is used but I
don't think there's any chance of removing the knob.  There's a reason
why we're introducing a new version of the whole cgroup interface
which can co-exist with the existing one after all.  If you wanna
version memcg interface separately, maybe that'd work but it sounds
like a lot of extra hassle for not much gain.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
@ 2014-06-12 16:17                                                           ` Tejun Heo
  0 siblings, 0 replies; 196+ messages in thread
From: Tejun Heo @ 2014-06-12 16:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Greg Thelen, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michel Lespinasse,
	Roman Gushchin, linux-mm, LKML, Li Zefan

Hello, Michal.

On Thu, Jun 12, 2014 at 04:22:37PM +0200, Michal Hocko wrote:
> The primary question would be, whether this is is the best transition
> strategy. I do not know how many users apart from developers are really
> using unified hierarchy. I would be worried that we merge a feature which
> will not be used for a long time.

I'm planning to drop __DEVEL__ mask from the unified hierarchy in a
cycle, at most two.  The biggest hold up at the moment is
straightening out the interfaces and interaction between memcg and
blkcg because I think it'd be silly to have to go through another
round of interface versioning effort right after transitioning to
unified hierarchy.  I'm not too confident whether it'd be possible to
get blkcg completely in shape by that time, but, if that takes too
long, I'll just leave blkcg behind temporarily.  So, at least from
kernel side, it's not gonna be too long.

There sure is a question of how fast userland will move to the new
interface.  Some are already playing with unified hierarchy and
planning to migrate as soon as possible but there sure will be others
who will take more time.  Can't tell for sure, but the thing is that
migration to min/low/high/max scheme is a signficant migration effort
too, so I'm not sure how much we'd gain by doing that separately.
It'd be an extra transition step for userland (optional but still),
more combinations of configration to handle for memcg, and it's not
like unified hierarchy is that difficult to transition to.

> Moreover, if somebody wants to transition from soft limit then it would
> be really hard because switching to unified hierarchy might be a no-go.

Why would that be a no-go?  Its usage is mostly similar with
tranditional hierarchies and can be used with other hierarchies, so
while it'd take some adaptation, in most cases gradual transition
shouldn't be a big problem.

> I think that it is clear that we should deprecate soft_limit ASAP. I
> also think it wont't hurt to have min, low, high in both old and unified
> API and strongly warn if somebody tries to use soft_limit along with any
> of the new APIs in the first step. Later we can even forbid any
> combination by a hard failure.

I don't quite understand how you plan to deprecate it.  Sure you can
fail with -EINVAL or whatnot when the wrong combination is used but I
don't think there's any chance of removing the knob.  There's a reason
why we're introducing a new version of the whole cgroup interface
which can co-exist with the existing one after all.  If you wanna
version memcg interface separately, maybe that'd work but it sounds
like a lot of extra hassle for not much gain.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
  2014-06-12 14:22                                                         ` Michal Hocko
@ 2014-06-12 16:51                                                           ` Johannes Weiner
  -1 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-12 16:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	linux-mm, LKML

On Thu, Jun 12, 2014 at 04:22:37PM +0200, Michal Hocko wrote:
> On Thu 12-06-14 09:56:00, Johannes Weiner wrote:
> > On Thu, Jun 12, 2014 at 03:22:07PM +0200, Michal Hocko wrote:
> [...]
> > > Anyway, the situation now is pretty chaotic. I plan to gather all the
> > > patchse posted so far and repost for the future discussion. I just need
> > > to finish some internal tasks and will post it soon.
> > 
> > That would be great, thanks, it's really hard to follow this stuff
> > halfway in and halfway outside of -mm.
> > 
> > Now that we roughly figured out what knobs and semantics we want, it
> > would be great to figure out the merging logistics.
> > 
> > I would prefer if we could introduce max, high, low, min in unified
> > hierarchy, and *only* in there, so that we never have to worry about
> > it coexisting and interacting with the existing hard and soft limit.
> 
> The primary question would be, whether this is is the best transition
> strategy. I do not know how many users apart from developers are really
> using unified hierarchy. I would be worried that we merge a feature which
> will not be used for a long time.

Unified hierarchy is the next version of the cgroup interface, and
once the development tag drops I consider the old memcg interface
deprecated.  It makes very little sense to me to put up additional
incentives at this point to continue the use of the old interface,
when we already struggle with manpower to maintain even one of them.

> Moreover, if somebody wants to transition from soft limit then it would
> be really hard because switching to unified hierarchy might be a no-go.
>
> I think that it is clear that we should deprecate soft_limit ASAP. I
> also think it wont't hurt to have min, low, high in both old and unified
> API and strongly warn if somebody tries to use soft_limit along with any
> of the new APIs in the first step. Later we can even forbid any
> combination by a hard failure.

Why would somebody NOT be able to convert to unified hierarchy
eventually?

How big is the intersection of cases that can't convert to unified
hierarchy AND are using the soft limit AND want to use the new low
limit?

Merging a different concept with its own naming scheme into an already
confusing interface, spamming the dmesg if someone gets it wrong,
potentially introducing more breakage with the hard failure, putting
up incentives to stick with a deprecated and confusing interface...
This is a lot of horrible stuff in an attempt to accomodate very few
usecases - if any - when we are *already versioning the interface* and
have the opportunity for a clean transition.

The transition to min, low, high, max is effort in itself.  Conflating
the two models sounds more detrimental than anything else, with a very
dubious upside at that.

> > It would also be beneficial to introduce them all close to each other,
> > develop them together, possibly submit them in the same patch series,
> > so that we know the requirements and how the code should look like in
> > the big picture and can offer a fully consistent and documented usage
> > model in the unified hierarchy.
> 
> Min and Low should definitely go together. High sounds like an
> orthogonal problem (pro-active reclaim vs reclaim protection) so I think
> it can go its own way and pace. We still have to discuss its semantic
> and I feel it would be a bit disturbing to have everything in one
> bundle.
>
> I do understand your point about the global picture, though. Do you
> think that there is a risk that formulating semantic for High limit
> might change the way how Min and Low would be defined?

I think one of the biggest hinderances in making forward progress on
individual limits is that we only had a laundry list of occasionally
conflicting requirements but never a consistent big picture to design
around and match full usecases to.  It's much easier and less error
prone to develop the concept as a whole, alongside full real-life
configurations.

They are symmetrical pieces whose semantics very much depend on each
other, so I wouldn't like too much lag between those.

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
@ 2014-06-12 16:51                                                           ` Johannes Weiner
  0 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2014-06-12 16:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	linux-mm, LKML

On Thu, Jun 12, 2014 at 04:22:37PM +0200, Michal Hocko wrote:
> On Thu 12-06-14 09:56:00, Johannes Weiner wrote:
> > On Thu, Jun 12, 2014 at 03:22:07PM +0200, Michal Hocko wrote:
> [...]
> > > Anyway, the situation now is pretty chaotic. I plan to gather all the
> > > patchse posted so far and repost for the future discussion. I just need
> > > to finish some internal tasks and will post it soon.
> > 
> > That would be great, thanks, it's really hard to follow this stuff
> > halfway in and halfway outside of -mm.
> > 
> > Now that we roughly figured out what knobs and semantics we want, it
> > would be great to figure out the merging logistics.
> > 
> > I would prefer if we could introduce max, high, low, min in unified
> > hierarchy, and *only* in there, so that we never have to worry about
> > it coexisting and interacting with the existing hard and soft limit.
> 
> The primary question would be, whether this is is the best transition
> strategy. I do not know how many users apart from developers are really
> using unified hierarchy. I would be worried that we merge a feature which
> will not be used for a long time.

Unified hierarchy is the next version of the cgroup interface, and
once the development tag drops I consider the old memcg interface
deprecated.  It makes very little sense to me to put up additional
incentives at this point to continue the use of the old interface,
when we already struggle with manpower to maintain even one of them.

> Moreover, if somebody wants to transition from soft limit then it would
> be really hard because switching to unified hierarchy might be a no-go.
>
> I think that it is clear that we should deprecate soft_limit ASAP. I
> also think it wont't hurt to have min, low, high in both old and unified
> API and strongly warn if somebody tries to use soft_limit along with any
> of the new APIs in the first step. Later we can even forbid any
> combination by a hard failure.

Why would somebody NOT be able to convert to unified hierarchy
eventually?

How big is the intersection of cases that can't convert to unified
hierarchy AND are using the soft limit AND want to use the new low
limit?

Merging a different concept with its own naming scheme into an already
confusing interface, spamming the dmesg if someone gets it wrong,
potentially introducing more breakage with the hard failure, putting
up incentives to stick with a deprecated and confusing interface...
This is a lot of horrible stuff in an attempt to accomodate very few
usecases - if any - when we are *already versioning the interface* and
have the opportunity for a clean transition.

The transition to min, low, high, max is effort in itself.  Conflating
the two models sounds more detrimental than anything else, with a very
dubious upside at that.

> > It would also be beneficial to introduce them all close to each other,
> > develop them together, possibly submit them in the same patch series,
> > so that we know the requirements and how the code should look like in
> > the big picture and can offer a fully consistent and documented usage
> > model in the unified hierarchy.
> 
> Min and Low should definitely go together. High sounds like an
> orthogonal problem (pro-active reclaim vs reclaim protection) so I think
> it can go its own way and pace. We still have to discuss its semantic
> and I feel it would be a bit disturbing to have everything in one
> bundle.
>
> I do understand your point about the global picture, though. Do you
> think that there is a risk that formulating semantic for High limit
> might change the way how Min and Low would be defined?

I think one of the biggest hinderances in making forward progress on
individual limits is that we only had a laundry list of occasionally
conflicting requirements but never a consistent big picture to design
around and match full usecases to.  It's much easier and less error
prone to develop the concept as a whole, alongside full real-life
configurations.

They are symmetrical pieces whose semantics very much depend on each
other, so I wouldn't like too much lag between those.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
  2014-06-12 16:17                                                           ` Tejun Heo
@ 2014-06-16 12:59                                                             ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-16 12:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Greg Thelen, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michel Lespinasse,
	Roman Gushchin, linux-mm, LKML, Li Zefan

On Thu 12-06-14 12:17:33, Tejun Heo wrote:
> Hello, Michal.
> 
> On Thu, Jun 12, 2014 at 04:22:37PM +0200, Michal Hocko wrote:
> > The primary question would be, whether this is is the best transition
> > strategy. I do not know how many users apart from developers are really
> > using unified hierarchy. I would be worried that we merge a feature which
> > will not be used for a long time.
> 
> I'm planning to drop __DEVEL__ mask from the unified hierarchy in a
> cycle, at most two. 

OK, I am obviously behind the current cgroup core changes. I thought
that unified hierarchy will be for development only for much more time.

> The biggest hold up at the moment is
> straightening out the interfaces and interaction between memcg and
> blkcg because I think it'd be silly to have to go through another
> round of interface versioning effort right after transitioning to
> unified hierarchy.  I'm not too confident whether it'd be possible to
> get blkcg completely in shape by that time, but, if that takes too
> long, I'll just leave blkcg behind temporarily.  So, at least from
> kernel side, it's not gonna be too long.
> 
> There sure is a question of how fast userland will move to the new
> interface. 

Yeah, I was mostly thinking about those who would need to to bigger
changes. AFAIR threads will no longer be distributable between groups.

> Some are already playing with unified hierarchy and
> planning to migrate as soon as possible but there sure will be others
> who will take more time.  Can't tell for sure, but the thing is that
> migration to min/low/high/max scheme is a signficant migration effort
> too, so I'm not sure how much we'd gain by doing that separately.
> It'd be an extra transition step for userland (optional but still),
> more combinations of configration to handle for memcg, and it's not
> like unified hierarchy is that difficult to transition to.
> 
> > Moreover, if somebody wants to transition from soft limit then it would
> > be really hard because switching to unified hierarchy might be a no-go.
> 
> Why would that be a no-go? 

I remember discussions about per-thread distributions and some other
things missing from the new API.

> Its usage is mostly similar with
> tranditional hierarchies and can be used with other hierarchies, so
> while it'd take some adaptation, in most cases gradual transition
> shouldn't be a big problem.

OK

> > I think that it is clear that we should deprecate soft_limit ASAP. I
> > also think it wont't hurt to have min, low, high in both old and unified
> > API and strongly warn if somebody tries to use soft_limit along with any
> > of the new APIs in the first step. Later we can even forbid any
> > combination by a hard failure.
> 
> I don't quite understand how you plan to deprecate it.  Sure you can
> fail with -EINVAL or whatnot when the wrong combination

Yes, I was thinking that direction. First warn and then EINVAL later.

> is used but I don't think there's any chance of removing the knob.
> There's a reason why we're introducing a new version of the whole
> cgroup interface which can co-exist with the existing one after all.
> If you wanna version memcg interface separately, maybe that'd work but
> it sounds like a lot of extra hassle for not much gain.

No, I didn't mean to version the interface. I just wanted to have
gradual transition for potential soft_limit users.

Maybe I am misunderstanding something but I thought that new version of
API will contain all knobs which are not marked .flags = CFTYPE_INSANE
while the old API will contain all of them.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
@ 2014-06-16 12:59                                                             ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-16 12:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Greg Thelen, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michel Lespinasse,
	Roman Gushchin, linux-mm, LKML, Li Zefan

On Thu 12-06-14 12:17:33, Tejun Heo wrote:
> Hello, Michal.
> 
> On Thu, Jun 12, 2014 at 04:22:37PM +0200, Michal Hocko wrote:
> > The primary question would be, whether this is is the best transition
> > strategy. I do not know how many users apart from developers are really
> > using unified hierarchy. I would be worried that we merge a feature which
> > will not be used for a long time.
> 
> I'm planning to drop __DEVEL__ mask from the unified hierarchy in a
> cycle, at most two. 

OK, I am obviously behind the current cgroup core changes. I thought
that unified hierarchy will be for development only for much more time.

> The biggest hold up at the moment is
> straightening out the interfaces and interaction between memcg and
> blkcg because I think it'd be silly to have to go through another
> round of interface versioning effort right after transitioning to
> unified hierarchy.  I'm not too confident whether it'd be possible to
> get blkcg completely in shape by that time, but, if that takes too
> long, I'll just leave blkcg behind temporarily.  So, at least from
> kernel side, it's not gonna be too long.
> 
> There sure is a question of how fast userland will move to the new
> interface. 

Yeah, I was mostly thinking about those who would need to to bigger
changes. AFAIR threads will no longer be distributable between groups.

> Some are already playing with unified hierarchy and
> planning to migrate as soon as possible but there sure will be others
> who will take more time.  Can't tell for sure, but the thing is that
> migration to min/low/high/max scheme is a signficant migration effort
> too, so I'm not sure how much we'd gain by doing that separately.
> It'd be an extra transition step for userland (optional but still),
> more combinations of configration to handle for memcg, and it's not
> like unified hierarchy is that difficult to transition to.
> 
> > Moreover, if somebody wants to transition from soft limit then it would
> > be really hard because switching to unified hierarchy might be a no-go.
> 
> Why would that be a no-go? 

I remember discussions about per-thread distributions and some other
things missing from the new API.

> Its usage is mostly similar with
> tranditional hierarchies and can be used with other hierarchies, so
> while it'd take some adaptation, in most cases gradual transition
> shouldn't be a big problem.

OK

> > I think that it is clear that we should deprecate soft_limit ASAP. I
> > also think it wont't hurt to have min, low, high in both old and unified
> > API and strongly warn if somebody tries to use soft_limit along with any
> > of the new APIs in the first step. Later we can even forbid any
> > combination by a hard failure.
> 
> I don't quite understand how you plan to deprecate it.  Sure you can
> fail with -EINVAL or whatnot when the wrong combination

Yes, I was thinking that direction. First warn and then EINVAL later.

> is used but I don't think there's any chance of removing the knob.
> There's a reason why we're introducing a new version of the whole
> cgroup interface which can co-exist with the existing one after all.
> If you wanna version memcg interface separately, maybe that'd work but
> it sounds like a lot of extra hassle for not much gain.

No, I didn't mean to version the interface. I just wanted to have
gradual transition for potential soft_limit users.

Maybe I am misunderstanding something but I thought that new version of
API will contain all knobs which are not marked .flags = CFTYPE_INSANE
while the old API will contain all of them.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
  2014-06-12 16:51                                                           ` Johannes Weiner
@ 2014-06-16 13:22                                                             ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-16 13:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	linux-mm, LKML

On Thu 12-06-14 12:51:05, Johannes Weiner wrote:
> On Thu, Jun 12, 2014 at 04:22:37PM +0200, Michal Hocko wrote:
> > On Thu 12-06-14 09:56:00, Johannes Weiner wrote:
> > > On Thu, Jun 12, 2014 at 03:22:07PM +0200, Michal Hocko wrote:
> > [...]
> > > > Anyway, the situation now is pretty chaotic. I plan to gather all the
> > > > patchse posted so far and repost for the future discussion. I just need
> > > > to finish some internal tasks and will post it soon.
> > > 
> > > That would be great, thanks, it's really hard to follow this stuff
> > > halfway in and halfway outside of -mm.
> > > 
> > > Now that we roughly figured out what knobs and semantics we want, it
> > > would be great to figure out the merging logistics.
> > > 
> > > I would prefer if we could introduce max, high, low, min in unified
> > > hierarchy, and *only* in there, so that we never have to worry about
> > > it coexisting and interacting with the existing hard and soft limit.

Btw. what is the way to introduce a knob _only_ in the new cgroup API?
I am aware only about .flags = CFTYPE_INSANE which works other way
around.

> > The primary question would be, whether this is is the best transition
> > strategy. I do not know how many users apart from developers are really
> > using unified hierarchy. I would be worried that we merge a feature which
> > will not be used for a long time.
> 
> Unified hierarchy is the next version of the cgroup interface, and
> once the development tag drops I consider the old memcg interface
> deprecated. 

Deprecated in the unified hierarchy mount, right? There will be still
the old API around AFAIU. The deprecated knobs will be only not visible
in the new API. So we cannot simply remove all the code after unified
hierarchy drops its DEVEL status, can we?

> It makes very little sense to me to put up additional
> incentives at this point to continue the use of the old interface,
> when we already struggle with manpower to maintain even one of them.
> 
> > Moreover, if somebody wants to transition from soft limit then it would
> > be really hard because switching to unified hierarchy might be a no-go.
> >
> > I think that it is clear that we should deprecate soft_limit ASAP. I
> > also think it wont't hurt to have min, low, high in both old and unified
> > API and strongly warn if somebody tries to use soft_limit along with any
> > of the new APIs in the first step. Later we can even forbid any
> > combination by a hard failure.
> 
> Why would somebody NOT be able to convert to unified hierarchy
> eventually?

I've mentioned that in other email. I remember people complaining about
threads not being distributable over groups in the past. Things might
have changed in the mean time, I was too busy to pay closer attention so
I might be completely wrong here.

> How big is the intersection of cases that can't convert to unified
> hierarchy AND are using the soft limit AND want to use the new low
> limit?

I am not talking about intentional usage of soft limit with new knobs.
That would be unsupported of course and I meant to complain about that
in the logs and later even fail on an attempt.

> Merging a different concept with its own naming scheme into an already
> confusing interface, spamming the dmesg if someone gets it wrong,
> potentially introducing more breakage with the hard failure, putting
> up incentives to stick with a deprecated and confusing interface...
> This is a lot of horrible stuff in an attempt to accomodate very few
> usecases - if any - when we are *already versioning the interface* and
> have the opportunity for a clean transition.
> 
> The transition to min, low, high, max is effort in itself.  Conflating
> the two models sounds more detrimental than anything else, with a very
> dubious upside at that.
>
> > > It would also be beneficial to introduce them all close to each other,
> > > develop them together, possibly submit them in the same patch series,
> > > so that we know the requirements and how the code should look like in
> > > the big picture and can offer a fully consistent and documented usage
> > > model in the unified hierarchy.
> > 
> > Min and Low should definitely go together. High sounds like an
> > orthogonal problem (pro-active reclaim vs reclaim protection) so I think
> > it can go its own way and pace. We still have to discuss its semantic
> > and I feel it would be a bit disturbing to have everything in one
> > bundle.
> >
> > I do understand your point about the global picture, though. Do you
> > think that there is a risk that formulating semantic for High limit
> > might change the way how Min and Low would be defined?
> 
> I think one of the biggest hinderances in making forward progress on
> individual limits is that we only had a laundry list of occasionally
> conflicting requirements but never a consistent big picture to design
> around and match full usecases to.  It's much easier and less error
> prone to develop the concept as a whole, alongside full real-life
> configurations.
> 
> They are symmetrical pieces whose semantics very much depend on each
> other, so I wouldn't like too much lag between those.

Sure, I think we can target them for the same merge window. I am just
not sure whether one patch series is the appropriate way.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
@ 2014-06-16 13:22                                                             ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-16 13:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Hugh Dickins, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Michel Lespinasse, Tejun Heo, Roman Gushchin,
	linux-mm, LKML

On Thu 12-06-14 12:51:05, Johannes Weiner wrote:
> On Thu, Jun 12, 2014 at 04:22:37PM +0200, Michal Hocko wrote:
> > On Thu 12-06-14 09:56:00, Johannes Weiner wrote:
> > > On Thu, Jun 12, 2014 at 03:22:07PM +0200, Michal Hocko wrote:
> > [...]
> > > > Anyway, the situation now is pretty chaotic. I plan to gather all the
> > > > patchse posted so far and repost for the future discussion. I just need
> > > > to finish some internal tasks and will post it soon.
> > > 
> > > That would be great, thanks, it's really hard to follow this stuff
> > > halfway in and halfway outside of -mm.
> > > 
> > > Now that we roughly figured out what knobs and semantics we want, it
> > > would be great to figure out the merging logistics.
> > > 
> > > I would prefer if we could introduce max, high, low, min in unified
> > > hierarchy, and *only* in there, so that we never have to worry about
> > > it coexisting and interacting with the existing hard and soft limit.

Btw. what is the way to introduce a knob _only_ in the new cgroup API?
I am aware only about .flags = CFTYPE_INSANE which works other way
around.

> > The primary question would be, whether this is is the best transition
> > strategy. I do not know how many users apart from developers are really
> > using unified hierarchy. I would be worried that we merge a feature which
> > will not be used for a long time.
> 
> Unified hierarchy is the next version of the cgroup interface, and
> once the development tag drops I consider the old memcg interface
> deprecated. 

Deprecated in the unified hierarchy mount, right? There will be still
the old API around AFAIU. The deprecated knobs will be only not visible
in the new API. So we cannot simply remove all the code after unified
hierarchy drops its DEVEL status, can we?

> It makes very little sense to me to put up additional
> incentives at this point to continue the use of the old interface,
> when we already struggle with manpower to maintain even one of them.
> 
> > Moreover, if somebody wants to transition from soft limit then it would
> > be really hard because switching to unified hierarchy might be a no-go.
> >
> > I think that it is clear that we should deprecate soft_limit ASAP. I
> > also think it wont't hurt to have min, low, high in both old and unified
> > API and strongly warn if somebody tries to use soft_limit along with any
> > of the new APIs in the first step. Later we can even forbid any
> > combination by a hard failure.
> 
> Why would somebody NOT be able to convert to unified hierarchy
> eventually?

I've mentioned that in other email. I remember people complaining about
threads not being distributable over groups in the past. Things might
have changed in the mean time, I was too busy to pay closer attention so
I might be completely wrong here.

> How big is the intersection of cases that can't convert to unified
> hierarchy AND are using the soft limit AND want to use the new low
> limit?

I am not talking about intentional usage of soft limit with new knobs.
That would be unsupported of course and I meant to complain about that
in the logs and later even fail on an attempt.

> Merging a different concept with its own naming scheme into an already
> confusing interface, spamming the dmesg if someone gets it wrong,
> potentially introducing more breakage with the hard failure, putting
> up incentives to stick with a deprecated and confusing interface...
> This is a lot of horrible stuff in an attempt to accomodate very few
> usecases - if any - when we are *already versioning the interface* and
> have the opportunity for a clean transition.
> 
> The transition to min, low, high, max is effort in itself.  Conflating
> the two models sounds more detrimental than anything else, with a very
> dubious upside at that.
>
> > > It would also be beneficial to introduce them all close to each other,
> > > develop them together, possibly submit them in the same patch series,
> > > so that we know the requirements and how the code should look like in
> > > the big picture and can offer a fully consistent and documented usage
> > > model in the unified hierarchy.
> > 
> > Min and Low should definitely go together. High sounds like an
> > orthogonal problem (pro-active reclaim vs reclaim protection) so I think
> > it can go its own way and pace. We still have to discuss its semantic
> > and I feel it would be a bit disturbing to have everything in one
> > bundle.
> >
> > I do understand your point about the global picture, though. Do you
> > think that there is a risk that formulating semantic for High limit
> > might change the way how Min and Low would be defined?
> 
> I think one of the biggest hinderances in making forward progress on
> individual limits is that we only had a laundry list of occasionally
> conflicting requirements but never a consistent big picture to design
> around and match full usecases to.  It's much easier and less error
> prone to develop the concept as a whole, alongside full real-life
> configurations.
> 
> They are symmetrical pieces whose semantics very much depend on each
> other, so I wouldn't like too much lag between those.

Sure, I think we can target them for the same merge window. I am just
not sure whether one patch series is the appropriate way.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
  2014-06-16 12:59                                                             ` Michal Hocko
@ 2014-06-16 13:57                                                               ` Tejun Heo
  -1 siblings, 0 replies; 196+ messages in thread
From: Tejun Heo @ 2014-06-16 13:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Greg Thelen, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michel Lespinasse,
	Roman Gushchin, linux-mm, LKML, Li Zefan

Hello, Michal.

On Mon, Jun 16, 2014 at 02:59:15PM +0200, Michal Hocko wrote:
> > There sure is a question of how fast userland will move to the new
> > interface. 
> 
> Yeah, I was mostly thinking about those who would need to to bigger
> changes. AFAIR threads will no longer be distributable between groups.

Thread-level granularity should go away no matter what, but this is
completely irrelevant to memcg which can't do per-thread anyway.  For
whatever reason, a user is stuck with thread-level granularity for
controllers which work that way, the user can use the old hierarchies
for them for the time being.

> > is used but I don't think there's any chance of removing the knob.
> > There's a reason why we're introducing a new version of the whole
> > cgroup interface which can co-exist with the existing one after all.
> > If you wanna version memcg interface separately, maybe that'd work but
> > it sounds like a lot of extra hassle for not much gain.
> 
> No, I didn't mean to version the interface. I just wanted to have
> gradual transition for potential soft_limit users.
> 
> Maybe I am misunderstanding something but I thought that new version of
> API will contain all knobs which are not marked .flags = CFTYPE_INSANE
> while the old API will contain all of them.

Nope, some changes don't fit that model.  CFTYPE_ON_ON_DFL is the
opposite.  Knobs marked with the flag only appear on the default
hierarchy (cgroup core internally calls it the default hierarchy as
this is the tree all the controllers are attached to by default).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
@ 2014-06-16 13:57                                                               ` Tejun Heo
  0 siblings, 0 replies; 196+ messages in thread
From: Tejun Heo @ 2014-06-16 13:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Greg Thelen, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michel Lespinasse,
	Roman Gushchin, linux-mm, LKML, Li Zefan

Hello, Michal.

On Mon, Jun 16, 2014 at 02:59:15PM +0200, Michal Hocko wrote:
> > There sure is a question of how fast userland will move to the new
> > interface. 
> 
> Yeah, I was mostly thinking about those who would need to to bigger
> changes. AFAIR threads will no longer be distributable between groups.

Thread-level granularity should go away no matter what, but this is
completely irrelevant to memcg which can't do per-thread anyway.  For
whatever reason, a user is stuck with thread-level granularity for
controllers which work that way, the user can use the old hierarchies
for them for the time being.

> > is used but I don't think there's any chance of removing the knob.
> > There's a reason why we're introducing a new version of the whole
> > cgroup interface which can co-exist with the existing one after all.
> > If you wanna version memcg interface separately, maybe that'd work but
> > it sounds like a lot of extra hassle for not much gain.
> 
> No, I didn't mean to version the interface. I just wanted to have
> gradual transition for potential soft_limit users.
> 
> Maybe I am misunderstanding something but I thought that new version of
> API will contain all knobs which are not marked .flags = CFTYPE_INSANE
> while the old API will contain all of them.

Nope, some changes don't fit that model.  CFTYPE_ON_ON_DFL is the
opposite.  Knobs marked with the flag only appear on the default
hierarchy (cgroup core internally calls it the default hierarchy as
this is the tree all the controllers are attached to by default).

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
  2014-06-16 13:57                                                               ` Tejun Heo
@ 2014-06-16 14:04                                                                 ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-16 14:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Greg Thelen, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michel Lespinasse,
	Roman Gushchin, linux-mm, LKML, Li Zefan

On Mon 16-06-14 09:57:41, Tejun Heo wrote:
> Hello, Michal.
> 
> On Mon, Jun 16, 2014 at 02:59:15PM +0200, Michal Hocko wrote:
> > > There sure is a question of how fast userland will move to the new
> > > interface. 
> > 
> > Yeah, I was mostly thinking about those who would need to to bigger
> > changes. AFAIR threads will no longer be distributable between groups.
> 
> Thread-level granularity should go away no matter what, but this is
> completely irrelevant to memcg which can't do per-thread anyway.

Yes, I wasn't afraid about memcg. It was a setup which requires more
controllers that I was worried about.

> For whatever reason, a user is stuck with thread-level granularity for
> controllers which work that way, the user can use the old hierarchies
> for them for the time being.

So he can mount memcg with new cgroup API and others with old?

> > > is used but I don't think there's any chance of removing the knob.
> > > There's a reason why we're introducing a new version of the whole
> > > cgroup interface which can co-exist with the existing one after all.
> > > If you wanna version memcg interface separately, maybe that'd work but
> > > it sounds like a lot of extra hassle for not much gain.
> > 
> > No, I didn't mean to version the interface. I just wanted to have
> > gradual transition for potential soft_limit users.
> > 
> > Maybe I am misunderstanding something but I thought that new version of
> > API will contain all knobs which are not marked .flags = CFTYPE_INSANE
> > while the old API will contain all of them.
> 
> Nope, some changes don't fit that model.  CFTYPE_ON_ON_DFL is the
> opposite. 

OK, I wasn't aware of this. On which branch I find this?

> Knobs marked with the flag only appear on the default
> hierarchy (cgroup core internally calls it the default hierarchy as
> this is the tree all the controllers are attached to by default).

I am not sure I understand. So they are visible only in the hierarchy
mounted with the new cgroup API (sane or how is it called)?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
@ 2014-06-16 14:04                                                                 ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-16 14:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Greg Thelen, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michel Lespinasse,
	Roman Gushchin, linux-mm, LKML, Li Zefan

On Mon 16-06-14 09:57:41, Tejun Heo wrote:
> Hello, Michal.
> 
> On Mon, Jun 16, 2014 at 02:59:15PM +0200, Michal Hocko wrote:
> > > There sure is a question of how fast userland will move to the new
> > > interface. 
> > 
> > Yeah, I was mostly thinking about those who would need to to bigger
> > changes. AFAIR threads will no longer be distributable between groups.
> 
> Thread-level granularity should go away no matter what, but this is
> completely irrelevant to memcg which can't do per-thread anyway.

Yes, I wasn't afraid about memcg. It was a setup which requires more
controllers that I was worried about.

> For whatever reason, a user is stuck with thread-level granularity for
> controllers which work that way, the user can use the old hierarchies
> for them for the time being.

So he can mount memcg with new cgroup API and others with old?

> > > is used but I don't think there's any chance of removing the knob.
> > > There's a reason why we're introducing a new version of the whole
> > > cgroup interface which can co-exist with the existing one after all.
> > > If you wanna version memcg interface separately, maybe that'd work but
> > > it sounds like a lot of extra hassle for not much gain.
> > 
> > No, I didn't mean to version the interface. I just wanted to have
> > gradual transition for potential soft_limit users.
> > 
> > Maybe I am misunderstanding something but I thought that new version of
> > API will contain all knobs which are not marked .flags = CFTYPE_INSANE
> > while the old API will contain all of them.
> 
> Nope, some changes don't fit that model.  CFTYPE_ON_ON_DFL is the
> opposite. 

OK, I wasn't aware of this. On which branch I find this?

> Knobs marked with the flag only appear on the default
> hierarchy (cgroup core internally calls it the default hierarchy as
> this is the tree all the controllers are attached to by default).

I am not sure I understand. So they are visible only in the hierarchy
mounted with the new cgroup API (sane or how is it called)?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
  2014-06-16 14:04                                                                 ` Michal Hocko
@ 2014-06-16 14:12                                                                   ` Tejun Heo
  -1 siblings, 0 replies; 196+ messages in thread
From: Tejun Heo @ 2014-06-16 14:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Greg Thelen, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michel Lespinasse,
	Roman Gushchin, linux-mm, LKML, Li Zefan

On Mon, Jun 16, 2014 at 04:04:48PM +0200, Michal Hocko wrote:
> > For whatever reason, a user is stuck with thread-level granularity for
> > controllers which work that way, the user can use the old hierarchies
> > for them for the time being.
> 
> So he can mount memcg with new cgroup API and others with old?

Yes, you can read Documentation/cgroups/unified-hierarchy.txt for more
details.  I think I cc'd you when posting unified hierarchy patchset,
didn't I?

> > Nope, some changes don't fit that model.  CFTYPE_ON_ON_DFL is the
> > opposite. 
> 
> OK, I wasn't aware of this. On which branch I find this?

They're all in the mainline now.

> > Knobs marked with the flag only appear on the default
> > hierarchy (cgroup core internally calls it the default hierarchy as
> > this is the tree all the controllers are attached to by default).
> 
> I am not sure I understand. So they are visible only in the hierarchy
> mounted with the new cgroup API (sane or how is it called)?

Yeap.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
@ 2014-06-16 14:12                                                                   ` Tejun Heo
  0 siblings, 0 replies; 196+ messages in thread
From: Tejun Heo @ 2014-06-16 14:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Greg Thelen, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michel Lespinasse,
	Roman Gushchin, linux-mm, LKML, Li Zefan

On Mon, Jun 16, 2014 at 04:04:48PM +0200, Michal Hocko wrote:
> > For whatever reason, a user is stuck with thread-level granularity for
> > controllers which work that way, the user can use the old hierarchies
> > for them for the time being.
> 
> So he can mount memcg with new cgroup API and others with old?

Yes, you can read Documentation/cgroups/unified-hierarchy.txt for more
details.  I think I cc'd you when posting unified hierarchy patchset,
didn't I?

> > Nope, some changes don't fit that model.  CFTYPE_ON_ON_DFL is the
> > opposite. 
> 
> OK, I wasn't aware of this. On which branch I find this?

They're all in the mainline now.

> > Knobs marked with the flag only appear on the default
> > hierarchy (cgroup core internally calls it the default hierarchy as
> > this is the tree all the controllers are attached to by default).
> 
> I am not sure I understand. So they are visible only in the hierarchy
> mounted with the new cgroup API (sane or how is it called)?

Yeap.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
  2014-06-16 14:12                                                                   ` Tejun Heo
@ 2014-06-16 14:29                                                                     ` Michal Hocko
  -1 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-16 14:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Greg Thelen, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michel Lespinasse,
	Roman Gushchin, linux-mm, LKML, Li Zefan

On Mon 16-06-14 10:12:33, Tejun Heo wrote:
> On Mon, Jun 16, 2014 at 04:04:48PM +0200, Michal Hocko wrote:
> > > For whatever reason, a user is stuck with thread-level granularity for
> > > controllers which work that way, the user can use the old hierarchies
> > > for them for the time being.
> > 
> > So he can mount memcg with new cgroup API and others with old?
> 
> Yes, you can read Documentation/cgroups/unified-hierarchy.txt for more
> details.  I think I cc'd you when posting unified hierarchy patchset,
> didn't I?

OK, I've obviously pushed that out of my brain, because you are really
clear about it:
"
All controllers which are not bound to other hierarchies are
automatically bound to unified hierarchy and show up at the root of
it. Controllers which are enabled only in the root of unified
hierarchy can be bound to other hierarchies at any time.  This allows
mixing unified hierarchy with the traditional multiple hierarchies in
a fully backward compatible way.
"

This of course sorts out my concerns. Sorry about the noise!

> > > Nope, some changes don't fit that model.  CFTYPE_ON_ON_DFL is the
> > > opposite. 
> > 
> > OK, I wasn't aware of this. On which branch I find this?
> 
> They're all in the mainline now.

git grep CFTYPE_ON_ON_DFL origin/master didn't show me anything.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
@ 2014-06-16 14:29                                                                     ` Michal Hocko
  0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2014-06-16 14:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Greg Thelen, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michel Lespinasse,
	Roman Gushchin, linux-mm, LKML, Li Zefan

On Mon 16-06-14 10:12:33, Tejun Heo wrote:
> On Mon, Jun 16, 2014 at 04:04:48PM +0200, Michal Hocko wrote:
> > > For whatever reason, a user is stuck with thread-level granularity for
> > > controllers which work that way, the user can use the old hierarchies
> > > for them for the time being.
> > 
> > So he can mount memcg with new cgroup API and others with old?
> 
> Yes, you can read Documentation/cgroups/unified-hierarchy.txt for more
> details.  I think I cc'd you when posting unified hierarchy patchset,
> didn't I?

OK, I've obviously pushed that out of my brain, because you are really
clear about it:
"
All controllers which are not bound to other hierarchies are
automatically bound to unified hierarchy and show up at the root of
it. Controllers which are enabled only in the root of unified
hierarchy can be bound to other hierarchies at any time.  This allows
mixing unified hierarchy with the traditional multiple hierarchies in
a fully backward compatible way.
"

This of course sorts out my concerns. Sorry about the noise!

> > > Nope, some changes don't fit that model.  CFTYPE_ON_ON_DFL is the
> > > opposite. 
> > 
> > OK, I wasn't aware of this. On which branch I find this?
> 
> They're all in the mainline now.

git grep CFTYPE_ON_ON_DFL origin/master didn't show me anything.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
  2014-06-16 14:29                                                                     ` Michal Hocko
@ 2014-06-16 14:40                                                                       ` Tejun Heo
  -1 siblings, 0 replies; 196+ messages in thread
From: Tejun Heo @ 2014-06-16 14:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Greg Thelen, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michel Lespinasse,
	Roman Gushchin, linux-mm, LKML, Li Zefan

On Mon, Jun 16, 2014 at 04:29:15PM +0200, Michal Hocko wrote:
> > They're all in the mainline now.
> 
> git grep CFTYPE_ON_ON_DFL origin/master didn't show me anything.

lol, it should have been CFTYPE_ONLY_ON_DFL.

-- 
tejun

^ permalink raw reply	[flat|nested] 196+ messages in thread

* Re: [PATCH 2/2] memcg: Allow guarantee reclaim
@ 2014-06-16 14:40                                                                       ` Tejun Heo
  0 siblings, 0 replies; 196+ messages in thread
From: Tejun Heo @ 2014-06-16 14:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Greg Thelen, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Michel Lespinasse,
	Roman Gushchin, linux-mm, LKML, Li Zefan

On Mon, Jun 16, 2014 at 04:29:15PM +0200, Michal Hocko wrote:
> > They're all in the mainline now.
> 
> git grep CFTYPE_ON_ON_DFL origin/master didn't show me anything.

lol, it should have been CFTYPE_ONLY_ON_DFL.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 196+ messages in thread

end of thread, other threads:[~2014-06-16 14:40 UTC | newest]

Thread overview: 196+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-28 12:26 [PATCH v2 0/4] memcg: Low-limit reclaim Michal Hocko
2014-04-28 12:26 ` Michal Hocko
2014-04-28 12:26 ` [PATCH 1/4] memcg, mm: introduce lowlimit reclaim Michal Hocko
2014-04-28 12:26   ` Michal Hocko
2014-04-30 22:55   ` Johannes Weiner
2014-04-30 22:55     ` Johannes Weiner
2014-05-02  9:36     ` Michal Hocko
2014-05-02  9:36       ` Michal Hocko
2014-05-02 12:07       ` Michal Hocko
2014-05-02 12:07         ` Michal Hocko
2014-05-02 13:01         ` Johannes Weiner
2014-05-02 13:01           ` Johannes Weiner
2014-05-02 14:15           ` Michal Hocko
2014-05-02 14:15             ` Michal Hocko
2014-05-02 15:04             ` Johannes Weiner
2014-05-02 15:04               ` Johannes Weiner
2014-05-02 15:11               ` Michal Hocko
2014-05-02 15:11                 ` Michal Hocko
2014-05-02 15:34                 ` Johannes Weiner
2014-05-02 15:34                   ` Johannes Weiner
2014-05-02 15:48                   ` Michal Hocko
2014-05-02 15:48                     ` Michal Hocko
2014-05-06 19:58                     ` Michal Hocko
2014-05-06 19:58                       ` Michal Hocko
2014-05-02 15:58       ` Johannes Weiner
2014-05-02 15:58         ` Johannes Weiner
2014-05-02 16:49         ` Michal Hocko
2014-05-02 16:49           ` Michal Hocko
2014-05-02 22:00           ` Johannes Weiner
2014-05-02 22:00             ` Johannes Weiner
2014-05-05 14:21             ` Michal Hocko
2014-05-05 14:21               ` Michal Hocko
2014-05-19 16:18               ` Michal Hocko
2014-05-19 16:18                 ` Michal Hocko
2014-06-11 15:15               ` Johannes Weiner
2014-06-11 15:15                 ` Johannes Weiner
2014-06-11 16:08                 ` Michal Hocko
2014-06-11 16:08                   ` Michal Hocko
2014-05-06 13:29             ` Johannes Weiner
2014-05-06 14:32               ` Michal Hocko
2014-05-06 14:32                 ` Michal Hocko
2014-05-06 15:21                 ` Johannes Weiner
2014-05-06 15:21                   ` Johannes Weiner
2014-05-06 16:12                   ` Michal Hocko
2014-05-06 16:12                     ` Michal Hocko
2014-05-06 16:51                     ` Johannes Weiner
2014-05-06 16:51                       ` Johannes Weiner
2014-05-06 18:30                       ` Michal Hocko
2014-05-06 18:30                         ` Michal Hocko
2014-05-06 19:55                         ` Johannes Weiner
2014-05-06 19:55                           ` Johannes Weiner
2014-04-28 12:26 ` [PATCH 2/4] memcg: Allow setting low_limit Michal Hocko
2014-04-28 12:26   ` Michal Hocko
2014-04-28 12:26 ` [PATCH 3/4] memcg, doc: clarify global vs. limit reclaims Michal Hocko
2014-04-28 12:26   ` Michal Hocko
2014-04-30 23:03   ` Johannes Weiner
2014-04-30 23:03     ` Johannes Weiner
2014-05-02  9:43     ` Michal Hocko
2014-05-02  9:43       ` Michal Hocko
2014-05-06 19:56       ` Michal Hocko
2014-05-06 19:56         ` Michal Hocko
2014-04-28 12:26 ` [PATCH 4/4] memcg: Document memory.low_limit_in_bytes Michal Hocko
2014-04-28 12:26   ` Michal Hocko
2014-04-30 22:57   ` Johannes Weiner
2014-04-30 22:57     ` Johannes Weiner
2014-05-02  9:46     ` Michal Hocko
2014-05-02  9:46       ` Michal Hocko
2014-04-28 15:46 ` [PATCH v2 0/4] memcg: Low-limit reclaim Roman Gushchin
2014-04-28 15:46   ` Roman Gushchin
2014-04-29  7:42   ` Greg Thelen
2014-04-29  7:42     ` Greg Thelen
2014-04-29 10:50     ` Roman Gushchin
2014-04-29 10:50       ` Roman Gushchin
2014-04-29 12:54       ` Michal Hocko
2014-04-29 12:54         ` Michal Hocko
2014-04-30 21:52 ` Andrew Morton
2014-04-30 21:52   ` Andrew Morton
2014-04-30 22:49   ` Johannes Weiner
2014-04-30 22:49     ` Johannes Weiner
2014-05-02 12:03   ` Michal Hocko
2014-05-02 12:03     ` Michal Hocko
2014-04-30 21:59 ` Andrew Morton
2014-04-30 21:59   ` Andrew Morton
2014-05-02 11:22   ` Michal Hocko
2014-05-02 11:22     ` Michal Hocko
2014-05-28 12:10 ` Michal Hocko
2014-05-28 12:10   ` Michal Hocko
2014-05-28 13:49   ` Johannes Weiner
2014-05-28 13:49     ` Johannes Weiner
2014-05-28 14:21     ` Michal Hocko
2014-05-28 14:21       ` Michal Hocko
2014-05-28 15:28       ` Johannes Weiner
2014-05-28 15:28         ` Johannes Weiner
2014-05-28 15:54         ` Michal Hocko
2014-05-28 15:54           ` Michal Hocko
2014-05-28 16:33           ` Johannes Weiner
2014-05-28 16:33             ` Johannes Weiner
2014-06-03 11:07             ` Michal Hocko
2014-06-03 11:07               ` Michal Hocko
2014-06-03 14:22               ` Johannes Weiner
2014-06-03 14:22                 ` Johannes Weiner
2014-06-04 14:46                 ` Michal Hocko
2014-06-04 14:46                   ` Michal Hocko
2014-06-04 15:44                   ` Johannes Weiner
2014-06-04 15:44                     ` Johannes Weiner
2014-06-04 19:18                     ` Hugh Dickins
2014-06-04 19:18                       ` Hugh Dickins
2014-06-04 21:45                       ` Johannes Weiner
2014-06-04 21:45                         ` Johannes Weiner
2014-06-05 14:51                         ` Michal Hocko
2014-06-05 14:51                           ` Michal Hocko
2014-06-05 16:10                           ` Johannes Weiner
2014-06-05 16:10                             ` Johannes Weiner
2014-06-05 16:43                             ` Michal Hocko
2014-06-05 16:43                               ` Michal Hocko
2014-06-05 18:23                               ` Johannes Weiner
2014-06-05 18:23                                 ` Johannes Weiner
2014-06-06 14:44                                 ` Michal Hocko
2014-06-06 14:44                                   ` Michal Hocko
2014-06-06 14:46                                   ` [PATCH 1/2] mm, memcg: allow OOM if no memcg is eligible during direct reclaim Michal Hocko
2014-06-06 14:46                                     ` Michal Hocko
2014-06-06 14:46                                     ` [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim Michal Hocko
2014-06-06 14:46                                       ` Michal Hocko
2014-06-06 15:29                                       ` Tejun Heo
2014-06-06 15:29                                         ` Tejun Heo
2014-06-06 15:34                                         ` Tejun Heo
2014-06-06 15:34                                           ` Tejun Heo
2014-06-09  8:30                                         ` Michal Hocko
2014-06-09  8:30                                           ` Michal Hocko
2014-06-09 13:54                                           ` Tejun Heo
2014-06-09 13:54                                             ` Tejun Heo
2014-06-09 22:52                                       ` Greg Thelen
2014-06-09 22:52                                         ` Greg Thelen
2014-06-10 16:57                                         ` Johannes Weiner
2014-06-10 16:57                                           ` Johannes Weiner
2014-06-10 22:16                                           ` Greg Thelen
2014-06-10 22:16                                             ` Greg Thelen
2014-06-11  7:57                                           ` Michal Hocko
2014-06-11  7:57                                             ` Michal Hocko
2014-06-11  8:00                                             ` [PATCH 1/2] mm, memcg: allow OOM if no memcg is eligible during direct reclaim Michal Hocko
2014-06-11  8:00                                               ` Michal Hocko
2014-06-11  8:00                                               ` [PATCH 2/2] memcg: Allow guarantee reclaim Michal Hocko
2014-06-11  8:00                                                 ` Michal Hocko
2014-06-11 15:36                                                 ` Johannes Weiner
2014-06-11 15:36                                                   ` Johannes Weiner
2014-06-12 13:22                                                   ` Michal Hocko
2014-06-12 13:22                                                     ` Michal Hocko
2014-06-12 13:56                                                     ` Johannes Weiner
2014-06-12 13:56                                                       ` Johannes Weiner
2014-06-12 14:22                                                       ` Michal Hocko
2014-06-12 14:22                                                         ` Michal Hocko
2014-06-12 16:17                                                         ` Tejun Heo
2014-06-12 16:17                                                           ` Tejun Heo
2014-06-16 12:59                                                           ` Michal Hocko
2014-06-16 12:59                                                             ` Michal Hocko
2014-06-16 13:57                                                             ` Tejun Heo
2014-06-16 13:57                                                               ` Tejun Heo
2014-06-16 14:04                                                               ` Michal Hocko
2014-06-16 14:04                                                                 ` Michal Hocko
2014-06-16 14:12                                                                 ` Tejun Heo
2014-06-16 14:12                                                                   ` Tejun Heo
2014-06-16 14:29                                                                   ` Michal Hocko
2014-06-16 14:29                                                                     ` Michal Hocko
2014-06-16 14:40                                                                     ` Tejun Heo
2014-06-16 14:40                                                                       ` Tejun Heo
2014-06-12 16:51                                                         ` Johannes Weiner
2014-06-12 16:51                                                           ` Johannes Weiner
2014-06-16 13:22                                                           ` Michal Hocko
2014-06-16 13:22                                                             ` Michal Hocko
2014-06-11 15:20                                               ` [PATCH 1/2] mm, memcg: allow OOM if no memcg is eligible during direct reclaim Johannes Weiner
2014-06-11 15:20                                                 ` Johannes Weiner
2014-06-11 16:14                                                 ` Michal Hocko
2014-06-11 16:14                                                   ` Michal Hocko
2014-06-11 12:31                                             ` [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim Tejun Heo
2014-06-11 12:31                                               ` Tejun Heo
2014-06-11 14:11                                               ` Michal Hocko
2014-06-11 14:11                                                 ` Michal Hocko
2014-06-11 15:34                                                 ` Tejun Heo
2014-06-11 15:34                                                   ` Tejun Heo
2014-06-05 19:36                       ` [PATCH v2 0/4] memcg: Low-limit reclaim Tejun Heo
2014-06-05 19:36                         ` Tejun Heo
2014-06-05 14:32                     ` Michal Hocko
2014-06-05 14:32                       ` Michal Hocko
2014-06-05 15:43                       ` Johannes Weiner
2014-06-05 15:43                         ` Johannes Weiner
2014-06-05 16:09                         ` Michal Hocko
2014-06-05 16:09                           ` Michal Hocko
2014-06-05 16:46                           ` Johannes Weiner
2014-06-05 16:46                             ` Johannes Weiner
2014-05-28 16:17         ` Greg Thelen
2014-05-28 16:17           ` Greg Thelen
2014-06-03 11:09           ` Michal Hocko
2014-06-03 11:09             ` Michal Hocko
2014-06-03 14:01             ` Greg Thelen
2014-06-03 14:44               ` Michal Hocko
2014-06-03 14:44                 ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.