[RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining
@ 2019-01-09 19:14 Yang Shi
  2019-01-09 19:14 ` [v3 PATCH 1/5] doc: memcontrol: fix the obsolete content about force empty Yang Shi
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Yang Shi @ 2019-01-09 19:14 UTC (permalink / raw)
  To: mhocko, hannes, shakeelb, akpm; +Cc: yang.shi, linux-mm, linux-kernel

We have some usecases which create and remove memcgs very frequently,
and the tasks in the memcg may just access the files which are unlikely
accessed by anyone else.  So, we prefer force_empty the memcg before
rmdir'ing it to reclaim the page cache so that they don't get
accumulated to incur unnecessary memory pressure.  Since the memory
pressure may incur direct reclaim to harm some latency sensitive
applications.

Force empty would help out such usecase, however force empty reclaims
memory synchronously when writing to memory.force_empty.  It may take
some time to return and the afterwards operations are blocked by it.
Although this can be done in background, some usecases may need create
new memcg with the same name right after the old one is deleted.  So,
the creation might get blocked by the before reclaim/remove operation.

Delaying memory reclaim in cgroup offline for such usecase sounds
reasonable.  Introduced a new interface, called wipe_on_offline for both
default and legacy hierarchy, which does memory reclaim in css offline
kworker.

v2 -> v3:
* Introduced may_swap parameter to mem_cgroup_force_empty() to keep force_empty behavior per   Shakeel
* Fixed some comments from Shakeel

v1 -> v2:
* Introduced wipe_on_offline interface suggested by Michal
* Bring force_empty into default hierarchy

Patch #1: Fix some obsolete information about force_empty in the document
Patch #2: Introduce may_swap parameter to mem_cgroup_force_empty()
Patch #3: Introduces wipe_on_offline interface
Patch #4: Being force_empty into default hierarchy
Patch #5: Document update

Yang Shi (5):
      doc: memcontrol: fix the obsolete content about force empty
      mm: memcontrol: add may_swap parameter to mem_cgroup_force_empty()
      mm: memcontrol: introduce wipe_on_offline interface
      mm: memcontrol: bring force_empty into default hierarchy
      doc: memcontrol: add description for wipe_on_offline

 Documentation/admin-guide/cgroup-v2.rst | 23 ++++++++++++++++++++
 Documentation/cgroup-v1/memory.txt      | 17 ++++++++++++---
 include/linux/memcontrol.h              |  3 +++
 mm/memcontrol.c                         | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 98 insertions(+), 8 deletions(-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [v3 PATCH 1/5] doc: memcontrol: fix the obsolete content about force empty
  2019-01-09 19:14 [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining Yang Shi
@ 2019-01-09 19:14 ` Yang Shi
  2019-01-09 19:14 ` [v3 PATCH 2/5] mm: memcontrol: add may_swap parameter to mem_cgroup_force_empty() Yang Shi
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Yang Shi @ 2019-01-09 19:14 UTC (permalink / raw)
  To: mhocko, hannes, shakeelb, akpm; +Cc: yang.shi, linux-mm, linux-kernel

We don't do page cache reparent anymore when offlining memcg, so update
force empty related content accordingly.

Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 Documentation/cgroup-v1/memory.txt | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
index 3682e99..8e2cb1d 100644
--- a/Documentation/cgroup-v1/memory.txt
+++ b/Documentation/cgroup-v1/memory.txt
@@ -70,7 +70,7 @@ Brief summary of control files.
  memory.soft_limit_in_bytes	 # set/show soft limit of memory usage
  memory.stat			 # show various statistics
  memory.use_hierarchy		 # set/show hierarchical account enabled
- memory.force_empty		 # trigger forced move charge to parent
+ memory.force_empty		 # trigger forced page reclaim
  memory.pressure_level		 # set memory pressure notifications
  memory.swappiness		 # set/show swappiness parameter of vmscan
 				 (See sysctl's vm.swappiness)
@@ -459,8 +459,9 @@ About use_hierarchy, see Section 6.
   the cgroup will be reclaimed and as many pages reclaimed as possible.
 
   The typical use case for this interface is before calling rmdir().
-  Because rmdir() moves all pages to parent, some out-of-use page caches can be
-  moved to the parent. If you want to avoid that, force_empty will be useful.
+  Though rmdir() offlines memcg, but the memcg may still stay there due to
+  charged file caches. Some out-of-use page caches may keep charged until
+  memory pressure happens. If you want to avoid that, force_empty will be useful.
 
   Also, note that when memory.kmem.limit_in_bytes is set the charges due to
   kernel pages will still be seen. This is not considered a failure and the
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [v3 PATCH 2/5] mm: memcontrol: add may_swap parameter to mem_cgroup_force_empty()
  2019-01-09 19:14 [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining Yang Shi
  2019-01-09 19:14 ` [v3 PATCH 1/5] doc: memcontrol: fix the obsolete content about force empty Yang Shi
@ 2019-01-09 19:14 ` Yang Shi
  2019-01-09 19:14 ` [v3 PATCH 3/5] mm: memcontrol: introduce wipe_on_offline interface Yang Shi
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Yang Shi @ 2019-01-09 19:14 UTC (permalink / raw)
  To: mhocko, hannes, shakeelb, akpm; +Cc: yang.shi, linux-mm, linux-kernel

mem_cgroup_force_empty() will be reused by the following patch which
does memory reclaim when offlining.  It is unnecessary to do swap in that
path, but force_empty still needs keep intact since it is also used by
other usecases per Shakeel.

So, introduce may_swap parameter to mem_cgroup_force_empty().  This is
the preparation for the following patch.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 mm/memcontrol.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index af7f18b..eaa3970 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2878,7 +2878,7 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
  *
  * Caller is responsible for holding css reference for memcg.
  */
-static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
+static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool may_swap)
 {
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 
@@ -2895,7 +2895,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 			return -EINTR;
 
 		progress = try_to_free_mem_cgroup_pages(memcg, 1,
-							GFP_KERNEL, true);
+							GFP_KERNEL, may_swap);
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
@@ -2915,7 +2915,7 @@ static ssize_t mem_cgroup_force_empty_write(struct kernfs_open_file *of,
 
 	if (mem_cgroup_is_root(memcg))
 		return -EINVAL;
-	return mem_cgroup_force_empty(memcg) ?: nbytes;
+	return mem_cgroup_force_empty(memcg, true) ?: nbytes;
 }
 
 static u64 mem_cgroup_hierarchy_read(struct cgroup_subsys_state *css,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [v3 PATCH 3/5] mm: memcontrol: introduce wipe_on_offline interface
  2019-01-09 19:14 [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining Yang Shi
  2019-01-09 19:14 ` [v3 PATCH 1/5] doc: memcontrol: fix the obsolete content about force empty Yang Shi
  2019-01-09 19:14 ` [v3 PATCH 2/5] mm: memcontrol: add may_swap parameter to mem_cgroup_force_empty() Yang Shi
@ 2019-01-09 19:14 ` Yang Shi
  2019-01-09 19:14 ` [v3 PATCH 4/5] mm: memcontrol: bring force_empty into default hierarchy Yang Shi
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Yang Shi @ 2019-01-09 19:14 UTC (permalink / raw)
  To: mhocko, hannes, shakeelb, akpm; +Cc: yang.shi, linux-mm, linux-kernel

We have some usecases which create and remove memcgs very frequently,
and the tasks in the memcg may just access the files which are unlikely
accessed by anyone else.  So, we prefer force_empty the memcg before
rmdir'ing it to reclaim the page cache so that they don't get
accumulated to incur unnecessary memory pressure.  Since the memory
pressure may incur direct reclaim to harm some latency sensitive
applications.

Force empty would help out such usecase, however force empty reclaims
memory synchronously when writing to memory.force_empty.  It may take
some time to return and the afterwards operations are blocked by it.
Although this can be done in background, some usecases may need create
new memcg with the same name right after the old one is deleted.  So,
the creation might get blocked by the before reclaim/remove operation.

Delaying memory reclaim in cgroup offline for such usecase sounds
reasonable.  Introduced a new interface, called wipe_on_offline for both
default and legacy hierarchy, which does memory reclaim in css offline
kworker.

Writing to 1 would enable it, writing 0 would disable it.

Suggested-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 include/linux/memcontrol.h |  3 +++
 mm/memcontrol.c            | 53 ++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 83ae11c..2f1258a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -311,6 +311,9 @@ struct mem_cgroup {
 	struct list_head event_list;
 	spinlock_t event_list_lock;
 
+	/* Reclaim as much as possible memory in offline kworker */
+	bool wipe_on_offline;
+
 	struct mem_cgroup_per_node *nodeinfo[0];
 	/* WARNING: nodeinfo must be the last member here */
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index eaa3970..ff50810 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2918,6 +2918,35 @@ static ssize_t mem_cgroup_force_empty_write(struct kernfs_open_file *of,
 	return mem_cgroup_force_empty(memcg, true) ?: nbytes;
 }
 
+static int wipe_on_offline_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+
+	seq_printf(m, "%lu\n", (unsigned long)memcg->wipe_on_offline);
+
+	return 0;
+}
+
+static int wipe_on_offline_write(struct cgroup_subsys_state *css,
+				 struct cftype *cft, u64 val)
+{
+	int ret = 0;
+
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	if (mem_cgroup_is_root(memcg))
+		return -EINVAL;
+
+	if (val == 0)
+		memcg->wipe_on_offline = false;
+	else if (val == 1)
+		memcg->wipe_on_offline = true;
+	else
+		ret = -EINVAL;
+
+	return ret;
+}
+
 static u64 mem_cgroup_hierarchy_read(struct cgroup_subsys_state *css,
 				     struct cftype *cft)
 {
@@ -4283,6 +4312,11 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 		.write = mem_cgroup_reset,
 		.read_u64 = mem_cgroup_read_u64,
 	},
+	{
+		.name = "wipe_on_offline",
+		.seq_show = wipe_on_offline_show,
+		.write_u64 = wipe_on_offline_write,
+	},
 	{ },	/* terminate */
 };
 
@@ -4569,11 +4603,20 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	page_counter_set_min(&memcg->memory, 0);
 	page_counter_set_low(&memcg->memory, 0);
 
+	/*
+	 * Reclaim as much as possible memory when offlining.
+	 *
+	 * Do it after min/low is reset otherwise some memory might
+	 * be protected by min/low.
+	 */
+	if (memcg->wipe_on_offline)
+		mem_cgroup_force_empty(memcg, false);
+	else
+		drain_all_stock(memcg);
+
 	memcg_offline_kmem(memcg);
 	wb_memcg_offline(memcg);
 
-	drain_all_stock(memcg);
-
 	mem_cgroup_id_put(memcg);
 }
 
@@ -5694,6 +5737,12 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 		.seq_show = memory_oom_group_show,
 		.write = memory_oom_group_write,
 	},
+	{
+		.name = "wipe_on_offline",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = wipe_on_offline_show,
+		.write_u64 = wipe_on_offline_write,
+	},
 	{ }	/* terminate */
 };
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [v3 PATCH 4/5] mm: memcontrol: bring force_empty into default hierarchy
  2019-01-09 19:14 [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining Yang Shi
                   ` (2 preceding siblings ...)
  2019-01-09 19:14 ` [v3 PATCH 3/5] mm: memcontrol: introduce wipe_on_offline interface Yang Shi
@ 2019-01-09 19:14 ` Yang Shi
  2019-01-09 19:14 ` [v3 PATCH 5/5] doc: memcontrol: add description for wipe_on_offline Yang Shi
  2019-01-09 19:32 ` [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining Johannes Weiner
  5 siblings, 0 replies; 15+ messages in thread
From: Yang Shi @ 2019-01-09 19:14 UTC (permalink / raw)
  To: mhocko, hannes, shakeelb, akpm; +Cc: yang.shi, linux-mm, linux-kernel

The default hierarchy doesn't support force_empty, but there are some
usecases which create and remove memcgs very frequently, and the
tasks in the memcg may just access the files which are unlikely
accessed by anyone else. So, we prefer force_empty the memcg before
rmdir'ing it to reclaim the page cache so that they don't get
accumulated to incur unnecessary memory pressure. Since the memory
pressure may incur direct reclaim to harm some latency sensitive
applications.

There is another patch which introduces asynchronous memory reclaim when
offlining, but the behavior of force_empty is still needed by some
usecases which want to get the memory reclaimed immediately.  So, bring
force_empty interface in default hierarchy too.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 14 ++++++++++++++
 mm/memcontrol.c                         |  4 ++++
 2 files changed, 18 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 7bf3f12..0290c65 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1289,6 +1289,20 @@ PAGE_SIZE multiple when read back.
 	Shows pressure stall information for memory. See
 	Documentation/accounting/psi.txt for details.
 
+  memory.force_empty
+        This interface is provided to make cgroup's memory usage empty.
+        When writing anything to this
+
+        # echo 0 > memory.force_empty
+
+        the cgroup will be reclaimed and as many pages reclaimed as possible.
+
+        The typical use case for this interface is before calling rmdir().
+        Though rmdir() offlines memcg, but the memcg may still stay there due to
+        charged file caches. Some out-of-use page caches may keep charged until
+        memory pressure happens. If you want to avoid that, force_empty will be
+        useful.
+
 
 Usage Guidelines
 ~~~~~~~~~~~~~~~~
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ff50810..5d42a19 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5743,6 +5743,10 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 		.seq_show = wipe_on_offline_show,
 		.write_u64 = wipe_on_offline_write,
 	},
+	{
+		.name = "force_empty",
+		.write = mem_cgroup_force_empty_write,
+	},
 	{ }	/* terminate */
 };
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [v3 PATCH 5/5] doc: memcontrol: add description for wipe_on_offline
  2019-01-09 19:14 [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining Yang Shi
                   ` (3 preceding siblings ...)
  2019-01-09 19:14 ` [v3 PATCH 4/5] mm: memcontrol: bring force_empty into default hierarchy Yang Shi
@ 2019-01-09 19:14 ` Yang Shi
  2019-01-10 12:00   ` William Kucharski
  2019-01-09 19:32 ` [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining Johannes Weiner
  5 siblings, 1 reply; 15+ messages in thread
From: Yang Shi @ 2019-01-09 19:14 UTC (permalink / raw)
  To: mhocko, hannes, shakeelb, akpm; +Cc: yang.shi, linux-mm, linux-kernel

Add desprition of wipe_on_offline interface in cgroup documents.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  9 +++++++++
 Documentation/cgroup-v1/memory.txt      | 10 ++++++++++
 2 files changed, 19 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 0290c65..e4ef08c 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1303,6 +1303,15 @@ PAGE_SIZE multiple when read back.
         memory pressure happens. If you want to avoid that, force_empty will be
         useful.
 
+  memory.wipe_on_offline
+
+        This is similar to force_empty, but it just does memory reclaim
+        asynchronously in css offline kworker.
+
+        Writing into 1 will enable it, disable it by writing into 0.
+
+        It would reclaim as much as possible memory just as what force_empty does.
+
 
 Usage Guidelines
 ~~~~~~~~~~~~~~~~
diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
index 8e2cb1d..1c6e1ca 100644
--- a/Documentation/cgroup-v1/memory.txt
+++ b/Documentation/cgroup-v1/memory.txt
@@ -71,6 +71,7 @@ Brief summary of control files.
  memory.stat			 # show various statistics
  memory.use_hierarchy		 # set/show hierarchical account enabled
  memory.force_empty		 # trigger forced page reclaim
+ memory.wipe_on_offline		 # trigger forced page reclaim when offlining
  memory.pressure_level		 # set memory pressure notifications
  memory.swappiness		 # set/show swappiness parameter of vmscan
 				 (See sysctl's vm.swappiness)
@@ -581,6 +582,15 @@ hierarchical_<counter>=<counter pages> N0=<node 0 pages> N1=<node 1 pages> ...
 
 The "total" count is sum of file + anon + unevictable.
 
+5.7 wipe_on_offline
+
+This is similar to force_empty, but it just does memory reclaim asynchronously
+in css offline kworker.
+
+Writing into 1 will enable it, disable it by writing into 0.
+
+It would reclaim as much as possible memory just as what force_empty does.
+
 6. Hierarchy support
 
 The memory controller supports a deep hierarchy and hierarchical accounting.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining
  2019-01-09 19:14 [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining Yang Shi
                   ` (4 preceding siblings ...)
  2019-01-09 19:14 ` [v3 PATCH 5/5] doc: memcontrol: add description for wipe_on_offline Yang Shi
@ 2019-01-09 19:32 ` Johannes Weiner
  2019-01-09 20:36   ` Yang Shi
  5 siblings, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2019-01-09 19:32 UTC (permalink / raw)
  To: Yang Shi; +Cc: mhocko, shakeelb, akpm, linux-mm, linux-kernel

On Thu, Jan 10, 2019 at 03:14:40AM +0800, Yang Shi wrote:
> 
> We have some usecases which create and remove memcgs very frequently,
> and the tasks in the memcg may just access the files which are unlikely
> accessed by anyone else.  So, we prefer force_empty the memcg before
> rmdir'ing it to reclaim the page cache so that they don't get
> accumulated to incur unnecessary memory pressure.  Since the memory
> pressure may incur direct reclaim to harm some latency sensitive
> applications.

We have kswapd for exactly this purpose. Can you lay out more details
on why that is not good enough, especially in conjunction with tuning
the watermark_scale_factor etc.?

We've been pretty adamant that users shouldn't use drop_caches for
performance for example, and that the need to do this usually is
indicative of a problem or suboptimal tuning in the VM subsystem.

How is this different?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining
  2019-01-09 19:32 ` [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining Johannes Weiner
@ 2019-01-09 20:36   ` Yang Shi
  2019-01-09 21:23     ` Johannes Weiner
  0 siblings, 1 reply; 15+ messages in thread
From: Yang Shi @ 2019-01-09 20:36 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: mhocko, shakeelb, akpm, linux-mm, linux-kernel

On 1/9/19 11:32 AM, Johannes Weiner wrote:
> On Thu, Jan 10, 2019 at 03:14:40AM +0800, Yang Shi wrote:
>> We have some usecases which create and remove memcgs very frequently,
>> and the tasks in the memcg may just access the files which are unlikely
>> accessed by anyone else.  So, we prefer force_empty the memcg before
>> rmdir'ing it to reclaim the page cache so that they don't get
>> accumulated to incur unnecessary memory pressure.  Since the memory
>> pressure may incur direct reclaim to harm some latency sensitive
>> applications.
> We have kswapd for exactly this purpose. Can you lay out more details
> on why that is not good enough, especially in conjunction with tuning
> the watermark_scale_factor etc.?

watermark_scale_factor does help out for some workloads in general. 
However, memcgs might be created then do memory allocation faster than 
kswapd in some our workloads. And, the tune may work for one kind 
machine or workload, but may not work for others. But, we may have 
different kind workloads (for example, latency-sensitive and batch jobs) 
run on the same machine, so it is kind of hard for us to guarantee all 
the workloads work well together by relying on kswapd and 
watermark_scale_factor only.

And, we know the page cache access pattern would be one-off for some 
memcgs, and those page caches are unlikely shared by others, so why not 
just drop them when the memcg is offlined. Reclaiming those cold page 
caches earlier would also improve the efficiency of memcg creation for 
long run.

>
> We've been pretty adamant that users shouldn't use drop_caches for
> performance for example, and that the need to do this usually is
> indicative of a problem or suboptimal tuning in the VM subsystem.
>
> How is this different?

IMHO, that depends on the usecases and workloads. As I mentioned above, 
if we know some page caches from some memcgs are referenced one-off and 
unlikely shared, why just keep them around to increase memory pressure?

Thanks,
Yang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining
  2019-01-09 20:36   ` Yang Shi
@ 2019-01-09 21:23     ` Johannes Weiner
  2019-01-09 22:09       ` Yang Shi
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2019-01-09 21:23 UTC (permalink / raw)
  To: Yang Shi; +Cc: mhocko, shakeelb, akpm, linux-mm, linux-kernel

On Wed, Jan 09, 2019 at 12:36:11PM -0800, Yang Shi wrote:
> As I mentioned above, if we know some page caches from some memcgs
> are referenced one-off and unlikely shared, why just keep them
> around to increase memory pressure?

It's just not clear to me that your scenarios are generic enough to
justify adding two interfaces that we have to maintain forever, and
that they couldn't be solved with existing mechanisms.

Please explain:

- Unmapped clean page cache isn't expensive to reclaim, certainly
  cheaper than the IO involved in new application startup. How could
  recycling clean cache be a prohibitive part of workload warmup?

- Why you cannot temporarily raise the kswapd watermarks right before
  an important application starts up (your answer was sorta handwavy)

- Why you cannot use madvise/fadvise when an application whose cache
  you won't reuse exits

- Why you couldn't set memory.high or memory.max to 0 after the
  application quits and before you call rmdir on the cgroup

Adding a permanent kernel interface is a serious measure. I think you
need to make a much better case for it, discuss why other options are
not practical, and show that this will be a generally useful thing for
cgroup users and not just a niche fix for very specific situations.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining
  2019-01-09 21:23     ` Johannes Weiner
@ 2019-01-09 22:09       ` Yang Shi
  2019-01-09 22:51         ` Johannes Weiner
  0 siblings, 1 reply; 15+ messages in thread
From: Yang Shi @ 2019-01-09 22:09 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: mhocko, shakeelb, akpm, linux-mm, linux-kernel

On 1/9/19 1:23 PM, Johannes Weiner wrote:
> On Wed, Jan 09, 2019 at 12:36:11PM -0800, Yang Shi wrote:
>> As I mentioned above, if we know some page caches from some memcgs
>> are referenced one-off and unlikely shared, why just keep them
>> around to increase memory pressure?
> It's just not clear to me that your scenarios are generic enough to
> justify adding two interfaces that we have to maintain forever, and
> that they couldn't be solved with existing mechanisms.
>
> Please explain:
>
> - Unmapped clean page cache isn't expensive to reclaim, certainly
>    cheaper than the IO involved in new application startup. How could
>    recycling clean cache be a prohibitive part of workload warmup?

It is nothing about recycling. Those page caches might be referenced by 
memcg just once, then nobody touch them until memory pressure is hit. 
And, they might be not accessed again at any time soon.

>
> - Why you cannot temporarily raise the kswapd watermarks right before
>    an important application starts up (your answer was sorta handwavy)

It could, but kswapd watermark is global. Boosting kswapd watermark may 
cause kswapd reclaim some memory from some memcgs which we want to keep 
untouched. Although v2's low/min could provide some protection, it is 
still not prohibited generally. And, v1 doesn't have such protection at all.

force_empty or wipe_on_offline could be used to target to some specific 
memcgs which we may know exactly what they do or it is safe to reclaim 
memory from them. IMHO, this may make better isolation.

>
> - Why you cannot use madvise/fadvise when an application whose cache
>    you won't reuse exits

Sure we can. But, we can't guarantee all applications use them properly.

>
> - Why you couldn't set memory.high or memory.max to 0 after the
>    application quits and before you call rmdir on the cgroup

I recall I explained this in the review email for the first version. Set 
memory.high or memory.max to 0 would trigger direct reclaim which may 
stall the offline of memcg. But, we have "restarting the same name job" 
logic in our usecase (I'm not quite sure why they do so). Basically, it 
means to create memcg with the exact same name right after the old one 
is deleted, but may have different limit or other settings. The creation 
has to wait for rmdir is done.

>
> Adding a permanent kernel interface is a serious measure. I think you
> need to make a much better case for it, discuss why other options are
> not practical, and show that this will be a generally useful thing for
> cgroup users and not just a niche fix for very specific situations.

I do understand your concern and the maintenance cost for a permanent 
kernel interface. I'm not quite sure if this is generic enough, however, 
Michal Hocko did mention "It seems we have several people asking for 
something like that already.", so at least it sounds not like "a niche 
fix for very specific situations".

In my first submit, I did reuse force_empty interface to keep it less 
intrusive, at least not a new interface. Since we have several people 
asking for something like that already, Michal suggested a new knob 
instead of reusing force_empty.

Thanks,
Yang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining
  2019-01-09 22:09       ` Yang Shi
@ 2019-01-09 22:51         ` Johannes Weiner
  2019-01-10  1:47           ` Yang Shi
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2019-01-09 22:51 UTC (permalink / raw)
  To: Yang Shi; +Cc: mhocko, shakeelb, akpm, linux-mm, linux-kernel

On Wed, Jan 09, 2019 at 02:09:20PM -0800, Yang Shi wrote:
> On 1/9/19 1:23 PM, Johannes Weiner wrote:
> > On Wed, Jan 09, 2019 at 12:36:11PM -0800, Yang Shi wrote:
> > > As I mentioned above, if we know some page caches from some memcgs
> > > are referenced one-off and unlikely shared, why just keep them
> > > around to increase memory pressure?
> > It's just not clear to me that your scenarios are generic enough to
> > justify adding two interfaces that we have to maintain forever, and
> > that they couldn't be solved with existing mechanisms.
> > 
> > Please explain:
> > 
> > - Unmapped clean page cache isn't expensive to reclaim, certainly
> >    cheaper than the IO involved in new application startup. How could
> >    recycling clean cache be a prohibitive part of workload warmup?
> 
> It is nothing about recycling. Those page caches might be referenced by
> memcg just once, then nobody touch them until memory pressure is hit. And,
> they might be not accessed again at any time soon.

I meant recycling the page frames, not the cache in them. So the new
workload as it starts up needs to take those pages from the LRU list
instead of just the allocator freelist. While that's obviously not the
same cost, it's not clear why the difference would be prohibitive to
application startup especially since app startup tends to be dominated
by things like IO to fault in executables etc.

> > - Why you couldn't set memory.high or memory.max to 0 after the
> >    application quits and before you call rmdir on the cgroup
> 
> I recall I explained this in the review email for the first version. Set
> memory.high or memory.max to 0 would trigger direct reclaim which may stall
> the offline of memcg. But, we have "restarting the same name job" logic in
> our usecase (I'm not quite sure why they do so). Basically, it means to
> create memcg with the exact same name right after the old one is deleted,
> but may have different limit or other settings. The creation has to wait for
> rmdir is done.

This really needs a fix on your end. We cannot add new cgroup control
files because you cannot handle a delayed release in the cgroupfs
namespace while you're reclaiming associated memory. A simple serial
number would fix this.

Whether others have asked for this knob or not, these patches should
come with a solid case in the cover letter and changelogs that explain
why this ABI is necessary to solve a generic cgroup usecase. But it
sounds to me that setting the limit to 0 once the group is empty would
meet the functional requirement (use fork() if you don't want to wait)
of what you are trying to do.

I don't think the new interface bar is met here.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining
  2019-01-09 22:51         ` Johannes Weiner
@ 2019-01-10  1:47           ` Yang Shi
  2019-01-14 19:01             ` Johannes Weiner
  0 siblings, 1 reply; 15+ messages in thread
From: Yang Shi @ 2019-01-10  1:47 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: mhocko, shakeelb, akpm, linux-mm, linux-kernel



On 1/9/19 2:51 PM, Johannes Weiner wrote:
> On Wed, Jan 09, 2019 at 02:09:20PM -0800, Yang Shi wrote:
>> On 1/9/19 1:23 PM, Johannes Weiner wrote:
>>> On Wed, Jan 09, 2019 at 12:36:11PM -0800, Yang Shi wrote:
>>>> As I mentioned above, if we know some page caches from some memcgs
>>>> are referenced one-off and unlikely shared, why just keep them
>>>> around to increase memory pressure?
>>> It's just not clear to me that your scenarios are generic enough to
>>> justify adding two interfaces that we have to maintain forever, and
>>> that they couldn't be solved with existing mechanisms.
>>>
>>> Please explain:
>>>
>>> - Unmapped clean page cache isn't expensive to reclaim, certainly
>>>     cheaper than the IO involved in new application startup. How could
>>>     recycling clean cache be a prohibitive part of workload warmup?
>> It is nothing about recycling. Those page caches might be referenced by
>> memcg just once, then nobody touch them until memory pressure is hit. And,
>> they might be not accessed again at any time soon.
> I meant recycling the page frames, not the cache in them. So the new
> workload as it starts up needs to take those pages from the LRU list
> instead of just the allocator freelist. While that's obviously not the
> same cost, it's not clear why the difference would be prohibitive to
> application startup especially since app startup tends to be dominated
> by things like IO to fault in executables etc.

I'm a little bit confused here. Even though those page frames are not 
reclaimed by force_empty, they would be reclaimed by kswapd later when 
memory pressure is hit. For some usecases, they may prefer get recycled 
before kswapd kick them out LRU, but for some usecases avoiding memory 
pressure might outpace page frame recycling.

>
>>> - Why you couldn't set memory.high or memory.max to 0 after the
>>>     application quits and before you call rmdir on the cgroup
>> I recall I explained this in the review email for the first version. Set
>> memory.high or memory.max to 0 would trigger direct reclaim which may stall
>> the offline of memcg. But, we have "restarting the same name job" logic in
>> our usecase (I'm not quite sure why they do so). Basically, it means to
>> create memcg with the exact same name right after the old one is deleted,
>> but may have different limit or other settings. The creation has to wait for
>> rmdir is done.
> This really needs a fix on your end. We cannot add new cgroup control
> files because you cannot handle a delayed release in the cgroupfs
> namespace while you're reclaiming associated memory. A simple serial
> number would fix this.
>
> Whether others have asked for this knob or not, these patches should
> come with a solid case in the cover letter and changelogs that explain
> why this ABI is necessary to solve a generic cgroup usecase. But it
> sounds to me that setting the limit to 0 once the group is empty would
> meet the functional requirement (use fork() if you don't want to wait)
> of what you are trying to do.

Do you mean do something like the below:

echo 0 > cg1/memory.max &
rmdir cg1 &
mkdir cg1 &

But, the latency is still there, even though memcg creation (mkdir) can 
be done very fast by using fork(), the latency would delay afterwards 
operations, i.e. attaching tasks (echo PID > cg1/cgroup.procs). When we 
calculating the time consumption of the container deployment, we would 
count from mkdir to the job is actually launched.

So, without delaying force_empty to offline kworker, we still suffer 
from the latency.

Am I missing anything?

Thanks,
Yang

>
> I don't think the new interface bar is met here.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [v3 PATCH 5/5] doc: memcontrol: add description for wipe_on_offline
  2019-01-09 19:14 ` [v3 PATCH 5/5] doc: memcontrol: add description for wipe_on_offline Yang Shi
@ 2019-01-10 12:00   ` William Kucharski
  0 siblings, 0 replies; 15+ messages in thread
From: William Kucharski @ 2019-01-10 12:00 UTC (permalink / raw)
  To: Yang Shi; +Cc: Michal Hocko, hannes, shakeelb, akpm, linux-mm, linux-kernel

Just a few grammar corrections since this is going into Documentation:


> On Jan 9, 2019, at 12:14 PM, Yang Shi <yang.shi@linux.alibaba.com> wrote:
> 
> Add desprition of wipe_on_offline interface in cgroup documents.
Add a description of the wipe_on_offline interface to the cgroup documents.

> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Shakeel Butt <shakeelb@google.com>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> ---
> Documentation/admin-guide/cgroup-v2.rst |  9 +++++++++
> Documentation/cgroup-v1/memory.txt      | 10 ++++++++++
> 2 files changed, 19 insertions(+)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 0290c65..e4ef08c 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1303,6 +1303,15 @@ PAGE_SIZE multiple when read back.
>         memory pressure happens. If you want to avoid that, force_empty will be
>         useful.
> 
> +  memory.wipe_on_offline
> +
> +        This is similar to force_empty, but it just does memory reclaim
> +        asynchronously in css offline kworker.
> +
> +        Writing into 1 will enable it, disable it by writing into 0.
Writing a 1 will enable it; writing a 0 will disable it.

> +
> +        It would reclaim as much as possible memory just as what force_empty does.
It will reclaim as much memory as possible, just as force_empty does.

> +
> 
> Usage Guidelines
> ~~~~~~~~~~~~~~~~
> diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
> index 8e2cb1d..1c6e1ca 100644
> --- a/Documentation/cgroup-v1/memory.txt
> +++ b/Documentation/cgroup-v1/memory.txt
> @@ -71,6 +71,7 @@ Brief summary of control files.
>  memory.stat			 # show various statistics
>  memory.use_hierarchy		 # set/show hierarchical account enabled
>  memory.force_empty		 # trigger forced page reclaim
> + memory.wipe_on_offline		 # trigger forced page reclaim when offlining
>  memory.pressure_level		 # set memory pressure notifications
>  memory.swappiness		 # set/show swappiness parameter of vmscan
> 				 (See sysctl's vm.swappiness)
> @@ -581,6 +582,15 @@ hierarchical_<counter>=<counter pages> N0=<node 0 pages> N1=<node 1 pages> ...
> 
> The "total" count is sum of file + anon + unevictable.
> 
> +5.7 wipe_on_offline
> +
> +This is similar to force_empty, but it just does memory reclaim asynchronously
> +in css offline kworker.
> +
> +Writing into 1 will enable it, disable it by writing into 0.
Writing a 1 will enable it; writing a 0 will disable it.

> +
> +It would reclaim as much as possible memory just as what force_empty does.
It will reclaim as much memory as possible, just as force_empty does.

> +
> 6. Hierarchy support
> 
> The memory controller supports a deep hierarchy and hierarchical accounting.
> -- 
> 1.8.3.1
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining
  2019-01-10  1:47           ` Yang Shi
@ 2019-01-14 19:01             ` Johannes Weiner
  2019-01-17 22:55               ` Yang Shi
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2019-01-14 19:01 UTC (permalink / raw)
  To: Yang Shi; +Cc: mhocko, shakeelb, akpm, linux-mm, linux-kernel

On Wed, Jan 09, 2019 at 05:47:41PM -0800, Yang Shi wrote:
> On 1/9/19 2:51 PM, Johannes Weiner wrote:
> > On Wed, Jan 09, 2019 at 02:09:20PM -0800, Yang Shi wrote:
> > > On 1/9/19 1:23 PM, Johannes Weiner wrote:
> > > > On Wed, Jan 09, 2019 at 12:36:11PM -0800, Yang Shi wrote:
> > > > > As I mentioned above, if we know some page caches from some memcgs
> > > > > are referenced one-off and unlikely shared, why just keep them
> > > > > around to increase memory pressure?
> > > > It's just not clear to me that your scenarios are generic enough to
> > > > justify adding two interfaces that we have to maintain forever, and
> > > > that they couldn't be solved with existing mechanisms.
> > > > 
> > > > Please explain:
> > > > 
> > > > - Unmapped clean page cache isn't expensive to reclaim, certainly
> > > >     cheaper than the IO involved in new application startup. How could
> > > >     recycling clean cache be a prohibitive part of workload warmup?
> > > It is nothing about recycling. Those page caches might be referenced by
> > > memcg just once, then nobody touch them until memory pressure is hit. And,
> > > they might be not accessed again at any time soon.
> > I meant recycling the page frames, not the cache in them. So the new
> > workload as it starts up needs to take those pages from the LRU list
> > instead of just the allocator freelist. While that's obviously not the
> > same cost, it's not clear why the difference would be prohibitive to
> > application startup especially since app startup tends to be dominated
> > by things like IO to fault in executables etc.
> 
> I'm a little bit confused here. Even though those page frames are not
> reclaimed by force_empty, they would be reclaimed by kswapd later when
> memory pressure is hit. For some usecases, they may prefer get recycled
> before kswapd kick them out LRU, but for some usecases avoiding memory
> pressure might outpace page frame recycling.

I understand that, but you're not providing data for the "may prefer"
part. You haven't shown that any proactive reclaim actually matters
and is a significant net improvement to a real workload in a real
hardware environment, and that the usecase is generic and widespread
enough to warrant an entirely new kernel interface.

> > > > - Why you couldn't set memory.high or memory.max to 0 after the
> > > >     application quits and before you call rmdir on the cgroup
> > > I recall I explained this in the review email for the first version. Set
> > > memory.high or memory.max to 0 would trigger direct reclaim which may stall
> > > the offline of memcg. But, we have "restarting the same name job" logic in
> > > our usecase (I'm not quite sure why they do so). Basically, it means to
> > > create memcg with the exact same name right after the old one is deleted,
> > > but may have different limit or other settings. The creation has to wait for
> > > rmdir is done.
> > This really needs a fix on your end. We cannot add new cgroup control
> > files because you cannot handle a delayed release in the cgroupfs
> > namespace while you're reclaiming associated memory. A simple serial
> > number would fix this.
> > 
> > Whether others have asked for this knob or not, these patches should
> > come with a solid case in the cover letter and changelogs that explain
> > why this ABI is necessary to solve a generic cgroup usecase. But it
> > sounds to me that setting the limit to 0 once the group is empty would
> > meet the functional requirement (use fork() if you don't want to wait)
> > of what you are trying to do.
> 
> Do you mean do something like the below:
> 
> echo 0 > cg1/memory.max &
> rmdir cg1 &
> mkdir cg1 &
>
> But, the latency is still there, even though memcg creation (mkdir) can be
> done very fast by using fork(), the latency would delay afterwards
> operations, i.e. attaching tasks (echo PID > cg1/cgroup.procs). When we
> calculating the time consumption of the container deployment, we would count
> from mkdir to the job is actually launched.

I'm saying that the same-name requirement is your problem, not the
kernel's. It's not unreasonable for the kernel to say that as long as
you want to do something with the cgroup, such as forcibly emptying
out the left-over cache, that the group name stays in the namespace.

Requiring the same exact cgroup name for another instance of the same
job sounds like a bogus requirement. Surely you can use serial numbers
to denote subsequent invocations of the same job and handle that from
whatever job management software you're using:

	( echo 0 > job1345-1/memory.max; rmdir job12345-1 ) &
	mkdir job12345-2

See, completely decoupled.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining
  2019-01-14 19:01             ` Johannes Weiner
@ 2019-01-17 22:55               ` Yang Shi
  0 siblings, 0 replies; 15+ messages in thread
From: Yang Shi @ 2019-01-17 22:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Yang Shi, Michal Hocko, Shakeel Butt, Andrew Morton, Linux MM,
	Linux Kernel Mailing List

Not sure if you guys received my yesterday's reply or not. I sent
twice, but both got bounced back. Maybe my company email server has
some problems. So, I sent this with my personal email.

On Mon, Jan 14, 2019 at 11:01 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
&gt;
&gt; On Wed, Jan 09, 2019 at 05:47:41PM -0800, Yang Shi wrote:
&gt; &gt; On 1/9/19 2:51 PM, Johannes Weiner wrote:
&gt; &gt; &gt; On Wed, Jan 09, 2019 at 02:09:20PM -0800, Yang Shi wrote:
&gt; &gt; &gt; &gt; On 1/9/19 1:23 PM, Johannes Weiner wrote:
&gt; &gt; &gt; &gt; &gt; On Wed, Jan 09, 2019 at 12:36:11PM -0800,
Yang Shi wrote:
&gt; &gt; &gt; &gt; &gt; &gt; As I mentioned above, if we know some
page caches from some memcgs
&gt; &gt; &gt; &gt; &gt; &gt; are referenced one-off and unlikely
shared, why just keep them
&gt; &gt; &gt; &gt; &gt; &gt; around to increase memory pressure?
&gt; &gt; &gt; &gt; &gt; It's just not clear to me that your scenarios
are generic enough to
&gt; &gt; &gt; &gt; &gt; justify adding two interfaces that we have to
maintain forever, and
&gt; &gt; &gt; &gt; &gt; that they couldn't be solved with existing mechanisms.
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; Please explain:
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; - Unmapped clean page cache isn't expensive
to reclaim, certainly
&gt; &gt; &gt; &gt; &gt;     cheaper than the IO involved in new
application startup. How could
&gt; &gt; &gt; &gt; &gt;     recycling clean cache be a prohibitive
part of workload warmup?
&gt; &gt; &gt; &gt; It is nothing about recycling. Those page caches
might be referenced by
&gt; &gt; &gt; &gt; memcg just once, then nobody touch them until
memory pressure is hit. And,
&gt; &gt; &gt; &gt; they might be not accessed again at any time soon.
&gt; &gt; &gt; I meant recycling the page frames, not the cache in
them. So the new
&gt; &gt; &gt; workload as it starts up needs to take those pages from
the LRU list
&gt; &gt; &gt; instead of just the allocator freelist. While that's
obviously not the
&gt; &gt; &gt; same cost, it's not clear why the difference would be
prohibitive to
&gt; &gt; &gt; application startup especially since app startup tends
to be dominated
&gt; &gt; &gt; by things like IO to fault in executables etc.
&gt; &gt;
&gt; &gt; I'm a little bit confused here. Even though those page frames are not
&gt; &gt; reclaimed by force_empty, they would be reclaimed by kswapd later when
&gt; &gt; memory pressure is hit. For some usecases, they may prefer
get recycled
&gt; &gt; before kswapd kick them out LRU, but for some usecases avoiding memory
&gt; &gt; pressure might outpace page frame recycling.
&gt;
&gt; I understand that, but you're not providing data for the "may prefer"
&gt; part. You haven't shown that any proactive reclaim actually matters
&gt; and is a significant net improvement to a real workload in a real
&gt; hardware environment, and that the usecase is generic and widespread
&gt; enough to warrant an entirely new kernel interface.

Proactive reclaim could prevent from getting offline memcgs
accumulated. In our production environment, we saw offline memcgs
could reach over 450K (just a few hundred online memcgs) in some
cases. kswapd is supposed to help to remove offline memcgs when memory
pressure hit, but with such huge number of offline memcgs, kswapd
would take very long time to iterate all of them. Such huge number of
offline memcgs could bring in other latency problems whenever
iterating memcgs is needed, i.e. show memory.stat, direct reclaim,
oom, etc.

So, we also use force_empty to keep reasonable number of offline memcgs.

And, Fam Zheng from Bytedance noticed delayed force_empty gets things
done more effectively. Please see the discussion here
https://www.spinics.net/lists/cgroups/msg21259.html

Thanks,
Yang

</hannes@cmpxchg.org>>
> > > > > - Why you couldn't set memory.high or memory.max to 0 after the
> > > > >     application quits and before you call rmdir on the cgroup
> > > > I recall I explained this in the review email for the first version. Set
> > > > memory.high or memory.max to 0 would trigger direct reclaim which may stall
> > > > the offline of memcg. But, we have "restarting the same name job" logic in
> > > > our usecase (I'm not quite sure why they do so). Basically, it means to
> > > > create memcg with the exact same name right after the old one is deleted,
> > > > but may have different limit or other settings. The creation has to wait for
> > > > rmdir is done.
> > > This really needs a fix on your end. We cannot add new cgroup control
> > > files because you cannot handle a delayed release in the cgroupfs
> > > namespace while you're reclaiming associated memory. A simple serial
> > > number would fix this.
> > >
> > > Whether others have asked for this knob or not, these patches should
> > > come with a solid case in the cover letter and changelogs that explain
> > > why this ABI is necessary to solve a generic cgroup usecase. But it
> > > sounds to me that setting the limit to 0 once the group is empty would
> > > meet the functional requirement (use fork() if you don't want to wait)
> > > of what you are trying to do.
> >
> > Do you mean do something like the below:
> >
> > echo 0 > cg1/memory.max &
> > rmdir cg1 &
> > mkdir cg1 &
> >
> > But, the latency is still there, even though memcg creation (mkdir) can be
> > done very fast by using fork(), the latency would delay afterwards
> > operations, i.e. attaching tasks (echo PID > cg1/cgroup.procs). When we
> > calculating the time consumption of the container deployment, we would count
> > from mkdir to the job is actually launched.
>
> I'm saying that the same-name requirement is your problem, not the
> kernel's. It's not unreasonable for the kernel to say that as long as
> you want to do something with the cgroup, such as forcibly emptying
> out the left-over cache, that the group name stays in the namespace.
>
> Requiring the same exact cgroup name for another instance of the same
> job sounds like a bogus requirement. Surely you can use serial numbers
> to denote subsequent invocations of the same job and handle that from
> whatever job management software you're using:
>
>         ( echo 0 > job1345-1/memory.max; rmdir job12345-1 ) &
>         mkdir job12345-2
>
> See, completely decoupled.
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-01-17 22:55 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-09 19:14 [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining Yang Shi
2019-01-09 19:14 ` [v3 PATCH 1/5] doc: memcontrol: fix the obsolete content about force empty Yang Shi
2019-01-09 19:14 ` [v3 PATCH 2/5] mm: memcontrol: add may_swap parameter to mem_cgroup_force_empty() Yang Shi
2019-01-09 19:14 ` [v3 PATCH 3/5] mm: memcontrol: introduce wipe_on_offline interface Yang Shi
2019-01-09 19:14 ` [v3 PATCH 4/5] mm: memcontrol: bring force_empty into default hierarchy Yang Shi
2019-01-09 19:14 ` [v3 PATCH 5/5] doc: memcontrol: add description for wipe_on_offline Yang Shi
2019-01-10 12:00   ` William Kucharski
2019-01-09 19:32 ` [RFC v3 PATCH 0/5] mm: memcontrol: do memory reclaim when offlining Johannes Weiner
2019-01-09 20:36   ` Yang Shi
2019-01-09 21:23     ` Johannes Weiner
2019-01-09 22:09       ` Yang Shi
2019-01-09 22:51         ` Johannes Weiner
2019-01-10  1:47           ` Yang Shi
2019-01-14 19:01             ` Johannes Weiner
2019-01-17 22:55               ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).