All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-19  0:01 ` Shakeel Butt
  0 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-19  0:01 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Michal Hocko, Greg Thelen, Johannes Weiner, Hugh Dickins,
	Andrew Morton
  Cc: linux-mm, linux-kernel, cgroups, linux-doc, Shakeel Butt

The memory controller in cgroup v1 provides the memory+swap (memsw)
interface to account to the combined usage of memory and swap of the
jobs. The memsw interface allows the users to limit or view the
consistent memory usage of their jobs irrespectibe of the presense of
swap on the system (consistent OOM and memory reclaim behavior). The
memory+swap accounting makes the job easier for centralized systems
doing resource usage monitoring, prediction or anomaly detection.

In cgroup v2, the 'memsw' interface was dropped and a new 'swap'
interface has been introduced which allows to limit the actual usage of
swap by the job. For the systems where swap is a limited resource,
'swap' interface can be used to fairly distribute the swap resource
between different jobs. There is no easy way to limit the swap usage
using the 'memsw' interface.

However for the systems where the swap is cheap and can be increased
dynamically (like remote swap and swap on zram), the 'memsw' interface
is much more appropriate as it makes swap transparent to the jobs and
gives consistent memory usage history to centralized monitoring systems.

This patch adds memsw interface to cgroup v2 memory controller behind a
mount option 'memsw'. The memsw interface is mutually exclusive with
the existing swap interface. When 'memsw' is enabled, reading or writing
to 'swap' interface files will return -ENOTSUPP and vice versa. Enabling
or disabling memsw through remounting cgroup v2, will only be effective
if there are no decendants of the root cgroup.

When memsw accounting is enabled then "memory.high" is comapred with
memory+swap usage. So, when the allocating job's memsw usage hits its
high mark, the job will be throttled by triggering memory reclaim.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 Documentation/cgroup-v2.txt |  69 ++++++++++++++++++++--------
 include/linux/cgroup-defs.h |   5 +++
 kernel/cgroup/cgroup.c      |  12 +++++
 mm/memcontrol.c             | 107 +++++++++++++++++++++++++++++++++++++-------
 4 files changed, 157 insertions(+), 36 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 9a4f2e54a97d..1cbc51203b00 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -169,6 +169,12 @@ cgroup v2 currently supports the following mount options.
 	ignored on non-init namespace mounts.  Please refer to the
 	Delegation section for details.
 
+  memsw
+
+	Allows the enforcement of memory+swap limit on cgroups. This
+	option is system wide and can only be set on mount and can only
+	be modified through remount from the init namespace and if root
+	cgroup has no children.
 
 Organizing Processes and Threads
 --------------------------------
@@ -1020,6 +1026,10 @@ PAGE_SIZE multiple when read back.
 	Going over the high limit never invokes the OOM killer and
 	under extreme conditions the limit may be breached.
 
+	If memsw (memory+swap) enforcement is enabled then the
+	cgroup's memory+swap usage is checked against memory.high
+	instead of just memory.
+
   memory.max
 	A read-write single value file which exists on non-root
 	cgroups.  The default is "max".
@@ -1207,18 +1217,39 @@ PAGE_SIZE multiple when read back.
 
   memory.swap.current
 	A read-only single value file which exists on non-root
-	cgroups.
+	cgroups. If memsw is enabled then reading this file will return
+	-ENOTSUPP.
 
 	The total amount of swap currently being used by the cgroup
 	and its descendants.
 
   memory.swap.max
 	A read-write single value file which exists on non-root
-	cgroups.  The default is "max".
+	cgroups.  The default is "max". Accessing this file will return
+	-ENOTSUPP if memsw enforcement is enabled.
 
 	Swap usage hard limit.  If a cgroup's swap usage reaches this
 	limit, anonymous meomry of the cgroup will not be swapped out.
 
+  memory.memsw.current
+	A read-only single value file which exists on non-root
+	cgroups. -ENOTSUPP will be returned on read if memsw is not
+	enabled.
+
+	The total amount of memory+swap currently being used by the cgroup
+	and its descendants.
+
+  memory.memsw.max
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "max". -ENOTSUPP will be returned on
+	access if memsw is not enabled.
+
+	Memory+swap usage hard limit. If a cgroup's memory+swap usage
+	reaches this limit and 	can't be reduced, the OOM killer is
+	invoked in the cgroup. Under certain circumstances, the usage
+	may go over the limit temporarily.
+
+
 
 Usage Guidelines
 ~~~~~~~~~~~~~~~~
@@ -1243,6 +1274,23 @@ memory - is necessary to determine whether a workload needs more
 memory; unfortunately, memory pressure monitoring mechanism isn't
 implemented yet.
 
+Please note that when memory+swap accounting is enforced then the
+"memory.high" is checked and enforced against memory+swap usage instead
+of just memory usage.
+
+Memory+Swap interface
+~~~~~~~~~~~~~~~~~~~~~
+
+The memory+swap i.e. memsw interface allows to limit and view the
+combined usage of memory and swap of the jobs. It gives a consistent
+memory usage history or memory limit enforcement irrespective of the
+presense of swap on the system. The consistent memory usage history
+is useful for centralized systems doing resource usage monitoring,
+prediction or anomaly detection.
+
+Also when swap is cheap, can be increased dynamically, is a system
+level resource and transparent to jobs, the memsw interface is more
+appropriate to use than just swap interface.
 
 Memory Ownership
 ~~~~~~~~~~~~~~~~
@@ -1987,20 +2035,3 @@ subject to a race condition, where concurrent charges could cause the
 limit setting to fail. memory.max on the other hand will first set the
 limit to prevent new charges, and then reclaim and OOM kill until the
 new limit is met - or the task writing to memory.max is killed.
-
-The combined memory+swap accounting and limiting is replaced by real
-control over swap space.
-
-The main argument for a combined memory+swap facility in the original
-cgroup design was that global or parental pressure would always be
-able to swap all anonymous memory of a child group, regardless of the
-child's own (possibly untrusted) configuration.  However, untrusted
-groups can sabotage swapping by other means - such as referencing its
-anonymous memory in a tight loop - and an admin can not assume full
-swappability when overcommitting untrusted jobs.
-
-For trusted jobs, on the other hand, a combined counter is not an
-intuitive userspace interface, and it flies in the face of the idea
-that cgroup controllers should account and limit specific physical
-resources.  Swap space is a resource like all others in the system,
-and that's why unified hierarchy allows distributing it separately.
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 9fb99e25d654..d72c14eb1f5a 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -86,6 +86,11 @@ enum {
 	 * Enable cgroup-aware OOM killer.
 	 */
 	CGRP_GROUP_OOM = (1 << 5),
+
+	/*
+	 * Enable memsw interface in cgroup-v2.
+	 */
+	CGRP_ROOT_MEMSW = (1 << 6),
 };
 
 /* cftype->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 693443282fc1..bedc24391879 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1734,6 +1734,9 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
 		} else if (!strcmp(token, "groupoom")) {
 			*root_flags |= CGRP_GROUP_OOM;
 			continue;
+		} else if (!strcmp(token, "memsw")) {
+			*root_flags |= CGRP_ROOT_MEMSW;
+			continue;
 		}
 
 		pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1755,6 +1758,13 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
 			cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
 		else
 			cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
+
+		if (!cgrp_dfl_root.cgrp.nr_descendants) {
+			if (root_flags & CGRP_ROOT_MEMSW)
+				cgrp_dfl_root.flags |= CGRP_ROOT_MEMSW;
+			else
+				cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMSW;
+		}
 	}
 }
 
@@ -1764,6 +1774,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
 		seq_puts(seq, ",nsdelegate");
 	if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
 		seq_puts(seq, ",groupoom");
+	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMSW)
+		seq_puts(seq, ",memsw");
 	return 0;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f40b5ad3f959..b04ba19a8c64 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -94,10 +94,12 @@ int do_swap_account __read_mostly;
 #define do_swap_account		0
 #endif
 
-/* Whether legacy memory+swap accounting is active */
+/* Whether memory+swap accounting is active */
 static bool do_memsw_account(void)
 {
-	return !cgroup_subsys_on_dfl(memory_cgrp_subsys) && do_swap_account;
+	return do_swap_account &&
+		(!cgroup_subsys_on_dfl(memory_cgrp_subsys) ||
+		 cgrp_dfl_root.flags & CGRP_ROOT_MEMSW);
 }
 
 static const char *const mem_cgroup_lru_names[] = {
@@ -1868,11 +1870,15 @@ static void reclaim_high(struct mem_cgroup *memcg,
 			 unsigned int nr_pages,
 			 gfp_t gfp_mask)
 {
+	struct page_counter *counter;
+	bool memsw = cgrp_dfl_root.flags & CGRP_ROOT_MEMSW;
+
 	do {
-		if (page_counter_read(&memcg->memory) <= memcg->high)
+		counter = memsw ? &memcg->memsw : &memcg->memory;
+		if (page_counter_read(counter) <= memcg->high)
 			continue;
 		mem_cgroup_event(memcg, MEMCG_HIGH);
-		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, !memsw);
 	} while ((memcg = parent_mem_cgroup(memcg)));
 }
 
@@ -1912,6 +1918,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned long nr_reclaimed;
 	bool may_swap = true;
 	bool drained = false;
+	bool memsw = cgrp_dfl_root.flags & CGRP_ROOT_MEMSW;
 
 	if (mem_cgroup_is_root(memcg))
 		return 0;
@@ -2040,7 +2047,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * reclaim, the cost of mismatch is negligible.
 	 */
 	do {
-		if (page_counter_read(&memcg->memory) > memcg->high) {
+		counter = memsw ? &memcg->memsw : &memcg->memory;
+		if (page_counter_read(counter) > memcg->high) {
 			/* Don't bother a random interrupted task */
 			if (in_interrupt()) {
 				schedule_work(&memcg->high_work);
@@ -3906,6 +3914,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
 	struct mem_cgroup *parent;
+	bool memsw = cgrp_dfl_root.flags & CGRP_ROOT_MEMSW;
 
 	*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
 
@@ -3919,6 +3928,9 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 		unsigned long ceiling = min(memcg->memory.limit, memcg->high);
 		unsigned long used = page_counter_read(&memcg->memory);
 
+		if (memsw)
+			ceiling = min(ceiling, memcg->memsw.limit);
+
 		*pheadroom = min(*pheadroom, ceiling - min(ceiling, used));
 		memcg = parent;
 	}
@@ -5395,6 +5407,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 				 char *buf, size_t nbytes, loff_t off)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	bool memsw = cgrp_dfl_root.flags & CGRP_ROOT_MEMSW;
 	unsigned long nr_pages;
 	unsigned long high;
 	int err;
@@ -5406,10 +5419,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 
 	memcg->high = high;
 
-	nr_pages = page_counter_read(&memcg->memory);
+	nr_pages = page_counter_read(memsw ? &memcg->memsw : &memcg->memory);
 	if (nr_pages > high)
 		try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
-					     GFP_KERNEL, true);
+					     GFP_KERNEL, !memsw);
 
 	memcg_wb_domain_size_changed(memcg);
 	return nbytes;
@@ -5428,10 +5441,10 @@ static int memory_max_show(struct seq_file *m, void *v)
 	return 0;
 }
 
-static ssize_t memory_max_write(struct kernfs_open_file *of,
-				char *buf, size_t nbytes, loff_t off)
+static ssize_t counter_max_write(struct mem_cgroup *memcg,
+				 struct page_counter *counter, char *buf,
+				 size_t nbytes, bool may_swap)
 {
-	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
 	unsigned int nr_reclaims = MEM_CGROUP_RECLAIM_RETRIES;
 	bool drained = false;
 	unsigned long max;
@@ -5442,10 +5455,10 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	if (err)
 		return err;
 
-	xchg(&memcg->memory.limit, max);
+	xchg(&counter->limit, max);
 
 	for (;;) {
-		unsigned long nr_pages = page_counter_read(&memcg->memory);
+		unsigned long nr_pages = page_counter_read(counter);
 
 		if (nr_pages <= max)
 			break;
@@ -5463,7 +5476,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 
 		if (nr_reclaims) {
 			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
-							  GFP_KERNEL, true))
+							  GFP_KERNEL, may_swap))
 				nr_reclaims--;
 			continue;
 		}
@@ -5477,6 +5490,14 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static ssize_t memory_max_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+
+	return counter_max_write(memcg, &memcg->memory, buf, nbytes, true);
+}
+
 static int memory_oom_group_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -6311,7 +6332,7 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
 	struct mem_cgroup *memcg;
 	unsigned short oldid;
 
-	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) || !do_swap_account)
+	if (!do_swap_account || do_memsw_account())
 		return 0;
 
 	memcg = page->mem_cgroup;
@@ -6356,7 +6377,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 	memcg = mem_cgroup_from_id(id);
 	if (memcg) {
 		if (!mem_cgroup_is_root(memcg)) {
-			if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
+			if (!do_memsw_account())
 				page_counter_uncharge(&memcg->swap, nr_pages);
 			else
 				page_counter_uncharge(&memcg->memsw, nr_pages);
@@ -6371,7 +6392,7 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
 {
 	long nr_swap_pages = get_nr_swap_pages();
 
-	if (!do_swap_account || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
+	if (!do_swap_account || do_memsw_account())
 		return nr_swap_pages;
 	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
 		nr_swap_pages = min_t(long, nr_swap_pages,
@@ -6388,7 +6409,7 @@ bool mem_cgroup_swap_full(struct page *page)
 
 	if (vm_swap_full())
 		return true;
-	if (!do_swap_account || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
+	if (!do_swap_account || do_memsw_account())
 		return false;
 
 	memcg = page->mem_cgroup;
@@ -6432,6 +6453,9 @@ static int swap_max_show(struct seq_file *m, void *v)
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
 	unsigned long max = READ_ONCE(memcg->swap.limit);
 
+	if (do_memsw_account())
+		return -ENOTSUPP;
+
 	if (max == PAGE_COUNTER_MAX)
 		seq_puts(m, "max\n");
 	else
@@ -6447,6 +6471,9 @@ static ssize_t swap_max_write(struct kernfs_open_file *of,
 	unsigned long max;
 	int err;
 
+	if (do_memsw_account())
+		return -ENOTSUPP;
+
 	buf = strstrip(buf);
 	err = page_counter_memparse(buf, "max", &max);
 	if (err)
@@ -6461,6 +6488,41 @@ static ssize_t swap_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static u64 memsw_current_read(struct cgroup_subsys_state *css,
+			     struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	return (u64)page_counter_read(&memcg->memsw) * PAGE_SIZE;
+}
+
+static int memsw_max_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	unsigned long max = READ_ONCE(memcg->memsw.limit);
+
+	if (!do_memsw_account())
+		return -ENOTSUPP;
+
+	if (max == PAGE_COUNTER_MAX)
+		seq_puts(m, "max\n");
+	else
+		seq_printf(m, "%llu\n", (u64)max * PAGE_SIZE);
+
+	return 0;
+}
+
+static ssize_t memsw_max_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+
+	if (!do_memsw_account())
+		return -ENOTSUPP;
+
+	return counter_max_write(memcg, &memcg->memsw, buf, nbytes, false);
+}
+
 static struct cftype swap_files[] = {
 	{
 		.name = "swap.current",
@@ -6473,6 +6535,17 @@ static struct cftype swap_files[] = {
 		.seq_show = swap_max_show,
 		.write = swap_max_write,
 	},
+	{
+		.name = "memsw.current",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = memsw_current_read,
+	},
+	{
+		.name = "memsw.max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memsw_max_show,
+		.write = memsw_max_write,
+	},
 	{ }	/* terminate */
 };
 
-- 
2.15.1.504.g5279b80103-goog

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-19  0:01 ` Shakeel Butt
  0 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-19  0:01 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Michal Hocko, Greg Thelen, Johannes Weiner, Hugh Dickins,
	Andrew Morton
  Cc: linux-mm, linux-kernel, cgroups, linux-doc, Shakeel Butt

The memory controller in cgroup v1 provides the memory+swap (memsw)
interface to account to the combined usage of memory and swap of the
jobs. The memsw interface allows the users to limit or view the
consistent memory usage of their jobs irrespectibe of the presense of
swap on the system (consistent OOM and memory reclaim behavior). The
memory+swap accounting makes the job easier for centralized systems
doing resource usage monitoring, prediction or anomaly detection.

In cgroup v2, the 'memsw' interface was dropped and a new 'swap'
interface has been introduced which allows to limit the actual usage of
swap by the job. For the systems where swap is a limited resource,
'swap' interface can be used to fairly distribute the swap resource
between different jobs. There is no easy way to limit the swap usage
using the 'memsw' interface.

However for the systems where the swap is cheap and can be increased
dynamically (like remote swap and swap on zram), the 'memsw' interface
is much more appropriate as it makes swap transparent to the jobs and
gives consistent memory usage history to centralized monitoring systems.

This patch adds memsw interface to cgroup v2 memory controller behind a
mount option 'memsw'. The memsw interface is mutually exclusive with
the existing swap interface. When 'memsw' is enabled, reading or writing
to 'swap' interface files will return -ENOTSUPP and vice versa. Enabling
or disabling memsw through remounting cgroup v2, will only be effective
if there are no decendants of the root cgroup.

When memsw accounting is enabled then "memory.high" is comapred with
memory+swap usage. So, when the allocating job's memsw usage hits its
high mark, the job will be throttled by triggering memory reclaim.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 Documentation/cgroup-v2.txt |  69 ++++++++++++++++++++--------
 include/linux/cgroup-defs.h |   5 +++
 kernel/cgroup/cgroup.c      |  12 +++++
 mm/memcontrol.c             | 107 +++++++++++++++++++++++++++++++++++++-------
 4 files changed, 157 insertions(+), 36 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 9a4f2e54a97d..1cbc51203b00 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -169,6 +169,12 @@ cgroup v2 currently supports the following mount options.
 	ignored on non-init namespace mounts.  Please refer to the
 	Delegation section for details.
 
+  memsw
+
+	Allows the enforcement of memory+swap limit on cgroups. This
+	option is system wide and can only be set on mount and can only
+	be modified through remount from the init namespace and if root
+	cgroup has no children.
 
 Organizing Processes and Threads
 --------------------------------
@@ -1020,6 +1026,10 @@ PAGE_SIZE multiple when read back.
 	Going over the high limit never invokes the OOM killer and
 	under extreme conditions the limit may be breached.
 
+	If memsw (memory+swap) enforcement is enabled then the
+	cgroup's memory+swap usage is checked against memory.high
+	instead of just memory.
+
   memory.max
 	A read-write single value file which exists on non-root
 	cgroups.  The default is "max".
@@ -1207,18 +1217,39 @@ PAGE_SIZE multiple when read back.
 
   memory.swap.current
 	A read-only single value file which exists on non-root
-	cgroups.
+	cgroups. If memsw is enabled then reading this file will return
+	-ENOTSUPP.
 
 	The total amount of swap currently being used by the cgroup
 	and its descendants.
 
   memory.swap.max
 	A read-write single value file which exists on non-root
-	cgroups.  The default is "max".
+	cgroups.  The default is "max". Accessing this file will return
+	-ENOTSUPP if memsw enforcement is enabled.
 
 	Swap usage hard limit.  If a cgroup's swap usage reaches this
 	limit, anonymous meomry of the cgroup will not be swapped out.
 
+  memory.memsw.current
+	A read-only single value file which exists on non-root
+	cgroups. -ENOTSUPP will be returned on read if memsw is not
+	enabled.
+
+	The total amount of memory+swap currently being used by the cgroup
+	and its descendants.
+
+  memory.memsw.max
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "max". -ENOTSUPP will be returned on
+	access if memsw is not enabled.
+
+	Memory+swap usage hard limit. If a cgroup's memory+swap usage
+	reaches this limit and 	can't be reduced, the OOM killer is
+	invoked in the cgroup. Under certain circumstances, the usage
+	may go over the limit temporarily.
+
+
 
 Usage Guidelines
 ~~~~~~~~~~~~~~~~
@@ -1243,6 +1274,23 @@ memory - is necessary to determine whether a workload needs more
 memory; unfortunately, memory pressure monitoring mechanism isn't
 implemented yet.
 
+Please note that when memory+swap accounting is enforced then the
+"memory.high" is checked and enforced against memory+swap usage instead
+of just memory usage.
+
+Memory+Swap interface
+~~~~~~~~~~~~~~~~~~~~~
+
+The memory+swap i.e. memsw interface allows to limit and view the
+combined usage of memory and swap of the jobs. It gives a consistent
+memory usage history or memory limit enforcement irrespective of the
+presense of swap on the system. The consistent memory usage history
+is useful for centralized systems doing resource usage monitoring,
+prediction or anomaly detection.
+
+Also when swap is cheap, can be increased dynamically, is a system
+level resource and transparent to jobs, the memsw interface is more
+appropriate to use than just swap interface.
 
 Memory Ownership
 ~~~~~~~~~~~~~~~~
@@ -1987,20 +2035,3 @@ subject to a race condition, where concurrent charges could cause the
 limit setting to fail. memory.max on the other hand will first set the
 limit to prevent new charges, and then reclaim and OOM kill until the
 new limit is met - or the task writing to memory.max is killed.
-
-The combined memory+swap accounting and limiting is replaced by real
-control over swap space.
-
-The main argument for a combined memory+swap facility in the original
-cgroup design was that global or parental pressure would always be
-able to swap all anonymous memory of a child group, regardless of the
-child's own (possibly untrusted) configuration.  However, untrusted
-groups can sabotage swapping by other means - such as referencing its
-anonymous memory in a tight loop - and an admin can not assume full
-swappability when overcommitting untrusted jobs.
-
-For trusted jobs, on the other hand, a combined counter is not an
-intuitive userspace interface, and it flies in the face of the idea
-that cgroup controllers should account and limit specific physical
-resources.  Swap space is a resource like all others in the system,
-and that's why unified hierarchy allows distributing it separately.
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 9fb99e25d654..d72c14eb1f5a 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -86,6 +86,11 @@ enum {
 	 * Enable cgroup-aware OOM killer.
 	 */
 	CGRP_GROUP_OOM = (1 << 5),
+
+	/*
+	 * Enable memsw interface in cgroup-v2.
+	 */
+	CGRP_ROOT_MEMSW = (1 << 6),
 };
 
 /* cftype->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 693443282fc1..bedc24391879 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1734,6 +1734,9 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
 		} else if (!strcmp(token, "groupoom")) {
 			*root_flags |= CGRP_GROUP_OOM;
 			continue;
+		} else if (!strcmp(token, "memsw")) {
+			*root_flags |= CGRP_ROOT_MEMSW;
+			continue;
 		}
 
 		pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1755,6 +1758,13 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
 			cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
 		else
 			cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
+
+		if (!cgrp_dfl_root.cgrp.nr_descendants) {
+			if (root_flags & CGRP_ROOT_MEMSW)
+				cgrp_dfl_root.flags |= CGRP_ROOT_MEMSW;
+			else
+				cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMSW;
+		}
 	}
 }
 
@@ -1764,6 +1774,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
 		seq_puts(seq, ",nsdelegate");
 	if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
 		seq_puts(seq, ",groupoom");
+	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMSW)
+		seq_puts(seq, ",memsw");
 	return 0;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f40b5ad3f959..b04ba19a8c64 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -94,10 +94,12 @@ int do_swap_account __read_mostly;
 #define do_swap_account		0
 #endif
 
-/* Whether legacy memory+swap accounting is active */
+/* Whether memory+swap accounting is active */
 static bool do_memsw_account(void)
 {
-	return !cgroup_subsys_on_dfl(memory_cgrp_subsys) && do_swap_account;
+	return do_swap_account &&
+		(!cgroup_subsys_on_dfl(memory_cgrp_subsys) ||
+		 cgrp_dfl_root.flags & CGRP_ROOT_MEMSW);
 }
 
 static const char *const mem_cgroup_lru_names[] = {
@@ -1868,11 +1870,15 @@ static void reclaim_high(struct mem_cgroup *memcg,
 			 unsigned int nr_pages,
 			 gfp_t gfp_mask)
 {
+	struct page_counter *counter;
+	bool memsw = cgrp_dfl_root.flags & CGRP_ROOT_MEMSW;
+
 	do {
-		if (page_counter_read(&memcg->memory) <= memcg->high)
+		counter = memsw ? &memcg->memsw : &memcg->memory;
+		if (page_counter_read(counter) <= memcg->high)
 			continue;
 		mem_cgroup_event(memcg, MEMCG_HIGH);
-		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, !memsw);
 	} while ((memcg = parent_mem_cgroup(memcg)));
 }
 
@@ -1912,6 +1918,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned long nr_reclaimed;
 	bool may_swap = true;
 	bool drained = false;
+	bool memsw = cgrp_dfl_root.flags & CGRP_ROOT_MEMSW;
 
 	if (mem_cgroup_is_root(memcg))
 		return 0;
@@ -2040,7 +2047,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * reclaim, the cost of mismatch is negligible.
 	 */
 	do {
-		if (page_counter_read(&memcg->memory) > memcg->high) {
+		counter = memsw ? &memcg->memsw : &memcg->memory;
+		if (page_counter_read(counter) > memcg->high) {
 			/* Don't bother a random interrupted task */
 			if (in_interrupt()) {
 				schedule_work(&memcg->high_work);
@@ -3906,6 +3914,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
 	struct mem_cgroup *parent;
+	bool memsw = cgrp_dfl_root.flags & CGRP_ROOT_MEMSW;
 
 	*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
 
@@ -3919,6 +3928,9 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 		unsigned long ceiling = min(memcg->memory.limit, memcg->high);
 		unsigned long used = page_counter_read(&memcg->memory);
 
+		if (memsw)
+			ceiling = min(ceiling, memcg->memsw.limit);
+
 		*pheadroom = min(*pheadroom, ceiling - min(ceiling, used));
 		memcg = parent;
 	}
@@ -5395,6 +5407,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 				 char *buf, size_t nbytes, loff_t off)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	bool memsw = cgrp_dfl_root.flags & CGRP_ROOT_MEMSW;
 	unsigned long nr_pages;
 	unsigned long high;
 	int err;
@@ -5406,10 +5419,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 
 	memcg->high = high;
 
-	nr_pages = page_counter_read(&memcg->memory);
+	nr_pages = page_counter_read(memsw ? &memcg->memsw : &memcg->memory);
 	if (nr_pages > high)
 		try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
-					     GFP_KERNEL, true);
+					     GFP_KERNEL, !memsw);
 
 	memcg_wb_domain_size_changed(memcg);
 	return nbytes;
@@ -5428,10 +5441,10 @@ static int memory_max_show(struct seq_file *m, void *v)
 	return 0;
 }
 
-static ssize_t memory_max_write(struct kernfs_open_file *of,
-				char *buf, size_t nbytes, loff_t off)
+static ssize_t counter_max_write(struct mem_cgroup *memcg,
+				 struct page_counter *counter, char *buf,
+				 size_t nbytes, bool may_swap)
 {
-	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
 	unsigned int nr_reclaims = MEM_CGROUP_RECLAIM_RETRIES;
 	bool drained = false;
 	unsigned long max;
@@ -5442,10 +5455,10 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	if (err)
 		return err;
 
-	xchg(&memcg->memory.limit, max);
+	xchg(&counter->limit, max);
 
 	for (;;) {
-		unsigned long nr_pages = page_counter_read(&memcg->memory);
+		unsigned long nr_pages = page_counter_read(counter);
 
 		if (nr_pages <= max)
 			break;
@@ -5463,7 +5476,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 
 		if (nr_reclaims) {
 			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
-							  GFP_KERNEL, true))
+							  GFP_KERNEL, may_swap))
 				nr_reclaims--;
 			continue;
 		}
@@ -5477,6 +5490,14 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static ssize_t memory_max_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+
+	return counter_max_write(memcg, &memcg->memory, buf, nbytes, true);
+}
+
 static int memory_oom_group_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -6311,7 +6332,7 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
 	struct mem_cgroup *memcg;
 	unsigned short oldid;
 
-	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) || !do_swap_account)
+	if (!do_swap_account || do_memsw_account())
 		return 0;
 
 	memcg = page->mem_cgroup;
@@ -6356,7 +6377,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 	memcg = mem_cgroup_from_id(id);
 	if (memcg) {
 		if (!mem_cgroup_is_root(memcg)) {
-			if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
+			if (!do_memsw_account())
 				page_counter_uncharge(&memcg->swap, nr_pages);
 			else
 				page_counter_uncharge(&memcg->memsw, nr_pages);
@@ -6371,7 +6392,7 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
 {
 	long nr_swap_pages = get_nr_swap_pages();
 
-	if (!do_swap_account || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
+	if (!do_swap_account || do_memsw_account())
 		return nr_swap_pages;
 	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
 		nr_swap_pages = min_t(long, nr_swap_pages,
@@ -6388,7 +6409,7 @@ bool mem_cgroup_swap_full(struct page *page)
 
 	if (vm_swap_full())
 		return true;
-	if (!do_swap_account || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
+	if (!do_swap_account || do_memsw_account())
 		return false;
 
 	memcg = page->mem_cgroup;
@@ -6432,6 +6453,9 @@ static int swap_max_show(struct seq_file *m, void *v)
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
 	unsigned long max = READ_ONCE(memcg->swap.limit);
 
+	if (do_memsw_account())
+		return -ENOTSUPP;
+
 	if (max == PAGE_COUNTER_MAX)
 		seq_puts(m, "max\n");
 	else
@@ -6447,6 +6471,9 @@ static ssize_t swap_max_write(struct kernfs_open_file *of,
 	unsigned long max;
 	int err;
 
+	if (do_memsw_account())
+		return -ENOTSUPP;
+
 	buf = strstrip(buf);
 	err = page_counter_memparse(buf, "max", &max);
 	if (err)
@@ -6461,6 +6488,41 @@ static ssize_t swap_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static u64 memsw_current_read(struct cgroup_subsys_state *css,
+			     struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	return (u64)page_counter_read(&memcg->memsw) * PAGE_SIZE;
+}
+
+static int memsw_max_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	unsigned long max = READ_ONCE(memcg->memsw.limit);
+
+	if (!do_memsw_account())
+		return -ENOTSUPP;
+
+	if (max == PAGE_COUNTER_MAX)
+		seq_puts(m, "max\n");
+	else
+		seq_printf(m, "%llu\n", (u64)max * PAGE_SIZE);
+
+	return 0;
+}
+
+static ssize_t memsw_max_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+
+	if (!do_memsw_account())
+		return -ENOTSUPP;
+
+	return counter_max_write(memcg, &memcg->memsw, buf, nbytes, false);
+}
+
 static struct cftype swap_files[] = {
 	{
 		.name = "swap.current",
@@ -6473,6 +6535,17 @@ static struct cftype swap_files[] = {
 		.seq_show = swap_max_show,
 		.write = swap_max_write,
 	},
+	{
+		.name = "memsw.current",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = memsw_current_read,
+	},
+	{
+		.name = "memsw.max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memsw_max_show,
+		.write = memsw_max_write,
+	},
 	{ }	/* terminate */
 };
 
-- 
2.15.1.504.g5279b80103-goog

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
  2017-12-19  0:01 ` Shakeel Butt
  (?)
@ 2017-12-19 12:49   ` Michal Hocko
  -1 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2017-12-19 12:49 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tejun Heo, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	linux-mm, linux-kernel, cgroups, linux-doc

On Mon 18-12-17 16:01:31, Shakeel Butt wrote:
> The memory controller in cgroup v1 provides the memory+swap (memsw)
> interface to account to the combined usage of memory and swap of the
> jobs. The memsw interface allows the users to limit or view the
> consistent memory usage of their jobs irrespectibe of the presense of
> swap on the system (consistent OOM and memory reclaim behavior). The
> memory+swap accounting makes the job easier for centralized systems
> doing resource usage monitoring, prediction or anomaly detection.
> 
> In cgroup v2, the 'memsw' interface was dropped and a new 'swap'
> interface has been introduced which allows to limit the actual usage of
> swap by the job. For the systems where swap is a limited resource,
> 'swap' interface can be used to fairly distribute the swap resource
> between different jobs. There is no easy way to limit the swap usage
> using the 'memsw' interface.
> 
> However for the systems where the swap is cheap and can be increased
> dynamically (like remote swap and swap on zram), the 'memsw' interface
> is much more appropriate as it makes swap transparent to the jobs and
> gives consistent memory usage history to centralized monitoring systems.
> 
> This patch adds memsw interface to cgroup v2 memory controller behind a
> mount option 'memsw'. The memsw interface is mutually exclusive with
> the existing swap interface. When 'memsw' is enabled, reading or writing
> to 'swap' interface files will return -ENOTSUPP and vice versa. Enabling
> or disabling memsw through remounting cgroup v2, will only be effective
> if there are no decendants of the root cgroup.
> 
> When memsw accounting is enabled then "memory.high" is comapred with
> memory+swap usage. So, when the allocating job's memsw usage hits its
> high mark, the job will be throttled by triggering memory reclaim.

>From a quick look, this looks like a mess. We have agreed to go with
the current scheme for some good reasons. There are cons/pros for both
approaches but I am not convinced we should convolute the user API for
the usecase you describe.

> Signed-off-by: Shakeel Butt <shakeelb@google.com>
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-19 12:49   ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2017-12-19 12:49 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tejun Heo, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	linux-mm, linux-kernel, cgroups, linux-doc

On Mon 18-12-17 16:01:31, Shakeel Butt wrote:
> The memory controller in cgroup v1 provides the memory+swap (memsw)
> interface to account to the combined usage of memory and swap of the
> jobs. The memsw interface allows the users to limit or view the
> consistent memory usage of their jobs irrespectibe of the presense of
> swap on the system (consistent OOM and memory reclaim behavior). The
> memory+swap accounting makes the job easier for centralized systems
> doing resource usage monitoring, prediction or anomaly detection.
> 
> In cgroup v2, the 'memsw' interface was dropped and a new 'swap'
> interface has been introduced which allows to limit the actual usage of
> swap by the job. For the systems where swap is a limited resource,
> 'swap' interface can be used to fairly distribute the swap resource
> between different jobs. There is no easy way to limit the swap usage
> using the 'memsw' interface.
> 
> However for the systems where the swap is cheap and can be increased
> dynamically (like remote swap and swap on zram), the 'memsw' interface
> is much more appropriate as it makes swap transparent to the jobs and
> gives consistent memory usage history to centralized monitoring systems.
> 
> This patch adds memsw interface to cgroup v2 memory controller behind a
> mount option 'memsw'. The memsw interface is mutually exclusive with
> the existing swap interface. When 'memsw' is enabled, reading or writing
> to 'swap' interface files will return -ENOTSUPP and vice versa. Enabling
> or disabling memsw through remounting cgroup v2, will only be effective
> if there are no decendants of the root cgroup.
> 
> When memsw accounting is enabled then "memory.high" is comapred with
> memory+swap usage. So, when the allocating job's memsw usage hits its
> high mark, the job will be throttled by triggering memory reclaim.

>From a quick look, this looks like a mess. We have agreed to go with
the current scheme for some good reasons. There are cons/pros for both
approaches but I am not convinced we should convolute the user API for
the usecase you describe.

> Signed-off-by: Shakeel Butt <shakeelb@google.com>
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-19 12:49   ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2017-12-19 12:49 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tejun Heo, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	linux-mm, linux-kernel, cgroups, linux-doc

On Mon 18-12-17 16:01:31, Shakeel Butt wrote:
> The memory controller in cgroup v1 provides the memory+swap (memsw)
> interface to account to the combined usage of memory and swap of the
> jobs. The memsw interface allows the users to limit or view the
> consistent memory usage of their jobs irrespectibe of the presense of
> swap on the system (consistent OOM and memory reclaim behavior). The
> memory+swap accounting makes the job easier for centralized systems
> doing resource usage monitoring, prediction or anomaly detection.
> 
> In cgroup v2, the 'memsw' interface was dropped and a new 'swap'
> interface has been introduced which allows to limit the actual usage of
> swap by the job. For the systems where swap is a limited resource,
> 'swap' interface can be used to fairly distribute the swap resource
> between different jobs. There is no easy way to limit the swap usage
> using the 'memsw' interface.
> 
> However for the systems where the swap is cheap and can be increased
> dynamically (like remote swap and swap on zram), the 'memsw' interface
> is much more appropriate as it makes swap transparent to the jobs and
> gives consistent memory usage history to centralized monitoring systems.
> 
> This patch adds memsw interface to cgroup v2 memory controller behind a
> mount option 'memsw'. The memsw interface is mutually exclusive with
> the existing swap interface. When 'memsw' is enabled, reading or writing
> to 'swap' interface files will return -ENOTSUPP and vice versa. Enabling
> or disabling memsw through remounting cgroup v2, will only be effective
> if there are no decendants of the root cgroup.
> 
> When memsw accounting is enabled then "memory.high" is comapred with
> memory+swap usage. So, when the allocating job's memsw usage hits its
> high mark, the job will be throttled by triggering memory reclaim.

From a quick look, this looks like a mess. We have agreed to go with
the current scheme for some good reasons. There are cons/pros for both
approaches but I am not convinced we should convolute the user API for
the usecase you describe.

> Signed-off-by: Shakeel Butt <shakeelb@google.com>
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
  2017-12-19 12:49   ` Michal Hocko
@ 2017-12-19 15:12     ` Shakeel Butt
  -1 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-19 15:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

On Tue, Dec 19, 2017 at 4:49 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 18-12-17 16:01:31, Shakeel Butt wrote:
>> The memory controller in cgroup v1 provides the memory+swap (memsw)
>> interface to account to the combined usage of memory and swap of the
>> jobs. The memsw interface allows the users to limit or view the
>> consistent memory usage of their jobs irrespectibe of the presense of
>> swap on the system (consistent OOM and memory reclaim behavior). The
>> memory+swap accounting makes the job easier for centralized systems
>> doing resource usage monitoring, prediction or anomaly detection.
>>
>> In cgroup v2, the 'memsw' interface was dropped and a new 'swap'
>> interface has been introduced which allows to limit the actual usage of
>> swap by the job. For the systems where swap is a limited resource,
>> 'swap' interface can be used to fairly distribute the swap resource
>> between different jobs. There is no easy way to limit the swap usage
>> using the 'memsw' interface.
>>
>> However for the systems where the swap is cheap and can be increased
>> dynamically (like remote swap and swap on zram), the 'memsw' interface
>> is much more appropriate as it makes swap transparent to the jobs and
>> gives consistent memory usage history to centralized monitoring systems.
>>
>> This patch adds memsw interface to cgroup v2 memory controller behind a
>> mount option 'memsw'. The memsw interface is mutually exclusive with
>> the existing swap interface. When 'memsw' is enabled, reading or writing
>> to 'swap' interface files will return -ENOTSUPP and vice versa. Enabling
>> or disabling memsw through remounting cgroup v2, will only be effective
>> if there are no decendants of the root cgroup.
>>
>> When memsw accounting is enabled then "memory.high" is comapred with
>> memory+swap usage. So, when the allocating job's memsw usage hits its
>> high mark, the job will be throttled by triggering memory reclaim.
>
> From a quick look, this looks like a mess.

The main motivation behind this patch is to convince that memsw has
genuine use-cases. How to provide memsw is still in RFC stage.
Suggestions and comments are welcomed.

> We have agreed to go with
> the current scheme for some good reasons.

Yes I agree, when the swap is a limited resource the current 'swap'
interface should be used to fairly distribute it between different
jobs.

> There are cons/pros for both
> approaches but I am not convinced we should convolute the user API for
> the usecase you describe.
>

Yes, there are pros & cons, therefore we should give users the option
to select the API that is better suited for their use-cases and
environment. Both approaches are not interchangeable. We use memsw
internally for use-cases I mentioned in commit message. This is one of
the main blockers for us to even consider cgroup-v2 for memory
controller.

>> Signed-off-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-19 15:12     ` Shakeel Butt
  0 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-19 15:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

On Tue, Dec 19, 2017 at 4:49 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 18-12-17 16:01:31, Shakeel Butt wrote:
>> The memory controller in cgroup v1 provides the memory+swap (memsw)
>> interface to account to the combined usage of memory and swap of the
>> jobs. The memsw interface allows the users to limit or view the
>> consistent memory usage of their jobs irrespectibe of the presense of
>> swap on the system (consistent OOM and memory reclaim behavior). The
>> memory+swap accounting makes the job easier for centralized systems
>> doing resource usage monitoring, prediction or anomaly detection.
>>
>> In cgroup v2, the 'memsw' interface was dropped and a new 'swap'
>> interface has been introduced which allows to limit the actual usage of
>> swap by the job. For the systems where swap is a limited resource,
>> 'swap' interface can be used to fairly distribute the swap resource
>> between different jobs. There is no easy way to limit the swap usage
>> using the 'memsw' interface.
>>
>> However for the systems where the swap is cheap and can be increased
>> dynamically (like remote swap and swap on zram), the 'memsw' interface
>> is much more appropriate as it makes swap transparent to the jobs and
>> gives consistent memory usage history to centralized monitoring systems.
>>
>> This patch adds memsw interface to cgroup v2 memory controller behind a
>> mount option 'memsw'. The memsw interface is mutually exclusive with
>> the existing swap interface. When 'memsw' is enabled, reading or writing
>> to 'swap' interface files will return -ENOTSUPP and vice versa. Enabling
>> or disabling memsw through remounting cgroup v2, will only be effective
>> if there are no decendants of the root cgroup.
>>
>> When memsw accounting is enabled then "memory.high" is comapred with
>> memory+swap usage. So, when the allocating job's memsw usage hits its
>> high mark, the job will be throttled by triggering memory reclaim.
>
> From a quick look, this looks like a mess.

The main motivation behind this patch is to convince that memsw has
genuine use-cases. How to provide memsw is still in RFC stage.
Suggestions and comments are welcomed.

> We have agreed to go with
> the current scheme for some good reasons.

Yes I agree, when the swap is a limited resource the current 'swap'
interface should be used to fairly distribute it between different
jobs.

> There are cons/pros for both
> approaches but I am not convinced we should convolute the user API for
> the usecase you describe.
>

Yes, there are pros & cons, therefore we should give users the option
to select the API that is better suited for their use-cases and
environment. Both approaches are not interchangeable. We use memsw
internally for use-cases I mentioned in commit message. This is one of
the main blockers for us to even consider cgroup-v2 for memory
controller.

>> Signed-off-by: Shakeel Butt <shakeelb@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
  2017-12-19 15:12     ` Shakeel Butt
@ 2017-12-19 15:24       ` Tejun Heo
  -1 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2017-12-19 15:24 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

Hello,

On Tue, Dec 19, 2017 at 07:12:19AM -0800, Shakeel Butt wrote:
> Yes, there are pros & cons, therefore we should give users the option
> to select the API that is better suited for their use-cases and

Heh, that's not how API decisions should be made.  The long term
outcome would be really really bad.

> environment. Both approaches are not interchangeable. We use memsw
> internally for use-cases I mentioned in commit message. This is one of
> the main blockers for us to even consider cgroup-v2 for memory
> controller.

Let's concentrate on the use case.  I couldn't quite understand what
was missing from your description.  You said that it'd make things
easier for the centralized monitoring system which isn't really a
description of a use case.  Can you please go into more details
focusing on the eventual goals (rather than what's currently
implemented)?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-19 15:24       ` Tejun Heo
  0 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2017-12-19 15:24 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

Hello,

On Tue, Dec 19, 2017 at 07:12:19AM -0800, Shakeel Butt wrote:
> Yes, there are pros & cons, therefore we should give users the option
> to select the API that is better suited for their use-cases and

Heh, that's not how API decisions should be made.  The long term
outcome would be really really bad.

> environment. Both approaches are not interchangeable. We use memsw
> internally for use-cases I mentioned in commit message. This is one of
> the main blockers for us to even consider cgroup-v2 for memory
> controller.

Let's concentrate on the use case.  I couldn't quite understand what
was missing from your description.  You said that it'd make things
easier for the centralized monitoring system which isn't really a
description of a use case.  Can you please go into more details
focusing on the eventual goals (rather than what's currently
implemented)?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
  2017-12-19 15:24       ` Tejun Heo
@ 2017-12-19 17:23         ` Shakeel Butt
  -1 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-19 17:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

On Tue, Dec 19, 2017 at 7:24 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Tue, Dec 19, 2017 at 07:12:19AM -0800, Shakeel Butt wrote:
>> Yes, there are pros & cons, therefore we should give users the option
>> to select the API that is better suited for their use-cases and
>
> Heh, that's not how API decisions should be made.  The long term
> outcome would be really really bad.
>
>> environment. Both approaches are not interchangeable. We use memsw
>> internally for use-cases I mentioned in commit message. This is one of
>> the main blockers for us to even consider cgroup-v2 for memory
>> controller.
>
> Let's concentrate on the use case.  I couldn't quite understand what
> was missing from your description.  You said that it'd make things
> easier for the centralized monitoring system which isn't really a
> description of a use case.  Can you please go into more details
> focusing on the eventual goals (rather than what's currently
> implemented)?
>

The goal is to provide an interface that provides:

1. Consistent memory usage history
2. Consistent memory limit enforcement behavior

By consistent I mean, the environment should not affect the usage
history. For example, the presence or absence of swap or memory
pressure on the system should not affect the memory usage history i.e.
making environment an invariant. Similarly, the environment should not
affect the memcg OOM or memcg memory reclaim behavior.

To provide consistent memory usage history using the current
cgroup-v2's 'swap' interface, an additional metric expressing the
intersection of memory and swap has to be exposed. Basically memsw is
the union of memory and swap. So, if that additional metric can be
used to find the union. However for consistent memory limit
enforcement, I don't think there is an easy way to use current 'swap'
interface.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-19 17:23         ` Shakeel Butt
  0 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-19 17:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

On Tue, Dec 19, 2017 at 7:24 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Tue, Dec 19, 2017 at 07:12:19AM -0800, Shakeel Butt wrote:
>> Yes, there are pros & cons, therefore we should give users the option
>> to select the API that is better suited for their use-cases and
>
> Heh, that's not how API decisions should be made.  The long term
> outcome would be really really bad.
>
>> environment. Both approaches are not interchangeable. We use memsw
>> internally for use-cases I mentioned in commit message. This is one of
>> the main blockers for us to even consider cgroup-v2 for memory
>> controller.
>
> Let's concentrate on the use case.  I couldn't quite understand what
> was missing from your description.  You said that it'd make things
> easier for the centralized monitoring system which isn't really a
> description of a use case.  Can you please go into more details
> focusing on the eventual goals (rather than what's currently
> implemented)?
>

The goal is to provide an interface that provides:

1. Consistent memory usage history
2. Consistent memory limit enforcement behavior

By consistent I mean, the environment should not affect the usage
history. For example, the presence or absence of swap or memory
pressure on the system should not affect the memory usage history i.e.
making environment an invariant. Similarly, the environment should not
affect the memcg OOM or memcg memory reclaim behavior.

To provide consistent memory usage history using the current
cgroup-v2's 'swap' interface, an additional metric expressing the
intersection of memory and swap has to be exposed. Basically memsw is
the union of memory and swap. So, if that additional metric can be
used to find the union. However for consistent memory limit
enforcement, I don't think there is an easy way to use current 'swap'
interface.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
  2017-12-19 17:23         ` Shakeel Butt
@ 2017-12-19 17:33           ` Tejun Heo
  -1 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2017-12-19 17:33 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

Hello,

On Tue, Dec 19, 2017 at 09:23:29AM -0800, Shakeel Butt wrote:
> To provide consistent memory usage history using the current
> cgroup-v2's 'swap' interface, an additional metric expressing the
> intersection of memory and swap has to be exposed. Basically memsw is
> the union of memory and swap. So, if that additional metric can be

Exposing anonymous pages with swap backing sounds pretty trivial.

> used to find the union. However for consistent memory limit
> enforcement, I don't think there is an easy way to use current 'swap'
> interface.

Can you please go into details on why this is important?  I get that
you can't do it as easily w/o memsw but I don't understand why this is
a critical feature.  Why is that?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-19 17:33           ` Tejun Heo
  0 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2017-12-19 17:33 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

Hello,

On Tue, Dec 19, 2017 at 09:23:29AM -0800, Shakeel Butt wrote:
> To provide consistent memory usage history using the current
> cgroup-v2's 'swap' interface, an additional metric expressing the
> intersection of memory and swap has to be exposed. Basically memsw is
> the union of memory and swap. So, if that additional metric can be

Exposing anonymous pages with swap backing sounds pretty trivial.

> used to find the union. However for consistent memory limit
> enforcement, I don't think there is an easy way to use current 'swap'
> interface.

Can you please go into details on why this is important?  I get that
you can't do it as easily w/o memsw but I don't understand why this is
a critical feature.  Why is that?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
  2017-12-19 17:33           ` Tejun Heo
  (?)
@ 2017-12-19 18:25             ` Shakeel Butt
  -1 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-19 18:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

On Tue, Dec 19, 2017 at 9:33 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Tue, Dec 19, 2017 at 09:23:29AM -0800, Shakeel Butt wrote:
>> To provide consistent memory usage history using the current
>> cgroup-v2's 'swap' interface, an additional metric expressing the
>> intersection of memory and swap has to be exposed. Basically memsw is
>> the union of memory and swap. So, if that additional metric can be
>
> Exposing anonymous pages with swap backing sounds pretty trivial.
>
>> used to find the union. However for consistent memory limit
>> enforcement, I don't think there is an easy way to use current 'swap'
>> interface.
>
> Can you please go into details on why this is important?  I get that
> you can't do it as easily w/o memsw but I don't understand why this is
> a critical feature.  Why is that?
>

Making the runtime environment, an invariant is very critical to make
the management of a job easier whose instances run on different
clusters across the world. Some clusters might have different type of
swaps installed while some might not have one at all and the
availability of the swap can be dynamic (i.e. swap medium outage).

So, if users want to run multiple instances of a job across multiple
clusters, they should be able to specify the limits of their jobs
irrespective of the knowledge of cluster. The best case would be they
just submits their jobs without any config and the system figures out
the right limit and enforce that. And to figure out the right limit
and enforcing it, the consistent memory usage history and consistent
memory limit enforcement is very critical.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-19 18:25             ` Shakeel Butt
  0 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-19 18:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

On Tue, Dec 19, 2017 at 9:33 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Tue, Dec 19, 2017 at 09:23:29AM -0800, Shakeel Butt wrote:
>> To provide consistent memory usage history using the current
>> cgroup-v2's 'swap' interface, an additional metric expressing the
>> intersection of memory and swap has to be exposed. Basically memsw is
>> the union of memory and swap. So, if that additional metric can be
>
> Exposing anonymous pages with swap backing sounds pretty trivial.
>
>> used to find the union. However for consistent memory limit
>> enforcement, I don't think there is an easy way to use current 'swap'
>> interface.
>
> Can you please go into details on why this is important?  I get that
> you can't do it as easily w/o memsw but I don't understand why this is
> a critical feature.  Why is that?
>

Making the runtime environment, an invariant is very critical to make
the management of a job easier whose instances run on different
clusters across the world. Some clusters might have different type of
swaps installed while some might not have one at all and the
availability of the swap can be dynamic (i.e. swap medium outage).

So, if users want to run multiple instances of a job across multiple
clusters, they should be able to specify the limits of their jobs
irrespective of the knowledge of cluster. The best case would be they
just submits their jobs without any config and the system figures out
the right limit and enforce that. And to figure out the right limit
and enforcing it, the consistent memory usage history and consistent
memory limit enforcement is very critical.

thanks,
Shakeel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-19 18:25             ` Shakeel Butt
  0 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-19 18:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc-u79uwXL29TY76Z2rM5mHXA

On Tue, Dec 19, 2017 at 9:33 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello,
>
> On Tue, Dec 19, 2017 at 09:23:29AM -0800, Shakeel Butt wrote:
>> To provide consistent memory usage history using the current
>> cgroup-v2's 'swap' interface, an additional metric expressing the
>> intersection of memory and swap has to be exposed. Basically memsw is
>> the union of memory and swap. So, if that additional metric can be
>
> Exposing anonymous pages with swap backing sounds pretty trivial.
>
>> used to find the union. However for consistent memory limit
>> enforcement, I don't think there is an easy way to use current 'swap'
>> interface.
>
> Can you please go into details on why this is important?  I get that
> you can't do it as easily w/o memsw but I don't understand why this is
> a critical feature.  Why is that?
>

Making the runtime environment, an invariant is very critical to make
the management of a job easier whose instances run on different
clusters across the world. Some clusters might have different type of
swaps installed while some might not have one at all and the
availability of the swap can be dynamic (i.e. swap medium outage).

So, if users want to run multiple instances of a job across multiple
clusters, they should be able to specify the limits of their jobs
irrespective of the knowledge of cluster. The best case would be they
just submits their jobs without any config and the system figures out
the right limit and enforce that. And to figure out the right limit
and enforcing it, the consistent memory usage history and consistent
memory limit enforcement is very critical.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
  2017-12-19 18:25             ` Shakeel Butt
@ 2017-12-19 21:41               ` Tejun Heo
  -1 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2017-12-19 21:41 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

Hello,

On Tue, Dec 19, 2017 at 10:25:12AM -0800, Shakeel Butt wrote:
> Making the runtime environment, an invariant is very critical to make
> the management of a job easier whose instances run on different
> clusters across the world. Some clusters might have different type of
> swaps installed while some might not have one at all and the
> availability of the swap can be dynamic (i.e. swap medium outage).
> 
> So, if users want to run multiple instances of a job across multiple
> clusters, they should be able to specify the limits of their jobs
> irrespective of the knowledge of cluster. The best case would be they
> just submits their jobs without any config and the system figures out
> the right limit and enforce that. And to figure out the right limit
> and enforcing it, the consistent memory usage history and consistent
> memory limit enforcement is very critical.

I'm having a hard time extracting anything concrete from your
explanation on why memsw is required.  Can you please ELI5 with some
examples?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-19 21:41               ` Tejun Heo
  0 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2017-12-19 21:41 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

Hello,

On Tue, Dec 19, 2017 at 10:25:12AM -0800, Shakeel Butt wrote:
> Making the runtime environment, an invariant is very critical to make
> the management of a job easier whose instances run on different
> clusters across the world. Some clusters might have different type of
> swaps installed while some might not have one at all and the
> availability of the swap can be dynamic (i.e. swap medium outage).
> 
> So, if users want to run multiple instances of a job across multiple
> clusters, they should be able to specify the limits of their jobs
> irrespective of the knowledge of cluster. The best case would be they
> just submits their jobs without any config and the system figures out
> the right limit and enforce that. And to figure out the right limit
> and enforcing it, the consistent memory usage history and consistent
> memory limit enforcement is very critical.

I'm having a hard time extracting anything concrete from your
explanation on why memsw is required.  Can you please ELI5 with some
examples?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
  2017-12-19 21:41               ` Tejun Heo
@ 2017-12-19 22:39                 ` Shakeel Butt
  -1 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-19 22:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

On Tue, Dec 19, 2017 at 1:41 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Tue, Dec 19, 2017 at 10:25:12AM -0800, Shakeel Butt wrote:
>> Making the runtime environment, an invariant is very critical to make
>> the management of a job easier whose instances run on different
>> clusters across the world. Some clusters might have different type of
>> swaps installed while some might not have one at all and the
>> availability of the swap can be dynamic (i.e. swap medium outage).
>>
>> So, if users want to run multiple instances of a job across multiple
>> clusters, they should be able to specify the limits of their jobs
>> irrespective of the knowledge of cluster. The best case would be they
>> just submits their jobs without any config and the system figures out
>> the right limit and enforce that. And to figure out the right limit
>> and enforcing it, the consistent memory usage history and consistent
>> memory limit enforcement is very critical.
>
> I'm having a hard time extracting anything concrete from your
> explanation on why memsw is required.  Can you please ELI5 with some
> examples?
>

Suppose a user wants to run multiple instances of a specific job on
different datacenters and s/he has budget of 100MiB for each instance.
The instances are schduled on the requested datacenters and the
scheduler has set the memory limit of those instances to 100MiB. Now,
some datacenters have swap deployed, so, there, let's say, the swap
limit of those instances are set according to swap medium
availability. In this setting the user will see inconsistent memcg OOM
behavior. Some of the instances see OOMs at 100MiB usage (suppose only
anon memory) while some will see OOMs way above 100MiB due to swap.
So, the user is required to know the internal knowledge of datacenters
(like which has swap or not and swap type) and has to set the limits
accordingly and thus increase the chance of config bugs.

Also different types and sizes of swap mediums in data center will
further complicates the configuration. One datacenter might have SSD
as a swap, another might be doing swap on zram and third might be
doing swap on nvdimm. Each can have different size and can be assigned
to jobs differently. So, it is possible that the instances of the same
job might be assigned different swap limit on different datacenters.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-19 22:39                 ` Shakeel Butt
  0 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-19 22:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

On Tue, Dec 19, 2017 at 1:41 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Tue, Dec 19, 2017 at 10:25:12AM -0800, Shakeel Butt wrote:
>> Making the runtime environment, an invariant is very critical to make
>> the management of a job easier whose instances run on different
>> clusters across the world. Some clusters might have different type of
>> swaps installed while some might not have one at all and the
>> availability of the swap can be dynamic (i.e. swap medium outage).
>>
>> So, if users want to run multiple instances of a job across multiple
>> clusters, they should be able to specify the limits of their jobs
>> irrespective of the knowledge of cluster. The best case would be they
>> just submits their jobs without any config and the system figures out
>> the right limit and enforce that. And to figure out the right limit
>> and enforcing it, the consistent memory usage history and consistent
>> memory limit enforcement is very critical.
>
> I'm having a hard time extracting anything concrete from your
> explanation on why memsw is required.  Can you please ELI5 with some
> examples?
>

Suppose a user wants to run multiple instances of a specific job on
different datacenters and s/he has budget of 100MiB for each instance.
The instances are schduled on the requested datacenters and the
scheduler has set the memory limit of those instances to 100MiB. Now,
some datacenters have swap deployed, so, there, let's say, the swap
limit of those instances are set according to swap medium
availability. In this setting the user will see inconsistent memcg OOM
behavior. Some of the instances see OOMs at 100MiB usage (suppose only
anon memory) while some will see OOMs way above 100MiB due to swap.
So, the user is required to know the internal knowledge of datacenters
(like which has swap or not and swap type) and has to set the limits
accordingly and thus increase the chance of config bugs.

Also different types and sizes of swap mediums in data center will
further complicates the configuration. One datacenter might have SSD
as a swap, another might be doing swap on zram and third might be
doing swap on nvdimm. Each can have different size and can be assigned
to jobs differently. So, it is possible that the instances of the same
job might be assigned different swap limit on different datacenters.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
  2017-12-19 22:39                 ` Shakeel Butt
  (?)
@ 2017-12-20 19:37                   ` Tejun Heo
  -1 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2017-12-20 19:37 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

Hello, Shakeel.

On Tue, Dec 19, 2017 at 02:39:19PM -0800, Shakeel Butt wrote:
> Suppose a user wants to run multiple instances of a specific job on
> different datacenters and s/he has budget of 100MiB for each instance.
> The instances are schduled on the requested datacenters and the
> scheduler has set the memory limit of those instances to 100MiB. Now,
> some datacenters have swap deployed, so, there, let's say, the swap
> limit of those instances are set according to swap medium
> availability. In this setting the user will see inconsistent memcg OOM
> behavior. Some of the instances see OOMs at 100MiB usage (suppose only
> anon memory) while some will see OOMs way above 100MiB due to swap.
> So, the user is required to know the internal knowledge of datacenters
> (like which has swap or not and swap type) and has to set the limits
> accordingly and thus increase the chance of config bugs.

I don't understand how this invariant is useful across different
backing swap devices and availability.  e.g. Our OOM decisions are
currently not great in that the kernel can easily thrash for a very
long time without making actual progresses.  If you combine that with
widely varying types and availability of swaps, whether something is
OOMing or not doesn't really tell you much.  The workload could be
running completely fine or have been thrashing without making any
meaningful forward progress for the past 15 mins.

Given that whether or not swap exists, how much is avialable and how
fast the backing swap device is all highly influential parameters in
how the workload behaves, I don't see what having sum of memory + swap
as an invariant actually buys.  And, even that essentially meaningless
invariant doesn't really exist - the performance of the swap device
absolutely affects when the OOM killer would kick in.

So, I don't see how the sum of memory+swap makes it possible to ignore
the swap type and availability.  Can you please explain that further?

> Also different types and sizes of swap mediums in data center will
> further complicates the configuration. One datacenter might have SSD
> as a swap, another might be doing swap on zram and third might be
> doing swap on nvdimm. Each can have different size and can be assigned
> to jobs differently. So, it is possible that the instances of the same
> job might be assigned different swap limit on different datacenters.

Sure, but what does memswap achieve?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-20 19:37                   ` Tejun Heo
  0 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2017-12-20 19:37 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

Hello, Shakeel.

On Tue, Dec 19, 2017 at 02:39:19PM -0800, Shakeel Butt wrote:
> Suppose a user wants to run multiple instances of a specific job on
> different datacenters and s/he has budget of 100MiB for each instance.
> The instances are schduled on the requested datacenters and the
> scheduler has set the memory limit of those instances to 100MiB. Now,
> some datacenters have swap deployed, so, there, let's say, the swap
> limit of those instances are set according to swap medium
> availability. In this setting the user will see inconsistent memcg OOM
> behavior. Some of the instances see OOMs at 100MiB usage (suppose only
> anon memory) while some will see OOMs way above 100MiB due to swap.
> So, the user is required to know the internal knowledge of datacenters
> (like which has swap or not and swap type) and has to set the limits
> accordingly and thus increase the chance of config bugs.

I don't understand how this invariant is useful across different
backing swap devices and availability.  e.g. Our OOM decisions are
currently not great in that the kernel can easily thrash for a very
long time without making actual progresses.  If you combine that with
widely varying types and availability of swaps, whether something is
OOMing or not doesn't really tell you much.  The workload could be
running completely fine or have been thrashing without making any
meaningful forward progress for the past 15 mins.

Given that whether or not swap exists, how much is avialable and how
fast the backing swap device is all highly influential parameters in
how the workload behaves, I don't see what having sum of memory + swap
as an invariant actually buys.  And, even that essentially meaningless
invariant doesn't really exist - the performance of the swap device
absolutely affects when the OOM killer would kick in.

So, I don't see how the sum of memory+swap makes it possible to ignore
the swap type and availability.  Can you please explain that further?

> Also different types and sizes of swap mediums in data center will
> further complicates the configuration. One datacenter might have SSD
> as a swap, another might be doing swap on zram and third might be
> doing swap on nvdimm. Each can have different size and can be assigned
> to jobs differently. So, it is possible that the instances of the same
> job might be assigned different swap limit on different datacenters.

Sure, but what does memswap achieve?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-20 19:37                   ` Tejun Heo
  0 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2017-12-20 19:37 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc-u79uwXL29TY76Z2rM5mHXA

Hello, Shakeel.

On Tue, Dec 19, 2017 at 02:39:19PM -0800, Shakeel Butt wrote:
> Suppose a user wants to run multiple instances of a specific job on
> different datacenters and s/he has budget of 100MiB for each instance.
> The instances are schduled on the requested datacenters and the
> scheduler has set the memory limit of those instances to 100MiB. Now,
> some datacenters have swap deployed, so, there, let's say, the swap
> limit of those instances are set according to swap medium
> availability. In this setting the user will see inconsistent memcg OOM
> behavior. Some of the instances see OOMs at 100MiB usage (suppose only
> anon memory) while some will see OOMs way above 100MiB due to swap.
> So, the user is required to know the internal knowledge of datacenters
> (like which has swap or not and swap type) and has to set the limits
> accordingly and thus increase the chance of config bugs.

I don't understand how this invariant is useful across different
backing swap devices and availability.  e.g. Our OOM decisions are
currently not great in that the kernel can easily thrash for a very
long time without making actual progresses.  If you combine that with
widely varying types and availability of swaps, whether something is
OOMing or not doesn't really tell you much.  The workload could be
running completely fine or have been thrashing without making any
meaningful forward progress for the past 15 mins.

Given that whether or not swap exists, how much is avialable and how
fast the backing swap device is all highly influential parameters in
how the workload behaves, I don't see what having sum of memory + swap
as an invariant actually buys.  And, even that essentially meaningless
invariant doesn't really exist - the performance of the swap device
absolutely affects when the OOM killer would kick in.

So, I don't see how the sum of memory+swap makes it possible to ignore
the swap type and availability.  Can you please explain that further?

> Also different types and sizes of swap mediums in data center will
> further complicates the configuration. One datacenter might have SSD
> as a swap, another might be doing swap on zram and third might be
> doing swap on nvdimm. Each can have different size and can be assigned
> to jobs differently. So, it is possible that the instances of the same
> job might be assigned different swap limit on different datacenters.

Sure, but what does memswap achieve?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
  2017-12-20 19:37                   ` Tejun Heo
@ 2017-12-20 20:15                     ` Shakeel Butt
  -1 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-20 20:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

On Wed, Dec 20, 2017 at 11:37 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Shakeel.
>
> On Tue, Dec 19, 2017 at 02:39:19PM -0800, Shakeel Butt wrote:
>> Suppose a user wants to run multiple instances of a specific job on
>> different datacenters and s/he has budget of 100MiB for each instance.
>> The instances are schduled on the requested datacenters and the
>> scheduler has set the memory limit of those instances to 100MiB. Now,
>> some datacenters have swap deployed, so, there, let's say, the swap
>> limit of those instances are set according to swap medium
>> availability. In this setting the user will see inconsistent memcg OOM
>> behavior. Some of the instances see OOMs at 100MiB usage (suppose only
>> anon memory) while some will see OOMs way above 100MiB due to swap.
>> So, the user is required to know the internal knowledge of datacenters
>> (like which has swap or not and swap type) and has to set the limits
>> accordingly and thus increase the chance of config bugs.
>
> I don't understand how this invariant is useful across different
> backing swap devices and availability.  e.g. Our OOM decisions are
> currently not great in that the kernel can easily thrash for a very
> long time without making actual progresses.  If you combine that with
> widely varying types and availability of swaps,

The kernel never swaps out on hitting memsw limit. So, the varying
types and availability of swaps becomes invariant to the memcg OOM
behavior of the job.

> whether something is
> OOMing or not doesn't really tell you much.  The workload could be
> running completely fine or have been thrashing without making any
> meaningful forward progress for the past 15 mins.
>
> Given that whether or not swap exists, how much is avialable and how
> fast the backing swap device is all highly influential parameters in
> how the workload behaves, I don't see what having sum of memory + swap
> as an invariant actually buys.  And, even that essentially meaningless
> invariant doesn't really exist - the performance of the swap device
> absolutely affects when the OOM killer would kick in.
>

No, as I previously explained, the swap types and availability will be
transparent to the memcg OOM killer and memcg memory reclaim behavior.

> So, I don't see how the sum of memory+swap makes it possible to ignore
> the swap type and availability.  Can you please explain that further?
>
>> Also different types and sizes of swap mediums in data center will
>> further complicates the configuration. One datacenter might have SSD
>> as a swap, another might be doing swap on zram and third might be
>> doing swap on nvdimm. Each can have different size and can be assigned
>> to jobs differently. So, it is possible that the instances of the same
>> job might be assigned different swap limit on different datacenters.
>
> Sure, but what does memswap achieve?
>

1. memswap provides consistent memcg OOM killer and memcg memory
reclaim behavior independent to swap.
2. With memswap, the job owners do not have to think or worry about swaps.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-20 20:15                     ` Shakeel Butt
  0 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-20 20:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

On Wed, Dec 20, 2017 at 11:37 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Shakeel.
>
> On Tue, Dec 19, 2017 at 02:39:19PM -0800, Shakeel Butt wrote:
>> Suppose a user wants to run multiple instances of a specific job on
>> different datacenters and s/he has budget of 100MiB for each instance.
>> The instances are schduled on the requested datacenters and the
>> scheduler has set the memory limit of those instances to 100MiB. Now,
>> some datacenters have swap deployed, so, there, let's say, the swap
>> limit of those instances are set according to swap medium
>> availability. In this setting the user will see inconsistent memcg OOM
>> behavior. Some of the instances see OOMs at 100MiB usage (suppose only
>> anon memory) while some will see OOMs way above 100MiB due to swap.
>> So, the user is required to know the internal knowledge of datacenters
>> (like which has swap or not and swap type) and has to set the limits
>> accordingly and thus increase the chance of config bugs.
>
> I don't understand how this invariant is useful across different
> backing swap devices and availability.  e.g. Our OOM decisions are
> currently not great in that the kernel can easily thrash for a very
> long time without making actual progresses.  If you combine that with
> widely varying types and availability of swaps,

The kernel never swaps out on hitting memsw limit. So, the varying
types and availability of swaps becomes invariant to the memcg OOM
behavior of the job.

> whether something is
> OOMing or not doesn't really tell you much.  The workload could be
> running completely fine or have been thrashing without making any
> meaningful forward progress for the past 15 mins.
>
> Given that whether or not swap exists, how much is avialable and how
> fast the backing swap device is all highly influential parameters in
> how the workload behaves, I don't see what having sum of memory + swap
> as an invariant actually buys.  And, even that essentially meaningless
> invariant doesn't really exist - the performance of the swap device
> absolutely affects when the OOM killer would kick in.
>

No, as I previously explained, the swap types and availability will be
transparent to the memcg OOM killer and memcg memory reclaim behavior.

> So, I don't see how the sum of memory+swap makes it possible to ignore
> the swap type and availability.  Can you please explain that further?
>
>> Also different types and sizes of swap mediums in data center will
>> further complicates the configuration. One datacenter might have SSD
>> as a swap, another might be doing swap on zram and third might be
>> doing swap on nvdimm. Each can have different size and can be assigned
>> to jobs differently. So, it is possible that the instances of the same
>> job might be assigned different swap limit on different datacenters.
>
> Sure, but what does memswap achieve?
>

1. memswap provides consistent memcg OOM killer and memcg memory
reclaim behavior independent to swap.
2. With memswap, the job owners do not have to think or worry about swaps.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
  2017-12-20 20:15                     ` Shakeel Butt
@ 2017-12-20 20:27                       ` Shakeel Butt
  -1 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-20 20:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

On Wed, Dec 20, 2017 at 12:15 PM, Shakeel Butt <shakeelb@google.com> wrote:
> On Wed, Dec 20, 2017 at 11:37 AM, Tejun Heo <tj@kernel.org> wrote:
>> Hello, Shakeel.
>>
>> On Tue, Dec 19, 2017 at 02:39:19PM -0800, Shakeel Butt wrote:
>>> Suppose a user wants to run multiple instances of a specific job on
>>> different datacenters and s/he has budget of 100MiB for each instance.
>>> The instances are schduled on the requested datacenters and the
>>> scheduler has set the memory limit of those instances to 100MiB. Now,
>>> some datacenters have swap deployed, so, there, let's say, the swap
>>> limit of those instances are set according to swap medium
>>> availability. In this setting the user will see inconsistent memcg OOM
>>> behavior. Some of the instances see OOMs at 100MiB usage (suppose only
>>> anon memory) while some will see OOMs way above 100MiB due to swap.
>>> So, the user is required to know the internal knowledge of datacenters
>>> (like which has swap or not and swap type) and has to set the limits
>>> accordingly and thus increase the chance of config bugs.
>>
>> I don't understand how this invariant is useful across different
>> backing swap devices and availability.  e.g. Our OOM decisions are
>> currently not great in that the kernel can easily thrash for a very
>> long time without making actual progresses.  If you combine that with
>> widely varying types and availability of swaps,
>
> The kernel never swaps out on hitting memsw limit. So, the varying
> types and availability of swaps becomes invariant to the memcg OOM
> behavior of the job.
>
>> whether something is
>> OOMing or not doesn't really tell you much.  The workload could be
>> running completely fine or have been thrashing without making any
>> meaningful forward progress for the past 15 mins.
>>
>> Given that whether or not swap exists, how much is avialable and how
>> fast the backing swap device is all highly influential parameters in
>> how the workload behaves, I don't see what having sum of memory + swap
>> as an invariant actually buys.  And, even that essentially meaningless
>> invariant doesn't really exist - the performance of the swap device
>> absolutely affects when the OOM killer would kick in.
>>
>
> No, as I previously explained, the swap types and availability will be
> transparent to the memcg OOM killer and memcg memory reclaim behavior.
>
>> So, I don't see how the sum of memory+swap makes it possible to ignore
>> the swap type and availability.  Can you please explain that further?
>>
>>> Also different types and sizes of swap mediums in data center will
>>> further complicates the configuration. One datacenter might have SSD
>>> as a swap, another might be doing swap on zram and third might be
>>> doing swap on nvdimm. Each can have different size and can be assigned
>>> to jobs differently. So, it is possible that the instances of the same
>>> job might be assigned different swap limit on different datacenters.
>>
>> Sure, but what does memswap achieve?
>>
>
> 1. memswap provides consistent memcg OOM killer and memcg memory
> reclaim behavior independent to swap.
> 2. With memswap, the job owners do not have to think or worry about swaps.

When I say OOM and memory reclaim behavior, I specifically mean memcg
oom-kill and memcg memory reclaim behavior. These are different from
global oom-killer and global memory reclaim behaviors. The global
behaviors will be affected by the types and availability of swaps and
the jobs can suffer differently based on swap types and availability
on hitting global OOM scenario.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-20 20:27                       ` Shakeel Butt
  0 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-20 20:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

On Wed, Dec 20, 2017 at 12:15 PM, Shakeel Butt <shakeelb@google.com> wrote:
> On Wed, Dec 20, 2017 at 11:37 AM, Tejun Heo <tj@kernel.org> wrote:
>> Hello, Shakeel.
>>
>> On Tue, Dec 19, 2017 at 02:39:19PM -0800, Shakeel Butt wrote:
>>> Suppose a user wants to run multiple instances of a specific job on
>>> different datacenters and s/he has budget of 100MiB for each instance.
>>> The instances are schduled on the requested datacenters and the
>>> scheduler has set the memory limit of those instances to 100MiB. Now,
>>> some datacenters have swap deployed, so, there, let's say, the swap
>>> limit of those instances are set according to swap medium
>>> availability. In this setting the user will see inconsistent memcg OOM
>>> behavior. Some of the instances see OOMs at 100MiB usage (suppose only
>>> anon memory) while some will see OOMs way above 100MiB due to swap.
>>> So, the user is required to know the internal knowledge of datacenters
>>> (like which has swap or not and swap type) and has to set the limits
>>> accordingly and thus increase the chance of config bugs.
>>
>> I don't understand how this invariant is useful across different
>> backing swap devices and availability.  e.g. Our OOM decisions are
>> currently not great in that the kernel can easily thrash for a very
>> long time without making actual progresses.  If you combine that with
>> widely varying types and availability of swaps,
>
> The kernel never swaps out on hitting memsw limit. So, the varying
> types and availability of swaps becomes invariant to the memcg OOM
> behavior of the job.
>
>> whether something is
>> OOMing or not doesn't really tell you much.  The workload could be
>> running completely fine or have been thrashing without making any
>> meaningful forward progress for the past 15 mins.
>>
>> Given that whether or not swap exists, how much is avialable and how
>> fast the backing swap device is all highly influential parameters in
>> how the workload behaves, I don't see what having sum of memory + swap
>> as an invariant actually buys.  And, even that essentially meaningless
>> invariant doesn't really exist - the performance of the swap device
>> absolutely affects when the OOM killer would kick in.
>>
>
> No, as I previously explained, the swap types and availability will be
> transparent to the memcg OOM killer and memcg memory reclaim behavior.
>
>> So, I don't see how the sum of memory+swap makes it possible to ignore
>> the swap type and availability.  Can you please explain that further?
>>
>>> Also different types and sizes of swap mediums in data center will
>>> further complicates the configuration. One datacenter might have SSD
>>> as a swap, another might be doing swap on zram and third might be
>>> doing swap on nvdimm. Each can have different size and can be assigned
>>> to jobs differently. So, it is possible that the instances of the same
>>> job might be assigned different swap limit on different datacenters.
>>
>> Sure, but what does memswap achieve?
>>
>
> 1. memswap provides consistent memcg OOM killer and memcg memory
> reclaim behavior independent to swap.
> 2. With memswap, the job owners do not have to think or worry about swaps.

When I say OOM and memory reclaim behavior, I specifically mean memcg
oom-kill and memcg memory reclaim behavior. These are different from
global oom-killer and global memory reclaim behaviors. The global
behaviors will be affected by the types and availability of swaps and
the jobs can suffer differently based on swap types and availability
on hitting global OOM scenario.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
  2017-12-20 20:15                     ` Shakeel Butt
@ 2017-12-20 23:36                       ` Tejun Heo
  -1 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2017-12-20 23:36 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

Hello, Shakeel.

On Wed, Dec 20, 2017 at 12:15:46PM -0800, Shakeel Butt wrote:
> > I don't understand how this invariant is useful across different
> > backing swap devices and availability.  e.g. Our OOM decisions are
> > currently not great in that the kernel can easily thrash for a very
> > long time without making actual progresses.  If you combine that with
> > widely varying types and availability of swaps,
> 
> The kernel never swaps out on hitting memsw limit. So, the varying
> types and availability of swaps becomes invariant to the memcg OOM
> behavior of the job.

The kernel doesn't swap because of memsw because that wouldn't change
the memsw number; however, that has nothing to do with whether the
underlying swap device affects OOM behavior or not.  That invariant
can't prevent memcg decisions from being affected by the performance
of the underlying swap device.  How could it possibly achieve that?

The only reason memsw was designed the way it was designed was to
avoid lower swap limit meaning more memory consumption.  It is true
that swap and memory consumptions are interlinked; however, so are
memory and io, and we can't solve these issues by interlinking
separate resources in a single resource knob and that's why they're
separate in cgroup2.

> > Sure, but what does memswap achieve?
> 
> 1. memswap provides consistent memcg OOM killer and memcg memory
> reclaim behavior independent to swap.
> 2. With memswap, the job owners do not have to think or worry about swaps.

To me, you sound massively confused on what memsw can do.  It could be
that I'm just not understanding what you're saying.  So, let's try
this one more time.  Can you please give one concrete example of memsw
achieving critical capabilities that aren't possible without it?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-20 23:36                       ` Tejun Heo
  0 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2017-12-20 23:36 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

Hello, Shakeel.

On Wed, Dec 20, 2017 at 12:15:46PM -0800, Shakeel Butt wrote:
> > I don't understand how this invariant is useful across different
> > backing swap devices and availability.  e.g. Our OOM decisions are
> > currently not great in that the kernel can easily thrash for a very
> > long time without making actual progresses.  If you combine that with
> > widely varying types and availability of swaps,
> 
> The kernel never swaps out on hitting memsw limit. So, the varying
> types and availability of swaps becomes invariant to the memcg OOM
> behavior of the job.

The kernel doesn't swap because of memsw because that wouldn't change
the memsw number; however, that has nothing to do with whether the
underlying swap device affects OOM behavior or not.  That invariant
can't prevent memcg decisions from being affected by the performance
of the underlying swap device.  How could it possibly achieve that?

The only reason memsw was designed the way it was designed was to
avoid lower swap limit meaning more memory consumption.  It is true
that swap and memory consumptions are interlinked; however, so are
memory and io, and we can't solve these issues by interlinking
separate resources in a single resource knob and that's why they're
separate in cgroup2.

> > Sure, but what does memswap achieve?
> 
> 1. memswap provides consistent memcg OOM killer and memcg memory
> reclaim behavior independent to swap.
> 2. With memswap, the job owners do not have to think or worry about swaps.

To me, you sound massively confused on what memsw can do.  It could be
that I'm just not understanding what you're saying.  So, let's try
this one more time.  Can you please give one concrete example of memsw
achieving critical capabilities that aren't possible without it?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
  2017-12-20 23:36                       ` Tejun Heo
@ 2017-12-21  1:15                         ` Shakeel Butt
  -1 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-21  1:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

On Wed, Dec 20, 2017 at 3:36 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Shakeel.
>
> On Wed, Dec 20, 2017 at 12:15:46PM -0800, Shakeel Butt wrote:
>> > I don't understand how this invariant is useful across different
>> > backing swap devices and availability.  e.g. Our OOM decisions are
>> > currently not great in that the kernel can easily thrash for a very
>> > long time without making actual progresses.  If you combine that with
>> > widely varying types and availability of swaps,
>>
>> The kernel never swaps out on hitting memsw limit. So, the varying
>> types and availability of swaps becomes invariant to the memcg OOM
>> behavior of the job.
>
> The kernel doesn't swap because of memsw because that wouldn't change
> the memsw number; however, that has nothing to do with whether the
> underlying swap device affects OOM behavior or not.  That invariant
> can't prevent memcg decisions from being affected by the performance
> of the underlying swap device.  How could it possibly achieve that?
>

I feel like you are confusing between global OOM and memcg OOM. Under
memsw, the memcg OOM behavior will not be affected by the underlying
swap device. See my example below.

> The only reason memsw was designed the way it was designed was to
> avoid lower swap limit meaning more memory consumption.  It is true
> that swap and memory consumptions are interlinked; however, so are
> memory and io, and we can't solve these issues by interlinking
> separate resources in a single resource knob and that's why they're
> separate in cgroup2.
>
>> > Sure, but what does memswap achieve?
>>
>> 1. memswap provides consistent memcg OOM killer and memcg memory
>> reclaim behavior independent to swap.
>> 2. With memswap, the job owners do not have to think or worry about swaps.
>
> To me, you sound massively confused on what memsw can do.  It could be
> that I'm just not understanding what you're saying.  So, let's try
> this one more time.  Can you please give one concrete example of memsw
> achieving critical capabilities that aren't possible without it?
>

Let's say we have a job that allocates 100 MiB memory and suppose 80
MiB is anon and 20 MiB is non-anon (file & kmem).

[With memsw] Scheduler sets the memsw limit of the job to 100 MiB and
memory to max. Now suppose the job tries to allocates memory more than
100 MiB, it will hit the memsw limit and will try to reclaim non-anon
memory. The memcg OOM behavior will only depend on the reclaim of
non-anon memory and will be independent of the underlying swap device.

[Without memsw] Scheduler sets the memory limit to 100 MiB and swap to
50 MiB (based on availability). Now when the job tries to allocate
memory more than 100 MiB, it will hit memory limit and try to reclaim
anon and non-anon memory. The kernel will try to swapout anon memory,
write out dirty file pages, free clean file pages and shrink
reclaimable kernel memory. Here the memcg OOM behavior will depend on
the underlying swap device.

Without memsw, the underlying swap device will always affect the memcg
OOM and memcg reclaim behavior. We need memcg OOM and memcg memory
reclaim behavior independent to the availability and varieties of
swaps. This will allow to decouple the job owners decisions on their
job's memory budget from datacenter owners decisions on swap and
memory overcommit. The job owners should not have to worry or think
about swaps and be forced to have different configurations based on
types and availability of swaps in different datacenters.

Tejun, I think I have very clearly explained that without memsw,
consistent memcg OOM and reclaim behavior is not possible and why
consistent behavior is crucial. If you think otherwise, please
pinpoint where you disagree.

I really appreciate your time and patience.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-21  1:15                         ` Shakeel Butt
  0 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-21  1:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

On Wed, Dec 20, 2017 at 3:36 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Shakeel.
>
> On Wed, Dec 20, 2017 at 12:15:46PM -0800, Shakeel Butt wrote:
>> > I don't understand how this invariant is useful across different
>> > backing swap devices and availability.  e.g. Our OOM decisions are
>> > currently not great in that the kernel can easily thrash for a very
>> > long time without making actual progresses.  If you combine that with
>> > widely varying types and availability of swaps,
>>
>> The kernel never swaps out on hitting memsw limit. So, the varying
>> types and availability of swaps becomes invariant to the memcg OOM
>> behavior of the job.
>
> The kernel doesn't swap because of memsw because that wouldn't change
> the memsw number; however, that has nothing to do with whether the
> underlying swap device affects OOM behavior or not.  That invariant
> can't prevent memcg decisions from being affected by the performance
> of the underlying swap device.  How could it possibly achieve that?
>

I feel like you are confusing between global OOM and memcg OOM. Under
memsw, the memcg OOM behavior will not be affected by the underlying
swap device. See my example below.

> The only reason memsw was designed the way it was designed was to
> avoid lower swap limit meaning more memory consumption.  It is true
> that swap and memory consumptions are interlinked; however, so are
> memory and io, and we can't solve these issues by interlinking
> separate resources in a single resource knob and that's why they're
> separate in cgroup2.
>
>> > Sure, but what does memswap achieve?
>>
>> 1. memswap provides consistent memcg OOM killer and memcg memory
>> reclaim behavior independent to swap.
>> 2. With memswap, the job owners do not have to think or worry about swaps.
>
> To me, you sound massively confused on what memsw can do.  It could be
> that I'm just not understanding what you're saying.  So, let's try
> this one more time.  Can you please give one concrete example of memsw
> achieving critical capabilities that aren't possible without it?
>

Let's say we have a job that allocates 100 MiB memory and suppose 80
MiB is anon and 20 MiB is non-anon (file & kmem).

[With memsw] Scheduler sets the memsw limit of the job to 100 MiB and
memory to max. Now suppose the job tries to allocates memory more than
100 MiB, it will hit the memsw limit and will try to reclaim non-anon
memory. The memcg OOM behavior will only depend on the reclaim of
non-anon memory and will be independent of the underlying swap device.

[Without memsw] Scheduler sets the memory limit to 100 MiB and swap to
50 MiB (based on availability). Now when the job tries to allocate
memory more than 100 MiB, it will hit memory limit and try to reclaim
anon and non-anon memory. The kernel will try to swapout anon memory,
write out dirty file pages, free clean file pages and shrink
reclaimable kernel memory. Here the memcg OOM behavior will depend on
the underlying swap device.

Without memsw, the underlying swap device will always affect the memcg
OOM and memcg reclaim behavior. We need memcg OOM and memcg memory
reclaim behavior independent to the availability and varieties of
swaps. This will allow to decouple the job owners decisions on their
job's memory budget from datacenter owners decisions on swap and
memory overcommit. The job owners should not have to worry or think
about swaps and be forced to have different configurations based on
types and availability of swaps in different datacenters.

Tejun, I think I have very clearly explained that without memsw,
consistent memcg OOM and reclaim behavior is not possible and why
consistent behavior is crucial. If you think otherwise, please
pinpoint where you disagree.

I really appreciate your time and patience.

thanks,
Shakeel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
  2017-12-21  1:15                         ` Shakeel Butt
@ 2017-12-21 13:37                           ` Tejun Heo
  -1 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2017-12-21 13:37 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

Hello, Shakeel.

On Wed, Dec 20, 2017 at 05:15:41PM -0800, Shakeel Butt wrote:
> Let's say we have a job that allocates 100 MiB memory and suppose 80
> MiB is anon and 20 MiB is non-anon (file & kmem).
> 
> [With memsw] Scheduler sets the memsw limit of the job to 100 MiB and
> memory to max. Now suppose the job tries to allocates memory more than
> 100 MiB, it will hit the memsw limit and will try to reclaim non-anon
> memory. The memcg OOM behavior will only depend on the reclaim of
> non-anon memory and will be independent of the underlying swap device.

Sure, the direct reclaim on memsw limit won't reclaim anon pages, but
think about how the state at that point would have formed.  You're
claiming that memsw makes memory allocation and balancing behavior an
invariant against the performance of the swap device that the machine
has.  It's simply not possible.

On top of that, what's the point?

1. As I wrote earlier, given the current OOM killer implementation,
   whether OOM kicks in or not is not even that relevant in
   determining the health of the workload.  There are frequent failure
   modes where OOM killer fails to kick in while the workload isn't
   making any meaningful forward progress.

2. On hitting memsw limit, the OOM decision is dependent on the
   performance of the file backing devices.  Why is that necessarily
   better than being dependent on swap or both, which would increase
   the reclaim efficiency anyway?  You can't avoid being affected by
   the underlying hardware one way or the other.

3. The only thing memsw does is that memsw direct reclaim will only
   consider file backed pages, which I think is more of an accident
   (in an attemp to avoid lower swap setting meaning higher actual
   memory usage) than the intended outcome.  This is obviously
   suboptimal and an implementation detail.  I don't think it's
   something we want to expose to userland as a feature.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-21 13:37                           ` Tejun Heo
  0 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2017-12-21 13:37 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

Hello, Shakeel.

On Wed, Dec 20, 2017 at 05:15:41PM -0800, Shakeel Butt wrote:
> Let's say we have a job that allocates 100 MiB memory and suppose 80
> MiB is anon and 20 MiB is non-anon (file & kmem).
> 
> [With memsw] Scheduler sets the memsw limit of the job to 100 MiB and
> memory to max. Now suppose the job tries to allocates memory more than
> 100 MiB, it will hit the memsw limit and will try to reclaim non-anon
> memory. The memcg OOM behavior will only depend on the reclaim of
> non-anon memory and will be independent of the underlying swap device.

Sure, the direct reclaim on memsw limit won't reclaim anon pages, but
think about how the state at that point would have formed.  You're
claiming that memsw makes memory allocation and balancing behavior an
invariant against the performance of the swap device that the machine
has.  It's simply not possible.

On top of that, what's the point?

1. As I wrote earlier, given the current OOM killer implementation,
   whether OOM kicks in or not is not even that relevant in
   determining the health of the workload.  There are frequent failure
   modes where OOM killer fails to kick in while the workload isn't
   making any meaningful forward progress.

2. On hitting memsw limit, the OOM decision is dependent on the
   performance of the file backing devices.  Why is that necessarily
   better than being dependent on swap or both, which would increase
   the reclaim efficiency anyway?  You can't avoid being affected by
   the underlying hardware one way or the other.

3. The only thing memsw does is that memsw direct reclaim will only
   consider file backed pages, which I think is more of an accident
   (in an attemp to avoid lower swap setting meaning higher actual
   memory usage) than the intended outcome.  This is obviously
   suboptimal and an implementation detail.  I don't think it's
   something we want to expose to userland as a feature.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
  2017-12-21 13:37                           ` Tejun Heo
@ 2017-12-21 15:22                             ` Shakeel Butt
  -1 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-21 15:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

On Thu, Dec 21, 2017 at 5:37 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Shakeel.
>
> On Wed, Dec 20, 2017 at 05:15:41PM -0800, Shakeel Butt wrote:
>> Let's say we have a job that allocates 100 MiB memory and suppose 80
>> MiB is anon and 20 MiB is non-anon (file & kmem).
>>
>> [With memsw] Scheduler sets the memsw limit of the job to 100 MiB and
>> memory to max. Now suppose the job tries to allocates memory more than
>> 100 MiB, it will hit the memsw limit and will try to reclaim non-anon
>> memory. The memcg OOM behavior will only depend on the reclaim of
>> non-anon memory and will be independent of the underlying swap device.
>
> Sure, the direct reclaim on memsw limit won't reclaim anon pages, but
> think about how the state at that point would have formed.  You're
> claiming that memsw makes memory allocation and balancing behavior an
> invariant against the performance of the swap device that the machine
> has.  It's simply not possible.
>

I am claiming memory allocations under global pressure will be
affected by the performance of the underlying swap device. However
memory allocations under memcg memory pressure, with memsw, will not
be affected by the performance of the underlying swap device. A job
having 100 MiB limit running on a machine without global memory
pressure will never see swap on hitting 100 MiB memsw limit.

> On top of that, what's the point?
>
> 1. As I wrote earlier, given the current OOM killer implementation,
>    whether OOM kicks in or not is not even that relevant in
>    determining the health of the workload.  There are frequent failure
>    modes where OOM killer fails to kick in while the workload isn't
>    making any meaningful forward progress.
>

Deterministic oom-killer is not the point. The point is to
"consistently limit the anon memory" allocated by the job which only
memsw can provide. A job owner who has requested 100 MiB for a job
sees some instances of the job suffer at 100 MiB and other instances
suffer at 150 MiB, is an inconsistent behavior.

> 2. On hitting memsw limit, the OOM decision is dependent on the
>    performance of the file backing devices.  Why is that necessarily
>    better than being dependent on swap or both, which would increase
>    the reclaim efficiency anyway?  You can't avoid being affected by
>    the underlying hardware one way or the other.
>

This is a separate discussion but still the amount of file backed
pages is known and controlled by the job owner and they have the
option to use a storage service, providing a consistent performance
across different data centers, instead of the physical disks of the
system where the job is running and thus isolating the job's
performance from the speed of the local disk. This is not possible
with swap. The swap (and its performance) is and should be transparent
to the job owners.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-21 15:22                             ` Shakeel Butt
  0 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-21 15:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

On Thu, Dec 21, 2017 at 5:37 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Shakeel.
>
> On Wed, Dec 20, 2017 at 05:15:41PM -0800, Shakeel Butt wrote:
>> Let's say we have a job that allocates 100 MiB memory and suppose 80
>> MiB is anon and 20 MiB is non-anon (file & kmem).
>>
>> [With memsw] Scheduler sets the memsw limit of the job to 100 MiB and
>> memory to max. Now suppose the job tries to allocates memory more than
>> 100 MiB, it will hit the memsw limit and will try to reclaim non-anon
>> memory. The memcg OOM behavior will only depend on the reclaim of
>> non-anon memory and will be independent of the underlying swap device.
>
> Sure, the direct reclaim on memsw limit won't reclaim anon pages, but
> think about how the state at that point would have formed.  You're
> claiming that memsw makes memory allocation and balancing behavior an
> invariant against the performance of the swap device that the machine
> has.  It's simply not possible.
>

I am claiming memory allocations under global pressure will be
affected by the performance of the underlying swap device. However
memory allocations under memcg memory pressure, with memsw, will not
be affected by the performance of the underlying swap device. A job
having 100 MiB limit running on a machine without global memory
pressure will never see swap on hitting 100 MiB memsw limit.

> On top of that, what's the point?
>
> 1. As I wrote earlier, given the current OOM killer implementation,
>    whether OOM kicks in or not is not even that relevant in
>    determining the health of the workload.  There are frequent failure
>    modes where OOM killer fails to kick in while the workload isn't
>    making any meaningful forward progress.
>

Deterministic oom-killer is not the point. The point is to
"consistently limit the anon memory" allocated by the job which only
memsw can provide. A job owner who has requested 100 MiB for a job
sees some instances of the job suffer at 100 MiB and other instances
suffer at 150 MiB, is an inconsistent behavior.

> 2. On hitting memsw limit, the OOM decision is dependent on the
>    performance of the file backing devices.  Why is that necessarily
>    better than being dependent on swap or both, which would increase
>    the reclaim efficiency anyway?  You can't avoid being affected by
>    the underlying hardware one way or the other.
>

This is a separate discussion but still the amount of file backed
pages is known and controlled by the job owner and they have the
option to use a storage service, providing a consistent performance
across different data centers, instead of the physical disks of the
system where the job is running and thus isolating the job's
performance from the speed of the local disk. This is not possible
with swap. The swap (and its performance) is and should be transparent
to the job owners.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
  2017-12-21 15:22                             ` Shakeel Butt
@ 2017-12-21 15:33                               ` Shakeel Butt
  -1 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-21 15:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

> The swap (and its performance) is and should be transparent
> to the job owners.

Please ignore this statement, I didn't mean to claim on the
independence of job performance and underlying swap performance, sorry
about that.

I meant to say that the amount of anon memory a job can allocate
should be independent to the underlying swap.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-21 15:33                               ` Shakeel Butt
  0 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-21 15:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

> The swap (and its performance) is and should be transparent
> to the job owners.

Please ignore this statement, I didn't mean to claim on the
independence of job performance and underlying swap performance, sorry
about that.

I meant to say that the amount of anon memory a job can allocate
should be independent to the underlying swap.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
  2017-12-21 15:22                             ` Shakeel Butt
  (?)
@ 2017-12-21 17:29                               ` Tejun Heo
  -1 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2017-12-21 17:29 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

Hello, Shakeel.

On Thu, Dec 21, 2017 at 07:22:20AM -0800, Shakeel Butt wrote:
> I am claiming memory allocations under global pressure will be
> affected by the performance of the underlying swap device. However
> memory allocations under memcg memory pressure, with memsw, will not
> be affected by the performance of the underlying swap device. A job
> having 100 MiB limit running on a machine without global memory
> pressure will never see swap on hitting 100 MiB memsw limit.

But, without global memory pressure, the swap wouldn't be making any
difference to begin with.  Also, when multiple cgroups are hitting
memsw limits, they'd behave as if swappiness is zero increasing load
on the filesystems, which then then of course will affect everyone
under memory pressure whether memsw or not.

> > On top of that, what's the point?
> >
> > 1. As I wrote earlier, given the current OOM killer implementation,
> >    whether OOM kicks in or not is not even that relevant in
> >    determining the health of the workload.  There are frequent failure
> >    modes where OOM killer fails to kick in while the workload isn't
> >    making any meaningful forward progress.
> >
> 
> Deterministic oom-killer is not the point. The point is to
> "consistently limit the anon memory" allocated by the job which only
> memsw can provide. A job owner who has requested 100 MiB for a job
> sees some instances of the job suffer at 100 MiB and other instances
> suffer at 150 MiB, is an inconsistent behavior.

So, the first part, I get.  memsw happens to be be able to limit the
amount of anon memory.  I really don't think that was the intention
but more of a byproduct that some people might find useful.

The example you listed tho doesn't make much sense to me.  Given two
systems with differing level of memory pressures, two instances can
see wildly different performance regardless of memsw.

> > 2. On hitting memsw limit, the OOM decision is dependent on the
> >    performance of the file backing devices.  Why is that necessarily
> >    better than being dependent on swap or both, which would increase
> >    the reclaim efficiency anyway?  You can't avoid being affected by
> >    the underlying hardware one way or the other.
> 
> This is a separate discussion but still the amount of file backed
> pages is known and controlled by the job owner and they have the
> option to use a storage service, providing a consistent performance
> across different data centers, instead of the physical disks of the
> system where the job is running and thus isolating the job's
> performance from the speed of the local disk. This is not possible
> with swap. The swap (and its performance) is and should be transparent
> to the job owners.

And, for your use case, there is a noticeable difference between file
backed and anonymous memories and that's why you want to limit
anonymous memory independently from file backed memory.

It looks like what you actually want is limiting the amount of
anonymous memory independently from file-backed consumptions because,
in your setup, while swap is always on local disk the file storages
are over network and more configurable / flexible.

Assuming I'm not misunderstanding you, here are my thoughts.

* I'm not sure that distinguishing anon and file backed memories like
  that is the direction we want to head.  In fact, the more uniform we
  can behave across them, the more efficient we'd be as we wouldn't
  have that artificial barrier.  It is true that we don't have the
  same level of control for swap tho.

* Even if we want an independent anon limit, memsw isn't the solution.
  It's too conflated.  If you want to have anon limit, the right thing
  to do would be pushing for an independent anon limit, not memsw.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-21 17:29                               ` Tejun Heo
  0 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2017-12-21 17:29 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

Hello, Shakeel.

On Thu, Dec 21, 2017 at 07:22:20AM -0800, Shakeel Butt wrote:
> I am claiming memory allocations under global pressure will be
> affected by the performance of the underlying swap device. However
> memory allocations under memcg memory pressure, with memsw, will not
> be affected by the performance of the underlying swap device. A job
> having 100 MiB limit running on a machine without global memory
> pressure will never see swap on hitting 100 MiB memsw limit.

But, without global memory pressure, the swap wouldn't be making any
difference to begin with.  Also, when multiple cgroups are hitting
memsw limits, they'd behave as if swappiness is zero increasing load
on the filesystems, which then then of course will affect everyone
under memory pressure whether memsw or not.

> > On top of that, what's the point?
> >
> > 1. As I wrote earlier, given the current OOM killer implementation,
> >    whether OOM kicks in or not is not even that relevant in
> >    determining the health of the workload.  There are frequent failure
> >    modes where OOM killer fails to kick in while the workload isn't
> >    making any meaningful forward progress.
> >
> 
> Deterministic oom-killer is not the point. The point is to
> "consistently limit the anon memory" allocated by the job which only
> memsw can provide. A job owner who has requested 100 MiB for a job
> sees some instances of the job suffer at 100 MiB and other instances
> suffer at 150 MiB, is an inconsistent behavior.

So, the first part, I get.  memsw happens to be be able to limit the
amount of anon memory.  I really don't think that was the intention
but more of a byproduct that some people might find useful.

The example you listed tho doesn't make much sense to me.  Given two
systems with differing level of memory pressures, two instances can
see wildly different performance regardless of memsw.

> > 2. On hitting memsw limit, the OOM decision is dependent on the
> >    performance of the file backing devices.  Why is that necessarily
> >    better than being dependent on swap or both, which would increase
> >    the reclaim efficiency anyway?  You can't avoid being affected by
> >    the underlying hardware one way or the other.
> 
> This is a separate discussion but still the amount of file backed
> pages is known and controlled by the job owner and they have the
> option to use a storage service, providing a consistent performance
> across different data centers, instead of the physical disks of the
> system where the job is running and thus isolating the job's
> performance from the speed of the local disk. This is not possible
> with swap. The swap (and its performance) is and should be transparent
> to the job owners.

And, for your use case, there is a noticeable difference between file
backed and anonymous memories and that's why you want to limit
anonymous memory independently from file backed memory.

It looks like what you actually want is limiting the amount of
anonymous memory independently from file-backed consumptions because,
in your setup, while swap is always on local disk the file storages
are over network and more configurable / flexible.

Assuming I'm not misunderstanding you, here are my thoughts.

* I'm not sure that distinguishing anon and file backed memories like
  that is the direction we want to head.  In fact, the more uniform we
  can behave across them, the more efficient we'd be as we wouldn't
  have that artificial barrier.  It is true that we don't have the
  same level of control for swap tho.

* Even if we want an independent anon limit, memsw isn't the solution.
  It's too conflated.  If you want to have anon limit, the right thing
  to do would be pushing for an independent anon limit, not memsw.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-21 17:29                               ` Tejun Heo
  0 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2017-12-21 17:29 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc-u79uwXL29TY76Z2rM5mHXA

Hello, Shakeel.

On Thu, Dec 21, 2017 at 07:22:20AM -0800, Shakeel Butt wrote:
> I am claiming memory allocations under global pressure will be
> affected by the performance of the underlying swap device. However
> memory allocations under memcg memory pressure, with memsw, will not
> be affected by the performance of the underlying swap device. A job
> having 100 MiB limit running on a machine without global memory
> pressure will never see swap on hitting 100 MiB memsw limit.

But, without global memory pressure, the swap wouldn't be making any
difference to begin with.  Also, when multiple cgroups are hitting
memsw limits, they'd behave as if swappiness is zero increasing load
on the filesystems, which then then of course will affect everyone
under memory pressure whether memsw or not.

> > On top of that, what's the point?
> >
> > 1. As I wrote earlier, given the current OOM killer implementation,
> >    whether OOM kicks in or not is not even that relevant in
> >    determining the health of the workload.  There are frequent failure
> >    modes where OOM killer fails to kick in while the workload isn't
> >    making any meaningful forward progress.
> >
> 
> Deterministic oom-killer is not the point. The point is to
> "consistently limit the anon memory" allocated by the job which only
> memsw can provide. A job owner who has requested 100 MiB for a job
> sees some instances of the job suffer at 100 MiB and other instances
> suffer at 150 MiB, is an inconsistent behavior.

So, the first part, I get.  memsw happens to be be able to limit the
amount of anon memory.  I really don't think that was the intention
but more of a byproduct that some people might find useful.

The example you listed tho doesn't make much sense to me.  Given two
systems with differing level of memory pressures, two instances can
see wildly different performance regardless of memsw.

> > 2. On hitting memsw limit, the OOM decision is dependent on the
> >    performance of the file backing devices.  Why is that necessarily
> >    better than being dependent on swap or both, which would increase
> >    the reclaim efficiency anyway?  You can't avoid being affected by
> >    the underlying hardware one way or the other.
> 
> This is a separate discussion but still the amount of file backed
> pages is known and controlled by the job owner and they have the
> option to use a storage service, providing a consistent performance
> across different data centers, instead of the physical disks of the
> system where the job is running and thus isolating the job's
> performance from the speed of the local disk. This is not possible
> with swap. The swap (and its performance) is and should be transparent
> to the job owners.

And, for your use case, there is a noticeable difference between file
backed and anonymous memories and that's why you want to limit
anonymous memory independently from file backed memory.

It looks like what you actually want is limiting the amount of
anonymous memory independently from file-backed consumptions because,
in your setup, while swap is always on local disk the file storages
are over network and more configurable / flexible.

Assuming I'm not misunderstanding you, here are my thoughts.

* I'm not sure that distinguishing anon and file backed memories like
  that is the direction we want to head.  In fact, the more uniform we
  can behave across them, the more efficient we'd be as we wouldn't
  have that artificial barrier.  It is true that we don't have the
  same level of control for swap tho.

* Even if we want an independent anon limit, memsw isn't the solution.
  It's too conflated.  If you want to have anon limit, the right thing
  to do would be pushing for an independent anon limit, not memsw.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
  2017-12-21 17:29                               ` Tejun Heo
@ 2017-12-27 19:49                                 ` Shakeel Butt
  -1 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-27 19:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

Hi Tejun,

In my previous messages, I think the message "memsw improves the
performance of the job" might have been conveyed. Please ignore that.
The message I want to express is the "memsw provides users the ability
to consistently limit their job's memory (specifically anon memory)
irrespective of the presence of swap".

On Thu, Dec 21, 2017 at 9:29 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Shakeel.
>
> On Thu, Dec 21, 2017 at 07:22:20AM -0800, Shakeel Butt wrote:
>> I am claiming memory allocations under global pressure will be
>> affected by the performance of the underlying swap device. However
>> memory allocations under memcg memory pressure, with memsw, will not
>> be affected by the performance of the underlying swap device. A job
>> having 100 MiB limit running on a machine without global memory
>> pressure will never see swap on hitting 100 MiB memsw limit.
>
> But, without global memory pressure, the swap wouldn't be making any
> difference to begin with.

It would in the current cgroup-v2's swap interface. When a job hits
the memory.max (or memory.high), assuming the availability of swap and
job's swap usage, the kernel will swapout the anon pages of the job.

> Also, when multiple cgroups are hitting
> memsw limits, they'd behave as if swappiness is zero increasing load
> on the filesystems, which then then of course will affect everyone
> under memory pressure whether memsw or not.
>
>> > On top of that, what's the point?
>> >
>> > 1. As I wrote earlier, given the current OOM killer implementation,
>> >    whether OOM kicks in or not is not even that relevant in
>> >    determining the health of the workload.  There are frequent failure
>> >    modes where OOM killer fails to kick in while the workload isn't
>> >    making any meaningful forward progress.
>> >
>>
>> Deterministic oom-killer is not the point. The point is to
>> "consistently limit the anon memory" allocated by the job which only
>> memsw can provide. A job owner who has requested 100 MiB for a job
>> sees some instances of the job suffer at 100 MiB and other instances
>> suffer at 150 MiB, is an inconsistent behavior.
>
> So, the first part, I get.  memsw happens to be be able to limit the
> amount of anon memory.  I really don't think that was the intention
> but more of a byproduct that some people might find useful.
>
> The example you listed tho doesn't make much sense to me.  Given two
> systems with differing level of memory pressures, two instances can
> see wildly different performance regardless of memsw.
>

The word 'suffer' might have given the impression that I am concerned
about performance. Let me clarify, if the amount of memory a job can
allocate differs based on the swap availability of the system where it
ran, is an inconsistent behavior. The 'memsw' interface allows to
overcome that inconsistency.

>> > 2. On hitting memsw limit, the OOM decision is dependent on the
>> >    performance of the file backing devices.  Why is that necessarily
>> >    better than being dependent on swap or both, which would increase
>> >    the reclaim efficiency anyway?  You can't avoid being affected by
>> >    the underlying hardware one way or the other.
>>
>> This is a separate discussion but still the amount of file backed
>> pages is known and controlled by the job owner and they have the
>> option to use a storage service, providing a consistent performance
>> across different data centers, instead of the physical disks of the
>> system where the job is running and thus isolating the job's
>> performance from the speed of the local disk. This is not possible
>> with swap. The swap (and its performance) is and should be transparent
>> to the job owners.
>

Please ignore this "separate discussion" as I do not want to lead the
discussion towards the "performance" of file storages or swap mediums.

> And, for your use case, there is a noticeable difference between file
> backed and anonymous memories and that's why you want to limit
> anonymous memory independently from file backed memory.
>
> It looks like what you actually want is limiting the amount of
> anonymous memory independently from file-backed consumptions because,
> in your setup, while swap is always on local disk the file storages
> are over network and more configurable / flexible.
>

What I want is a job having 100 MiB memory limit should not be able to
allocate anon memory more than 100 MiB irrespective of the
availability of the swap. I can see that a separate anon limit can be
used to achieve my goal but I am failing to see why memsw, which can
also be used to achieve my goal and is already implemented, is not
right direction or solution.

> Assuming I'm not misunderstanding you, here are my thoughts.
>
> * I'm not sure that distinguishing anon and file backed memories like
>   that is the direction we want to head.  In fact, the more uniform we
>   can behave across them, the more efficient we'd be as we wouldn't
>   have that artificial barrier.  It is true that we don't have the
>   same level of control for swap tho.
>

I totally agree about the uniform behavior (& no artificial barriers)
but I don't understand how 'memsw' will lead to opposite direction.
'memsw' is an interface that can simultaneously limit anon, file and
kmem of the job.

> * Even if we want an independent anon limit, memsw isn't the solution.
>   It's too conflated.

I am confused. If we want a solution which has uniform behavior across
file, anon & kmem and without any artificial barrier, how would that
not be conflated?

>   If you want to have anon limit, the right thing
>   to do would be pushing for an independent anon limit, not memsw.
>

Though I agree that a separate anon limit will work for me. I am
hesitant to push for that direction due to:

1. The alternative solution, i.e. 'memsw', already exist.
2. What will be the semantics of memory.high under new anon limit?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2
@ 2017-12-27 19:49                                 ` Shakeel Butt
  0 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2017-12-27 19:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Roman Gushchin, Vladimir Davydov,
	Greg Thelen, Johannes Weiner, Hugh Dickins, Andrew Morton,
	Linux MM, LKML, Cgroups, linux-doc

Hi Tejun,

In my previous messages, I think the message "memsw improves the
performance of the job" might have been conveyed. Please ignore that.
The message I want to express is the "memsw provides users the ability
to consistently limit their job's memory (specifically anon memory)
irrespective of the presence of swap".

On Thu, Dec 21, 2017 at 9:29 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Shakeel.
>
> On Thu, Dec 21, 2017 at 07:22:20AM -0800, Shakeel Butt wrote:
>> I am claiming memory allocations under global pressure will be
>> affected by the performance of the underlying swap device. However
>> memory allocations under memcg memory pressure, with memsw, will not
>> be affected by the performance of the underlying swap device. A job
>> having 100 MiB limit running on a machine without global memory
>> pressure will never see swap on hitting 100 MiB memsw limit.
>
> But, without global memory pressure, the swap wouldn't be making any
> difference to begin with.

It would in the current cgroup-v2's swap interface. When a job hits
the memory.max (or memory.high), assuming the availability of swap and
job's swap usage, the kernel will swapout the anon pages of the job.

> Also, when multiple cgroups are hitting
> memsw limits, they'd behave as if swappiness is zero increasing load
> on the filesystems, which then then of course will affect everyone
> under memory pressure whether memsw or not.
>
>> > On top of that, what's the point?
>> >
>> > 1. As I wrote earlier, given the current OOM killer implementation,
>> >    whether OOM kicks in or not is not even that relevant in
>> >    determining the health of the workload.  There are frequent failure
>> >    modes where OOM killer fails to kick in while the workload isn't
>> >    making any meaningful forward progress.
>> >
>>
>> Deterministic oom-killer is not the point. The point is to
>> "consistently limit the anon memory" allocated by the job which only
>> memsw can provide. A job owner who has requested 100 MiB for a job
>> sees some instances of the job suffer at 100 MiB and other instances
>> suffer at 150 MiB, is an inconsistent behavior.
>
> So, the first part, I get.  memsw happens to be be able to limit the
> amount of anon memory.  I really don't think that was the intention
> but more of a byproduct that some people might find useful.
>
> The example you listed tho doesn't make much sense to me.  Given two
> systems with differing level of memory pressures, two instances can
> see wildly different performance regardless of memsw.
>

The word 'suffer' might have given the impression that I am concerned
about performance. Let me clarify, if the amount of memory a job can
allocate differs based on the swap availability of the system where it
ran, is an inconsistent behavior. The 'memsw' interface allows to
overcome that inconsistency.

>> > 2. On hitting memsw limit, the OOM decision is dependent on the
>> >    performance of the file backing devices.  Why is that necessarily
>> >    better than being dependent on swap or both, which would increase
>> >    the reclaim efficiency anyway?  You can't avoid being affected by
>> >    the underlying hardware one way or the other.
>>
>> This is a separate discussion but still the amount of file backed
>> pages is known and controlled by the job owner and they have the
>> option to use a storage service, providing a consistent performance
>> across different data centers, instead of the physical disks of the
>> system where the job is running and thus isolating the job's
>> performance from the speed of the local disk. This is not possible
>> with swap. The swap (and its performance) is and should be transparent
>> to the job owners.
>

Please ignore this "separate discussion" as I do not want to lead the
discussion towards the "performance" of file storages or swap mediums.

> And, for your use case, there is a noticeable difference between file
> backed and anonymous memories and that's why you want to limit
> anonymous memory independently from file backed memory.
>
> It looks like what you actually want is limiting the amount of
> anonymous memory independently from file-backed consumptions because,
> in your setup, while swap is always on local disk the file storages
> are over network and more configurable / flexible.
>

What I want is a job having 100 MiB memory limit should not be able to
allocate anon memory more than 100 MiB irrespective of the
availability of the swap. I can see that a separate anon limit can be
used to achieve my goal but I am failing to see why memsw, which can
also be used to achieve my goal and is already implemented, is not
right direction or solution.

> Assuming I'm not misunderstanding you, here are my thoughts.
>
> * I'm not sure that distinguishing anon and file backed memories like
>   that is the direction we want to head.  In fact, the more uniform we
>   can behave across them, the more efficient we'd be as we wouldn't
>   have that artificial barrier.  It is true that we don't have the
>   same level of control for swap tho.
>

I totally agree about the uniform behavior (& no artificial barriers)
but I don't understand how 'memsw' will lead to opposite direction.
'memsw' is an interface that can simultaneously limit anon, file and
kmem of the job.

> * Even if we want an independent anon limit, memsw isn't the solution.
>   It's too conflated.

I am confused. If we want a solution which has uniform behavior across
file, anon & kmem and without any artificial barrier, how would that
not be conflated?

>   If you want to have anon limit, the right thing
>   to do would be pushing for an independent anon limit, not memsw.
>

Though I agree that a separate anon limit will work for me. I am
hesitant to push for that direction due to:

1. The alternative solution, i.e. 'memsw', already exist.
2. What will be the semantics of memory.high under new anon limit?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2017-12-27 19:49 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-19  0:01 [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2 Shakeel Butt
2017-12-19  0:01 ` Shakeel Butt
2017-12-19 12:49 ` Michal Hocko
2017-12-19 12:49   ` Michal Hocko
2017-12-19 12:49   ` Michal Hocko
2017-12-19 15:12   ` Shakeel Butt
2017-12-19 15:12     ` Shakeel Butt
2017-12-19 15:24     ` Tejun Heo
2017-12-19 15:24       ` Tejun Heo
2017-12-19 17:23       ` Shakeel Butt
2017-12-19 17:23         ` Shakeel Butt
2017-12-19 17:33         ` Tejun Heo
2017-12-19 17:33           ` Tejun Heo
2017-12-19 18:25           ` Shakeel Butt
2017-12-19 18:25             ` Shakeel Butt
2017-12-19 18:25             ` Shakeel Butt
2017-12-19 21:41             ` Tejun Heo
2017-12-19 21:41               ` Tejun Heo
2017-12-19 22:39               ` Shakeel Butt
2017-12-19 22:39                 ` Shakeel Butt
2017-12-20 19:37                 ` Tejun Heo
2017-12-20 19:37                   ` Tejun Heo
2017-12-20 19:37                   ` Tejun Heo
2017-12-20 20:15                   ` Shakeel Butt
2017-12-20 20:15                     ` Shakeel Butt
2017-12-20 20:27                     ` Shakeel Butt
2017-12-20 20:27                       ` Shakeel Butt
2017-12-20 23:36                     ` Tejun Heo
2017-12-20 23:36                       ` Tejun Heo
2017-12-21  1:15                       ` Shakeel Butt
2017-12-21  1:15                         ` Shakeel Butt
2017-12-21 13:37                         ` Tejun Heo
2017-12-21 13:37                           ` Tejun Heo
2017-12-21 15:22                           ` Shakeel Butt
2017-12-21 15:22                             ` Shakeel Butt
2017-12-21 15:33                             ` Shakeel Butt
2017-12-21 15:33                               ` Shakeel Butt
2017-12-21 17:29                             ` Tejun Heo
2017-12-21 17:29                               ` Tejun Heo
2017-12-21 17:29                               ` Tejun Heo
2017-12-27 19:49                               ` Shakeel Butt
2017-12-27 19:49                                 ` Shakeel Butt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.