linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Android low memory killer vs. memory pressure notifications
@ 2011-12-19  2:53 Anton Vorontsov
  2011-12-19  7:48 ` Minchan Kim
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Anton Vorontsov @ 2011-12-19  2:53 UTC (permalink / raw)
  To: KOSAKI Motohiro, Arve Hjønnevåg
  Cc: Rik van Riel, Pavel Machek, Greg Kroah-Hartman, Andrew Morton,
	David Rientjes, Michal Hocko, John Stultz, linux-mm,
	linux-kernel

Hello everyone,

Some background: Android apps never exit, instead they just save state
and become inactive, and only get killed when memory usage hits a
specific threshold. This strategy greatly improves user experience,
as "start-up" time becomes non-issue. There are several application
categories and for each category there is its own limit (e.g. background
vs. foreground app -- we never want to kill foreground tasks, but that's
details).

So, Android developers came with a Lowmemory killer driver, it receives
memory pressure notifications, and then kills appropriate tasks when
memory resources become low.

Some time ago there were a lot of discussions regarding this driver,
and it seems that people see different ways of how this should be
implemented.

Today I'd like to resurrect the discussion, and eventually come to a
solution (or, if there is a group of people already working on this,
please let me know -- I'd readily help with anything I could).

The last time the two main approaches were spoken out, which both assume
that kernel should not be responsible for killing tasks:

- Use memory controller cgroup (CGROUP_MEM_RES_CTLR) notifications from
  the kernel side, plus userland "manager" that would kill applications.

  The main downside of this approach is that mem_cg needs 20 bytes per
  page (on a 32 bit machine). So on a 32 bit machine with 4K pages
  that's approx. 0.5% of RAM, or, in other words, 5MB on a 1GB machine.

  0.5% doesn't sound too bad, but 5MB does, quite a little bit. So,
  mem_cg feels like an overkill for this simple task (see the driver at
  the very bottom).

- Use some new low memory notifications mechanism from the kernel side +
  userland manager that would react to the notifications and would kill
  the tasks.

  The main downside of this approach is that the new mechanism does
  not exist. :-) "Big iron" people happily use mem_cg notifications,
  and things like /dev/mem_notify died circa 2008 as there was too
  little interest in it. See http://lkml.org/lkml/2009/1/20/404


(There were also suggestions to integrate lowmemory killer functionality
into OOM killer, but I see little point in doing this as the OOM
killer and lowmemory killer have different "triggers": OOM killer is
a quite simple last-resort thing for the kernel, it is called from
the kernel allocators' fail paths, and, IIRC, it is even synchronous w/
GFP_NOFAIL. I don't think that there could be any code or ABI reuse.)

So, the main difference between current Android lowmemory killer and
the approaches above is that the "killer" function suggested to be
factored out to the userland code. This makes sense as it is userland
that is categorizing tasks-to-kill (in the current lowmemory killer
driver via controlling OOM adj value).

Personally I'd start thinking about the new [lightweight] notification
stuff, i.e. something without mem_cg's downsides. Though, I'm Cc'ing
Android folks so maybe they could enlighten us why in-kernel "lowmemory
manager" might be a better idea. Plus Cc'ing folks that I think might
be interested in this discussion.

Thanks!

p.s.

I'm inlining the android memory killer code down below, just for the
reference. It is quite small (and useful... though, currently only for
Android case).

- - - -
From: Arve Hjønnevåg <arve@android.com>
Subject: Android low memory killer driver

The lowmemorykiller driver lets user-space specify a set of memory thresholds
where processes with a range of oom_adj values will get killed. Specify the
minimum oom_adj values in /sys/module/lowmemorykiller/parameters/adj and the
number of free pages in /sys/module/lowmemorykiller/parameters/minfree. Both
files take a comma separated list of numbers in ascending order.

For example, write "0,8" to /sys/module/lowmemorykiller/parameters/adj and
"1024,4096" to /sys/module/lowmemorykiller/parameters/minfree to kill processes
with a oom_adj value of 8 or higher when the free memory drops below 4096 pages
and kill processes with a oom_adj value of 0 or higher when the free memory
drops below 1024 pages.

The driver considers memory used for caches to be free, but if a large
percentage of the cached memory is locked this can be very inaccurate
and processes may not get killed until the normal oom killer is triggered.

---
 mm/Kconfig           |    7 ++
 mm/Makefile          |    1 +
 mm/lowmemorykiller.c |  175 ++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 183 insertions(+), 0 deletions(-)
 create mode 100644 mm/lowmemorykiller.c

diff --git a/mm/Kconfig b/mm/Kconfig
index 011b110..a2e7959 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -259,6 +259,12 @@ config DEFAULT_MMAP_MIN_ADDR
 	  This value can be changed after boot using the
 	  /proc/sys/vm/mmap_min_addr tunable.
 
+config LOW_MEMORY_KILLER
+	bool "Low Memory Killer"
+	help
+	  The lowmemorykiller driver lets user-space specify a set of memory
+	  thresholds where processes will get killed.
+
 config ARCH_SUPPORTS_MEMORY_FAILURE
 	bool
 
diff --git a/mm/Makefile b/mm/Makefile
index 50ec00e..10fb4ff 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
+obj-$(CONFIG_LOW_MEMORY_KILLER)	+= lowmemorykiller.o
 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
diff --git a/mm/lowmemorykiller.c b/mm/lowmemorykiller.c
new file mode 100644
index 0000000..4e51936
--- /dev/null
+++ b/mm/lowmemorykiller.c
@@ -0,0 +1,175 @@
+/*
+ * The lowmemorykiller driver lets user-space specify a set of memory thresholds
+ * where processes with a range of oom_adj values will get killed. Specify the
+ * minimum oom_adj values in /sys/module/lowmemorykiller/parameters/adj and the
+ * number of free pages in /sys/module/lowmemorykiller/parameters/minfree. Both
+ * files take a comma separated list of numbers in ascending order.
+ *
+ * For example, write "0,8" to /sys/module/lowmemorykiller/parameters/adj and
+ * "1024,4096" to /sys/module/lowmemorykiller/parameters/minfree to kill processes
+ * with a oom_adj value of 8 or higher when the free memory drops below 4096 pages
+ * and kill processes with a oom_adj value of 0 or higher when the free memory
+ * drops below 1024 pages.
+ *
+ * The driver considers memory used for caches to be free, but if a large
+ * percentage of the cached memory is locked this can be very inaccurate
+ * and processes may not get killed until the normal oom killer is triggered.
+ *
+ * Copyright (C) 2007-2008 Google, Inc.
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/oom.h>
+#include <linux/sched.h>
+#include <linux/notifier.h>
+
+static uint32_t lowmem_debug_level = 2;
+static int lowmem_adj[6] = {
+	0,
+	1,
+	6,
+	12,
+};
+static int lowmem_adj_size = 4;
+static size_t lowmem_minfree[6] = {
+	3 * 512,	/* 6MB */
+	2 * 1024,	/* 8MB */
+	4 * 1024,	/* 16MB */
+	16 * 1024,	/* 64MB */
+};
+static int lowmem_minfree_size = 4;
+
+#define lowmem_print(level, x...)			\
+	do {						\
+		if (lowmem_debug_level >= (level))	\
+			printk(x);			\
+	} while (0)
+
+static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
+{
+	struct task_struct *p;
+	struct task_struct *selected = NULL;
+	int rem = 0;
+	int tasksize;
+	int i;
+	int min_adj = OOM_ADJUST_MAX + 1;
+	int selected_tasksize = 0;
+	int selected_oom_adj;
+	int array_size = ARRAY_SIZE(lowmem_adj);
+	int other_free = global_page_state(NR_FREE_PAGES);
+	int other_file = global_page_state(NR_FILE_PAGES) -
+						global_page_state(NR_SHMEM);
+
+	if (lowmem_adj_size < array_size)
+		array_size = lowmem_adj_size;
+	if (lowmem_minfree_size < array_size)
+		array_size = lowmem_minfree_size;
+	for (i = 0; i < array_size; i++) {
+		if (other_free < lowmem_minfree[i] &&
+		    other_file < lowmem_minfree[i]) {
+			min_adj = lowmem_adj[i];
+			break;
+		}
+	}
+	if (sc->nr_to_scan > 0)
+		lowmem_print(3, "lowmem_shrink %lu, %x, ofree %d %d, ma %d\n",
+			     sc->nr_to_scan, sc->gfp_mask, other_free, other_file,
+			     min_adj);
+	rem = global_page_state(NR_ACTIVE_ANON) +
+		global_page_state(NR_ACTIVE_FILE) +
+		global_page_state(NR_INACTIVE_ANON) +
+		global_page_state(NR_INACTIVE_FILE);
+	if (sc->nr_to_scan <= 0 || min_adj == OOM_ADJUST_MAX + 1) {
+		lowmem_print(5, "lowmem_shrink %lu, %x, return %d\n",
+			     sc->nr_to_scan, sc->gfp_mask, rem);
+		return rem;
+	}
+	selected_oom_adj = min_adj;
+
+	read_lock(&tasklist_lock);
+	for_each_process(p) {
+		struct mm_struct *mm;
+		struct signal_struct *sig;
+		int oom_adj;
+
+		task_lock(p);
+		mm = p->mm;
+		sig = p->signal;
+		if (!mm || !sig) {
+			task_unlock(p);
+			continue;
+		}
+		oom_adj = sig->oom_adj;
+		if (oom_adj < min_adj) {
+			task_unlock(p);
+			continue;
+		}
+		tasksize = get_mm_rss(mm);
+		task_unlock(p);
+		if (tasksize <= 0)
+			continue;
+		if (selected) {
+			if (oom_adj < selected_oom_adj)
+				continue;
+			if (oom_adj == selected_oom_adj &&
+			    tasksize <= selected_tasksize)
+				continue;
+		}
+		selected = p;
+		selected_tasksize = tasksize;
+		selected_oom_adj = oom_adj;
+		lowmem_print(2, "select %d (%s), adj %d, size %d, to kill\n",
+			     p->pid, p->comm, oom_adj, tasksize);
+	}
+	if (selected) {
+		lowmem_print(1, "send sigkill to %d (%s), adj %d, size %d\n",
+			     selected->pid, selected->comm,
+			     selected_oom_adj, selected_tasksize);
+		force_sig(SIGKILL, selected);
+		rem -= selected_tasksize;
+	}
+	lowmem_print(4, "lowmem_shrink %lu, %x, return %d\n",
+		     sc->nr_to_scan, sc->gfp_mask, rem);
+	read_unlock(&tasklist_lock);
+	return rem;
+}
+
+static struct shrinker lowmem_shrinker = {
+	.shrink = lowmem_shrink,
+	.seeks = DEFAULT_SEEKS * 16
+};
+
+static int __init lowmem_init(void)
+{
+	register_shrinker(&lowmem_shrinker);
+	return 0;
+}
+
+static void __exit lowmem_exit(void)
+{
+	unregister_shrinker(&lowmem_shrinker);
+}
+
+module_param_named(cost, lowmem_shrinker.seeks, int, S_IRUGO | S_IWUSR);
+module_param_array_named(adj, lowmem_adj, int, &lowmem_adj_size,
+			 S_IRUGO | S_IWUSR);
+module_param_array_named(minfree, lowmem_minfree, uint, &lowmem_minfree_size,
+			 S_IRUGO | S_IWUSR);
+module_param_named(debug_level, lowmem_debug_level, uint, S_IRUGO | S_IWUSR);
+
+module_init(lowmem_init);
+module_exit(lowmem_exit);
+
+MODULE_LICENSE("GPL");
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-19  2:53 Android low memory killer vs. memory pressure notifications Anton Vorontsov
@ 2011-12-19  7:48 ` Minchan Kim
  2011-12-19 19:05   ` David Rientjes
  2011-12-19 10:39 ` Alan Cox
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Minchan Kim @ 2011-12-19  7:48 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: KOSAKI Motohiro, Arve Hjønnevåg, Rik van Riel,
	Pavel Machek, Greg Kroah-Hartman, Andrew Morton, David Rientjes,
	Michal Hocko, John Stultz, linux-mm, linux-kernel

On Mon, Dec 19, 2011 at 06:53:28AM +0400, Anton Vorontsov wrote:
> Hello everyone,
> 
> Some background: Android apps never exit, instead they just save state
> and become inactive, and only get killed when memory usage hits a
> specific threshold. This strategy greatly improves user experience,
> as "start-up" time becomes non-issue. There are several application
> categories and for each category there is its own limit (e.g. background
> vs. foreground app -- we never want to kill foreground tasks, but that's
> details).
> 
> So, Android developers came with a Lowmemory killer driver, it receives
> memory pressure notifications, and then kills appropriate tasks when
> memory resources become low.
> 
> Some time ago there were a lot of discussions regarding this driver,
> and it seems that people see different ways of how this should be
> implemented.
> 
> Today I'd like to resurrect the discussion, and eventually come to a
> solution (or, if there is a group of people already working on this,
> please let me know -- I'd readily help with anything I could).
> 
> The last time the two main approaches were spoken out, which both assume
> that kernel should not be responsible for killing tasks:

Right.
Kernel should have just signal role when resource is not enough.
It is desirable that killing is role of user space.
The problem is accurate receiving signal time.
For example, Let assume A, B, C applications.

A application want to receive signal if system memory is below 4M
If A receive the signal, it is supposed to kill B.

1. memory pressure
2. kernel detect memory is under 4M
3. kernel signal to A
4. schedule in B
5. B consume lots of memory
6. OOM happens
7. OOM kills C and schedule A
8. A kill B

B and C is dead :(

It's not what we want.
 
> 
> - Use memory controller cgroup (CGROUP_MEM_RES_CTLR) notifications from
>   the kernel side, plus userland "manager" that would kill applications.
> 
>   The main downside of this approach is that mem_cg needs 20 bytes per
>   page (on a 32 bit machine). So on a 32 bit machine with 4K pages
>   that's approx. 0.5% of RAM, or, in other words, 5MB on a 1GB machine.
> 
>   0.5% doesn't sound too bad, but 5MB does, quite a little bit. So,
>   mem_cg feels like an overkill for this simple task (see the driver at
>   the very bottom).

Agree.
Although current embedded system have enough memory, it is overkill that
enabling memcg for just memcg notification.

> 
> - Use some new low memory notifications mechanism from the kernel side +
>   userland manager that would react to the notifications and would kill
>   the tasks.
> 
>   The main downside of this approach is that the new mechanism does
>   not exist. :-) "Big iron" people happily use mem_cg notifications,
>   and things like /dev/mem_notify died circa 2008 as there was too
>   little interest in it. See http://lkml.org/lkml/2009/1/20/404

I like mem_notify if we can solve the problem I mentioned.

> 
> 
> (There were also suggestions to integrate lowmemory killer functionality
> into OOM killer, but I see little point in doing this as the OOM
> killer and lowmemory killer have different "triggers": OOM killer is
> a quite simple last-resort thing for the kernel, it is called from
> the kernel allocators' fail paths, and, IIRC, it is even synchronous w/
> GFP_NOFAIL. I don't think that there could be any code or ABI reuse.)
> 
> So, the main difference between current Android lowmemory killer and
> the approaches above is that the "killer" function suggested to be
> factored out to the userland code. This makes sense as it is userland
> that is categorizing tasks-to-kill (in the current lowmemory killer
> driver via controlling OOM adj value).
> 
> Personally I'd start thinking about the new [lightweight] notification
> stuff, i.e. something without mem_cg's downsides. Though, I'm Cc'ing
> Android folks so maybe they could enlighten us why in-kernel "lowmemory
> manager" might be a better idea. Plus Cc'ing folks that I think might
> be interested in this discussion.
> 
> Thanks!
> 
> p.s.
> 
> I'm inlining the android memory killer code down below, just for the
> reference. It is quite small (and useful... though, currently only for
> Android case).
> 
> - - - -
> From: Arve Hjønnevåg <arve@android.com>
> Subject: Android low memory killer driver
> 
> The lowmemorykiller driver lets user-space specify a set of memory thresholds
> where processes with a range of oom_adj values will get killed. Specify the
> minimum oom_adj values in /sys/module/lowmemorykiller/parameters/adj and the
> number of free pages in /sys/module/lowmemorykiller/parameters/minfree. Both
> files take a comma separated list of numbers in ascending order.
> 
> For example, write "0,8" to /sys/module/lowmemorykiller/parameters/adj and
> "1024,4096" to /sys/module/lowmemorykiller/parameters/minfree to kill processes
> with a oom_adj value of 8 or higher when the free memory drops below 4096 pages
> and kill processes with a oom_adj value of 0 or higher when the free memory
> drops below 1024 pages.
> 
> The driver considers memory used for caches to be free, but if a large
> percentage of the cached memory is locked this can be very inaccurate
> and processes may not get killed until the normal oom killer is triggered.
> 
> ---
>  mm/Kconfig           |    7 ++
>  mm/Makefile          |    1 +
>  mm/lowmemorykiller.c |  175 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 183 insertions(+), 0 deletions(-)
>  create mode 100644 mm/lowmemorykiller.c
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 011b110..a2e7959 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -259,6 +259,12 @@ config DEFAULT_MMAP_MIN_ADDR
>  	  This value can be changed after boot using the
>  	  /proc/sys/vm/mmap_min_addr tunable.
>  
> +config LOW_MEMORY_KILLER
> +	bool "Low Memory Killer"
> +	help
> +	  The lowmemorykiller driver lets user-space specify a set of memory
> +	  thresholds where processes will get killed.
> +
>  config ARCH_SUPPORTS_MEMORY_FAILURE
>  	bool
>  
> diff --git a/mm/Makefile b/mm/Makefile
> index 50ec00e..10fb4ff 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -47,6 +47,7 @@ obj-$(CONFIG_QUICKLIST) += quicklist.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
>  obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
>  obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
> +obj-$(CONFIG_LOW_MEMORY_KILLER)	+= lowmemorykiller.o
>  obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
> diff --git a/mm/lowmemorykiller.c b/mm/lowmemorykiller.c
> new file mode 100644
> index 0000000..4e51936
> --- /dev/null
> +++ b/mm/lowmemorykiller.c
> @@ -0,0 +1,175 @@
> +/*
> + * The lowmemorykiller driver lets user-space specify a set of memory thresholds
> + * where processes with a range of oom_adj values will get killed. Specify the
> + * minimum oom_adj values in /sys/module/lowmemorykiller/parameters/adj and the
> + * number of free pages in /sys/module/lowmemorykiller/parameters/minfree. Both
> + * files take a comma separated list of numbers in ascending order.
> + *
> + * For example, write "0,8" to /sys/module/lowmemorykiller/parameters/adj and
> + * "1024,4096" to /sys/module/lowmemorykiller/parameters/minfree to kill processes
> + * with a oom_adj value of 8 or higher when the free memory drops below 4096 pages
> + * and kill processes with a oom_adj value of 0 or higher when the free memory
> + * drops below 1024 pages.
> + *
> + * The driver considers memory used for caches to be free, but if a large
> + * percentage of the cached memory is locked this can be very inaccurate
> + * and processes may not get killed until the normal oom killer is triggered.
> + *
> + * Copyright (C) 2007-2008 Google, Inc.
> + *
> + * This software is licensed under the terms of the GNU General Public
> + * License version 2, as published by the Free Software Foundation, and
> + * may be copied, distributed, and modified under those terms.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + */
> +
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +#include <linux/oom.h>
> +#include <linux/sched.h>
> +#include <linux/notifier.h>
> +
> +static uint32_t lowmem_debug_level = 2;
> +static int lowmem_adj[6] = {
> +	0,
> +	1,
> +	6,
> +	12,
> +};
> +static int lowmem_adj_size = 4;
> +static size_t lowmem_minfree[6] = {
> +	3 * 512,	/* 6MB */
> +	2 * 1024,	/* 8MB */
> +	4 * 1024,	/* 16MB */
> +	16 * 1024,	/* 64MB */
> +};
> +static int lowmem_minfree_size = 4;
> +
> +#define lowmem_print(level, x...)			\
> +	do {						\
> +		if (lowmem_debug_level >= (level))	\
> +			printk(x);			\
> +	} while (0)
> +
> +static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
> +{
> +	struct task_struct *p;
> +	struct task_struct *selected = NULL;
> +	int rem = 0;
> +	int tasksize;
> +	int i;
> +	int min_adj = OOM_ADJUST_MAX + 1;
> +	int selected_tasksize = 0;
> +	int selected_oom_adj;
> +	int array_size = ARRAY_SIZE(lowmem_adj);
> +	int other_free = global_page_state(NR_FREE_PAGES);
> +	int other_file = global_page_state(NR_FILE_PAGES) -
> +						global_page_state(NR_SHMEM);
> +
> +	if (lowmem_adj_size < array_size)
> +		array_size = lowmem_adj_size;
> +	if (lowmem_minfree_size < array_size)
> +		array_size = lowmem_minfree_size;
> +	for (i = 0; i < array_size; i++) {
> +		if (other_free < lowmem_minfree[i] &&
> +		    other_file < lowmem_minfree[i]) {
> +			min_adj = lowmem_adj[i];
> +			break;
> +		}
> +	}
> +	if (sc->nr_to_scan > 0)
> +		lowmem_print(3, "lowmem_shrink %lu, %x, ofree %d %d, ma %d\n",
> +			     sc->nr_to_scan, sc->gfp_mask, other_free, other_file,
> +			     min_adj);
> +	rem = global_page_state(NR_ACTIVE_ANON) +
> +		global_page_state(NR_ACTIVE_FILE) +
> +		global_page_state(NR_INACTIVE_ANON) +
> +		global_page_state(NR_INACTIVE_FILE);
> +	if (sc->nr_to_scan <= 0 || min_adj == OOM_ADJUST_MAX + 1) {
> +		lowmem_print(5, "lowmem_shrink %lu, %x, return %d\n",
> +			     sc->nr_to_scan, sc->gfp_mask, rem);
> +		return rem;
> +	}
> +	selected_oom_adj = min_adj;
> +
> +	read_lock(&tasklist_lock);
> +	for_each_process(p) {
> +		struct mm_struct *mm;
> +		struct signal_struct *sig;
> +		int oom_adj;
> +
> +		task_lock(p);
> +		mm = p->mm;
> +		sig = p->signal;
> +		if (!mm || !sig) {
> +			task_unlock(p);
> +			continue;
> +		}
> +		oom_adj = sig->oom_adj;
> +		if (oom_adj < min_adj) {
> +			task_unlock(p);
> +			continue;
> +		}
> +		tasksize = get_mm_rss(mm);
> +		task_unlock(p);
> +		if (tasksize <= 0)
> +			continue;
> +		if (selected) {
> +			if (oom_adj < selected_oom_adj)
> +				continue;
> +			if (oom_adj == selected_oom_adj &&
> +			    tasksize <= selected_tasksize)
> +				continue;
> +		}
> +		selected = p;
> +		selected_tasksize = tasksize;
> +		selected_oom_adj = oom_adj;
> +		lowmem_print(2, "select %d (%s), adj %d, size %d, to kill\n",
> +			     p->pid, p->comm, oom_adj, tasksize);
> +	}
> +	if (selected) {
> +		lowmem_print(1, "send sigkill to %d (%s), adj %d, size %d\n",
> +			     selected->pid, selected->comm,
> +			     selected_oom_adj, selected_tasksize);
> +		force_sig(SIGKILL, selected);
> +		rem -= selected_tasksize;
> +	}
> +	lowmem_print(4, "lowmem_shrink %lu, %x, return %d\n",
> +		     sc->nr_to_scan, sc->gfp_mask, rem);
> +	read_unlock(&tasklist_lock);
> +	return rem;
> +}
> +
> +static struct shrinker lowmem_shrinker = {
> +	.shrink = lowmem_shrink,
> +	.seeks = DEFAULT_SEEKS * 16
> +};
> +
> +static int __init lowmem_init(void)
> +{
> +	register_shrinker(&lowmem_shrinker);
> +	return 0;
> +}
> +
> +static void __exit lowmem_exit(void)
> +{
> +	unregister_shrinker(&lowmem_shrinker);
> +}
> +
> +module_param_named(cost, lowmem_shrinker.seeks, int, S_IRUGO | S_IWUSR);
> +module_param_array_named(adj, lowmem_adj, int, &lowmem_adj_size,
> +			 S_IRUGO | S_IWUSR);
> +module_param_array_named(minfree, lowmem_minfree, uint, &lowmem_minfree_size,
> +			 S_IRUGO | S_IWUSR);
> +module_param_named(debug_level, lowmem_debug_level, uint, S_IRUGO | S_IWUSR);
> +
> +module_init(lowmem_init);
> +module_exit(lowmem_exit);
> +
> +MODULE_LICENSE("GPL");
> -- 
> 1.7.7.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-19  2:53 Android low memory killer vs. memory pressure notifications Anton Vorontsov
  2011-12-19  7:48 ` Minchan Kim
@ 2011-12-19 10:39 ` Alan Cox
  2011-12-19 16:16   ` KOSAKI Motohiro
  2011-12-19 12:12 ` Michal Hocko
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Alan Cox @ 2011-12-19 10:39 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: KOSAKI Motohiro, Arve Hjønnevåg, Rik van Riel,
	Pavel Machek, Greg Kroah-Hartman, Andrew Morton, David Rientjes,
	Michal Hocko, John Stultz, linux-mm, linux-kernel

>   The main downside of this approach is that mem_cg needs 20 bytes per
>   page (on a 32 bit machine). So on a 32 bit machine with 4K pages
>   that's approx. 0.5% of RAM, or, in other words, 5MB on a 1GB machine.

The obvious question would be why? Would fixing memcg make more sense ?

The only problem I see with having a user space manager is that manager
probably has to be mlock to avoid awkward fail cases and that may in fact
make it smaller kernel side.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-19  2:53 Android low memory killer vs. memory pressure notifications Anton Vorontsov
  2011-12-19  7:48 ` Minchan Kim
  2011-12-19 10:39 ` Alan Cox
@ 2011-12-19 12:12 ` Michal Hocko
  2011-12-19 19:12   ` David Rientjes
  2011-12-20  2:16   ` Anton Vorontsov
  2011-12-19 16:11 ` KOSAKI Motohiro
  2011-12-19 17:30 ` KOSAKI Motohiro
  4 siblings, 2 replies; 23+ messages in thread
From: Michal Hocko @ 2011-12-19 12:12 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: KOSAKI Motohiro, Arve Hjønnevåg, Rik van Riel,
	Pavel Machek, Greg Kroah-Hartman, Andrew Morton, David Rientjes,
	John Stultz, linux-mm, linux-kernel, Johannes Weiner,
	KAMEZAWA Hiroyuki

[Didn't get to the patch yet but a comment on memcg]

On Mon 19-12-11 06:53:28, Anton Vorontsov wrote:
[...]
> - Use memory controller cgroup (CGROUP_MEM_RES_CTLR) notifications from
>   the kernel side, plus userland "manager" that would kill applications.
> 
>   The main downside of this approach is that mem_cg needs 20 bytes per
>   page (on a 32 bit machine). So on a 32 bit machine with 4K pages
>   that's approx. 0.5% of RAM, or, in other words, 5MB on a 1GB machine.

page_cgroup is 16B per page and with the current Johannes' memcg
naturalization work (in the mmotm tree) we are down to 8B per page (we
got rid of lru). Kamezawa has some patches to get rid of the flags so we
will be down to 4B per page on 32b. Is this still too much?
I would be really careful about a yet another lowmem notification
mechanism.

>   0.5% doesn't sound too bad, but 5MB does, quite a little bit. So,
>   mem_cg feels like an overkill for this simple task (see the driver at
>   the very bottom).

Why is it an overkill? I think that having 2 groups (active and
inactive) and move tasks between then sounds quite elegant. You can
implement an user space oom handler in both groups (active will just
move a task to the inactive group which inactive will kill a task which
hasn't been used for the longest time).
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-19  2:53 Android low memory killer vs. memory pressure notifications Anton Vorontsov
                   ` (2 preceding siblings ...)
  2011-12-19 12:12 ` Michal Hocko
@ 2011-12-19 16:11 ` KOSAKI Motohiro
  2011-12-20  0:30   ` Hiroyuki Kamezawa
  2011-12-19 17:30 ` KOSAKI Motohiro
  4 siblings, 1 reply; 23+ messages in thread
From: KOSAKI Motohiro @ 2011-12-19 16:11 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: KOSAKI Motohiro, Arve Hjønnevåg, Rik van Riel,
	Pavel Machek, Greg Kroah-Hartman, Andrew Morton, David Rientjes,
	Michal Hocko, John Stultz, linux-mm, linux-kernel,
	KAMEZAWA Hiroyuki

> - Use memory controller cgroup (CGROUP_MEM_RES_CTLR) notifications from
>    the kernel side, plus userland "manager" that would kill applications.
>
>    The main downside of this approach is that mem_cg needs 20 bytes per
>    page (on a 32 bit machine). So on a 32 bit machine with 4K pages
>    that's approx. 0.5% of RAM, or, in other words, 5MB on a 1GB machine.
>
>    0.5% doesn't sound too bad, but 5MB does, quite a little bit. So,
>    mem_cg feels like an overkill for this simple task (see the driver at
>    the very bottom).

Kamezawa-san, Is 20bytes/page still correct now? If I remember 
correctly, you improved space efficiency of memcg.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-19 10:39 ` Alan Cox
@ 2011-12-19 16:16   ` KOSAKI Motohiro
  2011-12-19 16:24     ` Rik van Riel
  0 siblings, 1 reply; 23+ messages in thread
From: KOSAKI Motohiro @ 2011-12-19 16:16 UTC (permalink / raw)
  To: Alan Cox
  Cc: Anton Vorontsov, KOSAKI Motohiro, Arve Hjønnevåg,
	Rik van Riel, Pavel Machek, Greg Kroah-Hartman, Andrew Morton,
	David Rientjes, Michal Hocko, John Stultz, linux-mm,
	linux-kernel

(12/19/11 5:39 AM), Alan Cox wrote:
>>    The main downside of this approach is that mem_cg needs 20 bytes per
>>    page (on a 32 bit machine). So on a 32 bit machine with 4K pages
>>    that's approx. 0.5% of RAM, or, in other words, 5MB on a 1GB machine.
>
> The obvious question would be why? Would fixing memcg make more sense ?

Just historical reason. Initial memcg implement by IBM was just crap. 
People need very long time to fix it.


> The only problem I see with having a user space manager is that manager
> probably has to be mlock to avoid awkward fail cases and that may in fact
> make it smaller kernel side.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-19 16:16   ` KOSAKI Motohiro
@ 2011-12-19 16:24     ` Rik van Riel
  0 siblings, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2011-12-19 16:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Alan Cox, Anton Vorontsov, KOSAKI Motohiro,
	Arve Hjønnevåg, Pavel Machek, Greg Kroah-Hartman,
	Andrew Morton, David Rientjes, Michal Hocko, John Stultz,
	linux-mm, linux-kernel

On 12/19/2011 11:16 AM, KOSAKI Motohiro wrote:
> (12/19/11 5:39 AM), Alan Cox wrote:
>>> The main downside of this approach is that mem_cg needs 20 bytes per
>>> page (on a 32 bit machine). So on a 32 bit machine with 4K pages
>>> that's approx. 0.5% of RAM, or, in other words, 5MB on a 1GB machine.
>>
>> The obvious question would be why? Would fixing memcg make more sense ?
>
> Just historical reason. Initial memcg implement by IBM was just crap.

And the reason for that, I suspect, is that the "proper"
implementation changes the VM by so much that it would
never have been merged in the first place...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-19  2:53 Android low memory killer vs. memory pressure notifications Anton Vorontsov
                   ` (3 preceding siblings ...)
  2011-12-19 16:11 ` KOSAKI Motohiro
@ 2011-12-19 17:30 ` KOSAKI Motohiro
  2011-12-19 17:34   ` KOSAKI Motohiro
  4 siblings, 1 reply; 23+ messages in thread
From: KOSAKI Motohiro @ 2011-12-19 17:30 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: KOSAKI Motohiro, Arve Hjønnevåg, Rik van Riel,
	Pavel Machek, Greg Kroah-Hartman, Andrew Morton, David Rientjes,
	Michal Hocko, John Stultz, linux-mm, linux-kernel

> Personally I'd start thinking about the new [lightweight] notification
> stuff, i.e. something without mem_cg's downsides. Though, I'm Cc'ing
> Android folks so maybe they could enlighten us why in-kernel "lowmemory
> manager" might be a better idea. Plus Cc'ing folks that I think might
> be interested in this discussion.
>
> Thanks!
>
> p.s.
>
> I'm inlining the android memory killer code down below, just for the
> reference. It is quite small (and useful... though, currently only for
> Android case).
>
> - - - -
> From: Arve Hjønnevåg<arve@android.com>
> Subject: Android low memory killer driver
>
> The lowmemorykiller driver lets user-space specify a set of memory thresholds
> where processes with a range of oom_adj values will get killed. Specify the
> minimum oom_adj values in /sys/module/lowmemorykiller/parameters/adj and the
> number of free pages in /sys/module/lowmemorykiller/parameters/minfree. Both
> files take a comma separated list of numbers in ascending order.
>
> For example, write "0,8" to /sys/module/lowmemorykiller/parameters/adj and
> "1024,4096" to /sys/module/lowmemorykiller/parameters/minfree to kill processes
> with a oom_adj value of 8 or higher when the free memory drops below 4096 pages
> and kill processes with a oom_adj value of 0 or higher when the free memory
> drops below 1024 pages.
>
> The driver considers memory used for caches to be free, but if a large
> percentage of the cached memory is locked this can be very inaccurate
> and processes may not get killed until the normal oom killer is triggered.
>
> ---
>   mm/Kconfig           |    7 ++
>   mm/Makefile          |    1 +
>   mm/lowmemorykiller.c |  175 ++++++++++++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 183 insertions(+), 0 deletions(-)
>   create mode 100644 mm/lowmemorykiller.c
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 011b110..a2e7959 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -259,6 +259,12 @@ config DEFAULT_MMAP_MIN_ADDR
>   	  This value can be changed after boot using the
>   	  /proc/sys/vm/mmap_min_addr tunable.
>
> +config LOW_MEMORY_KILLER
> +	bool "Low Memory Killer"
> +	help
> +	  The lowmemorykiller driver lets user-space specify a set of memory
> +	  thresholds where processes will get killed.
> +
>   config ARCH_SUPPORTS_MEMORY_FAILURE
>   	bool
>
> diff --git a/mm/Makefile b/mm/Makefile
> index 50ec00e..10fb4ff 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -47,6 +47,7 @@ obj-$(CONFIG_QUICKLIST) += quicklist.o
>   obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
>   obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
>   obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
> +obj-$(CONFIG_LOW_MEMORY_KILLER)	+= lowmemorykiller.o
>   obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>   obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>   obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
> diff --git a/mm/lowmemorykiller.c b/mm/lowmemorykiller.c
> new file mode 100644
> index 0000000..4e51936
> --- /dev/null
> +++ b/mm/lowmemorykiller.c
> @@ -0,0 +1,175 @@
> +/*
> + * The lowmemorykiller driver lets user-space specify a set of memory thresholds
> + * where processes with a range of oom_adj values will get killed. Specify the
> + * minimum oom_adj values in /sys/module/lowmemorykiller/parameters/adj and the
> + * number of free pages in /sys/module/lowmemorykiller/parameters/minfree. Both
> + * files take a comma separated list of numbers in ascending order.
> + *
> + * For example, write "0,8" to /sys/module/lowmemorykiller/parameters/adj and
> + * "1024,4096" to /sys/module/lowmemorykiller/parameters/minfree to kill processes
> + * with a oom_adj value of 8 or higher when the free memory drops below 4096 pages
> + * and kill processes with a oom_adj value of 0 or higher when the free memory
> + * drops below 1024 pages.
> + *
> + * The driver considers memory used for caches to be free, but if a large
> + * percentage of the cached memory is locked this can be very inaccurate
> + * and processes may not get killed until the normal oom killer is triggered.
> + *
> + * Copyright (C) 2007-2008 Google, Inc.
> + *
> + * This software is licensed under the terms of the GNU General Public
> + * License version 2, as published by the Free Software Foundation, and
> + * may be copied, distributed, and modified under those terms.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + */
> +
> +#include<linux/module.h>
> +#include<linux/kernel.h>
> +#include<linux/mm.h>
> +#include<linux/oom.h>
> +#include<linux/sched.h>
> +#include<linux/notifier.h>
> +
> +static uint32_t lowmem_debug_level = 2;
> +static int lowmem_adj[6] = {
> +	0,
> +	1,
> +	6,
> +	12,
> +};
> +static int lowmem_adj_size = 4;
> +static size_t lowmem_minfree[6] = {
> +	3 * 512,	/* 6MB */
> +	2 * 1024,	/* 8MB */
> +	4 * 1024,	/* 16MB */
> +	16 * 1024,	/* 64MB */
> +};
> +static int lowmem_minfree_size = 4;
> +
> +#define lowmem_print(level, x...)			\
> +	do {						\
> +		if (lowmem_debug_level>= (level))	\
> +			printk(x);			\
> +	} while (0)
> +
> +static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
> +{
> +	struct task_struct *p;
> +	struct task_struct *selected = NULL;
> +	int rem = 0;
> +	int tasksize;
> +	int i;
> +	int min_adj = OOM_ADJUST_MAX + 1;
> +	int selected_tasksize = 0;
> +	int selected_oom_adj;
> +	int array_size = ARRAY_SIZE(lowmem_adj);
> +	int other_free = global_page_state(NR_FREE_PAGES);
> +	int other_file = global_page_state(NR_FILE_PAGES) -
> +						global_page_state(NR_SHMEM);
> +
> +	if (lowmem_adj_size<  array_size)
> +		array_size = lowmem_adj_size;
> +	if (lowmem_minfree_size<  array_size)
> +		array_size = lowmem_minfree_size;
> +	for (i = 0; i<  array_size; i++) {
> +		if (other_free<  lowmem_minfree[i]&&
> +		    other_file<  lowmem_minfree[i]) {
> +			min_adj = lowmem_adj[i];
> +			break;
> +		}
> +	}
> +	if (sc->nr_to_scan>  0)
> +		lowmem_print(3, "lowmem_shrink %lu, %x, ofree %d %d, ma %d\n",
> +			     sc->nr_to_scan, sc->gfp_mask, other_free, other_file,
> +			     min_adj);
> +	rem = global_page_state(NR_ACTIVE_ANON) +
> +		global_page_state(NR_ACTIVE_FILE) +
> +		global_page_state(NR_INACTIVE_ANON) +
> +		global_page_state(NR_INACTIVE_FILE);

Seems incorrect. process killing only free anon pages, but not file cache.


> +	if (sc->nr_to_scan<= 0 || min_adj == OOM_ADJUST_MAX + 1) {
> +		lowmem_print(5, "lowmem_shrink %lu, %x, return %d\n",
> +			     sc->nr_to_scan, sc->gfp_mask, rem);
> +		return rem;
> +	}
> +	selected_oom_adj = min_adj;
> +
> +	read_lock(&tasklist_lock);

Crazy inefficient. mere slab shrinker shouldn't take tasklist_lock. 
Imagine if tasks are much plenty...

Moreover, if system have plenty file cache, any process shouldn't killed 
at all! That's fundamental downside of this patch.


> +	for_each_process(p) {
> +		struct mm_struct *mm;
> +		struct signal_struct *sig;
> +		int oom_adj;
> +
> +		task_lock(p);
> +		mm = p->mm;
> +		sig = p->signal;
> +		if (!mm || !sig) {
> +			task_unlock(p);
> +			continue;
> +		}
> +		oom_adj = sig->oom_adj;
> +		if (oom_adj<  min_adj) {
> +			task_unlock(p);
> +			continue;
> +		}
> +		tasksize = get_mm_rss(mm);
> +		task_unlock(p);
> +		if (tasksize<= 0)
> +			continue;
> +		if (selected) {
> +			if (oom_adj<  selected_oom_adj)
> +				continue;
> +			if (oom_adj == selected_oom_adj&&
> +			    tasksize<= selected_tasksize)
> +				continue;
> +		}
> +		selected = p;
> +		selected_tasksize = tasksize;
> +		selected_oom_adj = oom_adj;
> +		lowmem_print(2, "select %d (%s), adj %d, size %d, to kill\n",
> +			     p->pid, p->comm, oom_adj, tasksize);
> +	}
> +	if (selected) {
> +		lowmem_print(1, "send sigkill to %d (%s), adj %d, size %d\n",
> +			     selected->pid, selected->comm,
> +			     selected_oom_adj, selected_tasksize);
> +		force_sig(SIGKILL, selected);

Scary naive assumption. To send SIGKILL doesn't have a guarantee to kill 
a process immediately if the task is stuck in kernel.


> +		rem -= selected_tasksize;
> +	}
> +	lowmem_print(4, "lowmem_shrink %lu, %x, return %d\n",
> +		     sc->nr_to_scan, sc->gfp_mask, rem);
> +	read_unlock(&tasklist_lock);
> +	return rem;
> +}
> +
> +static struct shrinker lowmem_shrinker = {
> +	.shrink = lowmem_shrink,
> +	.seeks = DEFAULT_SEEKS * 16
> +};
> +
> +static int __init lowmem_init(void)
> +{
> +	register_shrinker(&lowmem_shrinker);
> +	return 0;
> +}
> +
> +static void __exit lowmem_exit(void)
> +{
> +	unregister_shrinker(&lowmem_shrinker);
> +}
> +
> +module_param_named(cost, lowmem_shrinker.seeks, int, S_IRUGO | S_IWUSR);
> +module_param_array_named(adj, lowmem_adj, int,&lowmem_adj_size,
> +			 S_IRUGO | S_IWUSR);
> +module_param_array_named(minfree, lowmem_minfree, uint,&lowmem_minfree_size,
> +			 S_IRUGO | S_IWUSR);
> +module_param_named(debug_level, lowmem_debug_level, uint, S_IRUGO | S_IWUSR);
> +
> +module_init(lowmem_init);
> +module_exit(lowmem_exit);
> +
> +MODULE_LICENSE("GPL");


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-19 17:30 ` KOSAKI Motohiro
@ 2011-12-19 17:34   ` KOSAKI Motohiro
  0 siblings, 0 replies; 23+ messages in thread
From: KOSAKI Motohiro @ 2011-12-19 17:34 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: KOSAKI Motohiro, Arve Hjønnevåg, Rik van Riel,
	Pavel Machek, Greg Kroah-Hartman, Andrew Morton, David Rientjes,
	Michal Hocko, John Stultz, linux-mm, linux-kernel

>> + read_lock(&tasklist_lock);
>
> Crazy inefficient. mere slab shrinker shouldn't take tasklist_lock.
> Imagine if tasks are much plenty...
>
> Moreover, if system have plenty file cache, any process shouldn't killed
> at all! That's fundamental downside of this patch.

In addition, this code is reused a lot of code of oom-killer. But it is 
bad idea. oom killer is really exceptional case. then it don't pay 
attention faster processing. But, no free memory is not rare. we don't 
have much free memory EVERY TIME. because we have file cache.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-19  7:48 ` Minchan Kim
@ 2011-12-19 19:05   ` David Rientjes
  0 siblings, 0 replies; 23+ messages in thread
From: David Rientjes @ 2011-12-19 19:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Anton Vorontsov, KOSAKI Motohiro, Arve Hjønnevåg,
	Rik van Riel, Pavel Machek, Greg Kroah-Hartman, Andrew Morton,
	Michal Hocko, John Stultz, linux-mm, linux-kernel

On Mon, 19 Dec 2011, Minchan Kim wrote:

> Kernel should have just signal role when resource is not enough.
> It is desirable that killing is role of user space.

The low memory killer becomes an out of memory killer very quickly if 
(1) userspace can't respond fast enough and (2) the killed thread cannot 
exit and free its memory fast enough.  It also requires userspace to know 
which threads are sharing memory such that they may all be killed; 
otherwise, killing one thread won't lead to future memory freeing.

If the system becomes oom before userspace can kill a thread, then there's 
no guarantee that it will ever be able to exit.  That's fixed in the 
kernel oom killer by allowing special access to memory reserves 
specifically for this purpose, which userspace can't provide.

So the prerequisites for this to work correctly every time would be to 
ensure that points (1) and (2) above can always happen.  I'm not seeing 
where that's proven, so presumably you'd still always need the kernel oom 
killer.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-19 12:12 ` Michal Hocko
@ 2011-12-19 19:12   ` David Rientjes
  2011-12-20 14:56     ` Anton Vorontsov
  2011-12-20  2:16   ` Anton Vorontsov
  1 sibling, 1 reply; 23+ messages in thread
From: David Rientjes @ 2011-12-19 19:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Anton Vorontsov, KOSAKI Motohiro, Arve Hjønnevåg,
	Rik van Riel, Pavel Machek, Greg Kroah-Hartman, Andrew Morton,
	John Stultz, linux-mm, linux-kernel, Johannes Weiner,
	KAMEZAWA Hiroyuki

On Mon, 19 Dec 2011, Michal Hocko wrote:

> page_cgroup is 16B per page and with the current Johannes' memcg
> naturalization work (in the mmotm tree) we are down to 8B per page (we
> got rid of lru). Kamezawa has some patches to get rid of the flags so we
> will be down to 4B per page on 32b. Is this still too much?
> I would be really careful about a yet another lowmem notification
> mechanism.
> 

There was always general interest in a low memory notification mechanism 
even prior to memcg, see http://lwn.net/Articles/268732/ from Marcelo and 
KOSAKI-san.  The desire is not only to avoid the metadata overhead of 
memcg, but also to avoid cgroups entirely.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-19 16:11 ` KOSAKI Motohiro
@ 2011-12-20  0:30   ` Hiroyuki Kamezawa
  0 siblings, 0 replies; 23+ messages in thread
From: Hiroyuki Kamezawa @ 2011-12-20  0:30 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Anton Vorontsov, KOSAKI Motohiro, Arve Hjønnevåg,
	Rik van Riel, Pavel Machek, Greg Kroah-Hartman, Andrew Morton,
	David Rientjes, Michal Hocko, John Stultz, linux-mm,
	linux-kernel, KAMEZAWA Hiroyuki

2011/12/20 KOSAKI Motohiro <kosaki.motohiro@gmail.com>:
>> - Use memory controller cgroup (CGROUP_MEM_RES_CTLR) notifications from
>>   the kernel side, plus userland "manager" that would kill applications.
>>
>>   The main downside of this approach is that mem_cg needs 20 bytes per
>>   page (on a 32 bit machine). So on a 32 bit machine with 4K pages
>>   that's approx. 0.5% of RAM, or, in other words, 5MB on a 1GB machine.
>>
>>   0.5% doesn't sound too bad, but 5MB does, quite a little bit. So,
>>   mem_cg feels like an overkill for this simple task (see the driver at
>>   the very bottom).
>
>
> Kamezawa-san, Is 20bytes/page still correct now? If I remember correctly,
> you improved space efficiency of memcg.
>
Johannes removed 4 bytes. It's in upstream.
Johannes removed 8bytes. It's now in linux-next.
I'm preparing a patch to remove more 4 bytes.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-19 12:12 ` Michal Hocko
  2011-12-19 19:12   ` David Rientjes
@ 2011-12-20  2:16   ` Anton Vorontsov
  1 sibling, 0 replies; 23+ messages in thread
From: Anton Vorontsov @ 2011-12-20  2:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: KOSAKI Motohiro, Arve Hjønnevåg, Rik van Riel,
	Pavel Machek, Greg Kroah-Hartman, Andrew Morton, David Rientjes,
	John Stultz, linux-mm, linux-kernel, Johannes Weiner,
	KAMEZAWA Hiroyuki

On Mon, Dec 19, 2011 at 01:12:55PM +0100, Michal Hocko wrote:
> [Didn't get to the patch yet but a comment on memcg]
> 
> On Mon 19-12-11 06:53:28, Anton Vorontsov wrote:
> [...]
> > - Use memory controller cgroup (CGROUP_MEM_RES_CTLR) notifications from
> >   the kernel side, plus userland "manager" that would kill applications.
> > 
> >   The main downside of this approach is that mem_cg needs 20 bytes per
> >   page (on a 32 bit machine). So on a 32 bit machine with 4K pages
> >   that's approx. 0.5% of RAM, or, in other words, 5MB on a 1GB machine.
> 
> page_cgroup is 16B per page and with the current Johannes' memcg
> naturalization work (in the mmotm tree) we are down to 8B per page (we
> got rid of lru). Kamezawa has some patches to get rid of the flags so we
> will be down to 4B per page on 32b. Is this still too much?
> I would be really careful about a yet another lowmem notification
> mechanism.

4 bytes (1MB wastage on a 1GB machine) sounds much better. If there are no
other downsides of using cgroups-based low memory killer, then maybe it's
not worth doing yet another low memory notification stuff.

> >   0.5% doesn't sound too bad, but 5MB does, quite a little bit. So,
> >   mem_cg feels like an overkill for this simple task (see the driver at
> >   the very bottom).
> 
> Why is it an overkill? I think that having 2 groups (active and
> inactive) and move tasks between then sounds quite elegant.

Yep, that was the original idea. But back then mem_cg was way too costly,
so nobody seriously considered this as a solution.

Thanks,

-- 
Anton Vorontsov
Email: cbouatmailru@gmail.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-19 19:12   ` David Rientjes
@ 2011-12-20 14:56     ` Anton Vorontsov
  2011-12-20 21:36       ` David Rientjes
  0 siblings, 1 reply; 23+ messages in thread
From: Anton Vorontsov @ 2011-12-20 14:56 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, KOSAKI Motohiro, Arve Hjønnevåg,
	Rik van Riel, Pavel Machek, Greg Kroah-Hartman, Andrew Morton,
	John Stultz, linux-mm, linux-kernel, Johannes Weiner,
	KAMEZAWA Hiroyuki

On Mon, Dec 19, 2011 at 11:12:09AM -0800, David Rientjes wrote:
> On Mon, 19 Dec 2011, Michal Hocko wrote:
> 
> > page_cgroup is 16B per page and with the current Johannes' memcg
> > naturalization work (in the mmotm tree) we are down to 8B per page (we
> > got rid of lru). Kamezawa has some patches to get rid of the flags so we
> > will be down to 4B per page on 32b. Is this still too much?
> > I would be really careful about a yet another lowmem notification
> > mechanism.
> > 
> 
> There was always general interest in a low memory notification mechanism 
> even prior to memcg, see http://lwn.net/Articles/268732/ from Marcelo and 
> KOSAKI-san.  The desire is not only to avoid the metadata overhead of 
> memcg, but also to avoid cgroups entirely.

Hm, assuming that metadata is no longer an issue, why do you think avoiding
cgroups would be a good idea?

Thanks,

-- 
Anton Vorontsov
Email: cbouatmailru@gmail.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-20 14:56     ` Anton Vorontsov
@ 2011-12-20 21:36       ` David Rientjes
  2011-12-21  0:28         ` Anton Vorontsov
  0 siblings, 1 reply; 23+ messages in thread
From: David Rientjes @ 2011-12-20 21:36 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Michal Hocko, KOSAKI Motohiro, Arve Hjønnevåg,
	Rik van Riel, Pavel Machek, Greg Kroah-Hartman, Andrew Morton,
	John Stultz, linux-mm, linux-kernel, Johannes Weiner,
	KAMEZAWA Hiroyuki

On Tue, 20 Dec 2011, Anton Vorontsov wrote:

> Hm, assuming that metadata is no longer an issue, why do you think avoiding
> cgroups would be a good idea?
> 

It's helpful for certain end users, particularly those in the embedded 
world, to be able to disable as many config options as possible to reduce 
the size of kernel image as much as possible, so they'll want a minimal 
amount of kernel functionality that allows such notifications.  Keep in 
mind that CONFIG_CGROUP_MEM_RES_CTLR is not enabled by default because of 
this (enabling it, CONFIG_RESOURCE_COUNTERS, and CONFIG_CGROUPS increases 
the size of the kernel text by ~1%), and it's becoming increasingly 
important for certain workloads to be notified of low memory conditions 
without any restriction on its usage other than the amount of RAM that the 
system has so that they can trigger internal memory freeing, explicit 
memory compaction from the command line, drop caches, reducing scheduling 
priority, etc.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-20 21:36       ` David Rientjes
@ 2011-12-21  0:28         ` Anton Vorontsov
  2011-12-21  1:14           ` Frank Rowand
  2011-12-21  2:50           ` David Rientjes
  0 siblings, 2 replies; 23+ messages in thread
From: Anton Vorontsov @ 2011-12-21  0:28 UTC (permalink / raw)
  To: David Rientjes, KOSAKI Motohiro
  Cc: Michal Hocko, Arve Hjønnevåg, Rik van Riel,
	Pavel Machek, Greg Kroah-Hartman, Andrew Morton, John Stultz,
	linux-mm, linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki,
	Alan Cox

On Tue, Dec 20, 2011 at 01:36:00PM -0800, David Rientjes wrote:
> On Tue, 20 Dec 2011, Anton Vorontsov wrote:
> 
> > Hm, assuming that metadata is no longer an issue, why do you think avoiding
> > cgroups would be a good idea?
> > 
> 
> It's helpful for certain end users, particularly those in the embedded 
> world, to be able to disable as many config options as possible to reduce 
> the size of kernel image as much as possible, so they'll want a minimal 
> amount of kernel functionality that allows such notifications.  Keep in 
> mind that CONFIG_CGROUP_MEM_RES_CTLR is not enabled by default because of 
> this (enabling it, CONFIG_RESOURCE_COUNTERS, and CONFIG_CGROUPS increases 
> the size of the kernel text by ~1%),

So for 2MB kernel that's about 20KB of an additional text... This seems
affordable, especially as a trade-off for the things that cgroups may
provide.

The fact is, for desktop and server Linux, cgroups slowly becomes a
mandatory thing. And the reason for this is that cgroups mechanism
provides some very useful features (in an extensible way, like plugins),
i.e. a way to manage and track processes and its resources -- which is the
main purpose of cgroups.

And that's exactly what we want for low memory killer -- manage processes
and track its resources.

No doubt that Android is very different from desktop and server Linux
usage, but that does not mean that it has to use different kernel
interfaces.


As Alan Cox pointed out, we should probably focus on improving (if needed)
existing solutions, instead of duplicating functionality for the sake of
doing the same thing, but in a more "lightweight" and ad-hocish way.

By going "alternative" (to cgroups) way, we're risking to end up with the
same thing but under some different name.

> and it's becoming increasingly 
> important for certain workloads to be notified of low memory conditions 
> without any restriction on its usage other than the amount of RAM that the 
> system has

I'm not sure what you mean here. Mem_cg may provide a way to the
userland to be notified on low memory conditions, i.e. amount of RAM
that the system has -- the same thing as /dev/mem_notify would do...

(Though, as of current mem_cg, I believe that root memory.usage_in_bytes
does not account memory used by the kernel itself, so today it seems not
possible to use 'memory thresholds' feature to track total amount of RAM
available in the system.)

> so that they can trigger internal memory freeing, explicit 
> memory compaction from the command line, drop caches, reducing scheduling 
> priority, etc.

Mem_cg provides a mere resources tracking and notification mechanism,
I'm not sure how it could restrict what exactly apps would do with it.
They as well may trigger internal memory freeing, drop caches etc., no?

Thanks!

-- 
Anton Vorontsov
Email: cbouatmailru@gmail.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-21  0:28         ` Anton Vorontsov
@ 2011-12-21  1:14           ` Frank Rowand
  2011-12-21  2:07             ` Anton Vorontsov
  2011-12-22  1:16             ` KOSAKI Motohiro
  2011-12-21  2:50           ` David Rientjes
  1 sibling, 2 replies; 23+ messages in thread
From: Frank Rowand @ 2011-12-21  1:14 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: David Rientjes, KOSAKI Motohiro, Michal Hocko,
	Arve Hjønnevåg, Rik van Riel, Pavel Machek,
	Greg Kroah-Hartman, Andrew Morton, John Stultz, linux-mm,
	linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Alan Cox,
	tbird20d

On 12/20/11 16:28, Anton Vorontsov wrote:
> On Tue, Dec 20, 2011 at 01:36:00PM -0800, David Rientjes wrote:
>> On Tue, 20 Dec 2011, Anton Vorontsov wrote:
>>
>>> Hm, assuming that metadata is no longer an issue, why do you think avoiding
>>> cgroups would be a good idea?
>>>
>>
>> It's helpful for certain end users, particularly those in the embedded 
>> world, to be able to disable as many config options as possible to reduce 
>> the size of kernel image as much as possible, so they'll want a minimal 
>> amount of kernel functionality that allows such notifications.  Keep in 
>> mind that CONFIG_CGROUP_MEM_RES_CTLR is not enabled by default because of 
>> this (enabling it, CONFIG_RESOURCE_COUNTERS, and CONFIG_CGROUPS increases 
>> the size of the kernel text by ~1%),
> 
> So for 2MB kernel that's about 20KB of an additional text... This seems
> affordable, especially as a trade-off for the things that cgroups may
> provide.

A comment from http://lkml.indiana.edu/hypermail/linux/kernel/1102.1/00412.html:

"I care about 5K. (But honestly, I don't actively hunt stuff less than
10K in size, because there's too many of them to chase, currently)."

> 
> The fact is, for desktop and server Linux, cgroups slowly becomes a
> mandatory thing. And the reason for this is that cgroups mechanism
> provides some very useful features (in an extensible way, like plugins),
> i.e. a way to manage and track processes and its resources -- which is the
> main purpose of cgroups.

And for embedded and for real-time, some of us do not want cgroups to be
a mandatory thing.  We want it to remain configurable.  My personal
interest is in keeping the latency of certain critical paths (especially
in the scheduler) short and consistent.

-Frank


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-21  1:14           ` Frank Rowand
@ 2011-12-21  2:07             ` Anton Vorontsov
  2011-12-21  2:30               ` Frank Rowand
  2011-12-22  1:16             ` KOSAKI Motohiro
  1 sibling, 1 reply; 23+ messages in thread
From: Anton Vorontsov @ 2011-12-21  2:07 UTC (permalink / raw)
  To: Frank Rowand
  Cc: David Rientjes, KOSAKI Motohiro, Michal Hocko,
	Arve Hjønnevåg, Rik van Riel, Pavel Machek,
	Greg Kroah-Hartman, Andrew Morton, John Stultz, linux-mm,
	linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Alan Cox,
	tbird20d

On Tue, Dec 20, 2011 at 05:14:18PM -0800, Frank Rowand wrote:
[...]
> >>> Hm, assuming that metadata is no longer an issue, why do you think avoiding
> >>> cgroups would be a good idea?
> >>>
> >>
> >> It's helpful for certain end users, particularly those in the embedded 
> >> world, to be able to disable as many config options as possible to reduce 
> >> the size of kernel image as much as possible, so they'll want a minimal 
> >> amount of kernel functionality that allows such notifications.  Keep in 
> >> mind that CONFIG_CGROUP_MEM_RES_CTLR is not enabled by default because of 
> >> this (enabling it, CONFIG_RESOURCE_COUNTERS, and CONFIG_CGROUPS increases 
> >> the size of the kernel text by ~1%),
> > 
> > So for 2MB kernel that's about 20KB of an additional text... This seems
> > affordable, especially as a trade-off for the things that cgroups may
> > provide.
> 
> A comment from http://lkml.indiana.edu/hypermail/linux/kernel/1102.1/00412.html:
> 
> "I care about 5K. (But honestly, I don't actively hunt stuff less than
> 10K in size, because there's too many of them to chase, currently)."

I have just tried to turn off CGROUPS on my qemu test kernels:

$ diff -u cgroups no_cgroups 
    text           data     bss     dec     hex filename
-3869810         465976  565248 4901034  4ac8aa vmlinux
+3806374         460544  540672 4807590  495ba6 vmlinux

So, that's actually ~60KB. Which is serious. memcontrol.o text size
is about 23KB.

And my cgroups setup was just this:

$ cat .config | grep CGRO
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
# CONFIG_CGROUP_FREEZER is not set
# CONFIG_CGROUP_DEVICE is not set
# CONFIG_CGROUP_CPUACCT is not set
CONFIG_CGROUP_MEM_RES_CTLR=y
# CONFIG_CGROUP_PERF is not set
# CONFIG_CGROUP_SCHED is not set
# CONFIG_BLK_CGROUP is not set

:-(

> > The fact is, for desktop and server Linux, cgroups slowly becomes a
> > mandatory thing. And the reason for this is that cgroups mechanism
> > provides some very useful features (in an extensible way, like plugins),
> > i.e. a way to manage and track processes and its resources -- which is the
> > main purpose of cgroups.
> 
> And for embedded and for real-time, some of us do not want cgroups to be
> a mandatory thing.  We want it to remain configurable.  My personal
> interest is in keeping the latency of certain critical paths (especially
> in the scheduler) short and consistent.

Much thanks for your input! That would be quite strong argument for going
with /dev/mem_notify approach. Do you have any specific numbers how cgroups
makes scheduler latencies worse?

Thanks!

-- 
Anton Vorontsov
Email: cbouatmailru@gmail.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-21  2:07             ` Anton Vorontsov
@ 2011-12-21  2:30               ` Frank Rowand
  2011-12-21 23:41                 ` Anton Vorontsov
  0 siblings, 1 reply; 23+ messages in thread
From: Frank Rowand @ 2011-12-21  2:30 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Rowand, Frank, David Rientjes, KOSAKI Motohiro, Michal Hocko,
	Arve Hjønnevåg, Rik van Riel, Pavel Machek,
	Greg Kroah-Hartman, Andrew Morton, John Stultz, linux-mm,
	linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Alan Cox,
	tbird20d

On 12/20/11 18:07, Anton Vorontsov wrote:
> On Tue, Dec 20, 2011 at 05:14:18PM -0800, Frank Rowand wrote:

< snip >

>> And for embedded and for real-time, some of us do not want cgroups to be
>> a mandatory thing.  We want it to remain configurable.  My personal
>> interest is in keeping the latency of certain critical paths (especially
>> in the scheduler) short and consistent.
> 
> Much thanks for your input! That would be quite strong argument for going
> with /dev/mem_notify approach. Do you have any specific numbers how cgroups
> makes scheduler latencies worse?

Sorry, I don't have specific numbers.  And the numbers would be workload
specific anyway.

-Frank


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-21  0:28         ` Anton Vorontsov
  2011-12-21  1:14           ` Frank Rowand
@ 2011-12-21  2:50           ` David Rientjes
  1 sibling, 0 replies; 23+ messages in thread
From: David Rientjes @ 2011-12-21  2:50 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: KOSAKI Motohiro, Michal Hocko, Arve Hjønnevåg,
	Rik van Riel, Pavel Machek, Greg Kroah-Hartman, Andrew Morton,
	John Stultz, linux-mm, linux-kernel, Johannes Weiner,
	KAMEZAWA Hiroyuki, Alan Cox

On Wed, 21 Dec 2011, Anton Vorontsov wrote:

> > It's helpful for certain end users, particularly those in the embedded 
> > world, to be able to disable as many config options as possible to reduce 
> > the size of kernel image as much as possible, so they'll want a minimal 
> > amount of kernel functionality that allows such notifications.  Keep in 
> > mind that CONFIG_CGROUP_MEM_RES_CTLR is not enabled by default because of 
> > this (enabling it, CONFIG_RESOURCE_COUNTERS, and CONFIG_CGROUPS increases 
> > the size of the kernel text by ~1%),
> 
> So for 2MB kernel that's about 20KB of an additional text... This seems
> affordable, especially as a trade-off for the things that cgroups may
> provide.
> 

No, this was with defconfig and then defconfig + CONFIG_CGROUPS + 
CONFIG_RESOURCE_COUNTERS + CONFIG_CGROUP_MEM_RES_CTLR.  Configs that want 
a very small kernel image will definitely not be running with defconfig, 
they'll be using a stripped down version that allows for the smallest 
footprint possible.  Requiring those config options would then increase 
the size of the kernel text by much more than 1%.

Compare this situation with using CONFIG_SLOB for embedded devices (which 
is actually quite popular) over CONFIG_SLAB and CONFIG_SLUB specifically 
for that low memory footprint.

> The fact is, for desktop and server Linux, cgroups slowly becomes a
> mandatory thing.

And that's definitely in the wrong direction for Linux.  It would be like 
asking users to convert to slab or slub because we don't want to maintain 
a slob allocator that is specifically designed for an extremely low memory 
footprint.  Such a proposal would be rejected outright unless you could 
match the same footprint with the alternatives.

> As Alan Cox pointed out, we should probably focus on improving (if needed)
> existing solutions, instead of duplicating functionality for the sake of
> doing the same thing, but in a more "lightweight" and ad-hocish way.
> 

I'm very in favor of extracting out notifiers of low-memory situations and 
extended for global use rather than tying it specifically to the memory 
controller.  Then, memcg would be responsible only for limitation of 
resources rather than tying additional functionality to it that would be 
generally useful to everyone (memory notifiers) and requiring them to 
incur the overhead of memcg.

> > and it's becoming increasingly 
> > important for certain workloads to be notified of low memory conditions 
> > without any restriction on its usage other than the amount of RAM that the 
> > system has
> 
> I'm not sure what you mean here. Mem_cg may provide a way to the
> userland to be notified on low memory conditions, i.e. amount of RAM
> that the system has -- the same thing as /dev/mem_notify would do...
> 

Yes, but without the requirements of the above-mentioned subsystems.  The 
point here is that some embedded devices may want notification of low-
memory conditions without the overhead (both size and performance) of 
cgroups or memcg.  Please focus on that specifically.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-21  2:30               ` Frank Rowand
@ 2011-12-21 23:41                 ` Anton Vorontsov
  0 siblings, 0 replies; 23+ messages in thread
From: Anton Vorontsov @ 2011-12-21 23:41 UTC (permalink / raw)
  To: Frank Rowand
  Cc: Rowand, Frank, David Rientjes, KOSAKI Motohiro, Michal Hocko,
	Arve Hjønnevåg, Rik van Riel, Pavel Machek,
	Greg Kroah-Hartman, Andrew Morton, John Stultz, linux-mm,
	linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Alan Cox,
	tbird20d

On Tue, Dec 20, 2011 at 06:30:41PM -0800, Frank Rowand wrote:
> >> And for embedded and for real-time, some of us do not want cgroups to be
> >> a mandatory thing.  We want it to remain configurable.  My personal
> >> interest is in keeping the latency of certain critical paths (especially
> >> in the scheduler) short and consistent.
> > 
> > Much thanks for your input! That would be quite strong argument for going
> > with /dev/mem_notify approach. Do you have any specific numbers how cgroups
> > makes scheduler latencies worse?
> 
> Sorry, I don't have specific numbers.  And the numbers would be workload
> specific anyway.

OK, here are some numbers I captured using rt-tests suite.

I don't see any huge latency drops w/ cyclictest, but there is ~8% drop
in hackbench. Might be interesting to cgroups folks?

Kernel config, w/ preempt and only minimal options enabled for mem_cg:
http://ix.io/22w

rt-tests: https://github.com/clrkwllms/rt-tests.git

- - - - - test script
#!/bin/sh
echo cyclic
for i in `seq 1 3`; do ./cyclictest  -l 50000 -q ; done
echo signal
for i in `seq 1 3`; do ./signaltest  -l 30000 -q ; done
echo hackbench
for i in `seq 1 3`; do ./hackbench -l 1000 | grep Time ; done
- - - - -

I run this script inside a QEMU KVM guest on a idling host. The host's
cpufreq governor is set to powersave (so that's effectively becomes a
800 MHz machine). I can re-run this on a real HW, but I don't think
that results would differ significantly.


Results:

bzImage_nocgroups_nopreempt
---------------------------
cyclic
T: 0 ( 2240) P: 0 I:1000 C:  50000 Min:     46 Act:  228 Avg:  226 Max:    5693
T: 0 ( 2242) P: 0 I:1000 C:  50000 Min:     57 Act:  234 Avg:  244 Max:    9041
T: 0 ( 2244) P: 0 I:1000 C:  50000 Min:     47 Act:  246 Avg:  227 Max:    6612
signal
T: 0 ( 2247) P: 0 C:  30000 Min:      5 Act:    5 Avg:    6 Max:     236
T: 1 ( 2248) P: 0 C:  30000 Min:      5 Act:    5 Avg:  645 Max:   11719
T: 0 ( 2250) P: 0 C:  30000 Min:      6 Act:    6 Avg:    7 Max:     248
T: 1 ( 2251) P: 0 C:  30000 Min:      6 Act:    6 Avg:  647 Max:   14581
T: 0 ( 2253) P: 0 C:  30000 Min:      5 Act:    5 Avg:    7 Max:     210
T: 1 ( 2254) P: 0 C:  30000 Min:      5 Act:    6 Avg:  646 Max:   13892
hackbench
Time: 14.940
Time: 14.883
Time: 14.959

bzImage_cgroups_nopreempt:
--------------------------
cyclic
T: 0 (  963) P: 0 I:1000 C:  50000 Min:     52 Act:  248 Avg:  235 Max:    6497
T: 0 (  965) P: 0 I:1000 C:  50000 Min:     55 Act:  230 Avg:  228 Max:   10438
T: 0 (  967) P: 0 I:1000 C:  50000 Min:     51 Act:  173 Avg:  183 Max:    4396
signal
T: 0 (  970) P: 0 C:  30000 Min:      5 Act:    5 Avg:    6 Max:      98
T: 1 (  971) P: 0 C:  30000 Min:      5 Act:    5 Avg:  646 Max:   13654
T: 0 (  973) P: 0 C:  30000 Min:      5 Act:    5 Avg:    6 Max:     150
T: 1 (  974) P: 0 C:  30000 Min:      5 Act:    5 Avg:  646 Max:   10560
T: 0 (  976) P: 0 C:  30000 Min:      5 Act:    5 Avg:    6 Max:     107
T: 1 (  977) P: 0 C:  30000 Min:      5 Act:    5 Avg:  646 Max:   13453
hackbench
Time: 15.857
Time: 15.745
Time: 15.588

bzImage_cgroups_preempt:
------------------------
cyclic
T: 0 (  986) P: 0 I:1000 C:  50000 Min:     50 Act:  278 Avg:  239 Max:    8259
T: 0 (  988) P: 0 I:1000 C:  50000 Min:     53 Act:  236 Avg:  228 Max:    3565
T: 0 (  990) P: 0 I:1000 C:  50000 Min:     76 Act:  242 Avg:  238 Max:    3902
signal
T: 0 (  993) P: 0 C:  30000 Min:      6 Act:    6 Avg:    7 Max:     102
T: 1 (  994) P: 0 C:  30000 Min:      6 Act:    6 Avg:  646 Max:   10683
T: 0 (  996) P: 0 C:  30000 Min:      6 Act:    6 Avg:    7 Max:     129
T: 1 (  997) P: 0 C:  30000 Min:      6 Act:    6 Avg:  647 Max:   10973
T: 0 (  999) P: 0 C:  30000 Min:      6 Act:   43 Avg:    7 Max:      95
T: 1 ( 1000) P: 0 C:  30000 Min:      6 Act:   44 Avg:  646 Max:   10552
hackbench
Time: 15.632
Time: 15.221
Time: 15.443

bzImage_nocgroups_preempt:
--------------------------
cyclic
T: 0 (  974) P: 0 I:1000 C:  50000 Min:     50 Act:  268 Avg:  258 Max:    8324
T: 0 (  976) P: 0 I:1000 C:  50000 Min:     61 Act:  185 Avg:  183 Max:    2998
T: 0 (  978) P: 0 I:1000 C:  50000 Min:     55 Act:  234 Avg:  236 Max:    2858
signal
T: 0 (  981) P: 0 C:  30000 Min:      6 Act:    6 Avg:    7 Max:      85
T: 1 (  982) P: 0 C:  30000 Min:      6 Act:    6 Avg:  647 Max:   10479
T: 0 (  984) P: 0 C:  30000 Min:      6 Act:    6 Avg:    7 Max:     129
T: 1 (  985) P: 0 C:  30000 Min:      6 Act:    6 Avg:  647 Max:   11178
T: 0 (  987) P: 0 C:  30000 Min:      6 Act:    6 Avg:    7 Max:      94
T: 1 (  988) P: 0 C:  30000 Min:      6 Act:    6 Avg:  647 Max:   11587
hackbench
Time: 14.488
Time: 14.390
Time: 14.310

-- 
Anton Vorontsov
Email: cbouatmailru@gmail.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-21  1:14           ` Frank Rowand
  2011-12-21  2:07             ` Anton Vorontsov
@ 2011-12-22  1:16             ` KOSAKI Motohiro
  2011-12-22 18:53               ` Frank Rowand
  1 sibling, 1 reply; 23+ messages in thread
From: KOSAKI Motohiro @ 2011-12-22  1:16 UTC (permalink / raw)
  To: frank.rowand
  Cc: Anton Vorontsov, David Rientjes, Michal Hocko,
	Arve Hjønnevåg, Rik van Riel, Pavel Machek,
	Greg Kroah-Hartman, Andrew Morton, John Stultz, linux-mm,
	linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Alan Cox,
	tbird20d

>> So for 2MB kernel that's about 20KB of an additional text... This seems
>> affordable, especially as a trade-off for the things that cgroups may
>> provide.
>
> A comment from http://lkml.indiana.edu/hypermail/linux/kernel/1102.1/00412.html:
>
> "I care about 5K. (But honestly, I don't actively hunt stuff less than
> 10K in size, because there's too many of them to chase, currently)."

Hm, interesting. Because of, current memory cgroup notification was
made by a request from Sony and CELinux. AFAIK, at least, Sony
are already using cgroups.


>> The fact is, for desktop and server Linux, cgroups slowly becomes a
>> mandatory thing. And the reason for this is that cgroups mechanism
>> provides some very useful features (in an extensible way, like plugins),
>> i.e. a way to manage and track processes and its resources -- which is the
>> main purpose of cgroups.
>
> And for embedded and for real-time, some of us do not want cgroups to be
> a mandatory thing.  We want it to remain configurable.  My personal
> interest is in keeping the latency of certain critical paths (especially
> in the scheduler) short and consistent.

As far as I observed, modern embedded system have both RT and no RT process.
Java VM or user downloadable programs may need memory notification
because users may download bad programs. in the other hand, rt
processes are not downloadable and much tested by hardware vendor. So,
I think you only need
split process between under cgroups and not under cgroups.

cgroups have zero or much likely zero overhead if the processes don't use it.
Of course, feedback are welcome. I'm interesting your embedded usecase.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Android low memory killer vs. memory pressure notifications
  2011-12-22  1:16             ` KOSAKI Motohiro
@ 2011-12-22 18:53               ` Frank Rowand
  0 siblings, 0 replies; 23+ messages in thread
From: Frank Rowand @ 2011-12-22 18:53 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rowand, Frank, Anton Vorontsov, David Rientjes, Michal Hocko,
	Arve Hjønnevåg, Rik van Riel, Pavel Machek,
	Greg Kroah-Hartman, Andrew Morton, John Stultz, linux-mm,
	linux-kernel, Johannes Weiner, KAMEZAWA Hiroyuki, Alan Cox,
	tbird20d

On 12/21/11 17:16, KOSAKI Motohiro wrote:
>>> So for 2MB kernel that's about 20KB of an additional text... This seems
>>> affordable, especially as a trade-off for the things that cgroups may
>>> provide.
>>
>> A comment from http://lkml.indiana.edu/hypermail/linux/kernel/1102.1/00412.html:
>>
>> "I care about 5K. (But honestly, I don't actively hunt stuff less than
>> 10K in size, because there's too many of them to chase, currently)."
> 
> Hm, interesting. Because of, current memory cgroup notification was
> made by a request from Sony and CELinux. AFAIK, at least, Sony
> are already using cgroups.

Sony makes a very large range of products.  The memory available on the
different products can range from a few megabytes to hundreds of megabytes
(and I wouldn't be surprised if the top of the range is gigabytes).

Our low memory products lead us to be concerned about the growth in
memory usage by newer kernel versions.  Of course we also like additional
features and kernel improvements, so we understand the balancing act of
features requiring more memory, while at the same time discouraging
memory growth for resource constrained systems.  Config options are
one of the tools used to manage that balancing act.

>>> The fact is, for desktop and server Linux, cgroups slowly becomes a
>>> mandatory thing. And the reason for this is that cgroups mechanism
>>> provides some very useful features (in an extensible way, like plugins),
>>> i.e. a way to manage and track processes and its resources -- which is the
>>> main purpose of cgroups.
>>
>> And for embedded and for real-time, some of us do not want cgroups to be
>> a mandatory thing.  We want it to remain configurable.  My personal
>> interest is in keeping the latency of certain critical paths (especially
>> in the scheduler) short and consistent.
> 
> As far as I observed, modern embedded system have both RT and no RT process.
> Java VM or user downloadable programs may need memory notification
> because users may download bad programs. in the other hand, rt
> processes are not downloadable and much tested by hardware vendor. So,
> I think you only need
> split process between under cgroups and not under cgroups.
> 
> cgroups have zero or much likely zero overhead if the processes don't use it.
> Of course, feedback are welcome. I'm interesting your embedded usecase.

No, cgroups have _near_ zero overhead when the cgroup configuration option is
turned off. :-)  (Sorry, being pedantic, but still serious.)

Again, we have many different products.  Some may find cgroups to be useful.
But at least one of our product groups totally removed the cgroups source code
from their scheduler as part of their focus on reducing latency.

We have to think about a wide range of (sometimes conflicting)
requirements.  Config options help us choose which features to enable
for each product, resolving some of the conflicting requirements.

-Frank


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2011-12-22 18:54 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-19  2:53 Android low memory killer vs. memory pressure notifications Anton Vorontsov
2011-12-19  7:48 ` Minchan Kim
2011-12-19 19:05   ` David Rientjes
2011-12-19 10:39 ` Alan Cox
2011-12-19 16:16   ` KOSAKI Motohiro
2011-12-19 16:24     ` Rik van Riel
2011-12-19 12:12 ` Michal Hocko
2011-12-19 19:12   ` David Rientjes
2011-12-20 14:56     ` Anton Vorontsov
2011-12-20 21:36       ` David Rientjes
2011-12-21  0:28         ` Anton Vorontsov
2011-12-21  1:14           ` Frank Rowand
2011-12-21  2:07             ` Anton Vorontsov
2011-12-21  2:30               ` Frank Rowand
2011-12-21 23:41                 ` Anton Vorontsov
2011-12-22  1:16             ` KOSAKI Motohiro
2011-12-22 18:53               ` Frank Rowand
2011-12-21  2:50           ` David Rientjes
2011-12-20  2:16   ` Anton Vorontsov
2011-12-19 16:11 ` KOSAKI Motohiro
2011-12-20  0:30   ` Hiroyuki Kamezawa
2011-12-19 17:30 ` KOSAKI Motohiro
2011-12-19 17:34   ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).