linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM
@ 2012-01-04 17:21 Leonid Moiseichuk
  2012-01-04 17:21 ` [PATCH 3.2.0-rc1 1/3] Making si_swapinfo exportable Leonid Moiseichuk
                   ` (3 more replies)
  0 siblings, 4 replies; 38+ messages in thread
From: Leonid Moiseichuk @ 2012-01-04 17:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: cesarb, kamezawa.hiroyu, emunson, penberg, aarcange, riel, mel,
	rientjes, dima, gregkh, rebecca, san, akpm, vesa.jaaskelainen

The main idea of Used Memory Meter (UMM) is to provide low-cost interface
for user-space to notify about memory consumption using similar approach /proc/meminfo
does but focusing only on "modified" pages which cannot be fogotten.

The calculation formula in terms of meminfo looks the following:
  UsedMemory = (MemTotal - MemFree - Buffers - Cached - SwapCached) +
                                               (SwapTotal - SwapFree)
It reflects well amount of system memory used in applications in heaps and shared pages.

Previously (n770..n900) we had lowmem.c [1] which used LSM and did a lot other things,
n9 implementation based on memcg [2] which has own problems, so the proposed variant
I hope is the best one for n9:
- Keeps connections from user space
- When amount of modified pages reaches crossed pointed value for particular connection
  makes POLLIN and allow user-space app to read it and react
- Economic as much as possible, so currently its operates if allocation higher than 487
  pages or last check happened 250 ms before
Of course if no allocation happened then no activities performed, use-time
must be not affected.

Testing results:
- Checkpatch produced 0 warning
- Sparse does not produce warnings
- One check costs ~20 us or less (could be checked with probe=1 insmod)
- One connection costs 20 bytes in kernel-space  (see observer structure) for 32-bit variant
- For 10K connections poll update in requested in ~10ms, but for practically device expected
  to will have about 10 connections (like n9 has now).

Known weak points which I do not know how to fix but will if you have a brillian idea:
- Having hook in MM is nasty but MM/shrinker cannot be used there and LSM even worse idea
- If I made 
	$cat /dev/used_memory
  it is produced lines in non-stop mode. Adding position check in umm_read seems doesn not help,
  so "head -1 /dev/used_memory" should be used if you need to quick check
- Format of output is USED_PAGES:AVAILABLE_PAGES, primitive but enough for tasks module does

Tested on ARM, x86-32 and x86-64 with and without CONFIG_SWAP. Seems works in all combinations.
Sorry for wide distributions but list of names were produced by scripts/get_maintainer.pl

References
[1] http://elinux.org/Accurate_Memory_Measurement
[2] http://maemo.gitorious.org/maemo-tools/libmemnotify

Leonid Moiseichuk (3):
  Making si_swapinfo exportable
  MM hook for page allocation and release
  Used Memory Meter pseudo-device module

 drivers/misc/Kconfig  |   12 ++
 drivers/misc/Makefile |    1 +
 drivers/misc/umm.c    |  452 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h    |   15 ++
 include/linux/umm.h   |   42 +++++
 mm/Kconfig            |    8 +
 mm/page_alloc.c       |   31 ++++
 mm/swapfile.c         |    3 +
 8 files changed, 564 insertions(+), 0 deletions(-)
 create mode 100644 drivers/misc/umm.c
 create mode 100644 include/linux/umm.h

-- 
1.7.7.3


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 3.2.0-rc1 1/3] Making si_swapinfo exportable
  2012-01-04 17:21 [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM Leonid Moiseichuk
@ 2012-01-04 17:21 ` Leonid Moiseichuk
  2012-01-04 17:21 ` [PATCH 3.2.0-rc1 2/3] MM hook for page allocation and release Leonid Moiseichuk
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 38+ messages in thread
From: Leonid Moiseichuk @ 2012-01-04 17:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: cesarb, kamezawa.hiroyu, emunson, penberg, aarcange, riel, mel,
	rientjes, dima, gregkh, rebecca, san, akpm, vesa.jaaskelainen

If we will make si_swapinfo() exportable it could be called from modules.
Otherwise modules have no interface to obtain information about swap usage.
Change made in the same way as si_meminfo() declared.

Signed-off-by: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
---
 mm/swapfile.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index b1cd120..192cc25 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -5,10 +5,12 @@
  *  Swap reorganised 29.12.95, Stephen Tweedie
  */
 
+#include <linux/export.h>
 #include <linux/mm.h>
 #include <linux/hugetlb.h>
 #include <linux/mman.h>
 #include <linux/slab.h>
+#include <linux/kernel.h>
 #include <linux/kernel_stat.h>
 #include <linux/swap.h>
 #include <linux/vmalloc.h>
@@ -2177,6 +2179,7 @@ void si_swapinfo(struct sysinfo *val)
 	val->totalswap = total_swap_pages + nr_to_be_unused;
 	spin_unlock(&swap_lock);
 }
+EXPORT_SYMBOL(si_swapinfo);
 
 /*
  * Verify that a swap entry is valid and increment its swap map count.
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3.2.0-rc1 2/3] MM hook for page allocation and release
  2012-01-04 17:21 [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM Leonid Moiseichuk
  2012-01-04 17:21 ` [PATCH 3.2.0-rc1 1/3] Making si_swapinfo exportable Leonid Moiseichuk
@ 2012-01-04 17:21 ` Leonid Moiseichuk
  2012-01-04 20:40   ` Pekka Enberg
                     ` (2 more replies)
  2012-01-04 17:21 ` [PATCH 3.2.0-rc1 3/3] Used Memory Meter pseudo-device module Leonid Moiseichuk
  2012-01-04 19:56 ` [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM Greg KH
  3 siblings, 3 replies; 38+ messages in thread
From: Leonid Moiseichuk @ 2012-01-04 17:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: cesarb, kamezawa.hiroyu, emunson, penberg, aarcange, riel, mel,
	rientjes, dima, gregkh, rebecca, san, akpm, vesa.jaaskelainen

That is required by Used Memory Meter (UMM) pseudo-device
to track memory utilization in system. It is expected that
hook MUST be very light to prevent performance impact
on the hot allocation path. Accuracy of number managed pages
does not expected to be absolute but fact of allocation or
deallocation must be registered.

Signed-off-by: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
---
 include/linux/mm.h |   15 +++++++++++++++
 mm/Kconfig         |    8 ++++++++
 mm/page_alloc.c    |   31 +++++++++++++++++++++++++++++++
 3 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3dc3a8c..d133f73 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1618,6 +1618,21 @@ extern int soft_offline_page(struct page *page, int flags);
 
 extern void dump_page(struct page *page);
 
+#ifdef CONFIG_MM_ALLOC_FREE_HOOK
+/*
+ * Hook function type which called when some pages allocated or released.
+ * Value of nr_pages is positive for post-allocation calls and negative
+ * after free.
+ */
+typedef void (*mm_alloc_free_hook_t)(int nr_pages);
+
+/*
+ * Setups specified hook function for tracking pages allocation.
+ * Returns value of old hook to organize chains of calls if necessary.
+ */
+mm_alloc_free_hook_t set_mm_alloc_free_hook(mm_alloc_free_hook_t hook);
+#endif
+
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 extern void clear_huge_page(struct page *page,
 			    unsigned long addr,
diff --git a/mm/Kconfig b/mm/Kconfig
index 011b110..2aaa1e9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -373,3 +373,11 @@ config CLEANCACHE
 	  in a negligible performance hit.
 
 	  If unsure, say Y to enable cleancache
+
+config MM_ALLOC_FREE_HOOK
+	bool "Enable callback support for pages allocation and releasing"
+	default n
+	help
+	  Required for some features like used memory meter.
+	  If unsure, say N to disable alloc/free hook.
+
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9dd443d..9307800 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -236,6 +236,30 @@ static void set_pageblock_migratetype(struct page *page, int migratetype)
 
 bool oom_killer_disabled __read_mostly;
 
+#ifdef CONFIG_MM_ALLOC_FREE_HOOK
+static atomic_long_t alloc_free_hook __read_mostly = ATOMIC_LONG_INIT(0);
+
+mm_alloc_free_hook_t set_mm_alloc_free_hook(mm_alloc_free_hook_t hook)
+{
+	const mm_alloc_free_hook_t old_hook =
+		(mm_alloc_free_hook_t)atomic_long_read(&alloc_free_hook);
+
+	atomic_long_set(&alloc_free_hook, (long)hook);
+	pr_info("MM alloc/free hook set to 0x%p (was 0x%p)\n", hook, old_hook);
+
+	return old_hook;
+}
+EXPORT_SYMBOL(set_mm_alloc_free_hook);
+
+static inline void call_alloc_free_hook(int pages)
+{
+	const mm_alloc_free_hook_t hook =
+		(mm_alloc_free_hook_t)atomic_long_read(&alloc_free_hook);
+	if (hook)
+		hook(pages);
+}
+#endif
+
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {
@@ -2298,6 +2322,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	put_mems_allowed();
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
+#ifdef CONFIG_MM_ALLOC_FREE_HOOK
+	call_alloc_free_hook(1 << order);
+#endif
+
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -2345,6 +2373,9 @@ void __free_pages(struct page *page, unsigned int order)
 			free_hot_cold_page(page, 0);
 		else
 			__free_pages_ok(page, order);
+#ifdef CONFIG_MM_ALLOC_FREE_HOOK
+		call_alloc_free_hook(-(1 << order));
+#endif
 	}
 }
 
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3.2.0-rc1 3/3] Used Memory Meter pseudo-device module
  2012-01-04 17:21 [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM Leonid Moiseichuk
  2012-01-04 17:21 ` [PATCH 3.2.0-rc1 1/3] Making si_swapinfo exportable Leonid Moiseichuk
  2012-01-04 17:21 ` [PATCH 3.2.0-rc1 2/3] MM hook for page allocation and release Leonid Moiseichuk
@ 2012-01-04 17:21 ` Leonid Moiseichuk
  2012-01-04 19:55   ` Greg KH
  2012-01-04 19:56 ` [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM Greg KH
  3 siblings, 1 reply; 38+ messages in thread
From: Leonid Moiseichuk @ 2012-01-04 17:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: cesarb, kamezawa.hiroyu, emunson, penberg, aarcange, riel, mel,
	rientjes, dima, gregkh, rebecca, san, akpm, vesa.jaaskelainen

The Used Memory Meter (UMM) device tracks level of memory utilization
and notifies subscribed processes when consumption crossed specified
threshold up or down. It could be used on embedded devices to
implementation of performance-cheap memory reacting by using
e.g. libmemnotify or similar user-space component.

Signed-off-by: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
---
 drivers/misc/Kconfig  |   12 ++
 drivers/misc/Makefile |    1 +
 drivers/misc/umm.c    |  452 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/umm.h   |   42 +++++
 4 files changed, 507 insertions(+), 0 deletions(-)
 create mode 100644 drivers/misc/umm.c
 create mode 100644 include/linux/umm.h

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index d593878..5d71960 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -499,6 +499,18 @@ config USB_SWITCH_FSA9480
 	  stereo and mono audio, video, microphone and UART data to use
 	  a common connector port.
 
+config USED_MEMORY_METER
+	tristate "Enables used memory meter pseudo-device"
+	default n
+	select MM_ALLOC_FREE_HOOK
+	help
+	  This option enables pseudo-device /dev/used_memory for tracking
+	  system memory utilization and updating state to subscribed clients
+	  when specified threshold reached.
+
+	  Say Y here if you want to support used memory monitor.
+	  If unsure, say N.
+
 source "drivers/misc/c2port/Kconfig"
 source "drivers/misc/eeprom/Kconfig"
 source "drivers/misc/cb710/Kconfig"
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index b26495a..eaec343 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -48,3 +48,4 @@ obj-y				+= lis3lv02d/
 obj-y				+= carma/
 obj-$(CONFIG_USB_SWITCH_FSA9480) += fsa9480.o
 obj-$(CONFIG_ALTERA_STAPL)	+=altera-stapl/
+obj-$(CONFIG_USED_MEMORY_METER)	+= umm.o
diff --git a/drivers/misc/umm.c b/drivers/misc/umm.c
new file mode 100644
index 0000000..a384be20
--- /dev/null
+++ b/drivers/misc/umm.c
@@ -0,0 +1,452 @@
+/*
+ * umm.c - system-wide Used Memory Meter pseudo-device implementation
+ *
+ * Copyright (C) 2011 Nokia Corporation.
+ *      Leonid Moiseichuk
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed "as is" WITHOUT ANY WARRANTY of any
+ * kind, whether express or implied; without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/types.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/atomic.h>
+#include <linux/jiffies.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/poll.h>
+#include <linux/highmem.h>
+#include <linux/swap.h>
+#include <linux/list.h>
+#include <linux/wait.h>
+#include <linux/spinlock.h>
+#include <linux/spinlock_types.h>
+
+#include <linux/umm.h>
+
+
+
+/* subscriber information to be notified when level changed */
+struct observer {
+	/* list data to check from notify_memory_usage and wakeup user-space */
+	struct list_head list;
+	/* related file structure for open/close/read/write and poll */
+	struct file	*file;
+	/* threshold [pages] when we should trigger notification */
+	unsigned long	threshold;
+	/* did we crossed theshold on last validation? */
+	bool		active;
+	/* flag about new notification is required */
+	bool		updated;
+};
+
+
+
+MODULE_AUTHOR("Leonid Moiseichuk (leonid.moiseichuk@nokia.com)");
+MODULE_DESCRIPTION("System used memory meter pseudo-device");
+MODULE_LICENSE("GPL v2");
+MODULE_VERSION("0.0.2");
+
+static int debug __read_mostly;
+module_param(debug, bool, 0);
+MODULE_PARM_DESC(debug, "More info about module parameters and operations");
+
+static int probe __read_mostly;
+module_param(probe, bool, 0);
+MODULE_PARM_DESC(probe, "Probe measurement overhead during loading");
+
+static char device_name[64] __read_mostly = UMM_DEVICE_NAME;
+module_param_string(device_name, device_name, sizeof(device_name), 0);
+MODULE_PARM_DESC(device_name, "Device name in /dev if need different");
+
+static unsigned update_period __read_mostly = UMM_UPDATE_PERIOD;
+module_param(update_period, uint, 0);
+MODULE_PARM_DESC(update_period, "Update interval period [ms]");
+
+static unsigned update_space __read_mostly;
+module_param(update_space, uint, 0);
+MODULE_PARM_DESC(update_space, "Clients granularity space in [kb], 0 - auto");
+
+/* Validated parameters in adequate units */
+static unsigned long update_period_jiffies __read_mostly;
+static unsigned      update_space_pages    __read_mostly;
+static mm_alloc_free_hook_t old_mm_hook    __read_mostly;
+
+/* Timestamp when last time memory usage was measured */
+static atomic_long_t last_time_jiffies __read_mostly =
+					ATOMIC_LONG_INIT(INITIAL_JIFFIES);
+/* Memory values which is used in measurements and notification */
+static unsigned long available_pages      __read_mostly;
+#ifdef CONFIG_SWAP
+static unsigned long available_swap_pages __read_mostly;
+#endif
+static atomic_long_t last_used_pages   __read_mostly = ATOMIC_LONG_INIT(0);
+static atomic_long_t last_nofity_pages __read_mostly = ATOMIC_LONG_INIT(0);
+
+/* Subscribers in poll() call to be validated and notified */
+static atomic_t observer_counter = ATOMIC_INIT(0);
+static DEFINE_SPINLOCK(observer_lock);
+static LIST_HEAD(observer_list);
+static DECLARE_WAIT_QUEUE_HEAD(watcher_queue);
+
+
+
+static inline bool notification_required(unsigned long a, unsigned long b)
+{
+	return (a < b ? b - a : a - b) >= update_space_pages / 2;
+}
+
+static inline bool validate_observer(struct observer *obs, unsigned long used)
+{
+	/* evaluation of current state and compare to old one */
+	const bool active = (obs->threshold && obs->threshold < used);
+	/*
+	 * If we evaluated status just before and did not send update
+	 * yet to user-space we must preserve update flag.
+	 */
+	if (active != obs->active) {
+		obs->active  = active;
+		obs->updated = true;
+	}
+
+	return obs->updated;
+}
+
+static inline unsigned long get_memory_usage(void)
+{
+	/* calculate used pages by substracting free memories */
+	unsigned long used = available_pages;
+
+	/* RAM part: free + slab rec + cached - shared */
+	used -= global_page_state(NR_FREE_PAGES);
+	used -= global_page_state(NR_SLAB_RECLAIMABLE);
+	used -= global_page_state(NR_FILE_PAGES);
+	used += global_page_state(NR_SHMEM);
+
+#ifdef CONFIG_SWAP
+	/* Swap if we have */
+	if (available_swap_pages) {
+		struct sysinfo si;
+		si_swapinfo(&si);
+		used -= si.freeswap;
+	}
+#endif
+
+	return used;
+}
+
+static inline void update_memory_usage(void)
+{
+	atomic_long_set(&last_used_pages, get_memory_usage());
+	atomic_long_set(&last_time_jiffies, jiffies + update_period_jiffies);
+}
+
+/* this code called from allocation hook = must be as fast as possible */
+static inline void notify_memory_usage(void)
+{
+	const unsigned long was = atomic_long_read(&last_nofity_pages);
+	const unsigned long now = atomic_long_read(&last_used_pages);
+
+	if (notification_required(was, now)) {
+		bool updated = false;
+		struct list_head *pos;
+
+		atomic_long_set(&last_nofity_pages, now);
+		spin_lock(&observer_lock);
+		list_for_each(pos, &observer_list) {
+			struct observer *obs = (struct observer *)pos;
+			if (validate_observer(obs, now)) {
+				updated = true;
+				/*
+				 * some watcher changed status
+				 * the rest of checks will be done in umm_poll
+				 */
+				break;
+			}
+		}
+		spin_unlock(&observer_lock);
+		if (updated) {
+			if (debug)
+				pr_info("UMM: wakeup polling tasks\n");
+			wake_up_all(&watcher_queue);
+		}
+	}
+}
+
+/* this method invoked from MM allocation hot path */
+static void mm_alloc_free_hook(int pages)
+{
+	const unsigned long last_measured =
+		(unsigned long)atomic_long_read(&last_time_jiffies);
+
+	if (abs(pages) >= update_space_pages ||
+		time_is_before_jiffies(last_measured)) {
+		update_memory_usage();
+		if (atomic_read(&observer_counter) > 0)
+			notify_memory_usage();
+	}
+
+	if (old_mm_hook)
+		old_mm_hook(pages);
+}
+
+
+
+static int umm_open(struct inode *inode, struct file *file)
+{
+	struct observer *obs;
+
+	obs = kmalloc(sizeof(*obs), GFP_KERNEL);
+	if (obs) {
+		/* object initialization */
+		memset(obs, 0, sizeof(obs));
+		obs->file      = file;
+		file->private_data = obs;
+
+		/* place it into checking list */
+		spin_lock(&observer_lock);
+		list_add(&obs->list, &observer_list);
+		spin_unlock(&observer_lock);
+		atomic_inc(&observer_counter);
+
+		if (debug)
+			pr_info("UMM: 0x%p - observer %u created\n",
+				obs, atomic_read(&observer_counter));
+
+		return 0;
+	}
+
+	return -ENOMEM;
+}
+
+static int umm_release(struct inode *inode, struct file *file)
+{
+	struct observer *obs = (struct observer *)file->private_data;
+
+	if (obs) {
+		if (debug)
+			pr_info("UMM: 0x%p - observer released\n", obs);
+
+		/* remove from checking list */
+		atomic_dec(&observer_counter);
+		spin_lock(&observer_lock);
+		list_del(&obs->list);
+		spin_unlock(&observer_lock);
+
+		/* cleanup the memory */
+		file->private_data = NULL;
+		kfree(obs);
+	}
+
+	return 0;
+}
+
+static ssize_t umm_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	char tmp[128];
+	ssize_t retval;
+
+	retval = snprintf(tmp, sizeof(tmp), "%lu:%lu\n",
+			atomic_long_read(&last_used_pages), available_pages);
+	if (retval > count)
+		retval = count;
+	return copy_to_user(buf, tmp, retval) ? -EINVAL : retval;
+}
+
+static ssize_t umm_write(struct file *file, const char __user *buf,
+					size_t count, loff_t *offset)
+{
+	struct observer *obs = (struct observer *)file->private_data;
+
+	obs->updated = false;
+	if (kstrtoul_from_user(buf, count, 10, &obs->threshold) < 0) {
+		obs->threshold = 0;
+		obs->active = false;
+		return -EINVAL;
+	}
+	obs->active = (obs->threshold &&
+			obs->threshold < atomic_long_read(&last_used_pages));
+	if (debug)
+		pr_info("UMM: 0x%p - threshold set to %lu -> %d\n",
+					obs, obs->threshold, obs->active);
+
+	return (ssize_t)count;
+}
+
+static unsigned int umm_poll(struct file *file, poll_table *wait)
+{
+	struct observer *obs = (struct observer *)file->private_data;
+
+	if (NULL == obs || 0 == obs->threshold)
+		return 0;
+
+	poll_wait(file, &watcher_queue, wait);
+	if (validate_observer(obs, atomic_long_read(&last_used_pages))) {
+		if (debug)
+			pr_info("UMM: 0x%p - threshold %lu updated to %d\n",
+					obs, obs->threshold, obs->active);
+		obs->updated = false;
+		return POLLIN;
+	} else
+		return 0;
+}
+
+
+
+static const struct file_operations umm_fops = {
+	.llseek  = noop_llseek,
+	.open    = umm_open,
+	.release = umm_release,
+	.read    = umm_read,
+	.write   = umm_write,
+	.poll    = umm_poll,
+};
+
+static struct device *umm_device __read_mostly;
+static struct class  *umm_class  __read_mostly;
+static int            umm_major  __read_mostly = -1;
+
+
+static int __init umm_init(void)
+{
+	struct sysinfo si;
+	int error;
+
+	pr_info("UMM: Used Memory Meter loading to support /dev/%s\n",
+							device_name);
+
+	umm_major = register_chrdev(0, device_name, &umm_fops);
+	if (umm_major < 0) {
+		pr_err("UMM: unable to get major number for device %s\n",
+							device_name);
+		error = -EBUSY;
+		goto register_failed;
+	}
+
+	umm_class = class_create(THIS_MODULE, device_name);
+	if (IS_ERR(umm_class)) {
+		error = PTR_ERR(umm_class);
+		pr_err("UMM: unable to create class for device %s - %d\n",
+						device_name, error);
+		goto class_failed;
+	}
+
+	umm_device = device_create(
+			umm_class, NULL,
+			MKDEV(umm_major, 0),
+			NULL, device_name);
+	if (IS_ERR(umm_device)) {
+		error = PTR_ERR(umm_device);
+		pr_err("UMM: unable to create device %s - %d\n",
+						device_name, error);
+		goto device_failed;
+	}
+
+	update_period_jiffies = msecs_to_jiffies(update_period);
+	if (!update_period_jiffies)
+		update_period_jiffies = msecs_to_jiffies(UMM_UPDATE_PERIOD);
+
+	/* query amount of available ram and swap, mem_unit is PAGE_SIZE */
+	si_meminfo(&si);
+#ifdef CONFIG_SWAP
+	si_swapinfo(&si);
+	available_pages = si.totalram + si.totalswap;
+	available_swap_pages = si.totalswap;
+#else
+	available_pages = si.totalram;
+#endif
+	/* if autodetect then set granularity to ~1.4% from available memory */
+	if (update_space)
+		update_space_pages = update_space >> (PAGE_SHIFT - 10);
+	else
+		update_space_pages = available_pages >> 6;
+	if (!update_space_pages)
+		update_space_pages = UMM_UPDATE_SPACE >> (PAGE_SHIFT - 10);
+
+	update_memory_usage();
+	old_mm_hook = set_mm_alloc_free_hook(mm_alloc_free_hook);
+
+	if (debug) {
+		pr_info("UMM: /dev/%s got major %d\n", device_name, umm_major);
+		pr_info("UMM: update period set to %u ms or %lu jiffies\n",
+					update_period, update_period_jiffies);
+		pr_info("UMM: update space set to %u kb or %u pages\n",
+					update_space, update_space_pages);
+		pr_info("UMM: old mm alloc/free hook is 0x%p\n", old_mm_hook);
+		pr_info("UMM: now hook set to 0x%p\n", mm_alloc_free_hook);
+#ifdef CONFIG_SWAP
+		pr_info("UMM: %lu available pages found (only ram)\n",
+							available_pages);
+#else
+		pr_info("UMM: %lu available pages found (%lu ram + %lu swap)\n",
+				available_pages, si.totalram, si.totalswap);
+#endif
+		pr_info("UMM: %lu used pages, utilization %lu percents\n",
+					atomic_long_read(&last_used_pages),
+			(100 * atomic_long_read(&last_used_pages)) /
+							available_pages);
+		pr_info("UMM: overhead per client connection is %lu bytes\n",
+						sizeof(struct observer));
+	}
+
+	if (probe) {
+		unsigned long start;
+		unsigned long stop;
+		unsigned long time;
+		unsigned long counter = 0;
+
+		pr_info("UMM: probing measurements overhead for %u [ms] ...\n",
+							UMM_PROBE_PERIOD);
+		start = jiffies;
+		stop  = jiffies + msecs_to_jiffies(UMM_PROBE_PERIOD);
+		while (time_is_after_jiffies(stop)) {
+			update_memory_usage();
+			counter++;
+		}
+		time = jiffies_to_usecs(jiffies - start);
+		pr_info("UMM: %lu probes done in %lu us, %lu probes/us\n",
+					counter, time, counter / time);
+	}
+
+	return 0;
+
+device_failed:
+	class_destroy(umm_class);
+class_failed:
+	unregister_chrdev(umm_major, device_name);
+register_failed:
+	return error;
+}
+
+static void __exit umm_exit(void)
+{
+	mm_alloc_free_hook_t hook = set_mm_alloc_free_hook(old_mm_hook);
+
+	if (debug)
+		pr_info("UMM: old mm alloc/free hook 0x%p restored\n",
+							old_mm_hook);
+	if (mm_alloc_free_hook != hook)
+		pr_warning("UMM: restored 0x%p, expected 0x%p!\n",
+					hook, mm_alloc_free_hook);
+	if (umm_device)
+		device_del(umm_device);
+	if (umm_class)
+		class_destroy(umm_class);
+	if (umm_major >= 0)
+		unregister_chrdev(umm_major, device_name);
+
+	pr_info("UMM: Used Memory Meter unloaded, /dev/%s gone\n", device_name);
+}
+
+
+module_init(umm_init);
+module_exit(umm_exit);
diff --git a/include/linux/umm.h b/include/linux/umm.h
new file mode 100644
index 0000000..0bfc0b0
--- /dev/null
+++ b/include/linux/umm.h
@@ -0,0 +1,42 @@
+/*
+ * umm.h - system-wide Used Memory Meter definitions
+ *
+ * Copyright (C) 2011 Nokia Corporation.
+ *      Leonid Moiseichuk
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed "as is" WITHOUT ANY WARRANTY of any
+ * kind, whether express or implied; without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef LINUX_UMM_H
+#define LINUX_UMM_H
+
+/*
+ * Pseudo-device name in /dev to subscribe or read data.
+ */
+#define UMM_DEVICE_NAME	"used_memory"
+
+/*
+ * How often [ms] usage information will be updated.
+ * It happened in alloc/free hook and needs to be tuned for your system.
+ */
+#define UMM_UPDATE_PERIOD	250
+
+/*
+ * Which minimal [kb] allocation change will produce notification for user-space
+ * to avoid too often jittering.
+ */
+#define UMM_UPDATE_SPACE	1024
+
+/*
+ * Probe period [ms] if it requested during module loading to clarify overhead.
+ */
+#define UMM_PROBE_PERIOD	100
+
+#endif
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 3.2.0-rc1 3/3] Used Memory Meter pseudo-device module
  2012-01-04 17:21 ` [PATCH 3.2.0-rc1 3/3] Used Memory Meter pseudo-device module Leonid Moiseichuk
@ 2012-01-04 19:55   ` Greg KH
  2012-01-09  9:58     ` leonid.moiseichuk
  0 siblings, 1 reply; 38+ messages in thread
From: Greg KH @ 2012-01-04 19:55 UTC (permalink / raw)
  To: Leonid Moiseichuk
  Cc: linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	penberg, aarcange, riel, mel, rientjes, dima, rebecca, san, akpm,
	vesa.jaaskelainen

On Wed, Jan 04, 2012 at 07:21:56PM +0200, Leonid Moiseichuk wrote:
> The Used Memory Meter (UMM) device tracks level of memory utilization
> and notifies subscribed processes when consumption crossed specified
> threshold up or down. It could be used on embedded devices to
> implementation of performance-cheap memory reacting by using
> e.g. libmemnotify or similar user-space component.
> 
> Signed-off-by: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>

Note, I don't agree that this code is the correct thing to be doing
here, you'll have to get the buy-in from the mm developers on that, but
I do have some comments on the implementation:

> --- /dev/null
> +++ b/drivers/misc/umm.c
> @@ -0,0 +1,452 @@
> +/*
> + * umm.c - system-wide Used Memory Meter pseudo-device implementation
> + *
> + * Copyright (C) 2011 Nokia Corporation.
> + *      Leonid Moiseichuk
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed "as is" WITHOUT ANY WARRANTY of any
> + * kind, whether express or implied; without even the implied warranty
> + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/types.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/atomic.h>
> +#include <linux/jiffies.h>
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/poll.h>
> +#include <linux/highmem.h>
> +#include <linux/swap.h>
> +#include <linux/list.h>
> +#include <linux/wait.h>
> +#include <linux/spinlock.h>
> +#include <linux/spinlock_types.h>
> +
> +#include <linux/umm.h>

Why do you need a header file at all?

> +/* subscriber information to be notified when level changed */
> +struct observer {
> +	/* list data to check from notify_memory_usage and wakeup user-space */
> +	struct list_head list;
> +	/* related file structure for open/close/read/write and poll */
> +	struct file	*file;
> +	/* threshold [pages] when we should trigger notification */
> +	unsigned long	threshold;
> +	/* did we crossed theshold on last validation? */
> +	bool		active;
> +	/* flag about new notification is required */
> +	bool		updated;
> +};
> +
> +
> +
> +MODULE_AUTHOR("Leonid Moiseichuk (leonid.moiseichuk@nokia.com)");
> +MODULE_DESCRIPTION("System used memory meter pseudo-device");
> +MODULE_LICENSE("GPL v2");
> +MODULE_VERSION("0.0.2");
> +
> +static int debug __read_mostly;
> +module_param(debug, bool, 0);
> +MODULE_PARM_DESC(debug, "More info about module parameters and operations");
> +
> +static int probe __read_mostly;
> +module_param(probe, bool, 0);
> +MODULE_PARM_DESC(probe, "Probe measurement overhead during loading");
> +
> +static char device_name[64] __read_mostly = UMM_DEVICE_NAME;
> +module_param_string(device_name, device_name, sizeof(device_name), 0);
> +MODULE_PARM_DESC(device_name, "Device name in /dev if need different");

This is pointless, right?

> +static struct device *umm_device __read_mostly;

Enough with the __read_mostly markings, they really aren't needed for
every single variable, right?  Especially for trivial stuff like this
one, and all of the module parameters.

> +static struct class  *umm_class  __read_mostly;

Just use a misc device, as you are only creating/needing one character
device, right?  That will make your init and destroy code a lot cleaner
and smaller.

> +	pr_info("UMM: Used Memory Meter loading to support /dev/%s\n",
> +							device_name);

Not needed.

> +
> +	umm_major = register_chrdev(0, device_name, &umm_fops);
> +	if (umm_major < 0) {
> +		pr_err("UMM: unable to get major number for device %s\n",
> +							device_name);
> +		error = -EBUSY;
> +		goto register_failed;
> +	}
> +
> +	umm_class = class_create(THIS_MODULE, device_name);
> +	if (IS_ERR(umm_class)) {
> +		error = PTR_ERR(umm_class);
> +		pr_err("UMM: unable to create class for device %s - %d\n",
> +						device_name, error);
> +		goto class_failed;
> +	}
> +
> +	umm_device = device_create(
> +			umm_class, NULL,
> +			MKDEV(umm_major, 0),
> +			NULL, device_name);
> +	if (IS_ERR(umm_device)) {
> +		error = PTR_ERR(umm_device);
> +		pr_err("UMM: unable to create device %s - %d\n",
> +						device_name, error);
> +		goto device_failed;
> +	}
> +
> +	update_period_jiffies = msecs_to_jiffies(update_period);
> +	if (!update_period_jiffies)
> +		update_period_jiffies = msecs_to_jiffies(UMM_UPDATE_PERIOD);
> +
> +	/* query amount of available ram and swap, mem_unit is PAGE_SIZE */
> +	si_meminfo(&si);
> +#ifdef CONFIG_SWAP
> +	si_swapinfo(&si);
> +	available_pages = si.totalram + si.totalswap;
> +	available_swap_pages = si.totalswap;
> +#else
> +	available_pages = si.totalram;
> +#endif
> +	/* if autodetect then set granularity to ~1.4% from available memory */
> +	if (update_space)
> +		update_space_pages = update_space >> (PAGE_SHIFT - 10);
> +	else
> +		update_space_pages = available_pages >> 6;
> +	if (!update_space_pages)
> +		update_space_pages = UMM_UPDATE_SPACE >> (PAGE_SHIFT - 10);
> +
> +	update_memory_usage();
> +	old_mm_hook = set_mm_alloc_free_hook(mm_alloc_free_hook);
> +
> +	if (debug) {
> +		pr_info("UMM: /dev/%s got major %d\n", device_name, umm_major);
> +		pr_info("UMM: update period set to %u ms or %lu jiffies\n",
> +					update_period, update_period_jiffies);
> +		pr_info("UMM: update space set to %u kb or %u pages\n",
> +					update_space, update_space_pages);
> +		pr_info("UMM: old mm alloc/free hook is 0x%p\n", old_mm_hook);
> +		pr_info("UMM: now hook set to 0x%p\n", mm_alloc_free_hook);
> +#ifdef CONFIG_SWAP
> +		pr_info("UMM: %lu available pages found (only ram)\n",
> +							available_pages);
> +#else
> +		pr_info("UMM: %lu available pages found (%lu ram + %lu swap)\n",
> +				available_pages, si.totalram, si.totalswap);
> +#endif
> +		pr_info("UMM: %lu used pages, utilization %lu percents\n",
> +					atomic_long_read(&last_used_pages),
> +			(100 * atomic_long_read(&last_used_pages)) /
> +							available_pages);
> +		pr_info("UMM: overhead per client connection is %lu bytes\n",
> +						sizeof(struct observer));

Please use the dev_dbg() macro instead, that removes the debug flag
here, and properly identifies your code/device in the kernel log.  Same
goes for other pr_* usages in the file, just use the appropriate dev_*
call instead.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM
  2012-01-04 17:21 [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM Leonid Moiseichuk
                   ` (2 preceding siblings ...)
  2012-01-04 17:21 ` [PATCH 3.2.0-rc1 3/3] Used Memory Meter pseudo-device module Leonid Moiseichuk
@ 2012-01-04 19:56 ` Greg KH
  2012-01-04 20:17   ` Rik van Riel
  2012-01-05 11:47   ` leonid.moiseichuk
  3 siblings, 2 replies; 38+ messages in thread
From: Greg KH @ 2012-01-04 19:56 UTC (permalink / raw)
  To: Leonid Moiseichuk
  Cc: linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	penberg, aarcange, riel, mel, rientjes, dima, rebecca, san, akpm,
	vesa.jaaskelainen

On Wed, Jan 04, 2012 at 07:21:53PM +0200, Leonid Moiseichuk wrote:
> The main idea of Used Memory Meter (UMM) is to provide low-cost interface
> for user-space to notify about memory consumption using similar approach /proc/meminfo
> does but focusing only on "modified" pages which cannot be fogotten.
> 
> The calculation formula in terms of meminfo looks the following:
>   UsedMemory = (MemTotal - MemFree - Buffers - Cached - SwapCached) +
>                                                (SwapTotal - SwapFree)
> It reflects well amount of system memory used in applications in heaps and shared pages.
> 
> Previously (n770..n900) we had lowmem.c [1] which used LSM and did a lot other things,
> n9 implementation based on memcg [2] which has own problems, so the proposed variant
> I hope is the best one for n9:
> - Keeps connections from user space
> - When amount of modified pages reaches crossed pointed value for particular connection
>   makes POLLIN and allow user-space app to read it and react
> - Economic as much as possible, so currently its operates if allocation higher than 487
>   pages or last check happened 250 ms before
> Of course if no allocation happened then no activities performed, use-time
> must be not affected.
> 
> Testing results:
> - Checkpatch produced 0 warning
> - Sparse does not produce warnings
> - One check costs ~20 us or less (could be checked with probe=1 insmod)
> - One connection costs 20 bytes in kernel-space  (see observer structure) for 32-bit variant
> - For 10K connections poll update in requested in ~10ms, but for practically device expected
>   to will have about 10 connections (like n9 has now).
> 
> Known weak points which I do not know how to fix but will if you have a brillian idea:
> - Having hook in MM is nasty but MM/shrinker cannot be used there and LSM even worse idea
> - If I made 
> 	$cat /dev/used_memory
>   it is produced lines in non-stop mode. Adding position check in umm_read seems doesn not help,
>   so "head -1 /dev/used_memory" should be used if you need to quick check
> - Format of output is USED_PAGES:AVAILABLE_PAGES, primitive but enough for tasks module does
> 
> Tested on ARM, x86-32 and x86-64 with and without CONFIG_SWAP. Seems works in all combinations.
> Sorry for wide distributions but list of names were produced by scripts/get_maintainer.pl

How does this compare with the lowmemorykiller.c driver from the android
developers that is currently in the linux-next tree?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM
  2012-01-04 19:56 ` [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM Greg KH
@ 2012-01-04 20:17   ` Rik van Riel
  2012-01-04 20:42     ` Pekka Enberg
  2012-01-05 12:22     ` leonid.moiseichuk
  2012-01-05 11:47   ` leonid.moiseichuk
  1 sibling, 2 replies; 38+ messages in thread
From: Rik van Riel @ 2012-01-04 20:17 UTC (permalink / raw)
  To: Greg KH
  Cc: Leonid Moiseichuk, linux-mm, linux-kernel, cesarb,
	kamezawa.hiroyu, emunson, penberg, aarcange, mel, rientjes, dima,
	rebecca, san, akpm, vesa.jaaskelainen, Minchan Kim,
	KOSAKI Motohiro

On 01/04/2012 02:56 PM, Greg KH wrote:

> How does this compare with the lowmemorykiller.c driver from the android
> developers that is currently in the linux-next tree?

Also, the low memory notification that Kosaki-san has worked on,
and which Minchan is looking at now.

We seem to have many mechanisms under development, all aimed at
similar goals. I believe it would be good to agree on one mechanism
that could solve multiple of these goals at once, instead of sticking
a handful of different partial solutions in the kernel...

Exactly what is the problem you are trying to solve?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3.2.0-rc1 2/3] MM hook for page allocation and release
  2012-01-04 17:21 ` [PATCH 3.2.0-rc1 2/3] MM hook for page allocation and release Leonid Moiseichuk
@ 2012-01-04 20:40   ` Pekka Enberg
  2012-01-05  6:59   ` KAMEZAWA Hiroyuki
  2012-01-05 15:22   ` Mel Gorman
  2 siblings, 0 replies; 38+ messages in thread
From: Pekka Enberg @ 2012-01-04 20:40 UTC (permalink / raw)
  To: Leonid Moiseichuk
  Cc: linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	aarcange, riel, mel, rientjes, dima, gregkh, rebecca, san, akpm,
	vesa.jaaskelainen

On Wed, Jan 4, 2012 at 7:21 PM, Leonid Moiseichuk
<leonid.moiseichuk@nokia.com> wrote:
> That is required by Used Memory Meter (UMM) pseudo-device
> to track memory utilization in system. It is expected that
> hook MUST be very light to prevent performance impact
> on the hot allocation path. Accuracy of number managed pages
> does not expected to be absolute but fact of allocation or
> deallocation must be registered.
>
> Signed-off-by: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
> ---
>  include/linux/mm.h |   15 +++++++++++++++
>  mm/Kconfig         |    8 ++++++++
>  mm/page_alloc.c    |   31 +++++++++++++++++++++++++++++++
>  3 files changed, 54 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 3dc3a8c..d133f73 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1618,6 +1618,21 @@ extern int soft_offline_page(struct page *page, int flags);
>
>  extern void dump_page(struct page *page);
>
> +#ifdef CONFIG_MM_ALLOC_FREE_HOOK
> +/*
> + * Hook function type which called when some pages allocated or released.
> + * Value of nr_pages is positive for post-allocation calls and negative
> + * after free.
> + */
> +typedef void (*mm_alloc_free_hook_t)(int nr_pages);
> +
> +/*
> + * Setups specified hook function for tracking pages allocation.
> + * Returns value of old hook to organize chains of calls if necessary.
> + */
> +mm_alloc_free_hook_t set_mm_alloc_free_hook(mm_alloc_free_hook_t hook);
> +#endif
> +
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>  extern void clear_huge_page(struct page *page,
>                            unsigned long addr,
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 011b110..2aaa1e9 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -373,3 +373,11 @@ config CLEANCACHE
>          in a negligible performance hit.
>
>          If unsure, say Y to enable cleancache
> +
> +config MM_ALLOC_FREE_HOOK
> +       bool "Enable callback support for pages allocation and releasing"
> +       default n
> +       help
> +         Required for some features like used memory meter.
> +         If unsure, say N to disable alloc/free hook.
> +
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9dd443d..9307800 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -236,6 +236,30 @@ static void set_pageblock_migratetype(struct page *page, int migratetype)
>
>  bool oom_killer_disabled __read_mostly;
>
> +#ifdef CONFIG_MM_ALLOC_FREE_HOOK
> +static atomic_long_t alloc_free_hook __read_mostly = ATOMIC_LONG_INIT(0);
> +
> +mm_alloc_free_hook_t set_mm_alloc_free_hook(mm_alloc_free_hook_t hook)
> +{
> +       const mm_alloc_free_hook_t old_hook =
> +               (mm_alloc_free_hook_t)atomic_long_read(&alloc_free_hook);
> +
> +       atomic_long_set(&alloc_free_hook, (long)hook);
> +       pr_info("MM alloc/free hook set to 0x%p (was 0x%p)\n", hook, old_hook);
> +
> +       return old_hook;
> +}
> +EXPORT_SYMBOL(set_mm_alloc_free_hook);
> +
> +static inline void call_alloc_free_hook(int pages)
> +{
> +       const mm_alloc_free_hook_t hook =
> +               (mm_alloc_free_hook_t)atomic_long_read(&alloc_free_hook);
> +       if (hook)
> +               hook(pages);
> +}
> +#endif
> +
>  #ifdef CONFIG_DEBUG_VM
>  static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
>  {
> @@ -2298,6 +2322,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>        put_mems_allowed();
>
>        trace_mm_page_alloc(page, order, gfp_mask, migratetype);
> +#ifdef CONFIG_MM_ALLOC_FREE_HOOK
> +       call_alloc_free_hook(1 << order);
> +#endif
> +
>        return page;
>  }
>  EXPORT_SYMBOL(__alloc_pages_nodemask);
> @@ -2345,6 +2373,9 @@ void __free_pages(struct page *page, unsigned int order)
>                        free_hot_cold_page(page, 0);
>                else
>                        __free_pages_ok(page, order);
> +#ifdef CONFIG_MM_ALLOC_FREE_HOOK
> +               call_alloc_free_hook(-(1 << order));
> +#endif
>        }
>  }

No, we definitely don't want to allow random modules to insert hooks
to the page allocator:

  Nacked-by: Pekka Enberg <penberg@kernel.org>

Can't we introduce some super-lightweight lowmem_{alloc|free}_hook()
hooks that live in mm/lowmem.c and call those directly? If you need to
support different ABIs for lowmem notifier, N9, and Android, you could
make that observer code more generic, no? The swaphook people might be
interested in that as well.

                           Pekka

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM
  2012-01-04 20:17   ` Rik van Riel
@ 2012-01-04 20:42     ` Pekka Enberg
  2012-01-05 23:01       ` David Rientjes
  2012-01-05 12:22     ` leonid.moiseichuk
  1 sibling, 1 reply; 38+ messages in thread
From: Pekka Enberg @ 2012-01-04 20:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Greg KH, Leonid Moiseichuk, linux-mm, linux-kernel, cesarb,
	kamezawa.hiroyu, emunson, aarcange, mel, rientjes, dima, rebecca,
	san, akpm, vesa.jaaskelainen, Minchan Kim, KOSAKI Motohiro

On Wed, Jan 4, 2012 at 10:17 PM, Rik van Riel <riel@redhat.com> wrote:
> Also, the low memory notification that Kosaki-san has worked on,
> and which Minchan is looking at now.
>
> We seem to have many mechanisms under development, all aimed at
> similar goals. I believe it would be good to agree on one mechanism
> that could solve multiple of these goals at once, instead of sticking
> a handful of different partial solutions in the kernel...

And even if people want to support multiple ABIs and fight it out to
see which one wins, we should factor out the generic parts and put
them under mm/*.c and not hide them in random modules.

                        Pekka

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3.2.0-rc1 2/3] MM hook for page allocation and release
  2012-01-04 17:21 ` [PATCH 3.2.0-rc1 2/3] MM hook for page allocation and release Leonid Moiseichuk
  2012-01-04 20:40   ` Pekka Enberg
@ 2012-01-05  6:59   ` KAMEZAWA Hiroyuki
  2012-01-05 11:26     ` leonid.moiseichuk
  2012-01-05 15:22   ` Mel Gorman
  2 siblings, 1 reply; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-05  6:59 UTC (permalink / raw)
  To: Leonid Moiseichuk
  Cc: linux-mm, linux-kernel, cesarb, emunson, penberg, aarcange, riel,
	mel, rientjes, dima, gregkh, rebecca, san, akpm,
	vesa.jaaskelainen

On Wed,  4 Jan 2012 19:21:55 +0200
Leonid Moiseichuk <leonid.moiseichuk@nokia.com> wrote:

> That is required by Used Memory Meter (UMM) pseudo-device
> to track memory utilization in system. It is expected that
> hook MUST be very light to prevent performance impact
> on the hot allocation path. Accuracy of number managed pages
> does not expected to be absolute but fact of allocation or
> deallocation must be registered.
> 
> Signed-off-by: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>

I never like arbitrary hooks to alloc_pages().
Could you find another way ?

Hmm. memcg uses per-cpu counters for counting event of alloc/free and
trigger threashold check per 128 event on a cpu.


Thanks,
-Kame


> ---
>  include/linux/mm.h |   15 +++++++++++++++
>  mm/Kconfig         |    8 ++++++++
>  mm/page_alloc.c    |   31 +++++++++++++++++++++++++++++++
>  3 files changed, 54 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 3dc3a8c..d133f73 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1618,6 +1618,21 @@ extern int soft_offline_page(struct page *page, int flags);
>  
>  extern void dump_page(struct page *page);
>  
> +#ifdef CONFIG_MM_ALLOC_FREE_HOOK
> +/*
> + * Hook function type which called when some pages allocated or released.
> + * Value of nr_pages is positive for post-allocation calls and negative
> + * after free.
> + */
> +typedef void (*mm_alloc_free_hook_t)(int nr_pages);
> +
> +/*
> + * Setups specified hook function for tracking pages allocation.
> + * Returns value of old hook to organize chains of calls if necessary.
> + */
> +mm_alloc_free_hook_t set_mm_alloc_free_hook(mm_alloc_free_hook_t hook);
> +#endif
> +
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>  extern void clear_huge_page(struct page *page,
>  			    unsigned long addr,
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 011b110..2aaa1e9 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -373,3 +373,11 @@ config CLEANCACHE
>  	  in a negligible performance hit.
>  
>  	  If unsure, say Y to enable cleancache
> +
> +config MM_ALLOC_FREE_HOOK
> +	bool "Enable callback support for pages allocation and releasing"
> +	default n
> +	help
> +	  Required for some features like used memory meter.
> +	  If unsure, say N to disable alloc/free hook.
> +
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9dd443d..9307800 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -236,6 +236,30 @@ static void set_pageblock_migratetype(struct page *page, int migratetype)
>  
>  bool oom_killer_disabled __read_mostly;
>  
> +#ifdef CONFIG_MM_ALLOC_FREE_HOOK
> +static atomic_long_t alloc_free_hook __read_mostly = ATOMIC_LONG_INIT(0);
> +
> +mm_alloc_free_hook_t set_mm_alloc_free_hook(mm_alloc_free_hook_t hook)
> +{
> +	const mm_alloc_free_hook_t old_hook =
> +		(mm_alloc_free_hook_t)atomic_long_read(&alloc_free_hook);
> +
> +	atomic_long_set(&alloc_free_hook, (long)hook);
> +	pr_info("MM alloc/free hook set to 0x%p (was 0x%p)\n", hook, old_hook);
> +
> +	return old_hook;
> +}
> +EXPORT_SYMBOL(set_mm_alloc_free_hook);
> +
> +static inline void call_alloc_free_hook(int pages)
> +{
> +	const mm_alloc_free_hook_t hook =
> +		(mm_alloc_free_hook_t)atomic_long_read(&alloc_free_hook);
> +	if (hook)
> +		hook(pages);
> +}
> +#endif
> +
>  #ifdef CONFIG_DEBUG_VM
>  static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
>  {
> @@ -2298,6 +2322,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  	put_mems_allowed();
>  
>  	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
> +#ifdef CONFIG_MM_ALLOC_FREE_HOOK
> +	call_alloc_free_hook(1 << order);
> +#endif
> +
>  	return page;
>  }
>  EXPORT_SYMBOL(__alloc_pages_nodemask);
> @@ -2345,6 +2373,9 @@ void __free_pages(struct page *page, unsigned int order)
>  			free_hot_cold_page(page, 0);
>  		else
>  			__free_pages_ok(page, order);
> +#ifdef CONFIG_MM_ALLOC_FREE_HOOK
> +		call_alloc_free_hook(-(1 << order));
> +#endif
>  	}
>  }
>  
> -- 
> 1.7.7.3
> 
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 2/3] MM hook for page allocation and release
  2012-01-05  6:59   ` KAMEZAWA Hiroyuki
@ 2012-01-05 11:26     ` leonid.moiseichuk
  2012-01-05 12:49       ` Pekka Enberg
  0 siblings, 1 reply; 38+ messages in thread
From: leonid.moiseichuk @ 2012-01-05 11:26 UTC (permalink / raw)
  To: kamezawa.hiroyu
  Cc: linux-mm, linux-kernel, cesarb, emunson, penberg, aarcange, riel,
	mel, rientjes, dima, gregkh, rebecca, san, akpm,
	vesa.jaaskelainen

Hi,

I agree that hooking alloc_pages is ugly way. So alternatives I see:

- shrinkers (as e.g. Android OOM used) but shrink_slab called only from try_to_free_pages only if we are on slow reclaim path on memory allocation, so it cannot be used for e.g. 75% memory tracking or when pages released to notify user space that we are OK. But according to easy to use it will be the best approach.

- memcg-kind of changes like mem_cgroup_newpage_charge/uncharge_page but without blocking decision making logic. Seems to me more changes. Threshold currently in memcg set 128 pages per CPU, that is quite often for level tracking needs.

- tracking situation using timer? Maybe not due to will impact battery.

With Best Wishes,
Leonid


-----Original Message-----
From: ext KAMEZAWA Hiroyuki [mailto:kamezawa.hiroyu@jp.fujitsu.com] 
Sent: 05 January, 2012 09:00
To: Moiseichuk Leonid (Nokia-MP/Helsinki)
Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; cesarb@cesarb.net; emunson@mgebm.net; penberg@kernel.org; aarcange@redhat.com; riel@redhat.com; mel@csn.ul.ie; rientjes@google.com; dima@android.com; gregkh@suse.de; rebecca@android.com; san@google.com; akpm@linux-foundation.org; Jaaskelainen Vesa (Nokia-MP/Helsinki)
Subject: Re: [PATCH 3.2.0-rc1 2/3] MM hook for page allocation and release

On Wed,  4 Jan 2012 19:21:55 +0200
Leonid Moiseichuk <leonid.moiseichuk@nokia.com> wrote:

> That is required by Used Memory Meter (UMM) pseudo-device to track 
> memory utilization in system. It is expected that hook MUST be very 
> light to prevent performance impact on the hot allocation path. 
> Accuracy of number managed pages does not expected to be absolute but 
> fact of allocation or deallocation must be registered.
> 
> Signed-off-by: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>

I never like arbitrary hooks to alloc_pages().
Could you find another way ?

Hmm. memcg uses per-cpu counters for counting event of alloc/free and trigger threashold check per 128 event on a cpu.


Thanks,
-Kame


> ---
>  include/linux/mm.h |   15 +++++++++++++++
>  mm/Kconfig         |    8 ++++++++
>  mm/page_alloc.c    |   31 +++++++++++++++++++++++++++++++
>  3 files changed, 54 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h index 
> 3dc3a8c..d133f73 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1618,6 +1618,21 @@ extern int soft_offline_page(struct page *page, 
> int flags);
>  
>  extern void dump_page(struct page *page);
>  
> +#ifdef CONFIG_MM_ALLOC_FREE_HOOK
> +/*
> + * Hook function type which called when some pages allocated or released.
> + * Value of nr_pages is positive for post-allocation calls and 
> +negative
> + * after free.
> + */
> +typedef void (*mm_alloc_free_hook_t)(int nr_pages);
> +
> +/*
> + * Setups specified hook function for tracking pages allocation.
> + * Returns value of old hook to organize chains of calls if necessary.
> + */
> +mm_alloc_free_hook_t set_mm_alloc_free_hook(mm_alloc_free_hook_t 
> +hook); #endif
> +
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)  
> extern void clear_huge_page(struct page *page,
>  			    unsigned long addr,
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 011b110..2aaa1e9 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -373,3 +373,11 @@ config CLEANCACHE
>  	  in a negligible performance hit.
>  
>  	  If unsure, say Y to enable cleancache
> +
> +config MM_ALLOC_FREE_HOOK
> +	bool "Enable callback support for pages allocation and releasing"
> +	default n
> +	help
> +	  Required for some features like used memory meter.
> +	  If unsure, say N to disable alloc/free hook.
> +
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9dd443d..9307800 
> 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -236,6 +236,30 @@ static void set_pageblock_migratetype(struct page 
> *page, int migratetype)
>  
>  bool oom_killer_disabled __read_mostly;
>  
> +#ifdef CONFIG_MM_ALLOC_FREE_HOOK
> +static atomic_long_t alloc_free_hook __read_mostly = 
> +ATOMIC_LONG_INIT(0);
> +
> +mm_alloc_free_hook_t set_mm_alloc_free_hook(mm_alloc_free_hook_t 
> +hook) {
> +	const mm_alloc_free_hook_t old_hook =
> +		(mm_alloc_free_hook_t)atomic_long_read(&alloc_free_hook);
> +
> +	atomic_long_set(&alloc_free_hook, (long)hook);
> +	pr_info("MM alloc/free hook set to 0x%p (was 0x%p)\n", hook, 
> +old_hook);
> +
> +	return old_hook;
> +}
> +EXPORT_SYMBOL(set_mm_alloc_free_hook);
> +
> +static inline void call_alloc_free_hook(int pages) {
> +	const mm_alloc_free_hook_t hook =
> +		(mm_alloc_free_hook_t)atomic_long_read(&alloc_free_hook);
> +	if (hook)
> +		hook(pages);
> +}
> +#endif
> +
>  #ifdef CONFIG_DEBUG_VM
>  static int page_outside_zone_boundaries(struct zone *zone, struct 
> page *page)  { @@ -2298,6 +2322,10 @@ __alloc_pages_nodemask(gfp_t 
> gfp_mask, unsigned int order,
>  	put_mems_allowed();
>  
>  	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
> +#ifdef CONFIG_MM_ALLOC_FREE_HOOK
> +	call_alloc_free_hook(1 << order);
> +#endif
> +
>  	return page;
>  }
>  EXPORT_SYMBOL(__alloc_pages_nodemask);
> @@ -2345,6 +2373,9 @@ void __free_pages(struct page *page, unsigned int order)
>  			free_hot_cold_page(page, 0);
>  		else
>  			__free_pages_ok(page, order);
> +#ifdef CONFIG_MM_ALLOC_FREE_HOOK
> +		call_alloc_free_hook(-(1 << order)); #endif
>  	}
>  }
>  
> --
> 1.7.7.3
> 
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM
  2012-01-04 19:56 ` [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM Greg KH
  2012-01-04 20:17   ` Rik van Riel
@ 2012-01-05 11:47   ` leonid.moiseichuk
  2012-01-05 12:40     ` Pekka Enberg
  2012-01-06  0:26     ` KOSAKI Motohiro
  1 sibling, 2 replies; 38+ messages in thread
From: leonid.moiseichuk @ 2012-01-05 11:47 UTC (permalink / raw)
  To: gregkh
  Cc: linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	penberg, aarcange, riel, mel, rientjes, dima, rebecca, san, akpm,
	vesa.jaaskelainen

Hi,

Android OOM (AOOM) is a different thing. Briefly Android OOM is a safety belt, but I try to introduce look-ahead radar to stop before hitting wall.

As I understand AOOM it wait until situation is reached bad conditions which required memory reclaiming, selects application according to free memory and oom_adj level and kills it.
So no intermediate levels could be checked (e.g. 75% usage),  nothing could be done in user-space to prevent killing, no notification for case when memory becomes OK.

What I try to do is to get notification in any application that memory becomes low, and do something about it like stop processing data, close unused pages or correctly shuts applications, daemons.
Application(s) might have necessity to install several notification levels, so reaction could be adjusted based on current utilization level per each application, not globally.

Rik van Riel have pointed Kosaki-san's low memory notification. I know about mem_notify but according to Anton Vorontsov's statement [1] it is died since 2008 and for me it is really good news that is still not. 
I need to re-investigate it.

With Best Wishes,
Leonid

[1] http://permalink.gmane.org/gmane.linux.kernel.mm/71626

-----Original Message-----
From: ext Greg KH [mailto:gregkh@suse.de] 
Sent: 04 January, 2012 21:56
To: Moiseichuk Leonid (Nokia-MP/Helsinki)
Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; cesarb@cesarb.net; kamezawa.hiroyu@jp.fujitsu.com; emunson@mgebm.net; penberg@kernel.org; aarcange@redhat.com; riel@redhat.com; mel@csn.ul.ie; rientjes@google.com; dima@android.com; rebecca@android.com; san@google.com; akpm@linux-foundation.org; Jaaskelainen Vesa (Nokia-MP/Helsinki)
Subject: Re: [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM

On Wed, Jan 04, 2012 at 07:21:53PM +0200, Leonid Moiseichuk wrote:
> The main idea of Used Memory Meter (UMM) is to provide low-cost 
> interface for user-space to notify about memory consumption using 
> similar approach /proc/meminfo does but focusing only on "modified" pages which cannot be fogotten.
> 
> The calculation formula in terms of meminfo looks the following:
>   UsedMemory = (MemTotal - MemFree - Buffers - Cached - SwapCached) +
>                                                (SwapTotal - SwapFree) 
> It reflects well amount of system memory used in applications in heaps and shared pages.
> 
> Previously (n770..n900) we had lowmem.c [1] which used LSM and did a 
> lot other things,
> n9 implementation based on memcg [2] which has own problems, so the 
> proposed variant I hope is the best one for n9:
> - Keeps connections from user space
> - When amount of modified pages reaches crossed pointed value for particular connection
>   makes POLLIN and allow user-space app to read it and react
> - Economic as much as possible, so currently its operates if allocation higher than 487
>   pages or last check happened 250 ms before Of course if no 
> allocation happened then no activities performed, use-time must be not 
> affected.
> 
> Testing results:
> - Checkpatch produced 0 warning
> - Sparse does not produce warnings
> - One check costs ~20 us or less (could be checked with probe=1 
> insmod)
> - One connection costs 20 bytes in kernel-space  (see observer 
> structure) for 32-bit variant
> - For 10K connections poll update in requested in ~10ms, but for practically device expected
>   to will have about 10 connections (like n9 has now).
> 
> Known weak points which I do not know how to fix but will if you have a brillian idea:
> - Having hook in MM is nasty but MM/shrinker cannot be used there and 
> LSM even worse idea
> - If I made 
> 	$cat /dev/used_memory
>   it is produced lines in non-stop mode. Adding position check in umm_read seems doesn not help,
>   so "head -1 /dev/used_memory" should be used if you need to quick 
> check
> - Format of output is USED_PAGES:AVAILABLE_PAGES, primitive but enough 
> for tasks module does
> 
> Tested on ARM, x86-32 and x86-64 with and without CONFIG_SWAP. Seems works in all combinations.
> Sorry for wide distributions but list of names were produced by 
> scripts/get_maintainer.pl

How does this compare with the lowmemorykiller.c driver from the android developers that is currently in the linux-next tree?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM
  2012-01-04 20:17   ` Rik van Riel
  2012-01-04 20:42     ` Pekka Enberg
@ 2012-01-05 12:22     ` leonid.moiseichuk
  1 sibling, 0 replies; 38+ messages in thread
From: leonid.moiseichuk @ 2012-01-05 12:22 UTC (permalink / raw)
  To: riel, gregkh
  Cc: linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	penberg, aarcange, mel, rientjes, dima, rebecca, san, akpm,
	vesa.jaaskelainen, minchan, kosaki.motohiro

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2302 bytes --]

Hi,

We (small and embedded) needs some mechanism to notify userspace that memory consumption is going to be high to allow react properly and keep device fast and stable:  close big graphics, shuts something not used, stop processing data etc.  If we not able to do it in the right moment OOM killer starts to work and applications just disappears which is not nice for user in all cases. So the proposed code is not replacement of OOM killer at all but, yes, could be recognized as part of "cooperative-oom". 

Since n900 we tried to use memcg and notifications. Unfortunately memcg is not nice suitable to handling processes due to how memory is accounted: taking into account caches produce false-positive notifications plus stability issues which could be result of outdated 2.6.32 kernel e.g. unnecessary swapping overhead when big process moved from one group to another or even device hung.  

With Best Wishes,
Leonid


-----Original Message-----
From: ext Rik van Riel [mailto:riel@redhat.com] 
Sent: 04 January, 2012 22:18
To: Greg KH
Cc: Moiseichuk Leonid (Nokia-MP/Helsinki); linux-mm@kvack.org; linux-kernel@vger.kernel.org; cesarb@cesarb.net; kamezawa.hiroyu@jp.fujitsu.com; emunson@mgebm.net; penberg@kernel.org; aarcange@redhat.com; mel@csn.ul.ie; rientjes@google.com; dima@android.com; rebecca@android.com; san@google.com; akpm@linux-foundation.org; Jaaskelainen Vesa (Nokia-MP/Helsinki); Minchan Kim; KOSAKI Motohiro
Subject: Re: [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM

On 01/04/2012 02:56 PM, Greg KH wrote:

> How does this compare with the lowmemorykiller.c driver from the 
> android developers that is currently in the linux-next tree?

Also, the low memory notification that Kosaki-san has worked on, and which Minchan is looking at now.

We seem to have many mechanisms under development, all aimed at similar goals. I believe it would be good to agree on one mechanism that could solve multiple of these goals at once, instead of sticking a handful of different partial solutions in the kernel...

Exactly what is the problem you are trying to solve?

--
All rights reversed
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM
  2012-01-05 11:47   ` leonid.moiseichuk
@ 2012-01-05 12:40     ` Pekka Enberg
  2012-01-05 13:02       ` leonid.moiseichuk
  2012-01-06  0:26     ` KOSAKI Motohiro
  1 sibling, 1 reply; 38+ messages in thread
From: Pekka Enberg @ 2012-01-05 12:40 UTC (permalink / raw)
  To: leonid.moiseichuk
  Cc: gregkh, linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	aarcange, riel, mel, rientjes, dima, rebecca, san, akpm,
	vesa.jaaskelainen

On Thu, Jan 5, 2012 at 1:47 PM,  <leonid.moiseichuk@nokia.com> wrote:
> As I understand AOOM it wait until situation is reached bad conditions which
> required memory reclaiming, selects application according to free memory and
> oom_adj level and kills it.  So no intermediate levels could be checked (e.g.
> 75% usage),  nothing could be done in user-space to prevent killing, no
> notification for case when memory becomes OK.
>
> What I try to do is to get notification in any application that memory
> becomes low, and do something about it like stop processing data, close
> unused pages or correctly shuts applications, daemons.  Application(s) might
> have necessity to install several notification levels, so reaction could be
> adjusted based on current utilization level per each application, not
> globally.

Sure. However, from VM point of view, both have the exact same
functionality: detect when we reach low memory condition (for some
configurable threshold) and notify userspace or kernel subsystem about
it.

That's the part I'd like to see implemented in mm/notify.c or similar.
I really don't care what Android or any other folks use it for exactly
as long as the generic code is light-weight, clean, and we can
reasonably assume that distros can actually enable it.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3.2.0-rc1 2/3] MM hook for page allocation and release
  2012-01-05 11:26     ` leonid.moiseichuk
@ 2012-01-05 12:49       ` Pekka Enberg
  2012-01-05 15:05         ` Rik van Riel
  0 siblings, 1 reply; 38+ messages in thread
From: Pekka Enberg @ 2012-01-05 12:49 UTC (permalink / raw)
  To: leonid.moiseichuk
  Cc: kamezawa.hiroyu, linux-mm, linux-kernel, cesarb, emunson,
	aarcange, riel, mel, rientjes, dima, gregkh, rebecca, san, akpm,
	vesa.jaaskelainen

On Thu, Jan 5, 2012 at 1:26 PM,  <leonid.moiseichuk@nokia.com> wrote:
> I agree that hooking alloc_pages is ugly way. So alternatives I see:
>
> - shrinkers (as e.g. Android OOM used) but shrink_slab called only from
> try_to_free_pages only if we are on slow reclaim path on memory allocation,
> so it cannot be used for e.g. 75% memory tracking or when pages released to
> notify user space that we are OK. But according to easy to use it will be the
> best approach.
>
> - memcg-kind of changes like mem_cgroup_newpage_charge/uncharge_page but
> without blocking decision making logic. Seems to me more changes. Threshold
> currently in memcg set 128 pages per CPU, that is quite often for level
> tracking needs.
>
> - tracking situation using timer? Maybe not due to will impact battery.

Can we hook into mm/vmscan.c and mm/page-writeback.c for this?

                                Pekka

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM
  2012-01-05 12:40     ` Pekka Enberg
@ 2012-01-05 13:02       ` leonid.moiseichuk
  2012-01-05 14:57         ` Greg KH
  0 siblings, 1 reply; 38+ messages in thread
From: leonid.moiseichuk @ 2012-01-05 13:02 UTC (permalink / raw)
  To: penberg
  Cc: gregkh, linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	aarcange, riel, mel, rientjes, dima, rebecca, san, akpm,
	vesa.jaaskelainen

Well, mm/notify.c seems a bit global for me. At the first step I handle inputs from Greg and try to find less destructive approach to allocation tracking rather than page_alloc.
The issue is I know quite well my problem, so other guys who needs memory tracking has own requirements how account memory, how often notify/which granularity,  
how  many clients could be and so one. If I get some inputs I will be happy to implement them.

With Best Wishes,
Leonid


-----Original Message-----
From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext Pekka Enberg
Sent: 05 January, 2012 14:41
To: Moiseichuk Leonid (Nokia-MP/Helsinki)
Cc: gregkh@suse.de; linux-mm@kvack.org; linux-kernel@vger.kernel.org; cesarb@cesarb.net; kamezawa.hiroyu@jp.fujitsu.com; emunson@mgebm.net; aarcange@redhat.com; riel@redhat.com; mel@csn.ul.ie; rientjes@google.com; dima@android.com; rebecca@android.com; san@google.com; akpm@linux-foundation.org; Jaaskelainen Vesa (Nokia-MP/Helsinki)
Subject: Re: [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM

On Thu, Jan 5, 2012 at 1:47 PM,  <leonid.moiseichuk@nokia.com> wrote:
> As I understand AOOM it wait until situation is reached bad conditions 
> which required memory reclaiming, selects application according to 
> free memory and oom_adj level and kills it.  So no intermediate levels could be checked (e.g.
> 75% usage),  nothing could be done in user-space to prevent killing, 
> no notification for case when memory becomes OK.
>
> What I try to do is to get notification in any application that memory 
> becomes low, and do something about it like stop processing data, 
> close unused pages or correctly shuts applications, daemons.  
> Application(s) might have necessity to install several notification 
> levels, so reaction could be adjusted based on current utilization 
> level per each application, not globally.

Sure. However, from VM point of view, both have the exact same
functionality: detect when we reach low memory condition (for some configurable threshold) and notify userspace or kernel subsystem about it.

That's the part I'd like to see implemented in mm/notify.c or similar.
I really don't care what Android or any other folks use it for exactly as long as the generic code is light-weight, clean, and we can reasonably assume that distros can actually enable it.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM
  2012-01-05 13:02       ` leonid.moiseichuk
@ 2012-01-05 14:57         ` Greg KH
  2012-01-05 16:13           ` leonid.moiseichuk
  0 siblings, 1 reply; 38+ messages in thread
From: Greg KH @ 2012-01-05 14:57 UTC (permalink / raw)
  To: leonid.moiseichuk
  Cc: penberg, linux-mm, linux-kernel, cesarb, kamezawa.hiroyu,
	emunson, aarcange, riel, mel, rientjes, dima, rebecca, san, akpm,
	vesa.jaaskelainen

A: No.
Q: Should I include quotations after my reply?

http://daringfireball.net/2007/07/on_top

On Thu, Jan 05, 2012 at 01:02:23PM +0000, leonid.moiseichuk@nokia.com wrote:
> Well, mm/notify.c seems a bit global for me. At the first step I
> handle inputs from Greg and try to find less destructive approach to
> allocation tracking rather than page_alloc.

No, please listen to what people, including me, are saying, otherwise
your code will be totally ignored.

greg k-h

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3.2.0-rc1 2/3] MM hook for page allocation and release
  2012-01-05 12:49       ` Pekka Enberg
@ 2012-01-05 15:05         ` Rik van Riel
  2012-01-05 15:17           ` leonid.moiseichuk
  0 siblings, 1 reply; 38+ messages in thread
From: Rik van Riel @ 2012-01-05 15:05 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: leonid.moiseichuk, kamezawa.hiroyu, linux-mm, linux-kernel,
	cesarb, emunson, aarcange, mel, rientjes, dima, gregkh, rebecca,
	san, akpm, vesa.jaaskelainen

On 01/05/2012 07:49 AM, Pekka Enberg wrote:
> On Thu, Jan 5, 2012 at 1:26 PM,<leonid.moiseichuk@nokia.com>  wrote:
>> I agree that hooking alloc_pages is ugly way. So alternatives I see:
>>
>> - shrinkers (as e.g. Android OOM used) but shrink_slab called only from
>> try_to_free_pages only if we are on slow reclaim path on memory allocation,
>> so it cannot be used for e.g. 75% memory tracking or when pages released to
>> notify user space that we are OK. But according to easy to use it will be the
>> best approach.

Well, there is always the page cache.

If, at reclaim time, the amount of page cache + free memory
is below the free threshold, we should still have space left
to handle userspace things.

It may be possible to hijack memcg accounting to get lower
usage thresholds for earlier notification.  That way the code
can stay out of the true fast paths like alloc_pages.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 2/3] MM hook for page allocation and release
  2012-01-05 15:05         ` Rik van Riel
@ 2012-01-05 15:17           ` leonid.moiseichuk
  0 siblings, 0 replies; 38+ messages in thread
From: leonid.moiseichuk @ 2012-01-05 15:17 UTC (permalink / raw)
  To: riel, penberg
  Cc: kamezawa.hiroyu, linux-mm, linux-kernel, cesarb, emunson,
	aarcange, mel, rientjes, dima, gregkh, rebecca, san, akpm,
	vesa.jaaskelainen

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2112 bytes --]

Huh,
Early notification not much better then late. Cache size impacts mostly user-space responsiveness, so cache is accounted as free memory but device need be tuned development time according how fast it should be in different device-specific use-cases (it depends from zillions technical and non-technical factors e.g.  producer needs).
I fixed all known findings and trying to find places to integrate into vmscan/page-writeback as Pekka has advised. 

With Best Wishes,
Leonid


-----Original Message-----
From: ext Rik van Riel [mailto:riel@redhat.com] 
Sent: 05 January, 2012 17:05
To: Pekka Enberg
Cc: Moiseichuk Leonid (Nokia-MP/Helsinki); kamezawa.hiroyu@jp.fujitsu.com; linux-mm@kvack.org; linux-kernel@vger.kernel.org; cesarb@cesarb.net; emunson@mgebm.net; aarcange@redhat.com; mel@csn.ul.ie; rientjes@google.com; dima@android.com; gregkh@suse.de; rebecca@android.com; san@google.com; akpm@linux-foundation.org; Jaaskelainen Vesa (Nokia-MP/Helsinki)
Subject: Re: [PATCH 3.2.0-rc1 2/3] MM hook for page allocation and release

On 01/05/2012 07:49 AM, Pekka Enberg wrote:
> On Thu, Jan 5, 2012 at 1:26 PM,<leonid.moiseichuk@nokia.com>  wrote:
>> I agree that hooking alloc_pages is ugly way. So alternatives I see:
>>
>> - shrinkers (as e.g. Android OOM used) but shrink_slab called only 
>> from try_to_free_pages only if we are on slow reclaim path on memory 
>> allocation, so it cannot be used for e.g. 75% memory tracking or when 
>> pages released to notify user space that we are OK. But according to 
>> easy to use it will be the best approach.

Well, there is always the page cache.

If, at reclaim time, the amount of page cache + free memory is below the free threshold, we should still have space left to handle userspace things.

It may be possible to hijack memcg accounting to get lower usage thresholds for earlier notification.  That way the code can stay out of the true fast paths like alloc_pages.

--
All rights reversed
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3.2.0-rc1 2/3] MM hook for page allocation and release
  2012-01-04 17:21 ` [PATCH 3.2.0-rc1 2/3] MM hook for page allocation and release Leonid Moiseichuk
  2012-01-04 20:40   ` Pekka Enberg
  2012-01-05  6:59   ` KAMEZAWA Hiroyuki
@ 2012-01-05 15:22   ` Mel Gorman
  2 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2012-01-05 15:22 UTC (permalink / raw)
  To: Leonid Moiseichuk
  Cc: linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	penberg, aarcange, riel, rientjes, dima, gregkh, rebecca, san,
	akpm, vesa.jaaskelainen

On Wed, Jan 04, 2012 at 07:21:55PM +0200, Leonid Moiseichuk wrote:
> That is required by Used Memory Meter (UMM) pseudo-device
> to track memory utilization in system. It is expected that
> hook MUST be very light to prevent performance impact
> on the hot allocation path. Accuracy of number managed pages
> does not expected to be absolute but fact of allocation or
> deallocation must be registered.
> 
> Signed-off-by: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
> ---
>  include/linux/mm.h |   15 +++++++++++++++
>  mm/Kconfig         |    8 ++++++++
>  mm/page_alloc.c    |   31 +++++++++++++++++++++++++++++++
>  3 files changed, 54 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 3dc3a8c..d133f73 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1618,6 +1618,21 @@ extern int soft_offline_page(struct page *page, int flags);
>  
>  extern void dump_page(struct page *page);
>  
> +#ifdef CONFIG_MM_ALLOC_FREE_HOOK
> +/*
> + * Hook function type which called when some pages allocated or released.
> + * Value of nr_pages is positive for post-allocation calls and negative
> + * after free.
> + */
> +typedef void (*mm_alloc_free_hook_t)(int nr_pages);
> +
> +/*

I'm going to chime in and say that hooks like this into the page
allocator are a no-go unless there really is absolutely no other option.
There is too much scope for abuse.

Even if they were not, this takes no account of the zone or node
we are allocating from making it useful only in the case where the
system had a single node and zone. This applies to mobile devices
but not a lot of other systems.

It also would have very poor information about memory pressure which
is likely to be far more interesting and for that, awareness of what
is happening in page reclaim is required.

I haven't looked at the alternatives but there has been some vague
discussion recently on reviving the concept of a low memory notifier,
somehow making the existing memcg oom notifier global or maybe the
andro lowmem killer can be adapted to your needs.

> @@ -2298,6 +2322,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  	put_mems_allowed();
>  
>  	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
> +#ifdef CONFIG_MM_ALLOC_FREE_HOOK
> +	call_alloc_free_hook(1 << order);
> +#endif
> +
>  	return page;
>  }

you are calling a free hook there in the alloc path. Seems odd.

This is just a side-note but as this information is meant to be
consumed by userspace you have the option of hooking into the
mm_page_alloc tracepoint. You get the same information about how
many pages are allocated or freed. I accept that it will probably be
a bit slower but on the plus side it'll be backwards compatible and
you don't need a kernel patch for it.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM
  2012-01-05 14:57         ` Greg KH
@ 2012-01-05 16:13           ` leonid.moiseichuk
  2012-01-05 23:10             ` David Rientjes
  0 siblings, 1 reply; 38+ messages in thread
From: leonid.moiseichuk @ 2012-01-05 16:13 UTC (permalink / raw)
  To: gregkh
  Cc: penberg, linux-mm, linux-kernel, cesarb, kamezawa.hiroyu,
	emunson, aarcange, riel, mel, rientjes, dima, rebecca, san, akpm,
	vesa.jaaskelainen

-----Original Message-----
From: ext Greg KH [mailto:gregkh@suse.de] 
>No, please listen to what people, including me, are saying, otherwise your code will be totally ignored.

I tried to sort out all inputs coming. But before doing the next step I prefer to have tests passed. Changes you proposed are strain forward and understandable. 
Hooking in mm/vmscan.c and mm/page-writeback.c is not so easy, I need to find proper place and make adequate proposal.
Using memcg is doesn't not look for me now as a good way because I wouldn't like to change memory accounting - memcg has strong reason to keep caches.

Best Wishes,
Leonid

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM
  2012-01-04 20:42     ` Pekka Enberg
@ 2012-01-05 23:01       ` David Rientjes
  0 siblings, 0 replies; 38+ messages in thread
From: David Rientjes @ 2012-01-05 23:01 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Rik van Riel, Greg KH, Leonid Moiseichuk, linux-mm, linux-kernel,
	cesarb, kamezawa.hiroyu, emunson, aarcange, mel, dima, rebecca,
	san, akpm, vesa.jaaskelainen, Minchan Kim, KOSAKI Motohiro

On Wed, 4 Jan 2012, Pekka Enberg wrote:

> And even if people want to support multiple ABIs and fight it out to
> see which one wins, we should factor out the generic parts and put
> them under mm/*.c and not hide them in random modules.
> 

Agreed.  This came up recently when another lowmem killer was proposed and 
the suggestion was to enable the memory controller to be able to have the 
memory threshold notifications with eventfd(2) and cgroup.event_control.  
It would be very nice to have a generic lowmem notifier (like 
/dev/mem_notify that has been reworked several times in the past) rather 
than tying it to a particular cgroup, especially when that cgroup incurs a 
substantial overhead for embedded users.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM
  2012-01-05 16:13           ` leonid.moiseichuk
@ 2012-01-05 23:10             ` David Rientjes
  2012-01-09  8:27               ` leonid.moiseichuk
  0 siblings, 1 reply; 38+ messages in thread
From: David Rientjes @ 2012-01-05 23:10 UTC (permalink / raw)
  To: leonid.moiseichuk
  Cc: gregkh, penberg, linux-mm, linux-kernel, cesarb, kamezawa.hiroyu,
	emunson, aarcange, riel, mel, dima, rebecca, san, akpm,
	vesa.jaaskelainen

On Thu, 5 Jan 2012, leonid.moiseichuk@nokia.com wrote:

> I tried to sort out all inputs coming. But before doing the next step I 
> prefer to have tests passed. Changes you proposed are strain forward and 
> understandable. 
> Hooking in mm/vmscan.c and mm/page-writeback.c is not so easy, I need 
> to find proper place and make adequate proposal.
> Using memcg is doesn't not look for me now as a good way because I 
> wouldn't like to change memory accounting - memcg has strong reason to 
> keep caches.
> 

If you can accept the overhead of the memory controller (increase in 
kernel text size and amount of metadata for page_cgroup), then you can 
already do this with a combination of memory thresholds with 
cgroup.event_control and disabling of the oom killer entirely with 
memory.oom_control.  You can also get notified when the oom killer is 
triggered by using eventfd(2) on memory.oom_control even though it's 
disabled in the kernel.  Then, the userspace task attached to that control 
file can send signals to applications to free their memory or, in the 
worst case, choose to kill an application but have all that policy be 
implemented in userspace.

We actually have extended that internally to have an oom killer delay, 
i.e. a specific amount of time must pass for userspace to react to the oom 
situation or the oom killer will actually be triggered.  This is needed in 
case our userspace is blocked or can't respond for whatever reason and is 
a nice fallback so that we're guaranteed to never end up livelocked.  That 
delay gets reset anytime a page is uncharged to a memcg, the memcg limit 
is increased, or the delay is rewritten (for userspace to say "I've 
handled the event").  Those patches were posted on linux-mm several months 
ago but never merged upstream.  You should be able to use the same concept 
apart from the memory controller and implement it generically.

You also presented this as an alternative for "embedded or small" users so 
I wasn't aware that using the memory controller was an acceptable solution 
given its overhead.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM
  2012-01-05 11:47   ` leonid.moiseichuk
  2012-01-05 12:40     ` Pekka Enberg
@ 2012-01-06  0:26     ` KOSAKI Motohiro
  2012-01-09  8:49       ` leonid.moiseichuk
  1 sibling, 1 reply; 38+ messages in thread
From: KOSAKI Motohiro @ 2012-01-06  0:26 UTC (permalink / raw)
  To: leonid.moiseichuk
  Cc: kosaki.motohiro, gregkh, linux-mm, linux-kernel, cesarb,
	kamezawa.hiroyu, emunson, penberg, aarcange, riel, mel, rientjes,
	dima, rebecca, san, akpm, vesa.jaaskelainen

> Android OOM (AOOM) is a different thing. Briefly Android OOM is a safety belt,
>but I try to introduce look-ahead radar to stop before hitting wall.

You explained why we shouldn't merge neither you nor android notification patches.
Many embedded developers tried to merge their own patch and claimed "Hey! my patch
is completely different from another one". That said, their patches can't be used
each other use case, just for them.

Systemwide global notification itself is not bad idea. But we definitely choose
just one implementation. thus, you need to get agree with other embedded people.

Again, lowmemorykiller.c should be dropped too.


>  UsedMemory = (MemTotal - MemFree - Buffers - Cached - SwapCached) +
>                                               (SwapTotal - SwapFree)

If you spent a few time to read past discuttion, you should have understand your fomula
is broken and unacceptable. Think, mlocked (or pinning by other way) cache can't be
discarded. And, When system is under swap thrashing, userland notification is
useless. I don't think you tested w/ swap environment heavily.

While you are getting stuck to make nokia specific feature, I'm recommending you
maintain your local patch yourself.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM
  2012-01-05 23:10             ` David Rientjes
@ 2012-01-09  8:27               ` leonid.moiseichuk
  0 siblings, 0 replies; 38+ messages in thread
From: leonid.moiseichuk @ 2012-01-09  8:27 UTC (permalink / raw)
  To: rientjes
  Cc: gregkh, penberg, linux-mm, linux-kernel, cesarb, kamezawa.hiroyu,
	emunson, aarcange, riel, mel, dima, rebecca, san, akpm,
	vesa.jaaskelainen

> -----Original Message-----
> From: ext David Rientjes [mailto:rientjes@google.com]
> Sent: 06 January, 2012 01:10

> If you can accept the overhead of the memory controller (increase in
> kernel text size and amount of metadata for page_cgroup), then you can
> already do this with a combination of memory thresholds with
> cgroup.event_control and disabling of the oom killer entirely with
> memory.oom_control.  You can also get notified when the oom killer is
> triggered by using eventfd(2) on memory.oom_control even though it's
> disabled in the kernel.  Then, the userspace task attached to that control
> file can send signals to applications to free their memory or, in the
> worst case, choose to kill an application but have all that policy be
> implemented in userspace.

We invested in memcg notification (Kiryl Shutsemau's patches) and use the similar approach in n9 already (see libmemnotifyqt on gitorious).
Unfortunately it is produces number of side effects which are related how memcg handled application injection/removal from/to group.
So I like to try another approach.

> We actually have extended that internally to have an oom killer delay,
> i.e. a specific amount of time must pass for userspace to react to the oom
...
> handled the event").  Those patches were posted on linux-mm several
> months
> ago but never merged upstream.  You should be able to use the same
> concept
> apart from the memory controller and implement it generically.

Yep. But in n9 concept OOMing some application is acceptable, so I do not see such changes as very suitable.

> You also presented this as an alternative for "embedded or small" users so
> I wasn't aware that using the memory controller was an acceptable solution
> given its overhead.

Overhead, by the way, fully acceptable and I think in never kernels (3.x) situation will be much better.
But memcg has from my point principal problems for case when you cgroup application set is updated when application foregrounded/backgrounded, unfortunately that is how n900 and n9 software designed.

Best Wishes,
Leonid

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM
  2012-01-06  0:26     ` KOSAKI Motohiro
@ 2012-01-09  8:49       ` leonid.moiseichuk
  0 siblings, 0 replies; 38+ messages in thread
From: leonid.moiseichuk @ 2012-01-09  8:49 UTC (permalink / raw)
  To: kosaki.motohiro
  Cc: gregkh, linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	penberg, aarcange, riel, mel, rientjes, dima, rebecca, san, akpm,
	vesa.jaaskelainen

> -----Original Message-----
> From: ext KOSAKI Motohiro [mailto:kosaki.motohiro@gmail.com]
> Sent: 06 January, 2012 02:27
> To: Moiseichuk Leonid (Nokia-MP/Helsinki)
 
> > Android OOM (AOOM) is a different thing. Briefly Android OOM is a safety
> belt,
> >but I try to introduce look-ahead radar to stop before hitting wall.
> 
> You explained why we shouldn't merge neither you nor android notification
> patches.
> Many embedded developers tried to merge their own patch and claimed
> "Hey! my patch
> is completely different from another one". That said, their patches can't be
> used
> each other use case, just for them.

Pardon me but these patches doing really different thing. Having notification doesn't mean all you software will handle them correct. In open platform you might have "bad entity" which will be killed by OOM.
In we used default OOM killer but Android OOM probably works better in some other conditions even from my point of view it may trigger false OOMs due to base on NR_FREE_PAGES which are more interesting for kernel for than for user-space. 

> Systemwide global notification itself is not bad idea. But we definitely choose
> just one implementation. thus, you need to get agree with other embedded
> people.

Agree. That is point for discussion. One is already available through memcg but problem is in memcg and how we use it.

> >  UsedMemory = (MemTotal - MemFree - Buffers - Cached - SwapCached) +
> >                                               (SwapTotal - SwapFree)
> 
> If you spent a few time to read past discuttion, you should have understand
> your fomula
> is broken and unacceptable. Think, mlocked (or pinning by other way) cache
> can't be discarded. 

In theory you are right about mlocked pages.  So I will add deduction for NR_MLOCK
In practice typical desktop system has mlocked = 0. Also code pages are shared, so mlocking has 0 effect.
For data pages the some library like http://maemo.gitorious.org/maemo-tools/libmlocknice could be used.
Anyhow, on  n9 we have only  5.3 MB mlocked memory from 1024MB.

> And, When system is under swap thrashing, userland notification is useless.

Well, cgroups CPU shares and ionice seems to me better but as a quick solution extension with LRU_ACTIVE_ANON + LRU_ACTIVE_FILE could be done easily.

>  I don't think you tested w/ swap environment heavily.

n770, n800, n810 have optional in-file swapping.
n900 has permanent 768 MB swap partition.
n9 uses in-RAM lzo compressed 256 MB swap.

All of them tested, tuned and works fine for majority use-cases.

> While you are getting stuck to make nokia specific feature, I'm
> recommending you
> maintain your local patch yourself.

Thanks for advices, but I have better idea which is less destructive for MM.
Maybe it will more successful, at least for maintenance as a local patch.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 3/3] Used Memory Meter pseudo-device module
  2012-01-04 19:55   ` Greg KH
@ 2012-01-09  9:58     ` leonid.moiseichuk
  2012-01-09 10:09       ` David Rientjes
  0 siblings, 1 reply; 38+ messages in thread
From: leonid.moiseichuk @ 2012-01-09  9:58 UTC (permalink / raw)
  To: gregkh
  Cc: linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	penberg, aarcange, riel, mel, rientjes, dima, rebecca, san, akpm,
	vesa.jaaskelainen

> -----Original Message-----
> From: ext Greg KH [mailto:gregkh@suse.de]
> Sent: 04 January, 2012 21:55
...
> Note, I don't agree that this code is the correct thing to be doing here, you'll
> have to get the buy-in from the mm developers on that, but I do have some
> comments on the implementation:

Hello everyone and thanks for comments.

If I not wrong in addition to Greg's remarks about polishing I got 14 findings (see details below):
1. Alternative solutions: why not Android OOM or memcg
2. How to connect to MM - the current variant is no-go and that is a critical part
3. What should be tracked (e.g. memory pressure 3.1)

For sure I used wrong approach to solve notification problem. The user-space reaction should fit under 1s, so to react 250-500 ms on kernel side absolutely not necessary hook page_alloc due to this component 
should be used only for notification and not denying allocations. It also inadequate idea due to I need only data from global_page_state/vm_stat which is cpu-independent and has a lot of traces in MM where 
it could be updated. 

So major changes in coming version will be:
1. timer-based access to global_page_state() data. If I understand documentation right the deferred timer will not wake up if cpu frozen. Otherwise timer must be set using register_cpu_notifier
2. to track high memory pressure cases the shrinker should be added without filtering by last call time
3. used memory calculation will be changed and active page set added
4. file renamed to memnotify.c and interface to /dev/memnotify due to it will report not only used memory + low probability it will be accepted as mm/notify.c as advised below (but maybe someone will use it).

With Best Wishes,
Leonid

Remarks collected from emails
=======================

1. Alternative solutions
------------------------

1.1. Pekka Enberg
> However, from VM point of view, both have the exact same functionality: detect when we reach low memory condition
> (for some configurable threshold) and notify userspace or kernel subsystem about it.

Well, I cannot say that SIGKILL is a notification. From kernel side maybe. But Android OOM uses different memory 
tracking rules. From my opinion OOM killer should be as reliable as default is but functionality Android OOM killer 
does should be done in user space by some "smart killer" which closes application correct way (save data, notify user etc.).
It heavily depends from product design.

1.2. Pekka Enberg
> That's the part I'd like to see implemented in mm/notify.c or similar.
> I really don't care what Android or any other folks use it for exactly as long as the generic code is light-weight, > clean, and we can reasonably assume that distros can actually enable it.

I will try to do memnotify.c but due to I am not sure it will be well enough done to be accepted it will be in drivers.

1.3. Rik van Riel
> Also, the low memory notification that Kosaki-san has worked on, and which Minchan is looking at now.
Finally I found only patches from 2009 which are not look for me good from user space point of view.
For example I do not understand how to specify application limit(s).

1.4. Mel Gorman
> I haven't looked at the alternatives but there has been some vague discussion recently on reviving the concept of
> a low memory notifier, somehow making the existing memcg oom notifier global or maybe the andro lowmem killer 
> can be adapted to your needs.

Most likely not. The memcg OOM handling can but idea is to not have memcg/partitions.

1.5. David Rientjes
> If you can accept the overhead of the memory controller (increase in
> kernel text size and amount of metadata for page_cgroup), then you can
> already do this with a combination of memory thresholds with
> cgroup.event_control and disabling of the oom killer entirely with
> memory.oom_control. 
already done in libmemnotifyqt used in n9


1.6. David Rientjes
> Agreed.  This came up recently when another lowmem killer was proposed and the suggestion was to enable the memory > controller to be able to have the memory threshold notifications with eventfd(2) and cgroup.event_control.  

already done in libmemnotifyqt used in n9

1.7. David Rientjes
> This is just a side-note but as this information is meant to be consumed by userspace you have the option of hooking
> into the mm_page_alloc tracepoint. You get the same information about how many pages are allocated or freed. I accept
> that it will probably be a bit slower but on the plus side it'll be backwards compatible and you don't need a kernel
> patch for it.

That is odd for sure, I have to use another kind of access to vm_stat.


2. How to hook MM
-----------------

2.1. Pekka Enberg
> Can we hook into mm/vmscan.c and mm/page-writeback.c for this?
Thanks for pointing. For vmscan I plan to use shrinker. But changes in page-writeback seems to be the same bad as page-alloc hooking.

2.2. Rik van Riel
> It may be possible to hijack memcg accounting to get lower usage thresholds for earlier notification.  
> That way the code can stay out of the true fast paths like alloc_pages

That is a case but memcg is not well suitable when processes migrating in-between cgroups e.g. forced to be swapped out
and device becomes slaggy or if process is big enough it cannot be injected into cgroup and stays in root group without
any restrictions

2.3. Mel Gorman
> I'm going to chime in and say that hooks like this into the page allocator are a no-go unless there really 
> s absolutely no other option. There is too much scope for abuse.

Agree. The idea is based on vm_stat which is global, and to track it absolutely do not necessary to hook in page_alloc


2.4. David Rientjes

> It would be very nice to have a generic lowmem notifier (like /dev/mem_notify that has been reworked several times 
> in the past) rather than tying it to a particular cgroup, especially when that cgroup incurs a substantial overhead 
> for embedded users.

Ok, will try to do more generic and re-use memnotify name. But due to high risk to be not accepted in mainline I will keep it as drivers/misc/memnotify.c


3. What to track
----------------

3.1. Mel Gorman
> It also would have very poor information about memory pressure which is likely to be far more interesting and for that,
> awareness of what is happening in page reclaim is required.
Could to be added later, now I try to focus on vm_stat due to it is simpler. 

3.2. KOSAKI Motohiro 
> If you spent a few time to read past discuttion, you should have understand
> your fomula
> is broken and unacceptable. Think, mlocked (or pinning by other way) cache
> can't be discarded. 

NR_MLOCK will be added

3.3. KOSAKI Motohiro 
> And, When system is under swap thrashing, userland notification is useless.
Well, cgroups CPU shares and ionice seems to me better but as a quick solution extension with LRU_ACTIVE_ANON + LRU_ACTIVE_FILE could be done easily.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 3/3] Used Memory Meter pseudo-device module
  2012-01-09  9:58     ` leonid.moiseichuk
@ 2012-01-09 10:09       ` David Rientjes
  2012-01-09 10:19         ` leonid.moiseichuk
  0 siblings, 1 reply; 38+ messages in thread
From: David Rientjes @ 2012-01-09 10:09 UTC (permalink / raw)
  To: leonid.moiseichuk
  Cc: gregkh, linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	penberg, aarcange, riel, mel, dima, rebecca, san, akpm,
	vesa.jaaskelainen

On Mon, 9 Jan 2012, leonid.moiseichuk@nokia.com wrote:

> 1.1. Pekka Enberg
> > However, from VM point of view, both have the exact same 
> > functionality: detect when we reach low memory condition
> > (for some configurable threshold) and notify userspace or kernel 
> > subsystem about it.
> 
> Well, I cannot say that SIGKILL is a notification. From kernel side 
> maybe. But Android OOM uses different memory tracking rules. From my 
> opinion OOM killer should be as reliable as default is but functionality 
> Android OOM killer does should be done in user space by some "smart 
> killer" which closes application correct way (save data, notify user 
> etc.). It heavily depends from product design.
> 

I'm not sure why you need to detect low memory thresholds if you're not 
interested in using the memory controller, why not just use the oom killer 
delay that I suggested earlier and allow userspace to respond to 
conditions when you are known to failed reclaim and require that something 
be killed?  Userspace should be able to make sane decisions or trigger 
external knobs to be able to free memory much better than having the 
kernel handling signals or notification to individual applications.

> 1.7. David Rientjes
> > This is just a side-note but as this information is meant to be consumed by userspace you have the option of hooking
> > into the mm_page_alloc tracepoint. You get the same information about how many pages are allocated or freed. I accept
> > that it will probably be a bit slower but on the plus side it'll be backwards compatible and you don't need a kernel
> > patch for it.
> 

I didn't write that.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 3/3] Used Memory Meter pseudo-device module
  2012-01-09 10:09       ` David Rientjes
@ 2012-01-09 10:19         ` leonid.moiseichuk
  2012-01-09 20:55           ` David Rientjes
  0 siblings, 1 reply; 38+ messages in thread
From: leonid.moiseichuk @ 2012-01-09 10:19 UTC (permalink / raw)
  To: rientjes
  Cc: gregkh, linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	penberg, aarcange, riel, mel, dima, rebecca, san, akpm,
	vesa.jaaskelainen

> -----Original Message-----
> From: ext David Rientjes [mailto:rientjes@google.com]
> Sent: 09 January, 2012 12:09
...
> 
> I'm not sure why you need to detect low memory thresholds if you're not
> interested in using the memory controller, why not just use the oom killer
> delay that I suggested earlier and allow userspace to respond to conditions
> when you are known to failed reclaim and require that something be killed?

As I understand that is required to turn on memcg and memcg is a thing I try to avoid.

> > 1.7. David Rientjes
> > > This is just a side-note but as this information is meant to be
> > > consumed by userspace you have the option of hooking into the
> > > mm_page_alloc tracepoint. You get the same information about how
> > > many pages are allocated or freed. I accept that it will probably be a bit
> slower but on the plus side it'll be backwards compatible and you don't need
> a kernel patch for it.
> >
> 
> I didn't write that.

Sorry, it was Mel Gorman. Copy-paste problem.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 3/3] Used Memory Meter pseudo-device module
  2012-01-09 10:19         ` leonid.moiseichuk
@ 2012-01-09 20:55           ` David Rientjes
  2012-01-11 12:46             ` leonid.moiseichuk
  0 siblings, 1 reply; 38+ messages in thread
From: David Rientjes @ 2012-01-09 20:55 UTC (permalink / raw)
  To: leonid.moiseichuk
  Cc: gregkh, linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	penberg, aarcange, riel, mel, dima, rebecca, san, akpm,
	vesa.jaaskelainen

On Mon, 9 Jan 2012, leonid.moiseichuk@nokia.com wrote:

> > I'm not sure why you need to detect low memory thresholds if you're not
> > interested in using the memory controller, why not just use the oom killer
> > delay that I suggested earlier and allow userspace to respond to conditions
> > when you are known to failed reclaim and require that something be killed?
> 
> As I understand that is required to turn on memcg and memcg is a thing 
> I try to avoid.
> 

Maybe there's some confusion: the proposed oom killer delay that I'm 
referring to here is not upstream and has never been written for global 
oom conditions.  My reference to it earlier was as an internal patch that 
we carry on top of memory controller, but what I'm proposing here is for 
it to be implemented globally.

So if the page allocator can make no progress in freeing memory, we would 
introduce a delay in out_of_memory() if it were configured via a sysctl 
from userspace.  When this delay is started, applications waiting on this 
event can be notified with eventfd(2) that the delay has started and they 
have however many milliseconds to address the situation.  When they 
rewrite the sysctl, the delay is cleared.  If they don't rewrite the 
sysctl and the delay expires, the oom killer proceeds with killing.

What's missing for your usecase with this proposal?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 3/3] Used Memory Meter pseudo-device module
  2012-01-09 20:55           ` David Rientjes
@ 2012-01-11 12:46             ` leonid.moiseichuk
  2012-01-11 21:44               ` David Rientjes
  0 siblings, 1 reply; 38+ messages in thread
From: leonid.moiseichuk @ 2012-01-11 12:46 UTC (permalink / raw)
  To: rientjes
  Cc: gregkh, linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	penberg, aarcange, riel, mel, dima, rebecca, san, akpm,
	vesa.jaaskelainen

> -----Original Message-----
> From: ext David Rientjes [mailto:rientjes@google.com]
> Sent: 09 January, 2012 21:55
...
> 
> Maybe there's some confusion: the proposed oom killer delay that I'm
> referring to here is not upstream and has never been written for global oom
> conditions.  My reference to it earlier was as an internal patch that we carry
> on top of memory controller, but what I'm proposing here is for it to be
> implemented globally.

That is explains situation - I know how memcg can handle OOM in cgroup but not about internal patch.

> So if the page allocator can make no progress in freeing memory, we would
> introduce a delay in out_of_memory() if it were configured via a sysctl from
> userspace.  When this delay is started, applications waiting on this event can
> be notified with eventfd(2) that the delay has started and they have
> however many milliseconds to address the situation.  When they rewrite the
> sysctl, the delay is cleared.  If they don't rewrite the sysctl and the delay
> expires, the oom killer proceeds with killing.
> 
> What's missing for your use case with this proposal?

Timed delays in multi-process handling in case OOM looks for me fragile construction due to delays are not predicable.
Memcg supports [1] better approach to freeze whole group and kick pointed user-space application to handle it. We planned
to use it as:
- enlarge cgroup
- send SIGTERM to selected "bad" application e.g. based on oom_score
- wait a bit
- send SIGKILL to "bad" application
- reduce group size

But finally default OOM killer starts to work fine.

[1] http://lwn.net/Articles/377708/

 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 3/3] Used Memory Meter pseudo-device module
  2012-01-11 12:46             ` leonid.moiseichuk
@ 2012-01-11 21:44               ` David Rientjes
  2012-01-12  8:32                 ` leonid.moiseichuk
  0 siblings, 1 reply; 38+ messages in thread
From: David Rientjes @ 2012-01-11 21:44 UTC (permalink / raw)
  To: leonid.moiseichuk
  Cc: gregkh, linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	penberg, aarcange, riel, mel, dima, rebecca, san, akpm,
	vesa.jaaskelainen

On Wed, 11 Jan 2012, leonid.moiseichuk@nokia.com wrote:

> > So if the page allocator can make no progress in freeing memory, we would
> > introduce a delay in out_of_memory() if it were configured via a sysctl from
> > userspace.  When this delay is started, applications waiting on this event can
> > be notified with eventfd(2) that the delay has started and they have
> > however many milliseconds to address the situation.  When they rewrite the
> > sysctl, the delay is cleared.  If they don't rewrite the sysctl and the delay
> > expires, the oom killer proceeds with killing.
> > 
> > What's missing for your use case with this proposal?
> 
> Timed delays in multi-process handling in case OOM looks for me fragile 
> construction due to delays are not predicable.

Not sure what you mean by predictable; the oom conditions themselves 
certainly aren't predictable, otherwise you wouldn't need notification at 
all.  The delays are predictable since you configure it to be a number of 
millisecs via a global sysctl.  Userspace can either handle the oom itself 
and rewrite that sysctl to reset the delay or write 0 to make the kernel 
immediately oom.  If the delay expires, then it is assumed that userspace 
is dead and the kernel will proceed to avoid livelock.

> Memcg supports [1] better approach to freeze whole group and kick 
> pointed user-space application to handle it. We planned
> to use it as:
> - enlarge cgroup
> - send SIGTERM to selected "bad" application e.g. based on oom_score
> - wait a bit
> - send SIGKILL to "bad" application
> - reduce group size
> 
> But finally default OOM killer starts to work fine.
> 

I think you're misunderstanding the proposal; in the case of a global oom 
(that means without memcg) then, by definition, all threads that are 
allocating memory would be frozen and incur the delay at the point they 
would currently call into the oom killer.  If your userspace is alive, 
i.e. the application responsible for managing oom killing, then it can 
wait on eventfd(2), wake up, and then send SIGTERM and SIGKILL to the 
appropriate threads based on priority.

So, again, why wouldn't this work for you?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 3/3] Used Memory Meter pseudo-device module
  2012-01-11 21:44               ` David Rientjes
@ 2012-01-12  8:32                 ` leonid.moiseichuk
  2012-01-12 20:54                   ` David Rientjes
  0 siblings, 1 reply; 38+ messages in thread
From: leonid.moiseichuk @ 2012-01-12  8:32 UTC (permalink / raw)
  To: rientjes
  Cc: gregkh, linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	penberg, aarcange, riel, mel, dima, rebecca, san, akpm,
	vesa.jaaskelainen

> -----Original Message-----
> From: ext David Rientjes [mailto:rientjes@google.com]
> Sent: 11 January, 2012 22:45
 
> I think you're misunderstanding the proposal; in the case of a global oom
> (that means without memcg) then, by definition, all threads that are
> allocating memory would be frozen and incur the delay at the point they
> would currently call into the oom killer.  If your userspace is alive, i.e. the
> application responsible for managing oom killing, then it can wait on
> eventfd(2), wake up, and then send SIGTERM and SIGKILL to the appropriate
> threads based on priority.
> 
> So, again, why wouldn't this work for you?

As I wrote the proposed change is not safety belt but looking ahead radar.
If it detects that we are close to wall it starts to alarm and alarm volume is proportional to distance.

In close-to-OOM situations device becomes very slow, which is not good for user. The performance difference depends on code size and storage performance 
to trash code pages but even 20% is noticeable. Practically 2x-5x times slowdown was observed.

We can do some actions ahead of time and try to prevent OOM at all like shrink caches in applications, close unused apps etc.  If OOM still happened due to 
3rd party components or misbehaving software even default OOM killer works good enough if oom_score_adj values are properly set.

Thus, controlling device on wider set of memory situations looks for me more beneficial than trying to  recover when situation is bad. And increasing complexity
of recovery mechanism (OOM, Android OOM, OOM with delay), involving user-space into decision-making, makes recovery _potentially_ less predictable.

Best Wishes,
Leonid


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 3/3] Used Memory Meter pseudo-device module
  2012-01-12  8:32                 ` leonid.moiseichuk
@ 2012-01-12 20:54                   ` David Rientjes
  2012-01-13  9:34                     ` leonid.moiseichuk
  0 siblings, 1 reply; 38+ messages in thread
From: David Rientjes @ 2012-01-12 20:54 UTC (permalink / raw)
  To: leonid.moiseichuk
  Cc: gregkh, linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	penberg, aarcange, riel, mel, dima, rebecca, san, akpm,
	vesa.jaaskelainen

On Thu, 12 Jan 2012, leonid.moiseichuk@nokia.com wrote:

> As I wrote the proposed change is not safety belt but looking ahead 
> radar.
> If it detects that we are close to wall it starts to alarm and alarm 
> volume is proportional to distance.
> 

Then it's fundamentally flawed since there's no guarantee that coming with 
100MB of the min watermark, for example, means that an oom is imminent and 
will just result in unnecessary notification to userspace that will cause 
some action to be taken that may not be necessary.  If the setting of 
these thresholds depends on some pattern that is guaranteed to be along 
the path to oom for a certain workload, then that will also change 
depending on VM implementation changes, kernel versions, other 
applications, etc., and simply is unmaintainable.

> In close-to-OOM situations device becomes very slow, which is not good 
> for user. The performance difference depends on code size and storage 
> performance to trash code pages but even 20% is noticeable. Practically 
> 2x-5x times slowdown was observed.
> 

It would be much better to address the slowdown when running out of memory 
rather than requiring userspace to react and unnecessarily send signals to 
threads that may or may not have the ability to respond because they may 
already be oom themselves.  You can do crazy things to reduce latency in 
lowmem memory allocations like changing gfp_allowed_mask to be GFP_ATOMIC 
so that direct reclaim is never called, for example, and then use the 
proposed oom killer delay to handle the situation at the time of oom.

Regardless, you should be addressing the slowness in lowmem situations 
rather than implementing notifiers to userspace to handle the events 
itself, so nack on this proposal.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 3/3] Used Memory Meter pseudo-device module
  2012-01-12 20:54                   ` David Rientjes
@ 2012-01-13  9:34                     ` leonid.moiseichuk
  2012-01-13 11:06                       ` David Rientjes
  0 siblings, 1 reply; 38+ messages in thread
From: leonid.moiseichuk @ 2012-01-13  9:34 UTC (permalink / raw)
  To: rientjes
  Cc: gregkh, linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	penberg, aarcange, riel, mel, dima, rebecca, san, akpm,
	vesa.jaaskelainen

> -----Original Message-----
> From: ext David Rientjes [mailto:rientjes@google.com]
> Sent: 12 January, 2012 21:55
> To: Moiseichuk Leonid (Nokia-MP/Helsinki)
....
> 
> Then it's fundamentally flawed since there's no guarantee that coming with
> 100MB of the min watermark, for example, means that an oom is imminent
> and will just result in unnecessary notification to userspace that will cause
> some action to be taken that may not be necessary.  If the setting of these
> thresholds depends on some pattern that is guaranteed to be along the path
> to oom for a certain workload, then that will also change depending on VM
> implementation changes, kernel versions, other applications, etc., and simply
> is unmaintainable.

Why? That is expected that product tested and tuned properly, applications fixed, and at least no apps installed which might consume 100 MB in second or two.
If you have another product with big difference in memory size, applications etc. you might need to re-calibrate reactions.
Let's focus on realistic cases.

> It would be much better to address the slowdown when running out of
> memory rather than requiring userspace to react and unnecessarily send
> signals to threads that may or may not have the ability to respond because
> they may already be oom themselves.

That is not possible - signals usually set at level you have 20-50 MB to react. 
Slowdown is natural thing if you have lack of space for code paging, I do not see any ways to fix it.

>  You can do crazy things to reduce
> latency in lowmem memory allocations like changing gfp_allowed_mask to
> be GFP_ATOMIC so that direct reclaim is never called, for example, and then
> use the proposed oom killer delay to handle the situation at the time of oom.

It is not necessary.

> Regardless, you should be addressing the slowness in lowmem situations
> rather than implementing notifiers to userspace to handle the events itself,
> so nack on this proposal.

Define "lowmem situation" first.  For proposed approach it is from 50-90% of memory usage until user-space can do something.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 3/3] Used Memory Meter pseudo-device module
  2012-01-13  9:34                     ` leonid.moiseichuk
@ 2012-01-13 11:06                       ` David Rientjes
  2012-01-13 11:51                         ` leonid.moiseichuk
  0 siblings, 1 reply; 38+ messages in thread
From: David Rientjes @ 2012-01-13 11:06 UTC (permalink / raw)
  To: leonid.moiseichuk
  Cc: gregkh, linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	penberg, aarcange, riel, mel, dima, rebecca, san, akpm,
	vesa.jaaskelainen

On Fri, 13 Jan 2012, leonid.moiseichuk@nokia.com wrote:

> > Then it's fundamentally flawed since there's no guarantee that coming with
> > 100MB of the min watermark, for example, means that an oom is imminent
> > and will just result in unnecessary notification to userspace that will cause
> > some action to be taken that may not be necessary.  If the setting of these
> > thresholds depends on some pattern that is guaranteed to be along the path
> > to oom for a certain workload, then that will also change depending on VM
> > implementation changes, kernel versions, other applications, etc., and simply
> > is unmaintainable.
> 
> Why? That is expected that product tested and tuned properly, 
> applications fixed, and at least no apps installed which might consume 
> 100 MB in second or two.

I'm trying to make this easy for you, if you haven't noticed.  Your memory 
threshold, as proposed, will have values that are tied directly to the 
implementation of the VM in the kernel when its under memory pressure and 
that implementation evolves at a constant rate.

What I'm proposing is limiting the amount of latency that the VM incurs 
when under memory pressure, notify userspace, and allow it to react to the 
situation until the delay expires.  This doesn't require recalibration for 
other products or upgraded kernels, it just works all the time.

> Slowdown is natural thing if you have lack of space for code paging, I 
> do not see any ways to fix it.
> 

mlock() the memory that your userspace monitoring needs to send signals to 
applications, whether those signals are handled to free memory internally 
or its SIGTERM or SIGKILL.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 3/3] Used Memory Meter pseudo-device module
  2012-01-13 11:06                       ` David Rientjes
@ 2012-01-13 11:51                         ` leonid.moiseichuk
  2012-01-13 21:35                           ` David Rientjes
  0 siblings, 1 reply; 38+ messages in thread
From: leonid.moiseichuk @ 2012-01-13 11:51 UTC (permalink / raw)
  To: rientjes
  Cc: gregkh, linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	penberg, aarcange, riel, mel, dima, rebecca, san, akpm,
	vesa.jaaskelainen

> -----Original Message-----
> From: ext David Rientjes [mailto:rientjes@google.com]
> Sent: 13 January, 2012 12:06
> To: Moiseichuk Leonid (Nokia-MP/Helsinki)
...
> > Why? That is expected that product tested and tuned properly,
> > applications fixed, and at least no apps installed which might consume
> > 100 MB in second or two.
> 
> I'm trying to make this easy for you, if you haven't noticed.
Thanks, I did.

> Your memory threshold, as proposed, will have values that are tied directly to the
> implementation of the VM in the kernel when its under memory pressure
> and that implementation evolves at a constant rate.

Not sure that I understand this statement. Free/Used/Active page sets are properties of any VM.
I have a new implementation but it is in testing now. I do not see any relation to VM implementation except statistics and it could be extended with 
"virtual values" which are suitable for user-space e.g. active page set. It could be extended with something else  if someone needs it. 
 The thresholds are set by user-space and individual for applications which likes to be informed.

> mlock() the memory that your userspace monitoring needs to send signals to
> applications, whether those signals are handled to free memory internally or
> its SIGTERM or SIGKILL.

Mlocked memory should be avoid as much as possible because efficiency rate is lowest possible and makes situation for non-mlocked pages even worse.
You cannot mlock whole UI only most critical parts.
Thus, handling time in case of 3rd party apps will be not controllable, User will observe it as device jam/hang/freeze. 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 3.2.0-rc1 3/3] Used Memory Meter pseudo-device module
  2012-01-13 11:51                         ` leonid.moiseichuk
@ 2012-01-13 21:35                           ` David Rientjes
  0 siblings, 0 replies; 38+ messages in thread
From: David Rientjes @ 2012-01-13 21:35 UTC (permalink / raw)
  To: leonid.moiseichuk
  Cc: gregkh, linux-mm, linux-kernel, cesarb, kamezawa.hiroyu, emunson,
	penberg, aarcange, riel, mel, dima, rebecca, san, akpm,
	vesa.jaaskelainen

On Fri, 13 Jan 2012, leonid.moiseichuk@nokia.com wrote:

> > Your memory threshold, as proposed, will have values that are tied directly to the
> > implementation of the VM in the kernel when its under memory pressure
> > and that implementation evolves at a constant rate.
> 
> Not sure that I understand this statement. Free/Used/Active page sets 
> are properties of any VM.

The point at which the latency is deemed to be unacceptable in your 
trail-and-error is tied directly to the implementation of the VM and must 
be recalibrated with each userspace change or kernel upgrade.  I assume 
here that some reclaim is allowed in the VM for your usecase; if not, then 
I already gave a solution for how to disable that entirely.

>  The thresholds are set by user-space and individual for applications 
> which likes to be informed.
> 

You haven't given a usecase for the thresholds for anything other than 
when you're just about oom, and I think it's much simpler if you actually 
get to the point of oom and your userspace notifier is guaranteed to be 
able to respond over a preconfigured delay.  It works pretty well for us 
internally, you should consider it.

> > mlock() the memory that your userspace monitoring needs to send signals to
> > applications, whether those signals are handled to free memory internally or
> > its SIGTERM or SIGKILL.
> 
> Mlocked memory should be avoid as much as possible because efficiency 
> rate is lowest possible and makes situation for non-mlocked pages even 
> worse.

It's used only to protect the thread that is notified right before the oom 
killer is triggered so that it can send the appropriate signals.  If it 
can't do that, the oom killer delay will expire on subsequent memory 
allocation attempts and kill something itself.  This thread should have a 
minimal memory footprint, be mlock()'d into memory, and have an 
oom_score_adj of OOM_SCORE_ADJ_MIN.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2012-01-13 21:35 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-04 17:21 [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM Leonid Moiseichuk
2012-01-04 17:21 ` [PATCH 3.2.0-rc1 1/3] Making si_swapinfo exportable Leonid Moiseichuk
2012-01-04 17:21 ` [PATCH 3.2.0-rc1 2/3] MM hook for page allocation and release Leonid Moiseichuk
2012-01-04 20:40   ` Pekka Enberg
2012-01-05  6:59   ` KAMEZAWA Hiroyuki
2012-01-05 11:26     ` leonid.moiseichuk
2012-01-05 12:49       ` Pekka Enberg
2012-01-05 15:05         ` Rik van Riel
2012-01-05 15:17           ` leonid.moiseichuk
2012-01-05 15:22   ` Mel Gorman
2012-01-04 17:21 ` [PATCH 3.2.0-rc1 3/3] Used Memory Meter pseudo-device module Leonid Moiseichuk
2012-01-04 19:55   ` Greg KH
2012-01-09  9:58     ` leonid.moiseichuk
2012-01-09 10:09       ` David Rientjes
2012-01-09 10:19         ` leonid.moiseichuk
2012-01-09 20:55           ` David Rientjes
2012-01-11 12:46             ` leonid.moiseichuk
2012-01-11 21:44               ` David Rientjes
2012-01-12  8:32                 ` leonid.moiseichuk
2012-01-12 20:54                   ` David Rientjes
2012-01-13  9:34                     ` leonid.moiseichuk
2012-01-13 11:06                       ` David Rientjes
2012-01-13 11:51                         ` leonid.moiseichuk
2012-01-13 21:35                           ` David Rientjes
2012-01-04 19:56 ` [PATCH 3.2.0-rc1 0/3] Used Memory Meter pseudo-device and related changes in MM Greg KH
2012-01-04 20:17   ` Rik van Riel
2012-01-04 20:42     ` Pekka Enberg
2012-01-05 23:01       ` David Rientjes
2012-01-05 12:22     ` leonid.moiseichuk
2012-01-05 11:47   ` leonid.moiseichuk
2012-01-05 12:40     ` Pekka Enberg
2012-01-05 13:02       ` leonid.moiseichuk
2012-01-05 14:57         ` Greg KH
2012-01-05 16:13           ` leonid.moiseichuk
2012-01-05 23:10             ` David Rientjes
2012-01-09  8:27               ` leonid.moiseichuk
2012-01-06  0:26     ` KOSAKI Motohiro
2012-01-09  8:49       ` leonid.moiseichuk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).