linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/3] low memory notify
@ 2012-01-17  8:13 Minchan Kim
  2012-01-17  8:13 ` [RFC 1/3] /dev/low_mem_notify Minchan Kim
                   ` (4 more replies)
  0 siblings, 5 replies; 62+ messages in thread
From: Minchan Kim @ 2012-01-17  8:13 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, leonid.moiseichuk, kamezawa.hiroyu, penberg, Rik van Riel,
	mel, rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod, Minchan Kim

As you can see, it's respin of mem_notify core of KOSAKI and Marcelo.
(Of course, KOSAKI's original patchset includes more logics but I didn't
include all things intentionally because I want to start from beginning
again) Recently, there are some requirements of notification of system
memory pressure. It would be very useful for various cases.
For example, QEMU/JVM/Firefox like big memory hogger can release their memory
when memory pressure happens. Another example in embedded side,
they can close background application. For this, there are some trial but
we need more general one and not-hacked alloc/free hot path.

I think most big problem of system slowness is swap-in operation.
Swap-in is a synchronous operation so application's latency would be 
big. Solution for that is prevent swap-out itself. We couldn't prevent
swapout totally but could reduce it with this patch.

In case of swapless system, code page is very important for system response.
So we have to keep code page, too. I used very naive heuristic in this patch
but welcome to any idea.

I want to make kernel logic simple if possible and just notify to user space.
Of course, there are lots of thing we have to consider but for discussion
this simple patch would be a good start point.

This version is totally RFC so any comments are welcome.

Minchan Kim (3):
  [RFC 1/3] /dev/low_mem_notify
  [RFC 2/3] vmscan hook
  [RFC 3/3] test program

 drivers/char/mem.c             |    7 ++
 include/linux/low_mem_notify.h |    6 ++
 mm/Kconfig                     |    7 ++
 mm/Makefile                    |    1 +
 mm/low_mem_notify.c            |   61 ++++++++++++++++++++
 mm/vmscan.c                    |   28 +++++++++
 poll.c                         |  121 ++++++++++++++++++++++++++++++++++++++++
 7 files changed, 231 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/low_mem_notify.h
 create mode 100644 mm/low_mem_notify.c
 create mode 100644 poll.c

-- 
1.7.7.5


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC 1/3] /dev/low_mem_notify
  2012-01-17  8:13 [RFC 0/3] low memory notify Minchan Kim
@ 2012-01-17  8:13 ` Minchan Kim
  2012-01-17  9:27   ` Pekka Enberg
  2012-01-17  9:45   ` Pekka Enberg
  2012-01-17  8:13 ` [RFC 2/3] vmscan hook Minchan Kim
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 62+ messages in thread
From: Minchan Kim @ 2012-01-17  8:13 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, leonid.moiseichuk, kamezawa.hiroyu, penberg, Rik van Riel,
	mel, rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod, Minchan Kim, KOSAKI Motohiro

This patch makes new device file "/dev/low_mem_notify".
If application polls it, it can receive event when system
memory pressure happens.

This patch is based on KOSAKI and Marcelo's long time ago work.
http://lwn.net/Articles/268732/

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/char/mem.c             |    7 ++++
 include/linux/low_mem_notify.h |    6 ++++
 mm/Kconfig                     |    7 ++++
 mm/Makefile                    |    1 +
 mm/low_mem_notify.c            |   61 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 82 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/low_mem_notify.h
 create mode 100644 mm/low_mem_notify.c

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index d6e9d08..72bc12b 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -35,6 +35,10 @@
 # include <linux/efi.h>
 #endif
 
+#ifdef CONFIG_LOW_MEM_NOTIFY
+extern struct file_operations low_mem_notify_fops;
+#endif
+
 static inline unsigned long size_inside_page(unsigned long start,
 					     unsigned long size)
 {
@@ -867,6 +871,9 @@ static const struct memdev {
 #ifdef CONFIG_CRASH_DUMP
 	[12] = { "oldmem", 0, &oldmem_fops, NULL },
 #endif
+#ifdef CONFIG_LOW_MEM_NOTIFY
+	[13] = { "low_mem_notify",0666, &low_mem_notify_fops, NULL},
+#endif
 };
 
 static int memory_open(struct inode *inode, struct file *filp)
diff --git a/include/linux/low_mem_notify.h b/include/linux/low_mem_notify.h
new file mode 100644
index 0000000..bc0fc89
--- /dev/null
+++ b/include/linux/low_mem_notify.h
@@ -0,0 +1,6 @@
+#ifndef _LINUX_LOW_MEM_NOTIFY_H
+#define _LINUX_LOW_MEM_NOTIFY_H
+
+void low_memory_pressure(void);
+
+#endif
diff --git a/mm/Kconfig b/mm/Kconfig
index e338407..a2f48c6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -379,3 +379,10 @@ config CLEANCACHE
 	  in a negligible performance hit.
 
 	  If unsure, say Y to enable cleancache
+
+config LOW_MEM_NOTIFY
+	bool "Enable low memory notification"
+	default n
+	help
+	  If system suffer from low memory, kernel can notify it to user through
+	  /dev/low_mem_notify.
diff --git a/mm/Makefile b/mm/Makefile
index 50ec00e..7856357 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -51,3 +51,4 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
 obj-$(CONFIG_CLEANCACHE) += cleancache.o
+obj-$(CONFIG_LOW_MEM_NOTIFY) += low_mem_notify.o
diff --git a/mm/low_mem_notify.c b/mm/low_mem_notify.c
new file mode 100644
index 0000000..7432307
--- /dev/null
+++ b/mm/low_mem_notify.c
@@ -0,0 +1,61 @@
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+
+static DECLARE_WAIT_QUEUE_HEAD(low_mem_wait);
+static atomic_t nr_low_mem = ATOMIC_INIT(0);
+
+struct low_mem_notify_file_info {
+        unsigned long last_proc_notify;
+};
+
+void low_memory_pressure(void)
+{
+       	atomic_inc(&nr_low_mem);
+       	wake_up(&low_mem_wait);
+}
+
+static int low_mem_notify_open(struct inode *inode, struct file *file)
+{
+        struct low_mem_notify_file_info *info;
+        int err = 0;
+
+        info = kmalloc(sizeof(*info), GFP_KERNEL);
+        if (!info) {
+                err = -ENOMEM;
+                goto out;
+        }
+
+        file->private_data = info;
+out:
+        return err;
+}
+
+static int low_mem_notify_release(struct inode *inode, struct file *file)
+{
+        kfree(file->private_data);
+        return 0;
+}
+
+static unsigned int low_mem_notify_poll(struct file *file, poll_table *wait)
+{
+        unsigned int ret = 0;
+
+        poll_wait(file, &low_mem_wait, wait);
+
+        if (atomic_read(&nr_low_mem) != 0) {
+                ret = POLLIN;
+                atomic_set(&nr_low_mem, 0);
+        }
+
+        return ret;
+}
+
+struct file_operations low_mem_notify_fops = {
+        .open = low_mem_notify_open,
+        .release = low_mem_notify_release,
+        .poll = low_mem_notify_poll,
+};
+EXPORT_SYMBOL(low_mem_notify_fops);
-- 
1.7.7.5


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC 2/3] vmscan hook
  2012-01-17  8:13 [RFC 0/3] low memory notify Minchan Kim
  2012-01-17  8:13 ` [RFC 1/3] /dev/low_mem_notify Minchan Kim
@ 2012-01-17  8:13 ` Minchan Kim
  2012-01-17  8:39   ` KAMEZAWA Hiroyuki
  2012-01-17  8:13 ` [RFC 3/3] test program Minchan Kim
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 62+ messages in thread
From: Minchan Kim @ 2012-01-17  8:13 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, leonid.moiseichuk, kamezawa.hiroyu, penberg, Rik van Riel,
	mel, rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod, Minchan Kim

This patch insert memory pressure notify point into vmscan.c
Most problem in system slowness is swap-in. swap-in is a synchronous
opeartion so that it affects heavily system response.

This patch alert it when reclaimer start to reclaim inactive anon list.
It seems rather earlier but not bad than too late.

Other alert point is when there is few cache pages
In this implementation, if it is (cache < free pages),
memory pressure notify happens. It has to need more testing and tuning
or other hueristic. Any suggesion are welcome.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/vmscan.c |   28 ++++++++++++++++++++++++++++
 1 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2880396..cfa2e2d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -43,6 +43,7 @@
 #include <linux/sysctl.h>
 #include <linux/oom.h>
 #include <linux/prefetch.h>
+#include <linux/low_mem_notify.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -2082,16 +2083,43 @@ static void shrink_mem_cgroup_zone(int priority, struct mem_cgroup_zone *mz,
 {
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
+
 	enum lru_list lru;
 	unsigned long nr_reclaimed, nr_scanned;
 	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
 	struct blk_plug plug;
+#ifdef CONFIG_LOW_MEM_NOTIFY
+	bool low_mem = false;
+	unsigned long free, file;
+#endif
 
 restart:
 	nr_reclaimed = 0;
 	nr_scanned = sc->nr_scanned;
 	get_scan_count(mz, sc, nr, priority);
+#ifdef CONFIG_LOW_MEM_NOTIFY
+	/* We want to avoid swapout */
+	if (nr[LRU_INACTIVE_ANON])
+		low_mem = true;
+	/*
+	 * We want to avoid dropping page cache excessively
+	 * in no swap system
+	 */
+	if (nr_swap_pages <= 0) {
+		free = zone_page_state(mz->zone, NR_FREE_PAGES);
+		file = zone_page_state(mz->zone, NR_ACTIVE_FILE) +
+			zone_page_state(mz->zone, NR_INACTIVE_FILE);
+		/*
+		 * If we have very few page cache pages,
+		 * notify to user
+		 */
+		if (file < free)
+			low_mem = true;
+	}
 
+	if (low_mem)
+		low_memory_pressure();
+#endif
 	blk_start_plug(&plug);
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
-- 
1.7.7.5


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC 3/3] test program
  2012-01-17  8:13 [RFC 0/3] low memory notify Minchan Kim
  2012-01-17  8:13 ` [RFC 1/3] /dev/low_mem_notify Minchan Kim
  2012-01-17  8:13 ` [RFC 2/3] vmscan hook Minchan Kim
@ 2012-01-17  8:13 ` Minchan Kim
  2012-01-17 14:38 ` [RFC 0/3] low memory notify Colin Walters
  2012-01-17 17:16 ` Olof Johansson
  4 siblings, 0 replies; 62+ messages in thread
From: Minchan Kim @ 2012-01-17  8:13 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, leonid.moiseichuk, kamezawa.hiroyu, penberg, Rik van Riel,
	mel, rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod, Minchan Kim

This test program allocates 10M per second and when
memory pressure notify happens, it releases 20M.

I tested this patch on 512M qemu machine with 3 test program.
I saw some swapout but not too many and even didn't see OOM.
It obviously reduces swap out.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 poll.c |  121 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 121 insertions(+), 0 deletions(-)
 create mode 100644 poll.c

diff --git a/poll.c b/poll.c
new file mode 100644
index 0000000..3215f8b
--- /dev/null
+++ b/poll.c
@@ -0,0 +1,121 @@
+#include <poll.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <pthread.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <string.h>
+
+#define ALLOC_UNIT	10 /* MB */
+#define FREE_UNIT	20 /* MB */
+
+void alloc_memory();
+void free_memory();
+
+unsigned int total_memory = 0; /* MB */
+
+pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; 
+
+/*
+ * If total memory is higher than 200M
+ */
+bool memory_full()
+{
+	return total_memory >= 400 ? true : false;
+}
+
+struct alloc_chunk {
+	void *ptr;
+	struct alloc_chunk *next;
+};
+
+struct alloc_chunk head_chunk;
+
+void init_alloc_chunk(void)
+{
+	head_chunk.ptr = NULL;
+	head_chunk.next = NULL;
+}
+
+void add_memory(void *ptr)
+{
+	struct alloc_chunk *new_chunk = malloc(sizeof(struct alloc_chunk));
+	new_chunk->ptr = ptr;
+
+	pthread_mutex_lock(&mutex);
+	new_chunk->next = head_chunk.next;
+	head_chunk.next = new_chunk;
+	total_memory += ALLOC_UNIT;
+	pthread_mutex_unlock(&mutex);
+
+	printf("[%d] Add total memory %d(MB)\n", getpid(), total_memory);
+}
+
+void alloc_memory(void)
+{
+	while(1) {
+		if (memory_full()) {
+			sleep(10);
+			continue;
+		}
+
+		void *new = malloc(ALLOC_UNIT*1024*1024);
+		memset(new, 0, ALLOC_UNIT*1024*1024);
+		add_memory(new);
+		sleep(1);
+	}
+}
+
+void free_memory(void)
+{
+	int count = FREE_UNIT / ALLOC_UNIT;
+	while(count--) {
+		struct alloc_chunk *chunk = head_chunk.next;
+		if (chunk == NULL)
+			break;
+
+		pthread_mutex_lock(&mutex);
+		head_chunk.next = chunk->next;
+		total_memory -= ALLOC_UNIT;
+		pthread_mutex_unlock(&mutex);
+
+		free(chunk->ptr);
+		free(chunk);
+
+		printf("[%d] Free total memory %d(MB)\n", getpid(), total_memory);
+	}
+}
+
+void *poll_thread(void *dummy)
+{
+	struct pollfd pfd;	
+	int fd = open("/dev/low_mem_notify", O_RDONLY); 
+	if (fd == -1) {
+		fprintf(stderr, "Fail to open\n");
+		return;
+	}
+
+	pfd.fd = fd;
+	pfd.events = POLLIN;
+
+	while(1) {
+		poll(&pfd, 1, -1);
+		free_memory();
+	}
+}
+
+int main()
+{
+	pthread_t threadid;
+	init_alloc_chunk();
+
+	if (pthread_create(&threadid, NULL, poll_thread, NULL)) {
+		fprintf(stderr, "pthread create fail\n");
+		return 1;
+	}
+
+	alloc_memory();
+	return 0;
+}
-- 
1.7.7.5


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [RFC 2/3] vmscan hook
  2012-01-17  8:13 ` [RFC 2/3] vmscan hook Minchan Kim
@ 2012-01-17  8:39   ` KAMEZAWA Hiroyuki
  2012-01-17  9:13     ` Minchan Kim
  0 siblings, 1 reply; 62+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-17  8:39 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, LKML, leonid.moiseichuk, penberg, Rik van Riel, mel,
	rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod

On Tue, 17 Jan 2012 17:13:57 +0900
Minchan Kim <minchan@kernel.org> wrote:

> This patch insert memory pressure notify point into vmscan.c
> Most problem in system slowness is swap-in. swap-in is a synchronous
> opeartion so that it affects heavily system response.
> 
> This patch alert it when reclaimer start to reclaim inactive anon list.
> It seems rather earlier but not bad than too late.
> 
> Other alert point is when there is few cache pages
> In this implementation, if it is (cache < free pages),
> memory pressure notify happens. It has to need more testing and tuning
> or other hueristic. Any suggesion are welcome.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>

In my 1st impression, isn't this too simple ?


> ---
>  mm/vmscan.c |   28 ++++++++++++++++++++++++++++
>  1 files changed, 28 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2880396..cfa2e2d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -43,6 +43,7 @@
>  #include <linux/sysctl.h>
>  #include <linux/oom.h>
>  #include <linux/prefetch.h>
> +#include <linux/low_mem_notify.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -2082,16 +2083,43 @@ static void shrink_mem_cgroup_zone(int priority, struct mem_cgroup_zone *mz,
>  {
>  	unsigned long nr[NR_LRU_LISTS];
>  	unsigned long nr_to_scan;
> +
>  	enum lru_list lru;
>  	unsigned long nr_reclaimed, nr_scanned;
>  	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
>  	struct blk_plug plug;
> +#ifdef CONFIG_LOW_MEM_NOTIFY
> +	bool low_mem = false;
> +	unsigned long free, file;
> +#endif
>  
>  restart:
>  	nr_reclaimed = 0;
>  	nr_scanned = sc->nr_scanned;
>  	get_scan_count(mz, sc, nr, priority);
> +#ifdef CONFIG_LOW_MEM_NOTIFY
> +	/* We want to avoid swapout */
> +	if (nr[LRU_INACTIVE_ANON])
> +		low_mem = true;

IIUC, nr[LRU_INACTIVE_ANON] can be easily > 0.
And get_scan_count() now check per-memcg-lru. So, this only works when
memcg is not used.


> +	/*
> +	 * We want to avoid dropping page cache excessively
> +	 * in no swap system
> +	 */
> +	if (nr_swap_pages <= 0) {
> +		free = zone_page_state(mz->zone, NR_FREE_PAGES);
> +		file = zone_page_state(mz->zone, NR_ACTIVE_FILE) +
> +			zone_page_state(mz->zone, NR_INACTIVE_FILE);
> +		/*
> +		 * If we have very few page cache pages,
> +		 * notify to user
> +		 */
> +		if (file < free)
> +			low_mem = true;
> +	}

I can't understand why you think you can check lowmem condition by "file < free".
And I don't think using per-zone data is good.
(I'm not sure how many zones embeded guys using..)

Another idea:
1. can't we use some technique like cleancache to detect the condition ?
2. can't we measure page-in/page-out distance by recording something ?
3. NR_ANON + NR_FILE_MAPPED can't mean the amount of core memory if we can
   ignore the data file cache ?
4. how about checking kswapd's busy status ?



Thanks,
-Kame


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 2/3] vmscan hook
  2012-01-17  8:39   ` KAMEZAWA Hiroyuki
@ 2012-01-17  9:13     ` Minchan Kim
  2012-01-17 10:05       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 62+ messages in thread
From: Minchan Kim @ 2012-01-17  9:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, LKML, leonid.moiseichuk, penberg, Rik van Riel, mel,
	rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod

On Tue, Jan 17, 2012 at 05:39:32PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 17 Jan 2012 17:13:57 +0900
> Minchan Kim <minchan@kernel.org> wrote:
> 
> > This patch insert memory pressure notify point into vmscan.c
> > Most problem in system slowness is swap-in. swap-in is a synchronous
> > opeartion so that it affects heavily system response.
> > 
> > This patch alert it when reclaimer start to reclaim inactive anon list.
> > It seems rather earlier but not bad than too late.
> > 
> > Other alert point is when there is few cache pages
> > In this implementation, if it is (cache < free pages),
> > memory pressure notify happens. It has to need more testing and tuning
> > or other hueristic. Any suggesion are welcome.
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> 
> In my 1st impression, isn't this too simple ?

I agree It's too simple. It would be good start point rather than
unnecessary complicated things.

> 
> 
> > ---
> >  mm/vmscan.c |   28 ++++++++++++++++++++++++++++
> >  1 files changed, 28 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 2880396..cfa2e2d 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -43,6 +43,7 @@
> >  #include <linux/sysctl.h>
> >  #include <linux/oom.h>
> >  #include <linux/prefetch.h>
> > +#include <linux/low_mem_notify.h>
> >  
> >  #include <asm/tlbflush.h>
> >  #include <asm/div64.h>
> > @@ -2082,16 +2083,43 @@ static void shrink_mem_cgroup_zone(int priority, struct mem_cgroup_zone *mz,
> >  {
> >  	unsigned long nr[NR_LRU_LISTS];
> >  	unsigned long nr_to_scan;
> > +
> >  	enum lru_list lru;
> >  	unsigned long nr_reclaimed, nr_scanned;
> >  	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> >  	struct blk_plug plug;
> > +#ifdef CONFIG_LOW_MEM_NOTIFY
> > +	bool low_mem = false;
> > +	unsigned long free, file;
> > +#endif
> >  
> >  restart:
> >  	nr_reclaimed = 0;
> >  	nr_scanned = sc->nr_scanned;
> >  	get_scan_count(mz, sc, nr, priority);
> > +#ifdef CONFIG_LOW_MEM_NOTIFY
> > +	/* We want to avoid swapout */
> > +	if (nr[LRU_INACTIVE_ANON])
> > +		low_mem = true;
> 
> IIUC, nr[LRU_INACTIVE_ANON] can be easily > 0.

Yes. But I thought it would be better than late notification.
Late notification ends up swap out which is a big concern about this patch.
More proper timing suggestion helps me a lot.

> And get_scan_count() now check per-memcg-lru. So, this only works when
> memcg is not used.

Hmm, I didn't look at recent memcg/global reclaim unify patch of Johannes.
I need time to look at it.
Thanks.

> 
> 
> > +	/*
> > +	 * We want to avoid dropping page cache excessively
> > +	 * in no swap system
> > +	 */
> > +	if (nr_swap_pages <= 0) {
> > +		free = zone_page_state(mz->zone, NR_FREE_PAGES);
> > +		file = zone_page_state(mz->zone, NR_ACTIVE_FILE) +
> > +			zone_page_state(mz->zone, NR_INACTIVE_FILE);
> > +		/*
> > +		 * If we have very few page cache pages,
> > +		 * notify to user
> > +		 */
> > +		if (file < free)
> > +			low_mem = true;
> > +	}
> 
> I can't understand why you think you can check lowmem condition by "file < free".

The reason I thought so is I want to maintain some page cache to some degree.
But I admit It's very naive heuristic and should be improved.

> And I don't think using per-zone data is good.
> (I'm not sure how many zones embeded guys using..)

Agree. In case of swapless system, we need another heuristic.

> 
> Another idea:
> 1. can't we use some technique like cleancache to detect the condition ?

I totally forgot cleancache approach. Could you remind that?

> 2. can't we measure page-in/page-out distance by recording something ?

I can't understand your point. What's relation does it with swapout prevent?

> 3. NR_ANON + NR_FILE_MAPPED can't mean the amount of core memory if we can
>    ignore the data file cache ?

It's good but how do we define some amount?
It's very vague but I guess we can get a good idea from that.
Perhaps, you already has it.

> 4. how about checking kswapd's busy status ?

Could you elaborate on your idea?

Kame, Thanks for reply, 

> 
> 
> 
> Thanks,
> -Kame
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-17  8:13 ` [RFC 1/3] /dev/low_mem_notify Minchan Kim
@ 2012-01-17  9:27   ` Pekka Enberg
  2012-01-17 16:35     ` Rik van Riel
  2012-01-17  9:45   ` Pekka Enberg
  1 sibling, 1 reply; 62+ messages in thread
From: Pekka Enberg @ 2012-01-17  9:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, LKML, leonid.moiseichuk, kamezawa.hiroyu, Rik van Riel,
	mel, rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod, KOSAKI Motohiro

On Tue, Jan 17, 2012 at 10:13 AM, Minchan Kim <minchan@kernel.org> wrote:
> +static unsigned int low_mem_notify_poll(struct file *file, poll_table *wait)
> +{
> +        unsigned int ret = 0;
> +
> +        poll_wait(file, &low_mem_wait, wait);
> +
> +        if (atomic_read(&nr_low_mem) != 0) {
> +                ret = POLLIN;
> +                atomic_set(&nr_low_mem, 0);
> +        }
> +
> +        return ret;
> +}

Doesn't this mean that only one application will receive the notification?

                        Pekka

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-17  8:13 ` [RFC 1/3] /dev/low_mem_notify Minchan Kim
  2012-01-17  9:27   ` Pekka Enberg
@ 2012-01-17  9:45   ` Pekka Enberg
  1 sibling, 0 replies; 62+ messages in thread
From: Pekka Enberg @ 2012-01-17  9:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, LKML, leonid.moiseichuk, kamezawa.hiroyu, Rik van Riel,
	mel, rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod, KOSAKI Motohiro

On Tue, Jan 17, 2012 at 10:13 AM, Minchan Kim <minchan@kernel.org> wrote:
> This patch makes new device file "/dev/low_mem_notify".
> If application polls it, it can receive event when system
> memory pressure happens.
>
> This patch is based on KOSAKI and Marcelo's long time ago work.
> http://lwn.net/Articles/268732/

I'm not loving the ABI. Alternative solutions:

  - SIGDANGER + signalfd() for poll

  - sys_eventfd()

  - sys_mem_notify_open() similar to sys_perf_event_open()

                        Pekka

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 2/3] vmscan hook
  2012-01-17  9:13     ` Minchan Kim
@ 2012-01-17 10:05       ` KAMEZAWA Hiroyuki
  2012-01-17 23:08         ` Minchan Kim
  0 siblings, 1 reply; 62+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-17 10:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, LKML, leonid.moiseichuk, penberg, Rik van Riel, mel,
	rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod

On Tue, 17 Jan 2012 18:13:56 +0900
Minchan Kim <minchan@kernel.org> wrote:

> On Tue, Jan 17, 2012 at 05:39:32PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Tue, 17 Jan 2012 17:13:57 +0900
> > Minchan Kim <minchan@kernel.org> wrote:
> > 
> > 
> > > +	/*
> > > +	 * We want to avoid dropping page cache excessively
> > > +	 * in no swap system
> > > +	 */
> > > +	if (nr_swap_pages <= 0) {
> > > +		free = zone_page_state(mz->zone, NR_FREE_PAGES);
> > > +		file = zone_page_state(mz->zone, NR_ACTIVE_FILE) +
> > > +			zone_page_state(mz->zone, NR_INACTIVE_FILE);
> > > +		/*
> > > +		 * If we have very few page cache pages,
> > > +		 * notify to user
> > > +		 */
> > > +		if (file < free)
> > > +			low_mem = true;
> > > +	}
> > 
> > I can't understand why you think you can check lowmem condition by "file < free".
> 
> The reason I thought so is I want to maintain some page cache to some degree.
> But I admit It's very naive heuristic and should be improved.
> 
> > And I don't think using per-zone data is good.
> > (I'm not sure how many zones embeded guys using..)
> 
> Agree. In case of swapless system, we need another heuristic.
> 
> > 
> > Another idea:
> > 1. can't we use some technique like cleancache to detect the condition ?
> 
> I totally forgot cleancache approach. Could you remind that?
> 

Similar to 'victim cache'. Then, cache some clean pages somewhere when
vmscan pageout it.

   page -> vmscan's pageout -> cleancache  -> may be discarded.

If a filesystem look up a page which is in a cleancache, cache-hit and
bring it back to radix-tree. If not, read from disk again.
And cleancache for swap(frontswap) was posted, too.


> > 2. can't we measure page-in/page-out distance by recording something ?
> 
> I can't understand your point. What's relation does it with swapout prevent?
> 

If distance between pageout -> pagein is short, it means thrashing.
For example, recoding the timestamp when the page(mapping, index) was
paged-out, and check it at page-in.


> > 3. NR_ANON + NR_FILE_MAPPED can't mean the amount of core memory if we can
> >    ignore the data file cache ?
> 
> It's good but how do we define some amount?
> It's very vague but I guess we can get a good idea from that.
> Perhaps, you already has it.
> 

Hm, a rough idea is...

  - we now have rss counter per mm.
    - mapped anon
    - mapped file
    - swapents
 
Ok, here, add one more counter.

    - paged-out file. (I think this can be recorded in pte.)
      +1 when try_to_unmap_file() unmaps it.
      -1 when a page is back or unmapped.

Then, scanning all tasks. Then,

                                 mapped_anon + mapped_file
active_map_ratio =   ----------------------------------------------------- * 100
                     mapped_anon + mapped_file + swapents + paged_out_file

Ok, how to use this value...

Like memcg's threshold notify interface, you can change the mem_notify interface
to use eventfd() as

   <event_fd, fd of /dev/mem_notify, threshold of active_map_ratio>

This will inform you an event when active_map_ratio crosses passed threshold.

complicated ? 


> > 4. how about checking kswapd's busy status ?
> 
> Could you elaborate on your idea?
> 

I just thought kswapd may not stop when the situation is very bad.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 0/3] low memory notify
  2012-01-17  8:13 [RFC 0/3] low memory notify Minchan Kim
                   ` (2 preceding siblings ...)
  2012-01-17  8:13 ` [RFC 3/3] test program Minchan Kim
@ 2012-01-17 14:38 ` Colin Walters
  2012-01-17 15:04   ` Pekka Enberg
  2012-01-17 16:44   ` Rik van Riel
  2012-01-17 17:16 ` Olof Johansson
  4 siblings, 2 replies; 62+ messages in thread
From: Colin Walters @ 2012-01-17 14:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, LKML, leonid.moiseichuk, kamezawa.hiroyu, penberg,
	Rik van Riel, mel, rientjes, KOSAKI Motohiro, Johannes Weiner,
	Marcelo Tosatti, Andrew Morton, Ronen Hod

On Tue, 2012-01-17 at 17:13 +0900, Minchan Kim wrote:
> As you can see, it's respin of mem_notify core of KOSAKI and Marcelo.
> (Of course, KOSAKI's original patchset includes more logics but I didn't
> include all things intentionally because I want to start from beginning
> again) Recently, there are some requirements of notification of system
> memory pressure.

How does this relate to the existing cgroups memory notifications?  See
Documentation/cgroups/memory.txt under "10. OOM Control"

>  It would be very useful for various cases.
> For example, QEMU/JVM/Firefox like big memory hogger can release their memory
> when memory pressure happens.

I don't know about QEMU, but the key characteristic of the JVM and
Firefox is that they use garbage collection.  Which also applies to
Python, Ruby, Google Go, Haskell, OCaml...

So what you really want to be investigating here is integration between
a garbage collector and the system VM.  Your test program looks nothing
like a garbage collector.  I'd expect most of the performance tradeoffs
to be similar between these runtimes.  The Azul people have been doing
something like this: http://www.managedruntime.org/

In Firefox' case though it can also drop other caches, e.g.:

http://people.gnome.org/~federico/news-2007-09.html#firefox-memory-1

As far as the desktop goes, I want to get notified if we're going to hit
swap, not if we're close to exhausting the total of RAM+swap.  While
swap may make sense for servers that care about throughput mainly, I
care a lot about latency.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 0/3] low memory notify
  2012-01-17 14:38 ` [RFC 0/3] low memory notify Colin Walters
@ 2012-01-17 15:04   ` Pekka Enberg
  2012-01-17 16:44   ` Rik van Riel
  1 sibling, 0 replies; 62+ messages in thread
From: Pekka Enberg @ 2012-01-17 15:04 UTC (permalink / raw)
  To: Colin Walters
  Cc: Minchan Kim, linux-mm, LKML, leonid.moiseichuk, kamezawa.hiroyu,
	Rik van Riel, mel, rientjes, KOSAKI Motohiro, Johannes Weiner,
	Marcelo Tosatti, Andrew Morton, Ronen Hod

On Tue, Jan 17, 2012 at 4:38 PM, Colin Walters <walters@verbum.org> wrote:
> So what you really want to be investigating here is integration between
> a garbage collector and the system VM.  Your test program looks nothing
> like a garbage collector.  I'd expect most of the performance tradeoffs
> to be similar between these runtimes.  The Azul people have been doing
> something like this: http://www.managedruntime.org/

The interraction isn't all that complex, really. I'd expect most VMs
to simply wake up the GC thread when poll() returns. GCs that are able
to compact the heap can madvise(MADV_DONTNEED) or even munmap() unused
parts of the heap.

                                Pekka

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-17  9:27   ` Pekka Enberg
@ 2012-01-17 16:35     ` Rik van Riel
  2012-01-17 18:51       ` Pekka Enberg
  0 siblings, 1 reply; 62+ messages in thread
From: Rik van Riel @ 2012-01-17 16:35 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Minchan Kim, linux-mm, LKML, leonid.moiseichuk, kamezawa.hiroyu,
	mel, rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod, KOSAKI Motohiro

On 01/17/2012 04:27 AM, Pekka Enberg wrote:
> On Tue, Jan 17, 2012 at 10:13 AM, Minchan Kim<minchan@kernel.org>  wrote:
>> +static unsigned int low_mem_notify_poll(struct file *file, poll_table *wait)
>> +{
>> +        unsigned int ret = 0;
>> +
>> +        poll_wait(file,&low_mem_wait, wait);
>> +
>> +        if (atomic_read(&nr_low_mem) != 0) {
>> +                ret = POLLIN;
>> +                atomic_set(&nr_low_mem, 0);
>> +        }
>> +
>> +        return ret;
>> +}
>
> Doesn't this mean that only one application will receive the notification?

One at a time, which could be a good thing since the last
thing we want to do when the system is under memory
pressure is create a thundering herd.

OTOH, we do need to ensure that programs take turns getting
the memory pressure notification.  I do not know whether
poll_wait automatically takes care of that...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 0/3] low memory notify
  2012-01-17 14:38 ` [RFC 0/3] low memory notify Colin Walters
  2012-01-17 15:04   ` Pekka Enberg
@ 2012-01-17 16:44   ` Rik van Riel
  1 sibling, 0 replies; 62+ messages in thread
From: Rik van Riel @ 2012-01-17 16:44 UTC (permalink / raw)
  To: Colin Walters
  Cc: Minchan Kim, linux-mm, LKML, leonid.moiseichuk, kamezawa.hiroyu,
	penberg, mel, rientjes, KOSAKI Motohiro, Johannes Weiner,
	Marcelo Tosatti, Andrew Morton, Ronen Hod

On 01/17/2012 09:38 AM, Colin Walters wrote:

> How does this relate to the existing cgroups memory notifications?  See
> Documentation/cgroups/memory.txt under "10. OOM Control"

> As far as the desktop goes, I want to get notified if we're going to hit
> swap, not if we're close to exhausting the total of RAM+swap.  While
> swap may make sense for servers that care about throughput mainly, I
> care a lot about latency.

You just answered your own question :)

This code is indeed meant to avoid/reduce swap use and
improve userspace latencies.

Minchan posted a very simple example patch set, so we
can get an idea in what direction people would want
the code to go.  This often beats working on complex
code for weeks, and then having people tell you they
wanted something else :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 0/3] low memory notify
  2012-01-17  8:13 [RFC 0/3] low memory notify Minchan Kim
                   ` (3 preceding siblings ...)
  2012-01-17 14:38 ` [RFC 0/3] low memory notify Colin Walters
@ 2012-01-17 17:16 ` Olof Johansson
  4 siblings, 0 replies; 62+ messages in thread
From: Olof Johansson @ 2012-01-17 17:16 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, LKML, leonid.moiseichuk, kamezawa.hiroyu, penberg,
	Rik van Riel, mel, rientjes, KOSAKI Motohiro, Johannes Weiner,
	Marcelo Tosatti, Andrew Morton, Ronen Hod

Hi,

On Tue, Jan 17, 2012 at 12:13 AM, Minchan Kim <minchan@kernel.org> wrote:
> As you can see, it's respin of mem_notify core of KOSAKI and Marcelo.
> (Of course, KOSAKI's original patchset includes more logics but I didn't
> include all things intentionally because I want to start from beginning
> again) Recently, there are some requirements of notification of system
> memory pressure. It would be very useful for various cases.
> For example, QEMU/JVM/Firefox like big memory hogger can release their memory
> when memory pressure happens. Another example in embedded side,
> they can close background application. For this, there are some trial but
> we need more general one and not-hacked alloc/free hot path.
>
> I think most big problem of system slowness is swap-in operation.
> Swap-in is a synchronous operation so application's latency would be
> big. Solution for that is prevent swap-out itself. We couldn't prevent
> swapout totally but could reduce it with this patch.
>
> In case of swapless system, code page is very important for system response.
> So we have to keep code page, too. I used very naive heuristic in this patch
> but welcome to any idea.
>
> I want to make kernel logic simple if possible and just notify to user space.
> Of course, there are lots of thing we have to consider but for discussion
> this simple patch would be a good start point.

This is almost exactly what we've been looking at doing for Chrome OS
(which is swapless). In our case, the browser is by far the largest
memory consumer on the system, and we have for quite a while been
playing tricks with OOM scores trying to make the interaction between
the VM and the application happen right such that if we're OOM, the
"right" tab process gets killed, etc. But it's not enough (and it's
not always accurate enough). Chrome definitely knows already what it
would prefer to do to release memory, so having a simple notifier for
low memory condition is preferred.

We have considered doing it through cgroups but it adds a level of
complexity that we don't need for this use case (we do already use
cgroups for other reasons though). If this simpler solution is heading
towards inclusion we'll probably use it instead.


-Olof

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-17 16:35     ` Rik van Riel
@ 2012-01-17 18:51       ` Pekka Enberg
  2012-01-17 19:30         ` Rik van Riel
                           ` (4 more replies)
  0 siblings, 5 replies; 62+ messages in thread
From: Pekka Enberg @ 2012-01-17 18:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Minchan Kim, linux-mm, LKML, leonid.moiseichuk, kamezawa.hiroyu,
	mel, rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod, KOSAKI Motohiro

Hello,

Ok, so here's a proof of concept patch that implements sample-base 
per-process free threshold VM event watching using perf-like syscall ABI. 
I'd really like to see something like this that's much more extensible and 
clean than the /dev based ABIs that people have proposed so far.

 			Pekka

------------------->

>From a07f93fdca360b20daef4a5d66f2a5746f31f6a6 Mon Sep 17 00:00:00 2001
From: Pekka Enberg <penberg@kernel.org>
Date: Tue, 17 Jan 2012 17:51:48 +0200
Subject: [PATCH] vmnotify: VM event notification system

This patch implements a new sys_vmnotify_fd() system call that returns a
pollable file descriptor that can be used to watch VM events.

For example, to watch for VM event when free memory is below 99% of available
memory using 1 second sample period, you'd do something like this:

     struct vmnotify_config config;
     struct vmnotify_event event;
     struct pollfd pollfd;
     int fd;

     config = (struct vmnotify_config) {
             .type                   = VMNOTIFY_TYPE_SAMPLE|VMNOTIFY_TYPE_FREE_THRESHOLD,
             .sample_period_ns       = 1000000000L,
             .free_threshold         = 99,
     };

     fd = sys_vmnotify_fd(&config);

     pollfd.fd               = fd;
     pollfd.events           = POLLIN;

     if (poll(&pollfd, 1, -1) < 0) {
             perror("poll failed");
             exit(1);
     }

     memset(&event, 0, sizeof(event));

     if (read(fd, &event, sizeof(event)) < 0) {
             perror("read failed");
             exit(1);
     }

Signed-off-by: Pekka Enberg <penberg@kernel.org>
---
  arch/x86/include/asm/unistd_64.h       |    2 +
  include/linux/vmnotify.h               |   44 ++++++
  mm/Kconfig                             |    6 +
  mm/Makefile                            |    1 +
  mm/vmnotify.c                          |  235 ++++++++++++++++++++++++++++++++
  tools/testing/vmnotify/vmnotify-test.c |   68 +++++++++
  6 files changed, 356 insertions(+), 0 deletions(-)
  create mode 100644 include/linux/vmnotify.h
  create mode 100644 mm/vmnotify.c
  create mode 100644 tools/testing/vmnotify/vmnotify-test.c

diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 0431f19..b0928cd 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -686,6 +686,8 @@ __SYSCALL(__NR_getcpu, sys_getcpu)
  __SYSCALL(__NR_process_vm_readv, sys_process_vm_readv)
  #define __NR_process_vm_writev			311
  __SYSCALL(__NR_process_vm_writev, sys_process_vm_writev)
+#define __NR_vmnotify_fd			312
+__SYSCALL(__NR_vmnotify_fd, sys_vmnotify_fd)

  #ifndef __NO_STUBS
  #define __ARCH_WANT_OLD_READDIR
diff --git a/include/linux/vmnotify.h b/include/linux/vmnotify.h
new file mode 100644
index 0000000..8f8642b
--- /dev/null
+++ b/include/linux/vmnotify.h
@@ -0,0 +1,44 @@
+#ifndef _LINUX_VMNOTIFY_H
+#define _LINUX_VMNOTIFY_H
+
+#include <linux/types.h>
+
+enum {
+	VMNOTIFY_TYPE_FREE_THRESHOLD	= 1ULL << 0,
+	VMNOTIFY_TYPE_SAMPLE		= 1ULL << 1,
+};
+
+struct vmnotify_config {
+	/*
+	 * Size of the struct for ABI extensibility.
+	 */
+	__u32		   size;
+
+	/*
+	 * Notification type bitmask
+	 */
+	__u64			type;
+
+	/*
+	 * Free memory threshold in percentages [1..99]
+	 */
+	__u32			free_threshold;
+
+	/*
+	 * Sample period in nanoseconds
+	 */
+	__u64			sample_period_ns;
+};
+
+struct vmnotify_event {
+	/* Size of the struct for ABI extensibility. */
+	__u32			size;
+
+	__u64			nr_avail_pages;
+
+	__u64			nr_swap_pages;
+
+	__u64			nr_free_pages;
+};
+
+#endif /* _LINUX_VMNOTIFY_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 011b110..6631167 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -373,3 +373,9 @@ config CLEANCACHE
  	  in a negligible performance hit.

  	  If unsure, say Y to enable cleancache
+
+config VMNOTIFY
+	bool "Enable VM event notification system"
+	default n
+	help
+	  If unsure, say N to disable vmnotify
diff --git a/mm/Makefile b/mm/Makefile
index 50ec00e..e1b5db3 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -51,3 +51,4 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
  obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
  obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
  obj-$(CONFIG_CLEANCACHE) += cleancache.o
+obj-$(CONFIG_VMNOTIFY) += vmnotify.o
diff --git a/mm/vmnotify.c b/mm/vmnotify.c
new file mode 100644
index 0000000..6800450
--- /dev/null
+++ b/mm/vmnotify.c
@@ -0,0 +1,235 @@
+#include <linux/anon_inodes.h>
+#include <linux/vmnotify.h>
+#include <linux/syscalls.h>
+#include <linux/file.h>
+#include <linux/list.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/swap.h>
+
+#define VMNOTIFY_MAX_FREE_THRESHOD	100
+
+struct vmnotify_watch {
+	struct vmnotify_config		config;
+
+	struct mutex			mutex;
+	bool				pending;
+	struct vmnotify_event		event;
+
+	/* sampling */
+	struct hrtimer			timer;
+
+	/* poll */
+	wait_queue_head_t		waitq;
+};
+
+static bool vmnotify_match(struct vmnotify_watch *watch, struct vmnotify_event *event)
+{
+	if (watch->config.type & VMNOTIFY_TYPE_FREE_THRESHOLD) {
+		u64 threshold;
+
+		if (!event->nr_avail_pages)
+			return false;
+
+		threshold = event->nr_free_pages * 100 / event->nr_avail_pages;
+		if (threshold > watch->config.free_threshold)
+			return false;
+	}
+
+	return true;
+}
+
+static void vmnotify_sample(struct vmnotify_watch *watch)
+{
+	struct vmnotify_event event;
+	struct sysinfo si;
+
+	memset(&event, 0, sizeof(event));
+
+	event.size		= sizeof(event);
+	event.nr_free_pages	= global_page_state(NR_FREE_PAGES);
+
+	si_meminfo(&si);
+	event.nr_avail_pages	= si.totalram;
+
+#ifdef CONFIG_SWAP
+	si_swapinfo(&si);
+	event.nr_swap_pages	= si.totalswap;
+#endif
+
+	if (!vmnotify_match(watch, &event))
+		return;
+
+	mutex_lock(&watch->mutex);
+
+	watch->pending = true;
+
+	memcpy(&watch->event, &event, sizeof(event));
+
+	mutex_unlock(&watch->mutex);
+}
+
+static enum hrtimer_restart vmnotify_timer_fn(struct hrtimer *hrtimer)
+{
+	struct vmnotify_watch *watch = container_of(hrtimer, struct vmnotify_watch, timer);
+	u64 sample_period = watch->config.sample_period_ns;
+
+	vmnotify_sample(watch);
+
+	hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period));
+
+	wake_up(&watch->waitq);
+
+	return HRTIMER_RESTART;
+}
+
+static void vmnotify_start_timer(struct vmnotify_watch *watch)
+{
+	u64 sample_period = watch->config.sample_period_ns;
+
+	hrtimer_init(&watch->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	watch->timer.function = vmnotify_timer_fn;
+
+	hrtimer_start(&watch->timer, ns_to_ktime(sample_period), HRTIMER_MODE_REL_PINNED);
+}
+
+static unsigned int vmnotify_poll(struct file *file, poll_table *wait)
+{
+	struct vmnotify_watch *watch = file->private_data;
+	unsigned int events = 0;
+
+	poll_wait(file, &watch->waitq, wait);
+
+	mutex_lock(&watch->mutex);
+
+	if (watch->pending)
+		events |= POLLIN;
+
+	mutex_unlock(&watch->mutex);
+
+	return events;
+}
+
+static ssize_t vmnotify_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
+{
+	struct vmnotify_watch *watch = file->private_data;
+	int ret = 0;
+
+	mutex_lock(&watch->mutex);
+
+	if (!watch->pending)
+		goto out_unlock;
+
+	if (copy_to_user(buf, &watch->event, sizeof(struct vmnotify_event))) {
+		ret = -EFAULT;
+		goto out_unlock;
+	}
+
+	ret = watch->event.size;
+
+	watch->pending = false;
+
+out_unlock:
+	mutex_unlock(&watch->mutex);
+
+	return ret;
+}
+
+static int vmnotify_release(struct inode *inode, struct file *file)
+{
+	struct vmnotify_watch *watch = file->private_data;
+
+	hrtimer_cancel(&watch->timer);
+
+	kfree(watch);
+
+	return 0;
+}
+
+static const struct file_operations vmnotify_fops = {
+	.poll		= vmnotify_poll,
+	.read		= vmnotify_read,
+	.release	= vmnotify_release,
+};
+
+static struct vmnotify_watch *vmnotify_watch_alloc(void)
+{
+	struct vmnotify_watch *watch;
+
+	watch = kzalloc(sizeof *watch, GFP_KERNEL);
+	if (!watch)
+		return NULL;
+
+	mutex_init(&watch->mutex);
+
+	init_waitqueue_head(&watch->waitq);
+
+	return watch;
+}
+
+static int vmnotify_copy_config(struct vmnotify_config __user *uconfig,
+				struct vmnotify_config *config)
+{
+	int ret;
+
+	ret = copy_from_user(config, uconfig, sizeof(struct vmnotify_config));
+	if (ret)
+		return -EFAULT;
+
+	if (!config->type)
+		return -EINVAL;
+
+	if (config->type & VMNOTIFY_TYPE_SAMPLE) {
+		if (config->sample_period_ns < NSEC_PER_MSEC)
+			return -EINVAL;
+	}
+
+	if (config->type & VMNOTIFY_TYPE_FREE_THRESHOLD) {
+		if (config->free_threshold > VMNOTIFY_MAX_FREE_THRESHOD)
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+SYSCALL_DEFINE1(vmnotify_fd,
+		struct vmnotify_config __user *, uconfig)
+{
+	struct vmnotify_watch *watch;
+	struct file *file;
+	int err;
+	int fd;
+
+	watch = vmnotify_watch_alloc();
+	if (!watch)
+		return -ENOMEM;
+
+	err = vmnotify_copy_config(uconfig, &watch->config);
+	if (err)
+		goto err_free;
+
+	fd = get_unused_fd_flags(O_RDONLY);
+	if (fd < 0) {
+		err = fd;
+		goto err_free;
+	}
+
+	file = anon_inode_getfile("[vmnotify]", &vmnotify_fops, watch, O_RDONLY);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto err_fd;
+	}
+
+	fd_install(fd, file);
+
+	if (watch->config.type & VMNOTIFY_TYPE_SAMPLE)
+		vmnotify_start_timer(watch);
+
+	return fd;
+
+err_fd:
+	put_unused_fd(fd);
+err_free:
+	kfree(watch);
+	return err;
+}
diff --git a/tools/testing/vmnotify/vmnotify-test.c b/tools/testing/vmnotify/vmnotify-test.c
new file mode 100644
index 0000000..3c6b26d
--- /dev/null
+++ b/tools/testing/vmnotify/vmnotify-test.c
@@ -0,0 +1,68 @@
+#include "../../../include/linux/vmnotify.h"
+
+#if defined(__x86_64__)
+#include "../../../arch/x86/include/asm/unistd.h"
+#endif
+
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <stdio.h>
+#include <poll.h>
+
+static int sys_vmnotify_fd(struct vmnotify_config *config)
+{
+	config->size = sizeof(*config);
+
+	return syscall(__NR_vmnotify_fd, config);
+}
+
+int main(int argc, char *argv[])
+{
+	struct vmnotify_config config;
+	struct vmnotify_event event;
+	struct pollfd pollfd;
+	int i;
+	int fd;
+
+	config = (struct vmnotify_config) {
+		.type			= VMNOTIFY_TYPE_SAMPLE|VMNOTIFY_TYPE_FREE_THRESHOLD,
+		.sample_period_ns	= 1000000000L,
+		.free_threshold		= 99,
+	};
+
+	fd = sys_vmnotify_fd(&config);
+	if (fd < 0) {
+		perror("vmnotify_fd failed");
+		exit(1);
+	}
+
+	for (i = 0; i < 10; i++) {
+		pollfd.fd		= fd;
+		pollfd.events		= POLLIN;
+
+		if (poll(&pollfd, 1, -1) < 0) {
+			perror("poll failed");
+			exit(1);
+		}
+
+		memset(&event, 0, sizeof(event));
+
+		if (read(fd, &event, sizeof(event)) < 0) {
+			perror("read failed");
+			exit(1);
+		}
+
+		printf("VM event:\n");
+		printf("\tsize=%lu\n", event.size);
+		printf("\tnr_avail_pages=%Lu\n", event.nr_avail_pages);
+		printf("\tnr_swap_pages=%Lu\n", event.nr_swap_pages);
+		printf("\tnr_free_pages=%Lu\n", event.nr_free_pages);
+	}
+	if (close(fd) < 0) {
+		perror("close failed");
+		exit(1);
+	}
+
+	return 0;
+}
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-17 18:51       ` Pekka Enberg
@ 2012-01-17 19:30         ` Rik van Riel
  2012-01-17 19:49           ` Pekka Enberg
  2012-01-17 23:20         ` Minchan Kim
                           ` (3 subsequent siblings)
  4 siblings, 1 reply; 62+ messages in thread
From: Rik van Riel @ 2012-01-17 19:30 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Minchan Kim, linux-mm, LKML, leonid.moiseichuk, kamezawa.hiroyu,
	mel, rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod, KOSAKI Motohiro

On 01/17/2012 01:51 PM, Pekka Enberg wrote:
> Hello,
>
> Ok, so here's a proof of concept patch that implements sample-base
> per-process free threshold VM event watching using perf-like syscall
> ABI. I'd really like to see something like this that's much more
> extensible and clean than the /dev based ABIs that people have proposed
> so far.

Looks like a nice extensible interface to me.

The only thing is, I expect we will not want to wake
up processes most of the time, when there is no memory
pressure, because that would just waste battery power
and/or cpu time that could be used for something else.

The desire to avoid such wakeups makes it harder to
wake up processes at arbitrary points set by the API.

Another issue is that we might be running two programs
on the system, each with a different threshold for
"lets free some of my cache".  Say one program sets
the threshold at 20% free/cache memory, the other
program at 10%.

We could end up with the first process continually
throwing away its caches, while the second process
never gives its unused memory back to the kernel.

I am not sure what the right thing to do would be...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-17 19:30         ` Rik van Riel
@ 2012-01-17 19:49           ` Pekka Enberg
  2012-01-17 19:54             ` Pekka Enberg
  2012-01-17 19:57             ` Pekka Enberg
  0 siblings, 2 replies; 62+ messages in thread
From: Pekka Enberg @ 2012-01-17 19:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Minchan Kim, linux-mm, LKML, leonid.moiseichuk, kamezawa.hiroyu,
	mel, rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod, KOSAKI Motohiro

On Tue, Jan 17, 2012 at 9:30 PM, Rik van Riel <riel@redhat.com> wrote:
> Looks like a nice extensible interface to me.
>
> The only thing is, I expect we will not want to wake
> up processes most of the time, when there is no memory
> pressure, because that would just waste battery power
> and/or cpu time that could be used for something else.
>
> The desire to avoid such wakeups makes it harder to
> wake up processes at arbitrary points set by the API.

Sure. You could either bump up the threshold or use Minchan's hooks - or both.

On Tue, Jan 17, 2012 at 9:30 PM, Rik van Riel <riel@redhat.com> wrote:
> Another issue is that we might be running two programs
> on the system, each with a different threshold for
> "lets free some of my cache".  Say one program sets
> the threshold at 20% free/cache memory, the other
> program at 10%.
>
> We could end up with the first process continually
> throwing away its caches, while the second process
> never gives its unused memory back to the kernel.
>
> I am not sure what the right thing to do would be...

One option is to use per-process thresholds on RSS, for example, and
also support system-wide thresholds.

That said, I'd really like to see the N9 and Android policies
supported with this ABI. It's much easier to make it generic once we
support real-world use cases.

                        Pekka

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-17 19:49           ` Pekka Enberg
@ 2012-01-17 19:54             ` Pekka Enberg
  2012-01-17 19:57             ` Pekka Enberg
  1 sibling, 0 replies; 62+ messages in thread
From: Pekka Enberg @ 2012-01-17 19:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Minchan Kim, linux-mm, LKML, leonid.moiseichuk, kamezawa.hiroyu,
	mel, rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod, KOSAKI Motohiro

On Tue, Jan 17, 2012 at 9:49 PM, Pekka Enberg <penberg@kernel.org> wrote:
> That said, I'd really like to see the N9 and Android policies
> supported with this ABI. It's much easier to make it generic once we
> support real-world use cases.

If people are interested in hacking on the thing, I pushed the commit
in 'vmnotify/core' branch of

    git://github.com/penberg/linux.git

                        Pekka

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-17 19:49           ` Pekka Enberg
  2012-01-17 19:54             ` Pekka Enberg
@ 2012-01-17 19:57             ` Pekka Enberg
  1 sibling, 0 replies; 62+ messages in thread
From: Pekka Enberg @ 2012-01-17 19:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Minchan Kim, linux-mm, LKML, leonid.moiseichuk, kamezawa.hiroyu,
	mel, rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod, KOSAKI Motohiro

On Tue, Jan 17, 2012 at 9:49 PM, Pekka Enberg <penberg@kernel.org> wrote:
>> The desire to avoid such wakeups makes it harder to
>> wake up processes at arbitrary points set by the API.
>
> Sure. You could either bump up the threshold or use Minchan's hooks - or both.

s/threshold/sample period/g

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 2/3] vmscan hook
  2012-01-17 10:05       ` KAMEZAWA Hiroyuki
@ 2012-01-17 23:08         ` Minchan Kim
  2012-01-18  0:18           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 62+ messages in thread
From: Minchan Kim @ 2012-01-17 23:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, LKML, leonid.moiseichuk, penberg, Rik van Riel, mel,
	rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod

On Tue, Jan 17, 2012 at 07:05:12PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 17 Jan 2012 18:13:56 +0900
> Minchan Kim <minchan@kernel.org> wrote:
> 
> > On Tue, Jan 17, 2012 at 05:39:32PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 17 Jan 2012 17:13:57 +0900
> > > Minchan Kim <minchan@kernel.org> wrote:
> > > 
> > > 
> > > > +	/*
> > > > +	 * We want to avoid dropping page cache excessively
> > > > +	 * in no swap system
> > > > +	 */
> > > > +	if (nr_swap_pages <= 0) {
> > > > +		free = zone_page_state(mz->zone, NR_FREE_PAGES);
> > > > +		file = zone_page_state(mz->zone, NR_ACTIVE_FILE) +
> > > > +			zone_page_state(mz->zone, NR_INACTIVE_FILE);
> > > > +		/*
> > > > +		 * If we have very few page cache pages,
> > > > +		 * notify to user
> > > > +		 */
> > > > +		if (file < free)
> > > > +			low_mem = true;
> > > > +	}
> > > 
> > > I can't understand why you think you can check lowmem condition by "file < free".
> > 
> > The reason I thought so is I want to maintain some page cache to some degree.
> > But I admit It's very naive heuristic and should be improved.
> > 
> > > And I don't think using per-zone data is good.
> > > (I'm not sure how many zones embeded guys using..)
> > 
> > Agree. In case of swapless system, we need another heuristic.
> > 
> > > 
> > > Another idea:
> > > 1. can't we use some technique like cleancache to detect the condition ?
> > 
> > I totally forgot cleancache approach. Could you remind that?
> > 
> 
> Similar to 'victim cache'. Then, cache some clean pages somewhere when
> vmscan pageout it.
> 
>    page -> vmscan's pageout -> cleancache  -> may be discarded.
> 
> If a filesystem look up a page which is in a cleancache, cache-hit and
> bring it back to radix-tree. If not, read from disk again.
> And cleancache for swap(frontswap) was posted, too.

I am not sure this can prevent swapout.
I think it ends up evicting pages into swap devices.

> 
> 
> > > 2. can't we measure page-in/page-out distance by recording something ?
> > 
> > I can't understand your point. What's relation does it with swapout prevent?
> > 
> 
> If distance between pageout -> pagein is short, it means thrashing.
> For example, recoding the timestamp when the page(mapping, index) was
> paged-out, and check it at page-in.

Our goal is prevent swapout. When we found thrashing, it's too late.

> 
> 
> > > 3. NR_ANON + NR_FILE_MAPPED can't mean the amount of core memory if we can
> > >    ignore the data file cache ?
> > 
> > It's good but how do we define some amount?
> > It's very vague but I guess we can get a good idea from that.
> > Perhaps, you already has it.
> > 
> 
> Hm, a rough idea is...
> 
>   - we now have rss counter per mm.
>     - mapped anon
>     - mapped file
>     - swapents
>  
> Ok, here, add one more counter.
> 
>     - paged-out file. (I think this can be recorded in pte.)
>       +1 when try_to_unmap_file() unmaps it.
>       -1 when a page is back or unmapped.
> 
> Then, scanning all tasks. Then,
> 
>                                  mapped_anon + mapped_file
> active_map_ratio =   ----------------------------------------------------- * 100
>                      mapped_anon + mapped_file + swapents + paged_out_file
> 
> Ok, how to use this value...
> 
> Like memcg's threshold notify interface, you can change the mem_notify interface
> to use eventfd() as
> 
>    <event_fd, fd of /dev/mem_notify, threshold of active_map_ratio>
> 
> This will inform you an event when active_map_ratio crosses passed threshold.
> 
> complicated ? 

Yes. :)
I want to make simple if possible.

> 
> 
> > > 4. how about checking kswapd's busy status ?
> > 
> > Could you elaborate on your idea?
> > 
> 
> I just thought kswapd may not stop when the situation is very bad.

As I said eariler, the goal is prevent swap.
When we found kswapd is busy, it might many pages are already swapped-out so it's too late.

> 
> Thanks,
> -Kame
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-17 18:51       ` Pekka Enberg
  2012-01-17 19:30         ` Rik van Riel
@ 2012-01-17 23:20         ` Minchan Kim
  2012-01-18  7:16           ` Pekka Enberg
  2012-01-18  9:06         ` leonid.moiseichuk
                           ` (2 subsequent siblings)
  4 siblings, 1 reply; 62+ messages in thread
From: Minchan Kim @ 2012-01-17 23:20 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Rik van Riel, linux-mm, LKML, leonid.moiseichuk, kamezawa.hiroyu,
	mel, rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod, KOSAKI Motohiro

On Tue, Jan 17, 2012 at 08:51:13PM +0200, Pekka Enberg wrote:
> Hello,
> 
> Ok, so here's a proof of concept patch that implements sample-base
> per-process free threshold VM event watching using perf-like syscall
> ABI. I'd really like to see something like this that's much more
> extensible and clean than the /dev based ABIs that people have
> proposed so far.
> 
> 			Pekka
> 
> ------------------->
> 
> From a07f93fdca360b20daef4a5d66f2a5746f31f6a6 Mon Sep 17 00:00:00 2001
> From: Pekka Enberg <penberg@kernel.org>
> Date: Tue, 17 Jan 2012 17:51:48 +0200
> Subject: [PATCH] vmnotify: VM event notification system
> 
> This patch implements a new sys_vmnotify_fd() system call that returns a
> pollable file descriptor that can be used to watch VM events.
> 
> For example, to watch for VM event when free memory is below 99% of available
> memory using 1 second sample period, you'd do something like this:
> 
>     struct vmnotify_config config;
>     struct vmnotify_event event;
>     struct pollfd pollfd;
>     int fd;
> 
>     config = (struct vmnotify_config) {
>             .type                   = VMNOTIFY_TYPE_SAMPLE|VMNOTIFY_TYPE_FREE_THRESHOLD,
>             .sample_period_ns       = 1000000000L,
>             .free_threshold         = 99,
>     };
> 
>     fd = sys_vmnotify_fd(&config);
> 
>     pollfd.fd               = fd;
>     pollfd.events           = POLLIN;
> 
>     if (poll(&pollfd, 1, -1) < 0) {
>             perror("poll failed");
>             exit(1);
>     }
> 
>     memset(&event, 0, sizeof(event));
> 
>     if (read(fd, &event, sizeof(event)) < 0) {
>             perror("read failed");
>             exit(1);
>     }

Hi Pekka,

I didn't look into your code(will do) but as I read description,
still I don't convince we need really some process specific threshold like 99%
I think application can know it by polling /proc/meminfo without this mechanism
if they really want.

I would like to notify when system has a trobule with memory pressure without
some process specific threshold. Of course, applicatoin can't expect it.(ie,
application can know system memory pressure by /proc/meminfo but it can't know
when swapout really happens). Kernel low mem notify have to give such notification
to user space, I think.

> 
> Signed-off-by: Pekka Enberg <penberg@kernel.org>
> ---
>  arch/x86/include/asm/unistd_64.h       |    2 +
>  include/linux/vmnotify.h               |   44 ++++++
>  mm/Kconfig                             |    6 +
>  mm/Makefile                            |    1 +
>  mm/vmnotify.c                          |  235 ++++++++++++++++++++++++++++++++
>  tools/testing/vmnotify/vmnotify-test.c |   68 +++++++++
>  6 files changed, 356 insertions(+), 0 deletions(-)
>  create mode 100644 include/linux/vmnotify.h
>  create mode 100644 mm/vmnotify.c
>  create mode 100644 tools/testing/vmnotify/vmnotify-test.c
> 
> diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
> index 0431f19..b0928cd 100644
> --- a/arch/x86/include/asm/unistd_64.h
> +++ b/arch/x86/include/asm/unistd_64.h
> @@ -686,6 +686,8 @@ __SYSCALL(__NR_getcpu, sys_getcpu)
>  __SYSCALL(__NR_process_vm_readv, sys_process_vm_readv)
>  #define __NR_process_vm_writev			311
>  __SYSCALL(__NR_process_vm_writev, sys_process_vm_writev)
> +#define __NR_vmnotify_fd			312
> +__SYSCALL(__NR_vmnotify_fd, sys_vmnotify_fd)
> 
>  #ifndef __NO_STUBS
>  #define __ARCH_WANT_OLD_READDIR
> diff --git a/include/linux/vmnotify.h b/include/linux/vmnotify.h
> new file mode 100644
> index 0000000..8f8642b
> --- /dev/null
> +++ b/include/linux/vmnotify.h
> @@ -0,0 +1,44 @@
> +#ifndef _LINUX_VMNOTIFY_H
> +#define _LINUX_VMNOTIFY_H
> +
> +#include <linux/types.h>
> +
> +enum {
> +	VMNOTIFY_TYPE_FREE_THRESHOLD	= 1ULL << 0,
> +	VMNOTIFY_TYPE_SAMPLE		= 1ULL << 1,
> +};
> +
> +struct vmnotify_config {
> +	/*
> +	 * Size of the struct for ABI extensibility.
> +	 */
> +	__u32		   size;
> +
> +	/*
> +	 * Notification type bitmask
> +	 */
> +	__u64			type;
> +
> +	/*
> +	 * Free memory threshold in percentages [1..99]
> +	 */
> +	__u32			free_threshold;
> +
> +	/*
> +	 * Sample period in nanoseconds
> +	 */
> +	__u64			sample_period_ns;
> +};
> +
> +struct vmnotify_event {
> +	/* Size of the struct for ABI extensibility. */
> +	__u32			size;
> +
> +	__u64			nr_avail_pages;
> +
> +	__u64			nr_swap_pages;
> +
> +	__u64			nr_free_pages;
> +};
> +
> +#endif /* _LINUX_VMNOTIFY_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 011b110..6631167 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -373,3 +373,9 @@ config CLEANCACHE
>  	  in a negligible performance hit.
> 
>  	  If unsure, say Y to enable cleancache
> +
> +config VMNOTIFY
> +	bool "Enable VM event notification system"
> +	default n
> +	help
> +	  If unsure, say N to disable vmnotify
> diff --git a/mm/Makefile b/mm/Makefile
> index 50ec00e..e1b5db3 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -51,3 +51,4 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
>  obj-$(CONFIG_CLEANCACHE) += cleancache.o
> +obj-$(CONFIG_VMNOTIFY) += vmnotify.o
> diff --git a/mm/vmnotify.c b/mm/vmnotify.c
> new file mode 100644
> index 0000000..6800450
> --- /dev/null
> +++ b/mm/vmnotify.c
> @@ -0,0 +1,235 @@
> +#include <linux/anon_inodes.h>
> +#include <linux/vmnotify.h>
> +#include <linux/syscalls.h>
> +#include <linux/file.h>
> +#include <linux/list.h>
> +#include <linux/poll.h>
> +#include <linux/slab.h>
> +#include <linux/swap.h>
> +
> +#define VMNOTIFY_MAX_FREE_THRESHOD	100
> +
> +struct vmnotify_watch {
> +	struct vmnotify_config		config;
> +
> +	struct mutex			mutex;
> +	bool				pending;
> +	struct vmnotify_event		event;
> +
> +	/* sampling */
> +	struct hrtimer			timer;
> +
> +	/* poll */
> +	wait_queue_head_t		waitq;
> +};
> +
> +static bool vmnotify_match(struct vmnotify_watch *watch, struct vmnotify_event *event)
> +{
> +	if (watch->config.type & VMNOTIFY_TYPE_FREE_THRESHOLD) {
> +		u64 threshold;
> +
> +		if (!event->nr_avail_pages)
> +			return false;
> +
> +		threshold = event->nr_free_pages * 100 / event->nr_avail_pages;
> +		if (threshold > watch->config.free_threshold)
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
> +static void vmnotify_sample(struct vmnotify_watch *watch)
> +{
> +	struct vmnotify_event event;
> +	struct sysinfo si;
> +
> +	memset(&event, 0, sizeof(event));
> +
> +	event.size		= sizeof(event);
> +	event.nr_free_pages	= global_page_state(NR_FREE_PAGES);
> +
> +	si_meminfo(&si);
> +	event.nr_avail_pages	= si.totalram;
> +
> +#ifdef CONFIG_SWAP
> +	si_swapinfo(&si);
> +	event.nr_swap_pages	= si.totalswap;
> +#endif
> +
> +	if (!vmnotify_match(watch, &event))
> +		return;
> +
> +	mutex_lock(&watch->mutex);
> +
> +	watch->pending = true;
> +
> +	memcpy(&watch->event, &event, sizeof(event));
> +
> +	mutex_unlock(&watch->mutex);
> +}
> +
> +static enum hrtimer_restart vmnotify_timer_fn(struct hrtimer *hrtimer)
> +{
> +	struct vmnotify_watch *watch = container_of(hrtimer, struct vmnotify_watch, timer);
> +	u64 sample_period = watch->config.sample_period_ns;
> +
> +	vmnotify_sample(watch);
> +
> +	hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period));
> +
> +	wake_up(&watch->waitq);
> +
> +	return HRTIMER_RESTART;
> +}
> +
> +static void vmnotify_start_timer(struct vmnotify_watch *watch)
> +{
> +	u64 sample_period = watch->config.sample_period_ns;
> +
> +	hrtimer_init(&watch->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	watch->timer.function = vmnotify_timer_fn;
> +
> +	hrtimer_start(&watch->timer, ns_to_ktime(sample_period), HRTIMER_MODE_REL_PINNED);
> +}
> +
> +static unsigned int vmnotify_poll(struct file *file, poll_table *wait)
> +{
> +	struct vmnotify_watch *watch = file->private_data;
> +	unsigned int events = 0;
> +
> +	poll_wait(file, &watch->waitq, wait);
> +
> +	mutex_lock(&watch->mutex);
> +
> +	if (watch->pending)
> +		events |= POLLIN;
> +
> +	mutex_unlock(&watch->mutex);
> +
> +	return events;
> +}
> +
> +static ssize_t vmnotify_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
> +{
> +	struct vmnotify_watch *watch = file->private_data;
> +	int ret = 0;
> +
> +	mutex_lock(&watch->mutex);
> +
> +	if (!watch->pending)
> +		goto out_unlock;
> +
> +	if (copy_to_user(buf, &watch->event, sizeof(struct vmnotify_event))) {
> +		ret = -EFAULT;
> +		goto out_unlock;
> +	}
> +
> +	ret = watch->event.size;
> +
> +	watch->pending = false;
> +
> +out_unlock:
> +	mutex_unlock(&watch->mutex);
> +
> +	return ret;
> +}
> +
> +static int vmnotify_release(struct inode *inode, struct file *file)
> +{
> +	struct vmnotify_watch *watch = file->private_data;
> +
> +	hrtimer_cancel(&watch->timer);
> +
> +	kfree(watch);
> +
> +	return 0;
> +}
> +
> +static const struct file_operations vmnotify_fops = {
> +	.poll		= vmnotify_poll,
> +	.read		= vmnotify_read,
> +	.release	= vmnotify_release,
> +};
> +
> +static struct vmnotify_watch *vmnotify_watch_alloc(void)
> +{
> +	struct vmnotify_watch *watch;
> +
> +	watch = kzalloc(sizeof *watch, GFP_KERNEL);
> +	if (!watch)
> +		return NULL;
> +
> +	mutex_init(&watch->mutex);
> +
> +	init_waitqueue_head(&watch->waitq);
> +
> +	return watch;
> +}
> +
> +static int vmnotify_copy_config(struct vmnotify_config __user *uconfig,
> +				struct vmnotify_config *config)
> +{
> +	int ret;
> +
> +	ret = copy_from_user(config, uconfig, sizeof(struct vmnotify_config));
> +	if (ret)
> +		return -EFAULT;
> +
> +	if (!config->type)
> +		return -EINVAL;
> +
> +	if (config->type & VMNOTIFY_TYPE_SAMPLE) {
> +		if (config->sample_period_ns < NSEC_PER_MSEC)
> +			return -EINVAL;
> +	}
> +
> +	if (config->type & VMNOTIFY_TYPE_FREE_THRESHOLD) {
> +		if (config->free_threshold > VMNOTIFY_MAX_FREE_THRESHOD)
> +			return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +SYSCALL_DEFINE1(vmnotify_fd,
> +		struct vmnotify_config __user *, uconfig)
> +{
> +	struct vmnotify_watch *watch;
> +	struct file *file;
> +	int err;
> +	int fd;
> +
> +	watch = vmnotify_watch_alloc();
> +	if (!watch)
> +		return -ENOMEM;
> +
> +	err = vmnotify_copy_config(uconfig, &watch->config);
> +	if (err)
> +		goto err_free;
> +
> +	fd = get_unused_fd_flags(O_RDONLY);
> +	if (fd < 0) {
> +		err = fd;
> +		goto err_free;
> +	}
> +
> +	file = anon_inode_getfile("[vmnotify]", &vmnotify_fops, watch, O_RDONLY);
> +	if (IS_ERR(file)) {
> +		err = PTR_ERR(file);
> +		goto err_fd;
> +	}
> +
> +	fd_install(fd, file);
> +
> +	if (watch->config.type & VMNOTIFY_TYPE_SAMPLE)
> +		vmnotify_start_timer(watch);
> +
> +	return fd;
> +
> +err_fd:
> +	put_unused_fd(fd);
> +err_free:
> +	kfree(watch);
> +	return err;
> +}
> diff --git a/tools/testing/vmnotify/vmnotify-test.c b/tools/testing/vmnotify/vmnotify-test.c
> new file mode 100644
> index 0000000..3c6b26d
> --- /dev/null
> +++ b/tools/testing/vmnotify/vmnotify-test.c
> @@ -0,0 +1,68 @@
> +#include "../../../include/linux/vmnotify.h"
> +
> +#if defined(__x86_64__)
> +#include "../../../arch/x86/include/asm/unistd.h"
> +#endif
> +
> +#include <stdlib.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <stdio.h>
> +#include <poll.h>
> +
> +static int sys_vmnotify_fd(struct vmnotify_config *config)
> +{
> +	config->size = sizeof(*config);
> +
> +	return syscall(__NR_vmnotify_fd, config);
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	struct vmnotify_config config;
> +	struct vmnotify_event event;
> +	struct pollfd pollfd;
> +	int i;
> +	int fd;
> +
> +	config = (struct vmnotify_config) {
> +		.type			= VMNOTIFY_TYPE_SAMPLE|VMNOTIFY_TYPE_FREE_THRESHOLD,
> +		.sample_period_ns	= 1000000000L,
> +		.free_threshold		= 99,
> +	};
> +
> +	fd = sys_vmnotify_fd(&config);
> +	if (fd < 0) {
> +		perror("vmnotify_fd failed");
> +		exit(1);
> +	}
> +
> +	for (i = 0; i < 10; i++) {
> +		pollfd.fd		= fd;
> +		pollfd.events		= POLLIN;
> +
> +		if (poll(&pollfd, 1, -1) < 0) {
> +			perror("poll failed");
> +			exit(1);
> +		}
> +
> +		memset(&event, 0, sizeof(event));
> +
> +		if (read(fd, &event, sizeof(event)) < 0) {
> +			perror("read failed");
> +			exit(1);
> +		}
> +
> +		printf("VM event:\n");
> +		printf("\tsize=%lu\n", event.size);
> +		printf("\tnr_avail_pages=%Lu\n", event.nr_avail_pages);
> +		printf("\tnr_swap_pages=%Lu\n", event.nr_swap_pages);
> +		printf("\tnr_free_pages=%Lu\n", event.nr_free_pages);
> +	}
> +	if (close(fd) < 0) {
> +		perror("close failed");
> +		exit(1);
> +	}
> +
> +	return 0;
> +}
> -- 
> 1.7.6.4
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 2/3] vmscan hook
  2012-01-17 23:08         ` Minchan Kim
@ 2012-01-18  0:18           ` KAMEZAWA Hiroyuki
  2012-01-18 14:17             ` Rik van Riel
  0 siblings, 1 reply; 62+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-18  0:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, LKML, leonid.moiseichuk, penberg, Rik van Riel, mel,
	rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod

On Wed, 18 Jan 2012 08:08:01 +0900
Minchan Kim <minchan@kernel.org> wrote:

> > 
> > 
> > > > 2. can't we measure page-in/page-out distance by recording something ?
> > > 
> > > I can't understand your point. What's relation does it with swapout prevent?
> > > 
> > 
> > If distance between pageout -> pagein is short, it means thrashing.
> > For example, recoding the timestamp when the page(mapping, index) was
> > paged-out, and check it at page-in.
> 
> Our goal is prevent swapout. When we found thrashing, it's too late.
> 

If you want to prevent swap-out, don't swapon any. That's all.
Then, you can check the number of FILE_CACHE and have threshold.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-17 23:20         ` Minchan Kim
@ 2012-01-18  7:16           ` Pekka Enberg
  2012-01-18  7:49             ` Minchan Kim
  0 siblings, 1 reply; 62+ messages in thread
From: Pekka Enberg @ 2012-01-18  7:16 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, linux-mm, LKML, leonid.moiseichuk, kamezawa.hiroyu,
	mel, rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod, KOSAKI Motohiro

On Wed, 18 Jan 2012, Minchan Kim wrote:
> I didn't look into your code(will do) but as I read description,
> still I don't convince we need really some process specific threshold like 99%
> I think application can know it by polling /proc/meminfo without this mechanism
> if they really want.

I'm not sure if we need arbitrary threshold either. However, we need to 
support the following cases:

   - We're about to swap

   - We're about to run out of memory

   - We're about to start OOM killing

and I don't think your patch solves that. One possibility is to implement:

   VMNOTIFY_TYPE_ABOUT_TO_SWAP
   VMNOTIFY_TYPE_ABOUT_TO_OOM
   VMNOTIFY_TYPE_ABOUT_TO_OOM_KILL

and maybe rip out support for arbitrary thresholds. Does that more 
reasonable?

As for polling /proc/meminfo, I'd much rather deliver stats as part of 
vmnotify_read() because it's easier to extend the ABI rather than adding 
new fields to /proc/meminfo.

On Wed, 18 Jan 2012, Minchan Kim wrote:
> I would like to notify when system has a trobule with memory pressure without
> some process specific threshold. Of course, applicatoin can't expect it.(ie,
> application can know system memory pressure by /proc/meminfo but it can't know
> when swapout really happens). Kernel low mem notify have to give such notification
> to user space, I think.

It should be simple to add support for VMNOTIFY_TYPE_MEM_PRESSURE that 
uses your hooks.

 			Pekka

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-18  7:16           ` Pekka Enberg
@ 2012-01-18  7:49             ` Minchan Kim
  0 siblings, 0 replies; 62+ messages in thread
From: Minchan Kim @ 2012-01-18  7:49 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Rik van Riel, linux-mm, LKML, leonid.moiseichuk, kamezawa.hiroyu,
	mel, rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod, KOSAKI Motohiro

On Wed, Jan 18, 2012 at 09:16:49AM +0200, Pekka Enberg wrote:
> On Wed, 18 Jan 2012, Minchan Kim wrote:
> >I didn't look into your code(will do) but as I read description,
> >still I don't convince we need really some process specific threshold like 99%
> >I think application can know it by polling /proc/meminfo without this mechanism
> >if they really want.
> 
> I'm not sure if we need arbitrary threshold either. However, we need
> to support the following cases:
> 
>   - We're about to swap
> 
>   - We're about to run out of memory
> 
>   - We're about to start OOM killing
> 
> and I don't think your patch solves that. One possibility is to implement:

I think my patch can extend it but your ABI looks good to me than my approach.

> 
>   VMNOTIFY_TYPE_ABOUT_TO_SWAP
>   VMNOTIFY_TYPE_ABOUT_TO_OOM
>   VMNOTIFY_TYPE_ABOUT_TO_OOM_KILL

Yes. We can define some levels.

1. page cache reclaim
2. code page reclaim
3. anonymous page swap out
4. OOM kill.


Application might handle it differenlty by the memory pressure level.

> 
> and maybe rip out support for arbitrary thresholds. Does that more
> reasonable?

Currently, Nokia people seem to want process specific thresholds so 
we might need it.

> 
> As for polling /proc/meminfo, I'd much rather deliver stats as part
> of vmnotify_read() because it's easier to extend the ABI rather than
> adding new fields to /proc/meminfo.

Agree.

> 
> On Wed, 18 Jan 2012, Minchan Kim wrote:
> >I would like to notify when system has a trobule with memory pressure without
> >some process specific threshold. Of course, applicatoin can't expect it.(ie,
> >application can know system memory pressure by /proc/meminfo but it can't know
> >when swapout really happens). Kernel low mem notify have to give such notification
> >to user space, I think.
> 
> It should be simple to add support for VMNOTIFY_TYPE_MEM_PRESSURE
> that uses your hooks.

Indeed.

> 
> 			Pekka

^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [RFC 1/3] /dev/low_mem_notify
  2012-01-17 18:51       ` Pekka Enberg
  2012-01-17 19:30         ` Rik van Riel
  2012-01-17 23:20         ` Minchan Kim
@ 2012-01-18  9:06         ` leonid.moiseichuk
  2012-01-18  9:15           ` Pekka Enberg
  2012-01-18 14:30           ` Rik van Riel
  2012-01-24 15:40         ` Marcelo Tosatti
  2012-01-24 21:57         ` Jonathan Corbet
  4 siblings, 2 replies; 62+ messages in thread
From: leonid.moiseichuk @ 2012-01-18  9:06 UTC (permalink / raw)
  To: penberg, riel
  Cc: minchan, linux-mm, linux-kernel, kamezawa.hiroyu, mel, rientjes,
	kosaki.motohiro, hannes, mtosatti, akpm, rhod, kosaki.motohiro

Hi,

Just couple of observations, which maybe wrong below

> -----Original Message-----
> From: Pekka Enberg [mailto:penberg@gmail.com] On Behalf Of ext Pekka
> Enberg
> Sent: 17 January, 2012 20:51
....

> +struct vmnotify_config {
> +	/*
> +	 * Size of the struct for ABI extensibility.
> +	 */
> +	__u32		   size;
> +
> +	/*
> +	 * Notification type bitmask
> +	 */
> +	__u64			type;
> +
> +	/*
> +	 * Free memory threshold in percentages [1..99]
> +	 */
> +	__u32			free_threshold;

Would be possible to not use percents for thesholds? Accounting in pages even not so difficult to user-space.
Also, looking on vmnotify_match I understand that events propagated to user-space only in case threshold trigger change state from 0 to 1 but not back, 1-> 0 is very useful event as well.

Would be possible to use for threshold pointed value(s) e.g. according to enum zone_state_item, because kinds of memory to track could be different?
E.g. to tracking paging activity NR_ACTIVE_ANON and NR_ACTIVE_FILE could be interesting, not only free.

> +
> +	/*
> +	 * Sample period in nanoseconds
> +	 */
> +	__u64			sample_period_ns;
> +};
> +
....
> +struct vmnotify_event {
> +	/* Size of the struct for ABI extensibility. */
> +	__u32			size;
> +
> +	__u64			nr_avail_pages;
> +
> +	__u64			nr_swap_pages;
> +
> +	__u64			nr_free_pages;
> +};

Two fields here most likely session-constant, (nr_avail_pages and nr_swap_pages), seems not much sense to report them in every event.
If we have memory/swap hotplug user-space can use sysinfo() call.

> +static void vmnotify_sample(struct vmnotify_watch *watch) {
...
> +	si_meminfo(&si);
> +	event.nr_avail_pages	= si.totalram;
> +
> +#ifdef CONFIG_SWAP
> +	si_swapinfo(&si);
> +	event.nr_swap_pages	= si.totalswap;
> +#endif
> +

Why not to use global_page_state() directly? si_meminfo() and especial si_swapinfo are quite expensive call.

> +static void vmnotify_start_timer(struct vmnotify_watch *watch) {
> +	u64 sample_period = watch->config.sample_period_ns;
> +
> +	hrtimer_init(&watch->timer, CLOCK_MONOTONIC,
> HRTIMER_MODE_REL);
> +	watch->timer.function = vmnotify_timer_fn;
> +
> +	hrtimer_start(&watch->timer, ns_to_ktime(sample_period),
> +HRTIMER_MODE_REL_PINNED); }

Do I understand correct you allocate timer for every user-space client and propagate events every pointed interval?
What will happened with system if we have a timer but need to turn CPU off? The timer must not be a reason to wakeup if user-space is sleeping.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-18  9:06         ` leonid.moiseichuk
@ 2012-01-18  9:15           ` Pekka Enberg
  2012-01-18  9:41             ` leonid.moiseichuk
  2012-01-24 16:22             ` Arnd Bergmann
  2012-01-18 14:30           ` Rik van Riel
  1 sibling, 2 replies; 62+ messages in thread
From: Pekka Enberg @ 2012-01-18  9:15 UTC (permalink / raw)
  To: leonid.moiseichuk
  Cc: riel, minchan, linux-mm, linux-kernel, kamezawa.hiroyu, mel,
	rientjes, kosaki.motohiro, hannes, mtosatti, akpm, rhod,
	kosaki.motohiro

On Wed, Jan 18, 2012 at 11:06 AM,  <leonid.moiseichuk@nokia.com> wrote:
> Would be possible to not use percents for thesholds? Accounting in pages even
> not so difficult to user-space.

How does that work with memory hotplug?

On Wed, Jan 18, 2012 at 11:06 AM,  <leonid.moiseichuk@nokia.com> wrote:
> Also, looking on vmnotify_match I understand that events propagated to
> user-space only in case threshold trigger change state from 0 to 1 but not
> back, 1-> 0 is very useful event as well.
>
> Would be possible to use for threshold pointed value(s) e.g. according to
> enum zone_state_item, because kinds of memory to track could be different?
> E.g. to tracking paging activity NR_ACTIVE_ANON and NR_ACTIVE_FILE could be
> interesting, not only free.

I don't think there's anything in the ABI that would prevent that.

>> +struct vmnotify_event {
>> +     /* Size of the struct for ABI extensibility. */
>> +     __u32                   size;
>> +
>> +     __u64                   nr_avail_pages;
>> +
>> +     __u64                   nr_swap_pages;
>> +
>> +     __u64                   nr_free_pages;
>> +};
>
> Two fields here most likely session-constant, (nr_avail_pages and
> nr_swap_pages), seems not much sense to report them in every event.  If we
> have memory/swap hotplug user-space can use sysinfo() call.

I actually changed the ABI to look like this:

struct vmnotify_event {
        /*
         * Size of the struct for ABI extensibility.
         */
        __u32                   size;

        __u64                   attrs;

        __u64                   attr_values[];
};

So userspace can decide which fields to include in notifications.

On Wed, Jan 18, 2012 at 11:06 AM,  <leonid.moiseichuk@nokia.com> wrote:
>> +static void vmnotify_sample(struct vmnotify_watch *watch) {
> ...
>> +     si_meminfo(&si);
>> +     event.nr_avail_pages    = si.totalram;
>> +
>> +#ifdef CONFIG_SWAP
>> +     si_swapinfo(&si);
>> +     event.nr_swap_pages     = si.totalswap;
>> +#endif
>> +
>
> Why not to use global_page_state() directly? si_meminfo() and especial
> si_swapinfo are quite expensive call.

Sure, we can do that. Feel free to send a patch :-).

>> +static void vmnotify_start_timer(struct vmnotify_watch *watch) {
>> +     u64 sample_period = watch->config.sample_period_ns;
>> +
>> +     hrtimer_init(&watch->timer, CLOCK_MONOTONIC,
>> HRTIMER_MODE_REL);
>> +     watch->timer.function = vmnotify_timer_fn;
>> +
>> +     hrtimer_start(&watch->timer, ns_to_ktime(sample_period),
>> +HRTIMER_MODE_REL_PINNED); }
>
> Do I understand correct you allocate timer for every user-space client and
> propagate events every pointed interval?  What will happened with system if
> we have a timer but need to turn CPU off? The timer must not be a reason to
> wakeup if user-space is sleeping.

No idea what happens. The sampling code is just a proof of concept thing and I
expect it to be buggy as hell. :-)

			Pekka

^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [RFC 1/3] /dev/low_mem_notify
  2012-01-18  9:15           ` Pekka Enberg
@ 2012-01-18  9:41             ` leonid.moiseichuk
  2012-01-18 10:40               ` Pekka Enberg
  2012-01-24 16:22             ` Arnd Bergmann
  1 sibling, 1 reply; 62+ messages in thread
From: leonid.moiseichuk @ 2012-01-18  9:41 UTC (permalink / raw)
  To: penberg
  Cc: riel, minchan, linux-mm, linux-kernel, kamezawa.hiroyu, mel,
	rientjes, kosaki.motohiro, hannes, mtosatti, akpm, rhod,
	kosaki.motohiro

> -----Original Message-----
> From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext
> Pekka Enberg
> Sent: 18 January, 2012 11:16
...
> > Would be possible to not use percents for thesholds? Accounting in pages
> even
> > not so difficult to user-space.
> 
> How does that work with memory hotplug?

Not worse than %%. For example you had 10% free memory threshold for 512 MB RAM meaning 51.2 MB in absolute number.
Then hotplug turned off 256 MB, you for sure must update threshold for %% because these 10% for 25.6 MB most likely will be not suitable for different operating mode.
Using pages makes calculations must simpler.

> 
> On Wed, Jan 18, 2012 at 11:06 AM,  <leonid.moiseichuk@nokia.com> wrote:
> > Also, looking on vmnotify_match I understand that events propagated to
> > user-space only in case threshold trigger change state from 0 to 1 but not
> > back, 1-> 0 is very useful event as well
(*)

> >
> > Would be possible to use for threshold pointed value(s) e.g. according to
> > enum zone_state_item, because kinds of memory to track could be
> different?
> > E.g. to tracking paging activity NR_ACTIVE_ANON and NR_ACTIVE_FILE
> could be
> > interesting, not only free.
> 
> I don't think there's anything in the ABI that would prevent that.

If this statement also related my question (*)  I have to point need to track attributes history, otherwise user-space will be constantly kicked with updates.

> I actually changed the ABI to look like this:
> 
> struct vmnotify_event {
>         /*
>          * Size of the struct for ABI extensibility.
>          */
>         __u32                   size;
> 
>         __u64                   attrs;
> 
>         __u64                   attr_values[];
> };
> 
> So userspace can decide which fields to include in notifications.

Good. But how you can provide current status of attributes to user-space? Need to have read() call support to deliver all supported attr_values[] on demand.

> >> +
> >> +#ifdef CONFIG_SWAP
> >> +     si_swapinfo(&si);
> >> +     event.nr_swap_pages     = si.totalswap;
> >> +#endif
> >> +
> >
> > Why not to use global_page_state() directly? si_meminfo() and especial
> > si_swapinfo are quite expensive call.
> 
> Sure, we can do that. Feel free to send a patch :-).

When I see code because from emails it is quite difficult to understand. 
For short-term I need to focus on integration "memnotify" version internally which is kind of work for me already and provides all required interfaces n9 needs.
 
Btw, when API starts to work with pointed thresholds logically it is not anymore low_mem_notify, you need to invent some other name. 

> No idea what happens. The sampling code is just a proof of concept thing and
> I expect it to be buggy as hell. :-)
> 
> 			Pekka

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-18  9:41             ` leonid.moiseichuk
@ 2012-01-18 10:40               ` Pekka Enberg
  2012-01-18 10:44                 ` leonid.moiseichuk
  0 siblings, 1 reply; 62+ messages in thread
From: Pekka Enberg @ 2012-01-18 10:40 UTC (permalink / raw)
  To: leonid.moiseichuk
  Cc: riel, minchan, linux-mm, linux-kernel, kamezawa.hiroyu, mel,
	rientjes, kosaki.motohiro, hannes, mtosatti, akpm, rhod,
	kosaki.motohiro

On Wed, Jan 18, 2012 at 11:41 AM,  <leonid.moiseichuk@nokia.com> wrote:
>> -----Original Message-----
>> From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext
>> Pekka Enberg
>> Sent: 18 January, 2012 11:16
> ...
>> > Would be possible to not use percents for thesholds? Accounting in pages
>> even
>> > not so difficult to user-space.
>>
>> How does that work with memory hotplug?
>
> Not worse than %%. For example you had 10% free memory threshold for 512 MB
> RAM meaning 51.2 MB in absolute number.  Then hotplug turned off 256 MB, you
> for sure must update threshold for %% because these 10% for 25.6 MB most
> likely will be not suitable for different operating mode.
> Using pages makes calculations must simpler.

Right. Does threshold in percentages make any sense then? Is it enough to use
number of free pages?

On Wed, Jan 18, 2012 at 11:06 AM,  <leonid.moiseichuk@nokia.com> wrote:
>> > Also, looking on vmnotify_match I understand that events propagated to
>> > user-space only in case threshold trigger change state from 0 to 1 but not
>> > back, 1-> 0 is very useful event as well
> (*)
>
>> >
>> > Would be possible to use for threshold pointed value(s) e.g. according to
>> > enum zone_state_item, because kinds of memory to track could be
>> different?
>> > E.g. to tracking paging activity NR_ACTIVE_ANON and NR_ACTIVE_FILE
>> could be
>> > interesting, not only free.
>>
>> I don't think there's anything in the ABI that would prevent that.
>
> If this statement also related my question (*)  I have to point need to track
> attributes history, otherwise user-space will be constantly kicked with
> updates.

Well sure, I think it makes sense to support state change to both directions.

> When I see code because from emails it is quite difficult to understand.  For
> short-term I need to focus on integration "memnotify" version internally
> which is kind of work for me already and provides all required interfaces n9
> needs.

Sure. I'm only talking about mainline here.

> Btw, when API starts to work with pointed thresholds logically it is not

Definitely, it's about generic VM event notification now.

			Pekka

^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [RFC 1/3] /dev/low_mem_notify
  2012-01-18 10:40               ` Pekka Enberg
@ 2012-01-18 10:44                 ` leonid.moiseichuk
  2012-01-18 23:34                   ` Ronen Hod
  2012-01-19  7:34                   ` Pekka Enberg
  0 siblings, 2 replies; 62+ messages in thread
From: leonid.moiseichuk @ 2012-01-18 10:44 UTC (permalink / raw)
  To: penberg
  Cc: riel, minchan, linux-mm, linux-kernel, kamezawa.hiroyu, mel,
	rientjes, kosaki.motohiro, hannes, mtosatti, akpm, rhod,
	kosaki.motohiro

> -----Original Message-----
> From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext
> Pekka Enberg
> Sent: 18 January, 2012 12:40
...
> > Not worse than %%. For example you had 10% free memory threshold for
> > 512 MB RAM meaning 51.2 MB in absolute number.  Then hotplug turned
> > off 256 MB, you for sure must update threshold for %% because these
> > 10% for 25.6 MB most likely will be not suitable for different operating
> mode.
> > Using pages makes calculations must simpler.
> 
> Right. Does threshold in percentages make any sense then? Is it enough to
> use number of free pages?

Paul Mundt noticed that and we stopped use percentage in 2006 for n770 update. 
He was right.
Percents are useless and do not correlate with other kernel APIs like sysinfo().


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 2/3] vmscan hook
  2012-01-18  0:18           ` KAMEZAWA Hiroyuki
@ 2012-01-18 14:17             ` Rik van Riel
  2012-01-19  2:25               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 62+ messages in thread
From: Rik van Riel @ 2012-01-18 14:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Minchan Kim, linux-mm, LKML, leonid.moiseichuk, penberg, mel,
	rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod

On 01/17/2012 07:18 PM, KAMEZAWA Hiroyuki wrote:
> On Wed, 18 Jan 2012 08:08:01 +0900
> Minchan Kim<minchan@kernel.org>  wrote:
>
>>>>> 2. can't we measure page-in/page-out distance by recording something ?
>>>>
>>>> I can't understand your point. What's relation does it with swapout prevent?
>>>>
>>>
>>> If distance between pageout ->  pagein is short, it means thrashing.
>>> For example, recoding the timestamp when the page(mapping, index) was
>>> paged-out, and check it at page-in.
>>
>> Our goal is prevent swapout. When we found thrashing, it's too late.
>
> If you want to prevent swap-out, don't swapon any. That's all.
> Then, you can check the number of FILE_CACHE and have threshold.

I think you are getting hung up on a word here.

As I understand it, the goal is to push out the point where
we start doing heavier swap IO, allowing us to overcommit
memory more heavily before things start really slowing down.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-18  9:06         ` leonid.moiseichuk
  2012-01-18  9:15           ` Pekka Enberg
@ 2012-01-18 14:30           ` Rik van Riel
  2012-01-18 15:29             ` Pekka Enberg
  1 sibling, 1 reply; 62+ messages in thread
From: Rik van Riel @ 2012-01-18 14:30 UTC (permalink / raw)
  To: leonid.moiseichuk
  Cc: penberg, minchan, linux-mm, linux-kernel, kamezawa.hiroyu, mel,
	rientjes, kosaki.motohiro, hannes, mtosatti, akpm, rhod,
	kosaki.motohiro

On 01/18/2012 04:06 AM, leonid.moiseichuk@nokia.com wrote:

> Would be possible to use for threshold pointed value(s) e.g. according to enum zone_state_item, because kinds of memory to track could be different?
> E.g. to tracking paging activity NR_ACTIVE_ANON and NR_ACTIVE_FILE could be interesting, not only free.

That seems like a horrible idea, because there is no guarantee that
the kernel will continue to use NR_ACTIVE_ANON and NR_ACTIVE_FILE
internally in the future.

What is exported to userspace must be somewhat independent of the
specifics of how the kernel implements things internally.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-18 14:30           ` Rik van Riel
@ 2012-01-18 15:29             ` Pekka Enberg
  0 siblings, 0 replies; 62+ messages in thread
From: Pekka Enberg @ 2012-01-18 15:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: leonid.moiseichuk, minchan, linux-mm, linux-kernel,
	kamezawa.hiroyu, mel, rientjes, kosaki.motohiro, hannes,
	mtosatti, akpm, rhod, kosaki.motohiro

On Wed, Jan 18, 2012 at 4:30 PM, Rik van Riel <riel@redhat.com> wrote:
> That seems like a horrible idea, because there is no guarantee that
> the kernel will continue to use NR_ACTIVE_ANON and NR_ACTIVE_FILE
> internally in the future.
>
> What is exported to userspace must be somewhat independent of the
> specifics of how the kernel implements things internally.

Exactly, that's what I'm also interested in.

                        Pekka

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-18 10:44                 ` leonid.moiseichuk
@ 2012-01-18 23:34                   ` Ronen Hod
  2012-01-19  7:25                     ` Pekka Enberg
  2012-01-19  7:34                   ` Pekka Enberg
  1 sibling, 1 reply; 62+ messages in thread
From: Ronen Hod @ 2012-01-18 23:34 UTC (permalink / raw)
  To: leonid.moiseichuk
  Cc: penberg, riel, minchan, linux-mm, linux-kernel, kamezawa.hiroyu,
	mel, rientjes, kosaki.motohiro, hannes, mtosatti, akpm,
	kosaki.motohiro

On 01/18/2012 12:44 PM, leonid.moiseichuk@nokia.com wrote:
>> -----Original Message-----
>> From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext
>> Pekka Enberg
>> Sent: 18 January, 2012 12:40
> ...
>>> Not worse than %%. For example you had 10% free memory threshold for
>>> 512 MB RAM meaning 51.2 MB in absolute number.  Then hotplug turned
>>> off 256 MB, you for sure must update threshold for %% because these
>>> 10% for 25.6 MB most likely will be not suitable for different operating
>> mode.
>>> Using pages makes calculations must simpler.
>> Right. Does threshold in percentages make any sense then? Is it enough to
>> use number of free pages?
> Paul Mundt noticed that and we stopped use percentage in 2006 for n770 update.
> He was right.
> Percents are useless and do not correlate with other kernel APIs like sysinfo().

I believe that it will be best if the kernel publishes an ideal number_of_free_pages (in /proc/meminfo or whatever). Such number is easy to work with since this is what applications do, they free pages. Applications will be able to refer to this number from their garbage collector, or before allocating memory also if they did not get a notification, and it is also useful if several applications free memory at the same time.

Ronen.

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 2/3] vmscan hook
  2012-01-18 14:17             ` Rik van Riel
@ 2012-01-19  2:25               ` KAMEZAWA Hiroyuki
  2012-01-19 14:42                 ` Rik van Riel
  0 siblings, 1 reply; 62+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-19  2:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Minchan Kim, linux-mm, LKML, leonid.moiseichuk, penberg, mel,
	rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod

On Wed, 18 Jan 2012 09:17:17 -0500
Rik van Riel <riel@redhat.com> wrote:

> On 01/17/2012 07:18 PM, KAMEZAWA Hiroyuki wrote:
> > On Wed, 18 Jan 2012 08:08:01 +0900
> > Minchan Kim<minchan@kernel.org>  wrote:
> >
> >>>>> 2. can't we measure page-in/page-out distance by recording something ?
> >>>>
> >>>> I can't understand your point. What's relation does it with swapout prevent?
> >>>>
> >>>
> >>> If distance between pageout ->  pagein is short, it means thrashing.
> >>> For example, recoding the timestamp when the page(mapping, index) was
> >>> paged-out, and check it at page-in.
> >>
> >> Our goal is prevent swapout. When we found thrashing, it's too late.
> >
> > If you want to prevent swap-out, don't swapon any. That's all.
> > Then, you can check the number of FILE_CACHE and have threshold.
> 
> I think you are getting hung up on a word here.
> 
> As I understand it, the goal is to push out the point where
> we start doing heavier swap IO, allowing us to overcommit
> memory more heavily before things start really slowing down.
> 

Yes.

Hmm, considering that the issue is slow down,

time values as

- 'cpu time used for memory reclaim'
- 'latency of page allocation'
- 'application execution speed' ?

may be a better score to see rather than just seeing lru's stat.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-18 23:34                   ` Ronen Hod
@ 2012-01-19  7:25                     ` Pekka Enberg
  2012-01-19  9:05                       ` Ronen Hod
  0 siblings, 1 reply; 62+ messages in thread
From: Pekka Enberg @ 2012-01-19  7:25 UTC (permalink / raw)
  To: Ronen Hod
  Cc: leonid.moiseichuk, riel, minchan, linux-mm, linux-kernel,
	kamezawa.hiroyu, mel, rientjes, kosaki.motohiro, hannes,
	mtosatti, akpm, kosaki.motohiro

On Thu, 19 Jan 2012, Ronen Hod wrote:
> I believe that it will be best if the kernel publishes an ideal 
> number_of_free_pages (in /proc/meminfo or whatever). Such number is easy to 
> work with since this is what applications do, they free pages. Applications 
> will be able to refer to this number from their garbage collector, or before 
> allocating memory also if they did not get a notification, and it is also 
> useful if several applications free memory at the same time.

Isn't

/proc/sys/vm/min_free_kbytes

pretty much just that?

 			Pekka

^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [RFC 1/3] /dev/low_mem_notify
  2012-01-18 10:44                 ` leonid.moiseichuk
  2012-01-18 23:34                   ` Ronen Hod
@ 2012-01-19  7:34                   ` Pekka Enberg
  1 sibling, 0 replies; 62+ messages in thread
From: Pekka Enberg @ 2012-01-19  7:34 UTC (permalink / raw)
  To: leonid.moiseichuk
  Cc: riel, minchan, linux-mm, linux-kernel, kamezawa.hiroyu, mel,
	rientjes, kosaki.motohiro, hannes, mtosatti, akpm, rhod,
	kosaki.motohiro

On Wed, 18 Jan 2012, leonid.moiseichuk@nokia.com wrote:
> Paul Mundt noticed that and we stopped use percentage in 2006 for n770 update.
> He was right.
> Percents are useless and do not correlate with other kernel APIs like sysinfo().

I changed the code to use number of pages. Thanks!

 			Pekka

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-19  7:25                     ` Pekka Enberg
@ 2012-01-19  9:05                       ` Ronen Hod
  2012-01-19  9:10                         ` Pekka Enberg
  0 siblings, 1 reply; 62+ messages in thread
From: Ronen Hod @ 2012-01-19  9:05 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: leonid.moiseichuk, riel, minchan, linux-mm, linux-kernel,
	kamezawa.hiroyu, mel, rientjes, kosaki.motohiro, hannes,
	mtosatti, akpm, kosaki.motohiro

On 01/19/2012 09:25 AM, Pekka Enberg wrote:
> On Thu, 19 Jan 2012, Ronen Hod wrote:
>> I believe that it will be best if the kernel publishes an ideal number_of_free_pages (in /proc/meminfo or whatever). Such number is easy to work with since this is what applications do, they free pages. Applications will be able to refer to this number from their garbage collector, or before allocating memory also if they did not get a notification, and it is also useful if several applications free memory at the same time.
>
> Isn't
>
> /proc/sys/vm/min_free_kbytes
>
> pretty much just that?
>
>             Pekka

Would you suggest to use min_free_kbytes as the threshold for sending low_memory_notifications to applications, and separately as a target value for the applications' memory giveaway?

Thanks, Ronen.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-19  9:05                       ` Ronen Hod
@ 2012-01-19  9:10                         ` Pekka Enberg
  2012-01-19  9:20                           ` Ronen Hod
  0 siblings, 1 reply; 62+ messages in thread
From: Pekka Enberg @ 2012-01-19  9:10 UTC (permalink / raw)
  To: Ronen Hod
  Cc: leonid.moiseichuk, riel, minchan, linux-mm, linux-kernel,
	kamezawa.hiroyu, mel, rientjes, kosaki.motohiro, hannes,
	mtosatti, akpm, kosaki.motohiro

On Thu, Jan 19, 2012 at 11:05 AM, Ronen Hod <rhod@redhat.com> wrote:
>>> I believe that it will be best if the kernel publishes an ideal
>>> number_of_free_pages (in /proc/meminfo or whatever). Such number is easy to
>>> work with since this is what applications do, they free pages. Applications
>>> will be able to refer to this number from their garbage collector, or before
>>> allocating memory also if they did not get a notification, and it is also
>>> useful if several applications free memory at the same time.
>>
>> Isn't
>>
>> /proc/sys/vm/min_free_kbytes
>>
>> pretty much just that?
>
> Would you suggest to use min_free_kbytes as the threshold for sending
> low_memory_notifications to applications, and separately as a target value
> for the applications' memory giveaway?

I'm not saying that the kernel should use it directly but it seems
like the kind of "ideal number of free pages" threshold you're
suggesting. So userspace can read that value and use it as the "number
of free pages" threshold for VM events, no?

                        Pekka

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-19  9:10                         ` Pekka Enberg
@ 2012-01-19  9:20                           ` Ronen Hod
  2012-01-19 10:53                             ` leonid.moiseichuk
  0 siblings, 1 reply; 62+ messages in thread
From: Ronen Hod @ 2012-01-19  9:20 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: leonid.moiseichuk, riel, minchan, linux-mm, linux-kernel,
	kamezawa.hiroyu, mel, rientjes, kosaki.motohiro, hannes,
	mtosatti, akpm, kosaki.motohiro

On 01/19/2012 11:10 AM, Pekka Enberg wrote:
> On Thu, Jan 19, 2012 at 11:05 AM, Ronen Hod<rhod@redhat.com>  wrote:
>>>> I believe that it will be best if the kernel publishes an ideal
>>>> number_of_free_pages (in /proc/meminfo or whatever). Such number is easy to
>>>> work with since this is what applications do, they free pages. Applications
>>>> will be able to refer to this number from their garbage collector, or before
>>>> allocating memory also if they did not get a notification, and it is also
>>>> useful if several applications free memory at the same time.
>>> Isn't
>>>
>>> /proc/sys/vm/min_free_kbytes
>>>
>>> pretty much just that?
>> Would you suggest to use min_free_kbytes as the threshold for sending
>> low_memory_notifications to applications, and separately as a target value
>> for the applications' memory giveaway?
> I'm not saying that the kernel should use it directly but it seems
> like the kind of "ideal number of free pages" threshold you're
> suggesting. So userspace can read that value and use it as the "number
> of free pages" threshold for VM events, no?

Yes, I like it. The rules of the game are simple and consistent all over, be it the alert threshold, voluntary poling by the apps, and for concurrent work by several applications.
Well, as long as it provides a good indication for low_mem_pressure.

Thanks, Ronen.

>
>                          Pekka


^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [RFC 1/3] /dev/low_mem_notify
  2012-01-19  9:20                           ` Ronen Hod
@ 2012-01-19 10:53                             ` leonid.moiseichuk
  2012-01-19 11:07                               ` Pekka Enberg
  2012-01-24 15:38                               ` Marcelo Tosatti
  0 siblings, 2 replies; 62+ messages in thread
From: leonid.moiseichuk @ 2012-01-19 10:53 UTC (permalink / raw)
  To: rhod, penberg
  Cc: riel, minchan, linux-mm, linux-kernel, kamezawa.hiroyu, mel,
	rientjes, kosaki.motohiro, hannes, mtosatti, akpm,
	kosaki.motohiro

> -----Original Message-----
> From: ext Ronen Hod [mailto:rhod@redhat.com]
> Sent: 19 January, 2012 11:20
> To: Pekka Enberg
...
> >>> Isn't
> >>>
> >>> /proc/sys/vm/min_free_kbytes
> >>>
> >>> pretty much just that?
> >> Would you suggest to use min_free_kbytes as the threshold for sending
> >> low_memory_notifications to applications, and separately as a target
> >> value for the applications' memory giveaway?
> > I'm not saying that the kernel should use it directly but it seems
> > like the kind of "ideal number of free pages" threshold you're
> > suggesting. So userspace can read that value and use it as the "number
> > of free pages" threshold for VM events, no?
> 
> Yes, I like it. The rules of the game are simple and consistent all over, be it the
> alert threshold, voluntary poling by the apps, and for concurrent work by
> several applications.
> Well, as long as it provides a good indication for low_mem_pressure.

For me it doesn't look that have much sense. min_free_kbytes could be set from user-space (or auto-tuned by kernel) to keep some amount 
of memory available for GFP_ATOMIC allocations.  In case situation comes under pointed level kernel will reclaim memory from e.g. caches.

>From potential user point of view the proposed API has number of lacks which would be nice to have implemented:
1. rename this API from low_mem_pressure to something more related to notification and memory situation in system: memory_pressure, memnotify, memory_level etc. The word "low" is misleading here
2. API must use deferred timers to prevent use-time impact. Deferred timer will be triggered only in case HW event or non-deferrable timer, so if device sleeps timer might be skipped and that is what expected for user-space
3. API should be tunable for propagate changes when level is Up or Down, maybe both ways. 
4. to avoid triggering too much events probably has sense to filter according to amount of change but that is optional. If subscriber set timer to 1s the amount of events should not be very big.
5. API must provide interface to request parameters e.g. available swap or free memory just to have some base.
6. I do not understand how work with attributes performed ( ) but it has sense to use mask and fill requested attributes using mask and callback table i.e. if free pages requested - they are reported, otherwise not.
7. would have sense to backport couple of attributes from memnotify.c

I can submit couple of patches if some of proposals looks sane for everyone.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-19 10:53                             ` leonid.moiseichuk
@ 2012-01-19 11:07                               ` Pekka Enberg
  2012-01-19 11:54                                 ` leonid.moiseichuk
  2012-01-24 15:38                               ` Marcelo Tosatti
  1 sibling, 1 reply; 62+ messages in thread
From: Pekka Enberg @ 2012-01-19 11:07 UTC (permalink / raw)
  To: leonid.moiseichuk
  Cc: rhod, riel, minchan, linux-mm, linux-kernel, kamezawa.hiroyu,
	mel, rientjes, kosaki.motohiro, hannes, mtosatti, akpm,
	kosaki.motohiro

On Thu, Jan 19, 2012 at 12:53 PM,  <leonid.moiseichuk@nokia.com> wrote:
> From potential user point of view the proposed API has number of lacks which
> would be nice to have implemented:

On Thu, Jan 19, 2012 at 12:53 PM,  <leonid.moiseichuk@nokia.com> wrote:
> From potential user point of view the proposed API has number of lacks which
> would be nice to have implemented:

> 1. rename this API from low_mem_pressure to something more related to
> notification and memory situation in system: memory_pressure, memnotify,
> memory_level etc. The word "low" is misleading here

The thing is called vmevent:

http://git.kernel.org/?p=linux/kernel/git/penberg/linux.git;a=shortlog;h=refs/heads/vmevent/core
[penberg@tux ~]$ vi
[penberg@tux ~]$ cat email
On Thu, Jan 19, 2012 at 12:53 PM,  <leonid.moiseichuk@nokia.com> wrote:
> From potential user point of view the proposed API has number of lacks which
> would be nice to have implemented:

> 1. rename this API from low_mem_pressure to something more related to
> notification and memory situation in system: memory_pressure, memnotify,
> memory_level etc. The word "low" is misleading here

The thing is called vmevent:

http://git.kernel.org/?p=linux/kernel/git/penberg/linux.git;a=shortlog;h=refs/heads/vmevent/core

I haven't used "low mem" at all in the patches.

On Thu, Jan 19, 2012 at 12:53 PM,  <leonid.moiseichuk@nokia.com> wrote:
> 2. API must use deferred timers to prevent use-time impact. Deferred timer
> will be triggered only in case HW event or non-deferrable timer, so if device
> sleeps timer might be skipped and that is what expected for user-space

I'm currently looking at the possibility of hooking VM events to perf which
also uses hrtimers. Can't we make hrtimers do the right thing?

On Thu, Jan 19, 2012 at 12:53 PM,  <leonid.moiseichuk@nokia.com> wrote:
> 3. API should be tunable for propagate changes when level is Up or Down,
> maybe both ways.

Agreed.

On Thu, Jan 19, 2012 at 12:53 PM,  <leonid.moiseichuk@nokia.com> wrote:
> 4. to avoid triggering too much events probably has sense to filter according
> to amount of change but that is optional. If subscriber set timer to 1s the
> amount of events should not be very big.

Agreed.

On Thu, Jan 19, 2012 at 12:53 PM,  <leonid.moiseichuk@nokia.com> wrote:
> 5. API must provide interface to request parameters e.g. available swap or
> free memory just to have some base.

The current ABI already supports that. You can specify which attributes you're
interested in and they will be delivered as part of th event.

On Thu, Jan 19, 2012 at 12:53 PM,  <leonid.moiseichuk@nokia.com> wrote:
> 6. I do not understand how work with attributes performed ( ) but it has
> sense to use mask and fill requested attributes using mask and callback table
> i.e. if free pages requested - they are reported, otherwise not.

That's how it works now in the git tree.

On Thu, Jan 19, 2012 at 12:53 PM,  <leonid.moiseichuk@nokia.com> wrote:
> 7. would have sense to backport couple of attributes from memnotify.c
>
> I can submit couple of patches if some of proposals looks sane for everyone.

Feel free to do that.

I'm currently looking at how to support Minchan's non-sampled events. It seems
to me integrating with perf would be nice because we could simply use
tracepoints for this.

			Pekka

^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [RFC 1/3] /dev/low_mem_notify
  2012-01-19 11:07                               ` Pekka Enberg
@ 2012-01-19 11:54                                 ` leonid.moiseichuk
  2012-01-19 11:59                                   ` Pekka Enberg
  2012-01-19 12:06                                   ` Pekka Enberg
  0 siblings, 2 replies; 62+ messages in thread
From: leonid.moiseichuk @ 2012-01-19 11:54 UTC (permalink / raw)
  To: penberg
  Cc: rhod, riel, minchan, linux-mm, linux-kernel, kamezawa.hiroyu,
	mel, rientjes, kosaki.motohiro, hannes, mtosatti, akpm,
	kosaki.motohiro

> -----Original Message-----
> From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext
> Pekka Enberg
> Sent: 19 January, 2012 13:08
...
> > 1. rename this API from low_mem_pressure to something more related to
> > notification and memory situation in system: memory_pressure,
> > memnotify, memory_level etc. The word "low" is misleading here
> 
> The thing is called vmevent:

Yes, I see it. But I was a bit confused with vmnotify_fops and was sure it is mapped through dev. Now it anonymous inode.

> 
> On Thu, Jan 19, 2012 at 12:53 PM,  <leonid.moiseichuk@nokia.com> wrote:
> > 2. API must use deferred timers to prevent use-time impact. Deferred
> > timer will be triggered only in case HW event or non-deferrable timer,
> > so if device sleeps timer might be skipped and that is what expected
> > for user-space
> 
> I'm currently looking at the possibility of hooking VM events to perf which
> also uses hrtimers. Can't we make hrtimers do the right thing?

I had no answer for this question. According to hrtimer_cpu_notify the cpu state is tracked but timer may set HW event to wake up.
In this case use-time will be affected due to you will have too much HW events and reasons to wakeup.
At least powertop reports hrtimers in relation to <kernel core> as an activities sources.

> 
> On Thu, Jan 19, 2012 at 12:53 PM,  <leonid.moiseichuk@nokia.com> wrote:
> > 3. API should be tunable for propagate changes when level is Up or
> > Down, maybe both ways.
> 
> Agreed.
> 
> On Thu, Jan 19, 2012 at 12:53 PM,  <leonid.moiseichuk@nokia.com> wrote:
> > 4. to avoid triggering too much events probably has sense to filter
> > according to amount of change but that is optional. If subscriber set
> > timer to 1s the amount of events should not be very big.
> 
> Agreed.
> 
> On Thu, Jan 19, 2012 at 12:53 PM,  <leonid.moiseichuk@nokia.com> wrote:
> > 5. API must provide interface to request parameters e.g. available
> > swap or free memory just to have some base.
> 
> The current ABI already supports that. You can specify which attributes
> you're interested in and they will be delivered as part of th event.

But you have in vmnotify.h suspicious free_pages_threshold field.

> 
> On Thu, Jan 19, 2012 at 12:53 PM,  <leonid.moiseichuk@nokia.com> wrote:
> > 6. I do not understand how work with attributes performed ( ) but it
> > has sense to use mask and fill requested attributes using mask and
> > callback table i.e. if free pages requested - they are reported, otherwise
> not.
> 
> That's how it works now in the git tree.

Vmnotify.c has vmnotify_watch_event which collects fixed set of parameters.

> I'm currently looking at how to support Minchan's non-sampled events. It
> seems to me integrating with perf would be nice because we could simply
> use tracepoints for this.

If tracepoints not jeopardize use time has sense to do it.

> 
> 			Pekka

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-19 11:54                                 ` leonid.moiseichuk
@ 2012-01-19 11:59                                   ` Pekka Enberg
  2012-01-19 12:06                                   ` Pekka Enberg
  1 sibling, 0 replies; 62+ messages in thread
From: Pekka Enberg @ 2012-01-19 11:59 UTC (permalink / raw)
  To: leonid.moiseichuk
  Cc: rhod, riel, minchan, linux-mm, linux-kernel, kamezawa.hiroyu,
	mel, rientjes, kosaki.motohiro, hannes, mtosatti, akpm,
	kosaki.motohiro

On Thu, Jan 19, 2012 at 1:54 PM,  <leonid.moiseichuk@nokia.com> wrote:
>> The current ABI already supports that. You can specify which attributes
>> you're interested in and they will be delivered as part of th event.
>
> But you have in vmnotify.h suspicious free_pages_threshold field.

Aah, I was actually talking about the events userspace _reads_.

The free_pages_threshold field is only used if
VMEVENT_TYPE_FREE_THRESHOLD bit is set. It should be cleaned up a bit
but it in theory it supports watching other attributes as well. I've
postponed the cleanup until I've figured out whether we can use perf
which would make the whole syscall go away.

                        Pekka

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-19 11:54                                 ` leonid.moiseichuk
  2012-01-19 11:59                                   ` Pekka Enberg
@ 2012-01-19 12:06                                   ` Pekka Enberg
  1 sibling, 0 replies; 62+ messages in thread
From: Pekka Enberg @ 2012-01-19 12:06 UTC (permalink / raw)
  To: leonid.moiseichuk
  Cc: rhod, riel, minchan, linux-mm, linux-kernel, kamezawa.hiroyu,
	mel, rientjes, kosaki.motohiro, hannes, mtosatti, akpm,
	kosaki.motohiro

On Thu, Jan 19, 2012 at 1:54 PM,  <leonid.moiseichuk@nokia.com> wrote:
>> On Thu, Jan 19, 2012 at 12:53 PM,  <leonid.moiseichuk@nokia.com> wrote:
>> > 6. I do not understand how work with attributes performed ( ) but it
>> > has sense to use mask and fill requested attributes using mask and
>> > callback table i.e. if free pages requested - they are reported, otherwise
>> not.
>>
>> That's how it works now in the git tree.
>
> Vmnotify.c has vmnotify_watch_event which collects fixed set of parameters.

That's would be a bug. We should check event_attrs like we do for NR_SWAP_PAGES.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 2/3] vmscan hook
  2012-01-19  2:25               ` KAMEZAWA Hiroyuki
@ 2012-01-19 14:42                 ` Rik van Riel
  2012-01-20  0:24                   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 62+ messages in thread
From: Rik van Riel @ 2012-01-19 14:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Minchan Kim, linux-mm, LKML, leonid.moiseichuk, penberg, mel,
	rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod

On 01/18/2012 09:25 PM, KAMEZAWA Hiroyuki wrote:
> On Wed, 18 Jan 2012 09:17:17 -0500
> Rik van Riel<riel@redhat.com>  wrote:
>
>> On 01/17/2012 07:18 PM, KAMEZAWA Hiroyuki wrote:
>>> On Wed, 18 Jan 2012 08:08:01 +0900
>>> Minchan Kim<minchan@kernel.org>   wrote:
>>>
>>>>>>> 2. can't we measure page-in/page-out distance by recording something ?
>>>>>>
>>>>>> I can't understand your point. What's relation does it with swapout prevent?
>>>>>>
>>>>>
>>>>> If distance between pageout ->   pagein is short, it means thrashing.
>>>>> For example, recoding the timestamp when the page(mapping, index) was
>>>>> paged-out, and check it at page-in.
>>>>
>>>> Our goal is prevent swapout. When we found thrashing, it's too late.
>>>
>>> If you want to prevent swap-out, don't swapon any. That's all.
>>> Then, you can check the number of FILE_CACHE and have threshold.
>>
>> I think you are getting hung up on a word here.
>>
>> As I understand it, the goal is to push out the point where
>> we start doing heavier swap IO, allowing us to overcommit
>> memory more heavily before things start really slowing down.
>>
>
> Yes.
>
> Hmm, considering that the issue is slow down,
>
> time values as
>
> - 'cpu time used for memory reclaim'
> - 'latency of page allocation'
> - 'application execution speed' ?
>
> may be a better score to see rather than just seeing lru's stat.

I believe those all qualify as "too late".

We want to prevent things from becoming bad, for as long
as we (easily) can.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 2/3] vmscan hook
  2012-01-19 14:42                 ` Rik van Riel
@ 2012-01-20  0:24                   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 62+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-20  0:24 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Minchan Kim, linux-mm, LKML, leonid.moiseichuk, penberg, mel,
	rientjes, KOSAKI Motohiro, Johannes Weiner, Marcelo Tosatti,
	Andrew Morton, Ronen Hod

On Thu, 19 Jan 2012 09:42:59 -0500
Rik van Riel <riel@redhat.com> wrote:

> On 01/18/2012 09:25 PM, KAMEZAWA Hiroyuki wrote:
> > On Wed, 18 Jan 2012 09:17:17 -0500
> > Rik van Riel<riel@redhat.com>  wrote:
> >
> >> On 01/17/2012 07:18 PM, KAMEZAWA Hiroyuki wrote:
> >>> On Wed, 18 Jan 2012 08:08:01 +0900
> >>> Minchan Kim<minchan@kernel.org>   wrote:
> >>>
> >>>>>>> 2. can't we measure page-in/page-out distance by recording something ?
> >>>>>>
> >>>>>> I can't understand your point. What's relation does it with swapout prevent?
> >>>>>>
> >>>>>
> >>>>> If distance between pageout ->   pagein is short, it means thrashing.
> >>>>> For example, recoding the timestamp when the page(mapping, index) was
> >>>>> paged-out, and check it at page-in.
> >>>>
> >>>> Our goal is prevent swapout. When we found thrashing, it's too late.
> >>>
> >>> If you want to prevent swap-out, don't swapon any. That's all.
> >>> Then, you can check the number of FILE_CACHE and have threshold.
> >>
> >> I think you are getting hung up on a word here.
> >>
> >> As I understand it, the goal is to push out the point where
> >> we start doing heavier swap IO, allowing us to overcommit
> >> memory more heavily before things start really slowing down.
> >>
> >
> > Yes.
> >
> > Hmm, considering that the issue is slow down,
> >
> > time values as
> >
> > - 'cpu time used for memory reclaim'
> > - 'latency of page allocation'
> > - 'application execution speed' ?
> >
> > may be a better score to see rather than just seeing lru's stat.
> 
> I believe those all qualify as "too late".
> 
> We want to prevent things from becoming bad, for as long
> as we (easily) can.
> 

Hmm, then some threshold-notifier interface will be required.
Problem is how to know free + page_can_be_freed_without_risk.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-19 10:53                             ` leonid.moiseichuk
  2012-01-19 11:07                               ` Pekka Enberg
@ 2012-01-24 15:38                               ` Marcelo Tosatti
  2012-01-24 16:08                                 ` Ronen Hod
  2012-01-24 16:10                                 ` Pekka Enberg
  1 sibling, 2 replies; 62+ messages in thread
From: Marcelo Tosatti @ 2012-01-24 15:38 UTC (permalink / raw)
  To: leonid.moiseichuk
  Cc: rhod, penberg, riel, minchan, linux-mm, linux-kernel,
	kamezawa.hiroyu, mel, rientjes, kosaki.motohiro, hannes, akpm,
	kosaki.motohiro

On Thu, Jan 19, 2012 at 10:53:29AM +0000, leonid.moiseichuk@nokia.com wrote:
> > -----Original Message-----
> > From: ext Ronen Hod [mailto:rhod@redhat.com]
> > Sent: 19 January, 2012 11:20
> > To: Pekka Enberg
> ...
> > >>> Isn't
> > >>>
> > >>> /proc/sys/vm/min_free_kbytes
> > >>>
> > >>> pretty much just that?
> > >> Would you suggest to use min_free_kbytes as the threshold for sending
> > >> low_memory_notifications to applications, and separately as a target
> > >> value for the applications' memory giveaway?
> > > I'm not saying that the kernel should use it directly but it seems
> > > like the kind of "ideal number of free pages" threshold you're
> > > suggesting. So userspace can read that value and use it as the "number
> > > of free pages" threshold for VM events, no?
> > 
> > Yes, I like it. The rules of the game are simple and consistent all over, be it the
> > alert threshold, voluntary poling by the apps, and for concurrent work by
> > several applications.
> > Well, as long as it provides a good indication for low_mem_pressure.
> 
> For me it doesn't look that have much sense. min_free_kbytes could be set from user-space (or auto-tuned by kernel) to keep some amount 
> of memory available for GFP_ATOMIC allocations.  In case situation comes under pointed level kernel will reclaim memory from e.g. caches.
> 
> >From potential user point of view the proposed API has number of lacks which would be nice to have implemented:
> 1. rename this API from low_mem_pressure to something more related to notification and memory situation in system: memory_pressure, memnotify, memory_level etc. The word "low" is misleading here
> 2. API must use deferred timers to prevent use-time impact. Deferred timer will be triggered only in case HW event or non-deferrable timer, so if device sleeps timer might be skipped and that is what expected for user-space

Having userspace specify the "sample period" for low memory notification
makes no sense. The frequency of notifications is a function of the
memory pressure.

> 3. API should be tunable for propagate changes when level is Up or Down, maybe both ways. 


> 4. to avoid triggering too much events probably has sense to filter according to amount of change but that is optional. If subscriber set timer to 1s the amount of events should not be very big.
> 5. API must provide interface to request parameters e.g. available swap or free memory just to have some base.

It would make the interface easier to use if it provided the number of
pages to free, in the notification (kernel can calculate that as the
delta between current_free_pages -> comfortable_free_pages relative to
process RSS).

> 6. I do not understand how work with attributes performed ( ) but it has sense to use mask and fill requested attributes using mask and callback table i.e. if free pages requested - they are reported, otherwise not.
> 7. would have sense to backport couple of attributes from memnotify.c
> 
> I can submit couple of patches if some of proposals looks sane for everyone.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-17 18:51       ` Pekka Enberg
                           ` (2 preceding siblings ...)
  2012-01-18  9:06         ` leonid.moiseichuk
@ 2012-01-24 15:40         ` Marcelo Tosatti
  2012-01-24 16:01           ` Pekka Enberg
  2012-01-24 21:57         ` Jonathan Corbet
  4 siblings, 1 reply; 62+ messages in thread
From: Marcelo Tosatti @ 2012-01-24 15:40 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Rik van Riel, Minchan Kim, linux-mm, LKML, leonid.moiseichuk,
	kamezawa.hiroyu, mel, rientjes, KOSAKI Motohiro, Johannes Weiner,
	Andrew Morton, Ronen Hod, KOSAKI Motohiro

On Tue, Jan 17, 2012 at 08:51:13PM +0200, Pekka Enberg wrote:
> Hello,
> 
> Ok, so here's a proof of concept patch that implements sample-base
> per-process free threshold VM event watching using perf-like syscall
> ABI. I'd really like to see something like this that's much more
> extensible and clean than the /dev based ABIs that people have
> proposed so far.
> 
> 			Pekka

What is the practical advantage of a syscall, again?


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-24 15:40         ` Marcelo Tosatti
@ 2012-01-24 16:01           ` Pekka Enberg
  2012-01-24 16:25             ` Arnd Bergmann
  0 siblings, 1 reply; 62+ messages in thread
From: Pekka Enberg @ 2012-01-24 16:01 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Rik van Riel, Minchan Kim, linux-mm, LKML, leonid.moiseichuk,
	kamezawa.hiroyu, mel, rientjes, KOSAKI Motohiro, Johannes Weiner,
	Andrew Morton, Ronen Hod, KOSAKI Motohiro

On Tue, 2012-01-24 at 13:40 -0200, Marcelo Tosatti wrote:
> What is the practical advantage of a syscall, again?

Why do you ask? The advantage for this particular case is not needing to
add ioctls() for configuration and keeping the file read/write ABI
simple.

			Pekka


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-24 15:38                               ` Marcelo Tosatti
@ 2012-01-24 16:08                                 ` Ronen Hod
  2012-01-24 18:10                                   ` Marcelo Tosatti
  2012-01-24 16:10                                 ` Pekka Enberg
  1 sibling, 1 reply; 62+ messages in thread
From: Ronen Hod @ 2012-01-24 16:08 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: leonid.moiseichuk, penberg, riel, minchan, linux-mm,
	linux-kernel, kamezawa.hiroyu, mel, rientjes, kosaki.motohiro,
	hannes, akpm, kosaki.motohiro

On 01/24/2012 05:38 PM, Marcelo Tosatti wrote:
> On Thu, Jan 19, 2012 at 10:53:29AM +0000, leonid.moiseichuk@nokia.com wrote:
>>> -----Original Message-----
>>> From: ext Ronen Hod [mailto:rhod@redhat.com]
>>> Sent: 19 January, 2012 11:20
>>> To: Pekka Enberg
>> ...
>>>>>> Isn't
>>>>>>
>>>>>> /proc/sys/vm/min_free_kbytes
>>>>>>
>>>>>> pretty much just that?
>>>>> Would you suggest to use min_free_kbytes as the threshold for sending
>>>>> low_memory_notifications to applications, and separately as a target
>>>>> value for the applications' memory giveaway?
>>>> I'm not saying that the kernel should use it directly but it seems
>>>> like the kind of "ideal number of free pages" threshold you're
>>>> suggesting. So userspace can read that value and use it as the "number
>>>> of free pages" threshold for VM events, no?
>>> Yes, I like it. The rules of the game are simple and consistent all over, be it the
>>> alert threshold, voluntary poling by the apps, and for concurrent work by
>>> several applications.
>>> Well, as long as it provides a good indication for low_mem_pressure.
>> For me it doesn't look that have much sense. min_free_kbytes could be set from user-space (or auto-tuned by kernel) to keep some amount
>> of memory available for GFP_ATOMIC allocations.  In case situation comes under pointed level kernel will reclaim memory from e.g. caches.
>>
>> > From potential user point of view the proposed API has number of lacks which would be nice to have implemented:
>> 1. rename this API from low_mem_pressure to something more related to notification and memory situation in system: memory_pressure, memnotify, memory_level etc. The word "low" is misleading here
>> 2. API must use deferred timers to prevent use-time impact. Deferred timer will be triggered only in case HW event or non-deferrable timer, so if device sleeps timer might be skipped and that is what expected for user-space
> Having userspace specify the "sample period" for low memory notification
> makes no sense. The frequency of notifications is a function of the
> memory pressure.
>
>> 3. API should be tunable for propagate changes when level is Up or Down, maybe both ways.
>
>> 4. to avoid triggering too much events probably has sense to filter according to amount of change but that is optional. If subscriber set timer to 1s the amount of events should not be very big.
>> 5. API must provide interface to request parameters e.g. available swap or free memory just to have some base.
> It would make the interface easier to use if it provided the number of
> pages to free, in the notification (kernel can calculate that as the
> delta between current_free_pages ->  comfortable_free_pages relative to
> process RSS).

If you rely on the notification's argument you lose several features:
  - Handling of notifications by several applications in parallel
  - Voluntary application's decisions, such as cleanup or avoiding allocations, at the application's convenience.
  - Iterative release loops, until there are enough free pages.
I believe that the notification should only serve as a trigger to run the cleanup.

Ronen.

>
>> 6. I do not understand how work with attributes performed ( ) but it has sense to use mask and fill requested attributes using mask and callback table i.e. if free pages requested - they are reported, otherwise not.
>> 7. would have sense to backport couple of attributes from memnotify.c
>>
>> I can submit couple of patches if some of proposals looks sane for everyone.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-24 15:38                               ` Marcelo Tosatti
  2012-01-24 16:08                                 ` Ronen Hod
@ 2012-01-24 16:10                                 ` Pekka Enberg
  2012-01-24 18:29                                   ` Marcelo Tosatti
  2012-01-25  8:19                                   ` leonid.moiseichuk
  1 sibling, 2 replies; 62+ messages in thread
From: Pekka Enberg @ 2012-01-24 16:10 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: leonid.moiseichuk, rhod, riel, minchan, linux-mm, linux-kernel,
	kamezawa.hiroyu, mel, rientjes, kosaki.motohiro, hannes, akpm,
	kosaki.motohiro

On Tue, 2012-01-24 at 13:38 -0200, Marcelo Tosatti wrote:
> Having userspace specify the "sample period" for low memory notification
> makes no sense. The frequency of notifications is a function of the
> memory pressure.

Sure, it makes sense to autotune sample period. I don't see the problem
with letting userspace decide it for themselves if they want to.

			Pekka


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-18  9:15           ` Pekka Enberg
  2012-01-18  9:41             ` leonid.moiseichuk
@ 2012-01-24 16:22             ` Arnd Bergmann
  1 sibling, 0 replies; 62+ messages in thread
From: Arnd Bergmann @ 2012-01-24 16:22 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: leonid.moiseichuk, riel, minchan, linux-mm, linux-kernel,
	kamezawa.hiroyu, mel, rientjes, kosaki.motohiro, hannes,
	mtosatti, akpm, rhod, kosaki.motohiro

On Wednesday 18 January 2012, Pekka Enberg wrote:
> >> +struct vmnotify_event {
> >> +     /* Size of the struct for ABI extensibility. */
> >> +     __u32                   size;
> >> +
> >> +     __u64                   nr_avail_pages;
> >> +
> >> +     __u64                   nr_swap_pages;
> >> +
> >> +     __u64                   nr_free_pages;
> >> +};
> >
> > Two fields here most likely session-constant, (nr_avail_pages and
> > nr_swap_pages), seems not much sense to report them in every event.  If we
> > have memory/swap hotplug user-space can use sysinfo() call.
> 
> I actually changed the ABI to look like this:
> 
> struct vmnotify_event {
>         /*
>          * Size of the struct for ABI extensibility.
>          */
>         __u32                   size;
> 
>         __u64                   attrs;
> 
>         __u64                   attr_values[];
> };
> 
> So userspace can decide which fields to include in notifications.

Please make the first member a __u64 instead of a __u32. This will
avoid incompatibility between 32 and 64 bit processes, which have
different alignment rules on x86: x86-32 would implicitly pack the
struct while x86-64 would add padding with your layout.

	Arnd

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-24 16:01           ` Pekka Enberg
@ 2012-01-24 16:25             ` Arnd Bergmann
  2012-01-24 18:32               ` Marcelo Tosatti
  0 siblings, 1 reply; 62+ messages in thread
From: Arnd Bergmann @ 2012-01-24 16:25 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Marcelo Tosatti, Rik van Riel, Minchan Kim, linux-mm, LKML,
	leonid.moiseichuk, kamezawa.hiroyu, mel, rientjes,
	KOSAKI Motohiro, Johannes Weiner, Andrew Morton, Ronen Hod,
	KOSAKI Motohiro

On Tuesday 24 January 2012, Pekka Enberg wrote:
> On Tue, 2012-01-24 at 13:40 -0200, Marcelo Tosatti wrote:
> > What is the practical advantage of a syscall, again?
> 
> Why do you ask? The advantage for this particular case is not needing to
> add ioctls() for configuration and keeping the file read/write ABI
> simple.

The two are obviously equivalent and there is no reason to avoid
ioctl in general. However I agree that the syscall would be better
in this case, because that is what we tend to use for core kernel
functionality, while character devices tend to be used for I/O device
drivers that need stuff like enumeration and permission management.

	Arnd

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-24 16:08                                 ` Ronen Hod
@ 2012-01-24 18:10                                   ` Marcelo Tosatti
  2012-01-25  8:52                                     ` Ronen Hod
  0 siblings, 1 reply; 62+ messages in thread
From: Marcelo Tosatti @ 2012-01-24 18:10 UTC (permalink / raw)
  To: Ronen Hod
  Cc: leonid.moiseichuk, penberg, riel, minchan, linux-mm,
	linux-kernel, kamezawa.hiroyu, mel, rientjes, kosaki.motohiro,
	hannes, akpm, kosaki.motohiro

On Tue, Jan 24, 2012 at 06:08:31PM +0200, Ronen Hod wrote:
> On 01/24/2012 05:38 PM, Marcelo Tosatti wrote:
> >On Thu, Jan 19, 2012 at 10:53:29AM +0000, leonid.moiseichuk@nokia.com wrote:
> >>>-----Original Message-----
> >>>From: ext Ronen Hod [mailto:rhod@redhat.com]
> >>>Sent: 19 January, 2012 11:20
> >>>To: Pekka Enberg
> >>...
> >>>>>>Isn't
> >>>>>>
> >>>>>>/proc/sys/vm/min_free_kbytes
> >>>>>>
> >>>>>>pretty much just that?
> >>>>>Would you suggest to use min_free_kbytes as the threshold for sending
> >>>>>low_memory_notifications to applications, and separately as a target
> >>>>>value for the applications' memory giveaway?
> >>>>I'm not saying that the kernel should use it directly but it seems
> >>>>like the kind of "ideal number of free pages" threshold you're
> >>>>suggesting. So userspace can read that value and use it as the "number
> >>>>of free pages" threshold for VM events, no?
> >>>Yes, I like it. The rules of the game are simple and consistent all over, be it the
> >>>alert threshold, voluntary poling by the apps, and for concurrent work by
> >>>several applications.
> >>>Well, as long as it provides a good indication for low_mem_pressure.
> >>For me it doesn't look that have much sense. min_free_kbytes could be set from user-space (or auto-tuned by kernel) to keep some amount
> >>of memory available for GFP_ATOMIC allocations.  In case situation comes under pointed level kernel will reclaim memory from e.g. caches.
> >>
> >>> From potential user point of view the proposed API has number of lacks which would be nice to have implemented:
> >>1. rename this API from low_mem_pressure to something more related to notification and memory situation in system: memory_pressure, memnotify, memory_level etc. The word "low" is misleading here
> >>2. API must use deferred timers to prevent use-time impact. Deferred timer will be triggered only in case HW event or non-deferrable timer, so if device sleeps timer might be skipped and that is what expected for user-space
> >Having userspace specify the "sample period" for low memory notification
> >makes no sense. The frequency of notifications is a function of the
> >memory pressure.
> >
> >>3. API should be tunable for propagate changes when level is Up or Down, maybe both ways.
> >
> >>4. to avoid triggering too much events probably has sense to filter according to amount of change but that is optional. If subscriber set timer to 1s the amount of events should not be very big.
> >>5. API must provide interface to request parameters e.g. available swap or free memory just to have some base.
> >It would make the interface easier to use if it provided the number of
> >pages to free, in the notification (kernel can calculate that as the
> >delta between current_free_pages ->  comfortable_free_pages relative to
> >process RSS).
> 
> If you rely on the notification's argument you lose several features:
>  - Handling of notifications by several applications in parallel

Each application has its argument built in a custom fashion
(pages_to_free = delta between current_free_pages ->
comfortable_free_pages relative to process RSS), or something to that
effect. It is compatible with parallel notifications.

>  - Voluntary application's decisions, such as cleanup or avoiding allocations, at the application's convenience.

I am suggesting an additional field in the notification data so that the
freeing routine has a goal. But it is not mandatory.

> - Iterative release loops, until there are enough free pages.

What is the advantage versus releasing the necessary amount of
memory in a given moment?

> I believe that the notification should only serve as a trigger to run the cleanup.

Agree.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-24 16:10                                 ` Pekka Enberg
@ 2012-01-24 18:29                                   ` Marcelo Tosatti
  2012-01-25  8:19                                   ` leonid.moiseichuk
  1 sibling, 0 replies; 62+ messages in thread
From: Marcelo Tosatti @ 2012-01-24 18:29 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: leonid.moiseichuk, rhod, riel, minchan, linux-mm, linux-kernel,
	kamezawa.hiroyu, mel, rientjes, kosaki.motohiro, hannes, akpm,
	kosaki.motohiro

On Tue, Jan 24, 2012 at 06:10:40PM +0200, Pekka Enberg wrote:
> On Tue, 2012-01-24 at 13:38 -0200, Marcelo Tosatti wrote:
> > Having userspace specify the "sample period" for low memory notification
> > makes no sense. The frequency of notifications is a function of the
> > memory pressure.
> 
> Sure, it makes sense to autotune sample period. I don't see the problem
> with letting userspace decide it for themselves if they want to.
> 
> 			Pekka

Application polls on a file descriptor waiting for asynchronous events,
particular conditions of memory reclaim upon which an action is
necessary.

These signalled conditions are not simply percentages of free memory,
but depend on the amount of freeable cache available, etc. Otherwise
applications could monitor /proc/mem_info and act on that.

What is the point of sampling in the interface as you have it?
Application can read() from the file descriptor to retrieve the current
status, if it wishes.

The objective in this argument is to make the API as simple and easy to
use as possible.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-24 16:25             ` Arnd Bergmann
@ 2012-01-24 18:32               ` Marcelo Tosatti
  0 siblings, 0 replies; 62+ messages in thread
From: Marcelo Tosatti @ 2012-01-24 18:32 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Pekka Enberg, Rik van Riel, Minchan Kim, linux-mm, LKML,
	leonid.moiseichuk, kamezawa.hiroyu, mel, rientjes,
	KOSAKI Motohiro, Johannes Weiner, Andrew Morton, Ronen Hod,
	KOSAKI Motohiro

On Tue, Jan 24, 2012 at 04:25:55PM +0000, Arnd Bergmann wrote:
> On Tuesday 24 January 2012, Pekka Enberg wrote:
> > On Tue, 2012-01-24 at 13:40 -0200, Marcelo Tosatti wrote:
> > > What is the practical advantage of a syscall, again?
> > 
> > Why do you ask? The advantage for this particular case is not needing to
> > add ioctls() for configuration and keeping the file read/write ABI
> > simple.
> 
> The two are obviously equivalent and there is no reason to avoid
> ioctl in general. However I agree that the syscall would be better
> in this case, because that is what we tend to use for core kernel
> functionality, while character devices tend to be used for I/O device
> drivers that need stuff like enumeration and permission management.
> 
> 	Arnd

Makes sense.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-17 18:51       ` Pekka Enberg
                           ` (3 preceding siblings ...)
  2012-01-24 15:40         ` Marcelo Tosatti
@ 2012-01-24 21:57         ` Jonathan Corbet
  4 siblings, 0 replies; 62+ messages in thread
From: Jonathan Corbet @ 2012-01-24 21:57 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Rik van Riel, Minchan Kim, linux-mm, LKML, leonid.moiseichuk,
	kamezawa.hiroyu, mel, rientjes, KOSAKI Motohiro, Johannes Weiner,
	Marcelo Tosatti, Andrew Morton, Ronen Hod, KOSAKI Motohiro

On Tue, 17 Jan 2012 20:51:13 +0200 (EET)
Pekka Enberg <penberg@kernel.org> wrote:

> Ok, so here's a proof of concept patch that implements sample-base 
> per-process free threshold VM event watching using perf-like syscall ABI. 
> I'd really like to see something like this that's much more extensible and 
> clean than the /dev based ABIs that people have proposed so far.

OK, so I'm slow, but better late than never.  I plead travel.

I guess the thing that surprises me is that nobody has said this yet: this
looks a lot like an event-reporting mechanism like perf.  Is there a reason
these can't be perf-style events integrated with all the rest?

> +struct vmnotify_config {
> +	/*
> +	 * Size of the struct for ABI extensibility.
> +	 */
> +	__u32		   size;
> +
> +	/*
> +	 * Notification type bitmask
> +	 */
> +	__u64			type;
> +
> +	/*
> +	 * Free memory threshold in percentages [1..99]
> +	 */
> +	__u32			free_threshold;

Is this an upper-bound threshold or a lower-bound threshold?  From your
example, it looks like "free_threshold" is "the amount of memory that is
not free", which seems confusing.

[...]

> new file mode 100644
> index 0000000..6800450
> --- /dev/null
> +++ b/mm/vmnotify.c
> @@ -0,0 +1,235 @@
> +#include <linux/anon_inodes.h>
> +#include <linux/vmnotify.h>
> +#include <linux/syscalls.h>
> +#include <linux/file.h>
> +#include <linux/list.h>
> +#include <linux/poll.h>
> +#include <linux/slab.h>
> +#include <linux/swap.h>
> +
> +#define VMNOTIFY_MAX_FREE_THRESHOD	100

Did we run out of L's here? :)

> +static ssize_t vmnotify_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
> +{
> +	struct vmnotify_watch *watch = file->private_data;
> +	int ret = 0;
> +
> +	mutex_lock(&watch->mutex);
> +
> +	if (!watch->pending)
> +		goto out_unlock;
> +
> +	if (copy_to_user(buf, &watch->event, sizeof(struct vmnotify_event))) {
> +		ret = -EFAULT;
> +		goto out_unlock;
> +	}
> +
> +	ret = watch->event.size;
> +
> +	watch->pending = false;
> +
> +out_unlock:
> +	mutex_unlock(&watch->mutex);
> +
> +	return ret;
> +}

So this is a nonblocking-only interface?  That may surprise some
developers.  You already have a wait queue, why not wait on it if need be?

> +static int vmnotify_copy_config(struct vmnotify_config __user *uconfig,
> +				struct vmnotify_config *config)
> +{
> +	int ret;
> +
> +	ret = copy_from_user(config, uconfig, sizeof(struct vmnotify_config));
> +	if (ret)
> +		return -EFAULT;
> +
> +	if (!config->type)
> +		return -EINVAL;
> +
> +	if (config->type & VMNOTIFY_TYPE_SAMPLE) {
> +		if (config->sample_period_ns < NSEC_PER_MSEC)
> +			return -EINVAL;
> +	}

What happens if the sample period is zero?

jon

^ permalink raw reply	[flat|nested] 62+ messages in thread

* RE: [RFC 1/3] /dev/low_mem_notify
  2012-01-24 16:10                                 ` Pekka Enberg
  2012-01-24 18:29                                   ` Marcelo Tosatti
@ 2012-01-25  8:19                                   ` leonid.moiseichuk
  1 sibling, 0 replies; 62+ messages in thread
From: leonid.moiseichuk @ 2012-01-25  8:19 UTC (permalink / raw)
  To: penberg, mtosatti
  Cc: rhod, riel, minchan, linux-mm, linux-kernel, kamezawa.hiroyu,
	mel, rientjes, kosaki.motohiro, hannes, akpm, kosaki.motohiro

> -----Original Message-----
> From: ext Pekka Enberg [mailto:penberg@kernel.org]
> Sent: 24 January, 2012 18:11
> To: Marcelo Tosatti
....
> On Tue, 2012-01-24 at 13:38 -0200, Marcelo Tosatti wrote:
> > Having userspace specify the "sample period" for low memory
> > notification makes no sense. The frequency of notifications is a
> > function of the memory pressure.
> 
> Sure, it makes sense to autotune sample period. I don't see the problem
> with letting userspace decide it for themselves if they want to.
> 
> 			Pekka
Good point, but you must take into account that reaction time in user-space depends how SW stack is organized.
So for some components 1s is good enough update time,  for another cases 10ms.
If changes on VM happened too often they had no sense for user-space.

Thus from practical point of view having sampling period is not a bad idea.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-24 18:10                                   ` Marcelo Tosatti
@ 2012-01-25  8:52                                     ` Ronen Hod
  2012-01-25 10:12                                       ` Marcelo Tosatti
  0 siblings, 1 reply; 62+ messages in thread
From: Ronen Hod @ 2012-01-25  8:52 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: leonid.moiseichuk, penberg, riel, minchan, linux-mm,
	linux-kernel, kamezawa.hiroyu, mel, rientjes, kosaki.motohiro,
	hannes, akpm, kosaki.motohiro

On 01/24/2012 08:10 PM, Marcelo Tosatti wrote:
> On Tue, Jan 24, 2012 at 06:08:31PM +0200, Ronen Hod wrote:
>> On 01/24/2012 05:38 PM, Marcelo Tosatti wrote:
>>> On Thu, Jan 19, 2012 at 10:53:29AM +0000, leonid.moiseichuk@nokia.com wrote:
>>>>> -----Original Message-----
>>>>> From: ext Ronen Hod [mailto:rhod@redhat.com]
>>>>> Sent: 19 January, 2012 11:20
>>>>> To: Pekka Enberg
>>>> ...
>>>>>>>> Isn't
>>>>>>>>
>>>>>>>> /proc/sys/vm/min_free_kbytes
>>>>>>>>
>>>>>>>> pretty much just that?
>>>>>>> Would you suggest to use min_free_kbytes as the threshold for sending
>>>>>>> low_memory_notifications to applications, and separately as a target
>>>>>>> value for the applications' memory giveaway?
>>>>>> I'm not saying that the kernel should use it directly but it seems
>>>>>> like the kind of "ideal number of free pages" threshold you're
>>>>>> suggesting. So userspace can read that value and use it as the "number
>>>>>> of free pages" threshold for VM events, no?
>>>>> Yes, I like it. The rules of the game are simple and consistent all over, be it the
>>>>> alert threshold, voluntary poling by the apps, and for concurrent work by
>>>>> several applications.
>>>>> Well, as long as it provides a good indication for low_mem_pressure.
>>>> For me it doesn't look that have much sense. min_free_kbytes could be set from user-space (or auto-tuned by kernel) to keep some amount
>>>> of memory available for GFP_ATOMIC allocations.  In case situation comes under pointed level kernel will reclaim memory from e.g. caches.
>>>>
>>>>>  From potential user point of view the proposed API has number of lacks which would be nice to have implemented:
>>>> 1. rename this API from low_mem_pressure to something more related to notification and memory situation in system: memory_pressure, memnotify, memory_level etc. The word "low" is misleading here
>>>> 2. API must use deferred timers to prevent use-time impact. Deferred timer will be triggered only in case HW event or non-deferrable timer, so if device sleeps timer might be skipped and that is what expected for user-space
>>> Having userspace specify the "sample period" for low memory notification
>>> makes no sense. The frequency of notifications is a function of the
>>> memory pressure.
>>>
>>>> 3. API should be tunable for propagate changes when level is Up or Down, maybe both ways.
>>>> 4. to avoid triggering too much events probably has sense to filter according to amount of change but that is optional. If subscriber set timer to 1s the amount of events should not be very big.
>>>> 5. API must provide interface to request parameters e.g. available swap or free memory just to have some base.
>>> It would make the interface easier to use if it provided the number of
>>> pages to free, in the notification (kernel can calculate that as the
>>> delta between current_free_pages ->   comfortable_free_pages relative to
>>> process RSS).
>> If you rely on the notification's argument you lose several features:
>>   - Handling of notifications by several applications in parallel
> Each application has its argument built in a custom fashion
> (pages_to_free = delta between current_free_pages ->
> comfortable_free_pages relative to process RSS), or something to that
> effect. It is compatible with parallel notifications.

Not sure that I got it. Do you suggest to ask all the applications to free say 3% of their memory?. Some may be able to free more, and some cannot free any. Isn't it more practical to just notify them, and let each app contribute its part to the global moving target?

>>   - Voluntary application's decisions, such as cleanup or avoiding allocations, at the application's convenience.
> I am suggesting an additional field in the notification data so that the
> freeing routine has a goal. But it is not mandatory.

If you do want to support voluntary (notification less) app decisions, based on the current status, then why not satisfy with this API and only use the notifications to trigger this procedure?

>
>> - Iterative release loops, until there are enough free pages.
> What is the advantage versus releasing the necessary amount of
> memory in a given moment?

The cleanup logic may be unaware of the page-level effects of its alloc and free, more so when freeing complex internal data structures (such as cached web pages), and this way you let it free until things settle down.

Ronen.

>
>> I believe that the notification should only serve as a trigger to run the cleanup.
> Agree.
>
>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-25  8:52                                     ` Ronen Hod
@ 2012-01-25 10:12                                       ` Marcelo Tosatti
  2012-01-25 10:48                                         ` Ronen Hod
  0 siblings, 1 reply; 62+ messages in thread
From: Marcelo Tosatti @ 2012-01-25 10:12 UTC (permalink / raw)
  To: Ronen Hod
  Cc: leonid.moiseichuk, penberg, riel, minchan, linux-mm,
	linux-kernel, kamezawa.hiroyu, mel, rientjes, kosaki.motohiro,
	hannes, akpm, kosaki.motohiro

On Wed, Jan 25, 2012 at 10:52:24AM +0200, Ronen Hod wrote:
> On 01/24/2012 08:10 PM, Marcelo Tosatti wrote:
> >On Tue, Jan 24, 2012 at 06:08:31PM +0200, Ronen Hod wrote:
> >>On 01/24/2012 05:38 PM, Marcelo Tosatti wrote:
> >>>On Thu, Jan 19, 2012 at 10:53:29AM +0000, leonid.moiseichuk@nokia.com wrote:
> >>>>>-----Original Message-----
> >>>>>From: ext Ronen Hod [mailto:rhod@redhat.com]
> >>>>>Sent: 19 January, 2012 11:20
> >>>>>To: Pekka Enberg
> >>>>...
> >>>>>>>>Isn't
> >>>>>>>>
> >>>>>>>>/proc/sys/vm/min_free_kbytes
> >>>>>>>>
> >>>>>>>>pretty much just that?
> >>>>>>>Would you suggest to use min_free_kbytes as the threshold for sending
> >>>>>>>low_memory_notifications to applications, and separately as a target
> >>>>>>>value for the applications' memory giveaway?
> >>>>>>I'm not saying that the kernel should use it directly but it seems
> >>>>>>like the kind of "ideal number of free pages" threshold you're
> >>>>>>suggesting. So userspace can read that value and use it as the "number
> >>>>>>of free pages" threshold for VM events, no?
> >>>>>Yes, I like it. The rules of the game are simple and consistent all over, be it the
> >>>>>alert threshold, voluntary poling by the apps, and for concurrent work by
> >>>>>several applications.
> >>>>>Well, as long as it provides a good indication for low_mem_pressure.
> >>>>For me it doesn't look that have much sense. min_free_kbytes could be set from user-space (or auto-tuned by kernel) to keep some amount
> >>>>of memory available for GFP_ATOMIC allocations.  In case situation comes under pointed level kernel will reclaim memory from e.g. caches.
> >>>>
> >>>>> From potential user point of view the proposed API has number of lacks which would be nice to have implemented:
> >>>>1. rename this API from low_mem_pressure to something more related to notification and memory situation in system: memory_pressure, memnotify, memory_level etc. The word "low" is misleading here
> >>>>2. API must use deferred timers to prevent use-time impact. Deferred timer will be triggered only in case HW event or non-deferrable timer, so if device sleeps timer might be skipped and that is what expected for user-space
> >>>Having userspace specify the "sample period" for low memory notification
> >>>makes no sense. The frequency of notifications is a function of the
> >>>memory pressure.
> >>>
> >>>>3. API should be tunable for propagate changes when level is Up or Down, maybe both ways.
> >>>>4. to avoid triggering too much events probably has sense to filter according to amount of change but that is optional. If subscriber set timer to 1s the amount of events should not be very big.
> >>>>5. API must provide interface to request parameters e.g. available swap or free memory just to have some base.
> >>>It would make the interface easier to use if it provided the number of
> >>>pages to free, in the notification (kernel can calculate that as the
> >>>delta between current_free_pages ->   comfortable_free_pages relative to
> >>>process RSS).
> >>If you rely on the notification's argument you lose several features:
> >>  - Handling of notifications by several applications in parallel
> >Each application has its argument built in a custom fashion
> >(pages_to_free = delta between current_free_pages ->
> >comfortable_free_pages relative to process RSS), or something to that
> >effect. It is compatible with parallel notifications.
> 
> Not sure that I got it. Do you suggest to ask all the applications to free say 3% of their memory?. 
> Some may be able to free more, and some cannot free any. Isn't it more practical to just notify them, and let each app contribute its part to the global moving target?

The problem is, how is each process supposed to know how much memory
it should free for each notification received, that is, its part?

Its easier if there is a goal, a hint of how many pages the process
should release.

> >>  - Voluntary application's decisions, such as cleanup or avoiding allocations, at the application's convenience.
> >I am suggesting an additional field in the notification data so that the
> >freeing routine has a goal. But it is not mandatory.
> 
> If you do want to support voluntary (notification less) app decisions, based on the current status, then why not satisfy with this API and only use the notifications to trigger this procedure?
> 
> >
> >>- Iterative release loops, until there are enough free pages.
> >What is the advantage versus releasing the necessary amount of
> >memory in a given moment?
> 
> The cleanup logic may be unaware of the page-level effects of its alloc and free, more so when freeing complex internal data structures (such as cached web pages), and this way you let it free until things settle down.
> 
> Ronen.
> 
> >
> >>I believe that the notification should only serve as a trigger to run the cleanup.
> >Agree.
> >
> >

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-25 10:12                                       ` Marcelo Tosatti
@ 2012-01-25 10:48                                         ` Ronen Hod
  2012-01-26 16:17                                           ` Marcelo Tosatti
  0 siblings, 1 reply; 62+ messages in thread
From: Ronen Hod @ 2012-01-25 10:48 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: leonid.moiseichuk, penberg, riel, minchan, linux-mm,
	linux-kernel, kamezawa.hiroyu, mel, rientjes, kosaki.motohiro,
	hannes, akpm, kosaki.motohiro

On 01/25/2012 12:12 PM, Marcelo Tosatti wrote:
> On Wed, Jan 25, 2012 at 10:52:24AM +0200, Ronen Hod wrote:
>> On 01/24/2012 08:10 PM, Marcelo Tosatti wrote:
>>> On Tue, Jan 24, 2012 at 06:08:31PM +0200, Ronen Hod wrote:
>>>> On 01/24/2012 05:38 PM, Marcelo Tosatti wrote:
>>>>> On Thu, Jan 19, 2012 at 10:53:29AM +0000, leonid.moiseichuk@nokia.com wrote:
>>>>>>> -----Original Message-----
>>>>>>> From: ext Ronen Hod [mailto:rhod@redhat.com]
>>>>>>> Sent: 19 January, 2012 11:20
>>>>>>> To: Pekka Enberg
>>>>>> ...
>>>>>>>>>> Isn't
>>>>>>>>>>
>>>>>>>>>> /proc/sys/vm/min_free_kbytes
>>>>>>>>>>
>>>>>>>>>> pretty much just that?
>>>>>>>>> Would you suggest to use min_free_kbytes as the threshold for sending
>>>>>>>>> low_memory_notifications to applications, and separately as a target
>>>>>>>>> value for the applications' memory giveaway?
>>>>>>>> I'm not saying that the kernel should use it directly but it seems
>>>>>>>> like the kind of "ideal number of free pages" threshold you're
>>>>>>>> suggesting. So userspace can read that value and use it as the "number
>>>>>>>> of free pages" threshold for VM events, no?
>>>>>>> Yes, I like it. The rules of the game are simple and consistent all over, be it the
>>>>>>> alert threshold, voluntary poling by the apps, and for concurrent work by
>>>>>>> several applications.
>>>>>>> Well, as long as it provides a good indication for low_mem_pressure.
>>>>>> For me it doesn't look that have much sense. min_free_kbytes could be set from user-space (or auto-tuned by kernel) to keep some amount
>>>>>> of memory available for GFP_ATOMIC allocations.  In case situation comes under pointed level kernel will reclaim memory from e.g. caches.
>>>>>>
>>>>>>>  From potential user point of view the proposed API has number of lacks which would be nice to have implemented:
>>>>>> 1. rename this API from low_mem_pressure to something more related to notification and memory situation in system: memory_pressure, memnotify, memory_level etc. The word "low" is misleading here
>>>>>> 2. API must use deferred timers to prevent use-time impact. Deferred timer will be triggered only in case HW event or non-deferrable timer, so if device sleeps timer might be skipped and that is what expected for user-space
>>>>> Having userspace specify the "sample period" for low memory notification
>>>>> makes no sense. The frequency of notifications is a function of the
>>>>> memory pressure.
>>>>>
>>>>>> 3. API should be tunable for propagate changes when level is Up or Down, maybe both ways.
>>>>>> 4. to avoid triggering too much events probably has sense to filter according to amount of change but that is optional. If subscriber set timer to 1s the amount of events should not be very big.
>>>>>> 5. API must provide interface to request parameters e.g. available swap or free memory just to have some base.
>>>>> It would make the interface easier to use if it provided the number of
>>>>> pages to free, in the notification (kernel can calculate that as the
>>>>> delta between current_free_pages ->    comfortable_free_pages relative to
>>>>> process RSS).
>>>> If you rely on the notification's argument you lose several features:
>>>>   - Handling of notifications by several applications in parallel
>>> Each application has its argument built in a custom fashion
>>> (pages_to_free = delta between current_free_pages ->
>>> comfortable_free_pages relative to process RSS), or something to that
>>> effect. It is compatible with parallel notifications.
>> Not sure that I got it. Do you suggest to ask all the applications to free say 3% of their memory?.
>> Some may be able to free more, and some cannot free any. Isn't it more practical to just notify them, and let each app contribute its part to the global moving target?
> The problem is, how is each process supposed to know how much memory
> it should free for each notification received, that is, its part?
>
> Its easier if there is a goal, a hint of how many pages the process
> should release.

I have to agree.
Still, the amount of memory that an app should free per memory-pressure-level can be best calculated inside the application (based on comfortable_free_pages relative to process RSS, as you suggested). Fairness is also an issue.
And, if in the meantime the memory pressure ended, would you recommend that the application will continue with its work?

Ronen.

>
>>>>   - Voluntary application's decisions, such as cleanup or avoiding allocations, at the application's convenience.
>>> I am suggesting an additional field in the notification data so that the
>>> freeing routine has a goal. But it is not mandatory.
>> If you do want to support voluntary (notification less) app decisions, based on the current status, then why not satisfy with this API and only use the notifications to trigger this procedure?
>>
>>>> - Iterative release loops, until there are enough free pages.
>>> What is the advantage versus releasing the necessary amount of
>>> memory in a given moment?
>> The cleanup logic may be unaware of the page-level effects of its alloc and free, more so when freeing complex internal data structures (such as cached web pages), and this way you let it free until things settle down.
>>
>> Ronen.
>>
>>>> I believe that the notification should only serve as a trigger to run the cleanup.
>>> Agree.
>>>
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC 1/3] /dev/low_mem_notify
  2012-01-25 10:48                                         ` Ronen Hod
@ 2012-01-26 16:17                                           ` Marcelo Tosatti
  0 siblings, 0 replies; 62+ messages in thread
From: Marcelo Tosatti @ 2012-01-26 16:17 UTC (permalink / raw)
  To: Ronen Hod
  Cc: leonid.moiseichuk, penberg, riel, minchan, linux-mm,
	linux-kernel, kamezawa.hiroyu, mel, rientjes, kosaki.motohiro,
	hannes, akpm, kosaki.motohiro

> >it should free for each notification received, that is, its part?
> >
> >Its easier if there is a goal, a hint of how many pages the process
> >should release.
> 
> I have to agree.
> Still, the amount of memory that an app should free per memory-pressure-level can be best calculated inside the application (based on comfortable_free_pages relative to process RSS, as you suggested).

It is easier if the kernel calculates the target (the application is
free to ignore the hint, of course), because it depends on information 
not readily available in userspace.

>  Fairness is also an issue.
> And, if in the meantime the memory pressure ended, would you recommend that the application will continue with its work?

There appears to be interest in an event to notify that higher levels
of memory are available (see Leonid's email).


^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2012-01-26 16:20 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-17  8:13 [RFC 0/3] low memory notify Minchan Kim
2012-01-17  8:13 ` [RFC 1/3] /dev/low_mem_notify Minchan Kim
2012-01-17  9:27   ` Pekka Enberg
2012-01-17 16:35     ` Rik van Riel
2012-01-17 18:51       ` Pekka Enberg
2012-01-17 19:30         ` Rik van Riel
2012-01-17 19:49           ` Pekka Enberg
2012-01-17 19:54             ` Pekka Enberg
2012-01-17 19:57             ` Pekka Enberg
2012-01-17 23:20         ` Minchan Kim
2012-01-18  7:16           ` Pekka Enberg
2012-01-18  7:49             ` Minchan Kim
2012-01-18  9:06         ` leonid.moiseichuk
2012-01-18  9:15           ` Pekka Enberg
2012-01-18  9:41             ` leonid.moiseichuk
2012-01-18 10:40               ` Pekka Enberg
2012-01-18 10:44                 ` leonid.moiseichuk
2012-01-18 23:34                   ` Ronen Hod
2012-01-19  7:25                     ` Pekka Enberg
2012-01-19  9:05                       ` Ronen Hod
2012-01-19  9:10                         ` Pekka Enberg
2012-01-19  9:20                           ` Ronen Hod
2012-01-19 10:53                             ` leonid.moiseichuk
2012-01-19 11:07                               ` Pekka Enberg
2012-01-19 11:54                                 ` leonid.moiseichuk
2012-01-19 11:59                                   ` Pekka Enberg
2012-01-19 12:06                                   ` Pekka Enberg
2012-01-24 15:38                               ` Marcelo Tosatti
2012-01-24 16:08                                 ` Ronen Hod
2012-01-24 18:10                                   ` Marcelo Tosatti
2012-01-25  8:52                                     ` Ronen Hod
2012-01-25 10:12                                       ` Marcelo Tosatti
2012-01-25 10:48                                         ` Ronen Hod
2012-01-26 16:17                                           ` Marcelo Tosatti
2012-01-24 16:10                                 ` Pekka Enberg
2012-01-24 18:29                                   ` Marcelo Tosatti
2012-01-25  8:19                                   ` leonid.moiseichuk
2012-01-19  7:34                   ` Pekka Enberg
2012-01-24 16:22             ` Arnd Bergmann
2012-01-18 14:30           ` Rik van Riel
2012-01-18 15:29             ` Pekka Enberg
2012-01-24 15:40         ` Marcelo Tosatti
2012-01-24 16:01           ` Pekka Enberg
2012-01-24 16:25             ` Arnd Bergmann
2012-01-24 18:32               ` Marcelo Tosatti
2012-01-24 21:57         ` Jonathan Corbet
2012-01-17  9:45   ` Pekka Enberg
2012-01-17  8:13 ` [RFC 2/3] vmscan hook Minchan Kim
2012-01-17  8:39   ` KAMEZAWA Hiroyuki
2012-01-17  9:13     ` Minchan Kim
2012-01-17 10:05       ` KAMEZAWA Hiroyuki
2012-01-17 23:08         ` Minchan Kim
2012-01-18  0:18           ` KAMEZAWA Hiroyuki
2012-01-18 14:17             ` Rik van Riel
2012-01-19  2:25               ` KAMEZAWA Hiroyuki
2012-01-19 14:42                 ` Rik van Riel
2012-01-20  0:24                   ` KAMEZAWA Hiroyuki
2012-01-17  8:13 ` [RFC 3/3] test program Minchan Kim
2012-01-17 14:38 ` [RFC 0/3] low memory notify Colin Walters
2012-01-17 15:04   ` Pekka Enberg
2012-01-17 16:44   ` Rik van Riel
2012-01-17 17:16 ` Olof Johansson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).