All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages
@ 2013-02-04 10:04 ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-04 10:04 UTC (permalink / raw)
  To: akpm, mgorman, bcrl, viro
  Cc: khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes,
	isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio,
	linux-fsdevel, linux-kernel, Lin Feng

Currently get_user_pages() always tries to allocate pages from movable zone,
as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users
of get_user_pages() is easy to pin user pages for a long time(for now we found
that pages pinned as aio ring pages is such case), which is fatal for memory
hotplug/remove framework.

So the 1st patch introduces a new library function called
get_user_pages_non_movable() to pin pages only from zone non-movable in memory.
It's a wrapper of get_user_pages() but it makes sure that all pages come from
non-movable zone via additional page migration.

The 2nd patch gets around the aio ring pages can't be migrated bug caused by
get_user_pages() via using the new function. It only works when configed with
CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().

Lin Feng (2):
  mm: hotplug: implement non-movable version of get_user_pages()
  fs/aio.c: use non-movable version of get_user_pages() to pin ring
    pages when support memory hotremove

 fs/aio.c               |  6 +++++
 include/linux/mm.h     |  5 ++++
 include/linux/mmzone.h |  4 ++++
 mm/memory.c            | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_isolation.c    |  5 ++++
 5 files changed, 83 insertions(+)

-- 
1.7.11.7


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages
@ 2013-02-04 10:04 ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-04 10:04 UTC (permalink / raw)
  To: akpm, mgorman, bcrl, viro
  Cc: khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes,
	isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio,
	linux-fsdevel, linux-kernel, Lin Feng

Currently get_user_pages() always tries to allocate pages from movable zone,
as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users
of get_user_pages() is easy to pin user pages for a long time(for now we found
that pages pinned as aio ring pages is such case), which is fatal for memory
hotplug/remove framework.

So the 1st patch introduces a new library function called
get_user_pages_non_movable() to pin pages only from zone non-movable in memory.
It's a wrapper of get_user_pages() but it makes sure that all pages come from
non-movable zone via additional page migration.

The 2nd patch gets around the aio ring pages can't be migrated bug caused by
get_user_pages() via using the new function. It only works when configed with
CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().

Lin Feng (2):
  mm: hotplug: implement non-movable version of get_user_pages()
  fs/aio.c: use non-movable version of get_user_pages() to pin ring
    pages when support memory hotremove

 fs/aio.c               |  6 +++++
 include/linux/mm.h     |  5 ++++
 include/linux/mmzone.h |  4 ++++
 mm/memory.c            | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_isolation.c    |  5 ++++
 5 files changed, 83 insertions(+)

-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-04 10:04 ` Lin Feng
  (?)
@ 2013-02-04 10:04   ` Lin Feng
  -1 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-04 10:04 UTC (permalink / raw)
  To: akpm, mgorman, bcrl, viro
  Cc: khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes,
	isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio,
	linux-fsdevel, linux-kernel, Lin Feng

get_user_pages() always tries to allocate pages from movable zone, which is not
 reliable to memory hotremove framework in some case.

This patch introduces a new library function called get_user_pages_non_movable()
 to pin pages only from zone non-movable in memory.
It's a wrapper of get_user_pages() but it makes sure that all pages come from
non-movable zone via additional page migration.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
---
 include/linux/mm.h     |  5 ++++
 include/linux/mmzone.h |  4 ++++
 mm/memory.c            | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_isolation.c    |  5 ++++
 4 files changed, 77 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 66e2f7c..2a25d0e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			struct page **pages, struct vm_area_struct **vmas);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, int nr_pages, int write, int force,
+		struct page **pages, struct vm_area_struct **vmas);
+#endif
 struct kvec;
 int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
 			struct page **pages);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 73b64a3..5db811e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
 	return (idx == ZONE_NORMAL);
 }
 
+static inline int is_movable(struct zone *zone)
+{
+	return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE;
+}
 /**
  * is_highmem - helper function to quickly check if a struct zone is a 
  *              highmem zone or not.  This is an attempt to keep references
diff --git a/mm/memory.c b/mm/memory.c
index bb1369f..e3b8e19 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -58,6 +58,8 @@
 #include <linux/elf.h>
 #include <linux/gfp.h>
 #include <linux/migrate.h>
+#include <linux/page-isolation.h>
+#include <linux/mm_inline.h>
 #include <linux/string.h>
 
 #include <asm/io.h>
@@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 }
 EXPORT_SYMBOL(get_user_pages);
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+/**
+ * It's a wrapper of get_user_pages() but it makes sure that all pages come from
+ * non-movable zone via additional page migration.
+ */
+int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, int nr_pages, int write, int force,
+		struct page **pages, struct vm_area_struct **vmas)
+{
+	int ret, i, isolate_err, migrate_pre_flag;
+	LIST_HEAD(pagelist);
+
+retry:
+	ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages,
+				vmas);
+
+	isolate_err = 0;
+	migrate_pre_flag = 0;
+
+	for (i = 0; i < ret; i++) {
+		if (is_movable(page_zone(pages[i]))) {
+			if (!migrate_pre_flag) {
+				if (migrate_prep())
+					goto put_page;
+				migrate_pre_flag = 1;
+			}
+
+			if (!isolate_lru_page(pages[i])) {
+				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
+						 page_is_file_cache(pages[i]));
+				list_add_tail(&pages[i]->lru, &pagelist);
+			} else {
+				isolate_err = 1;
+				goto put_page;
+			}
+		}
+	}
+
+	/* All pages are non movable, we are done :) */
+	if (i == ret && list_empty(&pagelist))
+		return ret;
+
+put_page:
+	/* Undo the effects of former get_user_pages(), we won't pin anything */
+	for (i = 0; i < ret; i++)
+		put_page(pages[i]);
+
+	if (migrate_pre_flag && !isolate_err) {
+		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
+					false, MIGRATE_SYNC, MR_SYSCALL);
+		/* Steal pages from non-movable zone successfully? */
+		if (!ret)
+			goto retry;
+	}
+
+	putback_lru_pages(&pagelist);
+	return 0;
+}
+EXPORT_SYMBOL(get_user_pages_non_movable);
+#endif
+
 /**
  * get_dump_page() - pin user page in memory while writing it to core dump
  * @addr: user address
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 383bdbb..1b7bd17 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -247,6 +247,9 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
 	return ret ? 0 : -EBUSY;
 }
 
+/**
+ * @private: 0 means page can be alloced from movable zone, otherwise forbidden
+ */
 struct page *alloc_migrate_target(struct page *page, unsigned long private,
 				  int **resultp)
 {
@@ -254,6 +257,8 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
 
 	if (PageHighMem(page))
 		gfp_mask |= __GFP_HIGHMEM;
+	if (unlikely(private != 0))
+		gfp_mask &= ~__GFP_MOVABLE;
 
 	return alloc_page(gfp_mask);
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-04 10:04   ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-04 10:04 UTC (permalink / raw)
  To: akpm, mgorman, bcrl, viro
  Cc: khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes,
	isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio,
	linux-fsdevel, linux-kernel, Lin Feng

get_user_pages() always tries to allocate pages from movable zone, which is not
 reliable to memory hotremove framework in some case.

This patch introduces a new library function called get_user_pages_non_movable()
 to pin pages only from zone non-movable in memory.
It's a wrapper of get_user_pages() but it makes sure that all pages come from
non-movable zone via additional page migration.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
---
 include/linux/mm.h     |  5 ++++
 include/linux/mmzone.h |  4 ++++
 mm/memory.c            | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_isolation.c    |  5 ++++
 4 files changed, 77 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 66e2f7c..2a25d0e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			struct page **pages, struct vm_area_struct **vmas);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, int nr_pages, int write, int force,
+		struct page **pages, struct vm_area_struct **vmas);
+#endif
 struct kvec;
 int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
 			struct page **pages);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 73b64a3..5db811e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
 	return (idx == ZONE_NORMAL);
 }
 
+static inline int is_movable(struct zone *zone)
+{
+	return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE;
+}
 /**
  * is_highmem - helper function to quickly check if a struct zone is a 
  *              highmem zone or not.  This is an attempt to keep references
diff --git a/mm/memory.c b/mm/memory.c
index bb1369f..e3b8e19 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -58,6 +58,8 @@
 #include <linux/elf.h>
 #include <linux/gfp.h>
 #include <linux/migrate.h>
+#include <linux/page-isolation.h>
+#include <linux/mm_inline.h>
 #include <linux/string.h>
 
 #include <asm/io.h>
@@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 }
 EXPORT_SYMBOL(get_user_pages);
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+/**
+ * It's a wrapper of get_user_pages() but it makes sure that all pages come from
+ * non-movable zone via additional page migration.
+ */
+int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, int nr_pages, int write, int force,
+		struct page **pages, struct vm_area_struct **vmas)
+{
+	int ret, i, isolate_err, migrate_pre_flag;
+	LIST_HEAD(pagelist);
+
+retry:
+	ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages,
+				vmas);
+
+	isolate_err = 0;
+	migrate_pre_flag = 0;
+
+	for (i = 0; i < ret; i++) {
+		if (is_movable(page_zone(pages[i]))) {
+			if (!migrate_pre_flag) {
+				if (migrate_prep())
+					goto put_page;
+				migrate_pre_flag = 1;
+			}
+
+			if (!isolate_lru_page(pages[i])) {
+				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
+						 page_is_file_cache(pages[i]));
+				list_add_tail(&pages[i]->lru, &pagelist);
+			} else {
+				isolate_err = 1;
+				goto put_page;
+			}
+		}
+	}
+
+	/* All pages are non movable, we are done :) */
+	if (i == ret && list_empty(&pagelist))
+		return ret;
+
+put_page:
+	/* Undo the effects of former get_user_pages(), we won't pin anything */
+	for (i = 0; i < ret; i++)
+		put_page(pages[i]);
+
+	if (migrate_pre_flag && !isolate_err) {
+		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
+					false, MIGRATE_SYNC, MR_SYSCALL);
+		/* Steal pages from non-movable zone successfully? */
+		if (!ret)
+			goto retry;
+	}
+
+	putback_lru_pages(&pagelist);
+	return 0;
+}
+EXPORT_SYMBOL(get_user_pages_non_movable);
+#endif
+
 /**
  * get_dump_page() - pin user page in memory while writing it to core dump
  * @addr: user address
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 383bdbb..1b7bd17 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -247,6 +247,9 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
 	return ret ? 0 : -EBUSY;
 }
 
+/**
+ * @private: 0 means page can be alloced from movable zone, otherwise forbidden
+ */
 struct page *alloc_migrate_target(struct page *page, unsigned long private,
 				  int **resultp)
 {
@@ -254,6 +257,8 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
 
 	if (PageHighMem(page))
 		gfp_mask |= __GFP_HIGHMEM;
+	if (unlikely(private != 0))
+		gfp_mask &= ~__GFP_MOVABLE;
 
 	return alloc_page(gfp_mask);
 }
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-04 10:04   ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-04 10:04 UTC (permalink / raw)
  To: akpm, mgorman, bcrl, viro
  Cc: khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes,
	isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio,
	linux-fsdevel, linux-kernel, Lin Feng

get_user_pages() always tries to allocate pages from movable zone, which is not
 reliable to memory hotremove framework in some case.

This patch introduces a new library function called get_user_pages_non_movable()
 to pin pages only from zone non-movable in memory.
It's a wrapper of get_user_pages() but it makes sure that all pages come from
non-movable zone via additional page migration.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
---
 include/linux/mm.h     |  5 ++++
 include/linux/mmzone.h |  4 ++++
 mm/memory.c            | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_isolation.c    |  5 ++++
 4 files changed, 77 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 66e2f7c..2a25d0e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			struct page **pages, struct vm_area_struct **vmas);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, int nr_pages, int write, int force,
+		struct page **pages, struct vm_area_struct **vmas);
+#endif
 struct kvec;
 int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
 			struct page **pages);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 73b64a3..5db811e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
 	return (idx == ZONE_NORMAL);
 }
 
+static inline int is_movable(struct zone *zone)
+{
+	return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE;
+}
 /**
  * is_highmem - helper function to quickly check if a struct zone is a 
  *              highmem zone or not.  This is an attempt to keep references
diff --git a/mm/memory.c b/mm/memory.c
index bb1369f..e3b8e19 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -58,6 +58,8 @@
 #include <linux/elf.h>
 #include <linux/gfp.h>
 #include <linux/migrate.h>
+#include <linux/page-isolation.h>
+#include <linux/mm_inline.h>
 #include <linux/string.h>
 
 #include <asm/io.h>
@@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 }
 EXPORT_SYMBOL(get_user_pages);
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+/**
+ * It's a wrapper of get_user_pages() but it makes sure that all pages come from
+ * non-movable zone via additional page migration.
+ */
+int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, int nr_pages, int write, int force,
+		struct page **pages, struct vm_area_struct **vmas)
+{
+	int ret, i, isolate_err, migrate_pre_flag;
+	LIST_HEAD(pagelist);
+
+retry:
+	ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages,
+				vmas);
+
+	isolate_err = 0;
+	migrate_pre_flag = 0;
+
+	for (i = 0; i < ret; i++) {
+		if (is_movable(page_zone(pages[i]))) {
+			if (!migrate_pre_flag) {
+				if (migrate_prep())
+					goto put_page;
+				migrate_pre_flag = 1;
+			}
+
+			if (!isolate_lru_page(pages[i])) {
+				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
+						 page_is_file_cache(pages[i]));
+				list_add_tail(&pages[i]->lru, &pagelist);
+			} else {
+				isolate_err = 1;
+				goto put_page;
+			}
+		}
+	}
+
+	/* All pages are non movable, we are done :) */
+	if (i == ret && list_empty(&pagelist))
+		return ret;
+
+put_page:
+	/* Undo the effects of former get_user_pages(), we won't pin anything */
+	for (i = 0; i < ret; i++)
+		put_page(pages[i]);
+
+	if (migrate_pre_flag && !isolate_err) {
+		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
+					false, MIGRATE_SYNC, MR_SYSCALL);
+		/* Steal pages from non-movable zone successfully? */
+		if (!ret)
+			goto retry;
+	}
+
+	putback_lru_pages(&pagelist);
+	return 0;
+}
+EXPORT_SYMBOL(get_user_pages_non_movable);
+#endif
+
 /**
  * get_dump_page() - pin user page in memory while writing it to core dump
  * @addr: user address
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 383bdbb..1b7bd17 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -247,6 +247,9 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
 	return ret ? 0 : -EBUSY;
 }
 
+/**
+ * @private: 0 means page can be alloced from movable zone, otherwise forbidden
+ */
 struct page *alloc_migrate_target(struct page *page, unsigned long private,
 				  int **resultp)
 {
@@ -254,6 +257,8 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
 
 	if (PageHighMem(page))
 		gfp_mask |= __GFP_HIGHMEM;
+	if (unlikely(private != 0))
+		gfp_mask &= ~__GFP_MOVABLE;
 
 	return alloc_page(gfp_mask);
 }
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove
  2013-02-04 10:04 ` Lin Feng
  (?)
@ 2013-02-04 10:04   ` Lin Feng
  -1 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-04 10:04 UTC (permalink / raw)
  To: akpm, mgorman, bcrl, viro
  Cc: khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes,
	isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio,
	linux-fsdevel, linux-kernel, Lin Feng

This patch gets around the aio ring pages can't be migrated bug caused by
get_user_pages() via using the new function. It only works as configed with
CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().

Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
---
 fs/aio.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index 71f613c..0e9b30a 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx)
 	}
 
 	dprintk("mmap address: 0x%08lx\n", info->mmap_base);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+	info->nr_pages = get_user_pages_non_movable(current, ctx->mm,
+					info->mmap_base, nr_pages,
+					1, 0, info->ring_pages, NULL);
+#else
 	info->nr_pages = get_user_pages(current, ctx->mm,
 					info->mmap_base, nr_pages, 
 					1, 0, info->ring_pages, NULL);
+#endif
 	up_write(&ctx->mm->mmap_sem);
 
 	if (unlikely(info->nr_pages != nr_pages)) {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove
@ 2013-02-04 10:04   ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-04 10:04 UTC (permalink / raw)
  To: akpm, mgorman, bcrl, viro
  Cc: khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes,
	isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio,
	linux-fsdevel, linux-kernel, Lin Feng

This patch gets around the aio ring pages can't be migrated bug caused by
get_user_pages() via using the new function. It only works as configed with
CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().

Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
---
 fs/aio.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index 71f613c..0e9b30a 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx)
 	}
 
 	dprintk("mmap address: 0x%08lx\n", info->mmap_base);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+	info->nr_pages = get_user_pages_non_movable(current, ctx->mm,
+					info->mmap_base, nr_pages,
+					1, 0, info->ring_pages, NULL);
+#else
 	info->nr_pages = get_user_pages(current, ctx->mm,
 					info->mmap_base, nr_pages, 
 					1, 0, info->ring_pages, NULL);
+#endif
 	up_write(&ctx->mm->mmap_sem);
 
 	if (unlikely(info->nr_pages != nr_pages)) {
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove
@ 2013-02-04 10:04   ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-04 10:04 UTC (permalink / raw)
  To: akpm, mgorman, bcrl, viro
  Cc: khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes,
	isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio,
	linux-fsdevel, linux-kernel, Lin Feng

This patch gets around the aio ring pages can't be migrated bug caused by
get_user_pages() via using the new function. It only works as configed with
CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().

Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
---
 fs/aio.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index 71f613c..0e9b30a 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx)
 	}
 
 	dprintk("mmap address: 0x%08lx\n", info->mmap_base);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+	info->nr_pages = get_user_pages_non_movable(current, ctx->mm,
+					info->mmap_base, nr_pages,
+					1, 0, info->ring_pages, NULL);
+#else
 	info->nr_pages = get_user_pages(current, ctx->mm,
 					info->mmap_base, nr_pages, 
 					1, 0, info->ring_pages, NULL);
+#endif
 	up_write(&ctx->mm->mmap_sem);
 
 	if (unlikely(info->nr_pages != nr_pages)) {
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove
  2013-02-04 10:04   ` Lin Feng
  (?)
@ 2013-02-04 15:18     ` Jeff Moyer
  -1 siblings, 0 replies; 88+ messages in thread
From: Jeff Moyer @ 2013-02-04 15:18 UTC (permalink / raw)
  To: Lin Feng
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

Lin Feng <linfeng@cn.fujitsu.com> writes:

> This patch gets around the aio ring pages can't be migrated bug caused by
> get_user_pages() via using the new function. It only works as configed with
> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().
>
> Cc: Benjamin LaHaise <bcrl@kvack.org>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
> Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
> Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
> ---
>  fs/aio.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index 71f613c..0e9b30a 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx)
>  	}
>  
>  	dprintk("mmap address: 0x%08lx\n", info->mmap_base);
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +	info->nr_pages = get_user_pages_non_movable(current, ctx->mm,
> +					info->mmap_base, nr_pages,
> +					1, 0, info->ring_pages, NULL);
> +#else
>  	info->nr_pages = get_user_pages(current, ctx->mm,
>  					info->mmap_base, nr_pages, 
>  					1, 0, info->ring_pages, NULL);
> +#endif

Can't you hide this in your 1/1 patch, by providing this function as
just a static inline wrapper around get_user_pages when
CONFIG_MEMORY_HOTREMOVE is not enabled?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove
@ 2013-02-04 15:18     ` Jeff Moyer
  0 siblings, 0 replies; 88+ messages in thread
From: Jeff Moyer @ 2013-02-04 15:18 UTC (permalink / raw)
  To: Lin Feng
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

Lin Feng <linfeng@cn.fujitsu.com> writes:

> This patch gets around the aio ring pages can't be migrated bug caused by
> get_user_pages() via using the new function. It only works as configed with
> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().
>
> Cc: Benjamin LaHaise <bcrl@kvack.org>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
> Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
> Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
> ---
>  fs/aio.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index 71f613c..0e9b30a 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx)
>  	}
>  
>  	dprintk("mmap address: 0x%08lx\n", info->mmap_base);
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +	info->nr_pages = get_user_pages_non_movable(current, ctx->mm,
> +					info->mmap_base, nr_pages,
> +					1, 0, info->ring_pages, NULL);
> +#else
>  	info->nr_pages = get_user_pages(current, ctx->mm,
>  					info->mmap_base, nr_pages, 
>  					1, 0, info->ring_pages, NULL);
> +#endif

Can't you hide this in your 1/1 patch, by providing this function as
just a static inline wrapper around get_user_pages when
CONFIG_MEMORY_HOTREMOVE is not enabled?

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove
@ 2013-02-04 15:18     ` Jeff Moyer
  0 siblings, 0 replies; 88+ messages in thread
From: Jeff Moyer @ 2013-02-04 15:18 UTC (permalink / raw)
  To: Lin Feng
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

Lin Feng <linfeng@cn.fujitsu.com> writes:

> This patch gets around the aio ring pages can't be migrated bug caused by
> get_user_pages() via using the new function. It only works as configed with
> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().
>
> Cc: Benjamin LaHaise <bcrl@kvack.org>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
> Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
> Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
> ---
>  fs/aio.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index 71f613c..0e9b30a 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx)
>  	}
>  
>  	dprintk("mmap address: 0x%08lx\n", info->mmap_base);
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +	info->nr_pages = get_user_pages_non_movable(current, ctx->mm,
> +					info->mmap_base, nr_pages,
> +					1, 0, info->ring_pages, NULL);
> +#else
>  	info->nr_pages = get_user_pages(current, ctx->mm,
>  					info->mmap_base, nr_pages, 
>  					1, 0, info->ring_pages, NULL);
> +#endif

Can't you hide this in your 1/1 patch, by providing this function as
just a static inline wrapper around get_user_pages when
CONFIG_MEMORY_HOTREMOVE is not enabled?

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove
  2013-02-04 15:18     ` Jeff Moyer
  (?)
@ 2013-02-04 23:02       ` Zach Brown
  -1 siblings, 0 replies; 88+ messages in thread
From: Zach Brown @ 2013-02-04 23:02 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Lin Feng, akpm, mgorman, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel

> > index 71f613c..0e9b30a 100644
> > --- a/fs/aio.c
> > +++ b/fs/aio.c
> > @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx)
> >  	}
> >  
> >  	dprintk("mmap address: 0x%08lx\n", info->mmap_base);
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +	info->nr_pages = get_user_pages_non_movable(current, ctx->mm,
> > +					info->mmap_base, nr_pages,
> > +					1, 0, info->ring_pages, NULL);
> > +#else
> >  	info->nr_pages = get_user_pages(current, ctx->mm,
> >  					info->mmap_base, nr_pages, 
> >  					1, 0, info->ring_pages, NULL);
> > +#endif
> 
> Can't you hide this in your 1/1 patch, by providing this function as
> just a static inline wrapper around get_user_pages when
> CONFIG_MEMORY_HOTREMOVE is not enabled?

Yes, please.  Having callers duplicate the call site for a single
optional boolean input is unacceptable.

But do we want another input argument as a name?  Should aio have been
using get_user_pages_fast()? (and so now _fast_non_movable?)

I wonder if it's time to offer the booleans as a _flags() variant, much
like the current internal flags for __get_user_pages().  The write and
force arguments are already booleans, we have a different fast api, and
now we're adding non-movable.  The NON_MOVABLE flag would be 0 without
MEMORY_HOTREMOVE, easy peasy.

Turning current callers' mysterious '1, 1' in to 'WRITE|FORCE' might
also be nice :).

No?

- z

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove
@ 2013-02-04 23:02       ` Zach Brown
  0 siblings, 0 replies; 88+ messages in thread
From: Zach Brown @ 2013-02-04 23:02 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Lin Feng, akpm, mgorman, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel

> > index 71f613c..0e9b30a 100644
> > --- a/fs/aio.c
> > +++ b/fs/aio.c
> > @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx)
> >  	}
> >  
> >  	dprintk("mmap address: 0x%08lx\n", info->mmap_base);
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +	info->nr_pages = get_user_pages_non_movable(current, ctx->mm,
> > +					info->mmap_base, nr_pages,
> > +					1, 0, info->ring_pages, NULL);
> > +#else
> >  	info->nr_pages = get_user_pages(current, ctx->mm,
> >  					info->mmap_base, nr_pages, 
> >  					1, 0, info->ring_pages, NULL);
> > +#endif
> 
> Can't you hide this in your 1/1 patch, by providing this function as
> just a static inline wrapper around get_user_pages when
> CONFIG_MEMORY_HOTREMOVE is not enabled?

Yes, please.  Having callers duplicate the call site for a single
optional boolean input is unacceptable.

But do we want another input argument as a name?  Should aio have been
using get_user_pages_fast()? (and so now _fast_non_movable?)

I wonder if it's time to offer the booleans as a _flags() variant, much
like the current internal flags for __get_user_pages().  The write and
force arguments are already booleans, we have a different fast api, and
now we're adding non-movable.  The NON_MOVABLE flag would be 0 without
MEMORY_HOTREMOVE, easy peasy.

Turning current callers' mysterious '1, 1' in to 'WRITE|FORCE' might
also be nice :).

No?

- z

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove
@ 2013-02-04 23:02       ` Zach Brown
  0 siblings, 0 replies; 88+ messages in thread
From: Zach Brown @ 2013-02-04 23:02 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Lin Feng, akpm, mgorman, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel

> > index 71f613c..0e9b30a 100644
> > --- a/fs/aio.c
> > +++ b/fs/aio.c
> > @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx)
> >  	}
> >  
> >  	dprintk("mmap address: 0x%08lx\n", info->mmap_base);
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +	info->nr_pages = get_user_pages_non_movable(current, ctx->mm,
> > +					info->mmap_base, nr_pages,
> > +					1, 0, info->ring_pages, NULL);
> > +#else
> >  	info->nr_pages = get_user_pages(current, ctx->mm,
> >  					info->mmap_base, nr_pages, 
> >  					1, 0, info->ring_pages, NULL);
> > +#endif
> 
> Can't you hide this in your 1/1 patch, by providing this function as
> just a static inline wrapper around get_user_pages when
> CONFIG_MEMORY_HOTREMOVE is not enabled?

Yes, please.  Having callers duplicate the call site for a single
optional boolean input is unacceptable.

But do we want another input argument as a name?  Should aio have been
using get_user_pages_fast()? (and so now _fast_non_movable?)

I wonder if it's time to offer the booleans as a _flags() variant, much
like the current internal flags for __get_user_pages().  The write and
force arguments are already booleans, we have a different fast api, and
now we're adding non-movable.  The NON_MOVABLE flag would be 0 without
MEMORY_HOTREMOVE, easy peasy.

Turning current callers' mysterious '1, 1' in to 'WRITE|FORCE' might
also be nice :).

No?

- z

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-04 10:04   ` Lin Feng
  (?)
@ 2013-02-05  0:06     ` Andrew Morton
  -1 siblings, 0 replies; 88+ messages in thread
From: Andrew Morton @ 2013-02-05  0:06 UTC (permalink / raw)
  To: Lin Feng
  Cc: mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel


melreadthis

On Mon, 4 Feb 2013 18:04:07 +0800
Lin Feng <linfeng@cn.fujitsu.com> wrote:

> get_user_pages() always tries to allocate pages from movable zone, which is not
>  reliable to memory hotremove framework in some case.
> 
> This patch introduces a new library function called get_user_pages_non_movable()
>  to pin pages only from zone non-movable in memory.
> It's a wrapper of get_user_pages() but it makes sure that all pages come from
> non-movable zone via additional page migration.
> 
> ...
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  			struct page **pages, struct vm_area_struct **vmas);
>  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  			struct page **pages);
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
> +		unsigned long start, int nr_pages, int write, int force,
> +		struct page **pages, struct vm_area_struct **vmas);
> +#endif

The ifdefs aren't really needed here and I encourage people to omit
them.  This keeps the header files looking neater and reduces the
chances of things later breaking because we forgot to update some
CONFIG_foo logic in a header file.  The downside is that errors will be
revealed at link time rather than at compile time, but that's a pretty
small cost.

>  struct kvec;
>  int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
>  			struct page **pages);
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 73b64a3..5db811e 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
>  	return (idx == ZONE_NORMAL);
>  }
>  
> +static inline int is_movable(struct zone *zone)
> +{
> +	return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE;
> +}

A better name would be zone_is_movable().  We haven't been very
consistent about this in mmzone.h, but zone_is_foo() is pretty common.

And a neater implementation would be

	return zone_idx(zone) == ZONE_MOVABLE;

All of which made me look at mmzone.h, and what I saw wasn't very nice :(

Please review...


From: Andrew Morton <akpm@linux-foundation.org>
Subject: include/linux/mmzone.h: cleanups

- implement zone_idx() in C to fix its references-args-twice macro bug

- use zone_idx() in is_highmem() to remove large amounts of silly fluff.

Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h |   13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h
--- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups
+++ a/include/linux/mmzone.h
@@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by
 /*
  * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
  */
-#define zone_idx(zone)		((zone) - (zone)->zone_pgdat->node_zones)
+static inline enum zone_type zone_idx(struct zone *zone)
+{
+	return zone - zone->zone_pgdat->node_zones;
+}
 
 static inline int populated_zone(struct zone *zone)
 {
@@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon
 static inline int is_highmem(struct zone *zone)
 {
 #ifdef CONFIG_HIGHMEM
-	int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
-	return zone_off == ZONE_HIGHMEM * sizeof(*zone) ||
-	       (zone_off == ZONE_MOVABLE * sizeof(*zone) &&
-		zone_movable_is_highmem());
+	enum zone_type idx = zone_idx(zone);
+
+	return idx == ZONE_HIGHMEM ||
+	       (idx == ZONE_MOVABLE && zone_movable_is_highmem());
 #else
 	return 0;
 #endif
_


> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -58,6 +58,8 @@
>  #include <linux/elf.h>
>  #include <linux/gfp.h>
>  #include <linux/migrate.h>
> +#include <linux/page-isolation.h>
> +#include <linux/mm_inline.h>
>  #include <linux/string.h>
>  
>  #include <asm/io.h>
> @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  }
>  EXPORT_SYMBOL(get_user_pages);
>  
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +/**
> + * It's a wrapper of get_user_pages() but it makes sure that all pages come from
> + * non-movable zone via additional page migration.
> + */

This needs a description of why the function exists - say something
about the requirements of memory hotplug.

Also a few words describing how the function works would be good.

> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
> +		unsigned long start, int nr_pages, int write, int force,
> +		struct page **pages, struct vm_area_struct **vmas)
> +{
> +	int ret, i, isolate_err, migrate_pre_flag;
> +	LIST_HEAD(pagelist);
> +
> +retry:
> +	ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages,
> +				vmas);

We should handle (ret < 0) here.  At present the function will silently
convert an error return into "return 0", which is bad.  The function
does appear to otherwise do the right thing if get_user_pages() failed,
but only due to good luck.

> +	isolate_err = 0;
> +	migrate_pre_flag = 0;
> +
> +	for (i = 0; i < ret; i++) {
> +		if (is_movable(page_zone(pages[i]))) {
> +			if (!migrate_pre_flag) {
> +				if (migrate_prep())
> +					goto put_page;
> +				migrate_pre_flag = 1;
> +			}
> +
> +			if (!isolate_lru_page(pages[i])) {
> +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
> +						 page_is_file_cache(pages[i]));
> +				list_add_tail(&pages[i]->lru, &pagelist);
> +			} else {
> +				isolate_err = 1;
> +				goto put_page;
> +			}
> +		}
> +	}
> +
> +	/* All pages are non movable, we are done :) */
> +	if (i == ret && list_empty(&pagelist))
> +		return ret;
> +
> +put_page:
> +	/* Undo the effects of former get_user_pages(), we won't pin anything */
> +	for (i = 0; i < ret; i++)
> +		put_page(pages[i]);
> +
> +	if (migrate_pre_flag && !isolate_err) {
> +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
> +					false, MIGRATE_SYNC, MR_SYSCALL);
> +		/* Steal pages from non-movable zone successfully? */
> +		if (!ret)
> +			goto retry;

This is buggy.  migrate_pages() doesn't empty its `from' argument, so
page_list must be reinitialised here (or, better, at the start of the loop).

Mel, what's up with migrate_pages()?  Shouldn't it be removing the
pages from the list when MIGRATEPAGE_SUCCESS?  The use of
list_for_each_entry_safe() suggests we used to do that...

> +	}
> +
> +	putback_lru_pages(&pagelist);
> +	return 0;
> +}
> +EXPORT_SYMBOL(get_user_pages_non_movable);
> +#endif
> +

Generally, I'd like Mel to go through (the next version of) this patch
carefully, please.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-05  0:06     ` Andrew Morton
  0 siblings, 0 replies; 88+ messages in thread
From: Andrew Morton @ 2013-02-05  0:06 UTC (permalink / raw)
  To: Lin Feng
  Cc: mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel


melreadthis

On Mon, 4 Feb 2013 18:04:07 +0800
Lin Feng <linfeng@cn.fujitsu.com> wrote:

> get_user_pages() always tries to allocate pages from movable zone, which is not
>  reliable to memory hotremove framework in some case.
> 
> This patch introduces a new library function called get_user_pages_non_movable()
>  to pin pages only from zone non-movable in memory.
> It's a wrapper of get_user_pages() but it makes sure that all pages come from
> non-movable zone via additional page migration.
> 
> ...
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  			struct page **pages, struct vm_area_struct **vmas);
>  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  			struct page **pages);
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
> +		unsigned long start, int nr_pages, int write, int force,
> +		struct page **pages, struct vm_area_struct **vmas);
> +#endif

The ifdefs aren't really needed here and I encourage people to omit
them.  This keeps the header files looking neater and reduces the
chances of things later breaking because we forgot to update some
CONFIG_foo logic in a header file.  The downside is that errors will be
revealed at link time rather than at compile time, but that's a pretty
small cost.

>  struct kvec;
>  int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
>  			struct page **pages);
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 73b64a3..5db811e 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
>  	return (idx == ZONE_NORMAL);
>  }
>  
> +static inline int is_movable(struct zone *zone)
> +{
> +	return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE;
> +}

A better name would be zone_is_movable().  We haven't been very
consistent about this in mmzone.h, but zone_is_foo() is pretty common.

And a neater implementation would be

	return zone_idx(zone) == ZONE_MOVABLE;

All of which made me look at mmzone.h, and what I saw wasn't very nice :(

Please review...


From: Andrew Morton <akpm@linux-foundation.org>
Subject: include/linux/mmzone.h: cleanups

- implement zone_idx() in C to fix its references-args-twice macro bug

- use zone_idx() in is_highmem() to remove large amounts of silly fluff.

Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h |   13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h
--- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups
+++ a/include/linux/mmzone.h
@@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by
 /*
  * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
  */
-#define zone_idx(zone)		((zone) - (zone)->zone_pgdat->node_zones)
+static inline enum zone_type zone_idx(struct zone *zone)
+{
+	return zone - zone->zone_pgdat->node_zones;
+}
 
 static inline int populated_zone(struct zone *zone)
 {
@@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon
 static inline int is_highmem(struct zone *zone)
 {
 #ifdef CONFIG_HIGHMEM
-	int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
-	return zone_off == ZONE_HIGHMEM * sizeof(*zone) ||
-	       (zone_off == ZONE_MOVABLE * sizeof(*zone) &&
-		zone_movable_is_highmem());
+	enum zone_type idx = zone_idx(zone);
+
+	return idx == ZONE_HIGHMEM ||
+	       (idx == ZONE_MOVABLE && zone_movable_is_highmem());
 #else
 	return 0;
 #endif
_


> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -58,6 +58,8 @@
>  #include <linux/elf.h>
>  #include <linux/gfp.h>
>  #include <linux/migrate.h>
> +#include <linux/page-isolation.h>
> +#include <linux/mm_inline.h>
>  #include <linux/string.h>
>  
>  #include <asm/io.h>
> @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  }
>  EXPORT_SYMBOL(get_user_pages);
>  
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +/**
> + * It's a wrapper of get_user_pages() but it makes sure that all pages come from
> + * non-movable zone via additional page migration.
> + */

This needs a description of why the function exists - say something
about the requirements of memory hotplug.

Also a few words describing how the function works would be good.

> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
> +		unsigned long start, int nr_pages, int write, int force,
> +		struct page **pages, struct vm_area_struct **vmas)
> +{
> +	int ret, i, isolate_err, migrate_pre_flag;
> +	LIST_HEAD(pagelist);
> +
> +retry:
> +	ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages,
> +				vmas);

We should handle (ret < 0) here.  At present the function will silently
convert an error return into "return 0", which is bad.  The function
does appear to otherwise do the right thing if get_user_pages() failed,
but only due to good luck.

> +	isolate_err = 0;
> +	migrate_pre_flag = 0;
> +
> +	for (i = 0; i < ret; i++) {
> +		if (is_movable(page_zone(pages[i]))) {
> +			if (!migrate_pre_flag) {
> +				if (migrate_prep())
> +					goto put_page;
> +				migrate_pre_flag = 1;
> +			}
> +
> +			if (!isolate_lru_page(pages[i])) {
> +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
> +						 page_is_file_cache(pages[i]));
> +				list_add_tail(&pages[i]->lru, &pagelist);
> +			} else {
> +				isolate_err = 1;
> +				goto put_page;
> +			}
> +		}
> +	}
> +
> +	/* All pages are non movable, we are done :) */
> +	if (i == ret && list_empty(&pagelist))
> +		return ret;
> +
> +put_page:
> +	/* Undo the effects of former get_user_pages(), we won't pin anything */
> +	for (i = 0; i < ret; i++)
> +		put_page(pages[i]);
> +
> +	if (migrate_pre_flag && !isolate_err) {
> +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
> +					false, MIGRATE_SYNC, MR_SYSCALL);
> +		/* Steal pages from non-movable zone successfully? */
> +		if (!ret)
> +			goto retry;

This is buggy.  migrate_pages() doesn't empty its `from' argument, so
page_list must be reinitialised here (or, better, at the start of the loop).

Mel, what's up with migrate_pages()?  Shouldn't it be removing the
pages from the list when MIGRATEPAGE_SUCCESS?  The use of
list_for_each_entry_safe() suggests we used to do that...

> +	}
> +
> +	putback_lru_pages(&pagelist);
> +	return 0;
> +}
> +EXPORT_SYMBOL(get_user_pages_non_movable);
> +#endif
> +

Generally, I'd like Mel to go through (the next version of) this patch
carefully, please.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-05  0:06     ` Andrew Morton
  0 siblings, 0 replies; 88+ messages in thread
From: Andrew Morton @ 2013-02-05  0:06 UTC (permalink / raw)
  To: Lin Feng
  Cc: mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel


melreadthis

On Mon, 4 Feb 2013 18:04:07 +0800
Lin Feng <linfeng@cn.fujitsu.com> wrote:

> get_user_pages() always tries to allocate pages from movable zone, which is not
>  reliable to memory hotremove framework in some case.
> 
> This patch introduces a new library function called get_user_pages_non_movable()
>  to pin pages only from zone non-movable in memory.
> It's a wrapper of get_user_pages() but it makes sure that all pages come from
> non-movable zone via additional page migration.
> 
> ...
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  			struct page **pages, struct vm_area_struct **vmas);
>  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  			struct page **pages);
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
> +		unsigned long start, int nr_pages, int write, int force,
> +		struct page **pages, struct vm_area_struct **vmas);
> +#endif

The ifdefs aren't really needed here and I encourage people to omit
them.  This keeps the header files looking neater and reduces the
chances of things later breaking because we forgot to update some
CONFIG_foo logic in a header file.  The downside is that errors will be
revealed at link time rather than at compile time, but that's a pretty
small cost.

>  struct kvec;
>  int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
>  			struct page **pages);
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 73b64a3..5db811e 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
>  	return (idx == ZONE_NORMAL);
>  }
>  
> +static inline int is_movable(struct zone *zone)
> +{
> +	return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE;
> +}

A better name would be zone_is_movable().  We haven't been very
consistent about this in mmzone.h, but zone_is_foo() is pretty common.

And a neater implementation would be

	return zone_idx(zone) == ZONE_MOVABLE;

All of which made me look at mmzone.h, and what I saw wasn't very nice :(

Please review...


From: Andrew Morton <akpm@linux-foundation.org>
Subject: include/linux/mmzone.h: cleanups

- implement zone_idx() in C to fix its references-args-twice macro bug

- use zone_idx() in is_highmem() to remove large amounts of silly fluff.

Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h |   13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h
--- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups
+++ a/include/linux/mmzone.h
@@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by
 /*
  * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
  */
-#define zone_idx(zone)		((zone) - (zone)->zone_pgdat->node_zones)
+static inline enum zone_type zone_idx(struct zone *zone)
+{
+	return zone - zone->zone_pgdat->node_zones;
+}
 
 static inline int populated_zone(struct zone *zone)
 {
@@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon
 static inline int is_highmem(struct zone *zone)
 {
 #ifdef CONFIG_HIGHMEM
-	int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
-	return zone_off == ZONE_HIGHMEM * sizeof(*zone) ||
-	       (zone_off == ZONE_MOVABLE * sizeof(*zone) &&
-		zone_movable_is_highmem());
+	enum zone_type idx = zone_idx(zone);
+
+	return idx == ZONE_HIGHMEM ||
+	       (idx == ZONE_MOVABLE && zone_movable_is_highmem());
 #else
 	return 0;
 #endif
_


> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -58,6 +58,8 @@
>  #include <linux/elf.h>
>  #include <linux/gfp.h>
>  #include <linux/migrate.h>
> +#include <linux/page-isolation.h>
> +#include <linux/mm_inline.h>
>  #include <linux/string.h>
>  
>  #include <asm/io.h>
> @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  }
>  EXPORT_SYMBOL(get_user_pages);
>  
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +/**
> + * It's a wrapper of get_user_pages() but it makes sure that all pages come from
> + * non-movable zone via additional page migration.
> + */

This needs a description of why the function exists - say something
about the requirements of memory hotplug.

Also a few words describing how the function works would be good.

> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
> +		unsigned long start, int nr_pages, int write, int force,
> +		struct page **pages, struct vm_area_struct **vmas)
> +{
> +	int ret, i, isolate_err, migrate_pre_flag;
> +	LIST_HEAD(pagelist);
> +
> +retry:
> +	ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages,
> +				vmas);

We should handle (ret < 0) here.  At present the function will silently
convert an error return into "return 0", which is bad.  The function
does appear to otherwise do the right thing if get_user_pages() failed,
but only due to good luck.

> +	isolate_err = 0;
> +	migrate_pre_flag = 0;
> +
> +	for (i = 0; i < ret; i++) {
> +		if (is_movable(page_zone(pages[i]))) {
> +			if (!migrate_pre_flag) {
> +				if (migrate_prep())
> +					goto put_page;
> +				migrate_pre_flag = 1;
> +			}
> +
> +			if (!isolate_lru_page(pages[i])) {
> +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
> +						 page_is_file_cache(pages[i]));
> +				list_add_tail(&pages[i]->lru, &pagelist);
> +			} else {
> +				isolate_err = 1;
> +				goto put_page;
> +			}
> +		}
> +	}
> +
> +	/* All pages are non movable, we are done :) */
> +	if (i == ret && list_empty(&pagelist))
> +		return ret;
> +
> +put_page:
> +	/* Undo the effects of former get_user_pages(), we won't pin anything */
> +	for (i = 0; i < ret; i++)
> +		put_page(pages[i]);
> +
> +	if (migrate_pre_flag && !isolate_err) {
> +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
> +					false, MIGRATE_SYNC, MR_SYSCALL);
> +		/* Steal pages from non-movable zone successfully? */
> +		if (!ret)
> +			goto retry;

This is buggy.  migrate_pages() doesn't empty its `from' argument, so
page_list must be reinitialised here (or, better, at the start of the loop).

Mel, what's up with migrate_pages()?  Shouldn't it be removing the
pages from the list when MIGRATEPAGE_SUCCESS?  The use of
list_for_each_entry_safe() suggests we used to do that...

> +	}
> +
> +	putback_lru_pages(&pagelist);
> +	return 0;
> +}
> +EXPORT_SYMBOL(get_user_pages_non_movable);
> +#endif
> +

Generally, I'd like Mel to go through (the next version of) this patch
carefully, please.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-05  0:06     ` Andrew Morton
@ 2013-02-05  0:18       ` Andrew Morton
  -1 siblings, 0 replies; 88+ messages in thread
From: Andrew Morton @ 2013-02-05  0:18 UTC (permalink / raw)
  To: Lin Feng, mgorman, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel


Also...

On Mon, 4 Feb 2013 16:06:24 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> > +put_page:
> > +	/* Undo the effects of former get_user_pages(), we won't pin anything */
> > +	for (i = 0; i < ret; i++)
> > +		put_page(pages[i]);

We can use release_pages() here.

release_pages() is designed to be more efficient when we're putting the
final reference to (most of) the pages.  It probably has little if any
benefit when putting still-in-use pages, as we're doing here.

But please consider...

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-05  0:18       ` Andrew Morton
  0 siblings, 0 replies; 88+ messages in thread
From: Andrew Morton @ 2013-02-05  0:18 UTC (permalink / raw)
  To: Lin Feng, mgorman, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel


Also...

On Mon, 4 Feb 2013 16:06:24 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> > +put_page:
> > +	/* Undo the effects of former get_user_pages(), we won't pin anything */
> > +	for (i = 0; i < ret; i++)
> > +		put_page(pages[i]);

We can use release_pages() here.

release_pages() is designed to be more efficient when we're putting the
final reference to (most of) the pages.  It probably has little if any
benefit when putting still-in-use pages, as we're doing here.

But please consider...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages
  2013-02-04 10:04 ` Lin Feng
  (?)
@ 2013-02-05  0:58   ` Minchan Kim
  -1 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2013-02-05  0:58 UTC (permalink / raw)
  To: Lin Feng
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu,
	linux-mm, linux-aio, linux-fsdevel, linux-kernel

Hello,

On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote:
> Currently get_user_pages() always tries to allocate pages from movable zone,
> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users
> of get_user_pages() is easy to pin user pages for a long time(for now we found
> that pages pinned as aio ring pages is such case), which is fatal for memory
> hotplug/remove framework.
> 
> So the 1st patch introduces a new library function called
> get_user_pages_non_movable() to pin pages only from zone non-movable in memory.
> It's a wrapper of get_user_pages() but it makes sure that all pages come from
> non-movable zone via additional page migration.
> 
> The 2nd patch gets around the aio ring pages can't be migrated bug caused by
> get_user_pages() via using the new function. It only works when configed with
> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().

CMA has same issue but the problem is the driver developers or any subsystem
using GUP can't know their pages is in CMA area or not in advance.
So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well?
Even some driver module in embedded side doesn't open their source code.

I would like to make GUP smart so it allocates a page from non-movable/non-cma area
when memory-hotplug/cma is enabled(CONFIG_MIGRATE_ISOLATE). Yeb. it might hurt GUP
performance but it is just trade-off for using CMA/memory-hotplug, IMHO. :(

> 
> Lin Feng (2):
>   mm: hotplug: implement non-movable version of get_user_pages()
>   fs/aio.c: use non-movable version of get_user_pages() to pin ring
>     pages when support memory hotremove
> 
>  fs/aio.c               |  6 +++++
>  include/linux/mm.h     |  5 ++++
>  include/linux/mmzone.h |  4 ++++
>  mm/memory.c            | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/page_isolation.c    |  5 ++++
>  5 files changed, 83 insertions(+)
> 
> -- 
> 1.7.11.7
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages
@ 2013-02-05  0:58   ` Minchan Kim
  0 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2013-02-05  0:58 UTC (permalink / raw)
  To: Lin Feng
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu,
	linux-mm, linux-aio, linux-fsdevel, linux-kernel

Hello,

On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote:
> Currently get_user_pages() always tries to allocate pages from movable zone,
> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users
> of get_user_pages() is easy to pin user pages for a long time(for now we found
> that pages pinned as aio ring pages is such case), which is fatal for memory
> hotplug/remove framework.
> 
> So the 1st patch introduces a new library function called
> get_user_pages_non_movable() to pin pages only from zone non-movable in memory.
> It's a wrapper of get_user_pages() but it makes sure that all pages come from
> non-movable zone via additional page migration.
> 
> The 2nd patch gets around the aio ring pages can't be migrated bug caused by
> get_user_pages() via using the new function. It only works when configed with
> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().

CMA has same issue but the problem is the driver developers or any subsystem
using GUP can't know their pages is in CMA area or not in advance.
So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well?
Even some driver module in embedded side doesn't open their source code.

I would like to make GUP smart so it allocates a page from non-movable/non-cma area
when memory-hotplug/cma is enabled(CONFIG_MIGRATE_ISOLATE). Yeb. it might hurt GUP
performance but it is just trade-off for using CMA/memory-hotplug, IMHO. :(

> 
> Lin Feng (2):
>   mm: hotplug: implement non-movable version of get_user_pages()
>   fs/aio.c: use non-movable version of get_user_pages() to pin ring
>     pages when support memory hotremove
> 
>  fs/aio.c               |  6 +++++
>  include/linux/mm.h     |  5 ++++
>  include/linux/mmzone.h |  4 ++++
>  mm/memory.c            | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/page_isolation.c    |  5 ++++
>  5 files changed, 83 insertions(+)
> 
> -- 
> 1.7.11.7
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages
@ 2013-02-05  0:58   ` Minchan Kim
  0 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2013-02-05  0:58 UTC (permalink / raw)
  To: Lin Feng
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu,
	linux-mm, linux-aio, linux-fsdevel, linux-kernel

Hello,

On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote:
> Currently get_user_pages() always tries to allocate pages from movable zone,
> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users
> of get_user_pages() is easy to pin user pages for a long time(for now we found
> that pages pinned as aio ring pages is such case), which is fatal for memory
> hotplug/remove framework.
> 
> So the 1st patch introduces a new library function called
> get_user_pages_non_movable() to pin pages only from zone non-movable in memory.
> It's a wrapper of get_user_pages() but it makes sure that all pages come from
> non-movable zone via additional page migration.
> 
> The 2nd patch gets around the aio ring pages can't be migrated bug caused by
> get_user_pages() via using the new function. It only works when configed with
> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().

CMA has same issue but the problem is the driver developers or any subsystem
using GUP can't know their pages is in CMA area or not in advance.
So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well?
Even some driver module in embedded side doesn't open their source code.

I would like to make GUP smart so it allocates a page from non-movable/non-cma area
when memory-hotplug/cma is enabled(CONFIG_MIGRATE_ISOLATE). Yeb. it might hurt GUP
performance but it is just trade-off for using CMA/memory-hotplug, IMHO. :(

> 
> Lin Feng (2):
>   mm: hotplug: implement non-movable version of get_user_pages()
>   fs/aio.c: use non-movable version of get_user_pages() to pin ring
>     pages when support memory hotremove
> 
>  fs/aio.c               |  6 +++++
>  include/linux/mm.h     |  5 ++++
>  include/linux/mmzone.h |  4 ++++
>  mm/memory.c            | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/page_isolation.c    |  5 ++++
>  5 files changed, 83 insertions(+)
> 
> -- 
> 1.7.11.7
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-05  0:06     ` Andrew Morton
  (?)
@ 2013-02-05  3:09       ` Lin Feng
  -1 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-05  3:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel,
	Tang chen, Gu Zheng

Hi Andrew,

On 02/05/2013 08:06 AM, Andrew Morton wrote:
> 
> melreadthis
> 
> On Mon, 4 Feb 2013 18:04:07 +0800
> Lin Feng <linfeng@cn.fujitsu.com> wrote:
> 
>> get_user_pages() always tries to allocate pages from movable zone, which is not
>>  reliable to memory hotremove framework in some case.
>>
>> This patch introduces a new library function called get_user_pages_non_movable()
>>  to pin pages only from zone non-movable in memory.
>> It's a wrapper of get_user_pages() but it makes sure that all pages come from
>> non-movable zone via additional page migration.
>>
>> ...
>>
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>>  			struct page **pages, struct vm_area_struct **vmas);
>>  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>>  			struct page **pages);
>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
>> +		unsigned long start, int nr_pages, int write, int force,
>> +		struct page **pages, struct vm_area_struct **vmas);
>> +#endif
> 
> The ifdefs aren't really needed here and I encourage people to omit
> them.  This keeps the header files looking neater and reduces the
> chances of things later breaking because we forgot to update some
> CONFIG_foo logic in a header file.  The downside is that errors will be
> revealed at link time rather than at compile time, but that's a pretty
> small cost.
OK, got it, thanks :)

> 
>>  struct kvec;
>>  int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
>>  			struct page **pages);
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 73b64a3..5db811e 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
>>  	return (idx == ZONE_NORMAL);
>>  }
>>  
>> +static inline int is_movable(struct zone *zone)
>> +{
>> +	return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE;
>> +}
> 
> A better name would be zone_is_movable().  We haven't been very
> consistent about this in mmzone.h, but zone_is_foo() is pretty common.
> 
Yes, zone_is_xxx() should be a better name, but there are some analogous
definition like is_dma32() and is_normal() etc, if we only use zone_is_movable()
for movable zone it will break such naming rules.
Should we update other ones in a separate patch later or just keep the old style?
 
> And a neater implementation would be
> 
> 	return zone_idx(zone) == ZONE_MOVABLE;
> 
OK. After your change, should we also update for other ones such as is_normal()..?

> All of which made me look at mmzone.h, and what I saw wasn't very nice :(
> 
> Please review...
> 
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: include/linux/mmzone.h: cleanups
> 
> - implement zone_idx() in C to fix its references-args-twice macro bug
> 
> - use zone_idx() in is_highmem() to remove large amounts of silly fluff.
> 
> Cc: Lin Feng <linfeng@cn.fujitsu.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  include/linux/mmzone.h |   13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h
> --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups
> +++ a/include/linux/mmzone.h
> @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by
>  /*
>   * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
>   */
> -#define zone_idx(zone)		((zone) - (zone)->zone_pgdat->node_zones)
> +static inline enum zone_type zone_idx(struct zone *zone)
> +{
> +	return zone - zone->zone_pgdat->node_zones;
> +}
>  
>  static inline int populated_zone(struct zone *zone)
>  {
> @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon
>  static inline int is_highmem(struct zone *zone)
>  {
>  #ifdef CONFIG_HIGHMEM
> -	int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
> -	return zone_off == ZONE_HIGHMEM * sizeof(*zone) ||
> -	       (zone_off == ZONE_MOVABLE * sizeof(*zone) &&
> -		zone_movable_is_highmem());
> +	enum zone_type idx = zone_idx(zone);
> +
> +	return idx == ZONE_HIGHMEM ||
> +	       (idx == ZONE_MOVABLE && zone_movable_is_highmem());
>  #else
>  	return 0;
>  #endif
> _
> 
> 
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -58,6 +58,8 @@
>>  #include <linux/elf.h>
>>  #include <linux/gfp.h>
>>  #include <linux/migrate.h>
>> +#include <linux/page-isolation.h>
>> +#include <linux/mm_inline.h>
>>  #include <linux/string.h>
>>  
>>  #include <asm/io.h>
>> @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>>  }
>>  EXPORT_SYMBOL(get_user_pages);
>>  
>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>> +/**
>> + * It's a wrapper of get_user_pages() but it makes sure that all pages come from
>> + * non-movable zone via additional page migration.
>> + */
> 
> This needs a description of why the function exists - say something
> about the requirements of memory hotplug.
> 
> Also a few words describing how the function works would be good.
OK, I will add them in next version.

> 
>> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
>> +		unsigned long start, int nr_pages, int write, int force,
>> +		struct page **pages, struct vm_area_struct **vmas)
>> +{
>> +	int ret, i, isolate_err, migrate_pre_flag;
>> +	LIST_HEAD(pagelist);
>> +
>> +retry:
>> +	ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages,
>> +				vmas);
> 
> We should handle (ret < 0) here.  At present the function will silently
> convert an error return into "return 0", which is bad.  The function
> does appear to otherwise do the right thing if get_user_pages() failed,
> but only due to good luck.
Sorry, I forgot this, thanks.

> 
>> +	isolate_err = 0;
>> +	migrate_pre_flag = 0;
>> +
>> +	for (i = 0; i < ret; i++) {
>> +		if (is_movable(page_zone(pages[i]))) {
>> +			if (!migrate_pre_flag) {
>> +				if (migrate_prep())
>> +					goto put_page;
>> +				migrate_pre_flag = 1;
>> +			}
>> +
>> +			if (!isolate_lru_page(pages[i])) {
>> +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
>> +						 page_is_file_cache(pages[i]));
>> +				list_add_tail(&pages[i]->lru, &pagelist);
>> +			} else {
>> +				isolate_err = 1;
>> +				goto put_page;
>> +			}
>> +		}
>> +	}
>> +
>> +	/* All pages are non movable, we are done :) */
>> +	if (i == ret && list_empty(&pagelist))
>> +		return ret;
>> +
>> +put_page:
>> +	/* Undo the effects of former get_user_pages(), we won't pin anything */
>> +	for (i = 0; i < ret; i++)
>> +		put_page(pages[i]);
>> +
>> +	if (migrate_pre_flag && !isolate_err) {
>> +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
>> +					false, MIGRATE_SYNC, MR_SYSCALL);
>> +		/* Steal pages from non-movable zone successfully? */
>> +		if (!ret)
>> +			goto retry;
> 
> This is buggy.  migrate_pages() doesn't empty its `from' argument, so
> page_list must be reinitialised here (or, better, at the start of the loop).
migrate_pages()
  list_for_each_entry_safe()
     unmap_and_move()
       if (rc != -EAGAIN) {
         list_del(&page->lru);
       }
IUUC, the page migrated successfully will be remove from the `from` list here.
So if all pages having been migrated successlly, the list will be empty, and we goto retry.
Otherwise the rest of the immigrated pages will be handled later by putback_lru_pages(&pagelist),
and the function just return.

Also there are many places using such logic:
1274                         nr_failed = migrate_pages(&pagelist, new_vma_page,
1275                                                 (unsigned long)vma,
1276                                                 false, MIGRATE_SYNC,
1277                                                 MR_MEMPOLICY_MBIND);
1278                         if (nr_failed)
1279                                 putback_lru_pages(&pagelist);
Since we traverse the list and handle each page separately, It's likely that we
have migrated some page successfully before we encounter a failure.
If the pagelist is still carrying the successfully migrated pages when migrate_pages()
return, such code is bugyy.

> 
> Mel, what's up with migrate_pages()?  Shouldn't it be removing the
> pages from the list when MIGRATEPAGE_SUCCESS?  The use of
> list_for_each_entry_safe() suggests we used to do that...
> 
>> +	}
>> +
>> +	putback_lru_pages(&pagelist);
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL(get_user_pages_non_movable);
>> +#endif
>> +
> 
> Generally, I'd like Mel to go through (the next version of) this patch
> carefully, please.
> 
> 

On 02/05/2013 08:18 AM, Andrew Morton wrote:> On Mon, 4 Feb 2013 16:06:24 -0800
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
>>> > > +put_page:
>>> > > +	/* Undo the effects of former get_user_pages(), we won't pin anything */
>>> > > +	for (i = 0; i < ret; i++)
>>> > > +		put_page(pages[i]);
> We can use release_pages() here.
> 
> release_pages() is designed to be more efficient when we're putting the
> final reference to (most of) the pages.  It probably has little if any
> benefit when putting still-in-use pages, as we're doing here.
> 
> But please consider...
OK, I will try :)

> 

thanks,
linfeng


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-05  3:09       ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-05  3:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel,
	Tang chen, Gu Zheng

Hi Andrew,

On 02/05/2013 08:06 AM, Andrew Morton wrote:
> 
> melreadthis
> 
> On Mon, 4 Feb 2013 18:04:07 +0800
> Lin Feng <linfeng@cn.fujitsu.com> wrote:
> 
>> get_user_pages() always tries to allocate pages from movable zone, which is not
>>  reliable to memory hotremove framework in some case.
>>
>> This patch introduces a new library function called get_user_pages_non_movable()
>>  to pin pages only from zone non-movable in memory.
>> It's a wrapper of get_user_pages() but it makes sure that all pages come from
>> non-movable zone via additional page migration.
>>
>> ...
>>
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>>  			struct page **pages, struct vm_area_struct **vmas);
>>  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>>  			struct page **pages);
>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
>> +		unsigned long start, int nr_pages, int write, int force,
>> +		struct page **pages, struct vm_area_struct **vmas);
>> +#endif
> 
> The ifdefs aren't really needed here and I encourage people to omit
> them.  This keeps the header files looking neater and reduces the
> chances of things later breaking because we forgot to update some
> CONFIG_foo logic in a header file.  The downside is that errors will be
> revealed at link time rather than at compile time, but that's a pretty
> small cost.
OK, got it, thanks :)

> 
>>  struct kvec;
>>  int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
>>  			struct page **pages);
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 73b64a3..5db811e 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
>>  	return (idx == ZONE_NORMAL);
>>  }
>>  
>> +static inline int is_movable(struct zone *zone)
>> +{
>> +	return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE;
>> +}
> 
> A better name would be zone_is_movable().  We haven't been very
> consistent about this in mmzone.h, but zone_is_foo() is pretty common.
> 
Yes, zone_is_xxx() should be a better name, but there are some analogous
definition like is_dma32() and is_normal() etc, if we only use zone_is_movable()
for movable zone it will break such naming rules.
Should we update other ones in a separate patch later or just keep the old style?
 
> And a neater implementation would be
> 
> 	return zone_idx(zone) == ZONE_MOVABLE;
> 
OK. After your change, should we also update for other ones such as is_normal()..?

> All of which made me look at mmzone.h, and what I saw wasn't very nice :(
> 
> Please review...
> 
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: include/linux/mmzone.h: cleanups
> 
> - implement zone_idx() in C to fix its references-args-twice macro bug
> 
> - use zone_idx() in is_highmem() to remove large amounts of silly fluff.
> 
> Cc: Lin Feng <linfeng@cn.fujitsu.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  include/linux/mmzone.h |   13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h
> --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups
> +++ a/include/linux/mmzone.h
> @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by
>  /*
>   * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
>   */
> -#define zone_idx(zone)		((zone) - (zone)->zone_pgdat->node_zones)
> +static inline enum zone_type zone_idx(struct zone *zone)
> +{
> +	return zone - zone->zone_pgdat->node_zones;
> +}
>  
>  static inline int populated_zone(struct zone *zone)
>  {
> @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon
>  static inline int is_highmem(struct zone *zone)
>  {
>  #ifdef CONFIG_HIGHMEM
> -	int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
> -	return zone_off == ZONE_HIGHMEM * sizeof(*zone) ||
> -	       (zone_off == ZONE_MOVABLE * sizeof(*zone) &&
> -		zone_movable_is_highmem());
> +	enum zone_type idx = zone_idx(zone);
> +
> +	return idx == ZONE_HIGHMEM ||
> +	       (idx == ZONE_MOVABLE && zone_movable_is_highmem());
>  #else
>  	return 0;
>  #endif
> _
> 
> 
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -58,6 +58,8 @@
>>  #include <linux/elf.h>
>>  #include <linux/gfp.h>
>>  #include <linux/migrate.h>
>> +#include <linux/page-isolation.h>
>> +#include <linux/mm_inline.h>
>>  #include <linux/string.h>
>>  
>>  #include <asm/io.h>
>> @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>>  }
>>  EXPORT_SYMBOL(get_user_pages);
>>  
>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>> +/**
>> + * It's a wrapper of get_user_pages() but it makes sure that all pages come from
>> + * non-movable zone via additional page migration.
>> + */
> 
> This needs a description of why the function exists - say something
> about the requirements of memory hotplug.
> 
> Also a few words describing how the function works would be good.
OK, I will add them in next version.

> 
>> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
>> +		unsigned long start, int nr_pages, int write, int force,
>> +		struct page **pages, struct vm_area_struct **vmas)
>> +{
>> +	int ret, i, isolate_err, migrate_pre_flag;
>> +	LIST_HEAD(pagelist);
>> +
>> +retry:
>> +	ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages,
>> +				vmas);
> 
> We should handle (ret < 0) here.  At present the function will silently
> convert an error return into "return 0", which is bad.  The function
> does appear to otherwise do the right thing if get_user_pages() failed,
> but only due to good luck.
Sorry, I forgot this, thanks.

> 
>> +	isolate_err = 0;
>> +	migrate_pre_flag = 0;
>> +
>> +	for (i = 0; i < ret; i++) {
>> +		if (is_movable(page_zone(pages[i]))) {
>> +			if (!migrate_pre_flag) {
>> +				if (migrate_prep())
>> +					goto put_page;
>> +				migrate_pre_flag = 1;
>> +			}
>> +
>> +			if (!isolate_lru_page(pages[i])) {
>> +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
>> +						 page_is_file_cache(pages[i]));
>> +				list_add_tail(&pages[i]->lru, &pagelist);
>> +			} else {
>> +				isolate_err = 1;
>> +				goto put_page;
>> +			}
>> +		}
>> +	}
>> +
>> +	/* All pages are non movable, we are done :) */
>> +	if (i == ret && list_empty(&pagelist))
>> +		return ret;
>> +
>> +put_page:
>> +	/* Undo the effects of former get_user_pages(), we won't pin anything */
>> +	for (i = 0; i < ret; i++)
>> +		put_page(pages[i]);
>> +
>> +	if (migrate_pre_flag && !isolate_err) {
>> +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
>> +					false, MIGRATE_SYNC, MR_SYSCALL);
>> +		/* Steal pages from non-movable zone successfully? */
>> +		if (!ret)
>> +			goto retry;
> 
> This is buggy.  migrate_pages() doesn't empty its `from' argument, so
> page_list must be reinitialised here (or, better, at the start of the loop).
migrate_pages()
  list_for_each_entry_safe()
     unmap_and_move()
       if (rc != -EAGAIN) {
         list_del(&page->lru);
       }
IUUC, the page migrated successfully will be remove from the `from` list here.
So if all pages having been migrated successlly, the list will be empty, and we goto retry.
Otherwise the rest of the immigrated pages will be handled later by putback_lru_pages(&pagelist),
and the function just return.

Also there are many places using such logic:
1274                         nr_failed = migrate_pages(&pagelist, new_vma_page,
1275                                                 (unsigned long)vma,
1276                                                 false, MIGRATE_SYNC,
1277                                                 MR_MEMPOLICY_MBIND);
1278                         if (nr_failed)
1279                                 putback_lru_pages(&pagelist);
Since we traverse the list and handle each page separately, It's likely that we
have migrated some page successfully before we encounter a failure.
If the pagelist is still carrying the successfully migrated pages when migrate_pages()
return, such code is bugyy.

> 
> Mel, what's up with migrate_pages()?  Shouldn't it be removing the
> pages from the list when MIGRATEPAGE_SUCCESS?  The use of
> list_for_each_entry_safe() suggests we used to do that...
> 
>> +	}
>> +
>> +	putback_lru_pages(&pagelist);
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL(get_user_pages_non_movable);
>> +#endif
>> +
> 
> Generally, I'd like Mel to go through (the next version of) this patch
> carefully, please.
> 
> 

On 02/05/2013 08:18 AM, Andrew Morton wrote:> On Mon, 4 Feb 2013 16:06:24 -0800
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
>>> > > +put_page:
>>> > > +	/* Undo the effects of former get_user_pages(), we won't pin anything */
>>> > > +	for (i = 0; i < ret; i++)
>>> > > +		put_page(pages[i]);
> We can use release_pages() here.
> 
> release_pages() is designed to be more efficient when we're putting the
> final reference to (most of) the pages.  It probably has little if any
> benefit when putting still-in-use pages, as we're doing here.
> 
> But please consider...
OK, I will try :)

> 

thanks,
linfeng

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-05  3:09       ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-05  3:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel,
	Tang chen, Gu Zheng

Hi Andrew,

On 02/05/2013 08:06 AM, Andrew Morton wrote:
> 
> melreadthis
> 
> On Mon, 4 Feb 2013 18:04:07 +0800
> Lin Feng <linfeng@cn.fujitsu.com> wrote:
> 
>> get_user_pages() always tries to allocate pages from movable zone, which is not
>>  reliable to memory hotremove framework in some case.
>>
>> This patch introduces a new library function called get_user_pages_non_movable()
>>  to pin pages only from zone non-movable in memory.
>> It's a wrapper of get_user_pages() but it makes sure that all pages come from
>> non-movable zone via additional page migration.
>>
>> ...
>>
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>>  			struct page **pages, struct vm_area_struct **vmas);
>>  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>>  			struct page **pages);
>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
>> +		unsigned long start, int nr_pages, int write, int force,
>> +		struct page **pages, struct vm_area_struct **vmas);
>> +#endif
> 
> The ifdefs aren't really needed here and I encourage people to omit
> them.  This keeps the header files looking neater and reduces the
> chances of things later breaking because we forgot to update some
> CONFIG_foo logic in a header file.  The downside is that errors will be
> revealed at link time rather than at compile time, but that's a pretty
> small cost.
OK, got it, thanks :)

> 
>>  struct kvec;
>>  int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
>>  			struct page **pages);
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 73b64a3..5db811e 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
>>  	return (idx == ZONE_NORMAL);
>>  }
>>  
>> +static inline int is_movable(struct zone *zone)
>> +{
>> +	return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE;
>> +}
> 
> A better name would be zone_is_movable().  We haven't been very
> consistent about this in mmzone.h, but zone_is_foo() is pretty common.
> 
Yes, zone_is_xxx() should be a better name, but there are some analogous
definition like is_dma32() and is_normal() etc, if we only use zone_is_movable()
for movable zone it will break such naming rules.
Should we update other ones in a separate patch later or just keep the old style?
 
> And a neater implementation would be
> 
> 	return zone_idx(zone) == ZONE_MOVABLE;
> 
OK. After your change, should we also update for other ones such as is_normal()..?

> All of which made me look at mmzone.h, and what I saw wasn't very nice :(
> 
> Please review...
> 
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: include/linux/mmzone.h: cleanups
> 
> - implement zone_idx() in C to fix its references-args-twice macro bug
> 
> - use zone_idx() in is_highmem() to remove large amounts of silly fluff.
> 
> Cc: Lin Feng <linfeng@cn.fujitsu.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  include/linux/mmzone.h |   13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h
> --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups
> +++ a/include/linux/mmzone.h
> @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by
>  /*
>   * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
>   */
> -#define zone_idx(zone)		((zone) - (zone)->zone_pgdat->node_zones)
> +static inline enum zone_type zone_idx(struct zone *zone)
> +{
> +	return zone - zone->zone_pgdat->node_zones;
> +}
>  
>  static inline int populated_zone(struct zone *zone)
>  {
> @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon
>  static inline int is_highmem(struct zone *zone)
>  {
>  #ifdef CONFIG_HIGHMEM
> -	int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
> -	return zone_off == ZONE_HIGHMEM * sizeof(*zone) ||
> -	       (zone_off == ZONE_MOVABLE * sizeof(*zone) &&
> -		zone_movable_is_highmem());
> +	enum zone_type idx = zone_idx(zone);
> +
> +	return idx == ZONE_HIGHMEM ||
> +	       (idx == ZONE_MOVABLE && zone_movable_is_highmem());
>  #else
>  	return 0;
>  #endif
> _
> 
> 
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -58,6 +58,8 @@
>>  #include <linux/elf.h>
>>  #include <linux/gfp.h>
>>  #include <linux/migrate.h>
>> +#include <linux/page-isolation.h>
>> +#include <linux/mm_inline.h>
>>  #include <linux/string.h>
>>  
>>  #include <asm/io.h>
>> @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>>  }
>>  EXPORT_SYMBOL(get_user_pages);
>>  
>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>> +/**
>> + * It's a wrapper of get_user_pages() but it makes sure that all pages come from
>> + * non-movable zone via additional page migration.
>> + */
> 
> This needs a description of why the function exists - say something
> about the requirements of memory hotplug.
> 
> Also a few words describing how the function works would be good.
OK, I will add them in next version.

> 
>> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
>> +		unsigned long start, int nr_pages, int write, int force,
>> +		struct page **pages, struct vm_area_struct **vmas)
>> +{
>> +	int ret, i, isolate_err, migrate_pre_flag;
>> +	LIST_HEAD(pagelist);
>> +
>> +retry:
>> +	ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages,
>> +				vmas);
> 
> We should handle (ret < 0) here.  At present the function will silently
> convert an error return into "return 0", which is bad.  The function
> does appear to otherwise do the right thing if get_user_pages() failed,
> but only due to good luck.
Sorry, I forgot this, thanks.

> 
>> +	isolate_err = 0;
>> +	migrate_pre_flag = 0;
>> +
>> +	for (i = 0; i < ret; i++) {
>> +		if (is_movable(page_zone(pages[i]))) {
>> +			if (!migrate_pre_flag) {
>> +				if (migrate_prep())
>> +					goto put_page;
>> +				migrate_pre_flag = 1;
>> +			}
>> +
>> +			if (!isolate_lru_page(pages[i])) {
>> +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
>> +						 page_is_file_cache(pages[i]));
>> +				list_add_tail(&pages[i]->lru, &pagelist);
>> +			} else {
>> +				isolate_err = 1;
>> +				goto put_page;
>> +			}
>> +		}
>> +	}
>> +
>> +	/* All pages are non movable, we are done :) */
>> +	if (i == ret && list_empty(&pagelist))
>> +		return ret;
>> +
>> +put_page:
>> +	/* Undo the effects of former get_user_pages(), we won't pin anything */
>> +	for (i = 0; i < ret; i++)
>> +		put_page(pages[i]);
>> +
>> +	if (migrate_pre_flag && !isolate_err) {
>> +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
>> +					false, MIGRATE_SYNC, MR_SYSCALL);
>> +		/* Steal pages from non-movable zone successfully? */
>> +		if (!ret)
>> +			goto retry;
> 
> This is buggy.  migrate_pages() doesn't empty its `from' argument, so
> page_list must be reinitialised here (or, better, at the start of the loop).
migrate_pages()
  list_for_each_entry_safe()
     unmap_and_move()
       if (rc != -EAGAIN) {
         list_del(&page->lru);
       }
IUUC, the page migrated successfully will be remove from the `from` list here.
So if all pages having been migrated successlly, the list will be empty, and we goto retry.
Otherwise the rest of the immigrated pages will be handled later by putback_lru_pages(&pagelist),
and the function just return.

Also there are many places using such logic:
1274                         nr_failed = migrate_pages(&pagelist, new_vma_page,
1275                                                 (unsigned long)vma,
1276                                                 false, MIGRATE_SYNC,
1277                                                 MR_MEMPOLICY_MBIND);
1278                         if (nr_failed)
1279                                 putback_lru_pages(&pagelist);
Since we traverse the list and handle each page separately, It's likely that we
have migrated some page successfully before we encounter a failure.
If the pagelist is still carrying the successfully migrated pages when migrate_pages()
return, such code is bugyy.

> 
> Mel, what's up with migrate_pages()?  Shouldn't it be removing the
> pages from the list when MIGRATEPAGE_SUCCESS?  The use of
> list_for_each_entry_safe() suggests we used to do that...
> 
>> +	}
>> +
>> +	putback_lru_pages(&pagelist);
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL(get_user_pages_non_movable);
>> +#endif
>> +
> 
> Generally, I'd like Mel to go through (the next version of) this patch
> carefully, please.
> 
> 

On 02/05/2013 08:18 AM, Andrew Morton wrote:> On Mon, 4 Feb 2013 16:06:24 -0800
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
>>> > > +put_page:
>>> > > +	/* Undo the effects of former get_user_pages(), we won't pin anything */
>>> > > +	for (i = 0; i < ret; i++)
>>> > > +		put_page(pages[i]);
> We can use release_pages() here.
> 
> release_pages() is designed to be more efficient when we're putting the
> final reference to (most of) the pages.  It probably has little if any
> benefit when putting still-in-use pages, as we're doing here.
> 
> But please consider...
OK, I will try :)

> 

thanks,
linfeng

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages
  2013-02-05  0:58   ` Minchan Kim
@ 2013-02-05  4:42     ` Lin Feng
  -1 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-05  4:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu,
	linux-mm, linux-aio, linux-fsdevel, linux-kernel

Hi Minchan,

On 02/05/2013 08:58 AM, Minchan Kim wrote:
> Hello,
> 
> On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote:
>> Currently get_user_pages() always tries to allocate pages from movable zone,
>> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users
>> of get_user_pages() is easy to pin user pages for a long time(for now we found
>> that pages pinned as aio ring pages is such case), which is fatal for memory
>> hotplug/remove framework.
>>
>> So the 1st patch introduces a new library function called
>> get_user_pages_non_movable() to pin pages only from zone non-movable in memory.
>> It's a wrapper of get_user_pages() but it makes sure that all pages come from
>> non-movable zone via additional page migration.
>>
>> The 2nd patch gets around the aio ring pages can't be migrated bug caused by
>> get_user_pages() via using the new function. It only works when configed with
>> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().
> 
> CMA has same issue but the problem is the driver developers or any subsystem
> using GUP can't know their pages is in CMA area or not in advance.
> So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well?
> Even some driver module in embedded side doesn't open their source code.
Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users
of GUP, they will release the pinned pages immediately and to such users they should get
a good performance, using the old style interface is a smart way. And we had better just
deal with the cases we have to by using the new interface.
  
> 
> I would like to make GUP smart so it allocates a page from non-movable/non-cma area
> when memory-hotplug/cma is enabled(CONFIG_MIGRATE_ISOLATE). Yeb. it might hurt GUP
> performance but it is just trade-off for using CMA/memory-hotplug, IMHO. :(
As I debuged the get_user_pages(), I found that some pages is already there and may be
allocated before we call get_user_pages(). __get_user_pages() have following logic to
handle such case.
1786                         while (!(page = follow_page(vma, start, foll_flags))) {
1787                                 int ret;
To such case an additional alloc-flag or such doesn't work, it's difficult to keep GUP
as smart as we want :( , so I worked out the migration approach to get around and 
avoid messing up the current code :)

thanks,
linfeng 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages
@ 2013-02-05  4:42     ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-05  4:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu,
	linux-mm, linux-aio, linux-fsdevel, linux-kernel

Hi Minchan,

On 02/05/2013 08:58 AM, Minchan Kim wrote:
> Hello,
> 
> On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote:
>> Currently get_user_pages() always tries to allocate pages from movable zone,
>> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users
>> of get_user_pages() is easy to pin user pages for a long time(for now we found
>> that pages pinned as aio ring pages is such case), which is fatal for memory
>> hotplug/remove framework.
>>
>> So the 1st patch introduces a new library function called
>> get_user_pages_non_movable() to pin pages only from zone non-movable in memory.
>> It's a wrapper of get_user_pages() but it makes sure that all pages come from
>> non-movable zone via additional page migration.
>>
>> The 2nd patch gets around the aio ring pages can't be migrated bug caused by
>> get_user_pages() via using the new function. It only works when configed with
>> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().
> 
> CMA has same issue but the problem is the driver developers or any subsystem
> using GUP can't know their pages is in CMA area or not in advance.
> So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well?
> Even some driver module in embedded side doesn't open their source code.
Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users
of GUP, they will release the pinned pages immediately and to such users they should get
a good performance, using the old style interface is a smart way. And we had better just
deal with the cases we have to by using the new interface.
  
> 
> I would like to make GUP smart so it allocates a page from non-movable/non-cma area
> when memory-hotplug/cma is enabled(CONFIG_MIGRATE_ISOLATE). Yeb. it might hurt GUP
> performance but it is just trade-off for using CMA/memory-hotplug, IMHO. :(
As I debuged the get_user_pages(), I found that some pages is already there and may be
allocated before we call get_user_pages(). __get_user_pages() have following logic to
handle such case.
1786                         while (!(page = follow_page(vma, start, foll_flags))) {
1787                                 int ret;
To such case an additional alloc-flag or such doesn't work, it's difficult to keep GUP
as smart as we want :( , so I worked out the migration approach to get around and 
avoid messing up the current code :)

thanks,
linfeng 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove
  2013-02-04 15:18     ` Jeff Moyer
@ 2013-02-05  5:06       ` Lin Feng
  -1 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-05  5:06 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel,
	Tang chen, Gu Zheng

Hi Jeff,

On 02/04/2013 11:18 PM, Jeff Moyer wrote:
>> ---
>>  fs/aio.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/fs/aio.c b/fs/aio.c
>> index 71f613c..0e9b30a 100644
>> --- a/fs/aio.c
>> +++ b/fs/aio.c
>> @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx)
>>  	}
>>  
>>  	dprintk("mmap address: 0x%08lx\n", info->mmap_base);
>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>> +	info->nr_pages = get_user_pages_non_movable(current, ctx->mm,
>> +					info->mmap_base, nr_pages,
>> +					1, 0, info->ring_pages, NULL);
>> +#else
>>  	info->nr_pages = get_user_pages(current, ctx->mm,
>>  					info->mmap_base, nr_pages, 
>>  					1, 0, info->ring_pages, NULL);
>> +#endif
> 
> Can't you hide this in your 1/1 patch, by providing this function as
> just a static inline wrapper around get_user_pages when
> CONFIG_MEMORY_HOTREMOVE is not enabled?
Good idea, it makes the callers more neatly :)

thanks,
linfeng

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove
@ 2013-02-05  5:06       ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-05  5:06 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel,
	Tang chen, Gu Zheng

Hi Jeff,

On 02/04/2013 11:18 PM, Jeff Moyer wrote:
>> ---
>>  fs/aio.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/fs/aio.c b/fs/aio.c
>> index 71f613c..0e9b30a 100644
>> --- a/fs/aio.c
>> +++ b/fs/aio.c
>> @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx)
>>  	}
>>  
>>  	dprintk("mmap address: 0x%08lx\n", info->mmap_base);
>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>> +	info->nr_pages = get_user_pages_non_movable(current, ctx->mm,
>> +					info->mmap_base, nr_pages,
>> +					1, 0, info->ring_pages, NULL);
>> +#else
>>  	info->nr_pages = get_user_pages(current, ctx->mm,
>>  					info->mmap_base, nr_pages, 
>>  					1, 0, info->ring_pages, NULL);
>> +#endif
> 
> Can't you hide this in your 1/1 patch, by providing this function as
> just a static inline wrapper around get_user_pages when
> CONFIG_MEMORY_HOTREMOVE is not enabled?
Good idea, it makes the callers more neatly :)

thanks,
linfeng

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages
  2013-02-05  4:42     ` Lin Feng
  (?)
@ 2013-02-05  5:25       ` Minchan Kim
  -1 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2013-02-05  5:25 UTC (permalink / raw)
  To: Lin Feng
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu,
	linux-mm, linux-aio, linux-fsdevel, linux-kernel

Hi Lin,

On Tue, Feb 05, 2013 at 12:42:48PM +0800, Lin Feng wrote:
> Hi Minchan,
> 
> On 02/05/2013 08:58 AM, Minchan Kim wrote:
> > Hello,
> > 
> > On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote:
> >> Currently get_user_pages() always tries to allocate pages from movable zone,
> >> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users
> >> of get_user_pages() is easy to pin user pages for a long time(for now we found
> >> that pages pinned as aio ring pages is such case), which is fatal for memory
> >> hotplug/remove framework.
> >>
> >> So the 1st patch introduces a new library function called
> >> get_user_pages_non_movable() to pin pages only from zone non-movable in memory.
> >> It's a wrapper of get_user_pages() but it makes sure that all pages come from
> >> non-movable zone via additional page migration.
> >>
> >> The 2nd patch gets around the aio ring pages can't be migrated bug caused by
> >> get_user_pages() via using the new function. It only works when configed with
> >> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().
> > 
> > CMA has same issue but the problem is the driver developers or any subsystem
> > using GUP can't know their pages is in CMA area or not in advance.
> > So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well?
> > Even some driver module in embedded side doesn't open their source code.
> Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users
> of GUP, they will release the pinned pages immediately and to such users they should get
> a good performance, using the old style interface is a smart way. And we had better just
> deal with the cases we have to by using the new interface.

Hmm, I think you can't make sure most of user for MEMORY_HOTPLUG will release pinned pages
immediately. Because MEMORY_HOTPLUG could be used for embedded system for reducing power
by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know
in advance they will use pinned pages long time or release in short time because it depends
on some event like user's response which is very not predetermined.
So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of
failing migration by page count and then, investigate they are really using GUP and it's
REALLY a culprit. If so, yell to them "Please use GUP_NM instead"?

Yes. it could be done but it would be rather trobulesome job.

>   
> > 
> > I would like to make GUP smart so it allocates a page from non-movable/non-cma area
> > when memory-hotplug/cma is enabled(CONFIG_MIGRATE_ISOLATE). Yeb. it might hurt GUP
> > performance but it is just trade-off for using CMA/memory-hotplug, IMHO. :(
> As I debuged the get_user_pages(), I found that some pages is already there and may be
> allocated before we call get_user_pages(). __get_user_pages() have following logic to
> handle such case.
> 1786                         while (!(page = follow_page(vma, start, foll_flags))) {
> 1787                                 int ret;
> To such case an additional alloc-flag or such doesn't work, it's difficult to keep GUP
> as smart as we want :( , so I worked out the migration approach to get around and 
> avoid messing up the current code :)

I didn't look at your code in detail yet but What's wrong following this?

Change curret GUP with old_GUP.

#ifdef CONFIG_MIGRATE_ISOLATE

int get_user_pages_non_movable()
{
        ..
        old_get_user_pages()
        ..
}

int get_user_pages()
{
        return get_user_pages_non_movable();
}
#else
int get_user_pages()
{
        return old_get_user_pages()
}
#endif

> 
> thanks,
> linfeng 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages
@ 2013-02-05  5:25       ` Minchan Kim
  0 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2013-02-05  5:25 UTC (permalink / raw)
  To: Lin Feng
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu,
	linux-mm, linux-aio, linux-fsdevel, linux-kernel

Hi Lin,

On Tue, Feb 05, 2013 at 12:42:48PM +0800, Lin Feng wrote:
> Hi Minchan,
> 
> On 02/05/2013 08:58 AM, Minchan Kim wrote:
> > Hello,
> > 
> > On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote:
> >> Currently get_user_pages() always tries to allocate pages from movable zone,
> >> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users
> >> of get_user_pages() is easy to pin user pages for a long time(for now we found
> >> that pages pinned as aio ring pages is such case), which is fatal for memory
> >> hotplug/remove framework.
> >>
> >> So the 1st patch introduces a new library function called
> >> get_user_pages_non_movable() to pin pages only from zone non-movable in memory.
> >> It's a wrapper of get_user_pages() but it makes sure that all pages come from
> >> non-movable zone via additional page migration.
> >>
> >> The 2nd patch gets around the aio ring pages can't be migrated bug caused by
> >> get_user_pages() via using the new function. It only works when configed with
> >> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().
> > 
> > CMA has same issue but the problem is the driver developers or any subsystem
> > using GUP can't know their pages is in CMA area or not in advance.
> > So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well?
> > Even some driver module in embedded side doesn't open their source code.
> Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users
> of GUP, they will release the pinned pages immediately and to such users they should get
> a good performance, using the old style interface is a smart way. And we had better just
> deal with the cases we have to by using the new interface.

Hmm, I think you can't make sure most of user for MEMORY_HOTPLUG will release pinned pages
immediately. Because MEMORY_HOTPLUG could be used for embedded system for reducing power
by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know
in advance they will use pinned pages long time or release in short time because it depends
on some event like user's response which is very not predetermined.
So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of
failing migration by page count and then, investigate they are really using GUP and it's
REALLY a culprit. If so, yell to them "Please use GUP_NM instead"?

Yes. it could be done but it would be rather trobulesome job.

>   
> > 
> > I would like to make GUP smart so it allocates a page from non-movable/non-cma area
> > when memory-hotplug/cma is enabled(CONFIG_MIGRATE_ISOLATE). Yeb. it might hurt GUP
> > performance but it is just trade-off for using CMA/memory-hotplug, IMHO. :(
> As I debuged the get_user_pages(), I found that some pages is already there and may be
> allocated before we call get_user_pages(). __get_user_pages() have following logic to
> handle such case.
> 1786                         while (!(page = follow_page(vma, start, foll_flags))) {
> 1787                                 int ret;
> To such case an additional alloc-flag or such doesn't work, it's difficult to keep GUP
> as smart as we want :( , so I worked out the migration approach to get around and 
> avoid messing up the current code :)

I didn't look at your code in detail yet but What's wrong following this?

Change curret GUP with old_GUP.

#ifdef CONFIG_MIGRATE_ISOLATE

int get_user_pages_non_movable()
{
        ..
        old_get_user_pages()
        ..
}

int get_user_pages()
{
        return get_user_pages_non_movable();
}
#else
int get_user_pages()
{
        return old_get_user_pages()
}
#endif

> 
> thanks,
> linfeng 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages
@ 2013-02-05  5:25       ` Minchan Kim
  0 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2013-02-05  5:25 UTC (permalink / raw)
  To: Lin Feng
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu,
	linux-mm, linux-aio, linux-fsdevel, linux-kernel

Hi Lin,

On Tue, Feb 05, 2013 at 12:42:48PM +0800, Lin Feng wrote:
> Hi Minchan,
> 
> On 02/05/2013 08:58 AM, Minchan Kim wrote:
> > Hello,
> > 
> > On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote:
> >> Currently get_user_pages() always tries to allocate pages from movable zone,
> >> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users
> >> of get_user_pages() is easy to pin user pages for a long time(for now we found
> >> that pages pinned as aio ring pages is such case), which is fatal for memory
> >> hotplug/remove framework.
> >>
> >> So the 1st patch introduces a new library function called
> >> get_user_pages_non_movable() to pin pages only from zone non-movable in memory.
> >> It's a wrapper of get_user_pages() but it makes sure that all pages come from
> >> non-movable zone via additional page migration.
> >>
> >> The 2nd patch gets around the aio ring pages can't be migrated bug caused by
> >> get_user_pages() via using the new function. It only works when configed with
> >> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().
> > 
> > CMA has same issue but the problem is the driver developers or any subsystem
> > using GUP can't know their pages is in CMA area or not in advance.
> > So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well?
> > Even some driver module in embedded side doesn't open their source code.
> Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users
> of GUP, they will release the pinned pages immediately and to such users they should get
> a good performance, using the old style interface is a smart way. And we had better just
> deal with the cases we have to by using the new interface.

Hmm, I think you can't make sure most of user for MEMORY_HOTPLUG will release pinned pages
immediately. Because MEMORY_HOTPLUG could be used for embedded system for reducing power
by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know
in advance they will use pinned pages long time or release in short time because it depends
on some event like user's response which is very not predetermined.
So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of
failing migration by page count and then, investigate they are really using GUP and it's
REALLY a culprit. If so, yell to them "Please use GUP_NM instead"?

Yes. it could be done but it would be rather trobulesome job.

>   
> > 
> > I would like to make GUP smart so it allocates a page from non-movable/non-cma area
> > when memory-hotplug/cma is enabled(CONFIG_MIGRATE_ISOLATE). Yeb. it might hurt GUP
> > performance but it is just trade-off for using CMA/memory-hotplug, IMHO. :(
> As I debuged the get_user_pages(), I found that some pages is already there and may be
> allocated before we call get_user_pages(). __get_user_pages() have following logic to
> handle such case.
> 1786                         while (!(page = follow_page(vma, start, foll_flags))) {
> 1787                                 int ret;
> To such case an additional alloc-flag or such doesn't work, it's difficult to keep GUP
> as smart as we want :( , so I worked out the migration approach to get around and 
> avoid messing up the current code :)

I didn't look at your code in detail yet but What's wrong following this?

Change curret GUP with old_GUP.

#ifdef CONFIG_MIGRATE_ISOLATE

int get_user_pages_non_movable()
{
        ..
        old_get_user_pages()
        ..
}

int get_user_pages()
{
        return get_user_pages_non_movable();
}
#else
int get_user_pages()
{
        return old_get_user_pages()
}
#endif

> 
> thanks,
> linfeng 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove
  2013-02-04 23:02       ` Zach Brown
  (?)
@ 2013-02-05  5:35         ` Lin Feng
  -1 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-05  5:35 UTC (permalink / raw)
  To: Zach Brown
  Cc: Jeff Moyer, akpm, mgorman, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel, Tang chen, Gu Zheng

Hi Zach,

On 02/05/2013 07:02 AM, Zach Brown wrote:
>>> index 71f613c..0e9b30a 100644
>>> --- a/fs/aio.c
>>> +++ b/fs/aio.c
>>> @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx)
>>>  	}
>>>  
>>>  	dprintk("mmap address: 0x%08lx\n", info->mmap_base);
>>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>>> +	info->nr_pages = get_user_pages_non_movable(current, ctx->mm,
>>> +					info->mmap_base, nr_pages,
>>> +					1, 0, info->ring_pages, NULL);
>>> +#else
>>>  	info->nr_pages = get_user_pages(current, ctx->mm,
>>>  					info->mmap_base, nr_pages, 
>>>  					1, 0, info->ring_pages, NULL);
>>> +#endif
>>
>> Can't you hide this in your 1/1 patch, by providing this function as
>> just a static inline wrapper around get_user_pages when
>> CONFIG_MEMORY_HOTREMOVE is not enabled?
> 
> Yes, please.  Having callers duplicate the call site for a single
> optional boolean input is unacceptable.
I will deal with it in next version :)

> 
> But do we want another input argument as a name?  Should aio have been
> using get_user_pages_fast()? (and so now _fast_non_movable?)
> 
> I wonder if it's time to offer the booleans as a _flags() variant, much
> like the current internal flags for __get_user_pages().  The write and
> force arguments are already booleans, we have a different fast api, and
> now we're adding non-movable.  The NON_MOVABLE flag would be 0 without
> MEMORY_HOTREMOVE, easy peasy.
As my next reply-mail mentioned, IIUC in GUP case additional flags seems doesn't work,
 I abstract here:

As I debuged the get_user_pages(), I found that some pages is already there and may be
allocated before we call get_user_pages(). __get_user_pages() have following logic to
handle such case.
1786                         while (!(page = follow_page(vma, start, foll_flags))) {
1787                                 int ret;
To such case an additional alloc-flag or such doesn't work, it's difficult to keep GUP
as smart as we want  , so I worked out the migration approach to get around and 
avoid messing up the current code.

And even worse we have already got *8* arguments...Maybe we have to rework the boolean 
arguments into bit flags... It seems not a little work :(
> 
> Turning current callers' mysterious '1, 1' in to 'WRITE|FORCE' might
> also be nice :).
Agree, maybe we could handle them later :)

thanks,
linfeng

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove
@ 2013-02-05  5:35         ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-05  5:35 UTC (permalink / raw)
  To: Zach Brown
  Cc: Jeff Moyer, akpm, mgorman, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel, Tang chen, Gu Zheng

Hi Zach,

On 02/05/2013 07:02 AM, Zach Brown wrote:
>>> index 71f613c..0e9b30a 100644
>>> --- a/fs/aio.c
>>> +++ b/fs/aio.c
>>> @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx)
>>>  	}
>>>  
>>>  	dprintk("mmap address: 0x%08lx\n", info->mmap_base);
>>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>>> +	info->nr_pages = get_user_pages_non_movable(current, ctx->mm,
>>> +					info->mmap_base, nr_pages,
>>> +					1, 0, info->ring_pages, NULL);
>>> +#else
>>>  	info->nr_pages = get_user_pages(current, ctx->mm,
>>>  					info->mmap_base, nr_pages, 
>>>  					1, 0, info->ring_pages, NULL);
>>> +#endif
>>
>> Can't you hide this in your 1/1 patch, by providing this function as
>> just a static inline wrapper around get_user_pages when
>> CONFIG_MEMORY_HOTREMOVE is not enabled?
> 
> Yes, please.  Having callers duplicate the call site for a single
> optional boolean input is unacceptable.
I will deal with it in next version :)

> 
> But do we want another input argument as a name?  Should aio have been
> using get_user_pages_fast()? (and so now _fast_non_movable?)
> 
> I wonder if it's time to offer the booleans as a _flags() variant, much
> like the current internal flags for __get_user_pages().  The write and
> force arguments are already booleans, we have a different fast api, and
> now we're adding non-movable.  The NON_MOVABLE flag would be 0 without
> MEMORY_HOTREMOVE, easy peasy.
As my next reply-mail mentioned, IIUC in GUP case additional flags seems doesn't work,
 I abstract here:

As I debuged the get_user_pages(), I found that some pages is already there and may be
allocated before we call get_user_pages(). __get_user_pages() have following logic to
handle such case.
1786                         while (!(page = follow_page(vma, start, foll_flags))) {
1787                                 int ret;
To such case an additional alloc-flag or such doesn't work, it's difficult to keep GUP
as smart as we want  , so I worked out the migration approach to get around and 
avoid messing up the current code.

And even worse we have already got *8* arguments...Maybe we have to rework the boolean 
arguments into bit flags... It seems not a little work :(
> 
> Turning current callers' mysterious '1, 1' in to 'WRITE|FORCE' might
> also be nice :).
Agree, maybe we could handle them later :)

thanks,
linfeng

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove
@ 2013-02-05  5:35         ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-05  5:35 UTC (permalink / raw)
  To: Zach Brown
  Cc: Jeff Moyer, akpm, mgorman, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel, Tang chen, Gu Zheng

Hi Zach,

On 02/05/2013 07:02 AM, Zach Brown wrote:
>>> index 71f613c..0e9b30a 100644
>>> --- a/fs/aio.c
>>> +++ b/fs/aio.c
>>> @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx)
>>>  	}
>>>  
>>>  	dprintk("mmap address: 0x%08lx\n", info->mmap_base);
>>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>>> +	info->nr_pages = get_user_pages_non_movable(current, ctx->mm,
>>> +					info->mmap_base, nr_pages,
>>> +					1, 0, info->ring_pages, NULL);
>>> +#else
>>>  	info->nr_pages = get_user_pages(current, ctx->mm,
>>>  					info->mmap_base, nr_pages, 
>>>  					1, 0, info->ring_pages, NULL);
>>> +#endif
>>
>> Can't you hide this in your 1/1 patch, by providing this function as
>> just a static inline wrapper around get_user_pages when
>> CONFIG_MEMORY_HOTREMOVE is not enabled?
> 
> Yes, please.  Having callers duplicate the call site for a single
> optional boolean input is unacceptable.
I will deal with it in next version :)

> 
> But do we want another input argument as a name?  Should aio have been
> using get_user_pages_fast()? (and so now _fast_non_movable?)
> 
> I wonder if it's time to offer the booleans as a _flags() variant, much
> like the current internal flags for __get_user_pages().  The write and
> force arguments are already booleans, we have a different fast api, and
> now we're adding non-movable.  The NON_MOVABLE flag would be 0 without
> MEMORY_HOTREMOVE, easy peasy.
As my next reply-mail mentioned, IIUC in GUP case additional flags seems doesn't work,
 I abstract here:

As I debuged the get_user_pages(), I found that some pages is already there and may be
allocated before we call get_user_pages(). __get_user_pages() have following logic to
handle such case.
1786                         while (!(page = follow_page(vma, start, foll_flags))) {
1787                                 int ret;
To such case an additional alloc-flag or such doesn't work, it's difficult to keep GUP
as smart as we want  , so I worked out the migration approach to get around and 
avoid messing up the current code.

And even worse we have already got *8* arguments...Maybe we have to rework the boolean 
arguments into bit flags... It seems not a little work :(
> 
> Turning current callers' mysterious '1, 1' in to 'WRITE|FORCE' might
> also be nice :).
Agree, maybe we could handle them later :)

thanks,
linfeng

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages
  2013-02-05  5:25       ` Minchan Kim
@ 2013-02-05  6:18         ` Lin Feng
  -1 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-05  6:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu,
	linux-mm, linux-aio, linux-fsdevel, linux-kernel



On 02/05/2013 01:25 PM, Minchan Kim wrote:
> Hi Lin,
> 
> On Tue, Feb 05, 2013 at 12:42:48PM +0800, Lin Feng wrote:
>> Hi Minchan,
>>
>> On 02/05/2013 08:58 AM, Minchan Kim wrote:
>>> Hello,
>>>
>>> On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote:
>>>> Currently get_user_pages() always tries to allocate pages from movable zone,
>>>> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users
>>>> of get_user_pages() is easy to pin user pages for a long time(for now we found
>>>> that pages pinned as aio ring pages is such case), which is fatal for memory
>>>> hotplug/remove framework.
>>>>
>>>> So the 1st patch introduces a new library function called
>>>> get_user_pages_non_movable() to pin pages only from zone non-movable in memory.
>>>> It's a wrapper of get_user_pages() but it makes sure that all pages come from
>>>> non-movable zone via additional page migration.
>>>>
>>>> The 2nd patch gets around the aio ring pages can't be migrated bug caused by
>>>> get_user_pages() via using the new function. It only works when configed with
>>>> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().
>>>
>>> CMA has same issue but the problem is the driver developers or any subsystem
>>> using GUP can't know their pages is in CMA area or not in advance.
>>> So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well?
>>> Even some driver module in embedded side doesn't open their source code.
>> Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users
>> of GUP, they will release the pinned pages immediately and to such users they should get
>> a good performance, using the old style interface is a smart way. And we had better just
>> deal with the cases we have to by using the new interface.
> 
> Hmm, I think you can't make sure most of user for MEMORY_HOTPLUG will release pinned pages
> immediately. Because MEMORY_HOTPLUG could be used for embedded system for reducing power
> by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know
> in advance they will use pinned pages long time or release in short time because it depends
> on some event like user's response which is very not predetermined.
> So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of
> failing migration by page count and then, investigate they are really using GUP and it's
> REALLY a culprit. If so, yell to them "Please use GUP_NM instead"?
> 
> Yes. it could be done but it would be rather trobulesome job.
Yes WARN_ON may be easy while troubleshooting for finding the immigrate-able page is 
a big job.
If we want to kill all the potential immigrate-able pages caused by GUP we'd better use the
*non_movable* version of GUP.
But in some server environment we want to keep the performance but also want to use hotremove
feature in case. Maybe patch the place as we need is a trade off for such support.

Mel also said in the last discussion:

On 11/30/2012 07:00 PM, Mel Gorman wrote:> On Thu, Nov 29, 2012 at 11:55:02PM -0800, Andrew Morton wrote:
>> Well, that's a fairly low-level implementation detail.  A more typical
>> approach would be to add a new get_user_pages_non_movable() or such. 
>> That would probably have the same signature as get_user_pages(), with
>> one additional argument.  Then get_user_pages() becomes a one-line
>> wrapper which passes in a particular value of that argument.
>>
> 
> That is going in the direction that all pinned pages become MIGRATE_UNMOVABLE
> allocations.  That will impact THP availability by increasing the number
> of MIGRATE_UNMOVABLE blocks that exist and it would hit every user --
> not just those that care about ZONE_MOVABLE.
> 
> I'm likely to NAK such a patch if it's only about node hot-remove because
> it's much more of a corner case than wanting to use THP.
> 
> I would prefer if get_user_pages() checked if the page it was about to
> pin was in ZONE_MOVABLE and if so, migrate it at that point before it's
> pinned. It'll be expensive but will guarantee ZONE_MOVABLE availability
> if that's what they want. The CMA people might also want to take
> advantage of this if the page happened to be in the MIGRATE_CMA
> pageblock.
> 

So it may not a good idea that we all fall into calling the *non_movable* version of
GUP when CONFIG_MIGRATE_ISOLATE is on. What do you think?

> #ifdef CONFIG_MIGRATE_ISOLATE
> 
> int get_user_pages_non_movable()
> {
>         ..
>         old_get_user_pages()
>         ..
> }
> 
> int get_user_pages()
> {
>         return get_user_pages_non_movable();
> }
> #else
> int get_user_pages()
> {
>         return old_get_user_pages()
> }
> #endif


thanks,
linfeng

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages
@ 2013-02-05  6:18         ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-05  6:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu,
	linux-mm, linux-aio, linux-fsdevel, linux-kernel



On 02/05/2013 01:25 PM, Minchan Kim wrote:
> Hi Lin,
> 
> On Tue, Feb 05, 2013 at 12:42:48PM +0800, Lin Feng wrote:
>> Hi Minchan,
>>
>> On 02/05/2013 08:58 AM, Minchan Kim wrote:
>>> Hello,
>>>
>>> On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote:
>>>> Currently get_user_pages() always tries to allocate pages from movable zone,
>>>> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users
>>>> of get_user_pages() is easy to pin user pages for a long time(for now we found
>>>> that pages pinned as aio ring pages is such case), which is fatal for memory
>>>> hotplug/remove framework.
>>>>
>>>> So the 1st patch introduces a new library function called
>>>> get_user_pages_non_movable() to pin pages only from zone non-movable in memory.
>>>> It's a wrapper of get_user_pages() but it makes sure that all pages come from
>>>> non-movable zone via additional page migration.
>>>>
>>>> The 2nd patch gets around the aio ring pages can't be migrated bug caused by
>>>> get_user_pages() via using the new function. It only works when configed with
>>>> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().
>>>
>>> CMA has same issue but the problem is the driver developers or any subsystem
>>> using GUP can't know their pages is in CMA area or not in advance.
>>> So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well?
>>> Even some driver module in embedded side doesn't open their source code.
>> Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users
>> of GUP, they will release the pinned pages immediately and to such users they should get
>> a good performance, using the old style interface is a smart way. And we had better just
>> deal with the cases we have to by using the new interface.
> 
> Hmm, I think you can't make sure most of user for MEMORY_HOTPLUG will release pinned pages
> immediately. Because MEMORY_HOTPLUG could be used for embedded system for reducing power
> by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know
> in advance they will use pinned pages long time or release in short time because it depends
> on some event like user's response which is very not predetermined.
> So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of
> failing migration by page count and then, investigate they are really using GUP and it's
> REALLY a culprit. If so, yell to them "Please use GUP_NM instead"?
> 
> Yes. it could be done but it would be rather trobulesome job.
Yes WARN_ON may be easy while troubleshooting for finding the immigrate-able page is 
a big job.
If we want to kill all the potential immigrate-able pages caused by GUP we'd better use the
*non_movable* version of GUP.
But in some server environment we want to keep the performance but also want to use hotremove
feature in case. Maybe patch the place as we need is a trade off for such support.

Mel also said in the last discussion:

On 11/30/2012 07:00 PM, Mel Gorman wrote:> On Thu, Nov 29, 2012 at 11:55:02PM -0800, Andrew Morton wrote:
>> Well, that's a fairly low-level implementation detail.  A more typical
>> approach would be to add a new get_user_pages_non_movable() or such. 
>> That would probably have the same signature as get_user_pages(), with
>> one additional argument.  Then get_user_pages() becomes a one-line
>> wrapper which passes in a particular value of that argument.
>>
> 
> That is going in the direction that all pinned pages become MIGRATE_UNMOVABLE
> allocations.  That will impact THP availability by increasing the number
> of MIGRATE_UNMOVABLE blocks that exist and it would hit every user --
> not just those that care about ZONE_MOVABLE.
> 
> I'm likely to NAK such a patch if it's only about node hot-remove because
> it's much more of a corner case than wanting to use THP.
> 
> I would prefer if get_user_pages() checked if the page it was about to
> pin was in ZONE_MOVABLE and if so, migrate it at that point before it's
> pinned. It'll be expensive but will guarantee ZONE_MOVABLE availability
> if that's what they want. The CMA people might also want to take
> advantage of this if the page happened to be in the MIGRATE_CMA
> pageblock.
> 

So it may not a good idea that we all fall into calling the *non_movable* version of
GUP when CONFIG_MIGRATE_ISOLATE is on. What do you think?

> #ifdef CONFIG_MIGRATE_ISOLATE
> 
> int get_user_pages_non_movable()
> {
>         ..
>         old_get_user_pages()
>         ..
> }
> 
> int get_user_pages()
> {
>         return get_user_pages_non_movable();
> }
> #else
> int get_user_pages()
> {
>         return old_get_user_pages()
> }
> #endif


thanks,
linfeng

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages
  2013-02-05  6:18         ` Lin Feng
  (?)
@ 2013-02-05  7:45           ` Minchan Kim
  -1 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2013-02-05  7:45 UTC (permalink / raw)
  To: Lin Feng
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu,
	linux-mm, linux-aio, linux-fsdevel, linux-kernel

On Tue, Feb 05, 2013 at 02:18:42PM +0800, Lin Feng wrote:
> 
> 
> On 02/05/2013 01:25 PM, Minchan Kim wrote:
> > Hi Lin,
> > 
> > On Tue, Feb 05, 2013 at 12:42:48PM +0800, Lin Feng wrote:
> >> Hi Minchan,
> >>
> >> On 02/05/2013 08:58 AM, Minchan Kim wrote:
> >>> Hello,
> >>>
> >>> On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote:
> >>>> Currently get_user_pages() always tries to allocate pages from movable zone,
> >>>> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users
> >>>> of get_user_pages() is easy to pin user pages for a long time(for now we found
> >>>> that pages pinned as aio ring pages is such case), which is fatal for memory
> >>>> hotplug/remove framework.
> >>>>
> >>>> So the 1st patch introduces a new library function called
> >>>> get_user_pages_non_movable() to pin pages only from zone non-movable in memory.
> >>>> It's a wrapper of get_user_pages() but it makes sure that all pages come from
> >>>> non-movable zone via additional page migration.
> >>>>
> >>>> The 2nd patch gets around the aio ring pages can't be migrated bug caused by
> >>>> get_user_pages() via using the new function. It only works when configed with
> >>>> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().
> >>>
> >>> CMA has same issue but the problem is the driver developers or any subsystem
> >>> using GUP can't know their pages is in CMA area or not in advance.
> >>> So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well?
> >>> Even some driver module in embedded side doesn't open their source code.
> >> Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users
> >> of GUP, they will release the pinned pages immediately and to such users they should get
> >> a good performance, using the old style interface is a smart way. And we had better just
> >> deal with the cases we have to by using the new interface.
> > 
> > Hmm, I think you can't make sure most of user for MEMORY_HOTPLUG will release pinned pages
> > immediately. Because MEMORY_HOTPLUG could be used for embedded system for reducing power
> > by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know
> > in advance they will use pinned pages long time or release in short time because it depends
> > on some event like user's response which is very not predetermined.
> > So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of
> > failing migration by page count and then, investigate they are really using GUP and it's
> > REALLY a culprit. If so, yell to them "Please use GUP_NM instead"?
> > 
> > Yes. it could be done but it would be rather trobulesome job.
> Yes WARN_ON may be easy while troubleshooting for finding the immigrate-able page is 
> a big job.
> If we want to kill all the potential immigrate-able pages caused by GUP we'd better use the
> *non_movable* version of GUP.
> But in some server environment we want to keep the performance but also want to use hotremove
> feature in case. Maybe patch the place as we need is a trade off for such support.
> 
> Mel also said in the last discussion:
> 
> On 11/30/2012 07:00 PM, Mel Gorman wrote:> On Thu, Nov 29, 2012 at 11:55:02PM -0800, Andrew Morton wrote:
> >> Well, that's a fairly low-level implementation detail.  A more typical
> >> approach would be to add a new get_user_pages_non_movable() or such. 
> >> That would probably have the same signature as get_user_pages(), with
> >> one additional argument.  Then get_user_pages() becomes a one-line
> >> wrapper which passes in a particular value of that argument.
> >>
> > 
> > That is going in the direction that all pinned pages become MIGRATE_UNMOVABLE
> > allocations.  That will impact THP availability by increasing the number
> > of MIGRATE_UNMOVABLE blocks that exist and it would hit every user --
> > not just those that care about ZONE_MOVABLE.
> > 
> > I'm likely to NAK such a patch if it's only about node hot-remove because
> > it's much more of a corner case than wanting to use THP.
> > 
> > I would prefer if get_user_pages() checked if the page it was about to
> > pin was in ZONE_MOVABLE and if so, migrate it at that point before it's
> > pinned. It'll be expensive but will guarantee ZONE_MOVABLE availability
> > if that's what they want. The CMA people might also want to take
> > advantage of this if the page happened to be in the MIGRATE_CMA
> > pageblock.
> > 
> 
> So it may not a good idea that we all fall into calling the *non_movable* version of
> GUP when CONFIG_MIGRATE_ISOLATE is on. What do you think?

Frankly speaking, I can't understand Mel's comment.
AFAIUC, he said GUP checks the page before get_page and if the page is movable zone,
then migrate it out of movable zone and get_page again.
That's exactly what I want. It doesn't introduce GUP_NM.
-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages
@ 2013-02-05  7:45           ` Minchan Kim
  0 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2013-02-05  7:45 UTC (permalink / raw)
  To: Lin Feng
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu,
	linux-mm, linux-aio, linux-fsdevel, linux-kernel

On Tue, Feb 05, 2013 at 02:18:42PM +0800, Lin Feng wrote:
> 
> 
> On 02/05/2013 01:25 PM, Minchan Kim wrote:
> > Hi Lin,
> > 
> > On Tue, Feb 05, 2013 at 12:42:48PM +0800, Lin Feng wrote:
> >> Hi Minchan,
> >>
> >> On 02/05/2013 08:58 AM, Minchan Kim wrote:
> >>> Hello,
> >>>
> >>> On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote:
> >>>> Currently get_user_pages() always tries to allocate pages from movable zone,
> >>>> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users
> >>>> of get_user_pages() is easy to pin user pages for a long time(for now we found
> >>>> that pages pinned as aio ring pages is such case), which is fatal for memory
> >>>> hotplug/remove framework.
> >>>>
> >>>> So the 1st patch introduces a new library function called
> >>>> get_user_pages_non_movable() to pin pages only from zone non-movable in memory.
> >>>> It's a wrapper of get_user_pages() but it makes sure that all pages come from
> >>>> non-movable zone via additional page migration.
> >>>>
> >>>> The 2nd patch gets around the aio ring pages can't be migrated bug caused by
> >>>> get_user_pages() via using the new function. It only works when configed with
> >>>> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().
> >>>
> >>> CMA has same issue but the problem is the driver developers or any subsystem
> >>> using GUP can't know their pages is in CMA area or not in advance.
> >>> So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well?
> >>> Even some driver module in embedded side doesn't open their source code.
> >> Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users
> >> of GUP, they will release the pinned pages immediately and to such users they should get
> >> a good performance, using the old style interface is a smart way. And we had better just
> >> deal with the cases we have to by using the new interface.
> > 
> > Hmm, I think you can't make sure most of user for MEMORY_HOTPLUG will release pinned pages
> > immediately. Because MEMORY_HOTPLUG could be used for embedded system for reducing power
> > by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know
> > in advance they will use pinned pages long time or release in short time because it depends
> > on some event like user's response which is very not predetermined.
> > So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of
> > failing migration by page count and then, investigate they are really using GUP and it's
> > REALLY a culprit. If so, yell to them "Please use GUP_NM instead"?
> > 
> > Yes. it could be done but it would be rather trobulesome job.
> Yes WARN_ON may be easy while troubleshooting for finding the immigrate-able page is 
> a big job.
> If we want to kill all the potential immigrate-able pages caused by GUP we'd better use the
> *non_movable* version of GUP.
> But in some server environment we want to keep the performance but also want to use hotremove
> feature in case. Maybe patch the place as we need is a trade off for such support.
> 
> Mel also said in the last discussion:
> 
> On 11/30/2012 07:00 PM, Mel Gorman wrote:> On Thu, Nov 29, 2012 at 11:55:02PM -0800, Andrew Morton wrote:
> >> Well, that's a fairly low-level implementation detail.  A more typical
> >> approach would be to add a new get_user_pages_non_movable() or such. 
> >> That would probably have the same signature as get_user_pages(), with
> >> one additional argument.  Then get_user_pages() becomes a one-line
> >> wrapper which passes in a particular value of that argument.
> >>
> > 
> > That is going in the direction that all pinned pages become MIGRATE_UNMOVABLE
> > allocations.  That will impact THP availability by increasing the number
> > of MIGRATE_UNMOVABLE blocks that exist and it would hit every user --
> > not just those that care about ZONE_MOVABLE.
> > 
> > I'm likely to NAK such a patch if it's only about node hot-remove because
> > it's much more of a corner case than wanting to use THP.
> > 
> > I would prefer if get_user_pages() checked if the page it was about to
> > pin was in ZONE_MOVABLE and if so, migrate it at that point before it's
> > pinned. It'll be expensive but will guarantee ZONE_MOVABLE availability
> > if that's what they want. The CMA people might also want to take
> > advantage of this if the page happened to be in the MIGRATE_CMA
> > pageblock.
> > 
> 
> So it may not a good idea that we all fall into calling the *non_movable* version of
> GUP when CONFIG_MIGRATE_ISOLATE is on. What do you think?

Frankly speaking, I can't understand Mel's comment.
AFAIUC, he said GUP checks the page before get_page and if the page is movable zone,
then migrate it out of movable zone and get_page again.
That's exactly what I want. It doesn't introduce GUP_NM.
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages
@ 2013-02-05  7:45           ` Minchan Kim
  0 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2013-02-05  7:45 UTC (permalink / raw)
  To: Lin Feng
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu,
	linux-mm, linux-aio, linux-fsdevel, linux-kernel

On Tue, Feb 05, 2013 at 02:18:42PM +0800, Lin Feng wrote:
> 
> 
> On 02/05/2013 01:25 PM, Minchan Kim wrote:
> > Hi Lin,
> > 
> > On Tue, Feb 05, 2013 at 12:42:48PM +0800, Lin Feng wrote:
> >> Hi Minchan,
> >>
> >> On 02/05/2013 08:58 AM, Minchan Kim wrote:
> >>> Hello,
> >>>
> >>> On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote:
> >>>> Currently get_user_pages() always tries to allocate pages from movable zone,
> >>>> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users
> >>>> of get_user_pages() is easy to pin user pages for a long time(for now we found
> >>>> that pages pinned as aio ring pages is such case), which is fatal for memory
> >>>> hotplug/remove framework.
> >>>>
> >>>> So the 1st patch introduces a new library function called
> >>>> get_user_pages_non_movable() to pin pages only from zone non-movable in memory.
> >>>> It's a wrapper of get_user_pages() but it makes sure that all pages come from
> >>>> non-movable zone via additional page migration.
> >>>>
> >>>> The 2nd patch gets around the aio ring pages can't be migrated bug caused by
> >>>> get_user_pages() via using the new function. It only works when configed with
> >>>> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages().
> >>>
> >>> CMA has same issue but the problem is the driver developers or any subsystem
> >>> using GUP can't know their pages is in CMA area or not in advance.
> >>> So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well?
> >>> Even some driver module in embedded side doesn't open their source code.
> >> Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users
> >> of GUP, they will release the pinned pages immediately and to such users they should get
> >> a good performance, using the old style interface is a smart way. And we had better just
> >> deal with the cases we have to by using the new interface.
> > 
> > Hmm, I think you can't make sure most of user for MEMORY_HOTPLUG will release pinned pages
> > immediately. Because MEMORY_HOTPLUG could be used for embedded system for reducing power
> > by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know
> > in advance they will use pinned pages long time or release in short time because it depends
> > on some event like user's response which is very not predetermined.
> > So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of
> > failing migration by page count and then, investigate they are really using GUP and it's
> > REALLY a culprit. If so, yell to them "Please use GUP_NM instead"?
> > 
> > Yes. it could be done but it would be rather trobulesome job.
> Yes WARN_ON may be easy while troubleshooting for finding the immigrate-able page is 
> a big job.
> If we want to kill all the potential immigrate-able pages caused by GUP we'd better use the
> *non_movable* version of GUP.
> But in some server environment we want to keep the performance but also want to use hotremove
> feature in case. Maybe patch the place as we need is a trade off for such support.
> 
> Mel also said in the last discussion:
> 
> On 11/30/2012 07:00 PM, Mel Gorman wrote:> On Thu, Nov 29, 2012 at 11:55:02PM -0800, Andrew Morton wrote:
> >> Well, that's a fairly low-level implementation detail.  A more typical
> >> approach would be to add a new get_user_pages_non_movable() or such. 
> >> That would probably have the same signature as get_user_pages(), with
> >> one additional argument.  Then get_user_pages() becomes a one-line
> >> wrapper which passes in a particular value of that argument.
> >>
> > 
> > That is going in the direction that all pinned pages become MIGRATE_UNMOVABLE
> > allocations.  That will impact THP availability by increasing the number
> > of MIGRATE_UNMOVABLE blocks that exist and it would hit every user --
> > not just those that care about ZONE_MOVABLE.
> > 
> > I'm likely to NAK such a patch if it's only about node hot-remove because
> > it's much more of a corner case than wanting to use THP.
> > 
> > I would prefer if get_user_pages() checked if the page it was about to
> > pin was in ZONE_MOVABLE and if so, migrate it at that point before it's
> > pinned. It'll be expensive but will guarantee ZONE_MOVABLE availability
> > if that's what they want. The CMA people might also want to take
> > advantage of this if the page happened to be in the MIGRATE_CMA
> > pageblock.
> > 
> 
> So it may not a good idea that we all fall into calling the *non_movable* version of
> GUP when CONFIG_MIGRATE_ISOLATE is on. What do you think?

Frankly speaking, I can't understand Mel's comment.
AFAIUC, he said GUP checks the page before get_page and if the page is movable zone,
then migrate it out of movable zone and get_page again.
That's exactly what I want. It doesn't introduce GUP_NM.
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages
  2013-02-05  7:45           ` Minchan Kim
  (?)
@ 2013-02-05  8:27             ` Lin Feng
  -1 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-05  8:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu,
	linux-mm, linux-aio, linux-fsdevel, linux-kernel

Hi Minchan,

On 02/05/2013 03:45 PM, Minchan Kim wrote:
>> So it may not a good idea that we all fall into calling the *non_movable* version of
>> > GUP when CONFIG_MIGRATE_ISOLATE is on. What do you think?
> Frankly speaking, I can't understand Mel's comment.
> AFAIUC, he said GUP checks the page before get_page and if the page is movable zone,
> then migrate it out of movable zone and get_page again.
> That's exactly what I want. It doesn't introduce GUP_NM.
Since an long time pin or not is an unpredictable behave except you know what the caller
wants to do. We have to check every time we call GUP, and GUP may need another parameter
 to teach itself to make the right decision? We have already got *8* parameters :(

thanks,
linfeng

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages
@ 2013-02-05  8:27             ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-05  8:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu,
	linux-mm, linux-aio, linux-fsdevel, linux-kernel

Hi Minchan,

On 02/05/2013 03:45 PM, Minchan Kim wrote:
>> So it may not a good idea that we all fall into calling the *non_movable* version of
>> > GUP when CONFIG_MIGRATE_ISOLATE is on. What do you think?
> Frankly speaking, I can't understand Mel's comment.
> AFAIUC, he said GUP checks the page before get_page and if the page is movable zone,
> then migrate it out of movable zone and get_page again.
> That's exactly what I want. It doesn't introduce GUP_NM.
Since an long time pin or not is an unpredictable behave except you know what the caller
wants to do. We have to check every time we call GUP, and GUP may need another parameter
 to teach itself to make the right decision? We have already got *8* parameters :(

thanks,
linfeng

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages
@ 2013-02-05  8:27             ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-05  8:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu,
	linux-mm, linux-aio, linux-fsdevel, linux-kernel

Hi Minchan,

On 02/05/2013 03:45 PM, Minchan Kim wrote:
>> So it may not a good idea that we all fall into calling the *non_movable* version of
>> > GUP when CONFIG_MIGRATE_ISOLATE is on. What do you think?
> Frankly speaking, I can't understand Mel's comment.
> AFAIUC, he said GUP checks the page before get_page and if the page is movable zone,
> then migrate it out of movable zone and get_page again.
> That's exactly what I want. It doesn't introduce GUP_NM.
Since an long time pin or not is an unpredictable behave except you know what the caller
wants to do. We have to check every time we call GUP, and GUP may need another parameter
 to teach itself to make the right decision? We have already got *8* parameters :(

thanks,
linfeng

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-05  0:06     ` Andrew Morton
  (?)
@ 2013-02-05 11:57       ` Mel Gorman
  -1 siblings, 0 replies; 88+ messages in thread
From: Mel Gorman @ 2013-02-05 11:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lin Feng, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote:
> On Mon, 4 Feb 2013 18:04:07 +0800
> Lin Feng <linfeng@cn.fujitsu.com> wrote:
> 
> > get_user_pages() always tries to allocate pages from movable zone, which is not
> >  reliable to memory hotremove framework in some case.
> > 
> > This patch introduces a new library function called get_user_pages_non_movable()
> >  to pin pages only from zone non-movable in memory.
> > It's a wrapper of get_user_pages() but it makes sure that all pages come from
> > non-movable zone via additional page migration.
> > 
> > ...
> >
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> >  			struct page **pages, struct vm_area_struct **vmas);
> >  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
> >  			struct page **pages);
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
> > +		unsigned long start, int nr_pages, int write, int force,
> > +		struct page **pages, struct vm_area_struct **vmas);
> > +#endif
> 
> The ifdefs aren't really needed here and I encourage people to omit
> them.  This keeps the header files looking neater and reduces the
> chances of things later breaking because we forgot to update some
> CONFIG_foo logic in a header file.  The downside is that errors will be
> revealed at link time rather than at compile time, but that's a pretty
> small cost.
> 

As an aside, if ifdefs *have* to be used then it often better to have a
pattern like

#ifdef CONFIG_MEMORY_HOTREMOVE
int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
		unsigned long start, int nr_pages, int write, int force,
		struct page **pages, struct vm_area_struct **vmas);
#else
static inline get_user_pages_non_movable(...)
{
	get_user_pages(...)
}
#endif

It eliminates the need for #ifdefs in the C file that calls
get_user_pages_non_movable().

> >  struct kvec;
> >  int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
> >  			struct page **pages);
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 73b64a3..5db811e 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
> >  	return (idx == ZONE_NORMAL);
> >  }
> >  
> > +static inline int is_movable(struct zone *zone)
> > +{
> > +	return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE;
> > +}
> 
> A better name would be zone_is_movable(). 

He is matching the names of the is_normal(), is_dma32() and is_dma() helpers.
Unfortunately I would expect a name like is_movable() to return true if the
page had either been allocated with __GFP_MOVABLE or if it was checking
migrate can currently handle the page. is_movable() is indeed a terrible
name. They should be renamed or deleted in preparation -- see below.

> We haven't been very
> consistent about this in mmzone.h, but zone_is_foo() is pretty common.
> 
> And a neater implementation would be
> 
> 	return zone_idx(zone) == ZONE_MOVABLE;
> 

Yes.

> All of which made me look at mmzone.h, and what I saw wasn't very nice :(
> 
> Please review...
> 
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: include/linux/mmzone.h: cleanups
> 
> - implement zone_idx() in C to fix its references-args-twice macro bug
> 
> - use zone_idx() in is_highmem() to remove large amounts of silly fluff.
> 
> Cc: Lin Feng <linfeng@cn.fujitsu.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  include/linux/mmzone.h |   13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h
> --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups
> +++ a/include/linux/mmzone.h
> @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by
>  /*
>   * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
>   */
> -#define zone_idx(zone)		((zone) - (zone)->zone_pgdat->node_zones)
> +static inline enum zone_type zone_idx(struct zone *zone)
> +{
> +	return zone - zone->zone_pgdat->node_zones;
> +}
>  
>  static inline int populated_zone(struct zone *zone)
>  {
> @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon
>  static inline int is_highmem(struct zone *zone)
>  {
>  #ifdef CONFIG_HIGHMEM
> -	int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
> -	return zone_off == ZONE_HIGHMEM * sizeof(*zone) ||
> -	       (zone_off == ZONE_MOVABLE * sizeof(*zone) &&
> -		zone_movable_is_highmem());
> +	enum zone_type idx = zone_idx(zone);
> +
> +	return idx == ZONE_HIGHMEM ||
> +	       (idx == ZONE_MOVABLE && zone_movable_is_highmem());
>  #else
>  	return 0;
>  #endif

*blinks*. Ok, while I think your version looks nicer, it is reverting
ddc81ed2 (remove sparse warning for mmzone.h). According to my version of
gcc at least, your patch reintroduces the sar and sparse complains if run as

make ARCH=i386 C=2 CF=-Wsparse-all 

.... this?

From: Mel Gorman <mgorman@suse.de>
Subject: [PATCH] include/linux/mmzone.h: cleanups

- implement zone_idx() in C to fix its references-args-twice macro bug

- implement zone_is_idx in an attempt to explain what the fluff in
  is_highmem is for

- rename is_highmem to zone_is_highmem as preparation for
  zone_is_movable()

- implement zone_is_movable as a helper for places that care about
  ZONE_MOVABLE

- Remove is_normal, is_dma32 and is_dma because apparently no one cares

- convert users of zone_idx(zone) == ZONE_FOO to appropriate
  zone_is_foo() helper

- Use is_highmem_idx() instead of is_highmem for the PageHighMem
  implementation. Should be functionally equivalent because we
  only care about the zones index, not exactly what zone it is

[akpm@linux-foundation.org: zone_idx() reimplementation]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/s390/mm/init.c        |    2 +-
 arch/tile/mm/init.c        |    5 +----
 arch/x86/mm/highmem_32.c   |    2 +-
 include/linux/mmzone.h     |   49 +++++++++++++++-----------------------------
 include/linux/page-flags.h |    2 +-
 kernel/power/snapshot.c    |   12 +++++------
 mm/page_alloc.c            |    6 +++---
 7 files changed, 30 insertions(+), 48 deletions(-)

diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 81e596c..c26c321 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -231,7 +231,7 @@ int arch_add_memory(int nid, u64 start, u64 size)
 	if (rc)
 		return rc;
 	for_each_zone(zone) {
-		if (zone_idx(zone) != ZONE_MOVABLE) {
+		if (!zone_is_movable(zone)) {
 			/* Add range within existing zone limits */
 			zone_start_pfn = zone->zone_start_pfn;
 			zone_end_pfn = zone->zone_start_pfn +
diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c
index ef29d6c..8287749 100644
--- a/arch/tile/mm/init.c
+++ b/arch/tile/mm/init.c
@@ -733,9 +733,6 @@ static void __init set_non_bootmem_pages_init(void)
 	for_each_zone(z) {
 		unsigned long start, end;
 		int nid = z->zone_pgdat->node_id;
-#ifdef CONFIG_HIGHMEM
-		int idx = zone_idx(z);
-#endif
 
 		start = z->zone_start_pfn;
 		end = start + z->spanned_pages;
@@ -743,7 +740,7 @@ static void __init set_non_bootmem_pages_init(void)
 		start = max(start, max_low_pfn);
 
 #ifdef CONFIG_HIGHMEM
-		if (idx == ZONE_HIGHMEM)
+		if (zone_is_highmem(z))
 			totalhigh_pages += z->spanned_pages;
 #endif
 		if (kdata_huge) {
diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c
index 6f31ee5..63020e7 100644
--- a/arch/x86/mm/highmem_32.c
+++ b/arch/x86/mm/highmem_32.c
@@ -124,7 +124,7 @@ void __init set_highmem_pages_init(void)
 	for_each_zone(zone) {
 		unsigned long zone_start_pfn, zone_end_pfn;
 
-		if (!is_highmem(zone))
+		if (!zone_is_highmem(zone))
 			continue;
 
 		zone_start_pfn = zone->zone_start_pfn;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a23923b..5ecef75 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -779,10 +779,20 @@ static inline int local_memory_node(int node_id) { return node_id; };
 unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long);
 #endif
 
+static inline bool zone_is_idx(struct zone *zone, enum zone_type idx)
+{
+	/* This mess avoids a potentially expensive pointer subtraction. */
+	int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
+	return zone_off == idx * sizeof(*zone);
+}
+
 /*
  * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
  */
-#define zone_idx(zone)		((zone) - (zone)->zone_pgdat->node_zones)
+static inline enum zone_type zone_idx(struct zone *zone)
+{
+	return zone - zone->zone_pgdat->node_zones;
+}
 
 static inline int populated_zone(struct zone *zone)
 {
@@ -810,50 +820,25 @@ static inline int is_highmem_idx(enum zone_type idx)
 #endif
 }
 
-static inline int is_normal_idx(enum zone_type idx)
-{
-	return (idx == ZONE_NORMAL);
-}
-
 /**
- * is_highmem - helper function to quickly check if a struct zone is a 
+ * zone_is_highmem - helper function to quickly check if a struct zone is a 
  *              highmem zone or not.  This is an attempt to keep references
  *              to ZONE_{DMA/NORMAL/HIGHMEM/etc} in general code to a minimum.
  * @zone - pointer to struct zone variable
  */
-static inline int is_highmem(struct zone *zone)
+static inline bool zone_is_highmem(struct zone *zone)
 {
 #ifdef CONFIG_HIGHMEM
-	int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
-	return zone_off == ZONE_HIGHMEM * sizeof(*zone) ||
-	       (zone_off == ZONE_MOVABLE * sizeof(*zone) &&
-		zone_movable_is_highmem());
-#else
-	return 0;
-#endif
-}
-
-static inline int is_normal(struct zone *zone)
-{
-	return zone == zone->zone_pgdat->node_zones + ZONE_NORMAL;
-}
-
-static inline int is_dma32(struct zone *zone)
-{
-#ifdef CONFIG_ZONE_DMA32
-	return zone == zone->zone_pgdat->node_zones + ZONE_DMA32;
+	return zone_is_idx(zone, ZONE_HIGHMEM) ||
+	       (zone_is_idx(zone, ZONE_MOVABLE) && zone_movable_is_highmem());
 #else
 	return 0;
 #endif
 }
 
-static inline int is_dma(struct zone *zone)
+static inline bool zone_is_movable(struct zone *zone)
 {
-#ifdef CONFIG_ZONE_DMA
-	return zone == zone->zone_pgdat->node_zones + ZONE_DMA;
-#else
-	return 0;
-#endif
+	return zone_is_idx(zone, ZONE_MOVABLE);
 }
 
 /* These two functions are used to setup the per zone pages min values */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b5d1384..2f541f1 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -237,7 +237,7 @@ PAGEFLAG(Readahead, reclaim)		/* Reminder to do async read-ahead */
  * Must use a macro here due to header dependency issues. page_zone() is not
  * available at this point.
  */
-#define PageHighMem(__p) is_highmem(page_zone(__p))
+#define PageHighMem(__p) is_highmem_idx(page_zonenum(__p))
 #else
 PAGEFLAG_FALSE(HighMem)
 #endif
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 0de2857..cbf0da9 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -830,7 +830,7 @@ static unsigned int count_free_highmem_pages(void)
 	unsigned int cnt = 0;
 
 	for_each_populated_zone(zone)
-		if (is_highmem(zone))
+		if (zone_is_highmem(zone))
 			cnt += zone_page_state(zone, NR_FREE_PAGES);
 
 	return cnt;
@@ -879,7 +879,7 @@ static unsigned int count_highmem_pages(void)
 	for_each_populated_zone(zone) {
 		unsigned long pfn, max_zone_pfn;
 
-		if (!is_highmem(zone))
+		if (!zone_is_highmem(zone))
 			continue;
 
 		mark_free_pages(zone);
@@ -943,7 +943,7 @@ static unsigned int count_data_pages(void)
 	unsigned int n = 0;
 
 	for_each_populated_zone(zone) {
-		if (is_highmem(zone))
+		if (zone_is_highmem(zone))
 			continue;
 
 		mark_free_pages(zone);
@@ -989,7 +989,7 @@ static void safe_copy_page(void *dst, struct page *s_page)
 static inline struct page *
 page_is_saveable(struct zone *zone, unsigned long pfn)
 {
-	return is_highmem(zone) ?
+	return zone_is_highmem(zone) ?
 		saveable_highmem_page(zone, pfn) : saveable_page(zone, pfn);
 }
 
@@ -1338,7 +1338,7 @@ int hibernate_preallocate_memory(void)
 	size = 0;
 	for_each_populated_zone(zone) {
 		size += snapshot_additional_pages(zone);
-		if (is_highmem(zone))
+		if (zone_is_highmem(zone))
 			highmem += zone_page_state(zone, NR_FREE_PAGES);
 		else
 			count += zone_page_state(zone, NR_FREE_PAGES);
@@ -1481,7 +1481,7 @@ static int enough_free_mem(unsigned int nr_pages, unsigned int nr_highmem)
 	unsigned int free = alloc_normal;
 
 	for_each_populated_zone(zone)
-		if (!is_highmem(zone))
+		if (!zone_is_highmem(zone))
 			free += zone_page_state(zone, NR_FREE_PAGES);
 
 	nr_pages += count_pages_for_highmem(nr_highmem);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7e208f0..9558341 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5136,7 +5136,7 @@ static void __setup_per_zone_wmarks(void)
 
 	/* Calculate total number of !ZONE_HIGHMEM pages */
 	for_each_zone(zone) {
-		if (!is_highmem(zone))
+		if (!zone_is_highmem(zone))
 			lowmem_pages += zone->present_pages;
 	}
 
@@ -5146,7 +5146,7 @@ static void __setup_per_zone_wmarks(void)
 		spin_lock_irqsave(&zone->lock, flags);
 		tmp = (u64)pages_min * zone->present_pages;
 		do_div(tmp, lowmem_pages);
-		if (is_highmem(zone)) {
+		if (zone_is_highmem(zone)) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
 			 * need highmem pages, so cap pages_min to a small
@@ -5585,7 +5585,7 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count)
 	 * For avoiding noise data, lru_add_drain_all() should be called
 	 * If ZONE_MOVABLE, the zone never contains unmovable pages
 	 */
-	if (zone_idx(zone) == ZONE_MOVABLE)
+	if (zone_is_movable(zone))
 		return false;
 	mt = get_pageblock_migratetype(page);
 	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))

> _
> 
> 
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -58,6 +58,8 @@
> >  #include <linux/elf.h>
> >  #include <linux/gfp.h>
> >  #include <linux/migrate.h>
> > +#include <linux/page-isolation.h>
> > +#include <linux/mm_inline.h>
> >  #include <linux/string.h>
> >  
> >  #include <asm/io.h>
> > @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> >  }
> >  EXPORT_SYMBOL(get_user_pages);
> >  
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +/**
> > + * It's a wrapper of get_user_pages() but it makes sure that all pages come from
> > + * non-movable zone via additional page migration.
> > + */
> 
> This needs a description of why the function exists - say something
> about the requirements of memory hotplug.
> 
> Also a few words describing how the function works would be good.
> 
> > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
> > +		unsigned long start, int nr_pages, int write, int force,
> > +		struct page **pages, struct vm_area_struct **vmas)
> > +{
> > +	int ret, i, isolate_err, migrate_pre_flag;

migrate_pre_flag should be bool and the name is a bit
unusual. migrate_prepped?

> > +	LIST_HEAD(pagelist);
> > +
> > +retry:
> > +	ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages,
> > +				vmas);
> 
> We should handle (ret < 0) here.  At present the function will silently
> convert an error return into "return 0", which is bad.  The function
> does appear to otherwise do the right thing if get_user_pages() failed,
> but only due to good luck.
> 

A BUG_ON check if we retry more than once wouldn't hurt either. It requires
a broken implementation of alloc_migrate_target() but someone might "fix"
it and miss this.

A minor aside, it would be nice if we exited at this point if there was
no populated ZONE_MOVABLE in the system. There is a movable_zone global
variable already. If it was forced to be MAX_NR_ZONES before the call to
find_usable_zone_for_movable() in memory initialisation we should be able
to make a cheap check for it here.

> > +	isolate_err = 0;
> > +	migrate_pre_flag = 0;
> > +
> > +	for (i = 0; i < ret; i++) {
> > +		if (is_movable(page_zone(pages[i]))) {

This is a relatively expensive lookup. If my patch above is used then
you can add a is_movable_idx helper and then this becomes

if (is_movable_idx(page_zonenum(pages[i])))

which is cheaper.

> > +			if (!migrate_pre_flag) {
> > +				if (migrate_prep())
> > +					goto put_page;

CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never
return failure. You could make this BUG_ON(migrate_prep()), remove this
goto and the migrate_pre_flag check below becomes redundant.

> > +				migrate_pre_flag = 1;
> > +			}
> > +
> > +			if (!isolate_lru_page(pages[i])) {
> > +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
> > +						 page_is_file_cache(pages[i]));
> > +				list_add_tail(&pages[i]->lru, &pagelist);
> > +			} else {
> > +				isolate_err = 1;
> > +				goto put_page;
> > +			}

isolate_lru_page() takes the LRU lock every time. If
get_user_pages_non_movable is used heavily then you may encounter lock
contention problems. Batching this lock would be a separate patch and should
not be necessary yet but you should at least comment on it as a reminder.

I think that a goto could also have been avoided here if you used break. The
i == ret check below would be false and it would just fall through.
Overall the flow would be a bit easier to follow.

Why list_add_tail()? I don't think it's wrong but it's unusual to see
list_add_tail() when list_add() is enough.

> > +		}
> > +	}
> > +
> > +	/* All pages are non movable, we are done :) */
> > +	if (i == ret && list_empty(&pagelist))
> > +		return ret;
> > +
> > +put_page:
> > +	/* Undo the effects of former get_user_pages(), we won't pin anything */
> > +	for (i = 0; i < ret; i++)
> > +		put_page(pages[i]);
> > +

release_pages.

That comment is insufficient. There are non-obvious consequences to this
logic. We are dropping pins on all pages regardless of what zone they
are in. If the subsequent migration fails then we end up returning 0
with no pages pinned. The user-visible effect is that io_setup() fails
for non-obvious reasons. It will return EAGAIN to userspace which will be
interpreted as "The specified nr_events exceeds the user's limit of available
events.". The application will either fail or potentially infinite loop
if the developer interpreted EAGAIN as "try again" as opposed to "this is
a permanent failure".

Is that deliberate? Is it really preferable that AIO can fail to setup
and the application exit just in case we want to hot-remove memory later?
Should a failed migration generate a WARN_ON at least?

I would think that it's better to WARN_ON_ONCE if migration fails but pin
the pages as requested. If a future memory hot-remove operation fails
then the warning will indicate why but applications will not fail as a
result. It's a little clumsy but the memory hot-remove failure message
could list what applications have pinned the pages that cannot be removed
so the administrator has the option of force-killing the application. It
is possible to discover what application is pinning a page from userspace
but it would involve an expensive search with /proc/kpagemap

> > +	if (migrate_pre_flag && !isolate_err) {
> > +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
> > +					false, MIGRATE_SYNC, MR_SYSCALL);

The conversion of alloc_migrate_target is a bit problematic. It strips
the __GFP_MOVABLE flag and the consequence of this is that it converts
those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large
number of pages, particularly if the number of get_user_pages_non_movable()
increases for short-lived pins like direct IO.

One way around this is to add a high_zoneidx parameter to
__alloc_pages_nodemask and rename it ____alloc_pages_nodemask, create a new
inline function __alloc_pages_nodemask that passes in gfp_zone(gfp_mask)
as the high_zoneidx and create a new migration allocation function that
passes on ZONE_HIGHMEM as high_zoneidx. That would force the allocation to
happen in a lower zone while still treating the allocation as MIGRATE_MOVABLE

> > +		/* Steal pages from non-movable zone successfully? */
> > +		if (!ret)
> > +			goto retry;
> 
> This is buggy.  migrate_pages() doesn't empty its `from' argument, so
> page_list must be reinitialised here (or, better, at the start of the loop).
> 

page_list should be empty if ret == 0.

> Mel, what's up with migrate_pages()?  Shouldn't it be removing the
> pages from the list when MIGRATEPAGE_SUCCESS?  The use of
> list_for_each_entry_safe() suggests we used to do that...
> 

On successful migration, the pages are put back on the LRU at the end of
unmap_and_move(). On migration failure the caller is responsible for putting
the pages back on the LRU with putback_lru_pages() which happens below.

> > +	}
> > +
> > +	putback_lru_pages(&pagelist);
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(get_user_pages_non_movable);
> > +#endif
> > +
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-05 11:57       ` Mel Gorman
  0 siblings, 0 replies; 88+ messages in thread
From: Mel Gorman @ 2013-02-05 11:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lin Feng, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote:
> On Mon, 4 Feb 2013 18:04:07 +0800
> Lin Feng <linfeng@cn.fujitsu.com> wrote:
> 
> > get_user_pages() always tries to allocate pages from movable zone, which is not
> >  reliable to memory hotremove framework in some case.
> > 
> > This patch introduces a new library function called get_user_pages_non_movable()
> >  to pin pages only from zone non-movable in memory.
> > It's a wrapper of get_user_pages() but it makes sure that all pages come from
> > non-movable zone via additional page migration.
> > 
> > ...
> >
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> >  			struct page **pages, struct vm_area_struct **vmas);
> >  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
> >  			struct page **pages);
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
> > +		unsigned long start, int nr_pages, int write, int force,
> > +		struct page **pages, struct vm_area_struct **vmas);
> > +#endif
> 
> The ifdefs aren't really needed here and I encourage people to omit
> them.  This keeps the header files looking neater and reduces the
> chances of things later breaking because we forgot to update some
> CONFIG_foo logic in a header file.  The downside is that errors will be
> revealed at link time rather than at compile time, but that's a pretty
> small cost.
> 

As an aside, if ifdefs *have* to be used then it often better to have a
pattern like

#ifdef CONFIG_MEMORY_HOTREMOVE
int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
		unsigned long start, int nr_pages, int write, int force,
		struct page **pages, struct vm_area_struct **vmas);
#else
static inline get_user_pages_non_movable(...)
{
	get_user_pages(...)
}
#endif

It eliminates the need for #ifdefs in the C file that calls
get_user_pages_non_movable().

> >  struct kvec;
> >  int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
> >  			struct page **pages);
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 73b64a3..5db811e 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
> >  	return (idx == ZONE_NORMAL);
> >  }
> >  
> > +static inline int is_movable(struct zone *zone)
> > +{
> > +	return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE;
> > +}
> 
> A better name would be zone_is_movable(). 

He is matching the names of the is_normal(), is_dma32() and is_dma() helpers.
Unfortunately I would expect a name like is_movable() to return true if the
page had either been allocated with __GFP_MOVABLE or if it was checking
migrate can currently handle the page. is_movable() is indeed a terrible
name. They should be renamed or deleted in preparation -- see below.

> We haven't been very
> consistent about this in mmzone.h, but zone_is_foo() is pretty common.
> 
> And a neater implementation would be
> 
> 	return zone_idx(zone) == ZONE_MOVABLE;
> 

Yes.

> All of which made me look at mmzone.h, and what I saw wasn't very nice :(
> 
> Please review...
> 
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: include/linux/mmzone.h: cleanups
> 
> - implement zone_idx() in C to fix its references-args-twice macro bug
> 
> - use zone_idx() in is_highmem() to remove large amounts of silly fluff.
> 
> Cc: Lin Feng <linfeng@cn.fujitsu.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  include/linux/mmzone.h |   13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h
> --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups
> +++ a/include/linux/mmzone.h
> @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by
>  /*
>   * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
>   */
> -#define zone_idx(zone)		((zone) - (zone)->zone_pgdat->node_zones)
> +static inline enum zone_type zone_idx(struct zone *zone)
> +{
> +	return zone - zone->zone_pgdat->node_zones;
> +}
>  
>  static inline int populated_zone(struct zone *zone)
>  {
> @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon
>  static inline int is_highmem(struct zone *zone)
>  {
>  #ifdef CONFIG_HIGHMEM
> -	int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
> -	return zone_off == ZONE_HIGHMEM * sizeof(*zone) ||
> -	       (zone_off == ZONE_MOVABLE * sizeof(*zone) &&
> -		zone_movable_is_highmem());
> +	enum zone_type idx = zone_idx(zone);
> +
> +	return idx == ZONE_HIGHMEM ||
> +	       (idx == ZONE_MOVABLE && zone_movable_is_highmem());
>  #else
>  	return 0;
>  #endif

*blinks*. Ok, while I think your version looks nicer, it is reverting
ddc81ed2 (remove sparse warning for mmzone.h). According to my version of
gcc at least, your patch reintroduces the sar and sparse complains if run as

make ARCH=i386 C=2 CF=-Wsparse-all 

.... this?

From: Mel Gorman <mgorman@suse.de>
Subject: [PATCH] include/linux/mmzone.h: cleanups

- implement zone_idx() in C to fix its references-args-twice macro bug

- implement zone_is_idx in an attempt to explain what the fluff in
  is_highmem is for

- rename is_highmem to zone_is_highmem as preparation for
  zone_is_movable()

- implement zone_is_movable as a helper for places that care about
  ZONE_MOVABLE

- Remove is_normal, is_dma32 and is_dma because apparently no one cares

- convert users of zone_idx(zone) == ZONE_FOO to appropriate
  zone_is_foo() helper

- Use is_highmem_idx() instead of is_highmem for the PageHighMem
  implementation. Should be functionally equivalent because we
  only care about the zones index, not exactly what zone it is

[akpm@linux-foundation.org: zone_idx() reimplementation]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/s390/mm/init.c        |    2 +-
 arch/tile/mm/init.c        |    5 +----
 arch/x86/mm/highmem_32.c   |    2 +-
 include/linux/mmzone.h     |   49 +++++++++++++++-----------------------------
 include/linux/page-flags.h |    2 +-
 kernel/power/snapshot.c    |   12 +++++------
 mm/page_alloc.c            |    6 +++---
 7 files changed, 30 insertions(+), 48 deletions(-)

diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 81e596c..c26c321 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -231,7 +231,7 @@ int arch_add_memory(int nid, u64 start, u64 size)
 	if (rc)
 		return rc;
 	for_each_zone(zone) {
-		if (zone_idx(zone) != ZONE_MOVABLE) {
+		if (!zone_is_movable(zone)) {
 			/* Add range within existing zone limits */
 			zone_start_pfn = zone->zone_start_pfn;
 			zone_end_pfn = zone->zone_start_pfn +
diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c
index ef29d6c..8287749 100644
--- a/arch/tile/mm/init.c
+++ b/arch/tile/mm/init.c
@@ -733,9 +733,6 @@ static void __init set_non_bootmem_pages_init(void)
 	for_each_zone(z) {
 		unsigned long start, end;
 		int nid = z->zone_pgdat->node_id;
-#ifdef CONFIG_HIGHMEM
-		int idx = zone_idx(z);
-#endif
 
 		start = z->zone_start_pfn;
 		end = start + z->spanned_pages;
@@ -743,7 +740,7 @@ static void __init set_non_bootmem_pages_init(void)
 		start = max(start, max_low_pfn);
 
 #ifdef CONFIG_HIGHMEM
-		if (idx == ZONE_HIGHMEM)
+		if (zone_is_highmem(z))
 			totalhigh_pages += z->spanned_pages;
 #endif
 		if (kdata_huge) {
diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c
index 6f31ee5..63020e7 100644
--- a/arch/x86/mm/highmem_32.c
+++ b/arch/x86/mm/highmem_32.c
@@ -124,7 +124,7 @@ void __init set_highmem_pages_init(void)
 	for_each_zone(zone) {
 		unsigned long zone_start_pfn, zone_end_pfn;
 
-		if (!is_highmem(zone))
+		if (!zone_is_highmem(zone))
 			continue;
 
 		zone_start_pfn = zone->zone_start_pfn;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a23923b..5ecef75 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -779,10 +779,20 @@ static inline int local_memory_node(int node_id) { return node_id; };
 unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long);
 #endif
 
+static inline bool zone_is_idx(struct zone *zone, enum zone_type idx)
+{
+	/* This mess avoids a potentially expensive pointer subtraction. */
+	int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
+	return zone_off == idx * sizeof(*zone);
+}
+
 /*
  * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
  */
-#define zone_idx(zone)		((zone) - (zone)->zone_pgdat->node_zones)
+static inline enum zone_type zone_idx(struct zone *zone)
+{
+	return zone - zone->zone_pgdat->node_zones;
+}
 
 static inline int populated_zone(struct zone *zone)
 {
@@ -810,50 +820,25 @@ static inline int is_highmem_idx(enum zone_type idx)
 #endif
 }
 
-static inline int is_normal_idx(enum zone_type idx)
-{
-	return (idx == ZONE_NORMAL);
-}
-
 /**
- * is_highmem - helper function to quickly check if a struct zone is a 
+ * zone_is_highmem - helper function to quickly check if a struct zone is a 
  *              highmem zone or not.  This is an attempt to keep references
  *              to ZONE_{DMA/NORMAL/HIGHMEM/etc} in general code to a minimum.
  * @zone - pointer to struct zone variable
  */
-static inline int is_highmem(struct zone *zone)
+static inline bool zone_is_highmem(struct zone *zone)
 {
 #ifdef CONFIG_HIGHMEM
-	int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
-	return zone_off == ZONE_HIGHMEM * sizeof(*zone) ||
-	       (zone_off == ZONE_MOVABLE * sizeof(*zone) &&
-		zone_movable_is_highmem());
-#else
-	return 0;
-#endif
-}
-
-static inline int is_normal(struct zone *zone)
-{
-	return zone == zone->zone_pgdat->node_zones + ZONE_NORMAL;
-}
-
-static inline int is_dma32(struct zone *zone)
-{
-#ifdef CONFIG_ZONE_DMA32
-	return zone == zone->zone_pgdat->node_zones + ZONE_DMA32;
+	return zone_is_idx(zone, ZONE_HIGHMEM) ||
+	       (zone_is_idx(zone, ZONE_MOVABLE) && zone_movable_is_highmem());
 #else
 	return 0;
 #endif
 }
 
-static inline int is_dma(struct zone *zone)
+static inline bool zone_is_movable(struct zone *zone)
 {
-#ifdef CONFIG_ZONE_DMA
-	return zone == zone->zone_pgdat->node_zones + ZONE_DMA;
-#else
-	return 0;
-#endif
+	return zone_is_idx(zone, ZONE_MOVABLE);
 }
 
 /* These two functions are used to setup the per zone pages min values */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b5d1384..2f541f1 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -237,7 +237,7 @@ PAGEFLAG(Readahead, reclaim)		/* Reminder to do async read-ahead */
  * Must use a macro here due to header dependency issues. page_zone() is not
  * available at this point.
  */
-#define PageHighMem(__p) is_highmem(page_zone(__p))
+#define PageHighMem(__p) is_highmem_idx(page_zonenum(__p))
 #else
 PAGEFLAG_FALSE(HighMem)
 #endif
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 0de2857..cbf0da9 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -830,7 +830,7 @@ static unsigned int count_free_highmem_pages(void)
 	unsigned int cnt = 0;
 
 	for_each_populated_zone(zone)
-		if (is_highmem(zone))
+		if (zone_is_highmem(zone))
 			cnt += zone_page_state(zone, NR_FREE_PAGES);
 
 	return cnt;
@@ -879,7 +879,7 @@ static unsigned int count_highmem_pages(void)
 	for_each_populated_zone(zone) {
 		unsigned long pfn, max_zone_pfn;
 
-		if (!is_highmem(zone))
+		if (!zone_is_highmem(zone))
 			continue;
 
 		mark_free_pages(zone);
@@ -943,7 +943,7 @@ static unsigned int count_data_pages(void)
 	unsigned int n = 0;
 
 	for_each_populated_zone(zone) {
-		if (is_highmem(zone))
+		if (zone_is_highmem(zone))
 			continue;
 
 		mark_free_pages(zone);
@@ -989,7 +989,7 @@ static void safe_copy_page(void *dst, struct page *s_page)
 static inline struct page *
 page_is_saveable(struct zone *zone, unsigned long pfn)
 {
-	return is_highmem(zone) ?
+	return zone_is_highmem(zone) ?
 		saveable_highmem_page(zone, pfn) : saveable_page(zone, pfn);
 }
 
@@ -1338,7 +1338,7 @@ int hibernate_preallocate_memory(void)
 	size = 0;
 	for_each_populated_zone(zone) {
 		size += snapshot_additional_pages(zone);
-		if (is_highmem(zone))
+		if (zone_is_highmem(zone))
 			highmem += zone_page_state(zone, NR_FREE_PAGES);
 		else
 			count += zone_page_state(zone, NR_FREE_PAGES);
@@ -1481,7 +1481,7 @@ static int enough_free_mem(unsigned int nr_pages, unsigned int nr_highmem)
 	unsigned int free = alloc_normal;
 
 	for_each_populated_zone(zone)
-		if (!is_highmem(zone))
+		if (!zone_is_highmem(zone))
 			free += zone_page_state(zone, NR_FREE_PAGES);
 
 	nr_pages += count_pages_for_highmem(nr_highmem);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7e208f0..9558341 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5136,7 +5136,7 @@ static void __setup_per_zone_wmarks(void)
 
 	/* Calculate total number of !ZONE_HIGHMEM pages */
 	for_each_zone(zone) {
-		if (!is_highmem(zone))
+		if (!zone_is_highmem(zone))
 			lowmem_pages += zone->present_pages;
 	}
 
@@ -5146,7 +5146,7 @@ static void __setup_per_zone_wmarks(void)
 		spin_lock_irqsave(&zone->lock, flags);
 		tmp = (u64)pages_min * zone->present_pages;
 		do_div(tmp, lowmem_pages);
-		if (is_highmem(zone)) {
+		if (zone_is_highmem(zone)) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
 			 * need highmem pages, so cap pages_min to a small
@@ -5585,7 +5585,7 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count)
 	 * For avoiding noise data, lru_add_drain_all() should be called
 	 * If ZONE_MOVABLE, the zone never contains unmovable pages
 	 */
-	if (zone_idx(zone) == ZONE_MOVABLE)
+	if (zone_is_movable(zone))
 		return false;
 	mt = get_pageblock_migratetype(page);
 	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))

> _
> 
> 
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -58,6 +58,8 @@
> >  #include <linux/elf.h>
> >  #include <linux/gfp.h>
> >  #include <linux/migrate.h>
> > +#include <linux/page-isolation.h>
> > +#include <linux/mm_inline.h>
> >  #include <linux/string.h>
> >  
> >  #include <asm/io.h>
> > @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> >  }
> >  EXPORT_SYMBOL(get_user_pages);
> >  
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +/**
> > + * It's a wrapper of get_user_pages() but it makes sure that all pages come from
> > + * non-movable zone via additional page migration.
> > + */
> 
> This needs a description of why the function exists - say something
> about the requirements of memory hotplug.
> 
> Also a few words describing how the function works would be good.
> 
> > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
> > +		unsigned long start, int nr_pages, int write, int force,
> > +		struct page **pages, struct vm_area_struct **vmas)
> > +{
> > +	int ret, i, isolate_err, migrate_pre_flag;

migrate_pre_flag should be bool and the name is a bit
unusual. migrate_prepped?

> > +	LIST_HEAD(pagelist);
> > +
> > +retry:
> > +	ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages,
> > +				vmas);
> 
> We should handle (ret < 0) here.  At present the function will silently
> convert an error return into "return 0", which is bad.  The function
> does appear to otherwise do the right thing if get_user_pages() failed,
> but only due to good luck.
> 

A BUG_ON check if we retry more than once wouldn't hurt either. It requires
a broken implementation of alloc_migrate_target() but someone might "fix"
it and miss this.

A minor aside, it would be nice if we exited at this point if there was
no populated ZONE_MOVABLE in the system. There is a movable_zone global
variable already. If it was forced to be MAX_NR_ZONES before the call to
find_usable_zone_for_movable() in memory initialisation we should be able
to make a cheap check for it here.

> > +	isolate_err = 0;
> > +	migrate_pre_flag = 0;
> > +
> > +	for (i = 0; i < ret; i++) {
> > +		if (is_movable(page_zone(pages[i]))) {

This is a relatively expensive lookup. If my patch above is used then
you can add a is_movable_idx helper and then this becomes

if (is_movable_idx(page_zonenum(pages[i])))

which is cheaper.

> > +			if (!migrate_pre_flag) {
> > +				if (migrate_prep())
> > +					goto put_page;

CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never
return failure. You could make this BUG_ON(migrate_prep()), remove this
goto and the migrate_pre_flag check below becomes redundant.

> > +				migrate_pre_flag = 1;
> > +			}
> > +
> > +			if (!isolate_lru_page(pages[i])) {
> > +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
> > +						 page_is_file_cache(pages[i]));
> > +				list_add_tail(&pages[i]->lru, &pagelist);
> > +			} else {
> > +				isolate_err = 1;
> > +				goto put_page;
> > +			}

isolate_lru_page() takes the LRU lock every time. If
get_user_pages_non_movable is used heavily then you may encounter lock
contention problems. Batching this lock would be a separate patch and should
not be necessary yet but you should at least comment on it as a reminder.

I think that a goto could also have been avoided here if you used break. The
i == ret check below would be false and it would just fall through.
Overall the flow would be a bit easier to follow.

Why list_add_tail()? I don't think it's wrong but it's unusual to see
list_add_tail() when list_add() is enough.

> > +		}
> > +	}
> > +
> > +	/* All pages are non movable, we are done :) */
> > +	if (i == ret && list_empty(&pagelist))
> > +		return ret;
> > +
> > +put_page:
> > +	/* Undo the effects of former get_user_pages(), we won't pin anything */
> > +	for (i = 0; i < ret; i++)
> > +		put_page(pages[i]);
> > +

release_pages.

That comment is insufficient. There are non-obvious consequences to this
logic. We are dropping pins on all pages regardless of what zone they
are in. If the subsequent migration fails then we end up returning 0
with no pages pinned. The user-visible effect is that io_setup() fails
for non-obvious reasons. It will return EAGAIN to userspace which will be
interpreted as "The specified nr_events exceeds the user's limit of available
events.". The application will either fail or potentially infinite loop
if the developer interpreted EAGAIN as "try again" as opposed to "this is
a permanent failure".

Is that deliberate? Is it really preferable that AIO can fail to setup
and the application exit just in case we want to hot-remove memory later?
Should a failed migration generate a WARN_ON at least?

I would think that it's better to WARN_ON_ONCE if migration fails but pin
the pages as requested. If a future memory hot-remove operation fails
then the warning will indicate why but applications will not fail as a
result. It's a little clumsy but the memory hot-remove failure message
could list what applications have pinned the pages that cannot be removed
so the administrator has the option of force-killing the application. It
is possible to discover what application is pinning a page from userspace
but it would involve an expensive search with /proc/kpagemap

> > +	if (migrate_pre_flag && !isolate_err) {
> > +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
> > +					false, MIGRATE_SYNC, MR_SYSCALL);

The conversion of alloc_migrate_target is a bit problematic. It strips
the __GFP_MOVABLE flag and the consequence of this is that it converts
those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large
number of pages, particularly if the number of get_user_pages_non_movable()
increases for short-lived pins like direct IO.

One way around this is to add a high_zoneidx parameter to
__alloc_pages_nodemask and rename it ____alloc_pages_nodemask, create a new
inline function __alloc_pages_nodemask that passes in gfp_zone(gfp_mask)
as the high_zoneidx and create a new migration allocation function that
passes on ZONE_HIGHMEM as high_zoneidx. That would force the allocation to
happen in a lower zone while still treating the allocation as MIGRATE_MOVABLE

> > +		/* Steal pages from non-movable zone successfully? */
> > +		if (!ret)
> > +			goto retry;
> 
> This is buggy.  migrate_pages() doesn't empty its `from' argument, so
> page_list must be reinitialised here (or, better, at the start of the loop).
> 

page_list should be empty if ret == 0.

> Mel, what's up with migrate_pages()?  Shouldn't it be removing the
> pages from the list when MIGRATEPAGE_SUCCESS?  The use of
> list_for_each_entry_safe() suggests we used to do that...
> 

On successful migration, the pages are put back on the LRU at the end of
unmap_and_move(). On migration failure the caller is responsible for putting
the pages back on the LRU with putback_lru_pages() which happens below.

> > +	}
> > +
> > +	putback_lru_pages(&pagelist);
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(get_user_pages_non_movable);
> > +#endif
> > +
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-05 11:57       ` Mel Gorman
  0 siblings, 0 replies; 88+ messages in thread
From: Mel Gorman @ 2013-02-05 11:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lin Feng, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote:
> On Mon, 4 Feb 2013 18:04:07 +0800
> Lin Feng <linfeng@cn.fujitsu.com> wrote:
> 
> > get_user_pages() always tries to allocate pages from movable zone, which is not
> >  reliable to memory hotremove framework in some case.
> > 
> > This patch introduces a new library function called get_user_pages_non_movable()
> >  to pin pages only from zone non-movable in memory.
> > It's a wrapper of get_user_pages() but it makes sure that all pages come from
> > non-movable zone via additional page migration.
> > 
> > ...
> >
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> >  			struct page **pages, struct vm_area_struct **vmas);
> >  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
> >  			struct page **pages);
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
> > +		unsigned long start, int nr_pages, int write, int force,
> > +		struct page **pages, struct vm_area_struct **vmas);
> > +#endif
> 
> The ifdefs aren't really needed here and I encourage people to omit
> them.  This keeps the header files looking neater and reduces the
> chances of things later breaking because we forgot to update some
> CONFIG_foo logic in a header file.  The downside is that errors will be
> revealed at link time rather than at compile time, but that's a pretty
> small cost.
> 

As an aside, if ifdefs *have* to be used then it often better to have a
pattern like

#ifdef CONFIG_MEMORY_HOTREMOVE
int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
		unsigned long start, int nr_pages, int write, int force,
		struct page **pages, struct vm_area_struct **vmas);
#else
static inline get_user_pages_non_movable(...)
{
	get_user_pages(...)
}
#endif

It eliminates the need for #ifdefs in the C file that calls
get_user_pages_non_movable().

> >  struct kvec;
> >  int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
> >  			struct page **pages);
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 73b64a3..5db811e 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
> >  	return (idx == ZONE_NORMAL);
> >  }
> >  
> > +static inline int is_movable(struct zone *zone)
> > +{
> > +	return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE;
> > +}
> 
> A better name would be zone_is_movable(). 

He is matching the names of the is_normal(), is_dma32() and is_dma() helpers.
Unfortunately I would expect a name like is_movable() to return true if the
page had either been allocated with __GFP_MOVABLE or if it was checking
migrate can currently handle the page. is_movable() is indeed a terrible
name. They should be renamed or deleted in preparation -- see below.

> We haven't been very
> consistent about this in mmzone.h, but zone_is_foo() is pretty common.
> 
> And a neater implementation would be
> 
> 	return zone_idx(zone) == ZONE_MOVABLE;
> 

Yes.

> All of which made me look at mmzone.h, and what I saw wasn't very nice :(
> 
> Please review...
> 
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: include/linux/mmzone.h: cleanups
> 
> - implement zone_idx() in C to fix its references-args-twice macro bug
> 
> - use zone_idx() in is_highmem() to remove large amounts of silly fluff.
> 
> Cc: Lin Feng <linfeng@cn.fujitsu.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  include/linux/mmzone.h |   13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h
> --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups
> +++ a/include/linux/mmzone.h
> @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by
>  /*
>   * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
>   */
> -#define zone_idx(zone)		((zone) - (zone)->zone_pgdat->node_zones)
> +static inline enum zone_type zone_idx(struct zone *zone)
> +{
> +	return zone - zone->zone_pgdat->node_zones;
> +}
>  
>  static inline int populated_zone(struct zone *zone)
>  {
> @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon
>  static inline int is_highmem(struct zone *zone)
>  {
>  #ifdef CONFIG_HIGHMEM
> -	int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
> -	return zone_off == ZONE_HIGHMEM * sizeof(*zone) ||
> -	       (zone_off == ZONE_MOVABLE * sizeof(*zone) &&
> -		zone_movable_is_highmem());
> +	enum zone_type idx = zone_idx(zone);
> +
> +	return idx == ZONE_HIGHMEM ||
> +	       (idx == ZONE_MOVABLE && zone_movable_is_highmem());
>  #else
>  	return 0;
>  #endif

*blinks*. Ok, while I think your version looks nicer, it is reverting
ddc81ed2 (remove sparse warning for mmzone.h). According to my version of
gcc at least, your patch reintroduces the sar and sparse complains if run as

make ARCH=i386 C=2 CF=-Wsparse-all 

.... this?

From: Mel Gorman <mgorman@suse.de>
Subject: [PATCH] include/linux/mmzone.h: cleanups

- implement zone_idx() in C to fix its references-args-twice macro bug

- implement zone_is_idx in an attempt to explain what the fluff in
  is_highmem is for

- rename is_highmem to zone_is_highmem as preparation for
  zone_is_movable()

- implement zone_is_movable as a helper for places that care about
  ZONE_MOVABLE

- Remove is_normal, is_dma32 and is_dma because apparently no one cares

- convert users of zone_idx(zone) == ZONE_FOO to appropriate
  zone_is_foo() helper

- Use is_highmem_idx() instead of is_highmem for the PageHighMem
  implementation. Should be functionally equivalent because we
  only care about the zones index, not exactly what zone it is

[akpm@linux-foundation.org: zone_idx() reimplementation]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/s390/mm/init.c        |    2 +-
 arch/tile/mm/init.c        |    5 +----
 arch/x86/mm/highmem_32.c   |    2 +-
 include/linux/mmzone.h     |   49 +++++++++++++++-----------------------------
 include/linux/page-flags.h |    2 +-
 kernel/power/snapshot.c    |   12 +++++------
 mm/page_alloc.c            |    6 +++---
 7 files changed, 30 insertions(+), 48 deletions(-)

diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 81e596c..c26c321 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -231,7 +231,7 @@ int arch_add_memory(int nid, u64 start, u64 size)
 	if (rc)
 		return rc;
 	for_each_zone(zone) {
-		if (zone_idx(zone) != ZONE_MOVABLE) {
+		if (!zone_is_movable(zone)) {
 			/* Add range within existing zone limits */
 			zone_start_pfn = zone->zone_start_pfn;
 			zone_end_pfn = zone->zone_start_pfn +
diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c
index ef29d6c..8287749 100644
--- a/arch/tile/mm/init.c
+++ b/arch/tile/mm/init.c
@@ -733,9 +733,6 @@ static void __init set_non_bootmem_pages_init(void)
 	for_each_zone(z) {
 		unsigned long start, end;
 		int nid = z->zone_pgdat->node_id;
-#ifdef CONFIG_HIGHMEM
-		int idx = zone_idx(z);
-#endif
 
 		start = z->zone_start_pfn;
 		end = start + z->spanned_pages;
@@ -743,7 +740,7 @@ static void __init set_non_bootmem_pages_init(void)
 		start = max(start, max_low_pfn);
 
 #ifdef CONFIG_HIGHMEM
-		if (idx == ZONE_HIGHMEM)
+		if (zone_is_highmem(z))
 			totalhigh_pages += z->spanned_pages;
 #endif
 		if (kdata_huge) {
diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c
index 6f31ee5..63020e7 100644
--- a/arch/x86/mm/highmem_32.c
+++ b/arch/x86/mm/highmem_32.c
@@ -124,7 +124,7 @@ void __init set_highmem_pages_init(void)
 	for_each_zone(zone) {
 		unsigned long zone_start_pfn, zone_end_pfn;
 
-		if (!is_highmem(zone))
+		if (!zone_is_highmem(zone))
 			continue;
 
 		zone_start_pfn = zone->zone_start_pfn;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a23923b..5ecef75 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -779,10 +779,20 @@ static inline int local_memory_node(int node_id) { return node_id; };
 unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long);
 #endif
 
+static inline bool zone_is_idx(struct zone *zone, enum zone_type idx)
+{
+	/* This mess avoids a potentially expensive pointer subtraction. */
+	int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
+	return zone_off == idx * sizeof(*zone);
+}
+
 /*
  * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
  */
-#define zone_idx(zone)		((zone) - (zone)->zone_pgdat->node_zones)
+static inline enum zone_type zone_idx(struct zone *zone)
+{
+	return zone - zone->zone_pgdat->node_zones;
+}
 
 static inline int populated_zone(struct zone *zone)
 {
@@ -810,50 +820,25 @@ static inline int is_highmem_idx(enum zone_type idx)
 #endif
 }
 
-static inline int is_normal_idx(enum zone_type idx)
-{
-	return (idx == ZONE_NORMAL);
-}
-
 /**
- * is_highmem - helper function to quickly check if a struct zone is a 
+ * zone_is_highmem - helper function to quickly check if a struct zone is a 
  *              highmem zone or not.  This is an attempt to keep references
  *              to ZONE_{DMA/NORMAL/HIGHMEM/etc} in general code to a minimum.
  * @zone - pointer to struct zone variable
  */
-static inline int is_highmem(struct zone *zone)
+static inline bool zone_is_highmem(struct zone *zone)
 {
 #ifdef CONFIG_HIGHMEM
-	int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
-	return zone_off == ZONE_HIGHMEM * sizeof(*zone) ||
-	       (zone_off == ZONE_MOVABLE * sizeof(*zone) &&
-		zone_movable_is_highmem());
-#else
-	return 0;
-#endif
-}
-
-static inline int is_normal(struct zone *zone)
-{
-	return zone == zone->zone_pgdat->node_zones + ZONE_NORMAL;
-}
-
-static inline int is_dma32(struct zone *zone)
-{
-#ifdef CONFIG_ZONE_DMA32
-	return zone == zone->zone_pgdat->node_zones + ZONE_DMA32;
+	return zone_is_idx(zone, ZONE_HIGHMEM) ||
+	       (zone_is_idx(zone, ZONE_MOVABLE) && zone_movable_is_highmem());
 #else
 	return 0;
 #endif
 }
 
-static inline int is_dma(struct zone *zone)
+static inline bool zone_is_movable(struct zone *zone)
 {
-#ifdef CONFIG_ZONE_DMA
-	return zone == zone->zone_pgdat->node_zones + ZONE_DMA;
-#else
-	return 0;
-#endif
+	return zone_is_idx(zone, ZONE_MOVABLE);
 }
 
 /* These two functions are used to setup the per zone pages min values */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b5d1384..2f541f1 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -237,7 +237,7 @@ PAGEFLAG(Readahead, reclaim)		/* Reminder to do async read-ahead */
  * Must use a macro here due to header dependency issues. page_zone() is not
  * available at this point.
  */
-#define PageHighMem(__p) is_highmem(page_zone(__p))
+#define PageHighMem(__p) is_highmem_idx(page_zonenum(__p))
 #else
 PAGEFLAG_FALSE(HighMem)
 #endif
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 0de2857..cbf0da9 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -830,7 +830,7 @@ static unsigned int count_free_highmem_pages(void)
 	unsigned int cnt = 0;
 
 	for_each_populated_zone(zone)
-		if (is_highmem(zone))
+		if (zone_is_highmem(zone))
 			cnt += zone_page_state(zone, NR_FREE_PAGES);
 
 	return cnt;
@@ -879,7 +879,7 @@ static unsigned int count_highmem_pages(void)
 	for_each_populated_zone(zone) {
 		unsigned long pfn, max_zone_pfn;
 
-		if (!is_highmem(zone))
+		if (!zone_is_highmem(zone))
 			continue;
 
 		mark_free_pages(zone);
@@ -943,7 +943,7 @@ static unsigned int count_data_pages(void)
 	unsigned int n = 0;
 
 	for_each_populated_zone(zone) {
-		if (is_highmem(zone))
+		if (zone_is_highmem(zone))
 			continue;
 
 		mark_free_pages(zone);
@@ -989,7 +989,7 @@ static void safe_copy_page(void *dst, struct page *s_page)
 static inline struct page *
 page_is_saveable(struct zone *zone, unsigned long pfn)
 {
-	return is_highmem(zone) ?
+	return zone_is_highmem(zone) ?
 		saveable_highmem_page(zone, pfn) : saveable_page(zone, pfn);
 }
 
@@ -1338,7 +1338,7 @@ int hibernate_preallocate_memory(void)
 	size = 0;
 	for_each_populated_zone(zone) {
 		size += snapshot_additional_pages(zone);
-		if (is_highmem(zone))
+		if (zone_is_highmem(zone))
 			highmem += zone_page_state(zone, NR_FREE_PAGES);
 		else
 			count += zone_page_state(zone, NR_FREE_PAGES);
@@ -1481,7 +1481,7 @@ static int enough_free_mem(unsigned int nr_pages, unsigned int nr_highmem)
 	unsigned int free = alloc_normal;
 
 	for_each_populated_zone(zone)
-		if (!is_highmem(zone))
+		if (!zone_is_highmem(zone))
 			free += zone_page_state(zone, NR_FREE_PAGES);
 
 	nr_pages += count_pages_for_highmem(nr_highmem);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7e208f0..9558341 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5136,7 +5136,7 @@ static void __setup_per_zone_wmarks(void)
 
 	/* Calculate total number of !ZONE_HIGHMEM pages */
 	for_each_zone(zone) {
-		if (!is_highmem(zone))
+		if (!zone_is_highmem(zone))
 			lowmem_pages += zone->present_pages;
 	}
 
@@ -5146,7 +5146,7 @@ static void __setup_per_zone_wmarks(void)
 		spin_lock_irqsave(&zone->lock, flags);
 		tmp = (u64)pages_min * zone->present_pages;
 		do_div(tmp, lowmem_pages);
-		if (is_highmem(zone)) {
+		if (zone_is_highmem(zone)) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
 			 * need highmem pages, so cap pages_min to a small
@@ -5585,7 +5585,7 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count)
 	 * For avoiding noise data, lru_add_drain_all() should be called
 	 * If ZONE_MOVABLE, the zone never contains unmovable pages
 	 */
-	if (zone_idx(zone) == ZONE_MOVABLE)
+	if (zone_is_movable(zone))
 		return false;
 	mt = get_pageblock_migratetype(page);
 	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))

> _
> 
> 
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -58,6 +58,8 @@
> >  #include <linux/elf.h>
> >  #include <linux/gfp.h>
> >  #include <linux/migrate.h>
> > +#include <linux/page-isolation.h>
> > +#include <linux/mm_inline.h>
> >  #include <linux/string.h>
> >  
> >  #include <asm/io.h>
> > @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> >  }
> >  EXPORT_SYMBOL(get_user_pages);
> >  
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +/**
> > + * It's a wrapper of get_user_pages() but it makes sure that all pages come from
> > + * non-movable zone via additional page migration.
> > + */
> 
> This needs a description of why the function exists - say something
> about the requirements of memory hotplug.
> 
> Also a few words describing how the function works would be good.
> 
> > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
> > +		unsigned long start, int nr_pages, int write, int force,
> > +		struct page **pages, struct vm_area_struct **vmas)
> > +{
> > +	int ret, i, isolate_err, migrate_pre_flag;

migrate_pre_flag should be bool and the name is a bit
unusual. migrate_prepped?

> > +	LIST_HEAD(pagelist);
> > +
> > +retry:
> > +	ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages,
> > +				vmas);
> 
> We should handle (ret < 0) here.  At present the function will silently
> convert an error return into "return 0", which is bad.  The function
> does appear to otherwise do the right thing if get_user_pages() failed,
> but only due to good luck.
> 

A BUG_ON check if we retry more than once wouldn't hurt either. It requires
a broken implementation of alloc_migrate_target() but someone might "fix"
it and miss this.

A minor aside, it would be nice if we exited at this point if there was
no populated ZONE_MOVABLE in the system. There is a movable_zone global
variable already. If it was forced to be MAX_NR_ZONES before the call to
find_usable_zone_for_movable() in memory initialisation we should be able
to make a cheap check for it here.

> > +	isolate_err = 0;
> > +	migrate_pre_flag = 0;
> > +
> > +	for (i = 0; i < ret; i++) {
> > +		if (is_movable(page_zone(pages[i]))) {

This is a relatively expensive lookup. If my patch above is used then
you can add a is_movable_idx helper and then this becomes

if (is_movable_idx(page_zonenum(pages[i])))

which is cheaper.

> > +			if (!migrate_pre_flag) {
> > +				if (migrate_prep())
> > +					goto put_page;

CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never
return failure. You could make this BUG_ON(migrate_prep()), remove this
goto and the migrate_pre_flag check below becomes redundant.

> > +				migrate_pre_flag = 1;
> > +			}
> > +
> > +			if (!isolate_lru_page(pages[i])) {
> > +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
> > +						 page_is_file_cache(pages[i]));
> > +				list_add_tail(&pages[i]->lru, &pagelist);
> > +			} else {
> > +				isolate_err = 1;
> > +				goto put_page;
> > +			}

isolate_lru_page() takes the LRU lock every time. If
get_user_pages_non_movable is used heavily then you may encounter lock
contention problems. Batching this lock would be a separate patch and should
not be necessary yet but you should at least comment on it as a reminder.

I think that a goto could also have been avoided here if you used break. The
i == ret check below would be false and it would just fall through.
Overall the flow would be a bit easier to follow.

Why list_add_tail()? I don't think it's wrong but it's unusual to see
list_add_tail() when list_add() is enough.

> > +		}
> > +	}
> > +
> > +	/* All pages are non movable, we are done :) */
> > +	if (i == ret && list_empty(&pagelist))
> > +		return ret;
> > +
> > +put_page:
> > +	/* Undo the effects of former get_user_pages(), we won't pin anything */
> > +	for (i = 0; i < ret; i++)
> > +		put_page(pages[i]);
> > +

release_pages.

That comment is insufficient. There are non-obvious consequences to this
logic. We are dropping pins on all pages regardless of what zone they
are in. If the subsequent migration fails then we end up returning 0
with no pages pinned. The user-visible effect is that io_setup() fails
for non-obvious reasons. It will return EAGAIN to userspace which will be
interpreted as "The specified nr_events exceeds the user's limit of available
events.". The application will either fail or potentially infinite loop
if the developer interpreted EAGAIN as "try again" as opposed to "this is
a permanent failure".

Is that deliberate? Is it really preferable that AIO can fail to setup
and the application exit just in case we want to hot-remove memory later?
Should a failed migration generate a WARN_ON at least?

I would think that it's better to WARN_ON_ONCE if migration fails but pin
the pages as requested. If a future memory hot-remove operation fails
then the warning will indicate why but applications will not fail as a
result. It's a little clumsy but the memory hot-remove failure message
could list what applications have pinned the pages that cannot be removed
so the administrator has the option of force-killing the application. It
is possible to discover what application is pinning a page from userspace
but it would involve an expensive search with /proc/kpagemap

> > +	if (migrate_pre_flag && !isolate_err) {
> > +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
> > +					false, MIGRATE_SYNC, MR_SYSCALL);

The conversion of alloc_migrate_target is a bit problematic. It strips
the __GFP_MOVABLE flag and the consequence of this is that it converts
those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large
number of pages, particularly if the number of get_user_pages_non_movable()
increases for short-lived pins like direct IO.

One way around this is to add a high_zoneidx parameter to
__alloc_pages_nodemask and rename it ____alloc_pages_nodemask, create a new
inline function __alloc_pages_nodemask that passes in gfp_zone(gfp_mask)
as the high_zoneidx and create a new migration allocation function that
passes on ZONE_HIGHMEM as high_zoneidx. That would force the allocation to
happen in a lower zone while still treating the allocation as MIGRATE_MOVABLE

> > +		/* Steal pages from non-movable zone successfully? */
> > +		if (!ret)
> > +			goto retry;
> 
> This is buggy.  migrate_pages() doesn't empty its `from' argument, so
> page_list must be reinitialised here (or, better, at the start of the loop).
> 

page_list should be empty if ret == 0.

> Mel, what's up with migrate_pages()?  Shouldn't it be removing the
> pages from the list when MIGRATEPAGE_SUCCESS?  The use of
> list_for_each_entry_safe() suggests we used to do that...
> 

On successful migration, the pages are put back on the LRU at the end of
unmap_and_move(). On migration failure the caller is responsible for putting
the pages back on the LRU with putback_lru_pages() which happens below.

> > +	}
> > +
> > +	putback_lru_pages(&pagelist);
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(get_user_pages_non_movable);
> > +#endif
> > +
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-05 11:57       ` Mel Gorman
  (?)
@ 2013-02-05 13:32         ` Mel Gorman
  -1 siblings, 0 replies; 88+ messages in thread
From: Mel Gorman @ 2013-02-05 13:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lin Feng, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel

On Tue, Feb 05, 2013 at 11:57:22AM +0000, Mel Gorman wrote:
> 
> > > +				migrate_pre_flag = 1;
> > > +			}
> > > +
> > > +			if (!isolate_lru_page(pages[i])) {
> > > +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
> > > +						 page_is_file_cache(pages[i]));
> > > +				list_add_tail(&pages[i]->lru, &pagelist);
> > > +			} else {
> > > +				isolate_err = 1;
> > > +				goto put_page;
> > > +			}
> 
> isolate_lru_page() takes the LRU lock every time.

Credit to Michal Hocko for bringing this up but with the number of
other issues I missed that this is also broken with respect to huge page
handling. hugetlbfs pages will not be on the LRU so the isolation will mess
up and the migration has to be handled differently.  Ordinarily hugetlbfs
pages cannot be allocated from ZONE_MOVABLE but it is possible to configure
it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this
encounters a hugetlbfs page, it'll just blow up.

The other is that this almost certainly broken for transhuge page
handling. gup returns the head and tail pages and ordinarily this is ok
because the caller only cares about the physical address. Migration will
also split a hugepage if it receives it but you are potentially adding
tail pages to a list here and then migrating them. The split of the first
page will get very confused. I'm not exactly sure what the result will be
but it won't be pretty.

Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled
during testing?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-05 13:32         ` Mel Gorman
  0 siblings, 0 replies; 88+ messages in thread
From: Mel Gorman @ 2013-02-05 13:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lin Feng, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel

On Tue, Feb 05, 2013 at 11:57:22AM +0000, Mel Gorman wrote:
> 
> > > +				migrate_pre_flag = 1;
> > > +			}
> > > +
> > > +			if (!isolate_lru_page(pages[i])) {
> > > +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
> > > +						 page_is_file_cache(pages[i]));
> > > +				list_add_tail(&pages[i]->lru, &pagelist);
> > > +			} else {
> > > +				isolate_err = 1;
> > > +				goto put_page;
> > > +			}
> 
> isolate_lru_page() takes the LRU lock every time.

Credit to Michal Hocko for bringing this up but with the number of
other issues I missed that this is also broken with respect to huge page
handling. hugetlbfs pages will not be on the LRU so the isolation will mess
up and the migration has to be handled differently.  Ordinarily hugetlbfs
pages cannot be allocated from ZONE_MOVABLE but it is possible to configure
it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this
encounters a hugetlbfs page, it'll just blow up.

The other is that this almost certainly broken for transhuge page
handling. gup returns the head and tail pages and ordinarily this is ok
because the caller only cares about the physical address. Migration will
also split a hugepage if it receives it but you are potentially adding
tail pages to a list here and then migrating them. The split of the first
page will get very confused. I'm not exactly sure what the result will be
but it won't be pretty.

Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled
during testing?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-05 13:32         ` Mel Gorman
  0 siblings, 0 replies; 88+ messages in thread
From: Mel Gorman @ 2013-02-05 13:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lin Feng, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel

On Tue, Feb 05, 2013 at 11:57:22AM +0000, Mel Gorman wrote:
> 
> > > +				migrate_pre_flag = 1;
> > > +			}
> > > +
> > > +			if (!isolate_lru_page(pages[i])) {
> > > +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
> > > +						 page_is_file_cache(pages[i]));
> > > +				list_add_tail(&pages[i]->lru, &pagelist);
> > > +			} else {
> > > +				isolate_err = 1;
> > > +				goto put_page;
> > > +			}
> 
> isolate_lru_page() takes the LRU lock every time.

Credit to Michal Hocko for bringing this up but with the number of
other issues I missed that this is also broken with respect to huge page
handling. hugetlbfs pages will not be on the LRU so the isolation will mess
up and the migration has to be handled differently.  Ordinarily hugetlbfs
pages cannot be allocated from ZONE_MOVABLE but it is possible to configure
it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this
encounters a hugetlbfs page, it'll just blow up.

The other is that this almost certainly broken for transhuge page
handling. gup returns the head and tail pages and ordinarily this is ok
because the caller only cares about the physical address. Migration will
also split a hugepage if it receives it but you are potentially adding
tail pages to a list here and then migrating them. The split of the first
page will get very confused. I'm not exactly sure what the result will be
but it won't be pretty.

Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled
during testing?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-05  3:09       ` Lin Feng
  (?)
@ 2013-02-05 21:13         ` Andrew Morton
  -1 siblings, 0 replies; 88+ messages in thread
From: Andrew Morton @ 2013-02-05 21:13 UTC (permalink / raw)
  To: Lin Feng
  Cc: mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel,
	Tang chen, Gu Zheng

On Tue, 05 Feb 2013 11:09:27 +0800
Lin Feng <linfeng@cn.fujitsu.com> wrote:

> > 
> >>  struct kvec;
> >>  int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
> >>  			struct page **pages);
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index 73b64a3..5db811e 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
> >>  	return (idx == ZONE_NORMAL);
> >>  }
> >>  
> >> +static inline int is_movable(struct zone *zone)
> >> +{
> >> +	return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE;
> >> +}
> > 
> > A better name would be zone_is_movable().  We haven't been very
> > consistent about this in mmzone.h, but zone_is_foo() is pretty common.
> > 
> Yes, zone_is_xxx() should be a better name, but there are some analogous
> definition like is_dma32() and is_normal() etc, if we only use zone_is_movable()
> for movable zone it will break such naming rules.
> Should we update other ones in a separate patch later or just keep the old style?

I do think the old names were poorly chosen.  Yes, we could fix them up
sometime but it's hardly a pressing issue.

> > And a neater implementation would be
> > 
> > 	return zone_idx(zone) == ZONE_MOVABLE;
> > 
> OK. After your change, should we also update for other ones such as is_normal()..?

Sure.

From: Andrew Morton <akpm@linux-foundation.org>
Subject: include-linux-mmzoneh-cleanups-fix

use zone_idx() some more, further simplify is_highmem()

Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h |   15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix include/linux/mmzone.h
--- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix
+++ a/include/linux/mmzone.h
@@ -859,25 +859,18 @@ static inline int is_normal_idx(enum zon
  */
 static inline int is_highmem(struct zone *zone)
 {
-#ifdef CONFIG_HIGHMEM
-	enum zone_type idx = zone_idx(zone);
-
-	return idx == ZONE_HIGHMEM ||
-	       (idx == ZONE_MOVABLE && zone_movable_is_highmem());
-#else
-	return 0;
-#endif
+	return is_highmem_idx(zone_idx(zone));
 }
 
 static inline int is_normal(struct zone *zone)
 {
-	return zone == zone->zone_pgdat->node_zones + ZONE_NORMAL;
+	return zone_idx(zone) == ZONE_NORMAL;
 }
 
 static inline int is_dma32(struct zone *zone)
 {
 #ifdef CONFIG_ZONE_DMA32
-	return zone == zone->zone_pgdat->node_zones + ZONE_DMA32;
+	return zone_idx(zone) == ZONE_DMA32;
 #else
 	return 0;
 #endif
@@ -886,7 +879,7 @@ static inline int is_dma32(struct zone *
 static inline int is_dma(struct zone *zone)
 {
 #ifdef CONFIG_ZONE_DMA
-	return zone == zone->zone_pgdat->node_zones + ZONE_DMA;
+	return zone_idx(zone) == ZONE_DMA;
 #else
 	return 0;
 #endif
_

> >> +	/* All pages are non movable, we are done :) */
> >> +	if (i == ret && list_empty(&pagelist))
> >> +		return ret;
> >> +
> >> +put_page:
> >> +	/* Undo the effects of former get_user_pages(), we won't pin anything */
> >> +	for (i = 0; i < ret; i++)
> >> +		put_page(pages[i]);
> >> +
> >> +	if (migrate_pre_flag && !isolate_err) {
> >> +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
> >> +					false, MIGRATE_SYNC, MR_SYSCALL);
> >> +		/* Steal pages from non-movable zone successfully? */
> >> +		if (!ret)
> >> +			goto retry;
> > 
> > This is buggy.  migrate_pages() doesn't empty its `from' argument, so
> > page_list must be reinitialised here (or, better, at the start of the loop).
> migrate_pages()
>   list_for_each_entry_safe()
>      unmap_and_move()
>        if (rc != -EAGAIN) {
>          list_del(&page->lru);
>        }

ah, OK, there it is.



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-05 21:13         ` Andrew Morton
  0 siblings, 0 replies; 88+ messages in thread
From: Andrew Morton @ 2013-02-05 21:13 UTC (permalink / raw)
  To: Lin Feng
  Cc: mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel,
	Tang chen, Gu Zheng

On Tue, 05 Feb 2013 11:09:27 +0800
Lin Feng <linfeng@cn.fujitsu.com> wrote:

> > 
> >>  struct kvec;
> >>  int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
> >>  			struct page **pages);
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index 73b64a3..5db811e 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
> >>  	return (idx == ZONE_NORMAL);
> >>  }
> >>  
> >> +static inline int is_movable(struct zone *zone)
> >> +{
> >> +	return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE;
> >> +}
> > 
> > A better name would be zone_is_movable().  We haven't been very
> > consistent about this in mmzone.h, but zone_is_foo() is pretty common.
> > 
> Yes, zone_is_xxx() should be a better name, but there are some analogous
> definition like is_dma32() and is_normal() etc, if we only use zone_is_movable()
> for movable zone it will break such naming rules.
> Should we update other ones in a separate patch later or just keep the old style?

I do think the old names were poorly chosen.  Yes, we could fix them up
sometime but it's hardly a pressing issue.

> > And a neater implementation would be
> > 
> > 	return zone_idx(zone) == ZONE_MOVABLE;
> > 
> OK. After your change, should we also update for other ones such as is_normal()..?

Sure.

From: Andrew Morton <akpm@linux-foundation.org>
Subject: include-linux-mmzoneh-cleanups-fix

use zone_idx() some more, further simplify is_highmem()

Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h |   15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix include/linux/mmzone.h
--- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix
+++ a/include/linux/mmzone.h
@@ -859,25 +859,18 @@ static inline int is_normal_idx(enum zon
  */
 static inline int is_highmem(struct zone *zone)
 {
-#ifdef CONFIG_HIGHMEM
-	enum zone_type idx = zone_idx(zone);
-
-	return idx == ZONE_HIGHMEM ||
-	       (idx == ZONE_MOVABLE && zone_movable_is_highmem());
-#else
-	return 0;
-#endif
+	return is_highmem_idx(zone_idx(zone));
 }
 
 static inline int is_normal(struct zone *zone)
 {
-	return zone == zone->zone_pgdat->node_zones + ZONE_NORMAL;
+	return zone_idx(zone) == ZONE_NORMAL;
 }
 
 static inline int is_dma32(struct zone *zone)
 {
 #ifdef CONFIG_ZONE_DMA32
-	return zone == zone->zone_pgdat->node_zones + ZONE_DMA32;
+	return zone_idx(zone) == ZONE_DMA32;
 #else
 	return 0;
 #endif
@@ -886,7 +879,7 @@ static inline int is_dma32(struct zone *
 static inline int is_dma(struct zone *zone)
 {
 #ifdef CONFIG_ZONE_DMA
-	return zone == zone->zone_pgdat->node_zones + ZONE_DMA;
+	return zone_idx(zone) == ZONE_DMA;
 #else
 	return 0;
 #endif
_

> >> +	/* All pages are non movable, we are done :) */
> >> +	if (i == ret && list_empty(&pagelist))
> >> +		return ret;
> >> +
> >> +put_page:
> >> +	/* Undo the effects of former get_user_pages(), we won't pin anything */
> >> +	for (i = 0; i < ret; i++)
> >> +		put_page(pages[i]);
> >> +
> >> +	if (migrate_pre_flag && !isolate_err) {
> >> +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
> >> +					false, MIGRATE_SYNC, MR_SYSCALL);
> >> +		/* Steal pages from non-movable zone successfully? */
> >> +		if (!ret)
> >> +			goto retry;
> > 
> > This is buggy.  migrate_pages() doesn't empty its `from' argument, so
> > page_list must be reinitialised here (or, better, at the start of the loop).
> migrate_pages()
>   list_for_each_entry_safe()
>      unmap_and_move()
>        if (rc != -EAGAIN) {
>          list_del(&page->lru);
>        }

ah, OK, there it is.


--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-05 21:13         ` Andrew Morton
  0 siblings, 0 replies; 88+ messages in thread
From: Andrew Morton @ 2013-02-05 21:13 UTC (permalink / raw)
  To: Lin Feng
  Cc: mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel,
	Tang chen, Gu Zheng

On Tue, 05 Feb 2013 11:09:27 +0800
Lin Feng <linfeng@cn.fujitsu.com> wrote:

> > 
> >>  struct kvec;
> >>  int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
> >>  			struct page **pages);
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index 73b64a3..5db811e 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
> >>  	return (idx == ZONE_NORMAL);
> >>  }
> >>  
> >> +static inline int is_movable(struct zone *zone)
> >> +{
> >> +	return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE;
> >> +}
> > 
> > A better name would be zone_is_movable().  We haven't been very
> > consistent about this in mmzone.h, but zone_is_foo() is pretty common.
> > 
> Yes, zone_is_xxx() should be a better name, but there are some analogous
> definition like is_dma32() and is_normal() etc, if we only use zone_is_movable()
> for movable zone it will break such naming rules.
> Should we update other ones in a separate patch later or just keep the old style?

I do think the old names were poorly chosen.  Yes, we could fix them up
sometime but it's hardly a pressing issue.

> > And a neater implementation would be
> > 
> > 	return zone_idx(zone) == ZONE_MOVABLE;
> > 
> OK. After your change, should we also update for other ones such as is_normal()..?

Sure.

From: Andrew Morton <akpm@linux-foundation.org>
Subject: include-linux-mmzoneh-cleanups-fix

use zone_idx() some more, further simplify is_highmem()

Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h |   15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix include/linux/mmzone.h
--- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix
+++ a/include/linux/mmzone.h
@@ -859,25 +859,18 @@ static inline int is_normal_idx(enum zon
  */
 static inline int is_highmem(struct zone *zone)
 {
-#ifdef CONFIG_HIGHMEM
-	enum zone_type idx = zone_idx(zone);
-
-	return idx == ZONE_HIGHMEM ||
-	       (idx == ZONE_MOVABLE && zone_movable_is_highmem());
-#else
-	return 0;
-#endif
+	return is_highmem_idx(zone_idx(zone));
 }
 
 static inline int is_normal(struct zone *zone)
 {
-	return zone == zone->zone_pgdat->node_zones + ZONE_NORMAL;
+	return zone_idx(zone) == ZONE_NORMAL;
 }
 
 static inline int is_dma32(struct zone *zone)
 {
 #ifdef CONFIG_ZONE_DMA32
-	return zone == zone->zone_pgdat->node_zones + ZONE_DMA32;
+	return zone_idx(zone) == ZONE_DMA32;
 #else
 	return 0;
 #endif
@@ -886,7 +879,7 @@ static inline int is_dma32(struct zone *
 static inline int is_dma(struct zone *zone)
 {
 #ifdef CONFIG_ZONE_DMA
-	return zone == zone->zone_pgdat->node_zones + ZONE_DMA;
+	return zone_idx(zone) == ZONE_DMA;
 #else
 	return 0;
 #endif
_

> >> +	/* All pages are non movable, we are done :) */
> >> +	if (i == ret && list_empty(&pagelist))
> >> +		return ret;
> >> +
> >> +put_page:
> >> +	/* Undo the effects of former get_user_pages(), we won't pin anything */
> >> +	for (i = 0; i < ret; i++)
> >> +		put_page(pages[i]);
> >> +
> >> +	if (migrate_pre_flag && !isolate_err) {
> >> +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
> >> +					false, MIGRATE_SYNC, MR_SYSCALL);
> >> +		/* Steal pages from non-movable zone successfully? */
> >> +		if (!ret)
> >> +			goto retry;
> > 
> > This is buggy.  migrate_pages() doesn't empty its `from' argument, so
> > page_list must be reinitialised here (or, better, at the start of the loop).
> migrate_pages()
>   list_for_each_entry_safe()
>      unmap_and_move()
>        if (rc != -EAGAIN) {
>          list_del(&page->lru);
>        }

ah, OK, there it is.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-05 11:57       ` Mel Gorman
  (?)
@ 2013-02-06  2:26         ` Michel Lespinasse
  -1 siblings, 0 replies; 88+ messages in thread
From: Michel Lespinasse @ 2013-02-06  2:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Lin Feng, bcrl, viro, khlebnikov, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

Just nitpicking, but:

On Tue, Feb 5, 2013 at 3:57 AM, Mel Gorman <mgorman@suse.de> wrote:
> +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx)
> +{
> +       /* This mess avoids a potentially expensive pointer subtraction. */
> +       int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
> +       return zone_off == idx * sizeof(*zone);
> +}

Maybe:
return zone == zone->zone_pgdat->node_zones + idx;
?

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-06  2:26         ` Michel Lespinasse
  0 siblings, 0 replies; 88+ messages in thread
From: Michel Lespinasse @ 2013-02-06  2:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Lin Feng, bcrl, viro, khlebnikov, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

Just nitpicking, but:

On Tue, Feb 5, 2013 at 3:57 AM, Mel Gorman <mgorman@suse.de> wrote:
> +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx)
> +{
> +       /* This mess avoids a potentially expensive pointer subtraction. */
> +       int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
> +       return zone_off == idx * sizeof(*zone);
> +}

Maybe:
return zone == zone->zone_pgdat->node_zones + idx;
?

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-06  2:26         ` Michel Lespinasse
  0 siblings, 0 replies; 88+ messages in thread
From: Michel Lespinasse @ 2013-02-06  2:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Lin Feng, bcrl, viro, khlebnikov, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

Just nitpicking, but:

On Tue, Feb 5, 2013 at 3:57 AM, Mel Gorman <mgorman@suse.de> wrote:
> +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx)
> +{
> +       /* This mess avoids a potentially expensive pointer subtraction. */
> +       int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
> +       return zone_off == idx * sizeof(*zone);
> +}

Maybe:
return zone == zone->zone_pgdat->node_zones + idx;
?

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-06  2:26         ` Michel Lespinasse
@ 2013-02-06 10:41           ` Mel Gorman
  -1 siblings, 0 replies; 88+ messages in thread
From: Mel Gorman @ 2013-02-06 10:41 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, Lin Feng, bcrl, viro, khlebnikov, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

On Tue, Feb 05, 2013 at 06:26:51PM -0800, Michel Lespinasse wrote:
> Just nitpicking, but:
> 
> On Tue, Feb 5, 2013 at 3:57 AM, Mel Gorman <mgorman@suse.de> wrote:
> > +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx)
> > +{
> > +       /* This mess avoids a potentially expensive pointer subtraction. */
> > +       int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
> > +       return zone_off == idx * sizeof(*zone);
> > +}
> 
> Maybe:
> return zone == zone->zone_pgdat->node_zones + idx;
> ?
> 

Not a nit at all. Yours is more readable but it generates more code. A
single line function that uses the helper generates 0x3f bytes of code
(mostly function entry/exit) with your version and 0x39 bytes with mine. The
difference in efficiency is marginal as your version uses lea to multiply
by a constant but it's still slightly heavier.

The old code is fractionally better, your version is more readable so it's
up to Andrew really. Right now I think he's gone with his own version with
zone_idx() in the name of readability whatever sparse has to say about
the matter.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-06 10:41           ` Mel Gorman
  0 siblings, 0 replies; 88+ messages in thread
From: Mel Gorman @ 2013-02-06 10:41 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, Lin Feng, bcrl, viro, khlebnikov, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

On Tue, Feb 05, 2013 at 06:26:51PM -0800, Michel Lespinasse wrote:
> Just nitpicking, but:
> 
> On Tue, Feb 5, 2013 at 3:57 AM, Mel Gorman <mgorman@suse.de> wrote:
> > +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx)
> > +{
> > +       /* This mess avoids a potentially expensive pointer subtraction. */
> > +       int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
> > +       return zone_off == idx * sizeof(*zone);
> > +}
> 
> Maybe:
> return zone == zone->zone_pgdat->node_zones + idx;
> ?
> 

Not a nit at all. Yours is more readable but it generates more code. A
single line function that uses the helper generates 0x3f bytes of code
(mostly function entry/exit) with your version and 0x39 bytes with mine. The
difference in efficiency is marginal as your version uses lea to multiply
by a constant but it's still slightly heavier.

The old code is fractionally better, your version is more readable so it's
up to Andrew really. Right now I think he's gone with his own version with
zone_idx() in the name of readability whatever sparse has to say about
the matter.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-05 11:57       ` Mel Gorman
  (?)
@ 2013-02-18 10:34         ` Lin Feng
  -1 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-18 10:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

Hi Mel,

See below.

On 02/05/2013 07:57 PM, Mel Gorman wrote:
> On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote:
>> The ifdefs aren't really needed here and I encourage people to omit
>> them.  This keeps the header files looking neater and reduces the
>> chances of things later breaking because we forgot to update some
>> CONFIG_foo logic in a header file.  The downside is that errors will be
>> revealed at link time rather than at compile time, but that's a pretty
>> small cost.
>>
> 
> As an aside, if ifdefs *have* to be used then it often better to have a
> pattern like
> 
> #ifdef CONFIG_MEMORY_HOTREMOVE
> int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
> 		unsigned long start, int nr_pages, int write, int force,
> 		struct page **pages, struct vm_area_struct **vmas);
> #else
> static inline get_user_pages_non_movable(...)
> {
> 	get_user_pages(...)
> }
> #endif
> 
> It eliminates the need for #ifdefs in the C file that calls
> get_user_pages_non_movable().
Thanks for pointing out :)

>>> +
>>> +retry:
>>> +	ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages,
>>> +				vmas);
>>
>> We should handle (ret < 0) here.  At present the function will silently
>> convert an error return into "return 0", which is bad.  The function
>> does appear to otherwise do the right thing if get_user_pages() failed,
>> but only due to good luck.
>>
> 
> A BUG_ON check if we retry more than once wouldn't hurt either. It requires
> a broken implementation of alloc_migrate_target() but someone might "fix"
> it and miss this.
Agree.

> 
> A minor aside, it would be nice if we exited at this point if there was
> no populated ZONE_MOVABLE in the system. There is a movable_zone global
> variable already. If it was forced to be MAX_NR_ZONES before the call to
> find_usable_zone_for_movable() in memory initialisation we should be able
> to make a cheap check for it here.
OK.

>>> +			if (!migrate_pre_flag) {
>>> +				if (migrate_prep())
>>> +					goto put_page;
> 
> CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never
> return failure. You could make this BUG_ON(migrate_prep()), remove this
> goto and the migrate_pre_flag check below becomes redundant.
Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check
will call/check migrate_prep() every time we isolate a single page, do we have
to do so?
I mean that for a group of pages it's sufficient to call migrate_prep() only once.
There are some other migrate-relative codes using this scheme. 
So I get confused, do I miss something or the the performance cost is little? :(

> 
>>> +				migrate_pre_flag = 1;
>>> +			}
>>> +
>>> +			if (!isolate_lru_page(pages[i])) {
>>> +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
>>> +						 page_is_file_cache(pages[i]));
>>> +				list_add_tail(&pages[i]->lru, &pagelist);
>>> +			} else {
>>> +				isolate_err = 1;
>>> +				goto put_page;
>>> +			}
> 
> isolate_lru_page() takes the LRU lock every time. If
> get_user_pages_non_movable is used heavily then you may encounter lock
> contention problems. Batching this lock would be a separate patch and should
> not be necessary yet but you should at least comment on it as a reminder.
Farsighted, should improve it in the future.

> 
> I think that a goto could also have been avoided here if you used break. The
> i == ret check below would be false and it would just fall through.
> Overall the flow would be a bit easier to follow.
Yes, I noticed this before. I thought using goto could save some micro ops
(here is only the i == ret check) though lower the readability but still be clearly. 

Also there are some other place in current kernel facing such performance/readability 
race condition, but it seems that the developers prefer readability, why? While I
think performance is fatal to kernel..

> 
> Why list_add_tail()? I don't think it's wrong but it's unusual to see
> list_add_tail() when list_add() is enough.
Sorry, not intentional, just steal from other codes without thinking more ;-)

> 
>>> +		}
>>> +	}
>>> +
>>> +	/* All pages are non movable, we are done :) */
>>> +	if (i == ret && list_empty(&pagelist))
>>> +		return ret;
>>> +
>>> +put_page:
>>> +	/* Undo the effects of former get_user_pages(), we won't pin anything */
>>> +	for (i = 0; i < ret; i++)
>>> +		put_page(pages[i]);
>>> +
> 
> release_pages.
> 
> That comment is insufficient. There are non-obvious consequences to this
> logic. We are dropping pins on all pages regardless of what zone they
> are in. If the subsequent migration fails then we end up returning 0
> with no pages pinned. The user-visible effect is that io_setup() fails
> for non-obvious reasons. It will return EAGAIN to userspace which will be
> interpreted as "The specified nr_events exceeds the user's limit of available
> events.". The application will either fail or potentially infinite loop
> if the developer interpreted EAGAIN as "try again" as opposed to "this is
> a permanent failure".
Yes, I missed such things.

> 
> Is that deliberate? Is it really preferable that AIO can fail to setup
> and the application exit just in case we want to hot-remove memory later?
> Should a failed migration generate a WARN_ON at least?
> 
> I would think that it's better to WARN_ON_ONCE if migration fails but pin
> the pages as requested. If a future memory hot-remove operation fails
> then the warning will indicate why but applications will not fail as a
I'm so agree, I will keep the pin state alive and issue a WARN_ON_ONCE message
in such case.

> result. It's a little clumsy but the memory hot-remove failure message
> could list what applications have pinned the pages that cannot be removed
> so the administrator has the option of force-killing the application. It
> is possible to discover what application is pinning a page from userspace
> but it would involve an expensive search with /proc/kpagemap
> 
>>> +	if (migrate_pre_flag && !isolate_err) {
>>> +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
>>> +					false, MIGRATE_SYNC, MR_SYSCALL);
> 
> The conversion of alloc_migrate_target is a bit problematic. It strips
> the __GFP_MOVABLE flag and the consequence of this is that it converts
> those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large
> number of pages, particularly if the number of get_user_pages_non_movable()
> increases for short-lived pins like direct IO.
Sorry, I don't quite understand here neither. If we use the following new 
migration allocation function as you said, the increasing number of 
get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE
pages. What's the difference, do I miss something?

> 
> One way around this is to add a high_zoneidx parameter to
> __alloc_pages_nodemask and rename it ____alloc_pages_nodemask, create a new
> inline function __alloc_pages_nodemask that passes in gfp_zone(gfp_mask)
> as the high_zoneidx and create a new migration allocation function that
> passes on ZONE_HIGHMEM as high_zoneidx. That would force the allocation to
> happen in a lower zone while still treating the allocation as MIGRATE_MOVABLE


On 02/05/2013 09:32 PM, Mel Gorman wrote:
> Credit to Michal Hocko for bringing this up but with the number of
> other issues I missed that this is also broken with respect to huge page
> handling. hugetlbfs pages will not be on the LRU so the isolation will mess
> up and the migration has to be handled differently.  Ordinarily hugetlbfs
> pages cannot be allocated from ZONE_MOVABLE but it is possible to configure
> it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this
> encounters a hugetlbfs page, it'll just blow up.
> 
> The other is that this almost certainly broken for transhuge page
> handling. gup returns the head and tail pages and ordinarily this is ok
> because the caller only cares about the physical address. Migration will
> also split a hugepage if it receives it but you are potentially adding
> tail pages to a list here and then migrating them. The split of the first
> page will get very confused. I'm not exactly sure what the result will be
> but it won't be pretty.
> 
> Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled
> during testing?
Sorry, I hadn't considered THP and hugepage yet, I will deal such cases based on your 
comments later.

thanks a lot for your review :)
linfeng

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-18 10:34         ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-18 10:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

Hi Mel,

See below.

On 02/05/2013 07:57 PM, Mel Gorman wrote:
> On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote:
>> The ifdefs aren't really needed here and I encourage people to omit
>> them.  This keeps the header files looking neater and reduces the
>> chances of things later breaking because we forgot to update some
>> CONFIG_foo logic in a header file.  The downside is that errors will be
>> revealed at link time rather than at compile time, but that's a pretty
>> small cost.
>>
> 
> As an aside, if ifdefs *have* to be used then it often better to have a
> pattern like
> 
> #ifdef CONFIG_MEMORY_HOTREMOVE
> int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
> 		unsigned long start, int nr_pages, int write, int force,
> 		struct page **pages, struct vm_area_struct **vmas);
> #else
> static inline get_user_pages_non_movable(...)
> {
> 	get_user_pages(...)
> }
> #endif
> 
> It eliminates the need for #ifdefs in the C file that calls
> get_user_pages_non_movable().
Thanks for pointing out :)

>>> +
>>> +retry:
>>> +	ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages,
>>> +				vmas);
>>
>> We should handle (ret < 0) here.  At present the function will silently
>> convert an error return into "return 0", which is bad.  The function
>> does appear to otherwise do the right thing if get_user_pages() failed,
>> but only due to good luck.
>>
> 
> A BUG_ON check if we retry more than once wouldn't hurt either. It requires
> a broken implementation of alloc_migrate_target() but someone might "fix"
> it and miss this.
Agree.

> 
> A minor aside, it would be nice if we exited at this point if there was
> no populated ZONE_MOVABLE in the system. There is a movable_zone global
> variable already. If it was forced to be MAX_NR_ZONES before the call to
> find_usable_zone_for_movable() in memory initialisation we should be able
> to make a cheap check for it here.
OK.

>>> +			if (!migrate_pre_flag) {
>>> +				if (migrate_prep())
>>> +					goto put_page;
> 
> CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never
> return failure. You could make this BUG_ON(migrate_prep()), remove this
> goto and the migrate_pre_flag check below becomes redundant.
Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check
will call/check migrate_prep() every time we isolate a single page, do we have
to do so?
I mean that for a group of pages it's sufficient to call migrate_prep() only once.
There are some other migrate-relative codes using this scheme. 
So I get confused, do I miss something or the the performance cost is little? :(

> 
>>> +				migrate_pre_flag = 1;
>>> +			}
>>> +
>>> +			if (!isolate_lru_page(pages[i])) {
>>> +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
>>> +						 page_is_file_cache(pages[i]));
>>> +				list_add_tail(&pages[i]->lru, &pagelist);
>>> +			} else {
>>> +				isolate_err = 1;
>>> +				goto put_page;
>>> +			}
> 
> isolate_lru_page() takes the LRU lock every time. If
> get_user_pages_non_movable is used heavily then you may encounter lock
> contention problems. Batching this lock would be a separate patch and should
> not be necessary yet but you should at least comment on it as a reminder.
Farsighted, should improve it in the future.

> 
> I think that a goto could also have been avoided here if you used break. The
> i == ret check below would be false and it would just fall through.
> Overall the flow would be a bit easier to follow.
Yes, I noticed this before. I thought using goto could save some micro ops
(here is only the i == ret check) though lower the readability but still be clearly. 

Also there are some other place in current kernel facing such performance/readability 
race condition, but it seems that the developers prefer readability, why? While I
think performance is fatal to kernel..

> 
> Why list_add_tail()? I don't think it's wrong but it's unusual to see
> list_add_tail() when list_add() is enough.
Sorry, not intentional, just steal from other codes without thinking more ;-)

> 
>>> +		}
>>> +	}
>>> +
>>> +	/* All pages are non movable, we are done :) */
>>> +	if (i == ret && list_empty(&pagelist))
>>> +		return ret;
>>> +
>>> +put_page:
>>> +	/* Undo the effects of former get_user_pages(), we won't pin anything */
>>> +	for (i = 0; i < ret; i++)
>>> +		put_page(pages[i]);
>>> +
> 
> release_pages.
> 
> That comment is insufficient. There are non-obvious consequences to this
> logic. We are dropping pins on all pages regardless of what zone they
> are in. If the subsequent migration fails then we end up returning 0
> with no pages pinned. The user-visible effect is that io_setup() fails
> for non-obvious reasons. It will return EAGAIN to userspace which will be
> interpreted as "The specified nr_events exceeds the user's limit of available
> events.". The application will either fail or potentially infinite loop
> if the developer interpreted EAGAIN as "try again" as opposed to "this is
> a permanent failure".
Yes, I missed such things.

> 
> Is that deliberate? Is it really preferable that AIO can fail to setup
> and the application exit just in case we want to hot-remove memory later?
> Should a failed migration generate a WARN_ON at least?
> 
> I would think that it's better to WARN_ON_ONCE if migration fails but pin
> the pages as requested. If a future memory hot-remove operation fails
> then the warning will indicate why but applications will not fail as a
I'm so agree, I will keep the pin state alive and issue a WARN_ON_ONCE message
in such case.

> result. It's a little clumsy but the memory hot-remove failure message
> could list what applications have pinned the pages that cannot be removed
> so the administrator has the option of force-killing the application. It
> is possible to discover what application is pinning a page from userspace
> but it would involve an expensive search with /proc/kpagemap
> 
>>> +	if (migrate_pre_flag && !isolate_err) {
>>> +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
>>> +					false, MIGRATE_SYNC, MR_SYSCALL);
> 
> The conversion of alloc_migrate_target is a bit problematic. It strips
> the __GFP_MOVABLE flag and the consequence of this is that it converts
> those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large
> number of pages, particularly if the number of get_user_pages_non_movable()
> increases for short-lived pins like direct IO.
Sorry, I don't quite understand here neither. If we use the following new 
migration allocation function as you said, the increasing number of 
get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE
pages. What's the difference, do I miss something?

> 
> One way around this is to add a high_zoneidx parameter to
> __alloc_pages_nodemask and rename it ____alloc_pages_nodemask, create a new
> inline function __alloc_pages_nodemask that passes in gfp_zone(gfp_mask)
> as the high_zoneidx and create a new migration allocation function that
> passes on ZONE_HIGHMEM as high_zoneidx. That would force the allocation to
> happen in a lower zone while still treating the allocation as MIGRATE_MOVABLE


On 02/05/2013 09:32 PM, Mel Gorman wrote:
> Credit to Michal Hocko for bringing this up but with the number of
> other issues I missed that this is also broken with respect to huge page
> handling. hugetlbfs pages will not be on the LRU so the isolation will mess
> up and the migration has to be handled differently.  Ordinarily hugetlbfs
> pages cannot be allocated from ZONE_MOVABLE but it is possible to configure
> it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this
> encounters a hugetlbfs page, it'll just blow up.
> 
> The other is that this almost certainly broken for transhuge page
> handling. gup returns the head and tail pages and ordinarily this is ok
> because the caller only cares about the physical address. Migration will
> also split a hugepage if it receives it but you are potentially adding
> tail pages to a list here and then migrating them. The split of the first
> page will get very confused. I'm not exactly sure what the result will be
> but it won't be pretty.
> 
> Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled
> during testing?
Sorry, I hadn't considered THP and hugepage yet, I will deal such cases based on your 
comments later.

thanks a lot for your review :)
linfeng

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-18 10:34         ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-18 10:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

Hi Mel,

See below.

On 02/05/2013 07:57 PM, Mel Gorman wrote:
> On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote:
>> The ifdefs aren't really needed here and I encourage people to omit
>> them.  This keeps the header files looking neater and reduces the
>> chances of things later breaking because we forgot to update some
>> CONFIG_foo logic in a header file.  The downside is that errors will be
>> revealed at link time rather than at compile time, but that's a pretty
>> small cost.
>>
> 
> As an aside, if ifdefs *have* to be used then it often better to have a
> pattern like
> 
> #ifdef CONFIG_MEMORY_HOTREMOVE
> int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
> 		unsigned long start, int nr_pages, int write, int force,
> 		struct page **pages, struct vm_area_struct **vmas);
> #else
> static inline get_user_pages_non_movable(...)
> {
> 	get_user_pages(...)
> }
> #endif
> 
> It eliminates the need for #ifdefs in the C file that calls
> get_user_pages_non_movable().
Thanks for pointing out :)

>>> +
>>> +retry:
>>> +	ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages,
>>> +				vmas);
>>
>> We should handle (ret < 0) here.  At present the function will silently
>> convert an error return into "return 0", which is bad.  The function
>> does appear to otherwise do the right thing if get_user_pages() failed,
>> but only due to good luck.
>>
> 
> A BUG_ON check if we retry more than once wouldn't hurt either. It requires
> a broken implementation of alloc_migrate_target() but someone might "fix"
> it and miss this.
Agree.

> 
> A minor aside, it would be nice if we exited at this point if there was
> no populated ZONE_MOVABLE in the system. There is a movable_zone global
> variable already. If it was forced to be MAX_NR_ZONES before the call to
> find_usable_zone_for_movable() in memory initialisation we should be able
> to make a cheap check for it here.
OK.

>>> +			if (!migrate_pre_flag) {
>>> +				if (migrate_prep())
>>> +					goto put_page;
> 
> CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never
> return failure. You could make this BUG_ON(migrate_prep()), remove this
> goto and the migrate_pre_flag check below becomes redundant.
Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check
will call/check migrate_prep() every time we isolate a single page, do we have
to do so?
I mean that for a group of pages it's sufficient to call migrate_prep() only once.
There are some other migrate-relative codes using this scheme. 
So I get confused, do I miss something or the the performance cost is little? :(

> 
>>> +				migrate_pre_flag = 1;
>>> +			}
>>> +
>>> +			if (!isolate_lru_page(pages[i])) {
>>> +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
>>> +						 page_is_file_cache(pages[i]));
>>> +				list_add_tail(&pages[i]->lru, &pagelist);
>>> +			} else {
>>> +				isolate_err = 1;
>>> +				goto put_page;
>>> +			}
> 
> isolate_lru_page() takes the LRU lock every time. If
> get_user_pages_non_movable is used heavily then you may encounter lock
> contention problems. Batching this lock would be a separate patch and should
> not be necessary yet but you should at least comment on it as a reminder.
Farsighted, should improve it in the future.

> 
> I think that a goto could also have been avoided here if you used break. The
> i == ret check below would be false and it would just fall through.
> Overall the flow would be a bit easier to follow.
Yes, I noticed this before. I thought using goto could save some micro ops
(here is only the i == ret check) though lower the readability but still be clearly. 

Also there are some other place in current kernel facing such performance/readability 
race condition, but it seems that the developers prefer readability, why? While I
think performance is fatal to kernel..

> 
> Why list_add_tail()? I don't think it's wrong but it's unusual to see
> list_add_tail() when list_add() is enough.
Sorry, not intentional, just steal from other codes without thinking more ;-)

> 
>>> +		}
>>> +	}
>>> +
>>> +	/* All pages are non movable, we are done :) */
>>> +	if (i == ret && list_empty(&pagelist))
>>> +		return ret;
>>> +
>>> +put_page:
>>> +	/* Undo the effects of former get_user_pages(), we won't pin anything */
>>> +	for (i = 0; i < ret; i++)
>>> +		put_page(pages[i]);
>>> +
> 
> release_pages.
> 
> That comment is insufficient. There are non-obvious consequences to this
> logic. We are dropping pins on all pages regardless of what zone they
> are in. If the subsequent migration fails then we end up returning 0
> with no pages pinned. The user-visible effect is that io_setup() fails
> for non-obvious reasons. It will return EAGAIN to userspace which will be
> interpreted as "The specified nr_events exceeds the user's limit of available
> events.". The application will either fail or potentially infinite loop
> if the developer interpreted EAGAIN as "try again" as opposed to "this is
> a permanent failure".
Yes, I missed such things.

> 
> Is that deliberate? Is it really preferable that AIO can fail to setup
> and the application exit just in case we want to hot-remove memory later?
> Should a failed migration generate a WARN_ON at least?
> 
> I would think that it's better to WARN_ON_ONCE if migration fails but pin
> the pages as requested. If a future memory hot-remove operation fails
> then the warning will indicate why but applications will not fail as a
I'm so agree, I will keep the pin state alive and issue a WARN_ON_ONCE message
in such case.

> result. It's a little clumsy but the memory hot-remove failure message
> could list what applications have pinned the pages that cannot be removed
> so the administrator has the option of force-killing the application. It
> is possible to discover what application is pinning a page from userspace
> but it would involve an expensive search with /proc/kpagemap
> 
>>> +	if (migrate_pre_flag && !isolate_err) {
>>> +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
>>> +					false, MIGRATE_SYNC, MR_SYSCALL);
> 
> The conversion of alloc_migrate_target is a bit problematic. It strips
> the __GFP_MOVABLE flag and the consequence of this is that it converts
> those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large
> number of pages, particularly if the number of get_user_pages_non_movable()
> increases for short-lived pins like direct IO.
Sorry, I don't quite understand here neither. If we use the following new 
migration allocation function as you said, the increasing number of 
get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE
pages. What's the difference, do I miss something?

> 
> One way around this is to add a high_zoneidx parameter to
> __alloc_pages_nodemask and rename it ____alloc_pages_nodemask, create a new
> inline function __alloc_pages_nodemask that passes in gfp_zone(gfp_mask)
> as the high_zoneidx and create a new migration allocation function that
> passes on ZONE_HIGHMEM as high_zoneidx. That would force the allocation to
> happen in a lower zone while still treating the allocation as MIGRATE_MOVABLE


On 02/05/2013 09:32 PM, Mel Gorman wrote:
> Credit to Michal Hocko for bringing this up but with the number of
> other issues I missed that this is also broken with respect to huge page
> handling. hugetlbfs pages will not be on the LRU so the isolation will mess
> up and the migration has to be handled differently.  Ordinarily hugetlbfs
> pages cannot be allocated from ZONE_MOVABLE but it is possible to configure
> it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this
> encounters a hugetlbfs page, it'll just blow up.
> 
> The other is that this almost certainly broken for transhuge page
> handling. gup returns the head and tail pages and ordinarily this is ok
> because the caller only cares about the physical address. Migration will
> also split a hugepage if it receives it but you are potentially adding
> tail pages to a list here and then migrating them. The split of the first
> page will get very confused. I'm not exactly sure what the result will be
> but it won't be pretty.
> 
> Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled
> during testing?
Sorry, I hadn't considered THP and hugepage yet, I will deal such cases based on your 
comments later.

thanks a lot for your review :)
linfeng

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-18 10:34         ` Lin Feng
  (?)
@ 2013-02-18 15:17           ` Mel Gorman
  -1 siblings, 0 replies; 88+ messages in thread
From: Mel Gorman @ 2013-02-18 15:17 UTC (permalink / raw)
  To: Lin Feng
  Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

On Mon, Feb 18, 2013 at 06:34:44PM +0800, Lin Feng wrote:
> >>> +			if (!migrate_pre_flag) {
> >>> +				if (migrate_prep())
> >>> +					goto put_page;
> > 
> > CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never
> > return failure. You could make this BUG_ON(migrate_prep()), remove this
> > goto and the migrate_pre_flag check below becomes redundant.
> Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check
> will call/check migrate_prep() every time we isolate a single page, do we have
> to do so?

I was referring to the migrate_pre_flag check further down and I'm not
suggesting it be removed from here as you do want to call migrate_prep()
only once. However, as it'll never return failure for any kernel
configuration that allows memory hot-remove, this goto can be
removed and the flow simplified.

> > 
> >>> +				migrate_pre_flag = 1;
> >>> +			}
> >>> +
> >>> +			if (!isolate_lru_page(pages[i])) {
> >>> +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
> >>> +						 page_is_file_cache(pages[i]));
> >>> +				list_add_tail(&pages[i]->lru, &pagelist);
> >>> +			} else {
> >>> +				isolate_err = 1;
> >>> +				goto put_page;
> >>> +			}
> > 
> > isolate_lru_page() takes the LRU lock every time. If
> > get_user_pages_non_movable is used heavily then you may encounter lock
> > contention problems. Batching this lock would be a separate patch and should
> > not be necessary yet but you should at least comment on it as a reminder.
> Farsighted, should improve it in the future.
> 
> > 
> > I think that a goto could also have been avoided here if you used break. The
> > i == ret check below would be false and it would just fall through.
> > Overall the flow would be a bit easier to follow.
> Yes, I noticed this before. I thought using goto could save some micro ops
> (here is only the i == ret check) though lower the readability but still be clearly. 
> 
> Also there are some other place in current kernel facing such performance/readability 
> race condition, but it seems that the developers prefer readability, why? While I
> think performance is fatal to kernel..
> 

Memory hot-remove and page migration are not performance critical paths.
For page migration, the cost will be likely dominated by the page copy but
it's also possible the cost will be dominated by locking. The difference
between a break and goto will not even be measurable.

> > <SNIP>
> >
> > result. It's a little clumsy but the memory hot-remove failure message
> > could list what applications have pinned the pages that cannot be removed
> > so the administrator has the option of force-killing the application. It
> > is possible to discover what application is pinning a page from userspace
> > but it would involve an expensive search with /proc/kpagemap
> > 
> >>> +	if (migrate_pre_flag && !isolate_err) {
> >>> +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
> >>> +					false, MIGRATE_SYNC, MR_SYSCALL);
> > 
> > The conversion of alloc_migrate_target is a bit problematic. It strips
> > the __GFP_MOVABLE flag and the consequence of this is that it converts
> > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large
> > number of pages, particularly if the number of get_user_pages_non_movable()
> > increases for short-lived pins like direct IO.
>
> Sorry, I don't quite understand here neither. If we use the following new 
> migration allocation function as you said, the increasing number of 
> get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE
> pages. What's the difference, do I miss something?
> 

The replacement function preserves the __GFP_MOVABLE flag. It cannot use
ZONE_MOVABLE but otherwise the newly allocated page will be grouped with
other movable pages.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-18 15:17           ` Mel Gorman
  0 siblings, 0 replies; 88+ messages in thread
From: Mel Gorman @ 2013-02-18 15:17 UTC (permalink / raw)
  To: Lin Feng
  Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

On Mon, Feb 18, 2013 at 06:34:44PM +0800, Lin Feng wrote:
> >>> +			if (!migrate_pre_flag) {
> >>> +				if (migrate_prep())
> >>> +					goto put_page;
> > 
> > CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never
> > return failure. You could make this BUG_ON(migrate_prep()), remove this
> > goto and the migrate_pre_flag check below becomes redundant.
> Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check
> will call/check migrate_prep() every time we isolate a single page, do we have
> to do so?

I was referring to the migrate_pre_flag check further down and I'm not
suggesting it be removed from here as you do want to call migrate_prep()
only once. However, as it'll never return failure for any kernel
configuration that allows memory hot-remove, this goto can be
removed and the flow simplified.

> > 
> >>> +				migrate_pre_flag = 1;
> >>> +			}
> >>> +
> >>> +			if (!isolate_lru_page(pages[i])) {
> >>> +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
> >>> +						 page_is_file_cache(pages[i]));
> >>> +				list_add_tail(&pages[i]->lru, &pagelist);
> >>> +			} else {
> >>> +				isolate_err = 1;
> >>> +				goto put_page;
> >>> +			}
> > 
> > isolate_lru_page() takes the LRU lock every time. If
> > get_user_pages_non_movable is used heavily then you may encounter lock
> > contention problems. Batching this lock would be a separate patch and should
> > not be necessary yet but you should at least comment on it as a reminder.
> Farsighted, should improve it in the future.
> 
> > 
> > I think that a goto could also have been avoided here if you used break. The
> > i == ret check below would be false and it would just fall through.
> > Overall the flow would be a bit easier to follow.
> Yes, I noticed this before. I thought using goto could save some micro ops
> (here is only the i == ret check) though lower the readability but still be clearly. 
> 
> Also there are some other place in current kernel facing such performance/readability 
> race condition, but it seems that the developers prefer readability, why? While I
> think performance is fatal to kernel..
> 

Memory hot-remove and page migration are not performance critical paths.
For page migration, the cost will be likely dominated by the page copy but
it's also possible the cost will be dominated by locking. The difference
between a break and goto will not even be measurable.

> > <SNIP>
> >
> > result. It's a little clumsy but the memory hot-remove failure message
> > could list what applications have pinned the pages that cannot be removed
> > so the administrator has the option of force-killing the application. It
> > is possible to discover what application is pinning a page from userspace
> > but it would involve an expensive search with /proc/kpagemap
> > 
> >>> +	if (migrate_pre_flag && !isolate_err) {
> >>> +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
> >>> +					false, MIGRATE_SYNC, MR_SYSCALL);
> > 
> > The conversion of alloc_migrate_target is a bit problematic. It strips
> > the __GFP_MOVABLE flag and the consequence of this is that it converts
> > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large
> > number of pages, particularly if the number of get_user_pages_non_movable()
> > increases for short-lived pins like direct IO.
>
> Sorry, I don't quite understand here neither. If we use the following new 
> migration allocation function as you said, the increasing number of 
> get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE
> pages. What's the difference, do I miss something?
> 

The replacement function preserves the __GFP_MOVABLE flag. It cannot use
ZONE_MOVABLE but otherwise the newly allocated page will be grouped with
other movable pages.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-18 15:17           ` Mel Gorman
  0 siblings, 0 replies; 88+ messages in thread
From: Mel Gorman @ 2013-02-18 15:17 UTC (permalink / raw)
  To: Lin Feng
  Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

On Mon, Feb 18, 2013 at 06:34:44PM +0800, Lin Feng wrote:
> >>> +			if (!migrate_pre_flag) {
> >>> +				if (migrate_prep())
> >>> +					goto put_page;
> > 
> > CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never
> > return failure. You could make this BUG_ON(migrate_prep()), remove this
> > goto and the migrate_pre_flag check below becomes redundant.
> Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check
> will call/check migrate_prep() every time we isolate a single page, do we have
> to do so?

I was referring to the migrate_pre_flag check further down and I'm not
suggesting it be removed from here as you do want to call migrate_prep()
only once. However, as it'll never return failure for any kernel
configuration that allows memory hot-remove, this goto can be
removed and the flow simplified.

> > 
> >>> +				migrate_pre_flag = 1;
> >>> +			}
> >>> +
> >>> +			if (!isolate_lru_page(pages[i])) {
> >>> +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
> >>> +						 page_is_file_cache(pages[i]));
> >>> +				list_add_tail(&pages[i]->lru, &pagelist);
> >>> +			} else {
> >>> +				isolate_err = 1;
> >>> +				goto put_page;
> >>> +			}
> > 
> > isolate_lru_page() takes the LRU lock every time. If
> > get_user_pages_non_movable is used heavily then you may encounter lock
> > contention problems. Batching this lock would be a separate patch and should
> > not be necessary yet but you should at least comment on it as a reminder.
> Farsighted, should improve it in the future.
> 
> > 
> > I think that a goto could also have been avoided here if you used break. The
> > i == ret check below would be false and it would just fall through.
> > Overall the flow would be a bit easier to follow.
> Yes, I noticed this before. I thought using goto could save some micro ops
> (here is only the i == ret check) though lower the readability but still be clearly. 
> 
> Also there are some other place in current kernel facing such performance/readability 
> race condition, but it seems that the developers prefer readability, why? While I
> think performance is fatal to kernel..
> 

Memory hot-remove and page migration are not performance critical paths.
For page migration, the cost will be likely dominated by the page copy but
it's also possible the cost will be dominated by locking. The difference
between a break and goto will not even be measurable.

> > <SNIP>
> >
> > result. It's a little clumsy but the memory hot-remove failure message
> > could list what applications have pinned the pages that cannot be removed
> > so the administrator has the option of force-killing the application. It
> > is possible to discover what application is pinning a page from userspace
> > but it would involve an expensive search with /proc/kpagemap
> > 
> >>> +	if (migrate_pre_flag && !isolate_err) {
> >>> +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
> >>> +					false, MIGRATE_SYNC, MR_SYSCALL);
> > 
> > The conversion of alloc_migrate_target is a bit problematic. It strips
> > the __GFP_MOVABLE flag and the consequence of this is that it converts
> > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large
> > number of pages, particularly if the number of get_user_pages_non_movable()
> > increases for short-lived pins like direct IO.
>
> Sorry, I don't quite understand here neither. If we use the following new 
> migration allocation function as you said, the increasing number of 
> get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE
> pages. What's the difference, do I miss something?
> 

The replacement function preserves the __GFP_MOVABLE flag. It cannot use
ZONE_MOVABLE but otherwise the newly allocated page will be grouped with
other movable pages.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-18 15:17           ` Mel Gorman
@ 2013-02-19  9:55             ` Lin Feng
  -1 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-19  9:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

Hi Mel,

On 02/18/2013 11:17 PM, Mel Gorman wrote:
>>> > > <SNIP>
>>> > >
>>> > > result. It's a little clumsy but the memory hot-remove failure message
>>> > > could list what applications have pinned the pages that cannot be removed
>>> > > so the administrator has the option of force-killing the application. It
>>> > > is possible to discover what application is pinning a page from userspace
>>> > > but it would involve an expensive search with /proc/kpagemap
>>> > > 
>>>>> > >>> +	if (migrate_pre_flag && !isolate_err) {
>>>>> > >>> +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
>>>>> > >>> +					false, MIGRATE_SYNC, MR_SYSCALL);
>>> > > 
>>> > > The conversion of alloc_migrate_target is a bit problematic. It strips
>>> > > the __GFP_MOVABLE flag and the consequence of this is that it converts
>>> > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large
>>> > > number of pages, particularly if the number of get_user_pages_non_movable()
>>> > > increases for short-lived pins like direct IO.
>> >
>> > Sorry, I don't quite understand here neither. If we use the following new 
>> > migration allocation function as you said, the increasing number of 
>> > get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE
>> > pages. What's the difference, do I miss something?
>> > 
> The replacement function preserves the __GFP_MOVABLE flag. It cannot use
> ZONE_MOVABLE but otherwise the newly allocated page will be grouped with
> other movable pages.

Ah, got it " But GFP_MOVABLE is not only a zone specifier but also an allocation policy.".

Could I clear __GFP_HIGHMEM flag in alloc_migrate_target depending on private parameter so
that we can keep MIGRATE_UNMOVABLE policy also allocate page none movable zones with little
change?

Does that approach work? Otherwise I have to follow your suggestion.

thanks,
linfeng

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-19  9:55             ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-19  9:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

Hi Mel,

On 02/18/2013 11:17 PM, Mel Gorman wrote:
>>> > > <SNIP>
>>> > >
>>> > > result. It's a little clumsy but the memory hot-remove failure message
>>> > > could list what applications have pinned the pages that cannot be removed
>>> > > so the administrator has the option of force-killing the application. It
>>> > > is possible to discover what application is pinning a page from userspace
>>> > > but it would involve an expensive search with /proc/kpagemap
>>> > > 
>>>>> > >>> +	if (migrate_pre_flag && !isolate_err) {
>>>>> > >>> +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
>>>>> > >>> +					false, MIGRATE_SYNC, MR_SYSCALL);
>>> > > 
>>> > > The conversion of alloc_migrate_target is a bit problematic. It strips
>>> > > the __GFP_MOVABLE flag and the consequence of this is that it converts
>>> > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large
>>> > > number of pages, particularly if the number of get_user_pages_non_movable()
>>> > > increases for short-lived pins like direct IO.
>> >
>> > Sorry, I don't quite understand here neither. If we use the following new 
>> > migration allocation function as you said, the increasing number of 
>> > get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE
>> > pages. What's the difference, do I miss something?
>> > 
> The replacement function preserves the __GFP_MOVABLE flag. It cannot use
> ZONE_MOVABLE but otherwise the newly allocated page will be grouped with
> other movable pages.

Ah, got it " But GFP_MOVABLE is not only a zone specifier but also an allocation policy.".

Could I clear __GFP_HIGHMEM flag in alloc_migrate_target depending on private parameter so
that we can keep MIGRATE_UNMOVABLE policy also allocate page none movable zones with little
change?

Does that approach work? Otherwise I have to follow your suggestion.

thanks,
linfeng

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-19  9:55             ` Lin Feng
  (?)
@ 2013-02-19 10:34               ` Mel Gorman
  -1 siblings, 0 replies; 88+ messages in thread
From: Mel Gorman @ 2013-02-19 10:34 UTC (permalink / raw)
  To: Lin Feng
  Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

On Tue, Feb 19, 2013 at 05:55:30PM +0800, Lin Feng wrote:
> Hi Mel,
> 
> On 02/18/2013 11:17 PM, Mel Gorman wrote:
> >>> > > <SNIP>
> >>> > >
> >>> > > result. It's a little clumsy but the memory hot-remove failure message
> >>> > > could list what applications have pinned the pages that cannot be removed
> >>> > > so the administrator has the option of force-killing the application. It
> >>> > > is possible to discover what application is pinning a page from userspace
> >>> > > but it would involve an expensive search with /proc/kpagemap
> >>> > > 
> >>>>> > >>> +	if (migrate_pre_flag && !isolate_err) {
> >>>>> > >>> +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
> >>>>> > >>> +					false, MIGRATE_SYNC, MR_SYSCALL);
> >>> > > 
> >>> > > The conversion of alloc_migrate_target is a bit problematic. It strips
> >>> > > the __GFP_MOVABLE flag and the consequence of this is that it converts
> >>> > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large
> >>> > > number of pages, particularly if the number of get_user_pages_non_movable()
> >>> > > increases for short-lived pins like direct IO.
> >> >
> >> > Sorry, I don't quite understand here neither. If we use the following new 
> >> > migration allocation function as you said, the increasing number of 
> >> > get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE
> >> > pages. What's the difference, do I miss something?
> >> > 
> > The replacement function preserves the __GFP_MOVABLE flag. It cannot use
> > ZONE_MOVABLE but otherwise the newly allocated page will be grouped with
> > other movable pages.
> 
> Ah, got it " But GFP_MOVABLE is not only a zone specifier but also an allocation policy.".
> 

Update the comment and describe the exception then.

> Could I clear __GFP_HIGHMEM flag in alloc_migrate_target depending on private parameter so
> that we can keep MIGRATE_UNMOVABLE policy also allocate page none movable zones with little
> change?
> 

It should work (double check gfp_zone) but then the allocation cannot
use the highmem zone. If you can be 100% certain that zone will not exist
be populated then it will work as expected but it's a hack and should be
commented clearly. You could do a BUILD_BUG_ON if CONFIG_HIGHMEM is set
to enforce it.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-19 10:34               ` Mel Gorman
  0 siblings, 0 replies; 88+ messages in thread
From: Mel Gorman @ 2013-02-19 10:34 UTC (permalink / raw)
  To: Lin Feng
  Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

On Tue, Feb 19, 2013 at 05:55:30PM +0800, Lin Feng wrote:
> Hi Mel,
> 
> On 02/18/2013 11:17 PM, Mel Gorman wrote:
> >>> > > <SNIP>
> >>> > >
> >>> > > result. It's a little clumsy but the memory hot-remove failure message
> >>> > > could list what applications have pinned the pages that cannot be removed
> >>> > > so the administrator has the option of force-killing the application. It
> >>> > > is possible to discover what application is pinning a page from userspace
> >>> > > but it would involve an expensive search with /proc/kpagemap
> >>> > > 
> >>>>> > >>> +	if (migrate_pre_flag && !isolate_err) {
> >>>>> > >>> +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
> >>>>> > >>> +					false, MIGRATE_SYNC, MR_SYSCALL);
> >>> > > 
> >>> > > The conversion of alloc_migrate_target is a bit problematic. It strips
> >>> > > the __GFP_MOVABLE flag and the consequence of this is that it converts
> >>> > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large
> >>> > > number of pages, particularly if the number of get_user_pages_non_movable()
> >>> > > increases for short-lived pins like direct IO.
> >> >
> >> > Sorry, I don't quite understand here neither. If we use the following new 
> >> > migration allocation function as you said, the increasing number of 
> >> > get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE
> >> > pages. What's the difference, do I miss something?
> >> > 
> > The replacement function preserves the __GFP_MOVABLE flag. It cannot use
> > ZONE_MOVABLE but otherwise the newly allocated page will be grouped with
> > other movable pages.
> 
> Ah, got it " But GFP_MOVABLE is not only a zone specifier but also an allocation policy.".
> 

Update the comment and describe the exception then.

> Could I clear __GFP_HIGHMEM flag in alloc_migrate_target depending on private parameter so
> that we can keep MIGRATE_UNMOVABLE policy also allocate page none movable zones with little
> change?
> 

It should work (double check gfp_zone) but then the allocation cannot
use the highmem zone. If you can be 100% certain that zone will not exist
be populated then it will work as expected but it's a hack and should be
commented clearly. You could do a BUILD_BUG_ON if CONFIG_HIGHMEM is set
to enforce it.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-19 10:34               ` Mel Gorman
  0 siblings, 0 replies; 88+ messages in thread
From: Mel Gorman @ 2013-02-19 10:34 UTC (permalink / raw)
  To: Lin Feng
  Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel

On Tue, Feb 19, 2013 at 05:55:30PM +0800, Lin Feng wrote:
> Hi Mel,
> 
> On 02/18/2013 11:17 PM, Mel Gorman wrote:
> >>> > > <SNIP>
> >>> > >
> >>> > > result. It's a little clumsy but the memory hot-remove failure message
> >>> > > could list what applications have pinned the pages that cannot be removed
> >>> > > so the administrator has the option of force-killing the application. It
> >>> > > is possible to discover what application is pinning a page from userspace
> >>> > > but it would involve an expensive search with /proc/kpagemap
> >>> > > 
> >>>>> > >>> +	if (migrate_pre_flag && !isolate_err) {
> >>>>> > >>> +		ret = migrate_pages(&pagelist, alloc_migrate_target, 1,
> >>>>> > >>> +					false, MIGRATE_SYNC, MR_SYSCALL);
> >>> > > 
> >>> > > The conversion of alloc_migrate_target is a bit problematic. It strips
> >>> > > the __GFP_MOVABLE flag and the consequence of this is that it converts
> >>> > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large
> >>> > > number of pages, particularly if the number of get_user_pages_non_movable()
> >>> > > increases for short-lived pins like direct IO.
> >> >
> >> > Sorry, I don't quite understand here neither. If we use the following new 
> >> > migration allocation function as you said, the increasing number of 
> >> > get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE
> >> > pages. What's the difference, do I miss something?
> >> > 
> > The replacement function preserves the __GFP_MOVABLE flag. It cannot use
> > ZONE_MOVABLE but otherwise the newly allocated page will be grouped with
> > other movable pages.
> 
> Ah, got it " But GFP_MOVABLE is not only a zone specifier but also an allocation policy.".
> 

Update the comment and describe the exception then.

> Could I clear __GFP_HIGHMEM flag in alloc_migrate_target depending on private parameter so
> that we can keep MIGRATE_UNMOVABLE policy also allocate page none movable zones with little
> change?
> 

It should work (double check gfp_zone) but then the allocation cannot
use the highmem zone. If you can be 100% certain that zone will not exist
be populated then it will work as expected but it's a hack and should be
commented clearly. You could do a BUILD_BUG_ON if CONFIG_HIGHMEM is set
to enforce it.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-05 13:32         ` Mel Gorman
@ 2013-02-19 13:37           ` Lin Feng
  -1 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-19 13:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel

Hi Mel,

On 02/05/2013 09:32 PM, Mel Gorman wrote:
> On Tue, Feb 05, 2013 at 11:57:22AM +0000, Mel Gorman wrote:
>>
>>>> +				migrate_pre_flag = 1;
>>>> +			}
>>>> +
>>>> +			if (!isolate_lru_page(pages[i])) {
>>>> +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
>>>> +						 page_is_file_cache(pages[i]));
>>>> +				list_add_tail(&pages[i]->lru, &pagelist);
>>>> +			} else {
>>>> +				isolate_err = 1;
>>>> +				goto put_page;
>>>> +			}
>>
>> isolate_lru_page() takes the LRU lock every time.
> 
> Credit to Michal Hocko for bringing this up but with the number of
> other issues I missed that this is also broken with respect to huge page
> handling. hugetlbfs pages will not be on the LRU so the isolation will mess
> up and the migration has to be handled differently.  Ordinarily hugetlbfs
> pages cannot be allocated from ZONE_MOVABLE but it is possible to configure
> it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this
> encounters a hugetlbfs page, it'll just blow up.

I look into the migrate_huge_page() codes find that if we support the hugetlbfs
non movable migration, we have to invent another alloc_huge_page_node_nonmovable() 
or such allocate interface, which cost is large(exploding the codes and great impact
on current alloc_huge_page_node()) but gains little, I think that pinning hugepage
is a corner case. 

So can we skip over hugepage without migration but give some WARN_ON() info, is
it acceptable?

> 
> The other is that this almost certainly broken for transhuge page
> handling. gup returns the head and tail pages and ordinarily this is ok

I can't find codes doing such things :(, could you please point me out?

> because the caller only cares about the physical address. Migration will
> also split a hugepage if it receives it but you are potentially adding
> tail pages to a list here and then migrating them. The split of the first
> page will get very confused. I'm not exactly sure what the result will be
> but it won't be pretty.
> 
> Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled
> during testing?

I checked my config file that both CONFIG options aboved are enabled. However it was 
only be tested by two services invoking io_setup(), it works fine..

thanks,
linfeng

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-19 13:37           ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-19 13:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel

Hi Mel,

On 02/05/2013 09:32 PM, Mel Gorman wrote:
> On Tue, Feb 05, 2013 at 11:57:22AM +0000, Mel Gorman wrote:
>>
>>>> +				migrate_pre_flag = 1;
>>>> +			}
>>>> +
>>>> +			if (!isolate_lru_page(pages[i])) {
>>>> +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
>>>> +						 page_is_file_cache(pages[i]));
>>>> +				list_add_tail(&pages[i]->lru, &pagelist);
>>>> +			} else {
>>>> +				isolate_err = 1;
>>>> +				goto put_page;
>>>> +			}
>>
>> isolate_lru_page() takes the LRU lock every time.
> 
> Credit to Michal Hocko for bringing this up but with the number of
> other issues I missed that this is also broken with respect to huge page
> handling. hugetlbfs pages will not be on the LRU so the isolation will mess
> up and the migration has to be handled differently.  Ordinarily hugetlbfs
> pages cannot be allocated from ZONE_MOVABLE but it is possible to configure
> it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this
> encounters a hugetlbfs page, it'll just blow up.

I look into the migrate_huge_page() codes find that if we support the hugetlbfs
non movable migration, we have to invent another alloc_huge_page_node_nonmovable() 
or such allocate interface, which cost is large(exploding the codes and great impact
on current alloc_huge_page_node()) but gains little, I think that pinning hugepage
is a corner case. 

So can we skip over hugepage without migration but give some WARN_ON() info, is
it acceptable?

> 
> The other is that this almost certainly broken for transhuge page
> handling. gup returns the head and tail pages and ordinarily this is ok

I can't find codes doing such things :(, could you please point me out?

> because the caller only cares about the physical address. Migration will
> also split a hugepage if it receives it but you are potentially adding
> tail pages to a list here and then migrating them. The split of the first
> page will get very confused. I'm not exactly sure what the result will be
> but it won't be pretty.
> 
> Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled
> during testing?

I checked my config file that both CONFIG options aboved are enabled. However it was 
only be tested by two services invoking io_setup(), it works fine..

thanks,
linfeng

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-19 13:37           ` Lin Feng
  (?)
@ 2013-02-20  2:34             ` Lin Feng
  -1 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-20  2:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel



On 02/19/2013 09:37 PM, Lin Feng wrote:
>> > 
>> > The other is that this almost certainly broken for transhuge page
>> > handling. gup returns the head and tail pages and ordinarily this is ok
> I can't find codes doing such things :(, could you please point me out?
> 
Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it.
flee...

thanks,
linfeng

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-20  2:34             ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-20  2:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel



On 02/19/2013 09:37 PM, Lin Feng wrote:
>> > 
>> > The other is that this almost certainly broken for transhuge page
>> > handling. gup returns the head and tail pages and ordinarily this is ok
> I can't find codes doing such things :(, could you please point me out?
> 
Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it.
flee...

thanks,
linfeng

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-20  2:34             ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-20  2:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu,
	minchan, riel, rientjes, isimatu.yasuaki, wency, laijs,
	jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel



On 02/19/2013 09:37 PM, Lin Feng wrote:
>> > 
>> > The other is that this almost certainly broken for transhuge page
>> > handling. gup returns the head and tail pages and ordinarily this is ok
> I can't find codes doing such things :(, could you please point me out?
> 
Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it.
flee...

thanks,
linfeng

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-20  2:34             ` Lin Feng
                               ` (2 preceding siblings ...)
  (?)
@ 2013-02-20  2:44             ` Wanpeng Li
  -1 siblings, 0 replies; 88+ messages in thread
From: Wanpeng Li @ 2013-02-20  2:44 UTC (permalink / raw)
  To: Lin Feng
  Cc: Mel Gorman, Andrew Morton, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel

On Wed, Feb 20, 2013 at 10:34:36AM +0800, Lin Feng wrote:
>
>
>On 02/19/2013 09:37 PM, Lin Feng wrote:
>>> > 
>>> > The other is that this almost certainly broken for transhuge page
>>> > handling. gup returns the head and tail pages and ordinarily this is ok
>> I can't find codes doing such things :(, could you please point me out?
>> 
>Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it.
>flee...

According to the compound page, the first page of compound page is
called head page, other sub pages are called tail pages.

Regards,
Wanpeng Li 

>
>thanks,
>linfeng
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-20  2:34             ` Lin Feng
  (?)
  (?)
@ 2013-02-20  2:44             ` Wanpeng Li
  -1 siblings, 0 replies; 88+ messages in thread
From: Wanpeng Li @ 2013-02-20  2:44 UTC (permalink / raw)
  To: Lin Feng
  Cc: Mel Gorman, Andrew Morton, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel

On Wed, Feb 20, 2013 at 10:34:36AM +0800, Lin Feng wrote:
>
>
>On 02/19/2013 09:37 PM, Lin Feng wrote:
>>> > 
>>> > The other is that this almost certainly broken for transhuge page
>>> > handling. gup returns the head and tail pages and ordinarily this is ok
>> I can't find codes doing such things :(, could you please point me out?
>> 
>Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it.
>flee...

According to the compound page, the first page of compound page is
called head page, other sub pages are called tail pages.

Regards,
Wanpeng Li 

>
>thanks,
>linfeng
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-20  2:34             ` Lin Feng
                               ` (3 preceding siblings ...)
  (?)
@ 2013-02-20  2:44             ` Wanpeng Li
  2013-02-20  2:59                 ` Lin Feng
  -1 siblings, 1 reply; 88+ messages in thread
From: Wanpeng Li @ 2013-02-20  2:44 UTC (permalink / raw)
  To: Lin Feng
  Cc: Mel Gorman, Andrew Morton, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel

On Wed, Feb 20, 2013 at 10:34:36AM +0800, Lin Feng wrote:
>
>
>On 02/19/2013 09:37 PM, Lin Feng wrote:
>>> > 
>>> > The other is that this almost certainly broken for transhuge page
>>> > handling. gup returns the head and tail pages and ordinarily this is ok
>> I can't find codes doing such things :(, could you please point me out?
>> 
>Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it.
>flee...

According to the compound page, the first page of compound page is
called head page, other sub pages are called tail pages.

Regards,
Wanpeng Li 

>
>thanks,
>linfeng
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-20  2:44             ` Wanpeng Li
@ 2013-02-20  2:59                 ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-20  2:59 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Mel Gorman, Andrew Morton, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel

Hi Wanpeng,

On 02/20/2013 10:44 AM, Wanpeng Li wrote:
>> Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it.
>> >flee...
> According to the compound page, the first page of compound page is
> called head page, other sub pages are called tail pages.
> 
> Regards,
> Wanpeng Li 
> 
Yes, you are right, thanks for explaining.
I thought it just means only the last one of the compound pages..

thanks,
linfeng

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-20  2:59                 ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-20  2:59 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Mel Gorman, Andrew Morton, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel

Hi Wanpeng,

On 02/20/2013 10:44 AM, Wanpeng Li wrote:
>> Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it.
>> >flee...
> According to the compound page, the first page of compound page is
> called head page, other sub pages are called tail pages.
> 
> Regards,
> Wanpeng Li 
> 
Yes, you are right, thanks for explaining.
I thought it just means only the last one of the compound pages..

thanks,
linfeng

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-05 13:32         ` Mel Gorman
@ 2013-02-20  9:58           ` Simon Jeons
  -1 siblings, 0 replies; 88+ messages in thread
From: Simon Jeons @ 2013-02-20  9:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Lin Feng, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel

On 02/05/2013 09:32 PM, Mel Gorman wrote:
> On Tue, Feb 05, 2013 at 11:57:22AM +0000, Mel Gorman wrote:
>>>> +				migrate_pre_flag = 1;
>>>> +			}
>>>> +
>>>> +			if (!isolate_lru_page(pages[i])) {
>>>> +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
>>>> +						 page_is_file_cache(pages[i]));
>>>> +				list_add_tail(&pages[i]->lru, &pagelist);
>>>> +			} else {
>>>> +				isolate_err = 1;
>>>> +				goto put_page;
>>>> +			}
>> isolate_lru_page() takes the LRU lock every time.
> Credit to Michal Hocko for bringing this up but with the number of
> other issues I missed that this is also broken with respect to huge page
> handling. hugetlbfs pages will not be on the LRU so the isolation will mess
> up and the migration has to be handled differently.  Ordinarily hugetlbfs
> pages cannot be allocated from ZONE_MOVABLE but it is possible to configure
> it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this
> encounters a hugetlbfs page, it'll just blow up.

As you said, hugetlbfs pages are not in LRU list, then how can encounter 
a hugetlbfs page and blow up?

>
> The other is that this almost certainly broken for transhuge page
> handling. gup returns the head and tail pages and ordinarily this is ok

When need gup thp? in kvm case?

> because the caller only cares about the physical address. Migration will
> also split a hugepage if it receives it but you are potentially adding
> tail pages to a list here and then migrating them. The split of the first
> page will get very confused. I'm not exactly sure what the result will be
> but it won't be pretty.
>
> Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled
> during testing?
>


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-20  9:58           ` Simon Jeons
  0 siblings, 0 replies; 88+ messages in thread
From: Simon Jeons @ 2013-02-20  9:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Lin Feng, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel

On 02/05/2013 09:32 PM, Mel Gorman wrote:
> On Tue, Feb 05, 2013 at 11:57:22AM +0000, Mel Gorman wrote:
>>>> +				migrate_pre_flag = 1;
>>>> +			}
>>>> +
>>>> +			if (!isolate_lru_page(pages[i])) {
>>>> +				inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
>>>> +						 page_is_file_cache(pages[i]));
>>>> +				list_add_tail(&pages[i]->lru, &pagelist);
>>>> +			} else {
>>>> +				isolate_err = 1;
>>>> +				goto put_page;
>>>> +			}
>> isolate_lru_page() takes the LRU lock every time.
> Credit to Michal Hocko for bringing this up but with the number of
> other issues I missed that this is also broken with respect to huge page
> handling. hugetlbfs pages will not be on the LRU so the isolation will mess
> up and the migration has to be handled differently.  Ordinarily hugetlbfs
> pages cannot be allocated from ZONE_MOVABLE but it is possible to configure
> it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this
> encounters a hugetlbfs page, it'll just blow up.

As you said, hugetlbfs pages are not in LRU list, then how can encounter 
a hugetlbfs page and blow up?

>
> The other is that this almost certainly broken for transhuge page
> handling. gup returns the head and tail pages and ordinarily this is ok

When need gup thp? in kvm case?

> because the caller only cares about the physical address. Migration will
> also split a hugepage if it receives it but you are potentially adding
> tail pages to a list here and then migrating them. The split of the first
> page will get very confused. I'm not exactly sure what the result will be
> but it won't be pretty.
>
> Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled
> during testing?
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-20  9:58           ` Simon Jeons
  (?)
@ 2013-02-20 10:23             ` Lin Feng
  -1 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-20 10:23 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Mel Gorman, Andrew Morton, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel

Hi Simon,

On 02/20/2013 05:58 PM, Simon Jeons wrote:
> 
>>
>> The other is that this almost certainly broken for transhuge page
>> handling. gup returns the head and tail pages and ordinarily this is ok
> 
> When need gup thp? in kvm case?

gup just pins the wanted pages(for x86 is 4k size) of user address space in memory.
We can't expect the pages have been allocated for user address space are thp or 
normal page. So we need to deal with them and I think it have nothing to do with kvm.

thanks,
linfeng

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-20 10:23             ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-20 10:23 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Mel Gorman, Andrew Morton, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel

Hi Simon,

On 02/20/2013 05:58 PM, Simon Jeons wrote:
> 
>>
>> The other is that this almost certainly broken for transhuge page
>> handling. gup returns the head and tail pages and ordinarily this is ok
> 
> When need gup thp? in kvm case?

gup just pins the wanted pages(for x86 is 4k size) of user address space in memory.
We can't expect the pages have been allocated for user address space are thp or 
normal page. So we need to deal with them and I think it have nothing to do with kvm.

thanks,
linfeng

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-20 10:23             ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-20 10:23 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Mel Gorman, Andrew Morton, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel

Hi Simon,

On 02/20/2013 05:58 PM, Simon Jeons wrote:
> 
>>
>> The other is that this almost certainly broken for transhuge page
>> handling. gup returns the head and tail pages and ordinarily this is ok
> 
> When need gup thp? in kvm case?

gup just pins the wanted pages(for x86 is 4k size) of user address space in memory.
We can't expect the pages have been allocated for user address space are thp or 
normal page. So we need to deal with them and I think it have nothing to do with kvm.

thanks,
linfeng

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-20 10:23             ` Lin Feng
@ 2013-02-20 11:31               ` Simon Jeons
  -1 siblings, 0 replies; 88+ messages in thread
From: Simon Jeons @ 2013-02-20 11:31 UTC (permalink / raw)
  To: Lin Feng
  Cc: Mel Gorman, Andrew Morton, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel

On 02/20/2013 06:23 PM, Lin Feng wrote:
> Hi Simon,
>
> On 02/20/2013 05:58 PM, Simon Jeons wrote:
>>> The other is that this almost certainly broken for transhuge page
>>> handling. gup returns the head and tail pages and ordinarily this is ok
>> When need gup thp? in kvm case?
> gup just pins the wanted pages(for x86 is 4k size) of user address space in memory.
> We can't expect the pages have been allocated for user address space are thp or
> normal page. So we need to deal with them and I think it have nothing to do with kvm.

Ok, I'm curious about userspace process call which funtion(will call 
gup) to pin pages except make_pages_present?

>
> thanks,
> linfeng


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-20 11:31               ` Simon Jeons
  0 siblings, 0 replies; 88+ messages in thread
From: Simon Jeons @ 2013-02-20 11:31 UTC (permalink / raw)
  To: Lin Feng
  Cc: Mel Gorman, Andrew Morton, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel

On 02/20/2013 06:23 PM, Lin Feng wrote:
> Hi Simon,
>
> On 02/20/2013 05:58 PM, Simon Jeons wrote:
>>> The other is that this almost certainly broken for transhuge page
>>> handling. gup returns the head and tail pages and ordinarily this is ok
>> When need gup thp? in kvm case?
> gup just pins the wanted pages(for x86 is 4k size) of user address space in memory.
> We can't expect the pages have been allocated for user address space are thp or
> normal page. So we need to deal with them and I think it have nothing to do with kvm.

Ok, I'm curious about userspace process call which funtion(will call 
gup) to pin pages except make_pages_present?

>
> thanks,
> linfeng

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
  2013-02-20 11:31               ` Simon Jeons
  (?)
@ 2013-02-20 11:54                 ` Lin Feng
  -1 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-20 11:54 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Mel Gorman, Andrew Morton, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel



On 02/20/2013 07:31 PM, Simon Jeons wrote:
> On 02/20/2013 06:23 PM, Lin Feng wrote:
>> Hi Simon,
>>
>> On 02/20/2013 05:58 PM, Simon Jeons wrote:
>>>> The other is that this almost certainly broken for transhuge page
>>>> handling. gup returns the head and tail pages and ordinarily this is ok
>>> When need gup thp? in kvm case?
>> gup just pins the wanted pages(for x86 is 4k size) of user address space in memory.
>> We can't expect the pages have been allocated for user address space are thp or
>> normal page. So we need to deal with them and I think it have nothing to do with kvm.
> 
> Ok, I'm curious about userspace process call which funtion(will call gup) to pin pages except make_pages_present?
No, userspace process doesn't pin any pages directly but through some syscalls like io_setup() indirectly
for other purpose because kernel can't pagefault and it have to keep the page alive.
Kernel wants to communicate with the userspace such as to notify some events so it need some sort of buffer
that both Kernel and User space can both access, which leads to so called pin pages by gup.  

thanks,
linfeng

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-20 11:54                 ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-20 11:54 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Mel Gorman, Andrew Morton, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel



On 02/20/2013 07:31 PM, Simon Jeons wrote:
> On 02/20/2013 06:23 PM, Lin Feng wrote:
>> Hi Simon,
>>
>> On 02/20/2013 05:58 PM, Simon Jeons wrote:
>>>> The other is that this almost certainly broken for transhuge page
>>>> handling. gup returns the head and tail pages and ordinarily this is ok
>>> When need gup thp? in kvm case?
>> gup just pins the wanted pages(for x86 is 4k size) of user address space in memory.
>> We can't expect the pages have been allocated for user address space are thp or
>> normal page. So we need to deal with them and I think it have nothing to do with kvm.
> 
> Ok, I'm curious about userspace process call which funtion(will call gup) to pin pages except make_pages_present?
No, userspace process doesn't pin any pages directly but through some syscalls like io_setup() indirectly
for other purpose because kernel can't pagefault and it have to keep the page alive.
Kernel wants to communicate with the userspace such as to notify some events so it need some sort of buffer
that both Kernel and User space can both access, which leads to so called pin pages by gup.  

thanks,
linfeng

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
@ 2013-02-20 11:54                 ` Lin Feng
  0 siblings, 0 replies; 88+ messages in thread
From: Lin Feng @ 2013-02-20 11:54 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Mel Gorman, Andrew Morton, bcrl, viro, khlebnikov, walken,
	kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency,
	laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel,
	linux-kernel



On 02/20/2013 07:31 PM, Simon Jeons wrote:
> On 02/20/2013 06:23 PM, Lin Feng wrote:
>> Hi Simon,
>>
>> On 02/20/2013 05:58 PM, Simon Jeons wrote:
>>>> The other is that this almost certainly broken for transhuge page
>>>> handling. gup returns the head and tail pages and ordinarily this is ok
>>> When need gup thp? in kvm case?
>> gup just pins the wanted pages(for x86 is 4k size) of user address space in memory.
>> We can't expect the pages have been allocated for user address space are thp or
>> normal page. So we need to deal with them and I think it have nothing to do with kvm.
> 
> Ok, I'm curious about userspace process call which funtion(will call gup) to pin pages except make_pages_present?
No, userspace process doesn't pin any pages directly but through some syscalls like io_setup() indirectly
for other purpose because kernel can't pagefault and it have to keep the page alive.
Kernel wants to communicate with the userspace such as to notify some events so it need some sort of buffer
that both Kernel and User space can both access, which leads to so called pin pages by gup.  

thanks,
linfeng

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2013-02-20 12:26 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-04 10:04 [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Lin Feng
2013-02-04 10:04 ` Lin Feng
2013-02-04 10:04 ` [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Lin Feng
2013-02-04 10:04   ` Lin Feng
2013-02-04 10:04   ` Lin Feng
2013-02-05  0:06   ` Andrew Morton
2013-02-05  0:06     ` Andrew Morton
2013-02-05  0:06     ` Andrew Morton
2013-02-05  0:18     ` Andrew Morton
2013-02-05  0:18       ` Andrew Morton
2013-02-05  3:09     ` Lin Feng
2013-02-05  3:09       ` Lin Feng
2013-02-05  3:09       ` Lin Feng
2013-02-05 21:13       ` Andrew Morton
2013-02-05 21:13         ` Andrew Morton
2013-02-05 21:13         ` Andrew Morton
2013-02-05 11:57     ` Mel Gorman
2013-02-05 11:57       ` Mel Gorman
2013-02-05 11:57       ` Mel Gorman
2013-02-05 13:32       ` Mel Gorman
2013-02-05 13:32         ` Mel Gorman
2013-02-05 13:32         ` Mel Gorman
2013-02-19 13:37         ` Lin Feng
2013-02-19 13:37           ` Lin Feng
2013-02-20  2:34           ` Lin Feng
2013-02-20  2:34             ` Lin Feng
2013-02-20  2:34             ` Lin Feng
2013-02-20  2:44             ` Wanpeng Li
2013-02-20  2:44             ` Wanpeng Li
2013-02-20  2:44             ` Wanpeng Li
2013-02-20  2:59               ` Lin Feng
2013-02-20  2:59                 ` Lin Feng
2013-02-20  9:58         ` Simon Jeons
2013-02-20  9:58           ` Simon Jeons
2013-02-20 10:23           ` Lin Feng
2013-02-20 10:23             ` Lin Feng
2013-02-20 10:23             ` Lin Feng
2013-02-20 11:31             ` Simon Jeons
2013-02-20 11:31               ` Simon Jeons
2013-02-20 11:54               ` Lin Feng
2013-02-20 11:54                 ` Lin Feng
2013-02-20 11:54                 ` Lin Feng
2013-02-06  2:26       ` Michel Lespinasse
2013-02-06  2:26         ` Michel Lespinasse
2013-02-06  2:26         ` Michel Lespinasse
2013-02-06 10:41         ` Mel Gorman
2013-02-06 10:41           ` Mel Gorman
2013-02-18 10:34       ` Lin Feng
2013-02-18 10:34         ` Lin Feng
2013-02-18 10:34         ` Lin Feng
2013-02-18 15:17         ` Mel Gorman
2013-02-18 15:17           ` Mel Gorman
2013-02-18 15:17           ` Mel Gorman
2013-02-19  9:55           ` Lin Feng
2013-02-19  9:55             ` Lin Feng
2013-02-19 10:34             ` Mel Gorman
2013-02-19 10:34               ` Mel Gorman
2013-02-19 10:34               ` Mel Gorman
2013-02-04 10:04 ` [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove Lin Feng
2013-02-04 10:04   ` Lin Feng
2013-02-04 10:04   ` Lin Feng
2013-02-04 15:18   ` Jeff Moyer
2013-02-04 15:18     ` Jeff Moyer
2013-02-04 15:18     ` Jeff Moyer
2013-02-04 23:02     ` Zach Brown
2013-02-04 23:02       ` Zach Brown
2013-02-04 23:02       ` Zach Brown
2013-02-05  5:35       ` Lin Feng
2013-02-05  5:35         ` Lin Feng
2013-02-05  5:35         ` Lin Feng
2013-02-05  5:06     ` Lin Feng
2013-02-05  5:06       ` Lin Feng
2013-02-05  0:58 ` [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Minchan Kim
2013-02-05  0:58   ` Minchan Kim
2013-02-05  0:58   ` Minchan Kim
2013-02-05  4:42   ` Lin Feng
2013-02-05  4:42     ` Lin Feng
2013-02-05  5:25     ` Minchan Kim
2013-02-05  5:25       ` Minchan Kim
2013-02-05  5:25       ` Minchan Kim
2013-02-05  6:18       ` Lin Feng
2013-02-05  6:18         ` Lin Feng
2013-02-05  7:45         ` Minchan Kim
2013-02-05  7:45           ` Minchan Kim
2013-02-05  7:45           ` Minchan Kim
2013-02-05  8:27           ` Lin Feng
2013-02-05  8:27             ` Lin Feng
2013-02-05  8:27             ` Lin Feng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.