linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/10] Hugepage migration (v5)
@ 2010-09-08  1:19 Naoya Horiguchi
  2010-09-08  1:19 ` [PATCH 01/10] hugetlb: fix metadata corruption in hugetlb_fault() Naoya Horiguchi
                   ` (10 more replies)
  0 siblings, 11 replies; 36+ messages in thread
From: Naoya Horiguchi @ 2010-09-08  1:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

Hi,

This is the 5th version of "hugepage migration" set.

Changes from v4 (mostly refactoring):
- remove unnecessary might_sleep() [3/10]
- define migrate_huge_pages() from copy of migrate_pages() [4/10]
- soft_offline_page() branches off to hugepage path. [8/10]

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 01/10] hugetlb: fix metadata corruption in hugetlb_fault()
  2010-09-08  1:19 [PATCH 0/10] Hugepage migration (v5) Naoya Horiguchi
@ 2010-09-08  1:19 ` Naoya Horiguchi
  2010-09-20 10:47   ` Mel Gorman
  2010-09-08  1:19 ` [PATCH 02/10] hugetlb: add allocate function for hugepage migration Naoya Horiguchi
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Naoya Horiguchi @ 2010-09-08  1:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

Since the PageHWPoison() check is for avoiding hwpoisoned page remained
in pagecache mapping to the process, it should be done in "found in pagecache"
branch, not in the common path.
Otherwise, metadata corruption occurs if memory failure happens between
alloc_huge_page() and lock_page() because page fault fails with metadata
changes remained (such as refcount, mapcount, etc.)

This patch moves the check to "found in pagecache" branch and fix the problem.

ChangeLog since v2:
- remove retry check in "new allocation" path.
- make description more detailed
- change patch name from "HWPOISON, hugetlb: move PG_HWPoison bit check"

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/hugetlb.c |   21 +++++++++------------
 1 files changed, 9 insertions(+), 12 deletions(-)

diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index cc5be78..6871b41 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -2518,22 +2518,19 @@ retry:
 			hugepage_add_new_anon_rmap(page, vma, address);
 		}
 	} else {
+		/*
+		 * If memory error occurs between mmap() and fault, some process
+		 * don't have hwpoisoned swap entry for errored virtual address.
+		 * So we need to block hugepage fault by PG_hwpoison bit check.
+		 */
+		if (unlikely(PageHWPoison(page))) {
+			ret = VM_FAULT_HWPOISON;
+			goto backout_unlocked;
+		}
 		page_dup_rmap(page);
 	}
 
 	/*
-	 * Since memory error handler replaces pte into hwpoison swap entry
-	 * at the time of error handling, a process which reserved but not have
-	 * the mapping to the error hugepage does not have hwpoison swap entry.
-	 * So we need to block accesses from such a process by checking
-	 * PG_hwpoison bit here.
-	 */
-	if (unlikely(PageHWPoison(page))) {
-		ret = VM_FAULT_HWPOISON;
-		goto backout_unlocked;
-	}
-
-	/*
 	 * If we are going to COW a private mapping later, we examine the
 	 * pending reservations for this page now. This will ensure that
 	 * any allocations necessary to record that reservation occur outside
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 02/10] hugetlb: add allocate function for hugepage migration
  2010-09-08  1:19 [PATCH 0/10] Hugepage migration (v5) Naoya Horiguchi
  2010-09-08  1:19 ` [PATCH 01/10] hugetlb: fix metadata corruption in hugetlb_fault() Naoya Horiguchi
@ 2010-09-08  1:19 ` Naoya Horiguchi
  2010-09-20 10:59   ` Mel Gorman
  2010-09-22 21:05   ` Christoph Lameter
  2010-09-08  1:19 ` [PATCH 03/10] hugetlb: redefine hugepage copy functions Naoya Horiguchi
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 36+ messages in thread
From: Naoya Horiguchi @ 2010-09-08  1:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

We can't use existing hugepage allocation functions to allocate hugepage
for page migration, because page migration can happen asynchronously with
the running processes and page migration users should call the allocation
function with physical addresses (not virtual addresses) as arguments.

ChangeLog since v3:
- unify alloc_buddy_huge_page() and alloc_buddy_huge_page_node()

ChangeLog since v2:
- remove unnecessary get/put_mems_allowed() (thanks to David Rientjes)

ChangeLog since v1:
- add comment on top of alloc_huge_page_no_vma()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 include/linux/hugetlb.h |    3 ++
 mm/hugetlb.c            |   79 ++++++++++++++++++++++++++++++++---------------
 2 files changed, 57 insertions(+), 25 deletions(-)

diff --git v2.6.36-rc2/include/linux/hugetlb.h v2.6.36-rc2/include/linux/hugetlb.h
index f479700..0b73c53 100644
--- v2.6.36-rc2/include/linux/hugetlb.h
+++ v2.6.36-rc2/include/linux/hugetlb.h
@@ -228,6 +228,8 @@ struct huge_bootmem_page {
 	struct hstate *hstate;
 };
 
+struct page *alloc_huge_page_node(struct hstate *h, int nid);
+
 /* arch callback */
 int __init alloc_bootmem_huge_page(struct hstate *h);
 
@@ -303,6 +305,7 @@ static inline struct hstate *page_hstate(struct page *page)
 
 #else
 struct hstate {};
+#define alloc_huge_page_node(h, nid) NULL
 #define alloc_bootmem_huge_page(h) NULL
 #define hstate_file(f) NULL
 #define hstate_vma(v) NULL
diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index 6871b41..f526228 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -466,11 +466,23 @@ static void enqueue_huge_page(struct hstate *h, struct page *page)
 	h->free_huge_pages_node[nid]++;
 }
 
+static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
+{
+	struct page *page;
+
+	if (list_empty(&h->hugepage_freelists[nid]))
+		return NULL;
+	page = list_entry(h->hugepage_freelists[nid].next, struct page, lru);
+	list_del(&page->lru);
+	h->free_huge_pages--;
+	h->free_huge_pages_node[nid]--;
+	return page;
+}
+
 static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
 				unsigned long address, int avoid_reserve)
 {
-	int nid;
 	struct page *page = NULL;
 	struct mempolicy *mpol;
 	nodemask_t *nodemask;
@@ -496,19 +508,13 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						MAX_NR_ZONES - 1, nodemask) {
-		nid = zone_to_nid(zone);
-		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
-		    !list_empty(&h->hugepage_freelists[nid])) {
-			page = list_entry(h->hugepage_freelists[nid].next,
-					  struct page, lru);
-			list_del(&page->lru);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[nid]--;
-
-			if (!avoid_reserve)
-				decrement_hugepage_resv_vma(h, vma);
-
-			break;
+		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
+			page = dequeue_huge_page_node(h, zone_to_nid(zone));
+			if (page) {
+				if (!avoid_reserve)
+					decrement_hugepage_resv_vma(h, vma);
+				break;
+			}
 		}
 	}
 err:
@@ -770,11 +776,10 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 	return ret;
 }
 
-static struct page *alloc_buddy_huge_page(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long address)
+static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
 {
 	struct page *page;
-	unsigned int nid;
+	unsigned int r_nid;
 
 	if (h->order >= MAX_ORDER)
 		return NULL;
@@ -812,9 +817,14 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 	}
 	spin_unlock(&hugetlb_lock);
 
-	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
-					__GFP_REPEAT|__GFP_NOWARN,
-					huge_page_order(h));
+	if (nid == NUMA_NO_NODE)
+		page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
+				   __GFP_REPEAT|__GFP_NOWARN,
+				   huge_page_order(h));
+	else
+		page = alloc_pages_exact_node(nid,
+			htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
+			__GFP_REPEAT|__GFP_NOWARN, huge_page_order(h));
 
 	if (page && arch_prepare_hugepage(page)) {
 		__free_pages(page, huge_page_order(h));
@@ -829,13 +839,13 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 		 */
 		put_page_testzero(page);
 		VM_BUG_ON(page_count(page));
-		nid = page_to_nid(page);
+		r_nid = page_to_nid(page);
 		set_compound_page_dtor(page, free_huge_page);
 		/*
 		 * We incremented the global counters already
 		 */
-		h->nr_huge_pages_node[nid]++;
-		h->surplus_huge_pages_node[nid]++;
+		h->nr_huge_pages_node[r_nid]++;
+		h->surplus_huge_pages_node[r_nid]++;
 		__count_vm_event(HTLB_BUDDY_PGALLOC);
 	} else {
 		h->nr_huge_pages--;
@@ -848,6 +858,25 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 }
 
 /*
+ * This allocation function is useful in the context where vma is irrelevant.
+ * E.g. soft-offlining uses this function because it only cares physical
+ * address of error page.
+ */
+struct page *alloc_huge_page_node(struct hstate *h, int nid)
+{
+	struct page *page;
+
+	spin_lock(&hugetlb_lock);
+	page = dequeue_huge_page_node(h, nid);
+	spin_unlock(&hugetlb_lock);
+
+	if (!page)
+		page = alloc_buddy_huge_page(h, nid);
+
+	return page;
+}
+
+/*
  * Increase the hugetlb pool such that it can accomodate a reservation
  * of size 'delta'.
  */
@@ -871,7 +900,7 @@ static int gather_surplus_pages(struct hstate *h, int delta)
 retry:
 	spin_unlock(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
-		page = alloc_buddy_huge_page(h, NULL, 0);
+		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
 			/*
 			 * We were not able to allocate enough pages to
@@ -1052,7 +1081,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	spin_unlock(&hugetlb_lock);
 
 	if (!page) {
-		page = alloc_buddy_huge_page(h, vma, addr);
+		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
 			hugetlb_put_quota(inode->i_mapping, chg);
 			return ERR_PTR(-VM_FAULT_SIGBUS);
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 03/10] hugetlb: redefine hugepage copy functions
  2010-09-08  1:19 [PATCH 0/10] Hugepage migration (v5) Naoya Horiguchi
  2010-09-08  1:19 ` [PATCH 01/10] hugetlb: fix metadata corruption in hugetlb_fault() Naoya Horiguchi
  2010-09-08  1:19 ` [PATCH 02/10] hugetlb: add allocate function for hugepage migration Naoya Horiguchi
@ 2010-09-08  1:19 ` Naoya Horiguchi
  2010-09-20 11:03   ` Mel Gorman
  2010-09-23 16:21   ` Christoph Lameter
  2010-09-08  1:19 ` [PATCH 04/10] hugetlb: hugepage migration core Naoya Horiguchi
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 36+ messages in thread
From: Naoya Horiguchi @ 2010-09-08  1:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

This patch modifies hugepage copy functions to have only destination
and source hugepages as arguments for later use.
The old ones are renamed from copy_{gigantic,huge}_page() to
copy_user_{gigantic,huge}_page().
This naming convention is consistent with that between copy_highpage()
and copy_user_highpage().

ChangeLog since v4:
- add blank line between local declaration and code
- remove unnecessary might_sleep()

ChangeLog since v2:
- change copy_huge_page() from macro to inline dummy function
  to avoid compile warning when !CONFIG_HUGETLB_PAGE.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/hugetlb.h |    4 ++++
 mm/hugetlb.c            |   45 ++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 44 insertions(+), 5 deletions(-)

diff --git v2.6.36-rc2/include/linux/hugetlb.h v2.6.36-rc2/include/linux/hugetlb.h
index 0b73c53..9e51f77 100644
--- v2.6.36-rc2/include/linux/hugetlb.h
+++ v2.6.36-rc2/include/linux/hugetlb.h
@@ -44,6 +44,7 @@ int hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						int acctflags);
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
 void __isolate_hwpoisoned_huge_page(struct page *page);
+void copy_huge_page(struct page *dst, struct page *src);
 
 extern unsigned long hugepages_treat_as_movable;
 extern const unsigned long hugetlb_zero, hugetlb_infinity;
@@ -102,6 +103,9 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
 #define hugetlb_fault(mm, vma, addr, flags)	({ BUG(); 0; })
 #define huge_pte_offset(mm, address)	0
 #define __isolate_hwpoisoned_huge_page(page)	0
+static inline void copy_huge_page(struct page *dst, struct page *src)
+{
+}
 
 #define hugetlb_change_protection(vma, address, end, newprot)
 
diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index f526228..351f8d1 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -423,14 +423,14 @@ static void clear_huge_page(struct page *page,
 	}
 }
 
-static void copy_gigantic_page(struct page *dst, struct page *src,
+static void copy_user_gigantic_page(struct page *dst, struct page *src,
 			   unsigned long addr, struct vm_area_struct *vma)
 {
 	int i;
 	struct hstate *h = hstate_vma(vma);
 	struct page *dst_base = dst;
 	struct page *src_base = src;
-	might_sleep();
+
 	for (i = 0; i < pages_per_huge_page(h); ) {
 		cond_resched();
 		copy_user_highpage(dst, src, addr + i*PAGE_SIZE, vma);
@@ -440,14 +440,15 @@ static void copy_gigantic_page(struct page *dst, struct page *src,
 		src = mem_map_next(src, src_base, i);
 	}
 }
-static void copy_huge_page(struct page *dst, struct page *src,
+
+static void copy_user_huge_page(struct page *dst, struct page *src,
 			   unsigned long addr, struct vm_area_struct *vma)
 {
 	int i;
 	struct hstate *h = hstate_vma(vma);
 
 	if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
-		copy_gigantic_page(dst, src, addr, vma);
+		copy_user_gigantic_page(dst, src, addr, vma);
 		return;
 	}
 
@@ -458,6 +459,40 @@ static void copy_huge_page(struct page *dst, struct page *src,
 	}
 }
 
+static void copy_gigantic_page(struct page *dst, struct page *src)
+{
+	int i;
+	struct hstate *h = page_hstate(src);
+	struct page *dst_base = dst;
+	struct page *src_base = src;
+
+	for (i = 0; i < pages_per_huge_page(h); ) {
+		cond_resched();
+		copy_highpage(dst, src);
+
+		i++;
+		dst = mem_map_next(dst, dst_base, i);
+		src = mem_map_next(src, src_base, i);
+	}
+}
+
+void copy_huge_page(struct page *dst, struct page *src)
+{
+	int i;
+	struct hstate *h = page_hstate(src);
+
+	if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
+		copy_gigantic_page(dst, src);
+		return;
+	}
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page(h); i++) {
+		cond_resched();
+		copy_highpage(dst + i, src + i);
+	}
+}
+
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
@@ -2415,7 +2450,7 @@ retry_avoidcopy:
 	if (unlikely(anon_vma_prepare(vma)))
 		return VM_FAULT_OOM;
 
-	copy_huge_page(new_page, old_page, address, vma);
+	copy_user_huge_page(new_page, old_page, address, vma);
 	__SetPageUptodate(new_page);
 
 	/*
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 04/10] hugetlb: hugepage migration core
  2010-09-08  1:19 [PATCH 0/10] Hugepage migration (v5) Naoya Horiguchi
                   ` (2 preceding siblings ...)
  2010-09-08  1:19 ` [PATCH 03/10] hugetlb: redefine hugepage copy functions Naoya Horiguchi
@ 2010-09-08  1:19 ` Naoya Horiguchi
  2010-09-20 11:10   ` Mel Gorman
  2010-09-23 16:52   ` Christoph Lameter
  2010-09-08  1:19 ` [PATCH 05/10] HWPOISON, hugetlb: add free check to dequeue_hwpoison_huge_page() Naoya Horiguchi
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 36+ messages in thread
From: Naoya Horiguchi @ 2010-09-08  1:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

This patch extends page migration code to support hugepage migration.
One of the potential users of this feature is soft offlining which
is triggered by memory corrected errors (added by the next patch.)

Todo:
- there are other users of page migration such as memory policy,
  memory hotplug and memocy compaction.
  They are not ready for hugepage support for now.

ChangeLog since v4:
- define migrate_huge_pages()
- remove changes on isolation/putback_lru_page()

ChangeLog since v2:
- refactor isolate/putback_lru_page() to handle hugepage
- add comment about race on unmap_and_move_huge_page()

ChangeLog since v1:
- divide migration code path for hugepage
- define routine checking migration swap entry for hugetlb
- replace "goto" with "if/else" in remove_migration_pte()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 fs/hugetlbfs/inode.c    |   15 +++
 include/linux/migrate.h |   16 +++
 mm/hugetlb.c            |   18 ++++-
 mm/migrate.c            |  232 +++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 262 insertions(+), 19 deletions(-)

diff --git v2.6.36-rc2/fs/hugetlbfs/inode.c v2.6.36-rc2/fs/hugetlbfs/inode.c
index 6e5bd42..1f7ca50 100644
--- v2.6.36-rc2/fs/hugetlbfs/inode.c
+++ v2.6.36-rc2/fs/hugetlbfs/inode.c
@@ -31,6 +31,7 @@
 #include <linux/statfs.h>
 #include <linux/security.h>
 #include <linux/magic.h>
+#include <linux/migrate.h>
 
 #include <asm/uaccess.h>
 
@@ -573,6 +574,19 @@ static int hugetlbfs_set_page_dirty(struct page *page)
 	return 0;
 }
 
+static int hugetlbfs_migrate_page(struct address_space *mapping,
+				struct page *newpage, struct page *page)
+{
+	int rc;
+
+	rc = migrate_huge_page_move_mapping(mapping, newpage, page);
+	if (rc)
+		return rc;
+	migrate_page_copy(newpage, page);
+
+	return 0;
+}
+
 static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(dentry->d_sb);
@@ -659,6 +673,7 @@ static const struct address_space_operations hugetlbfs_aops = {
 	.write_begin	= hugetlbfs_write_begin,
 	.write_end	= hugetlbfs_write_end,
 	.set_page_dirty	= hugetlbfs_set_page_dirty,
+	.migratepage    = hugetlbfs_migrate_page,
 };
 
 
diff --git v2.6.36-rc2/include/linux/migrate.h v2.6.36-rc2/include/linux/migrate.h
index 7238231..3c1941e 100644
--- v2.6.36-rc2/include/linux/migrate.h
+++ v2.6.36-rc2/include/linux/migrate.h
@@ -14,6 +14,8 @@ extern int migrate_page(struct address_space *,
 			struct page *, struct page *);
 extern int migrate_pages(struct list_head *l, new_page_t x,
 			unsigned long private, int offlining);
+extern int migrate_huge_pages(struct list_head *l, new_page_t x,
+			unsigned long private, int offlining);
 
 extern int fail_migrate_page(struct address_space *,
 			struct page *, struct page *);
@@ -23,12 +25,17 @@ extern int migrate_prep_local(void);
 extern int migrate_vmas(struct mm_struct *mm,
 		const nodemask_t *from, const nodemask_t *to,
 		unsigned long flags);
+extern void migrate_page_copy(struct page *newpage, struct page *page);
+extern int migrate_huge_page_move_mapping(struct address_space *mapping,
+				  struct page *newpage, struct page *page);
 #else
 #define PAGE_MIGRATION 0
 
 static inline void putback_lru_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private, int offlining) { return -ENOSYS; }
+static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
+		unsigned long private, int offlining) { return -ENOSYS; }
 
 static inline int migrate_prep(void) { return -ENOSYS; }
 static inline int migrate_prep_local(void) { return -ENOSYS; }
@@ -40,6 +47,15 @@ static inline int migrate_vmas(struct mm_struct *mm,
 	return -ENOSYS;
 }
 
+static inline void migrate_page_copy(struct page *newpage,
+				     struct page *page) {}
+
+extern int migrate_huge_page_move_mapping(struct address_space *mapping,
+				  struct page *newpage, struct page *page)
+{
+	return -ENOSYS;
+}
+
 /* Possible settings for the migrate_page() method in address_operations */
 #define migrate_page NULL
 #define fail_migrate_page NULL
diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index 351f8d1..55f3e2d 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -2217,6 +2217,19 @@ nomem:
 	return -ENOMEM;
 }
 
+static int is_hugetlb_entry_migration(pte_t pte)
+{
+	swp_entry_t swp;
+
+	if (huge_pte_none(pte) || pte_present(pte))
+		return 0;
+	swp = pte_to_swp_entry(pte);
+	if (non_swap_entry(swp) && is_migration_entry(swp)) {
+		return 1;
+	} else
+		return 0;
+}
+
 static int is_hugetlb_entry_hwpoisoned(pte_t pte)
 {
 	swp_entry_t swp;
@@ -2651,7 +2664,10 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	ptep = huge_pte_offset(mm, address);
 	if (ptep) {
 		entry = huge_ptep_get(ptep);
-		if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
+		if (unlikely(is_hugetlb_entry_migration(entry))) {
+			migration_entry_wait(mm, (pmd_t *)ptep, address);
+			return 0;
+		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			return VM_FAULT_HWPOISON;
 	}
 
diff --git v2.6.36-rc2/mm/migrate.c v2.6.36-rc2/mm/migrate.c
index 38e7cad..55dbc45 100644
--- v2.6.36-rc2/mm/migrate.c
+++ v2.6.36-rc2/mm/migrate.c
@@ -32,6 +32,7 @@
 #include <linux/security.h>
 #include <linux/memcontrol.h>
 #include <linux/syscalls.h>
+#include <linux/hugetlb.h>
 #include <linux/gfp.h>
 
 #include "internal.h"
@@ -95,26 +96,34 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 	pte_t *ptep, pte;
  	spinlock_t *ptl;
 
- 	pgd = pgd_offset(mm, addr);
-	if (!pgd_present(*pgd))
-		goto out;
+	if (unlikely(PageHuge(new))) {
+		ptep = huge_pte_offset(mm, addr);
+		if (!ptep)
+			goto out;
+		ptl = &mm->page_table_lock;
+	} else {
+		pgd = pgd_offset(mm, addr);
+		if (!pgd_present(*pgd))
+			goto out;
 
-	pud = pud_offset(pgd, addr);
-	if (!pud_present(*pud))
-		goto out;
+		pud = pud_offset(pgd, addr);
+		if (!pud_present(*pud))
+			goto out;
 
-	pmd = pmd_offset(pud, addr);
-	if (!pmd_present(*pmd))
-		goto out;
+		pmd = pmd_offset(pud, addr);
+		if (!pmd_present(*pmd))
+			goto out;
 
-	ptep = pte_offset_map(pmd, addr);
+		ptep = pte_offset_map(pmd, addr);
 
-	if (!is_swap_pte(*ptep)) {
-		pte_unmap(ptep);
-		goto out;
- 	}
+		if (!is_swap_pte(*ptep)) {
+			pte_unmap(ptep);
+			goto out;
+		}
+
+		ptl = pte_lockptr(mm, pmd);
+	}
 
- 	ptl = pte_lockptr(mm, pmd);
  	spin_lock(ptl);
 	pte = *ptep;
 	if (!is_swap_pte(pte))
@@ -130,10 +139,17 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
 	if (is_write_migration_entry(entry))
 		pte = pte_mkwrite(pte);
+	if (PageHuge(new))
+		pte = pte_mkhuge(pte);
 	flush_cache_page(vma, addr, pte_pfn(pte));
 	set_pte_at(mm, addr, ptep, pte);
 
-	if (PageAnon(new))
+	if (PageHuge(new)) {
+		if (PageAnon(new))
+			hugepage_add_anon_rmap(new, vma, addr);
+		else
+			page_dup_rmap(new);
+	} else if (PageAnon(new))
 		page_add_anon_rmap(new, vma, addr);
 	else
 		page_add_file_rmap(new);
@@ -276,11 +292,59 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 }
 
 /*
+ * The expected number of remaining references is the same as that
+ * of migrate_page_move_mapping().
+ */
+int migrate_huge_page_move_mapping(struct address_space *mapping,
+				   struct page *newpage, struct page *page)
+{
+	int expected_count;
+	void **pslot;
+
+	if (!mapping) {
+		if (page_count(page) != 1)
+			return -EAGAIN;
+		return 0;
+	}
+
+	spin_lock_irq(&mapping->tree_lock);
+
+	pslot = radix_tree_lookup_slot(&mapping->page_tree,
+					page_index(page));
+
+	expected_count = 2 + page_has_private(page);
+	if (page_count(page) != expected_count ||
+	    (struct page *)radix_tree_deref_slot(pslot) != page) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return -EAGAIN;
+	}
+
+	if (!page_freeze_refs(page, expected_count)) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return -EAGAIN;
+	}
+
+	get_page(newpage);
+
+	radix_tree_replace_slot(pslot, newpage);
+
+	page_unfreeze_refs(page, expected_count);
+
+	__put_page(page);
+
+	spin_unlock_irq(&mapping->tree_lock);
+	return 0;
+}
+
+/*
  * Copy the page to its new location
  */
-static void migrate_page_copy(struct page *newpage, struct page *page)
+void migrate_page_copy(struct page *newpage, struct page *page)
 {
-	copy_highpage(newpage, page);
+	if (PageHuge(page))
+		copy_huge_page(newpage, page);
+	else
+		copy_highpage(newpage, page);
 
 	if (PageError(page))
 		SetPageError(newpage);
@@ -724,6 +788,92 @@ move_newpage:
 }
 
 /*
+ * Counterpart of unmap_and_move_page() for hugepage migration.
+ *
+ * This function doesn't wait the completion of hugepage I/O
+ * because there is no race between I/O and migration for hugepage.
+ * Note that currently hugepage I/O occurs only in direct I/O
+ * where no lock is held and PG_writeback is irrelevant,
+ * and writeback status of all subpages are counted in the reference
+ * count of the head page (i.e. if all subpages of a 2MB hugepage are
+ * under direct I/O, the reference of the head page is 512 and a bit more.)
+ * This means that when we try to migrate hugepage whose subpages are
+ * doing direct I/O, some references remain after try_to_unmap() and
+ * hugepage migration fails without data corruption.
+ *
+ * There is also no race when direct I/O is issued on the page under migration,
+ * because then pte is replaced with migration swap entry and direct I/O code
+ * will wait in the page fault for migration to complete.
+ */
+static int unmap_and_move_huge_page(new_page_t get_new_page,
+				unsigned long private, struct page *hpage,
+				int force, int offlining)
+{
+	int rc = 0;
+	int *result = NULL;
+	struct page *new_hpage = get_new_page(hpage, private, &result);
+	int rcu_locked = 0;
+	struct anon_vma *anon_vma = NULL;
+
+	if (!new_hpage)
+		return -ENOMEM;
+
+	rc = -EAGAIN;
+
+	if (!trylock_page(hpage)) {
+		if (!force)
+			goto out;
+		lock_page(hpage);
+	}
+
+	if (PageAnon(hpage)) {
+		rcu_read_lock();
+		rcu_locked = 1;
+
+		if (page_mapped(hpage)) {
+			anon_vma = page_anon_vma(hpage);
+			atomic_inc(&anon_vma->external_refcount);
+		}
+	}
+
+	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+
+	if (!page_mapped(hpage))
+		rc = move_to_new_page(new_hpage, hpage, 1);
+
+	if (rc)
+		remove_migration_ptes(hpage, hpage);
+
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount,
+					    &anon_vma->lock)) {
+		int empty = list_empty(&anon_vma->head);
+		spin_unlock(&anon_vma->lock);
+		if (empty)
+			anon_vma_free(anon_vma);
+	}
+
+	if (rcu_locked)
+		rcu_read_unlock();
+out:
+	unlock_page(hpage);
+
+	if (rc != -EAGAIN) {
+		list_del(&hpage->lru);
+		put_page(hpage);
+	}
+
+	put_page(new_hpage);
+
+	if (result) {
+		if (rc)
+			*result = rc;
+		else
+			*result = page_to_nid(new_hpage);
+	}
+	return rc;
+}
+
+/*
  * migrate_pages
  *
  * The function takes one list of pages to migrate and a function
@@ -788,6 +938,52 @@ out:
 	return nr_failed + retry;
 }
 
+int migrate_huge_pages(struct list_head *from,
+		new_page_t get_new_page, unsigned long private, int offlining)
+{
+	int retry = 1;
+	int nr_failed = 0;
+	int pass = 0;
+	struct page *page;
+	struct page *page2;
+	int rc;
+
+	for (pass = 0; pass < 10 && retry; pass++) {
+		retry = 0;
+
+		list_for_each_entry_safe(page, page2, from, lru) {
+			cond_resched();
+
+			rc = unmap_and_move_huge_page(get_new_page,
+					private, page, pass > 2, offlining);
+
+			switch(rc) {
+			case -ENOMEM:
+				goto out;
+			case -EAGAIN:
+				retry++;
+				break;
+			case 0:
+				break;
+			default:
+				/* Permanent failure */
+				nr_failed++;
+				break;
+			}
+		}
+	}
+	rc = 0;
+out:
+
+	list_for_each_entry_safe(page, page2, from, lru)
+		put_page(page);
+
+	if (rc)
+		return rc;
+
+	return nr_failed + retry;
+}
+
 #ifdef CONFIG_NUMA
 /*
  * Move a list of individual pages
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 05/10] HWPOISON, hugetlb: add free check to dequeue_hwpoison_huge_page()
  2010-09-08  1:19 [PATCH 0/10] Hugepage migration (v5) Naoya Horiguchi
                   ` (3 preceding siblings ...)
  2010-09-08  1:19 ` [PATCH 04/10] hugetlb: hugepage migration core Naoya Horiguchi
@ 2010-09-08  1:19 ` Naoya Horiguchi
  2010-09-23 16:54   ` Christoph Lameter
  2010-09-08  1:19 ` [PATCH 06/10] hugetlb: move refcounting in hugepage allocation inside hugetlb_lock Naoya Horiguchi
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Naoya Horiguchi @ 2010-09-08  1:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

This check is necessary to avoid race between dequeue and allocation,
which can cause a free hugepage to be dequeued twice and get kernel unstable.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/hugetlb.h |    4 ++--
 mm/hugetlb.c            |   29 +++++++++++++++++++++++++----
 mm/memory-failure.c     |    6 ++++--
 3 files changed, 31 insertions(+), 8 deletions(-)

diff --git v2.6.36-rc2/include/linux/hugetlb.h v2.6.36-rc2/include/linux/hugetlb.h
index 9e51f77..796f30e 100644
--- v2.6.36-rc2/include/linux/hugetlb.h
+++ v2.6.36-rc2/include/linux/hugetlb.h
@@ -43,7 +43,7 @@ int hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						struct vm_area_struct *vma,
 						int acctflags);
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
-void __isolate_hwpoisoned_huge_page(struct page *page);
+int dequeue_hwpoisoned_huge_page(struct page *page);
 void copy_huge_page(struct page *dst, struct page *src);
 
 extern unsigned long hugepages_treat_as_movable;
@@ -102,7 +102,7 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
 #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
 #define hugetlb_fault(mm, vma, addr, flags)	({ BUG(); 0; })
 #define huge_pte_offset(mm, address)	0
-#define __isolate_hwpoisoned_huge_page(page)	0
+#define dequeue_hwpoisoned_huge_page(page)	0
 static inline void copy_huge_page(struct page *dst, struct page *src)
 {
 }
diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index 55f3e2d..8948abc 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -2953,18 +2953,39 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	hugetlb_acct_memory(h, -(chg - freed));
 }
 
+/* Should be called in hugetlb_lock */
+static int is_hugepage_on_freelist(struct page *hpage)
+{
+	struct page *page;
+	struct page *tmp;
+	struct hstate *h = page_hstate(hpage);
+	int nid = page_to_nid(hpage);
+
+	list_for_each_entry_safe(page, tmp, &h->hugepage_freelists[nid], lru)
+		if (page == hpage)
+			return 1;
+	return 0;
+}
+
+#ifdef CONFIG_MEMORY_FAILURE
 /*
  * This function is called from memory failure code.
  * Assume the caller holds page lock of the head page.
  */
-void __isolate_hwpoisoned_huge_page(struct page *hpage)
+int dequeue_hwpoisoned_huge_page(struct page *hpage)
 {
 	struct hstate *h = page_hstate(hpage);
 	int nid = page_to_nid(hpage);
+	int ret = -EBUSY;
 
 	spin_lock(&hugetlb_lock);
-	list_del(&hpage->lru);
-	h->free_huge_pages--;
-	h->free_huge_pages_node[nid]--;
+	if (is_hugepage_on_freelist(hpage)) {
+		list_del(&hpage->lru);
+		h->free_huge_pages--;
+		h->free_huge_pages_node[nid]--;
+		ret = 0;
+	}
 	spin_unlock(&hugetlb_lock);
+	return ret;
 }
+#endif
diff --git v2.6.36-rc2/mm/memory-failure.c v2.6.36-rc2/mm/memory-failure.c
index 9c26eec..c67f801 100644
--- v2.6.36-rc2/mm/memory-failure.c
+++ v2.6.36-rc2/mm/memory-failure.c
@@ -698,6 +698,7 @@ static int me_swapcache_clean(struct page *p, unsigned long pfn)
  */
 static int me_huge_page(struct page *p, unsigned long pfn)
 {
+	int res = 0;
 	struct page *hpage = compound_head(p);
 	/*
 	 * We can safely recover from error on free or reserved (i.e.
@@ -710,8 +711,9 @@ static int me_huge_page(struct page *p, unsigned long pfn)
 	 * so there is no race between isolation and mapping/unmapping.
 	 */
 	if (!(page_mapping(hpage) || PageAnon(hpage))) {
-		__isolate_hwpoisoned_huge_page(hpage);
-		return RECOVERED;
+		res = dequeue_hwpoisoned_huge_page(hpage);
+		if (!res)
+			return RECOVERED;
 	}
 	return DELAYED;
 }
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 06/10] hugetlb: move refcounting in hugepage allocation inside hugetlb_lock
  2010-09-08  1:19 [PATCH 0/10] Hugepage migration (v5) Naoya Horiguchi
                   ` (4 preceding siblings ...)
  2010-09-08  1:19 ` [PATCH 05/10] HWPOISON, hugetlb: add free check to dequeue_hwpoison_huge_page() Naoya Horiguchi
@ 2010-09-08  1:19 ` Naoya Horiguchi
  2010-09-23 17:12   ` Christoph Lameter
  2010-09-08  1:19 ` [PATCH 07/10] HWPOSION, hugetlb: recover from free hugepage error when !MF_COUNT_INCREASED Naoya Horiguchi
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Naoya Horiguchi @ 2010-09-08  1:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

Currently alloc_huge_page() raises page refcount outside hugetlb_lock.
but it causes race when dequeue_hwpoison_huge_page() runs concurrently
with alloc_huge_page().
To avoid it, this patch moves set_page_refcounted() in hugetlb_lock.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/hugetlb.c |   35 +++++++++++++----------------------
 1 files changed, 13 insertions(+), 22 deletions(-)

diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index 8948abc..adb5dfa 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -509,6 +509,7 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
 		return NULL;
 	page = list_entry(h->hugepage_freelists[nid].next, struct page, lru);
 	list_del(&page->lru);
+	set_page_refcounted(page);
 	h->free_huge_pages--;
 	h->free_huge_pages_node[nid]--;
 	return page;
@@ -868,12 +869,6 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
 
 	spin_lock(&hugetlb_lock);
 	if (page) {
-		/*
-		 * This page is now managed by the hugetlb allocator and has
-		 * no users -- drop the buddy allocator's reference.
-		 */
-		put_page_testzero(page);
-		VM_BUG_ON(page_count(page));
 		r_nid = page_to_nid(page);
 		set_compound_page_dtor(page, free_huge_page);
 		/*
@@ -936,16 +931,13 @@ retry:
 	spin_unlock(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
-		if (!page) {
+		if (!page)
 			/*
 			 * We were not able to allocate enough pages to
 			 * satisfy the entire reservation so we free what
 			 * we've allocated so far.
 			 */
-			spin_lock(&hugetlb_lock);
-			needed = 0;
 			goto free;
-		}
 
 		list_add(&page->lru, &surplus_list);
 	}
@@ -972,31 +964,31 @@ retry:
 	needed += allocated;
 	h->resv_huge_pages += delta;
 	ret = 0;
-free:
+
+	spin_unlock(&hugetlb_lock);
 	/* Free the needed pages to the hugetlb pool */
 	list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
 		if ((--needed) < 0)
 			break;
 		list_del(&page->lru);
+		/*
+		 * This page is now managed by the hugetlb allocator and has
+		 * no users -- drop the buddy allocator's reference.
+		 */
+		put_page_testzero(page);
+		VM_BUG_ON(page_count(page));
 		enqueue_huge_page(h, page);
 	}
 
 	/* Free unnecessary surplus pages to the buddy allocator */
+free:
 	if (!list_empty(&surplus_list)) {
-		spin_unlock(&hugetlb_lock);
 		list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
 			list_del(&page->lru);
-			/*
-			 * The page has a reference count of zero already, so
-			 * call free_huge_page directly instead of using
-			 * put_page.  This must be done with hugetlb_lock
-			 * unlocked which is safe because free_huge_page takes
-			 * hugetlb_lock before deciding how to free the page.
-			 */
-			free_huge_page(page);
+			put_page(page);
 		}
-		spin_lock(&hugetlb_lock);
 	}
+	spin_lock(&hugetlb_lock);
 
 	return ret;
 }
@@ -1123,7 +1115,6 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 		}
 	}
 
-	set_page_refcounted(page);
 	set_page_private(page, (unsigned long) mapping);
 
 	vma_commit_reservation(h, vma, addr);
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 07/10] HWPOSION, hugetlb: recover from free hugepage error when !MF_COUNT_INCREASED
  2010-09-08  1:19 [PATCH 0/10] Hugepage migration (v5) Naoya Horiguchi
                   ` (5 preceding siblings ...)
  2010-09-08  1:19 ` [PATCH 06/10] hugetlb: move refcounting in hugepage allocation inside hugetlb_lock Naoya Horiguchi
@ 2010-09-08  1:19 ` Naoya Horiguchi
  2010-09-08  1:19 ` [PATCH 08/10] HWPOISON, hugetlb: soft offlining for hugepage Naoya Horiguchi
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Naoya Horiguchi @ 2010-09-08  1:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

Currently error recovery for free hugepage works only for MF_COUNT_INCREASED.
This patch enables !MF_COUNT_INCREASED case.

Free hugepages can be handled directly by alloc_huge_page() and
dequeue_hwpoisoned_huge_page(), and both of them are protected
by hugetlb_lock, so there is no race between them.

Note that this patch defines the refcount of HWPoisoned hugepage
dequeued from freelist is 1, deviated from present 0, thereby we
can avoid race between unpoison and memory failure on free hugepage.
This is reasonable because unlikely to free buddy pages, free hugepage
is governed by hugetlbfs even after error handling finishes.
And it also makes unpoison code added in the later patch cleaner.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 mm/hugetlb.c        |    1 +
 mm/memory-failure.c |   33 ++++++++++++++++++++++++++++++++-
 2 files changed, 33 insertions(+), 1 deletions(-)

diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index adb5dfa..79a049a 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -2972,6 +2972,7 @@ int dequeue_hwpoisoned_huge_page(struct page *hpage)
 	spin_lock(&hugetlb_lock);
 	if (is_hugepage_on_freelist(hpage)) {
 		list_del(&hpage->lru);
+		set_page_refcounted(hpage);
 		h->free_huge_pages--;
 		h->free_huge_pages_node[nid]--;
 		ret = 0;
diff --git v2.6.36-rc2/mm/memory-failure.c v2.6.36-rc2/mm/memory-failure.c
index c67f801..dfeb8b8 100644
--- v2.6.36-rc2/mm/memory-failure.c
+++ v2.6.36-rc2/mm/memory-failure.c
@@ -983,7 +983,10 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
 	 * We need/can do nothing about count=0 pages.
 	 * 1) it's a free page, and therefore in safe hand:
 	 *    prep_new_page() will be the gate keeper.
-	 * 2) it's part of a non-compound high order page.
+	 * 2) it's a free hugepage, which is also safe:
+	 *    an affected hugepage will be dequeued from hugepage freelist,
+	 *    so there's no concern about reusing it ever after.
+	 * 3) it's part of a non-compound high order page.
 	 *    Implies some kernel user: cannot stop them from
 	 *    R/W the page; let's pray that the page has been
 	 *    used and will be freed some time later.
@@ -995,6 +998,24 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
 		if (is_free_buddy_page(p)) {
 			action_result(pfn, "free buddy", DELAYED);
 			return 0;
+		} else if (PageHuge(hpage)) {
+			/*
+			 * Check "just unpoisoned", "filter hit", and
+			 * "race with other subpage."
+			 */
+			lock_page_nosync(hpage);
+			if (!PageHWPoison(hpage)
+			    || (hwpoison_filter(p) && TestClearPageHWPoison(p))
+			    || (p != hpage && TestSetPageHWPoison(hpage))) {
+				atomic_long_sub(nr_pages, &mce_bad_pages);
+				return 0;
+			}
+			set_page_hwpoison_huge_page(hpage);
+			res = dequeue_hwpoisoned_huge_page(hpage);
+			action_result(pfn, "free huge",
+				      res ? IGNORED : DELAYED);
+			unlock_page(hpage);
+			return res;
 		} else {
 			action_result(pfn, "high order kernel", IGNORED);
 			return -EBUSY;
@@ -1156,6 +1177,16 @@ int unpoison_memory(unsigned long pfn)
 	nr_pages = 1 << compound_order(page);
 
 	if (!get_page_unless_zero(page)) {
+		/*
+		 * Since HWPoisoned hugepage should have non-zero refcount,
+		 * race between memory failure and unpoison seems to happen.
+		 * In such case unpoison fails and memory failure runs
+		 * to the end.
+		 */
+		if (PageHuge(page)) {
+			pr_debug("MCE: Memory failure is now running on free hugepage %#lx\n", pfn);
+			return 0;
+		}
 		if (TestClearPageHWPoison(p))
 			atomic_long_sub(nr_pages, &mce_bad_pages);
 		pr_debug("MCE: Software-unpoisoned free page %#lx\n", pfn);
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 08/10] HWPOISON, hugetlb: soft offlining for hugepage
  2010-09-08  1:19 [PATCH 0/10] Hugepage migration (v5) Naoya Horiguchi
                   ` (6 preceding siblings ...)
  2010-09-08  1:19 ` [PATCH 07/10] HWPOSION, hugetlb: recover from free hugepage error when !MF_COUNT_INCREASED Naoya Horiguchi
@ 2010-09-08  1:19 ` Naoya Horiguchi
  2010-09-08  1:19 ` [PATCH 09/10] HWPOISON, hugetlb: fix unpoison " Naoya Horiguchi
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Naoya Horiguchi @ 2010-09-08  1:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

This patch extends soft offlining framework to support hugepage.
When memory corrected errors occur repeatedly on a hugepage,
we can choose to stop using it by migrating data onto another hugepage
and disabling the original (maybe half-broken) one.

ChangeLog since v4:
- branch soft_offline_page() for hugepage

ChangeLog since v3:
- remove comment about "ToDo: hugepage soft-offline"

ChangeLog since v2:
- move refcount handling into isolate_lru_page()

ChangeLog since v1:
- add double check in isolating hwpoisoned hugepage
- define free/non-free checker for hugepage
- postpone calling put_page() for hugepage in soft_offline_page()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 mm/memory-failure.c |   59 +++++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 55 insertions(+), 4 deletions(-)

diff --git v2.6.36-rc2/mm/memory-failure.c v2.6.36-rc2/mm/memory-failure.c
index dfeb8b8..1d0392d 100644
--- v2.6.36-rc2/mm/memory-failure.c
+++ v2.6.36-rc2/mm/memory-failure.c
@@ -693,8 +693,6 @@ static int me_swapcache_clean(struct page *p, unsigned long pfn)
  * Issues:
  * - Error on hugepage is contained in hugepage unit (not in raw page unit.)
  *   To narrow down kill region to one page, we need to break up pmd.
- * - To support soft-offlining for hugepage, we need to support hugepage
- *   migration.
  */
 static int me_huge_page(struct page *p, unsigned long pfn)
 {
@@ -1220,7 +1218,11 @@ EXPORT_SYMBOL(unpoison_memory);
 static struct page *new_page(struct page *p, unsigned long private, int **x)
 {
 	int nid = page_to_nid(p);
-	return alloc_pages_exact_node(nid, GFP_HIGHUSER_MOVABLE, 0);
+	if (PageHuge(p))
+		return alloc_huge_page_node(page_hstate(compound_head(p)),
+						   nid);
+	else
+		return alloc_pages_exact_node(nid, GFP_HIGHUSER_MOVABLE, 0);
 }
 
 /*
@@ -1248,8 +1250,15 @@ static int get_any_page(struct page *p, unsigned long pfn, int flags)
 	 * was free.
 	 */
 	set_migratetype_isolate(p);
+	/*
+	 * When the target page is a free hugepage, just remove it
+	 * from free hugepage list.
+	 */
 	if (!get_page_unless_zero(compound_head(p))) {
-		if (is_free_buddy_page(p)) {
+		if (PageHuge(p)) {
+			pr_debug("get_any_page: %#lx free huge page\n", pfn);
+			ret = dequeue_hwpoisoned_huge_page(compound_head(p));
+		} else if (is_free_buddy_page(p)) {
 			pr_debug("get_any_page: %#lx free buddy page\n", pfn);
 			/* Set hwpoison bit while page is still isolated */
 			SetPageHWPoison(p);
@@ -1268,6 +1277,45 @@ static int get_any_page(struct page *p, unsigned long pfn, int flags)
 	return ret;
 }
 
+static int soft_offline_huge_page(struct page *page, int flags)
+{
+	int ret;
+	unsigned long pfn = page_to_pfn(page);
+	struct page *hpage = compound_head(page);
+	LIST_HEAD(pagelist);
+
+	ret = get_any_page(page, pfn, flags);
+	if (ret < 0)
+		return ret;
+	if (ret == 0)
+		goto done;
+
+	if (PageHWPoison(hpage)) {
+		put_page(hpage);
+		pr_debug("soft offline: %#lx hugepage already poisoned\n", pfn);
+		return -EBUSY;
+	}
+
+	/* Keep page count to indicate a given hugepage is isolated. */
+
+	list_add(&hpage->lru, &pagelist);
+	ret = migrate_huge_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, 0);
+	if (ret) {
+		pr_debug("soft offline: %#lx: migration failed %d, type %lx\n",
+			 pfn, ret, page->flags);
+		if (ret > 0)
+			ret = -EIO;
+		return ret;
+	}
+done:
+	if (!PageHWPoison(hpage))
+		atomic_long_add(1 << compound_order(hpage), &mce_bad_pages);
+	set_page_hwpoison_huge_page(hpage);
+	dequeue_hwpoisoned_huge_page(hpage);
+	/* keep elevated page count for bad page */
+	return ret;
+}
+
 /**
  * soft_offline_page - Soft offline a page.
  * @page: page to offline
@@ -1295,6 +1343,9 @@ int soft_offline_page(struct page *page, int flags)
 	int ret;
 	unsigned long pfn = page_to_pfn(page);
 
+	if (PageHuge(page))
+		return soft_offline_huge_page(page, flags);
+
 	ret = get_any_page(page, pfn, flags);
 	if (ret < 0)
 		return ret;
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 09/10] HWPOISON, hugetlb: fix unpoison for hugepage
  2010-09-08  1:19 [PATCH 0/10] Hugepage migration (v5) Naoya Horiguchi
                   ` (7 preceding siblings ...)
  2010-09-08  1:19 ` [PATCH 08/10] HWPOISON, hugetlb: soft offlining for hugepage Naoya Horiguchi
@ 2010-09-08  1:19 ` Naoya Horiguchi
  2010-09-08  1:19 ` [PATCH 10/10] page-types.c: fix name of unpoison interface Naoya Horiguchi
  2010-09-09 10:33 ` [PATCH 0/10] Hugepage migration (v5) Andi Kleen
  10 siblings, 0 replies; 36+ messages in thread
From: Naoya Horiguchi @ 2010-09-08  1:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

Currently unpoisoning hugepages doesn't work correctly because
clearing PG_HWPoison is done outside if (TestClearPageHWPoison).
This patch fixes it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 mm/memory-failure.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git v2.6.36-rc2/mm/memory-failure.c v2.6.36-rc2/mm/memory-failure.c
index 1d0392d..483a59f 100644
--- v2.6.36-rc2/mm/memory-failure.c
+++ v2.6.36-rc2/mm/memory-failure.c
@@ -1202,9 +1202,9 @@ int unpoison_memory(unsigned long pfn)
 		pr_debug("MCE: Software-unpoisoned page %#lx\n", pfn);
 		atomic_long_sub(nr_pages, &mce_bad_pages);
 		freeit = 1;
+		if (PageHuge(page))
+			clear_page_hwpoison_huge_page(page);
 	}
-	if (PageHuge(p))
-		clear_page_hwpoison_huge_page(page);
 	unlock_page(page);
 
 	put_page(page);
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 10/10] page-types.c: fix name of unpoison interface
  2010-09-08  1:19 [PATCH 0/10] Hugepage migration (v5) Naoya Horiguchi
                   ` (8 preceding siblings ...)
  2010-09-08  1:19 ` [PATCH 09/10] HWPOISON, hugetlb: fix unpoison " Naoya Horiguchi
@ 2010-09-08  1:19 ` Naoya Horiguchi
  2010-09-09 10:33 ` [PATCH 0/10] Hugepage migration (v5) Andi Kleen
  10 siblings, 0 replies; 36+ messages in thread
From: Naoya Horiguchi @ 2010-09-08  1:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

debugfs:hwpoison/renew-pfn is the old interface.
This patch renames and fixes it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/vm/page-types.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git v2.6.36-rc2/Documentation/vm/page-types.c v2.6.36-rc2/Documentation/vm/page-types.c
index ccd951f..cc96ee2 100644
--- v2.6.36-rc2/Documentation/vm/page-types.c
+++ v2.6.36-rc2/Documentation/vm/page-types.c
@@ -478,7 +478,7 @@ static void prepare_hwpoison_fd(void)
 	}
 
 	if (opt_unpoison && !hwpoison_forget_fd) {
-		sprintf(buf, "%s/renew-pfn", hwpoison_debug_fs);
+		sprintf(buf, "%s/unpoison-pfn", hwpoison_debug_fs);
 		hwpoison_forget_fd = checked_open(buf, O_WRONLY);
 	}
 }
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/10] Hugepage migration (v5)
  2010-09-08  1:19 [PATCH 0/10] Hugepage migration (v5) Naoya Horiguchi
                   ` (9 preceding siblings ...)
  2010-09-08  1:19 ` [PATCH 10/10] page-types.c: fix name of unpoison interface Naoya Horiguchi
@ 2010-09-09 10:33 ` Andi Kleen
  2010-09-09 22:56   ` Naoya Horiguchi
  2010-09-20 11:14   ` Mel Gorman
  10 siblings, 2 replies; 36+ messages in thread
From: Andi Kleen @ 2010-09-09 10:33 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Wed,  8 Sep 2010 10:19:31 +0900
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> Hi,
> 
> This is the 5th version of "hugepage migration" set.
> 
> Changes from v4 (mostly refactoring):
> - remove unnecessary might_sleep() [3/10]
> - define migrate_huge_pages() from copy of migrate_pages() [4/10]
> - soft_offline_page() branches off to hugepage path. [8/10]

I went over this patchkit again and it all looks good to me.
I plan to merge it through my hwpoison tree.

As far as I understand all earlier comments have been addressed
with this revision, correct?

Thanks for your work, this is very good.

But I would like to have some Acks from Christoph for the
page migration changes and from Mel for the hugetlb changes
outside memory-failures.c. Are the patches ok for you two? 
Can I have your Acked-by or Reviewed-by? 

Any other comments would be welcome too.

I am considering to fast track 10/10 (the page-types fix). 

I think the other bug fixes in the series are only for bugs added
earlier in the series, correct?

Thanks,
-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/10] Hugepage migration (v5)
  2010-09-09 10:33 ` [PATCH 0/10] Hugepage migration (v5) Andi Kleen
@ 2010-09-09 22:56   ` Naoya Horiguchi
  2010-09-20 11:14   ` Mel Gorman
  1 sibling, 0 replies; 36+ messages in thread
From: Naoya Horiguchi @ 2010-09-09 22:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Thu, Sep 09, 2010 at 12:33:06PM +0200, Andi Kleen wrote:
> 
> I went over this patchkit again and it all looks good to me.
> I plan to merge it through my hwpoison tree.

Thank you.

> As far as I understand all earlier comments have been addressed
> with this revision, correct?

Yes. I've reflected all given comments.

> I am considering to fast track 10/10 (the page-types fix). 
> 
> I think the other bug fixes in the series are only for bugs added
> earlier in the series, correct?

Correct.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 01/10] hugetlb: fix metadata corruption in hugetlb_fault()
  2010-09-08  1:19 ` [PATCH 01/10] hugetlb: fix metadata corruption in hugetlb_fault() Naoya Horiguchi
@ 2010-09-20 10:47   ` Mel Gorman
  2010-09-22 20:41     ` Christoph Lameter
  0 siblings, 1 reply; 36+ messages in thread
From: Mel Gorman @ 2010-09-20 10:47 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, Sep 08, 2010 at 10:19:32AM +0900, Naoya Horiguchi wrote:
> Since the PageHWPoison() check is for avoiding hwpoisoned page remained
> in pagecache mapping to the process, it should be done in "found in pagecache"
> branch, not in the common path.
> Otherwise, metadata corruption occurs if memory failure happens between
> alloc_huge_page() and lock_page() because page fault fails with metadata
> changes remained (such as refcount, mapcount, etc.)
> 
> This patch moves the check to "found in pagecache" branch and fix the problem.
> 
> ChangeLog since v2:
> - remove retry check in "new allocation" path.
> - make description more detailed
> - change patch name from "HWPOISON, hugetlb: move PG_HWPoison bit check"
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

Seems reasonable.

Acked-by: Mel Gorman <mel@csn.ul.ie>

> ---
>  mm/hugetlb.c |   21 +++++++++------------
>  1 files changed, 9 insertions(+), 12 deletions(-)
> 
> diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
> index cc5be78..6871b41 100644
> --- v2.6.36-rc2/mm/hugetlb.c
> +++ v2.6.36-rc2/mm/hugetlb.c
> @@ -2518,22 +2518,19 @@ retry:
>  			hugepage_add_new_anon_rmap(page, vma, address);
>  		}
>  	} else {
> +		/*
> +		 * If memory error occurs between mmap() and fault, some process
> +		 * don't have hwpoisoned swap entry for errored virtual address.
> +		 * So we need to block hugepage fault by PG_hwpoison bit check.
> +		 */
> +		if (unlikely(PageHWPoison(page))) {
> +			ret = VM_FAULT_HWPOISON;
> +			goto backout_unlocked;
> +		}
>  		page_dup_rmap(page);
>  	}
>  
>  	/*
> -	 * Since memory error handler replaces pte into hwpoison swap entry
> -	 * at the time of error handling, a process which reserved but not have
> -	 * the mapping to the error hugepage does not have hwpoison swap entry.
> -	 * So we need to block accesses from such a process by checking
> -	 * PG_hwpoison bit here.
> -	 */
> -	if (unlikely(PageHWPoison(page))) {
> -		ret = VM_FAULT_HWPOISON;
> -		goto backout_unlocked;
> -	}
> -
> -	/*
>  	 * If we are going to COW a private mapping later, we examine the
>  	 * pending reservations for this page now. This will ensure that
>  	 * any allocations necessary to record that reservation occur outside
> -- 
> 1.7.2.2
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/10] hugetlb: add allocate function for hugepage migration
  2010-09-08  1:19 ` [PATCH 02/10] hugetlb: add allocate function for hugepage migration Naoya Horiguchi
@ 2010-09-20 10:59   ` Mel Gorman
  2010-09-22  4:41     ` Naoya Horiguchi
  2010-09-22 21:05   ` Christoph Lameter
  1 sibling, 1 reply; 36+ messages in thread
From: Mel Gorman @ 2010-09-20 10:59 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, Sep 08, 2010 at 10:19:33AM +0900, Naoya Horiguchi wrote:
> We can't use existing hugepage allocation functions to allocate hugepage
> for page migration, because page migration can happen asynchronously with
> the running processes and page migration users should call the allocation
> function with physical addresses (not virtual addresses) as arguments.
> 
> ChangeLog since v3:
> - unify alloc_buddy_huge_page() and alloc_buddy_huge_page_node()
> 
> ChangeLog since v2:
> - remove unnecessary get/put_mems_allowed() (thanks to David Rientjes)
> 
> ChangeLog since v1:
> - add comment on top of alloc_huge_page_no_vma()
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
> ---
>  include/linux/hugetlb.h |    3 ++
>  mm/hugetlb.c            |   79 ++++++++++++++++++++++++++++++++---------------
>  2 files changed, 57 insertions(+), 25 deletions(-)
> 
> diff --git v2.6.36-rc2/include/linux/hugetlb.h v2.6.36-rc2/include/linux/hugetlb.h
> index f479700..0b73c53 100644
> --- v2.6.36-rc2/include/linux/hugetlb.h
> +++ v2.6.36-rc2/include/linux/hugetlb.h
> @@ -228,6 +228,8 @@ struct huge_bootmem_page {
>  	struct hstate *hstate;
>  };
>  
> +struct page *alloc_huge_page_node(struct hstate *h, int nid);
> +
>  /* arch callback */
>  int __init alloc_bootmem_huge_page(struct hstate *h);
>  
> @@ -303,6 +305,7 @@ static inline struct hstate *page_hstate(struct page *page)
>  
>  #else
>  struct hstate {};
> +#define alloc_huge_page_node(h, nid) NULL
>  #define alloc_bootmem_huge_page(h) NULL
>  #define hstate_file(f) NULL
>  #define hstate_vma(v) NULL
> diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
> index 6871b41..f526228 100644
> --- v2.6.36-rc2/mm/hugetlb.c
> +++ v2.6.36-rc2/mm/hugetlb.c
> @@ -466,11 +466,23 @@ static void enqueue_huge_page(struct hstate *h, struct page *page)
>  	h->free_huge_pages_node[nid]++;
>  }
>  
> +static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
> +{
> +	struct page *page;
> +
> +	if (list_empty(&h->hugepage_freelists[nid]))
> +		return NULL;
> +	page = list_entry(h->hugepage_freelists[nid].next, struct page, lru);
> +	list_del(&page->lru);
> +	h->free_huge_pages--;
> +	h->free_huge_pages_node[nid]--;
> +	return page;
> +}
> +

Ok, this is fine because all you are doing is splitting
dequeue_huge_page_vma()

>  static struct page *dequeue_huge_page_vma(struct hstate *h,
>  				struct vm_area_struct *vma,
>  				unsigned long address, int avoid_reserve)
>  {
> -	int nid;
>  	struct page *page = NULL;
>  	struct mempolicy *mpol;
>  	nodemask_t *nodemask;
> @@ -496,19 +508,13 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>  
>  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
>  						MAX_NR_ZONES - 1, nodemask) {
> -		nid = zone_to_nid(zone);
> -		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
> -		    !list_empty(&h->hugepage_freelists[nid])) {
> -			page = list_entry(h->hugepage_freelists[nid].next,
> -					  struct page, lru);
> -			list_del(&page->lru);
> -			h->free_huge_pages--;
> -			h->free_huge_pages_node[nid]--;
> -
> -			if (!avoid_reserve)
> -				decrement_hugepage_resv_vma(h, vma);
> -
> -			break;
> +		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
> +			page = dequeue_huge_page_node(h, zone_to_nid(zone));
> +			if (page) {
> +				if (!avoid_reserve)
> +					decrement_hugepage_resv_vma(h, vma);
> +				break;
> +			}
>  		}
>  	}

This looks functionally equivalent so no problems there.

>  err:
> @@ -770,11 +776,10 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>  	return ret;
>  }
>  
> -static struct page *alloc_buddy_huge_page(struct hstate *h,
> -			struct vm_area_struct *vma, unsigned long address)
> +static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
>  {
>  	struct page *page;
> -	unsigned int nid;
> +	unsigned int r_nid;
>  

Why the rename, just to avoid changing the value of a function parameter?

>  	if (h->order >= MAX_ORDER)
>  		return NULL;
> @@ -812,9 +817,14 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
>  	}
>  	spin_unlock(&hugetlb_lock);
>  
> -	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
> -					__GFP_REPEAT|__GFP_NOWARN,
> -					huge_page_order(h));
> +	if (nid == NUMA_NO_NODE)
> +		page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
> +				   __GFP_REPEAT|__GFP_NOWARN,
> +				   huge_page_order(h));
> +	else
> +		page = alloc_pages_exact_node(nid,
> +			htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
> +			__GFP_REPEAT|__GFP_NOWARN, huge_page_order(h));
>  

Why not just call alloc_pages_node()?

>  	if (page && arch_prepare_hugepage(page)) {
>  		__free_pages(page, huge_page_order(h));
> @@ -829,13 +839,13 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
>  		 */
>  		put_page_testzero(page);
>  		VM_BUG_ON(page_count(page));
> -		nid = page_to_nid(page);
> +		r_nid = page_to_nid(page);
>  		set_compound_page_dtor(page, free_huge_page);
>  		/*
>  		 * We incremented the global counters already
>  		 */
> -		h->nr_huge_pages_node[nid]++;
> -		h->surplus_huge_pages_node[nid]++;
> +		h->nr_huge_pages_node[r_nid]++;
> +		h->surplus_huge_pages_node[r_nid]++;
>  		__count_vm_event(HTLB_BUDDY_PGALLOC);
>  	} else {
>  		h->nr_huge_pages--;
> @@ -848,6 +858,25 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
>  }
>  
>  /*
> + * This allocation function is useful in the context where vma is irrelevant.
> + * E.g. soft-offlining uses this function because it only cares physical
> + * address of error page.
> + */
> +struct page *alloc_huge_page_node(struct hstate *h, int nid)
> +{
> +	struct page *page;
> +
> +	spin_lock(&hugetlb_lock);
> +	page = dequeue_huge_page_node(h, nid);
> +	spin_unlock(&hugetlb_lock);
> +
> +	if (!page)
> +		page = alloc_buddy_huge_page(h, nid);
> +
> +	return page;
> +}
> +
> +/*
>   * Increase the hugetlb pool such that it can accomodate a reservation
>   * of size 'delta'.
>   */
> @@ -871,7 +900,7 @@ static int gather_surplus_pages(struct hstate *h, int delta)
>  retry:
>  	spin_unlock(&hugetlb_lock);
>  	for (i = 0; i < needed; i++) {
> -		page = alloc_buddy_huge_page(h, NULL, 0);
> +		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
>  		if (!page) {
>  			/*
>  			 * We were not able to allocate enough pages to
> @@ -1052,7 +1081,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>  	spin_unlock(&hugetlb_lock);
>  
>  	if (!page) {
> -		page = alloc_buddy_huge_page(h, vma, addr);
> +		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
>  		if (!page) {
>  			hugetlb_put_quota(inode->i_mapping, chg);
>  			return ERR_PTR(-VM_FAULT_SIGBUS);

Mostly it looks ok, the only snag I see is the why you didn't use
alloc_pages_node().

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 03/10] hugetlb: redefine hugepage copy functions
  2010-09-08  1:19 ` [PATCH 03/10] hugetlb: redefine hugepage copy functions Naoya Horiguchi
@ 2010-09-20 11:03   ` Mel Gorman
  2010-09-20 11:15     ` Andi Kleen
  2010-09-23 16:21   ` Christoph Lameter
  1 sibling, 1 reply; 36+ messages in thread
From: Mel Gorman @ 2010-09-20 11:03 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, Sep 08, 2010 at 10:19:34AM +0900, Naoya Horiguchi wrote:
> This patch modifies hugepage copy functions to have only destination
> and source hugepages as arguments for later use.
> The old ones are renamed from copy_{gigantic,huge}_page() to
> copy_user_{gigantic,huge}_page().
> This naming convention is consistent with that between copy_highpage()
> and copy_user_highpage().
> 
> ChangeLog since v4:
> - add blank line between local declaration and code
> - remove unnecessary might_sleep()
> 
> ChangeLog since v2:
> - change copy_huge_page() from macro to inline dummy function
>   to avoid compile warning when !CONFIG_HUGETLB_PAGE.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  include/linux/hugetlb.h |    4 ++++
>  mm/hugetlb.c            |   45 ++++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 44 insertions(+), 5 deletions(-)
> 
> diff --git v2.6.36-rc2/include/linux/hugetlb.h v2.6.36-rc2/include/linux/hugetlb.h
> index 0b73c53..9e51f77 100644
> --- v2.6.36-rc2/include/linux/hugetlb.h
> +++ v2.6.36-rc2/include/linux/hugetlb.h
> @@ -44,6 +44,7 @@ int hugetlb_reserve_pages(struct inode *inode, long from, long to,
>  						int acctflags);
>  void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
>  void __isolate_hwpoisoned_huge_page(struct page *page);
> +void copy_huge_page(struct page *dst, struct page *src);
>  
>  extern unsigned long hugepages_treat_as_movable;
>  extern const unsigned long hugetlb_zero, hugetlb_infinity;
> @@ -102,6 +103,9 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
>  #define hugetlb_fault(mm, vma, addr, flags)	({ BUG(); 0; })
>  #define huge_pte_offset(mm, address)	0
>  #define __isolate_hwpoisoned_huge_page(page)	0
> +static inline void copy_huge_page(struct page *dst, struct page *src)
> +{
> +}
>  
>  #define hugetlb_change_protection(vma, address, end, newprot)
>  
> diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
> index f526228..351f8d1 100644
> --- v2.6.36-rc2/mm/hugetlb.c
> +++ v2.6.36-rc2/mm/hugetlb.c
> @@ -423,14 +423,14 @@ static void clear_huge_page(struct page *page,
>  	}
>  }
>  
> -static void copy_gigantic_page(struct page *dst, struct page *src,
> +static void copy_user_gigantic_page(struct page *dst, struct page *src,
>  			   unsigned long addr, struct vm_area_struct *vma)
>  {
>  	int i;
>  	struct hstate *h = hstate_vma(vma);
>  	struct page *dst_base = dst;
>  	struct page *src_base = src;
> -	might_sleep();
> +

Why is this check removed?

>  	for (i = 0; i < pages_per_huge_page(h); ) {
>  		cond_resched();
>  		copy_user_highpage(dst, src, addr + i*PAGE_SIZE, vma);
> @@ -440,14 +440,15 @@ static void copy_gigantic_page(struct page *dst, struct page *src,
>  		src = mem_map_next(src, src_base, i);
>  	}
>  }
> -static void copy_huge_page(struct page *dst, struct page *src,
> +
> +static void copy_user_huge_page(struct page *dst, struct page *src,
>  			   unsigned long addr, struct vm_area_struct *vma)
>  {
>  	int i;
>  	struct hstate *h = hstate_vma(vma);
>  
>  	if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
> -		copy_gigantic_page(dst, src, addr, vma);
> +		copy_user_gigantic_page(dst, src, addr, vma);
>  		return;
>  	}
>  
> @@ -458,6 +459,40 @@ static void copy_huge_page(struct page *dst, struct page *src,
>  	}
>  }
>  
> +static void copy_gigantic_page(struct page *dst, struct page *src)
> +{
> +	int i;
> +	struct hstate *h = page_hstate(src);
> +	struct page *dst_base = dst;
> +	struct page *src_base = src;
> +
> +	for (i = 0; i < pages_per_huge_page(h); ) {
> +		cond_resched();

Should this function not have a might_sleep() check too?

> +		copy_highpage(dst, src);
> +
> +		i++;
> +		dst = mem_map_next(dst, dst_base, i);
> +		src = mem_map_next(src, src_base, i);
> +	}
> +}
> +
> +void copy_huge_page(struct page *dst, struct page *src)
> +{
> +	int i;
> +	struct hstate *h = page_hstate(src);
> +
> +	if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
> +		copy_gigantic_page(dst, src);
> +		return;
> +	}
> +
> +	might_sleep();
> +	for (i = 0; i < pages_per_huge_page(h); i++) {
> +		cond_resched();
> +		copy_highpage(dst + i, src + i);
> +	}
> +}
> +
>  static void enqueue_huge_page(struct hstate *h, struct page *page)
>  {
>  	int nid = page_to_nid(page);
> @@ -2415,7 +2450,7 @@ retry_avoidcopy:
>  	if (unlikely(anon_vma_prepare(vma)))
>  		return VM_FAULT_OOM;
>  
> -	copy_huge_page(new_page, old_page, address, vma);
> +	copy_user_huge_page(new_page, old_page, address, vma);
>  	__SetPageUptodate(new_page);
>  
>  	/*

Other than the removal of the might_sleep() check, this looks ok too.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/10] hugetlb: hugepage migration core
  2010-09-08  1:19 ` [PATCH 04/10] hugetlb: hugepage migration core Naoya Horiguchi
@ 2010-09-20 11:10   ` Mel Gorman
  2010-09-22  4:59     ` Naoya Horiguchi
  2010-09-23 16:52   ` Christoph Lameter
  1 sibling, 1 reply; 36+ messages in thread
From: Mel Gorman @ 2010-09-20 11:10 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, Sep 08, 2010 at 10:19:35AM +0900, Naoya Horiguchi wrote:
> This patch extends page migration code to support hugepage migration.
> One of the potential users of this feature is soft offlining which
> is triggered by memory corrected errors (added by the next patch.)
> 
> Todo:
> - there are other users of page migration such as memory policy,
>   memory hotplug and memocy compaction.
>   They are not ready for hugepage support for now.
> 
> ChangeLog since v4:
> - define migrate_huge_pages()
> - remove changes on isolation/putback_lru_page()
> 
> ChangeLog since v2:
> - refactor isolate/putback_lru_page() to handle hugepage
> - add comment about race on unmap_and_move_huge_page()
> 
> ChangeLog since v1:
> - divide migration code path for hugepage
> - define routine checking migration swap entry for hugetlb
> - replace "goto" with "if/else" in remove_migration_pte()
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
> ---
>  fs/hugetlbfs/inode.c    |   15 +++
>  include/linux/migrate.h |   16 +++
>  mm/hugetlb.c            |   18 ++++-
>  mm/migrate.c            |  232 +++++++++++++++++++++++++++++++++++++++++++----
>  4 files changed, 262 insertions(+), 19 deletions(-)
> 
> diff --git v2.6.36-rc2/fs/hugetlbfs/inode.c v2.6.36-rc2/fs/hugetlbfs/inode.c
> index 6e5bd42..1f7ca50 100644
> --- v2.6.36-rc2/fs/hugetlbfs/inode.c
> +++ v2.6.36-rc2/fs/hugetlbfs/inode.c
> @@ -31,6 +31,7 @@
>  #include <linux/statfs.h>
>  #include <linux/security.h>
>  #include <linux/magic.h>
> +#include <linux/migrate.h>
>  
>  #include <asm/uaccess.h>
>  
> @@ -573,6 +574,19 @@ static int hugetlbfs_set_page_dirty(struct page *page)
>  	return 0;
>  }
>  
> +static int hugetlbfs_migrate_page(struct address_space *mapping,
> +				struct page *newpage, struct page *page)
> +{
> +	int rc;
> +
> +	rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> +	if (rc)
> +		return rc;
> +	migrate_page_copy(newpage, page);
> +
> +	return 0;
> +}
> +
>  static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>  {
>  	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(dentry->d_sb);
> @@ -659,6 +673,7 @@ static const struct address_space_operations hugetlbfs_aops = {
>  	.write_begin	= hugetlbfs_write_begin,
>  	.write_end	= hugetlbfs_write_end,
>  	.set_page_dirty	= hugetlbfs_set_page_dirty,
> +	.migratepage    = hugetlbfs_migrate_page,
>  };
>  
>  
> diff --git v2.6.36-rc2/include/linux/migrate.h v2.6.36-rc2/include/linux/migrate.h
> index 7238231..3c1941e 100644
> --- v2.6.36-rc2/include/linux/migrate.h
> +++ v2.6.36-rc2/include/linux/migrate.h
> @@ -14,6 +14,8 @@ extern int migrate_page(struct address_space *,
>  			struct page *, struct page *);
>  extern int migrate_pages(struct list_head *l, new_page_t x,
>  			unsigned long private, int offlining);
> +extern int migrate_huge_pages(struct list_head *l, new_page_t x,
> +			unsigned long private, int offlining);
>  
>  extern int fail_migrate_page(struct address_space *,
>  			struct page *, struct page *);
> @@ -23,12 +25,17 @@ extern int migrate_prep_local(void);
>  extern int migrate_vmas(struct mm_struct *mm,
>  		const nodemask_t *from, const nodemask_t *to,
>  		unsigned long flags);
> +extern void migrate_page_copy(struct page *newpage, struct page *page);
> +extern int migrate_huge_page_move_mapping(struct address_space *mapping,
> +				  struct page *newpage, struct page *page);
>  #else
>  #define PAGE_MIGRATION 0
>  
>  static inline void putback_lru_pages(struct list_head *l) {}
>  static inline int migrate_pages(struct list_head *l, new_page_t x,
>  		unsigned long private, int offlining) { return -ENOSYS; }
> +static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
> +		unsigned long private, int offlining) { return -ENOSYS; }
>  
>  static inline int migrate_prep(void) { return -ENOSYS; }
>  static inline int migrate_prep_local(void) { return -ENOSYS; }
> @@ -40,6 +47,15 @@ static inline int migrate_vmas(struct mm_struct *mm,
>  	return -ENOSYS;
>  }
>  
> +static inline void migrate_page_copy(struct page *newpage,
> +				     struct page *page) {}
> +
> +extern int migrate_huge_page_move_mapping(struct address_space *mapping,
> +				  struct page *newpage, struct page *page)
> +{
> +	return -ENOSYS;
> +}
> +
>  /* Possible settings for the migrate_page() method in address_operations */
>  #define migrate_page NULL
>  #define fail_migrate_page NULL
> diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
> index 351f8d1..55f3e2d 100644
> --- v2.6.36-rc2/mm/hugetlb.c
> +++ v2.6.36-rc2/mm/hugetlb.c
> @@ -2217,6 +2217,19 @@ nomem:
>  	return -ENOMEM;
>  }
>  
> +static int is_hugetlb_entry_migration(pte_t pte)
> +{
> +	swp_entry_t swp;
> +
> +	if (huge_pte_none(pte) || pte_present(pte))
> +		return 0;
> +	swp = pte_to_swp_entry(pte);
> +	if (non_swap_entry(swp) && is_migration_entry(swp)) {
> +		return 1;
> +	} else
> +		return 0;
> +}
> +
>  static int is_hugetlb_entry_hwpoisoned(pte_t pte)
>  {
>  	swp_entry_t swp;
> @@ -2651,7 +2664,10 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	ptep = huge_pte_offset(mm, address);
>  	if (ptep) {
>  		entry = huge_ptep_get(ptep);
> -		if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> +		if (unlikely(is_hugetlb_entry_migration(entry))) {
> +			migration_entry_wait(mm, (pmd_t *)ptep, address);
> +			return 0;
> +		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
>  			return VM_FAULT_HWPOISON;
>  	}
>  
> diff --git v2.6.36-rc2/mm/migrate.c v2.6.36-rc2/mm/migrate.c
> index 38e7cad..55dbc45 100644
> --- v2.6.36-rc2/mm/migrate.c
> +++ v2.6.36-rc2/mm/migrate.c
> @@ -32,6 +32,7 @@
>  #include <linux/security.h>
>  #include <linux/memcontrol.h>
>  #include <linux/syscalls.h>
> +#include <linux/hugetlb.h>
>  #include <linux/gfp.h>
>  
>  #include "internal.h"
> @@ -95,26 +96,34 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
>  	pte_t *ptep, pte;
>   	spinlock_t *ptl;
>  
> - 	pgd = pgd_offset(mm, addr);
> -	if (!pgd_present(*pgd))
> -		goto out;
> +	if (unlikely(PageHuge(new))) {
> +		ptep = huge_pte_offset(mm, addr);
> +		if (!ptep)
> +			goto out;
> +		ptl = &mm->page_table_lock;
> +	} else {
> +		pgd = pgd_offset(mm, addr);
> +		if (!pgd_present(*pgd))
> +			goto out;
>  
> -	pud = pud_offset(pgd, addr);
> -	if (!pud_present(*pud))
> -		goto out;
> +		pud = pud_offset(pgd, addr);
> +		if (!pud_present(*pud))
> +			goto out;
>  

Why are the changes to teh rest of the walkers necessary? Instead, why
did you not identify which PTL lock you needed and then goto the point
where spin_lock(ptl) is called? Similar to what page_check_address()
does for example.

Otherwise, I did not spot anything that was obviously wrong.

> -	pmd = pmd_offset(pud, addr);
> -	if (!pmd_present(*pmd))
> -		goto out;
> +		pmd = pmd_offset(pud, addr);
> +		if (!pmd_present(*pmd))
> +			goto out;
>  
> -	ptep = pte_offset_map(pmd, addr);
> +		ptep = pte_offset_map(pmd, addr);
>  
> -	if (!is_swap_pte(*ptep)) {
> -		pte_unmap(ptep);
> -		goto out;
> - 	}
> +		if (!is_swap_pte(*ptep)) {
> +			pte_unmap(ptep);
> +			goto out;
> +		}
> +
> +		ptl = pte_lockptr(mm, pmd);
> +	}
>  
> - 	ptl = pte_lockptr(mm, pmd);
>   	spin_lock(ptl);
>  	pte = *ptep;
>  	if (!is_swap_pte(pte))
> @@ -130,10 +139,17 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
>  	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
>  	if (is_write_migration_entry(entry))
>  		pte = pte_mkwrite(pte);
> +	if (PageHuge(new))
> +		pte = pte_mkhuge(pte);
>  	flush_cache_page(vma, addr, pte_pfn(pte));
>  	set_pte_at(mm, addr, ptep, pte);
>  
> -	if (PageAnon(new))
> +	if (PageHuge(new)) {
> +		if (PageAnon(new))
> +			hugepage_add_anon_rmap(new, vma, addr);
> +		else
> +			page_dup_rmap(new);
> +	} else if (PageAnon(new))
>  		page_add_anon_rmap(new, vma, addr);
>  	else
>  		page_add_file_rmap(new);
> @@ -276,11 +292,59 @@ static int migrate_page_move_mapping(struct address_space *mapping,
>  }
>  
>  /*
> + * The expected number of remaining references is the same as that
> + * of migrate_page_move_mapping().
> + */
> +int migrate_huge_page_move_mapping(struct address_space *mapping,
> +				   struct page *newpage, struct page *page)
> +{
> +	int expected_count;
> +	void **pslot;
> +
> +	if (!mapping) {
> +		if (page_count(page) != 1)
> +			return -EAGAIN;
> +		return 0;
> +	}
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +
> +	pslot = radix_tree_lookup_slot(&mapping->page_tree,
> +					page_index(page));
> +
> +	expected_count = 2 + page_has_private(page);
> +	if (page_count(page) != expected_count ||
> +	    (struct page *)radix_tree_deref_slot(pslot) != page) {
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return -EAGAIN;
> +	}
> +
> +	if (!page_freeze_refs(page, expected_count)) {
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return -EAGAIN;
> +	}
> +
> +	get_page(newpage);
> +
> +	radix_tree_replace_slot(pslot, newpage);
> +
> +	page_unfreeze_refs(page, expected_count);
> +
> +	__put_page(page);
> +
> +	spin_unlock_irq(&mapping->tree_lock);
> +	return 0;
> +}
> +
> +/*
>   * Copy the page to its new location
>   */
> -static void migrate_page_copy(struct page *newpage, struct page *page)
> +void migrate_page_copy(struct page *newpage, struct page *page)
>  {
> -	copy_highpage(newpage, page);
> +	if (PageHuge(page))
> +		copy_huge_page(newpage, page);
> +	else
> +		copy_highpage(newpage, page);
>  
>  	if (PageError(page))
>  		SetPageError(newpage);
> @@ -724,6 +788,92 @@ move_newpage:
>  }
>  
>  /*
> + * Counterpart of unmap_and_move_page() for hugepage migration.
> + *
> + * This function doesn't wait the completion of hugepage I/O
> + * because there is no race between I/O and migration for hugepage.
> + * Note that currently hugepage I/O occurs only in direct I/O
> + * where no lock is held and PG_writeback is irrelevant,
> + * and writeback status of all subpages are counted in the reference
> + * count of the head page (i.e. if all subpages of a 2MB hugepage are
> + * under direct I/O, the reference of the head page is 512 and a bit more.)
> + * This means that when we try to migrate hugepage whose subpages are
> + * doing direct I/O, some references remain after try_to_unmap() and
> + * hugepage migration fails without data corruption.
> + *
> + * There is also no race when direct I/O is issued on the page under migration,
> + * because then pte is replaced with migration swap entry and direct I/O code
> + * will wait in the page fault for migration to complete.
> + */
> +static int unmap_and_move_huge_page(new_page_t get_new_page,
> +				unsigned long private, struct page *hpage,
> +				int force, int offlining)
> +{
> +	int rc = 0;
> +	int *result = NULL;
> +	struct page *new_hpage = get_new_page(hpage, private, &result);
> +	int rcu_locked = 0;
> +	struct anon_vma *anon_vma = NULL;
> +
> +	if (!new_hpage)
> +		return -ENOMEM;
> +
> +	rc = -EAGAIN;
> +
> +	if (!trylock_page(hpage)) {
> +		if (!force)
> +			goto out;
> +		lock_page(hpage);
> +	}
> +
> +	if (PageAnon(hpage)) {
> +		rcu_read_lock();
> +		rcu_locked = 1;
> +
> +		if (page_mapped(hpage)) {
> +			anon_vma = page_anon_vma(hpage);
> +			atomic_inc(&anon_vma->external_refcount);
> +		}
> +	}
> +
> +	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> +
> +	if (!page_mapped(hpage))
> +		rc = move_to_new_page(new_hpage, hpage, 1);
> +
> +	if (rc)
> +		remove_migration_ptes(hpage, hpage);
> +
> +	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount,
> +					    &anon_vma->lock)) {
> +		int empty = list_empty(&anon_vma->head);
> +		spin_unlock(&anon_vma->lock);
> +		if (empty)
> +			anon_vma_free(anon_vma);
> +	}
> +
> +	if (rcu_locked)
> +		rcu_read_unlock();
> +out:
> +	unlock_page(hpage);
> +
> +	if (rc != -EAGAIN) {
> +		list_del(&hpage->lru);
> +		put_page(hpage);
> +	}
> +
> +	put_page(new_hpage);
> +
> +	if (result) {
> +		if (rc)
> +			*result = rc;
> +		else
> +			*result = page_to_nid(new_hpage);
> +	}
> +	return rc;
> +}
> +
> +/*
>   * migrate_pages
>   *
>   * The function takes one list of pages to migrate and a function
> @@ -788,6 +938,52 @@ out:
>  	return nr_failed + retry;
>  }
>  
> +int migrate_huge_pages(struct list_head *from,
> +		new_page_t get_new_page, unsigned long private, int offlining)
> +{
> +	int retry = 1;
> +	int nr_failed = 0;
> +	int pass = 0;
> +	struct page *page;
> +	struct page *page2;
> +	int rc;
> +
> +	for (pass = 0; pass < 10 && retry; pass++) {
> +		retry = 0;
> +
> +		list_for_each_entry_safe(page, page2, from, lru) {
> +			cond_resched();
> +
> +			rc = unmap_and_move_huge_page(get_new_page,
> +					private, page, pass > 2, offlining);
> +
> +			switch(rc) {
> +			case -ENOMEM:
> +				goto out;
> +			case -EAGAIN:
> +				retry++;
> +				break;
> +			case 0:
> +				break;
> +			default:
> +				/* Permanent failure */
> +				nr_failed++;
> +				break;
> +			}
> +		}
> +	}
> +	rc = 0;
> +out:
> +
> +	list_for_each_entry_safe(page, page2, from, lru)
> +		put_page(page);
> +
> +	if (rc)
> +		return rc;
> +
> +	return nr_failed + retry;
> +}
> +
>  #ifdef CONFIG_NUMA
>  /*
>   * Move a list of individual pages
> -- 
> 1.7.2.2
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/10] Hugepage migration (v5)
  2010-09-09 10:33 ` [PATCH 0/10] Hugepage migration (v5) Andi Kleen
  2010-09-09 22:56   ` Naoya Horiguchi
@ 2010-09-20 11:14   ` Mel Gorman
  1 sibling, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2010-09-20 11:14 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Naoya Horiguchi, Andrew Morton, Christoph Lameter, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Thu, Sep 09, 2010 at 12:33:06PM +0200, Andi Kleen wrote:
> On Wed,  8 Sep 2010 10:19:31 +0900
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> 
> > Hi,
> > 
> > This is the 5th version of "hugepage migration" set.
> > 
> > Changes from v4 (mostly refactoring):
> > - remove unnecessary might_sleep() [3/10]
> > - define migrate_huge_pages() from copy of migrate_pages() [4/10]
> > - soft_offline_page() branches off to hugepage path. [8/10]
> 
> I went over this patchkit again and it all looks good to me.
> I plan to merge it through my hwpoison tree.
> 
> As far as I understand all earlier comments have been addressed
> with this revision, correct?
> 
> Thanks for your work, this is very good.
> 
> But I would like to have some Acks from Christoph for the
> page migration changes and from Mel for the hugetlb changes
> outside memory-failures.c. Are the patches ok for you two? 
> Can I have your Acked-by or Reviewed-by? 
> 

Sorry for taking so long to get back. I was snowed under by other work.
I've reviewed the bulk of the hugetlb changes that affect common paths.
There are a few small queries there but they are very minor. Once covered,
feel free to add by Acked-by. I didn't get the chance to actually test the
patches but they look ok.

> Any other comments would be welcome too.
> 
> I am considering to fast track 10/10 (the page-types fix). 
> 
> I think the other bug fixes in the series are only for bugs added
> earlier in the series, correct?
> 

That is what it looked like to me.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 03/10] hugetlb: redefine hugepage copy functions
  2010-09-20 11:03   ` Mel Gorman
@ 2010-09-20 11:15     ` Andi Kleen
  2010-09-20 11:18       ` Mel Gorman
  0 siblings, 1 reply; 36+ messages in thread
From: Andi Kleen @ 2010-09-20 11:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Naoya Horiguchi, Andi Kleen, Andrew Morton, Christoph Lameter,
	Wu Fengguang, Jun'ichi Nomura, linux-mm, LKML


>> +static void copy_gigantic_page(struct page *dst, struct page *src)
>> +{
>> +	int i;
>> +	struct hstate *h = page_hstate(src);
>> +	struct page *dst_base = dst;
>> +	struct page *src_base = src;
>> +
>> +	for (i = 0; i < pages_per_huge_page(h); ) {
>> +		cond_resched();
>
> Should this function not have a might_sleep() check too?

cond_resched() implies might_sleep I believe. I think
that answers the earlier question too becuse that function
calls this.

	/*
>
> Other than the removal of the might_sleep() check, this looks ok too.

Can I assume an Ack?

Thanks,
-Andi


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 03/10] hugetlb: redefine hugepage copy functions
  2010-09-20 11:15     ` Andi Kleen
@ 2010-09-20 11:18       ` Mel Gorman
  0 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2010-09-20 11:18 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Naoya Horiguchi, Andrew Morton, Christoph Lameter, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Mon, Sep 20, 2010 at 01:15:44PM +0200, Andi Kleen wrote:
> 
> >> +static void copy_gigantic_page(struct page *dst, struct page *src)
> >> +{
> >> +	int i;
> >> +	struct hstate *h = page_hstate(src);
> >> +	struct page *dst_base = dst;
> >> +	struct page *src_base = src;
> >> +
> >> +	for (i = 0; i < pages_per_huge_page(h); ) {
> >> +		cond_resched();
> >
> > Should this function not have a might_sleep() check too?
> 
> cond_resched() implies might_sleep I believe. I think
> that answers the earlier question too becuse that function
> calls this.
> 

You're right, cond_resched() calls might_sleep so the additional check
is redundant.

> 	/*
> >
> > Other than the removal of the might_sleep() check, this looks ok too.
> 
> Can I assume an Ack?
> 

Yes.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/10] hugetlb: add allocate function for hugepage migration
  2010-09-20 10:59   ` Mel Gorman
@ 2010-09-22  4:41     ` Naoya Horiguchi
  2010-09-22  8:37       ` Mel Gorman
  0 siblings, 1 reply; 36+ messages in thread
From: Naoya Horiguchi @ 2010-09-22  4:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

Hi,

Thank you for your review.

On Mon, Sep 20, 2010 at 11:59:16AM +0100, Mel Gorman wrote:
> On Wed, Sep 08, 2010 at 10:19:33AM +0900, Naoya Horiguchi wrote:
...
> > @@ -770,11 +776,10 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
> >  	return ret;
> >  }
> >  
> > -static struct page *alloc_buddy_huge_page(struct hstate *h,
> > -			struct vm_area_struct *vma, unsigned long address)
> > +static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
> >  {
> >  	struct page *page;
> > -	unsigned int nid;
> > +	unsigned int r_nid;
> >  
> 
> Why the rename, just to avoid changing the value of a function parameter?

I think it's better that a simple name is given to function parameter
than to internal variable, because the former is paid more attention
from other developers than the latter is.

> >  	if (h->order >= MAX_ORDER)
> >  		return NULL;
> > @@ -812,9 +817,14 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
> >  	}
> >  	spin_unlock(&hugetlb_lock);
> >  
> > -	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
> > -					__GFP_REPEAT|__GFP_NOWARN,
> > -					huge_page_order(h));
> > +	if (nid == NUMA_NO_NODE)
> > +		page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
> > +				   __GFP_REPEAT|__GFP_NOWARN,
> > +				   huge_page_order(h));
> > +	else
> > +		page = alloc_pages_exact_node(nid,
> > +			htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
> > +			__GFP_REPEAT|__GFP_NOWARN, huge_page_order(h));
> >  
> 
> Why not just call alloc_pages_node()?

Ah, we can bring together these two allocate functions.
I'll do it.

Here is a revised patch.

Thanks,
Naoya Horiguchi
---
Date: Wed, 22 Sep 2010 13:18:54 +0900
Subject: [PATCH 02/10] hugetlb: add allocate function for hugepage migration

We can't use existing hugepage allocation functions to allocate hugepage
for page migration, because page migration can happen asynchronously with
the running processes and page migration users should call the allocation
function with physical addresses (not virtual addresses) as arguments.

ChangeLog since v3:
- unify alloc_buddy_huge_page() and alloc_buddy_huge_page_node()
- bring together branched allocate functions

ChangeLog since v2:
- remove unnecessary get/put_mems_allowed() (thanks to David Rientjes)

ChangeLog since v1:
- add comment on top of alloc_huge_page_no_vma()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 include/linux/hugetlb.h |    3 ++
 mm/hugetlb.c            |   70 +++++++++++++++++++++++++++++++---------------
 2 files changed, 50 insertions(+), 23 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index f479700..0b73c53 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -228,6 +228,8 @@ struct huge_bootmem_page {
 	struct hstate *hstate;
 };
 
+struct page *alloc_huge_page_node(struct hstate *h, int nid);
+
 /* arch callback */
 int __init alloc_bootmem_huge_page(struct hstate *h);
 
@@ -303,6 +305,7 @@ static inline struct hstate *page_hstate(struct page *page)
 
 #else
 struct hstate {};
+#define alloc_huge_page_node(h, nid) NULL
 #define alloc_bootmem_huge_page(h) NULL
 #define hstate_file(f) NULL
 #define hstate_vma(v) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6871b41..cb32a32 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -466,11 +466,23 @@ static void enqueue_huge_page(struct hstate *h, struct page *page)
 	h->free_huge_pages_node[nid]++;
 }
 
+static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
+{
+	struct page *page;
+
+	if (list_empty(&h->hugepage_freelists[nid]))
+		return NULL;
+	page = list_entry(h->hugepage_freelists[nid].next, struct page, lru);
+	list_del(&page->lru);
+	h->free_huge_pages--;
+	h->free_huge_pages_node[nid]--;
+	return page;
+}
+
 static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
 				unsigned long address, int avoid_reserve)
 {
-	int nid;
 	struct page *page = NULL;
 	struct mempolicy *mpol;
 	nodemask_t *nodemask;
@@ -496,19 +508,13 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						MAX_NR_ZONES - 1, nodemask) {
-		nid = zone_to_nid(zone);
-		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
-		    !list_empty(&h->hugepage_freelists[nid])) {
-			page = list_entry(h->hugepage_freelists[nid].next,
-					  struct page, lru);
-			list_del(&page->lru);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[nid]--;
-
-			if (!avoid_reserve)
-				decrement_hugepage_resv_vma(h, vma);
-
-			break;
+		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
+			page = dequeue_huge_page_node(h, zone_to_nid(zone));
+			if (page) {
+				if (!avoid_reserve)
+					decrement_hugepage_resv_vma(h, vma);
+				break;
+			}
 		}
 	}
 err:
@@ -770,11 +776,10 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 	return ret;
 }
 
-static struct page *alloc_buddy_huge_page(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long address)
+static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
 {
 	struct page *page;
-	unsigned int nid;
+	unsigned int r_nid;
 
 	if (h->order >= MAX_ORDER)
 		return NULL;
@@ -812,7 +817,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 	}
 	spin_unlock(&hugetlb_lock);
 
-	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
+	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|
 					__GFP_REPEAT|__GFP_NOWARN,
 					huge_page_order(h));
 
@@ -829,13 +834,13 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 		 */
 		put_page_testzero(page);
 		VM_BUG_ON(page_count(page));
-		nid = page_to_nid(page);
+		r_nid = page_to_nid(page);
 		set_compound_page_dtor(page, free_huge_page);
 		/*
 		 * We incremented the global counters already
 		 */
-		h->nr_huge_pages_node[nid]++;
-		h->surplus_huge_pages_node[nid]++;
+		h->nr_huge_pages_node[r_nid]++;
+		h->surplus_huge_pages_node[r_nid]++;
 		__count_vm_event(HTLB_BUDDY_PGALLOC);
 	} else {
 		h->nr_huge_pages--;
@@ -848,6 +853,25 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 }
 
 /*
+ * This allocation function is useful in the context where vma is irrelevant.
+ * E.g. soft-offlining uses this function because it only cares physical
+ * address of error page.
+ */
+struct page *alloc_huge_page_node(struct hstate *h, int nid)
+{
+	struct page *page;
+
+	spin_lock(&hugetlb_lock);
+	page = dequeue_huge_page_node(h, nid);
+	spin_unlock(&hugetlb_lock);
+
+	if (!page)
+		page = alloc_buddy_huge_page(h, nid);
+
+	return page;
+}
+
+/*
  * Increase the hugetlb pool such that it can accomodate a reservation
  * of size 'delta'.
  */
@@ -871,7 +895,7 @@ static int gather_surplus_pages(struct hstate *h, int delta)
 retry:
 	spin_unlock(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
-		page = alloc_buddy_huge_page(h, NULL, 0);
+		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
 			/*
 			 * We were not able to allocate enough pages to
@@ -1052,7 +1076,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	spin_unlock(&hugetlb_lock);
 
 	if (!page) {
-		page = alloc_buddy_huge_page(h, vma, addr);
+		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
 			hugetlb_put_quota(inode->i_mapping, chg);
 			return ERR_PTR(-VM_FAULT_SIGBUS);
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/10] hugetlb: hugepage migration core
  2010-09-20 11:10   ` Mel Gorman
@ 2010-09-22  4:59     ` Naoya Horiguchi
  2010-09-22  8:40       ` Mel Gorman
  0 siblings, 1 reply; 36+ messages in thread
From: Naoya Horiguchi @ 2010-09-22  4:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Mon, Sep 20, 2010 at 12:10:55PM +0100, Mel Gorman wrote:
> On Wed, Sep 08, 2010 at 10:19:35AM +0900, Naoya Horiguchi wrote:
...
> > @@ -95,26 +96,34 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> >  	pte_t *ptep, pte;
> >   	spinlock_t *ptl;
> >  
> > - 	pgd = pgd_offset(mm, addr);
> > -	if (!pgd_present(*pgd))
> > -		goto out;
> > +	if (unlikely(PageHuge(new))) {
> > +		ptep = huge_pte_offset(mm, addr);
> > +		if (!ptep)
> > +			goto out;
> > +		ptl = &mm->page_table_lock;
> > +	} else {
> > +		pgd = pgd_offset(mm, addr);
> > +		if (!pgd_present(*pgd))
> > +			goto out;
> >  
> > -	pud = pud_offset(pgd, addr);
> > -	if (!pud_present(*pud))
> > -		goto out;
> > +		pud = pud_offset(pgd, addr);
> > +		if (!pud_present(*pud))
> > +			goto out;
> >  
> 
> Why are the changes to teh rest of the walkers necessary? Instead, why
> did you not identify which PTL lock you needed and then goto the point
> where spin_lock(ptl) is called? Similar to what page_check_address()
> does for example.

This is because Andi-san commented to avoid using goto sentense.
But honestly I'm not sure which is a clear way.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/10] hugetlb: add allocate function for hugepage migration
  2010-09-22  4:41     ` Naoya Horiguchi
@ 2010-09-22  8:37       ` Mel Gorman
  0 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2010-09-22  8:37 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, Sep 22, 2010 at 01:41:51PM +0900, Naoya Horiguchi wrote:
> Hi,
> 
> Thank you for your review.
> 
> On Mon, Sep 20, 2010 at 11:59:16AM +0100, Mel Gorman wrote:
> > On Wed, Sep 08, 2010 at 10:19:33AM +0900, Naoya Horiguchi wrote:
> ...
> > > @@ -770,11 +776,10 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
> > >  	return ret;
> > >  }
> > >  
> > > -static struct page *alloc_buddy_huge_page(struct hstate *h,
> > > -			struct vm_area_struct *vma, unsigned long address)
> > > +static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
> > >  {
> > >  	struct page *page;
> > > -	unsigned int nid;
> > > +	unsigned int r_nid;
> > >  
> > 
> > Why the rename, just to avoid changing the value of a function parameter?
> 
> I think it's better that a simple name is given to function parameter
> than to internal variable, because the former is paid more attention
> from other developers than the latter is.
> 

Ok.

> > >  	if (h->order >= MAX_ORDER)
> > >  		return NULL;
> > > @@ -812,9 +817,14 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
> > >  	}
> > >  	spin_unlock(&hugetlb_lock);
> > >  
> > > -	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
> > > -					__GFP_REPEAT|__GFP_NOWARN,
> > > -					huge_page_order(h));
> > > +	if (nid == NUMA_NO_NODE)
> > > +		page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
> > > +				   __GFP_REPEAT|__GFP_NOWARN,
> > > +				   huge_page_order(h));
> > > +	else
> > > +		page = alloc_pages_exact_node(nid,
> > > +			htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
> > > +			__GFP_REPEAT|__GFP_NOWARN, huge_page_order(h));
> > >  
> > 
> > Why not just call alloc_pages_node()?
> 
> Ah, we can bring together these two allocate functions.
> I'll do it.
> 
> Here is a revised patch.
> 
> Thanks,
> Naoya Horiguchi
> ---
> Date: Wed, 22 Sep 2010 13:18:54 +0900
> Subject: [PATCH 02/10] hugetlb: add allocate function for hugepage migration
> 
> We can't use existing hugepage allocation functions to allocate hugepage
> for page migration, because page migration can happen asynchronously with
> the running processes and page migration users should call the allocation
> function with physical addresses (not virtual addresses) as arguments.
> 
> ChangeLog since v3:
> - unify alloc_buddy_huge_page() and alloc_buddy_huge_page_node()
> - bring together branched allocate functions
> 
> ChangeLog since v2:
> - remove unnecessary get/put_mems_allowed() (thanks to David Rientjes)
> 
> ChangeLog since v1:
> - add comment on top of alloc_huge_page_no_vma()
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/10] hugetlb: hugepage migration core
  2010-09-22  4:59     ` Naoya Horiguchi
@ 2010-09-22  8:40       ` Mel Gorman
  0 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2010-09-22  8:40 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, Sep 22, 2010 at 01:59:07PM +0900, Naoya Horiguchi wrote:
> On Mon, Sep 20, 2010 at 12:10:55PM +0100, Mel Gorman wrote:
> > On Wed, Sep 08, 2010 at 10:19:35AM +0900, Naoya Horiguchi wrote:
> ...
> > > @@ -95,26 +96,34 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> > >  	pte_t *ptep, pte;
> > >   	spinlock_t *ptl;
> > >  
> > > - 	pgd = pgd_offset(mm, addr);
> > > -	if (!pgd_present(*pgd))
> > > -		goto out;
> > > +	if (unlikely(PageHuge(new))) {
> > > +		ptep = huge_pte_offset(mm, addr);
> > > +		if (!ptep)
> > > +			goto out;
> > > +		ptl = &mm->page_table_lock;
> > > +	} else {
> > > +		pgd = pgd_offset(mm, addr);
> > > +		if (!pgd_present(*pgd))
> > > +			goto out;
> > >  
> > > -	pud = pud_offset(pgd, addr);
> > > -	if (!pud_present(*pud))
> > > -		goto out;
> > > +		pud = pud_offset(pgd, addr);
> > > +		if (!pud_present(*pud))
> > > +			goto out;
> > >  
> > 
> > Why are the changes to teh rest of the walkers necessary? Instead, why
> > did you not identify which PTL lock you needed and then goto the point
> > where spin_lock(ptl) is called? Similar to what page_check_address()
> > does for example.
> 
> This is because Andi-san commented to avoid using goto sentense.
> But honestly I'm not sure which is a clear way.
> 

Ok, personally I would prefer to have matched page_check_address() so we use
similar code patterns when the intention is the same but as functionally it
seems fine, I'm not going to fight over it either.

Thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 01/10] hugetlb: fix metadata corruption in hugetlb_fault()
  2010-09-20 10:47   ` Mel Gorman
@ 2010-09-22 20:41     ` Christoph Lameter
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Lameter @ 2010-09-22 20:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Naoya Horiguchi, Andi Kleen, Andrew Morton, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML


Reviewed-by: Christoph Lameter <cl@linux.com>



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/10] hugetlb: add allocate function for hugepage migration
  2010-09-08  1:19 ` [PATCH 02/10] hugetlb: add allocate function for hugepage migration Naoya Horiguchi
  2010-09-20 10:59   ` Mel Gorman
@ 2010-09-22 21:05   ` Christoph Lameter
  2010-09-23  8:49     ` Mel Gorman
  1 sibling, 1 reply; 36+ messages in thread
From: Christoph Lameter @ 2010-09-22 21:05 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML, tony.luck

On Wed, 8 Sep 2010, Naoya Horiguchi wrote:

> We can't use existing hugepage allocation functions to allocate hugepage
> for page migration, because page migration can happen asynchronously with
> the running processes and page migration users should call the allocation
> function with physical addresses (not virtual addresses) as arguments.

Ummm... Some arches like IA64 need huge pages fixed at certain virtual
addresses in which only huge pages exist. A vma is needed in order to be
able to assign proper virtual address to the page.

How does that work with transparent huge pages anyways?

This looks like its going to break IA64 hugepage support for good. Maybe
thats okay given the reduced significance of IA64? Certainly would
simplify the code.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/10] hugetlb: add allocate function for hugepage migration
  2010-09-22 21:05   ` Christoph Lameter
@ 2010-09-23  8:49     ` Mel Gorman
  2010-09-23 16:02       ` Christoph Lameter
  0 siblings, 1 reply; 36+ messages in thread
From: Mel Gorman @ 2010-09-23  8:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Naoya Horiguchi, Andi Kleen, Andrew Morton, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML, tony.luck

On Wed, Sep 22, 2010 at 04:05:47PM -0500, Christoph Lameter wrote:
> On Wed, 8 Sep 2010, Naoya Horiguchi wrote:
> 
> > We can't use existing hugepage allocation functions to allocate hugepage
> > for page migration, because page migration can happen asynchronously with
> > the running processes and page migration users should call the allocation
> > function with physical addresses (not virtual addresses) as arguments.
> 
> Ummm... Some arches like IA64 need huge pages fixed at certain virtual
> addresses in which only huge pages exist. A vma is needed in order to be
> able to assign proper virtual address to the page.
> 

Are you sure about this case? The virtual address of the page being migrated
should not changed, only the physical address.

> How does that work with transparent huge pages anyways?
> 

IA-64 doesn't support transparent huge pages. Even if it did, this
change is about hugetlbfs, not transparent huge page support.

> This looks like its going to break IA64 hugepage support for good.

How?

> Maybe
> thats okay given the reduced significance of IA64? Certainly would
> simplify the code.
> 

Currently I'm not seeing how IA-64 gets broken.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/10] hugetlb: add allocate function for hugepage migration
  2010-09-23  8:49     ` Mel Gorman
@ 2010-09-23 16:02       ` Christoph Lameter
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Lameter @ 2010-09-23 16:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Naoya Horiguchi, Andi Kleen, Andrew Morton, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML, tony.luck

On Thu, 23 Sep 2010, Mel Gorman wrote:

> Are you sure about this case? The virtual address of the page being migrated
> should not changed, only the physical address.

Right. I see that he just extracts a portion of the function. Semantics
of the alloc functions are indeed preserved.

Reviewed-by: Christoph Lameter <cl@linux.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 03/10] hugetlb: redefine hugepage copy functions
  2010-09-08  1:19 ` [PATCH 03/10] hugetlb: redefine hugepage copy functions Naoya Horiguchi
  2010-09-20 11:03   ` Mel Gorman
@ 2010-09-23 16:21   ` Christoph Lameter
  2010-09-24  3:24     ` Naoya Horiguchi
  1 sibling, 1 reply; 36+ messages in thread
From: Christoph Lameter @ 2010-09-23 16:21 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, 8 Sep 2010, Naoya Horiguchi wrote:

> This patch modifies hugepage copy functions to have only destination
> and source hugepages as arguments for later use.
> The old ones are renamed from copy_{gigantic,huge}_page() to
> copy_user_{gigantic,huge}_page().
> This naming convention is consistent with that between copy_highpage()
> and copy_user_highpage().

Looking at copy_user_highpage(): The vma parameter does not seem to be
used anywhere anymore? The vaddr is used on arches that have virtual
caching.

Maybe removing the vma parameter would allow to simplify the hugetlb
code?

Reviewed-by: Christoph Lameter <cl@linux.com>



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/10] hugetlb: hugepage migration core
  2010-09-08  1:19 ` [PATCH 04/10] hugetlb: hugepage migration core Naoya Horiguchi
  2010-09-20 11:10   ` Mel Gorman
@ 2010-09-23 16:52   ` Christoph Lameter
  2010-09-24  5:58     ` Naoya Horiguchi
  1 sibling, 1 reply; 36+ messages in thread
From: Christoph Lameter @ 2010-09-23 16:52 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, 8 Sep 2010, Naoya Horiguchi wrote:


> +static int hugetlbfs_migrate_page(struct address_space *mapping,
> +				struct page *newpage, struct page *page)
> +{
> +	int rc;
> +
> +	rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> +	if (rc)
> +		return rc;
> +	migrate_page_copy(newpage, page);
> +
> +	return 0;
> +}
> +
>  static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>  {
>  	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(dentry->d_sb);
> @@ -659,6 +673,7 @@ static const struct address_space_operations hugetlbfs_aops = {
>  	.write_begin	= hugetlbfs_write_begin,
>  	.write_end	= hugetlbfs_write_end,
>  	.set_page_dirty	= hugetlbfs_set_page_dirty,
> +	.migratepage    = hugetlbfs_migrate_page,
>  };

Very straightforward conversion of innermost piece to huge pages. Good.

If migrate_page_move_mapping would do huge pages like it seems
migrate_page_copy() does (a bit surprising) then we could save ourselves
the function?

> index 351f8d1..55f3e2d 100644
> --- v2.6.36-rc2/mm/hugetlb.c
> +++ v2.6.36-rc2/mm/hugetlb.c
> @@ -2217,6 +2217,19 @@ nomem:
>  	return -ENOMEM;
>  }
>
> +static int is_hugetlb_entry_migration(pte_t pte)
> +{
> +	swp_entry_t swp;
> +
> +	if (huge_pte_none(pte) || pte_present(pte))
> +		return 0;
> +	swp = pte_to_swp_entry(pte);
> +	if (non_swap_entry(swp) && is_migration_entry(swp)) {
> +		return 1;
> +	} else
> +		return 0;
> +}

Ok that implies that to some extend swap must be supported for this case
in the core vm?

> @@ -2651,7 +2664,10 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	ptep = huge_pte_offset(mm, address);
>  	if (ptep) {
>  		entry = huge_ptep_get(ptep);
> -		if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> +		if (unlikely(is_hugetlb_entry_migration(entry))) {
> +			migration_entry_wait(mm, (pmd_t *)ptep, address);
> +			return 0;
> +		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
>  			return VM_FAULT_HWPOISON;

No we have here in hugetlb_fault() a copy of  the migration wait logic
from do_swap_page(). Hope the rest of the VM cannot inadvertantly
encounter such a migration entry elsewhere?


> diff --git v2.6.36-rc2/mm/migrate.c v2.6.36-rc2/mm/migrate.c

> @@ -130,10 +139,17 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
>  	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
>  	if (is_write_migration_entry(entry))
>  		pte = pte_mkwrite(pte);
> +	if (PageHuge(new))
> +		pte = pte_mkhuge(pte);
>  	flush_cache_page(vma, addr, pte_pfn(pte));
>  	set_pte_at(mm, addr, ptep, pte);
>
> -	if (PageAnon(new))
> +	if (PageHuge(new)) {
> +		if (PageAnon(new))
> +			hugepage_add_anon_rmap(new, vma, addr);
> +		else
> +			page_dup_rmap(new);

dup_rmap? What are non anon huge pages use for? That is not a file backed
huge page right?


> @@ -276,11 +292,59 @@ static int migrate_page_move_mapping(struct address_space *mapping,
>  }
>
>  /*
> + * The expected number of remaining references is the same as that
> + * of migrate_page_move_mapping().
> + */
> +int migrate_huge_page_move_mapping(struct address_space *mapping,
> +				   struct page *newpage, struct page *page)
> +{
> +	int expected_count;
> +	void **pslot;
> +
> +	if (!mapping) {
> +		if (page_count(page) != 1)
> +			return -EAGAIN;
> +		return 0;
> +	}
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +
> +	pslot = radix_tree_lookup_slot(&mapping->page_tree,
> +					page_index(page));
> +
> +	expected_count = 2 + page_has_private(page);
> +	if (page_count(page) != expected_count ||
> +	    (struct page *)radix_tree_deref_slot(pslot) != page) {
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return -EAGAIN;
> +	}
> +
> +	if (!page_freeze_refs(page, expected_count)) {
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return -EAGAIN;
> +	}
> +
> +	get_page(newpage);
> +
> +	radix_tree_replace_slot(pslot, newpage);
> +
> +	page_unfreeze_refs(page, expected_count);
> +
> +	__put_page(page);
> +
> +	spin_unlock_irq(&mapping->tree_lock);
> +	return 0;
> +}

Thats a pretty accurate copy of move_mapping(). Why are the counter
updates missing at the end? This also suggests that the two functions
could be merged into one.


> @@ -724,6 +788,92 @@ move_newpage:
>  }
>
>  /*
> + * Counterpart of unmap_and_move_page() for hugepage migration.
> + *
> + * This function doesn't wait the completion of hugepage I/O
> + * because there is no race between I/O and migration for hugepage.
> + * Note that currently hugepage I/O occurs only in direct I/O
> + * where no lock is held and PG_writeback is irrelevant,
> + * and writeback status of all subpages are counted in the reference
> + * count of the head page (i.e. if all subpages of a 2MB hugepage are
> + * under direct I/O, the reference of the head page is 512 and a bit more.)
> + * This means that when we try to migrate hugepage whose subpages are
> + * doing direct I/O, some references remain after try_to_unmap() and
> + * hugepage migration fails without data corruption.
> + *
> + * There is also no race when direct I/O is issued on the page under migration,
> + * because then pte is replaced with migration swap entry and direct I/O code
> + * will wait in the page fault for migration to complete.
> + */
> +static int unmap_and_move_huge_page(new_page_t get_new_page,
> +				unsigned long private, struct page *hpage,
> +				int force, int offlining)
> +{
> +	int rc = 0;
> +	int *result = NULL;
> +	struct page *new_hpage = get_new_page(hpage, private, &result);
> +	int rcu_locked = 0;
> +	struct anon_vma *anon_vma = NULL;
> +
> +	if (!new_hpage)
> +		return -ENOMEM;
> +
> +	rc = -EAGAIN;
> +
> +	if (!trylock_page(hpage)) {
> +		if (!force)
> +			goto out;
> +		lock_page(hpage);
> +	}
> +
> +	if (PageAnon(hpage)) {
> +		rcu_read_lock();
> +		rcu_locked = 1;
> +
> +		if (page_mapped(hpage)) {
> +			anon_vma = page_anon_vma(hpage);
> +			atomic_inc(&anon_vma->external_refcount);
> +		}
> +	}
> +
> +	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> +
> +	if (!page_mapped(hpage))
> +		rc = move_to_new_page(new_hpage, hpage, 1);
> +
> +	if (rc)
> +		remove_migration_ptes(hpage, hpage);
> +
> +	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount,
> +					    &anon_vma->lock)) {
> +		int empty = list_empty(&anon_vma->head);
> +		spin_unlock(&anon_vma->lock);
> +		if (empty)
> +			anon_vma_free(anon_vma);

Hmmm.. The anon_vma dropping looks different? Why cant we use
drop_anon_mva like in unmap_and_move? Also we do not take the root lock
like drop_anon_vma does.

> @@ -788,6 +938,52 @@ out:
>  	return nr_failed + retry;
>  }
>
> +int migrate_huge_pages(struct list_head *from,
> +		new_page_t get_new_page, unsigned long private, int offlining)
> +{
> +	int retry = 1;
> +	int nr_failed = 0;
> +	int pass = 0;
> +	struct page *page;
> +	struct page *page2;
> +	int rc;
> +
> +	for (pass = 0; pass < 10 && retry; pass++) {
> +		retry = 0;
> +
> +		list_for_each_entry_safe(page, page2, from, lru) {
> +			cond_resched();
> +
> +			rc = unmap_and_move_huge_page(get_new_page,
> +					private, page, pass > 2, offlining);
> +
> +			switch(rc) {
> +			case -ENOMEM:
> +				goto out;
> +			case -EAGAIN:
> +				retry++;
> +				break;
> +			case 0:
> +				break;
> +			default:
> +				/* Permanent failure */
> +				nr_failed++;
> +				break;
> +			}
> +		}
> +	}
> +	rc = 0;
> +out:
> +
> +	list_for_each_entry_safe(page, page2, from, lru)
> +		put_page(page);
> +
> +	if (rc)
> +		return rc;
> +
> +	return nr_failed + retry;
> +}

Copy of migrate_pages(). putback_lru_pages() omitted as also proposed for
upstream in a recent discussion.




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/10] HWPOISON, hugetlb: add free check to dequeue_hwpoison_huge_page()
  2010-09-08  1:19 ` [PATCH 05/10] HWPOISON, hugetlb: add free check to dequeue_hwpoison_huge_page() Naoya Horiguchi
@ 2010-09-23 16:54   ` Christoph Lameter
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Lameter @ 2010-09-23 16:54 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML


Reviewed-by: Christoph Lameter <cl@linux.com>



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/10] hugetlb: move refcounting in hugepage allocation inside hugetlb_lock
  2010-09-08  1:19 ` [PATCH 06/10] hugetlb: move refcounting in hugepage allocation inside hugetlb_lock Naoya Horiguchi
@ 2010-09-23 17:12   ` Christoph Lameter
  2010-09-24  6:47     ` Naoya Horiguchi
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Lameter @ 2010-09-23 17:12 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, 8 Sep 2010, Naoya Horiguchi wrote:

> Currently alloc_huge_page() raises page refcount outside hugetlb_lock.
> but it causes race when dequeue_hwpoison_huge_page() runs concurrently
> with alloc_huge_page().
> To avoid it, this patch moves set_page_refcounted() in hugetlb_lock.

Reviewed-by: Christoph Lameter <cl@linux.com>

One wonders though how many other of these huge races are still there
though.

"Normal" page migration is based on LRU isolation and therefore does not
suffer from these problems on allocation since the page is not yet on the
LRU. Also the LRU isolation is a known issue due to memory reclaim doing
this.  This protection is going away of one goes directly to a page
without going through the LRU. That should create more races...


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 03/10] hugetlb: redefine hugepage copy functions
  2010-09-23 16:21   ` Christoph Lameter
@ 2010-09-24  3:24     ` Naoya Horiguchi
  0 siblings, 0 replies; 36+ messages in thread
From: Naoya Horiguchi @ 2010-09-24  3:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Andrew Morton, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Thu, Sep 23, 2010 at 11:21:25AM -0500, Christoph Lameter wrote:
> On Wed, 8 Sep 2010, Naoya Horiguchi wrote:
> 
> > This patch modifies hugepage copy functions to have only destination
> > and source hugepages as arguments for later use.
> > The old ones are renamed from copy_{gigantic,huge}_page() to
> > copy_user_{gigantic,huge}_page().
> > This naming convention is consistent with that between copy_highpage()
> > and copy_user_highpage().
> 
> Looking at copy_user_highpage(): The vma parameter does not seem to be
> used anywhere anymore? The vaddr is used on arches that have virtual
> caching.
> 
> Maybe removing the vma parameter would allow to simplify the hugetlb
> code?

That's right.
I'll do this cleanup (although it may be aside from this patchset.)

> Reviewed-by: Christoph Lameter <cl@linux.com>

Thank you.

- Naoya Horiguchi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/10] hugetlb: hugepage migration core
  2010-09-23 16:52   ` Christoph Lameter
@ 2010-09-24  5:58     ` Naoya Horiguchi
  0 siblings, 0 replies; 36+ messages in thread
From: Naoya Horiguchi @ 2010-09-24  5:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Andrew Morton, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

Thank you for your review.

On Thu, Sep 23, 2010 at 11:52:26AM -0500, Christoph Lameter wrote:
> On Wed, 8 Sep 2010, Naoya Horiguchi wrote:
> 
> 
> > +static int hugetlbfs_migrate_page(struct address_space *mapping,
> > +				struct page *newpage, struct page *page)
> > +{
> > +	int rc;
> > +
> > +	rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> > +	if (rc)
> > +		return rc;
> > +	migrate_page_copy(newpage, page);
> > +
> > +	return 0;
> > +}
> > +
> >  static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
> >  {
> >  	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(dentry->d_sb);
> > @@ -659,6 +673,7 @@ static const struct address_space_operations hugetlbfs_aops = {
> >  	.write_begin	= hugetlbfs_write_begin,
> >  	.write_end	= hugetlbfs_write_end,
> >  	.set_page_dirty	= hugetlbfs_set_page_dirty,
> > +	.migratepage    = hugetlbfs_migrate_page,
> >  };
> 
> Very straightforward conversion of innermost piece to huge pages. Good.
> 
> If migrate_page_move_mapping would do huge pages like it seems
> migrate_page_copy() does (a bit surprising) then we could save ourselves
> the function?

Yes, that sounds nice.
I'll do it in the next version.

> > index 351f8d1..55f3e2d 100644
> > --- v2.6.36-rc2/mm/hugetlb.c
> > +++ v2.6.36-rc2/mm/hugetlb.c
> > @@ -2217,6 +2217,19 @@ nomem:
> >  	return -ENOMEM;
> >  }
> >
> > +static int is_hugetlb_entry_migration(pte_t pte)
> > +{
> > +	swp_entry_t swp;
> > +
> > +	if (huge_pte_none(pte) || pte_present(pte))
> > +		return 0;
> > +	swp = pte_to_swp_entry(pte);
> > +	if (non_swap_entry(swp) && is_migration_entry(swp)) {
> > +		return 1;
> > +	} else
> > +		return 0;
> > +}
> 
> Ok that implies that to some extend swap must be supported for this case
> in the core vm?

Yes.
Currently hugepage does not support swapping in/out,
but does support migration entry and hwpoison entry
(both of them have swap entry format.)

> > @@ -2651,7 +2664,10 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	ptep = huge_pte_offset(mm, address);
> >  	if (ptep) {
> >  		entry = huge_ptep_get(ptep);
> > -		if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> > +		if (unlikely(is_hugetlb_entry_migration(entry))) {
> > +			migration_entry_wait(mm, (pmd_t *)ptep, address);
> > +			return 0;
> > +		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> >  			return VM_FAULT_HWPOISON;
> 
> No we have here in hugetlb_fault() a copy of  the migration wait logic
> from do_swap_page(). Hope the rest of the VM cannot inadvertantly
> encounter such a migration entry elsewhere?

No, because is_hugetlb_entry_migration() and is_hugetlb_entry_hwpoisoned()
can return true only when (!huge_pte_none(pte) && !pte_present(pte)) is true,
which means the entry is a swap entry.
Currently hugepage swap entry can only be created in page migration context
and in memory failure context, without any exceptions.

If hugepage swapping becomes available, we can insert an additional if()
to branch to like hugetlb_swap_page().


> 
> > diff --git v2.6.36-rc2/mm/migrate.c v2.6.36-rc2/mm/migrate.c
> 
> > @@ -130,10 +139,17 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> >  	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
> >  	if (is_write_migration_entry(entry))
> >  		pte = pte_mkwrite(pte);
> > +	if (PageHuge(new))
> > +		pte = pte_mkhuge(pte);
> >  	flush_cache_page(vma, addr, pte_pfn(pte));
> >  	set_pte_at(mm, addr, ptep, pte);
> >
> > -	if (PageAnon(new))
> > +	if (PageHuge(new)) {
> > +		if (PageAnon(new))
> > +			hugepage_add_anon_rmap(new, vma, addr);
> > +		else
> > +			page_dup_rmap(new);
> 
> dup_rmap? What are non anon huge pages use for? That is not a file backed
> huge page right?

Yes, it's for file backed huge page.

This page_dup_rmap() is a hugepage variant of page_add_file_rmap().
Currently we don't account hugepage for zone_page_state nor for memory cgroup,
so only atomic_add() is needed and page_dup_rmap() does it.
I think it's OK because try_to_unmap_one() calls page_remove_rmap() both for
file mapped page and for anonymous page ("rmap" is a common term for both.)


> > @@ -276,11 +292,59 @@ static int migrate_page_move_mapping(struct address_space *mapping,
> >  }
> >
> >  /*
> > + * The expected number of remaining references is the same as that
> > + * of migrate_page_move_mapping().
> > + */
> > +int migrate_huge_page_move_mapping(struct address_space *mapping,
> > +				   struct page *newpage, struct page *page)
> > +{
> > +	int expected_count;
> > +	void **pslot;
> > +
> > +	if (!mapping) {
> > +		if (page_count(page) != 1)
> > +			return -EAGAIN;
> > +		return 0;
> > +	}
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +
> > +	pslot = radix_tree_lookup_slot(&mapping->page_tree,
> > +					page_index(page));
> > +
> > +	expected_count = 2 + page_has_private(page);
> > +	if (page_count(page) != expected_count ||
> > +	    (struct page *)radix_tree_deref_slot(pslot) != page) {
> > +		spin_unlock_irq(&mapping->tree_lock);
> > +		return -EAGAIN;
> > +	}
> > +
> > +	if (!page_freeze_refs(page, expected_count)) {
> > +		spin_unlock_irq(&mapping->tree_lock);
> > +		return -EAGAIN;
> > +	}
> > +
> > +	get_page(newpage);
> > +
> > +	radix_tree_replace_slot(pslot, newpage);
> > +
> > +	page_unfreeze_refs(page, expected_count);
> > +
> > +	__put_page(page);
> > +
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +	return 0;
> > +}
> 
> Thats a pretty accurate copy of move_mapping(). Why are the counter
> updates missing at the end? This also suggests that the two functions
> could be merged into one.

Because hugepage are not counted as zone page statistics elsewhere.
But surely it looks better to merge these two functions with adding
if () around counter updates.


> > @@ -724,6 +788,92 @@ move_newpage:
> >  }
> >
> >  /*
> > + * Counterpart of unmap_and_move_page() for hugepage migration.
> > + *
> > + * This function doesn't wait the completion of hugepage I/O
> > + * because there is no race between I/O and migration for hugepage.
> > + * Note that currently hugepage I/O occurs only in direct I/O
> > + * where no lock is held and PG_writeback is irrelevant,
> > + * and writeback status of all subpages are counted in the reference
> > + * count of the head page (i.e. if all subpages of a 2MB hugepage are
> > + * under direct I/O, the reference of the head page is 512 and a bit more.)
> > + * This means that when we try to migrate hugepage whose subpages are
> > + * doing direct I/O, some references remain after try_to_unmap() and
> > + * hugepage migration fails without data corruption.
> > + *
> > + * There is also no race when direct I/O is issued on the page under migration,
> > + * because then pte is replaced with migration swap entry and direct I/O code
> > + * will wait in the page fault for migration to complete.
> > + */
> > +static int unmap_and_move_huge_page(new_page_t get_new_page,
> > +				unsigned long private, struct page *hpage,
> > +				int force, int offlining)
> > +{
> > +	int rc = 0;
> > +	int *result = NULL;
> > +	struct page *new_hpage = get_new_page(hpage, private, &result);
> > +	int rcu_locked = 0;
> > +	struct anon_vma *anon_vma = NULL;
> > +
> > +	if (!new_hpage)
> > +		return -ENOMEM;
> > +
> > +	rc = -EAGAIN;
> > +
> > +	if (!trylock_page(hpage)) {
> > +		if (!force)
> > +			goto out;
> > +		lock_page(hpage);
> > +	}
> > +
> > +	if (PageAnon(hpage)) {
> > +		rcu_read_lock();
> > +		rcu_locked = 1;
> > +
> > +		if (page_mapped(hpage)) {
> > +			anon_vma = page_anon_vma(hpage);
> > +			atomic_inc(&anon_vma->external_refcount);
> > +		}
> > +	}
> > +
> > +	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> > +
> > +	if (!page_mapped(hpage))
> > +		rc = move_to_new_page(new_hpage, hpage, 1);
> > +
> > +	if (rc)
> > +		remove_migration_ptes(hpage, hpage);
> > +
> > +	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount,
> > +					    &anon_vma->lock)) {
> > +		int empty = list_empty(&anon_vma->head);
> > +		spin_unlock(&anon_vma->lock);
> > +		if (empty)
> > +			anon_vma_free(anon_vma);
> 
> Hmmm.. The anon_vma dropping looks different? Why cant we use
> drop_anon_mva like in unmap_and_move? Also we do not take the root lock
> like drop_anon_vma does.

Oh, I failed to follow the latest code in rebasing.
Recently anon_vma code is frequently updated.
I'll fix it.


> > @@ -788,6 +938,52 @@ out:
> >  	return nr_failed + retry;
> >  }
> >
> > +int migrate_huge_pages(struct list_head *from,
> > +		new_page_t get_new_page, unsigned long private, int offlining)
> > +{
> > +	int retry = 1;
> > +	int nr_failed = 0;
> > +	int pass = 0;
> > +	struct page *page;
> > +	struct page *page2;
> > +	int rc;
> > +
> > +	for (pass = 0; pass < 10 && retry; pass++) {
> > +		retry = 0;
> > +
> > +		list_for_each_entry_safe(page, page2, from, lru) {
> > +			cond_resched();
> > +
> > +			rc = unmap_and_move_huge_page(get_new_page,
> > +					private, page, pass > 2, offlining);
> > +
> > +			switch(rc) {
> > +			case -ENOMEM:
> > +				goto out;
> > +			case -EAGAIN:
> > +				retry++;
> > +				break;
> > +			case 0:
> > +				break;
> > +			default:
> > +				/* Permanent failure */
> > +				nr_failed++;
> > +				break;
> > +			}
> > +		}
> > +	}
> > +	rc = 0;
> > +out:
> > +
> > +	list_for_each_entry_safe(page, page2, from, lru)
> > +		put_page(page);
> > +
> > +	if (rc)
> > +		return rc;
> > +
> > +	return nr_failed + retry;
> > +}
> 
> Copy of migrate_pages(). putback_lru_pages() omitted as also proposed for
> upstream in a recent discussion.

Sure.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/10] hugetlb: move refcounting in hugepage allocation inside hugetlb_lock
  2010-09-23 17:12   ` Christoph Lameter
@ 2010-09-24  6:47     ` Naoya Horiguchi
  0 siblings, 0 replies; 36+ messages in thread
From: Naoya Horiguchi @ 2010-09-24  6:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Andrew Morton, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Thu, Sep 23, 2010 at 12:12:55PM -0500, Christoph Lameter wrote:
> On Wed, 8 Sep 2010, Naoya Horiguchi wrote:
> 
> > Currently alloc_huge_page() raises page refcount outside hugetlb_lock.
> > but it causes race when dequeue_hwpoison_huge_page() runs concurrently
> > with alloc_huge_page().
> > To avoid it, this patch moves set_page_refcounted() in hugetlb_lock.
> 
> Reviewed-by: Christoph Lameter <cl@linux.com>
> 
> One wonders though how many other of these huge races are still there
> though.
> 
> "Normal" page migration is based on LRU isolation and therefore does not
> suffer from these problems on allocation since the page is not yet on the
> LRU. Also the LRU isolation is a known issue due to memory reclaim doing
> this.

Yes.
For normal page, allocation and reclaiming and migration are protected from
each other by LRU isolation.
For huge page, however, allocation and migration (reclaiming is not available)
are protected by reference count, and race between allocation and hwpoison
are avoided by hugetlb_lock.
I see that this seems complex and can cause unpredicted races.

> This protection is going away of one goes directly to a page
> without going through the LRU. That should create more races...

To unify these protection mechanism, we need that LRU list become available
for hugepage, but we must wait for the appearance of hugepage swapping
for this. Or implementing dummy LRU list until then? (Maybe it's more messy...)

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 02/10] hugetlb: add allocate function for hugepage migration
  2010-09-03  4:37 [PATCH 0/10] Hugepage migration (v4) Naoya Horiguchi
@ 2010-09-03  4:37 ` Naoya Horiguchi
  0 siblings, 0 replies; 36+ messages in thread
From: Naoya Horiguchi @ 2010-09-03  4:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

We can't use existing hugepage allocation functions to allocate hugepage
for page migration, because page migration can happen asynchronously with
the running processes and page migration users should call the allocation
function with physical addresses (not virtual addresses) as arguments.

ChangeLog since v3:
- unify alloc_buddy_huge_page() and alloc_buddy_huge_page_node()

ChangeLog since v2:
- remove unnecessary get/put_mems_allowed() (thanks to David Rientjes)

ChangeLog since v1:
- add comment on top of alloc_huge_page_no_vma()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 include/linux/hugetlb.h |    3 ++
 mm/hugetlb.c            |   78 ++++++++++++++++++++++++++++++++---------------
 2 files changed, 56 insertions(+), 25 deletions(-)

diff --git v2.6.36-rc2/include/linux/hugetlb.h v2.6.36-rc2/include/linux/hugetlb.h
index f479700..0b73c53 100644
--- v2.6.36-rc2/include/linux/hugetlb.h
+++ v2.6.36-rc2/include/linux/hugetlb.h
@@ -228,6 +228,8 @@ struct huge_bootmem_page {
 	struct hstate *hstate;
 };
 
+struct page *alloc_huge_page_node(struct hstate *h, int nid);
+
 /* arch callback */
 int __init alloc_bootmem_huge_page(struct hstate *h);
 
@@ -303,6 +305,7 @@ static inline struct hstate *page_hstate(struct page *page)
 
 #else
 struct hstate {};
+#define alloc_huge_page_node(h, nid) NULL
 #define alloc_bootmem_huge_page(h) NULL
 #define hstate_file(f) NULL
 #define hstate_vma(v) NULL
diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index 6871b41..d12431b 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -466,11 +466,22 @@ static void enqueue_huge_page(struct hstate *h, struct page *page)
 	h->free_huge_pages_node[nid]++;
 }
 
+static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
+{
+	struct page *page;
+	if (list_empty(&h->hugepage_freelists[nid]))
+		return NULL;
+	page = list_entry(h->hugepage_freelists[nid].next, struct page, lru);
+	list_del(&page->lru);
+	h->free_huge_pages--;
+	h->free_huge_pages_node[nid]--;
+	return page;
+}
+
 static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
 				unsigned long address, int avoid_reserve)
 {
-	int nid;
 	struct page *page = NULL;
 	struct mempolicy *mpol;
 	nodemask_t *nodemask;
@@ -496,19 +507,13 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						MAX_NR_ZONES - 1, nodemask) {
-		nid = zone_to_nid(zone);
-		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
-		    !list_empty(&h->hugepage_freelists[nid])) {
-			page = list_entry(h->hugepage_freelists[nid].next,
-					  struct page, lru);
-			list_del(&page->lru);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[nid]--;
-
-			if (!avoid_reserve)
-				decrement_hugepage_resv_vma(h, vma);
-
-			break;
+		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
+			page = dequeue_huge_page_node(h, zone_to_nid(zone));
+			if (page) {
+				if (!avoid_reserve)
+					decrement_hugepage_resv_vma(h, vma);
+				break;
+			}
 		}
 	}
 err:
@@ -770,11 +775,10 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 	return ret;
 }
 
-static struct page *alloc_buddy_huge_page(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long address)
+static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
 {
 	struct page *page;
-	unsigned int nid;
+	unsigned int r_nid;
 
 	if (h->order >= MAX_ORDER)
 		return NULL;
@@ -812,9 +816,14 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 	}
 	spin_unlock(&hugetlb_lock);
 
-	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
-					__GFP_REPEAT|__GFP_NOWARN,
-					huge_page_order(h));
+	if (nid == NUMA_NO_NODE)
+		page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
+				   __GFP_REPEAT|__GFP_NOWARN,
+				   huge_page_order(h));
+	else
+		page = alloc_pages_exact_node(nid,
+			htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
+			__GFP_REPEAT|__GFP_NOWARN, huge_page_order(h));
 
 	if (page && arch_prepare_hugepage(page)) {
 		__free_pages(page, huge_page_order(h));
@@ -829,13 +838,13 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 		 */
 		put_page_testzero(page);
 		VM_BUG_ON(page_count(page));
-		nid = page_to_nid(page);
+		r_nid = page_to_nid(page);
 		set_compound_page_dtor(page, free_huge_page);
 		/*
 		 * We incremented the global counters already
 		 */
-		h->nr_huge_pages_node[nid]++;
-		h->surplus_huge_pages_node[nid]++;
+		h->nr_huge_pages_node[r_nid]++;
+		h->surplus_huge_pages_node[r_nid]++;
 		__count_vm_event(HTLB_BUDDY_PGALLOC);
 	} else {
 		h->nr_huge_pages--;
@@ -848,6 +857,25 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 }
 
 /*
+ * This allocation function is useful in the context where vma is irrelevant.
+ * E.g. soft-offlining uses this function because it only cares physical
+ * address of error page.
+ */
+struct page *alloc_huge_page_node(struct hstate *h, int nid)
+{
+	struct page *page;
+
+	spin_lock(&hugetlb_lock);
+	page = dequeue_huge_page_node(h, nid);
+	spin_unlock(&hugetlb_lock);
+
+	if (!page)
+		page = alloc_buddy_huge_page(h, nid);
+
+	return page;
+}
+
+/*
  * Increase the hugetlb pool such that it can accomodate a reservation
  * of size 'delta'.
  */
@@ -871,7 +899,7 @@ static int gather_surplus_pages(struct hstate *h, int delta)
 retry:
 	spin_unlock(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
-		page = alloc_buddy_huge_page(h, NULL, 0);
+		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
 			/*
 			 * We were not able to allocate enough pages to
@@ -1052,7 +1080,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	spin_unlock(&hugetlb_lock);
 
 	if (!page) {
-		page = alloc_buddy_huge_page(h, vma, addr);
+		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
 			hugetlb_put_quota(inode->i_mapping, chg);
 			return ERR_PTR(-VM_FAULT_SIGBUS);
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2010-09-24  6:49 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-08  1:19 [PATCH 0/10] Hugepage migration (v5) Naoya Horiguchi
2010-09-08  1:19 ` [PATCH 01/10] hugetlb: fix metadata corruption in hugetlb_fault() Naoya Horiguchi
2010-09-20 10:47   ` Mel Gorman
2010-09-22 20:41     ` Christoph Lameter
2010-09-08  1:19 ` [PATCH 02/10] hugetlb: add allocate function for hugepage migration Naoya Horiguchi
2010-09-20 10:59   ` Mel Gorman
2010-09-22  4:41     ` Naoya Horiguchi
2010-09-22  8:37       ` Mel Gorman
2010-09-22 21:05   ` Christoph Lameter
2010-09-23  8:49     ` Mel Gorman
2010-09-23 16:02       ` Christoph Lameter
2010-09-08  1:19 ` [PATCH 03/10] hugetlb: redefine hugepage copy functions Naoya Horiguchi
2010-09-20 11:03   ` Mel Gorman
2010-09-20 11:15     ` Andi Kleen
2010-09-20 11:18       ` Mel Gorman
2010-09-23 16:21   ` Christoph Lameter
2010-09-24  3:24     ` Naoya Horiguchi
2010-09-08  1:19 ` [PATCH 04/10] hugetlb: hugepage migration core Naoya Horiguchi
2010-09-20 11:10   ` Mel Gorman
2010-09-22  4:59     ` Naoya Horiguchi
2010-09-22  8:40       ` Mel Gorman
2010-09-23 16:52   ` Christoph Lameter
2010-09-24  5:58     ` Naoya Horiguchi
2010-09-08  1:19 ` [PATCH 05/10] HWPOISON, hugetlb: add free check to dequeue_hwpoison_huge_page() Naoya Horiguchi
2010-09-23 16:54   ` Christoph Lameter
2010-09-08  1:19 ` [PATCH 06/10] hugetlb: move refcounting in hugepage allocation inside hugetlb_lock Naoya Horiguchi
2010-09-23 17:12   ` Christoph Lameter
2010-09-24  6:47     ` Naoya Horiguchi
2010-09-08  1:19 ` [PATCH 07/10] HWPOSION, hugetlb: recover from free hugepage error when !MF_COUNT_INCREASED Naoya Horiguchi
2010-09-08  1:19 ` [PATCH 08/10] HWPOISON, hugetlb: soft offlining for hugepage Naoya Horiguchi
2010-09-08  1:19 ` [PATCH 09/10] HWPOISON, hugetlb: fix unpoison " Naoya Horiguchi
2010-09-08  1:19 ` [PATCH 10/10] page-types.c: fix name of unpoison interface Naoya Horiguchi
2010-09-09 10:33 ` [PATCH 0/10] Hugepage migration (v5) Andi Kleen
2010-09-09 22:56   ` Naoya Horiguchi
2010-09-20 11:14   ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2010-09-03  4:37 [PATCH 0/10] Hugepage migration (v4) Naoya Horiguchi
2010-09-03  4:37 ` [PATCH 02/10] hugetlb: add allocate function for hugepage migration Naoya Horiguchi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).