All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] Hugepage migration (v3)
@ 2010-08-24 23:55 ` Naoya Horiguchi
  0 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-24 23:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

Hi,

This is the 3rd version of "hugepage migration" set.
I rebased this onto 2.6.36-rc2 and merged many comments from you.

In previous discussion, I explained why hugepage migration encounts no race
with direct I/O without additional page locking. Based on that reasoning,
I made no change on page locking on migration code (i.e. lock only head pages.)


Future works:

- Migration can fail for various reasons depending on various factors,
  so it's useful if soft offline can be retried when it noticed migration
  fails. This problem is a more general one because it's applied for
  soft offline of normal-sized pages. So we leave it as a future work.
  
- Corrupted hugepage counter implemeted in the previous version was dropped
  because it's not directly related to migration topic and have no serious
  impact on kernel behavior. We also leave it as the next work.
  

Summary:

 [PATCH 1/8] hugetlb: fix metadata corruption in hugetlb_fault()
 [PATCH 2/8] hugetlb: add allocate function for hugepage migration
 [PATCH 3/8] hugetlb: rename hugepage allocation functions
 [PATCH 4/8] hugetlb: redefine hugepage copy functions
 [PATCH 5/8] hugetlb: hugepage migration core
 [PATCH 6/8] HWPOISON, hugetlb: soft offlining for hugepage
 [PATCH 7/8] HWPOISON, hugetlb: fix unpoison for hugepage
 [PATCH 8/8] page-types.c: fix name of unpoison interface

 Documentation/vm/page-types.c |    2 +-
 fs/hugetlbfs/inode.c          |   15 +++
 include/linux/hugetlb.h       |    9 ++
 include/linux/migrate.h       |   12 +++
 mm/hugetlb.c                  |  216 ++++++++++++++++++++++++++++++++---------
 mm/memory-failure.c           |   65 +++++++++----
 mm/migrate.c                  |  192 +++++++++++++++++++++++++++++++++----
 mm/vmscan.c                   |    9 ++-
 8 files changed, 434 insertions(+), 86 deletions(-)

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 0/8] Hugepage migration (v3)
@ 2010-08-24 23:55 ` Naoya Horiguchi
  0 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-24 23:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

Hi,

This is the 3rd version of "hugepage migration" set.
I rebased this onto 2.6.36-rc2 and merged many comments from you.

In previous discussion, I explained why hugepage migration encounts no race
with direct I/O without additional page locking. Based on that reasoning,
I made no change on page locking on migration code (i.e. lock only head pages.)


Future works:

- Migration can fail for various reasons depending on various factors,
  so it's useful if soft offline can be retried when it noticed migration
  fails. This problem is a more general one because it's applied for
  soft offline of normal-sized pages. So we leave it as a future work.
  
- Corrupted hugepage counter implemeted in the previous version was dropped
  because it's not directly related to migration topic and have no serious
  impact on kernel behavior. We also leave it as the next work.
  

Summary:

 [PATCH 1/8] hugetlb: fix metadata corruption in hugetlb_fault()
 [PATCH 2/8] hugetlb: add allocate function for hugepage migration
 [PATCH 3/8] hugetlb: rename hugepage allocation functions
 [PATCH 4/8] hugetlb: redefine hugepage copy functions
 [PATCH 5/8] hugetlb: hugepage migration core
 [PATCH 6/8] HWPOISON, hugetlb: soft offlining for hugepage
 [PATCH 7/8] HWPOISON, hugetlb: fix unpoison for hugepage
 [PATCH 8/8] page-types.c: fix name of unpoison interface

 Documentation/vm/page-types.c |    2 +-
 fs/hugetlbfs/inode.c          |   15 +++
 include/linux/hugetlb.h       |    9 ++
 include/linux/migrate.h       |   12 +++
 mm/hugetlb.c                  |  216 ++++++++++++++++++++++++++++++++---------
 mm/memory-failure.c           |   65 +++++++++----
 mm/migrate.c                  |  192 +++++++++++++++++++++++++++++++++----
 mm/vmscan.c                   |    9 ++-
 8 files changed, 434 insertions(+), 86 deletions(-)

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/8] hugetlb: fix metadata corruption in hugetlb_fault()
  2010-08-24 23:55 ` Naoya Horiguchi
@ 2010-08-24 23:55   ` Naoya Horiguchi
  -1 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-24 23:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

In order to avoid metadata corruption when memory failure occurs between
alloc_huge_page() and lock_page().
The corruption occurs because page fault can fail with metadata changes
remained (such as refcount, mapcount, etc.)
Since the PageHWPoison() check is for avoiding hwpoisoned page remained
in pagecache mapping to the process, it should be done in "found in pagecache"
branch, not in the common path.
This patch moves the check to "found in pagecache" branch and fix the problem.

ChangeLog since v2:
- remove retry check in "new allocation" path.
- make description more detailed
- change patch name from "HWPOISON, hugetlb: move PG_HWPoison bit check"

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 mm/hugetlb.c |   21 +++++++++------------
 1 files changed, 9 insertions(+), 12 deletions(-)

diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index cc5be78..6871b41 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -2518,22 +2518,19 @@ retry:
 			hugepage_add_new_anon_rmap(page, vma, address);
 		}
 	} else {
+		/*
+		 * If memory error occurs between mmap() and fault, some process
+		 * don't have hwpoisoned swap entry for errored virtual address.
+		 * So we need to block hugepage fault by PG_hwpoison bit check.
+		 */
+		if (unlikely(PageHWPoison(page))) {
+			ret = VM_FAULT_HWPOISON;
+			goto backout_unlocked;
+		}
 		page_dup_rmap(page);
 	}
 
 	/*
-	 * Since memory error handler replaces pte into hwpoison swap entry
-	 * at the time of error handling, a process which reserved but not have
-	 * the mapping to the error hugepage does not have hwpoison swap entry.
-	 * So we need to block accesses from such a process by checking
-	 * PG_hwpoison bit here.
-	 */
-	if (unlikely(PageHWPoison(page))) {
-		ret = VM_FAULT_HWPOISON;
-		goto backout_unlocked;
-	}
-
-	/*
 	 * If we are going to COW a private mapping later, we examine the
 	 * pending reservations for this page now. This will ensure that
 	 * any allocations necessary to record that reservation occur outside
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 1/8] hugetlb: fix metadata corruption in hugetlb_fault()
@ 2010-08-24 23:55   ` Naoya Horiguchi
  0 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-24 23:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

In order to avoid metadata corruption when memory failure occurs between
alloc_huge_page() and lock_page().
The corruption occurs because page fault can fail with metadata changes
remained (such as refcount, mapcount, etc.)
Since the PageHWPoison() check is for avoiding hwpoisoned page remained
in pagecache mapping to the process, it should be done in "found in pagecache"
branch, not in the common path.
This patch moves the check to "found in pagecache" branch and fix the problem.

ChangeLog since v2:
- remove retry check in "new allocation" path.
- make description more detailed
- change patch name from "HWPOISON, hugetlb: move PG_HWPoison bit check"

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 mm/hugetlb.c |   21 +++++++++------------
 1 files changed, 9 insertions(+), 12 deletions(-)

diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index cc5be78..6871b41 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -2518,22 +2518,19 @@ retry:
 			hugepage_add_new_anon_rmap(page, vma, address);
 		}
 	} else {
+		/*
+		 * If memory error occurs between mmap() and fault, some process
+		 * don't have hwpoisoned swap entry for errored virtual address.
+		 * So we need to block hugepage fault by PG_hwpoison bit check.
+		 */
+		if (unlikely(PageHWPoison(page))) {
+			ret = VM_FAULT_HWPOISON;
+			goto backout_unlocked;
+		}
 		page_dup_rmap(page);
 	}
 
 	/*
-	 * Since memory error handler replaces pte into hwpoison swap entry
-	 * at the time of error handling, a process which reserved but not have
-	 * the mapping to the error hugepage does not have hwpoison swap entry.
-	 * So we need to block accesses from such a process by checking
-	 * PG_hwpoison bit here.
-	 */
-	if (unlikely(PageHWPoison(page))) {
-		ret = VM_FAULT_HWPOISON;
-		goto backout_unlocked;
-	}
-
-	/*
 	 * If we are going to COW a private mapping later, we examine the
 	 * pending reservations for this page now. This will ensure that
 	 * any allocations necessary to record that reservation occur outside
-- 
1.7.2.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/8] hugetlb: add allocate function for hugepage migration
  2010-08-24 23:55 ` Naoya Horiguchi
@ 2010-08-24 23:55   ` Naoya Horiguchi
  -1 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-24 23:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

We can't use existing hugepage allocation functions to allocate hugepage
for page migration, because page migration can happen asynchronously with
the running processes and page migration users should call the allocation
function with physical addresses (not virtual addresses) as arguments.

ChangeLog since v2:
- remove unnecessary get/put_mems_allowed() (thanks to David Rientjes)

ChangeLog since v1:
- add comment on top of alloc_huge_page_no_vma()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 include/linux/hugetlb.h |    3 ++
 mm/hugetlb.c            |   90 ++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 72 insertions(+), 21 deletions(-)

diff --git v2.6.36-rc2/include/linux/hugetlb.h v2.6.36-rc2/include/linux/hugetlb.h
index f479700..142bd4f 100644
--- v2.6.36-rc2/include/linux/hugetlb.h
+++ v2.6.36-rc2/include/linux/hugetlb.h
@@ -228,6 +228,8 @@ struct huge_bootmem_page {
 	struct hstate *hstate;
 };
 
+struct page *alloc_huge_page_no_vma_node(struct hstate *h, int nid);
+
 /* arch callback */
 int __init alloc_bootmem_huge_page(struct hstate *h);
 
@@ -303,6 +305,7 @@ static inline struct hstate *page_hstate(struct page *page)
 
 #else
 struct hstate {};
+#define alloc_huge_page_no_vma_node(h, nid) NULL
 #define alloc_bootmem_huge_page(h) NULL
 #define hstate_file(f) NULL
 #define hstate_vma(v) NULL
diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index 6871b41..31118d2 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -466,11 +466,22 @@ static void enqueue_huge_page(struct hstate *h, struct page *page)
 	h->free_huge_pages_node[nid]++;
 }
 
+static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
+{
+	struct page *page;
+	if (list_empty(&h->hugepage_freelists[nid]))
+		return NULL;
+	page = list_entry(h->hugepage_freelists[nid].next, struct page, lru);
+	list_del(&page->lru);
+	h->free_huge_pages--;
+	h->free_huge_pages_node[nid]--;
+	return page;
+}
+
 static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
 				unsigned long address, int avoid_reserve)
 {
-	int nid;
 	struct page *page = NULL;
 	struct mempolicy *mpol;
 	nodemask_t *nodemask;
@@ -496,19 +507,13 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						MAX_NR_ZONES - 1, nodemask) {
-		nid = zone_to_nid(zone);
-		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
-		    !list_empty(&h->hugepage_freelists[nid])) {
-			page = list_entry(h->hugepage_freelists[nid].next,
-					  struct page, lru);
-			list_del(&page->lru);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[nid]--;
-
-			if (!avoid_reserve)
-				decrement_hugepage_resv_vma(h, vma);
-
-			break;
+		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
+			page = dequeue_huge_page_node(h, zone_to_nid(zone));
+			if (page) {
+				if (!avoid_reserve)
+					decrement_hugepage_resv_vma(h, vma);
+				break;
+			}
 		}
 	}
 err:
@@ -615,9 +620,7 @@ int PageHuge(struct page *page)
 	return dtor == free_huge_page;
 }
 
-EXPORT_SYMBOL_GPL(PageHuge);
-
-static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
+static struct page *__alloc_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
 
@@ -628,14 +631,59 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
 						__GFP_REPEAT|__GFP_NOWARN,
 		huge_page_order(h));
+	if (page && arch_prepare_hugepage(page)) {
+		__free_pages(page, huge_page_order(h));
+		return NULL;
+	}
+
+	return page;
+}
+
+static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
+{
+	struct page *page = __alloc_huge_page_node(h, nid);
+	if (page)
+		prep_new_huge_page(h, page, nid);
+	return page;
+}
+
+static struct page *alloc_buddy_huge_page_node(struct hstate *h, int nid)
+{
+	struct page *page = __alloc_huge_page_node(h, nid);
 	if (page) {
-		if (arch_prepare_hugepage(page)) {
-			__free_pages(page, huge_page_order(h));
+		set_compound_page_dtor(page, free_huge_page);
+		spin_lock(&hugetlb_lock);
+		h->nr_huge_pages++;
+		h->nr_huge_pages_node[nid]++;
+		spin_unlock(&hugetlb_lock);
+		put_page_testzero(page);
+	}
+	return page;
+}
+
+/*
+ * This allocation function is useful in the context where vma is irrelevant.
+ * E.g. soft-offlining uses this function because it only cares physical
+ * address of error page.
+ */
+struct page *alloc_huge_page_no_vma_node(struct hstate *h, int nid)
+{
+	struct page *page;
+
+	spin_lock(&hugetlb_lock);
+	page = dequeue_huge_page_node(h, nid);
+	spin_unlock(&hugetlb_lock);
+
+	if (!page) {
+		page = alloc_buddy_huge_page_node(h, nid);
+		if (!page) {
+			__count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
 			return NULL;
-		}
-		prep_new_huge_page(h, page, nid);
+		} else
+			__count_vm_event(HTLB_BUDDY_PGALLOC);
 	}
 
+	set_page_refcounted(page);
 	return page;
 }
 
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/8] hugetlb: add allocate function for hugepage migration
@ 2010-08-24 23:55   ` Naoya Horiguchi
  0 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-24 23:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

We can't use existing hugepage allocation functions to allocate hugepage
for page migration, because page migration can happen asynchronously with
the running processes and page migration users should call the allocation
function with physical addresses (not virtual addresses) as arguments.

ChangeLog since v2:
- remove unnecessary get/put_mems_allowed() (thanks to David Rientjes)

ChangeLog since v1:
- add comment on top of alloc_huge_page_no_vma()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 include/linux/hugetlb.h |    3 ++
 mm/hugetlb.c            |   90 ++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 72 insertions(+), 21 deletions(-)

diff --git v2.6.36-rc2/include/linux/hugetlb.h v2.6.36-rc2/include/linux/hugetlb.h
index f479700..142bd4f 100644
--- v2.6.36-rc2/include/linux/hugetlb.h
+++ v2.6.36-rc2/include/linux/hugetlb.h
@@ -228,6 +228,8 @@ struct huge_bootmem_page {
 	struct hstate *hstate;
 };
 
+struct page *alloc_huge_page_no_vma_node(struct hstate *h, int nid);
+
 /* arch callback */
 int __init alloc_bootmem_huge_page(struct hstate *h);
 
@@ -303,6 +305,7 @@ static inline struct hstate *page_hstate(struct page *page)
 
 #else
 struct hstate {};
+#define alloc_huge_page_no_vma_node(h, nid) NULL
 #define alloc_bootmem_huge_page(h) NULL
 #define hstate_file(f) NULL
 #define hstate_vma(v) NULL
diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index 6871b41..31118d2 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -466,11 +466,22 @@ static void enqueue_huge_page(struct hstate *h, struct page *page)
 	h->free_huge_pages_node[nid]++;
 }
 
+static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
+{
+	struct page *page;
+	if (list_empty(&h->hugepage_freelists[nid]))
+		return NULL;
+	page = list_entry(h->hugepage_freelists[nid].next, struct page, lru);
+	list_del(&page->lru);
+	h->free_huge_pages--;
+	h->free_huge_pages_node[nid]--;
+	return page;
+}
+
 static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
 				unsigned long address, int avoid_reserve)
 {
-	int nid;
 	struct page *page = NULL;
 	struct mempolicy *mpol;
 	nodemask_t *nodemask;
@@ -496,19 +507,13 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						MAX_NR_ZONES - 1, nodemask) {
-		nid = zone_to_nid(zone);
-		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
-		    !list_empty(&h->hugepage_freelists[nid])) {
-			page = list_entry(h->hugepage_freelists[nid].next,
-					  struct page, lru);
-			list_del(&page->lru);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[nid]--;
-
-			if (!avoid_reserve)
-				decrement_hugepage_resv_vma(h, vma);
-
-			break;
+		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
+			page = dequeue_huge_page_node(h, zone_to_nid(zone));
+			if (page) {
+				if (!avoid_reserve)
+					decrement_hugepage_resv_vma(h, vma);
+				break;
+			}
 		}
 	}
 err:
@@ -615,9 +620,7 @@ int PageHuge(struct page *page)
 	return dtor == free_huge_page;
 }
 
-EXPORT_SYMBOL_GPL(PageHuge);
-
-static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
+static struct page *__alloc_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
 
@@ -628,14 +631,59 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
 						__GFP_REPEAT|__GFP_NOWARN,
 		huge_page_order(h));
+	if (page && arch_prepare_hugepage(page)) {
+		__free_pages(page, huge_page_order(h));
+		return NULL;
+	}
+
+	return page;
+}
+
+static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
+{
+	struct page *page = __alloc_huge_page_node(h, nid);
+	if (page)
+		prep_new_huge_page(h, page, nid);
+	return page;
+}
+
+static struct page *alloc_buddy_huge_page_node(struct hstate *h, int nid)
+{
+	struct page *page = __alloc_huge_page_node(h, nid);
 	if (page) {
-		if (arch_prepare_hugepage(page)) {
-			__free_pages(page, huge_page_order(h));
+		set_compound_page_dtor(page, free_huge_page);
+		spin_lock(&hugetlb_lock);
+		h->nr_huge_pages++;
+		h->nr_huge_pages_node[nid]++;
+		spin_unlock(&hugetlb_lock);
+		put_page_testzero(page);
+	}
+	return page;
+}
+
+/*
+ * This allocation function is useful in the context where vma is irrelevant.
+ * E.g. soft-offlining uses this function because it only cares physical
+ * address of error page.
+ */
+struct page *alloc_huge_page_no_vma_node(struct hstate *h, int nid)
+{
+	struct page *page;
+
+	spin_lock(&hugetlb_lock);
+	page = dequeue_huge_page_node(h, nid);
+	spin_unlock(&hugetlb_lock);
+
+	if (!page) {
+		page = alloc_buddy_huge_page_node(h, nid);
+		if (!page) {
+			__count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
 			return NULL;
-		}
-		prep_new_huge_page(h, page, nid);
+		} else
+			__count_vm_event(HTLB_BUDDY_PGALLOC);
 	}
 
+	set_page_refcounted(page);
 	return page;
 }
 
-- 
1.7.2.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/8] hugetlb: rename hugepage allocation functions
  2010-08-24 23:55 ` Naoya Horiguchi
@ 2010-08-24 23:55   ` Naoya Horiguchi
  -1 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-24 23:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

The function name alloc_huge_page_no_vma_node() has verbose suffix "_no_vma".
This patch makes existing alloc_huge_page() and it's family have "_vma" instead,
which makes it easier to read.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/hugetlb.h |    4 ++--
 mm/hugetlb.c            |   20 ++++++++++----------
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git v2.6.36-rc2/include/linux/hugetlb.h v2.6.36-rc2/include/linux/hugetlb.h
index 142bd4f..0b73c53 100644
--- v2.6.36-rc2/include/linux/hugetlb.h
+++ v2.6.36-rc2/include/linux/hugetlb.h
@@ -228,7 +228,7 @@ struct huge_bootmem_page {
 	struct hstate *hstate;
 };
 
-struct page *alloc_huge_page_no_vma_node(struct hstate *h, int nid);
+struct page *alloc_huge_page_node(struct hstate *h, int nid);
 
 /* arch callback */
 int __init alloc_bootmem_huge_page(struct hstate *h);
@@ -305,7 +305,7 @@ static inline struct hstate *page_hstate(struct page *page)
 
 #else
 struct hstate {};
-#define alloc_huge_page_no_vma_node(h, nid) NULL
+#define alloc_huge_page_node(h, nid) NULL
 #define alloc_bootmem_huge_page(h) NULL
 #define hstate_file(f) NULL
 #define hstate_vma(v) NULL
diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index 31118d2..674a25e 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -666,7 +666,7 @@ static struct page *alloc_buddy_huge_page_node(struct hstate *h, int nid)
  * E.g. soft-offlining uses this function because it only cares physical
  * address of error page.
  */
-struct page *alloc_huge_page_no_vma_node(struct hstate *h, int nid)
+struct page *alloc_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
 
@@ -818,7 +818,7 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 	return ret;
 }
 
-static struct page *alloc_buddy_huge_page(struct hstate *h,
+static struct page *alloc_buddy_huge_page_vma(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long address)
 {
 	struct page *page;
@@ -919,7 +919,7 @@ static int gather_surplus_pages(struct hstate *h, int delta)
 retry:
 	spin_unlock(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
-		page = alloc_buddy_huge_page(h, NULL, 0);
+		page = alloc_buddy_huge_page_vma(h, NULL, 0);
 		if (!page) {
 			/*
 			 * We were not able to allocate enough pages to
@@ -1072,8 +1072,8 @@ static void vma_commit_reservation(struct hstate *h,
 	}
 }
 
-static struct page *alloc_huge_page(struct vm_area_struct *vma,
-				    unsigned long addr, int avoid_reserve)
+static struct page *alloc_huge_page_vma(struct vm_area_struct *vma,
+					unsigned long addr, int avoid_reserve)
 {
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
@@ -1100,7 +1100,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	spin_unlock(&hugetlb_lock);
 
 	if (!page) {
-		page = alloc_buddy_huge_page(h, vma, addr);
+		page = alloc_buddy_huge_page_vma(h, vma, addr);
 		if (!page) {
 			hugetlb_put_quota(inode->i_mapping, chg);
 			return ERR_PTR(-VM_FAULT_SIGBUS);
@@ -1319,7 +1319,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
 	 * First take pages out of surplus state.  Then make up the
 	 * remaining difference by allocating fresh huge pages.
 	 *
-	 * We might race with alloc_buddy_huge_page() here and be unable
+	 * We might race with alloc_buddy_huge_page_vma() here and be unable
 	 * to convert a surplus huge page to a normal huge page. That is
 	 * not critical, though, it just means the overall size of the
 	 * pool might be one hugepage larger than it needs to be, but
@@ -1358,7 +1358,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
 	 * By placing pages into the surplus state independent of the
 	 * overcommit value, we are allowing the surplus pool size to
 	 * exceed overcommit. There are few sane options here. Since
-	 * alloc_buddy_huge_page() is checking the global counter,
+	 * alloc_buddy_huge_page_vma() is checking the global counter,
 	 * though, we'll note that we're not allowed to exceed surplus
 	 * and won't grow the pool anywhere else. Not until one of the
 	 * sysctls are changed, or the surplus pages go out of use.
@@ -2399,7 +2399,7 @@ retry_avoidcopy:
 
 	/* Drop page_table_lock as buddy allocator may be called */
 	spin_unlock(&mm->page_table_lock);
-	new_page = alloc_huge_page(vma, address, outside_reserve);
+	new_page = alloc_huge_page_vma(vma, address, outside_reserve);
 
 	if (IS_ERR(new_page)) {
 		page_cache_release(old_page);
@@ -2533,7 +2533,7 @@ retry:
 		size = i_size_read(mapping->host) >> huge_page_shift(h);
 		if (idx >= size)
 			goto out;
-		page = alloc_huge_page(vma, address, 0);
+		page = alloc_huge_page_vma(vma, address, 0);
 		if (IS_ERR(page)) {
 			ret = -PTR_ERR(page);
 			goto out;
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/8] hugetlb: rename hugepage allocation functions
@ 2010-08-24 23:55   ` Naoya Horiguchi
  0 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-24 23:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

The function name alloc_huge_page_no_vma_node() has verbose suffix "_no_vma".
This patch makes existing alloc_huge_page() and it's family have "_vma" instead,
which makes it easier to read.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/hugetlb.h |    4 ++--
 mm/hugetlb.c            |   20 ++++++++++----------
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git v2.6.36-rc2/include/linux/hugetlb.h v2.6.36-rc2/include/linux/hugetlb.h
index 142bd4f..0b73c53 100644
--- v2.6.36-rc2/include/linux/hugetlb.h
+++ v2.6.36-rc2/include/linux/hugetlb.h
@@ -228,7 +228,7 @@ struct huge_bootmem_page {
 	struct hstate *hstate;
 };
 
-struct page *alloc_huge_page_no_vma_node(struct hstate *h, int nid);
+struct page *alloc_huge_page_node(struct hstate *h, int nid);
 
 /* arch callback */
 int __init alloc_bootmem_huge_page(struct hstate *h);
@@ -305,7 +305,7 @@ static inline struct hstate *page_hstate(struct page *page)
 
 #else
 struct hstate {};
-#define alloc_huge_page_no_vma_node(h, nid) NULL
+#define alloc_huge_page_node(h, nid) NULL
 #define alloc_bootmem_huge_page(h) NULL
 #define hstate_file(f) NULL
 #define hstate_vma(v) NULL
diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index 31118d2..674a25e 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -666,7 +666,7 @@ static struct page *alloc_buddy_huge_page_node(struct hstate *h, int nid)
  * E.g. soft-offlining uses this function because it only cares physical
  * address of error page.
  */
-struct page *alloc_huge_page_no_vma_node(struct hstate *h, int nid)
+struct page *alloc_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
 
@@ -818,7 +818,7 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 	return ret;
 }
 
-static struct page *alloc_buddy_huge_page(struct hstate *h,
+static struct page *alloc_buddy_huge_page_vma(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long address)
 {
 	struct page *page;
@@ -919,7 +919,7 @@ static int gather_surplus_pages(struct hstate *h, int delta)
 retry:
 	spin_unlock(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
-		page = alloc_buddy_huge_page(h, NULL, 0);
+		page = alloc_buddy_huge_page_vma(h, NULL, 0);
 		if (!page) {
 			/*
 			 * We were not able to allocate enough pages to
@@ -1072,8 +1072,8 @@ static void vma_commit_reservation(struct hstate *h,
 	}
 }
 
-static struct page *alloc_huge_page(struct vm_area_struct *vma,
-				    unsigned long addr, int avoid_reserve)
+static struct page *alloc_huge_page_vma(struct vm_area_struct *vma,
+					unsigned long addr, int avoid_reserve)
 {
 	struct hstate *h = hstate_vma(vma);
 	struct page *page;
@@ -1100,7 +1100,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	spin_unlock(&hugetlb_lock);
 
 	if (!page) {
-		page = alloc_buddy_huge_page(h, vma, addr);
+		page = alloc_buddy_huge_page_vma(h, vma, addr);
 		if (!page) {
 			hugetlb_put_quota(inode->i_mapping, chg);
 			return ERR_PTR(-VM_FAULT_SIGBUS);
@@ -1319,7 +1319,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
 	 * First take pages out of surplus state.  Then make up the
 	 * remaining difference by allocating fresh huge pages.
 	 *
-	 * We might race with alloc_buddy_huge_page() here and be unable
+	 * We might race with alloc_buddy_huge_page_vma() here and be unable
 	 * to convert a surplus huge page to a normal huge page. That is
 	 * not critical, though, it just means the overall size of the
 	 * pool might be one hugepage larger than it needs to be, but
@@ -1358,7 +1358,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
 	 * By placing pages into the surplus state independent of the
 	 * overcommit value, we are allowing the surplus pool size to
 	 * exceed overcommit. There are few sane options here. Since
-	 * alloc_buddy_huge_page() is checking the global counter,
+	 * alloc_buddy_huge_page_vma() is checking the global counter,
 	 * though, we'll note that we're not allowed to exceed surplus
 	 * and won't grow the pool anywhere else. Not until one of the
 	 * sysctls are changed, or the surplus pages go out of use.
@@ -2399,7 +2399,7 @@ retry_avoidcopy:
 
 	/* Drop page_table_lock as buddy allocator may be called */
 	spin_unlock(&mm->page_table_lock);
-	new_page = alloc_huge_page(vma, address, outside_reserve);
+	new_page = alloc_huge_page_vma(vma, address, outside_reserve);
 
 	if (IS_ERR(new_page)) {
 		page_cache_release(old_page);
@@ -2533,7 +2533,7 @@ retry:
 		size = i_size_read(mapping->host) >> huge_page_shift(h);
 		if (idx >= size)
 			goto out;
-		page = alloc_huge_page(vma, address, 0);
+		page = alloc_huge_page_vma(vma, address, 0);
 		if (IS_ERR(page)) {
 			ret = -PTR_ERR(page);
 			goto out;
-- 
1.7.2.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/8] hugetlb: redefine hugepage copy functions
  2010-08-24 23:55 ` Naoya Horiguchi
@ 2010-08-24 23:55   ` Naoya Horiguchi
  -1 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-24 23:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

This patch modifies hugepage copy functions to have only destination
and source hugepages as arguments for later use.
The old ones are renamed from copy_{gigantic,huge}_page() to
copy_user_{gigantic,huge}_page().
This naming convention is consistent with that between copy_highpage()
and copy_user_highpage().

ChangeLog since v2:
- change copy_huge_page() from macro to inline dummy function
  to avoid compile warning when !CONFIG_HUGETLB_PAGE.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/hugetlb.h |    4 ++++
 mm/hugetlb.c            |   43 +++++++++++++++++++++++++++++++++++++++----
 2 files changed, 43 insertions(+), 4 deletions(-)

diff --git v2.6.36-rc2/include/linux/hugetlb.h v2.6.36-rc2/include/linux/hugetlb.h
index 0b73c53..9e51f77 100644
--- v2.6.36-rc2/include/linux/hugetlb.h
+++ v2.6.36-rc2/include/linux/hugetlb.h
@@ -44,6 +44,7 @@ int hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						int acctflags);
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
 void __isolate_hwpoisoned_huge_page(struct page *page);
+void copy_huge_page(struct page *dst, struct page *src);
 
 extern unsigned long hugepages_treat_as_movable;
 extern const unsigned long hugetlb_zero, hugetlb_infinity;
@@ -102,6 +103,9 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
 #define hugetlb_fault(mm, vma, addr, flags)	({ BUG(); 0; })
 #define huge_pte_offset(mm, address)	0
 #define __isolate_hwpoisoned_huge_page(page)	0
+static inline void copy_huge_page(struct page *dst, struct page *src)
+{
+}
 
 #define hugetlb_change_protection(vma, address, end, newprot)
 
diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index 674a25e..283563d 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -423,7 +423,7 @@ static void clear_huge_page(struct page *page,
 	}
 }
 
-static void copy_gigantic_page(struct page *dst, struct page *src,
+static void copy_user_gigantic_page(struct page *dst, struct page *src,
 			   unsigned long addr, struct vm_area_struct *vma)
 {
 	int i;
@@ -440,14 +440,15 @@ static void copy_gigantic_page(struct page *dst, struct page *src,
 		src = mem_map_next(src, src_base, i);
 	}
 }
-static void copy_huge_page(struct page *dst, struct page *src,
+
+static void copy_user_huge_page(struct page *dst, struct page *src,
 			   unsigned long addr, struct vm_area_struct *vma)
 {
 	int i;
 	struct hstate *h = hstate_vma(vma);
 
 	if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
-		copy_gigantic_page(dst, src, addr, vma);
+		copy_user_gigantic_page(dst, src, addr, vma);
 		return;
 	}
 
@@ -458,6 +459,40 @@ static void copy_huge_page(struct page *dst, struct page *src,
 	}
 }
 
+static void copy_gigantic_page(struct page *dst, struct page *src)
+{
+	int i;
+	struct hstate *h = page_hstate(src);
+	struct page *dst_base = dst;
+	struct page *src_base = src;
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page(h); ) {
+		cond_resched();
+		copy_highpage(dst, src);
+
+		i++;
+		dst = mem_map_next(dst, dst_base, i);
+		src = mem_map_next(src, src_base, i);
+	}
+}
+
+void copy_huge_page(struct page *dst, struct page *src)
+{
+	int i;
+	struct hstate *h = page_hstate(src);
+
+	if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
+		copy_gigantic_page(dst, src);
+		return;
+	}
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page(h); i++) {
+		cond_resched();
+		copy_highpage(dst + i, src + i);
+	}
+}
+
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
@@ -2434,7 +2469,7 @@ retry_avoidcopy:
 	if (unlikely(anon_vma_prepare(vma)))
 		return VM_FAULT_OOM;
 
-	copy_huge_page(new_page, old_page, address, vma);
+	copy_user_huge_page(new_page, old_page, address, vma);
 	__SetPageUptodate(new_page);
 
 	/*
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/8] hugetlb: redefine hugepage copy functions
@ 2010-08-24 23:55   ` Naoya Horiguchi
  0 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-24 23:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

This patch modifies hugepage copy functions to have only destination
and source hugepages as arguments for later use.
The old ones are renamed from copy_{gigantic,huge}_page() to
copy_user_{gigantic,huge}_page().
This naming convention is consistent with that between copy_highpage()
and copy_user_highpage().

ChangeLog since v2:
- change copy_huge_page() from macro to inline dummy function
  to avoid compile warning when !CONFIG_HUGETLB_PAGE.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/hugetlb.h |    4 ++++
 mm/hugetlb.c            |   43 +++++++++++++++++++++++++++++++++++++++----
 2 files changed, 43 insertions(+), 4 deletions(-)

diff --git v2.6.36-rc2/include/linux/hugetlb.h v2.6.36-rc2/include/linux/hugetlb.h
index 0b73c53..9e51f77 100644
--- v2.6.36-rc2/include/linux/hugetlb.h
+++ v2.6.36-rc2/include/linux/hugetlb.h
@@ -44,6 +44,7 @@ int hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						int acctflags);
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
 void __isolate_hwpoisoned_huge_page(struct page *page);
+void copy_huge_page(struct page *dst, struct page *src);
 
 extern unsigned long hugepages_treat_as_movable;
 extern const unsigned long hugetlb_zero, hugetlb_infinity;
@@ -102,6 +103,9 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
 #define hugetlb_fault(mm, vma, addr, flags)	({ BUG(); 0; })
 #define huge_pte_offset(mm, address)	0
 #define __isolate_hwpoisoned_huge_page(page)	0
+static inline void copy_huge_page(struct page *dst, struct page *src)
+{
+}
 
 #define hugetlb_change_protection(vma, address, end, newprot)
 
diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index 674a25e..283563d 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -423,7 +423,7 @@ static void clear_huge_page(struct page *page,
 	}
 }
 
-static void copy_gigantic_page(struct page *dst, struct page *src,
+static void copy_user_gigantic_page(struct page *dst, struct page *src,
 			   unsigned long addr, struct vm_area_struct *vma)
 {
 	int i;
@@ -440,14 +440,15 @@ static void copy_gigantic_page(struct page *dst, struct page *src,
 		src = mem_map_next(src, src_base, i);
 	}
 }
-static void copy_huge_page(struct page *dst, struct page *src,
+
+static void copy_user_huge_page(struct page *dst, struct page *src,
 			   unsigned long addr, struct vm_area_struct *vma)
 {
 	int i;
 	struct hstate *h = hstate_vma(vma);
 
 	if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
-		copy_gigantic_page(dst, src, addr, vma);
+		copy_user_gigantic_page(dst, src, addr, vma);
 		return;
 	}
 
@@ -458,6 +459,40 @@ static void copy_huge_page(struct page *dst, struct page *src,
 	}
 }
 
+static void copy_gigantic_page(struct page *dst, struct page *src)
+{
+	int i;
+	struct hstate *h = page_hstate(src);
+	struct page *dst_base = dst;
+	struct page *src_base = src;
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page(h); ) {
+		cond_resched();
+		copy_highpage(dst, src);
+
+		i++;
+		dst = mem_map_next(dst, dst_base, i);
+		src = mem_map_next(src, src_base, i);
+	}
+}
+
+void copy_huge_page(struct page *dst, struct page *src)
+{
+	int i;
+	struct hstate *h = page_hstate(src);
+
+	if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
+		copy_gigantic_page(dst, src);
+		return;
+	}
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page(h); i++) {
+		cond_resched();
+		copy_highpage(dst + i, src + i);
+	}
+}
+
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
@@ -2434,7 +2469,7 @@ retry_avoidcopy:
 	if (unlikely(anon_vma_prepare(vma)))
 		return VM_FAULT_OOM;
 
-	copy_huge_page(new_page, old_page, address, vma);
+	copy_user_huge_page(new_page, old_page, address, vma);
 	__SetPageUptodate(new_page);
 
 	/*
-- 
1.7.2.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/8] hugetlb: hugepage migration core
  2010-08-24 23:55 ` Naoya Horiguchi
@ 2010-08-24 23:55   ` Naoya Horiguchi
  -1 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-24 23:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

This patch extends page migration code to support hugepage migration.
One of the potential users of this feature is soft offlining which
is triggered by memory corrected errors (added by the next patch.)

Todo:
- there are other users of page migration such as memory policy,
  memory hotplug and memocy compaction.
  They are not ready for hugepage support for now.

ChangeLog since v2:
- refactor isolate/putback_lru_page() to handle hugepage
- add comment about race on unmap_and_move_huge_page()

ChangeLog since v1:
- divide migration code path for hugepage
- define routine checking migration swap entry for hugetlb
- replace "goto" with "if/else" in remove_migration_pte()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 fs/hugetlbfs/inode.c    |   15 ++++
 include/linux/migrate.h |   12 +++
 mm/hugetlb.c            |   18 ++++-
 mm/migrate.c            |  192 ++++++++++++++++++++++++++++++++++++++++++-----
 mm/vmscan.c             |    9 ++-
 5 files changed, 225 insertions(+), 21 deletions(-)

diff --git v2.6.36-rc2/fs/hugetlbfs/inode.c v2.6.36-rc2/fs/hugetlbfs/inode.c
index 6e5bd42..1f7ca50 100644
--- v2.6.36-rc2/fs/hugetlbfs/inode.c
+++ v2.6.36-rc2/fs/hugetlbfs/inode.c
@@ -31,6 +31,7 @@
 #include <linux/statfs.h>
 #include <linux/security.h>
 #include <linux/magic.h>
+#include <linux/migrate.h>
 
 #include <asm/uaccess.h>
 
@@ -573,6 +574,19 @@ static int hugetlbfs_set_page_dirty(struct page *page)
 	return 0;
 }
 
+static int hugetlbfs_migrate_page(struct address_space *mapping,
+				struct page *newpage, struct page *page)
+{
+	int rc;
+
+	rc = migrate_huge_page_move_mapping(mapping, newpage, page);
+	if (rc)
+		return rc;
+	migrate_page_copy(newpage, page);
+
+	return 0;
+}
+
 static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(dentry->d_sb);
@@ -659,6 +673,7 @@ static const struct address_space_operations hugetlbfs_aops = {
 	.write_begin	= hugetlbfs_write_begin,
 	.write_end	= hugetlbfs_write_end,
 	.set_page_dirty	= hugetlbfs_set_page_dirty,
+	.migratepage    = hugetlbfs_migrate_page,
 };
 
 
diff --git v2.6.36-rc2/include/linux/migrate.h v2.6.36-rc2/include/linux/migrate.h
index 7238231..f4c15ff 100644
--- v2.6.36-rc2/include/linux/migrate.h
+++ v2.6.36-rc2/include/linux/migrate.h
@@ -23,6 +23,9 @@ extern int migrate_prep_local(void);
 extern int migrate_vmas(struct mm_struct *mm,
 		const nodemask_t *from, const nodemask_t *to,
 		unsigned long flags);
+extern void migrate_page_copy(struct page *newpage, struct page *page);
+extern int migrate_huge_page_move_mapping(struct address_space *mapping,
+				  struct page *newpage, struct page *page);
 #else
 #define PAGE_MIGRATION 0
 
@@ -40,6 +43,15 @@ static inline int migrate_vmas(struct mm_struct *mm,
 	return -ENOSYS;
 }
 
+static inline void migrate_page_copy(struct page *newpage,
+				     struct page *page) {}
+
+extern int migrate_huge_page_move_mapping(struct address_space *mapping,
+				  struct page *newpage, struct page *page)
+{
+	return -ENOSYS;
+}
+
 /* Possible settings for the migrate_page() method in address_operations */
 #define migrate_page NULL
 #define fail_migrate_page NULL
diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index 283563d..57bc1ec 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -2236,6 +2236,19 @@ nomem:
 	return -ENOMEM;
 }
 
+static int is_hugetlb_entry_migration(pte_t pte)
+{
+	swp_entry_t swp;
+
+	if (huge_pte_none(pte) || pte_present(pte))
+		return 0;
+	swp = pte_to_swp_entry(pte);
+	if (non_swap_entry(swp) && is_migration_entry(swp)) {
+		return 1;
+	} else
+		return 0;
+}
+
 static int is_hugetlb_entry_hwpoisoned(pte_t pte)
 {
 	swp_entry_t swp;
@@ -2670,7 +2683,10 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	ptep = huge_pte_offset(mm, address);
 	if (ptep) {
 		entry = huge_ptep_get(ptep);
-		if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
+		if (unlikely(is_hugetlb_entry_migration(entry))) {
+			migration_entry_wait(mm, (pmd_t *)ptep, address);
+			return 0;
+		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			return VM_FAULT_HWPOISON;
 	}
 
diff --git v2.6.36-rc2/mm/migrate.c v2.6.36-rc2/mm/migrate.c
index 38e7cad..aae3b3d 100644
--- v2.6.36-rc2/mm/migrate.c
+++ v2.6.36-rc2/mm/migrate.c
@@ -32,6 +32,7 @@
 #include <linux/security.h>
 #include <linux/memcontrol.h>
 #include <linux/syscalls.h>
+#include <linux/hugetlb.h>
 #include <linux/gfp.h>
 
 #include "internal.h"
@@ -95,26 +96,34 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 	pte_t *ptep, pte;
  	spinlock_t *ptl;
 
- 	pgd = pgd_offset(mm, addr);
-	if (!pgd_present(*pgd))
-		goto out;
+	if (unlikely(PageHuge(new))) {
+		ptep = huge_pte_offset(mm, addr);
+		if (!ptep)
+			goto out;
+		ptl = &mm->page_table_lock;
+	} else {
+		pgd = pgd_offset(mm, addr);
+		if (!pgd_present(*pgd))
+			goto out;
 
-	pud = pud_offset(pgd, addr);
-	if (!pud_present(*pud))
-		goto out;
+		pud = pud_offset(pgd, addr);
+		if (!pud_present(*pud))
+			goto out;
 
-	pmd = pmd_offset(pud, addr);
-	if (!pmd_present(*pmd))
-		goto out;
+		pmd = pmd_offset(pud, addr);
+		if (!pmd_present(*pmd))
+			goto out;
 
-	ptep = pte_offset_map(pmd, addr);
+		ptep = pte_offset_map(pmd, addr);
 
-	if (!is_swap_pte(*ptep)) {
-		pte_unmap(ptep);
-		goto out;
- 	}
+		if (!is_swap_pte(*ptep)) {
+			pte_unmap(ptep);
+			goto out;
+		}
+
+		ptl = pte_lockptr(mm, pmd);
+	}
 
- 	ptl = pte_lockptr(mm, pmd);
  	spin_lock(ptl);
 	pte = *ptep;
 	if (!is_swap_pte(pte))
@@ -130,10 +139,17 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
 	if (is_write_migration_entry(entry))
 		pte = pte_mkwrite(pte);
+	if (PageHuge(new))
+		pte = pte_mkhuge(pte);
 	flush_cache_page(vma, addr, pte_pfn(pte));
 	set_pte_at(mm, addr, ptep, pte);
 
-	if (PageAnon(new))
+	if (PageHuge(new)) {
+		if (PageAnon(new))
+			hugepage_add_anon_rmap(new, vma, addr);
+		else
+			page_dup_rmap(new);
+	} else if (PageAnon(new))
 		page_add_anon_rmap(new, vma, addr);
 	else
 		page_add_file_rmap(new);
@@ -276,11 +292,59 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 }
 
 /*
+ * The expected number of remaining references is the same as that
+ * of migrate_page_move_mapping().
+ */
+int migrate_huge_page_move_mapping(struct address_space *mapping,
+				   struct page *newpage, struct page *page)
+{
+	int expected_count;
+	void **pslot;
+
+	if (!mapping) {
+		if (page_count(page) != 1)
+			return -EAGAIN;
+		return 0;
+	}
+
+	spin_lock_irq(&mapping->tree_lock);
+
+	pslot = radix_tree_lookup_slot(&mapping->page_tree,
+					page_index(page));
+
+	expected_count = 2 + page_has_private(page);
+	if (page_count(page) != expected_count ||
+	    (struct page *)radix_tree_deref_slot(pslot) != page) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return -EAGAIN;
+	}
+
+	if (!page_freeze_refs(page, expected_count)) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return -EAGAIN;
+	}
+
+	get_page(newpage);
+
+	radix_tree_replace_slot(pslot, newpage);
+
+	page_unfreeze_refs(page, expected_count);
+
+	__put_page(page);
+
+	spin_unlock_irq(&mapping->tree_lock);
+	return 0;
+}
+
+/*
  * Copy the page to its new location
  */
-static void migrate_page_copy(struct page *newpage, struct page *page)
+void migrate_page_copy(struct page *newpage, struct page *page)
 {
-	copy_highpage(newpage, page);
+	if (PageHuge(page))
+		copy_huge_page(newpage, page);
+	else
+		copy_highpage(newpage, page);
 
 	if (PageError(page))
 		SetPageError(newpage);
@@ -724,6 +788,92 @@ move_newpage:
 }
 
 /*
+ * Counterpart of unmap_and_move_page() for hugepage migration.
+ *
+ * This function doesn't wait the completion of hugepage I/O
+ * because there is no race between I/O and migration for hugepage.
+ * Note that currently hugepage I/O occurs only in direct I/O
+ * where no lock is held and PG_writeback is irrelevant,
+ * and writeback status of all subpages are counted in the reference
+ * count of the head page (i.e. if all subpages of a 2MB hugepage are
+ * under direct I/O, the reference of the head page is 512 and a bit more.)
+ * This means that when we try to migrate hugepage whose subpages are
+ * doing direct I/O, some references remain after try_to_unmap() and
+ * hugepage migration fails without data corruption.
+ *
+ * There is also no race when direct I/O is issued on the page under migration,
+ * because then pte is replaced with migration swap entry and direct I/O code
+ * will wait in the page fault for migration to complete.
+ */
+static int unmap_and_move_huge_page(new_page_t get_new_page,
+				unsigned long private, struct page *hpage,
+				int force, int offlining)
+{
+	int rc = 0;
+	int *result = NULL;
+	struct page *new_hpage = get_new_page(hpage, private, &result);
+	int rcu_locked = 0;
+	struct anon_vma *anon_vma = NULL;
+
+	if (!new_hpage)
+		return -ENOMEM;
+
+	rc = -EAGAIN;
+
+	if (!trylock_page(hpage)) {
+		if (!force)
+			goto out;
+		lock_page(hpage);
+	}
+
+	if (PageAnon(hpage)) {
+		rcu_read_lock();
+		rcu_locked = 1;
+
+		if (page_mapped(hpage)) {
+			anon_vma = page_anon_vma(hpage);
+			atomic_inc(&anon_vma->external_refcount);
+		}
+	}
+
+	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+
+	if (!page_mapped(hpage))
+		rc = move_to_new_page(new_hpage, hpage, 1);
+
+	if (rc)
+		remove_migration_ptes(hpage, hpage);
+
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount,
+					    &anon_vma->lock)) {
+		int empty = list_empty(&anon_vma->head);
+		spin_unlock(&anon_vma->lock);
+		if (empty)
+			anon_vma_free(anon_vma);
+	}
+
+	if (rcu_locked)
+		rcu_read_unlock();
+out:
+	unlock_page(hpage);
+
+	if (rc != -EAGAIN) {
+		list_del(&hpage->lru);
+		putback_lru_page(hpage);
+	}
+
+	putback_lru_page(new_hpage);
+
+	if (result) {
+		if (rc)
+			*result = rc;
+		else
+			*result = page_to_nid(new_hpage);
+	}
+	return rc;
+}
+
+/*
  * migrate_pages
  *
  * The function takes one list of pages to migrate and a function
@@ -757,7 +907,11 @@ int migrate_pages(struct list_head *from,
 		list_for_each_entry_safe(page, page2, from, lru) {
 			cond_resched();
 
-			rc = unmap_and_move(get_new_page, private,
+			if (PageHuge(page)) {
+				rc = unmap_and_move_huge_page(get_new_page,
+					private, page, pass > 2, offlining);
+			} else
+				rc = unmap_and_move(get_new_page, private,
 						page, pass > 2, offlining);
 
 			switch(rc) {
diff --git v2.6.36-rc2/mm/vmscan.c v2.6.36-rc2/mm/vmscan.c
index c391c32..69ce2a3 100644
--- v2.6.36-rc2/mm/vmscan.c
+++ v2.6.36-rc2/mm/vmscan.c
@@ -40,6 +40,7 @@
 #include <linux/memcontrol.h>
 #include <linux/delayacct.h>
 #include <linux/sysctl.h>
+#include <linux/hugetlb.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -508,6 +509,10 @@ void putback_lru_page(struct page *page)
 
 	VM_BUG_ON(PageLRU(page));
 
+	if (PageHuge(page)) {
+		put_page(page);
+		return;
+	}
 redo:
 	ClearPageUnevictable(page);
 
@@ -1112,7 +1117,9 @@ int isolate_lru_page(struct page *page)
 {
 	int ret = -EBUSY;
 
-	if (PageLRU(page)) {
+	if (PageHuge(page) && get_page_unless_zero(compound_head(page)))
+		ret = 0;
+	else if (PageLRU(page)) {
 		struct zone *zone = page_zone(page);
 
 		spin_lock_irq(&zone->lru_lock);
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/8] hugetlb: hugepage migration core
@ 2010-08-24 23:55   ` Naoya Horiguchi
  0 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-24 23:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

This patch extends page migration code to support hugepage migration.
One of the potential users of this feature is soft offlining which
is triggered by memory corrected errors (added by the next patch.)

Todo:
- there are other users of page migration such as memory policy,
  memory hotplug and memocy compaction.
  They are not ready for hugepage support for now.

ChangeLog since v2:
- refactor isolate/putback_lru_page() to handle hugepage
- add comment about race on unmap_and_move_huge_page()

ChangeLog since v1:
- divide migration code path for hugepage
- define routine checking migration swap entry for hugetlb
- replace "goto" with "if/else" in remove_migration_pte()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 fs/hugetlbfs/inode.c    |   15 ++++
 include/linux/migrate.h |   12 +++
 mm/hugetlb.c            |   18 ++++-
 mm/migrate.c            |  192 ++++++++++++++++++++++++++++++++++++++++++-----
 mm/vmscan.c             |    9 ++-
 5 files changed, 225 insertions(+), 21 deletions(-)

diff --git v2.6.36-rc2/fs/hugetlbfs/inode.c v2.6.36-rc2/fs/hugetlbfs/inode.c
index 6e5bd42..1f7ca50 100644
--- v2.6.36-rc2/fs/hugetlbfs/inode.c
+++ v2.6.36-rc2/fs/hugetlbfs/inode.c
@@ -31,6 +31,7 @@
 #include <linux/statfs.h>
 #include <linux/security.h>
 #include <linux/magic.h>
+#include <linux/migrate.h>
 
 #include <asm/uaccess.h>
 
@@ -573,6 +574,19 @@ static int hugetlbfs_set_page_dirty(struct page *page)
 	return 0;
 }
 
+static int hugetlbfs_migrate_page(struct address_space *mapping,
+				struct page *newpage, struct page *page)
+{
+	int rc;
+
+	rc = migrate_huge_page_move_mapping(mapping, newpage, page);
+	if (rc)
+		return rc;
+	migrate_page_copy(newpage, page);
+
+	return 0;
+}
+
 static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(dentry->d_sb);
@@ -659,6 +673,7 @@ static const struct address_space_operations hugetlbfs_aops = {
 	.write_begin	= hugetlbfs_write_begin,
 	.write_end	= hugetlbfs_write_end,
 	.set_page_dirty	= hugetlbfs_set_page_dirty,
+	.migratepage    = hugetlbfs_migrate_page,
 };
 
 
diff --git v2.6.36-rc2/include/linux/migrate.h v2.6.36-rc2/include/linux/migrate.h
index 7238231..f4c15ff 100644
--- v2.6.36-rc2/include/linux/migrate.h
+++ v2.6.36-rc2/include/linux/migrate.h
@@ -23,6 +23,9 @@ extern int migrate_prep_local(void);
 extern int migrate_vmas(struct mm_struct *mm,
 		const nodemask_t *from, const nodemask_t *to,
 		unsigned long flags);
+extern void migrate_page_copy(struct page *newpage, struct page *page);
+extern int migrate_huge_page_move_mapping(struct address_space *mapping,
+				  struct page *newpage, struct page *page);
 #else
 #define PAGE_MIGRATION 0
 
@@ -40,6 +43,15 @@ static inline int migrate_vmas(struct mm_struct *mm,
 	return -ENOSYS;
 }
 
+static inline void migrate_page_copy(struct page *newpage,
+				     struct page *page) {}
+
+extern int migrate_huge_page_move_mapping(struct address_space *mapping,
+				  struct page *newpage, struct page *page)
+{
+	return -ENOSYS;
+}
+
 /* Possible settings for the migrate_page() method in address_operations */
 #define migrate_page NULL
 #define fail_migrate_page NULL
diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index 283563d..57bc1ec 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -2236,6 +2236,19 @@ nomem:
 	return -ENOMEM;
 }
 
+static int is_hugetlb_entry_migration(pte_t pte)
+{
+	swp_entry_t swp;
+
+	if (huge_pte_none(pte) || pte_present(pte))
+		return 0;
+	swp = pte_to_swp_entry(pte);
+	if (non_swap_entry(swp) && is_migration_entry(swp)) {
+		return 1;
+	} else
+		return 0;
+}
+
 static int is_hugetlb_entry_hwpoisoned(pte_t pte)
 {
 	swp_entry_t swp;
@@ -2670,7 +2683,10 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	ptep = huge_pte_offset(mm, address);
 	if (ptep) {
 		entry = huge_ptep_get(ptep);
-		if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
+		if (unlikely(is_hugetlb_entry_migration(entry))) {
+			migration_entry_wait(mm, (pmd_t *)ptep, address);
+			return 0;
+		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			return VM_FAULT_HWPOISON;
 	}
 
diff --git v2.6.36-rc2/mm/migrate.c v2.6.36-rc2/mm/migrate.c
index 38e7cad..aae3b3d 100644
--- v2.6.36-rc2/mm/migrate.c
+++ v2.6.36-rc2/mm/migrate.c
@@ -32,6 +32,7 @@
 #include <linux/security.h>
 #include <linux/memcontrol.h>
 #include <linux/syscalls.h>
+#include <linux/hugetlb.h>
 #include <linux/gfp.h>
 
 #include "internal.h"
@@ -95,26 +96,34 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 	pte_t *ptep, pte;
  	spinlock_t *ptl;
 
- 	pgd = pgd_offset(mm, addr);
-	if (!pgd_present(*pgd))
-		goto out;
+	if (unlikely(PageHuge(new))) {
+		ptep = huge_pte_offset(mm, addr);
+		if (!ptep)
+			goto out;
+		ptl = &mm->page_table_lock;
+	} else {
+		pgd = pgd_offset(mm, addr);
+		if (!pgd_present(*pgd))
+			goto out;
 
-	pud = pud_offset(pgd, addr);
-	if (!pud_present(*pud))
-		goto out;
+		pud = pud_offset(pgd, addr);
+		if (!pud_present(*pud))
+			goto out;
 
-	pmd = pmd_offset(pud, addr);
-	if (!pmd_present(*pmd))
-		goto out;
+		pmd = pmd_offset(pud, addr);
+		if (!pmd_present(*pmd))
+			goto out;
 
-	ptep = pte_offset_map(pmd, addr);
+		ptep = pte_offset_map(pmd, addr);
 
-	if (!is_swap_pte(*ptep)) {
-		pte_unmap(ptep);
-		goto out;
- 	}
+		if (!is_swap_pte(*ptep)) {
+			pte_unmap(ptep);
+			goto out;
+		}
+
+		ptl = pte_lockptr(mm, pmd);
+	}
 
- 	ptl = pte_lockptr(mm, pmd);
  	spin_lock(ptl);
 	pte = *ptep;
 	if (!is_swap_pte(pte))
@@ -130,10 +139,17 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
 	if (is_write_migration_entry(entry))
 		pte = pte_mkwrite(pte);
+	if (PageHuge(new))
+		pte = pte_mkhuge(pte);
 	flush_cache_page(vma, addr, pte_pfn(pte));
 	set_pte_at(mm, addr, ptep, pte);
 
-	if (PageAnon(new))
+	if (PageHuge(new)) {
+		if (PageAnon(new))
+			hugepage_add_anon_rmap(new, vma, addr);
+		else
+			page_dup_rmap(new);
+	} else if (PageAnon(new))
 		page_add_anon_rmap(new, vma, addr);
 	else
 		page_add_file_rmap(new);
@@ -276,11 +292,59 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 }
 
 /*
+ * The expected number of remaining references is the same as that
+ * of migrate_page_move_mapping().
+ */
+int migrate_huge_page_move_mapping(struct address_space *mapping,
+				   struct page *newpage, struct page *page)
+{
+	int expected_count;
+	void **pslot;
+
+	if (!mapping) {
+		if (page_count(page) != 1)
+			return -EAGAIN;
+		return 0;
+	}
+
+	spin_lock_irq(&mapping->tree_lock);
+
+	pslot = radix_tree_lookup_slot(&mapping->page_tree,
+					page_index(page));
+
+	expected_count = 2 + page_has_private(page);
+	if (page_count(page) != expected_count ||
+	    (struct page *)radix_tree_deref_slot(pslot) != page) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return -EAGAIN;
+	}
+
+	if (!page_freeze_refs(page, expected_count)) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return -EAGAIN;
+	}
+
+	get_page(newpage);
+
+	radix_tree_replace_slot(pslot, newpage);
+
+	page_unfreeze_refs(page, expected_count);
+
+	__put_page(page);
+
+	spin_unlock_irq(&mapping->tree_lock);
+	return 0;
+}
+
+/*
  * Copy the page to its new location
  */
-static void migrate_page_copy(struct page *newpage, struct page *page)
+void migrate_page_copy(struct page *newpage, struct page *page)
 {
-	copy_highpage(newpage, page);
+	if (PageHuge(page))
+		copy_huge_page(newpage, page);
+	else
+		copy_highpage(newpage, page);
 
 	if (PageError(page))
 		SetPageError(newpage);
@@ -724,6 +788,92 @@ move_newpage:
 }
 
 /*
+ * Counterpart of unmap_and_move_page() for hugepage migration.
+ *
+ * This function doesn't wait the completion of hugepage I/O
+ * because there is no race between I/O and migration for hugepage.
+ * Note that currently hugepage I/O occurs only in direct I/O
+ * where no lock is held and PG_writeback is irrelevant,
+ * and writeback status of all subpages are counted in the reference
+ * count of the head page (i.e. if all subpages of a 2MB hugepage are
+ * under direct I/O, the reference of the head page is 512 and a bit more.)
+ * This means that when we try to migrate hugepage whose subpages are
+ * doing direct I/O, some references remain after try_to_unmap() and
+ * hugepage migration fails without data corruption.
+ *
+ * There is also no race when direct I/O is issued on the page under migration,
+ * because then pte is replaced with migration swap entry and direct I/O code
+ * will wait in the page fault for migration to complete.
+ */
+static int unmap_and_move_huge_page(new_page_t get_new_page,
+				unsigned long private, struct page *hpage,
+				int force, int offlining)
+{
+	int rc = 0;
+	int *result = NULL;
+	struct page *new_hpage = get_new_page(hpage, private, &result);
+	int rcu_locked = 0;
+	struct anon_vma *anon_vma = NULL;
+
+	if (!new_hpage)
+		return -ENOMEM;
+
+	rc = -EAGAIN;
+
+	if (!trylock_page(hpage)) {
+		if (!force)
+			goto out;
+		lock_page(hpage);
+	}
+
+	if (PageAnon(hpage)) {
+		rcu_read_lock();
+		rcu_locked = 1;
+
+		if (page_mapped(hpage)) {
+			anon_vma = page_anon_vma(hpage);
+			atomic_inc(&anon_vma->external_refcount);
+		}
+	}
+
+	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+
+	if (!page_mapped(hpage))
+		rc = move_to_new_page(new_hpage, hpage, 1);
+
+	if (rc)
+		remove_migration_ptes(hpage, hpage);
+
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount,
+					    &anon_vma->lock)) {
+		int empty = list_empty(&anon_vma->head);
+		spin_unlock(&anon_vma->lock);
+		if (empty)
+			anon_vma_free(anon_vma);
+	}
+
+	if (rcu_locked)
+		rcu_read_unlock();
+out:
+	unlock_page(hpage);
+
+	if (rc != -EAGAIN) {
+		list_del(&hpage->lru);
+		putback_lru_page(hpage);
+	}
+
+	putback_lru_page(new_hpage);
+
+	if (result) {
+		if (rc)
+			*result = rc;
+		else
+			*result = page_to_nid(new_hpage);
+	}
+	return rc;
+}
+
+/*
  * migrate_pages
  *
  * The function takes one list of pages to migrate and a function
@@ -757,7 +907,11 @@ int migrate_pages(struct list_head *from,
 		list_for_each_entry_safe(page, page2, from, lru) {
 			cond_resched();
 
-			rc = unmap_and_move(get_new_page, private,
+			if (PageHuge(page)) {
+				rc = unmap_and_move_huge_page(get_new_page,
+					private, page, pass > 2, offlining);
+			} else
+				rc = unmap_and_move(get_new_page, private,
 						page, pass > 2, offlining);
 
 			switch(rc) {
diff --git v2.6.36-rc2/mm/vmscan.c v2.6.36-rc2/mm/vmscan.c
index c391c32..69ce2a3 100644
--- v2.6.36-rc2/mm/vmscan.c
+++ v2.6.36-rc2/mm/vmscan.c
@@ -40,6 +40,7 @@
 #include <linux/memcontrol.h>
 #include <linux/delayacct.h>
 #include <linux/sysctl.h>
+#include <linux/hugetlb.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -508,6 +509,10 @@ void putback_lru_page(struct page *page)
 
 	VM_BUG_ON(PageLRU(page));
 
+	if (PageHuge(page)) {
+		put_page(page);
+		return;
+	}
 redo:
 	ClearPageUnevictable(page);
 
@@ -1112,7 +1117,9 @@ int isolate_lru_page(struct page *page)
 {
 	int ret = -EBUSY;
 
-	if (PageLRU(page)) {
+	if (PageHuge(page) && get_page_unless_zero(compound_head(page)))
+		ret = 0;
+	else if (PageLRU(page)) {
 		struct zone *zone = page_zone(page);
 
 		spin_lock_irq(&zone->lru_lock);
-- 
1.7.2.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 6/8] HWPOISON, hugetlb: soft offlining for hugepage
  2010-08-24 23:55 ` Naoya Horiguchi
@ 2010-08-24 23:55   ` Naoya Horiguchi
  -1 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-24 23:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

This patch extends soft offlining framework to support hugepage.
When memory corrected errors occur repeatedly on a hugepage,
we can choose to stop using it by migrating data onto another hugepage
and disabling the original (maybe half-broken) one.

ChangeLog since v2:
- move refcount handling into isolate_lru_page()

ChangeLog since v1:
- add double check in isolating hwpoisoned hugepage
- define free/non-free checker for hugepage
- postpone calling put_page() for hugepage in soft_offline_page()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 include/linux/hugetlb.h |    2 +
 mm/hugetlb.c            |   26 ++++++++++++++++++++++++
 mm/memory-failure.c     |   49 ++++++++++++++++++++++++++++++++--------------
 3 files changed, 62 insertions(+), 15 deletions(-)

diff --git v2.6.36-rc2/include/linux/hugetlb.h v2.6.36-rc2/include/linux/hugetlb.h
index 9e51f77..c8c1891 100644
--- v2.6.36-rc2/include/linux/hugetlb.h
+++ v2.6.36-rc2/include/linux/hugetlb.h
@@ -44,6 +44,7 @@ int hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						int acctflags);
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
 void __isolate_hwpoisoned_huge_page(struct page *page);
+void isolate_hwpoisoned_huge_page(struct page *page);
 void copy_huge_page(struct page *dst, struct page *src);
 
 extern unsigned long hugepages_treat_as_movable;
@@ -103,6 +104,7 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
 #define hugetlb_fault(mm, vma, addr, flags)	({ BUG(); 0; })
 #define huge_pte_offset(mm, address)	0
 #define __isolate_hwpoisoned_huge_page(page)	0
+#define isolate_hwpoisoned_huge_page(page)	0
 static inline void copy_huge_page(struct page *dst, struct page *src)
 {
 }
diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index 57bc1ec..dee7972 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -2987,3 +2987,29 @@ void __isolate_hwpoisoned_huge_page(struct page *hpage)
 	h->free_huge_pages_node[nid]--;
 	spin_unlock(&hugetlb_lock);
 }
+
+static int is_hugepage_on_freelist(struct page *hpage)
+{
+	struct page *page;
+	struct page *tmp;
+	struct hstate *h = page_hstate(hpage);
+	int nid = page_to_nid(hpage);
+
+	spin_lock(&hugetlb_lock);
+	list_for_each_entry_safe(page, tmp, &h->hugepage_freelists[nid], lru) {
+		if (page == hpage) {
+			spin_unlock(&hugetlb_lock);
+			return 1;
+		}
+	}
+	spin_unlock(&hugetlb_lock);
+	return 0;
+}
+
+void isolate_hwpoisoned_huge_page(struct page *hpage)
+{
+	lock_page(hpage);
+	if (is_hugepage_on_freelist(hpage))
+		__isolate_hwpoisoned_huge_page(hpage);
+	unlock_page(hpage);
+}
diff --git v2.6.36-rc2/mm/memory-failure.c v2.6.36-rc2/mm/memory-failure.c
index 9c26eec..60178d2 100644
--- v2.6.36-rc2/mm/memory-failure.c
+++ v2.6.36-rc2/mm/memory-failure.c
@@ -1187,7 +1187,11 @@ EXPORT_SYMBOL(unpoison_memory);
 static struct page *new_page(struct page *p, unsigned long private, int **x)
 {
 	int nid = page_to_nid(p);
-	return alloc_pages_exact_node(nid, GFP_HIGHUSER_MOVABLE, 0);
+	if (PageHuge(p))
+		return alloc_huge_page_node(page_hstate(compound_head(p)),
+						   nid);
+	else
+		return alloc_pages_exact_node(nid, GFP_HIGHUSER_MOVABLE, 0);
 }
 
 /*
@@ -1215,8 +1219,16 @@ static int get_any_page(struct page *p, unsigned long pfn, int flags)
 	 * was free.
 	 */
 	set_migratetype_isolate(p);
+	/*
+	 * When the target page is a free hugepage, just remove it
+	 * from free hugepage list.
+	 */
 	if (!get_page_unless_zero(compound_head(p))) {
-		if (is_free_buddy_page(p)) {
+		if (PageHuge(p)) {
+			pr_debug("get_any_page: %#lx free huge page\n", pfn);
+			ret = 0;
+			SetPageHWPoison(compound_head(p));
+		} else if (is_free_buddy_page(p)) {
 			pr_debug("get_any_page: %#lx free buddy page\n", pfn);
 			/* Set hwpoison bit while page is still isolated */
 			SetPageHWPoison(p);
@@ -1261,6 +1273,7 @@ int soft_offline_page(struct page *page, int flags)
 {
 	int ret;
 	unsigned long pfn = page_to_pfn(page);
+	struct page *hpage = compound_head(page);
 
 	ret = get_any_page(page, pfn, flags);
 	if (ret < 0)
@@ -1271,7 +1284,7 @@ int soft_offline_page(struct page *page, int flags)
 	/*
 	 * Page cache page we can handle?
 	 */
-	if (!PageLRU(page)) {
+	if (!PageLRU(page) && !PageHuge(page)) {
 		/*
 		 * Try to free it.
 		 */
@@ -1287,21 +1300,21 @@ int soft_offline_page(struct page *page, int flags)
 		if (ret == 0)
 			goto done;
 	}
-	if (!PageLRU(page)) {
+	if (!PageLRU(page) && !PageHuge(page)) {
 		pr_debug("soft_offline: %#lx: unknown non LRU page type %lx\n",
 				pfn, page->flags);
 		return -EIO;
 	}
 
-	lock_page(page);
-	wait_on_page_writeback(page);
+	lock_page(hpage);
+	wait_on_page_writeback(hpage);
 
 	/*
 	 * Synchronized using the page lock with memory_failure()
 	 */
-	if (PageHWPoison(page)) {
-		unlock_page(page);
-		put_page(page);
+	if (PageHWPoison(hpage)) {
+		unlock_page(hpage);
+		put_page(hpage);
 		pr_debug("soft offline: %#lx page already poisoned\n", pfn);
 		return -EBUSY;
 	}
@@ -1311,7 +1324,7 @@ int soft_offline_page(struct page *page, int flags)
 	 * non dirty unmapped page cache pages.
 	 */
 	ret = invalidate_inode_page(page);
-	unlock_page(page);
+	unlock_page(hpage);
 
 	/*
 	 * Drop count because page migration doesn't like raised
@@ -1320,7 +1333,7 @@ int soft_offline_page(struct page *page, int flags)
 	 * RED-PEN would be better to keep it isolated here, but we
 	 * would need to fix isolation locking first.
 	 */
-	put_page(page);
+	put_page(hpage);
 	if (ret == 1) {
 		ret = 0;
 		pr_debug("soft_offline: %#lx: invalidated\n", pfn);
@@ -1336,7 +1349,7 @@ int soft_offline_page(struct page *page, int flags)
 	if (!ret) {
 		LIST_HEAD(pagelist);
 
-		list_add(&page->lru, &pagelist);
+		list_add(&hpage->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, 0);
 		if (ret) {
 			pr_debug("soft offline: %#lx: migration failed %d, type %lx\n",
@@ -1352,9 +1365,15 @@ int soft_offline_page(struct page *page, int flags)
 		return ret;
 
 done:
-	atomic_long_add(1, &mce_bad_pages);
-	SetPageHWPoison(page);
-	/* keep elevated page count for bad page */
+	if (!PageHWPoison(hpage))
+		atomic_long_add(1 << compound_order(hpage), &mce_bad_pages);
+	if (PageHuge(hpage)) {
+		set_page_hwpoison_huge_page(hpage);
+		isolate_hwpoisoned_huge_page(hpage);
+	} else {
+		SetPageHWPoison(page);
+		/* keep elevated page count for bad page */
+	}
 	return ret;
 }
 
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 6/8] HWPOISON, hugetlb: soft offlining for hugepage
@ 2010-08-24 23:55   ` Naoya Horiguchi
  0 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-24 23:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

This patch extends soft offlining framework to support hugepage.
When memory corrected errors occur repeatedly on a hugepage,
we can choose to stop using it by migrating data onto another hugepage
and disabling the original (maybe half-broken) one.

ChangeLog since v2:
- move refcount handling into isolate_lru_page()

ChangeLog since v1:
- add double check in isolating hwpoisoned hugepage
- define free/non-free checker for hugepage
- postpone calling put_page() for hugepage in soft_offline_page()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 include/linux/hugetlb.h |    2 +
 mm/hugetlb.c            |   26 ++++++++++++++++++++++++
 mm/memory-failure.c     |   49 ++++++++++++++++++++++++++++++++--------------
 3 files changed, 62 insertions(+), 15 deletions(-)

diff --git v2.6.36-rc2/include/linux/hugetlb.h v2.6.36-rc2/include/linux/hugetlb.h
index 9e51f77..c8c1891 100644
--- v2.6.36-rc2/include/linux/hugetlb.h
+++ v2.6.36-rc2/include/linux/hugetlb.h
@@ -44,6 +44,7 @@ int hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						int acctflags);
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
 void __isolate_hwpoisoned_huge_page(struct page *page);
+void isolate_hwpoisoned_huge_page(struct page *page);
 void copy_huge_page(struct page *dst, struct page *src);
 
 extern unsigned long hugepages_treat_as_movable;
@@ -103,6 +104,7 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
 #define hugetlb_fault(mm, vma, addr, flags)	({ BUG(); 0; })
 #define huge_pte_offset(mm, address)	0
 #define __isolate_hwpoisoned_huge_page(page)	0
+#define isolate_hwpoisoned_huge_page(page)	0
 static inline void copy_huge_page(struct page *dst, struct page *src)
 {
 }
diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index 57bc1ec..dee7972 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -2987,3 +2987,29 @@ void __isolate_hwpoisoned_huge_page(struct page *hpage)
 	h->free_huge_pages_node[nid]--;
 	spin_unlock(&hugetlb_lock);
 }
+
+static int is_hugepage_on_freelist(struct page *hpage)
+{
+	struct page *page;
+	struct page *tmp;
+	struct hstate *h = page_hstate(hpage);
+	int nid = page_to_nid(hpage);
+
+	spin_lock(&hugetlb_lock);
+	list_for_each_entry_safe(page, tmp, &h->hugepage_freelists[nid], lru) {
+		if (page == hpage) {
+			spin_unlock(&hugetlb_lock);
+			return 1;
+		}
+	}
+	spin_unlock(&hugetlb_lock);
+	return 0;
+}
+
+void isolate_hwpoisoned_huge_page(struct page *hpage)
+{
+	lock_page(hpage);
+	if (is_hugepage_on_freelist(hpage))
+		__isolate_hwpoisoned_huge_page(hpage);
+	unlock_page(hpage);
+}
diff --git v2.6.36-rc2/mm/memory-failure.c v2.6.36-rc2/mm/memory-failure.c
index 9c26eec..60178d2 100644
--- v2.6.36-rc2/mm/memory-failure.c
+++ v2.6.36-rc2/mm/memory-failure.c
@@ -1187,7 +1187,11 @@ EXPORT_SYMBOL(unpoison_memory);
 static struct page *new_page(struct page *p, unsigned long private, int **x)
 {
 	int nid = page_to_nid(p);
-	return alloc_pages_exact_node(nid, GFP_HIGHUSER_MOVABLE, 0);
+	if (PageHuge(p))
+		return alloc_huge_page_node(page_hstate(compound_head(p)),
+						   nid);
+	else
+		return alloc_pages_exact_node(nid, GFP_HIGHUSER_MOVABLE, 0);
 }
 
 /*
@@ -1215,8 +1219,16 @@ static int get_any_page(struct page *p, unsigned long pfn, int flags)
 	 * was free.
 	 */
 	set_migratetype_isolate(p);
+	/*
+	 * When the target page is a free hugepage, just remove it
+	 * from free hugepage list.
+	 */
 	if (!get_page_unless_zero(compound_head(p))) {
-		if (is_free_buddy_page(p)) {
+		if (PageHuge(p)) {
+			pr_debug("get_any_page: %#lx free huge page\n", pfn);
+			ret = 0;
+			SetPageHWPoison(compound_head(p));
+		} else if (is_free_buddy_page(p)) {
 			pr_debug("get_any_page: %#lx free buddy page\n", pfn);
 			/* Set hwpoison bit while page is still isolated */
 			SetPageHWPoison(p);
@@ -1261,6 +1273,7 @@ int soft_offline_page(struct page *page, int flags)
 {
 	int ret;
 	unsigned long pfn = page_to_pfn(page);
+	struct page *hpage = compound_head(page);
 
 	ret = get_any_page(page, pfn, flags);
 	if (ret < 0)
@@ -1271,7 +1284,7 @@ int soft_offline_page(struct page *page, int flags)
 	/*
 	 * Page cache page we can handle?
 	 */
-	if (!PageLRU(page)) {
+	if (!PageLRU(page) && !PageHuge(page)) {
 		/*
 		 * Try to free it.
 		 */
@@ -1287,21 +1300,21 @@ int soft_offline_page(struct page *page, int flags)
 		if (ret == 0)
 			goto done;
 	}
-	if (!PageLRU(page)) {
+	if (!PageLRU(page) && !PageHuge(page)) {
 		pr_debug("soft_offline: %#lx: unknown non LRU page type %lx\n",
 				pfn, page->flags);
 		return -EIO;
 	}
 
-	lock_page(page);
-	wait_on_page_writeback(page);
+	lock_page(hpage);
+	wait_on_page_writeback(hpage);
 
 	/*
 	 * Synchronized using the page lock with memory_failure()
 	 */
-	if (PageHWPoison(page)) {
-		unlock_page(page);
-		put_page(page);
+	if (PageHWPoison(hpage)) {
+		unlock_page(hpage);
+		put_page(hpage);
 		pr_debug("soft offline: %#lx page already poisoned\n", pfn);
 		return -EBUSY;
 	}
@@ -1311,7 +1324,7 @@ int soft_offline_page(struct page *page, int flags)
 	 * non dirty unmapped page cache pages.
 	 */
 	ret = invalidate_inode_page(page);
-	unlock_page(page);
+	unlock_page(hpage);
 
 	/*
 	 * Drop count because page migration doesn't like raised
@@ -1320,7 +1333,7 @@ int soft_offline_page(struct page *page, int flags)
 	 * RED-PEN would be better to keep it isolated here, but we
 	 * would need to fix isolation locking first.
 	 */
-	put_page(page);
+	put_page(hpage);
 	if (ret == 1) {
 		ret = 0;
 		pr_debug("soft_offline: %#lx: invalidated\n", pfn);
@@ -1336,7 +1349,7 @@ int soft_offline_page(struct page *page, int flags)
 	if (!ret) {
 		LIST_HEAD(pagelist);
 
-		list_add(&page->lru, &pagelist);
+		list_add(&hpage->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, 0);
 		if (ret) {
 			pr_debug("soft offline: %#lx: migration failed %d, type %lx\n",
@@ -1352,9 +1365,15 @@ int soft_offline_page(struct page *page, int flags)
 		return ret;
 
 done:
-	atomic_long_add(1, &mce_bad_pages);
-	SetPageHWPoison(page);
-	/* keep elevated page count for bad page */
+	if (!PageHWPoison(hpage))
+		atomic_long_add(1 << compound_order(hpage), &mce_bad_pages);
+	if (PageHuge(hpage)) {
+		set_page_hwpoison_huge_page(hpage);
+		isolate_hwpoisoned_huge_page(hpage);
+	} else {
+		SetPageHWPoison(page);
+		/* keep elevated page count for bad page */
+	}
 	return ret;
 }
 
-- 
1.7.2.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 7/8] HWPOISON, hugetlb: fix unpoison for hugepage
  2010-08-24 23:55 ` Naoya Horiguchi
@ 2010-08-24 23:55   ` Naoya Horiguchi
  -1 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-24 23:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

Currently unpoisoning hugepages doesn't work because it's not enough
to just clear PG_HWPoison bits and we need to link the hugepage
to be unpoisoned back to the free hugepage list.
To do this, we get and put hwpoisoned hugepage whose refcount is 0.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 mm/memory-failure.c |   16 +++++++++++++---
 1 files changed, 13 insertions(+), 3 deletions(-)

diff --git v2.6.36-rc2/mm/memory-failure.c v2.6.36-rc2/mm/memory-failure.c
index 60178d2..ab36690 100644
--- v2.6.36-rc2/mm/memory-failure.c
+++ v2.6.36-rc2/mm/memory-failure.c
@@ -1154,9 +1154,19 @@ int unpoison_memory(unsigned long pfn)
 	nr_pages = 1 << compound_order(page);
 
 	if (!get_page_unless_zero(page)) {
-		if (TestClearPageHWPoison(p))
+		/* The page to be unpoisoned was free one when hwpoisoned */
+		if (TestClearPageHWPoison(page))
 			atomic_long_sub(nr_pages, &mce_bad_pages);
 		pr_debug("MCE: Software-unpoisoned free page %#lx\n", pfn);
+		if (PageHuge(page)) {
+			/*
+			 * To unpoison free hugepage, we get and put it
+			 * to move it back to the free list.
+			 */
+			get_page(page);
+			clear_page_hwpoison_huge_page(page);
+			put_page(page);
+		}
 		return 0;
 	}
 
@@ -1171,9 +1181,9 @@ int unpoison_memory(unsigned long pfn)
 		pr_debug("MCE: Software-unpoisoned page %#lx\n", pfn);
 		atomic_long_sub(nr_pages, &mce_bad_pages);
 		freeit = 1;
+		if (PageHuge(page))
+			clear_page_hwpoison_huge_page(page);
 	}
-	if (PageHuge(p))
-		clear_page_hwpoison_huge_page(page);
 	unlock_page(page);
 
 	put_page(page);
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 7/8] HWPOISON, hugetlb: fix unpoison for hugepage
@ 2010-08-24 23:55   ` Naoya Horiguchi
  0 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-24 23:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

Currently unpoisoning hugepages doesn't work because it's not enough
to just clear PG_HWPoison bits and we need to link the hugepage
to be unpoisoned back to the free hugepage list.
To do this, we get and put hwpoisoned hugepage whose refcount is 0.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 mm/memory-failure.c |   16 +++++++++++++---
 1 files changed, 13 insertions(+), 3 deletions(-)

diff --git v2.6.36-rc2/mm/memory-failure.c v2.6.36-rc2/mm/memory-failure.c
index 60178d2..ab36690 100644
--- v2.6.36-rc2/mm/memory-failure.c
+++ v2.6.36-rc2/mm/memory-failure.c
@@ -1154,9 +1154,19 @@ int unpoison_memory(unsigned long pfn)
 	nr_pages = 1 << compound_order(page);
 
 	if (!get_page_unless_zero(page)) {
-		if (TestClearPageHWPoison(p))
+		/* The page to be unpoisoned was free one when hwpoisoned */
+		if (TestClearPageHWPoison(page))
 			atomic_long_sub(nr_pages, &mce_bad_pages);
 		pr_debug("MCE: Software-unpoisoned free page %#lx\n", pfn);
+		if (PageHuge(page)) {
+			/*
+			 * To unpoison free hugepage, we get and put it
+			 * to move it back to the free list.
+			 */
+			get_page(page);
+			clear_page_hwpoison_huge_page(page);
+			put_page(page);
+		}
 		return 0;
 	}
 
@@ -1171,9 +1181,9 @@ int unpoison_memory(unsigned long pfn)
 		pr_debug("MCE: Software-unpoisoned page %#lx\n", pfn);
 		atomic_long_sub(nr_pages, &mce_bad_pages);
 		freeit = 1;
+		if (PageHuge(page))
+			clear_page_hwpoison_huge_page(page);
 	}
-	if (PageHuge(p))
-		clear_page_hwpoison_huge_page(page);
 	unlock_page(page);
 
 	put_page(page);
-- 
1.7.2.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 8/8] page-types.c: fix name of unpoison interface
  2010-08-24 23:55 ` Naoya Horiguchi
@ 2010-08-24 23:55   ` Naoya Horiguchi
  -1 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-24 23:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

debugfs:hwpoison/renew-pfn is the old interface.
This patch renames and fixes it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/vm/page-types.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git v2.6.36-rc2/Documentation/vm/page-types.c v2.6.36-rc2/Documentation/vm/page-types.c
index ccd951f..cc96ee2 100644
--- v2.6.36-rc2/Documentation/vm/page-types.c
+++ v2.6.36-rc2/Documentation/vm/page-types.c
@@ -478,7 +478,7 @@ static void prepare_hwpoison_fd(void)
 	}
 
 	if (opt_unpoison && !hwpoison_forget_fd) {
-		sprintf(buf, "%s/renew-pfn", hwpoison_debug_fs);
+		sprintf(buf, "%s/unpoison-pfn", hwpoison_debug_fs);
 		hwpoison_forget_fd = checked_open(buf, O_WRONLY);
 	}
 }
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 8/8] page-types.c: fix name of unpoison interface
@ 2010-08-24 23:55   ` Naoya Horiguchi
  0 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-24 23:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

debugfs:hwpoison/renew-pfn is the old interface.
This patch renames and fixes it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/vm/page-types.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git v2.6.36-rc2/Documentation/vm/page-types.c v2.6.36-rc2/Documentation/vm/page-types.c
index ccd951f..cc96ee2 100644
--- v2.6.36-rc2/Documentation/vm/page-types.c
+++ v2.6.36-rc2/Documentation/vm/page-types.c
@@ -478,7 +478,7 @@ static void prepare_hwpoison_fd(void)
 	}
 
 	if (opt_unpoison && !hwpoison_forget_fd) {
-		sprintf(buf, "%s/renew-pfn", hwpoison_debug_fs);
+		sprintf(buf, "%s/unpoison-pfn", hwpoison_debug_fs);
 		hwpoison_forget_fd = checked_open(buf, O_WRONLY);
 	}
 }
-- 
1.7.2.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] hugetlb: fix metadata corruption in hugetlb_fault()
  2010-08-24 23:55   ` Naoya Horiguchi
@ 2010-08-25  0:17     ` Wu Fengguang
  -1 siblings, 0 replies; 34+ messages in thread
From: Wu Fengguang @ 2010-08-25  0:17 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Mel Gorman,
	Jun'ichi Nomura, linux-mm, LKML

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] hugetlb: fix metadata corruption in hugetlb_fault()
@ 2010-08-25  0:17     ` Wu Fengguang
  0 siblings, 0 replies; 34+ messages in thread
From: Wu Fengguang @ 2010-08-25  0:17 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Mel Gorman,
	Jun'ichi Nomura, linux-mm, LKML

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/8] hugetlb: rename hugepage allocation functions
  2010-08-24 23:55   ` Naoya Horiguchi
@ 2010-08-25  1:21     ` Wu Fengguang
  -1 siblings, 0 replies; 34+ messages in thread
From: Wu Fengguang @ 2010-08-25  1:21 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Mel Gorman,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, Aug 25, 2010 at 07:55:22AM +0800, Naoya Horiguchi wrote:
> The function name alloc_huge_page_no_vma_node() has verbose suffix "_no_vma".
> This patch makes existing alloc_huge_page() and it's family have "_vma" instead,
> which makes it easier to read.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  include/linux/hugetlb.h |    4 ++--
>  mm/hugetlb.c            |   20 ++++++++++----------
>  2 files changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git v2.6.36-rc2/include/linux/hugetlb.h v2.6.36-rc2/include/linux/hugetlb.h
> index 142bd4f..0b73c53 100644
> --- v2.6.36-rc2/include/linux/hugetlb.h
> +++ v2.6.36-rc2/include/linux/hugetlb.h
> @@ -228,7 +228,7 @@ struct huge_bootmem_page {
>  	struct hstate *hstate;
>  };
>  
> -struct page *alloc_huge_page_no_vma_node(struct hstate *h, int nid);
> +struct page *alloc_huge_page_node(struct hstate *h, int nid);
>  
>  /* arch callback */
>  int __init alloc_bootmem_huge_page(struct hstate *h);
> @@ -305,7 +305,7 @@ static inline struct hstate *page_hstate(struct page *page)
>  
>  #else
>  struct hstate {};
> -#define alloc_huge_page_no_vma_node(h, nid) NULL
> +#define alloc_huge_page_node(h, nid) NULL
>  #define alloc_bootmem_huge_page(h) NULL
>  #define hstate_file(f) NULL
>  #define hstate_vma(v) NULL
> diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
> index 31118d2..674a25e 100644
> --- v2.6.36-rc2/mm/hugetlb.c
> +++ v2.6.36-rc2/mm/hugetlb.c
> @@ -666,7 +666,7 @@ static struct page *alloc_buddy_huge_page_node(struct hstate *h, int nid)
>   * E.g. soft-offlining uses this function because it only cares physical
>   * address of error page.
>   */
> -struct page *alloc_huge_page_no_vma_node(struct hstate *h, int nid)
> +struct page *alloc_huge_page_node(struct hstate *h, int nid)
>  {
>  	struct page *page;
>  
> @@ -818,7 +818,7 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>  	return ret;
>  }
>  
> -static struct page *alloc_buddy_huge_page(struct hstate *h,
> +static struct page *alloc_buddy_huge_page_vma(struct hstate *h,
>  			struct vm_area_struct *vma, unsigned long address)
>  {
>  	struct page *page;
> @@ -919,7 +919,7 @@ static int gather_surplus_pages(struct hstate *h, int delta)
>  retry:
>  	spin_unlock(&hugetlb_lock);
>  	for (i = 0; i < needed; i++) {
> -		page = alloc_buddy_huge_page(h, NULL, 0);
> +		page = alloc_buddy_huge_page_vma(h, NULL, 0);

alloc_buddy_huge_page() doesn't make use of @vma at all, so the
parameters can be removed.

It looks cleaner to fold the
alloc_huge_page_no_vma_node=>alloc_huge_page_node renames into the
previous patch, from there split out the code refactor chunks into
a standalone patch, and then include this cleanup patch.

Thanks,
Fengguang
---
hugetlb: remove unused alloc_buddy_huge_page() parameters

alloc_buddy_huge_page() doesn't make use of @vma at all, so the
parameters can be removed.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index cc5be78..3114b4c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -770,8 +770,7 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 	return ret;
 }
 
-static struct page *alloc_buddy_huge_page(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long address)
+static struct page *alloc_buddy_huge_page(struct hstate *h)
 {
 	struct page *page;
 	unsigned int nid;
@@ -871,7 +870,7 @@ static int gather_surplus_pages(struct hstate *h, int delta)
 retry:
 	spin_unlock(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
-		page = alloc_buddy_huge_page(h, NULL, 0);
+		page = alloc_buddy_huge_page(h);
 		if (!page) {
 			/*
 			 * We were not able to allocate enough pages to
@@ -1052,7 +1051,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	spin_unlock(&hugetlb_lock);
 
 	if (!page) {
-		page = alloc_buddy_huge_page(h, vma, addr);
+		page = alloc_buddy_huge_page(h);
 		if (!page) {
 			hugetlb_put_quota(inode->i_mapping, chg);
 			return ERR_PTR(-VM_FAULT_SIGBUS);

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/8] hugetlb: rename hugepage allocation functions
@ 2010-08-25  1:21     ` Wu Fengguang
  0 siblings, 0 replies; 34+ messages in thread
From: Wu Fengguang @ 2010-08-25  1:21 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Mel Gorman,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, Aug 25, 2010 at 07:55:22AM +0800, Naoya Horiguchi wrote:
> The function name alloc_huge_page_no_vma_node() has verbose suffix "_no_vma".
> This patch makes existing alloc_huge_page() and it's family have "_vma" instead,
> which makes it easier to read.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  include/linux/hugetlb.h |    4 ++--
>  mm/hugetlb.c            |   20 ++++++++++----------
>  2 files changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git v2.6.36-rc2/include/linux/hugetlb.h v2.6.36-rc2/include/linux/hugetlb.h
> index 142bd4f..0b73c53 100644
> --- v2.6.36-rc2/include/linux/hugetlb.h
> +++ v2.6.36-rc2/include/linux/hugetlb.h
> @@ -228,7 +228,7 @@ struct huge_bootmem_page {
>  	struct hstate *hstate;
>  };
>  
> -struct page *alloc_huge_page_no_vma_node(struct hstate *h, int nid);
> +struct page *alloc_huge_page_node(struct hstate *h, int nid);
>  
>  /* arch callback */
>  int __init alloc_bootmem_huge_page(struct hstate *h);
> @@ -305,7 +305,7 @@ static inline struct hstate *page_hstate(struct page *page)
>  
>  #else
>  struct hstate {};
> -#define alloc_huge_page_no_vma_node(h, nid) NULL
> +#define alloc_huge_page_node(h, nid) NULL
>  #define alloc_bootmem_huge_page(h) NULL
>  #define hstate_file(f) NULL
>  #define hstate_vma(v) NULL
> diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
> index 31118d2..674a25e 100644
> --- v2.6.36-rc2/mm/hugetlb.c
> +++ v2.6.36-rc2/mm/hugetlb.c
> @@ -666,7 +666,7 @@ static struct page *alloc_buddy_huge_page_node(struct hstate *h, int nid)
>   * E.g. soft-offlining uses this function because it only cares physical
>   * address of error page.
>   */
> -struct page *alloc_huge_page_no_vma_node(struct hstate *h, int nid)
> +struct page *alloc_huge_page_node(struct hstate *h, int nid)
>  {
>  	struct page *page;
>  
> @@ -818,7 +818,7 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>  	return ret;
>  }
>  
> -static struct page *alloc_buddy_huge_page(struct hstate *h,
> +static struct page *alloc_buddy_huge_page_vma(struct hstate *h,
>  			struct vm_area_struct *vma, unsigned long address)
>  {
>  	struct page *page;
> @@ -919,7 +919,7 @@ static int gather_surplus_pages(struct hstate *h, int delta)
>  retry:
>  	spin_unlock(&hugetlb_lock);
>  	for (i = 0; i < needed; i++) {
> -		page = alloc_buddy_huge_page(h, NULL, 0);
> +		page = alloc_buddy_huge_page_vma(h, NULL, 0);

alloc_buddy_huge_page() doesn't make use of @vma at all, so the
parameters can be removed.

It looks cleaner to fold the
alloc_huge_page_no_vma_node=>alloc_huge_page_node renames into the
previous patch, from there split out the code refactor chunks into
a standalone patch, and then include this cleanup patch.

Thanks,
Fengguang
---
hugetlb: remove unused alloc_buddy_huge_page() parameters

alloc_buddy_huge_page() doesn't make use of @vma at all, so the
parameters can be removed.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index cc5be78..3114b4c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -770,8 +770,7 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 	return ret;
 }
 
-static struct page *alloc_buddy_huge_page(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long address)
+static struct page *alloc_buddy_huge_page(struct hstate *h)
 {
 	struct page *page;
 	unsigned int nid;
@@ -871,7 +870,7 @@ static int gather_surplus_pages(struct hstate *h, int delta)
 retry:
 	spin_unlock(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
-		page = alloc_buddy_huge_page(h, NULL, 0);
+		page = alloc_buddy_huge_page(h);
 		if (!page) {
 			/*
 			 * We were not able to allocate enough pages to
@@ -1052,7 +1051,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	spin_unlock(&hugetlb_lock);
 
 	if (!page) {
-		page = alloc_buddy_huge_page(h, vma, addr);
+		page = alloc_buddy_huge_page(h);
 		if (!page) {
 			hugetlb_put_quota(inode->i_mapping, chg);
 			return ERR_PTR(-VM_FAULT_SIGBUS);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] hugetlb: add allocate function for hugepage migration
  2010-08-24 23:55   ` Naoya Horiguchi
@ 2010-08-25  1:29     ` Wu Fengguang
  -1 siblings, 0 replies; 34+ messages in thread
From: Wu Fengguang @ 2010-08-25  1:29 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Mel Gorman,
	Jun'ichi Nomura, linux-mm, LKML

> +static struct page *alloc_buddy_huge_page_node(struct hstate *h, int nid)
> +{
> +	struct page *page = __alloc_huge_page_node(h, nid);
>  	if (page) {
> -		if (arch_prepare_hugepage(page)) {
> -			__free_pages(page, huge_page_order(h));
> +		set_compound_page_dtor(page, free_huge_page);
> +		spin_lock(&hugetlb_lock);
> +		h->nr_huge_pages++;
> +		h->nr_huge_pages_node[nid]++;
> +		spin_unlock(&hugetlb_lock);
> +		put_page_testzero(page);
> +	}
> +	return page;
> +}

One would expect the alloc_buddy_huge_page_node() to only differ with
alloc_buddy_huge_page() in the alloc_pages/alloc_pages_exact_node
calls. However you implement alloc_buddy_huge_page_node() in a quite
different way. Can the two functions be unified at all?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] hugetlb: add allocate function for hugepage migration
@ 2010-08-25  1:29     ` Wu Fengguang
  0 siblings, 0 replies; 34+ messages in thread
From: Wu Fengguang @ 2010-08-25  1:29 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Mel Gorman,
	Jun'ichi Nomura, linux-mm, LKML

> +static struct page *alloc_buddy_huge_page_node(struct hstate *h, int nid)
> +{
> +	struct page *page = __alloc_huge_page_node(h, nid);
>  	if (page) {
> -		if (arch_prepare_hugepage(page)) {
> -			__free_pages(page, huge_page_order(h));
> +		set_compound_page_dtor(page, free_huge_page);
> +		spin_lock(&hugetlb_lock);
> +		h->nr_huge_pages++;
> +		h->nr_huge_pages_node[nid]++;
> +		spin_unlock(&hugetlb_lock);
> +		put_page_testzero(page);
> +	}
> +	return page;
> +}

One would expect the alloc_buddy_huge_page_node() to only differ with
alloc_buddy_huge_page() in the alloc_pages/alloc_pages_exact_node
calls. However you implement alloc_buddy_huge_page_node() in a quite
different way. Can the two functions be unified at all?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/8] HWPOISON, hugetlb: fix unpoison for hugepage
  2010-08-24 23:55   ` Naoya Horiguchi
@ 2010-08-25  2:54     ` Wu Fengguang
  -1 siblings, 0 replies; 34+ messages in thread
From: Wu Fengguang @ 2010-08-25  2:54 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Mel Gorman,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, Aug 25, 2010 at 07:55:26AM +0800, Naoya Horiguchi wrote:
> Currently unpoisoning hugepages doesn't work because it's not enough
> to just clear PG_HWPoison bits and we need to link the hugepage
> to be unpoisoned back to the free hugepage list.
> To do this, we get and put hwpoisoned hugepage whose refcount is 0.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
> ---
>  mm/memory-failure.c |   16 +++++++++++++---
>  1 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git v2.6.36-rc2/mm/memory-failure.c v2.6.36-rc2/mm/memory-failure.c
> index 60178d2..ab36690 100644
> --- v2.6.36-rc2/mm/memory-failure.c
> +++ v2.6.36-rc2/mm/memory-failure.c
> @@ -1154,9 +1154,19 @@ int unpoison_memory(unsigned long pfn)
>  	nr_pages = 1 << compound_order(page);
>  
>  	if (!get_page_unless_zero(page)) {
> -		if (TestClearPageHWPoison(p))
> +		/* The page to be unpoisoned was free one when hwpoisoned */
> +		if (TestClearPageHWPoison(page))
>  			atomic_long_sub(nr_pages, &mce_bad_pages);
>  		pr_debug("MCE: Software-unpoisoned free page %#lx\n", pfn);
> +		if (PageHuge(page)) {
> +			/*
> +			 * To unpoison free hugepage, we get and put it
> +			 * to move it back to the free list.
> +			 */
> +			get_page(page);
> +			clear_page_hwpoison_huge_page(page);
> +			put_page(page);
> +		}
>  		return 0;
>  	}

It's racy in free huge page detection.

alloc_huge_page() does not increase page refcount inside hugetlb_lock,
the alloc_huge_page()=>alloc_buddy_huge_page() path even drops the
lock temporarily! Then we never know reliably if a huge page is really
free.

Here is a scratched fix. It is totally untested. Just want to notice
you that with this patch, the huge page unpoisoning should go easier.

Thanks,
Fengguang

---
 include/linux/hugetlb.h |    4 +-
 mm/hugetlb.c            |   51 +++++++++++++++++---------------------
 mm/memory-failure.c     |   24 ++++++-----------
 3 files changed, 34 insertions(+), 45 deletions(-)

--- linux-next.orig/mm/hugetlb.c	2010-08-25 10:16:15.000000000 +0800
+++ linux-next/mm/hugetlb.c	2010-08-25 10:47:17.000000000 +0800
@@ -502,6 +502,7 @@ static struct page *dequeue_huge_page_vm
 			page = list_entry(h->hugepage_freelists[nid].next,
 					  struct page, lru);
 			list_del(&page->lru);
+			set_page_refcounted(page);
 			h->free_huge_pages--;
 			h->free_huge_pages_node[nid]--;
 
@@ -822,12 +823,6 @@ static struct page *alloc_buddy_huge_pag
 
 	spin_lock(&hugetlb_lock);
 	if (page) {
-		/*
-		 * This page is now managed by the hugetlb allocator and has
-		 * no users -- drop the buddy allocator's reference.
-		 */
-		put_page_testzero(page);
-		VM_BUG_ON(page_count(page));
 		nid = page_to_nid(page);
 		set_compound_page_dtor(page, free_huge_page);
 		/*
@@ -877,8 +872,6 @@ retry:
 			 * satisfy the entire reservation so we free what
 			 * we've allocated so far.
 			 */
-			spin_lock(&hugetlb_lock);
-			needed = 0;
 			goto free;
 		}
 
@@ -904,33 +897,28 @@ retry:
 	 * process from stealing the pages as they are added to the pool but
 	 * before they are reserved.
 	 */
-	needed += allocated;
 	h->resv_huge_pages += delta;
 	ret = 0;
-free:
+
 	/* Free the needed pages to the hugetlb pool */
 	list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
-		if ((--needed) < 0)
-			break;
 		list_del(&page->lru);
+		/*
+		 * This page is now managed by the hugetlb allocator and has
+		 * no users -- drop the buddy allocator's reference.
+		 */
+		put_page_testzero(page);
+		VM_BUG_ON(page_count(page));
 		enqueue_huge_page(h, page);
 	}
-
+	spin_unlock(&hugetlb_lock);
+free:
 	/* Free unnecessary surplus pages to the buddy allocator */
 	if (!list_empty(&surplus_list)) {
-		spin_unlock(&hugetlb_lock);
 		list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
 			list_del(&page->lru);
-			/*
-			 * The page has a reference count of zero already, so
-			 * call free_huge_page directly instead of using
-			 * put_page.  This must be done with hugetlb_lock
-			 * unlocked which is safe because free_huge_page takes
-			 * hugetlb_lock before deciding how to free the page.
-			 */
-			free_huge_page(page);
+			put_page(page);
 		}
-		spin_lock(&hugetlb_lock);
 	}
 
 	return ret;
@@ -1058,7 +1046,6 @@ static struct page *alloc_huge_page(stru
 		}
 	}
 
-	set_page_refcounted(page);
 	set_page_private(page, (unsigned long) mapping);
 
 	vma_commit_reservation(h, vma, addr);
@@ -2875,18 +2862,26 @@ void hugetlb_unreserve_pages(struct inod
 	hugetlb_acct_memory(h, -(chg - freed));
 }
 
+#ifdef CONFIG_MEMORY_FAILURE
 /*
  * This function is called from memory failure code.
  * Assume the caller holds page lock of the head page.
  */
-void __isolate_hwpoisoned_huge_page(struct page *hpage)
+int dequeue_hwpoisoned_huge_page(struct page *hpage)
 {
 	struct hstate *h = page_hstate(hpage);
 	int nid = page_to_nid(hpage);
+	int ret = -EBUSY;
 
 	spin_lock(&hugetlb_lock);
-	list_del(&hpage->lru);
-	h->free_huge_pages--;
-	h->free_huge_pages_node[nid]--;
+	if (!page_count(hpage)) {
+		list_del(&hpage->lru);
+		set_page_refcounted(hpage);
+		h->free_huge_pages--;
+		h->free_huge_pages_node[nid]--;
+		ret = 0;
+	}
 	spin_unlock(&hugetlb_lock);
+	return ret;
 }
+#endif
--- linux-next.orig/include/linux/hugetlb.h	2010-08-25 10:43:18.000000000 +0800
+++ linux-next/include/linux/hugetlb.h	2010-08-25 10:48:43.000000000 +0800
@@ -43,7 +43,7 @@ int hugetlb_reserve_pages(struct inode *
 						struct vm_area_struct *vma,
 						int acctflags);
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
-void __isolate_hwpoisoned_huge_page(struct page *page);
+int dequeue_hwpoisoned_huge_page(struct page *page);
 
 extern unsigned long hugepages_treat_as_movable;
 extern const unsigned long hugetlb_zero, hugetlb_infinity;
@@ -101,7 +101,7 @@ static inline void hugetlb_report_meminf
 #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
 #define hugetlb_fault(mm, vma, addr, flags)	({ BUG(); 0; })
 #define huge_pte_offset(mm, address)	0
-#define __isolate_hwpoisoned_huge_page(page)	0
+#define dequeue_hwpoisoned_huge_page(page)	0
 
 #define hugetlb_change_protection(vma, address, end, newprot)
 
--- linux-next.orig/mm/memory-failure.c	2010-08-25 10:25:03.000000000 +0800
+++ linux-next/mm/memory-failure.c	2010-08-25 10:42:51.000000000 +0800
@@ -698,21 +698,6 @@ static int me_swapcache_clean(struct pag
  */
 static int me_huge_page(struct page *p, unsigned long pfn)
 {
-	struct page *hpage = compound_head(p);
-	/*
-	 * We can safely recover from error on free or reserved (i.e.
-	 * not in-use) hugepage by dequeuing it from freelist.
-	 * To check whether a hugepage is in-use or not, we can't use
-	 * page->lru because it can be used in other hugepage operations,
-	 * such as __unmap_hugepage_range() and gather_surplus_pages().
-	 * So instead we use page_mapping() and PageAnon().
-	 * We assume that this function is called with page lock held,
-	 * so there is no race between isolation and mapping/unmapping.
-	 */
-	if (!(page_mapping(hpage) || PageAnon(hpage))) {
-		__isolate_hwpoisoned_huge_page(hpage);
-		return RECOVERED;
-	}
 	return DELAYED;
 }
 
@@ -993,6 +978,11 @@ int __memory_failure(unsigned long pfn, 
 		if (is_free_buddy_page(p)) {
 			action_result(pfn, "free buddy", DELAYED);
 			return 0;
+		} else if (PageHuge(hpage)) {
+			res = dequeue_hwpoisoned_huge_page(hpage);
+			action_result(pfn, "free huge",
+				      res ? DELAYED : RECOVERED);
+			return res;
 		} else {
 			action_result(pfn, "high order kernel", IGNORED);
 			return -EBUSY;
@@ -1221,6 +1211,10 @@ static int get_any_page(struct page *p, 
 			/* Set hwpoison bit while page is still isolated */
 			SetPageHWPoison(p);
 			ret = 0;
+		} else if (PageHuge(p)) {
+			ret = dequeue_hwpoisoned_huge_page(compound_head(p));
+			if (!ret)
+				SetPageHWPoison(p);
 		} else {
 			pr_debug("get_any_page: %#lx: unknown zero refcount page type %lx\n",
 				pfn, p->flags);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/8] HWPOISON, hugetlb: fix unpoison for hugepage
@ 2010-08-25  2:54     ` Wu Fengguang
  0 siblings, 0 replies; 34+ messages in thread
From: Wu Fengguang @ 2010-08-25  2:54 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Mel Gorman,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, Aug 25, 2010 at 07:55:26AM +0800, Naoya Horiguchi wrote:
> Currently unpoisoning hugepages doesn't work because it's not enough
> to just clear PG_HWPoison bits and we need to link the hugepage
> to be unpoisoned back to the free hugepage list.
> To do this, we get and put hwpoisoned hugepage whose refcount is 0.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
> ---
>  mm/memory-failure.c |   16 +++++++++++++---
>  1 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git v2.6.36-rc2/mm/memory-failure.c v2.6.36-rc2/mm/memory-failure.c
> index 60178d2..ab36690 100644
> --- v2.6.36-rc2/mm/memory-failure.c
> +++ v2.6.36-rc2/mm/memory-failure.c
> @@ -1154,9 +1154,19 @@ int unpoison_memory(unsigned long pfn)
>  	nr_pages = 1 << compound_order(page);
>  
>  	if (!get_page_unless_zero(page)) {
> -		if (TestClearPageHWPoison(p))
> +		/* The page to be unpoisoned was free one when hwpoisoned */
> +		if (TestClearPageHWPoison(page))
>  			atomic_long_sub(nr_pages, &mce_bad_pages);
>  		pr_debug("MCE: Software-unpoisoned free page %#lx\n", pfn);
> +		if (PageHuge(page)) {
> +			/*
> +			 * To unpoison free hugepage, we get and put it
> +			 * to move it back to the free list.
> +			 */
> +			get_page(page);
> +			clear_page_hwpoison_huge_page(page);
> +			put_page(page);
> +		}
>  		return 0;
>  	}

It's racy in free huge page detection.

alloc_huge_page() does not increase page refcount inside hugetlb_lock,
the alloc_huge_page()=>alloc_buddy_huge_page() path even drops the
lock temporarily! Then we never know reliably if a huge page is really
free.

Here is a scratched fix. It is totally untested. Just want to notice
you that with this patch, the huge page unpoisoning should go easier.

Thanks,
Fengguang

---
 include/linux/hugetlb.h |    4 +-
 mm/hugetlb.c            |   51 +++++++++++++++++---------------------
 mm/memory-failure.c     |   24 ++++++-----------
 3 files changed, 34 insertions(+), 45 deletions(-)

--- linux-next.orig/mm/hugetlb.c	2010-08-25 10:16:15.000000000 +0800
+++ linux-next/mm/hugetlb.c	2010-08-25 10:47:17.000000000 +0800
@@ -502,6 +502,7 @@ static struct page *dequeue_huge_page_vm
 			page = list_entry(h->hugepage_freelists[nid].next,
 					  struct page, lru);
 			list_del(&page->lru);
+			set_page_refcounted(page);
 			h->free_huge_pages--;
 			h->free_huge_pages_node[nid]--;
 
@@ -822,12 +823,6 @@ static struct page *alloc_buddy_huge_pag
 
 	spin_lock(&hugetlb_lock);
 	if (page) {
-		/*
-		 * This page is now managed by the hugetlb allocator and has
-		 * no users -- drop the buddy allocator's reference.
-		 */
-		put_page_testzero(page);
-		VM_BUG_ON(page_count(page));
 		nid = page_to_nid(page);
 		set_compound_page_dtor(page, free_huge_page);
 		/*
@@ -877,8 +872,6 @@ retry:
 			 * satisfy the entire reservation so we free what
 			 * we've allocated so far.
 			 */
-			spin_lock(&hugetlb_lock);
-			needed = 0;
 			goto free;
 		}
 
@@ -904,33 +897,28 @@ retry:
 	 * process from stealing the pages as they are added to the pool but
 	 * before they are reserved.
 	 */
-	needed += allocated;
 	h->resv_huge_pages += delta;
 	ret = 0;
-free:
+
 	/* Free the needed pages to the hugetlb pool */
 	list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
-		if ((--needed) < 0)
-			break;
 		list_del(&page->lru);
+		/*
+		 * This page is now managed by the hugetlb allocator and has
+		 * no users -- drop the buddy allocator's reference.
+		 */
+		put_page_testzero(page);
+		VM_BUG_ON(page_count(page));
 		enqueue_huge_page(h, page);
 	}
-
+	spin_unlock(&hugetlb_lock);
+free:
 	/* Free unnecessary surplus pages to the buddy allocator */
 	if (!list_empty(&surplus_list)) {
-		spin_unlock(&hugetlb_lock);
 		list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
 			list_del(&page->lru);
-			/*
-			 * The page has a reference count of zero already, so
-			 * call free_huge_page directly instead of using
-			 * put_page.  This must be done with hugetlb_lock
-			 * unlocked which is safe because free_huge_page takes
-			 * hugetlb_lock before deciding how to free the page.
-			 */
-			free_huge_page(page);
+			put_page(page);
 		}
-		spin_lock(&hugetlb_lock);
 	}
 
 	return ret;
@@ -1058,7 +1046,6 @@ static struct page *alloc_huge_page(stru
 		}
 	}
 
-	set_page_refcounted(page);
 	set_page_private(page, (unsigned long) mapping);
 
 	vma_commit_reservation(h, vma, addr);
@@ -2875,18 +2862,26 @@ void hugetlb_unreserve_pages(struct inod
 	hugetlb_acct_memory(h, -(chg - freed));
 }
 
+#ifdef CONFIG_MEMORY_FAILURE
 /*
  * This function is called from memory failure code.
  * Assume the caller holds page lock of the head page.
  */
-void __isolate_hwpoisoned_huge_page(struct page *hpage)
+int dequeue_hwpoisoned_huge_page(struct page *hpage)
 {
 	struct hstate *h = page_hstate(hpage);
 	int nid = page_to_nid(hpage);
+	int ret = -EBUSY;
 
 	spin_lock(&hugetlb_lock);
-	list_del(&hpage->lru);
-	h->free_huge_pages--;
-	h->free_huge_pages_node[nid]--;
+	if (!page_count(hpage)) {
+		list_del(&hpage->lru);
+		set_page_refcounted(hpage);
+		h->free_huge_pages--;
+		h->free_huge_pages_node[nid]--;
+		ret = 0;
+	}
 	spin_unlock(&hugetlb_lock);
+	return ret;
 }
+#endif
--- linux-next.orig/include/linux/hugetlb.h	2010-08-25 10:43:18.000000000 +0800
+++ linux-next/include/linux/hugetlb.h	2010-08-25 10:48:43.000000000 +0800
@@ -43,7 +43,7 @@ int hugetlb_reserve_pages(struct inode *
 						struct vm_area_struct *vma,
 						int acctflags);
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
-void __isolate_hwpoisoned_huge_page(struct page *page);
+int dequeue_hwpoisoned_huge_page(struct page *page);
 
 extern unsigned long hugepages_treat_as_movable;
 extern const unsigned long hugetlb_zero, hugetlb_infinity;
@@ -101,7 +101,7 @@ static inline void hugetlb_report_meminf
 #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
 #define hugetlb_fault(mm, vma, addr, flags)	({ BUG(); 0; })
 #define huge_pte_offset(mm, address)	0
-#define __isolate_hwpoisoned_huge_page(page)	0
+#define dequeue_hwpoisoned_huge_page(page)	0
 
 #define hugetlb_change_protection(vma, address, end, newprot)
 
--- linux-next.orig/mm/memory-failure.c	2010-08-25 10:25:03.000000000 +0800
+++ linux-next/mm/memory-failure.c	2010-08-25 10:42:51.000000000 +0800
@@ -698,21 +698,6 @@ static int me_swapcache_clean(struct pag
  */
 static int me_huge_page(struct page *p, unsigned long pfn)
 {
-	struct page *hpage = compound_head(p);
-	/*
-	 * We can safely recover from error on free or reserved (i.e.
-	 * not in-use) hugepage by dequeuing it from freelist.
-	 * To check whether a hugepage is in-use or not, we can't use
-	 * page->lru because it can be used in other hugepage operations,
-	 * such as __unmap_hugepage_range() and gather_surplus_pages().
-	 * So instead we use page_mapping() and PageAnon().
-	 * We assume that this function is called with page lock held,
-	 * so there is no race between isolation and mapping/unmapping.
-	 */
-	if (!(page_mapping(hpage) || PageAnon(hpage))) {
-		__isolate_hwpoisoned_huge_page(hpage);
-		return RECOVERED;
-	}
 	return DELAYED;
 }
 
@@ -993,6 +978,11 @@ int __memory_failure(unsigned long pfn, 
 		if (is_free_buddy_page(p)) {
 			action_result(pfn, "free buddy", DELAYED);
 			return 0;
+		} else if (PageHuge(hpage)) {
+			res = dequeue_hwpoisoned_huge_page(hpage);
+			action_result(pfn, "free huge",
+				      res ? DELAYED : RECOVERED);
+			return res;
 		} else {
 			action_result(pfn, "high order kernel", IGNORED);
 			return -EBUSY;
@@ -1221,6 +1211,10 @@ static int get_any_page(struct page *p, 
 			/* Set hwpoison bit while page is still isolated */
 			SetPageHWPoison(p);
 			ret = 0;
+		} else if (PageHuge(p)) {
+			ret = dequeue_hwpoisoned_huge_page(compound_head(p));
+			if (!ret)
+				SetPageHWPoison(p);
 		} else {
 			pr_debug("get_any_page: %#lx: unknown zero refcount page type %lx\n",
 				pfn, p->flags);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/8] HWPOISON, hugetlb: soft offlining for hugepage
  2010-08-24 23:55   ` Naoya Horiguchi
@ 2010-08-25  3:02     ` Wu Fengguang
  -1 siblings, 0 replies; 34+ messages in thread
From: Wu Fengguang @ 2010-08-25  3:02 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Mel Gorman,
	Jun'ichi Nomura, linux-mm, LKML

> +static int is_hugepage_on_freelist(struct page *hpage)
> +{
> +	struct page *page;
> +	struct page *tmp;
> +	struct hstate *h = page_hstate(hpage);
> +	int nid = page_to_nid(hpage);
> +
> +	spin_lock(&hugetlb_lock);
> +	list_for_each_entry_safe(page, tmp, &h->hugepage_freelists[nid], lru) {
> +		if (page == hpage) {
> +			spin_unlock(&hugetlb_lock);
> +			return 1;
> +		}
> +	}
> +	spin_unlock(&hugetlb_lock);
> +	return 0;
> +}

Ha! That looks better than the page_count test in my previous email.

> +void isolate_hwpoisoned_huge_page(struct page *hpage)
> +{
> +	lock_page(hpage);
> +	if (is_hugepage_on_freelist(hpage))
> +		__isolate_hwpoisoned_huge_page(hpage);
> +	unlock_page(hpage);
> +}

However it should still be racy if the test/isolate actions are
not performed in the same hugetlb_lock.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/8] HWPOISON, hugetlb: soft offlining for hugepage
@ 2010-08-25  3:02     ` Wu Fengguang
  0 siblings, 0 replies; 34+ messages in thread
From: Wu Fengguang @ 2010-08-25  3:02 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Mel Gorman,
	Jun'ichi Nomura, linux-mm, LKML

> +static int is_hugepage_on_freelist(struct page *hpage)
> +{
> +	struct page *page;
> +	struct page *tmp;
> +	struct hstate *h = page_hstate(hpage);
> +	int nid = page_to_nid(hpage);
> +
> +	spin_lock(&hugetlb_lock);
> +	list_for_each_entry_safe(page, tmp, &h->hugepage_freelists[nid], lru) {
> +		if (page == hpage) {
> +			spin_unlock(&hugetlb_lock);
> +			return 1;
> +		}
> +	}
> +	spin_unlock(&hugetlb_lock);
> +	return 0;
> +}

Ha! That looks better than the page_count test in my previous email.

> +void isolate_hwpoisoned_huge_page(struct page *hpage)
> +{
> +	lock_page(hpage);
> +	if (is_hugepage_on_freelist(hpage))
> +		__isolate_hwpoisoned_huge_page(hpage);
> +	unlock_page(hpage);
> +}

However it should still be racy if the test/isolate actions are
not performed in the same hugetlb_lock.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] hugetlb: add allocate function for hugepage migration
  2010-08-25  1:29     ` Wu Fengguang
@ 2010-08-26  8:24       ` Naoya Horiguchi
  -1 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-26  8:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Mel Gorman,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, Aug 25, 2010 at 09:29:41AM +0800, Wu Fengguang wrote:
> > +static struct page *alloc_buddy_huge_page_node(struct hstate *h, int nid)
> > +{
> > +	struct page *page = __alloc_huge_page_node(h, nid);
> >  	if (page) {
> > -		if (arch_prepare_hugepage(page)) {
> > -			__free_pages(page, huge_page_order(h));
> > +		set_compound_page_dtor(page, free_huge_page);
> > +		spin_lock(&hugetlb_lock);
> > +		h->nr_huge_pages++;
> > +		h->nr_huge_pages_node[nid]++;
> > +		spin_unlock(&hugetlb_lock);
> > +		put_page_testzero(page);
> > +	}
> > +	return page;
> > +}
> 
> One would expect the alloc_buddy_huge_page_node() to only differ with
> alloc_buddy_huge_page() in the alloc_pages/alloc_pages_exact_node
> calls. However you implement alloc_buddy_huge_page_node() in a quite
> different way. Can the two functions be unified at all?

Yes. I did it by adding argument @nid to alloc_buddy_huge_page().
Code gets cleaner and work without problems.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] hugetlb: add allocate function for hugepage migration
@ 2010-08-26  8:24       ` Naoya Horiguchi
  0 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-26  8:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Mel Gorman,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, Aug 25, 2010 at 09:29:41AM +0800, Wu Fengguang wrote:
> > +static struct page *alloc_buddy_huge_page_node(struct hstate *h, int nid)
> > +{
> > +	struct page *page = __alloc_huge_page_node(h, nid);
> >  	if (page) {
> > -		if (arch_prepare_hugepage(page)) {
> > -			__free_pages(page, huge_page_order(h));
> > +		set_compound_page_dtor(page, free_huge_page);
> > +		spin_lock(&hugetlb_lock);
> > +		h->nr_huge_pages++;
> > +		h->nr_huge_pages_node[nid]++;
> > +		spin_unlock(&hugetlb_lock);
> > +		put_page_testzero(page);
> > +	}
> > +	return page;
> > +}
> 
> One would expect the alloc_buddy_huge_page_node() to only differ with
> alloc_buddy_huge_page() in the alloc_pages/alloc_pages_exact_node
> calls. However you implement alloc_buddy_huge_page_node() in a quite
> different way. Can the two functions be unified at all?

Yes. I did it by adding argument @nid to alloc_buddy_huge_page().
Code gets cleaner and work without problems.

Thanks,
Naoya

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/8] hugetlb: rename hugepage allocation functions
  2010-08-25  1:21     ` Wu Fengguang
@ 2010-08-26  8:25       ` Naoya Horiguchi
  -1 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-26  8:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Mel Gorman,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, Aug 25, 2010 at 09:21:31AM +0800, Wu Fengguang wrote:
> On Wed, Aug 25, 2010 at 07:55:22AM +0800, Naoya Horiguchi wrote:
> > The function name alloc_huge_page_no_vma_node() has verbose suffix "_no_vma".
> > This patch makes existing alloc_huge_page() and it's family have "_vma" instead,
> > which makes it easier to read.
> > 
...
> > @@ -919,7 +919,7 @@ static int gather_surplus_pages(struct hstate *h, int delta)
> >  retry:
> >     spin_unlock(&hugetlb_lock);
> >     for (i = 0; i < needed; i++) {
> > -           page = alloc_buddy_huge_page(h, NULL, 0);
> > +           page = alloc_buddy_huge_page_vma(h, NULL, 0);
> 
> alloc_buddy_huge_page() doesn't make use of @vma at all, so the
> parameters can be removed.

OK.

> It looks cleaner to fold the
> alloc_huge_page_no_vma_node=>alloc_huge_page_node renames into the
> previous patch, from there split out the code refactor chunks into
> a standalone patch, and then include this cleanup patch.

When we unify alloc_buddy_huge_page() as commented for patch 2/8,
_vma suffix is not necessary any more. Your suggestion to drop
unused arguments sounds reasonable,
so I merge the attached patch into patch 2/8.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/8] hugetlb: rename hugepage allocation functions
@ 2010-08-26  8:25       ` Naoya Horiguchi
  0 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-26  8:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Mel Gorman,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, Aug 25, 2010 at 09:21:31AM +0800, Wu Fengguang wrote:
> On Wed, Aug 25, 2010 at 07:55:22AM +0800, Naoya Horiguchi wrote:
> > The function name alloc_huge_page_no_vma_node() has verbose suffix "_no_vma".
> > This patch makes existing alloc_huge_page() and it's family have "_vma" instead,
> > which makes it easier to read.
> > 
...
> > @@ -919,7 +919,7 @@ static int gather_surplus_pages(struct hstate *h, int delta)
> >  retry:
> >     spin_unlock(&hugetlb_lock);
> >     for (i = 0; i < needed; i++) {
> > -           page = alloc_buddy_huge_page(h, NULL, 0);
> > +           page = alloc_buddy_huge_page_vma(h, NULL, 0);
> 
> alloc_buddy_huge_page() doesn't make use of @vma at all, so the
> parameters can be removed.

OK.

> It looks cleaner to fold the
> alloc_huge_page_no_vma_node=>alloc_huge_page_node renames into the
> previous patch, from there split out the code refactor chunks into
> a standalone patch, and then include this cleanup patch.

When we unify alloc_buddy_huge_page() as commented for patch 2/8,
_vma suffix is not necessary any more. Your suggestion to drop
unused arguments sounds reasonable,
so I merge the attached patch into patch 2/8.

Thanks,
Naoya

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/8] HWPOISON, hugetlb: fix unpoison for hugepage
  2010-08-25  2:54     ` Wu Fengguang
@ 2010-08-26  8:26       ` Naoya Horiguchi
  -1 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-26  8:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Mel Gorman,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, Aug 25, 2010 at 10:54:32AM +0800, Wu Fengguang wrote:
> On Wed, Aug 25, 2010 at 07:55:26AM +0800, Naoya Horiguchi wrote:
> > Currently unpoisoning hugepages doesn't work because it's not enough
> > to just clear PG_HWPoison bits and we need to link the hugepage
> > to be unpoisoned back to the free hugepage list.
> > To do this, we get and put hwpoisoned hugepage whose refcount is 0.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
> > ---
> >  mm/memory-failure.c |   16 +++++++++++++---
> >  1 files changed, 13 insertions(+), 3 deletions(-)
> > 
> > diff --git v2.6.36-rc2/mm/memory-failure.c v2.6.36-rc2/mm/memory-failure.c
> > index 60178d2..ab36690 100644
> > --- v2.6.36-rc2/mm/memory-failure.c
> > +++ v2.6.36-rc2/mm/memory-failure.c
> > @@ -1154,9 +1154,19 @@ int unpoison_memory(unsigned long pfn)
> >  	nr_pages = 1 << compound_order(page);
> >  
> >  	if (!get_page_unless_zero(page)) {
> > -		if (TestClearPageHWPoison(p))
> > +		/* The page to be unpoisoned was free one when hwpoisoned */
> > +		if (TestClearPageHWPoison(page))
> >  			atomic_long_sub(nr_pages, &mce_bad_pages);
> >  		pr_debug("MCE: Software-unpoisoned free page %#lx\n", pfn);
> > +		if (PageHuge(page)) {
> > +			/*
> > +			 * To unpoison free hugepage, we get and put it
> > +			 * to move it back to the free list.
> > +			 */
> > +			get_page(page);
> > +			clear_page_hwpoison_huge_page(page);
> > +			put_page(page);
> > +		}
> >  		return 0;
> >  	}
> 
> It's racy in free huge page detection.
> 
> alloc_huge_page() does not increase page refcount inside hugetlb_lock,
> the alloc_huge_page()=>alloc_buddy_huge_page() path even drops the
> lock temporarily! Then we never know reliably if a huge page is really
> free.

I agree.

> Here is a scratched fix. It is totally untested. Just want to notice
> you that with this patch, the huge page unpoisoning should go easier.

Great.
I adjusted this patch to real hugetlb code and passed libhugetlbfs test.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/8] HWPOISON, hugetlb: fix unpoison for hugepage
@ 2010-08-26  8:26       ` Naoya Horiguchi
  0 siblings, 0 replies; 34+ messages in thread
From: Naoya Horiguchi @ 2010-08-26  8:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Mel Gorman,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, Aug 25, 2010 at 10:54:32AM +0800, Wu Fengguang wrote:
> On Wed, Aug 25, 2010 at 07:55:26AM +0800, Naoya Horiguchi wrote:
> > Currently unpoisoning hugepages doesn't work because it's not enough
> > to just clear PG_HWPoison bits and we need to link the hugepage
> > to be unpoisoned back to the free hugepage list.
> > To do this, we get and put hwpoisoned hugepage whose refcount is 0.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
> > ---
> >  mm/memory-failure.c |   16 +++++++++++++---
> >  1 files changed, 13 insertions(+), 3 deletions(-)
> > 
> > diff --git v2.6.36-rc2/mm/memory-failure.c v2.6.36-rc2/mm/memory-failure.c
> > index 60178d2..ab36690 100644
> > --- v2.6.36-rc2/mm/memory-failure.c
> > +++ v2.6.36-rc2/mm/memory-failure.c
> > @@ -1154,9 +1154,19 @@ int unpoison_memory(unsigned long pfn)
> >  	nr_pages = 1 << compound_order(page);
> >  
> >  	if (!get_page_unless_zero(page)) {
> > -		if (TestClearPageHWPoison(p))
> > +		/* The page to be unpoisoned was free one when hwpoisoned */
> > +		if (TestClearPageHWPoison(page))
> >  			atomic_long_sub(nr_pages, &mce_bad_pages);
> >  		pr_debug("MCE: Software-unpoisoned free page %#lx\n", pfn);
> > +		if (PageHuge(page)) {
> > +			/*
> > +			 * To unpoison free hugepage, we get and put it
> > +			 * to move it back to the free list.
> > +			 */
> > +			get_page(page);
> > +			clear_page_hwpoison_huge_page(page);
> > +			put_page(page);
> > +		}
> >  		return 0;
> >  	}
> 
> It's racy in free huge page detection.
> 
> alloc_huge_page() does not increase page refcount inside hugetlb_lock,
> the alloc_huge_page()=>alloc_buddy_huge_page() path even drops the
> lock temporarily! Then we never know reliably if a huge page is really
> free.

I agree.

> Here is a scratched fix. It is totally untested. Just want to notice
> you that with this patch, the huge page unpoisoning should go easier.

Great.
I adjusted this patch to real hugetlb code and passed libhugetlbfs test.

Thanks,
Naoya

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2010-08-26  8:28 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-24 23:55 [PATCH 0/8] Hugepage migration (v3) Naoya Horiguchi
2010-08-24 23:55 ` Naoya Horiguchi
2010-08-24 23:55 ` [PATCH 1/8] hugetlb: fix metadata corruption in hugetlb_fault() Naoya Horiguchi
2010-08-24 23:55   ` Naoya Horiguchi
2010-08-25  0:17   ` Wu Fengguang
2010-08-25  0:17     ` Wu Fengguang
2010-08-24 23:55 ` [PATCH 2/8] hugetlb: add allocate function for hugepage migration Naoya Horiguchi
2010-08-24 23:55   ` Naoya Horiguchi
2010-08-25  1:29   ` Wu Fengguang
2010-08-25  1:29     ` Wu Fengguang
2010-08-26  8:24     ` Naoya Horiguchi
2010-08-26  8:24       ` Naoya Horiguchi
2010-08-24 23:55 ` [PATCH 3/8] hugetlb: rename hugepage allocation functions Naoya Horiguchi
2010-08-24 23:55   ` Naoya Horiguchi
2010-08-25  1:21   ` Wu Fengguang
2010-08-25  1:21     ` Wu Fengguang
2010-08-26  8:25     ` Naoya Horiguchi
2010-08-26  8:25       ` Naoya Horiguchi
2010-08-24 23:55 ` [PATCH 4/8] hugetlb: redefine hugepage copy functions Naoya Horiguchi
2010-08-24 23:55   ` Naoya Horiguchi
2010-08-24 23:55 ` [PATCH 5/8] hugetlb: hugepage migration core Naoya Horiguchi
2010-08-24 23:55   ` Naoya Horiguchi
2010-08-24 23:55 ` [PATCH 6/8] HWPOISON, hugetlb: soft offlining for hugepage Naoya Horiguchi
2010-08-24 23:55   ` Naoya Horiguchi
2010-08-25  3:02   ` Wu Fengguang
2010-08-25  3:02     ` Wu Fengguang
2010-08-24 23:55 ` [PATCH 7/8] HWPOISON, hugetlb: fix unpoison " Naoya Horiguchi
2010-08-24 23:55   ` Naoya Horiguchi
2010-08-25  2:54   ` Wu Fengguang
2010-08-25  2:54     ` Wu Fengguang
2010-08-26  8:26     ` Naoya Horiguchi
2010-08-26  8:26       ` Naoya Horiguchi
2010-08-24 23:55 ` [PATCH 8/8] page-types.c: fix name of unpoison interface Naoya Horiguchi
2010-08-24 23:55   ` Naoya Horiguchi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.