linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/10] extend hugepage migration
@ 2013-03-22 20:23 Naoya Horiguchi
  2013-03-22 20:23 ` [PATCH 01/10] migrate: add migrate_entry_wait_huge() Naoya Horiguchi
                   ` (9 more replies)
  0 siblings, 10 replies; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-22 20:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, KOSAKI Motohiro,
	Andi Kleen, Hillf Danton, Michal Hocko, linux-kernel

Hi,

Here is the 2nd version of hugepage migration patchset.
I rebased it onto v3.9-rc3 and merged all feedbacks from Hillf,
Michal and Kosaki-san unless I missed something. Thank you guys.

I hope that we move onto 2nd iteration of the review process,
please let me know if you have something to discuss.

Thanks,
Naoya Horiguchi

--- General Description (exactly same with previous post) ---

Hugepage migration is now available only for soft offlining (moving
data on the half corrupted page to another page to save the data).
But it's also useful some other users of page migration, so this
patchset tries to extend some of such users to support hugepage.

The targets of this patchset are NUMA related system calls (i.e.
migrate_pages(2), move_pages(2), and mbind(2)), and memory hotplug.
This patchset does not extend page migration in memory compaction,
because I think that users of memory compaction mainly expect to
construct thp by arranging raw pages but hugepage migration doesn't
help it.
CMA, another user of page migration, can have benefit from hugepage
migration, but is not enabled to support it now. This is because
I've never used CMA and need to learn more to extend and/or test
hugepage migration in CMA. I'll add this in later version if it
becomes ready, or will post as a separate patchset.

Hugepage migration of 1GB hugepage is not enabled for now, because
I'm not sure whether users of 1GB hugepage really want it.
We need to spare free hugepage in order to do migration, but I don't
think that users want to 1GB memory to idle for that purpose
(currently we can't expand/shrink 1GB hugepage pool after boot).

---
Easy patch access:
  git://github.com/Naoya-Horiguchi/linux.git extend_hugepage_migration_v2

Test code:
  git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git

Naoya Horiguchi (10):
      migrate: add migrate_entry_wait_huge()
      migrate: make core migration code aware of hugepage
      soft-offline: use migrate_pages() instead of migrate_huge_page()
      migrate: clean up migrate_huge_page()
      migrate: add hugepage migration code to migrate_pages()
      migrate: add hugepage migration code to move_pages()
      mbind: add hugepage migration code to mbind()
      migrate: remove VM_HUGETLB from vma flag check in vma_migratable()
      memory-hotplug: enable memory hotplug to handle hugepage
      prepare to remove /proc/sys/vm/hugepages_treat_as_movable

 Documentation/sysctl/vm.txt |  13 +-----
 include/linux/hugetlb.h     |  15 ++++++
 include/linux/mempolicy.h   |   2 +-
 include/linux/migrate.h     |  11 +++--
 include/linux/swapops.h     |   4 ++
 mm/hugetlb.c                | 110 ++++++++++++++++++++++++++++++++++++++------
 mm/memory-failure.c         |  15 ++++--
 mm/memory.c                 |   6 ++-
 mm/memory_hotplug.c         |  42 +++++++++++++----
 mm/mempolicy.c              |  56 ++++++++++++++--------
 mm/migrate.c                | 100 ++++++++++++++++++++++++++--------------
 mm/page_alloc.c             |  12 +++++
 mm/page_isolation.c         |   5 ++
 13 files changed, 293 insertions(+), 98 deletions(-)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 01/10] migrate: add migrate_entry_wait_huge()
  2013-03-22 20:23 [PATCH v2 0/10] extend hugepage migration Naoya Horiguchi
@ 2013-03-22 20:23 ` Naoya Horiguchi
  2013-03-23 15:55   ` Rik van Riel
                     ` (3 more replies)
  2013-03-22 20:23 ` [PATCH 02/10] migrate: make core migration code aware of hugepage Naoya Horiguchi
                   ` (8 subsequent siblings)
  9 siblings, 4 replies; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-22 20:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, KOSAKI Motohiro,
	Andi Kleen, Hillf Danton, Michal Hocko, linux-kernel

When we have a page fault for the address which is backed by a hugepage
under migration, the kernel can't wait correctly until the migration
finishes. This is because pte_offset_map_lock() can't get a correct
migration entry for hugepage. This patch adds migration_entry_wait_huge()
to separate code path between normal pages and hugepages.

ChangeLog v2:
 - remove dup in migrate_entry_wait_huge()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/swapops.h |  4 ++++
 mm/hugetlb.c            |  2 +-
 mm/migrate.c            | 25 ++++++++++++++++++++-----
 3 files changed, 25 insertions(+), 6 deletions(-)

diff --git v3.9-rc3.orig/include/linux/swapops.h v3.9-rc3/include/linux/swapops.h
index 47ead51..f68efdd 100644
--- v3.9-rc3.orig/include/linux/swapops.h
+++ v3.9-rc3/include/linux/swapops.h
@@ -137,6 +137,8 @@ static inline void make_migration_entry_read(swp_entry_t *entry)
 
 extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					unsigned long address);
+extern void migration_entry_wait_huge(struct mm_struct *mm, pmd_t *pmd,
+					unsigned long address);
 #else
 
 #define make_migration_entry(page, write) swp_entry(0, 0)
@@ -148,6 +150,8 @@ static inline int is_migration_entry(swp_entry_t swp)
 static inline void make_migration_entry_read(swp_entry_t *entryp) { }
 static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					 unsigned long address) { }
+static inline void migration_entry_wait_huge(struct mm_struct *mm, pmd_t *pmd,
+					 unsigned long address) { }
 static inline int is_write_migration_entry(swp_entry_t entry)
 {
 	return 0;
diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
index 0a0be33..98a478e 100644
--- v3.9-rc3.orig/mm/hugetlb.c
+++ v3.9-rc3/mm/hugetlb.c
@@ -2819,7 +2819,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (ptep) {
 		entry = huge_ptep_get(ptep);
 		if (unlikely(is_hugetlb_entry_migration(entry))) {
-			migration_entry_wait(mm, (pmd_t *)ptep, address);
+			migration_entry_wait_huge(mm, (pmd_t *)ptep, address);
 			return 0;
 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			return VM_FAULT_HWPOISON_LARGE |
diff --git v3.9-rc3.orig/mm/migrate.c v3.9-rc3/mm/migrate.c
index 3bbaf5d..ec692a3 100644
--- v3.9-rc3.orig/mm/migrate.c
+++ v3.9-rc3/mm/migrate.c
@@ -200,15 +200,14 @@ static void remove_migration_ptes(struct page *old, struct page *new)
  * get to the page and wait until migration is finished.
  * When we return from this function the fault will be retried.
  */
-void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
-				unsigned long address)
+static void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
+				spinlock_t *ptl)
 {
-	pte_t *ptep, pte;
-	spinlock_t *ptl;
+	pte_t pte;
 	swp_entry_t entry;
 	struct page *page;
 
-	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+	spin_lock(ptl);
 	pte = *ptep;
 	if (!is_swap_pte(pte))
 		goto out;
@@ -236,6 +235,22 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 	pte_unmap_unlock(ptep, ptl);
 }
 
+void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long address)
+{
+	pte_t *ptep;
+	spinlock_t *ptl = pte_lockptr(mm, pmd);
+	ptep = pte_offset_map(pmd, address);
+	__migration_entry_wait(mm, ptep, ptl);
+}
+
+void migration_entry_wait_huge(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long address)
+{
+	spinlock_t *ptl = pte_lockptr(mm, pmd);
+	__migration_entry_wait(mm, (pte_t *)pmd, ptl);
+}
+
 #ifdef CONFIG_BLOCK
 /* Returns true if all buffers are successfully locked */
 static bool buffer_migrate_lock_buffers(struct buffer_head *head,
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 02/10] migrate: make core migration code aware of hugepage
  2013-03-22 20:23 [PATCH v2 0/10] extend hugepage migration Naoya Horiguchi
  2013-03-22 20:23 ` [PATCH 01/10] migrate: add migrate_entry_wait_huge() Naoya Horiguchi
@ 2013-03-22 20:23 ` Naoya Horiguchi
  2013-03-25 10:57   ` Michal Hocko
  2013-03-22 20:23 ` [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page() Naoya Horiguchi
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-22 20:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, KOSAKI Motohiro,
	Andi Kleen, Hillf Danton, Michal Hocko, linux-kernel

Before enabling each user of page migration to support hugepage,
this patch adds necessary changes on core migration code.
The main change is that the list of pages to migrate can link
not only LRU pages, but also hugepages.
Along with this, functions such as migrate_pages() and
putback_movable_pages() need to be changed to handle hugepages.

ChangeLog v2:
 - move code removing VM_HUGETLB from vma_migratable check into a
   separate patch
 - hold hugetlb_lock in putback_active_hugepage
 - update comment near the definition of hugetlb_lock

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/hugetlb.h |  4 ++++
 include/linux/migrate.h |  6 ++++++
 mm/hugetlb.c            | 21 ++++++++++++++++++++-
 mm/migrate.c            | 24 +++++++++++++++++++++++-
 4 files changed, 53 insertions(+), 2 deletions(-)

diff --git v3.9-rc3.orig/include/linux/hugetlb.h v3.9-rc3/include/linux/hugetlb.h
index 16e4e9a..baa0aa0 100644
--- v3.9-rc3.orig/include/linux/hugetlb.h
+++ v3.9-rc3/include/linux/hugetlb.h
@@ -66,6 +66,8 @@ int hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						vm_flags_t vm_flags);
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
 int dequeue_hwpoisoned_huge_page(struct page *page);
+void putback_active_hugepage(struct page *page);
+void putback_active_hugepages(struct list_head *l);
 void copy_huge_page(struct page *dst, struct page *src);
 
 extern unsigned long hugepages_treat_as_movable;
@@ -128,6 +130,8 @@ static inline int dequeue_hwpoisoned_huge_page(struct page *page)
 	return 0;
 }
 
+#define putback_active_hugepage(p) 0
+#define putback_active_hugepages(l) 0
 static inline void copy_huge_page(struct page *dst, struct page *src)
 {
 }
diff --git v3.9-rc3.orig/include/linux/migrate.h v3.9-rc3/include/linux/migrate.h
index a405d3dc..d4c6c08 100644
--- v3.9-rc3.orig/include/linux/migrate.h
+++ v3.9-rc3/include/linux/migrate.h
@@ -41,6 +41,9 @@ extern int migrate_page(struct address_space *,
 			struct page *, struct page *, enum migrate_mode);
 extern int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private, enum migrate_mode mode, int reason);
+extern int migrate_movable_pages(struct list_head *from,
+		new_page_t get_new_page, unsigned long private,
+		enum migrate_mode mode, int reason);
 extern int migrate_huge_page(struct page *, new_page_t x,
 		unsigned long private, enum migrate_mode mode);
 
@@ -62,6 +65,9 @@ static inline void putback_movable_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private, enum migrate_mode mode, int reason)
 	{ return -ENOSYS; }
+static inline int migrate_movable_pages(struct list_head *from,
+		new_page_t get_new_page, unsigned long private, bool offlining,
+		enum migrate_mode mode, int reason) { return -ENOSYS; }
 static inline int migrate_huge_page(struct page *page, new_page_t x,
 		unsigned long private, enum migrate_mode mode)
 	{ return -ENOSYS; }
diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
index 98a478e..a787c44 100644
--- v3.9-rc3.orig/mm/hugetlb.c
+++ v3.9-rc3/mm/hugetlb.c
@@ -48,7 +48,8 @@ static unsigned long __initdata default_hstate_max_huge_pages;
 static unsigned long __initdata default_hstate_size;
 
 /*
- * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
+ * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
+ * free_huge_pages, and surplus_huge_pages.
  */
 DEFINE_SPINLOCK(hugetlb_lock);
 
@@ -3182,3 +3183,21 @@ int dequeue_hwpoisoned_huge_page(struct page *hpage)
 	return ret;
 }
 #endif
+
+void putback_active_hugepage(struct page *page)
+{
+	VM_BUG_ON(!PageHead(page));
+	spin_lock(&hugetlb_lock);
+	list_move_tail(&page->lru, &(page_hstate(page))->hugepage_activelist);
+	spin_unlock(&hugetlb_lock);
+	put_page(page);
+}
+
+void putback_active_hugepages(struct list_head *l)
+{
+	struct page *page;
+	struct page *page2;
+
+	list_for_each_entry_safe(page, page2, l, lru)
+		putback_active_hugepage(page);
+}
diff --git v3.9-rc3.orig/mm/migrate.c v3.9-rc3/mm/migrate.c
index ec692a3..f69f354 100644
--- v3.9-rc3.orig/mm/migrate.c
+++ v3.9-rc3/mm/migrate.c
@@ -100,6 +100,10 @@ void putback_movable_pages(struct list_head *l)
 	struct page *page2;
 
 	list_for_each_entry_safe(page, page2, l, lru) {
+		if (unlikely(PageHuge(page))) {
+			putback_active_hugepage(page);
+			continue;
+		}
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
@@ -1023,7 +1027,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 		list_for_each_entry_safe(page, page2, from, lru) {
 			cond_resched();
 
-			rc = unmap_and_move(get_new_page, private,
+			if (PageHuge(page))
+				rc = unmap_and_move_huge_page(get_new_page,
+						private, page, pass > 2, mode);
+			else
+				rc = unmap_and_move(get_new_page, private,
 						page, pass > 2, mode);
 
 			switch(rc) {
@@ -1056,6 +1064,20 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 	return rc;
 }
 
+int migrate_movable_pages(struct list_head *from, new_page_t get_new_page,
+			unsigned long private,
+			enum migrate_mode mode, int reason)
+{
+	int err = 0;
+
+	if (!list_empty(from)) {
+		err = migrate_pages(from, get_new_page, private, mode, reason);
+		if (err)
+			putback_movable_pages(from);
+	}
+	return err;
+}
+
 int migrate_huge_page(struct page *hpage, new_page_t get_new_page,
 		      unsigned long private, enum migrate_mode mode)
 {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page()
  2013-03-22 20:23 [PATCH v2 0/10] extend hugepage migration Naoya Horiguchi
  2013-03-22 20:23 ` [PATCH 01/10] migrate: add migrate_entry_wait_huge() Naoya Horiguchi
  2013-03-22 20:23 ` [PATCH 02/10] migrate: make core migration code aware of hugepage Naoya Horiguchi
@ 2013-03-22 20:23 ` Naoya Horiguchi
  2013-03-25 12:31   ` Michal Hocko
  2013-03-26 11:29   ` Aneesh Kumar K.V
  2013-03-22 20:23 ` [PATCH 04/10] migrate: clean up migrate_huge_page() Naoya Horiguchi
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-22 20:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, KOSAKI Motohiro,
	Andi Kleen, Hillf Danton, Michal Hocko, linux-kernel

Currently migrate_huge_page() takes a pointer to a hugepage to be
migrated as an argument, instead of taking a pointer to the list of
hugepages to be migrated. This behavior was introduced in commit
189ebff28 ("hugetlb: simplify migrate_huge_page()"), and was OK
because until now hugepage migration is enabled only for soft-offlining
which takes only one hugepage in a single call.

But the situation will change in the later patches in this series
which enable other users of page migration to support hugepage migration.
They can kick migration for both of normal pages and hugepages
in a single call, so we need to go back to original implementation
of using linked lists to collect the hugepages to be migrated.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/memory-failure.c | 15 ++++++++++++---
 mm/migrate.c        |  2 ++
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git v3.9-rc3.orig/mm/memory-failure.c v3.9-rc3/mm/memory-failure.c
index df0694c..4e01082 100644
--- v3.9-rc3.orig/mm/memory-failure.c
+++ v3.9-rc3/mm/memory-failure.c
@@ -1467,6 +1467,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
 	int ret;
 	unsigned long pfn = page_to_pfn(page);
 	struct page *hpage = compound_head(page);
+	LIST_HEAD(pagelist);
 
 	/*
 	 * This double-check of PageHWPoison is to avoid the race with
@@ -1482,12 +1483,20 @@ static int soft_offline_huge_page(struct page *page, int flags)
 	unlock_page(hpage);
 
 	/* Keep page count to indicate a given hugepage is isolated. */
-	ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL,
-				MIGRATE_SYNC);
-	put_page(hpage);
+	list_move(&hpage->lru, &pagelist);
+	ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
+				MIGRATE_SYNC, MR_MEMORY_FAILURE);
 	if (ret) {
 		pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
 			pfn, ret, page->flags);
+		/*
+		 * We know that soft_offline_huge_page() tries to migrate
+		 * only one hugepage pointed to by hpage, so we need not
+		 * run through the pagelist here.
+		 */
+		putback_active_hugepage(hpage);
+		if (ret > 0)
+			ret = -EIO;
 	} else {
 		set_page_hwpoison_huge_page(hpage);
 		dequeue_hwpoisoned_huge_page(hpage);
diff --git v3.9-rc3.orig/mm/migrate.c v3.9-rc3/mm/migrate.c
index f69f354..66030b6 100644
--- v3.9-rc3.orig/mm/migrate.c
+++ v3.9-rc3/mm/migrate.c
@@ -981,6 +981,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 
 	unlock_page(hpage);
 out:
+	if (rc != -EAGAIN)
+		putback_active_hugepage(hpage);
 	put_page(new_hpage);
 	if (result) {
 		if (rc)
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 04/10] migrate: clean up migrate_huge_page()
  2013-03-22 20:23 [PATCH v2 0/10] extend hugepage migration Naoya Horiguchi
                   ` (2 preceding siblings ...)
  2013-03-22 20:23 ` [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page() Naoya Horiguchi
@ 2013-03-22 20:23 ` Naoya Horiguchi
  2013-04-05 21:13   ` KOSAKI Motohiro
  2013-03-22 20:23 ` [PATCH 05/10] migrate: add hugepage migration code to migrate_pages() Naoya Horiguchi
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-22 20:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, KOSAKI Motohiro,
	Andi Kleen, Hillf Danton, Michal Hocko, linux-kernel

Due to the previous patch, soft_offline_huge_page() switches to use
migrate_pages(), and migrate_huge_page() is not used any more.
So let's remove it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/migrate.h |  5 -----
 mm/migrate.c            | 26 --------------------------
 2 files changed, 31 deletions(-)

diff --git v3.9-rc3.orig/include/linux/migrate.h v3.9-rc3/include/linux/migrate.h
index d4c6c08..0e05397 100644
--- v3.9-rc3.orig/include/linux/migrate.h
+++ v3.9-rc3/include/linux/migrate.h
@@ -44,8 +44,6 @@ extern int migrate_pages(struct list_head *l, new_page_t x,
 extern int migrate_movable_pages(struct list_head *from,
 		new_page_t get_new_page, unsigned long private,
 		enum migrate_mode mode, int reason);
-extern int migrate_huge_page(struct page *, new_page_t x,
-		unsigned long private, enum migrate_mode mode);
 
 extern int fail_migrate_page(struct address_space *,
 			struct page *, struct page *);
@@ -68,9 +66,6 @@ static inline int migrate_pages(struct list_head *l, new_page_t x,
 static inline int migrate_movable_pages(struct list_head *from,
 		new_page_t get_new_page, unsigned long private, bool offlining,
 		enum migrate_mode mode, int reason) { return -ENOSYS; }
-static inline int migrate_huge_page(struct page *page, new_page_t x,
-		unsigned long private, enum migrate_mode mode)
-	{ return -ENOSYS; }
 
 static inline int migrate_prep(void) { return -ENOSYS; }
 static inline int migrate_prep_local(void) { return -ENOSYS; }
diff --git v3.9-rc3.orig/mm/migrate.c v3.9-rc3/mm/migrate.c
index 66030b6..38912ae 100644
--- v3.9-rc3.orig/mm/migrate.c
+++ v3.9-rc3/mm/migrate.c
@@ -1080,32 +1080,6 @@ int migrate_movable_pages(struct list_head *from, new_page_t get_new_page,
 	return err;
 }
 
-int migrate_huge_page(struct page *hpage, new_page_t get_new_page,
-		      unsigned long private, enum migrate_mode mode)
-{
-	int pass, rc;
-
-	for (pass = 0; pass < 10; pass++) {
-		rc = unmap_and_move_huge_page(get_new_page, private,
-						hpage, pass > 2, mode);
-		switch (rc) {
-		case -ENOMEM:
-			goto out;
-		case -EAGAIN:
-			/* try again */
-			cond_resched();
-			break;
-		case MIGRATEPAGE_SUCCESS:
-			goto out;
-		default:
-			rc = -EIO;
-			goto out;
-		}
-	}
-out:
-	return rc;
-}
-
 #ifdef CONFIG_NUMA
 /*
  * Move a list of individual pages
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 05/10] migrate: add hugepage migration code to migrate_pages()
  2013-03-22 20:23 [PATCH v2 0/10] extend hugepage migration Naoya Horiguchi
                   ` (3 preceding siblings ...)
  2013-03-22 20:23 ` [PATCH 04/10] migrate: clean up migrate_huge_page() Naoya Horiguchi
@ 2013-03-22 20:23 ` Naoya Horiguchi
  2013-03-25 13:04   ` Michal Hocko
  2013-04-05 21:17   ` KOSAKI Motohiro
  2013-03-22 20:23 ` [PATCH 06/10] migrate: add hugepage migration code to move_pages() Naoya Horiguchi
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-22 20:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, KOSAKI Motohiro,
	Andi Kleen, Hillf Danton, Michal Hocko, linux-kernel

This patch extends check_range() to handle vma with VM_HUGETLB set.
We will be able to migrate hugepage with migrate_pages(2) after
applying the enablement patch which comes later in this series.

Note that for larger hugepages (covered by pud entries, 1GB for
x86_64 for example), we simply skip it now.

ChangeLog v2:
 - remove unnecessary extern
 - fix page table lock in check_hugetlb_pmd_range
 - updated description and renamed patch title

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/hugetlb.h |  2 ++
 mm/hugetlb.c            | 10 ++++++++++
 mm/mempolicy.c          | 46 ++++++++++++++++++++++++++++++++++------------
 3 files changed, 46 insertions(+), 12 deletions(-)

diff --git v3.9-rc3.orig/include/linux/hugetlb.h v3.9-rc3/include/linux/hugetlb.h
index baa0aa0..3c62b82 100644
--- v3.9-rc3.orig/include/linux/hugetlb.h
+++ v3.9-rc3/include/linux/hugetlb.h
@@ -68,6 +68,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
 int dequeue_hwpoisoned_huge_page(struct page *page);
 void putback_active_hugepage(struct page *page);
 void putback_active_hugepages(struct list_head *l);
+void migrate_hugepage_add(struct page *page, struct list_head *list);
 void copy_huge_page(struct page *dst, struct page *src);
 
 extern unsigned long hugepages_treat_as_movable;
@@ -132,6 +133,7 @@ static inline int dequeue_hwpoisoned_huge_page(struct page *page)
 
 #define putback_active_hugepage(p) 0
 #define putback_active_hugepages(l) 0
+#define migrate_hugepage_add(p, l) 0
 static inline void copy_huge_page(struct page *dst, struct page *src)
 {
 }
diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
index a787c44..99ef969 100644
--- v3.9-rc3.orig/mm/hugetlb.c
+++ v3.9-rc3/mm/hugetlb.c
@@ -3201,3 +3201,13 @@ void putback_active_hugepages(struct list_head *l)
 	list_for_each_entry_safe(page, page2, l, lru)
 		putback_active_hugepage(page);
 }
+
+void migrate_hugepage_add(struct page *page, struct list_head *list)
+{
+	VM_BUG_ON(!PageHead(page));
+	get_page(page);
+	spin_lock(&hugetlb_lock);
+	list_move_tail(&page->lru, list);
+	spin_unlock(&hugetlb_lock);
+	return;
+}
diff --git v3.9-rc3.orig/mm/mempolicy.c v3.9-rc3/mm/mempolicy.c
index 7431001..b9e323e 100644
--- v3.9-rc3.orig/mm/mempolicy.c
+++ v3.9-rc3/mm/mempolicy.c
@@ -512,6 +512,27 @@ static int check_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	return addr != end;
 }
 
+static void check_hugetlb_pmd_range(struct vm_area_struct *vma, pmd_t *pmd,
+		const nodemask_t *nodes, unsigned long flags,
+				    void *private)
+{
+#ifdef CONFIG_HUGETLB_PAGE
+	int nid;
+	struct page *page;
+
+	spin_lock(&vma->vm_mm->page_table_lock);
+	page = pte_page(huge_ptep_get((pte_t *)pmd));
+	nid = page_to_nid(page);
+	if (node_isset(nid, *nodes) != !!(flags & MPOL_MF_INVERT)
+	    && ((flags & MPOL_MF_MOVE && page_mapcount(page) == 1)
+		|| flags & MPOL_MF_MOVE_ALL))
+		migrate_hugepage_add(page, private);
+	spin_unlock(&vma->vm_mm->page_table_lock);
+#else
+	BUG();
+#endif
+}
+
 static inline int check_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 		unsigned long addr, unsigned long end,
 		const nodemask_t *nodes, unsigned long flags,
@@ -523,6 +544,11 @@ static inline int check_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (pmd_huge(*pmd) && is_vm_hugetlb_page(vma)) {
+			check_hugetlb_pmd_range(vma, pmd, nodes,
+						flags, private);
+			continue;
+		}
 		split_huge_page_pmd(vma, addr, pmd);
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			continue;
@@ -544,6 +570,8 @@ static inline int check_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
 	pud = pud_offset(pgd, addr);
 	do {
 		next = pud_addr_end(addr, end);
+		if (pud_huge(*pud) && is_vm_hugetlb_page(vma))
+			continue;
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		if (check_pmd_range(vma, pud, addr, next, nodes,
@@ -635,9 +663,6 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
 				return ERR_PTR(-EFAULT);
 		}
 
-		if (is_vm_hugetlb_page(vma))
-			goto next;
-
 		if (flags & MPOL_MF_LAZY) {
 			change_prot_numa(vma, start, endvma);
 			goto next;
@@ -986,7 +1011,11 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
 
 static struct page *new_node_page(struct page *page, unsigned long node, int **x)
 {
-	return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE, 0);
+	if (PageHuge(page))
+		return alloc_huge_page_node(page_hstate(compound_head(page)),
+					node);
+	else
+		return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE, 0);
 }
 
 /*
@@ -998,7 +1027,6 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
 {
 	nodemask_t nmask;
 	LIST_HEAD(pagelist);
-	int err = 0;
 
 	nodes_clear(nmask);
 	node_set(source, nmask);
@@ -1012,14 +1040,8 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
 	check_range(mm, mm->mmap->vm_start, mm->task_size, &nmask,
 			flags | MPOL_MF_DISCONTIG_OK, &pagelist);
 
-	if (!list_empty(&pagelist)) {
-		err = migrate_pages(&pagelist, new_node_page, dest,
+	return migrate_movable_pages(&pagelist, new_node_page, dest,
 					MIGRATE_SYNC, MR_SYSCALL);
-		if (err)
-			putback_lru_pages(&pagelist);
-	}
-
-	return err;
 }
 
 /*
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 06/10] migrate: add hugepage migration code to move_pages()
  2013-03-22 20:23 [PATCH v2 0/10] extend hugepage migration Naoya Horiguchi
                   ` (4 preceding siblings ...)
  2013-03-22 20:23 ` [PATCH 05/10] migrate: add hugepage migration code to migrate_pages() Naoya Horiguchi
@ 2013-03-22 20:23 ` Naoya Horiguchi
  2013-03-25 13:36   ` Michal Hocko
  2013-03-22 20:23 ` [PATCH 07/10] mbind: add hugepage migration code to mbind() Naoya Horiguchi
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-22 20:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, KOSAKI Motohiro,
	Andi Kleen, Hillf Danton, Michal Hocko, linux-kernel

This patch extends move_pages() to handle vma with VM_HUGETLB set.
We will be able to migrate hugepage with move_pages(2) after
applying the enablement patch which comes later in this series.

We avoid getting refcount on tail pages of hugepage, because unlike thp,
hugepage is not split and we need not care about races with splitting.

And migration of larger (1GB for x86_64) hugepage are not enabled.

ChangeLog v2:
 - updated description and renamed patch title

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/memory.c  |  6 ++++--
 mm/migrate.c | 26 +++++++++++++++++++-------
 2 files changed, 23 insertions(+), 9 deletions(-)

diff --git v3.9-rc3.orig/mm/memory.c v3.9-rc3/mm/memory.c
index 494526a..3b6ad3d 100644
--- v3.9-rc3.orig/mm/memory.c
+++ v3.9-rc3/mm/memory.c
@@ -1503,7 +1503,8 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 	if (pud_none(*pud))
 		goto no_page_table;
 	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
-		BUG_ON(flags & FOLL_GET);
+		if (flags & FOLL_GET)
+			goto out;
 		page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
 		goto out;
 	}
@@ -1514,8 +1515,9 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 	if (pmd_none(*pmd))
 		goto no_page_table;
 	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
-		BUG_ON(flags & FOLL_GET);
 		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
+		if (flags & FOLL_GET && PageHead(page))
+			get_page_foll(page);
 		goto out;
 	}
 	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
diff --git v3.9-rc3.orig/mm/migrate.c v3.9-rc3/mm/migrate.c
index 38912ae..ef8e4e3 100644
--- v3.9-rc3.orig/mm/migrate.c
+++ v3.9-rc3/mm/migrate.c
@@ -1104,7 +1104,11 @@ static struct page *new_page_node(struct page *p, unsigned long private,
 
 	*result = &pm->status;
 
-	return alloc_pages_exact_node(pm->node,
+	if (PageHuge(p))
+		return alloc_huge_page_node(page_hstate(compound_head(p)),
+					pm->node);
+	else
+		return alloc_pages_exact_node(pm->node,
 				GFP_HIGHUSER_MOVABLE | GFP_THISNODE, 0);
 }
 
@@ -1150,6 +1154,13 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 		if (PageReserved(page))
 			goto put_and_set;
 
+		/*
+		 * follow_page(FOLL_GET) didn't get refcount for tail pages of
+		 * hugepage, so here we skip putting it.
+		 */
+		if (PageHuge(page) && PageTail(page))
+			goto set_status;
+
 		pp->page = page;
 		err = page_to_nid(page);
 
@@ -1164,6 +1175,12 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 				!migrate_all)
 			goto put_and_set;
 
+		if (PageHuge(page)) {
+			get_page(page);
+			list_move_tail(&page->lru, &pagelist);
+			goto put_and_set;
+		}
+
 		err = isolate_lru_page(page);
 		if (!err) {
 			list_add_tail(&page->lru, &pagelist);
@@ -1181,13 +1198,8 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 		pp->status = err;
 	}
 
-	err = 0;
-	if (!list_empty(&pagelist)) {
-		err = migrate_pages(&pagelist, new_page_node,
+	err = migrate_movable_pages(&pagelist, new_page_node,
 				(unsigned long)pm, MIGRATE_SYNC, MR_SYSCALL);
-		if (err)
-			putback_lru_pages(&pagelist);
-	}
 
 	up_read(&mm->mmap_sem);
 	return err;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 07/10] mbind: add hugepage migration code to mbind()
  2013-03-22 20:23 [PATCH v2 0/10] extend hugepage migration Naoya Horiguchi
                   ` (5 preceding siblings ...)
  2013-03-22 20:23 ` [PATCH 06/10] migrate: add hugepage migration code to move_pages() Naoya Horiguchi
@ 2013-03-22 20:23 ` Naoya Horiguchi
  2013-03-25 13:49   ` Michal Hocko
  2013-04-05 22:18   ` KOSAKI Motohiro
  2013-03-22 20:23 ` [PATCH 08/10] migrate: remove VM_HUGETLB from vma flag check in vma_migratable() Naoya Horiguchi
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-22 20:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, KOSAKI Motohiro,
	Andi Kleen, Hillf Danton, Michal Hocko, linux-kernel

This patch extends do_mbind() to handle vma with VM_HUGETLB set.
We will be able to migrate hugepage with mbind(2) after
applying the enablement patch which comes later in this series.

ChangeLog v2:
 - updated description and renamed patch title

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/hugetlb.h |  3 +++
 mm/hugetlb.c            |  2 +-
 mm/mempolicy.c          | 10 ++++------
 mm/migrate.c            |  7 ++++++-
 4 files changed, 14 insertions(+), 8 deletions(-)

diff --git v3.9-rc3.orig/include/linux/hugetlb.h v3.9-rc3/include/linux/hugetlb.h
index 3c62b82..981eff8 100644
--- v3.9-rc3.orig/include/linux/hugetlb.h
+++ v3.9-rc3/include/linux/hugetlb.h
@@ -261,6 +261,8 @@ struct huge_bootmem_page {
 #endif
 };
 
+struct page *alloc_huge_page(struct vm_area_struct *vma,
+				unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
 
 /* arch callback */
@@ -356,6 +358,7 @@ static inline int hstate_index(struct hstate *h)
 
 #else
 struct hstate {};
+#define alloc_huge_page(v, a, r) NULL
 #define alloc_huge_page_node(h, nid) NULL
 #define alloc_bootmem_huge_page(h) NULL
 #define hstate_file(f) NULL
diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
index 99ef969..d9d3dd7 100644
--- v3.9-rc3.orig/mm/hugetlb.c
+++ v3.9-rc3/mm/hugetlb.c
@@ -1117,7 +1117,7 @@ static void vma_commit_reservation(struct hstate *h,
 	}
 }
 
-static struct page *alloc_huge_page(struct vm_area_struct *vma,
+struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr, int avoid_reserve)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
diff --git v3.9-rc3.orig/mm/mempolicy.c v3.9-rc3/mm/mempolicy.c
index b9e323e..ffba2ee 100644
--- v3.9-rc3.orig/mm/mempolicy.c
+++ v3.9-rc3/mm/mempolicy.c
@@ -1173,6 +1173,8 @@ static struct page *new_vma_page(struct page *page, unsigned long private, int *
 		vma = vma->vm_next;
 	}
 
+	if (PageHuge(page))
+		return alloc_huge_page(vma, address, 1);
 	/*
 	 * if !vma, alloc_page_vma() will use task or system default policy
 	 */
@@ -1277,14 +1279,10 @@ static long do_mbind(unsigned long start, unsigned long len,
 	if (!err) {
 		int nr_failed = 0;
 
-		if (!list_empty(&pagelist)) {
-			WARN_ON_ONCE(flags & MPOL_MF_LAZY);
-			nr_failed = migrate_pages(&pagelist, new_vma_page,
+		WARN_ON_ONCE(flags & MPOL_MF_LAZY);
+		nr_failed = migrate_movable_pages(&pagelist, new_vma_page,
 					(unsigned long)vma,
 					MIGRATE_SYNC, MR_MEMPOLICY_MBIND);
-			if (nr_failed)
-				putback_lru_pages(&pagelist);
-		}
 
 		if (nr_failed && (flags & MPOL_MF_STRICT))
 			err = -EIO;
diff --git v3.9-rc3.orig/mm/migrate.c v3.9-rc3/mm/migrate.c
index ef8e4e3..e64cd55 100644
--- v3.9-rc3.orig/mm/migrate.c
+++ v3.9-rc3/mm/migrate.c
@@ -951,7 +951,12 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	struct page *new_hpage = get_new_page(hpage, private, &result);
 	struct anon_vma *anon_vma = NULL;
 
-	if (!new_hpage)
+	/*
+	 * Getting a new hugepage with alloc_huge_page() (which can happen
+	 * when migration is caused by mbind()) can return ERR_PTR value,
+	 * so we need take care of the case here.
+	 */
+	if (!new_hpage || IS_ERR_VALUE(new_hpage))
 		return -ENOMEM;
 
 	rc = -EAGAIN;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 08/10] migrate: remove VM_HUGETLB from vma flag check in vma_migratable()
  2013-03-22 20:23 [PATCH v2 0/10] extend hugepage migration Naoya Horiguchi
                   ` (6 preceding siblings ...)
  2013-03-22 20:23 ` [PATCH 07/10] mbind: add hugepage migration code to mbind() Naoya Horiguchi
@ 2013-03-22 20:23 ` Naoya Horiguchi
  2013-03-22 20:23 ` [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage Naoya Horiguchi
  2013-03-22 20:23 ` [PATCH 10/10] prepare to remove /proc/sys/vm/hugepages_treat_as_movable Naoya Horiguchi
  9 siblings, 0 replies; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-22 20:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, KOSAKI Motohiro,
	Andi Kleen, Hillf Danton, Michal Hocko, linux-kernel

This patch enables hugepage migration from migrate_pages(2),
move_pages(2), and mbind(2).

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/mempolicy.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git v3.9-rc3.orig/include/linux/mempolicy.h v3.9-rc3/include/linux/mempolicy.h
index 0d7df39..2e475b5 100644
--- v3.9-rc3.orig/include/linux/mempolicy.h
+++ v3.9-rc3/include/linux/mempolicy.h
@@ -173,7 +173,7 @@ extern int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol);
 /* Check if a vma is migratable */
 static inline int vma_migratable(struct vm_area_struct *vma)
 {
-	if (vma->vm_flags & (VM_IO | VM_HUGETLB | VM_PFNMAP))
+	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
 		return 0;
 	/*
 	 * Migration allocates pages in the highest zone. If we cannot
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
  2013-03-22 20:23 [PATCH v2 0/10] extend hugepage migration Naoya Horiguchi
                   ` (7 preceding siblings ...)
  2013-03-22 20:23 ` [PATCH 08/10] migrate: remove VM_HUGETLB from vma flag check in vma_migratable() Naoya Horiguchi
@ 2013-03-22 20:23 ` Naoya Horiguchi
  2013-03-25 15:09   ` Michal Hocko
                     ` (2 more replies)
  2013-03-22 20:23 ` [PATCH 10/10] prepare to remove /proc/sys/vm/hugepages_treat_as_movable Naoya Horiguchi
  9 siblings, 3 replies; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-22 20:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, KOSAKI Motohiro,
	Andi Kleen, Hillf Danton, Michal Hocko, linux-kernel

Currently we can't offline memory blocks which contain hugepages because
a hugepage is considered as an unmovable page. But now with this patch
series, a hugepage has become movable, so by using hugepage migration we
can offline such memory blocks.

What's different from other users of hugepage migration is that we need
to decompose all the hugepages inside the target memory block into free
buddy pages after hugepage migration, because otherwise free hugepages
remaining in the memory block intervene the memory offlining.
For this reason we introduce new functions dissolve_free_huge_page() and
dissolve_free_huge_pages().

Other than that, what this patch does is straightforwardly to add hugepage
migration code, that is, adding hugepage code to the functions which scan
over pfn and collect hugepages to be migrated, and adding a hugepage
allocation function to alloc_migrate_target().

As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
over them because it's larger than memory block. So we now simply leave
it to fail as it is.

ChangeLog v2:
 - changed return value type of is_hugepage_movable() to bool
 - is_hugepage_movable() uses list_for_each_entry() instead of *_safe()
 - moved if(PageHuge) block before get_page_unless_zero() in do_migrate_range()
 - do_migrate_range() returns -EBUSY for hugepages larger than memory block
 - dissolve_free_huge_pages() calculates scan step and sets it to minimum
   hugepage size

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/hugetlb.h |  6 +++++
 mm/hugetlb.c            | 58 +++++++++++++++++++++++++++++++++++++++++++++++++
 mm/memory_hotplug.c     | 42 +++++++++++++++++++++++++++--------
 mm/page_alloc.c         | 12 ++++++++++
 mm/page_isolation.c     |  5 +++++
 5 files changed, 114 insertions(+), 9 deletions(-)

diff --git v3.9-rc3.orig/include/linux/hugetlb.h v3.9-rc3/include/linux/hugetlb.h
index 981eff8..8220a8a 100644
--- v3.9-rc3.orig/include/linux/hugetlb.h
+++ v3.9-rc3/include/linux/hugetlb.h
@@ -69,6 +69,7 @@ int dequeue_hwpoisoned_huge_page(struct page *page);
 void putback_active_hugepage(struct page *page);
 void putback_active_hugepages(struct list_head *l);
 void migrate_hugepage_add(struct page *page, struct list_head *list);
+bool is_hugepage_movable(struct page *page);
 void copy_huge_page(struct page *dst, struct page *src);
 
 extern unsigned long hugepages_treat_as_movable;
@@ -134,6 +135,7 @@ static inline int dequeue_hwpoisoned_huge_page(struct page *page)
 #define putback_active_hugepage(p) 0
 #define putback_active_hugepages(l) 0
 #define migrate_hugepage_add(p, l) 0
+#define is_hugepage_movable(x) 0
 static inline void copy_huge_page(struct page *dst, struct page *src)
 {
 }
@@ -356,6 +358,9 @@ static inline int hstate_index(struct hstate *h)
 	return h - hstates;
 }
 
+extern void dissolve_free_huge_pages(unsigned long start_pfn,
+				     unsigned long end_pfn);
+
 #else
 struct hstate {};
 #define alloc_huge_page(v, a, r) NULL
@@ -376,6 +381,7 @@ static inline unsigned int pages_per_huge_page(struct hstate *h)
 }
 #define hstate_index_to_shift(index) 0
 #define hstate_index(h) 0
+#define dissolve_free_huge_pages(s, e) 0
 #endif
 
 #endif /* _LINUX_HUGETLB_H */
diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
index d9d3dd7..ef79871 100644
--- v3.9-rc3.orig/mm/hugetlb.c
+++ v3.9-rc3/mm/hugetlb.c
@@ -844,6 +844,36 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 	return ret;
 }
 
+/* Dissolve a given free hugepage into free pages. */
+static void dissolve_free_huge_page(struct page *page)
+{
+	spin_lock(&hugetlb_lock);
+	if (PageHuge(page) && !page_count(page)) {
+		struct hstate *h = page_hstate(page);
+		int nid = page_to_nid(page);
+		list_del(&page->lru);
+		h->free_huge_pages--;
+		h->free_huge_pages_node[nid]--;
+		update_and_free_page(h, page);
+	}
+	spin_unlock(&hugetlb_lock);
+}
+
+/* Dissolve free hugepages in a given pfn range. Used by memory hotplug. */
+void dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
+{
+	unsigned int order = 8 * sizeof(void *);
+	unsigned long pfn;
+	struct hstate *h;
+
+	/* Set scan step to minimum hugepage size */
+	for_each_hstate(h)
+		if (order > huge_page_order(h))
+			order = huge_page_order(h);
+	for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << order)
+		dissolve_free_huge_page(pfn_to_page(pfn));
+}
+
 static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
 {
 	struct page *page;
@@ -3155,6 +3185,34 @@ static int is_hugepage_on_freelist(struct page *hpage)
 	return 0;
 }
 
+/* Returns true for head pages of in-use hugepages, otherwise returns false. */
+bool is_hugepage_movable(struct page *hpage)
+{
+	struct page *page;
+	struct hstate *h;
+	bool ret = false;
+
+	VM_BUG_ON(!PageHuge(hpage));
+	/*
+	 * This function can be called for a tail page because memory hotplug
+	 * scans movability of pages by pfn range of a memory block.
+	 * Larger hugepages (1GB for x86_64) are larger than memory block, so
+	 * the scan can start at the tail page of larger hugepages.
+	 * 1GB hugepage is not movable now, so we return with false for now.
+	 */
+	if (PageTail(hpage))
+		return false;
+	h = page_hstate(hpage);
+	spin_lock(&hugetlb_lock);
+	list_for_each_entry(page, &h->hugepage_activelist, lru)
+		if (page == hpage) {
+			ret = true;
+			break;
+		}
+	spin_unlock(&hugetlb_lock);
+	return ret;
+}
+
 /*
  * This function is called from memory failure code.
  * Assume the caller holds page lock of the head page.
diff --git v3.9-rc3.orig/mm/memory_hotplug.c v3.9-rc3/mm/memory_hotplug.c
index 9597eec..2d206e8 100644
--- v3.9-rc3.orig/mm/memory_hotplug.c
+++ v3.9-rc3/mm/memory_hotplug.c
@@ -30,6 +30,7 @@
 #include <linux/mm_inline.h>
 #include <linux/firmware-map.h>
 #include <linux/stop_machine.h>
+#include <linux/hugetlb.h>
 
 #include <asm/tlbflush.h>
 
@@ -1215,10 +1216,12 @@ static int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn)
 }
 
 /*
- * Scanning pfn is much easier than scanning lru list.
- * Scan pfn from start to end and Find LRU page.
+ * Scan pfn range [start,end) to find movable/migratable pages (LRU pages
+ * and hugepages). We scan pfn because it's much easier than scanning over
+ * linked list. This function returns the pfn of the first found movable
+ * page if it's found, otherwise 0.
  */
-static unsigned long scan_lru_pages(unsigned long start, unsigned long end)
+static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
 {
 	unsigned long pfn;
 	struct page *page;
@@ -1227,6 +1230,12 @@ static unsigned long scan_lru_pages(unsigned long start, unsigned long end)
 			page = pfn_to_page(pfn);
 			if (PageLRU(page))
 				return pfn;
+			if (PageHuge(page)) {
+				if (is_hugepage_movable(page))
+					return pfn;
+				else
+					pfn += (1 << compound_order(page)) - 1;
+			}
 		}
 	}
 	return 0;
@@ -1247,6 +1256,21 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		if (!pfn_valid(pfn))
 			continue;
 		page = pfn_to_page(pfn);
+
+		if (PageHuge(page)) {
+			struct page *head = compound_head(page);
+			pfn = page_to_pfn(head) + (1<<compound_order(head)) - 1;
+			if (compound_order(head) > PFN_SECTION_SHIFT) {
+				ret = -EBUSY;
+				break;
+			}
+			if (!get_page_unless_zero(page))
+				continue;
+			list_move_tail(&head->lru, &source);
+			move_pages -= 1 << compound_order(head);
+			continue;
+		}
+
 		if (!get_page_unless_zero(page))
 			continue;
 		/*
@@ -1279,7 +1303,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 	}
 	if (!list_empty(&source)) {
 		if (not_managed) {
-			putback_lru_pages(&source);
+			putback_movable_pages(&source);
 			goto out;
 		}
 
@@ -1287,10 +1311,8 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		 * alloc_migrate_target should be improooooved!!
 		 * migrate_pages returns # of failed pages.
 		 */
-		ret = migrate_pages(&source, alloc_migrate_target, 0,
+		ret = migrate_movable_pages(&source, alloc_migrate_target, 0,
 					MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
-		if (ret)
-			putback_lru_pages(&source);
 	}
 out:
 	return ret;
@@ -1533,8 +1555,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
 		drain_all_pages();
 	}
 
-	pfn = scan_lru_pages(start_pfn, end_pfn);
-	if (pfn) { /* We have page on LRU */
+	pfn = scan_movable_pages(start_pfn, end_pfn);
+	if (pfn) { /* We have movable pages */
 		ret = do_migrate_range(pfn, end_pfn);
 		if (!ret) {
 			drain = 1;
@@ -1553,6 +1575,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	yield();
 	/* drain pcp pages, this is synchronous. */
 	drain_all_pages();
+	/* dissolve all free hugepages inside the memory block */
+	dissolve_free_huge_pages(start_pfn, end_pfn);
 	/* check again */
 	offlined_pages = check_pages_isolated(start_pfn, end_pfn);
 	if (offlined_pages < 0) {
diff --git v3.9-rc3.orig/mm/page_alloc.c v3.9-rc3/mm/page_alloc.c
index 8fcced7..09a95e7 100644
--- v3.9-rc3.orig/mm/page_alloc.c
+++ v3.9-rc3/mm/page_alloc.c
@@ -59,6 +59,7 @@
 #include <linux/migrate.h>
 #include <linux/page-debug-flags.h>
 #include <linux/sched/rt.h>
+#include <linux/hugetlb.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -5716,6 +5717,17 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
 			continue;
 
 		page = pfn_to_page(check);
+
+		/*
+		 * Hugepages are not in LRU lists, but they're movable.
+		 * We need not scan over tail pages bacause we don't
+		 * handle each tail page individually in migration.
+		 */
+		if (PageHuge(page)) {
+			iter += (1 << compound_order(page)) - 1;
+			continue;
+		}
+
 		/*
 		 * We can't use page_count without pin a page
 		 * because another CPU can free compound page.
diff --git v3.9-rc3.orig/mm/page_isolation.c v3.9-rc3/mm/page_isolation.c
index 383bdbb..cf48ef6 100644
--- v3.9-rc3.orig/mm/page_isolation.c
+++ v3.9-rc3/mm/page_isolation.c
@@ -6,6 +6,7 @@
 #include <linux/page-isolation.h>
 #include <linux/pageblock-flags.h>
 #include <linux/memory.h>
+#include <linux/hugetlb.h>
 #include "internal.h"
 
 int set_migratetype_isolate(struct page *page, bool skip_hwpoisoned_pages)
@@ -252,6 +253,10 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
 {
 	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE;
 
+	if (PageHuge(page))
+		return alloc_huge_page_node(page_hstate(compound_head(page)),
+					    numa_node_id());
+
 	if (PageHighMem(page))
 		gfp_mask |= __GFP_HIGHMEM;
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 10/10] prepare to remove /proc/sys/vm/hugepages_treat_as_movable
  2013-03-22 20:23 [PATCH v2 0/10] extend hugepage migration Naoya Horiguchi
                   ` (8 preceding siblings ...)
  2013-03-22 20:23 ` [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage Naoya Horiguchi
@ 2013-03-22 20:23 ` Naoya Horiguchi
  2013-03-25 15:12   ` Michal Hocko
  9 siblings, 1 reply; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-22 20:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, KOSAKI Motohiro,
	Andi Kleen, Hillf Danton, Michal Hocko, linux-kernel

Now hugepages are definitely movable. So allocating hugepages from
ZONE_MOVABLE is natural and we have no reason to keep this parameter.
In order to allow userspace to prepare for the removal, let's leave
this sysctl handler as noop for a while.

ChangeLog v2:
 - shift to noop function instead of completely removing the parameter
 - rename patch title

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 Documentation/sysctl/vm.txt | 13 ++-----------
 mm/hugetlb.c                | 17 ++++++-----------
 2 files changed, 8 insertions(+), 22 deletions(-)

diff --git v3.9-rc3.orig/Documentation/sysctl/vm.txt v3.9-rc3/Documentation/sysctl/vm.txt
index 078701f..1746176 100644
--- v3.9-rc3.orig/Documentation/sysctl/vm.txt
+++ v3.9-rc3/Documentation/sysctl/vm.txt
@@ -169,17 +169,8 @@ fragmentation index is <= extfrag_threshold. The default value is 500.
 
 hugepages_treat_as_movable
 
-This parameter is only useful when kernelcore= is specified at boot time to
-create ZONE_MOVABLE for pages that may be reclaimed or migrated. Huge pages
-are not movable so are not normally allocated from ZONE_MOVABLE. A non-zero
-value written to hugepages_treat_as_movable allows huge pages to be allocated
-from ZONE_MOVABLE.
-
-Once enabled, the ZONE_MOVABLE is treated as an area of memory the huge
-pages pool can easily grow or shrink within. Assuming that applications are
-not running that mlock() a lot of memory, it is likely the huge pages pool
-can grow to the size of ZONE_MOVABLE by repeatedly entering the desired value
-into nr_hugepages and triggering page reclaim.
+This parameter is obsolete and planned to be removed. The value has no effect
+on kernel's behavior.
 
 ==============================================================
 
diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
index ef79871..0d1705b 100644
--- v3.9-rc3.orig/mm/hugetlb.c
+++ v3.9-rc3/mm/hugetlb.c
@@ -33,7 +33,6 @@
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
-static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
 
 int hugetlb_max_hstate __read_mostly;
@@ -543,7 +542,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 retry_cpuset:
 	cpuset_mems_cookie = get_mems_allowed();
 	zonelist = huge_zonelist(vma, address,
-					htlb_alloc_mask, &mpol, &nodemask);
+					GFP_HIGHUSER_MOVABLE, &mpol, &nodemask);
 	/*
 	 * A child process with MAP_PRIVATE mappings created by their parent
 	 * have no page reserves. This check ensures that reservations are
@@ -559,7 +558,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						MAX_NR_ZONES - 1, nodemask) {
-		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
+		if (cpuset_zone_allowed_softwall(zone, GFP_HIGHUSER_MOVABLE)) {
 			page = dequeue_huge_page_node(h, zone_to_nid(zone));
 			if (page) {
 				if (!avoid_reserve)
@@ -699,7 +698,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 		return NULL;
 
 	page = alloc_pages_exact_node(nid,
-		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
+		GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_THISNODE|
 						__GFP_REPEAT|__GFP_NOWARN,
 		huge_page_order(h));
 	if (page) {
@@ -916,12 +915,12 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
 	spin_unlock(&hugetlb_lock);
 
 	if (nid == NUMA_NO_NODE)
-		page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
+		page = alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_COMP|
 				   __GFP_REPEAT|__GFP_NOWARN,
 				   huge_page_order(h));
 	else
 		page = alloc_pages_exact_node(nid,
-			htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
+			GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_THISNODE|
 			__GFP_REPEAT|__GFP_NOWARN, huge_page_order(h));
 
 	if (page && arch_prepare_hugepage(page)) {
@@ -2086,11 +2085,7 @@ int hugetlb_treat_movable_handler(struct ctl_table *table, int write,
 			void __user *buffer,
 			size_t *length, loff_t *ppos)
 {
-	proc_dointvec(table, write, buffer, length, ppos);
-	if (hugepages_treat_as_movable)
-		htlb_alloc_mask = GFP_HIGHUSER_MOVABLE;
-	else
-		htlb_alloc_mask = GFP_HIGHUSER;
+	/* hugepages_treat_as_movable is obsolete and to be removed. */
 	return 0;
 }
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH 01/10] migrate: add migrate_entry_wait_huge()
  2013-03-22 20:23 ` [PATCH 01/10] migrate: add migrate_entry_wait_huge() Naoya Horiguchi
@ 2013-03-23 15:55   ` Rik van Riel
  2013-03-25 10:13   ` Michal Hocko
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2013-03-23 15:55 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, Michal Hocko,
	linux-kernel

On 03/22/2013 04:23 PM, Naoya Horiguchi wrote:
> When we have a page fault for the address which is backed by a hugepage
> under migration, the kernel can't wait correctly until the migration
> finishes. This is because pte_offset_map_lock() can't get a correct
> migration entry for hugepage. This patch adds migration_entry_wait_huge()
> to separate code path between normal pages and hugepages.
>
> ChangeLog v2:
>   - remove dup in migrate_entry_wait_huge()
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 01/10] migrate: add migrate_entry_wait_huge()
  2013-03-22 20:23 ` [PATCH 01/10] migrate: add migrate_entry_wait_huge() Naoya Horiguchi
  2013-03-23 15:55   ` Rik van Riel
@ 2013-03-25 10:13   ` Michal Hocko
  2013-03-26  4:25     ` Naoya Horiguchi
  2013-04-05 20:33   ` KOSAKI Motohiro
  2013-04-05 20:33   ` KOSAKI Motohiro
  3 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2013-03-25 10:13 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Fri 22-03-13 16:23:46, Naoya Horiguchi wrote:
> When we have a page fault for the address which is backed by a hugepage
> under migration, the kernel can't wait correctly until the migration
> finishes. This is because pte_offset_map_lock() can't get a correct
> migration entry for hugepage. This patch adds migration_entry_wait_huge()
> to separate code path between normal pages and hugepages.

The changelog is missing a description what is the effect of the bug. I
assume that we end up busy looping on the huge page page fault until
migration finishes, right?

Should this be applied to the stable tree or the current usage of the huge
page migration (HWPOISON) is not spread enough?

I like how you got rid of the duplication but I think this still doesn't
work for all archs/huge page sizes.

[...]
> diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
> index 0a0be33..98a478e 100644
> --- v3.9-rc3.orig/mm/hugetlb.c
> +++ v3.9-rc3/mm/hugetlb.c
> @@ -2819,7 +2819,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (ptep) {
>  		entry = huge_ptep_get(ptep);
>  		if (unlikely(is_hugetlb_entry_migration(entry))) {
> -			migration_entry_wait(mm, (pmd_t *)ptep, address);
> +			migration_entry_wait_huge(mm, (pmd_t *)ptep, address);

e.g. ia64 returns pte_t *

[...]
> +void migration_entry_wait_huge(struct mm_struct *mm, pmd_t *pmd,
> +				unsigned long address)
> +{
> +	spinlock_t *ptl = pte_lockptr(mm, pmd);
> +	__migration_entry_wait(mm, (pte_t *)pmd, ptl);

So you are trying to get lockptr from pmd but you have pte in fact. No
biggy for !USE_SPLIT_PTLOCKS but doesn't work otherwise. So you probably
need a arch specific huge_pte_lockptr callback for USE_SPLIT_PTLOCKS.

Or am I missing something here? All the pte usage in hugetlb is one
giant mess and I wouldn't be surprised if there were more places
hardcoded to pmd there.

> +}
> +
>  #ifdef CONFIG_BLOCK
>  /* Returns true if all buffers are successfully locked */
>  static bool buffer_migrate_lock_buffers(struct buffer_head *head,
> -- 
> 1.7.11.7
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/10] migrate: make core migration code aware of hugepage
  2013-03-22 20:23 ` [PATCH 02/10] migrate: make core migration code aware of hugepage Naoya Horiguchi
@ 2013-03-25 10:57   ` Michal Hocko
  2013-03-26  4:33     ` Naoya Horiguchi
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2013-03-25 10:57 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Fri 22-03-13 16:23:47, Naoya Horiguchi wrote:
> Before enabling each user of page migration to support hugepage,
> this patch adds necessary changes on core migration code.
> The main change is that the list of pages to migrate can link
> not only LRU pages, but also hugepages.
> Along with this, functions such as migrate_pages() and
> putback_movable_pages() need to be changed to handle hugepages.
> 
> ChangeLog v2:
>  - move code removing VM_HUGETLB from vma_migratable check into a
>    separate patch
>  - hold hugetlb_lock in putback_active_hugepage
>  - update comment near the definition of hugetlb_lock
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  include/linux/hugetlb.h |  4 ++++
>  include/linux/migrate.h |  6 ++++++
>  mm/hugetlb.c            | 21 ++++++++++++++++++++-
>  mm/migrate.c            | 24 +++++++++++++++++++++++-
>  4 files changed, 53 insertions(+), 2 deletions(-)
> 
> diff --git v3.9-rc3.orig/include/linux/hugetlb.h v3.9-rc3/include/linux/hugetlb.h
> index 16e4e9a..baa0aa0 100644
> --- v3.9-rc3.orig/include/linux/hugetlb.h
> +++ v3.9-rc3/include/linux/hugetlb.h
> @@ -66,6 +66,8 @@ int hugetlb_reserve_pages(struct inode *inode, long from, long to,
>  						vm_flags_t vm_flags);
>  void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
>  int dequeue_hwpoisoned_huge_page(struct page *page);
> +void putback_active_hugepage(struct page *page);
> +void putback_active_hugepages(struct list_head *l);
>  void copy_huge_page(struct page *dst, struct page *src);
>  
>  extern unsigned long hugepages_treat_as_movable;
> @@ -128,6 +130,8 @@ static inline int dequeue_hwpoisoned_huge_page(struct page *page)
>  	return 0;
>  }
>  
> +#define putback_active_hugepage(p) 0
> +#define putback_active_hugepages(l) 0
>  static inline void copy_huge_page(struct page *dst, struct page *src)
>  {
>  }
> diff --git v3.9-rc3.orig/include/linux/migrate.h v3.9-rc3/include/linux/migrate.h
> index a405d3dc..d4c6c08 100644
> --- v3.9-rc3.orig/include/linux/migrate.h
> +++ v3.9-rc3/include/linux/migrate.h
> @@ -41,6 +41,9 @@ extern int migrate_page(struct address_space *,
>  			struct page *, struct page *, enum migrate_mode);
>  extern int migrate_pages(struct list_head *l, new_page_t x,
>  		unsigned long private, enum migrate_mode mode, int reason);
> +extern int migrate_movable_pages(struct list_head *from,
> +		new_page_t get_new_page, unsigned long private,
> +		enum migrate_mode mode, int reason);
>  extern int migrate_huge_page(struct page *, new_page_t x,
>  		unsigned long private, enum migrate_mode mode);
>  
> @@ -62,6 +65,9 @@ static inline void putback_movable_pages(struct list_head *l) {}
>  static inline int migrate_pages(struct list_head *l, new_page_t x,
>  		unsigned long private, enum migrate_mode mode, int reason)
>  	{ return -ENOSYS; }
> +static inline int migrate_movable_pages(struct list_head *from,
> +		new_page_t get_new_page, unsigned long private, bool offlining,
> +		enum migrate_mode mode, int reason) { return -ENOSYS; }
>  static inline int migrate_huge_page(struct page *page, new_page_t x,
>  		unsigned long private, enum migrate_mode mode)
>  	{ return -ENOSYS; }
> diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
> index 98a478e..a787c44 100644
> --- v3.9-rc3.orig/mm/hugetlb.c
> +++ v3.9-rc3/mm/hugetlb.c
> @@ -48,7 +48,8 @@ static unsigned long __initdata default_hstate_max_huge_pages;
>  static unsigned long __initdata default_hstate_size;
>  
>  /*
> - * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
> + * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
> + * free_huge_pages, and surplus_huge_pages.
>   */

Could you get this out into a separate patch and add lockdep assertions
to functions which do not lock it directly but they rely on it so that
the locking is more clear?
e.g. dequeue_huge_page_node, update_and_free_page, try_to_free_low, ...

It would a nice cleanup and a lead for future when somebody tries to
make the locking a bit saner.

>  DEFINE_SPINLOCK(hugetlb_lock);
>  
> @@ -3182,3 +3183,21 @@ int dequeue_hwpoisoned_huge_page(struct page *hpage)
>  	return ret;
>  }
>  #endif
> +
> +void putback_active_hugepage(struct page *page)
> +{
> +	VM_BUG_ON(!PageHead(page));
> +	spin_lock(&hugetlb_lock);
> +	list_move_tail(&page->lru, &(page_hstate(page))->hugepage_activelist);
> +	spin_unlock(&hugetlb_lock);
> +	put_page(page);
> +}
> +
> +void putback_active_hugepages(struct list_head *l)
> +{
> +	struct page *page;
> +	struct page *page2;
> +
> +	list_for_each_entry_safe(page, page2, l, lru)
> +		putback_active_hugepage(page);
> +}
> diff --git v3.9-rc3.orig/mm/migrate.c v3.9-rc3/mm/migrate.c
> index ec692a3..f69f354 100644
> --- v3.9-rc3.orig/mm/migrate.c
> +++ v3.9-rc3/mm/migrate.c
> @@ -100,6 +100,10 @@ void putback_movable_pages(struct list_head *l)
>  	struct page *page2;
>  
>  	list_for_each_entry_safe(page, page2, l, lru) {
> +		if (unlikely(PageHuge(page))) {
> +			putback_active_hugepage(page);
> +			continue;
> +		}
>  		list_del(&page->lru);
>  		dec_zone_page_state(page, NR_ISOLATED_ANON +
>  				page_is_file_cache(page));
> @@ -1023,7 +1027,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  		list_for_each_entry_safe(page, page2, from, lru) {
>  			cond_resched();
>  
> -			rc = unmap_and_move(get_new_page, private,
> +			if (PageHuge(page))
> +				rc = unmap_and_move_huge_page(get_new_page,
> +						private, page, pass > 2, mode);
> +			else
> +				rc = unmap_and_move(get_new_page, private,
>  						page, pass > 2, mode);
>  
>  			switch(rc) {
> @@ -1056,6 +1064,20 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  	return rc;
>  }
>  
> +int migrate_movable_pages(struct list_head *from, new_page_t get_new_page,
> +			unsigned long private,
> +			enum migrate_mode mode, int reason)
> +{
> +	int err = 0;
> +
> +	if (!list_empty(from)) {
> +		err = migrate_pages(from, get_new_page, private, mode, reason);
> +		if (err)
> +			putback_movable_pages(from);
> +	}
> +	return err;
> +}
> +

There doesn't seem to be any caller for this function. Please move it to
the patch which uses it.

>  int migrate_huge_page(struct page *hpage, new_page_t get_new_page,
>  		      unsigned long private, enum migrate_mode mode)
>  {
> -- 
> 1.7.11.7
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page()
  2013-03-22 20:23 ` [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page() Naoya Horiguchi
@ 2013-03-25 12:31   ` Michal Hocko
  2013-03-26  4:34     ` Naoya Horiguchi
  2013-03-26 11:29   ` Aneesh Kumar K.V
  1 sibling, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2013-03-25 12:31 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Fri 22-03-13 16:23:48, Naoya Horiguchi wrote:
> Currently migrate_huge_page() takes a pointer to a hugepage to be
> migrated as an argument, instead of taking a pointer to the list of
> hugepages to be migrated. This behavior was introduced in commit
> 189ebff28 ("hugetlb: simplify migrate_huge_page()"), and was OK
> because until now hugepage migration is enabled only for soft-offlining
> which takes only one hugepage in a single call.
> 
> But the situation will change in the later patches in this series
> which enable other users of page migration to support hugepage migration.
> They can kick migration for both of normal pages and hugepages
> in a single call, so we need to go back to original implementation
> of using linked lists to collect the hugepages to be migrated.

If the purpose of this patch is to reduce code duplication then you
should remove migrate_huge_page as it doesn't have any caller anymore.

[...]
> @@ -1482,12 +1483,20 @@ static int soft_offline_huge_page(struct page *page, int flags)
>  	unlock_page(hpage);
>  
>  	/* Keep page count to indicate a given hugepage is isolated. */
> -	ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL,
> -				MIGRATE_SYNC);
> -	put_page(hpage);
> +	list_move(&hpage->lru, &pagelist);
> +	ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
> +				MIGRATE_SYNC, MR_MEMORY_FAILURE);
>  	if (ret) {
>  		pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
>  			pfn, ret, page->flags);
> +		/*
> +		 * We know that soft_offline_huge_page() tries to migrate
> +		 * only one hugepage pointed to by hpage, so we need not
> +		 * run through the pagelist here.
> +		 */
> +		putback_active_hugepage(hpage);

Maybe I am missing something but why we didn't need to call this before
when using migrate_huge_page?

> +		if (ret > 0)
> +			ret = -EIO;
>  	} else {
>  		set_page_hwpoison_huge_page(hpage);
>  		dequeue_hwpoisoned_huge_page(hpage);
> diff --git v3.9-rc3.orig/mm/migrate.c v3.9-rc3/mm/migrate.c
> index f69f354..66030b6 100644
> --- v3.9-rc3.orig/mm/migrate.c
> +++ v3.9-rc3/mm/migrate.c
> @@ -981,6 +981,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>  
>  	unlock_page(hpage);
>  out:
> +	if (rc != -EAGAIN)
> +		putback_active_hugepage(hpage);

And why do you put it here? If it is called from migrate_pages then the
caller already does the clean-up (putback_lru_pages).

>  	put_page(new_hpage);
>  	if (result) {
>  		if (rc)
> -- 
> 1.7.11.7
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 05/10] migrate: add hugepage migration code to migrate_pages()
  2013-03-22 20:23 ` [PATCH 05/10] migrate: add hugepage migration code to migrate_pages() Naoya Horiguchi
@ 2013-03-25 13:04   ` Michal Hocko
  2013-03-26  5:13     ` Naoya Horiguchi
  2013-04-05 21:17   ` KOSAKI Motohiro
  1 sibling, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2013-03-25 13:04 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Fri 22-03-13 16:23:50, Naoya Horiguchi wrote:
[...]
> @@ -523,6 +544,11 @@ static inline int check_pmd_range(struct vm_area_struct *vma, pud_t *pud,
>  	pmd = pmd_offset(pud, addr);
>  	do {
>  		next = pmd_addr_end(addr, end);
> +		if (pmd_huge(*pmd) && is_vm_hugetlb_page(vma)) {
> +			check_hugetlb_pmd_range(vma, pmd, nodes,
> +						flags, private);

I am afraid this has the same issue with other huge page sizes I have
mentioned earlier.

> +			continue;
> +		}
>  		split_huge_page_pmd(vma, addr, pmd);
>  		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
>  			continue;
[...]
> @@ -1012,14 +1040,8 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
>  	check_range(mm, mm->mmap->vm_start, mm->task_size, &nmask,
>  			flags | MPOL_MF_DISCONTIG_OK, &pagelist);
>  
> -	if (!list_empty(&pagelist)) {
> -		err = migrate_pages(&pagelist, new_node_page, dest,
> +	return migrate_movable_pages(&pagelist, new_node_page, dest,
>  					MIGRATE_SYNC, MR_SYSCALL);
> -		if (err)
> -			putback_lru_pages(&pagelist);
> -	}
> -
> -	return err;

This is really confusing. Why migrate_pages doesn't do putback cleanup
on its own but migrate_movable_pages does?
Please also move migrate_movable_pages definition to this patch.

>  }
>  
>  /*
> -- 
> 1.7.11.7
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/10] migrate: add hugepage migration code to move_pages()
  2013-03-22 20:23 ` [PATCH 06/10] migrate: add hugepage migration code to move_pages() Naoya Horiguchi
@ 2013-03-25 13:36   ` Michal Hocko
  2013-03-26  7:06     ` Naoya Horiguchi
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2013-03-25 13:36 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Fri 22-03-13 16:23:51, Naoya Horiguchi wrote:
> This patch extends move_pages() to handle vma with VM_HUGETLB set.
> We will be able to migrate hugepage with move_pages(2) after
> applying the enablement patch which comes later in this series.
> 
> We avoid getting refcount on tail pages of hugepage, because unlike thp,
> hugepage is not split and we need not care about races with splitting.
> 
> And migration of larger (1GB for x86_64) hugepage are not enabled.
> 
> ChangeLog v2:
>  - updated description and renamed patch title
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  mm/memory.c  |  6 ++++--
>  mm/migrate.c | 26 +++++++++++++++++++-------
>  2 files changed, 23 insertions(+), 9 deletions(-)
> 
> diff --git v3.9-rc3.orig/mm/memory.c v3.9-rc3/mm/memory.c
> index 494526a..3b6ad3d 100644
> --- v3.9-rc3.orig/mm/memory.c
> +++ v3.9-rc3/mm/memory.c
> @@ -1503,7 +1503,8 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  	if (pud_none(*pud))
>  		goto no_page_table;
>  	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
> -		BUG_ON(flags & FOLL_GET);
> +		if (flags & FOLL_GET)
> +			goto out;


>  		page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
>  		goto out;
>  	}
> @@ -1514,8 +1515,9 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  	if (pmd_none(*pmd))
>  		goto no_page_table;
>  	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
> -		BUG_ON(flags & FOLL_GET);
>  		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
> +		if (flags & FOLL_GET && PageHead(page))
> +			get_page_foll(page);

Hmm, so the caller gets a non-null page without elevated ref counted
even when he asked for it. This means that all callers have to check
PageTail && hugetlb and put_page according to that. That is _really_
fragile.
I think that returning NULL would make more sense in this case.

>  		goto out;
>  	}
>  	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
> @@ -1164,6 +1175,12 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
[...]
>  				!migrate_all)
>  			goto put_and_set;
>  
> +		if (PageHuge(page)) {
> +			get_page(page);
> +			list_move_tail(&page->lru, &pagelist);
> +			goto put_and_set;
> +		}

Why do you take an additional reference here? You have one from
follow_page already.

> +
>  		err = isolate_lru_page(page);
>  		if (!err) {
>  			list_add_tail(&page->lru, &pagelist);
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/10] mbind: add hugepage migration code to mbind()
  2013-03-22 20:23 ` [PATCH 07/10] mbind: add hugepage migration code to mbind() Naoya Horiguchi
@ 2013-03-25 13:49   ` Michal Hocko
  2013-04-05 22:23     ` KOSAKI Motohiro
  2013-04-05 22:18   ` KOSAKI Motohiro
  1 sibling, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2013-03-25 13:49 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Fri 22-03-13 16:23:52, Naoya Horiguchi wrote:
[...]
> --- v3.9-rc3.orig/mm/mempolicy.c
> +++ v3.9-rc3/mm/mempolicy.c
> @@ -1173,6 +1173,8 @@ static struct page *new_vma_page(struct page *page, unsigned long private, int *
>  		vma = vma->vm_next;
>  	}
>  
> +	if (PageHuge(page))
> +		return alloc_huge_page(vma, address, 1);
>  	/*
>  	 * if !vma, alloc_page_vma() will use task or system default policy
>  	 */
[...]
> diff --git v3.9-rc3.orig/mm/migrate.c v3.9-rc3/mm/migrate.c
> index ef8e4e3..e64cd55 100644
> --- v3.9-rc3.orig/mm/migrate.c
> +++ v3.9-rc3/mm/migrate.c
> @@ -951,7 +951,12 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>  	struct page *new_hpage = get_new_page(hpage, private, &result);
>  	struct anon_vma *anon_vma = NULL;
>  
> -	if (!new_hpage)
> +	/*
> +	 * Getting a new hugepage with alloc_huge_page() (which can happen
> +	 * when migration is caused by mbind()) can return ERR_PTR value,
> +	 * so we need take care of the case here.
> +	 */
> +	if (!new_hpage || IS_ERR_VALUE(new_hpage))
>  		return -ENOMEM;

Please no. get_new_page returns NULL or a page. You are hooking a wrong
callback here. The error value doesn't make any sense here. IMO you
should just wrap alloc_huge_page by something that returns NULL or page.

>  
>  	rc = -EAGAIN;

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
  2013-03-22 20:23 ` [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage Naoya Horiguchi
@ 2013-03-25 15:09   ` Michal Hocko
  2013-03-26 18:23     ` Naoya Horiguchi
  2013-03-26 12:01   ` Aneesh Kumar K.V
  2013-04-06  0:13   ` KOSAKI Motohiro
  2 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2013-03-25 15:09 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Fri 22-03-13 16:23:54, Naoya Horiguchi wrote:
> Currently we can't offline memory blocks which contain hugepages because
> a hugepage is considered as an unmovable page. But now with this patch
> series, a hugepage has become movable, so by using hugepage migration we
> can offline such memory blocks.
> 
> What's different from other users of hugepage migration is that we need
> to decompose all the hugepages inside the target memory block into free
> buddy pages after hugepage migration, because otherwise free hugepages
> remaining in the memory block intervene the memory offlining.
> For this reason we introduce new functions dissolve_free_huge_page() and
> dissolve_free_huge_pages().
> 
> Other than that, what this patch does is straightforwardly to add hugepage
> migration code, that is, adding hugepage code to the functions which scan
> over pfn and collect hugepages to be migrated, and adding a hugepage
> allocation function to alloc_migrate_target().
> 
> As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
> over them because it's larger than memory block. So we now simply leave
> it to fail as it is.
> 
> ChangeLog v2:
>  - changed return value type of is_hugepage_movable() to bool
>  - is_hugepage_movable() uses list_for_each_entry() instead of *_safe()
>  - moved if(PageHuge) block before get_page_unless_zero() in do_migrate_range()
>  - do_migrate_range() returns -EBUSY for hugepages larger than memory block
>  - dissolve_free_huge_pages() calculates scan step and sets it to minimum
>    hugepage size
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  include/linux/hugetlb.h |  6 +++++
>  mm/hugetlb.c            | 58 +++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/memory_hotplug.c     | 42 +++++++++++++++++++++++++++--------
>  mm/page_alloc.c         | 12 ++++++++++
>  mm/page_isolation.c     |  5 +++++
>  5 files changed, 114 insertions(+), 9 deletions(-)
> 
> diff --git v3.9-rc3.orig/include/linux/hugetlb.h v3.9-rc3/include/linux/hugetlb.h
> index 981eff8..8220a8a 100644
> --- v3.9-rc3.orig/include/linux/hugetlb.h
> +++ v3.9-rc3/include/linux/hugetlb.h
> @@ -69,6 +69,7 @@ int dequeue_hwpoisoned_huge_page(struct page *page);
>  void putback_active_hugepage(struct page *page);
>  void putback_active_hugepages(struct list_head *l);
>  void migrate_hugepage_add(struct page *page, struct list_head *list);
> +bool is_hugepage_movable(struct page *page);
>  void copy_huge_page(struct page *dst, struct page *src);
>  
>  extern unsigned long hugepages_treat_as_movable;
> @@ -134,6 +135,7 @@ static inline int dequeue_hwpoisoned_huge_page(struct page *page)
>  #define putback_active_hugepage(p) 0
>  #define putback_active_hugepages(l) 0
>  #define migrate_hugepage_add(p, l) 0
> +#define is_hugepage_movable(x) 0
>  static inline void copy_huge_page(struct page *dst, struct page *src)
>  {
>  }
> @@ -356,6 +358,9 @@ static inline int hstate_index(struct hstate *h)
>  	return h - hstates;
>  }
>  
> +extern void dissolve_free_huge_pages(unsigned long start_pfn,
> +				     unsigned long end_pfn);
> +
>  #else
>  struct hstate {};
>  #define alloc_huge_page(v, a, r) NULL
> @@ -376,6 +381,7 @@ static inline unsigned int pages_per_huge_page(struct hstate *h)
>  }
>  #define hstate_index_to_shift(index) 0
>  #define hstate_index(h) 0
> +#define dissolve_free_huge_pages(s, e) 0
>  #endif
>  
>  #endif /* _LINUX_HUGETLB_H */
> diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
> index d9d3dd7..ef79871 100644
> --- v3.9-rc3.orig/mm/hugetlb.c
> +++ v3.9-rc3/mm/hugetlb.c
> @@ -844,6 +844,36 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>  	return ret;
>  }
>  
> +/* Dissolve a given free hugepage into free pages. */
> +static void dissolve_free_huge_page(struct page *page)
> +{
> +	spin_lock(&hugetlb_lock);
> +	if (PageHuge(page) && !page_count(page)) {
> +		struct hstate *h = page_hstate(page);
> +		int nid = page_to_nid(page);
> +		list_del(&page->lru);
> +		h->free_huge_pages--;
> +		h->free_huge_pages_node[nid]--;
> +		update_and_free_page(h, page);
> +	}

What about surplus pages?

> +	spin_unlock(&hugetlb_lock);
> +}
> +
> +/* Dissolve free hugepages in a given pfn range. Used by memory hotplug. */
> +void dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
> +{
> +	unsigned int order = 8 * sizeof(void *);
> +	unsigned long pfn;
> +	struct hstate *h;
> +
> +	/* Set scan step to minimum hugepage size */
> +	for_each_hstate(h)
> +		if (order > huge_page_order(h))
> +			order = huge_page_order(h);
> +	for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << order)
> +		dissolve_free_huge_page(pfn_to_page(pfn));

This assumes that start_pfn doesn't at a tail page otherwise you could
end up traversing only tail pages. This shouldn't happen normally as
start_pfn will be bound to a memblock but it looks a bit fragile.

It is a bit unfortunate that the offlining code is pfn range oriented
while hugetlb pages are organized by nodes.

> +}
> +
>  static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
>  {
>  	struct page *page;
> @@ -3155,6 +3185,34 @@ static int is_hugepage_on_freelist(struct page *hpage)
>  	return 0;
>  }
>  
> +/* Returns true for head pages of in-use hugepages, otherwise returns false. */
> +bool is_hugepage_movable(struct page *hpage)
> +{
> +	struct page *page;
> +	struct hstate *h;
> +	bool ret = false;
> +
> +	VM_BUG_ON(!PageHuge(hpage));
> +	/*
> +	 * This function can be called for a tail page because memory hotplug
> +	 * scans movability of pages by pfn range of a memory block.
> +	 * Larger hugepages (1GB for x86_64) are larger than memory block, so
> +	 * the scan can start at the tail page of larger hugepages.
> +	 * 1GB hugepage is not movable now, so we return with false for now.
> +	 */
> +	if (PageTail(hpage))
> +		return false;
> +	h = page_hstate(hpage);
> +	spin_lock(&hugetlb_lock);
> +	list_for_each_entry(page, &h->hugepage_activelist, lru)
> +		if (page == hpage) {
> +			ret = true;
> +			break;
> +		}

Why are you checking that the page is active? It doesn't make much sense
to me because nothing prevents it from being freed/allocated right after
you release hugetlb_lock.

> +	spin_unlock(&hugetlb_lock);
> +	return ret;
> +}
> +
>  /*
>   * This function is called from memory failure code.
>   * Assume the caller holds page lock of the head page.
> diff --git v3.9-rc3.orig/mm/memory_hotplug.c v3.9-rc3/mm/memory_hotplug.c
> index 9597eec..2d206e8 100644
> --- v3.9-rc3.orig/mm/memory_hotplug.c
> +++ v3.9-rc3/mm/memory_hotplug.c
> @@ -30,6 +30,7 @@
>  #include <linux/mm_inline.h>
>  #include <linux/firmware-map.h>
>  #include <linux/stop_machine.h>
> +#include <linux/hugetlb.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -1215,10 +1216,12 @@ static int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn)
>  }
>  
>  /*
> - * Scanning pfn is much easier than scanning lru list.
> - * Scan pfn from start to end and Find LRU page.
> + * Scan pfn range [start,end) to find movable/migratable pages (LRU pages
> + * and hugepages). We scan pfn because it's much easier than scanning over
> + * linked list. This function returns the pfn of the first found movable
> + * page if it's found, otherwise 0.
>   */
> -static unsigned long scan_lru_pages(unsigned long start, unsigned long end)
> +static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
>  {
>  	unsigned long pfn;
>  	struct page *page;
> @@ -1227,6 +1230,12 @@ static unsigned long scan_lru_pages(unsigned long start, unsigned long end)
>  			page = pfn_to_page(pfn);
>  			if (PageLRU(page))
>  				return pfn;
> +			if (PageHuge(page)) {
> +				if (is_hugepage_movable(page))
> +					return pfn;
> +				else
> +					pfn += (1 << compound_order(page)) - 1;

This doesn't look right to me. You have to consider where is your tail
page.
					pfn += (1 << compound_order(page)) - (page - compound_head(page)) - 1;
Or something nicer ;)

> +			}
>  		}
>  	}
>  	return 0;
> @@ -1247,6 +1256,21 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>  		if (!pfn_valid(pfn))
>  			continue;
>  		page = pfn_to_page(pfn);
> +
> +		if (PageHuge(page)) {
> +			struct page *head = compound_head(page);
> +			pfn = page_to_pfn(head) + (1<<compound_order(head)) - 1;
> +			if (compound_order(head) > PFN_SECTION_SHIFT) {
> +				ret = -EBUSY;
> +				break;
> +			}
> +			if (!get_page_unless_zero(page))
> +				continue;

s/page/hpage/

> +			list_move_tail(&head->lru, &source);
> +			move_pages -= 1 << compound_order(head);
> +			continue;
> +		}
> +
>  		if (!get_page_unless_zero(page))
>  			continue;
>  		/*
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 10/10] prepare to remove /proc/sys/vm/hugepages_treat_as_movable
  2013-03-22 20:23 ` [PATCH 10/10] prepare to remove /proc/sys/vm/hugepages_treat_as_movable Naoya Horiguchi
@ 2013-03-25 15:12   ` Michal Hocko
  2013-04-06  0:15     ` KOSAKI Motohiro
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2013-03-25 15:12 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Fri 22-03-13 16:23:55, Naoya Horiguchi wrote:
[...]
> @@ -2086,11 +2085,7 @@ int hugetlb_treat_movable_handler(struct ctl_table *table, int write,
>  			void __user *buffer,
>  			size_t *length, loff_t *ppos)
>  {
> -	proc_dointvec(table, write, buffer, length, ppos);
> -	if (hugepages_treat_as_movable)
> -		htlb_alloc_mask = GFP_HIGHUSER_MOVABLE;
> -	else
> -		htlb_alloc_mask = GFP_HIGHUSER;
> +	/* hugepages_treat_as_movable is obsolete and to be removed. */

WARN_ON_ONCE("This knob is obsolete and has no effect. It is scheduled for removal")
>  	return 0;
>  }
>  
> -- 
> 1.7.11.7
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 01/10] migrate: add migrate_entry_wait_huge()
  2013-03-25 10:13   ` Michal Hocko
@ 2013-03-26  4:25     ` Naoya Horiguchi
  0 siblings, 0 replies; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-26  4:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Mon, Mar 25, 2013 at 11:13:40AM +0100, Michal Hocko wrote:
> On Fri 22-03-13 16:23:46, Naoya Horiguchi wrote:
> > When we have a page fault for the address which is backed by a hugepage
> > under migration, the kernel can't wait correctly until the migration
> > finishes. This is because pte_offset_map_lock() can't get a correct
> > migration entry for hugepage. This patch adds migration_entry_wait_huge()
> > to separate code path between normal pages and hugepages.
> 
> The changelog is missing a description what is the effect of the bug. I
> assume that we end up busy looping on the huge page page fault until
> migration finishes, right?

Right. I'll add it in the description.

> Should this be applied to the stable tree or the current usage of the huge
> page migration (HWPOISON) is not spread enough?

Yes, it's better to also send it to stable.

> I like how you got rid of the duplication but I think this still doesn't
> work for all archs/huge page sizes.
> 
> [...]
> > diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
> > index 0a0be33..98a478e 100644
> > --- v3.9-rc3.orig/mm/hugetlb.c
> > +++ v3.9-rc3/mm/hugetlb.c
> > @@ -2819,7 +2819,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	if (ptep) {
> >  		entry = huge_ptep_get(ptep);
> >  		if (unlikely(is_hugetlb_entry_migration(entry))) {
> > -			migration_entry_wait(mm, (pmd_t *)ptep, address);
> > +			migration_entry_wait_huge(mm, (pmd_t *)ptep, address);
> 
> e.g. ia64 returns pte_t *

We need arch-independent fix.

> [...]
> > +void migration_entry_wait_huge(struct mm_struct *mm, pmd_t *pmd,
> > +				unsigned long address)
> > +{
> > +	spinlock_t *ptl = pte_lockptr(mm, pmd);
> > +	__migration_entry_wait(mm, (pte_t *)pmd, ptl);
> 
> So you are trying to get lockptr from pmd but you have pte in fact. No
> biggy for !USE_SPLIT_PTLOCKS but doesn't work otherwise. So you probably
> need a arch specific huge_pte_lockptr callback for USE_SPLIT_PTLOCKS.

I must fix it, thanks.
And it might be good to generalize the idea of USE_SPLIT_PTLOCKS to pud and pmd.

> Or am I missing something here? All the pte usage in hugetlb is one
> giant mess and I wouldn't be surprised if there were more places
> hardcoded to pmd there.

Yes, that's a big problem on hugetlb and we need many clean-ups.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/10] migrate: make core migration code aware of hugepage
  2013-03-25 10:57   ` Michal Hocko
@ 2013-03-26  4:33     ` Naoya Horiguchi
  2013-03-26  8:49       ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-26  4:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Mon, Mar 25, 2013 at 11:57:01AM +0100, Michal Hocko wrote:
> On Fri 22-03-13 16:23:47, Naoya Horiguchi wrote:
...
> > diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
> > index 98a478e..a787c44 100644
> > --- v3.9-rc3.orig/mm/hugetlb.c
> > +++ v3.9-rc3/mm/hugetlb.c
> > @@ -48,7 +48,8 @@ static unsigned long __initdata default_hstate_max_huge_pages;
> >  static unsigned long __initdata default_hstate_size;
> >
> >  /*
> > - * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
> > + * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
> > + * free_huge_pages, and surplus_huge_pages.
> >   */
>
> Could you get this out into a separate patch and add lockdep assertions
> to functions which do not lock it directly but they rely on it so that
> the locking is more clear?
> e.g. dequeue_huge_page_node, update_and_free_page, try_to_free_low, ...

OK, I'll try it.

> It would a nice cleanup and a lead for future when somebody tries to
> make the locking a bit saner.
>
...
> > @@ -1056,6 +1064,20 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> >  	return rc;
> >  }
> >
> > +int migrate_movable_pages(struct list_head *from, new_page_t get_new_page,
> > +			unsigned long private,
> > +			enum migrate_mode mode, int reason)
> > +{
> > +	int err = 0;
> > +
> > +	if (!list_empty(from)) {
> > +		err = migrate_pages(from, get_new_page, private, mode, reason);
> > +		if (err)
> > +			putback_movable_pages(from);
> > +	}
> > +	return err;
> > +}
> > +
>
> There doesn't seem to be any caller for this function. Please move it to
> the patch which uses it.

I would do like that if there's only one user of this function, but I thought
that it's better to separate this part as changes of common code
because this function is commonly used by multiple users which are added by
multiple patches later in this series.

I mean doing like

  Patch 1: core change
  Patch 2: user A (depend on patch 1)
  Patch 3: user B (depend on patch 1)
  Patch 4: user C (depend on patch 1)

is a bit cleaner and easier in bisecting than doing like

  Patch 1: core change + user A
  Patch 2: user B (depend on patch 1)
  Patch 3: user C (depend on patch 1)

. I'm not sure which is standard or well-accepted way.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page()
  2013-03-25 12:31   ` Michal Hocko
@ 2013-03-26  4:34     ` Naoya Horiguchi
  2013-03-26  9:49       ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-26  4:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Mon, Mar 25, 2013 at 01:31:28PM +0100, Michal Hocko wrote:
> On Fri 22-03-13 16:23:48, Naoya Horiguchi wrote:
> > Currently migrate_huge_page() takes a pointer to a hugepage to be
> > migrated as an argument, instead of taking a pointer to the list of
> > hugepages to be migrated. This behavior was introduced in commit
> > 189ebff28 ("hugetlb: simplify migrate_huge_page()"), and was OK
> > because until now hugepage migration is enabled only for soft-offlining
> > which takes only one hugepage in a single call.
> > 
> > But the situation will change in the later patches in this series
> > which enable other users of page migration to support hugepage migration.
> > They can kick migration for both of normal pages and hugepages
> > in a single call, so we need to go back to original implementation
> > of using linked lists to collect the hugepages to be migrated.
> 
> If the purpose of this patch is to reduce code duplication then you
> should remove migrate_huge_page as it doesn't have any caller anymore.

Yes, that makes sense. I'll do this.

> [...]
> > @@ -1482,12 +1483,20 @@ static int soft_offline_huge_page(struct page *page, int flags)
> >  	unlock_page(hpage);
> >  
> >  	/* Keep page count to indicate a given hugepage is isolated. */
> > -	ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL,
> > -				MIGRATE_SYNC);
> > -	put_page(hpage);
> > +	list_move(&hpage->lru, &pagelist);
> > +	ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
> > +				MIGRATE_SYNC, MR_MEMORY_FAILURE);
> >  	if (ret) {
> >  		pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
> >  			pfn, ret, page->flags);
> > +		/*
> > +		 * We know that soft_offline_huge_page() tries to migrate
> > +		 * only one hugepage pointed to by hpage, so we need not
> > +		 * run through the pagelist here.
> > +		 */
> > +		putback_active_hugepage(hpage);
> 
> Maybe I am missing something but why we didn't need to call this before
> when using migrate_huge_page?

migrate_huge_page() does not need list handling before/after the call,
because it's defined to migrate only one hugepage, and it has a page as
an argument, not list_head.

> > +		if (ret > 0)
> > +			ret = -EIO;
> >  	} else {
> >  		set_page_hwpoison_huge_page(hpage);
> >  		dequeue_hwpoisoned_huge_page(hpage);
> > diff --git v3.9-rc3.orig/mm/migrate.c v3.9-rc3/mm/migrate.c
> > index f69f354..66030b6 100644
> > --- v3.9-rc3.orig/mm/migrate.c
> > +++ v3.9-rc3/mm/migrate.c
> > @@ -981,6 +981,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> >  
> >  	unlock_page(hpage);
> >  out:
> > +	if (rc != -EAGAIN)
> > +		putback_active_hugepage(hpage);
> 
> And why do you put it here? If it is called from migrate_pages then the
> caller already does the clean-up (putback_lru_pages).

What the caller of migrate_pages() cleans up is the (huge)pages which failed
to be migrated. And what the above code cleans up is the source hugepage
after the migration succeeds.

The latter clean-up code originally existed, but removed in 189ebff28
("hugetlb: simplify migrate_huge_page()").
This commit cleans up the code based on that there was only one user
of hugepage migration, but that's not true any more.
So the above hunk is a part of revert of the commit.
But it's not a simple revert, because there's one difference between
now and before 189ebff28 that we link hugepages in-use to hugepage_activelist.
Then we finally come to the above change.

Thanks,
Naoya

> >  	put_page(new_hpage);
> >  	if (result) {
> >  		if (rc)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 05/10] migrate: add hugepage migration code to migrate_pages()
  2013-03-25 13:04   ` Michal Hocko
@ 2013-03-26  5:13     ` Naoya Horiguchi
  2013-03-26  8:55       ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-26  5:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Mon, Mar 25, 2013 at 02:04:16PM +0100, Michal Hocko wrote:
> On Fri 22-03-13 16:23:50, Naoya Horiguchi wrote:
> [...]
> > @@ -523,6 +544,11 @@ static inline int check_pmd_range(struct vm_area_struct *vma, pud_t *pud,
> >  	pmd = pmd_offset(pud, addr);
> >  	do {
> >  		next = pmd_addr_end(addr, end);
> > +		if (pmd_huge(*pmd) && is_vm_hugetlb_page(vma)) {
> > +			check_hugetlb_pmd_range(vma, pmd, nodes,
> > +						flags, private);
> 
> I am afraid this has the same issue with other huge page sizes I have
> mentioned earlier.

So we need arch-dependent helper functions. I'll try that, but it
might be better to start with enabling only x86_64 if it takes time
to implement this.

> > +			continue;
> > +		}
> >  		split_huge_page_pmd(vma, addr, pmd);
> >  		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> >  			continue;
> [...]
> > @@ -1012,14 +1040,8 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
> >  	check_range(mm, mm->mmap->vm_start, mm->task_size, &nmask,
> >  			flags | MPOL_MF_DISCONTIG_OK, &pagelist);
> >  
> > -	if (!list_empty(&pagelist)) {
> > -		err = migrate_pages(&pagelist, new_node_page, dest,
> > +	return migrate_movable_pages(&pagelist, new_node_page, dest,
> >  					MIGRATE_SYNC, MR_SYSCALL);
> > -		if (err)
> > -			putback_lru_pages(&pagelist);
> > -	}
> > -
> > -	return err;
> 
> This is really confusing. Why migrate_pages doesn't do putback cleanup
> on its own but migrate_movable_pages does?

I consider migrate_movable_pages() as a wrapper of migrate_pages(),
not the variant of migrate_pages().
We can find the same pattern in the callers like

  if (!list_empty(&pagelist)) {
        err = migrate_pages(...);
        if (err)
                putback_lru_pages(&pagelist);
  }

, so it can be simplified by migrate_movable_pages().

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/10] migrate: add hugepage migration code to move_pages()
  2013-03-25 13:36   ` Michal Hocko
@ 2013-03-26  7:06     ` Naoya Horiguchi
  2013-03-26 10:02       ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-26  7:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Mon, Mar 25, 2013 at 02:36:44PM +0100, Michal Hocko wrote:
> On Fri 22-03-13 16:23:51, Naoya Horiguchi wrote:
> > This patch extends move_pages() to handle vma with VM_HUGETLB set.
> > We will be able to migrate hugepage with move_pages(2) after
> > applying the enablement patch which comes later in this series.
> > 
> > We avoid getting refcount on tail pages of hugepage, because unlike thp,
> > hugepage is not split and we need not care about races with splitting.
> > 
> > And migration of larger (1GB for x86_64) hugepage are not enabled.
> > 
> > ChangeLog v2:
> >  - updated description and renamed patch title
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> >  mm/memory.c  |  6 ++++--
> >  mm/migrate.c | 26 +++++++++++++++++++-------
> >  2 files changed, 23 insertions(+), 9 deletions(-)
> > 
> > diff --git v3.9-rc3.orig/mm/memory.c v3.9-rc3/mm/memory.c
> > index 494526a..3b6ad3d 100644
> > --- v3.9-rc3.orig/mm/memory.c
> > +++ v3.9-rc3/mm/memory.c
> > @@ -1503,7 +1503,8 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
> >  	if (pud_none(*pud))
> >  		goto no_page_table;
> >  	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
> > -		BUG_ON(flags & FOLL_GET);
> > +		if (flags & FOLL_GET)
> > +			goto out;
> 
> 
> >  		page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
> >  		goto out;
> >  	}
> > @@ -1514,8 +1515,9 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
> >  	if (pmd_none(*pmd))
> >  		goto no_page_table;
> >  	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
> > -		BUG_ON(flags & FOLL_GET);
> >  		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
> > +		if (flags & FOLL_GET && PageHead(page))
> > +			get_page_foll(page);
> 
> Hmm, so the caller gets a non-null page without elevated ref counted
> even when he asked for it. This means that all callers have to check
> PageTail && hugetlb and put_page according to that. That is _really_
> fragile.

I agree. And refcounting of tail pages are already very fragile,
because get_page_foll() does something very tricky on tail pages,
where we use page->_mapcount for refcount.
This seems to be to handle some thp splitting problem,
and is never intended to be used for hugepage.
So I just avoid calling it for tail pages of hugepage in caller's side.

> I think that returning NULL would make more sense in this case.

Sounds nice. I'll do this with some comment.

> >  		goto out;
> >  	}
> >  	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
> > @@ -1164,6 +1175,12 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
> [...]
> >  				!migrate_all)
> >  			goto put_and_set;
> >  
> > +		if (PageHuge(page)) {
> > +			get_page(page);
> > +			list_move_tail(&page->lru, &pagelist);
> > +			goto put_and_set;
> > +		}
> 
> Why do you take an additional reference here? You have one from
> follow_page already.

For normal pages, follow_page(FOLL_GET) takes a refcount and
isolate_lru_page() takes another one, so I think the same should
be done for hugepages. Refcounting of this function looks tricky,
and I'm not sure why existing code does like that.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/10] migrate: make core migration code aware of hugepage
  2013-03-26  4:33     ` Naoya Horiguchi
@ 2013-03-26  8:49       ` Michal Hocko
  2013-04-05 20:41         ` KOSAKI Motohiro
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2013-03-26  8:49 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Tue 26-03-13 00:33:35, Naoya Horiguchi wrote:
> On Mon, Mar 25, 2013 at 11:57:01AM +0100, Michal Hocko wrote:
> > On Fri 22-03-13 16:23:47, Naoya Horiguchi wrote:
[...]
> > > +int migrate_movable_pages(struct list_head *from, new_page_t get_new_page,
> > > +			unsigned long private,
> > > +			enum migrate_mode mode, int reason)
> > > +{
> > > +	int err = 0;
> > > +
> > > +	if (!list_empty(from)) {
> > > +		err = migrate_pages(from, get_new_page, private, mode, reason);
> > > +		if (err)
> > > +			putback_movable_pages(from);
> > > +	}
> > > +	return err;
> > > +}
> > > +
> >
> > There doesn't seem to be any caller for this function. Please move it to
> > the patch which uses it.
> 
> I would do like that if there's only one user of this function, but I thought
> that it's better to separate this part as changes of common code
> because this function is commonly used by multiple users which are added by
> multiple patches later in this series.

Sure there is no hard rule for this. I just find it much easier to
review if there is a caller of introduced functionality. In this
particular case I found out only later that many migrate_pages callers
were changed to use mograte_movable_pages and made the
putback_movable_pages cleanup inconsistent between the two.

It would help to mention what is the planned future usage of the
introduced function if you prefer to introduce it without users.

> I mean doing like
> 
>   Patch 1: core change
>   Patch 2: user A (depend on patch 1)
>   Patch 3: user B (depend on patch 1)
>   Patch 4: user C (depend on patch 1)
> 
> is a bit cleaner and easier in bisecting than doing like
> 
>   Patch 1: core change + user A
>   Patch 2: user B (depend on patch 1)
>   Patch 3: user C (depend on patch 1)
> 
> . I'm not sure which is standard or well-accepted way.

Whatever makes the review easy ;)
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 05/10] migrate: add hugepage migration code to migrate_pages()
  2013-03-26  5:13     ` Naoya Horiguchi
@ 2013-03-26  8:55       ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2013-03-26  8:55 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Tue 26-03-13 01:13:10, Naoya Horiguchi wrote:
> On Mon, Mar 25, 2013 at 02:04:16PM +0100, Michal Hocko wrote:
> > On Fri 22-03-13 16:23:50, Naoya Horiguchi wrote:
[...]
> > > @@ -1012,14 +1040,8 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
> > >  	check_range(mm, mm->mmap->vm_start, mm->task_size, &nmask,
> > >  			flags | MPOL_MF_DISCONTIG_OK, &pagelist);
> > >  
> > > -	if (!list_empty(&pagelist)) {
> > > -		err = migrate_pages(&pagelist, new_node_page, dest,
> > > +	return migrate_movable_pages(&pagelist, new_node_page, dest,
> > >  					MIGRATE_SYNC, MR_SYSCALL);
> > > -		if (err)
> > > -			putback_lru_pages(&pagelist);
> > > -	}
> > > -
> > > -	return err;
> > 
> > This is really confusing. Why migrate_pages doesn't do putback cleanup
> > on its own but migrate_movable_pages does?
> 
> I consider migrate_movable_pages() as a wrapper of migrate_pages(),
> not the variant of migrate_pages().

The naming suggests that this is the same functionality for a "different"
type of pages.

> We can find the same pattern in the callers like
> 
>   if (!list_empty(&pagelist)) {
>         err = migrate_pages(...);
>         if (err)
>                 putback_lru_pages(&pagelist);
>   }
> 
> , so it can be simplified by migrate_movable_pages().

I would rather see the same pattern for both. It could be error prone if

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page()
  2013-03-26  4:34     ` Naoya Horiguchi
@ 2013-03-26  9:49       ` Michal Hocko
  2013-03-26 20:35         ` Naoya Horiguchi
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2013-03-26  9:49 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Tue 26-03-13 00:34:40, Naoya Horiguchi wrote:
> On Mon, Mar 25, 2013 at 01:31:28PM +0100, Michal Hocko wrote:
> > On Fri 22-03-13 16:23:48, Naoya Horiguchi wrote:
[...]
> > > @@ -1482,12 +1483,20 @@ static int soft_offline_huge_page(struct page *page, int flags)
> > >  	unlock_page(hpage);
> > >  
> > >  	/* Keep page count to indicate a given hugepage is isolated. */
> > > -	ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL,
> > > -				MIGRATE_SYNC);
> > > -	put_page(hpage);
> > > +	list_move(&hpage->lru, &pagelist);
> > > +	ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
> > > +				MIGRATE_SYNC, MR_MEMORY_FAILURE);
> > >  	if (ret) {
> > >  		pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
> > >  			pfn, ret, page->flags);
> > > +		/*
> > > +		 * We know that soft_offline_huge_page() tries to migrate
> > > +		 * only one hugepage pointed to by hpage, so we need not
> > > +		 * run through the pagelist here.
> > > +		 */
> > > +		putback_active_hugepage(hpage);
> > 
> > Maybe I am missing something but why we didn't need to call this before
> > when using migrate_huge_page?
> 
> migrate_huge_page() does not need list handling before/after the call,
> because it's defined to migrate only one hugepage, and it has a page as
> an argument, not list_head.

I do not understand this reasoning. migrate_huge_page calls
unmap_and_move_huge_page and migrate_pages does the same + accounting.
So what is the difference here? I suspect that putback_active_hugepage
was simply missing in this code path.

> > > +		if (ret > 0)
> > > +			ret = -EIO;
> > >  	} else {
> > >  		set_page_hwpoison_huge_page(hpage);
> > >  		dequeue_hwpoisoned_huge_page(hpage);
> > > diff --git v3.9-rc3.orig/mm/migrate.c v3.9-rc3/mm/migrate.c
> > > index f69f354..66030b6 100644
> > > --- v3.9-rc3.orig/mm/migrate.c
> > > +++ v3.9-rc3/mm/migrate.c
> > > @@ -981,6 +981,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> > >  
> > >  	unlock_page(hpage);
> > >  out:
> > > +	if (rc != -EAGAIN)
> > > +		putback_active_hugepage(hpage);
> > 
> > And why do you put it here? If it is called from migrate_pages then the
> > caller already does the clean-up (putback_lru_pages).
> 
> What the caller of migrate_pages() cleans up is the (huge)pages which failed
> to be migrated. And what the above code cleans up is the source hugepage
> after the migration succeeds.

Why should you want to add successfully migrated page? /me confused.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/10] migrate: add hugepage migration code to move_pages()
  2013-03-26  7:06     ` Naoya Horiguchi
@ 2013-03-26 10:02       ` Michal Hocko
  2013-03-26 20:37         ` Naoya Horiguchi
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2013-03-26 10:02 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Tue 26-03-13 03:06:18, Naoya Horiguchi wrote:
> On Mon, Mar 25, 2013 at 02:36:44PM +0100, Michal Hocko wrote:
> > On Fri 22-03-13 16:23:51, Naoya Horiguchi wrote:
[...]
> > > @@ -1514,8 +1515,9 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
> > >  	if (pmd_none(*pmd))
> > >  		goto no_page_table;
> > >  	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
> > > -		BUG_ON(flags & FOLL_GET);
> > >  		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
> > > +		if (flags & FOLL_GET && PageHead(page))
> > > +			get_page_foll(page);
> > 
> > Hmm, so the caller gets a non-null page without elevated ref counted
> > even when he asked for it. This means that all callers have to check
> > PageTail && hugetlb and put_page according to that. That is _really_
> > fragile.
> 
> I agree. And refcounting of tail pages are already very fragile,
> because get_page_foll() does something very tricky on tail pages,
> where we use page->_mapcount for refcount.
> This seems to be to handle some thp splitting problem,
> and is never intended to be used for hugepage.

yes this is THP thingy.

> So I just avoid calling it for tail pages of hugepage in caller's side.
> 
> > I think that returning NULL would make more sense in this case.
> 
> Sounds nice. I'll do this with some comment.
> 
> > >  		goto out;
> > >  	}
> > >  	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
> > > @@ -1164,6 +1175,12 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
> > [...]
> > >  				!migrate_all)
> > >  			goto put_and_set;
> > >  
> > > +		if (PageHuge(page)) {
> > > +			get_page(page);
> > > +			list_move_tail(&page->lru, &pagelist);
> > > +			goto put_and_set;
> > > +		}
> > 
> > Why do you take an additional reference here? You have one from
> > follow_page already.
> 
> For normal pages, follow_page(FOLL_GET) takes a refcount and
> isolate_lru_page() takes another one, so I think the same should
> be done for hugepages. Refcounting of this function looks tricky,
> and I'm not sure why existing code does like that.

Ohh, I see. But the whole reference is taken just to release it in goto
put_and_set because isolate_lru_page elevates reference count because
other users require that. I think you do not have to mimic this behavior
here and you can drop get_page and use goto set_status.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page()
  2013-03-22 20:23 ` [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page() Naoya Horiguchi
  2013-03-25 12:31   ` Michal Hocko
@ 2013-03-26 11:29   ` Aneesh Kumar K.V
  2013-03-27 13:52     ` Michal Hocko
  1 sibling, 1 reply; 66+ messages in thread
From: Aneesh Kumar K.V @ 2013-03-26 11:29 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, KOSAKI Motohiro,
	Andi Kleen, Hillf Danton, Michal Hocko, linux-kernel

Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:

> Currently migrate_huge_page() takes a pointer to a hugepage to be
> migrated as an argument, instead of taking a pointer to the list of
> hugepages to be migrated. This behavior was introduced in commit
> 189ebff28 ("hugetlb: simplify migrate_huge_page()"), and was OK
> because until now hugepage migration is enabled only for soft-offlining
> which takes only one hugepage in a single call.
>
> But the situation will change in the later patches in this series
> which enable other users of page migration to support hugepage migration.
> They can kick migration for both of normal pages and hugepages
> in a single call, so we need to go back to original implementation
> of using linked lists to collect the hugepages to be migrated.
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  mm/memory-failure.c | 15 ++++++++++++---
>  mm/migrate.c        |  2 ++
>  2 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git v3.9-rc3.orig/mm/memory-failure.c v3.9-rc3/mm/memory-failure.c
> index df0694c..4e01082 100644
> --- v3.9-rc3.orig/mm/memory-failure.c
> +++ v3.9-rc3/mm/memory-failure.c
> @@ -1467,6 +1467,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
>  	int ret;
>  	unsigned long pfn = page_to_pfn(page);
>  	struct page *hpage = compound_head(page);
> +	LIST_HEAD(pagelist);
>
>  	/*
>  	 * This double-check of PageHWPoison is to avoid the race with
> @@ -1482,12 +1483,20 @@ static int soft_offline_huge_page(struct page *page, int flags)
>  	unlock_page(hpage);
>
>  	/* Keep page count to indicate a given hugepage is isolated. */
> -	ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL,
> -				MIGRATE_SYNC);
> -	put_page(hpage);
> +	list_move(&hpage->lru, &pagelist);

we use hpage->lru to add the hpage to h->hugepage_activelist. This will
break a hugetlb cgroup removal isn't it ?


> +	ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
> +				MIGRATE_SYNC, MR_MEMORY_FAILURE);
>  	if (ret) {
>  		pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
>  			pfn, ret, page->flags);
> +		/*
> +		 * We know that soft_offline_huge_page() tries to migrate
> +		 * only one hugepage pointed to by hpage, so we need not
> +		 * run through the pagelist here.
> +		 */
> +		putback_active_hugepage(hpage);
> +		if (ret > 0)
> +			ret = -EIO;
>  	} else {
>  		set_page_hwpoison_huge_page(hpage);
>  		dequeue_hwpoisoned_huge_page(hpage);
> diff --git v3.9-rc3.orig/mm/migrate.c v3.9-rc3/mm/migrate.c
> index f69f354..66030b6 100644
> --- v3.9-rc3.orig/mm/migrate.c
> +++ v3.9-rc3/mm/migrate.c
> @@ -981,6 +981,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>
>  	unlock_page(hpage);
>  out:
> +	if (rc != -EAGAIN)
> +		putback_active_hugepage(hpage);
>  	put_page(new_hpage);
>  	if (result) {
>  		if (rc)
> -- 

-aneesh


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
  2013-03-22 20:23 ` [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage Naoya Horiguchi
  2013-03-25 15:09   ` Michal Hocko
@ 2013-03-26 12:01   ` Aneesh Kumar K.V
  2013-03-27 19:28     ` Naoya Horiguchi
  2013-04-06  0:13   ` KOSAKI Motohiro
  2 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K.V @ 2013-03-26 12:01 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, Mel Gorman, Hugh Dickins, KOSAKI Motohiro,
	Andi Kleen, Hillf Danton, Michal Hocko, linux-kernel

Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:

> +/* Returns true for head pages of in-use hugepages, otherwise returns false. */
> +bool is_hugepage_movable(struct page *hpage)
> +{
> +	struct page *page;
> +	struct hstate *h;
> +	bool ret = false;
> +
> +	VM_BUG_ON(!PageHuge(hpage));
> +	/*
> +	 * This function can be called for a tail page because memory hotplug
> +	 * scans movability of pages by pfn range of a memory block.
> +	 * Larger hugepages (1GB for x86_64) are larger than memory block, so
> +	 * the scan can start at the tail page of larger hugepages.
> +	 * 1GB hugepage is not movable now, so we return with false for now.
> +	 */
> +	if (PageTail(hpage))
> +		return false;
> +	h = page_hstate(hpage);
> +	spin_lock(&hugetlb_lock);
> +	list_for_each_entry(page, &h->hugepage_activelist, lru)
> +		if (page == hpage) {
> +			ret = true;
> +			break;
> +		}
> +	spin_unlock(&hugetlb_lock);
> +	return ret;
> +}
> +

May be is_hugepage_active() ?


-aneesh


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
  2013-03-25 15:09   ` Michal Hocko
@ 2013-03-26 18:23     ` Naoya Horiguchi
  2013-03-27 14:19       ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-26 18:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Mon, Mar 25, 2013 at 04:09:52PM +0100, Michal Hocko wrote:
> On Fri 22-03-13 16:23:54, Naoya Horiguchi wrote:
...
> > index d9d3dd7..ef79871 100644
> > --- v3.9-rc3.orig/mm/hugetlb.c
> > +++ v3.9-rc3/mm/hugetlb.c
> > @@ -844,6 +844,36 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
> >  	return ret;
> >  }
> >  
> > +/* Dissolve a given free hugepage into free pages. */
> > +static void dissolve_free_huge_page(struct page *page)
> > +{
> > +	spin_lock(&hugetlb_lock);
> > +	if (PageHuge(page) && !page_count(page)) {
> > +		struct hstate *h = page_hstate(page);
> > +		int nid = page_to_nid(page);
> > +		list_del(&page->lru);
> > +		h->free_huge_pages--;
> > +		h->free_huge_pages_node[nid]--;
> > +		update_and_free_page(h, page);
> > +	}
> 
> What about surplus pages?

This function is only for free hugepage, not for surplus hugepages
(which are considered as in-use hugepages.)
dissolve_free_huge_pages() can be called only when all source hugepages
are free (all in-use hugepages are successfully migrated.)

> > +	spin_unlock(&hugetlb_lock);
> > +}
> > +
> > +/* Dissolve free hugepages in a given pfn range. Used by memory hotplug. */
> > +void dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
> > +{
> > +	unsigned int order = 8 * sizeof(void *);
> > +	unsigned long pfn;
> > +	struct hstate *h;
> > +
> > +	/* Set scan step to minimum hugepage size */
> > +	for_each_hstate(h)
> > +		if (order > huge_page_order(h))
> > +			order = huge_page_order(h);
> > +	for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << order)
> > +		dissolve_free_huge_page(pfn_to_page(pfn));
> 
> This assumes that start_pfn doesn't at a tail page otherwise you could
> end up traversing only tail pages. This shouldn't happen normally as
> start_pfn will be bound to a memblock but it looks a bit fragile.

I think that this function is never called for such a memblock because
scan_movable_pages() (scan_lru_pages in old name) skips the memblock
starting with a tail page.
But OK, to make code robuster I'll add checking whether first pfn is a
tail page or not.

> 
> It is a bit unfortunate that the offlining code is pfn range oriented
> while hugetlb pages are organized by nodes.
> 
> > +}
> > +
> >  static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
> >  {
> >  	struct page *page;
> > @@ -3155,6 +3185,34 @@ static int is_hugepage_on_freelist(struct page *hpage)
> >  	return 0;
> >  }
> >  
> > +/* Returns true for head pages of in-use hugepages, otherwise returns false. */
> > +bool is_hugepage_movable(struct page *hpage)
> > +{
> > +	struct page *page;
> > +	struct hstate *h;
> > +	bool ret = false;
> > +
> > +	VM_BUG_ON(!PageHuge(hpage));
> > +	/*
> > +	 * This function can be called for a tail page because memory hotplug
> > +	 * scans movability of pages by pfn range of a memory block.
> > +	 * Larger hugepages (1GB for x86_64) are larger than memory block, so
> > +	 * the scan can start at the tail page of larger hugepages.
> > +	 * 1GB hugepage is not movable now, so we return with false for now.
> > +	 */
> > +	if (PageTail(hpage))
> > +		return false;
> > +	h = page_hstate(hpage);
> > +	spin_lock(&hugetlb_lock);
> > +	list_for_each_entry(page, &h->hugepage_activelist, lru)
> > +		if (page == hpage) {
> > +			ret = true;
> > +			break;
> > +		}
> 
> Why are you checking that the page is active?

This is the counterpart to doing PageLRU check for normal pages.

> It doesn't make much sense
> to me because nothing prevents it from being freed/allocated right after
> you release hugetlb_lock.

Such a race can also happen for normal pages because scan_movable_pages()
just check PageLRU flags without holding any lock.
But the caller, __offline_pages(), repeats to call scan_movable_pages()
until no page in the memblock are judged as movable, and in the repeat loop
do_migrate_range() does nothing for free (unmovable) pages.
So there is no behavioral problem even if the movable page is freed just
after the if(PageLRU) check in scan_movable_page().
Note that in this loop, allocating pages in the memblock is forbidden
because we already do set_migratetype_isolate() for them, so we don't have
to worry about being allocated just after scan_movable_pages().

I want the same thing to be the case for hugepage. As you pointed out,
is_hugepage_movable() is not safe from such a race, but in "being freed
just after is_hugepage_movable() returns true" case we have no problem
for the same reason described above.
However, in "being allocated just after is_hugepage_movable() returns false"
case, it seems to be possible to hot-remove an active hugepage. I think we
can avoid this by adding migratetype check in alloc_huge_page().

> > +	spin_unlock(&hugetlb_lock);
> > +	return ret;
> > +}
> > +
> >  /*
> >   * This function is called from memory failure code.
> >   * Assume the caller holds page lock of the head page.
> > diff --git v3.9-rc3.orig/mm/memory_hotplug.c v3.9-rc3/mm/memory_hotplug.c
> > index 9597eec..2d206e8 100644
> > --- v3.9-rc3.orig/mm/memory_hotplug.c
> > +++ v3.9-rc3/mm/memory_hotplug.c
> > @@ -30,6 +30,7 @@
> >  #include <linux/mm_inline.h>
> >  #include <linux/firmware-map.h>
> >  #include <linux/stop_machine.h>
> > +#include <linux/hugetlb.h>
> >  
> >  #include <asm/tlbflush.h>
> >  
> > @@ -1215,10 +1216,12 @@ static int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn)
> >  }
> >  
> >  /*
> > - * Scanning pfn is much easier than scanning lru list.
> > - * Scan pfn from start to end and Find LRU page.
> > + * Scan pfn range [start,end) to find movable/migratable pages (LRU pages
> > + * and hugepages). We scan pfn because it's much easier than scanning over
> > + * linked list. This function returns the pfn of the first found movable
> > + * page if it's found, otherwise 0.
> >   */
> > -static unsigned long scan_lru_pages(unsigned long start, unsigned long end)
> > +static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
> >  {
> >  	unsigned long pfn;
> >  	struct page *page;
> > @@ -1227,6 +1230,12 @@ static unsigned long scan_lru_pages(unsigned long start, unsigned long end)
> >  			page = pfn_to_page(pfn);
> >  			if (PageLRU(page))
> >  				return pfn;
> > +			if (PageHuge(page)) {
> > +				if (is_hugepage_movable(page))
> > +					return pfn;
> > +				else
> > +					pfn += (1 << compound_order(page)) - 1;
> 
> This doesn't look right to me. You have to consider where is your tail
> page.
> 					pfn += (1 << compound_order(page)) - (page - compound_head(page)) - 1;
> Or something nicer ;)

OK.

> > +			}
> >  		}
> >  	}
> >  	return 0;
> > @@ -1247,6 +1256,21 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> >  		if (!pfn_valid(pfn))
> >  			continue;
> >  		page = pfn_to_page(pfn);
> > +
> > +		if (PageHuge(page)) {
> > +			struct page *head = compound_head(page);
> > +			pfn = page_to_pfn(head) + (1<<compound_order(head)) - 1;
> > +			if (compound_order(head) > PFN_SECTION_SHIFT) {
> > +				ret = -EBUSY;
> > +				break;
> > +			}
> > +			if (!get_page_unless_zero(page))
> > +				continue;
> 
> s/page/hpage/

Yes, we should pin head page.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page()
  2013-03-26  9:49       ` Michal Hocko
@ 2013-03-26 20:35         ` Naoya Horiguchi
  2013-03-27 13:00           ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-26 20:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Tue, Mar 26, 2013 at 10:49:50AM +0100, Michal Hocko wrote:
> On Tue 26-03-13 00:34:40, Naoya Horiguchi wrote:
> > On Mon, Mar 25, 2013 at 01:31:28PM +0100, Michal Hocko wrote:
> > > On Fri 22-03-13 16:23:48, Naoya Horiguchi wrote:
> [...]
> > > > @@ -1482,12 +1483,20 @@ static int soft_offline_huge_page(struct page *page, int flags)
> > > >  	unlock_page(hpage);
> > > >  
> > > >  	/* Keep page count to indicate a given hugepage is isolated. */
> > > > -	ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL,
> > > > -				MIGRATE_SYNC);
> > > > -	put_page(hpage);
> > > > +	list_move(&hpage->lru, &pagelist);
> > > > +	ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
> > > > +				MIGRATE_SYNC, MR_MEMORY_FAILURE);
> > > >  	if (ret) {
> > > >  		pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
> > > >  			pfn, ret, page->flags);
> > > > +		/*
> > > > +		 * We know that soft_offline_huge_page() tries to migrate
> > > > +		 * only one hugepage pointed to by hpage, so we need not
> > > > +		 * run through the pagelist here.
> > > > +		 */
> > > > +		putback_active_hugepage(hpage);
> > > 
> > > Maybe I am missing something but why we didn't need to call this before
> > > when using migrate_huge_page?
> > 
> > migrate_huge_page() does not need list handling before/after the call,
> > because it's defined to migrate only one hugepage, and it has a page as
> > an argument, not list_head.
> 
> I do not understand this reasoning. migrate_huge_page calls
> unmap_and_move_huge_page and migrate_pages does the same + accounting.
> So what is the difference here?

My previous comment missed the point, sorry.
Let me restate for your original question:
> > > Maybe I am missing something but why we didn't need to call this before
> > > when using migrate_huge_page?

Before 189ebff28, there was no hugepage_activelist and in-use hugepages are
not linked to any pagelist, so put_page was used instead.

And present question:
> I do not understand this reasoning. migrate_huge_page calls
> unmap_and_move_huge_page and migrate_pages does the same + accounting.
> So what is the difference here?

The differences is that migrate_huge_page() has one hugepage as an argument,
and migrate_pages() has a pagelist with multiple hugepages.
I already told this before and I'm not sure it's enough to answer the question,
so I explain another point about why this patch do like it.

I think that we must do putback_*pages() for source pages whether migration
succeeds or not. But when we call migrate_pages() with a pagelist,
the caller can't access to the successfully migrated source pages
after migrate_pages() returns, because they are no longer on the pagelist.
So putback of the successfully migrated source pages should be done *in*
unmap_and_move() and/or unmap_and_move_huge_page().

And when we used migrate_huge_page(), we passed a hugepage to be migrated
as an argument, so the caller can still access to the page even if the
migration succeeds.

> I suspect that putback_active_hugepage
> was simply missing in this code path.

Commit 189ebff28 moved put_page() out of the if-block with removing put_page()
in unmap_and_move_huge_page(). As I wrote above, this is correct only when
migrate_huge_page() handles only one hugepage.
But this patch makes us back to pagelist implementation, so we should cancel
this change.

> > > > +		if (ret > 0)
> > > > +			ret = -EIO;
> > > >  	} else {
> > > >  		set_page_hwpoison_huge_page(hpage);
> > > >  		dequeue_hwpoisoned_huge_page(hpage);
> > > > diff --git v3.9-rc3.orig/mm/migrate.c v3.9-rc3/mm/migrate.c
> > > > index f69f354..66030b6 100644
> > > > --- v3.9-rc3.orig/mm/migrate.c
> > > > +++ v3.9-rc3/mm/migrate.c
> > > > @@ -981,6 +981,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> > > >  
> > > >  	unlock_page(hpage);
> > > >  out:
> > > > +	if (rc != -EAGAIN)
> > > > +		putback_active_hugepage(hpage);
> > > 
> > > And why do you put it here? If it is called from migrate_pages then the
> > > caller already does the clean-up (putback_lru_pages).
> > 
> > What the caller of migrate_pages() cleans up is the (huge)pages which failed
> > to be migrated. And what the above code cleans up is the source hugepage
> > after the migration succeeds.
> 
> Why should you want to add successfully migrated page? /me confused.

When hugepage migration succeeds, the source hugepage is freed back to
free hugepage pool (just after copy of data and mapping ended,
refcount of the source hugepage should be 1, so free_huge_page() is called
in this putback_active_hugepage().)
As I stated above, the caller cannot access to the source page, so we
need to do this here.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/10] migrate: add hugepage migration code to move_pages()
  2013-03-26 10:02       ` Michal Hocko
@ 2013-03-26 20:37         ` Naoya Horiguchi
  0 siblings, 0 replies; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-26 20:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Tue, Mar 26, 2013 at 11:02:21AM +0100, Michal Hocko wrote:
> On Tue 26-03-13 03:06:18, Naoya Horiguchi wrote:
> > On Mon, Mar 25, 2013 at 02:36:44PM +0100, Michal Hocko wrote:
> > > On Fri 22-03-13 16:23:51, Naoya Horiguchi wrote:
> > > > @@ -1164,6 +1175,12 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
> > > [...]
> > > >  				!migrate_all)
> > > >  			goto put_and_set;
> > > >  
> > > > +		if (PageHuge(page)) {
> > > > +			get_page(page);
> > > > +			list_move_tail(&page->lru, &pagelist);
> > > > +			goto put_and_set;
> > > > +		}
> > > 
> > > Why do you take an additional reference here? You have one from
> > > follow_page already.
> > 
> > For normal pages, follow_page(FOLL_GET) takes a refcount and
> > isolate_lru_page() takes another one, so I think the same should
> > be done for hugepages. Refcounting of this function looks tricky,
> > and I'm not sure why existing code does like that.
> 
> Ohh, I see. But the whole reference is taken just to release it in goto
> put_and_set because isolate_lru_page elevates reference count because
> other users require that. I think you do not have to mimic this behavior
> here and you can drop get_page and use goto set_status.

OK, thanks.
Naoya

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page()
  2013-03-26 20:35         ` Naoya Horiguchi
@ 2013-03-27 13:00           ` Michal Hocko
  2013-04-05 21:11             ` KOSAKI Motohiro
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2013-03-27 13:00 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Tue 26-03-13 16:35:35, Naoya Horiguchi wrote:
[...]
> The differences is that migrate_huge_page() has one hugepage as an argument,
> and migrate_pages() has a pagelist with multiple hugepages.
> I already told this before and I'm not sure it's enough to answer the question,
> so I explain another point about why this patch do like it.

OK, I am blind. It is
+       list_move(&hpage->lru, &pagelist);
+       ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
+                               MIGRATE_SYNC, MR_MEMORY_FAILURE);

which moves it from active_list and so you have to put it back.

> I think that we must do putback_*pages() for source pages whether migration
> succeeds or not.
> But when we call migrate_pages() with a pagelist,
> the caller can't access to the successfully migrated source pages
> after migrate_pages() returns, because they are no longer on the pagelist.
> So putback of the successfully migrated source pages should be done *in*
> unmap_and_move() and/or unmap_and_move_huge_page().

If the migration succeeds then the page becomes unused and free after
its last reference drops. So I do not see any reason to put it back to
active list and free it right afterwards.
On the other hand unmap_and_move does the same thing (although page
reference counting is a bit more complicated in that case) so it would
be good to keep in sync with regular pages case.

> And when we used migrate_huge_page(), we passed a hugepage to be migrated
> as an argument, so the caller can still access to the page even if the
> migration succeeds.
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page()
  2013-03-26 11:29   ` Aneesh Kumar K.V
@ 2013-03-27 13:52     ` Michal Hocko
  2013-03-27 19:19       ` Naoya Horiguchi
                         ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Michal Hocko @ 2013-03-27 13:52 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, Mel Gorman,
	Hugh Dickins, KOSAKI Motohiro, Andi Kleen, Hillf Danton,
	linux-kernel

On Tue 26-03-13 16:59:40, Aneesh Kumar K.V wrote:
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
[...]
> > diff --git v3.9-rc3.orig/mm/memory-failure.c v3.9-rc3/mm/memory-failure.c
> > index df0694c..4e01082 100644
> > --- v3.9-rc3.orig/mm/memory-failure.c
> > +++ v3.9-rc3/mm/memory-failure.c
> > @@ -1467,6 +1467,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
> >  	int ret;
> >  	unsigned long pfn = page_to_pfn(page);
> >  	struct page *hpage = compound_head(page);
> > +	LIST_HEAD(pagelist);
> >
> >  	/*
> >  	 * This double-check of PageHWPoison is to avoid the race with
> > @@ -1482,12 +1483,20 @@ static int soft_offline_huge_page(struct page *page, int flags)
> >  	unlock_page(hpage);
> >
> >  	/* Keep page count to indicate a given hugepage is isolated. */
> > -	ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL,
> > -				MIGRATE_SYNC);
> > -	put_page(hpage);
> > +	list_move(&hpage->lru, &pagelist);
> 
> we use hpage->lru to add the hpage to h->hugepage_activelist. This will
> break a hugetlb cgroup removal isn't it ?

This particular part will not break removal because
hugetlb_cgroup_css_offline loops until hugetlb_cgroup_have_usage is 0.

Little bit offtopic:
Btw. hugetlb migration breaks to charging even before this patchset
AFAICS. The above put_page should remove the last reference and then it
will uncharge it but I do not see anything that would charge a new page.
This is all because regula LRU pages are uncharged when they are
unmapped. But this a different story not related to this series.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
  2013-03-26 18:23     ` Naoya Horiguchi
@ 2013-03-27 14:19       ` Michal Hocko
  2013-03-27 21:29         ` Naoya Horiguchi
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2013-03-27 14:19 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Tue 26-03-13 14:23:24, Naoya Horiguchi wrote:
> On Mon, Mar 25, 2013 at 04:09:52PM +0100, Michal Hocko wrote:
> > On Fri 22-03-13 16:23:54, Naoya Horiguchi wrote:
> ...
> > > index d9d3dd7..ef79871 100644
> > > --- v3.9-rc3.orig/mm/hugetlb.c
> > > +++ v3.9-rc3/mm/hugetlb.c
> > > @@ -844,6 +844,36 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
> > >  	return ret;
> > >  }
> > >  
> > > +/* Dissolve a given free hugepage into free pages. */
> > > +static void dissolve_free_huge_page(struct page *page)
> > > +{
> > > +	spin_lock(&hugetlb_lock);
> > > +	if (PageHuge(page) && !page_count(page)) {
> > > +		struct hstate *h = page_hstate(page);
> > > +		int nid = page_to_nid(page);
> > > +		list_del(&page->lru);
> > > +		h->free_huge_pages--;
> > > +		h->free_huge_pages_node[nid]--;
> > > +		update_and_free_page(h, page);
> > > +	}
> > 
> > What about surplus pages?
> 
> This function is only for free hugepage, not for surplus hugepages
> (which are considered as in-use hugepages.)

How do you want to get rid of those then? You cannot offline the node if
there are any pages...

> dissolve_free_huge_pages() can be called only when all source hugepages
> are free (all in-use hugepages are successfully migrated.)
> 
[...]
> > > +/* Returns true for head pages of in-use hugepages, otherwise returns false. */
> > > +bool is_hugepage_movable(struct page *hpage)
> > > +{
> > > +	struct page *page;
> > > +	struct hstate *h;
> > > +	bool ret = false;
> > > +
> > > +	VM_BUG_ON(!PageHuge(hpage));
> > > +	/*
> > > +	 * This function can be called for a tail page because memory hotplug
> > > +	 * scans movability of pages by pfn range of a memory block.
> > > +	 * Larger hugepages (1GB for x86_64) are larger than memory block, so
> > > +	 * the scan can start at the tail page of larger hugepages.
> > > +	 * 1GB hugepage is not movable now, so we return with false for now.
> > > +	 */
> > > +	if (PageTail(hpage))
> > > +		return false;
> > > +	h = page_hstate(hpage);
> > > +	spin_lock(&hugetlb_lock);
> > > +	list_for_each_entry(page, &h->hugepage_activelist, lru)
> > > +		if (page == hpage) {
> > > +			ret = true;
> > > +			break;
> > > +		}
> > 
> > Why are you checking that the page is active?
> 
> This is the counterpart to doing PageLRU check for normal pages.
> 
> > It doesn't make much sense
> > to me because nothing prevents it from being freed/allocated right after
> > you release hugetlb_lock.
> 
> Such a race can also happen for normal pages because scan_movable_pages()
> just check PageLRU flags without holding any lock.
> But the caller, __offline_pages(), repeats to call scan_movable_pages()
> until no page in the memblock are judged as movable, and in the repeat loop
> do_migrate_range() does nothing for free (unmovable) pages.
> So there is no behavioral problem even if the movable page is freed just
> after the if(PageLRU) check in scan_movable_page().

yes

> Note that in this loop, allocating pages in the memblock is forbidden
> because we already do set_migratetype_isolate() for them, so we don't have
> to worry about being allocated just after scan_movable_pages().

yes

> I want the same thing to be the case for hugepage. As you pointed out,
> is_hugepage_movable() is not safe from such a race, but in "being freed
> just after is_hugepage_movable() returns true" case we have no problem
> for the same reason described above.
 
yes, this was my point, sorry for not being clear about that. I meant
the costly test is pointless because it doesn't prevent any races and
doesn't tell us much.
If we made sure that all page on the hugepage_freelists have reference
0 (which is now not the case and it is yet another source of confusion)
then the whole loop could be replaced by page_count check.

> However, in "being allocated just after is_hugepage_movable() returns false"
> case, it seems to be possible to hot-remove an active hugepage.

check_pages_isolated should catch this but it is still racy.

> I think we can avoid this by adding migratetype check in
> alloc_huge_page().

I think dequeue_huge_page_vma should be sufficient, because we are going
through page allocator otherwise and that one is aware of migrate types.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page()
  2013-03-27 13:52     ` Michal Hocko
@ 2013-03-27 19:19       ` Naoya Horiguchi
  2013-03-28  8:53         ` Michal Hocko
  2013-03-29  5:26       ` Aneesh Kumar K.V
  2013-04-01  5:13       ` Aneesh Kumar K.V
  2 siblings, 1 reply; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-27 19:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Aneesh Kumar K.V, linux-mm, Andrew Morton, Mel Gorman,
	Hugh Dickins, KOSAKI Motohiro, Andi Kleen, Hillf Danton,
	linux-kernel

On Wed, Mar 27, 2013 at 02:52:50PM +0100, Michal Hocko wrote:
> On Tue 26-03-13 16:59:40, Aneesh Kumar K.V wrote:
> > Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> [...]
> > > diff --git v3.9-rc3.orig/mm/memory-failure.c v3.9-rc3/mm/memory-failure.c
> > > index df0694c..4e01082 100644
> > > --- v3.9-rc3.orig/mm/memory-failure.c
> > > +++ v3.9-rc3/mm/memory-failure.c
> > > @@ -1467,6 +1467,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
> > >  	int ret;
> > >  	unsigned long pfn = page_to_pfn(page);
> > >  	struct page *hpage = compound_head(page);
> > > +	LIST_HEAD(pagelist);
> > >
> > >  	/*
> > >  	 * This double-check of PageHWPoison is to avoid the race with
> > > @@ -1482,12 +1483,20 @@ static int soft_offline_huge_page(struct page *page, int flags)
> > >  	unlock_page(hpage);
> > >
> > >  	/* Keep page count to indicate a given hugepage is isolated. */
> > > -	ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL,
> > > -				MIGRATE_SYNC);
> > > -	put_page(hpage);
> > > +	list_move(&hpage->lru, &pagelist);
> > 
> > we use hpage->lru to add the hpage to h->hugepage_activelist. This will
> > break a hugetlb cgroup removal isn't it ?
> 
> This particular part will not break removal because
> hugetlb_cgroup_css_offline loops until hugetlb_cgroup_have_usage is 0.

Right.

> Little bit offtopic:
> Btw. hugetlb migration breaks to charging even before this patchset
> AFAICS. The above put_page should remove the last reference and then it
> will uncharge it but I do not see anything that would charge a new page.
> This is all because regula LRU pages are uncharged when they are
> unmapped. But this a different story not related to this series.

It seems to me that alloc_huge_page_node() needs to call
hugetlb_cgroup_charge_cgroup() before dequeuing a new hugepage.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
  2013-03-26 12:01   ` Aneesh Kumar K.V
@ 2013-03-27 19:28     ` Naoya Horiguchi
  0 siblings, 0 replies; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-27 19:28 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, Michal Hocko,
	linux-kernel

On Tue, Mar 26, 2013 at 05:31:51PM +0530, Aneesh Kumar K.V wrote:
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> 
> > +/* Returns true for head pages of in-use hugepages, otherwise returns false. */
> > +bool is_hugepage_movable(struct page *hpage)
> > +{
> > +	struct page *page;
> > +	struct hstate *h;
> > +	bool ret = false;
> > +
> > +	VM_BUG_ON(!PageHuge(hpage));
> > +	/*
> > +	 * This function can be called for a tail page because memory hotplug
> > +	 * scans movability of pages by pfn range of a memory block.
> > +	 * Larger hugepages (1GB for x86_64) are larger than memory block, so
> > +	 * the scan can start at the tail page of larger hugepages.
> > +	 * 1GB hugepage is not movable now, so we return with false for now.
> > +	 */
> > +	if (PageTail(hpage))
> > +		return false;
> > +	h = page_hstate(hpage);
> > +	spin_lock(&hugetlb_lock);
> > +	list_for_each_entry(page, &h->hugepage_activelist, lru)
> > +		if (page == hpage) {
> > +			ret = true;
> > +			break;
> > +		}
> > +	spin_unlock(&hugetlb_lock);
> > +	return ret;
> > +}
> > +
> 
> May be is_hugepage_active() ?

Yes, it would be nice.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
  2013-03-27 14:19       ` Michal Hocko
@ 2013-03-27 21:29         ` Naoya Horiguchi
  2013-03-27 21:58           ` Naoya Horiguchi
  2013-03-27 22:55           ` Michal Hocko
  0 siblings, 2 replies; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-27 21:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Wed, Mar 27, 2013 at 03:19:21PM +0100, Michal Hocko wrote:
> On Tue 26-03-13 14:23:24, Naoya Horiguchi wrote:
> > On Mon, Mar 25, 2013 at 04:09:52PM +0100, Michal Hocko wrote:
> > > On Fri 22-03-13 16:23:54, Naoya Horiguchi wrote:
> > ...
> > > > index d9d3dd7..ef79871 100644
> > > > --- v3.9-rc3.orig/mm/hugetlb.c
> > > > +++ v3.9-rc3/mm/hugetlb.c
> > > > @@ -844,6 +844,36 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
> > > >  	return ret;
> > > >  }
> > > >  
> > > > +/* Dissolve a given free hugepage into free pages. */
> > > > +static void dissolve_free_huge_page(struct page *page)
> > > > +{
> > > > +	spin_lock(&hugetlb_lock);
> > > > +	if (PageHuge(page) && !page_count(page)) {
> > > > +		struct hstate *h = page_hstate(page);
> > > > +		int nid = page_to_nid(page);
> > > > +		list_del(&page->lru);
> > > > +		h->free_huge_pages--;
> > > > +		h->free_huge_pages_node[nid]--;
> > > > +		update_and_free_page(h, page);
> > > > +	}
> > > 
> > > What about surplus pages?
> > 
> > This function is only for free hugepage, not for surplus hugepages
> > (which are considered as in-use hugepages.)
> 
> How do you want to get rid of those then? You cannot offline the node if
> there are any pages...

My intention is that offlining should fail if there remains some active
hugepages in the memblock you try to offline.

> > dissolve_free_huge_pages() can be called only when all source hugepages
> > are free (all in-use hugepages are successfully migrated.)
> > 
...
> > > > +/* Returns true for head pages of in-use hugepages, otherwise returns false. */
> > > > +bool is_hugepage_movable(struct page *hpage)
> > > > +{
> > > > +	struct page *page;
> > > > +	struct hstate *h;
> > > > +	bool ret = false;
> > > > +
> > > > +	VM_BUG_ON(!PageHuge(hpage));
> > > > +	/*
> > > > +	 * This function can be called for a tail page because memory hotplug
> > > > +	 * scans movability of pages by pfn range of a memory block.
> > > > +	 * Larger hugepages (1GB for x86_64) are larger than memory block, so
> > > > +	 * the scan can start at the tail page of larger hugepages.
> > > > +	 * 1GB hugepage is not movable now, so we return with false for now.
> > > > +	 */
> > > > +	if (PageTail(hpage))
> > > > +		return false;
> > > > +	h = page_hstate(hpage);
> > > > +	spin_lock(&hugetlb_lock);
> > > > +	list_for_each_entry(page, &h->hugepage_activelist, lru)
> > > > +		if (page == hpage) {
> > > > +			ret = true;
> > > > +			break;
> > > > +		}
> > > 
> > > Why are you checking that the page is active?
> > 
> > This is the counterpart to doing PageLRU check for normal pages.
> > 
> > > It doesn't make much sense
> > > to me because nothing prevents it from being freed/allocated right after
> > > you release hugetlb_lock.
> > 
> > Such a race can also happen for normal pages because scan_movable_pages()
> > just check PageLRU flags without holding any lock.
> > But the caller, __offline_pages(), repeats to call scan_movable_pages()
> > until no page in the memblock are judged as movable, and in the repeat loop
> > do_migrate_range() does nothing for free (unmovable) pages.
> > So there is no behavioral problem even if the movable page is freed just
> > after the if(PageLRU) check in scan_movable_page().
> 
> yes
> 
> > Note that in this loop, allocating pages in the memblock is forbidden
> > because we already do set_migratetype_isolate() for them, so we don't have
> > to worry about being allocated just after scan_movable_pages().
> 
> yes
> 
> > I want the same thing to be the case for hugepage. As you pointed out,
> > is_hugepage_movable() is not safe from such a race, but in "being freed
> > just after is_hugepage_movable() returns true" case we have no problem
> > for the same reason described above.
>  
> yes, this was my point, sorry for not being clear about that. I meant
> the costly test is pointless because it doesn't prevent any races and
> doesn't tell us much.

Yes, running over hugepage_activelist is not preferable when this list
become longer.

> If we made sure that all page on the hugepage_freelists have reference
> 0 (which is now not the case and it is yet another source of confusion)
> then the whole loop could be replaced by page_count check.

I think that free hugepages have refcount 0, but hwpoisoned hugepages
have also refcount 0. But hwpoison can happen only on limited hardware
and we consider them as exceptional, so replacing page_count check and
checking PG_hwpoisoned flag looks more reasonable to me.

> > However, in "being allocated just after is_hugepage_movable() returns false"
> > case, it seems to be possible to hot-remove an active hugepage.
> 
> check_pages_isolated should catch this but it is still racy.

Right.

> > I think we can avoid this by adding migratetype check in
> > alloc_huge_page().
> 
> I think dequeue_huge_page_vma should be sufficient, because we are going
> through page allocator otherwise and that one is aware of migrate types.

OK.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
  2013-03-27 21:29         ` Naoya Horiguchi
@ 2013-03-27 21:58           ` Naoya Horiguchi
  2013-03-27 22:55           ` Michal Hocko
  1 sibling, 0 replies; 66+ messages in thread
From: Naoya Horiguchi @ 2013-03-27 21:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Wed, Mar 27, 2013 at 05:29:19PM -0400, Naoya Horiguchi wrote:
...
> > If we made sure that all page on the hugepage_freelists have reference
> > 0 (which is now not the case and it is yet another source of confusion)
> > then the whole loop could be replaced by page_count check.
> 
> I think that free hugepages have refcount 0, but hwpoisoned hugepages
> have also refcount 0. But hwpoison can happen only on limited hardware
> and we consider them as exceptional, so replacing page_count check and
> checking PG_hwpoisoned flag looks more reasonable to me.

Sorry, my mistake. Hwpoisoned hugepages have refcount 1, so there's
no problem on using page_count check.

Naoya

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
  2013-03-27 21:29         ` Naoya Horiguchi
  2013-03-27 21:58           ` Naoya Horiguchi
@ 2013-03-27 22:55           ` Michal Hocko
  1 sibling, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2013-03-27 22:55 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, linux-kernel

On Wed 27-03-13 17:29:19, Naoya Horiguchi wrote:
> On Wed, Mar 27, 2013 at 03:19:21PM +0100, Michal Hocko wrote:
[...]
> > If we made sure that all page on the hugepage_freelists have reference
> > 0 (which is now not the case and it is yet another source of confusion)
> > then the whole loop could be replaced by page_count check.
> 
> I think that free hugepages have refcount 0,

You are right. For some reason I totally missed that we drop
the reference from page allocator (put_page_testzero in
gather_surplus_pages).

Sorry about the stupit question.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page()
  2013-03-27 19:19       ` Naoya Horiguchi
@ 2013-03-28  8:53         ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2013-03-28  8:53 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Aneesh Kumar K.V, linux-mm, Andrew Morton, Mel Gorman,
	Hugh Dickins, KOSAKI Motohiro, Andi Kleen, Hillf Danton,
	linux-kernel

On Wed 27-03-13 15:19:24, Naoya Horiguchi wrote:
> On Wed, Mar 27, 2013 at 02:52:50PM +0100, Michal Hocko wrote:
> > On Tue 26-03-13 16:59:40, Aneesh Kumar K.V wrote:
> > > Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> > [...]
> > > > diff --git v3.9-rc3.orig/mm/memory-failure.c v3.9-rc3/mm/memory-failure.c
> > > > index df0694c..4e01082 100644
> > > > --- v3.9-rc3.orig/mm/memory-failure.c
> > > > +++ v3.9-rc3/mm/memory-failure.c
> > > > @@ -1467,6 +1467,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
> > > >  	int ret;
> > > >  	unsigned long pfn = page_to_pfn(page);
> > > >  	struct page *hpage = compound_head(page);
> > > > +	LIST_HEAD(pagelist);
> > > >
> > > >  	/*
> > > >  	 * This double-check of PageHWPoison is to avoid the race with
> > > > @@ -1482,12 +1483,20 @@ static int soft_offline_huge_page(struct page *page, int flags)
> > > >  	unlock_page(hpage);
> > > >
> > > >  	/* Keep page count to indicate a given hugepage is isolated. */
> > > > -	ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL,
> > > > -				MIGRATE_SYNC);
> > > > -	put_page(hpage);
> > > > +	list_move(&hpage->lru, &pagelist);
> > > 
> > > we use hpage->lru to add the hpage to h->hugepage_activelist. This will
> > > break a hugetlb cgroup removal isn't it ?
> > 
> > This particular part will not break removal because
> > hugetlb_cgroup_css_offline loops until hugetlb_cgroup_have_usage is 0.
> 
> Right.
> 
> > Little bit offtopic:
> > Btw. hugetlb migration breaks to charging even before this patchset
> > AFAICS. The above put_page should remove the last reference and then it
> > will uncharge it but I do not see anything that would charge a new page.
> > This is all because regula LRU pages are uncharged when they are
> > unmapped. But this a different story not related to this series.
> 
> It seems to me that alloc_huge_page_node() needs to call
> hugetlb_cgroup_charge_cgroup() before dequeuing a new hugepage.

This is not that easy because the new page has to be charged to the same
group as the original one but the migration process might be running in
the context of a different group.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page()
  2013-03-27 13:52     ` Michal Hocko
  2013-03-27 19:19       ` Naoya Horiguchi
@ 2013-03-29  5:26       ` Aneesh Kumar K.V
  2013-03-29  9:36         ` Michal Hocko
  2013-04-01  5:13       ` Aneesh Kumar K.V
  2 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K.V @ 2013-03-29  5:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, Mel Gorman,
	Hugh Dickins, KOSAKI Motohiro, Andi Kleen, Hillf Danton,
	linux-kernel

Michal Hocko <mhocko@suse.cz> writes:

> On Tue 26-03-13 16:59:40, Aneesh Kumar K.V wrote:
>> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> [...]
>> > diff --git v3.9-rc3.orig/mm/memory-failure.c v3.9-rc3/mm/memory-failure.c
>> > index df0694c..4e01082 100644
>> > --- v3.9-rc3.orig/mm/memory-failure.c
>> > +++ v3.9-rc3/mm/memory-failure.c
>> > @@ -1467,6 +1467,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
>> >  	int ret;
>> >  	unsigned long pfn = page_to_pfn(page);
>> >  	struct page *hpage = compound_head(page);
>> > +	LIST_HEAD(pagelist);
>> >
>> >  	/*
>> >  	 * This double-check of PageHWPoison is to avoid the race with
>> > @@ -1482,12 +1483,20 @@ static int soft_offline_huge_page(struct page *page, int flags)
>> >  	unlock_page(hpage);
>> >
>> >  	/* Keep page count to indicate a given hugepage is isolated. */
>> > -	ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL,
>> > -				MIGRATE_SYNC);
>> > -	put_page(hpage);
>> > +	list_move(&hpage->lru, &pagelist);
>> 
>> we use hpage->lru to add the hpage to h->hugepage_activelist. This will
>> break a hugetlb cgroup removal isn't it ?
>
> This particular part will not break removal because
> hugetlb_cgroup_css_offline loops until hugetlb_cgroup_have_usage is 0.
>
> Little bit offtopic:
> Btw. hugetlb migration breaks to charging even before this patchset
> AFAICS. The above put_page should remove the last reference and then it
> will uncharge it but I do not see anything that would charge a new page.
> This is all because regula LRU pages are uncharged when they are
> unmapped. But this a different story not related to this series.


But when we call that put_page, we would have alreayd move the cgroup
information to the new page. We have

	h_cg = hugetlb_cgroup_from_page(oldhpage);
	set_hugetlb_cgroup(oldhpage, NULL);

	/* move the h_cg details to new cgroup */
	set_hugetlb_cgroup(newhpage, h_cg);


in hugetlb_cgroup_migrate

-aneesh


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page()
  2013-03-29  5:26       ` Aneesh Kumar K.V
@ 2013-03-29  9:36         ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2013-03-29  9:36 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, Mel Gorman,
	Hugh Dickins, KOSAKI Motohiro, Andi Kleen, Hillf Danton,
	linux-kernel

On Fri 29-03-13 10:56:00, Aneesh Kumar K.V wrote:
> Michal Hocko <mhocko@suse.cz> writes:
[...]
> > Little bit offtopic:
> > Btw. hugetlb migration breaks to charging even before this patchset
> > AFAICS. The above put_page should remove the last reference and then it
> > will uncharge it but I do not see anything that would charge a new page.
> > This is all because regula LRU pages are uncharged when they are
> > unmapped. But this a different story not related to this series.
> 
> 
> But when we call that put_page, we would have alreayd move the cgroup
> information to the new page. We have
> 
> 	h_cg = hugetlb_cgroup_from_page(oldhpage);
> 	set_hugetlb_cgroup(oldhpage, NULL);
> 
> 	/* move the h_cg details to new cgroup */
> 	set_hugetlb_cgroup(newhpage, h_cg);
> 
> 
> in hugetlb_cgroup_migrate

Yes but the res counter charge would be missing for the newpage after
put_page

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page()
  2013-03-27 13:52     ` Michal Hocko
  2013-03-27 19:19       ` Naoya Horiguchi
  2013-03-29  5:26       ` Aneesh Kumar K.V
@ 2013-04-01  5:13       ` Aneesh Kumar K.V
  2013-04-02  9:45         ` Michal Hocko
  2 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K.V @ 2013-04-01  5:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, Mel Gorman,
	Hugh Dickins, KOSAKI Motohiro, Andi Kleen, Hillf Danton,
	linux-kernel

Michal Hocko <mhocko@suse.cz> writes:

> On Tue 26-03-13 16:59:40, Aneesh Kumar K.V wrote:
>> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> [...]
>> > diff --git v3.9-rc3.orig/mm/memory-failure.c v3.9-rc3/mm/memory-failure.c
>> > index df0694c..4e01082 100644
>> > --- v3.9-rc3.orig/mm/memory-failure.c
>> > +++ v3.9-rc3/mm/memory-failure.c
>> > @@ -1467,6 +1467,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
>> >  	int ret;
>> >  	unsigned long pfn = page_to_pfn(page);
>> >  	struct page *hpage = compound_head(page);
>> > +	LIST_HEAD(pagelist);
>> >
>> >  	/*
>> >  	 * This double-check of PageHWPoison is to avoid the race with
>> > @@ -1482,12 +1483,20 @@ static int soft_offline_huge_page(struct page *page, int flags)
>> >  	unlock_page(hpage);
>> >
>> >  	/* Keep page count to indicate a given hugepage is isolated. */
>> > -	ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL,
>> > -				MIGRATE_SYNC);
>> > -	put_page(hpage);
>> > +	list_move(&hpage->lru, &pagelist);
>> 
>> we use hpage->lru to add the hpage to h->hugepage_activelist. This will
>> break a hugetlb cgroup removal isn't it ?
>
> This particular part will not break removal because
> hugetlb_cgroup_css_offline loops until hugetlb_cgroup_have_usage is 0.
>

But we still need to hold hugetlb_lock around that right ?

-aneesh


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page()
  2013-04-01  5:13       ` Aneesh Kumar K.V
@ 2013-04-02  9:45         ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2013-04-02  9:45 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, Mel Gorman,
	Hugh Dickins, KOSAKI Motohiro, Andi Kleen, Hillf Danton,
	linux-kernel

On Mon 01-04-13 10:43:14, Aneesh Kumar K.V wrote:
> Michal Hocko <mhocko@suse.cz> writes:
> 
> > On Tue 26-03-13 16:59:40, Aneesh Kumar K.V wrote:
> >> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> > [...]
> >> > diff --git v3.9-rc3.orig/mm/memory-failure.c v3.9-rc3/mm/memory-failure.c
> >> > index df0694c..4e01082 100644
> >> > --- v3.9-rc3.orig/mm/memory-failure.c
> >> > +++ v3.9-rc3/mm/memory-failure.c
> >> > @@ -1467,6 +1467,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
> >> >  	int ret;
> >> >  	unsigned long pfn = page_to_pfn(page);
> >> >  	struct page *hpage = compound_head(page);
> >> > +	LIST_HEAD(pagelist);
> >> >
> >> >  	/*
> >> >  	 * This double-check of PageHWPoison is to avoid the race with
> >> > @@ -1482,12 +1483,20 @@ static int soft_offline_huge_page(struct page *page, int flags)
> >> >  	unlock_page(hpage);
> >> >
> >> >  	/* Keep page count to indicate a given hugepage is isolated. */
> >> > -	ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL,
> >> > -				MIGRATE_SYNC);
> >> > -	put_page(hpage);
> >> > +	list_move(&hpage->lru, &pagelist);
> >> 
> >> we use hpage->lru to add the hpage to h->hugepage_activelist. This will
> >> break a hugetlb cgroup removal isn't it ?
> >
> > This particular part will not break removal because
> > hugetlb_cgroup_css_offline loops until hugetlb_cgroup_have_usage is 0.
> >
> 
> But we still need to hold hugetlb_lock around that right ?

Right. Racing hugetlb_cgroup_move_parent and hugetlb_cgroup_migrate could
lead to newpage pointing to NULL cgroup. That could be fixed by checking
old page cgroup for NULL inside hugetlb_lock and using
list_for_each_safe in hugetlb_cgroup_css_offline no?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 01/10] migrate: add migrate_entry_wait_huge()
  2013-03-22 20:23 ` [PATCH 01/10] migrate: add migrate_entry_wait_huge() Naoya Horiguchi
  2013-03-23 15:55   ` Rik van Riel
  2013-03-25 10:13   ` Michal Hocko
@ 2013-04-05 20:33   ` KOSAKI Motohiro
  2013-04-08 20:00     ` Naoya Horiguchi
  2013-04-05 20:33   ` KOSAKI Motohiro
  3 siblings, 1 reply; 66+ messages in thread
From: KOSAKI Motohiro @ 2013-04-05 20:33 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, Michal Hocko,
	linux-kernel, kosaki.motohiro

> diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
> index 0a0be33..98a478e 100644
> --- v3.9-rc3.orig/mm/hugetlb.c
> +++ v3.9-rc3/mm/hugetlb.c
> @@ -2819,7 +2819,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (ptep) {
>  		entry = huge_ptep_get(ptep);
>  		if (unlikely(is_hugetlb_entry_migration(entry))) {
> -			migration_entry_wait(mm, (pmd_t *)ptep, address);
> +			migration_entry_wait_huge(mm, (pmd_t *)ptep, address);

Hm.

How do you test this? From x86 point of view, this patch seems unnecessary because
hugetlb_fault call "address &= hugetlb_mask()" at first and then migration_entry_wait()
could grab right pte lock. And from !x86 point of view, this funciton still doesn't work
because huge page != pmd on some arch.

I might be missing though.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 01/10] migrate: add migrate_entry_wait_huge()
  2013-03-22 20:23 ` [PATCH 01/10] migrate: add migrate_entry_wait_huge() Naoya Horiguchi
                     ` (2 preceding siblings ...)
  2013-04-05 20:33   ` KOSAKI Motohiro
@ 2013-04-05 20:33   ` KOSAKI Motohiro
  3 siblings, 0 replies; 66+ messages in thread
From: KOSAKI Motohiro @ 2013-04-05 20:33 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, Michal Hocko,
	linux-kernel, kosaki.motohiro

> diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
> index 0a0be33..98a478e 100644
> --- v3.9-rc3.orig/mm/hugetlb.c
> +++ v3.9-rc3/mm/hugetlb.c
> @@ -2819,7 +2819,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (ptep) {
>  		entry = huge_ptep_get(ptep);
>  		if (unlikely(is_hugetlb_entry_migration(entry))) {
> -			migration_entry_wait(mm, (pmd_t *)ptep, address);
> +			migration_entry_wait_huge(mm, (pmd_t *)ptep, address);

Hm.

How do you test this? From x86 point of view, this patch seems unnecessary because
hugetlb_fault call "address &= hugetlb_mask()" at first and then migration_entry_wait()
could grab right pte lock. And from !x86 point of view, this funciton still doesn't work
because huge page != pmd on some arch.

I might be missing something though.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/10] migrate: make core migration code aware of hugepage
  2013-03-26  8:49       ` Michal Hocko
@ 2013-04-05 20:41         ` KOSAKI Motohiro
  0 siblings, 0 replies; 66+ messages in thread
From: KOSAKI Motohiro @ 2013-04-05 20:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, Mel Gorman,
	Hugh Dickins, KOSAKI Motohiro, Andi Kleen, Hillf Danton,
	linux-kernel, kosaki.motohiro

>>> There doesn't seem to be any caller for this function. Please move it to
>>> the patch which uses it.
>>
>> I would do like that if there's only one user of this function, but I thought
>> that it's better to separate this part as changes of common code
>> because this function is commonly used by multiple users which are added by
>> multiple patches later in this series.
> 
> Sure there is no hard rule for this. I just find it much easier to
> review if there is a caller of introduced functionality. In this
> particular case I found out only later that many migrate_pages callers
> were changed to use mograte_movable_pages and made the
> putback_movable_pages cleanup inconsistent between the two.
> 
> It would help to mention what is the planned future usage of the
> introduced function if you prefer to introduce it without users.

I strong agree with Michal.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page()
  2013-03-27 13:00           ` Michal Hocko
@ 2013-04-05 21:11             ` KOSAKI Motohiro
  0 siblings, 0 replies; 66+ messages in thread
From: KOSAKI Motohiro @ 2013-04-05 21:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, Mel Gorman,
	Hugh Dickins, KOSAKI Motohiro, Andi Kleen, Hillf Danton,
	linux-kernel, kosaki.motohiro

(3/27/13 9:00 AM), Michal Hocko wrote:
> On Tue 26-03-13 16:35:35, Naoya Horiguchi wrote:
> [...]
>> The differences is that migrate_huge_page() has one hugepage as an argument,
>> and migrate_pages() has a pagelist with multiple hugepages.
>> I already told this before and I'm not sure it's enough to answer the question,
>> so I explain another point about why this patch do like it.
> 
> OK, I am blind. It is
> +       list_move(&hpage->lru, &pagelist);
> +       ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
> +                               MIGRATE_SYNC, MR_MEMORY_FAILURE);
> 
> which moves it from active_list and so you have to put it back.
> 
>> I think that we must do putback_*pages() for source pages whether migration
>> succeeds or not.
>> But when we call migrate_pages() with a pagelist,
>> the caller can't access to the successfully migrated source pages
>> after migrate_pages() returns, because they are no longer on the pagelist.
>> So putback of the successfully migrated source pages should be done *in*
>> unmap_and_move() and/or unmap_and_move_huge_page().
> 
> If the migration succeeds then the page becomes unused and free after
> its last reference drops. So I do not see any reason to put it back to
> active list and free it right afterwards.
> On the other hand unmap_and_move does the same thing (although page
> reference counting is a bit more complicated in that case) so it would
> be good to keep in sync with regular pages case.

Even if pages are isolated from lists, there are several page count increasing
path. So, putback_pages() close a race when page count != 1.

I'm not sure, but I guess follow_hugepage() can make the same race.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 04/10] migrate: clean up migrate_huge_page()
  2013-03-22 20:23 ` [PATCH 04/10] migrate: clean up migrate_huge_page() Naoya Horiguchi
@ 2013-04-05 21:13   ` KOSAKI Motohiro
  0 siblings, 0 replies; 66+ messages in thread
From: KOSAKI Motohiro @ 2013-04-05 21:13 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, Michal Hocko,
	linux-kernel, kosaki.motohiro

(3/22/13 4:23 PM), Naoya Horiguchi wrote:
> Due to the previous patch, soft_offline_huge_page() switches to use
> migrate_pages(), and migrate_huge_page() is not used any more.
> So let's remove it.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 05/10] migrate: add hugepage migration code to migrate_pages()
  2013-03-22 20:23 ` [PATCH 05/10] migrate: add hugepage migration code to migrate_pages() Naoya Horiguchi
  2013-03-25 13:04   ` Michal Hocko
@ 2013-04-05 21:17   ` KOSAKI Motohiro
  2013-04-08 20:21     ` Naoya Horiguchi
  1 sibling, 1 reply; 66+ messages in thread
From: KOSAKI Motohiro @ 2013-04-05 21:17 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, Michal Hocko,
	linux-kernel, kosaki.motohiro

(3/22/13 4:23 PM), Naoya Horiguchi wrote:
> This patch extends check_range() to handle vma with VM_HUGETLB set.
> We will be able to migrate hugepage with migrate_pages(2) after
> applying the enablement patch which comes later in this series.
> 
> Note that for larger hugepages (covered by pud entries, 1GB for
> x86_64 for example), we simply skip it now.

check_range() has largely duplication with mm_walk and it is quirk subset.
Instead of, could you replace them to mm_walk and enhance/cleanup mm_walk?



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/10] mbind: add hugepage migration code to mbind()
  2013-03-22 20:23 ` [PATCH 07/10] mbind: add hugepage migration code to mbind() Naoya Horiguchi
  2013-03-25 13:49   ` Michal Hocko
@ 2013-04-05 22:18   ` KOSAKI Motohiro
  2013-04-08 20:25     ` Naoya Horiguchi
  1 sibling, 1 reply; 66+ messages in thread
From: KOSAKI Motohiro @ 2013-04-05 22:18 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, Michal Hocko,
	linux-kernel, kosaki.motohiro

> @@ -1277,14 +1279,10 @@ static long do_mbind(unsigned long start, unsigned long len,
>  	if (!err) {
>  		int nr_failed = 0;
>  
> -		if (!list_empty(&pagelist)) {
> -			WARN_ON_ONCE(flags & MPOL_MF_LAZY);
> -			nr_failed = migrate_pages(&pagelist, new_vma_page,
> +		WARN_ON_ONCE(flags & MPOL_MF_LAZY);

???
MPOL_MF_LAZY always output warn? It seems really insane.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/10] mbind: add hugepage migration code to mbind()
  2013-03-25 13:49   ` Michal Hocko
@ 2013-04-05 22:23     ` KOSAKI Motohiro
  2013-04-06  7:04       ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: KOSAKI Motohiro @ 2013-04-05 22:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, Mel Gorman,
	Hugh Dickins, KOSAKI Motohiro, Andi Kleen, Hillf Danton,
	linux-kernel, kosaki.motohiro

>> -	if (!new_hpage)
>> +	/*
>> +	 * Getting a new hugepage with alloc_huge_page() (which can happen
>> +	 * when migration is caused by mbind()) can return ERR_PTR value,
>> +	 * so we need take care of the case here.
>> +	 */
>> +	if (!new_hpage || IS_ERR_VALUE(new_hpage))
>>  		return -ENOMEM;
> 
> Please no. get_new_page returns NULL or a page. You are hooking a wrong
> callback here. The error value doesn't make any sense here. IMO you
> should just wrap alloc_huge_page by something that returns NULL or page.

I suggest just opposite way. new_vma_page() always return ENOMEM, ENOSPC etc instad 
of NULL. and caller propegate it to userland.
I guess userland want to distingush why mbind was failed.

Anyway, If new_vma_page() have a change to return both NULL and -ENOMEM. That's a bug.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
  2013-03-22 20:23 ` [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage Naoya Horiguchi
  2013-03-25 15:09   ` Michal Hocko
  2013-03-26 12:01   ` Aneesh Kumar K.V
@ 2013-04-06  0:13   ` KOSAKI Motohiro
  2013-04-09 20:07     ` Naoya Horiguchi
  2 siblings, 1 reply; 66+ messages in thread
From: KOSAKI Motohiro @ 2013-04-06  0:13 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, Michal Hocko,
	linux-kernel, kosaki.motohiro

(3/22/13 4:23 PM), Naoya Horiguchi wrote:
> Currently we can't offline memory blocks which contain hugepages because
> a hugepage is considered as an unmovable page. But now with this patch
> series, a hugepage has become movable, so by using hugepage migration we
> can offline such memory blocks.
> 
> What's different from other users of hugepage migration is that we need
> to decompose all the hugepages inside the target memory block into free
> buddy pages after hugepage migration, because otherwise free hugepages
> remaining in the memory block intervene the memory offlining.
> For this reason we introduce new functions dissolve_free_huge_page() and
> dissolve_free_huge_pages().
> 
> Other than that, what this patch does is straightforwardly to add hugepage
> migration code, that is, adding hugepage code to the functions which scan
> over pfn and collect hugepages to be migrated, and adding a hugepage
> allocation function to alloc_migrate_target().
> 
> As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
> over them because it's larger than memory block. So we now simply leave
> it to fail as it is.
> 
> ChangeLog v2:
>  - changed return value type of is_hugepage_movable() to bool
>  - is_hugepage_movable() uses list_for_each_entry() instead of *_safe()
>  - moved if(PageHuge) block before get_page_unless_zero() in do_migrate_range()
>  - do_migrate_range() returns -EBUSY for hugepages larger than memory block
>  - dissolve_free_huge_pages() calculates scan step and sets it to minimum
>    hugepage size
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  include/linux/hugetlb.h |  6 +++++
>  mm/hugetlb.c            | 58 +++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/memory_hotplug.c     | 42 +++++++++++++++++++++++++++--------
>  mm/page_alloc.c         | 12 ++++++++++
>  mm/page_isolation.c     |  5 +++++
>  5 files changed, 114 insertions(+), 9 deletions(-)
> 
> diff --git v3.9-rc3.orig/include/linux/hugetlb.h v3.9-rc3/include/linux/hugetlb.h
> index 981eff8..8220a8a 100644
> --- v3.9-rc3.orig/include/linux/hugetlb.h
> +++ v3.9-rc3/include/linux/hugetlb.h
> @@ -69,6 +69,7 @@ int dequeue_hwpoisoned_huge_page(struct page *page);
>  void putback_active_hugepage(struct page *page);
>  void putback_active_hugepages(struct list_head *l);
>  void migrate_hugepage_add(struct page *page, struct list_head *list);
> +bool is_hugepage_movable(struct page *page);
>  void copy_huge_page(struct page *dst, struct page *src);
>  
>  extern unsigned long hugepages_treat_as_movable;
> @@ -134,6 +135,7 @@ static inline int dequeue_hwpoisoned_huge_page(struct page *page)
>  #define putback_active_hugepage(p) 0
>  #define putback_active_hugepages(l) 0
>  #define migrate_hugepage_add(p, l) 0
> +#define is_hugepage_movable(x) 0

should be false instaed of 0.


>  static inline void copy_huge_page(struct page *dst, struct page *src)
>  {
>  }
> @@ -356,6 +358,9 @@ static inline int hstate_index(struct hstate *h)
>  	return h - hstates;
>  }
>  
> +extern void dissolve_free_huge_pages(unsigned long start_pfn,
> +				     unsigned long end_pfn);
> +
>  #else
>  struct hstate {};
>  #define alloc_huge_page(v, a, r) NULL
> @@ -376,6 +381,7 @@ static inline unsigned int pages_per_huge_page(struct hstate *h)
>  }
>  #define hstate_index_to_shift(index) 0
>  #define hstate_index(h) 0
> +#define dissolve_free_huge_pages(s, e) 0

no need 0.

>  #endif
>  
>  #endif /* _LINUX_HUGETLB_H */
> diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
> index d9d3dd7..ef79871 100644
> --- v3.9-rc3.orig/mm/hugetlb.c
> +++ v3.9-rc3/mm/hugetlb.c
> @@ -844,6 +844,36 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>  	return ret;
>  }
>  
> +/* Dissolve a given free hugepage into free pages. */
> +static void dissolve_free_huge_page(struct page *page)
> +{
> +	spin_lock(&hugetlb_lock);
> +	if (PageHuge(page) && !page_count(page)) {
> +		struct hstate *h = page_hstate(page);
> +		int nid = page_to_nid(page);
> +		list_del(&page->lru);
> +		h->free_huge_pages--;
> +		h->free_huge_pages_node[nid]--;
> +		update_and_free_page(h, page);
> +	}
> +	spin_unlock(&hugetlb_lock);
> +}
> +
> +/* Dissolve free hugepages in a given pfn range. Used by memory hotplug. */
> +void dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
> +{
> +	unsigned int order = 8 * sizeof(void *);
> +	unsigned long pfn;
> +	struct hstate *h;
> +
> +	/* Set scan step to minimum hugepage size */
> +	for_each_hstate(h)
> +		if (order > huge_page_order(h))
> +			order = huge_page_order(h);
> +	for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << order)
> +		dissolve_free_huge_page(pfn_to_page(pfn));
> +}

hotplug.c must not have such pure huge page function.


> +
>  static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
>  {
>  	struct page *page;
> @@ -3155,6 +3185,34 @@ static int is_hugepage_on_freelist(struct page *hpage)
>  	return 0;
>  }
>  
> +/* Returns true for head pages of in-use hugepages, otherwise returns false. */
> +bool is_hugepage_movable(struct page *hpage)
> +{
> +	struct page *page;
> +	struct hstate *h;
> +	bool ret = false;
> +
> +	VM_BUG_ON(!PageHuge(hpage));
> +	/*
> +	 * This function can be called for a tail page because memory hotplug
> +	 * scans movability of pages by pfn range of a memory block.
> +	 * Larger hugepages (1GB for x86_64) are larger than memory block, so
> +	 * the scan can start at the tail page of larger hugepages.
> +	 * 1GB hugepage is not movable now, so we return with false for now.
> +	 */
> +	if (PageTail(hpage))
> +		return false;
> +	h = page_hstate(hpage);
> +	spin_lock(&hugetlb_lock);
> +	list_for_each_entry(page, &h->hugepage_activelist, lru)
> +		if (page == hpage) {
> +			ret = true;
> +			break;
> +		}
> +	spin_unlock(&hugetlb_lock);
> +	return ret;
> +}
> +
>  /*
>   * This function is called from memory failure code.
>   * Assume the caller holds page lock of the head page.
> diff --git v3.9-rc3.orig/mm/memory_hotplug.c v3.9-rc3/mm/memory_hotplug.c
> index 9597eec..2d206e8 100644
> --- v3.9-rc3.orig/mm/memory_hotplug.c
> +++ v3.9-rc3/mm/memory_hotplug.c
> @@ -30,6 +30,7 @@
>  #include <linux/mm_inline.h>
>  #include <linux/firmware-map.h>
>  #include <linux/stop_machine.h>
> +#include <linux/hugetlb.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -1215,10 +1216,12 @@ static int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn)
>  }
>  
>  /*
> - * Scanning pfn is much easier than scanning lru list.
> - * Scan pfn from start to end and Find LRU page.
> + * Scan pfn range [start,end) to find movable/migratable pages (LRU pages
> + * and hugepages). We scan pfn because it's much easier than scanning over
> + * linked list. This function returns the pfn of the first found movable
> + * page if it's found, otherwise 0.
>   */
> -static unsigned long scan_lru_pages(unsigned long start, unsigned long end)
> +static unsigned long scan_movable_pages(unsigned long start, unsigned long end)

We can kill scan_lru_pages() completely. That's mere minor optimization and memory
hotremove it definitely not hot path.


>  {
>  	unsigned long pfn;
>  	struct page *page;
> @@ -1227,6 +1230,12 @@ static unsigned long scan_lru_pages(unsigned long start, unsigned long end)
>  			page = pfn_to_page(pfn);
>  			if (PageLRU(page))
>  				return pfn;
> +			if (PageHuge(page)) {
> +				if (is_hugepage_movable(page))
> +					return pfn;
> +				else
> +					pfn += (1 << compound_order(page)) - 1;
> +			}
>  		}
>  	}
>  	return 0;
> @@ -1247,6 +1256,21 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>  		if (!pfn_valid(pfn))
>  			continue;
>  		page = pfn_to_page(pfn);
> +
> +		if (PageHuge(page)) {
> +			struct page *head = compound_head(page);
> +			pfn = page_to_pfn(head) + (1<<compound_order(head)) - 1;
> +			if (compound_order(head) > PFN_SECTION_SHIFT) {
> +				ret = -EBUSY;
> +				break;
> +			}
> +			if (!get_page_unless_zero(page))
> +				continue;
> +			list_move_tail(&head->lru, &source);
> +			move_pages -= 1 << compound_order(head);
> +			continue;
> +		}
> +
>  		if (!get_page_unless_zero(page))
>  			continue;
>  		/*
> @@ -1279,7 +1303,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>  	}
>  	if (!list_empty(&source)) {
>  		if (not_managed) {
> -			putback_lru_pages(&source);
> +			putback_movable_pages(&source);
>  			goto out;
>  		}
>  
> @@ -1287,10 +1311,8 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>  		 * alloc_migrate_target should be improooooved!!
>  		 * migrate_pages returns # of failed pages.
>  		 */
> -		ret = migrate_pages(&source, alloc_migrate_target, 0,
> +		ret = migrate_movable_pages(&source, alloc_migrate_target, 0,
>  					MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
> -		if (ret)
> -			putback_lru_pages(&source);
>  	}
>  out:
>  	return ret;
> @@ -1533,8 +1555,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  		drain_all_pages();
>  	}

After applying your patch, __offline_pages() free hugetlb persistent and surplus pages. 
Thus this should allocate same size huge pages on other nodes.
Otherwise, total hugepages implicitely decrease and application may crash after offline page
success.


>  
> -	pfn = scan_lru_pages(start_pfn, end_pfn);
> -	if (pfn) { /* We have page on LRU */
> +	pfn = scan_movable_pages(start_pfn, end_pfn);
> +	if (pfn) { /* We have movable pages */
>  		ret = do_migrate_range(pfn, end_pfn);
>  		if (!ret) {
>  			drain = 1;
> @@ -1553,6 +1575,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  	yield();
>  	/* drain pcp pages, this is synchronous. */
>  	drain_all_pages();
> +	/* dissolve all free hugepages inside the memory block */
> +	dissolve_free_huge_pages(start_pfn, end_pfn);
>  	/* check again */
>  	offlined_pages = check_pages_isolated(start_pfn, end_pfn);
>  	if (offlined_pages < 0) {
> diff --git v3.9-rc3.orig/mm/page_alloc.c v3.9-rc3/mm/page_alloc.c
> index 8fcced7..09a95e7 100644
> --- v3.9-rc3.orig/mm/page_alloc.c
> +++ v3.9-rc3/mm/page_alloc.c
> @@ -59,6 +59,7 @@
>  #include <linux/migrate.h>
>  #include <linux/page-debug-flags.h>
>  #include <linux/sched/rt.h>
> +#include <linux/hugetlb.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -5716,6 +5717,17 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>  			continue;
>  
>  		page = pfn_to_page(check);
> +
> +		/*
> +		 * Hugepages are not in LRU lists, but they're movable.
> +		 * We need not scan over tail pages bacause we don't
> +		 * handle each tail page individually in migration.
> +		 */
> +		if (PageHuge(page)) {
> +			iter += (1 << compound_order(page)) - 1;
> +			continue;
> +		}

Your patch description says, we can't move 1GB hugepage. and then this seems
too blutal.


> +
>  		/*
>  		 * We can't use page_count without pin a page
>  		 * because another CPU can free compound page.
> diff --git v3.9-rc3.orig/mm/page_isolation.c v3.9-rc3/mm/page_isolation.c
> index 383bdbb..cf48ef6 100644
> --- v3.9-rc3.orig/mm/page_isolation.c
> +++ v3.9-rc3/mm/page_isolation.c
> @@ -6,6 +6,7 @@
>  #include <linux/page-isolation.h>
>  #include <linux/pageblock-flags.h>
>  #include <linux/memory.h>
> +#include <linux/hugetlb.h>
>  #include "internal.h"
>  
>  int set_migratetype_isolate(struct page *page, bool skip_hwpoisoned_pages)
> @@ -252,6 +253,10 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
>  {
>  	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE;
>  
> +	if (PageHuge(page))
> +		return alloc_huge_page_node(page_hstate(compound_head(page)),
> +					    numa_node_id());

numa_node_id() is really silly. This might lead to allocate from offlining node.

and, offline_pages() should mark hstate as isolated likes normal pages for prohibiting
new allocation at first.




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 10/10] prepare to remove /proc/sys/vm/hugepages_treat_as_movable
  2013-03-25 15:12   ` Michal Hocko
@ 2013-04-06  0:15     ` KOSAKI Motohiro
  0 siblings, 0 replies; 66+ messages in thread
From: KOSAKI Motohiro @ 2013-04-06  0:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, Mel Gorman,
	Hugh Dickins, KOSAKI Motohiro, Andi Kleen, Hillf Danton,
	linux-kernel, kosaki.motohiro

(3/25/13 11:12 AM), Michal Hocko wrote:
> On Fri 22-03-13 16:23:55, Naoya Horiguchi wrote:
> [...]
>> @@ -2086,11 +2085,7 @@ int hugetlb_treat_movable_handler(struct ctl_table *table, int write,
>>  			void __user *buffer,
>>  			size_t *length, loff_t *ppos)
>>  {
>> -	proc_dointvec(table, write, buffer, length, ppos);
>> -	if (hugepages_treat_as_movable)
>> -		htlb_alloc_mask = GFP_HIGHUSER_MOVABLE;
>> -	else
>> -		htlb_alloc_mask = GFP_HIGHUSER;
>> +	/* hugepages_treat_as_movable is obsolete and to be removed. */
> 
> WARN_ON_ONCE("This knob is obsolete and has no effect. It is scheduled for removal")

Indeed.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/10] mbind: add hugepage migration code to mbind()
  2013-04-05 22:23     ` KOSAKI Motohiro
@ 2013-04-06  7:04       ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2013-04-06  7:04 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, Mel Gorman,
	Hugh Dickins, KOSAKI Motohiro, Andi Kleen, Hillf Danton,
	linux-kernel

On Fri 05-04-13 18:23:07, KOSAKI Motohiro wrote:
> >> -	if (!new_hpage)
> >> +	/*
> >> +	 * Getting a new hugepage with alloc_huge_page() (which can happen
> >> +	 * when migration is caused by mbind()) can return ERR_PTR value,
> >> +	 * so we need take care of the case here.
> >> +	 */
> >> +	if (!new_hpage || IS_ERR_VALUE(new_hpage))
> >>  		return -ENOMEM;
> > 
> > Please no. get_new_page returns NULL or a page. You are hooking a wrong
> > callback here. The error value doesn't make any sense here. IMO you
> > should just wrap alloc_huge_page by something that returns NULL or page.
> 
> I suggest just opposite way. new_vma_page() always return ENOMEM, ENOSPC etc instad 
> of NULL. and caller propegate it to userland.
> I guess userland want to distingush why mbind was failed.

Sure, and I wasn't suggesting to change alloc_huge_page. I was just
pointing out that new_page_t used to return page or NULL and hugetlb
unmap_and_move shouldn't be any different in that direction so using
alloc_huge_page is not a good fit here.

> Anyway, If new_vma_page() have a change to return both NULL and
> -ENOMEM. That's a bug.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 01/10] migrate: add migrate_entry_wait_huge()
  2013-04-05 20:33   ` KOSAKI Motohiro
@ 2013-04-08 20:00     ` Naoya Horiguchi
  0 siblings, 0 replies; 66+ messages in thread
From: Naoya Horiguchi @ 2013-04-08 20:00 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, Michal Hocko,
	linux-kernel

On Fri, Apr 05, 2013 at 04:33:32PM -0400, KOSAKI Motohiro wrote:
> > diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
> > index 0a0be33..98a478e 100644
> > --- v3.9-rc3.orig/mm/hugetlb.c
> > +++ v3.9-rc3/mm/hugetlb.c
> > @@ -2819,7 +2819,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	if (ptep) {
> >  		entry = huge_ptep_get(ptep);
> >  		if (unlikely(is_hugetlb_entry_migration(entry))) {
> > -			migration_entry_wait(mm, (pmd_t *)ptep, address);
> > +			migration_entry_wait_huge(mm, (pmd_t *)ptep, address);
> 
> Hm.
> 
> How do you test this? From x86 point of view, this patch seems unnecessary because
> hugetlb_fault call "address &= hugetlb_mask()" at first and then migration_entry_wait()
> could grab right pte lock. And from !x86 point of view, this funciton still doesn't work
> because huge page != pmd on some arch.

I kicked hugepage migration for address range where I repeat to access
in a loop, and checked what happened (whether soft lockup happens or not.)
But I don't fully understand what the problem is, and I might wrongly define
the problem. So give me time to clarify it.

And I fully agree that this function should be arch dependent.

> I might be missing though.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 05/10] migrate: add hugepage migration code to migrate_pages()
  2013-04-05 21:17   ` KOSAKI Motohiro
@ 2013-04-08 20:21     ` Naoya Horiguchi
  0 siblings, 0 replies; 66+ messages in thread
From: Naoya Horiguchi @ 2013-04-08 20:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, Michal Hocko,
	linux-kernel

On Fri, Apr 05, 2013 at 05:17:16PM -0400, KOSAKI Motohiro wrote:
> (3/22/13 4:23 PM), Naoya Horiguchi wrote:
> > This patch extends check_range() to handle vma with VM_HUGETLB set.
> > We will be able to migrate hugepage with migrate_pages(2) after
> > applying the enablement patch which comes later in this series.
> > 
> > Note that for larger hugepages (covered by pud entries, 1GB for
> > x86_64 for example), we simply skip it now.
> 
> check_range() has largely duplication with mm_walk and it is quirk subset.
> Instead of, could you replace them to mm_walk and enhance/cleanup mm_walk?

OK, I'll try this.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/10] mbind: add hugepage migration code to mbind()
  2013-04-05 22:18   ` KOSAKI Motohiro
@ 2013-04-08 20:25     ` Naoya Horiguchi
  0 siblings, 0 replies; 66+ messages in thread
From: Naoya Horiguchi @ 2013-04-08 20:25 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, Michal Hocko,
	linux-kernel

On Fri, Apr 05, 2013 at 06:18:02PM -0400, KOSAKI Motohiro wrote:
> > @@ -1277,14 +1279,10 @@ static long do_mbind(unsigned long start, unsigned long len,
> >  	if (!err) {
> >  		int nr_failed = 0;
> >  
> > -		if (!list_empty(&pagelist)) {
> > -			WARN_ON_ONCE(flags & MPOL_MF_LAZY);
> > -			nr_failed = migrate_pages(&pagelist, new_vma_page,
> > +		WARN_ON_ONCE(flags & MPOL_MF_LAZY);
> 
> ???
> MPOL_MF_LAZY always output warn? It seems really insane.

So I'll stop replacing this block into migrate_movable_pages() and
leave this WARN as it is.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
  2013-04-06  0:13   ` KOSAKI Motohiro
@ 2013-04-09 20:07     ` Naoya Horiguchi
  2013-04-09 21:27       ` KOSAKI Motohiro
  0 siblings, 1 reply; 66+ messages in thread
From: Naoya Horiguchi @ 2013-04-09 20:07 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins,
	KOSAKI Motohiro, Andi Kleen, Hillf Danton, Michal Hocko,
	linux-kernel

On Fri, Apr 05, 2013 at 08:13:47PM -0400, KOSAKI Motohiro wrote:
> (3/22/13 4:23 PM), Naoya Horiguchi wrote:
> > Currently we can't offline memory blocks which contain hugepages because
> > a hugepage is considered as an unmovable page. But now with this patch
> > series, a hugepage has become movable, so by using hugepage migration we
> > can offline such memory blocks.
> > 
> > What's different from other users of hugepage migration is that we need
> > to decompose all the hugepages inside the target memory block into free
> > buddy pages after hugepage migration, because otherwise free hugepages
> > remaining in the memory block intervene the memory offlining.
> > For this reason we introduce new functions dissolve_free_huge_page() and
> > dissolve_free_huge_pages().
> > 
> > Other than that, what this patch does is straightforwardly to add hugepage
> > migration code, that is, adding hugepage code to the functions which scan
> > over pfn and collect hugepages to be migrated, and adding a hugepage
> > allocation function to alloc_migrate_target().
> > 
> > As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
> > over them because it's larger than memory block. So we now simply leave
> > it to fail as it is.
> > 
> > ChangeLog v2:
> >  - changed return value type of is_hugepage_movable() to bool
> >  - is_hugepage_movable() uses list_for_each_entry() instead of *_safe()
> >  - moved if(PageHuge) block before get_page_unless_zero() in do_migrate_range()
> >  - do_migrate_range() returns -EBUSY for hugepages larger than memory block
> >  - dissolve_free_huge_pages() calculates scan step and sets it to minimum
> >    hugepage size
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> >  include/linux/hugetlb.h |  6 +++++
> >  mm/hugetlb.c            | 58 +++++++++++++++++++++++++++++++++++++++++++++++++
> >  mm/memory_hotplug.c     | 42 +++++++++++++++++++++++++++--------
> >  mm/page_alloc.c         | 12 ++++++++++
> >  mm/page_isolation.c     |  5 +++++
> >  5 files changed, 114 insertions(+), 9 deletions(-)
> > 
> > diff --git v3.9-rc3.orig/include/linux/hugetlb.h v3.9-rc3/include/linux/hugetlb.h
> > index 981eff8..8220a8a 100644
> > --- v3.9-rc3.orig/include/linux/hugetlb.h
> > +++ v3.9-rc3/include/linux/hugetlb.h
> > @@ -69,6 +69,7 @@ int dequeue_hwpoisoned_huge_page(struct page *page);
> >  void putback_active_hugepage(struct page *page);
> >  void putback_active_hugepages(struct list_head *l);
> >  void migrate_hugepage_add(struct page *page, struct list_head *list);
> > +bool is_hugepage_movable(struct page *page);
> >  void copy_huge_page(struct page *dst, struct page *src);
> >  
> >  extern unsigned long hugepages_treat_as_movable;
> > @@ -134,6 +135,7 @@ static inline int dequeue_hwpoisoned_huge_page(struct page *page)
> >  #define putback_active_hugepage(p) 0
> >  #define putback_active_hugepages(l) 0
> >  #define migrate_hugepage_add(p, l) 0
> > +#define is_hugepage_movable(x) 0
> 
> should be false instaed of 0.

OK.

> 
> >  static inline void copy_huge_page(struct page *dst, struct page *src)
> >  {
> >  }
> > @@ -356,6 +358,9 @@ static inline int hstate_index(struct hstate *h)
> >  	return h - hstates;
> >  }
> >  
> > +extern void dissolve_free_huge_pages(unsigned long start_pfn,
> > +				     unsigned long end_pfn);
> > +
> >  #else
> >  struct hstate {};
> >  #define alloc_huge_page(v, a, r) NULL
> > @@ -376,6 +381,7 @@ static inline unsigned int pages_per_huge_page(struct hstate *h)
> >  }
> >  #define hstate_index_to_shift(index) 0
> >  #define hstate_index(h) 0
> > +#define dissolve_free_huge_pages(s, e) 0
> 
> no need 0.

OK.

> >  #endif
> >  
> >  #endif /* _LINUX_HUGETLB_H */
> > diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
> > index d9d3dd7..ef79871 100644
> > --- v3.9-rc3.orig/mm/hugetlb.c
> > +++ v3.9-rc3/mm/hugetlb.c
> > @@ -844,6 +844,36 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
> >  	return ret;
> >  }
> >  
> > +/* Dissolve a given free hugepage into free pages. */
> > +static void dissolve_free_huge_page(struct page *page)
> > +{
> > +	spin_lock(&hugetlb_lock);
> > +	if (PageHuge(page) && !page_count(page)) {
> > +		struct hstate *h = page_hstate(page);
> > +		int nid = page_to_nid(page);
> > +		list_del(&page->lru);
> > +		h->free_huge_pages--;
> > +		h->free_huge_pages_node[nid]--;
> > +		update_and_free_page(h, page);
> > +	}
> > +	spin_unlock(&hugetlb_lock);
> > +}
> > +
> > +/* Dissolve free hugepages in a given pfn range. Used by memory hotplug. */
> > +void dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
> > +{
> > +	unsigned int order = 8 * sizeof(void *);
> > +	unsigned long pfn;
> > +	struct hstate *h;
> > +
> > +	/* Set scan step to minimum hugepage size */
> > +	for_each_hstate(h)
> > +		if (order > huge_page_order(h))
> > +			order = huge_page_order(h);
> > +	for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << order)
> > +		dissolve_free_huge_page(pfn_to_page(pfn));
> > +}
> 
> hotplug.c must not have such pure huge page function.

This code is put in mm/hugetlb.c.

> >  {
> >  	struct page *page;
> > @@ -3155,6 +3185,34 @@ static int is_hugepage_on_freelist(struct page *hpage)
> >  	return 0;
> >  }
> >  
> > +/* Returns true for head pages of in-use hugepages, otherwise returns false. */
> > +bool is_hugepage_movable(struct page *hpage)
> > +{
> > +	struct page *page;
> > +	struct hstate *h;
> > +	bool ret = false;
> > +
> > +	VM_BUG_ON(!PageHuge(hpage));
> > +	/*
> > +	 * This function can be called for a tail page because memory hotplug
> > +	 * scans movability of pages by pfn range of a memory block.
> > +	 * Larger hugepages (1GB for x86_64) are larger than memory block, so
> > +	 * the scan can start at the tail page of larger hugepages.
> > +	 * 1GB hugepage is not movable now, so we return with false for now.
> > +	 */
> > +	if (PageTail(hpage))
> > +		return false;
> > +	h = page_hstate(hpage);
> > +	spin_lock(&hugetlb_lock);
> > +	list_for_each_entry(page, &h->hugepage_activelist, lru)
> > +		if (page == hpage) {
> > +			ret = true;
> > +			break;
> > +		}
> > +	spin_unlock(&hugetlb_lock);
> > +	return ret;
> > +}
> > +
> >  /*
> >   * This function is called from memory failure code.
> >   * Assume the caller holds page lock of the head page.
> > diff --git v3.9-rc3.orig/mm/memory_hotplug.c v3.9-rc3/mm/memory_hotplug.c
> > index 9597eec..2d206e8 100644
> > --- v3.9-rc3.orig/mm/memory_hotplug.c
> > +++ v3.9-rc3/mm/memory_hotplug.c
> > @@ -30,6 +30,7 @@
> >  #include <linux/mm_inline.h>
> >  #include <linux/firmware-map.h>
> >  #include <linux/stop_machine.h>
> > +#include <linux/hugetlb.h>
> >  
> >  #include <asm/tlbflush.h>
> >  
> > @@ -1215,10 +1216,12 @@ static int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn)
> >  }
> >  
> >  /*
> > - * Scanning pfn is much easier than scanning lru list.
> > - * Scan pfn from start to end and Find LRU page.
> > + * Scan pfn range [start,end) to find movable/migratable pages (LRU pages
> > + * and hugepages). We scan pfn because it's much easier than scanning over
> > + * linked list. This function returns the pfn of the first found movable
> > + * page if it's found, otherwise 0.
> >   */
> > -static unsigned long scan_lru_pages(unsigned long start, unsigned long end)
> > +static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
> 
> We can kill scan_lru_pages() completely. That's mere minor optimization and memory
> hotremove it definitely not hot path.

OK.

> 
> >  {
> >  	unsigned long pfn;
> >  	struct page *page;
> > @@ -1227,6 +1230,12 @@ static unsigned long scan_lru_pages(unsigned long start, unsigned long end)
> >  			page = pfn_to_page(pfn);
> >  			if (PageLRU(page))
> >  				return pfn;
> > +			if (PageHuge(page)) {
> > +				if (is_hugepage_movable(page))
> > +					return pfn;
> > +				else
> > +					pfn += (1 << compound_order(page)) - 1;
> > +			}
> >  		}
> >  	}
> >  	return 0;
> > @@ -1247,6 +1256,21 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> >  		if (!pfn_valid(pfn))
> >  			continue;
> >  		page = pfn_to_page(pfn);
> > +
> > +		if (PageHuge(page)) {
> > +			struct page *head = compound_head(page);
> > +			pfn = page_to_pfn(head) + (1<<compound_order(head)) - 1;
> > +			if (compound_order(head) > PFN_SECTION_SHIFT) {
> > +				ret = -EBUSY;
> > +				break;
> > +			}
> > +			if (!get_page_unless_zero(page))
> > +				continue;
> > +			list_move_tail(&head->lru, &source);
> > +			move_pages -= 1 << compound_order(head);
> > +			continue;
> > +		}
> > +
> >  		if (!get_page_unless_zero(page))
> >  			continue;
> >  		/*
> > @@ -1279,7 +1303,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> >  	}
> >  	if (!list_empty(&source)) {
> >  		if (not_managed) {
> > -			putback_lru_pages(&source);
> > +			putback_movable_pages(&source);
> >  			goto out;
> >  		}
> >  
> > @@ -1287,10 +1311,8 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> >  		 * alloc_migrate_target should be improooooved!!
> >  		 * migrate_pages returns # of failed pages.
> >  		 */
> > -		ret = migrate_pages(&source, alloc_migrate_target, 0,
> > +		ret = migrate_movable_pages(&source, alloc_migrate_target, 0,
> >  					MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
> > -		if (ret)
> > -			putback_lru_pages(&source);
> >  	}
> >  out:
> >  	return ret;
> > @@ -1533,8 +1555,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
> >  		drain_all_pages();
> >  	}
> 
> After applying your patch, __offline_pages() free hugetlb persistent and surplus pages. 
> Thus this should allocate same size huge pages on other nodes.
> Otherwise, total hugepages implicitely decrease and application may crash after offline page
> success.

I'm not sure that this should be done in kernel, because the user processes
which trigger page migration should know more about what they want.
It seems to me more reasonable that we leave it for userspace.

> 
> >  
> > -	pfn = scan_lru_pages(start_pfn, end_pfn);
> > -	if (pfn) { /* We have page on LRU */
> > +	pfn = scan_movable_pages(start_pfn, end_pfn);
> > +	if (pfn) { /* We have movable pages */
> >  		ret = do_migrate_range(pfn, end_pfn);
> >  		if (!ret) {
> >  			drain = 1;
> > @@ -1553,6 +1575,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
> >  	yield();
> >  	/* drain pcp pages, this is synchronous. */
> >  	drain_all_pages();
> > +	/* dissolve all free hugepages inside the memory block */
> > +	dissolve_free_huge_pages(start_pfn, end_pfn);
> >  	/* check again */
> >  	offlined_pages = check_pages_isolated(start_pfn, end_pfn);
> >  	if (offlined_pages < 0) {
> > diff --git v3.9-rc3.orig/mm/page_alloc.c v3.9-rc3/mm/page_alloc.c
> > index 8fcced7..09a95e7 100644
> > --- v3.9-rc3.orig/mm/page_alloc.c
> > +++ v3.9-rc3/mm/page_alloc.c
> > @@ -59,6 +59,7 @@
> >  #include <linux/migrate.h>
> >  #include <linux/page-debug-flags.h>
> >  #include <linux/sched/rt.h>
> > +#include <linux/hugetlb.h>
> >  
> >  #include <asm/tlbflush.h>
> >  #include <asm/div64.h>
> > @@ -5716,6 +5717,17 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
> >  			continue;
> >  
> >  		page = pfn_to_page(check);
> > +
> > +		/*
> > +		 * Hugepages are not in LRU lists, but they're movable.
> > +		 * We need not scan over tail pages bacause we don't
> > +		 * handle each tail page individually in migration.
> > +		 */
> > +		if (PageHuge(page)) {
> > +			iter += (1 << compound_order(page)) - 1;
> > +			continue;
> > +		}
> 
> Your patch description says, we can't move 1GB hugepage. and then this seems
> too blutal.

iter should be set to the last tail page of the hugepage.

> > +
> >  		/*
> >  		 * We can't use page_count without pin a page
> >  		 * because another CPU can free compound page.
> > diff --git v3.9-rc3.orig/mm/page_isolation.c v3.9-rc3/mm/page_isolation.c
> > index 383bdbb..cf48ef6 100644
> > --- v3.9-rc3.orig/mm/page_isolation.c
> > +++ v3.9-rc3/mm/page_isolation.c
> > @@ -6,6 +6,7 @@
> >  #include <linux/page-isolation.h>
> >  #include <linux/pageblock-flags.h>
> >  #include <linux/memory.h>
> > +#include <linux/hugetlb.h>
> >  #include "internal.h"
> >  
> >  int set_migratetype_isolate(struct page *page, bool skip_hwpoisoned_pages)
> > @@ -252,6 +253,10 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private,
> >  {
> >  	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE;
> >  
> > +	if (PageHuge(page))
> > +		return alloc_huge_page_node(page_hstate(compound_head(page)),
> > +					    numa_node_id());
> 
> numa_node_id() is really silly. This might lead to allocate from offlining node.

Right, it should've been alloc_huge_page().

> and, offline_pages() should mark hstate as isolated likes normal pages for prohibiting
> new allocation at first.

It seems that alloc_migrate_target() calls alloc_page() for normal pages
and the destination pages can be in the same node with the source pages
(new page allocation from the same memblock are prohibited.)
So if we want to avoid new page allocation from the same node,
this is the problem both for normal and huge pages.

BTW, is it correct to think that all users of memory hotplug assume
that they want to hotplug a whole node (not the part of it?)
If that's correct, introducing a kind of "allocate pages from the nearest
neighbor node" can be an improvement.
But I'm not sure how hard it is to implement yet.
Or if my assumption is wrong, what kind of real use cases to do memory
hotplug in more smaller (zone/memblock) unit are there?

Anyway, thank you for the detailed reviews/comments.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
  2013-04-09 20:07     ` Naoya Horiguchi
@ 2013-04-09 21:27       ` KOSAKI Motohiro
  2013-04-09 22:43         ` Naoya Horiguchi
  0 siblings, 1 reply; 66+ messages in thread
From: KOSAKI Motohiro @ 2013-04-09 21:27 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins, Andi Kleen,
	Hillf Danton, Michal Hocko, LKML

>> numa_node_id() is really silly. This might lead to allocate from offlining node.
>
> Right, it should've been alloc_huge_page().
>
>> and, offline_pages() should mark hstate as isolated likes normal pages for prohibiting
>> new allocation at first.
>
> It seems that alloc_migrate_target() calls alloc_page() for normal pages
> and the destination pages can be in the same node with the source pages
> (new page allocation from the same memblock are prohibited.)

No. It can't. memory hotplug change buddy attribute to MIGRATE_ISOLTE at first.
then alloc_page() never allocate from source node. however huge page don't use
buddy. then we need another trick.


> So if we want to avoid new page allocation from the same node,
> this is the problem both for normal and huge pages.
>
> BTW, is it correct to think that all users of memory hotplug assume
> that they want to hotplug a whole node (not the part of it?)

Both are valid use case. admin can isolate a part of memory for isolating
broken memory range.

but I'm sure almost user want to remove whole node.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
  2013-04-09 21:27       ` KOSAKI Motohiro
@ 2013-04-09 22:43         ` Naoya Horiguchi
  2013-04-10  1:56           ` KOSAKI Motohiro
  0 siblings, 1 reply; 66+ messages in thread
From: Naoya Horiguchi @ 2013-04-09 22:43 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins, Andi Kleen,
	Hillf Danton, Michal Hocko, LKML

On Tue, Apr 09, 2013 at 05:27:44PM -0400, KOSAKI Motohiro wrote:
> >> numa_node_id() is really silly. This might lead to allocate from offlining node.
> >
> > Right, it should've been alloc_huge_page().
> >
> >> and, offline_pages() should mark hstate as isolated likes normal pages for prohibiting
> >> new allocation at first.
> >
> > It seems that alloc_migrate_target() calls alloc_page() for normal pages
> > and the destination pages can be in the same node with the source pages
> > (new page allocation from the same memblock are prohibited.)
> 
> No. It can't. memory hotplug change buddy attribute to MIGRATE_ISOLTE at first.
> then alloc_page() never allocate from source node. however huge page don't use
> buddy. then we need another trick.

MIGRATE_ISOLTE is changed only within the range [start_pfn, end_pfn)
given as the argument of __offline_pages (see also start_isolate_page_range),
so it's set only for pages within the single memblock to be offlined.

BTW, in previous discussion I already agreed with checking migrate type
in hugepage allocation code (maybe it will be in dequeue_huge_page_vma(),)
so what you concern should be solved in the next post.

> 
> > So if we want to avoid new page allocation from the same node,
> > this is the problem both for normal and huge pages.
> >
> > BTW, is it correct to think that all users of memory hotplug assume
> > that they want to hotplug a whole node (not the part of it?)
> 
> Both are valid use case. admin can isolate a part of memory for isolating
> broken memory range.
> 
> but I'm sure almost user want to remove whole node.

OK. So I think about "allocation in the nearest neighbor node",
although it can be in separate patch if it's hard to implement.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
  2013-04-09 22:43         ` Naoya Horiguchi
@ 2013-04-10  1:56           ` KOSAKI Motohiro
  2013-04-10  2:24             ` Naoya Horiguchi
  0 siblings, 1 reply; 66+ messages in thread
From: KOSAKI Motohiro @ 2013-04-10  1:56 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins, Andi Kleen,
	Hillf Danton, Michal Hocko, LKML

On Tue, Apr 9, 2013 at 6:43 PM, Naoya Horiguchi
<n-horiguchi@ah.jp.nec.com> wrote:
> On Tue, Apr 09, 2013 at 05:27:44PM -0400, KOSAKI Motohiro wrote:
>> >> numa_node_id() is really silly. This might lead to allocate from offlining node.
>> >
>> > Right, it should've been alloc_huge_page().
>> >
>> >> and, offline_pages() should mark hstate as isolated likes normal pages for prohibiting
>> >> new allocation at first.
>> >
>> > It seems that alloc_migrate_target() calls alloc_page() for normal pages
>> > and the destination pages can be in the same node with the source pages
>> > (new page allocation from the same memblock are prohibited.)
>>
>> No. It can't. memory hotplug change buddy attribute to MIGRATE_ISOLTE at first.
>> then alloc_page() never allocate from source node. however huge page don't use
>> buddy. then we need another trick.
>
> MIGRATE_ISOLTE is changed only within the range [start_pfn, end_pfn)
> given as the argument of __offline_pages (see also start_isolate_page_range),
> so it's set only for pages within the single memblock to be offlined.

When partial memory hot remove, that's correct behavior. different
node is not required.


> BTW, in previous discussion I already agreed with checking migrate type
> in hugepage allocation code (maybe it will be in dequeue_huge_page_vma(),)
> so what you concern should be solved in the next post.

Umm.. Maybe I missed such discussion. Do you have a pointer?


>> > So if we want to avoid new page allocation from the same node,
>> > this is the problem both for normal and huge pages.
>> >
>> > BTW, is it correct to think that all users of memory hotplug assume
>> > that they want to hotplug a whole node (not the part of it?)
>>
>> Both are valid use case. admin can isolate a part of memory for isolating
>> broken memory range.
>>
>> but I'm sure almost user want to remove whole node.
>
> OK. So I think about "allocation in the nearest neighbor node",
> although it can be in separate patch if it's hard to implement.

That's fine.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
  2013-04-10  1:56           ` KOSAKI Motohiro
@ 2013-04-10  2:24             ` Naoya Horiguchi
  0 siblings, 0 replies; 66+ messages in thread
From: Naoya Horiguchi @ 2013-04-10  2:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Andrew Morton, Mel Gorman, Hugh Dickins, Andi Kleen,
	Hillf Danton, Michal Hocko, LKML

On Tue, Apr 09, 2013 at 09:56:58PM -0400, KOSAKI Motohiro wrote:
> On Tue, Apr 9, 2013 at 6:43 PM, Naoya Horiguchi
...
> > MIGRATE_ISOLTE is changed only within the range [start_pfn, end_pfn)
> > given as the argument of __offline_pages (see also start_isolate_page_range),
> > so it's set only for pages within the single memblock to be offlined.
> 
> When partial memory hot remove, that's correct behavior. different
> node is not required.
> 
> > BTW, in previous discussion I already agreed with checking migrate type
> > in hugepage allocation code (maybe it will be in dequeue_huge_page_vma(),)
> > so what you concern should be solved in the next post.
> 
> Umm.. Maybe I missed such discussion. Do you have a pointer?

Please see the bottom of the following:
  http://thread.gmane.org/gmane.linux.kernel.mm/96665/focus=96920
It's not exactly the same, but we need to prevent the allocation
from the memblock under hotremoving not only to be efficient,
but also to avoid the race.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2013-04-10  2:26 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-22 20:23 [PATCH v2 0/10] extend hugepage migration Naoya Horiguchi
2013-03-22 20:23 ` [PATCH 01/10] migrate: add migrate_entry_wait_huge() Naoya Horiguchi
2013-03-23 15:55   ` Rik van Riel
2013-03-25 10:13   ` Michal Hocko
2013-03-26  4:25     ` Naoya Horiguchi
2013-04-05 20:33   ` KOSAKI Motohiro
2013-04-08 20:00     ` Naoya Horiguchi
2013-04-05 20:33   ` KOSAKI Motohiro
2013-03-22 20:23 ` [PATCH 02/10] migrate: make core migration code aware of hugepage Naoya Horiguchi
2013-03-25 10:57   ` Michal Hocko
2013-03-26  4:33     ` Naoya Horiguchi
2013-03-26  8:49       ` Michal Hocko
2013-04-05 20:41         ` KOSAKI Motohiro
2013-03-22 20:23 ` [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page() Naoya Horiguchi
2013-03-25 12:31   ` Michal Hocko
2013-03-26  4:34     ` Naoya Horiguchi
2013-03-26  9:49       ` Michal Hocko
2013-03-26 20:35         ` Naoya Horiguchi
2013-03-27 13:00           ` Michal Hocko
2013-04-05 21:11             ` KOSAKI Motohiro
2013-03-26 11:29   ` Aneesh Kumar K.V
2013-03-27 13:52     ` Michal Hocko
2013-03-27 19:19       ` Naoya Horiguchi
2013-03-28  8:53         ` Michal Hocko
2013-03-29  5:26       ` Aneesh Kumar K.V
2013-03-29  9:36         ` Michal Hocko
2013-04-01  5:13       ` Aneesh Kumar K.V
2013-04-02  9:45         ` Michal Hocko
2013-03-22 20:23 ` [PATCH 04/10] migrate: clean up migrate_huge_page() Naoya Horiguchi
2013-04-05 21:13   ` KOSAKI Motohiro
2013-03-22 20:23 ` [PATCH 05/10] migrate: add hugepage migration code to migrate_pages() Naoya Horiguchi
2013-03-25 13:04   ` Michal Hocko
2013-03-26  5:13     ` Naoya Horiguchi
2013-03-26  8:55       ` Michal Hocko
2013-04-05 21:17   ` KOSAKI Motohiro
2013-04-08 20:21     ` Naoya Horiguchi
2013-03-22 20:23 ` [PATCH 06/10] migrate: add hugepage migration code to move_pages() Naoya Horiguchi
2013-03-25 13:36   ` Michal Hocko
2013-03-26  7:06     ` Naoya Horiguchi
2013-03-26 10:02       ` Michal Hocko
2013-03-26 20:37         ` Naoya Horiguchi
2013-03-22 20:23 ` [PATCH 07/10] mbind: add hugepage migration code to mbind() Naoya Horiguchi
2013-03-25 13:49   ` Michal Hocko
2013-04-05 22:23     ` KOSAKI Motohiro
2013-04-06  7:04       ` Michal Hocko
2013-04-05 22:18   ` KOSAKI Motohiro
2013-04-08 20:25     ` Naoya Horiguchi
2013-03-22 20:23 ` [PATCH 08/10] migrate: remove VM_HUGETLB from vma flag check in vma_migratable() Naoya Horiguchi
2013-03-22 20:23 ` [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage Naoya Horiguchi
2013-03-25 15:09   ` Michal Hocko
2013-03-26 18:23     ` Naoya Horiguchi
2013-03-27 14:19       ` Michal Hocko
2013-03-27 21:29         ` Naoya Horiguchi
2013-03-27 21:58           ` Naoya Horiguchi
2013-03-27 22:55           ` Michal Hocko
2013-03-26 12:01   ` Aneesh Kumar K.V
2013-03-27 19:28     ` Naoya Horiguchi
2013-04-06  0:13   ` KOSAKI Motohiro
2013-04-09 20:07     ` Naoya Horiguchi
2013-04-09 21:27       ` KOSAKI Motohiro
2013-04-09 22:43         ` Naoya Horiguchi
2013-04-10  1:56           ` KOSAKI Motohiro
2013-04-10  2:24             ` Naoya Horiguchi
2013-03-22 20:23 ` [PATCH 10/10] prepare to remove /proc/sys/vm/hugepages_treat_as_movable Naoya Horiguchi
2013-03-25 15:12   ` Michal Hocko
2013-04-06  0:15     ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).