All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] more sensible hugetlb migration for hotplug/CMA
@ 2017-06-08  7:45 Michal Hocko
  2017-06-08  7:45 ` [PATCH 1/4] mm, memory_hotplug: simplify empty node mask handling in new_node_page Michal Hocko
                   ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: Michal Hocko @ 2017-06-08  7:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Vlastimil Babka, Naoya Horiguchi, Xishi Qiu,
	zhong jiang, Joonsoo Kim, LKML

Hi,
I have received a bug report for memory hotplug triggered hugetlb
migration on a distribution kernel but the very same issue is still
present in the current upstream code. The bug is described in patch
2 but in short the issue is that new_node_page doesn't really try to
consume preallocated hugetlb pages in the pool on other than the next
node which is really suboptimal. This results in very likely failures of
memory hotremove even though there are many hugetlb pages in the pool.
I think it is fair to call this a bug.

Patches 1 and 3 are cleanups and the last patch is still a RFC because
I am not sure we really need/want to go that way. The thing is that the
page allocator relies on zonelists to do the proper allocation fallback
wrt. numa distances.  We do not have anything like that for hugetlb
allocations because they are not zone aware in general. Making them
fully zonlist (or alternately nodelist) aware is quite a large project
I guess. Instead I admittedly went the path of least resistance and
instead provided a much simpler approach. More on that in patch 4.  If
this doesn't seem good enough I will drop it from the series but to me
it looks like a reasonable compromise code wise.

Thoughts, ideas, objections?

Diffstat
 include/linux/hugetlb.h  |  3 +++
 include/linux/migrate.h  | 17 +++++++++++++++++
 include/linux/nodemask.h | 20 ++++++++++++++++++++
 mm/hugetlb.c             | 30 ++++++++++++++++++++++++++++++
 mm/memory_hotplug.c      | 25 ++++++-------------------
 mm/page_isolation.c      | 18 ++----------------
 6 files changed, 78 insertions(+), 35 deletions(-)

Shortlog
Michal Hocko (4):
      mm, memory_hotplug: simplify empty node mask handling in new_node_page
      hugetlb, memory_hotplug: prefer to use reserved pages for migration
      mm: unify new_node_page and alloc_migrate_target
      hugetlb: add support for preferred node to alloc_huge_page_nodemask

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/4] more sensible hugetlb migration for hotplug/CMA
@ 2017-06-08  7:45 Michal Hocko
  2017-06-08  7:45 ` [PATCH 1/4] mm, memory_hotplug: simplify empty node mask handling in new_node_page Michal Hocko
                   ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: Michal Hocko @ 2017-06-08  7:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Vlastimil Babka, Naoya Horiguchi, Xishi Qiu,
	zhong jiang, Joonsoo Kim, LKML

Hi,
I have received a bug report for memory hotplug triggered hugetlb
migration on a distribution kernel but the very same issue is still
present in the current upstream code. The bug is described in patch
2 but in short the issue is that new_node_page doesn't really try to
consume preallocated hugetlb pages in the pool on other than the next
node which is really suboptimal. This results in very likely failures of
memory hotremove even though there are many hugetlb pages in the pool.
I think it is fair to call this a bug.

Patches 1 and 3 are cleanups and the last patch is still a RFC because
I am not sure we really need/want to go that way. The thing is that the
page allocator relies on zonelists to do the proper allocation fallback
wrt. numa distances.  We do not have anything like that for hugetlb
allocations because they are not zone aware in general. Making them
fully zonlist (or alternately nodelist) aware is quite a large project
I guess. Instead I admittedly went the path of least resistance and
instead provided a much simpler approach. More on that in patch 4.  If
this doesn't seem good enough I will drop it from the series but to me
it looks like a reasonable compromise code wise.

Thoughts, ideas, objections?

Diffstat
 include/linux/hugetlb.h  |  3 +++
 include/linux/migrate.h  | 17 +++++++++++++++++
 include/linux/nodemask.h | 20 ++++++++++++++++++++
 mm/hugetlb.c             | 30 ++++++++++++++++++++++++++++++
 mm/memory_hotplug.c      | 25 ++++++-------------------
 mm/page_isolation.c      | 18 ++----------------
 6 files changed, 78 insertions(+), 35 deletions(-)

Shortlog
Michal Hocko (4):
      mm, memory_hotplug: simplify empty node mask handling in new_node_page
      hugetlb, memory_hotplug: prefer to use reserved pages for migration
      mm: unify new_node_page and alloc_migrate_target
      hugetlb: add support for preferred node to alloc_huge_page_nodemask


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/4] mm, memory_hotplug: simplify empty node mask handling in new_node_page
  2017-06-08  7:45 [PATCH 0/4] more sensible hugetlb migration for hotplug/CMA Michal Hocko
@ 2017-06-08  7:45 ` Michal Hocko
  2017-06-08  8:15   ` Vlastimil Babka
  2017-06-08  7:45 ` [PATCH 2/4] hugetlb, memory_hotplug: prefer to use reserved pages for migration Michal Hocko
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2017-06-08  7:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Vlastimil Babka, Naoya Horiguchi, Xishi Qiu,
	zhong jiang, Joonsoo Kim, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

new_node_page tries to allocate the target page on a different NUMA node
than the source page. This makes sense in most cases during the hotplug
because we are likely to offline the whole numa node. But there are
cases where there are no other nodes to fallback (e.g. when offlining
parts of the only existing node) and we have to fallback to allocating
from the source node. The current code does that but it can be
simplified by checking the nmask and updating it before we even try to
allocate rather than special casing it.

This patch shouldn't introduce any functional change.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory_hotplug.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d61509752112..1ca373bdffbf 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1432,7 +1432,15 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE;
 	int nid = page_to_nid(page);
 	nodemask_t nmask = node_states[N_MEMORY];
-	struct page *new_page = NULL;
+
+	/*
+	 * try to allocate from a different node but reuse this node if there
+	 * are no other online nodes to be used (e.g. we are offlining a part
+	 * of the only existing node)
+	 */
+	node_clear(nid, nmask);
+	if (nodes_empty(nmask))
+		node_set(nid, nmask);
 
 	/*
 	 * TODO: allocate a destination hugepage from a nearest neighbor node,
@@ -1443,18 +1451,11 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 		return alloc_huge_page_node(page_hstate(compound_head(page)),
 					next_node_in(nid, nmask));
 
-	node_clear(nid, nmask);
-
 	if (PageHighMem(page)
 	    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
 		gfp_mask |= __GFP_HIGHMEM;
 
-	if (!nodes_empty(nmask))
-		new_page = __alloc_pages_nodemask(gfp_mask, 0, nid, &nmask);
-	if (!new_page)
-		new_page = __alloc_pages(gfp_mask, 0, nid);
-
-	return new_page;
+	return __alloc_pages_nodemask(gfp_mask, 0, nid, &nmask);
 }
 
 #define NR_OFFLINE_AT_ONCE_PAGES	(256)
-- 
2.11.0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/4] mm, memory_hotplug: simplify empty node mask handling in new_node_page
@ 2017-06-08  7:45 ` Michal Hocko
  2017-06-08  8:15   ` Vlastimil Babka
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2017-06-08  7:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Vlastimil Babka, Naoya Horiguchi, Xishi Qiu,
	zhong jiang, Joonsoo Kim, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

new_node_page tries to allocate the target page on a different NUMA node
than the source page. This makes sense in most cases during the hotplug
because we are likely to offline the whole numa node. But there are
cases where there are no other nodes to fallback (e.g. when offlining
parts of the only existing node) and we have to fallback to allocating
from the source node. The current code does that but it can be
simplified by checking the nmask and updating it before we even try to
allocate rather than special casing it.

This patch shouldn't introduce any functional change.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory_hotplug.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d61509752112..1ca373bdffbf 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1432,7 +1432,15 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE;
 	int nid = page_to_nid(page);
 	nodemask_t nmask = node_states[N_MEMORY];
-	struct page *new_page = NULL;
+
+	/*
+	 * try to allocate from a different node but reuse this node if there
+	 * are no other online nodes to be used (e.g. we are offlining a part
+	 * of the only existing node)
+	 */
+	node_clear(nid, nmask);
+	if (nodes_empty(nmask))
+		node_set(nid, nmask);
 
 	/*
 	 * TODO: allocate a destination hugepage from a nearest neighbor node,
@@ -1443,18 +1451,11 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 		return alloc_huge_page_node(page_hstate(compound_head(page)),
 					next_node_in(nid, nmask));
 
-	node_clear(nid, nmask);
-
 	if (PageHighMem(page)
 	    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
 		gfp_mask |= __GFP_HIGHMEM;
 
-	if (!nodes_empty(nmask))
-		new_page = __alloc_pages_nodemask(gfp_mask, 0, nid, &nmask);
-	if (!new_page)
-		new_page = __alloc_pages(gfp_mask, 0, nid);
-
-	return new_page;
+	return __alloc_pages_nodemask(gfp_mask, 0, nid, &nmask);
 }
 
 #define NR_OFFLINE_AT_ONCE_PAGES	(256)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 2/4] hugetlb, memory_hotplug: prefer to use reserved pages for migration
  2017-06-08  7:45 [PATCH 0/4] more sensible hugetlb migration for hotplug/CMA Michal Hocko
  2017-06-08  7:45 ` [PATCH 1/4] mm, memory_hotplug: simplify empty node mask handling in new_node_page Michal Hocko
@ 2017-06-08  7:45 ` Michal Hocko
  2017-06-08  8:22   ` Vlastimil Babka
  2017-06-08  7:45 ` [PATCH 3/4] mm: unify new_node_page and alloc_migrate_target Michal Hocko
  2017-06-08  7:45 ` [RFC PATCH 4/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask Michal Hocko
  3 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2017-06-08  7:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Vlastimil Babka, Naoya Horiguchi, Xishi Qiu,
	zhong jiang, Joonsoo Kim, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

new_node_page will try to use the origin's next NUMA node as the
migration destination for hugetlb pages. If such a node doesn't have any
preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol to
allocate a surplus page instead. This is quite subotpimal for any
configuration when hugetlb pages are no distributed to all NUMA nodes
evenly. Say we have a hotplugable node 4 and spare hugetlb pages are
node 0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
/sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0

Now we consume the whole pool on node 4 and try to offline this
node. All the allocated pages should be moved to node0 which has enough
preallocated pages to hold them. With the current implementation
offlining very likely fails because hugetlb allocations during runtime
are much less reliable.

Fix this by reusing the nodemask which excludes migration source and try
to find a first node which has a page in the preallocated pool first and
fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
consumed.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/hugetlb.h |  2 ++
 mm/hugetlb.c            | 27 +++++++++++++++++++++++++++
 mm/memory_hotplug.c     |  9 ++-------
 3 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index dbb118c566cd..c469191bb13b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -349,6 +349,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
+struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask);
 int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
 			pgoff_t idx);
 
@@ -524,6 +525,7 @@ static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr
 struct hstate {};
 #define alloc_huge_page(v, a, r) NULL
 #define alloc_huge_page_node(h, nid) NULL
+#define alloc_huge_page_nodemask(h, preferred_nid, nmask) NULL
 #define alloc_huge_page_noerr(v, a, r) NULL
 #define alloc_bootmem_huge_page(h) NULL
 #define hstate_file(f) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 761a669d0b62..01c11ceb47d6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1723,6 +1723,33 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
 	return page;
 }
 
+struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
+{
+	struct page *page = NULL;
+	int node;
+
+	spin_lock(&hugetlb_lock);
+	if (h->free_huge_pages - h->resv_huge_pages > 0) {
+		for_each_node_mask(node, *nmask) {
+			page = dequeue_huge_page_node_exact(h, node);
+			if (page)
+				break;
+		}
+	}
+	spin_unlock(&hugetlb_lock);
+	if (page)
+		return page;
+
+	/* No reservations, try to overcommit */
+	for_each_node_mask(node, *nmask) {
+		page = __alloc_buddy_huge_page_no_mpol(h, node);
+		if (page)
+			return page;
+	}
+
+	return NULL;
+}
+
 /*
  * Increase the hugetlb pool such that it can accommodate a reservation
  * of size 'delta'.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1ca373bdffbf..6e0d964ac561 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1442,14 +1442,9 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 	if (nodes_empty(nmask))
 		node_set(nid, nmask);
 
-	/*
-	 * TODO: allocate a destination hugepage from a nearest neighbor node,
-	 * accordance with memory policy of the user process if possible. For
-	 * now as a simple work-around, we use the next node for destination.
-	 */
 	if (PageHuge(page))
-		return alloc_huge_page_node(page_hstate(compound_head(page)),
-					next_node_in(nid, nmask));
+		return alloc_huge_page_nodemask(
+				page_hstate(compound_head(page)), &nmask);
 
 	if (PageHighMem(page)
 	    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
-- 
2.11.0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 2/4] hugetlb, memory_hotplug: prefer to use reserved pages for migration
@ 2017-06-08  7:45 ` Michal Hocko
  2017-06-08  8:22   ` Vlastimil Babka
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2017-06-08  7:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Vlastimil Babka, Naoya Horiguchi, Xishi Qiu,
	zhong jiang, Joonsoo Kim, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

new_node_page will try to use the origin's next NUMA node as the
migration destination for hugetlb pages. If such a node doesn't have any
preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol to
allocate a surplus page instead. This is quite subotpimal for any
configuration when hugetlb pages are no distributed to all NUMA nodes
evenly. Say we have a hotplugable node 4 and spare hugetlb pages are
node 0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
/sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0

Now we consume the whole pool on node 4 and try to offline this
node. All the allocated pages should be moved to node0 which has enough
preallocated pages to hold them. With the current implementation
offlining very likely fails because hugetlb allocations during runtime
are much less reliable.

Fix this by reusing the nodemask which excludes migration source and try
to find a first node which has a page in the preallocated pool first and
fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
consumed.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/hugetlb.h |  2 ++
 mm/hugetlb.c            | 27 +++++++++++++++++++++++++++
 mm/memory_hotplug.c     |  9 ++-------
 3 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index dbb118c566cd..c469191bb13b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -349,6 +349,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
+struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask);
 int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
 			pgoff_t idx);
 
@@ -524,6 +525,7 @@ static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr
 struct hstate {};
 #define alloc_huge_page(v, a, r) NULL
 #define alloc_huge_page_node(h, nid) NULL
+#define alloc_huge_page_nodemask(h, preferred_nid, nmask) NULL
 #define alloc_huge_page_noerr(v, a, r) NULL
 #define alloc_bootmem_huge_page(h) NULL
 #define hstate_file(f) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 761a669d0b62..01c11ceb47d6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1723,6 +1723,33 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
 	return page;
 }
 
+struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
+{
+	struct page *page = NULL;
+	int node;
+
+	spin_lock(&hugetlb_lock);
+	if (h->free_huge_pages - h->resv_huge_pages > 0) {
+		for_each_node_mask(node, *nmask) {
+			page = dequeue_huge_page_node_exact(h, node);
+			if (page)
+				break;
+		}
+	}
+	spin_unlock(&hugetlb_lock);
+	if (page)
+		return page;
+
+	/* No reservations, try to overcommit */
+	for_each_node_mask(node, *nmask) {
+		page = __alloc_buddy_huge_page_no_mpol(h, node);
+		if (page)
+			return page;
+	}
+
+	return NULL;
+}
+
 /*
  * Increase the hugetlb pool such that it can accommodate a reservation
  * of size 'delta'.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1ca373bdffbf..6e0d964ac561 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1442,14 +1442,9 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 	if (nodes_empty(nmask))
 		node_set(nid, nmask);
 
-	/*
-	 * TODO: allocate a destination hugepage from a nearest neighbor node,
-	 * accordance with memory policy of the user process if possible. For
-	 * now as a simple work-around, we use the next node for destination.
-	 */
 	if (PageHuge(page))
-		return alloc_huge_page_node(page_hstate(compound_head(page)),
-					next_node_in(nid, nmask));
+		return alloc_huge_page_nodemask(
+				page_hstate(compound_head(page)), &nmask);
 
 	if (PageHighMem(page)
 	    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 3/4] mm: unify new_node_page and alloc_migrate_target
  2017-06-08  7:45 [PATCH 0/4] more sensible hugetlb migration for hotplug/CMA Michal Hocko
  2017-06-08  7:45 ` [PATCH 1/4] mm, memory_hotplug: simplify empty node mask handling in new_node_page Michal Hocko
  2017-06-08  7:45 ` [PATCH 2/4] hugetlb, memory_hotplug: prefer to use reserved pages for migration Michal Hocko
@ 2017-06-08  7:45 ` Michal Hocko
  2017-06-08  8:36   ` Vlastimil Babka
  2017-06-08  7:45 ` [RFC PATCH 4/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask Michal Hocko
  3 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2017-06-08  7:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Vlastimil Babka, Naoya Horiguchi, Xishi Qiu,
	zhong jiang, Joonsoo Kim, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node
when mem-offline") has duplicated a large part of alloc_migrate_target
with some hotplug specific special casing. To be more precise it tried
to enfore the allocation from a different node than the original page.
As a result the two function diverged in their shared logic, e.g. the
hugetlb allocation strategy. Let's unify the two and express different
NUMA requirements by the given nodemask. new_node_page will simply
exclude the node it doesn't care about and alloc_migrate_target will
use all the available nodes. alloc_migrate_target will then learn to
migrate hugetlb pages more sanely and use preallocated pool when
possible.

Please note that alloc_migrate_target used to call alloc_page resp.
alloc_pages_current so the memory policy of the current context which
is quite strange when we consider that it is used in the context of
alloc_contig_range which just tries to migrate pages which stand in the
way.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/migrate.h | 17 +++++++++++++++++
 mm/memory_hotplug.c     | 11 +----------
 mm/page_isolation.c     | 18 ++----------------
 3 files changed, 20 insertions(+), 26 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 48e24844b3c5..f80c9882403a 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -4,6 +4,7 @@
 #include <linux/mm.h>
 #include <linux/mempolicy.h>
 #include <linux/migrate_mode.h>
+#include <linux/hugetlb.h>
 
 typedef struct page *new_page_t(struct page *page, unsigned long private,
 				int **reason);
@@ -30,6 +31,22 @@ enum migrate_reason {
 /* In mm/debug.c; also keep sync with include/trace/events/migrate.h */
 extern char *migrate_reason_names[MR_TYPES];
 
+static inline struct page *new_page_nodemask(struct page *page, int preferred_nid,
+		nodemask_t *nodemask)
+{
+	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE;
+
+	if (PageHuge(page))
+		return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
+				nodemask);
+
+	if (PageHighMem(page)
+	    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
+		gfp_mask |= __GFP_HIGHMEM;
+
+	return __alloc_pages_nodemask(gfp_mask, 0, preferred_nid, nodemask);
+}
+
 #ifdef CONFIG_MIGRATION
 
 extern void putback_movable_pages(struct list_head *l);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6e0d964ac561..d2f13f2f3ebf 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1429,7 +1429,6 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
 static struct page *new_node_page(struct page *page, unsigned long private,
 		int **result)
 {
-	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE;
 	int nid = page_to_nid(page);
 	nodemask_t nmask = node_states[N_MEMORY];
 
@@ -1442,15 +1441,7 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 	if (nodes_empty(nmask))
 		node_set(nid, nmask);
 
-	if (PageHuge(page))
-		return alloc_huge_page_nodemask(
-				page_hstate(compound_head(page)), &nmask);
-
-	if (PageHighMem(page)
-	    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
-		gfp_mask |= __GFP_HIGHMEM;
-
-	return __alloc_pages_nodemask(gfp_mask, 0, nid, &nmask);
+	return new_page_nodemask(page, nid, &nmask);
 }
 
 #define NR_OFFLINE_AT_ONCE_PAGES	(256)
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 3606104893e0..757410d9f758 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -8,6 +8,7 @@
 #include <linux/memory.h>
 #include <linux/hugetlb.h>
 #include <linux/page_owner.h>
+#include <linux/migrate.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -294,20 +295,5 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
 struct page *alloc_migrate_target(struct page *page, unsigned long private,
 				  int **resultp)
 {
-	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE;
-
-	/*
-	 * TODO: allocate a destination hugepage from a nearest neighbor node,
-	 * accordance with memory policy of the user process if possible. For
-	 * now as a simple work-around, we use the next node for destination.
-	 */
-	if (PageHuge(page))
-		return alloc_huge_page_node(page_hstate(compound_head(page)),
-					    next_node_in(page_to_nid(page),
-							 node_online_map));
-
-	if (PageHighMem(page))
-		gfp_mask |= __GFP_HIGHMEM;
-
-	return alloc_page(gfp_mask);
+	return new_page_nodemask(page, numa_node_id(), &node_states[N_MEMORY]);
 }
-- 
2.11.0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 3/4] mm: unify new_node_page and alloc_migrate_target
@ 2017-06-08  7:45 ` Michal Hocko
  2017-06-08  8:36   ` Vlastimil Babka
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2017-06-08  7:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Vlastimil Babka, Naoya Horiguchi, Xishi Qiu,
	zhong jiang, Joonsoo Kim, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node
when mem-offline") has duplicated a large part of alloc_migrate_target
with some hotplug specific special casing. To be more precise it tried
to enfore the allocation from a different node than the original page.
As a result the two function diverged in their shared logic, e.g. the
hugetlb allocation strategy. Let's unify the two and express different
NUMA requirements by the given nodemask. new_node_page will simply
exclude the node it doesn't care about and alloc_migrate_target will
use all the available nodes. alloc_migrate_target will then learn to
migrate hugetlb pages more sanely and use preallocated pool when
possible.

Please note that alloc_migrate_target used to call alloc_page resp.
alloc_pages_current so the memory policy of the current context which
is quite strange when we consider that it is used in the context of
alloc_contig_range which just tries to migrate pages which stand in the
way.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/migrate.h | 17 +++++++++++++++++
 mm/memory_hotplug.c     | 11 +----------
 mm/page_isolation.c     | 18 ++----------------
 3 files changed, 20 insertions(+), 26 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 48e24844b3c5..f80c9882403a 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -4,6 +4,7 @@
 #include <linux/mm.h>
 #include <linux/mempolicy.h>
 #include <linux/migrate_mode.h>
+#include <linux/hugetlb.h>
 
 typedef struct page *new_page_t(struct page *page, unsigned long private,
 				int **reason);
@@ -30,6 +31,22 @@ enum migrate_reason {
 /* In mm/debug.c; also keep sync with include/trace/events/migrate.h */
 extern char *migrate_reason_names[MR_TYPES];
 
+static inline struct page *new_page_nodemask(struct page *page, int preferred_nid,
+		nodemask_t *nodemask)
+{
+	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE;
+
+	if (PageHuge(page))
+		return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
+				nodemask);
+
+	if (PageHighMem(page)
+	    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
+		gfp_mask |= __GFP_HIGHMEM;
+
+	return __alloc_pages_nodemask(gfp_mask, 0, preferred_nid, nodemask);
+}
+
 #ifdef CONFIG_MIGRATION
 
 extern void putback_movable_pages(struct list_head *l);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6e0d964ac561..d2f13f2f3ebf 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1429,7 +1429,6 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
 static struct page *new_node_page(struct page *page, unsigned long private,
 		int **result)
 {
-	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE;
 	int nid = page_to_nid(page);
 	nodemask_t nmask = node_states[N_MEMORY];
 
@@ -1442,15 +1441,7 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 	if (nodes_empty(nmask))
 		node_set(nid, nmask);
 
-	if (PageHuge(page))
-		return alloc_huge_page_nodemask(
-				page_hstate(compound_head(page)), &nmask);
-
-	if (PageHighMem(page)
-	    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
-		gfp_mask |= __GFP_HIGHMEM;
-
-	return __alloc_pages_nodemask(gfp_mask, 0, nid, &nmask);
+	return new_page_nodemask(page, nid, &nmask);
 }
 
 #define NR_OFFLINE_AT_ONCE_PAGES	(256)
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 3606104893e0..757410d9f758 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -8,6 +8,7 @@
 #include <linux/memory.h>
 #include <linux/hugetlb.h>
 #include <linux/page_owner.h>
+#include <linux/migrate.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -294,20 +295,5 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
 struct page *alloc_migrate_target(struct page *page, unsigned long private,
 				  int **resultp)
 {
-	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE;
-
-	/*
-	 * TODO: allocate a destination hugepage from a nearest neighbor node,
-	 * accordance with memory policy of the user process if possible. For
-	 * now as a simple work-around, we use the next node for destination.
-	 */
-	if (PageHuge(page))
-		return alloc_huge_page_node(page_hstate(compound_head(page)),
-					    next_node_in(page_to_nid(page),
-							 node_online_map));
-
-	if (PageHighMem(page))
-		gfp_mask |= __GFP_HIGHMEM;
-
-	return alloc_page(gfp_mask);
+	return new_page_nodemask(page, numa_node_id(), &node_states[N_MEMORY]);
 }
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 4/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
  2017-06-08  7:45 [PATCH 0/4] more sensible hugetlb migration for hotplug/CMA Michal Hocko
                   ` (2 preceding siblings ...)
  2017-06-08  7:45 ` [PATCH 3/4] mm: unify new_node_page and alloc_migrate_target Michal Hocko
@ 2017-06-08  7:45 ` Michal Hocko
  2017-06-08  8:38   ` Vlastimil Babka
  2017-06-12 15:21   ` Michal Hocko
  3 siblings, 2 replies; 30+ messages in thread
From: Michal Hocko @ 2017-06-08  7:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Vlastimil Babka, Naoya Horiguchi, Xishi Qiu,
	zhong jiang, Joonsoo Kim, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

alloc_huge_page_nodemask tries to allocate from any numa node in the
allowed node mask. This might lead to filling up low NUMA nodes while
others are not used. We can reduce this risk by introducing a concept
of the preferred node similar to what we have in the regular page
allocator. We will start allocating from the preferred nid and then
iterate over all allowed nodes until we try them all. Introduce
for_each_node_mask_preferred helper which does the iteration and reuse
the available preferred node in new_page_nodemask which is currently
the only caller of alloc_huge_page_nodemask.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/hugetlb.h  |  3 ++-
 include/linux/migrate.h  |  2 +-
 include/linux/nodemask.h | 20 ++++++++++++++++++++
 mm/hugetlb.c             |  9 ++++++---
 4 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c469191bb13b..9831a4434dd7 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -349,7 +349,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
-struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask);
+struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
+				const nodemask_t *nmask);
 int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
 			pgoff_t idx);
 
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index f80c9882403a..af3ccf93efaa 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -38,7 +38,7 @@ static inline struct page *new_page_nodemask(struct page *page, int preferred_ni
 
 	if (PageHuge(page))
 		return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
-				nodemask);
+				preferred_nid, nodemask);
 
 	if (PageHighMem(page)
 	    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index cf0b91c3ec12..797aa74392bc 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -42,6 +42,8 @@
  * void nodes_shift_left(dst, src, n)	Shift left
  *
  * int first_node(mask)			Number lowest set bit, or MAX_NUMNODES
+ * int first_node_from(nid, mask)	First node starting from nid, or wrap
+ * 					from first or MAX_NUMNODES
  * int next_node(node, mask)		Next node past 'node', or MAX_NUMNODES
  * int next_node_in(node, mask)		Next node past 'node', or wrap to first,
  *					or MAX_NUMNODES
@@ -268,6 +270,15 @@ static inline int __next_node(int n, const nodemask_t *srcp)
 #define next_node_in(n, src) __next_node_in((n), &(src))
 int __next_node_in(int node, const nodemask_t *srcp);
 
+#define first_node_from(nid, mask) __first_node_from(nid, &(mask))
+static inline int __first_node_from(int nid, const nodemask_t *mask)
+{
+	if (test_bit(nid, mask->bits))
+		return nid;
+
+	return __next_node_in(nid, mask);
+}
+
 static inline void init_nodemask_of_node(nodemask_t *mask, int node)
 {
 	nodes_clear(*mask);
@@ -369,10 +380,19 @@ static inline void __nodes_fold(nodemask_t *dstp, const nodemask_t *origp,
 	for ((node) = first_node(mask);			\
 		(node) < MAX_NUMNODES;			\
 		(node) = next_node((node), (mask)))
+
+#define for_each_node_mask_preferred(node, iter, preferred, mask)	\
+	for ((node) = first_node_from((preferred), (mask)), iter = 0;	\
+		(iter) < nodes_weight((mask));				\
+		(node) = next_node_in((node), (mask)), (iter)++)
+
 #else /* MAX_NUMNODES == 1 */
 #define for_each_node_mask(node, mask)			\
 	if (!nodes_empty(mask))				\
 		for ((node) = 0; (node) < 1; (node)++)
+
+#define for_each_node_mask_preferred(node, iter, preferred, mask) \
+	for_each_node_mask(node, mask)
 #endif /* MAX_NUMNODES */
 
 /*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 01c11ceb47d6..ebf5c9b890d5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1723,14 +1723,17 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
 	return page;
 }
 
-struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
+struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
+		const nodemask_t *nmask)
 {
 	struct page *page = NULL;
+	int iter;
 	int node;
 
 	spin_lock(&hugetlb_lock);
 	if (h->free_huge_pages - h->resv_huge_pages > 0) {
-		for_each_node_mask(node, *nmask) {
+		/* It would be nicer to iterate in the node distance order */
+		for_each_node_mask_preferred(node, iter, preferred_nid, *nmask) {
 			page = dequeue_huge_page_node_exact(h, node);
 			if (page)
 				break;
@@ -1741,7 +1744,7 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
 		return page;
 
 	/* No reservations, try to overcommit */
-	for_each_node_mask(node, *nmask) {
+	for_each_node_mask_preferred(node, iter, preferred_nid, *nmask) {
 		page = __alloc_buddy_huge_page_no_mpol(h, node);
 		if (page)
 			return page;
-- 
2.11.0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 4/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
@ 2017-06-08  7:45 ` Michal Hocko
  2017-06-08  8:38   ` Vlastimil Babka
  2017-06-12 15:21   ` Michal Hocko
  0 siblings, 2 replies; 30+ messages in thread
From: Michal Hocko @ 2017-06-08  7:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Vlastimil Babka, Naoya Horiguchi, Xishi Qiu,
	zhong jiang, Joonsoo Kim, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

alloc_huge_page_nodemask tries to allocate from any numa node in the
allowed node mask. This might lead to filling up low NUMA nodes while
others are not used. We can reduce this risk by introducing a concept
of the preferred node similar to what we have in the regular page
allocator. We will start allocating from the preferred nid and then
iterate over all allowed nodes until we try them all. Introduce
for_each_node_mask_preferred helper which does the iteration and reuse
the available preferred node in new_page_nodemask which is currently
the only caller of alloc_huge_page_nodemask.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/hugetlb.h  |  3 ++-
 include/linux/migrate.h  |  2 +-
 include/linux/nodemask.h | 20 ++++++++++++++++++++
 mm/hugetlb.c             |  9 ++++++---
 4 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c469191bb13b..9831a4434dd7 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -349,7 +349,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
-struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask);
+struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
+				const nodemask_t *nmask);
 int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
 			pgoff_t idx);
 
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index f80c9882403a..af3ccf93efaa 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -38,7 +38,7 @@ static inline struct page *new_page_nodemask(struct page *page, int preferred_ni
 
 	if (PageHuge(page))
 		return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
-				nodemask);
+				preferred_nid, nodemask);
 
 	if (PageHighMem(page)
 	    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index cf0b91c3ec12..797aa74392bc 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -42,6 +42,8 @@
  * void nodes_shift_left(dst, src, n)	Shift left
  *
  * int first_node(mask)			Number lowest set bit, or MAX_NUMNODES
+ * int first_node_from(nid, mask)	First node starting from nid, or wrap
+ * 					from first or MAX_NUMNODES
  * int next_node(node, mask)		Next node past 'node', or MAX_NUMNODES
  * int next_node_in(node, mask)		Next node past 'node', or wrap to first,
  *					or MAX_NUMNODES
@@ -268,6 +270,15 @@ static inline int __next_node(int n, const nodemask_t *srcp)
 #define next_node_in(n, src) __next_node_in((n), &(src))
 int __next_node_in(int node, const nodemask_t *srcp);
 
+#define first_node_from(nid, mask) __first_node_from(nid, &(mask))
+static inline int __first_node_from(int nid, const nodemask_t *mask)
+{
+	if (test_bit(nid, mask->bits))
+		return nid;
+
+	return __next_node_in(nid, mask);
+}
+
 static inline void init_nodemask_of_node(nodemask_t *mask, int node)
 {
 	nodes_clear(*mask);
@@ -369,10 +380,19 @@ static inline void __nodes_fold(nodemask_t *dstp, const nodemask_t *origp,
 	for ((node) = first_node(mask);			\
 		(node) < MAX_NUMNODES;			\
 		(node) = next_node((node), (mask)))
+
+#define for_each_node_mask_preferred(node, iter, preferred, mask)	\
+	for ((node) = first_node_from((preferred), (mask)), iter = 0;	\
+		(iter) < nodes_weight((mask));				\
+		(node) = next_node_in((node), (mask)), (iter)++)
+
 #else /* MAX_NUMNODES == 1 */
 #define for_each_node_mask(node, mask)			\
 	if (!nodes_empty(mask))				\
 		for ((node) = 0; (node) < 1; (node)++)
+
+#define for_each_node_mask_preferred(node, iter, preferred, mask) \
+	for_each_node_mask(node, mask)
 #endif /* MAX_NUMNODES */
 
 /*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 01c11ceb47d6..ebf5c9b890d5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1723,14 +1723,17 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
 	return page;
 }
 
-struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
+struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
+		const nodemask_t *nmask)
 {
 	struct page *page = NULL;
+	int iter;
 	int node;
 
 	spin_lock(&hugetlb_lock);
 	if (h->free_huge_pages - h->resv_huge_pages > 0) {
-		for_each_node_mask(node, *nmask) {
+		/* It would be nicer to iterate in the node distance order */
+		for_each_node_mask_preferred(node, iter, preferred_nid, *nmask) {
 			page = dequeue_huge_page_node_exact(h, node);
 			if (page)
 				break;
@@ -1741,7 +1744,7 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
 		return page;
 
 	/* No reservations, try to overcommit */
-	for_each_node_mask(node, *nmask) {
+	for_each_node_mask_preferred(node, iter, preferred_nid, *nmask) {
 		page = __alloc_buddy_huge_page_no_mpol(h, node);
 		if (page)
 			return page;
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/4] mm, memory_hotplug: simplify empty node mask handling in new_node_page
  2017-06-08  7:45 ` [PATCH 1/4] mm, memory_hotplug: simplify empty node mask handling in new_node_page Michal Hocko
@ 2017-06-08  8:15   ` Vlastimil Babka
  0 siblings, 0 replies; 30+ messages in thread
From: Vlastimil Babka @ 2017-06-08  8:15 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Naoya Horiguchi, Xishi Qiu, zhong jiang,
	Joonsoo Kim, LKML, Michal Hocko

On 06/08/2017 09:45 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> new_node_page tries to allocate the target page on a different NUMA node
> than the source page. This makes sense in most cases during the hotplug
> because we are likely to offline the whole numa node. But there are
> cases where there are no other nodes to fallback (e.g. when offlining
> parts of the only existing node) and we have to fallback to allocating
> from the source node. The current code does that but it can be
> simplified by checking the nmask and updating it before we even try to
> allocate rather than special casing it.
> 
> This patch shouldn't introduce any functional change.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/4] mm, memory_hotplug: simplify empty node mask handling in new_node_page
@ 2017-06-08  8:15   ` Vlastimil Babka
  0 siblings, 0 replies; 30+ messages in thread
From: Vlastimil Babka @ 2017-06-08  8:15 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Naoya Horiguchi, Xishi Qiu, zhong jiang,
	Joonsoo Kim, LKML, Michal Hocko

On 06/08/2017 09:45 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> new_node_page tries to allocate the target page on a different NUMA node
> than the source page. This makes sense in most cases during the hotplug
> because we are likely to offline the whole numa node. But there are
> cases where there are no other nodes to fallback (e.g. when offlining
> parts of the only existing node) and we have to fallback to allocating
> from the source node. The current code does that but it can be
> simplified by checking the nmask and updating it before we even try to
> allocate rather than special casing it.
> 
> This patch shouldn't introduce any functional change.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] hugetlb, memory_hotplug: prefer to use reserved pages for migration
  2017-06-08  7:45 ` [PATCH 2/4] hugetlb, memory_hotplug: prefer to use reserved pages for migration Michal Hocko
@ 2017-06-08  8:22   ` Vlastimil Babka
  0 siblings, 0 replies; 30+ messages in thread
From: Vlastimil Babka @ 2017-06-08  8:22 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Naoya Horiguchi, Xishi Qiu, zhong jiang,
	Joonsoo Kim, LKML, Michal Hocko

On 06/08/2017 09:45 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> new_node_page will try to use the origin's next NUMA node as the
> migration destination for hugetlb pages. If such a node doesn't have any
> preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol to
> allocate a surplus page instead. This is quite subotpimal for any
> configuration when hugetlb pages are no distributed to all NUMA nodes
> evenly. Say we have a hotplugable node 4 and spare hugetlb pages are
> node 0
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
> /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
> /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
> /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
> /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
> /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
> /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0
> 
> Now we consume the whole pool on node 4 and try to offline this
> node. All the allocated pages should be moved to node0 which has enough
> preallocated pages to hold them. With the current implementation
> offlining very likely fails because hugetlb allocations during runtime
> are much less reliable.
> 
> Fix this by reusing the nodemask which excludes migration source and try
> to find a first node which has a page in the preallocated pool first and
> fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
> consumed.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] hugetlb, memory_hotplug: prefer to use reserved pages for migration
@ 2017-06-08  8:22   ` Vlastimil Babka
  0 siblings, 0 replies; 30+ messages in thread
From: Vlastimil Babka @ 2017-06-08  8:22 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Naoya Horiguchi, Xishi Qiu, zhong jiang,
	Joonsoo Kim, LKML, Michal Hocko

On 06/08/2017 09:45 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> new_node_page will try to use the origin's next NUMA node as the
> migration destination for hugetlb pages. If such a node doesn't have any
> preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol to
> allocate a surplus page instead. This is quite subotpimal for any
> configuration when hugetlb pages are no distributed to all NUMA nodes
> evenly. Say we have a hotplugable node 4 and spare hugetlb pages are
> node 0
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
> /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
> /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
> /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
> /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
> /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
> /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0
> 
> Now we consume the whole pool on node 4 and try to offline this
> node. All the allocated pages should be moved to node0 which has enough
> preallocated pages to hold them. With the current implementation
> offlining very likely fails because hugetlb allocations during runtime
> are much less reliable.
> 
> Fix this by reusing the nodemask which excludes migration source and try
> to find a first node which has a page in the preallocated pool first and
> fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
> consumed.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/4] mm: unify new_node_page and alloc_migrate_target
  2017-06-08  7:45 ` [PATCH 3/4] mm: unify new_node_page and alloc_migrate_target Michal Hocko
@ 2017-06-08  8:36   ` Vlastimil Babka
  2017-06-08  8:40     ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Vlastimil Babka @ 2017-06-08  8:36 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Naoya Horiguchi, Xishi Qiu, zhong jiang,
	Joonsoo Kim, LKML, Michal Hocko

On 06/08/2017 09:45 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node
> when mem-offline") has duplicated a large part of alloc_migrate_target
> with some hotplug specific special casing. To be more precise it tried
> to enfore the allocation from a different node than the original page.
> As a result the two function diverged in their shared logic, e.g. the
> hugetlb allocation strategy. Let's unify the two and express different
> NUMA requirements by the given nodemask. new_node_page will simply
> exclude the node it doesn't care about and alloc_migrate_target will
> use all the available nodes. alloc_migrate_target will then learn to
> migrate hugetlb pages more sanely and use preallocated pool when
> possible.
> 
> Please note that alloc_migrate_target used to call alloc_page resp.
> alloc_pages_current so the memory policy of the current context which
> is quite strange when we consider that it is used in the context of
> alloc_contig_range which just tries to migrate pages which stand in the
> way.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 3606104893e0..757410d9f758 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -8,6 +8,7 @@
>  #include <linux/memory.h>
>  #include <linux/hugetlb.h>
>  #include <linux/page_owner.h>
> +#include <linux/migrate.h>
>  #include "internal.h"
>  
>  #define CREATE_TRACE_POINTS
> @@ -294,20 +295,5 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
>  struct page *alloc_migrate_target(struct page *page, unsigned long private,
>  				  int **resultp)
>  {
> -	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE;
> -
> -	/*
> -	 * TODO: allocate a destination hugepage from a nearest neighbor node,
> -	 * accordance with memory policy of the user process if possible. For
> -	 * now as a simple work-around, we use the next node for destination.
> -	 */
> -	if (PageHuge(page))
> -		return alloc_huge_page_node(page_hstate(compound_head(page)),
> -					    next_node_in(page_to_nid(page),
> -							 node_online_map));
> -
> -	if (PageHighMem(page))
> -		gfp_mask |= __GFP_HIGHMEM;
> -
> -	return alloc_page(gfp_mask);
> +	return new_page_nodemask(page, numa_node_id(), &node_states[N_MEMORY]);

This replaces the N_ONLINE (node_online_map) with N_MEMORY for huge
pages. Assuming that's OK.

>  }
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/4] mm: unify new_node_page and alloc_migrate_target
@ 2017-06-08  8:36   ` Vlastimil Babka
  2017-06-08  8:40     ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Vlastimil Babka @ 2017-06-08  8:36 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Naoya Horiguchi, Xishi Qiu, zhong jiang,
	Joonsoo Kim, LKML, Michal Hocko

On 06/08/2017 09:45 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node
> when mem-offline") has duplicated a large part of alloc_migrate_target
> with some hotplug specific special casing. To be more precise it tried
> to enfore the allocation from a different node than the original page.
> As a result the two function diverged in their shared logic, e.g. the
> hugetlb allocation strategy. Let's unify the two and express different
> NUMA requirements by the given nodemask. new_node_page will simply
> exclude the node it doesn't care about and alloc_migrate_target will
> use all the available nodes. alloc_migrate_target will then learn to
> migrate hugetlb pages more sanely and use preallocated pool when
> possible.
> 
> Please note that alloc_migrate_target used to call alloc_page resp.
> alloc_pages_current so the memory policy of the current context which
> is quite strange when we consider that it is used in the context of
> alloc_contig_range which just tries to migrate pages which stand in the
> way.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 3606104893e0..757410d9f758 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -8,6 +8,7 @@
>  #include <linux/memory.h>
>  #include <linux/hugetlb.h>
>  #include <linux/page_owner.h>
> +#include <linux/migrate.h>
>  #include "internal.h"
>  
>  #define CREATE_TRACE_POINTS
> @@ -294,20 +295,5 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
>  struct page *alloc_migrate_target(struct page *page, unsigned long private,
>  				  int **resultp)
>  {
> -	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE;
> -
> -	/*
> -	 * TODO: allocate a destination hugepage from a nearest neighbor node,
> -	 * accordance with memory policy of the user process if possible. For
> -	 * now as a simple work-around, we use the next node for destination.
> -	 */
> -	if (PageHuge(page))
> -		return alloc_huge_page_node(page_hstate(compound_head(page)),
> -					    next_node_in(page_to_nid(page),
> -							 node_online_map));
> -
> -	if (PageHighMem(page))
> -		gfp_mask |= __GFP_HIGHMEM;
> -
> -	return alloc_page(gfp_mask);
> +	return new_page_nodemask(page, numa_node_id(), &node_states[N_MEMORY]);

This replaces the N_ONLINE (node_online_map) with N_MEMORY for huge
pages. Assuming that's OK.

>  }
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
  2017-06-08  7:45 ` [RFC PATCH 4/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask Michal Hocko
@ 2017-06-08  8:38   ` Vlastimil Babka
  2017-06-12  9:06     ` Michal Hocko
  2017-06-12 15:21   ` Michal Hocko
  1 sibling, 1 reply; 30+ messages in thread
From: Vlastimil Babka @ 2017-06-08  8:38 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Naoya Horiguchi, Xishi Qiu, zhong jiang,
	Joonsoo Kim, LKML, Michal Hocko

On 06/08/2017 09:45 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> alloc_huge_page_nodemask tries to allocate from any numa node in the
> allowed node mask. This might lead to filling up low NUMA nodes while
> others are not used. We can reduce this risk by introducing a concept
> of the preferred node similar to what we have in the regular page
> allocator. We will start allocating from the preferred nid and then
> iterate over all allowed nodes until we try them all. Introduce
> for_each_node_mask_preferred helper which does the iteration and reuse
> the available preferred node in new_page_nodemask which is currently
> the only caller of alloc_huge_page_nodemask.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

That's better, yeah. I don't think it would be too hard to use a
zonelist though. What do others think?

> ---
>  include/linux/hugetlb.h  |  3 ++-
>  include/linux/migrate.h  |  2 +-
>  include/linux/nodemask.h | 20 ++++++++++++++++++++
>  mm/hugetlb.c             |  9 ++++++---
>  4 files changed, 29 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index c469191bb13b..9831a4434dd7 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -349,7 +349,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>  struct page *alloc_huge_page_node(struct hstate *h, int nid);
>  struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
>  				unsigned long addr, int avoid_reserve);
> -struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask);
> +struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> +				const nodemask_t *nmask);
>  int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
>  			pgoff_t idx);
>  
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index f80c9882403a..af3ccf93efaa 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -38,7 +38,7 @@ static inline struct page *new_page_nodemask(struct page *page, int preferred_ni
>  
>  	if (PageHuge(page))
>  		return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
> -				nodemask);
> +				preferred_nid, nodemask);
>  
>  	if (PageHighMem(page)
>  	    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index cf0b91c3ec12..797aa74392bc 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -42,6 +42,8 @@
>   * void nodes_shift_left(dst, src, n)	Shift left
>   *
>   * int first_node(mask)			Number lowest set bit, or MAX_NUMNODES
> + * int first_node_from(nid, mask)	First node starting from nid, or wrap
> + * 					from first or MAX_NUMNODES
>   * int next_node(node, mask)		Next node past 'node', or MAX_NUMNODES
>   * int next_node_in(node, mask)		Next node past 'node', or wrap to first,
>   *					or MAX_NUMNODES
> @@ -268,6 +270,15 @@ static inline int __next_node(int n, const nodemask_t *srcp)
>  #define next_node_in(n, src) __next_node_in((n), &(src))
>  int __next_node_in(int node, const nodemask_t *srcp);
>  
> +#define first_node_from(nid, mask) __first_node_from(nid, &(mask))
> +static inline int __first_node_from(int nid, const nodemask_t *mask)
> +{
> +	if (test_bit(nid, mask->bits))
> +		return nid;
> +
> +	return __next_node_in(nid, mask);
> +}
> +
>  static inline void init_nodemask_of_node(nodemask_t *mask, int node)
>  {
>  	nodes_clear(*mask);
> @@ -369,10 +380,19 @@ static inline void __nodes_fold(nodemask_t *dstp, const nodemask_t *origp,
>  	for ((node) = first_node(mask);			\
>  		(node) < MAX_NUMNODES;			\
>  		(node) = next_node((node), (mask)))
> +
> +#define for_each_node_mask_preferred(node, iter, preferred, mask)	\
> +	for ((node) = first_node_from((preferred), (mask)), iter = 0;	\
> +		(iter) < nodes_weight((mask));				\
> +		(node) = next_node_in((node), (mask)), (iter)++)
> +
>  #else /* MAX_NUMNODES == 1 */
>  #define for_each_node_mask(node, mask)			\
>  	if (!nodes_empty(mask))				\
>  		for ((node) = 0; (node) < 1; (node)++)
> +
> +#define for_each_node_mask_preferred(node, iter, preferred, mask) \
> +	for_each_node_mask(node, mask)
>  #endif /* MAX_NUMNODES */
>  
>  /*
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 01c11ceb47d6..ebf5c9b890d5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1723,14 +1723,17 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
>  	return page;
>  }
>  
> -struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
> +struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> +		const nodemask_t *nmask)
>  {
>  	struct page *page = NULL;
> +	int iter;
>  	int node;
>  
>  	spin_lock(&hugetlb_lock);
>  	if (h->free_huge_pages - h->resv_huge_pages > 0) {
> -		for_each_node_mask(node, *nmask) {
> +		/* It would be nicer to iterate in the node distance order */
> +		for_each_node_mask_preferred(node, iter, preferred_nid, *nmask) {
>  			page = dequeue_huge_page_node_exact(h, node);
>  			if (page)
>  				break;
> @@ -1741,7 +1744,7 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
>  		return page;
>  
>  	/* No reservations, try to overcommit */
> -	for_each_node_mask(node, *nmask) {
> +	for_each_node_mask_preferred(node, iter, preferred_nid, *nmask) {
>  		page = __alloc_buddy_huge_page_no_mpol(h, node);
>  		if (page)
>  			return page;
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
@ 2017-06-08  8:38   ` Vlastimil Babka
  2017-06-12  9:06     ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Vlastimil Babka @ 2017-06-08  8:38 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Naoya Horiguchi, Xishi Qiu, zhong jiang,
	Joonsoo Kim, LKML, Michal Hocko

On 06/08/2017 09:45 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> alloc_huge_page_nodemask tries to allocate from any numa node in the
> allowed node mask. This might lead to filling up low NUMA nodes while
> others are not used. We can reduce this risk by introducing a concept
> of the preferred node similar to what we have in the regular page
> allocator. We will start allocating from the preferred nid and then
> iterate over all allowed nodes until we try them all. Introduce
> for_each_node_mask_preferred helper which does the iteration and reuse
> the available preferred node in new_page_nodemask which is currently
> the only caller of alloc_huge_page_nodemask.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

That's better, yeah. I don't think it would be too hard to use a
zonelist though. What do others think?

> ---
>  include/linux/hugetlb.h  |  3 ++-
>  include/linux/migrate.h  |  2 +-
>  include/linux/nodemask.h | 20 ++++++++++++++++++++
>  mm/hugetlb.c             |  9 ++++++---
>  4 files changed, 29 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index c469191bb13b..9831a4434dd7 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -349,7 +349,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>  struct page *alloc_huge_page_node(struct hstate *h, int nid);
>  struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
>  				unsigned long addr, int avoid_reserve);
> -struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask);
> +struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> +				const nodemask_t *nmask);
>  int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
>  			pgoff_t idx);
>  
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index f80c9882403a..af3ccf93efaa 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -38,7 +38,7 @@ static inline struct page *new_page_nodemask(struct page *page, int preferred_ni
>  
>  	if (PageHuge(page))
>  		return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
> -				nodemask);
> +				preferred_nid, nodemask);
>  
>  	if (PageHighMem(page)
>  	    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index cf0b91c3ec12..797aa74392bc 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -42,6 +42,8 @@
>   * void nodes_shift_left(dst, src, n)	Shift left
>   *
>   * int first_node(mask)			Number lowest set bit, or MAX_NUMNODES
> + * int first_node_from(nid, mask)	First node starting from nid, or wrap
> + * 					from first or MAX_NUMNODES
>   * int next_node(node, mask)		Next node past 'node', or MAX_NUMNODES
>   * int next_node_in(node, mask)		Next node past 'node', or wrap to first,
>   *					or MAX_NUMNODES
> @@ -268,6 +270,15 @@ static inline int __next_node(int n, const nodemask_t *srcp)
>  #define next_node_in(n, src) __next_node_in((n), &(src))
>  int __next_node_in(int node, const nodemask_t *srcp);
>  
> +#define first_node_from(nid, mask) __first_node_from(nid, &(mask))
> +static inline int __first_node_from(int nid, const nodemask_t *mask)
> +{
> +	if (test_bit(nid, mask->bits))
> +		return nid;
> +
> +	return __next_node_in(nid, mask);
> +}
> +
>  static inline void init_nodemask_of_node(nodemask_t *mask, int node)
>  {
>  	nodes_clear(*mask);
> @@ -369,10 +380,19 @@ static inline void __nodes_fold(nodemask_t *dstp, const nodemask_t *origp,
>  	for ((node) = first_node(mask);			\
>  		(node) < MAX_NUMNODES;			\
>  		(node) = next_node((node), (mask)))
> +
> +#define for_each_node_mask_preferred(node, iter, preferred, mask)	\
> +	for ((node) = first_node_from((preferred), (mask)), iter = 0;	\
> +		(iter) < nodes_weight((mask));				\
> +		(node) = next_node_in((node), (mask)), (iter)++)
> +
>  #else /* MAX_NUMNODES == 1 */
>  #define for_each_node_mask(node, mask)			\
>  	if (!nodes_empty(mask))				\
>  		for ((node) = 0; (node) < 1; (node)++)
> +
> +#define for_each_node_mask_preferred(node, iter, preferred, mask) \
> +	for_each_node_mask(node, mask)
>  #endif /* MAX_NUMNODES */
>  
>  /*
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 01c11ceb47d6..ebf5c9b890d5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1723,14 +1723,17 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
>  	return page;
>  }
>  
> -struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
> +struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> +		const nodemask_t *nmask)
>  {
>  	struct page *page = NULL;
> +	int iter;
>  	int node;
>  
>  	spin_lock(&hugetlb_lock);
>  	if (h->free_huge_pages - h->resv_huge_pages > 0) {
> -		for_each_node_mask(node, *nmask) {
> +		/* It would be nicer to iterate in the node distance order */
> +		for_each_node_mask_preferred(node, iter, preferred_nid, *nmask) {
>  			page = dequeue_huge_page_node_exact(h, node);
>  			if (page)
>  				break;
> @@ -1741,7 +1744,7 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
>  		return page;
>  
>  	/* No reservations, try to overcommit */
> -	for_each_node_mask(node, *nmask) {
> +	for_each_node_mask_preferred(node, iter, preferred_nid, *nmask) {
>  		page = __alloc_buddy_huge_page_no_mpol(h, node);
>  		if (page)
>  			return page;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/4] mm: unify new_node_page and alloc_migrate_target
  2017-06-08  8:36   ` Vlastimil Babka
@ 2017-06-08  8:40     ` Michal Hocko
  0 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2017-06-08  8:40 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Naoya Horiguchi, Xishi Qiu, zhong jiang,
	Joonsoo Kim, LKML

On Thu 08-06-17 10:36:13, Vlastimil Babka wrote:
> On 06/08/2017 09:45 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node
> > when mem-offline") has duplicated a large part of alloc_migrate_target
> > with some hotplug specific special casing. To be more precise it tried
> > to enfore the allocation from a different node than the original page.
> > As a result the two function diverged in their shared logic, e.g. the
> > hugetlb allocation strategy. Let's unify the two and express different
> > NUMA requirements by the given nodemask. new_node_page will simply
> > exclude the node it doesn't care about and alloc_migrate_target will
> > use all the available nodes. alloc_migrate_target will then learn to
> > migrate hugetlb pages more sanely and use preallocated pool when
> > possible.
> > 
> > Please note that alloc_migrate_target used to call alloc_page resp.
> > alloc_pages_current so the memory policy of the current context which
> > is quite strange when we consider that it is used in the context of
> > alloc_contig_range which just tries to migrate pages which stand in the
> > way.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > index 3606104893e0..757410d9f758 100644
> > --- a/mm/page_isolation.c
> > +++ b/mm/page_isolation.c
> > @@ -8,6 +8,7 @@
> >  #include <linux/memory.h>
> >  #include <linux/hugetlb.h>
> >  #include <linux/page_owner.h>
> > +#include <linux/migrate.h>
> >  #include "internal.h"
> >  
> >  #define CREATE_TRACE_POINTS
> > @@ -294,20 +295,5 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
> >  struct page *alloc_migrate_target(struct page *page, unsigned long private,
> >  				  int **resultp)
> >  {
> > -	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE;
> > -
> > -	/*
> > -	 * TODO: allocate a destination hugepage from a nearest neighbor node,
> > -	 * accordance with memory policy of the user process if possible. For
> > -	 * now as a simple work-around, we use the next node for destination.
> > -	 */
> > -	if (PageHuge(page))
> > -		return alloc_huge_page_node(page_hstate(compound_head(page)),
> > -					    next_node_in(page_to_nid(page),
> > -							 node_online_map));
> > -
> > -	if (PageHighMem(page))
> > -		gfp_mask |= __GFP_HIGHMEM;
> > -
> > -	return alloc_page(gfp_mask);
> > +	return new_page_nodemask(page, numa_node_id(), &node_states[N_MEMORY]);
> 
> This replaces the N_ONLINE (node_online_map) with N_MEMORY for huge
> pages. Assuming that's OK.

Yes, this is what 231e97e2b8ec ("mem-hotplug: use nodes that contain
memory as mask in new_node_page()") fixed in new_node_page and didn't
care to do on alloc_migrate_target. Another argument to remove the code
duplication. Thanks for pointing out anyway!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/4] mm: unify new_node_page and alloc_migrate_target
@ 2017-06-08  8:40     ` Michal Hocko
  0 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2017-06-08  8:40 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Naoya Horiguchi, Xishi Qiu, zhong jiang,
	Joonsoo Kim, LKML

On Thu 08-06-17 10:36:13, Vlastimil Babka wrote:
> On 06/08/2017 09:45 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node
> > when mem-offline") has duplicated a large part of alloc_migrate_target
> > with some hotplug specific special casing. To be more precise it tried
> > to enfore the allocation from a different node than the original page.
> > As a result the two function diverged in their shared logic, e.g. the
> > hugetlb allocation strategy. Let's unify the two and express different
> > NUMA requirements by the given nodemask. new_node_page will simply
> > exclude the node it doesn't care about and alloc_migrate_target will
> > use all the available nodes. alloc_migrate_target will then learn to
> > migrate hugetlb pages more sanely and use preallocated pool when
> > possible.
> > 
> > Please note that alloc_migrate_target used to call alloc_page resp.
> > alloc_pages_current so the memory policy of the current context which
> > is quite strange when we consider that it is used in the context of
> > alloc_contig_range which just tries to migrate pages which stand in the
> > way.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > index 3606104893e0..757410d9f758 100644
> > --- a/mm/page_isolation.c
> > +++ b/mm/page_isolation.c
> > @@ -8,6 +8,7 @@
> >  #include <linux/memory.h>
> >  #include <linux/hugetlb.h>
> >  #include <linux/page_owner.h>
> > +#include <linux/migrate.h>
> >  #include "internal.h"
> >  
> >  #define CREATE_TRACE_POINTS
> > @@ -294,20 +295,5 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
> >  struct page *alloc_migrate_target(struct page *page, unsigned long private,
> >  				  int **resultp)
> >  {
> > -	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE;
> > -
> > -	/*
> > -	 * TODO: allocate a destination hugepage from a nearest neighbor node,
> > -	 * accordance with memory policy of the user process if possible. For
> > -	 * now as a simple work-around, we use the next node for destination.
> > -	 */
> > -	if (PageHuge(page))
> > -		return alloc_huge_page_node(page_hstate(compound_head(page)),
> > -					    next_node_in(page_to_nid(page),
> > -							 node_online_map));
> > -
> > -	if (PageHighMem(page))
> > -		gfp_mask |= __GFP_HIGHMEM;
> > -
> > -	return alloc_page(gfp_mask);
> > +	return new_page_nodemask(page, numa_node_id(), &node_states[N_MEMORY]);
> 
> This replaces the N_ONLINE (node_online_map) with N_MEMORY for huge
> pages. Assuming that's OK.

Yes, this is what 231e97e2b8ec ("mem-hotplug: use nodes that contain
memory as mask in new_node_page()") fixed in new_node_page and didn't
care to do on alloc_migrate_target. Another argument to remove the code
duplication. Thanks for pointing out anyway!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
  2017-06-08  8:38   ` Vlastimil Babka
@ 2017-06-12  9:06     ` Michal Hocko
  2017-06-12 11:48       ` Michal Hocko
  2017-06-12 11:53       ` Vlastimil Babka
  0 siblings, 2 replies; 30+ messages in thread
From: Michal Hocko @ 2017-06-12  9:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Naoya Horiguchi, Xishi Qiu, zhong jiang,
	Joonsoo Kim, LKML

On Thu 08-06-17 10:38:06, Vlastimil Babka wrote:
> On 06/08/2017 09:45 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > alloc_huge_page_nodemask tries to allocate from any numa node in the
> > allowed node mask. This might lead to filling up low NUMA nodes while
> > others are not used. We can reduce this risk by introducing a concept
> > of the preferred node similar to what we have in the regular page
> > allocator. We will start allocating from the preferred nid and then
> > iterate over all allowed nodes until we try them all. Introduce
> > for_each_node_mask_preferred helper which does the iteration and reuse
> > the available preferred node in new_page_nodemask which is currently
> > the only caller of alloc_huge_page_nodemask.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> That's better, yeah. I don't think it would be too hard to use a
> zonelist though. What do others think?

OK, so I've given it a try. This is untested yet but it doesn't look all
that bad. dequeue_huge_page_node will most proably see some clean up on
top but I've kept it for simplicity for now.
---
>From 597ab787ac081b57db13ce5576700163d0c1208c Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 7 Jun 2017 10:31:59 +0200
Subject: [PATCH] hugetlb: add support for preferred node to
 alloc_huge_page_nodemask

alloc_huge_page_nodemask tries to allocate from any numa node in the
allowed node mask. This might lead to filling up low NUMA nodes while
others are not used. We can reduce this risk by introducing a concept
of the preferred node similar to what we have in the regular page
allocator. We will start allocating from the preferred nid and then
iterate over all allowed nodes in the zonelist order until we try them
all.

This is mimicking the page allocator logic except it operates on
per-node mempools. dequeue_huge_page_vma already does this so distill
the zonelist logic into a more generic dequeue_huge_page_nodemask
and use it in alloc_huge_page_nodemask.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/hugetlb.h |   3 +-
 include/linux/migrate.h |   2 +-
 mm/hugetlb.c            | 111 +++++++++++++++++++++++++-----------------------
 3 files changed, 60 insertions(+), 56 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c469191bb13b..d4c33a8583be 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -349,7 +349,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
-struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask);
+struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
+				nodemask_t *nmask);
 int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
 			pgoff_t idx);
 
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index f80c9882403a..af3ccf93efaa 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -38,7 +38,7 @@ static inline struct page *new_page_nodemask(struct page *page, int preferred_ni
 
 	if (PageHuge(page))
 		return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
-				nodemask);
+				preferred_nid, nodemask);
 
 	if (PageHighMem(page)
 	    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 01c11ceb47d6..bbb3a1a46c64 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -897,29 +897,58 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
 	return page;
 }
 
-static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
+/* Movability of hugepages depends on migration support. */
+static inline gfp_t htlb_alloc_mask(struct hstate *h)
 {
-	struct page *page;
-	int node;
+	if (hugepages_treat_as_movable || hugepage_migration_supported(h))
+		return GFP_HIGHUSER_MOVABLE;
+	else
+		return GFP_HIGHUSER;
+}
 
-	if (nid != NUMA_NO_NODE)
-		return dequeue_huge_page_node_exact(h, nid);
+static struct page *dequeue_huge_page_nodemask(struct hstate *h, int nid,
+		nodemask_t *nmask)
+{
+	unsigned int cpuset_mems_cookie;
+	struct zonelist *zonelist;
+	struct page *page = NULL;
+	struct zone *zone;
+	struct zoneref *z;
+	gfp_t gfp_mask;
+	int node = -1;
+
+	gfp_mask = htlb_alloc_mask(h);
+	zonelist = node_zonelist(nid, gfp_mask);
+
+retry_cpuset:
+	cpuset_mems_cookie = read_mems_allowed_begin();
+	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), nmask) {
+		if (!cpuset_zone_allowed(zone, gfp_mask))
+			continue;
+		/*
+		 * no need to ask again on the same node. Pool is node rather than
+		 * zone aware
+		 */
+		if (zone_to_nid(zone) == node)
+			continue;
+		node = zone_to_nid(zone);
 
-	for_each_online_node(node) {
 		page = dequeue_huge_page_node_exact(h, node);
 		if (page)
-			return page;
+			break;
 	}
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
+		goto retry_cpuset;
+
 	return NULL;
 }
 
-/* Movability of hugepages depends on migration support. */
-static inline gfp_t htlb_alloc_mask(struct hstate *h)
+static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
 {
-	if (hugepages_treat_as_movable || hugepage_migration_supported(h))
-		return GFP_HIGHUSER_MOVABLE;
-	else
-		return GFP_HIGHUSER;
+	if (nid != NUMA_NO_NODE)
+		return dequeue_huge_page_node_exact(h, nid);
+
+	return dequeue_huge_page_nodemask(h, nid, NULL);
 }
 
 static struct page *dequeue_huge_page_vma(struct hstate *h,
@@ -927,15 +956,10 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 				unsigned long address, int avoid_reserve,
 				long chg)
 {
-	struct page *page = NULL;
+	struct page *page;
 	struct mempolicy *mpol;
 	nodemask_t *nodemask;
-	gfp_t gfp_mask;
 	int nid;
-	struct zonelist *zonelist;
-	struct zone *zone;
-	struct zoneref *z;
-	unsigned int cpuset_mems_cookie;
 
 	/*
 	 * A child process with MAP_PRIVATE mappings created by their parent
@@ -950,32 +974,14 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
 		goto err;
 
-retry_cpuset:
-	cpuset_mems_cookie = read_mems_allowed_begin();
-	gfp_mask = htlb_alloc_mask(h);
-	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
-	zonelist = node_zonelist(nid, gfp_mask);
-
-	for_each_zone_zonelist_nodemask(zone, z, zonelist,
-						MAX_NR_ZONES - 1, nodemask) {
-		if (cpuset_zone_allowed(zone, gfp_mask)) {
-			page = dequeue_huge_page_node(h, zone_to_nid(zone));
-			if (page) {
-				if (avoid_reserve)
-					break;
-				if (!vma_has_reserves(vma, chg))
-					break;
-
-				SetPagePrivate(page);
-				h->resv_huge_pages--;
-				break;
-			}
-		}
+	nid = huge_node(vma, address, htlb_alloc_mask(h), &mpol, &nodemask);
+	page = dequeue_huge_page_nodemask(h, nid, nodemask);
+	if (page && !(avoid_reserve || (!vma_has_reserves(vma, chg)))) {
+		SetPagePrivate(page);
+		h->resv_huge_pages--;
 	}
 
 	mpol_cond_put(mpol);
-	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
-		goto retry_cpuset;
 	return page;
 
 err:
@@ -1723,29 +1729,26 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
 	return page;
 }
 
-struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
+struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
+		nodemask_t *nmask)
 {
 	struct page *page = NULL;
-	int node;
 
 	spin_lock(&hugetlb_lock);
 	if (h->free_huge_pages - h->resv_huge_pages > 0) {
-		for_each_node_mask(node, *nmask) {
-			page = dequeue_huge_page_node_exact(h, node);
-			if (page)
-				break;
-		}
+		page = dequeue_huge_page_nodemask(h, preferred_nid, nmask);
+		if (page)
+			goto unlock;
 	}
+unlock:
 	spin_unlock(&hugetlb_lock);
 	if (page)
 		return page;
 
 	/* No reservations, try to overcommit */
-	for_each_node_mask(node, *nmask) {
-		page = __alloc_buddy_huge_page_no_mpol(h, node);
-		if (page)
-			return page;
-	}
+	page = __alloc_buddy_huge_page_no_mpol(h, preferred_nid);
+	if (page)
+		return page;
 
 	return NULL;
 }
-- 
2.11.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
@ 2017-06-12  9:06     ` Michal Hocko
  2017-06-12 11:48       ` Michal Hocko
  2017-06-12 11:53       ` Vlastimil Babka
  0 siblings, 2 replies; 30+ messages in thread
From: Michal Hocko @ 2017-06-12  9:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Naoya Horiguchi, Xishi Qiu, zhong jiang,
	Joonsoo Kim, LKML

On Thu 08-06-17 10:38:06, Vlastimil Babka wrote:
> On 06/08/2017 09:45 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > alloc_huge_page_nodemask tries to allocate from any numa node in the
> > allowed node mask. This might lead to filling up low NUMA nodes while
> > others are not used. We can reduce this risk by introducing a concept
> > of the preferred node similar to what we have in the regular page
> > allocator. We will start allocating from the preferred nid and then
> > iterate over all allowed nodes until we try them all. Introduce
> > for_each_node_mask_preferred helper which does the iteration and reuse
> > the available preferred node in new_page_nodemask which is currently
> > the only caller of alloc_huge_page_nodemask.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> That's better, yeah. I don't think it would be too hard to use a
> zonelist though. What do others think?

OK, so I've given it a try. This is untested yet but it doesn't look all
that bad. dequeue_huge_page_node will most proably see some clean up on
top but I've kept it for simplicity for now.
---

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
  2017-06-12  9:06     ` Michal Hocko
@ 2017-06-12 11:48       ` Michal Hocko
  2017-06-12 11:53       ` Vlastimil Babka
  1 sibling, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2017-06-12 11:48 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Naoya Horiguchi, Xishi Qiu, zhong jiang,
	Joonsoo Kim, LKML

On Mon 12-06-17 11:06:56, Michal Hocko wrote:
[...]
> @@ -1723,29 +1729,26 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
>  	return page;
>  }
>  
> -struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
> +struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> +		nodemask_t *nmask)
>  {
>  	struct page *page = NULL;
> -	int node;
>  
>  	spin_lock(&hugetlb_lock);
>  	if (h->free_huge_pages - h->resv_huge_pages > 0) {
> -		for_each_node_mask(node, *nmask) {
> -			page = dequeue_huge_page_node_exact(h, node);
> -			if (page)
> -				break;
> -		}
> +		page = dequeue_huge_page_nodemask(h, preferred_nid, nmask);
> +		if (page)
> +			goto unlock;
>  	}
> +unlock:
>  	spin_unlock(&hugetlb_lock);
>  	if (page)
>  		return page;
>  
>  	/* No reservations, try to overcommit */
> -	for_each_node_mask(node, *nmask) {
> -		page = __alloc_buddy_huge_page_no_mpol(h, node);
> -		if (page)
> -			return page;
> -	}
> +	page = __alloc_buddy_huge_page_no_mpol(h, preferred_nid);
> +	if (page)
> +		return page;

I was too quick. The fallback allocation needs some more love. I am
working on this but it quickly gets quite hairy so let's see whether
this still can converge to something reasonable.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
@ 2017-06-12 11:48       ` Michal Hocko
  0 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2017-06-12 11:48 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Naoya Horiguchi, Xishi Qiu, zhong jiang,
	Joonsoo Kim, LKML

On Mon 12-06-17 11:06:56, Michal Hocko wrote:
[...]
> @@ -1723,29 +1729,26 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
>  	return page;
>  }
>  
> -struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
> +struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> +		nodemask_t *nmask)
>  {
>  	struct page *page = NULL;
> -	int node;
>  
>  	spin_lock(&hugetlb_lock);
>  	if (h->free_huge_pages - h->resv_huge_pages > 0) {
> -		for_each_node_mask(node, *nmask) {
> -			page = dequeue_huge_page_node_exact(h, node);
> -			if (page)
> -				break;
> -		}
> +		page = dequeue_huge_page_nodemask(h, preferred_nid, nmask);
> +		if (page)
> +			goto unlock;
>  	}
> +unlock:
>  	spin_unlock(&hugetlb_lock);
>  	if (page)
>  		return page;
>  
>  	/* No reservations, try to overcommit */
> -	for_each_node_mask(node, *nmask) {
> -		page = __alloc_buddy_huge_page_no_mpol(h, node);
> -		if (page)
> -			return page;
> -	}
> +	page = __alloc_buddy_huge_page_no_mpol(h, preferred_nid);
> +	if (page)
> +		return page;

I was too quick. The fallback allocation needs some more love. I am
working on this but it quickly gets quite hairy so let's see whether
this still can converge to something reasonable.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
  2017-06-12  9:06     ` Michal Hocko
  2017-06-12 11:48       ` Michal Hocko
@ 2017-06-12 11:53       ` Vlastimil Babka
  2017-06-12 12:20         ` Michal Hocko
  1 sibling, 1 reply; 30+ messages in thread
From: Vlastimil Babka @ 2017-06-12 11:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Naoya Horiguchi, Xishi Qiu, zhong jiang,
	Joonsoo Kim, LKML

On 06/12/2017 11:06 AM, Michal Hocko wrote:
> On Thu 08-06-17 10:38:06, Vlastimil Babka wrote:
>> On 06/08/2017 09:45 AM, Michal Hocko wrote:
>>> From: Michal Hocko <mhocko@suse.com>
>>>
>>> alloc_huge_page_nodemask tries to allocate from any numa node in the
>>> allowed node mask. This might lead to filling up low NUMA nodes while
>>> others are not used. We can reduce this risk by introducing a concept
>>> of the preferred node similar to what we have in the regular page
>>> allocator. We will start allocating from the preferred nid and then
>>> iterate over all allowed nodes until we try them all. Introduce
>>> for_each_node_mask_preferred helper which does the iteration and reuse
>>> the available preferred node in new_page_nodemask which is currently
>>> the only caller of alloc_huge_page_nodemask.
>>>
>>> Signed-off-by: Michal Hocko <mhocko@suse.com>
>>
>> That's better, yeah. I don't think it would be too hard to use a
>> zonelist though. What do others think?
> 
> OK, so I've given it a try. This is untested yet but it doesn't look all
> that bad. dequeue_huge_page_node will most proably see some clean up on
> top but I've kept it for simplicity for now.
> ---
> From 597ab787ac081b57db13ce5576700163d0c1208c Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 7 Jun 2017 10:31:59 +0200
> Subject: [PATCH] hugetlb: add support for preferred node to
>  alloc_huge_page_nodemask
> 
> alloc_huge_page_nodemask tries to allocate from any numa node in the
> allowed node mask. This might lead to filling up low NUMA nodes while
> others are not used. We can reduce this risk by introducing a concept
> of the preferred node similar to what we have in the regular page
> allocator. We will start allocating from the preferred nid and then
> iterate over all allowed nodes in the zonelist order until we try them
> all.
> 
> This is mimicking the page allocator logic except it operates on
> per-node mempools. dequeue_huge_page_vma already does this so distill
> the zonelist logic into a more generic dequeue_huge_page_nodemask
> and use it in alloc_huge_page_nodemask.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/hugetlb.h |   3 +-
>  include/linux/migrate.h |   2 +-
>  mm/hugetlb.c            | 111 +++++++++++++++++++++++++-----------------------
>  3 files changed, 60 insertions(+), 56 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index c469191bb13b..d4c33a8583be 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -349,7 +349,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>  struct page *alloc_huge_page_node(struct hstate *h, int nid);
>  struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
>  				unsigned long addr, int avoid_reserve);
> -struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask);
> +struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> +				nodemask_t *nmask);
>  int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
>  			pgoff_t idx);
>  
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index f80c9882403a..af3ccf93efaa 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -38,7 +38,7 @@ static inline struct page *new_page_nodemask(struct page *page, int preferred_ni
>  
>  	if (PageHuge(page))
>  		return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
> -				nodemask);
> +				preferred_nid, nodemask);
>  
>  	if (PageHighMem(page)
>  	    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 01c11ceb47d6..bbb3a1a46c64 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -897,29 +897,58 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
>  	return page;
>  }
>  
> -static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
> +/* Movability of hugepages depends on migration support. */
> +static inline gfp_t htlb_alloc_mask(struct hstate *h)
>  {
> -	struct page *page;
> -	int node;
> +	if (hugepages_treat_as_movable || hugepage_migration_supported(h))
> +		return GFP_HIGHUSER_MOVABLE;
> +	else
> +		return GFP_HIGHUSER;
> +}
>  
> -	if (nid != NUMA_NO_NODE)
> -		return dequeue_huge_page_node_exact(h, nid);
> +static struct page *dequeue_huge_page_nodemask(struct hstate *h, int nid,
> +		nodemask_t *nmask)
> +{
> +	unsigned int cpuset_mems_cookie;
> +	struct zonelist *zonelist;
> +	struct page *page = NULL;
> +	struct zone *zone;
> +	struct zoneref *z;
> +	gfp_t gfp_mask;
> +	int node = -1;
> +
> +	gfp_mask = htlb_alloc_mask(h);
> +	zonelist = node_zonelist(nid, gfp_mask);
> +
> +retry_cpuset:
> +	cpuset_mems_cookie = read_mems_allowed_begin();
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), nmask) {
> +		if (!cpuset_zone_allowed(zone, gfp_mask))
> +			continue;
> +		/*
> +		 * no need to ask again on the same node. Pool is node rather than
> +		 * zone aware
> +		 */
> +		if (zone_to_nid(zone) == node)
> +			continue;
> +		node = zone_to_nid(zone);
>  
> -	for_each_online_node(node) {
>  		page = dequeue_huge_page_node_exact(h, node);
>  		if (page)
> -			return page;
> +			break;
>  	}
> +	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
> +		goto retry_cpuset;
> +
>  	return NULL;
>  }
>  
> -/* Movability of hugepages depends on migration support. */
> -static inline gfp_t htlb_alloc_mask(struct hstate *h)
> +static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
>  {
> -	if (hugepages_treat_as_movable || hugepage_migration_supported(h))
> -		return GFP_HIGHUSER_MOVABLE;
> -	else
> -		return GFP_HIGHUSER;
> +	if (nid != NUMA_NO_NODE)
> +		return dequeue_huge_page_node_exact(h, nid);
> +
> +	return dequeue_huge_page_nodemask(h, nid, NULL);

This with nid == NUMA_NO_NODE will break at node_zonelist(nid,
gfp_mask); in dequeue_huge_page_nodemask(). I guess just use the local
node as preferred.

>  }
>  
>  static struct page *dequeue_huge_page_vma(struct hstate *h,
> @@ -927,15 +956,10 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>  				unsigned long address, int avoid_reserve,
>  				long chg)
>  {
> -	struct page *page = NULL;
> +	struct page *page;
>  	struct mempolicy *mpol;
>  	nodemask_t *nodemask;
> -	gfp_t gfp_mask;
>  	int nid;
> -	struct zonelist *zonelist;
> -	struct zone *zone;
> -	struct zoneref *z;
> -	unsigned int cpuset_mems_cookie;
>  
>  	/*
>  	 * A child process with MAP_PRIVATE mappings created by their parent
> @@ -950,32 +974,14 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>  	if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
>  		goto err;
>  
> -retry_cpuset:
> -	cpuset_mems_cookie = read_mems_allowed_begin();
> -	gfp_mask = htlb_alloc_mask(h);
> -	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
> -	zonelist = node_zonelist(nid, gfp_mask);
> -
> -	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> -						MAX_NR_ZONES - 1, nodemask) {
> -		if (cpuset_zone_allowed(zone, gfp_mask)) {
> -			page = dequeue_huge_page_node(h, zone_to_nid(zone));
> -			if (page) {
> -				if (avoid_reserve)
> -					break;
> -				if (!vma_has_reserves(vma, chg))
> -					break;
> -
> -				SetPagePrivate(page);
> -				h->resv_huge_pages--;
> -				break;
> -			}
> -		}
> +	nid = huge_node(vma, address, htlb_alloc_mask(h), &mpol, &nodemask);
> +	page = dequeue_huge_page_nodemask(h, nid, nodemask);
> +	if (page && !(avoid_reserve || (!vma_has_reserves(vma, chg)))) {

Ugh that's hard to parse.
What about: if (page && !avoid_reserve && vma_has_reserves(...)) ?

> +		SetPagePrivate(page);
> +		h->resv_huge_pages--;
>  	}
>  
>  	mpol_cond_put(mpol);
> -	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
> -		goto retry_cpuset;
>  	return page;
>  
>  err:
> @@ -1723,29 +1729,26 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
>  	return page;
>  }
>  
> -struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
> +struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> +		nodemask_t *nmask)
>  {
>  	struct page *page = NULL;
> -	int node;
>  
>  	spin_lock(&hugetlb_lock);
>  	if (h->free_huge_pages - h->resv_huge_pages > 0) {
> -		for_each_node_mask(node, *nmask) {
> -			page = dequeue_huge_page_node_exact(h, node);
> -			if (page)
> -				break;
> -		}
> +		page = dequeue_huge_page_nodemask(h, preferred_nid, nmask);
> +		if (page)
> +			goto unlock;
>  	}
> +unlock:
>  	spin_unlock(&hugetlb_lock);
>  	if (page)
>  		return page;
>  
>  	/* No reservations, try to overcommit */
> -	for_each_node_mask(node, *nmask) {
> -		page = __alloc_buddy_huge_page_no_mpol(h, node);
> -		if (page)
> -			return page;
> -	}
> +	page = __alloc_buddy_huge_page_no_mpol(h, preferred_nid);
> +	if (page)
> +		return page;
>  
>  	return NULL;
>  }
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
@ 2017-06-12 11:53       ` Vlastimil Babka
  2017-06-12 12:20         ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: Vlastimil Babka @ 2017-06-12 11:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Naoya Horiguchi, Xishi Qiu, zhong jiang,
	Joonsoo Kim, LKML

On 06/12/2017 11:06 AM, Michal Hocko wrote:
> On Thu 08-06-17 10:38:06, Vlastimil Babka wrote:
>> On 06/08/2017 09:45 AM, Michal Hocko wrote:
>>> From: Michal Hocko <mhocko@suse.com>
>>>
>>> alloc_huge_page_nodemask tries to allocate from any numa node in the
>>> allowed node mask. This might lead to filling up low NUMA nodes while
>>> others are not used. We can reduce this risk by introducing a concept
>>> of the preferred node similar to what we have in the regular page
>>> allocator. We will start allocating from the preferred nid and then
>>> iterate over all allowed nodes until we try them all. Introduce
>>> for_each_node_mask_preferred helper which does the iteration and reuse
>>> the available preferred node in new_page_nodemask which is currently
>>> the only caller of alloc_huge_page_nodemask.
>>>
>>> Signed-off-by: Michal Hocko <mhocko@suse.com>
>>
>> That's better, yeah. I don't think it would be too hard to use a
>> zonelist though. What do others think?
> 
> OK, so I've given it a try. This is untested yet but it doesn't look all
> that bad. dequeue_huge_page_node will most proably see some clean up on
> top but I've kept it for simplicity for now.
> ---
> From 597ab787ac081b57db13ce5576700163d0c1208c Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 7 Jun 2017 10:31:59 +0200
> Subject: [PATCH] hugetlb: add support for preferred node to
>  alloc_huge_page_nodemask
> 
> alloc_huge_page_nodemask tries to allocate from any numa node in the
> allowed node mask. This might lead to filling up low NUMA nodes while
> others are not used. We can reduce this risk by introducing a concept
> of the preferred node similar to what we have in the regular page
> allocator. We will start allocating from the preferred nid and then
> iterate over all allowed nodes in the zonelist order until we try them
> all.
> 
> This is mimicking the page allocator logic except it operates on
> per-node mempools. dequeue_huge_page_vma already does this so distill
> the zonelist logic into a more generic dequeue_huge_page_nodemask
> and use it in alloc_huge_page_nodemask.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/hugetlb.h |   3 +-
>  include/linux/migrate.h |   2 +-
>  mm/hugetlb.c            | 111 +++++++++++++++++++++++++-----------------------
>  3 files changed, 60 insertions(+), 56 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index c469191bb13b..d4c33a8583be 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -349,7 +349,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>  struct page *alloc_huge_page_node(struct hstate *h, int nid);
>  struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
>  				unsigned long addr, int avoid_reserve);
> -struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask);
> +struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> +				nodemask_t *nmask);
>  int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
>  			pgoff_t idx);
>  
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index f80c9882403a..af3ccf93efaa 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -38,7 +38,7 @@ static inline struct page *new_page_nodemask(struct page *page, int preferred_ni
>  
>  	if (PageHuge(page))
>  		return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
> -				nodemask);
> +				preferred_nid, nodemask);
>  
>  	if (PageHighMem(page)
>  	    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 01c11ceb47d6..bbb3a1a46c64 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -897,29 +897,58 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
>  	return page;
>  }
>  
> -static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
> +/* Movability of hugepages depends on migration support. */
> +static inline gfp_t htlb_alloc_mask(struct hstate *h)
>  {
> -	struct page *page;
> -	int node;
> +	if (hugepages_treat_as_movable || hugepage_migration_supported(h))
> +		return GFP_HIGHUSER_MOVABLE;
> +	else
> +		return GFP_HIGHUSER;
> +}
>  
> -	if (nid != NUMA_NO_NODE)
> -		return dequeue_huge_page_node_exact(h, nid);
> +static struct page *dequeue_huge_page_nodemask(struct hstate *h, int nid,
> +		nodemask_t *nmask)
> +{
> +	unsigned int cpuset_mems_cookie;
> +	struct zonelist *zonelist;
> +	struct page *page = NULL;
> +	struct zone *zone;
> +	struct zoneref *z;
> +	gfp_t gfp_mask;
> +	int node = -1;
> +
> +	gfp_mask = htlb_alloc_mask(h);
> +	zonelist = node_zonelist(nid, gfp_mask);
> +
> +retry_cpuset:
> +	cpuset_mems_cookie = read_mems_allowed_begin();
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), nmask) {
> +		if (!cpuset_zone_allowed(zone, gfp_mask))
> +			continue;
> +		/*
> +		 * no need to ask again on the same node. Pool is node rather than
> +		 * zone aware
> +		 */
> +		if (zone_to_nid(zone) == node)
> +			continue;
> +		node = zone_to_nid(zone);
>  
> -	for_each_online_node(node) {
>  		page = dequeue_huge_page_node_exact(h, node);
>  		if (page)
> -			return page;
> +			break;
>  	}
> +	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
> +		goto retry_cpuset;
> +
>  	return NULL;
>  }
>  
> -/* Movability of hugepages depends on migration support. */
> -static inline gfp_t htlb_alloc_mask(struct hstate *h)
> +static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
>  {
> -	if (hugepages_treat_as_movable || hugepage_migration_supported(h))
> -		return GFP_HIGHUSER_MOVABLE;
> -	else
> -		return GFP_HIGHUSER;
> +	if (nid != NUMA_NO_NODE)
> +		return dequeue_huge_page_node_exact(h, nid);
> +
> +	return dequeue_huge_page_nodemask(h, nid, NULL);

This with nid == NUMA_NO_NODE will break at node_zonelist(nid,
gfp_mask); in dequeue_huge_page_nodemask(). I guess just use the local
node as preferred.

>  }
>  
>  static struct page *dequeue_huge_page_vma(struct hstate *h,
> @@ -927,15 +956,10 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>  				unsigned long address, int avoid_reserve,
>  				long chg)
>  {
> -	struct page *page = NULL;
> +	struct page *page;
>  	struct mempolicy *mpol;
>  	nodemask_t *nodemask;
> -	gfp_t gfp_mask;
>  	int nid;
> -	struct zonelist *zonelist;
> -	struct zone *zone;
> -	struct zoneref *z;
> -	unsigned int cpuset_mems_cookie;
>  
>  	/*
>  	 * A child process with MAP_PRIVATE mappings created by their parent
> @@ -950,32 +974,14 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>  	if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
>  		goto err;
>  
> -retry_cpuset:
> -	cpuset_mems_cookie = read_mems_allowed_begin();
> -	gfp_mask = htlb_alloc_mask(h);
> -	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
> -	zonelist = node_zonelist(nid, gfp_mask);
> -
> -	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> -						MAX_NR_ZONES - 1, nodemask) {
> -		if (cpuset_zone_allowed(zone, gfp_mask)) {
> -			page = dequeue_huge_page_node(h, zone_to_nid(zone));
> -			if (page) {
> -				if (avoid_reserve)
> -					break;
> -				if (!vma_has_reserves(vma, chg))
> -					break;
> -
> -				SetPagePrivate(page);
> -				h->resv_huge_pages--;
> -				break;
> -			}
> -		}
> +	nid = huge_node(vma, address, htlb_alloc_mask(h), &mpol, &nodemask);
> +	page = dequeue_huge_page_nodemask(h, nid, nodemask);
> +	if (page && !(avoid_reserve || (!vma_has_reserves(vma, chg)))) {

Ugh that's hard to parse.
What about: if (page && !avoid_reserve && vma_has_reserves(...)) ?

> +		SetPagePrivate(page);
> +		h->resv_huge_pages--;
>  	}
>  
>  	mpol_cond_put(mpol);
> -	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
> -		goto retry_cpuset;
>  	return page;
>  
>  err:
> @@ -1723,29 +1729,26 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
>  	return page;
>  }
>  
> -struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
> +struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> +		nodemask_t *nmask)
>  {
>  	struct page *page = NULL;
> -	int node;
>  
>  	spin_lock(&hugetlb_lock);
>  	if (h->free_huge_pages - h->resv_huge_pages > 0) {
> -		for_each_node_mask(node, *nmask) {
> -			page = dequeue_huge_page_node_exact(h, node);
> -			if (page)
> -				break;
> -		}
> +		page = dequeue_huge_page_nodemask(h, preferred_nid, nmask);
> +		if (page)
> +			goto unlock;
>  	}
> +unlock:
>  	spin_unlock(&hugetlb_lock);
>  	if (page)
>  		return page;
>  
>  	/* No reservations, try to overcommit */
> -	for_each_node_mask(node, *nmask) {
> -		page = __alloc_buddy_huge_page_no_mpol(h, node);
> -		if (page)
> -			return page;
> -	}
> +	page = __alloc_buddy_huge_page_no_mpol(h, preferred_nid);
> +	if (page)
> +		return page;
>  
>  	return NULL;
>  }
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
  2017-06-12 11:53       ` Vlastimil Babka
@ 2017-06-12 12:20         ` Michal Hocko
  0 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2017-06-12 12:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Naoya Horiguchi, Xishi Qiu, zhong jiang,
	Joonsoo Kim, LKML

On Mon 12-06-17 13:53:51, Vlastimil Babka wrote:
> On 06/12/2017 11:06 AM, Michal Hocko wrote:
[...]
> > -/* Movability of hugepages depends on migration support. */
> > -static inline gfp_t htlb_alloc_mask(struct hstate *h)
> > +static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
> >  {
> > -	if (hugepages_treat_as_movable || hugepage_migration_supported(h))
> > -		return GFP_HIGHUSER_MOVABLE;
> > -	else
> > -		return GFP_HIGHUSER;
> > +	if (nid != NUMA_NO_NODE)
> > +		return dequeue_huge_page_node_exact(h, nid);
> > +
> > +	return dequeue_huge_page_nodemask(h, nid, NULL);
> 
> This with nid == NUMA_NO_NODE will break at node_zonelist(nid,
> gfp_mask); in dequeue_huge_page_nodemask(). I guess just use the local
> node as preferred.

You are right. Anyway I have a patch to remove this helper altogether.

> > -retry_cpuset:
> > -	cpuset_mems_cookie = read_mems_allowed_begin();
> > -	gfp_mask = htlb_alloc_mask(h);
> > -	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
> > -	zonelist = node_zonelist(nid, gfp_mask);
> > -
> > -	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > -						MAX_NR_ZONES - 1, nodemask) {
> > -		if (cpuset_zone_allowed(zone, gfp_mask)) {
> > -			page = dequeue_huge_page_node(h, zone_to_nid(zone));
> > -			if (page) {
> > -				if (avoid_reserve)
> > -					break;
> > -				if (!vma_has_reserves(vma, chg))
> > -					break;
> > -
> > -				SetPagePrivate(page);
> > -				h->resv_huge_pages--;
> > -				break;
> > -			}
> > -		}
> > +	nid = huge_node(vma, address, htlb_alloc_mask(h), &mpol, &nodemask);
> > +	page = dequeue_huge_page_nodemask(h, nid, nodemask);
> > +	if (page && !(avoid_reserve || (!vma_has_reserves(vma, chg)))) {
> 
> Ugh that's hard to parse.
> What about: if (page && !avoid_reserve && vma_has_reserves(...)) ?

Yeah, I have just translated the two breaks into a single condition
without scratching my head to much. If you think that this face of De Morgan
is nicer I can use it.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
@ 2017-06-12 12:20         ` Michal Hocko
  0 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2017-06-12 12:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Naoya Horiguchi, Xishi Qiu, zhong jiang,
	Joonsoo Kim, LKML

On Mon 12-06-17 13:53:51, Vlastimil Babka wrote:
> On 06/12/2017 11:06 AM, Michal Hocko wrote:
[...]
> > -/* Movability of hugepages depends on migration support. */
> > -static inline gfp_t htlb_alloc_mask(struct hstate *h)
> > +static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
> >  {
> > -	if (hugepages_treat_as_movable || hugepage_migration_supported(h))
> > -		return GFP_HIGHUSER_MOVABLE;
> > -	else
> > -		return GFP_HIGHUSER;
> > +	if (nid != NUMA_NO_NODE)
> > +		return dequeue_huge_page_node_exact(h, nid);
> > +
> > +	return dequeue_huge_page_nodemask(h, nid, NULL);
> 
> This with nid == NUMA_NO_NODE will break at node_zonelist(nid,
> gfp_mask); in dequeue_huge_page_nodemask(). I guess just use the local
> node as preferred.

You are right. Anyway I have a patch to remove this helper altogether.

> > -retry_cpuset:
> > -	cpuset_mems_cookie = read_mems_allowed_begin();
> > -	gfp_mask = htlb_alloc_mask(h);
> > -	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
> > -	zonelist = node_zonelist(nid, gfp_mask);
> > -
> > -	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > -						MAX_NR_ZONES - 1, nodemask) {
> > -		if (cpuset_zone_allowed(zone, gfp_mask)) {
> > -			page = dequeue_huge_page_node(h, zone_to_nid(zone));
> > -			if (page) {
> > -				if (avoid_reserve)
> > -					break;
> > -				if (!vma_has_reserves(vma, chg))
> > -					break;
> > -
> > -				SetPagePrivate(page);
> > -				h->resv_huge_pages--;
> > -				break;
> > -			}
> > -		}
> > +	nid = huge_node(vma, address, htlb_alloc_mask(h), &mpol, &nodemask);
> > +	page = dequeue_huge_page_nodemask(h, nid, nodemask);
> > +	if (page && !(avoid_reserve || (!vma_has_reserves(vma, chg)))) {
> 
> Ugh that's hard to parse.
> What about: if (page && !avoid_reserve && vma_has_reserves(...)) ?

Yeah, I have just translated the two breaks into a single condition
without scratching my head to much. If you think that this face of De Morgan
is nicer I can use it.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
  2017-06-08  7:45 ` [RFC PATCH 4/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask Michal Hocko
  2017-06-08  8:38   ` Vlastimil Babka
@ 2017-06-12 15:21   ` Michal Hocko
  1 sibling, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2017-06-12 15:21 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Vlastimil Babka, Naoya Horiguchi, Xishi Qiu,
	zhong jiang, Joonsoo Kim, LKML

JFTR, I am dropping this patch and will follow up with a series which
will make the hugetlb allocation reflect node/zone ordering. It doesn't
make much sense to wait for those with this series because it doesn't
depend  on it.

I have some preliminary work but I would like to give it few days before
I post it.

On Thu 08-06-17 09:45:53, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> alloc_huge_page_nodemask tries to allocate from any numa node in the
> allowed node mask. This might lead to filling up low NUMA nodes while
> others are not used. We can reduce this risk by introducing a concept
> of the preferred node similar to what we have in the regular page
> allocator. We will start allocating from the preferred nid and then
> iterate over all allowed nodes until we try them all. Introduce
> for_each_node_mask_preferred helper which does the iteration and reuse
> the available preferred node in new_page_nodemask which is currently
> the only caller of alloc_huge_page_nodemask.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/hugetlb.h  |  3 ++-
>  include/linux/migrate.h  |  2 +-
>  include/linux/nodemask.h | 20 ++++++++++++++++++++
>  mm/hugetlb.c             |  9 ++++++---
>  4 files changed, 29 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index c469191bb13b..9831a4434dd7 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -349,7 +349,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>  struct page *alloc_huge_page_node(struct hstate *h, int nid);
>  struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
>  				unsigned long addr, int avoid_reserve);
> -struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask);
> +struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> +				const nodemask_t *nmask);
>  int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
>  			pgoff_t idx);
>  
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index f80c9882403a..af3ccf93efaa 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -38,7 +38,7 @@ static inline struct page *new_page_nodemask(struct page *page, int preferred_ni
>  
>  	if (PageHuge(page))
>  		return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
> -				nodemask);
> +				preferred_nid, nodemask);
>  
>  	if (PageHighMem(page)
>  	    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index cf0b91c3ec12..797aa74392bc 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -42,6 +42,8 @@
>   * void nodes_shift_left(dst, src, n)	Shift left
>   *
>   * int first_node(mask)			Number lowest set bit, or MAX_NUMNODES
> + * int first_node_from(nid, mask)	First node starting from nid, or wrap
> + * 					from first or MAX_NUMNODES
>   * int next_node(node, mask)		Next node past 'node', or MAX_NUMNODES
>   * int next_node_in(node, mask)		Next node past 'node', or wrap to first,
>   *					or MAX_NUMNODES
> @@ -268,6 +270,15 @@ static inline int __next_node(int n, const nodemask_t *srcp)
>  #define next_node_in(n, src) __next_node_in((n), &(src))
>  int __next_node_in(int node, const nodemask_t *srcp);
>  
> +#define first_node_from(nid, mask) __first_node_from(nid, &(mask))
> +static inline int __first_node_from(int nid, const nodemask_t *mask)
> +{
> +	if (test_bit(nid, mask->bits))
> +		return nid;
> +
> +	return __next_node_in(nid, mask);
> +}
> +
>  static inline void init_nodemask_of_node(nodemask_t *mask, int node)
>  {
>  	nodes_clear(*mask);
> @@ -369,10 +380,19 @@ static inline void __nodes_fold(nodemask_t *dstp, const nodemask_t *origp,
>  	for ((node) = first_node(mask);			\
>  		(node) < MAX_NUMNODES;			\
>  		(node) = next_node((node), (mask)))
> +
> +#define for_each_node_mask_preferred(node, iter, preferred, mask)	\
> +	for ((node) = first_node_from((preferred), (mask)), iter = 0;	\
> +		(iter) < nodes_weight((mask));				\
> +		(node) = next_node_in((node), (mask)), (iter)++)
> +
>  #else /* MAX_NUMNODES == 1 */
>  #define for_each_node_mask(node, mask)			\
>  	if (!nodes_empty(mask))				\
>  		for ((node) = 0; (node) < 1; (node)++)
> +
> +#define for_each_node_mask_preferred(node, iter, preferred, mask) \
> +	for_each_node_mask(node, mask)
>  #endif /* MAX_NUMNODES */
>  
>  /*
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 01c11ceb47d6..ebf5c9b890d5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1723,14 +1723,17 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
>  	return page;
>  }
>  
> -struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
> +struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> +		const nodemask_t *nmask)
>  {
>  	struct page *page = NULL;
> +	int iter;
>  	int node;
>  
>  	spin_lock(&hugetlb_lock);
>  	if (h->free_huge_pages - h->resv_huge_pages > 0) {
> -		for_each_node_mask(node, *nmask) {
> +		/* It would be nicer to iterate in the node distance order */
> +		for_each_node_mask_preferred(node, iter, preferred_nid, *nmask) {
>  			page = dequeue_huge_page_node_exact(h, node);
>  			if (page)
>  				break;
> @@ -1741,7 +1744,7 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
>  		return page;
>  
>  	/* No reservations, try to overcommit */
> -	for_each_node_mask(node, *nmask) {
> +	for_each_node_mask_preferred(node, iter, preferred_nid, *nmask) {
>  		page = __alloc_buddy_huge_page_no_mpol(h, node);
>  		if (page)
>  			return page;
> -- 
> 2.11.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
@ 2017-06-12 15:21   ` Michal Hocko
  0 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2017-06-12 15:21 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Vlastimil Babka, Naoya Horiguchi, Xishi Qiu,
	zhong jiang, Joonsoo Kim, LKML

JFTR, I am dropping this patch and will follow up with a series which
will make the hugetlb allocation reflect node/zone ordering. It doesn't
make much sense to wait for those with this series because it doesn't
depend  on it.

I have some preliminary work but I would like to give it few days before
I post it.

On Thu 08-06-17 09:45:53, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> alloc_huge_page_nodemask tries to allocate from any numa node in the
> allowed node mask. This might lead to filling up low NUMA nodes while
> others are not used. We can reduce this risk by introducing a concept
> of the preferred node similar to what we have in the regular page
> allocator. We will start allocating from the preferred nid and then
> iterate over all allowed nodes until we try them all. Introduce
> for_each_node_mask_preferred helper which does the iteration and reuse
> the available preferred node in new_page_nodemask which is currently
> the only caller of alloc_huge_page_nodemask.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/hugetlb.h  |  3 ++-
>  include/linux/migrate.h  |  2 +-
>  include/linux/nodemask.h | 20 ++++++++++++++++++++
>  mm/hugetlb.c             |  9 ++++++---
>  4 files changed, 29 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index c469191bb13b..9831a4434dd7 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -349,7 +349,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>  struct page *alloc_huge_page_node(struct hstate *h, int nid);
>  struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
>  				unsigned long addr, int avoid_reserve);
> -struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask);
> +struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> +				const nodemask_t *nmask);
>  int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
>  			pgoff_t idx);
>  
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index f80c9882403a..af3ccf93efaa 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -38,7 +38,7 @@ static inline struct page *new_page_nodemask(struct page *page, int preferred_ni
>  
>  	if (PageHuge(page))
>  		return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
> -				nodemask);
> +				preferred_nid, nodemask);
>  
>  	if (PageHighMem(page)
>  	    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index cf0b91c3ec12..797aa74392bc 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -42,6 +42,8 @@
>   * void nodes_shift_left(dst, src, n)	Shift left
>   *
>   * int first_node(mask)			Number lowest set bit, or MAX_NUMNODES
> + * int first_node_from(nid, mask)	First node starting from nid, or wrap
> + * 					from first or MAX_NUMNODES
>   * int next_node(node, mask)		Next node past 'node', or MAX_NUMNODES
>   * int next_node_in(node, mask)		Next node past 'node', or wrap to first,
>   *					or MAX_NUMNODES
> @@ -268,6 +270,15 @@ static inline int __next_node(int n, const nodemask_t *srcp)
>  #define next_node_in(n, src) __next_node_in((n), &(src))
>  int __next_node_in(int node, const nodemask_t *srcp);
>  
> +#define first_node_from(nid, mask) __first_node_from(nid, &(mask))
> +static inline int __first_node_from(int nid, const nodemask_t *mask)
> +{
> +	if (test_bit(nid, mask->bits))
> +		return nid;
> +
> +	return __next_node_in(nid, mask);
> +}
> +
>  static inline void init_nodemask_of_node(nodemask_t *mask, int node)
>  {
>  	nodes_clear(*mask);
> @@ -369,10 +380,19 @@ static inline void __nodes_fold(nodemask_t *dstp, const nodemask_t *origp,
>  	for ((node) = first_node(mask);			\
>  		(node) < MAX_NUMNODES;			\
>  		(node) = next_node((node), (mask)))
> +
> +#define for_each_node_mask_preferred(node, iter, preferred, mask)	\
> +	for ((node) = first_node_from((preferred), (mask)), iter = 0;	\
> +		(iter) < nodes_weight((mask));				\
> +		(node) = next_node_in((node), (mask)), (iter)++)
> +
>  #else /* MAX_NUMNODES == 1 */
>  #define for_each_node_mask(node, mask)			\
>  	if (!nodes_empty(mask))				\
>  		for ((node) = 0; (node) < 1; (node)++)
> +
> +#define for_each_node_mask_preferred(node, iter, preferred, mask) \
> +	for_each_node_mask(node, mask)
>  #endif /* MAX_NUMNODES */
>  
>  /*
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 01c11ceb47d6..ebf5c9b890d5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1723,14 +1723,17 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
>  	return page;
>  }
>  
> -struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
> +struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> +		const nodemask_t *nmask)
>  {
>  	struct page *page = NULL;
> +	int iter;
>  	int node;
>  
>  	spin_lock(&hugetlb_lock);
>  	if (h->free_huge_pages - h->resv_huge_pages > 0) {
> -		for_each_node_mask(node, *nmask) {
> +		/* It would be nicer to iterate in the node distance order */
> +		for_each_node_mask_preferred(node, iter, preferred_nid, *nmask) {
>  			page = dequeue_huge_page_node_exact(h, node);
>  			if (page)
>  				break;
> @@ -1741,7 +1744,7 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
>  		return page;
>  
>  	/* No reservations, try to overcommit */
> -	for_each_node_mask(node, *nmask) {
> +	for_each_node_mask_preferred(node, iter, preferred_nid, *nmask) {
>  		page = __alloc_buddy_huge_page_no_mpol(h, node);
>  		if (page)
>  			return page;
> -- 
> 2.11.0
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2017-06-12 15:21 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-08  7:45 [PATCH 0/4] more sensible hugetlb migration for hotplug/CMA Michal Hocko
2017-06-08  7:45 ` [PATCH 1/4] mm, memory_hotplug: simplify empty node mask handling in new_node_page Michal Hocko
2017-06-08  8:15   ` Vlastimil Babka
2017-06-08  7:45 ` [PATCH 2/4] hugetlb, memory_hotplug: prefer to use reserved pages for migration Michal Hocko
2017-06-08  8:22   ` Vlastimil Babka
2017-06-08  7:45 ` [PATCH 3/4] mm: unify new_node_page and alloc_migrate_target Michal Hocko
2017-06-08  8:36   ` Vlastimil Babka
2017-06-08  8:40     ` Michal Hocko
2017-06-08  7:45 ` [RFC PATCH 4/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask Michal Hocko
2017-06-08  8:38   ` Vlastimil Babka
2017-06-12  9:06     ` Michal Hocko
2017-06-12 11:48       ` Michal Hocko
2017-06-12 11:53       ` Vlastimil Babka
2017-06-12 12:20         ` Michal Hocko
2017-06-12 15:21   ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.