[RFC PATCH 0/4] mm, hugetlb: allow proper node fallback dequeue

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] mm, hugetlb: allow proper node fallback dequeue
@ 2017-06-13  9:00 Michal Hocko
  2017-06-13  9:00 ` [RFC PATCH 1/4] mm, hugetlb: unclutter hugetlb allocation layers Michal Hocko
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Michal Hocko @ 2017-06-13  9:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Naoya Horiguchi, Mike Kravetz, Mel Gorman, Vlastimil Babka,
	Andrew Morton, LKML

Hi,
while working on a hugetlb migration issue addressed in a separate
patchset [1] I have noticed that the hugetlb allocations from the
preallocated pool are quite subotimal. There is no fallback mechanism
implemented and no notion of preferred node. I have tried to work
around it by [2] but Vlastimil was right to push back for a more robust
solution. It seems that such a solution is to reuse zonelist approach
we use for the page alloctor.

This series has 4 patches. The first one tries to make hugetlb
allocation layers more clear. The second one implements the zonelist
hugetlb pool allocation and introduces a preferred node semantic which
is used by the migration callbacks. The third patch is a pure clean up
as well as the last patch.

Note that this patch depends on [1] (without the last patch which
is replaced by this work). You can find the whole series in
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git branch
attempts/hugetlb-zonelists

I am sending this as an RFC because I might be missing some subtle
dependencies which led to the original design.

Shortlog
Michal Hocko (4):
      mm, hugetlb: unclutter hugetlb allocation layers
      hugetlb: add support for preferred node to alloc_huge_page_nodemask
      mm, hugetlb: get rid of dequeue_huge_page_node
      mm, hugetlb, soft_offline: use new_page_nodemask for soft offline migration

And the diffstat looks promissing as well

 include/linux/hugetlb.h |   3 +-
 include/linux/migrate.h |   2 +-
 mm/hugetlb.c            | 233 ++++++++++++++++--------------------------------
 mm/memory-failure.c     |  10 +--
 4 files changed, 82 insertions(+), 166 deletions(-)

[1] http://lkml.kernel.org/r/20170608074553.22152-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20170608074553.22152-5-mhocko@kernel.org

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH 1/4] mm, hugetlb: unclutter hugetlb allocation layers
  2017-06-13  9:00 [RFC PATCH 0/4] mm, hugetlb: allow proper node fallback dequeue Michal Hocko
@ 2017-06-13  9:00 ` Michal Hocko
  2017-06-14 13:18   ` Vlastimil Babka
  2017-06-13  9:00 ` [RFC PATCH 2/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask Michal Hocko
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2017-06-13  9:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Naoya Horiguchi, Mike Kravetz, Mel Gorman, Vlastimil Babka,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Hugetlb allocation path for fresh huge pages is unnecessarily complex
and it mixes different interfaces between layers. __alloc_buddy_huge_page
is the central place to perform a new allocation. It checks for the
hugetlb overcommit and then relies on __hugetlb_alloc_buddy_huge_page to
invoke the page allocator. This is all good except that
__alloc_buddy_huge_page pushes vma and address down the callchain and
so __hugetlb_alloc_buddy_huge_page has to deal with two different
allocation modes - one for memory policy and other node specific (or to
make it more obscure node non-specific) requests. This just screams for a
reorganization.

This patch pulls out all the vma specific handling up to
__alloc_buddy_huge_page_with_mpol where it belongs.
__alloc_buddy_huge_page will get nodemask argument and
__hugetlb_alloc_buddy_huge_page will become a trivial wrapper over the
page allocator.

In short:
__alloc_buddy_huge_page_with_mpol - memory policy handling
  __alloc_buddy_huge_page - overcommit handling and accounting
    __hugetlb_alloc_buddy_huge_page - page allocator layer

Also note that __hugetlb_alloc_buddy_huge_page and its cpuset retry loop
is not really needed because the page allocator already handles the
cpusets update.

Finally __hugetlb_alloc_buddy_huge_page had a special case for node
specific allocations (when no policy is applied and there is a node
given). This has relied on __GFP_THISNODE to not fallback to a different
node. alloc_huge_page_node is the only caller which relies on this
behavior. Keep it for now and emulate it by a proper nodemask.

Not only this removes quite some code it also should make those layers
easier to follow and clear wrt responsibilities.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/hugetlb.h |   2 +-
 mm/hugetlb.c            | 134 +++++++++++-------------------------------------
 2 files changed, 31 insertions(+), 105 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c469191bb13b..016831fcdca1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -349,7 +349,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
-struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask);
+struct page *alloc_huge_page_nodemask(struct hstate *h, nodemask_t *nmask);
 int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
 			pgoff_t idx);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 01c11ceb47d6..3d5f25d589b3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1531,82 +1531,18 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
 	return rc;
 }
 
-/*
- * There are 3 ways this can get called:
- * 1. With vma+addr: we use the VMA's memory policy
- * 2. With !vma, but nid=NUMA_NO_NODE:  We try to allocate a huge
- *    page from any node, and let the buddy allocator itself figure
- *    it out.
- * 3. With !vma, but nid!=NUMA_NO_NODE.  We allocate a huge page
- *    strictly from 'nid'
- */
 static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
-		struct vm_area_struct *vma, unsigned long addr, int nid)
+		int nid, nodemask_t *nmask)
 {
 	int order = huge_page_order(h);
 	gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN;
-	unsigned int cpuset_mems_cookie;
-
-	/*
-	 * We need a VMA to get a memory policy.  If we do not
-	 * have one, we use the 'nid' argument.
-	 *
-	 * The mempolicy stuff below has some non-inlined bits
-	 * and calls ->vm_ops.  That makes it hard to optimize at
-	 * compile-time, even when NUMA is off and it does
-	 * nothing.  This helps the compiler optimize it out.
-	 */
-	if (!IS_ENABLED(CONFIG_NUMA) || !vma) {
-		/*
-		 * If a specific node is requested, make sure to
-		 * get memory from there, but only when a node
-		 * is explicitly specified.
-		 */
-		if (nid != NUMA_NO_NODE)
-			gfp |= __GFP_THISNODE;
-		/*
-		 * Make sure to call something that can handle
-		 * nid=NUMA_NO_NODE
-		 */
-		return alloc_pages_node(nid, gfp, order);
-	}
-
-	/*
-	 * OK, so we have a VMA.  Fetch the mempolicy and try to
-	 * allocate a huge page with it.  We will only reach this
-	 * when CONFIG_NUMA=y.
-	 */
-	do {
-		struct page *page;
-		struct mempolicy *mpol;
-		int nid;
-		nodemask_t *nodemask;
-
-		cpuset_mems_cookie = read_mems_allowed_begin();
-		nid = huge_node(vma, addr, gfp, &mpol, &nodemask);
-		mpol_cond_put(mpol);
-		page = __alloc_pages_nodemask(gfp, order, nid, nodemask);
-		if (page)
-			return page;
-	} while (read_mems_allowed_retry(cpuset_mems_cookie));
 
-	return NULL;
+	if (nid == NUMA_NO_NODE)
+		nid = numa_mem_id();
+	return __alloc_pages_nodemask(gfp, order, nid, nmask);
 }
 
-/*
- * There are two ways to allocate a huge page:
- * 1. When you have a VMA and an address (like a fault)
- * 2. When you have no VMA (like when setting /proc/.../nr_hugepages)
- *
- * 'vma' and 'addr' are only for (1).  'nid' is always NUMA_NO_NODE in
- * this case which signifies that the allocation should be done with
- * respect for the VMA's memory policy.
- *
- * For (2), we ignore 'vma' and 'addr' and use 'nid' exclusively. This
- * implies that memory policies will not be taken in to account.
- */
-static struct page *__alloc_buddy_huge_page(struct hstate *h,
-		struct vm_area_struct *vma, unsigned long addr, int nid)
+static struct page *__alloc_buddy_huge_page(struct hstate *h, int nid, nodemask_t *nmask)
 {
 	struct page *page;
 	unsigned int r_nid;
@@ -1615,15 +1551,6 @@ static struct page *__alloc_buddy_huge_page(struct hstate *h,
 		return NULL;
 
 	/*
-	 * Make sure that anyone specifying 'nid' is not also specifying a VMA.
-	 * This makes sure the caller is picking _one_ of the modes with which
-	 * we can call this function, not both.
-	 */
-	if (vma || (addr != -1)) {
-		VM_WARN_ON_ONCE(addr == -1);
-		VM_WARN_ON_ONCE(nid != NUMA_NO_NODE);
-	}
-	/*
 	 * Assume we will successfully allocate the surplus page to
 	 * prevent racing processes from causing the surplus to exceed
 	 * overcommit
@@ -1656,7 +1583,7 @@ static struct page *__alloc_buddy_huge_page(struct hstate *h,
 	}
 	spin_unlock(&hugetlb_lock);
 
-	page = __hugetlb_alloc_buddy_huge_page(h, vma, addr, nid);
+	page = __hugetlb_alloc_buddy_huge_page(h, nid, nmask);
 
 	spin_lock(&hugetlb_lock);
 	if (page) {
@@ -1681,26 +1608,22 @@ static struct page *__alloc_buddy_huge_page(struct hstate *h,
 }
 
 /*
- * Allocate a huge page from 'nid'.  Note, 'nid' may be
- * NUMA_NO_NODE, which means that it may be allocated
- * anywhere.
- */
-static
-struct page *__alloc_buddy_huge_page_no_mpol(struct hstate *h, int nid)
-{
-	unsigned long addr = -1;
-
-	return __alloc_buddy_huge_page(h, NULL, addr, nid);
-}
-
-/*
  * Use the VMA's mpolicy to allocate a huge page from the buddy.
  */
 static
 struct page *__alloc_buddy_huge_page_with_mpol(struct hstate *h,
 		struct vm_area_struct *vma, unsigned long addr)
 {
-	return __alloc_buddy_huge_page(h, vma, addr, NUMA_NO_NODE);
+	struct page *page;
+	struct mempolicy *mpol;
+	int nid;
+	nodemask_t *nodemask;
+
+	nid = huge_node(vma, addr, htlb_alloc_mask(h), &mpol, &nodemask);
+	page = __alloc_buddy_huge_page(h, nid, nodemask);
+	mpol_cond_put(mpol);
+
+	return page;
 }
 
 /*
@@ -1717,13 +1640,22 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
 		page = dequeue_huge_page_node(h, nid);
 	spin_unlock(&hugetlb_lock);
 
-	if (!page)
-		page = __alloc_buddy_huge_page_no_mpol(h, nid);
+	if (!page) {
+		nodemask_t nmask;
+
+		if (nid != NUMA_NO_NODE) {
+			nmask = NODE_MASK_NONE;
+			node_set(nid, nmask);
+		} else {
+			nmask = node_states[N_MEMORY];
+		}
+		page = __alloc_buddy_huge_page(h, nid, &nmask);
+	}
 
 	return page;
 }
 
-struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
+struct page *alloc_huge_page_nodemask(struct hstate *h, nodemask_t *nmask)
 {
 	struct page *page = NULL;
 	int node;
@@ -1741,13 +1673,7 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
 		return page;
 
 	/* No reservations, try to overcommit */
-	for_each_node_mask(node, *nmask) {
-		page = __alloc_buddy_huge_page_no_mpol(h, node);
-		if (page)
-			return page;
-	}
-
-	return NULL;
+	return __alloc_buddy_huge_page(h, NUMA_NO_NODE, nmask);
 }
 
 /*
@@ -1775,7 +1701,7 @@ static int gather_surplus_pages(struct hstate *h, int delta)
 retry:
 	spin_unlock(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
-		page = __alloc_buddy_huge_page_no_mpol(h, NUMA_NO_NODE);
+		page = __alloc_buddy_huge_page(h, NUMA_NO_NODE, NULL);
 		if (!page) {
 			alloc_ok = false;
 			break;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 2/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
  2017-06-13  9:00 [RFC PATCH 0/4] mm, hugetlb: allow proper node fallback dequeue Michal Hocko
  2017-06-13  9:00 ` [RFC PATCH 1/4] mm, hugetlb: unclutter hugetlb allocation layers Michal Hocko
@ 2017-06-13  9:00 ` Michal Hocko
  2017-06-14 16:17   ` Vlastimil Babka
  2017-06-14 22:12   ` Mike Kravetz
  2017-06-13  9:00 ` [RFC PATCH 3/4] mm, hugetlb: get rid of dequeue_huge_page_node Michal Hocko
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 18+ messages in thread
From: Michal Hocko @ 2017-06-13  9:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Naoya Horiguchi, Mike Kravetz, Mel Gorman, Vlastimil Babka,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

alloc_huge_page_nodemask tries to allocate from any numa node in the
allowed node mask starting from lower numa nodes. This might lead to
filling up those low NUMA nodes while others are not used. We can reduce
this risk by introducing a concept of the preferred node similar to what
we have in the regular page allocator. We will start allocating from the
preferred nid and then iterate over all allowed nodes in the zonelist
order until we try them all.

This is mimicking the page allocator logic except it operates on
per-node mempools. dequeue_huge_page_vma already does this so distill
the zonelist logic into a more generic dequeue_huge_page_nodemask
and use it in alloc_huge_page_nodemask.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/hugetlb.h |   3 +-
 include/linux/migrate.h |   2 +-
 mm/hugetlb.c            | 106 +++++++++++++++++++++++++-----------------------
 3 files changed, 59 insertions(+), 52 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 016831fcdca1..d4c33a8583be 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -349,7 +349,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
-struct page *alloc_huge_page_nodemask(struct hstate *h, nodemask_t *nmask);
+struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
+				nodemask_t *nmask);
 int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
 			pgoff_t idx);
 
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index f80c9882403a..af3ccf93efaa 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -38,7 +38,7 @@ static inline struct page *new_page_nodemask(struct page *page, int preferred_ni
 
 	if (PageHuge(page))
 		return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
-				nodemask);
+				preferred_nid, nodemask);
 
 	if (PageHighMem(page)
 	    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3d5f25d589b3..696de029f0fa 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -897,29 +897,58 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
 	return page;
 }
 
-static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
+/* Movability of hugepages depends on migration support. */
+static inline gfp_t htlb_alloc_mask(struct hstate *h)
 {
-	struct page *page;
-	int node;
+	if (hugepages_treat_as_movable || hugepage_migration_supported(h))
+		return GFP_HIGHUSER_MOVABLE;
+	else
+		return GFP_HIGHUSER;
+}
 
-	if (nid != NUMA_NO_NODE)
-		return dequeue_huge_page_node_exact(h, nid);
+static struct page *dequeue_huge_page_nodemask(struct hstate *h, int nid,
+		nodemask_t *nmask)
+{
+	unsigned int cpuset_mems_cookie;
+	struct zonelist *zonelist;
+	struct page *page = NULL;
+	struct zone *zone;
+	struct zoneref *z;
+	gfp_t gfp_mask;
+	int node = -1;
+
+	gfp_mask = htlb_alloc_mask(h);
+	zonelist = node_zonelist(nid, gfp_mask);
+
+retry_cpuset:
+	cpuset_mems_cookie = read_mems_allowed_begin();
+	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), nmask) {
+		if (!cpuset_zone_allowed(zone, gfp_mask))
+			continue;
+		/*
+		 * no need to ask again on the same node. Pool is node rather than
+		 * zone aware
+		 */
+		if (zone_to_nid(zone) == node)
+			continue;
+		node = zone_to_nid(zone);
 
-	for_each_online_node(node) {
 		page = dequeue_huge_page_node_exact(h, node);
 		if (page)
-			return page;
+			break;
 	}
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
+		goto retry_cpuset;
+
 	return NULL;
 }
 
-/* Movability of hugepages depends on migration support. */
-static inline gfp_t htlb_alloc_mask(struct hstate *h)
+static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
 {
-	if (hugepages_treat_as_movable || hugepage_migration_supported(h))
-		return GFP_HIGHUSER_MOVABLE;
-	else
-		return GFP_HIGHUSER;
+	if (nid != NUMA_NO_NODE)
+		return dequeue_huge_page_node_exact(h, nid);
+
+	return dequeue_huge_page_nodemask(h, nid, NULL);
 }
 
 static struct page *dequeue_huge_page_vma(struct hstate *h,
@@ -927,15 +956,10 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 				unsigned long address, int avoid_reserve,
 				long chg)
 {
-	struct page *page = NULL;
+	struct page *page;
 	struct mempolicy *mpol;
 	nodemask_t *nodemask;
-	gfp_t gfp_mask;
 	int nid;
-	struct zonelist *zonelist;
-	struct zone *zone;
-	struct zoneref *z;
-	unsigned int cpuset_mems_cookie;
 
 	/*
 	 * A child process with MAP_PRIVATE mappings created by their parent
@@ -950,32 +974,14 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
 		goto err;
 
-retry_cpuset:
-	cpuset_mems_cookie = read_mems_allowed_begin();
-	gfp_mask = htlb_alloc_mask(h);
-	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
-	zonelist = node_zonelist(nid, gfp_mask);
-
-	for_each_zone_zonelist_nodemask(zone, z, zonelist,
-						MAX_NR_ZONES - 1, nodemask) {
-		if (cpuset_zone_allowed(zone, gfp_mask)) {
-			page = dequeue_huge_page_node(h, zone_to_nid(zone));
-			if (page) {
-				if (avoid_reserve)
-					break;
-				if (!vma_has_reserves(vma, chg))
-					break;
-
-				SetPagePrivate(page);
-				h->resv_huge_pages--;
-				break;
-			}
-		}
+	nid = huge_node(vma, address, htlb_alloc_mask(h), &mpol, &nodemask);
+	page = dequeue_huge_page_nodemask(h, nid, nodemask);
+	if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
+		SetPagePrivate(page);
+		h->resv_huge_pages--;
 	}
 
 	mpol_cond_put(mpol);
-	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
-		goto retry_cpuset;
 	return page;
 
 err:
@@ -1655,25 +1661,25 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
 	return page;
 }
 
-struct page *alloc_huge_page_nodemask(struct hstate *h, nodemask_t *nmask)
+
+struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
+		nodemask_t *nmask)
 {
 	struct page *page = NULL;
-	int node;
 
 	spin_lock(&hugetlb_lock);
 	if (h->free_huge_pages - h->resv_huge_pages > 0) {
-		for_each_node_mask(node, *nmask) {
-			page = dequeue_huge_page_node_exact(h, node);
-			if (page)
-				break;
-		}
+		page = dequeue_huge_page_nodemask(h, preferred_nid, nmask);
+		if (page)
+			goto unlock;
 	}
+unlock:
 	spin_unlock(&hugetlb_lock);
 	if (page)
 		return page;
 
 	/* No reservations, try to overcommit */
-	return __alloc_buddy_huge_page(h, NUMA_NO_NODE, nmask);
+	return __alloc_buddy_huge_page(h, preferred_nid, nmask);
 }
 
 /*
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 3/4] mm, hugetlb: get rid of dequeue_huge_page_node
  2017-06-13  9:00 [RFC PATCH 0/4] mm, hugetlb: allow proper node fallback dequeue Michal Hocko
  2017-06-13  9:00 ` [RFC PATCH 1/4] mm, hugetlb: unclutter hugetlb allocation layers Michal Hocko
  2017-06-13  9:00 ` [RFC PATCH 2/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask Michal Hocko
@ 2017-06-13  9:00 ` Michal Hocko
  2017-06-13  9:00 ` [RFC PATCH 4/4] mm, hugetlb, soft_offline: use new_page_nodemask for soft offline migration Michal Hocko
  2017-06-16 11:44 ` [RFC PATCH 0/4] mm, hugetlb: allow proper node fallback dequeue Michal Hocko
  4 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2017-06-13  9:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Naoya Horiguchi, Mike Kravetz, Mel Gorman, Vlastimil Babka,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

dequeue_huge_page_node has a single caller alloc_huge_page_node and we
already have to handle NUMA_NO_NODE specially there. So get rid of the
helper and use the same numa mask trick for hugetlb dequeue as we use
for the allocation.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/hugetlb.c | 29 ++++++++++-------------------
 1 file changed, 10 insertions(+), 19 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 696de029f0fa..f58d6362c2c3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -943,14 +943,6 @@ static struct page *dequeue_huge_page_nodemask(struct hstate *h, int nid,
 	return NULL;
 }
 
-static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
-{
-	if (nid != NUMA_NO_NODE)
-		return dequeue_huge_page_node_exact(h, nid);
-
-	return dequeue_huge_page_nodemask(h, nid, NULL);
-}
-
 static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
 				unsigned long address, int avoid_reserve,
@@ -1640,23 +1632,22 @@ struct page *__alloc_buddy_huge_page_with_mpol(struct hstate *h,
 struct page *alloc_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page = NULL;
+	nodemask_t nmask;
+
+	if (nid != NUMA_NO_NODE) {
+		nmask = NODE_MASK_NONE;
+		node_set(nid, nmask);
+	} else {
+		nmask = node_states[N_MEMORY];
+	}
 
 	spin_lock(&hugetlb_lock);
 	if (h->free_huge_pages - h->resv_huge_pages > 0)
-		page = dequeue_huge_page_node(h, nid);
+		page = dequeue_huge_page_nodemask(h, nid, &nmask);
 	spin_unlock(&hugetlb_lock);
 
-	if (!page) {
-		nodemask_t nmask;
-
-		if (nid != NUMA_NO_NODE) {
-			nmask = NODE_MASK_NONE;
-			node_set(nid, nmask);
-		} else {
-			nmask = node_states[N_MEMORY];
-		}
+	if (!page)
 		page = __alloc_buddy_huge_page(h, nid, &nmask);
-	}
 
 	return page;
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 4/4] mm, hugetlb, soft_offline: use new_page_nodemask for soft offline migration
  2017-06-13  9:00 [RFC PATCH 0/4] mm, hugetlb: allow proper node fallback dequeue Michal Hocko
                   ` (2 preceding siblings ...)
  2017-06-13  9:00 ` [RFC PATCH 3/4] mm, hugetlb: get rid of dequeue_huge_page_node Michal Hocko
@ 2017-06-13  9:00 ` Michal Hocko
  2017-06-14 16:22   ` Vlastimil Babka
  2017-06-16 11:44 ` [RFC PATCH 0/4] mm, hugetlb: allow proper node fallback dequeue Michal Hocko
  4 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2017-06-13  9:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Naoya Horiguchi, Mike Kravetz, Mel Gorman, Vlastimil Babka,
	Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

new_page is yet another duplication of the migration callback which has
to handle hugetlb migration specially. We can safely use the generic
new_page_nodemask for the same purpose.

Please note that gigantic hugetlb pages do not need any special handling
because alloc_huge_page_nodemask will make sure to check pages in all
per node pools. The reason this was done previously was that
alloc_huge_page_node treated NO_NUMA_NODE and a specific node
differently and so alloc_huge_page_node(nid) would check on this
specific node.

Noticed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory-failure.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3615bffbd269..7040f60ecb71 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1487,16 +1487,8 @@ EXPORT_SYMBOL(unpoison_memory);
 static struct page *new_page(struct page *p, unsigned long private, int **x)
 {
 	int nid = page_to_nid(p);
-	if (PageHuge(p)) {
-		struct hstate *hstate = page_hstate(compound_head(p));
 
-		if (hstate_is_gigantic(hstate))
-			return alloc_huge_page_node(hstate, NUMA_NO_NODE);
-
-		return alloc_huge_page_node(hstate, nid);
-	} else {
-		return __alloc_pages_node(nid, GFP_HIGHUSER_MOVABLE, 0);
-	}
+	return new_page_nodemask(p, nid, &node_states[N_MEMORY]);
 }
 
 /*
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/4] mm, hugetlb: unclutter hugetlb allocation layers
  2017-06-13  9:00 ` [RFC PATCH 1/4] mm, hugetlb: unclutter hugetlb allocation layers Michal Hocko
@ 2017-06-14 13:18   ` Vlastimil Babka
  2017-06-14 13:42     ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Vlastimil Babka @ 2017-06-14 13:18 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Naoya Horiguchi, Mike Kravetz, Mel Gorman, Andrew Morton, LKML,
	Michal Hocko

On 06/13/2017 11:00 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Hugetlb allocation path for fresh huge pages is unnecessarily complex
> and it mixes different interfaces between layers. __alloc_buddy_huge_page
> is the central place to perform a new allocation. It checks for the
> hugetlb overcommit and then relies on __hugetlb_alloc_buddy_huge_page to
> invoke the page allocator. This is all good except that
> __alloc_buddy_huge_page pushes vma and address down the callchain and
> so __hugetlb_alloc_buddy_huge_page has to deal with two different
> allocation modes - one for memory policy and other node specific (or to
> make it more obscure node non-specific) requests. This just screams for a
> reorganization.
> 
> This patch pulls out all the vma specific handling up to
> __alloc_buddy_huge_page_with_mpol where it belongs.
> __alloc_buddy_huge_page will get nodemask argument and
> __hugetlb_alloc_buddy_huge_page will become a trivial wrapper over the
> page allocator.
> 
> In short:
> __alloc_buddy_huge_page_with_mpol - memory policy handling
>   __alloc_buddy_huge_page - overcommit handling and accounting
>     __hugetlb_alloc_buddy_huge_page - page allocator layer
> 
> Also note that __hugetlb_alloc_buddy_huge_page and its cpuset retry loop
> is not really needed because the page allocator already handles the
> cpusets update.
> 
> Finally __hugetlb_alloc_buddy_huge_page had a special case for node
> specific allocations (when no policy is applied and there is a node
> given). This has relied on __GFP_THISNODE to not fallback to a different
> node. alloc_huge_page_node is the only caller which relies on this
> behavior. Keep it for now and emulate it by a proper nodemask.
> 
> Not only this removes quite some code it also should make those layers
> easier to follow and clear wrt responsibilities.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/hugetlb.h |   2 +-
>  mm/hugetlb.c            | 134 +++++++++++-------------------------------------
>  2 files changed, 31 insertions(+), 105 deletions(-)

Very nice cleanup indeed!

> @@ -1717,13 +1640,22 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
>  		page = dequeue_huge_page_node(h, nid);
>  	spin_unlock(&hugetlb_lock);
>  
> -	if (!page)
> -		page = __alloc_buddy_huge_page_no_mpol(h, nid);
> +	if (!page) {
> +		nodemask_t nmask;
> +
> +		if (nid != NUMA_NO_NODE) {
> +			nmask = NODE_MASK_NONE;
> +			node_set(nid, nmask);

TBH I don't like this hack too much, and would rather see __GFP_THISNODE
involved, which picks a different (short) zonelist. Also it's allocating
nodemask on stack, which we generally avoid? Although the callers
currently seem to be shallow.

> +		} else {
> +			nmask = node_states[N_MEMORY];

If nothing, this case could pass NULL? Although that would lead to
uglier code too...

> +		}
> +		page = __alloc_buddy_huge_page(h, nid, &nmask);
> +	}
>  
>  	return page;
>  }
>  
> -struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
> +struct page *alloc_huge_page_nodemask(struct hstate *h, nodemask_t *nmask)
>  {
>  	struct page *page = NULL;
>  	int node;
> @@ -1741,13 +1673,7 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
>  		return page;
>  
>  	/* No reservations, try to overcommit */
> -	for_each_node_mask(node, *nmask) {
> -		page = __alloc_buddy_huge_page_no_mpol(h, node);
> -		if (page)
> -			return page;
> -	}
> -
> -	return NULL;
> +	return __alloc_buddy_huge_page(h, NUMA_NO_NODE, nmask);
>  }
>  
>  /*
> @@ -1775,7 +1701,7 @@ static int gather_surplus_pages(struct hstate *h, int delta)
>  retry:
>  	spin_unlock(&hugetlb_lock);
>  	for (i = 0; i < needed; i++) {
> -		page = __alloc_buddy_huge_page_no_mpol(h, NUMA_NO_NODE);
> +		page = __alloc_buddy_huge_page(h, NUMA_NO_NODE, NULL);
>  		if (!page) {
>  			alloc_ok = false;
>  			break;
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/4] mm, hugetlb: unclutter hugetlb allocation layers
  2017-06-14 13:18   ` Vlastimil Babka
@ 2017-06-14 13:42     ` Michal Hocko
  2017-06-14 14:04       ` Michal Hocko
  2017-06-14 15:06       ` Vlastimil Babka
  0 siblings, 2 replies; 18+ messages in thread
From: Michal Hocko @ 2017-06-14 13:42 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Naoya Horiguchi, Mike Kravetz, Mel Gorman, Andrew Morton, LKML

On Wed 14-06-17 15:18:26, Vlastimil Babka wrote:
> On 06/13/2017 11:00 AM, Michal Hocko wrote:
[...]
> > @@ -1717,13 +1640,22 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
> >  		page = dequeue_huge_page_node(h, nid);
> >  	spin_unlock(&hugetlb_lock);
> >  
> > -	if (!page)
> > -		page = __alloc_buddy_huge_page_no_mpol(h, nid);
> > +	if (!page) {
> > +		nodemask_t nmask;
> > +
> > +		if (nid != NUMA_NO_NODE) {
> > +			nmask = NODE_MASK_NONE;
> > +			node_set(nid, nmask);
> 
> TBH I don't like this hack too much, and would rather see __GFP_THISNODE
> involved, which picks a different (short) zonelist. Also it's allocating
> nodemask on stack, which we generally avoid? Although the callers
> currently seem to be shallow.

Fair enough. That would require pulling gfp mask handling up the call
chain. This on top of this patch + refreshes for other patches later in
the series as they will conflict now?
---
commit dcd863b48fb2c93e5aebce818e75c30978e26cf1
Author: Michal Hocko <mhocko@suse.com>
Date:   Wed Jun 14 15:41:07 2017 +0200

    fold me
    
    - pull gfp mask out of __hugetlb_alloc_buddy_huge_page and make it an
      explicit argument to allow __GFP_THISNODE in alloc_huge_page_node per
      Vlastimil

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3d5f25d589b3..afc87de5de5c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1532,17 +1532,18 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
 }
 
 static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
-		int nid, nodemask_t *nmask)
+		gfp_t gfp_mask, int nid, nodemask_t *nmask)
 {
 	int order = huge_page_order(h);
-	gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN;
 
+	gfp_mask |= __GFP_COMP|__GFP_REPEAT|__GFP_NOWARN;
 	if (nid == NUMA_NO_NODE)
 		nid = numa_mem_id();
-	return __alloc_pages_nodemask(gfp, order, nid, nmask);
+	return __alloc_pages_nodemask(gfp_mask, order, nid, nmask);
 }
 
-static struct page *__alloc_buddy_huge_page(struct hstate *h, int nid, nodemask_t *nmask)
+static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask,
+		int nid, nodemask_t *nmask)
 {
 	struct page *page;
 	unsigned int r_nid;
@@ -1583,7 +1584,7 @@ static struct page *__alloc_buddy_huge_page(struct hstate *h, int nid, nodemask_
 	}
 	spin_unlock(&hugetlb_lock);
 
-	page = __hugetlb_alloc_buddy_huge_page(h, nid, nmask);
+	page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask, nid, nmask);
 
 	spin_lock(&hugetlb_lock);
 	if (page) {
@@ -1616,11 +1617,12 @@ struct page *__alloc_buddy_huge_page_with_mpol(struct hstate *h,
 {
 	struct page *page;
 	struct mempolicy *mpol;
+	gfp_t gfp_mask = htlb_alloc_mask(h);
 	int nid;
 	nodemask_t *nodemask;
 
-	nid = huge_node(vma, addr, htlb_alloc_mask(h), &mpol, &nodemask);
-	page = __alloc_buddy_huge_page(h, nid, nodemask);
+	nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
+	page = __alloc_buddy_huge_page(h, gfp_mask, nid, nodemask);
 	mpol_cond_put(mpol);
 
 	return page;
@@ -1633,30 +1635,26 @@ struct page *__alloc_buddy_huge_page_with_mpol(struct hstate *h,
  */
 struct page *alloc_huge_page_node(struct hstate *h, int nid)
 {
+	gfp_t gfp_mask = htlb_alloc_mask(h);
 	struct page *page = NULL;
 
+	if (nid != NUMA_NO_NODE)
+		gfp_mask |= __GFP_THISNODE;
+
 	spin_lock(&hugetlb_lock);
 	if (h->free_huge_pages - h->resv_huge_pages > 0)
 		page = dequeue_huge_page_node(h, nid);
 	spin_unlock(&hugetlb_lock);
 
-	if (!page) {
-		nodemask_t nmask;
-
-		if (nid != NUMA_NO_NODE) {
-			nmask = NODE_MASK_NONE;
-			node_set(nid, nmask);
-		} else {
-			nmask = node_states[N_MEMORY];
-		}
-		page = __alloc_buddy_huge_page(h, nid, &nmask);
-	}
+	if (!page)
+		page = __alloc_buddy_huge_page(h, gfp_mask, nid, NULL);
 
 	return page;
 }
 
 struct page *alloc_huge_page_nodemask(struct hstate *h, nodemask_t *nmask)
 {
+	gfp_t gfp_mask = htlb_alloc_mask(h);
 	struct page *page = NULL;
 	int node;
 
@@ -1673,7 +1671,7 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, nodemask_t *nmask)
 		return page;
 
 	/* No reservations, try to overcommit */
-	return __alloc_buddy_huge_page(h, NUMA_NO_NODE, nmask);
+	return __alloc_buddy_huge_page(h, gfp_mask, NUMA_NO_NODE, nmask);
 }
 
 /*
@@ -1701,7 +1699,8 @@ static int gather_surplus_pages(struct hstate *h, int delta)
 retry:
 	spin_unlock(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
-		page = __alloc_buddy_huge_page(h, NUMA_NO_NODE, NULL);
+		page = __alloc_buddy_huge_page(h, htlb_alloc_mask(h),
+				NUMA_NO_NODE, NULL);
 		if (!page) {
 			alloc_ok = false;
 			break;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/4] mm, hugetlb: unclutter hugetlb allocation layers
  2017-06-14 13:42     ` Michal Hocko
@ 2017-06-14 14:04       ` Michal Hocko
  2017-06-14 15:06       ` Vlastimil Babka
  1 sibling, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2017-06-14 14:04 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Naoya Horiguchi, Mike Kravetz, Mel Gorman, Andrew Morton, LKML

On Wed 14-06-17 15:42:58, Michal Hocko wrote:
> On Wed 14-06-17 15:18:26, Vlastimil Babka wrote:
> > On 06/13/2017 11:00 AM, Michal Hocko wrote:
> [...]
> > > @@ -1717,13 +1640,22 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
> > >  		page = dequeue_huge_page_node(h, nid);
> > >  	spin_unlock(&hugetlb_lock);
> > >  
> > > -	if (!page)
> > > -		page = __alloc_buddy_huge_page_no_mpol(h, nid);
> > > +	if (!page) {
> > > +		nodemask_t nmask;
> > > +
> > > +		if (nid != NUMA_NO_NODE) {
> > > +			nmask = NODE_MASK_NONE;
> > > +			node_set(nid, nmask);
> > 
> > TBH I don't like this hack too much, and would rather see __GFP_THISNODE
> > involved, which picks a different (short) zonelist. Also it's allocating
> > nodemask on stack, which we generally avoid? Although the callers
> > currently seem to be shallow.
> 
> Fair enough. That would require pulling gfp mask handling up the call
> chain. This on top of this patch + refreshes for other patches later in
> the series as they will conflict now?

I've rebase the attempts/hugetlb-zonelists branch for an easier review.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/4] mm, hugetlb: unclutter hugetlb allocation layers
  2017-06-14 13:42     ` Michal Hocko
  2017-06-14 14:04       ` Michal Hocko
@ 2017-06-14 15:06       ` Vlastimil Babka
  2017-06-14 15:28         ` Michal Hocko
  1 sibling, 1 reply; 18+ messages in thread
From: Vlastimil Babka @ 2017-06-14 15:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Naoya Horiguchi, Mike Kravetz, Mel Gorman, Andrew Morton, LKML

On 06/14/2017 03:42 PM, Michal Hocko wrote:
> On Wed 14-06-17 15:18:26, Vlastimil Babka wrote:
>> On 06/13/2017 11:00 AM, Michal Hocko wrote:
> [...]
>>> @@ -1717,13 +1640,22 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
>>>  		page = dequeue_huge_page_node(h, nid);
>>>  	spin_unlock(&hugetlb_lock);
>>>  
>>> -	if (!page)
>>> -		page = __alloc_buddy_huge_page_no_mpol(h, nid);
>>> +	if (!page) {
>>> +		nodemask_t nmask;
>>> +
>>> +		if (nid != NUMA_NO_NODE) {
>>> +			nmask = NODE_MASK_NONE;
>>> +			node_set(nid, nmask);
>>
>> TBH I don't like this hack too much, and would rather see __GFP_THISNODE
>> involved, which picks a different (short) zonelist. Also it's allocating
>> nodemask on stack, which we generally avoid? Although the callers
>> currently seem to be shallow.
> 
> Fair enough. That would require pulling gfp mask handling up the call
> chain. This on top of this patch + refreshes for other patches later in
> the series as they will conflict now?

For the orig patch + fold (squashed locally from your mmotm/... branch)

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Please update the commit description which still mentions the nodemask
emulation of __GFP_THISNODE.

Also I noticed that the goal of patch 2 is already partially achieved
here, because alloc_huge_page_nodemask() will now allocate using
zonelist. It won't dequeue that way yet, though.

> ---
> commit dcd863b48fb2c93e5aebce818e75c30978e26cf1
> Author: Michal Hocko <mhocko@suse.com>
> Date:   Wed Jun 14 15:41:07 2017 +0200
> 
>     fold me
>     
>     - pull gfp mask out of __hugetlb_alloc_buddy_huge_page and make it an
>       explicit argument to allow __GFP_THISNODE in alloc_huge_page_node per
>       Vlastimil
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3d5f25d589b3..afc87de5de5c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1532,17 +1532,18 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
>  }
>  
>  static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
> -		int nid, nodemask_t *nmask)
> +		gfp_t gfp_mask, int nid, nodemask_t *nmask)
>  {
>  	int order = huge_page_order(h);
> -	gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN;
>  
> +	gfp_mask |= __GFP_COMP|__GFP_REPEAT|__GFP_NOWARN;
>  	if (nid == NUMA_NO_NODE)
>  		nid = numa_mem_id();
> -	return __alloc_pages_nodemask(gfp, order, nid, nmask);
> +	return __alloc_pages_nodemask(gfp_mask, order, nid, nmask);
>  }
>  
> -static struct page *__alloc_buddy_huge_page(struct hstate *h, int nid, nodemask_t *nmask)
> +static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask,
> +		int nid, nodemask_t *nmask)
>  {
>  	struct page *page;
>  	unsigned int r_nid;
> @@ -1583,7 +1584,7 @@ static struct page *__alloc_buddy_huge_page(struct hstate *h, int nid, nodemask_
>  	}
>  	spin_unlock(&hugetlb_lock);
>  
> -	page = __hugetlb_alloc_buddy_huge_page(h, nid, nmask);
> +	page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask, nid, nmask);
>  
>  	spin_lock(&hugetlb_lock);
>  	if (page) {
> @@ -1616,11 +1617,12 @@ struct page *__alloc_buddy_huge_page_with_mpol(struct hstate *h,
>  {
>  	struct page *page;
>  	struct mempolicy *mpol;
> +	gfp_t gfp_mask = htlb_alloc_mask(h);
>  	int nid;
>  	nodemask_t *nodemask;
>  
> -	nid = huge_node(vma, addr, htlb_alloc_mask(h), &mpol, &nodemask);
> -	page = __alloc_buddy_huge_page(h, nid, nodemask);
> +	nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
> +	page = __alloc_buddy_huge_page(h, gfp_mask, nid, nodemask);
>  	mpol_cond_put(mpol);
>  
>  	return page;
> @@ -1633,30 +1635,26 @@ struct page *__alloc_buddy_huge_page_with_mpol(struct hstate *h,
>   */
>  struct page *alloc_huge_page_node(struct hstate *h, int nid)
>  {
> +	gfp_t gfp_mask = htlb_alloc_mask(h);
>  	struct page *page = NULL;
>  
> +	if (nid != NUMA_NO_NODE)
> +		gfp_mask |= __GFP_THISNODE;
> +
>  	spin_lock(&hugetlb_lock);
>  	if (h->free_huge_pages - h->resv_huge_pages > 0)
>  		page = dequeue_huge_page_node(h, nid);
>  	spin_unlock(&hugetlb_lock);
>  
> -	if (!page) {
> -		nodemask_t nmask;
> -
> -		if (nid != NUMA_NO_NODE) {
> -			nmask = NODE_MASK_NONE;
> -			node_set(nid, nmask);
> -		} else {
> -			nmask = node_states[N_MEMORY];
> -		}
> -		page = __alloc_buddy_huge_page(h, nid, &nmask);
> -	}
> +	if (!page)
> +		page = __alloc_buddy_huge_page(h, gfp_mask, nid, NULL);
>  
>  	return page;
>  }
>  
>  struct page *alloc_huge_page_nodemask(struct hstate *h, nodemask_t *nmask)
>  {
> +	gfp_t gfp_mask = htlb_alloc_mask(h);
>  	struct page *page = NULL;
>  	int node;
>  
> @@ -1673,7 +1671,7 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, nodemask_t *nmask)
>  		return page;
>  
>  	/* No reservations, try to overcommit */
> -	return __alloc_buddy_huge_page(h, NUMA_NO_NODE, nmask);
> +	return __alloc_buddy_huge_page(h, gfp_mask, NUMA_NO_NODE, nmask);
>  }
>  
>  /*
> @@ -1701,7 +1699,8 @@ static int gather_surplus_pages(struct hstate *h, int delta)
>  retry:
>  	spin_unlock(&hugetlb_lock);
>  	for (i = 0; i < needed; i++) {
> -		page = __alloc_buddy_huge_page(h, NUMA_NO_NODE, NULL);
> +		page = __alloc_buddy_huge_page(h, htlb_alloc_mask(h),
> +				NUMA_NO_NODE, NULL);
>  		if (!page) {
>  			alloc_ok = false;
>  			break;
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/4] mm, hugetlb: unclutter hugetlb allocation layers
  2017-06-14 15:06       ` Vlastimil Babka
@ 2017-06-14 15:28         ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2017-06-14 15:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Naoya Horiguchi, Mike Kravetz, Mel Gorman, Andrew Morton, LKML

On Wed 14-06-17 17:06:47, Vlastimil Babka wrote:
> On 06/14/2017 03:42 PM, Michal Hocko wrote:
> > On Wed 14-06-17 15:18:26, Vlastimil Babka wrote:
> >> On 06/13/2017 11:00 AM, Michal Hocko wrote:
> > [...]
> >>> @@ -1717,13 +1640,22 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
> >>>  		page = dequeue_huge_page_node(h, nid);
> >>>  	spin_unlock(&hugetlb_lock);
> >>>  
> >>> -	if (!page)
> >>> -		page = __alloc_buddy_huge_page_no_mpol(h, nid);
> >>> +	if (!page) {
> >>> +		nodemask_t nmask;
> >>> +
> >>> +		if (nid != NUMA_NO_NODE) {
> >>> +			nmask = NODE_MASK_NONE;
> >>> +			node_set(nid, nmask);
> >>
> >> TBH I don't like this hack too much, and would rather see __GFP_THISNODE
> >> involved, which picks a different (short) zonelist. Also it's allocating
> >> nodemask on stack, which we generally avoid? Although the callers
> >> currently seem to be shallow.
> > 
> > Fair enough. That would require pulling gfp mask handling up the call
> > chain. This on top of this patch + refreshes for other patches later in
> > the series as they will conflict now?
> 
> For the orig patch + fold (squashed locally from your mmotm/... branch)
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> Please update the commit description which still mentions the nodemask
> emulation of __GFP_THISNODE.

yes I will do that when squashing them.

> Also I noticed that the goal of patch 2 is already partially achieved
> here, because alloc_huge_page_nodemask() will now allocate using
> zonelist. It won't dequeue that way yet, though.

well, the primary point if the later is to allow for the preferred node.
I didn't find a proper way to split the two things and still have a
reasonably comprehensible diff. So I've focused on the real allocation
here and pools in the other patch. Hope that makes some sense.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
  2017-06-13  9:00 ` [RFC PATCH 2/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask Michal Hocko
@ 2017-06-14 16:17   ` Vlastimil Babka
  2017-06-14 16:41     ` Michal Hocko
  2017-06-14 22:12   ` Mike Kravetz
  1 sibling, 1 reply; 18+ messages in thread
From: Vlastimil Babka @ 2017-06-14 16:17 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Naoya Horiguchi, Mike Kravetz, Mel Gorman, Andrew Morton, LKML,
	Michal Hocko

On 06/13/2017 11:00 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> alloc_huge_page_nodemask tries to allocate from any numa node in the
> allowed node mask starting from lower numa nodes. This might lead to
> filling up those low NUMA nodes while others are not used. We can reduce
> this risk by introducing a concept of the preferred node similar to what
> we have in the regular page allocator. We will start allocating from the
> preferred nid and then iterate over all allowed nodes in the zonelist
> order until we try them all.
> 
> This is mimicking the page allocator logic except it operates on
> per-node mempools. dequeue_huge_page_vma already does this so distill
> the zonelist logic into a more generic dequeue_huge_page_nodemask
> and use it in alloc_huge_page_nodemask.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

I've reviewed the current version in git, where patch 3/4 is folded.

Noticed some things below, but after fixing:
Acked-by: Vlastimil Babka <vbabka@suse.cz>


> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -897,29 +897,58 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
>  	return page;
>  }
>  
> -static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
> +/* Movability of hugepages depends on migration support. */
> +static inline gfp_t htlb_alloc_mask(struct hstate *h)
>  {
> -	struct page *page;
> -	int node;
> +	if (hugepages_treat_as_movable || hugepage_migration_supported(h))
> +		return GFP_HIGHUSER_MOVABLE;
> +	else
> +		return GFP_HIGHUSER;
> +}
>  
> -	if (nid != NUMA_NO_NODE)
> -		return dequeue_huge_page_node_exact(h, nid);
> +static struct page *dequeue_huge_page_nodemask(struct hstate *h, int nid,
> +		nodemask_t *nmask)
> +{
> +	unsigned int cpuset_mems_cookie;
> +	struct zonelist *zonelist;
> +	struct page *page = NULL;
> +	struct zone *zone;
> +	struct zoneref *z;
> +	gfp_t gfp_mask;
> +	int node = -1;
> +
> +	gfp_mask = htlb_alloc_mask(h);
> +	zonelist = node_zonelist(nid, gfp_mask);
> +
> +retry_cpuset:
> +	cpuset_mems_cookie = read_mems_allowed_begin();
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), nmask) {
> +		if (!cpuset_zone_allowed(zone, gfp_mask))
> +			continue;
> +		/*
> +		 * no need to ask again on the same node. Pool is node rather than
> +		 * zone aware
> +		 */
> +		if (zone_to_nid(zone) == node)
> +			continue;
> +		node = zone_to_nid(zone);
>  
> -	for_each_online_node(node) {
>  		page = dequeue_huge_page_node_exact(h, node);
>  		if (page)
> -			return page;
> +			break;

Either keep return page here...

>  	}
> +	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
> +		goto retry_cpuset;
> +
>  	return NULL;

... or return page here.

>  }
>  
> -/* Movability of hugepages depends on migration support. */
> -static inline gfp_t htlb_alloc_mask(struct hstate *h)
> +static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
>  {
> -	if (hugepages_treat_as_movable || hugepage_migration_supported(h))
> -		return GFP_HIGHUSER_MOVABLE;
> -	else
> -		return GFP_HIGHUSER;
> +	if (nid != NUMA_NO_NODE)
> +		return dequeue_huge_page_node_exact(h, nid);
> +
> +	return dequeue_huge_page_nodemask(h, nid, NULL);
>  }
>  

...

> @@ -1655,25 +1661,25 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
>  	return page;
>  }
>  
> -struct page *alloc_huge_page_nodemask(struct hstate *h, nodemask_t *nmask)
> +
> +struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> +		nodemask_t *nmask)
>  {
>  	struct page *page = NULL;
> -	int node;
>  
>  	spin_lock(&hugetlb_lock);
>  	if (h->free_huge_pages - h->resv_huge_pages > 0) {
> -		for_each_node_mask(node, *nmask) {
> -			page = dequeue_huge_page_node_exact(h, node);
> -			if (page)
> -				break;
> -		}
> +		page = dequeue_huge_page_nodemask(h, preferred_nid, nmask);



> +		if (page)
> +			goto unlock;
>  	}
> +unlock:

This doesn't seem needed?

>  	spin_unlock(&hugetlb_lock);
>  	if (page)
>  		return page;
>  
>  	/* No reservations, try to overcommit */
> -	return __alloc_buddy_huge_page(h, NUMA_NO_NODE, nmask);
> +	return __alloc_buddy_huge_page(h, preferred_nid, nmask);
>  }
>  
>  /*
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 4/4] mm, hugetlb, soft_offline: use new_page_nodemask for soft offline migration
  2017-06-13  9:00 ` [RFC PATCH 4/4] mm, hugetlb, soft_offline: use new_page_nodemask for soft offline migration Michal Hocko
@ 2017-06-14 16:22   ` Vlastimil Babka
  0 siblings, 0 replies; 18+ messages in thread
From: Vlastimil Babka @ 2017-06-14 16:22 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Naoya Horiguchi, Mike Kravetz, Mel Gorman, Andrew Morton, LKML,
	Michal Hocko

On 06/13/2017 11:00 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> new_page is yet another duplication of the migration callback which has
> to handle hugetlb migration specially. We can safely use the generic
> new_page_nodemask for the same purpose.
> 
> Please note that gigantic hugetlb pages do not need any special handling
> because alloc_huge_page_nodemask will make sure to check pages in all
> per node pools. The reason this was done previously was that
> alloc_huge_page_node treated NO_NUMA_NODE and a specific node
> differently and so alloc_huge_page_node(nid) would check on this
> specific node.
> 
> Noticed-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/memory-failure.c | 10 +---------
>  1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3615bffbd269..7040f60ecb71 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1487,16 +1487,8 @@ EXPORT_SYMBOL(unpoison_memory);
>  static struct page *new_page(struct page *p, unsigned long private, int **x)
>  {
>  	int nid = page_to_nid(p);
> -	if (PageHuge(p)) {
> -		struct hstate *hstate = page_hstate(compound_head(p));
>  
> -		if (hstate_is_gigantic(hstate))
> -			return alloc_huge_page_node(hstate, NUMA_NO_NODE);
> -
> -		return alloc_huge_page_node(hstate, nid);
> -	} else {
> -		return __alloc_pages_node(nid, GFP_HIGHUSER_MOVABLE, 0);
> -	}
> +	return new_page_nodemask(p, nid, &node_states[N_MEMORY]);
>  }
>  
>  /*
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
  2017-06-14 16:17   ` Vlastimil Babka
@ 2017-06-14 16:41     ` Michal Hocko
  2017-06-14 16:57       ` Vlastimil Babka
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2017-06-14 16:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Naoya Horiguchi, Mike Kravetz, Mel Gorman, Andrew Morton, LKML

On Wed 14-06-17 18:17:18, Vlastimil Babka wrote:
> On 06/13/2017 11:00 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > alloc_huge_page_nodemask tries to allocate from any numa node in the
> > allowed node mask starting from lower numa nodes. This might lead to
> > filling up those low NUMA nodes while others are not used. We can reduce
> > this risk by introducing a concept of the preferred node similar to what
> > we have in the regular page allocator. We will start allocating from the
> > preferred nid and then iterate over all allowed nodes in the zonelist
> > order until we try them all.
> > 
> > This is mimicking the page allocator logic except it operates on
> > per-node mempools. dequeue_huge_page_vma already does this so distill
> > the zonelist logic into a more generic dequeue_huge_page_nodemask
> > and use it in alloc_huge_page_nodemask.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> I've reviewed the current version in git, where patch 3/4 is folded.
> 
> Noticed some things below, but after fixing:
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

[...]
> > +retry_cpuset:
> > +	cpuset_mems_cookie = read_mems_allowed_begin();
> > +	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), nmask) {
> > +		if (!cpuset_zone_allowed(zone, gfp_mask))
> > +			continue;
> > +		/*
> > +		 * no need to ask again on the same node. Pool is node rather than
> > +		 * zone aware
> > +		 */
> > +		if (zone_to_nid(zone) == node)
> > +			continue;
> > +		node = zone_to_nid(zone);
> >  
> > -	for_each_online_node(node) {
> >  		page = dequeue_huge_page_node_exact(h, node);
> >  		if (page)
> > -			return page;
> > +			break;
> 
> Either keep return page here...
> 
> >  	}
> > +	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
> > +		goto retry_cpuset;
> > +
> >  	return NULL;
> 
> ... or return page here.

ups I went with the former.

[...]

> > -struct page *alloc_huge_page_nodemask(struct hstate *h, nodemask_t *nmask)
> > +
> > +struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> > +		nodemask_t *nmask)
> >  {
> >  	struct page *page = NULL;
> > -	int node;
> >  
> >  	spin_lock(&hugetlb_lock);
> >  	if (h->free_huge_pages - h->resv_huge_pages > 0) {
> > -		for_each_node_mask(node, *nmask) {
> > -			page = dequeue_huge_page_node_exact(h, node);
> > -			if (page)
> > -				break;
> > -		}
> > +		page = dequeue_huge_page_nodemask(h, preferred_nid, nmask);
> 
> 
> 
> > +		if (page)
> > +			goto unlock;
> >  	}
> > +unlock:
> 
> This doesn't seem needed?

This on top?
---
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9ac0ae725c5e..f9868e095afa 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -902,7 +902,6 @@ static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask,
 {
 	unsigned int cpuset_mems_cookie;
 	struct zonelist *zonelist;
-	struct page *page = NULL;
 	struct zone *zone;
 	struct zoneref *z;
 	int node = -1;
@@ -912,6 +911,8 @@ static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask,
 retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
 	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), nmask) {
+		struct page *page;
+
 		if (!cpuset_zone_allowed(zone, gfp_mask))
 			continue;
 		/*
@@ -924,9 +925,9 @@ static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask,
 
 		page = dequeue_huge_page_node_exact(h, node);
 		if (page)
-			break;
+			return page;
 	}
-	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
+	if (unlikely(read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
 	return NULL;
@@ -1655,18 +1656,18 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
 		nodemask_t *nmask)
 {
 	gfp_t gfp_mask = htlb_alloc_mask(h);
-	struct page *page = NULL;
 
 	spin_lock(&hugetlb_lock);
 	if (h->free_huge_pages - h->resv_huge_pages > 0) {
+		struct page *page;
+
 		page = dequeue_huge_page_nodemask(h, gfp_mask, preferred_nid, nmask);
-		if (page)
-			goto unlock;
+		if (page) {
+			spin_unlock(&hugetlb_lock);
+			return page;
+		}
 	}
-unlock:
 	spin_unlock(&hugetlb_lock);
-	if (page)
-		return page;
 
 	/* No reservations, try to overcommit */
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
  2017-06-14 16:41     ` Michal Hocko
@ 2017-06-14 16:57       ` Vlastimil Babka
  0 siblings, 0 replies; 18+ messages in thread
From: Vlastimil Babka @ 2017-06-14 16:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Naoya Horiguchi, Mike Kravetz, Mel Gorman, Andrew Morton, LKML

On 06/14/2017 06:41 PM, Michal Hocko wrote:
> 
> This on top?
> ---
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9ac0ae725c5e..f9868e095afa 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -902,7 +902,6 @@ static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask,
>  {
>  	unsigned int cpuset_mems_cookie;
>  	struct zonelist *zonelist;
> -	struct page *page = NULL;
>  	struct zone *zone;
>  	struct zoneref *z;
>  	int node = -1;
> @@ -912,6 +911,8 @@ static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask,
>  retry_cpuset:
>  	cpuset_mems_cookie = read_mems_allowed_begin();
>  	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), nmask) {
> +		struct page *page;
> +
>  		if (!cpuset_zone_allowed(zone, gfp_mask))
>  			continue;
>  		/*
> @@ -924,9 +925,9 @@ static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask,
>  
>  		page = dequeue_huge_page_node_exact(h, node);
>  		if (page)
> -			break;
> +			return page;
>  	}
> -	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
> +	if (unlikely(read_mems_allowed_retry(cpuset_mems_cookie)))
>  		goto retry_cpuset;
>  
>  	return NULL;

OK

> @@ -1655,18 +1656,18 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
>  		nodemask_t *nmask)
>  {
>  	gfp_t gfp_mask = htlb_alloc_mask(h);
> -	struct page *page = NULL;
>  
>  	spin_lock(&hugetlb_lock);
>  	if (h->free_huge_pages - h->resv_huge_pages > 0) {
> +		struct page *page;
> +
>  		page = dequeue_huge_page_nodemask(h, gfp_mask, preferred_nid, nmask);
> -		if (page)
> -			goto unlock;
> +		if (page) {
> +			spin_unlock(&hugetlb_lock);
> +			return page;
> +		}

I thought you would just continue after the if (this is not a for-loop
after all), but this works too.

>  	}
> -unlock:
>  	spin_unlock(&hugetlb_lock);
> -	if (page)
> -		return page;
>  
>  	/* No reservations, try to overcommit */
>  
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
  2017-06-13  9:00 ` [RFC PATCH 2/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask Michal Hocko
  2017-06-14 16:17   ` Vlastimil Babka
@ 2017-06-14 22:12   ` Mike Kravetz
  2017-06-15  0:12     ` Mike Kravetz
  1 sibling, 1 reply; 18+ messages in thread
From: Mike Kravetz @ 2017-06-14 22:12 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Naoya Horiguchi, Mel Gorman, Vlastimil Babka, Andrew Morton,
	LKML, Michal Hocko

On 06/13/2017 02:00 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> alloc_huge_page_nodemask tries to allocate from any numa node in the
> allowed node mask starting from lower numa nodes. This might lead to
> filling up those low NUMA nodes while others are not used. We can reduce
> this risk by introducing a concept of the preferred node similar to what
> we have in the regular page allocator. We will start allocating from the
> preferred nid and then iterate over all allowed nodes in the zonelist
> order until we try them all.
> 
> This is mimicking the page allocator logic except it operates on
> per-node mempools. dequeue_huge_page_vma already does this so distill
> the zonelist logic into a more generic dequeue_huge_page_nodemask
> and use it in alloc_huge_page_nodemask.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---


I built attempts/hugetlb-zonelists, threw it on a test machine, ran the
libhugetlbfs test suite and saw failures.  The failures started with this
patch: commit 7e8b09f14495 in your tree.  I have not yet started to look
into the failures.  It is even possible that the tests are making bad
assumptions, but there certainly appears to be changes in behavior visible
to the application(s).

FYI - My 'test machine' is an x86 KVM insatnce with 8GB memory simulating
2 nodes.  Huge page allocations before running tests:
node0
512	free_hugepages
512	nr_hugepages
0	surplus_hugepages
node1
512	free_hugepages
512	nr_hugepages
0	surplus_hugepages

I can take a closer look at the failures tomorrow.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
  2017-06-14 22:12   ` Mike Kravetz
@ 2017-06-15  0:12     ` Mike Kravetz
  2017-06-15  8:12       ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Kravetz @ 2017-06-15  0:12 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Naoya Horiguchi, Mel Gorman, Vlastimil Babka, Andrew Morton,
	LKML, Michal Hocko

On 06/14/2017 03:12 PM, Mike Kravetz wrote:
> On 06/13/2017 02:00 AM, Michal Hocko wrote:
>> From: Michal Hocko <mhocko@suse.com>
>>
>> alloc_huge_page_nodemask tries to allocate from any numa node in the
>> allowed node mask starting from lower numa nodes. This might lead to
>> filling up those low NUMA nodes while others are not used. We can reduce
>> this risk by introducing a concept of the preferred node similar to what
>> we have in the regular page allocator. We will start allocating from the
>> preferred nid and then iterate over all allowed nodes in the zonelist
>> order until we try them all.
>>
>> This is mimicking the page allocator logic except it operates on
>> per-node mempools. dequeue_huge_page_vma already does this so distill
>> the zonelist logic into a more generic dequeue_huge_page_nodemask
>> and use it in alloc_huge_page_nodemask.
>>
>> Signed-off-by: Michal Hocko <mhocko@suse.com>
>> ---
> 
> 
> I built attempts/hugetlb-zonelists, threw it on a test machine, ran the
> libhugetlbfs test suite and saw failures.  The failures started with this
> patch: commit 7e8b09f14495 in your tree.  I have not yet started to look
> into the failures.  It is even possible that the tests are making bad
> assumptions, but there certainly appears to be changes in behavior visible
> to the application(s).

nm.  The failures were the result of dequeue_huge_page_nodemask() always
returning NULL.  Vlastimil already noticed this issue and provided a
solution.

-- 
Mike Kravetz

> 
> FYI - My 'test machine' is an x86 KVM insatnce with 8GB memory simulating
> 2 nodes.  Huge page allocations before running tests:
> node0
> 512	free_hugepages
> 512	nr_hugepages
> 0	surplus_hugepages
> node1
> 512	free_hugepages
> 512	nr_hugepages
> 0	surplus_hugepages
> 
> I can take a closer look at the failures tomorrow.
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask
  2017-06-15  0:12     ` Mike Kravetz
@ 2017-06-15  8:12       ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2017-06-15  8:12 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, Naoya Horiguchi, Mel Gorman, Vlastimil Babka,
	Andrew Morton, LKML

On Wed 14-06-17 17:12:31, Mike Kravetz wrote:
> On 06/14/2017 03:12 PM, Mike Kravetz wrote:
> > On 06/13/2017 02:00 AM, Michal Hocko wrote:
> >> From: Michal Hocko <mhocko@suse.com>
> >>
> >> alloc_huge_page_nodemask tries to allocate from any numa node in the
> >> allowed node mask starting from lower numa nodes. This might lead to
> >> filling up those low NUMA nodes while others are not used. We can reduce
> >> this risk by introducing a concept of the preferred node similar to what
> >> we have in the regular page allocator. We will start allocating from the
> >> preferred nid and then iterate over all allowed nodes in the zonelist
> >> order until we try them all.
> >>
> >> This is mimicking the page allocator logic except it operates on
> >> per-node mempools. dequeue_huge_page_vma already does this so distill
> >> the zonelist logic into a more generic dequeue_huge_page_nodemask
> >> and use it in alloc_huge_page_nodemask.
> >>
> >> Signed-off-by: Michal Hocko <mhocko@suse.com>
> >> ---
> > 
> > 
> > I built attempts/hugetlb-zonelists, threw it on a test machine, ran the
> > libhugetlbfs test suite and saw failures.  The failures started with this
> > patch: commit 7e8b09f14495 in your tree.  I have not yet started to look
> > into the failures.  It is even possible that the tests are making bad
> > assumptions, but there certainly appears to be changes in behavior visible
> > to the application(s).
> 
> nm.  The failures were the result of dequeue_huge_page_nodemask() always
> returning NULL.  Vlastimil already noticed this issue and provided a
> solution.

I have pushed my current version to the same branch.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] mm, hugetlb: allow proper node fallback dequeue
  2017-06-13  9:00 [RFC PATCH 0/4] mm, hugetlb: allow proper node fallback dequeue Michal Hocko
                   ` (3 preceding siblings ...)
  2017-06-13  9:00 ` [RFC PATCH 4/4] mm, hugetlb, soft_offline: use new_page_nodemask for soft offline migration Michal Hocko
@ 2017-06-16 11:44 ` Michal Hocko
  4 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2017-06-16 11:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Naoya Horiguchi, Mike Kravetz, Mel Gorman, Vlastimil Babka,
	Andrew Morton, LKML

JFYI I will repost sometimes next week unless there is another feedback
by then.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-06-16 11:44 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-13  9:00 [RFC PATCH 0/4] mm, hugetlb: allow proper node fallback dequeue Michal Hocko
2017-06-13  9:00 ` [RFC PATCH 1/4] mm, hugetlb: unclutter hugetlb allocation layers Michal Hocko
2017-06-14 13:18   ` Vlastimil Babka
2017-06-14 13:42     ` Michal Hocko
2017-06-14 14:04       ` Michal Hocko
2017-06-14 15:06       ` Vlastimil Babka
2017-06-14 15:28         ` Michal Hocko
2017-06-13  9:00 ` [RFC PATCH 2/4] hugetlb: add support for preferred node to alloc_huge_page_nodemask Michal Hocko
2017-06-14 16:17   ` Vlastimil Babka
2017-06-14 16:41     ` Michal Hocko
2017-06-14 16:57       ` Vlastimil Babka
2017-06-14 22:12   ` Mike Kravetz
2017-06-15  0:12     ` Mike Kravetz
2017-06-15  8:12       ` Michal Hocko
2017-06-13  9:00 ` [RFC PATCH 3/4] mm, hugetlb: get rid of dequeue_huge_page_node Michal Hocko
2017-06-13  9:00 ` [RFC PATCH 4/4] mm, hugetlb, soft_offline: use new_page_nodemask for soft offline migration Michal Hocko
2017-06-14 16:22   ` Vlastimil Babka
2017-06-16 11:44 ` [RFC PATCH 0/4] mm, hugetlb: allow proper node fallback dequeue Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).