All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm, hugetlb: use memory policy when available
@ 2015-10-20 19:53 ` Dave Hansen
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Hansen @ 2015-10-20 19:53 UTC (permalink / raw)
  To: dave
  Cc: akpm, n-horiguchi, mike.kravetz, hillf.zj, rientjes, linux-mm,
	linux-kernel, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

I have a hugetlbfs user which is never explicitly allocating huge pages
with 'nr_hugepages'.  They only set 'nr_overcommit_hugepages' and then let
the pages be allocated from the buddy allocator at fault time.

This works, but they noticed that mbind() was not doing them any good and
the pages were being allocated without respect for the policy they
specified.

The code in question is this:

> struct page *alloc_huge_page(struct vm_area_struct *vma,
...
>         page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
>         if (!page) {
>                 page = alloc_buddy_huge_page(h, NUMA_NO_NODE);

dequeue_huge_page_vma() is smart and will respect the VMA's memory policy.
But, it only grabs _existing_ huge pages from the huge page pool.  If the
pool is empty, we fall back to alloc_buddy_huge_page() which obviously
can't do anything with the VMA's policy because it isn't even passed the
VMA.

Almost everybody preallocates huge pages.  That's probably why nobody has
ever noticed this.  Looking back at the git history, I don't think this
_ever_ worked from when alloc_buddy_huge_page() was introduced in 7893d1d5,
8 years ago.

The fix is to pass vma/addr down in to the places where we actually call in
to the buddy allocator.  It's fairly straightforward plumbing.  This has
been lightly tested.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/mm/hugetlb.c |  111 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 99 insertions(+), 12 deletions(-)

diff -puN mm/hugetlb.c~hugetlbfs-fix-mbind-when-demand-allocating mm/hugetlb.c
--- a/mm/hugetlb.c~hugetlbfs-fix-mbind-when-demand-allocating	2015-10-20 12:50:55.598628613 -0700
+++ b/mm/hugetlb.c	2015-10-20 12:50:55.605628929 -0700
@@ -1437,7 +1437,76 @@ void dissolve_free_huge_pages(unsigned l
 		dissolve_free_huge_page(pfn_to_page(pfn));
 }
 
-static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
+/*
+ * There are 3 ways this can get called:
+ * 1. With vma+addr: we use the VMA's memory policy
+ * 2. With !vma, but nid=NUMA_NO_NODE:  We try to allocate a huge
+ *    page from any node, and let the buddy allocator itself figure
+ *    it out.
+ * 3. With !vma, but nid!=NUMA_NO_NODE.  We allocate a huge page
+ *    strictly from 'nid'
+ */
+static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
+		struct vm_area_struct *vma, unsigned long addr, int nid)
+{
+	int order = huge_page_order(h);
+	gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN;
+	unsigned int cpuset_mems_cookie;
+
+	/*
+	 * We need a VMA to get a memory policy.  If we do not
+	 * have one, we use the 'nid' argument
+	 */
+	if (!vma) {
+		/*
+		 * If a specific node is requested, make sure to
+		 * get memory from there, but only when a node
+		 * is explicitly specified.
+		 */
+		if (nid != NUMA_NO_NODE)
+			gfp |= __GFP_THISNODE;
+		/*
+		 * Make sure to call something that can handle
+		 * nid=NUMA_NO_NODE
+		 */
+		return alloc_pages_node(nid, gfp, order);
+	}
+
+	/*
+	 * OK, so we have a VMA.  Fetch the mempolicy and try to
+	 * allocate a huge page with it.
+	 */
+	do {
+		struct page *page;
+		struct mempolicy *mpol;
+		struct zonelist *zl;
+		nodemask_t *nodemask;
+
+		cpuset_mems_cookie = read_mems_allowed_begin();
+		zl = huge_zonelist(vma, addr, gfp, &mpol, &nodemask);
+		mpol_cond_put(mpol);
+		page = __alloc_pages_nodemask(gfp, order, zl, nodemask);
+		if (page)
+			return page;
+	} while (read_mems_allowed_retry(cpuset_mems_cookie));
+
+	return NULL;
+}
+
+/*
+ * There are two ways to allocate a huge page:
+ * 1. When you have a VMA and an address (like a fault)
+ * 2. When you have no VMA (like when setting /proc/.../nr_hugepages)
+ *
+ * 'vma' and 'addr' are only for (1).  'nid' is always NUMA_NO_NODE in
+ * this case which signifies that the allocation should be done with
+ * respect for the VMA's memory policy.
+ *
+ * For (2), we ignore 'vma' and 'addr' and use 'nid' exclusively. This
+ * implies that memory policies will not be taken in to account.
+ */
+static struct page *__alloc_buddy_huge_page(struct hstate *h,
+		struct vm_area_struct *vma, unsigned long addr, int nid)
 {
 	struct page *page;
 	unsigned int r_nid;
@@ -1445,6 +1514,10 @@ static struct page *alloc_buddy_huge_pag
 	if (hstate_is_gigantic(h))
 		return NULL;
 
+	if (vma || addr) {
+		WARN_ON_ONCE(!addr || addr == -1);
+		WARN_ON_ONCE(nid != NUMA_NO_NODE);
+	}
 	/*
 	 * Assume we will successfully allocate the surplus page to
 	 * prevent racing processes from causing the surplus to exceed
@@ -1478,14 +1551,7 @@ static struct page *alloc_buddy_huge_pag
 	}
 	spin_unlock(&hugetlb_lock);
 
-	if (nid == NUMA_NO_NODE)
-		page = alloc_pages(htlb_alloc_mask(h)|__GFP_COMP|
-				   __GFP_REPEAT|__GFP_NOWARN,
-				   huge_page_order(h));
-	else
-		page = __alloc_pages_node(nid,
-			htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
-			__GFP_REPEAT|__GFP_NOWARN, huge_page_order(h));
+	page = __hugetlb_alloc_buddy_huge_page(h, vma, addr, nid);
 
 	spin_lock(&hugetlb_lock);
 	if (page) {
@@ -1510,6 +1576,27 @@ static struct page *alloc_buddy_huge_pag
 }
 
 /*
+ * Allocate a huge page from 'nid'.  Note, 'nid' may be
+ * NUMA_NO_NODE, which means that it may be allocated
+ * anywhere.
+ */
+struct page *__alloc_buddy_huge_page_no_mpol(struct hstate *h, int nid)
+{
+	unsigned long addr = -1;
+
+	return __alloc_buddy_huge_page(h, NULL, addr, nid);
+}
+
+/*
+ * Use the VMA's mpolicy to allocate a huge page from the buddy.
+ */
+struct page *__alloc_buddy_huge_page_with_mpol(struct hstate *h,
+		struct vm_area_struct *vma, unsigned long addr)
+{
+	return __alloc_buddy_huge_page(h, vma, addr, NUMA_NO_NODE);
+}
+
+/*
  * This allocation function is useful in the context where vma is irrelevant.
  * E.g. soft-offlining uses this function because it only cares physical
  * address of error page.
@@ -1524,7 +1611,7 @@ struct page *alloc_huge_page_node(struct
 	spin_unlock(&hugetlb_lock);
 
 	if (!page)
-		page = alloc_buddy_huge_page(h, nid);
+		page = __alloc_buddy_huge_page_no_mpol(h, nid);
 
 	return page;
 }
@@ -1554,7 +1641,7 @@ static int gather_surplus_pages(struct h
 retry:
 	spin_unlock(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
-		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
+		page = __alloc_buddy_huge_page_no_mpol(h, NUMA_NO_NODE);
 		if (!page) {
 			alloc_ok = false;
 			break;
@@ -1787,7 +1874,7 @@ struct page *alloc_huge_page(struct vm_a
 	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
 	if (!page) {
 		spin_unlock(&hugetlb_lock);
-		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
+		page = __alloc_buddy_huge_page_with_mpol(h, vma, addr);
 		if (!page)
 			goto out_uncharge_cgroup;
 
_

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] mm, hugetlb: use memory policy when available
@ 2015-10-20 19:53 ` Dave Hansen
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Hansen @ 2015-10-20 19:53 UTC (permalink / raw)
  To: dave
  Cc: akpm, n-horiguchi, mike.kravetz, hillf.zj, rientjes, linux-mm,
	linux-kernel, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

I have a hugetlbfs user which is never explicitly allocating huge pages
with 'nr_hugepages'.  They only set 'nr_overcommit_hugepages' and then let
the pages be allocated from the buddy allocator at fault time.

This works, but they noticed that mbind() was not doing them any good and
the pages were being allocated without respect for the policy they
specified.

The code in question is this:

> struct page *alloc_huge_page(struct vm_area_struct *vma,
...
>         page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
>         if (!page) {
>                 page = alloc_buddy_huge_page(h, NUMA_NO_NODE);

dequeue_huge_page_vma() is smart and will respect the VMA's memory policy.
But, it only grabs _existing_ huge pages from the huge page pool.  If the
pool is empty, we fall back to alloc_buddy_huge_page() which obviously
can't do anything with the VMA's policy because it isn't even passed the
VMA.

Almost everybody preallocates huge pages.  That's probably why nobody has
ever noticed this.  Looking back at the git history, I don't think this
_ever_ worked from when alloc_buddy_huge_page() was introduced in 7893d1d5,
8 years ago.

The fix is to pass vma/addr down in to the places where we actually call in
to the buddy allocator.  It's fairly straightforward plumbing.  This has
been lightly tested.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/mm/hugetlb.c |  111 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 99 insertions(+), 12 deletions(-)

diff -puN mm/hugetlb.c~hugetlbfs-fix-mbind-when-demand-allocating mm/hugetlb.c
--- a/mm/hugetlb.c~hugetlbfs-fix-mbind-when-demand-allocating	2015-10-20 12:50:55.598628613 -0700
+++ b/mm/hugetlb.c	2015-10-20 12:50:55.605628929 -0700
@@ -1437,7 +1437,76 @@ void dissolve_free_huge_pages(unsigned l
 		dissolve_free_huge_page(pfn_to_page(pfn));
 }
 
-static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
+/*
+ * There are 3 ways this can get called:
+ * 1. With vma+addr: we use the VMA's memory policy
+ * 2. With !vma, but nid=NUMA_NO_NODE:  We try to allocate a huge
+ *    page from any node, and let the buddy allocator itself figure
+ *    it out.
+ * 3. With !vma, but nid!=NUMA_NO_NODE.  We allocate a huge page
+ *    strictly from 'nid'
+ */
+static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
+		struct vm_area_struct *vma, unsigned long addr, int nid)
+{
+	int order = huge_page_order(h);
+	gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN;
+	unsigned int cpuset_mems_cookie;
+
+	/*
+	 * We need a VMA to get a memory policy.  If we do not
+	 * have one, we use the 'nid' argument
+	 */
+	if (!vma) {
+		/*
+		 * If a specific node is requested, make sure to
+		 * get memory from there, but only when a node
+		 * is explicitly specified.
+		 */
+		if (nid != NUMA_NO_NODE)
+			gfp |= __GFP_THISNODE;
+		/*
+		 * Make sure to call something that can handle
+		 * nid=NUMA_NO_NODE
+		 */
+		return alloc_pages_node(nid, gfp, order);
+	}
+
+	/*
+	 * OK, so we have a VMA.  Fetch the mempolicy and try to
+	 * allocate a huge page with it.
+	 */
+	do {
+		struct page *page;
+		struct mempolicy *mpol;
+		struct zonelist *zl;
+		nodemask_t *nodemask;
+
+		cpuset_mems_cookie = read_mems_allowed_begin();
+		zl = huge_zonelist(vma, addr, gfp, &mpol, &nodemask);
+		mpol_cond_put(mpol);
+		page = __alloc_pages_nodemask(gfp, order, zl, nodemask);
+		if (page)
+			return page;
+	} while (read_mems_allowed_retry(cpuset_mems_cookie));
+
+	return NULL;
+}
+
+/*
+ * There are two ways to allocate a huge page:
+ * 1. When you have a VMA and an address (like a fault)
+ * 2. When you have no VMA (like when setting /proc/.../nr_hugepages)
+ *
+ * 'vma' and 'addr' are only for (1).  'nid' is always NUMA_NO_NODE in
+ * this case which signifies that the allocation should be done with
+ * respect for the VMA's memory policy.
+ *
+ * For (2), we ignore 'vma' and 'addr' and use 'nid' exclusively. This
+ * implies that memory policies will not be taken in to account.
+ */
+static struct page *__alloc_buddy_huge_page(struct hstate *h,
+		struct vm_area_struct *vma, unsigned long addr, int nid)
 {
 	struct page *page;
 	unsigned int r_nid;
@@ -1445,6 +1514,10 @@ static struct page *alloc_buddy_huge_pag
 	if (hstate_is_gigantic(h))
 		return NULL;
 
+	if (vma || addr) {
+		WARN_ON_ONCE(!addr || addr == -1);
+		WARN_ON_ONCE(nid != NUMA_NO_NODE);
+	}
 	/*
 	 * Assume we will successfully allocate the surplus page to
 	 * prevent racing processes from causing the surplus to exceed
@@ -1478,14 +1551,7 @@ static struct page *alloc_buddy_huge_pag
 	}
 	spin_unlock(&hugetlb_lock);
 
-	if (nid == NUMA_NO_NODE)
-		page = alloc_pages(htlb_alloc_mask(h)|__GFP_COMP|
-				   __GFP_REPEAT|__GFP_NOWARN,
-				   huge_page_order(h));
-	else
-		page = __alloc_pages_node(nid,
-			htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
-			__GFP_REPEAT|__GFP_NOWARN, huge_page_order(h));
+	page = __hugetlb_alloc_buddy_huge_page(h, vma, addr, nid);
 
 	spin_lock(&hugetlb_lock);
 	if (page) {
@@ -1510,6 +1576,27 @@ static struct page *alloc_buddy_huge_pag
 }
 
 /*
+ * Allocate a huge page from 'nid'.  Note, 'nid' may be
+ * NUMA_NO_NODE, which means that it may be allocated
+ * anywhere.
+ */
+struct page *__alloc_buddy_huge_page_no_mpol(struct hstate *h, int nid)
+{
+	unsigned long addr = -1;
+
+	return __alloc_buddy_huge_page(h, NULL, addr, nid);
+}
+
+/*
+ * Use the VMA's mpolicy to allocate a huge page from the buddy.
+ */
+struct page *__alloc_buddy_huge_page_with_mpol(struct hstate *h,
+		struct vm_area_struct *vma, unsigned long addr)
+{
+	return __alloc_buddy_huge_page(h, vma, addr, NUMA_NO_NODE);
+}
+
+/*
  * This allocation function is useful in the context where vma is irrelevant.
  * E.g. soft-offlining uses this function because it only cares physical
  * address of error page.
@@ -1524,7 +1611,7 @@ struct page *alloc_huge_page_node(struct
 	spin_unlock(&hugetlb_lock);
 
 	if (!page)
-		page = alloc_buddy_huge_page(h, nid);
+		page = __alloc_buddy_huge_page_no_mpol(h, nid);
 
 	return page;
 }
@@ -1554,7 +1641,7 @@ static int gather_surplus_pages(struct h
 retry:
 	spin_unlock(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
-		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
+		page = __alloc_buddy_huge_page_no_mpol(h, NUMA_NO_NODE);
 		if (!page) {
 			alloc_ok = false;
 			break;
@@ -1787,7 +1874,7 @@ struct page *alloc_huge_page(struct vm_a
 	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
 	if (!page) {
 		spin_unlock(&hugetlb_lock);
-		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
+		page = __alloc_buddy_huge_page_with_mpol(h, vma, addr);
 		if (!page)
 			goto out_uncharge_cgroup;
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, hugetlb: use memory policy when available
  2015-10-20 19:53 ` Dave Hansen
@ 2015-10-20 22:19   ` Andrew Morton
  -1 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2015-10-20 22:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: n-horiguchi, mike.kravetz, hillf.zj, rientjes, linux-mm,
	linux-kernel, dave.hansen

On Tue, 20 Oct 2015 12:53:17 -0700 Dave Hansen <dave@sr71.net> wrote:

> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> I have a hugetlbfs user which is never explicitly allocating huge pages
> with 'nr_hugepages'.  They only set 'nr_overcommit_hugepages' and then let
> the pages be allocated from the buddy allocator at fault time.
> 
> This works, but they noticed that mbind() was not doing them any good and
> the pages were being allocated without respect for the policy they
> specified.
> 
> The code in question is this:
> 
> > struct page *alloc_huge_page(struct vm_area_struct *vma,
> ...
> >         page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
> >         if (!page) {
> >                 page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> 
> dequeue_huge_page_vma() is smart and will respect the VMA's memory policy.
> But, it only grabs _existing_ huge pages from the huge page pool.  If the
> pool is empty, we fall back to alloc_buddy_huge_page() which obviously
> can't do anything with the VMA's policy because it isn't even passed the
> VMA.
> 
> Almost everybody preallocates huge pages.  That's probably why nobody has
> ever noticed this.  Looking back at the git history, I don't think this
> _ever_ worked from when alloc_buddy_huge_page() was introduced in 7893d1d5,
> 8 years ago.
> 
> The fix is to pass vma/addr down in to the places where we actually call in
> to the buddy allocator.  It's fairly straightforward plumbing.  This has
> been lightly tested.

huh.  Fair enough.

>  b/mm/hugetlb.c |  111 ++++++++++++++++++++++++++++++++++++++++++++++++++-------

Is it worth deporking this for the CONFIG_NUMA=n case?



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, hugetlb: use memory policy when available
@ 2015-10-20 22:19   ` Andrew Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2015-10-20 22:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: n-horiguchi, mike.kravetz, hillf.zj, rientjes, linux-mm,
	linux-kernel, dave.hansen

On Tue, 20 Oct 2015 12:53:17 -0700 Dave Hansen <dave@sr71.net> wrote:

> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> I have a hugetlbfs user which is never explicitly allocating huge pages
> with 'nr_hugepages'.  They only set 'nr_overcommit_hugepages' and then let
> the pages be allocated from the buddy allocator at fault time.
> 
> This works, but they noticed that mbind() was not doing them any good and
> the pages were being allocated without respect for the policy they
> specified.
> 
> The code in question is this:
> 
> > struct page *alloc_huge_page(struct vm_area_struct *vma,
> ...
> >         page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
> >         if (!page) {
> >                 page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> 
> dequeue_huge_page_vma() is smart and will respect the VMA's memory policy.
> But, it only grabs _existing_ huge pages from the huge page pool.  If the
> pool is empty, we fall back to alloc_buddy_huge_page() which obviously
> can't do anything with the VMA's policy because it isn't even passed the
> VMA.
> 
> Almost everybody preallocates huge pages.  That's probably why nobody has
> ever noticed this.  Looking back at the git history, I don't think this
> _ever_ worked from when alloc_buddy_huge_page() was introduced in 7893d1d5,
> 8 years ago.
> 
> The fix is to pass vma/addr down in to the places where we actually call in
> to the buddy allocator.  It's fairly straightforward plumbing.  This has
> been lightly tested.

huh.  Fair enough.

>  b/mm/hugetlb.c |  111 ++++++++++++++++++++++++++++++++++++++++++++++++++-------

Is it worth deporking this for the CONFIG_NUMA=n case?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, hugetlb: use memory policy when available
  2015-10-20 19:53 ` Dave Hansen
@ 2015-10-21 15:12   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 14+ messages in thread
From: Kirill A. Shutemov @ 2015-10-21 15:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: akpm, n-horiguchi, mike.kravetz, hillf.zj, rientjes, linux-mm,
	linux-kernel, dave.hansen

On Tue, Oct 20, 2015 at 12:53:17PM -0700, Dave Hansen wrote:
> @@ -1445,6 +1514,10 @@ static struct page *alloc_buddy_huge_pag
>  	if (hstate_is_gigantic(h))
>  		return NULL;
>  
> +	if (vma || addr) {
> +		WARN_ON_ONCE(!addr || addr == -1);

Trinity triggered the WARN for me:

[  118.647212] WARNING: CPU: 10 PID: 9621 at /home/kas/linux/mm/mm/hugetlb.c:1514 __alloc_buddy_huge_page+0x2c8/0x300()
[  118.648698] Modules linked in:
[  118.649105] CPU: 10 PID: 9621 Comm: trinity-c147 Not tainted 4.2.0-dirty #651
[  118.649909] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014
[  118.650929]  ffffffff81ca6ad8 ffff88081f7f3c68 ffffffff818a9977 0000000080000001
[  118.651889]  0000000000000000 ffff88081f7f3ca8 ffffffff810574d6 ffff88081f7f3c98
[  118.652965]  0000000000000000 ffffffff830a87e0 00000000ffffffff ffffffffffffffff
[  118.653988] Call Trace:
[  118.654315]  [<ffffffff818a9977>] dump_stack+0x4f/0x7b
[  118.654936]  [<ffffffff810574d6>] warn_slowpath_common+0x86/0xc0
[  118.655630]  [<ffffffff810575ca>] warn_slowpath_null+0x1a/0x20
[  118.656427]  [<ffffffff811ac5e8>] __alloc_buddy_huge_page+0x2c8/0x300
[  118.657185]  [<ffffffff811ad081>] hugetlb_acct_memory+0xa1/0x3d0
[  118.657897]  [<ffffffff811ab241>] ? region_chg+0x1f1/0x200
[  118.658559]  [<ffffffff811ae932>] hugetlb_reserve_pages+0x92/0x250
[  118.659289]  [<ffffffff812d517c>] hugetlb_file_setup+0x14c/0x320
[  118.659994]  [<ffffffff813d2fd5>] newseg+0x135/0x370
[  118.660713]  [<ffffffff813cc134>] ? ipcget+0x44/0x2d0
[  118.661306]  [<ffffffff813cc360>] ipcget+0x270/0x2d0
[  118.661911]  [<ffffffff813d3525>] SyS_shmget+0x45/0x50
[  118.663409]  [<ffffffff818b2c7c>] tracesys_phase2+0x84/0x89
[  118.664199] ---[ end trace d2829191292b44ef ]---


> +		WARN_ON_ONCE(nid != NUMA_NO_NODE);
> +	}
>  	/*
>  	 * Assume we will successfully allocate the surplus page to
>  	 * prevent racing processes from causing the surplus to exceed
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, hugetlb: use memory policy when available
@ 2015-10-21 15:12   ` Kirill A. Shutemov
  0 siblings, 0 replies; 14+ messages in thread
From: Kirill A. Shutemov @ 2015-10-21 15:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: akpm, n-horiguchi, mike.kravetz, hillf.zj, rientjes, linux-mm,
	linux-kernel, dave.hansen

On Tue, Oct 20, 2015 at 12:53:17PM -0700, Dave Hansen wrote:
> @@ -1445,6 +1514,10 @@ static struct page *alloc_buddy_huge_pag
>  	if (hstate_is_gigantic(h))
>  		return NULL;
>  
> +	if (vma || addr) {
> +		WARN_ON_ONCE(!addr || addr == -1);

Trinity triggered the WARN for me:

[  118.647212] WARNING: CPU: 10 PID: 9621 at /home/kas/linux/mm/mm/hugetlb.c:1514 __alloc_buddy_huge_page+0x2c8/0x300()
[  118.648698] Modules linked in:
[  118.649105] CPU: 10 PID: 9621 Comm: trinity-c147 Not tainted 4.2.0-dirty #651
[  118.649909] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014
[  118.650929]  ffffffff81ca6ad8 ffff88081f7f3c68 ffffffff818a9977 0000000080000001
[  118.651889]  0000000000000000 ffff88081f7f3ca8 ffffffff810574d6 ffff88081f7f3c98
[  118.652965]  0000000000000000 ffffffff830a87e0 00000000ffffffff ffffffffffffffff
[  118.653988] Call Trace:
[  118.654315]  [<ffffffff818a9977>] dump_stack+0x4f/0x7b
[  118.654936]  [<ffffffff810574d6>] warn_slowpath_common+0x86/0xc0
[  118.655630]  [<ffffffff810575ca>] warn_slowpath_null+0x1a/0x20
[  118.656427]  [<ffffffff811ac5e8>] __alloc_buddy_huge_page+0x2c8/0x300
[  118.657185]  [<ffffffff811ad081>] hugetlb_acct_memory+0xa1/0x3d0
[  118.657897]  [<ffffffff811ab241>] ? region_chg+0x1f1/0x200
[  118.658559]  [<ffffffff811ae932>] hugetlb_reserve_pages+0x92/0x250
[  118.659289]  [<ffffffff812d517c>] hugetlb_file_setup+0x14c/0x320
[  118.659994]  [<ffffffff813d2fd5>] newseg+0x135/0x370
[  118.660713]  [<ffffffff813cc134>] ? ipcget+0x44/0x2d0
[  118.661306]  [<ffffffff813cc360>] ipcget+0x270/0x2d0
[  118.661911]  [<ffffffff813d3525>] SyS_shmget+0x45/0x50
[  118.663409]  [<ffffffff818b2c7c>] tracesys_phase2+0x84/0x89
[  118.664199] ---[ end trace d2829191292b44ef ]---


> +		WARN_ON_ONCE(nid != NUMA_NO_NODE);
> +	}
>  	/*
>  	 * Assume we will successfully allocate the surplus page to
>  	 * prevent racing processes from causing the surplus to exceed
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, hugetlb: use memory policy when available
  2015-10-20 19:53 ` Dave Hansen
@ 2015-10-22 21:39   ` Sasha Levin
  -1 siblings, 0 replies; 14+ messages in thread
From: Sasha Levin @ 2015-10-22 21:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: akpm, n-horiguchi, mike.kravetz, hillf.zj, rientjes, linux-mm,
	linux-kernel, dave.hansen

On 10/20/2015 03:53 PM, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> I have a hugetlbfs user which is never explicitly allocating huge pages
> with 'nr_hugepages'.  They only set 'nr_overcommit_hugepages' and then let
> the pages be allocated from the buddy allocator at fault time.
> 
> This works, but they noticed that mbind() was not doing them any good and
> the pages were being allocated without respect for the policy they
> specified.
> 
> The code in question is this:
> 
>> > struct page *alloc_huge_page(struct vm_area_struct *vma,
> ...
>> >         page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
>> >         if (!page) {
>> >                 page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> dequeue_huge_page_vma() is smart and will respect the VMA's memory policy.
> But, it only grabs _existing_ huge pages from the huge page pool.  If the
> pool is empty, we fall back to alloc_buddy_huge_page() which obviously
> can't do anything with the VMA's policy because it isn't even passed the
> VMA.
> 
> Almost everybody preallocates huge pages.  That's probably why nobody has
> ever noticed this.  Looking back at the git history, I don't think this
> _ever_ worked from when alloc_buddy_huge_page() was introduced in 7893d1d5,
> 8 years ago.
> 
> The fix is to pass vma/addr down in to the places where we actually call in
> to the buddy allocator.  It's fairly straightforward plumbing.  This has
> been lightly tested.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Hey Dave,

Trinity seems to be able to hit the newly added warnings pretty easily:

[  339.282065] WARNING: CPU: 4 PID: 10181 at mm/hugetlb.c:1520 __alloc_buddy_huge_page+0xff/0xa80()
[  339.360228] Modules linked in:
[  339.360838] CPU: 4 PID: 10181 Comm: trinity-c291 Not tainted 4.3.0-rc6-next-20151022-sasha-00040-g5ecc711-dirty #2608
[  339.362629]  ffff88015e59c000 00000000e6475701 ffff88015e61f9a0 ffffffff9dd3ef48
[  339.363896]  0000000000000000 ffff88015e61f9e0 ffffffff9c32d1ca ffffffff9c7175bf
[  339.365167]  ffffffffabddc0c8 ffff88015e61faf0 0000000000000000 ffffffffffffffff
[  339.366387] Call Trace:
[  339.366831]  [<ffffffff9dd3ef48>] dump_stack+0x4e/0x86
[  339.367648]  [<ffffffff9c32d1ca>] warn_slowpath_common+0xfa/0x120
[  339.368635]  [<ffffffff9c7175bf>] ? __alloc_buddy_huge_page+0xff/0xa80
[  339.369631]  [<ffffffff9c32d3ca>] warn_slowpath_null+0x1a/0x20
[  339.370574]  [<ffffffff9c7175bf>] __alloc_buddy_huge_page+0xff/0xa80
[  339.371551]  [<ffffffff9c7174c0>] ? return_unused_surplus_pages+0x120/0x120
[  339.372698]  [<ffffffff9dda0327>] ? debug_smp_processor_id+0x17/0x20
[  339.373683]  [<ffffffff9c41574b>] ? get_lock_stats+0x1b/0x80
[  339.374551]  [<ffffffff9c42e901>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[  339.375744]  [<ffffffff9c433870>] ? do_raw_spin_unlock+0x1d0/0x1e0
[  339.376728]  [<ffffffff9c718333>] hugetlb_acct_memory+0x193/0x990
[  339.377663]  [<ffffffff9c7181a0>] ? dequeue_huge_page_node+0x260/0x260
[  339.378658]  [<ffffffff9c41c970>] ? trace_hardirqs_on_caller+0x540/0x5e0
[  339.379671]  [<ffffffff9c71e469>] hugetlb_reserve_pages+0x229/0x330
[  339.380738]  [<ffffffff9cba273b>] hugetlb_file_setup+0x54b/0x810
[  339.381689]  [<ffffffff9cba21f0>] ? hugetlbfs_fallocate+0x9e0/0x9e0
[  339.382653]  [<ffffffff9dd669f0>] ? scnprintf+0x100/0x100
[  339.383526]  [<ffffffff9da638af>] newseg+0x49f/0xa70
[  339.384371]  [<ffffffff9dda0327>] ? debug_smp_processor_id+0x17/0x20
[  339.385345]  [<ffffffff9da63410>] ? shm_try_destroy_orphaned+0x190/0x190
[  339.386365]  [<ffffffff9da52cf0>] ? ipcget+0x60/0x510
[  339.387139]  [<ffffffff9da52d1f>] ipcget+0x8f/0x510
[  339.387902]  [<ffffffff9c0046f0>] ? do_audit_syscall_entry+0x2b0/0x2b0
[  339.388931]  [<ffffffff9da64e1a>] SyS_shmget+0x11a/0x160
[  339.389737]  [<ffffffff9da64d00>] ? is_file_shm_hugepages+0x40/0x40
[  339.393268]  [<ffffffff9c006ac2>] ? syscall_trace_enter_phase2+0x462/0x5f0
[  339.395643]  [<ffffffffa55ce0f8>] tracesys_phase2+0x88/0x8d


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, hugetlb: use memory policy when available
@ 2015-10-22 21:39   ` Sasha Levin
  0 siblings, 0 replies; 14+ messages in thread
From: Sasha Levin @ 2015-10-22 21:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: akpm, n-horiguchi, mike.kravetz, hillf.zj, rientjes, linux-mm,
	linux-kernel, dave.hansen

On 10/20/2015 03:53 PM, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> I have a hugetlbfs user which is never explicitly allocating huge pages
> with 'nr_hugepages'.  They only set 'nr_overcommit_hugepages' and then let
> the pages be allocated from the buddy allocator at fault time.
> 
> This works, but they noticed that mbind() was not doing them any good and
> the pages were being allocated without respect for the policy they
> specified.
> 
> The code in question is this:
> 
>> > struct page *alloc_huge_page(struct vm_area_struct *vma,
> ...
>> >         page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
>> >         if (!page) {
>> >                 page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> dequeue_huge_page_vma() is smart and will respect the VMA's memory policy.
> But, it only grabs _existing_ huge pages from the huge page pool.  If the
> pool is empty, we fall back to alloc_buddy_huge_page() which obviously
> can't do anything with the VMA's policy because it isn't even passed the
> VMA.
> 
> Almost everybody preallocates huge pages.  That's probably why nobody has
> ever noticed this.  Looking back at the git history, I don't think this
> _ever_ worked from when alloc_buddy_huge_page() was introduced in 7893d1d5,
> 8 years ago.
> 
> The fix is to pass vma/addr down in to the places where we actually call in
> to the buddy allocator.  It's fairly straightforward plumbing.  This has
> been lightly tested.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Hey Dave,

Trinity seems to be able to hit the newly added warnings pretty easily:

[  339.282065] WARNING: CPU: 4 PID: 10181 at mm/hugetlb.c:1520 __alloc_buddy_huge_page+0xff/0xa80()
[  339.360228] Modules linked in:
[  339.360838] CPU: 4 PID: 10181 Comm: trinity-c291 Not tainted 4.3.0-rc6-next-20151022-sasha-00040-g5ecc711-dirty #2608
[  339.362629]  ffff88015e59c000 00000000e6475701 ffff88015e61f9a0 ffffffff9dd3ef48
[  339.363896]  0000000000000000 ffff88015e61f9e0 ffffffff9c32d1ca ffffffff9c7175bf
[  339.365167]  ffffffffabddc0c8 ffff88015e61faf0 0000000000000000 ffffffffffffffff
[  339.366387] Call Trace:
[  339.366831]  [<ffffffff9dd3ef48>] dump_stack+0x4e/0x86
[  339.367648]  [<ffffffff9c32d1ca>] warn_slowpath_common+0xfa/0x120
[  339.368635]  [<ffffffff9c7175bf>] ? __alloc_buddy_huge_page+0xff/0xa80
[  339.369631]  [<ffffffff9c32d3ca>] warn_slowpath_null+0x1a/0x20
[  339.370574]  [<ffffffff9c7175bf>] __alloc_buddy_huge_page+0xff/0xa80
[  339.371551]  [<ffffffff9c7174c0>] ? return_unused_surplus_pages+0x120/0x120
[  339.372698]  [<ffffffff9dda0327>] ? debug_smp_processor_id+0x17/0x20
[  339.373683]  [<ffffffff9c41574b>] ? get_lock_stats+0x1b/0x80
[  339.374551]  [<ffffffff9c42e901>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[  339.375744]  [<ffffffff9c433870>] ? do_raw_spin_unlock+0x1d0/0x1e0
[  339.376728]  [<ffffffff9c718333>] hugetlb_acct_memory+0x193/0x990
[  339.377663]  [<ffffffff9c7181a0>] ? dequeue_huge_page_node+0x260/0x260
[  339.378658]  [<ffffffff9c41c970>] ? trace_hardirqs_on_caller+0x540/0x5e0
[  339.379671]  [<ffffffff9c71e469>] hugetlb_reserve_pages+0x229/0x330
[  339.380738]  [<ffffffff9cba273b>] hugetlb_file_setup+0x54b/0x810
[  339.381689]  [<ffffffff9cba21f0>] ? hugetlbfs_fallocate+0x9e0/0x9e0
[  339.382653]  [<ffffffff9dd669f0>] ? scnprintf+0x100/0x100
[  339.383526]  [<ffffffff9da638af>] newseg+0x49f/0xa70
[  339.384371]  [<ffffffff9dda0327>] ? debug_smp_processor_id+0x17/0x20
[  339.385345]  [<ffffffff9da63410>] ? shm_try_destroy_orphaned+0x190/0x190
[  339.386365]  [<ffffffff9da52cf0>] ? ipcget+0x60/0x510
[  339.387139]  [<ffffffff9da52d1f>] ipcget+0x8f/0x510
[  339.387902]  [<ffffffff9c0046f0>] ? do_audit_syscall_entry+0x2b0/0x2b0
[  339.388931]  [<ffffffff9da64e1a>] SyS_shmget+0x11a/0x160
[  339.389737]  [<ffffffff9da64d00>] ? is_file_shm_hugepages+0x40/0x40
[  339.393268]  [<ffffffff9c006ac2>] ? syscall_trace_enter_phase2+0x462/0x5f0
[  339.395643]  [<ffffffffa55ce0f8>] tracesys_phase2+0x88/0x8d


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, hugetlb: use memory policy when available
  2015-10-22 21:39   ` Sasha Levin
@ 2015-10-22 21:42     ` Dave Hansen
  -1 siblings, 0 replies; 14+ messages in thread
From: Dave Hansen @ 2015-10-22 21:42 UTC (permalink / raw)
  To: Sasha Levin
  Cc: akpm, n-horiguchi, mike.kravetz, hillf.zj, rientjes, linux-mm,
	linux-kernel, dave.hansen

On 10/22/2015 02:39 PM, Sasha Levin wrote:
> Trinity seems to be able to hit the newly added warnings pretty easily:

Kirill reported the same thing.  Is it fixed with this applied?

> http://ozlabs.org/~akpm/mmots/broken-out/mm-hugetlb-use-memory-policy-when-available-fix.patch



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, hugetlb: use memory policy when available
@ 2015-10-22 21:42     ` Dave Hansen
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Hansen @ 2015-10-22 21:42 UTC (permalink / raw)
  To: Sasha Levin
  Cc: akpm, n-horiguchi, mike.kravetz, hillf.zj, rientjes, linux-mm,
	linux-kernel, dave.hansen

On 10/22/2015 02:39 PM, Sasha Levin wrote:
> Trinity seems to be able to hit the newly added warnings pretty easily:

Kirill reported the same thing.  Is it fixed with this applied?

> http://ozlabs.org/~akpm/mmots/broken-out/mm-hugetlb-use-memory-policy-when-available-fix.patch


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, hugetlb: use memory policy when available
  2015-10-22 21:42     ` Dave Hansen
@ 2015-11-03 19:12       ` Sasha Levin
  -1 siblings, 0 replies; 14+ messages in thread
From: Sasha Levin @ 2015-11-03 19:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: akpm, n-horiguchi, mike.kravetz, hillf.zj, rientjes, linux-mm,
	linux-kernel, dave.hansen

On 10/22/2015 05:42 PM, Dave Hansen wrote:
> On 10/22/2015 02:39 PM, Sasha Levin wrote:
>> > Trinity seems to be able to hit the newly added warnings pretty easily:
> Kirill reported the same thing.  Is it fixed with this applied?
> 
>> > http://ozlabs.org/~akpm/mmots/broken-out/mm-hugetlb-use-memory-policy-when-available-fix.patch

Yup, that works for me.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, hugetlb: use memory policy when available
@ 2015-11-03 19:12       ` Sasha Levin
  0 siblings, 0 replies; 14+ messages in thread
From: Sasha Levin @ 2015-11-03 19:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: akpm, n-horiguchi, mike.kravetz, hillf.zj, rientjes, linux-mm,
	linux-kernel, dave.hansen

On 10/22/2015 05:42 PM, Dave Hansen wrote:
> On 10/22/2015 02:39 PM, Sasha Levin wrote:
>> > Trinity seems to be able to hit the newly added warnings pretty easily:
> Kirill reported the same thing.  Is it fixed with this applied?
> 
>> > http://ozlabs.org/~akpm/mmots/broken-out/mm-hugetlb-use-memory-policy-when-available-fix.patch

Yup, that works for me.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, hugetlb: use memory policy when available
  2015-10-20 19:53 ` Dave Hansen
@ 2015-11-05 13:47   ` Vlastimil Babka
  -1 siblings, 0 replies; 14+ messages in thread
From: Vlastimil Babka @ 2015-11-05 13:47 UTC (permalink / raw)
  To: Dave Hansen
  Cc: akpm, n-horiguchi, mike.kravetz, hillf.zj, rientjes, linux-mm,
	linux-kernel, dave.hansen

On 10/20/2015 09:53 PM, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> I have a hugetlbfs user which is never explicitly allocating huge pages
> with 'nr_hugepages'.  They only set 'nr_overcommit_hugepages' and then let
> the pages be allocated from the buddy allocator at fault time.
> 
> This works, but they noticed that mbind() was not doing them any good and
> the pages were being allocated without respect for the policy they
> specified.
> 
> The code in question is this:
> 
>> struct page *alloc_huge_page(struct vm_area_struct *vma,
> ...
>>         page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
>>         if (!page) {
>>                 page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> 
> dequeue_huge_page_vma() is smart and will respect the VMA's memory policy.
> But, it only grabs _existing_ huge pages from the huge page pool.  If the
> pool is empty, we fall back to alloc_buddy_huge_page() which obviously
> can't do anything with the VMA's policy because it isn't even passed the
> VMA.
> 
> Almost everybody preallocates huge pages.  That's probably why nobody has
> ever noticed this.  Looking back at the git history, I don't think this
> _ever_ worked from when alloc_buddy_huge_page() was introduced in 7893d1d5,
> 8 years ago.
> 
> The fix is to pass vma/addr down in to the places where we actually call in
> to the buddy allocator.  It's fairly straightforward plumbing.  This has
> been lightly tested.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Together with the fix and NUMA=n cleanup

Acked=by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, hugetlb: use memory policy when available
@ 2015-11-05 13:47   ` Vlastimil Babka
  0 siblings, 0 replies; 14+ messages in thread
From: Vlastimil Babka @ 2015-11-05 13:47 UTC (permalink / raw)
  To: Dave Hansen
  Cc: akpm, n-horiguchi, mike.kravetz, hillf.zj, rientjes, linux-mm,
	linux-kernel, dave.hansen

On 10/20/2015 09:53 PM, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> I have a hugetlbfs user which is never explicitly allocating huge pages
> with 'nr_hugepages'.  They only set 'nr_overcommit_hugepages' and then let
> the pages be allocated from the buddy allocator at fault time.
> 
> This works, but they noticed that mbind() was not doing them any good and
> the pages were being allocated without respect for the policy they
> specified.
> 
> The code in question is this:
> 
>> struct page *alloc_huge_page(struct vm_area_struct *vma,
> ...
>>         page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
>>         if (!page) {
>>                 page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> 
> dequeue_huge_page_vma() is smart and will respect the VMA's memory policy.
> But, it only grabs _existing_ huge pages from the huge page pool.  If the
> pool is empty, we fall back to alloc_buddy_huge_page() which obviously
> can't do anything with the VMA's policy because it isn't even passed the
> VMA.
> 
> Almost everybody preallocates huge pages.  That's probably why nobody has
> ever noticed this.  Looking back at the git history, I don't think this
> _ever_ worked from when alloc_buddy_huge_page() was introduced in 7893d1d5,
> 8 years ago.
> 
> The fix is to pass vma/addr down in to the places where we actually call in
> to the buddy allocator.  It's fairly straightforward plumbing.  This has
> been lightly tested.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Together with the fix and NUMA=n cleanup

Acked=by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-11-05 13:47 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-20 19:53 [PATCH] mm, hugetlb: use memory policy when available Dave Hansen
2015-10-20 19:53 ` Dave Hansen
2015-10-20 22:19 ` Andrew Morton
2015-10-20 22:19   ` Andrew Morton
2015-10-21 15:12 ` Kirill A. Shutemov
2015-10-21 15:12   ` Kirill A. Shutemov
2015-10-22 21:39 ` Sasha Levin
2015-10-22 21:39   ` Sasha Levin
2015-10-22 21:42   ` Dave Hansen
2015-10-22 21:42     ` Dave Hansen
2015-11-03 19:12     ` Sasha Levin
2015-11-03 19:12       ` Sasha Levin
2015-11-05 13:47 ` Vlastimil Babka
2015-11-05 13:47   ` Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.