All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
@ 2014-04-10 17:58 ` Luiz Capitulino
  0 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-10 17:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, mtosatti, aarcange, mgorman, akpm, andi, davidlohr,
	rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi, kirill

[Full introduction right after the changelog]

Changelog
---------

v3

- Dropped unnecessary WARN_ON() call [Kirill]
- Always check if the pfn range lies within a zone [Yasuaki]
- Renamed some function arguments for consistency

v2

- Rewrote allocation loop to avoid scanning unless PFNs [Yasuaki]
- Dropped incomplete multi-arch support [Naoya]
- Added patch to drop __init from prep_compound_gigantic_page()
- Restricted the feature to x86_64 (more details in patch 5/5)
- Added review-bys plus minor changelog changes

Introduction
------------

The HugeTLB subsystem uses the buddy allocator to allocate hugepages during
runtime. This means that hugepages allocation during runtime is limited to
MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes
greater than MAX_ORDER), this in turn means that those pages can't be
allocated at runtime.

HugeTLB supports gigantic page allocation during boottime, via the boot
allocator. To this end the kernel provides the command-line options
hugepagesz= and hugepages=, which can be used to instruct the kernel to
allocate N gigantic pages during boot.

For example, x86_64 supports 2M and 1G hugepages, but only 2M hugepages can
be allocated and freed at runtime. If one wants to allocate 1G gigantic pages,
this has to be done at boot via the hugepagesz= and hugepages= command-line
options.

Now, gigantic page allocation at boottime has two serious problems:

 1. Boottime allocation is not NUMA aware. On a NUMA machine the kernel
    evenly distributes boottime allocated hugepages among nodes.

    For example, suppose you have a four-node NUMA machine and want
    to allocate four 1G gigantic pages at boottime. The kernel will
    allocate one gigantic page per node.

    On the other hand, we do have users who want to be able to specify
    which NUMA node gigantic pages should allocated from. So that they
    can place virtual machines on a specific NUMA node.

 2. Gigantic pages allocated at boottime can't be freed

At this point it's important to observe that regular hugepages allocated
at runtime don't have those problems. This is so because HugeTLB interface
for runtime allocation in sysfs supports NUMA and runtime allocated pages
can be freed just fine via the buddy allocator.

This series adds support for allocating gigantic pages at runtime. It does
so by allocating gigantic pages via CMA instead of the buddy allocator.
Releasing gigantic pages is also supported via CMA. As this series builds
on top of the existing HugeTLB interface, it makes gigantic page allocation
and releasing just like regular sized hugepages. This also means that NUMA
support just works.

For example, to allocate two 1G gigantic pages on node 1, one can do:

 # echo 2 > \
   /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

And, to release all gigantic pages on the same node:

 # echo 0 > \
   /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

Please, refer to patch 5/5 for full technical details.

Finally, please note that this series is a follow up for a previous series
that tried to extend the command-line options set to be NUMA aware:

 http://marc.info/?l=linux-mm&m=139593335312191&w=2

During the discussion of that series it was agreed that having runtime
allocation support for gigantic pages was a better solution.

Luiz Capitulino (5):
  hugetlb: prep_compound_gigantic_page(): drop __init marker
  hugetlb: add hstate_is_gigantic()
  hugetlb: update_and_free_page(): don't clear PG_reserved bit
  hugetlb: move helpers up in the file
  hugetlb: add support for gigantic page allocation at runtime

 include/linux/hugetlb.h |   5 +
 mm/hugetlb.c            | 336 ++++++++++++++++++++++++++++++++++--------------
 2 files changed, 245 insertions(+), 96 deletions(-)

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
@ 2014-04-10 17:58 ` Luiz Capitulino
  0 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-10 17:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, mtosatti, aarcange, mgorman, akpm, andi, davidlohr,
	rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi, kirill

[Full introduction right after the changelog]

Changelog
---------

v3

- Dropped unnecessary WARN_ON() call [Kirill]
- Always check if the pfn range lies within a zone [Yasuaki]
- Renamed some function arguments for consistency

v2

- Rewrote allocation loop to avoid scanning unless PFNs [Yasuaki]
- Dropped incomplete multi-arch support [Naoya]
- Added patch to drop __init from prep_compound_gigantic_page()
- Restricted the feature to x86_64 (more details in patch 5/5)
- Added review-bys plus minor changelog changes

Introduction
------------

The HugeTLB subsystem uses the buddy allocator to allocate hugepages during
runtime. This means that hugepages allocation during runtime is limited to
MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes
greater than MAX_ORDER), this in turn means that those pages can't be
allocated at runtime.

HugeTLB supports gigantic page allocation during boottime, via the boot
allocator. To this end the kernel provides the command-line options
hugepagesz= and hugepages=, which can be used to instruct the kernel to
allocate N gigantic pages during boot.

For example, x86_64 supports 2M and 1G hugepages, but only 2M hugepages can
be allocated and freed at runtime. If one wants to allocate 1G gigantic pages,
this has to be done at boot via the hugepagesz= and hugepages= command-line
options.

Now, gigantic page allocation at boottime has two serious problems:

 1. Boottime allocation is not NUMA aware. On a NUMA machine the kernel
    evenly distributes boottime allocated hugepages among nodes.

    For example, suppose you have a four-node NUMA machine and want
    to allocate four 1G gigantic pages at boottime. The kernel will
    allocate one gigantic page per node.

    On the other hand, we do have users who want to be able to specify
    which NUMA node gigantic pages should allocated from. So that they
    can place virtual machines on a specific NUMA node.

 2. Gigantic pages allocated at boottime can't be freed

At this point it's important to observe that regular hugepages allocated
at runtime don't have those problems. This is so because HugeTLB interface
for runtime allocation in sysfs supports NUMA and runtime allocated pages
can be freed just fine via the buddy allocator.

This series adds support for allocating gigantic pages at runtime. It does
so by allocating gigantic pages via CMA instead of the buddy allocator.
Releasing gigantic pages is also supported via CMA. As this series builds
on top of the existing HugeTLB interface, it makes gigantic page allocation
and releasing just like regular sized hugepages. This also means that NUMA
support just works.

For example, to allocate two 1G gigantic pages on node 1, one can do:

 # echo 2 > \
   /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

And, to release all gigantic pages on the same node:

 # echo 0 > \
   /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

Please, refer to patch 5/5 for full technical details.

Finally, please note that this series is a follow up for a previous series
that tried to extend the command-line options set to be NUMA aware:

 http://marc.info/?l=linux-mm&m=139593335312191&w=2

During the discussion of that series it was agreed that having runtime
allocation support for gigantic pages was a better solution.

Luiz Capitulino (5):
  hugetlb: prep_compound_gigantic_page(): drop __init marker
  hugetlb: add hstate_is_gigantic()
  hugetlb: update_and_free_page(): don't clear PG_reserved bit
  hugetlb: move helpers up in the file
  hugetlb: add support for gigantic page allocation at runtime

 include/linux/hugetlb.h |   5 +
 mm/hugetlb.c            | 336 ++++++++++++++++++++++++++++++++++--------------
 2 files changed, 245 insertions(+), 96 deletions(-)

-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/5] hugetlb: prep_compound_gigantic_page(): drop __init marker
  2014-04-10 17:58 ` Luiz Capitulino
@ 2014-04-10 17:58   ` Luiz Capitulino
  -1 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-10 17:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, mtosatti, aarcange, mgorman, akpm, andi, davidlohr,
	rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi, kirill

This function is going to be used by non-init code in a future
commit.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
---
 mm/hugetlb.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dd30f22..957231b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -690,8 +690,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 	put_page(page); /* free it into the hugepage allocator */
 }
 
-static void __init prep_compound_gigantic_page(struct page *page,
-					       unsigned long order)
+static void prep_compound_gigantic_page(struct page *page, unsigned long order)
 {
 	int i;
 	int nr_pages = 1 << order;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 1/5] hugetlb: prep_compound_gigantic_page(): drop __init marker
@ 2014-04-10 17:58   ` Luiz Capitulino
  0 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-10 17:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, mtosatti, aarcange, mgorman, akpm, andi, davidlohr,
	rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi, kirill

This function is going to be used by non-init code in a future
commit.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
---
 mm/hugetlb.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dd30f22..957231b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -690,8 +690,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 	put_page(page); /* free it into the hugepage allocator */
 }
 
-static void __init prep_compound_gigantic_page(struct page *page,
-					       unsigned long order)
+static void prep_compound_gigantic_page(struct page *page, unsigned long order)
 {
 	int i;
 	int nr_pages = 1 << order;
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/5] hugetlb: add hstate_is_gigantic()
  2014-04-10 17:58 ` Luiz Capitulino
@ 2014-04-10 17:58   ` Luiz Capitulino
  -1 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-10 17:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, mtosatti, aarcange, mgorman, akpm, andi, davidlohr,
	rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi, kirill

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
---
 include/linux/hugetlb.h |  5 +++++
 mm/hugetlb.c            | 28 ++++++++++++++--------------
 2 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 5b337cf..62a8b88 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -343,6 +343,11 @@ static inline unsigned huge_page_shift(struct hstate *h)
 	return h->order + PAGE_SHIFT;
 }
 
+static inline bool hstate_is_gigantic(struct hstate *h)
+{
+	return huge_page_order(h) >= MAX_ORDER;
+}
+
 static inline unsigned int pages_per_huge_page(struct hstate *h)
 {
 	return 1 << h->order;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 957231b..a5e679b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -611,7 +611,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
 
-	VM_BUG_ON(h->order >= MAX_ORDER);
+	VM_BUG_ON(hstate_is_gigantic(h));
 
 	h->nr_huge_pages--;
 	h->nr_huge_pages_node[page_to_nid(page)]--;
@@ -664,7 +664,7 @@ static void free_huge_page(struct page *page)
 	if (restore_reserve)
 		h->resv_huge_pages++;
 
-	if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
+	if (h->surplus_huge_pages_node[nid] && !hstate_is_gigantic(h)) {
 		/* remove the page from active list */
 		list_del(&page->lru);
 		update_and_free_page(h, page);
@@ -768,7 +768,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
 
-	if (h->order >= MAX_ORDER)
+	if (hstate_is_gigantic(h))
 		return NULL;
 
 	page = alloc_pages_exact_node(nid,
@@ -962,7 +962,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
 	struct page *page;
 	unsigned int r_nid;
 
-	if (h->order >= MAX_ORDER)
+	if (hstate_is_gigantic(h))
 		return NULL;
 
 	/*
@@ -1155,7 +1155,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 	h->resv_huge_pages -= unused_resv_pages;
 
 	/* Cannot return gigantic pages currently */
-	if (h->order >= MAX_ORDER)
+	if (hstate_is_gigantic(h))
 		return;
 
 	nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
@@ -1354,7 +1354,7 @@ static void __init gather_bootmem_prealloc(void)
 		 * fix confusing memory reports from free(1) and another
 		 * side-effects, like CommitLimit going negative.
 		 */
-		if (h->order > (MAX_ORDER - 1))
+		if (hstate_is_gigantic(h))
 			adjust_managed_page_count(page, 1 << h->order);
 	}
 }
@@ -1364,7 +1364,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 	unsigned long i;
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
-		if (h->order >= MAX_ORDER) {
+		if (hstate_is_gigantic(h)) {
 			if (!alloc_bootmem_huge_page(h))
 				break;
 		} else if (!alloc_fresh_huge_page(h,
@@ -1380,7 +1380,7 @@ static void __init hugetlb_init_hstates(void)
 
 	for_each_hstate(h) {
 		/* oversize hugepages were init'ed in early boot */
-		if (h->order < MAX_ORDER)
+		if (!hstate_is_gigantic(h))
 			hugetlb_hstate_alloc_pages(h);
 	}
 }
@@ -1414,7 +1414,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 {
 	int i;
 
-	if (h->order >= MAX_ORDER)
+	if (hstate_is_gigantic(h))
 		return;
 
 	for_each_node_mask(i, *nodes_allowed) {
@@ -1477,7 +1477,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
 {
 	unsigned long min_count, ret;
 
-	if (h->order >= MAX_ORDER)
+	if (hstate_is_gigantic(h))
 		return h->max_huge_pages;
 
 	/*
@@ -1604,7 +1604,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
 		goto out;
 
 	h = kobj_to_hstate(kobj, &nid);
-	if (h->order >= MAX_ORDER) {
+	if (hstate_is_gigantic(h)) {
 		err = -EINVAL;
 		goto out;
 	}
@@ -1687,7 +1687,7 @@ static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
 	unsigned long input;
 	struct hstate *h = kobj_to_hstate(kobj, NULL);
 
-	if (h->order >= MAX_ORDER)
+	if (hstate_is_gigantic(h))
 		return -EINVAL;
 
 	err = kstrtoul(buf, 10, &input);
@@ -2112,7 +2112,7 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
 
 	tmp = h->max_huge_pages;
 
-	if (write && h->order >= MAX_ORDER)
+	if (write && hstate_is_gigantic(h))
 		return -EINVAL;
 
 	table->data = &tmp;
@@ -2165,7 +2165,7 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
 
 	tmp = h->nr_overcommit_huge_pages;
 
-	if (write && h->order >= MAX_ORDER)
+	if (write && hstate_is_gigantic(h))
 		return -EINVAL;
 
 	table->data = &tmp;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/5] hugetlb: add hstate_is_gigantic()
@ 2014-04-10 17:58   ` Luiz Capitulino
  0 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-10 17:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, mtosatti, aarcange, mgorman, akpm, andi, davidlohr,
	rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi, kirill

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
---
 include/linux/hugetlb.h |  5 +++++
 mm/hugetlb.c            | 28 ++++++++++++++--------------
 2 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 5b337cf..62a8b88 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -343,6 +343,11 @@ static inline unsigned huge_page_shift(struct hstate *h)
 	return h->order + PAGE_SHIFT;
 }
 
+static inline bool hstate_is_gigantic(struct hstate *h)
+{
+	return huge_page_order(h) >= MAX_ORDER;
+}
+
 static inline unsigned int pages_per_huge_page(struct hstate *h)
 {
 	return 1 << h->order;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 957231b..a5e679b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -611,7 +611,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
 
-	VM_BUG_ON(h->order >= MAX_ORDER);
+	VM_BUG_ON(hstate_is_gigantic(h));
 
 	h->nr_huge_pages--;
 	h->nr_huge_pages_node[page_to_nid(page)]--;
@@ -664,7 +664,7 @@ static void free_huge_page(struct page *page)
 	if (restore_reserve)
 		h->resv_huge_pages++;
 
-	if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
+	if (h->surplus_huge_pages_node[nid] && !hstate_is_gigantic(h)) {
 		/* remove the page from active list */
 		list_del(&page->lru);
 		update_and_free_page(h, page);
@@ -768,7 +768,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
 
-	if (h->order >= MAX_ORDER)
+	if (hstate_is_gigantic(h))
 		return NULL;
 
 	page = alloc_pages_exact_node(nid,
@@ -962,7 +962,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
 	struct page *page;
 	unsigned int r_nid;
 
-	if (h->order >= MAX_ORDER)
+	if (hstate_is_gigantic(h))
 		return NULL;
 
 	/*
@@ -1155,7 +1155,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 	h->resv_huge_pages -= unused_resv_pages;
 
 	/* Cannot return gigantic pages currently */
-	if (h->order >= MAX_ORDER)
+	if (hstate_is_gigantic(h))
 		return;
 
 	nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
@@ -1354,7 +1354,7 @@ static void __init gather_bootmem_prealloc(void)
 		 * fix confusing memory reports from free(1) and another
 		 * side-effects, like CommitLimit going negative.
 		 */
-		if (h->order > (MAX_ORDER - 1))
+		if (hstate_is_gigantic(h))
 			adjust_managed_page_count(page, 1 << h->order);
 	}
 }
@@ -1364,7 +1364,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 	unsigned long i;
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
-		if (h->order >= MAX_ORDER) {
+		if (hstate_is_gigantic(h)) {
 			if (!alloc_bootmem_huge_page(h))
 				break;
 		} else if (!alloc_fresh_huge_page(h,
@@ -1380,7 +1380,7 @@ static void __init hugetlb_init_hstates(void)
 
 	for_each_hstate(h) {
 		/* oversize hugepages were init'ed in early boot */
-		if (h->order < MAX_ORDER)
+		if (!hstate_is_gigantic(h))
 			hugetlb_hstate_alloc_pages(h);
 	}
 }
@@ -1414,7 +1414,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
 {
 	int i;
 
-	if (h->order >= MAX_ORDER)
+	if (hstate_is_gigantic(h))
 		return;
 
 	for_each_node_mask(i, *nodes_allowed) {
@@ -1477,7 +1477,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
 {
 	unsigned long min_count, ret;
 
-	if (h->order >= MAX_ORDER)
+	if (hstate_is_gigantic(h))
 		return h->max_huge_pages;
 
 	/*
@@ -1604,7 +1604,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
 		goto out;
 
 	h = kobj_to_hstate(kobj, &nid);
-	if (h->order >= MAX_ORDER) {
+	if (hstate_is_gigantic(h)) {
 		err = -EINVAL;
 		goto out;
 	}
@@ -1687,7 +1687,7 @@ static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
 	unsigned long input;
 	struct hstate *h = kobj_to_hstate(kobj, NULL);
 
-	if (h->order >= MAX_ORDER)
+	if (hstate_is_gigantic(h))
 		return -EINVAL;
 
 	err = kstrtoul(buf, 10, &input);
@@ -2112,7 +2112,7 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
 
 	tmp = h->max_huge_pages;
 
-	if (write && h->order >= MAX_ORDER)
+	if (write && hstate_is_gigantic(h))
 		return -EINVAL;
 
 	table->data = &tmp;
@@ -2165,7 +2165,7 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
 
 	tmp = h->nr_overcommit_huge_pages;
 
-	if (write && h->order >= MAX_ORDER)
+	if (write && hstate_is_gigantic(h))
 		return -EINVAL;
 
 	table->data = &tmp;
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/5] hugetlb: update_and_free_page(): don't clear PG_reserved bit
  2014-04-10 17:58 ` Luiz Capitulino
@ 2014-04-10 17:58   ` Luiz Capitulino
  -1 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-10 17:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, mtosatti, aarcange, mgorman, akpm, andi, davidlohr,
	rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi, kirill

Hugepages pages never get the PG_reserved bit set, so don't clear it.

However, note that if the bit gets mistakenly set free_pages_check() will
catch it.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
---
 mm/hugetlb.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a5e679b..8cbaa97 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -618,8 +618,8 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 	for (i = 0; i < pages_per_huge_page(h); i++) {
 		page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
 				1 << PG_referenced | 1 << PG_dirty |
-				1 << PG_active | 1 << PG_reserved |
-				1 << PG_private | 1 << PG_writeback);
+				1 << PG_active | 1 << PG_private |
+				1 << PG_writeback);
 	}
 	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
 	set_compound_page_dtor(page, NULL);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/5] hugetlb: update_and_free_page(): don't clear PG_reserved bit
@ 2014-04-10 17:58   ` Luiz Capitulino
  0 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-10 17:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, mtosatti, aarcange, mgorman, akpm, andi, davidlohr,
	rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi, kirill

Hugepages pages never get the PG_reserved bit set, so don't clear it.

However, note that if the bit gets mistakenly set free_pages_check() will
catch it.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
---
 mm/hugetlb.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a5e679b..8cbaa97 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -618,8 +618,8 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 	for (i = 0; i < pages_per_huge_page(h); i++) {
 		page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
 				1 << PG_referenced | 1 << PG_dirty |
-				1 << PG_active | 1 << PG_reserved |
-				1 << PG_private | 1 << PG_writeback);
+				1 << PG_active | 1 << PG_private |
+				1 << PG_writeback);
 	}
 	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
 	set_compound_page_dtor(page, NULL);
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/5] hugetlb: move helpers up in the file
  2014-04-10 17:58 ` Luiz Capitulino
@ 2014-04-10 17:58   ` Luiz Capitulino
  -1 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-10 17:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, mtosatti, aarcange, mgorman, akpm, andi, davidlohr,
	rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi, kirill

Next commit will add new code which will want to call
for_each_node_mask_to_alloc() macro. Move it, its buddy
for_each_node_mask_to_free() and their dependencies up in the file so
the new code can use them. This is just code movement, no logic change.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
---
 mm/hugetlb.c | 146 +++++++++++++++++++++++++++++------------------------------
 1 file changed, 73 insertions(+), 73 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8cbaa97..6f1ca74 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -607,6 +607,79 @@ err:
 	return NULL;
 }
 
+/*
+ * common helper functions for hstate_next_node_to_{alloc|free}.
+ * We may have allocated or freed a huge page based on a different
+ * nodes_allowed previously, so h->next_node_to_{alloc|free} might
+ * be outside of *nodes_allowed.  Ensure that we use an allowed
+ * node for alloc or free.
+ */
+static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
+{
+	nid = next_node(nid, *nodes_allowed);
+	if (nid == MAX_NUMNODES)
+		nid = first_node(*nodes_allowed);
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+
+	return nid;
+}
+
+static int get_valid_node_allowed(int nid, nodemask_t *nodes_allowed)
+{
+	if (!node_isset(nid, *nodes_allowed))
+		nid = next_node_allowed(nid, nodes_allowed);
+	return nid;
+}
+
+/*
+ * returns the previously saved node ["this node"] from which to
+ * allocate a persistent huge page for the pool and advance the
+ * next node from which to allocate, handling wrap at end of node
+ * mask.
+ */
+static int hstate_next_node_to_alloc(struct hstate *h,
+					nodemask_t *nodes_allowed)
+{
+	int nid;
+
+	VM_BUG_ON(!nodes_allowed);
+
+	nid = get_valid_node_allowed(h->next_nid_to_alloc, nodes_allowed);
+	h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
+
+	return nid;
+}
+
+/*
+ * helper for free_pool_huge_page() - return the previously saved
+ * node ["this node"] from which to free a huge page.  Advance the
+ * next node id whether or not we find a free huge page to free so
+ * that the next attempt to free addresses the next node.
+ */
+static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
+{
+	int nid;
+
+	VM_BUG_ON(!nodes_allowed);
+
+	nid = get_valid_node_allowed(h->next_nid_to_free, nodes_allowed);
+	h->next_nid_to_free = next_node_allowed(nid, nodes_allowed);
+
+	return nid;
+}
+
+#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask)		\
+	for (nr_nodes = nodes_weight(*mask);				\
+		nr_nodes > 0 &&						\
+		((node = hstate_next_node_to_alloc(hs, mask)) || 1);	\
+		nr_nodes--)
+
+#define for_each_node_mask_to_free(hs, nr_nodes, node, mask)		\
+	for (nr_nodes = nodes_weight(*mask);				\
+		nr_nodes > 0 &&						\
+		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
+		nr_nodes--)
+
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
@@ -786,79 +859,6 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 	return page;
 }
 
-/*
- * common helper functions for hstate_next_node_to_{alloc|free}.
- * We may have allocated or freed a huge page based on a different
- * nodes_allowed previously, so h->next_node_to_{alloc|free} might
- * be outside of *nodes_allowed.  Ensure that we use an allowed
- * node for alloc or free.
- */
-static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
-{
-	nid = next_node(nid, *nodes_allowed);
-	if (nid == MAX_NUMNODES)
-		nid = first_node(*nodes_allowed);
-	VM_BUG_ON(nid >= MAX_NUMNODES);
-
-	return nid;
-}
-
-static int get_valid_node_allowed(int nid, nodemask_t *nodes_allowed)
-{
-	if (!node_isset(nid, *nodes_allowed))
-		nid = next_node_allowed(nid, nodes_allowed);
-	return nid;
-}
-
-/*
- * returns the previously saved node ["this node"] from which to
- * allocate a persistent huge page for the pool and advance the
- * next node from which to allocate, handling wrap at end of node
- * mask.
- */
-static int hstate_next_node_to_alloc(struct hstate *h,
-					nodemask_t *nodes_allowed)
-{
-	int nid;
-
-	VM_BUG_ON(!nodes_allowed);
-
-	nid = get_valid_node_allowed(h->next_nid_to_alloc, nodes_allowed);
-	h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
-
-	return nid;
-}
-
-/*
- * helper for free_pool_huge_page() - return the previously saved
- * node ["this node"] from which to free a huge page.  Advance the
- * next node id whether or not we find a free huge page to free so
- * that the next attempt to free addresses the next node.
- */
-static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
-{
-	int nid;
-
-	VM_BUG_ON(!nodes_allowed);
-
-	nid = get_valid_node_allowed(h->next_nid_to_free, nodes_allowed);
-	h->next_nid_to_free = next_node_allowed(nid, nodes_allowed);
-
-	return nid;
-}
-
-#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask)		\
-	for (nr_nodes = nodes_weight(*mask);				\
-		nr_nodes > 0 &&						\
-		((node = hstate_next_node_to_alloc(hs, mask)) || 1);	\
-		nr_nodes--)
-
-#define for_each_node_mask_to_free(hs, nr_nodes, node, mask)		\
-	for (nr_nodes = nodes_weight(*mask);				\
-		nr_nodes > 0 &&						\
-		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
-		nr_nodes--)
-
 static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
 {
 	struct page *page;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/5] hugetlb: move helpers up in the file
@ 2014-04-10 17:58   ` Luiz Capitulino
  0 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-10 17:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, mtosatti, aarcange, mgorman, akpm, andi, davidlohr,
	rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi, kirill

Next commit will add new code which will want to call
for_each_node_mask_to_alloc() macro. Move it, its buddy
for_each_node_mask_to_free() and their dependencies up in the file so
the new code can use them. This is just code movement, no logic change.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
---
 mm/hugetlb.c | 146 +++++++++++++++++++++++++++++------------------------------
 1 file changed, 73 insertions(+), 73 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8cbaa97..6f1ca74 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -607,6 +607,79 @@ err:
 	return NULL;
 }
 
+/*
+ * common helper functions for hstate_next_node_to_{alloc|free}.
+ * We may have allocated or freed a huge page based on a different
+ * nodes_allowed previously, so h->next_node_to_{alloc|free} might
+ * be outside of *nodes_allowed.  Ensure that we use an allowed
+ * node for alloc or free.
+ */
+static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
+{
+	nid = next_node(nid, *nodes_allowed);
+	if (nid == MAX_NUMNODES)
+		nid = first_node(*nodes_allowed);
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+
+	return nid;
+}
+
+static int get_valid_node_allowed(int nid, nodemask_t *nodes_allowed)
+{
+	if (!node_isset(nid, *nodes_allowed))
+		nid = next_node_allowed(nid, nodes_allowed);
+	return nid;
+}
+
+/*
+ * returns the previously saved node ["this node"] from which to
+ * allocate a persistent huge page for the pool and advance the
+ * next node from which to allocate, handling wrap at end of node
+ * mask.
+ */
+static int hstate_next_node_to_alloc(struct hstate *h,
+					nodemask_t *nodes_allowed)
+{
+	int nid;
+
+	VM_BUG_ON(!nodes_allowed);
+
+	nid = get_valid_node_allowed(h->next_nid_to_alloc, nodes_allowed);
+	h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
+
+	return nid;
+}
+
+/*
+ * helper for free_pool_huge_page() - return the previously saved
+ * node ["this node"] from which to free a huge page.  Advance the
+ * next node id whether or not we find a free huge page to free so
+ * that the next attempt to free addresses the next node.
+ */
+static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
+{
+	int nid;
+
+	VM_BUG_ON(!nodes_allowed);
+
+	nid = get_valid_node_allowed(h->next_nid_to_free, nodes_allowed);
+	h->next_nid_to_free = next_node_allowed(nid, nodes_allowed);
+
+	return nid;
+}
+
+#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask)		\
+	for (nr_nodes = nodes_weight(*mask);				\
+		nr_nodes > 0 &&						\
+		((node = hstate_next_node_to_alloc(hs, mask)) || 1);	\
+		nr_nodes--)
+
+#define for_each_node_mask_to_free(hs, nr_nodes, node, mask)		\
+	for (nr_nodes = nodes_weight(*mask);				\
+		nr_nodes > 0 &&						\
+		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
+		nr_nodes--)
+
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
@@ -786,79 +859,6 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 	return page;
 }
 
-/*
- * common helper functions for hstate_next_node_to_{alloc|free}.
- * We may have allocated or freed a huge page based on a different
- * nodes_allowed previously, so h->next_node_to_{alloc|free} might
- * be outside of *nodes_allowed.  Ensure that we use an allowed
- * node for alloc or free.
- */
-static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
-{
-	nid = next_node(nid, *nodes_allowed);
-	if (nid == MAX_NUMNODES)
-		nid = first_node(*nodes_allowed);
-	VM_BUG_ON(nid >= MAX_NUMNODES);
-
-	return nid;
-}
-
-static int get_valid_node_allowed(int nid, nodemask_t *nodes_allowed)
-{
-	if (!node_isset(nid, *nodes_allowed))
-		nid = next_node_allowed(nid, nodes_allowed);
-	return nid;
-}
-
-/*
- * returns the previously saved node ["this node"] from which to
- * allocate a persistent huge page for the pool and advance the
- * next node from which to allocate, handling wrap at end of node
- * mask.
- */
-static int hstate_next_node_to_alloc(struct hstate *h,
-					nodemask_t *nodes_allowed)
-{
-	int nid;
-
-	VM_BUG_ON(!nodes_allowed);
-
-	nid = get_valid_node_allowed(h->next_nid_to_alloc, nodes_allowed);
-	h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
-
-	return nid;
-}
-
-/*
- * helper for free_pool_huge_page() - return the previously saved
- * node ["this node"] from which to free a huge page.  Advance the
- * next node id whether or not we find a free huge page to free so
- * that the next attempt to free addresses the next node.
- */
-static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
-{
-	int nid;
-
-	VM_BUG_ON(!nodes_allowed);
-
-	nid = get_valid_node_allowed(h->next_nid_to_free, nodes_allowed);
-	h->next_nid_to_free = next_node_allowed(nid, nodes_allowed);
-
-	return nid;
-}
-
-#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask)		\
-	for (nr_nodes = nodes_weight(*mask);				\
-		nr_nodes > 0 &&						\
-		((node = hstate_next_node_to_alloc(hs, mask)) || 1);	\
-		nr_nodes--)
-
-#define for_each_node_mask_to_free(hs, nr_nodes, node, mask)		\
-	for (nr_nodes = nodes_weight(*mask);				\
-		nr_nodes > 0 &&						\
-		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
-		nr_nodes--)
-
 static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
 {
 	struct page *page;
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/5] hugetlb: add support for gigantic page allocation at runtime
  2014-04-10 17:58 ` Luiz Capitulino
@ 2014-04-10 17:58   ` Luiz Capitulino
  -1 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-10 17:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, mtosatti, aarcange, mgorman, akpm, andi, davidlohr,
	rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi, kirill

HugeTLB is limited to allocating hugepages whose size are less than
MAX_ORDER order. This is so because HugeTLB allocates hugepages via
the buddy allocator. Gigantic pages (that is, pages whose size is
greater than MAX_ORDER order) have to be allocated at boottime.

However, boottime allocation has at least two serious problems. First,
it doesn't support NUMA and second, gigantic pages allocated at
boottime can't be freed.

This commit solves both issues by adding support for allocating gigantic
pages during runtime. It works just like regular sized hugepages,
meaning that the interface in sysfs is the same, it supports NUMA,
and gigantic pages can be freed.

For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
gigantic pages on node 1, one can do:

 # echo 2 > \
   /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

And to free them all:

 # echo 0 > \
   /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

The one problem with gigantic page allocation at runtime is that it
can't be serviced by the buddy allocator. To overcome that problem, this
commit scans all zones from a node looking for a large enough contiguous
region. When one is found, it's allocated by using CMA, that is, we call
alloc_contig_range() to do the actual allocation. For example, on x86_64
we scan all zones looking for a 1GB contiguous region. When one is found,
it's allocated by alloc_contig_range().

One expected issue with that approach is that such gigantic contiguous
regions tend to vanish as runtime goes by. The best way to avoid this for
now is to make gigantic page allocations very early during system boot, say
from a init script. Other possible optimization include using compaction,
which is supported by CMA but is not explicitly used by this commit.

It's also important to note the following:

 1. Gigantic pages allocated at boottime by the hugepages= command-line
    option can be freed at runtime just fine

 2. This commit adds support for gigantic pages only to x86_64. The
    reason is that I don't have access to nor experience with other archs.
    The code is arch indepedent though, so it should be simple to add
    support to different archs

 3. I didn't add support for hugepage overcommit, that is allocating
    a gigantic page on demand when
   /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't
   think it's reasonable to do the hard and long work required for
   allocating a gigantic page at fault time. But it should be simple
   to add this if wanted

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
---
 mm/hugetlb.c | 167 +++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 156 insertions(+), 11 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6f1ca74..161dc39 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -680,11 +680,150 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
 		nr_nodes--)
 
+#if defined(CONFIG_CMA) && defined(CONFIG_X86_64)
+static void destroy_compound_gigantic_page(struct page *page,
+					unsigned long order)
+{
+	int i;
+	int nr_pages = 1 << order;
+	struct page *p = page + 1;
+
+	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
+		__ClearPageTail(p);
+		set_page_refcounted(p);
+		p->first_page = NULL;
+	}
+
+	set_compound_order(page, 0);
+	__ClearPageHead(page);
+}
+
+static void free_gigantic_page(struct page *page, unsigned order)
+{
+	free_contig_range(page_to_pfn(page), 1 << order);
+}
+
+static int __alloc_gigantic_page(unsigned long start_pfn,
+				unsigned long nr_pages)
+{
+	unsigned long end_pfn = start_pfn + nr_pages;
+	return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
+}
+
+static bool pfn_range_valid_gigantic(unsigned long start_pfn,
+				unsigned long nr_pages)
+{
+	unsigned long i, end_pfn = start_pfn + nr_pages;
+	struct page *page;
+
+	for (i = start_pfn; i < end_pfn; i++) {
+		if (!pfn_valid(i))
+			return false;
+
+		page = pfn_to_page(i);
+
+		if (PageReserved(page))
+			return false;
+
+		if (page_count(page) > 0)
+			return false;
+
+		if (PageHuge(page))
+			return false;
+	}
+
+	return true;
+}
+
+static bool zone_spans_last_pfn(const struct zone *zone,
+			unsigned long start_pfn, unsigned long nr_pages)
+{
+	unsigned long last_pfn = start_pfn + nr_pages - 1;
+	return zone_spans_pfn(zone, last_pfn);
+}
+
+static struct page *alloc_gigantic_page(int nid, unsigned order)
+{
+	unsigned long nr_pages = 1 << order;
+	unsigned long ret, pfn, flags;
+	struct zone *z;
+
+	z = NODE_DATA(nid)->node_zones;
+	for (; z - NODE_DATA(nid)->node_zones < MAX_NR_ZONES; z++) {
+		spin_lock_irqsave(&z->lock, flags);
+
+		pfn = ALIGN(z->zone_start_pfn, nr_pages);
+		while (zone_spans_last_pfn(z, pfn, nr_pages)) {
+			if (pfn_range_valid_gigantic(pfn, nr_pages)) {
+				/*
+				 * We release the zone lock here because
+				 * alloc_contig_range() will also lock the zone
+				 * at some point. If there's an allocation
+				 * spinning on this lock, it may win the race
+				 * and cause alloc_contig_range() to fail...
+				 */
+				spin_unlock_irqrestore(&z->lock, flags);
+				ret = __alloc_gigantic_page(pfn, nr_pages);
+				if (!ret)
+					return pfn_to_page(pfn);
+				spin_lock_irqsave(&z->lock, flags);
+			}
+			pfn += nr_pages;
+		}
+
+		spin_unlock_irqrestore(&z->lock, flags);
+	}
+
+	return NULL;
+}
+
+static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
+static void prep_compound_gigantic_page(struct page *page, unsigned long order);
+
+static struct page *alloc_fresh_gigantic_page_node(struct hstate *h, int nid)
+{
+	struct page *page;
+
+	page = alloc_gigantic_page(nid, huge_page_order(h));
+	if (page) {
+		prep_compound_gigantic_page(page, huge_page_order(h));
+		prep_new_huge_page(h, page, nid);
+	}
+
+	return page;
+}
+
+static int alloc_fresh_gigantic_page(struct hstate *h,
+				nodemask_t *nodes_allowed)
+{
+	struct page *page = NULL;
+	int nr_nodes, node;
+
+	for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
+		page = alloc_fresh_gigantic_page_node(h, node);
+		if (page)
+			return 1;
+	}
+
+	return 0;
+}
+
+static inline bool gigantic_page_supported(void) { return true; }
+#else
+static inline bool gigantic_page_supported(void) { return false; }
+static inline void free_gigantic_page(struct page *page, unsigned order) { }
+static inline void destroy_compound_gigantic_page(struct page *page,
+						unsigned long order) { }
+static inline int alloc_fresh_gigantic_page(struct hstate *h,
+					nodemask_t *nodes_allowed) { return 0; }
+#endif
+
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
 
-	VM_BUG_ON(hstate_is_gigantic(h));
+	if (hstate_is_gigantic(h) && !gigantic_page_supported())
+		return;
 
 	h->nr_huge_pages--;
 	h->nr_huge_pages_node[page_to_nid(page)]--;
@@ -697,8 +836,13 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
 	set_compound_page_dtor(page, NULL);
 	set_page_refcounted(page);
-	arch_release_hugepage(page);
-	__free_pages(page, huge_page_order(h));
+	if (hstate_is_gigantic(h)) {
+		destroy_compound_gigantic_page(page, huge_page_order(h));
+		free_gigantic_page(page, huge_page_order(h));
+	} else {
+		arch_release_hugepage(page);
+		__free_pages(page, huge_page_order(h));
+	}
 }
 
 struct hstate *size_to_hstate(unsigned long size)
@@ -737,7 +881,7 @@ static void free_huge_page(struct page *page)
 	if (restore_reserve)
 		h->resv_huge_pages++;
 
-	if (h->surplus_huge_pages_node[nid] && !hstate_is_gigantic(h)) {
+	if (h->surplus_huge_pages_node[nid]) {
 		/* remove the page from active list */
 		list_del(&page->lru);
 		update_and_free_page(h, page);
@@ -841,9 +985,6 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
 
-	if (hstate_is_gigantic(h))
-		return NULL;
-
 	page = alloc_pages_exact_node(nid,
 		htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
 						__GFP_REPEAT|__GFP_NOWARN,
@@ -1477,7 +1618,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
 {
 	unsigned long min_count, ret;
 
-	if (hstate_is_gigantic(h))
+	if (hstate_is_gigantic(h) && !gigantic_page_supported())
 		return h->max_huge_pages;
 
 	/*
@@ -1504,7 +1645,11 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
 		 * and reducing the surplus.
 		 */
 		spin_unlock(&hugetlb_lock);
-		ret = alloc_fresh_huge_page(h, nodes_allowed);
+		if (hstate_is_gigantic(h)) {
+			ret = alloc_fresh_gigantic_page(h, nodes_allowed);
+		} else {
+			ret = alloc_fresh_huge_page(h, nodes_allowed);
+		}
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -1604,7 +1749,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
 		goto out;
 
 	h = kobj_to_hstate(kobj, &nid);
-	if (hstate_is_gigantic(h)) {
+	if (hstate_is_gigantic(h) && !gigantic_page_supported()) {
 		err = -EINVAL;
 		goto out;
 	}
@@ -2112,7 +2257,7 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
 
 	tmp = h->max_huge_pages;
 
-	if (write && hstate_is_gigantic(h))
+	if (write && hstate_is_gigantic(h) && !gigantic_page_supported())
 		return -EINVAL;
 
 	table->data = &tmp;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/5] hugetlb: add support for gigantic page allocation at runtime
@ 2014-04-10 17:58   ` Luiz Capitulino
  0 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-10 17:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, mtosatti, aarcange, mgorman, akpm, andi, davidlohr,
	rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi, kirill

HugeTLB is limited to allocating hugepages whose size are less than
MAX_ORDER order. This is so because HugeTLB allocates hugepages via
the buddy allocator. Gigantic pages (that is, pages whose size is
greater than MAX_ORDER order) have to be allocated at boottime.

However, boottime allocation has at least two serious problems. First,
it doesn't support NUMA and second, gigantic pages allocated at
boottime can't be freed.

This commit solves both issues by adding support for allocating gigantic
pages during runtime. It works just like regular sized hugepages,
meaning that the interface in sysfs is the same, it supports NUMA,
and gigantic pages can be freed.

For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
gigantic pages on node 1, one can do:

 # echo 2 > \
   /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

And to free them all:

 # echo 0 > \
   /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

The one problem with gigantic page allocation at runtime is that it
can't be serviced by the buddy allocator. To overcome that problem, this
commit scans all zones from a node looking for a large enough contiguous
region. When one is found, it's allocated by using CMA, that is, we call
alloc_contig_range() to do the actual allocation. For example, on x86_64
we scan all zones looking for a 1GB contiguous region. When one is found,
it's allocated by alloc_contig_range().

One expected issue with that approach is that such gigantic contiguous
regions tend to vanish as runtime goes by. The best way to avoid this for
now is to make gigantic page allocations very early during system boot, say
from a init script. Other possible optimization include using compaction,
which is supported by CMA but is not explicitly used by this commit.

It's also important to note the following:

 1. Gigantic pages allocated at boottime by the hugepages= command-line
    option can be freed at runtime just fine

 2. This commit adds support for gigantic pages only to x86_64. The
    reason is that I don't have access to nor experience with other archs.
    The code is arch indepedent though, so it should be simple to add
    support to different archs

 3. I didn't add support for hugepage overcommit, that is allocating
    a gigantic page on demand when
   /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't
   think it's reasonable to do the hard and long work required for
   allocating a gigantic page at fault time. But it should be simple
   to add this if wanted

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
---
 mm/hugetlb.c | 167 +++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 156 insertions(+), 11 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6f1ca74..161dc39 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -680,11 +680,150 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
 		nr_nodes--)
 
+#if defined(CONFIG_CMA) && defined(CONFIG_X86_64)
+static void destroy_compound_gigantic_page(struct page *page,
+					unsigned long order)
+{
+	int i;
+	int nr_pages = 1 << order;
+	struct page *p = page + 1;
+
+	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
+		__ClearPageTail(p);
+		set_page_refcounted(p);
+		p->first_page = NULL;
+	}
+
+	set_compound_order(page, 0);
+	__ClearPageHead(page);
+}
+
+static void free_gigantic_page(struct page *page, unsigned order)
+{
+	free_contig_range(page_to_pfn(page), 1 << order);
+}
+
+static int __alloc_gigantic_page(unsigned long start_pfn,
+				unsigned long nr_pages)
+{
+	unsigned long end_pfn = start_pfn + nr_pages;
+	return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
+}
+
+static bool pfn_range_valid_gigantic(unsigned long start_pfn,
+				unsigned long nr_pages)
+{
+	unsigned long i, end_pfn = start_pfn + nr_pages;
+	struct page *page;
+
+	for (i = start_pfn; i < end_pfn; i++) {
+		if (!pfn_valid(i))
+			return false;
+
+		page = pfn_to_page(i);
+
+		if (PageReserved(page))
+			return false;
+
+		if (page_count(page) > 0)
+			return false;
+
+		if (PageHuge(page))
+			return false;
+	}
+
+	return true;
+}
+
+static bool zone_spans_last_pfn(const struct zone *zone,
+			unsigned long start_pfn, unsigned long nr_pages)
+{
+	unsigned long last_pfn = start_pfn + nr_pages - 1;
+	return zone_spans_pfn(zone, last_pfn);
+}
+
+static struct page *alloc_gigantic_page(int nid, unsigned order)
+{
+	unsigned long nr_pages = 1 << order;
+	unsigned long ret, pfn, flags;
+	struct zone *z;
+
+	z = NODE_DATA(nid)->node_zones;
+	for (; z - NODE_DATA(nid)->node_zones < MAX_NR_ZONES; z++) {
+		spin_lock_irqsave(&z->lock, flags);
+
+		pfn = ALIGN(z->zone_start_pfn, nr_pages);
+		while (zone_spans_last_pfn(z, pfn, nr_pages)) {
+			if (pfn_range_valid_gigantic(pfn, nr_pages)) {
+				/*
+				 * We release the zone lock here because
+				 * alloc_contig_range() will also lock the zone
+				 * at some point. If there's an allocation
+				 * spinning on this lock, it may win the race
+				 * and cause alloc_contig_range() to fail...
+				 */
+				spin_unlock_irqrestore(&z->lock, flags);
+				ret = __alloc_gigantic_page(pfn, nr_pages);
+				if (!ret)
+					return pfn_to_page(pfn);
+				spin_lock_irqsave(&z->lock, flags);
+			}
+			pfn += nr_pages;
+		}
+
+		spin_unlock_irqrestore(&z->lock, flags);
+	}
+
+	return NULL;
+}
+
+static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
+static void prep_compound_gigantic_page(struct page *page, unsigned long order);
+
+static struct page *alloc_fresh_gigantic_page_node(struct hstate *h, int nid)
+{
+	struct page *page;
+
+	page = alloc_gigantic_page(nid, huge_page_order(h));
+	if (page) {
+		prep_compound_gigantic_page(page, huge_page_order(h));
+		prep_new_huge_page(h, page, nid);
+	}
+
+	return page;
+}
+
+static int alloc_fresh_gigantic_page(struct hstate *h,
+				nodemask_t *nodes_allowed)
+{
+	struct page *page = NULL;
+	int nr_nodes, node;
+
+	for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
+		page = alloc_fresh_gigantic_page_node(h, node);
+		if (page)
+			return 1;
+	}
+
+	return 0;
+}
+
+static inline bool gigantic_page_supported(void) { return true; }
+#else
+static inline bool gigantic_page_supported(void) { return false; }
+static inline void free_gigantic_page(struct page *page, unsigned order) { }
+static inline void destroy_compound_gigantic_page(struct page *page,
+						unsigned long order) { }
+static inline int alloc_fresh_gigantic_page(struct hstate *h,
+					nodemask_t *nodes_allowed) { return 0; }
+#endif
+
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
 
-	VM_BUG_ON(hstate_is_gigantic(h));
+	if (hstate_is_gigantic(h) && !gigantic_page_supported())
+		return;
 
 	h->nr_huge_pages--;
 	h->nr_huge_pages_node[page_to_nid(page)]--;
@@ -697,8 +836,13 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
 	set_compound_page_dtor(page, NULL);
 	set_page_refcounted(page);
-	arch_release_hugepage(page);
-	__free_pages(page, huge_page_order(h));
+	if (hstate_is_gigantic(h)) {
+		destroy_compound_gigantic_page(page, huge_page_order(h));
+		free_gigantic_page(page, huge_page_order(h));
+	} else {
+		arch_release_hugepage(page);
+		__free_pages(page, huge_page_order(h));
+	}
 }
 
 struct hstate *size_to_hstate(unsigned long size)
@@ -737,7 +881,7 @@ static void free_huge_page(struct page *page)
 	if (restore_reserve)
 		h->resv_huge_pages++;
 
-	if (h->surplus_huge_pages_node[nid] && !hstate_is_gigantic(h)) {
+	if (h->surplus_huge_pages_node[nid]) {
 		/* remove the page from active list */
 		list_del(&page->lru);
 		update_and_free_page(h, page);
@@ -841,9 +985,6 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
 
-	if (hstate_is_gigantic(h))
-		return NULL;
-
 	page = alloc_pages_exact_node(nid,
 		htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
 						__GFP_REPEAT|__GFP_NOWARN,
@@ -1477,7 +1618,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
 {
 	unsigned long min_count, ret;
 
-	if (hstate_is_gigantic(h))
+	if (hstate_is_gigantic(h) && !gigantic_page_supported())
 		return h->max_huge_pages;
 
 	/*
@@ -1504,7 +1645,11 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
 		 * and reducing the surplus.
 		 */
 		spin_unlock(&hugetlb_lock);
-		ret = alloc_fresh_huge_page(h, nodes_allowed);
+		if (hstate_is_gigantic(h)) {
+			ret = alloc_fresh_gigantic_page(h, nodes_allowed);
+		} else {
+			ret = alloc_fresh_huge_page(h, nodes_allowed);
+		}
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -1604,7 +1749,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
 		goto out;
 
 	h = kobj_to_hstate(kobj, &nid);
-	if (hstate_is_gigantic(h)) {
+	if (hstate_is_gigantic(h) && !gigantic_page_supported()) {
 		err = -EINVAL;
 		goto out;
 	}
@@ -2112,7 +2257,7 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
 
 	tmp = h->max_huge_pages;
 
-	if (write && hstate_is_gigantic(h))
+	if (write && hstate_is_gigantic(h) && !gigantic_page_supported())
 		return -EINVAL;
 
 	table->data = &tmp;
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
  2014-04-10 17:58 ` Luiz Capitulino
@ 2014-04-10 21:44   ` Davidlohr Bueso
  -1 siblings, 0 replies; 37+ messages in thread
From: Davidlohr Bueso @ 2014-04-10 21:44 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, akpm, andi,
	rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi, kirill

On Thu, 2014-04-10 at 13:58 -0400, Luiz Capitulino wrote:
> [Full introduction right after the changelog]
> 
> Changelog
> ---------
> 
> v3
> 
> - Dropped unnecessary WARN_ON() call [Kirill]
> - Always check if the pfn range lies within a zone [Yasuaki]
> - Renamed some function arguments for consistency
> 
> v2
> 
> - Rewrote allocation loop to avoid scanning unless PFNs [Yasuaki]
> - Dropped incomplete multi-arch support [Naoya]
> - Added patch to drop __init from prep_compound_gigantic_page()
> - Restricted the feature to x86_64 (more details in patch 5/5)
> - Added review-bys plus minor changelog changes
> 
> Introduction
> ------------
> 
> The HugeTLB subsystem uses the buddy allocator to allocate hugepages during
> runtime. This means that hugepages allocation during runtime is limited to
> MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes
> greater than MAX_ORDER), this in turn means that those pages can't be
> allocated at runtime.
> 
> HugeTLB supports gigantic page allocation during boottime, via the boot
> allocator. To this end the kernel provides the command-line options
> hugepagesz= and hugepages=, which can be used to instruct the kernel to
> allocate N gigantic pages during boot.
> 
> For example, x86_64 supports 2M and 1G hugepages, but only 2M hugepages can
> be allocated and freed at runtime. If one wants to allocate 1G gigantic pages,
> this has to be done at boot via the hugepagesz= and hugepages= command-line
> options.
> 
> Now, gigantic page allocation at boottime has two serious problems:
> 
>  1. Boottime allocation is not NUMA aware. On a NUMA machine the kernel
>     evenly distributes boottime allocated hugepages among nodes.
> 
>     For example, suppose you have a four-node NUMA machine and want
>     to allocate four 1G gigantic pages at boottime. The kernel will
>     allocate one gigantic page per node.
> 
>     On the other hand, we do have users who want to be able to specify
>     which NUMA node gigantic pages should allocated from. So that they
>     can place virtual machines on a specific NUMA node.
> 
>  2. Gigantic pages allocated at boottime can't be freed
> 
> At this point it's important to observe that regular hugepages allocated
> at runtime don't have those problems. This is so because HugeTLB interface
> for runtime allocation in sysfs supports NUMA and runtime allocated pages
> can be freed just fine via the buddy allocator.
> 
> This series adds support for allocating gigantic pages at runtime. It does
> so by allocating gigantic pages via CMA instead of the buddy allocator.
> Releasing gigantic pages is also supported via CMA. As this series builds
> on top of the existing HugeTLB interface, it makes gigantic page allocation
> and releasing just like regular sized hugepages. This also means that NUMA
> support just works.
> 
> For example, to allocate two 1G gigantic pages on node 1, one can do:
> 
>  # echo 2 > \
>    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> And, to release all gigantic pages on the same node:
> 
>  # echo 0 > \
>    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> Please, refer to patch 5/5 for full technical details.
> 
> Finally, please note that this series is a follow up for a previous series
> that tried to extend the command-line options set to be NUMA aware:
> 
>  http://marc.info/?l=linux-mm&m=139593335312191&w=2
> 
> During the discussion of that series it was agreed that having runtime
> allocation support for gigantic pages was a better solution.
> 
> Luiz Capitulino (5):
>   hugetlb: prep_compound_gigantic_page(): drop __init marker
>   hugetlb: add hstate_is_gigantic()
>   hugetlb: update_and_free_page(): don't clear PG_reserved bit
>   hugetlb: move helpers up in the file
>   hugetlb: add support for gigantic page allocation at runtime

Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
@ 2014-04-10 21:44   ` Davidlohr Bueso
  0 siblings, 0 replies; 37+ messages in thread
From: Davidlohr Bueso @ 2014-04-10 21:44 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, akpm, andi,
	rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi, kirill

On Thu, 2014-04-10 at 13:58 -0400, Luiz Capitulino wrote:
> [Full introduction right after the changelog]
> 
> Changelog
> ---------
> 
> v3
> 
> - Dropped unnecessary WARN_ON() call [Kirill]
> - Always check if the pfn range lies within a zone [Yasuaki]
> - Renamed some function arguments for consistency
> 
> v2
> 
> - Rewrote allocation loop to avoid scanning unless PFNs [Yasuaki]
> - Dropped incomplete multi-arch support [Naoya]
> - Added patch to drop __init from prep_compound_gigantic_page()
> - Restricted the feature to x86_64 (more details in patch 5/5)
> - Added review-bys plus minor changelog changes
> 
> Introduction
> ------------
> 
> The HugeTLB subsystem uses the buddy allocator to allocate hugepages during
> runtime. This means that hugepages allocation during runtime is limited to
> MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes
> greater than MAX_ORDER), this in turn means that those pages can't be
> allocated at runtime.
> 
> HugeTLB supports gigantic page allocation during boottime, via the boot
> allocator. To this end the kernel provides the command-line options
> hugepagesz= and hugepages=, which can be used to instruct the kernel to
> allocate N gigantic pages during boot.
> 
> For example, x86_64 supports 2M and 1G hugepages, but only 2M hugepages can
> be allocated and freed at runtime. If one wants to allocate 1G gigantic pages,
> this has to be done at boot via the hugepagesz= and hugepages= command-line
> options.
> 
> Now, gigantic page allocation at boottime has two serious problems:
> 
>  1. Boottime allocation is not NUMA aware. On a NUMA machine the kernel
>     evenly distributes boottime allocated hugepages among nodes.
> 
>     For example, suppose you have a four-node NUMA machine and want
>     to allocate four 1G gigantic pages at boottime. The kernel will
>     allocate one gigantic page per node.
> 
>     On the other hand, we do have users who want to be able to specify
>     which NUMA node gigantic pages should allocated from. So that they
>     can place virtual machines on a specific NUMA node.
> 
>  2. Gigantic pages allocated at boottime can't be freed
> 
> At this point it's important to observe that regular hugepages allocated
> at runtime don't have those problems. This is so because HugeTLB interface
> for runtime allocation in sysfs supports NUMA and runtime allocated pages
> can be freed just fine via the buddy allocator.
> 
> This series adds support for allocating gigantic pages at runtime. It does
> so by allocating gigantic pages via CMA instead of the buddy allocator.
> Releasing gigantic pages is also supported via CMA. As this series builds
> on top of the existing HugeTLB interface, it makes gigantic page allocation
> and releasing just like regular sized hugepages. This also means that NUMA
> support just works.
> 
> For example, to allocate two 1G gigantic pages on node 1, one can do:
> 
>  # echo 2 > \
>    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> And, to release all gigantic pages on the same node:
> 
>  # echo 0 > \
>    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> Please, refer to patch 5/5 for full technical details.
> 
> Finally, please note that this series is a follow up for a previous series
> that tried to extend the command-line options set to be NUMA aware:
> 
>  http://marc.info/?l=linux-mm&m=139593335312191&w=2
> 
> During the discussion of that series it was agreed that having runtime
> allocation support for gigantic pages was a better solution.
> 
> Luiz Capitulino (5):
>   hugetlb: prep_compound_gigantic_page(): drop __init marker
>   hugetlb: add hstate_is_gigantic()
>   hugetlb: update_and_free_page(): don't clear PG_reserved bit
>   hugetlb: move helpers up in the file
>   hugetlb: add support for gigantic page allocation at runtime

Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
  2014-04-10 17:58 ` Luiz Capitulino
                   ` (6 preceding siblings ...)
  (?)
@ 2014-04-11 12:08 ` Kirill A. Shutemov
  -1 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2014-04-11 12:08 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, akpm, andi,
	davidlohr, rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi

On Thu, Apr 10, 2014 at 01:58:40PM -0400, Luiz Capitulino wrote:
> [Full introduction right after the changelog]
> 
> Changelog
> ---------
> 
> v3
> 
> - Dropped unnecessary WARN_ON() call [Kirill]
> - Always check if the pfn range lies within a zone [Yasuaki]
> - Renamed some function arguments for consistency
> 
> v2
> 
> - Rewrote allocation loop to avoid scanning unless PFNs [Yasuaki]
> - Dropped incomplete multi-arch support [Naoya]
> - Added patch to drop __init from prep_compound_gigantic_page()
> - Restricted the feature to x86_64 (more details in patch 5/5)
> - Added review-bys plus minor changelog changes
> 
> Introduction
> ------------
> 
> The HugeTLB subsystem uses the buddy allocator to allocate hugepages during
> runtime. This means that hugepages allocation during runtime is limited to
> MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes
> greater than MAX_ORDER), this in turn means that those pages can't be
> allocated at runtime.
> 
> HugeTLB supports gigantic page allocation during boottime, via the boot
> allocator. To this end the kernel provides the command-line options
> hugepagesz= and hugepages=, which can be used to instruct the kernel to
> allocate N gigantic pages during boot.
> 
> For example, x86_64 supports 2M and 1G hugepages, but only 2M hugepages can
> be allocated and freed at runtime. If one wants to allocate 1G gigantic pages,
> this has to be done at boot via the hugepagesz= and hugepages= command-line
> options.
> 
> Now, gigantic page allocation at boottime has two serious problems:
> 
>  1. Boottime allocation is not NUMA aware. On a NUMA machine the kernel
>     evenly distributes boottime allocated hugepages among nodes.
> 
>     For example, suppose you have a four-node NUMA machine and want
>     to allocate four 1G gigantic pages at boottime. The kernel will
>     allocate one gigantic page per node.
> 
>     On the other hand, we do have users who want to be able to specify
>     which NUMA node gigantic pages should allocated from. So that they
>     can place virtual machines on a specific NUMA node.
> 
>  2. Gigantic pages allocated at boottime can't be freed
> 
> At this point it's important to observe that regular hugepages allocated
> at runtime don't have those problems. This is so because HugeTLB interface
> for runtime allocation in sysfs supports NUMA and runtime allocated pages
> can be freed just fine via the buddy allocator.
> 
> This series adds support for allocating gigantic pages at runtime. It does
> so by allocating gigantic pages via CMA instead of the buddy allocator.
> Releasing gigantic pages is also supported via CMA. As this series builds
> on top of the existing HugeTLB interface, it makes gigantic page allocation
> and releasing just like regular sized hugepages. This also means that NUMA
> support just works.
> 
> For example, to allocate two 1G gigantic pages on node 1, one can do:
> 
>  # echo 2 > \
>    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> And, to release all gigantic pages on the same node:
> 
>  # echo 0 > \
>    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> Please, refer to patch 5/5 for full technical details.
> 
> Finally, please note that this series is a follow up for a previous series
> that tried to extend the command-line options set to be NUMA aware:
> 
>  http://marc.info/?l=linux-mm&m=139593335312191&w=2
> 
> During the discussion of that series it was agreed that having runtime
> allocation support for gigantic pages was a better solution.
> 
> Luiz Capitulino (5):
>   hugetlb: prep_compound_gigantic_page(): drop __init marker
>   hugetlb: add hstate_is_gigantic()
>   hugetlb: update_and_free_page(): don't clear PG_reserved bit
>   hugetlb: move helpers up in the file
>   hugetlb: add support for gigantic page allocation at runtime

Thanks for doing this. It was on my todo list for some time.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/5] hugetlb: add support for gigantic page allocation at runtime
  2014-04-10 17:58   ` Luiz Capitulino
@ 2014-04-13 23:31     ` Yasuaki Ishimatsu
  -1 siblings, 0 replies; 37+ messages in thread
From: Yasuaki Ishimatsu @ 2014-04-13 23:31 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, akpm, andi,
	davidlohr, rientjes, yinghai, riel, n-horiguchi, kirill

(2014/04/11 2:58), Luiz Capitulino wrote:
> HugeTLB is limited to allocating hugepages whose size are less than
> MAX_ORDER order. This is so because HugeTLB allocates hugepages via
> the buddy allocator. Gigantic pages (that is, pages whose size is
> greater than MAX_ORDER order) have to be allocated at boottime.
> 
> However, boottime allocation has at least two serious problems. First,
> it doesn't support NUMA and second, gigantic pages allocated at
> boottime can't be freed.
> 
> This commit solves both issues by adding support for allocating gigantic
> pages during runtime. It works just like regular sized hugepages,
> meaning that the interface in sysfs is the same, it supports NUMA,
> and gigantic pages can be freed.
> 
> For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
> gigantic pages on node 1, one can do:
> 
>   # echo 2 > \
>     /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> And to free them all:
> 
>   # echo 0 > \
>     /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> The one problem with gigantic page allocation at runtime is that it
> can't be serviced by the buddy allocator. To overcome that problem, this
> commit scans all zones from a node looking for a large enough contiguous
> region. When one is found, it's allocated by using CMA, that is, we call
> alloc_contig_range() to do the actual allocation. For example, on x86_64
> we scan all zones looking for a 1GB contiguous region. When one is found,
> it's allocated by alloc_contig_range().
> 
> One expected issue with that approach is that such gigantic contiguous
> regions tend to vanish as runtime goes by. The best way to avoid this for
> now is to make gigantic page allocations very early during system boot, say
> from a init script. Other possible optimization include using compaction,
> which is supported by CMA but is not explicitly used by this commit.
> 
> It's also important to note the following:
> 
>   1. Gigantic pages allocated at boottime by the hugepages= command-line
>      option can be freed at runtime just fine
> 
>   2. This commit adds support for gigantic pages only to x86_64. The
>      reason is that I don't have access to nor experience with other archs.
>      The code is arch indepedent though, so it should be simple to add
>      support to different archs
> 
>   3. I didn't add support for hugepage overcommit, that is allocating
>      a gigantic page on demand when
>     /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't
>     think it's reasonable to do the hard and long work required for
>     allocating a gigantic page at fault time. But it should be simple
>     to add this if wanted
> 
> Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
> ---

Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

Thanks,
Yasuaki Ishimatsu

>   mm/hugetlb.c | 167 +++++++++++++++++++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 156 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6f1ca74..161dc39 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -680,11 +680,150 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
>   		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
>   		nr_nodes--)
>   
> +#if defined(CONFIG_CMA) && defined(CONFIG_X86_64)
> +static void destroy_compound_gigantic_page(struct page *page,
> +					unsigned long order)
> +{
> +	int i;
> +	int nr_pages = 1 << order;
> +	struct page *p = page + 1;
> +
> +	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
> +		__ClearPageTail(p);
> +		set_page_refcounted(p);
> +		p->first_page = NULL;
> +	}
> +
> +	set_compound_order(page, 0);
> +	__ClearPageHead(page);
> +}
> +
> +static void free_gigantic_page(struct page *page, unsigned order)
> +{
> +	free_contig_range(page_to_pfn(page), 1 << order);
> +}
> +
> +static int __alloc_gigantic_page(unsigned long start_pfn,
> +				unsigned long nr_pages)
> +{
> +	unsigned long end_pfn = start_pfn + nr_pages;
> +	return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
> +}
> +
> +static bool pfn_range_valid_gigantic(unsigned long start_pfn,
> +				unsigned long nr_pages)
> +{
> +	unsigned long i, end_pfn = start_pfn + nr_pages;
> +	struct page *page;
> +
> +	for (i = start_pfn; i < end_pfn; i++) {
> +		if (!pfn_valid(i))
> +			return false;
> +
> +		page = pfn_to_page(i);
> +
> +		if (PageReserved(page))
> +			return false;
> +
> +		if (page_count(page) > 0)
> +			return false;
> +
> +		if (PageHuge(page))
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
> +static bool zone_spans_last_pfn(const struct zone *zone,
> +			unsigned long start_pfn, unsigned long nr_pages)
> +{
> +	unsigned long last_pfn = start_pfn + nr_pages - 1;
> +	return zone_spans_pfn(zone, last_pfn);
> +}
> +
> +static struct page *alloc_gigantic_page(int nid, unsigned order)
> +{
> +	unsigned long nr_pages = 1 << order;
> +	unsigned long ret, pfn, flags;
> +	struct zone *z;
> +
> +	z = NODE_DATA(nid)->node_zones;
> +	for (; z - NODE_DATA(nid)->node_zones < MAX_NR_ZONES; z++) {
> +		spin_lock_irqsave(&z->lock, flags);
> +
> +		pfn = ALIGN(z->zone_start_pfn, nr_pages);
> +		while (zone_spans_last_pfn(z, pfn, nr_pages)) {
> +			if (pfn_range_valid_gigantic(pfn, nr_pages)) {
> +				/*
> +				 * We release the zone lock here because
> +				 * alloc_contig_range() will also lock the zone
> +				 * at some point. If there's an allocation
> +				 * spinning on this lock, it may win the race
> +				 * and cause alloc_contig_range() to fail...
> +				 */
> +				spin_unlock_irqrestore(&z->lock, flags);
> +				ret = __alloc_gigantic_page(pfn, nr_pages);
> +				if (!ret)
> +					return pfn_to_page(pfn);
> +				spin_lock_irqsave(&z->lock, flags);
> +			}
> +			pfn += nr_pages;
> +		}
> +
> +		spin_unlock_irqrestore(&z->lock, flags);
> +	}
> +
> +	return NULL;
> +}
> +
> +static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
> +static void prep_compound_gigantic_page(struct page *page, unsigned long order);
> +
> +static struct page *alloc_fresh_gigantic_page_node(struct hstate *h, int nid)
> +{
> +	struct page *page;
> +
> +	page = alloc_gigantic_page(nid, huge_page_order(h));
> +	if (page) {
> +		prep_compound_gigantic_page(page, huge_page_order(h));
> +		prep_new_huge_page(h, page, nid);
> +	}
> +
> +	return page;
> +}
> +
> +static int alloc_fresh_gigantic_page(struct hstate *h,
> +				nodemask_t *nodes_allowed)
> +{
> +	struct page *page = NULL;
> +	int nr_nodes, node;
> +
> +	for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
> +		page = alloc_fresh_gigantic_page_node(h, node);
> +		if (page)
> +			return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static inline bool gigantic_page_supported(void) { return true; }
> +#else
> +static inline bool gigantic_page_supported(void) { return false; }
> +static inline void free_gigantic_page(struct page *page, unsigned order) { }
> +static inline void destroy_compound_gigantic_page(struct page *page,
> +						unsigned long order) { }
> +static inline int alloc_fresh_gigantic_page(struct hstate *h,
> +					nodemask_t *nodes_allowed) { return 0; }
> +#endif
> +
>   static void update_and_free_page(struct hstate *h, struct page *page)
>   {
>   	int i;
>   
> -	VM_BUG_ON(hstate_is_gigantic(h));
> +	if (hstate_is_gigantic(h) && !gigantic_page_supported())
> +		return;
>   
>   	h->nr_huge_pages--;
>   	h->nr_huge_pages_node[page_to_nid(page)]--;
> @@ -697,8 +836,13 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>   	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
>   	set_compound_page_dtor(page, NULL);
>   	set_page_refcounted(page);
> -	arch_release_hugepage(page);
> -	__free_pages(page, huge_page_order(h));
> +	if (hstate_is_gigantic(h)) {
> +		destroy_compound_gigantic_page(page, huge_page_order(h));
> +		free_gigantic_page(page, huge_page_order(h));
> +	} else {
> +		arch_release_hugepage(page);
> +		__free_pages(page, huge_page_order(h));
> +	}
>   }
>   
>   struct hstate *size_to_hstate(unsigned long size)
> @@ -737,7 +881,7 @@ static void free_huge_page(struct page *page)
>   	if (restore_reserve)
>   		h->resv_huge_pages++;
>   
> -	if (h->surplus_huge_pages_node[nid] && !hstate_is_gigantic(h)) {
> +	if (h->surplus_huge_pages_node[nid]) {
>   		/* remove the page from active list */
>   		list_del(&page->lru);
>   		update_and_free_page(h, page);
> @@ -841,9 +985,6 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
>   {
>   	struct page *page;
>   
> -	if (hstate_is_gigantic(h))
> -		return NULL;
> -
>   	page = alloc_pages_exact_node(nid,
>   		htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
>   						__GFP_REPEAT|__GFP_NOWARN,
> @@ -1477,7 +1618,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
>   {
>   	unsigned long min_count, ret;
>   
> -	if (hstate_is_gigantic(h))
> +	if (hstate_is_gigantic(h) && !gigantic_page_supported())
>   		return h->max_huge_pages;
>   
>   	/*
> @@ -1504,7 +1645,11 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
>   		 * and reducing the surplus.
>   		 */
>   		spin_unlock(&hugetlb_lock);
> -		ret = alloc_fresh_huge_page(h, nodes_allowed);
> +		if (hstate_is_gigantic(h)) {
> +			ret = alloc_fresh_gigantic_page(h, nodes_allowed);
> +		} else {
> +			ret = alloc_fresh_huge_page(h, nodes_allowed);
> +		}
>   		spin_lock(&hugetlb_lock);
>   		if (!ret)
>   			goto out;
> @@ -1604,7 +1749,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
>   		goto out;
>   
>   	h = kobj_to_hstate(kobj, &nid);
> -	if (hstate_is_gigantic(h)) {
> +	if (hstate_is_gigantic(h) && !gigantic_page_supported()) {
>   		err = -EINVAL;
>   		goto out;
>   	}
> @@ -2112,7 +2257,7 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
>   
>   	tmp = h->max_huge_pages;
>   
> -	if (write && hstate_is_gigantic(h))
> +	if (write && hstate_is_gigantic(h) && !gigantic_page_supported())
>   		return -EINVAL;
>   
>   	table->data = &tmp;
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/5] hugetlb: add support for gigantic page allocation at runtime
@ 2014-04-13 23:31     ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 37+ messages in thread
From: Yasuaki Ishimatsu @ 2014-04-13 23:31 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, akpm, andi,
	davidlohr, rientjes, yinghai, riel, n-horiguchi, kirill

(2014/04/11 2:58), Luiz Capitulino wrote:
> HugeTLB is limited to allocating hugepages whose size are less than
> MAX_ORDER order. This is so because HugeTLB allocates hugepages via
> the buddy allocator. Gigantic pages (that is, pages whose size is
> greater than MAX_ORDER order) have to be allocated at boottime.
> 
> However, boottime allocation has at least two serious problems. First,
> it doesn't support NUMA and second, gigantic pages allocated at
> boottime can't be freed.
> 
> This commit solves both issues by adding support for allocating gigantic
> pages during runtime. It works just like regular sized hugepages,
> meaning that the interface in sysfs is the same, it supports NUMA,
> and gigantic pages can be freed.
> 
> For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
> gigantic pages on node 1, one can do:
> 
>   # echo 2 > \
>     /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> And to free them all:
> 
>   # echo 0 > \
>     /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> The one problem with gigantic page allocation at runtime is that it
> can't be serviced by the buddy allocator. To overcome that problem, this
> commit scans all zones from a node looking for a large enough contiguous
> region. When one is found, it's allocated by using CMA, that is, we call
> alloc_contig_range() to do the actual allocation. For example, on x86_64
> we scan all zones looking for a 1GB contiguous region. When one is found,
> it's allocated by alloc_contig_range().
> 
> One expected issue with that approach is that such gigantic contiguous
> regions tend to vanish as runtime goes by. The best way to avoid this for
> now is to make gigantic page allocations very early during system boot, say
> from a init script. Other possible optimization include using compaction,
> which is supported by CMA but is not explicitly used by this commit.
> 
> It's also important to note the following:
> 
>   1. Gigantic pages allocated at boottime by the hugepages= command-line
>      option can be freed at runtime just fine
> 
>   2. This commit adds support for gigantic pages only to x86_64. The
>      reason is that I don't have access to nor experience with other archs.
>      The code is arch indepedent though, so it should be simple to add
>      support to different archs
> 
>   3. I didn't add support for hugepage overcommit, that is allocating
>      a gigantic page on demand when
>     /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't
>     think it's reasonable to do the hard and long work required for
>     allocating a gigantic page at fault time. But it should be simple
>     to add this if wanted
> 
> Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
> ---

Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

Thanks,
Yasuaki Ishimatsu

>   mm/hugetlb.c | 167 +++++++++++++++++++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 156 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6f1ca74..161dc39 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -680,11 +680,150 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
>   		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
>   		nr_nodes--)
>   
> +#if defined(CONFIG_CMA) && defined(CONFIG_X86_64)
> +static void destroy_compound_gigantic_page(struct page *page,
> +					unsigned long order)
> +{
> +	int i;
> +	int nr_pages = 1 << order;
> +	struct page *p = page + 1;
> +
> +	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
> +		__ClearPageTail(p);
> +		set_page_refcounted(p);
> +		p->first_page = NULL;
> +	}
> +
> +	set_compound_order(page, 0);
> +	__ClearPageHead(page);
> +}
> +
> +static void free_gigantic_page(struct page *page, unsigned order)
> +{
> +	free_contig_range(page_to_pfn(page), 1 << order);
> +}
> +
> +static int __alloc_gigantic_page(unsigned long start_pfn,
> +				unsigned long nr_pages)
> +{
> +	unsigned long end_pfn = start_pfn + nr_pages;
> +	return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
> +}
> +
> +static bool pfn_range_valid_gigantic(unsigned long start_pfn,
> +				unsigned long nr_pages)
> +{
> +	unsigned long i, end_pfn = start_pfn + nr_pages;
> +	struct page *page;
> +
> +	for (i = start_pfn; i < end_pfn; i++) {
> +		if (!pfn_valid(i))
> +			return false;
> +
> +		page = pfn_to_page(i);
> +
> +		if (PageReserved(page))
> +			return false;
> +
> +		if (page_count(page) > 0)
> +			return false;
> +
> +		if (PageHuge(page))
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
> +static bool zone_spans_last_pfn(const struct zone *zone,
> +			unsigned long start_pfn, unsigned long nr_pages)
> +{
> +	unsigned long last_pfn = start_pfn + nr_pages - 1;
> +	return zone_spans_pfn(zone, last_pfn);
> +}
> +
> +static struct page *alloc_gigantic_page(int nid, unsigned order)
> +{
> +	unsigned long nr_pages = 1 << order;
> +	unsigned long ret, pfn, flags;
> +	struct zone *z;
> +
> +	z = NODE_DATA(nid)->node_zones;
> +	for (; z - NODE_DATA(nid)->node_zones < MAX_NR_ZONES; z++) {
> +		spin_lock_irqsave(&z->lock, flags);
> +
> +		pfn = ALIGN(z->zone_start_pfn, nr_pages);
> +		while (zone_spans_last_pfn(z, pfn, nr_pages)) {
> +			if (pfn_range_valid_gigantic(pfn, nr_pages)) {
> +				/*
> +				 * We release the zone lock here because
> +				 * alloc_contig_range() will also lock the zone
> +				 * at some point. If there's an allocation
> +				 * spinning on this lock, it may win the race
> +				 * and cause alloc_contig_range() to fail...
> +				 */
> +				spin_unlock_irqrestore(&z->lock, flags);
> +				ret = __alloc_gigantic_page(pfn, nr_pages);
> +				if (!ret)
> +					return pfn_to_page(pfn);
> +				spin_lock_irqsave(&z->lock, flags);
> +			}
> +			pfn += nr_pages;
> +		}
> +
> +		spin_unlock_irqrestore(&z->lock, flags);
> +	}
> +
> +	return NULL;
> +}
> +
> +static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
> +static void prep_compound_gigantic_page(struct page *page, unsigned long order);
> +
> +static struct page *alloc_fresh_gigantic_page_node(struct hstate *h, int nid)
> +{
> +	struct page *page;
> +
> +	page = alloc_gigantic_page(nid, huge_page_order(h));
> +	if (page) {
> +		prep_compound_gigantic_page(page, huge_page_order(h));
> +		prep_new_huge_page(h, page, nid);
> +	}
> +
> +	return page;
> +}
> +
> +static int alloc_fresh_gigantic_page(struct hstate *h,
> +				nodemask_t *nodes_allowed)
> +{
> +	struct page *page = NULL;
> +	int nr_nodes, node;
> +
> +	for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
> +		page = alloc_fresh_gigantic_page_node(h, node);
> +		if (page)
> +			return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static inline bool gigantic_page_supported(void) { return true; }
> +#else
> +static inline bool gigantic_page_supported(void) { return false; }
> +static inline void free_gigantic_page(struct page *page, unsigned order) { }
> +static inline void destroy_compound_gigantic_page(struct page *page,
> +						unsigned long order) { }
> +static inline int alloc_fresh_gigantic_page(struct hstate *h,
> +					nodemask_t *nodes_allowed) { return 0; }
> +#endif
> +
>   static void update_and_free_page(struct hstate *h, struct page *page)
>   {
>   	int i;
>   
> -	VM_BUG_ON(hstate_is_gigantic(h));
> +	if (hstate_is_gigantic(h) && !gigantic_page_supported())
> +		return;
>   
>   	h->nr_huge_pages--;
>   	h->nr_huge_pages_node[page_to_nid(page)]--;
> @@ -697,8 +836,13 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>   	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
>   	set_compound_page_dtor(page, NULL);
>   	set_page_refcounted(page);
> -	arch_release_hugepage(page);
> -	__free_pages(page, huge_page_order(h));
> +	if (hstate_is_gigantic(h)) {
> +		destroy_compound_gigantic_page(page, huge_page_order(h));
> +		free_gigantic_page(page, huge_page_order(h));
> +	} else {
> +		arch_release_hugepage(page);
> +		__free_pages(page, huge_page_order(h));
> +	}
>   }
>   
>   struct hstate *size_to_hstate(unsigned long size)
> @@ -737,7 +881,7 @@ static void free_huge_page(struct page *page)
>   	if (restore_reserve)
>   		h->resv_huge_pages++;
>   
> -	if (h->surplus_huge_pages_node[nid] && !hstate_is_gigantic(h)) {
> +	if (h->surplus_huge_pages_node[nid]) {
>   		/* remove the page from active list */
>   		list_del(&page->lru);
>   		update_and_free_page(h, page);
> @@ -841,9 +985,6 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
>   {
>   	struct page *page;
>   
> -	if (hstate_is_gigantic(h))
> -		return NULL;
> -
>   	page = alloc_pages_exact_node(nid,
>   		htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
>   						__GFP_REPEAT|__GFP_NOWARN,
> @@ -1477,7 +1618,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
>   {
>   	unsigned long min_count, ret;
>   
> -	if (hstate_is_gigantic(h))
> +	if (hstate_is_gigantic(h) && !gigantic_page_supported())
>   		return h->max_huge_pages;
>   
>   	/*
> @@ -1504,7 +1645,11 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
>   		 * and reducing the surplus.
>   		 */
>   		spin_unlock(&hugetlb_lock);
> -		ret = alloc_fresh_huge_page(h, nodes_allowed);
> +		if (hstate_is_gigantic(h)) {
> +			ret = alloc_fresh_gigantic_page(h, nodes_allowed);
> +		} else {
> +			ret = alloc_fresh_huge_page(h, nodes_allowed);
> +		}
>   		spin_lock(&hugetlb_lock);
>   		if (!ret)
>   			goto out;
> @@ -1604,7 +1749,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
>   		goto out;
>   
>   	h = kobj_to_hstate(kobj, &nid);
> -	if (hstate_is_gigantic(h)) {
> +	if (hstate_is_gigantic(h) && !gigantic_page_supported()) {
>   		err = -EINVAL;
>   		goto out;
>   	}
> @@ -2112,7 +2257,7 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
>   
>   	tmp = h->max_huge_pages;
>   
> -	if (write && hstate_is_gigantic(h))
> +	if (write && hstate_is_gigantic(h) && !gigantic_page_supported())
>   		return -EINVAL;
>   
>   	table->data = &tmp;
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
  2014-04-10 17:58 ` Luiz Capitulino
@ 2014-04-14  7:31   ` Zhang Yanfei
  -1 siblings, 0 replies; 37+ messages in thread
From: Zhang Yanfei @ 2014-04-14  7:31 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, akpm, andi,
	davidlohr, rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi,
	kirill

Clear explanation and implementation!

Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

On 04/11/2014 01:58 AM, Luiz Capitulino wrote:
> [Full introduction right after the changelog]
> 
> Changelog
> ---------
> 
> v3
> 
> - Dropped unnecessary WARN_ON() call [Kirill]
> - Always check if the pfn range lies within a zone [Yasuaki]
> - Renamed some function arguments for consistency
> 
> v2
> 
> - Rewrote allocation loop to avoid scanning unless PFNs [Yasuaki]
> - Dropped incomplete multi-arch support [Naoya]
> - Added patch to drop __init from prep_compound_gigantic_page()
> - Restricted the feature to x86_64 (more details in patch 5/5)
> - Added review-bys plus minor changelog changes
> 
> Introduction
> ------------
> 
> The HugeTLB subsystem uses the buddy allocator to allocate hugepages during
> runtime. This means that hugepages allocation during runtime is limited to
> MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes
> greater than MAX_ORDER), this in turn means that those pages can't be
> allocated at runtime.
> 
> HugeTLB supports gigantic page allocation during boottime, via the boot
> allocator. To this end the kernel provides the command-line options
> hugepagesz= and hugepages=, which can be used to instruct the kernel to
> allocate N gigantic pages during boot.
> 
> For example, x86_64 supports 2M and 1G hugepages, but only 2M hugepages can
> be allocated and freed at runtime. If one wants to allocate 1G gigantic pages,
> this has to be done at boot via the hugepagesz= and hugepages= command-line
> options.
> 
> Now, gigantic page allocation at boottime has two serious problems:
> 
>  1. Boottime allocation is not NUMA aware. On a NUMA machine the kernel
>     evenly distributes boottime allocated hugepages among nodes.
> 
>     For example, suppose you have a four-node NUMA machine and want
>     to allocate four 1G gigantic pages at boottime. The kernel will
>     allocate one gigantic page per node.
> 
>     On the other hand, we do have users who want to be able to specify
>     which NUMA node gigantic pages should allocated from. So that they
>     can place virtual machines on a specific NUMA node.
> 
>  2. Gigantic pages allocated at boottime can't be freed
> 
> At this point it's important to observe that regular hugepages allocated
> at runtime don't have those problems. This is so because HugeTLB interface
> for runtime allocation in sysfs supports NUMA and runtime allocated pages
> can be freed just fine via the buddy allocator.
> 
> This series adds support for allocating gigantic pages at runtime. It does
> so by allocating gigantic pages via CMA instead of the buddy allocator.
> Releasing gigantic pages is also supported via CMA. As this series builds
> on top of the existing HugeTLB interface, it makes gigantic page allocation
> and releasing just like regular sized hugepages. This also means that NUMA
> support just works.
> 
> For example, to allocate two 1G gigantic pages on node 1, one can do:
> 
>  # echo 2 > \
>    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> And, to release all gigantic pages on the same node:
> 
>  # echo 0 > \
>    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> Please, refer to patch 5/5 for full technical details.
> 
> Finally, please note that this series is a follow up for a previous series
> that tried to extend the command-line options set to be NUMA aware:
> 
>  http://marc.info/?l=linux-mm&m=139593335312191&w=2
> 
> During the discussion of that series it was agreed that having runtime
> allocation support for gigantic pages was a better solution.
> 
> Luiz Capitulino (5):
>   hugetlb: prep_compound_gigantic_page(): drop __init marker
>   hugetlb: add hstate_is_gigantic()
>   hugetlb: update_and_free_page(): don't clear PG_reserved bit
>   hugetlb: move helpers up in the file
>   hugetlb: add support for gigantic page allocation at runtime
> 
>  include/linux/hugetlb.h |   5 +
>  mm/hugetlb.c            | 336 ++++++++++++++++++++++++++++++++++--------------
>  2 files changed, 245 insertions(+), 96 deletions(-)
> 


-- 
Thanks.
Zhang Yanfei

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
@ 2014-04-14  7:31   ` Zhang Yanfei
  0 siblings, 0 replies; 37+ messages in thread
From: Zhang Yanfei @ 2014-04-14  7:31 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, akpm, andi,
	davidlohr, rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi,
	kirill

Clear explanation and implementation!

Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

On 04/11/2014 01:58 AM, Luiz Capitulino wrote:
> [Full introduction right after the changelog]
> 
> Changelog
> ---------
> 
> v3
> 
> - Dropped unnecessary WARN_ON() call [Kirill]
> - Always check if the pfn range lies within a zone [Yasuaki]
> - Renamed some function arguments for consistency
> 
> v2
> 
> - Rewrote allocation loop to avoid scanning unless PFNs [Yasuaki]
> - Dropped incomplete multi-arch support [Naoya]
> - Added patch to drop __init from prep_compound_gigantic_page()
> - Restricted the feature to x86_64 (more details in patch 5/5)
> - Added review-bys plus minor changelog changes
> 
> Introduction
> ------------
> 
> The HugeTLB subsystem uses the buddy allocator to allocate hugepages during
> runtime. This means that hugepages allocation during runtime is limited to
> MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes
> greater than MAX_ORDER), this in turn means that those pages can't be
> allocated at runtime.
> 
> HugeTLB supports gigantic page allocation during boottime, via the boot
> allocator. To this end the kernel provides the command-line options
> hugepagesz= and hugepages=, which can be used to instruct the kernel to
> allocate N gigantic pages during boot.
> 
> For example, x86_64 supports 2M and 1G hugepages, but only 2M hugepages can
> be allocated and freed at runtime. If one wants to allocate 1G gigantic pages,
> this has to be done at boot via the hugepagesz= and hugepages= command-line
> options.
> 
> Now, gigantic page allocation at boottime has two serious problems:
> 
>  1. Boottime allocation is not NUMA aware. On a NUMA machine the kernel
>     evenly distributes boottime allocated hugepages among nodes.
> 
>     For example, suppose you have a four-node NUMA machine and want
>     to allocate four 1G gigantic pages at boottime. The kernel will
>     allocate one gigantic page per node.
> 
>     On the other hand, we do have users who want to be able to specify
>     which NUMA node gigantic pages should allocated from. So that they
>     can place virtual machines on a specific NUMA node.
> 
>  2. Gigantic pages allocated at boottime can't be freed
> 
> At this point it's important to observe that regular hugepages allocated
> at runtime don't have those problems. This is so because HugeTLB interface
> for runtime allocation in sysfs supports NUMA and runtime allocated pages
> can be freed just fine via the buddy allocator.
> 
> This series adds support for allocating gigantic pages at runtime. It does
> so by allocating gigantic pages via CMA instead of the buddy allocator.
> Releasing gigantic pages is also supported via CMA. As this series builds
> on top of the existing HugeTLB interface, it makes gigantic page allocation
> and releasing just like regular sized hugepages. This also means that NUMA
> support just works.
> 
> For example, to allocate two 1G gigantic pages on node 1, one can do:
> 
>  # echo 2 > \
>    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> And, to release all gigantic pages on the same node:
> 
>  # echo 0 > \
>    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> Please, refer to patch 5/5 for full technical details.
> 
> Finally, please note that this series is a follow up for a previous series
> that tried to extend the command-line options set to be NUMA aware:
> 
>  http://marc.info/?l=linux-mm&m=139593335312191&w=2
> 
> During the discussion of that series it was agreed that having runtime
> allocation support for gigantic pages was a better solution.
> 
> Luiz Capitulino (5):
>   hugetlb: prep_compound_gigantic_page(): drop __init marker
>   hugetlb: add hstate_is_gigantic()
>   hugetlb: update_and_free_page(): don't clear PG_reserved bit
>   hugetlb: move helpers up in the file
>   hugetlb: add support for gigantic page allocation at runtime
> 
>  include/linux/hugetlb.h |   5 +
>  mm/hugetlb.c            | 336 ++++++++++++++++++++++++++++++++++--------------
>  2 files changed, 245 insertions(+), 96 deletions(-)
> 


-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
  2014-04-10 17:58 ` Luiz Capitulino
@ 2014-04-17 15:13   ` Luiz Capitulino
  -1 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-17 15:13 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, andi,
	davidlohr, rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi,
	kirill

On Thu, 10 Apr 2014 13:58:40 -0400
Luiz Capitulino <lcapitulino@redhat.com> wrote:

> [Full introduction right after the changelog]
> 
> Changelog
> ---------
> 
> v3
> 
> - Dropped unnecessary WARN_ON() call [Kirill]
> - Always check if the pfn range lies within a zone [Yasuaki]
> - Renamed some function arguments for consistency

Andrew, this series got four ACKs but it seems that you haven't picked
it yet. Is there anything missing to be addressed?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
@ 2014-04-17 15:13   ` Luiz Capitulino
  0 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-17 15:13 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, andi,
	davidlohr, rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi,
	kirill

On Thu, 10 Apr 2014 13:58:40 -0400
Luiz Capitulino <lcapitulino@redhat.com> wrote:

> [Full introduction right after the changelog]
> 
> Changelog
> ---------
> 
> v3
> 
> - Dropped unnecessary WARN_ON() call [Kirill]
> - Always check if the pfn range lies within a zone [Yasuaki]
> - Renamed some function arguments for consistency

Andrew, this series got four ACKs but it seems that you haven't picked
it yet. Is there anything missing to be addressed?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
  2014-04-17 15:13   ` Luiz Capitulino
@ 2014-04-17 18:52     ` Andrew Morton
  -1 siblings, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2014-04-17 18:52 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, andi,
	davidlohr, rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi,
	kirill

On Thu, 17 Apr 2014 11:13:05 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:

> On Thu, 10 Apr 2014 13:58:40 -0400
> Luiz Capitulino <lcapitulino@redhat.com> wrote:
> 
> > [Full introduction right after the changelog]
> > 
> > Changelog
> > ---------
> > 
> > v3
> > 
> > - Dropped unnecessary WARN_ON() call [Kirill]
> > - Always check if the pfn range lies within a zone [Yasuaki]
> > - Renamed some function arguments for consistency
> 
> Andrew, this series got four ACKs but it seems that you haven't picked
> it yet. Is there anything missing to be addressed?

I don't look at new features until after -rc1.  Then it takes a week or
more to work through the backlog.  We'll get there.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
@ 2014-04-17 18:52     ` Andrew Morton
  0 siblings, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2014-04-17 18:52 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, andi,
	davidlohr, rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi,
	kirill

On Thu, 17 Apr 2014 11:13:05 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:

> On Thu, 10 Apr 2014 13:58:40 -0400
> Luiz Capitulino <lcapitulino@redhat.com> wrote:
> 
> > [Full introduction right after the changelog]
> > 
> > Changelog
> > ---------
> > 
> > v3
> > 
> > - Dropped unnecessary WARN_ON() call [Kirill]
> > - Always check if the pfn range lies within a zone [Yasuaki]
> > - Renamed some function arguments for consistency
> 
> Andrew, this series got four ACKs but it seems that you haven't picked
> it yet. Is there anything missing to be addressed?

I don't look at new features until after -rc1.  Then it takes a week or
more to work through the backlog.  We'll get there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
  2014-04-17 18:52     ` Andrew Morton
@ 2014-04-17 19:09       ` Luiz Capitulino
  -1 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-17 19:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, andi,
	davidlohr, rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi,
	kirill

On Thu, 17 Apr 2014 11:52:42 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 17 Apr 2014 11:13:05 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:
> 
> > On Thu, 10 Apr 2014 13:58:40 -0400
> > Luiz Capitulino <lcapitulino@redhat.com> wrote:
> > 
> > > [Full introduction right after the changelog]
> > > 
> > > Changelog
> > > ---------
> > > 
> > > v3
> > > 
> > > - Dropped unnecessary WARN_ON() call [Kirill]
> > > - Always check if the pfn range lies within a zone [Yasuaki]
> > > - Renamed some function arguments for consistency
> > 
> > Andrew, this series got four ACKs but it seems that you haven't picked
> > it yet. Is there anything missing to be addressed?
> 
> I don't look at new features until after -rc1.  Then it takes a week or
> more to work through the backlog.  We'll get there.

I see, just wanted to make sure it was in your radar. Thanks a lot.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
@ 2014-04-17 19:09       ` Luiz Capitulino
  0 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-17 19:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, andi,
	davidlohr, rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi,
	kirill

On Thu, 17 Apr 2014 11:52:42 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 17 Apr 2014 11:13:05 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:
> 
> > On Thu, 10 Apr 2014 13:58:40 -0400
> > Luiz Capitulino <lcapitulino@redhat.com> wrote:
> > 
> > > [Full introduction right after the changelog]
> > > 
> > > Changelog
> > > ---------
> > > 
> > > v3
> > > 
> > > - Dropped unnecessary WARN_ON() call [Kirill]
> > > - Always check if the pfn range lies within a zone [Yasuaki]
> > > - Renamed some function arguments for consistency
> > 
> > Andrew, this series got four ACKs but it seems that you haven't picked
> > it yet. Is there anything missing to be addressed?
> 
> I don't look at new features until after -rc1.  Then it takes a week or
> more to work through the backlog.  We'll get there.

I see, just wanted to make sure it was in your radar. Thanks a lot.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/5] hugetlb: add support for gigantic page allocation at runtime
  2014-04-10 17:58   ` Luiz Capitulino
@ 2014-04-17 23:00     ` Andrew Morton
  -1 siblings, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2014-04-17 23:00 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, andi,
	davidlohr, rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi,
	kirill

On Thu, 10 Apr 2014 13:58:45 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:

> HugeTLB is limited to allocating hugepages whose size are less than
> MAX_ORDER order. This is so because HugeTLB allocates hugepages via
> the buddy allocator. Gigantic pages (that is, pages whose size is
> greater than MAX_ORDER order) have to be allocated at boottime.
> 
> However, boottime allocation has at least two serious problems. First,
> it doesn't support NUMA and second, gigantic pages allocated at
> boottime can't be freed.
> 
> This commit solves both issues by adding support for allocating gigantic
> pages during runtime. It works just like regular sized hugepages,
> meaning that the interface in sysfs is the same, it supports NUMA,
> and gigantic pages can be freed.
> 
> For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
> gigantic pages on node 1, one can do:
> 
>  # echo 2 > \
>    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> And to free them all:
> 
>  # echo 0 > \
>    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> The one problem with gigantic page allocation at runtime is that it
> can't be serviced by the buddy allocator. To overcome that problem, this
> commit scans all zones from a node looking for a large enough contiguous
> region. When one is found, it's allocated by using CMA, that is, we call
> alloc_contig_range() to do the actual allocation. For example, on x86_64
> we scan all zones looking for a 1GB contiguous region. When one is found,
> it's allocated by alloc_contig_range().
> 
> One expected issue with that approach is that such gigantic contiguous
> regions tend to vanish as runtime goes by. The best way to avoid this for
> now is to make gigantic page allocations very early during system boot, say
> from a init script. Other possible optimization include using compaction,
> which is supported by CMA but is not explicitly used by this commit.

Why aren't we using compaction?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/5] hugetlb: add support for gigantic page allocation at runtime
@ 2014-04-17 23:00     ` Andrew Morton
  0 siblings, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2014-04-17 23:00 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, andi,
	davidlohr, rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi,
	kirill

On Thu, 10 Apr 2014 13:58:45 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:

> HugeTLB is limited to allocating hugepages whose size are less than
> MAX_ORDER order. This is so because HugeTLB allocates hugepages via
> the buddy allocator. Gigantic pages (that is, pages whose size is
> greater than MAX_ORDER order) have to be allocated at boottime.
> 
> However, boottime allocation has at least two serious problems. First,
> it doesn't support NUMA and second, gigantic pages allocated at
> boottime can't be freed.
> 
> This commit solves both issues by adding support for allocating gigantic
> pages during runtime. It works just like regular sized hugepages,
> meaning that the interface in sysfs is the same, it supports NUMA,
> and gigantic pages can be freed.
> 
> For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
> gigantic pages on node 1, one can do:
> 
>  # echo 2 > \
>    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> And to free them all:
> 
>  # echo 0 > \
>    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> The one problem with gigantic page allocation at runtime is that it
> can't be serviced by the buddy allocator. To overcome that problem, this
> commit scans all zones from a node looking for a large enough contiguous
> region. When one is found, it's allocated by using CMA, that is, we call
> alloc_contig_range() to do the actual allocation. For example, on x86_64
> we scan all zones looking for a 1GB contiguous region. When one is found,
> it's allocated by alloc_contig_range().
> 
> One expected issue with that approach is that such gigantic contiguous
> regions tend to vanish as runtime goes by. The best way to avoid this for
> now is to make gigantic page allocations very early during system boot, say
> from a init script. Other possible optimization include using compaction,
> which is supported by CMA but is not explicitly used by this commit.

Why aren't we using compaction?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
  2014-04-10 17:58 ` Luiz Capitulino
@ 2014-04-17 23:01   ` Andrew Morton
  -1 siblings, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2014-04-17 23:01 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, andi,
	davidlohr, rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi,
	kirill

On Thu, 10 Apr 2014 13:58:40 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:

> The HugeTLB subsystem uses the buddy allocator to allocate hugepages during
> runtime. This means that hugepages allocation during runtime is limited to
> MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes
> greater than MAX_ORDER), this in turn means that those pages can't be
> allocated at runtime.

Dumb question: what's wrong with just increasing MAX_ORDER?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
@ 2014-04-17 23:01   ` Andrew Morton
  0 siblings, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2014-04-17 23:01 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, andi,
	davidlohr, rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi,
	kirill

On Thu, 10 Apr 2014 13:58:40 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:

> The HugeTLB subsystem uses the buddy allocator to allocate hugepages during
> runtime. This means that hugepages allocation during runtime is limited to
> MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes
> greater than MAX_ORDER), this in turn means that those pages can't be
> allocated at runtime.

Dumb question: what's wrong with just increasing MAX_ORDER?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/5] hugetlb: add support for gigantic page allocation at runtime
  2014-04-17 23:00     ` Andrew Morton
@ 2014-04-22 21:19       ` Luiz Capitulino
  -1 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-22 21:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, andi,
	davidlohr, rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi,
	kirill

On Thu, 17 Apr 2014 16:00:39 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 10 Apr 2014 13:58:45 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:
> 
> > HugeTLB is limited to allocating hugepages whose size are less than
> > MAX_ORDER order. This is so because HugeTLB allocates hugepages via
> > the buddy allocator. Gigantic pages (that is, pages whose size is
> > greater than MAX_ORDER order) have to be allocated at boottime.
> > 
> > However, boottime allocation has at least two serious problems. First,
> > it doesn't support NUMA and second, gigantic pages allocated at
> > boottime can't be freed.
> > 
> > This commit solves both issues by adding support for allocating gigantic
> > pages during runtime. It works just like regular sized hugepages,
> > meaning that the interface in sysfs is the same, it supports NUMA,
> > and gigantic pages can be freed.
> > 
> > For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
> > gigantic pages on node 1, one can do:
> > 
> >  # echo 2 > \
> >    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> > 
> > And to free them all:
> > 
> >  # echo 0 > \
> >    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> > 
> > The one problem with gigantic page allocation at runtime is that it
> > can't be serviced by the buddy allocator. To overcome that problem, this
> > commit scans all zones from a node looking for a large enough contiguous
> > region. When one is found, it's allocated by using CMA, that is, we call
> > alloc_contig_range() to do the actual allocation. For example, on x86_64
> > we scan all zones looking for a 1GB contiguous region. When one is found,
> > it's allocated by alloc_contig_range().
> > 
> > One expected issue with that approach is that such gigantic contiguous
> > regions tend to vanish as runtime goes by. The best way to avoid this for
> > now is to make gigantic page allocations very early during system boot, say
> > from a init script. Other possible optimization include using compaction,
> > which is supported by CMA but is not explicitly used by this commit.
> 
> Why aren't we using compaction?

The main reason is that I'm not sure what's the best way to use it in the
context of a 1GB allocation. I mean, the most obvious way (which seems to
be what the DMA subsystem does) is trial and error: just pass a gigantic
PFN range to alloc_contig_range() and if it fails you go to the next range
(or try again in certain cases). This might work, but to be honest I'm not
sure what are the implications of doing that for a 1GB range, especially
because compaction (as implemented by CMA) is synchronous.

As I see compaction usage as an optimization, I've opted for submitting the
simplest implementation that works. I've tested this series on two NUMA
machines and it worked just fine. Future improvements can be done on top.

Also note that this is about HugeTLB making use of compaction automatically.
There's nothing in this series that prevents the user from manually compacting
memory by writing to /sys/devices/system/node/nodeN/compact. As HugeTLB
page reservation is a manual procedure anyways, I don't think that manually
starting compaction is that bad.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/5] hugetlb: add support for gigantic page allocation at runtime
@ 2014-04-22 21:19       ` Luiz Capitulino
  0 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-22 21:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, andi,
	davidlohr, rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi,
	kirill

On Thu, 17 Apr 2014 16:00:39 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 10 Apr 2014 13:58:45 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:
> 
> > HugeTLB is limited to allocating hugepages whose size are less than
> > MAX_ORDER order. This is so because HugeTLB allocates hugepages via
> > the buddy allocator. Gigantic pages (that is, pages whose size is
> > greater than MAX_ORDER order) have to be allocated at boottime.
> > 
> > However, boottime allocation has at least two serious problems. First,
> > it doesn't support NUMA and second, gigantic pages allocated at
> > boottime can't be freed.
> > 
> > This commit solves both issues by adding support for allocating gigantic
> > pages during runtime. It works just like regular sized hugepages,
> > meaning that the interface in sysfs is the same, it supports NUMA,
> > and gigantic pages can be freed.
> > 
> > For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
> > gigantic pages on node 1, one can do:
> > 
> >  # echo 2 > \
> >    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> > 
> > And to free them all:
> > 
> >  # echo 0 > \
> >    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> > 
> > The one problem with gigantic page allocation at runtime is that it
> > can't be serviced by the buddy allocator. To overcome that problem, this
> > commit scans all zones from a node looking for a large enough contiguous
> > region. When one is found, it's allocated by using CMA, that is, we call
> > alloc_contig_range() to do the actual allocation. For example, on x86_64
> > we scan all zones looking for a 1GB contiguous region. When one is found,
> > it's allocated by alloc_contig_range().
> > 
> > One expected issue with that approach is that such gigantic contiguous
> > regions tend to vanish as runtime goes by. The best way to avoid this for
> > now is to make gigantic page allocations very early during system boot, say
> > from a init script. Other possible optimization include using compaction,
> > which is supported by CMA but is not explicitly used by this commit.
> 
> Why aren't we using compaction?

The main reason is that I'm not sure what's the best way to use it in the
context of a 1GB allocation. I mean, the most obvious way (which seems to
be what the DMA subsystem does) is trial and error: just pass a gigantic
PFN range to alloc_contig_range() and if it fails you go to the next range
(or try again in certain cases). This might work, but to be honest I'm not
sure what are the implications of doing that for a 1GB range, especially
because compaction (as implemented by CMA) is synchronous.

As I see compaction usage as an optimization, I've opted for submitting the
simplest implementation that works. I've tested this series on two NUMA
machines and it worked just fine. Future improvements can be done on top.

Also note that this is about HugeTLB making use of compaction automatically.
There's nothing in this series that prevents the user from manually compacting
memory by writing to /sys/devices/system/node/nodeN/compact. As HugeTLB
page reservation is a manual procedure anyways, I don't think that manually
starting compaction is that bad.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
  2014-04-17 23:01   ` Andrew Morton
@ 2014-04-22 21:37     ` Luiz Capitulino
  -1 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-22 21:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, andi,
	davidlohr, rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi,
	kirill

On Thu, 17 Apr 2014 16:01:10 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 10 Apr 2014 13:58:40 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:
> 
> > The HugeTLB subsystem uses the buddy allocator to allocate hugepages during
> > runtime. This means that hugepages allocation during runtime is limited to
> > MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes
> > greater than MAX_ORDER), this in turn means that those pages can't be
> > allocated at runtime.
> 
> Dumb question: what's wrong with just increasing MAX_ORDER?

To be honest I'm not a buddy allocator expert and I'm not familiar with
what is involved in increasing MAX_ORDER. What I do know though is that it's
not just a matter of increasing a macro's value. For example, for sparsemem
support we have this check (include/linux/mmzone.h:1084):

#if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
#error Allocator MAX_ORDER exceeds SECTION_SIZE
#endif

I _guess_ it's because we can't allocate more pages than what's within a
section on sparsemem. Can sparsemem and the other stuff be changed to
accommodate a bigger MAX_ORDER? I don't know. Is it worth it to increase
MAX_ORDER and do all the required changes, given that a bigger MAX_ORDER is
only useful for HugeTLB and the archs supporting gigantic pages? I'd guess not.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
@ 2014-04-22 21:37     ` Luiz Capitulino
  0 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-22 21:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, andi,
	davidlohr, rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi,
	kirill

On Thu, 17 Apr 2014 16:01:10 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 10 Apr 2014 13:58:40 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:
> 
> > The HugeTLB subsystem uses the buddy allocator to allocate hugepages during
> > runtime. This means that hugepages allocation during runtime is limited to
> > MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes
> > greater than MAX_ORDER), this in turn means that those pages can't be
> > allocated at runtime.
> 
> Dumb question: what's wrong with just increasing MAX_ORDER?

To be honest I'm not a buddy allocator expert and I'm not familiar with
what is involved in increasing MAX_ORDER. What I do know though is that it's
not just a matter of increasing a macro's value. For example, for sparsemem
support we have this check (include/linux/mmzone.h:1084):

#if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
#error Allocator MAX_ORDER exceeds SECTION_SIZE
#endif

I _guess_ it's because we can't allocate more pages than what's within a
section on sparsemem. Can sparsemem and the other stuff be changed to
accommodate a bigger MAX_ORDER? I don't know. Is it worth it to increase
MAX_ORDER and do all the required changes, given that a bigger MAX_ORDER is
only useful for HugeTLB and the archs supporting gigantic pages? I'd guess not.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
  2014-04-22 21:37     ` Luiz Capitulino
@ 2014-04-22 21:55       ` Andrew Morton
  -1 siblings, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2014-04-22 21:55 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, andi,
	davidlohr, rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi,
	kirill

On Tue, 22 Apr 2014 17:37:26 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:

> On Thu, 17 Apr 2014 16:01:10 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Thu, 10 Apr 2014 13:58:40 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:
> > 
> > > The HugeTLB subsystem uses the buddy allocator to allocate hugepages during
> > > runtime. This means that hugepages allocation during runtime is limited to
> > > MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes
> > > greater than MAX_ORDER), this in turn means that those pages can't be
> > > allocated at runtime.
> > 
> > Dumb question: what's wrong with just increasing MAX_ORDER?
> 
> To be honest I'm not a buddy allocator expert and I'm not familiar with
> what is involved in increasing MAX_ORDER. What I do know though is that it's
> not just a matter of increasing a macro's value. For example, for sparsemem
> support we have this check (include/linux/mmzone.h:1084):
> 
> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
> #error Allocator MAX_ORDER exceeds SECTION_SIZE
> #endif
> 
> I _guess_ it's because we can't allocate more pages than what's within a
> section on sparsemem. Can sparsemem and the other stuff be changed to
> accommodate a bigger MAX_ORDER? I don't know. Is it worth it to increase
> MAX_ORDER and do all the required changes, given that a bigger MAX_ORDER is
> only useful for HugeTLB and the archs supporting gigantic pages? I'd guess not.

afacit we'd need to increase SECTION_SIZE_BITS to 29 or more to
accommodate 1G MAX_ORDER.  I assume this means that some machines with
sparse physical memory layout may not be able to use all (or as much)
of the physical memory.  Perhaps Yinghai can advise?

I do think we should fully explore this option before giving up and
adding new special-case code. 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
@ 2014-04-22 21:55       ` Andrew Morton
  0 siblings, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2014-04-22 21:55 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, andi,
	davidlohr, rientjes, isimatu.yasuaki, yinghai, riel, n-horiguchi,
	kirill

On Tue, 22 Apr 2014 17:37:26 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:

> On Thu, 17 Apr 2014 16:01:10 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Thu, 10 Apr 2014 13:58:40 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:
> > 
> > > The HugeTLB subsystem uses the buddy allocator to allocate hugepages during
> > > runtime. This means that hugepages allocation during runtime is limited to
> > > MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes
> > > greater than MAX_ORDER), this in turn means that those pages can't be
> > > allocated at runtime.
> > 
> > Dumb question: what's wrong with just increasing MAX_ORDER?
> 
> To be honest I'm not a buddy allocator expert and I'm not familiar with
> what is involved in increasing MAX_ORDER. What I do know though is that it's
> not just a matter of increasing a macro's value. For example, for sparsemem
> support we have this check (include/linux/mmzone.h:1084):
> 
> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
> #error Allocator MAX_ORDER exceeds SECTION_SIZE
> #endif
> 
> I _guess_ it's because we can't allocate more pages than what's within a
> section on sparsemem. Can sparsemem and the other stuff be changed to
> accommodate a bigger MAX_ORDER? I don't know. Is it worth it to increase
> MAX_ORDER and do all the required changes, given that a bigger MAX_ORDER is
> only useful for HugeTLB and the archs supporting gigantic pages? I'd guess not.

afacit we'd need to increase SECTION_SIZE_BITS to 29 or more to
accommodate 1G MAX_ORDER.  I assume this means that some machines with
sparse physical memory layout may not be able to use all (or as much)
of the physical memory.  Perhaps Yinghai can advise?

I do think we should fully explore this option before giving up and
adding new special-case code. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
  2014-04-22 21:55       ` Andrew Morton
@ 2014-04-25 20:18         ` Luiz Capitulino
  -1 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-25 20:18 UTC (permalink / raw)
  To: yinghai
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, andi,
	davidlohr, rientjes, isimatu.yasuaki, Andrew Morton, riel,
	n-horiguchi, kirill

On Tue, 22 Apr 2014 14:55:46 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 22 Apr 2014 17:37:26 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:
> 
> > On Thu, 17 Apr 2014 16:01:10 -0700
> > Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > > On Thu, 10 Apr 2014 13:58:40 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:
> > > 
> > > > The HugeTLB subsystem uses the buddy allocator to allocate hugepages during
> > > > runtime. This means that hugepages allocation during runtime is limited to
> > > > MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes
> > > > greater than MAX_ORDER), this in turn means that those pages can't be
> > > > allocated at runtime.
> > > 
> > > Dumb question: what's wrong with just increasing MAX_ORDER?
> > 
> > To be honest I'm not a buddy allocator expert and I'm not familiar with
> > what is involved in increasing MAX_ORDER. What I do know though is that it's
> > not just a matter of increasing a macro's value. For example, for sparsemem
> > support we have this check (include/linux/mmzone.h:1084):
> > 
> > #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
> > #error Allocator MAX_ORDER exceeds SECTION_SIZE
> > #endif
> > 
> > I _guess_ it's because we can't allocate more pages than what's within a
> > section on sparsemem. Can sparsemem and the other stuff be changed to
> > accommodate a bigger MAX_ORDER? I don't know. Is it worth it to increase
> > MAX_ORDER and do all the required changes, given that a bigger MAX_ORDER is
> > only useful for HugeTLB and the archs supporting gigantic pages? I'd guess not.
> 
> afacit we'd need to increase SECTION_SIZE_BITS to 29 or more to
> accommodate 1G MAX_ORDER.  I assume this means that some machines with
> sparse physical memory layout may not be able to use all (or as much)
> of the physical memory.  Perhaps Yinghai can advise?

Yinghai?

> I do think we should fully explore this option before giving up and
> adding new special-case code. 

I'll look into that, but it may take a bit.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime
@ 2014-04-25 20:18         ` Luiz Capitulino
  0 siblings, 0 replies; 37+ messages in thread
From: Luiz Capitulino @ 2014-04-25 20:18 UTC (permalink / raw)
  To: yinghai
  Cc: linux-mm, linux-kernel, mtosatti, aarcange, mgorman, andi,
	davidlohr, rientjes, isimatu.yasuaki, Andrew Morton, riel,
	n-horiguchi, kirill

On Tue, 22 Apr 2014 14:55:46 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 22 Apr 2014 17:37:26 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:
> 
> > On Thu, 17 Apr 2014 16:01:10 -0700
> > Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > > On Thu, 10 Apr 2014 13:58:40 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:
> > > 
> > > > The HugeTLB subsystem uses the buddy allocator to allocate hugepages during
> > > > runtime. This means that hugepages allocation during runtime is limited to
> > > > MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes
> > > > greater than MAX_ORDER), this in turn means that those pages can't be
> > > > allocated at runtime.
> > > 
> > > Dumb question: what's wrong with just increasing MAX_ORDER?
> > 
> > To be honest I'm not a buddy allocator expert and I'm not familiar with
> > what is involved in increasing MAX_ORDER. What I do know though is that it's
> > not just a matter of increasing a macro's value. For example, for sparsemem
> > support we have this check (include/linux/mmzone.h:1084):
> > 
> > #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
> > #error Allocator MAX_ORDER exceeds SECTION_SIZE
> > #endif
> > 
> > I _guess_ it's because we can't allocate more pages than what's within a
> > section on sparsemem. Can sparsemem and the other stuff be changed to
> > accommodate a bigger MAX_ORDER? I don't know. Is it worth it to increase
> > MAX_ORDER and do all the required changes, given that a bigger MAX_ORDER is
> > only useful for HugeTLB and the archs supporting gigantic pages? I'd guess not.
> 
> afacit we'd need to increase SECTION_SIZE_BITS to 29 or more to
> accommodate 1G MAX_ORDER.  I assume this means that some machines with
> sparse physical memory layout may not be able to use all (or as much)
> of the physical memory.  Perhaps Yinghai can advise?

Yinghai?

> I do think we should fully explore this option before giving up and
> adding new special-case code. 

I'll look into that, but it may take a bit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2014-04-25 20:19 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-10 17:58 [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime Luiz Capitulino
2014-04-10 17:58 ` Luiz Capitulino
2014-04-10 17:58 ` [PATCH 1/5] hugetlb: prep_compound_gigantic_page(): drop __init marker Luiz Capitulino
2014-04-10 17:58   ` Luiz Capitulino
2014-04-10 17:58 ` [PATCH 2/5] hugetlb: add hstate_is_gigantic() Luiz Capitulino
2014-04-10 17:58   ` Luiz Capitulino
2014-04-10 17:58 ` [PATCH 3/5] hugetlb: update_and_free_page(): don't clear PG_reserved bit Luiz Capitulino
2014-04-10 17:58   ` Luiz Capitulino
2014-04-10 17:58 ` [PATCH 4/5] hugetlb: move helpers up in the file Luiz Capitulino
2014-04-10 17:58   ` Luiz Capitulino
2014-04-10 17:58 ` [PATCH 5/5] hugetlb: add support for gigantic page allocation at runtime Luiz Capitulino
2014-04-10 17:58   ` Luiz Capitulino
2014-04-13 23:31   ` Yasuaki Ishimatsu
2014-04-13 23:31     ` Yasuaki Ishimatsu
2014-04-17 23:00   ` Andrew Morton
2014-04-17 23:00     ` Andrew Morton
2014-04-22 21:19     ` Luiz Capitulino
2014-04-22 21:19       ` Luiz Capitulino
2014-04-10 21:44 ` [PATCH v3 0/5] hugetlb: add support " Davidlohr Bueso
2014-04-10 21:44   ` Davidlohr Bueso
2014-04-11 12:08 ` Kirill A. Shutemov
2014-04-14  7:31 ` Zhang Yanfei
2014-04-14  7:31   ` Zhang Yanfei
2014-04-17 15:13 ` Luiz Capitulino
2014-04-17 15:13   ` Luiz Capitulino
2014-04-17 18:52   ` Andrew Morton
2014-04-17 18:52     ` Andrew Morton
2014-04-17 19:09     ` Luiz Capitulino
2014-04-17 19:09       ` Luiz Capitulino
2014-04-17 23:01 ` Andrew Morton
2014-04-17 23:01   ` Andrew Morton
2014-04-22 21:37   ` Luiz Capitulino
2014-04-22 21:37     ` Luiz Capitulino
2014-04-22 21:55     ` Andrew Morton
2014-04-22 21:55       ` Andrew Morton
2014-04-25 20:18       ` Luiz Capitulino
2014-04-25 20:18         ` Luiz Capitulino

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.