linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: hugetlb: don't zero 1GiB bootmem pages.
@ 2018-07-10 18:49 Cannon Matthews
  2018-07-10 20:46 ` Mike Kravetz
  2018-07-11 12:47 ` Michal Hocko
  0 siblings, 2 replies; 6+ messages in thread
From: Cannon Matthews @ 2018-07-10 18:49 UTC (permalink / raw)
  To: Andrew Morton, Mike Kravetz, Nadia Yvette Chambers
  Cc: linux-mm, linux-kernel, andreslc, pfeiner, dmatlack, gthelen,
	Cannon Matthews

When using 1GiB pages during early boot, use the new
memblock_virt_alloc_try_nid_raw() function to allocate memory without
zeroing it.  Zeroing out hundreds or thousands of GiB in a single core
memset() call is very slow, and can make early boot last upwards of
20-30 minutes on multi TiB machines.

To be safe, still zero the first sizeof(struct boomem_huge_page) bytes
since this is used a temporary storage place for this info until
gather_bootmem_prealloc() processes them later.

The rest of the memory does not need to be zero'd as the hugetlb pages
are always zero'd on page fault.

Tested: Booted with ~3800 1G pages, and it booted successfully in
roughly the same amount of time as with 0, as opposed to the 25+
minutes it would take before.

Signed-off-by: Cannon Matthews <cannonmatthews@google.com>
---
 mm/hugetlb.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3612fbb32e9d..c93a2c77e881 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2101,7 +2101,7 @@ int __alloc_bootmem_huge_page(struct hstate *h)
 	for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
 		void *addr;

-		addr = memblock_virt_alloc_try_nid_nopanic(
+		addr = memblock_virt_alloc_try_nid_raw(
 				huge_page_size(h), huge_page_size(h),
 				0, BOOTMEM_ALLOC_ACCESSIBLE, node);
 		if (addr) {
@@ -2109,7 +2109,12 @@ int __alloc_bootmem_huge_page(struct hstate *h)
 			 * Use the beginning of the huge page to store the
 			 * huge_bootmem_page struct (until gather_bootmem
 			 * puts them into the mem_map).
+			 *
+			 * memblock_virt_alloc_try_nid_raw returns non-zero'd
+			 * memory so zero out just enough for this struct, the
+			 * rest will be zero'd on page fault.
 			 */
+			memset(addr, 0, sizeof(struct huge_bootmem_page));
 			m = addr;
 			goto found;
 		}
--
2.18.0.203.gfac676dfb9-goog


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: hugetlb: don't zero 1GiB bootmem pages.
  2018-07-10 18:49 [PATCH] mm: hugetlb: don't zero 1GiB bootmem pages Cannon Matthews
@ 2018-07-10 20:46 ` Mike Kravetz
  2018-07-11 12:49   ` Michal Hocko
  2018-07-11 12:47 ` Michal Hocko
  1 sibling, 1 reply; 6+ messages in thread
From: Mike Kravetz @ 2018-07-10 20:46 UTC (permalink / raw)
  To: Cannon Matthews, Andrew Morton, Nadia Yvette Chambers
  Cc: linux-mm, linux-kernel, andreslc, pfeiner, dmatlack, gthelen

On 07/10/2018 11:49 AM, Cannon Matthews wrote:
> When using 1GiB pages during early boot, use the new
> memblock_virt_alloc_try_nid_raw() function to allocate memory without
> zeroing it.  Zeroing out hundreds or thousands of GiB in a single core
> memset() call is very slow, and can make early boot last upwards of
> 20-30 minutes on multi TiB machines.
> 
> To be safe, still zero the first sizeof(struct boomem_huge_page) bytes
> since this is used a temporary storage place for this info until
> gather_bootmem_prealloc() processes them later.
> 
> The rest of the memory does not need to be zero'd as the hugetlb pages
> are always zero'd on page fault.
> 
> Tested: Booted with ~3800 1G pages, and it booted successfully in
> roughly the same amount of time as with 0, as opposed to the 25+
> minutes it would take before.
> 

Nice improvement!

> Signed-off-by: Cannon Matthews <cannonmatthews@google.com>
> ---
>  mm/hugetlb.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3612fbb32e9d..c93a2c77e881 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2101,7 +2101,7 @@ int __alloc_bootmem_huge_page(struct hstate *h)
>  	for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
>  		void *addr;
> 
> -		addr = memblock_virt_alloc_try_nid_nopanic(
> +		addr = memblock_virt_alloc_try_nid_raw(
>  				huge_page_size(h), huge_page_size(h),
>  				0, BOOTMEM_ALLOC_ACCESSIBLE, node);
>  		if (addr) {
> @@ -2109,7 +2109,12 @@ int __alloc_bootmem_huge_page(struct hstate *h)
>  			 * Use the beginning of the huge page to store the
>  			 * huge_bootmem_page struct (until gather_bootmem
>  			 * puts them into the mem_map).
> +			 *
> +			 * memblock_virt_alloc_try_nid_raw returns non-zero'd
> +			 * memory so zero out just enough for this struct, the
> +			 * rest will be zero'd on page fault.
>  			 */
> +			memset(addr, 0, sizeof(struct huge_bootmem_page));

This forced me to look at the usage of huge_bootmem_page.  It is defined as:
struct huge_bootmem_page {
	struct list_head list;
	struct hstate *hstate;
#ifdef CONFIG_HIGHMEM
	phys_addr_t phys;
#endif
};

The list and hstate fields are set immediately after allocating the memory
block here and elsewhere.  However, I can't find any code that sets phys.
Although, it is potentially used in gather_bootmem_prealloc().  It appears
powerpc used this field at one time, but no longer does.

Am I missing something?

Not an issue with this patch, rather existing code.  I'd prefer not to do
the memset() "just to be safe".  Unless I am missing something, I would
like to remove phys field and supporting code first.  Then, this patch
without the memset.

-- 
Mike Kravetz

>  			m = addr;
>  			goto found;
>  		}
> --
> 2.18.0.203.gfac676dfb9-goog
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: hugetlb: don't zero 1GiB bootmem pages.
  2018-07-10 18:49 [PATCH] mm: hugetlb: don't zero 1GiB bootmem pages Cannon Matthews
  2018-07-10 20:46 ` Mike Kravetz
@ 2018-07-11 12:47 ` Michal Hocko
  2018-07-11 12:48   ` Michal Hocko
  1 sibling, 1 reply; 6+ messages in thread
From: Michal Hocko @ 2018-07-11 12:47 UTC (permalink / raw)
  To: Cannon Matthews
  Cc: Andrew Morton, Mike Kravetz, Nadia Yvette Chambers, linux-mm,
	linux-kernel, andreslc, pfeiner, dmatlack, gthelen

On Tue 10-07-18 11:49:03, Cannon Matthews wrote:
> When using 1GiB pages during early boot, use the new
> memblock_virt_alloc_try_nid_raw() function to allocate memory without
> zeroing it.  Zeroing out hundreds or thousands of GiB in a single core
> memset() call is very slow, and can make early boot last upwards of
> 20-30 minutes on multi TiB machines.
> 
> To be safe, still zero the first sizeof(struct boomem_huge_page) bytes
> since this is used a temporary storage place for this info until
> gather_bootmem_prealloc() processes them later.
> 
> The rest of the memory does not need to be zero'd as the hugetlb pages
> are always zero'd on page fault.
> 
> Tested: Booted with ~3800 1G pages, and it booted successfully in
> roughly the same amount of time as with 0, as opposed to the 25+
> minutes it would take before.

The patch makes perfect sense to me. I wasn't even aware that it
zeroying memblock allocation. Thanks for spotting this and fixing it.

> Signed-off-by: Cannon Matthews <cannonmatthews@google.com>

I just do not think we need to to zero huge_bootmem_page portion of it.
It should be sufficient to INIT_LIST_HEAD before list_add. We do
initialize the rest explicitly already.

> ---
>  mm/hugetlb.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3612fbb32e9d..c93a2c77e881 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2101,7 +2101,7 @@ int __alloc_bootmem_huge_page(struct hstate *h)
>  	for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
>  		void *addr;
> 
> -		addr = memblock_virt_alloc_try_nid_nopanic(
> +		addr = memblock_virt_alloc_try_nid_raw(
>  				huge_page_size(h), huge_page_size(h),
>  				0, BOOTMEM_ALLOC_ACCESSIBLE, node);
>  		if (addr) {
> @@ -2109,7 +2109,12 @@ int __alloc_bootmem_huge_page(struct hstate *h)
>  			 * Use the beginning of the huge page to store the
>  			 * huge_bootmem_page struct (until gather_bootmem
>  			 * puts them into the mem_map).
> +			 *
> +			 * memblock_virt_alloc_try_nid_raw returns non-zero'd
> +			 * memory so zero out just enough for this struct, the
> +			 * rest will be zero'd on page fault.
>  			 */
> +			memset(addr, 0, sizeof(struct huge_bootmem_page));
>  			m = addr;
>  			goto found;
>  		}
> --
> 2.18.0.203.gfac676dfb9-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: hugetlb: don't zero 1GiB bootmem pages.
  2018-07-11 12:47 ` Michal Hocko
@ 2018-07-11 12:48   ` Michal Hocko
  2018-07-11 16:47     ` Mike Kravetz
  0 siblings, 1 reply; 6+ messages in thread
From: Michal Hocko @ 2018-07-11 12:48 UTC (permalink / raw)
  To: Cannon Matthews
  Cc: Andrew Morton, Mike Kravetz, Nadia Yvette Chambers, linux-mm,
	linux-kernel, andreslc, pfeiner, dmatlack, gthelen

On Wed 11-07-18 14:47:11, Michal Hocko wrote:
> On Tue 10-07-18 11:49:03, Cannon Matthews wrote:
> > When using 1GiB pages during early boot, use the new
> > memblock_virt_alloc_try_nid_raw() function to allocate memory without
> > zeroing it.  Zeroing out hundreds or thousands of GiB in a single core
> > memset() call is very slow, and can make early boot last upwards of
> > 20-30 minutes on multi TiB machines.
> > 
> > To be safe, still zero the first sizeof(struct boomem_huge_page) bytes
> > since this is used a temporary storage place for this info until
> > gather_bootmem_prealloc() processes them later.
> > 
> > The rest of the memory does not need to be zero'd as the hugetlb pages
> > are always zero'd on page fault.
> > 
> > Tested: Booted with ~3800 1G pages, and it booted successfully in
> > roughly the same amount of time as with 0, as opposed to the 25+
> > minutes it would take before.
> 
> The patch makes perfect sense to me. I wasn't even aware that it
> zeroying memblock allocation. Thanks for spotting this and fixing it.
> 
> > Signed-off-by: Cannon Matthews <cannonmatthews@google.com>
> 
> I just do not think we need to to zero huge_bootmem_page portion of it.
> It should be sufficient to INIT_LIST_HEAD before list_add. We do
> initialize the rest explicitly already.

Forgot to mention that after that is addressed you can add
Acked-by: Michal Hocko <mhocko@suse.com>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: hugetlb: don't zero 1GiB bootmem pages.
  2018-07-10 20:46 ` Mike Kravetz
@ 2018-07-11 12:49   ` Michal Hocko
  0 siblings, 0 replies; 6+ messages in thread
From: Michal Hocko @ 2018-07-11 12:49 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Cannon Matthews, Andrew Morton, Nadia Yvette Chambers, linux-mm,
	linux-kernel, andreslc, pfeiner, dmatlack, gthelen

On Tue 10-07-18 13:46:57, Mike Kravetz wrote:
> On 07/10/2018 11:49 AM, Cannon Matthews wrote:
> > When using 1GiB pages during early boot, use the new
> > memblock_virt_alloc_try_nid_raw() function to allocate memory without
> > zeroing it.  Zeroing out hundreds or thousands of GiB in a single core
> > memset() call is very slow, and can make early boot last upwards of
> > 20-30 minutes on multi TiB machines.
> > 
> > To be safe, still zero the first sizeof(struct boomem_huge_page) bytes
> > since this is used a temporary storage place for this info until
> > gather_bootmem_prealloc() processes them later.
> > 
> > The rest of the memory does not need to be zero'd as the hugetlb pages
> > are always zero'd on page fault.
> > 
> > Tested: Booted with ~3800 1G pages, and it booted successfully in
> > roughly the same amount of time as with 0, as opposed to the 25+
> > minutes it would take before.
> > 
> 
> Nice improvement!
> 
> > Signed-off-by: Cannon Matthews <cannonmatthews@google.com>
> > ---
> >  mm/hugetlb.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 3612fbb32e9d..c93a2c77e881 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2101,7 +2101,7 @@ int __alloc_bootmem_huge_page(struct hstate *h)
> >  	for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
> >  		void *addr;
> > 
> > -		addr = memblock_virt_alloc_try_nid_nopanic(
> > +		addr = memblock_virt_alloc_try_nid_raw(
> >  				huge_page_size(h), huge_page_size(h),
> >  				0, BOOTMEM_ALLOC_ACCESSIBLE, node);
> >  		if (addr) {
> > @@ -2109,7 +2109,12 @@ int __alloc_bootmem_huge_page(struct hstate *h)
> >  			 * Use the beginning of the huge page to store the
> >  			 * huge_bootmem_page struct (until gather_bootmem
> >  			 * puts them into the mem_map).
> > +			 *
> > +			 * memblock_virt_alloc_try_nid_raw returns non-zero'd
> > +			 * memory so zero out just enough for this struct, the
> > +			 * rest will be zero'd on page fault.
> >  			 */
> > +			memset(addr, 0, sizeof(struct huge_bootmem_page));
> 
> This forced me to look at the usage of huge_bootmem_page.  It is defined as:
> struct huge_bootmem_page {
> 	struct list_head list;
> 	struct hstate *hstate;
> #ifdef CONFIG_HIGHMEM
> 	phys_addr_t phys;
> #endif
> };
> 
> The list and hstate fields are set immediately after allocating the memory
> block here and elsewhere.  However, I can't find any code that sets phys.
> Although, it is potentially used in gather_bootmem_prealloc().  It appears
> powerpc used this field at one time, but no longer does.
> 
> Am I missing something?

If yes, then I am missing it as well. phys is a cool name to grep for...
Anyway, does it really make any sense to allow gigantic pages on HIGHMEM
systems in the first place?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: hugetlb: don't zero 1GiB bootmem pages.
  2018-07-11 12:48   ` Michal Hocko
@ 2018-07-11 16:47     ` Mike Kravetz
  0 siblings, 0 replies; 6+ messages in thread
From: Mike Kravetz @ 2018-07-11 16:47 UTC (permalink / raw)
  To: Michal Hocko, Cannon Matthews
  Cc: Andrew Morton, Nadia Yvette Chambers, linux-mm, linux-kernel,
	andreslc, pfeiner, dmatlack, gthelen

On 07/11/2018 05:48 AM, Michal Hocko wrote:
> On Wed 11-07-18 14:47:11, Michal Hocko wrote:
>> On Tue 10-07-18 11:49:03, Cannon Matthews wrote:
>>> When using 1GiB pages during early boot, use the new
>>> memblock_virt_alloc_try_nid_raw() function to allocate memory without
>>> zeroing it.  Zeroing out hundreds or thousands of GiB in a single core
>>> memset() call is very slow, and can make early boot last upwards of
>>> 20-30 minutes on multi TiB machines.
>>>
>>> To be safe, still zero the first sizeof(struct boomem_huge_page) bytes
>>> since this is used a temporary storage place for this info until
>>> gather_bootmem_prealloc() processes them later.
>>>
>>> The rest of the memory does not need to be zero'd as the hugetlb pages
>>> are always zero'd on page fault.
>>>
>>> Tested: Booted with ~3800 1G pages, and it booted successfully in
>>> roughly the same amount of time as with 0, as opposed to the 25+
>>> minutes it would take before.
>>
>> The patch makes perfect sense to me. I wasn't even aware that it
>> zeroying memblock allocation. Thanks for spotting this and fixing it.
>>
>>> Signed-off-by: Cannon Matthews <cannonmatthews@google.com>
>>
>> I just do not think we need to to zero huge_bootmem_page portion of it.
>> It should be sufficient to INIT_LIST_HEAD before list_add. We do
>> initialize the rest explicitly already.
> 
> Forgot to mention that after that is addressed you can add
> Acked-by: Michal Hocko <mhocko@suse.com>

Cannon,

How about if you make this change suggested by Michal, and I will submit
a separate patch to revert the patch which added the phys field to
huge_bootmem_page structure.

FWIW,
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-07-11 16:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-10 18:49 [PATCH] mm: hugetlb: don't zero 1GiB bootmem pages Cannon Matthews
2018-07-10 20:46 ` Mike Kravetz
2018-07-11 12:49   ` Michal Hocko
2018-07-11 12:47 ` Michal Hocko
2018-07-11 12:48   ` Michal Hocko
2018-07-11 16:47     ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).