All of lore.kernel.org
 help / color / mirror / Atom feed
From: Baoquan He <bhe@redhat.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Dan Williams <dan.j.williams@intel.com>,
	Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
	Mel Gorman <mgorman@suse.de>, "Jin, Zhi" <zhi.jin@intel.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [PATCH] mm/page_alloc: Skip non present sections on zone initialization
Date: Tue, 31 Dec 2019 09:33:18 +0800	[thread overview]
Message-ID: <20191231013318.GB26758@MiWiFi-R3L-srv> (raw)
In-Reply-To: <20191231012345.GA26758@MiWiFi-R3L-srv>

On 12/31/19 at 09:23am, Baoquan He wrote:
> On 12/30/19 at 12:38pm, Kirill A. Shutemov wrote:
> > memmap_init_zone() can be called on the ranges with holes during the
> > boot. It will skip any non-valid PFNs one-by-one. It works fine as long
> > as holes are not too big.
> > 
> > But huge holes in the memory map causes a problem. It takes over 20
> > seconds to walk 32TiB hole. x86-64 with 5-level paging allows for much
> > larger holes in the memory map which would practically hang the system.
> > 
> > Deferred struct page init doesn't help here. It only works on the
> > present ranges.
> > 
> > Skipping non-present sections would fix the issue.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> > 
> > The situation can be emulated using the following QEMU patch:
> > 
> > diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> > index ac08e6360437..f5f2258092e1 100644
> > --- a/hw/i386/pc.c
> > +++ b/hw/i386/pc.c
> > @@ -1159,13 +1159,14 @@ void pc_memory_init(PCMachineState *pcms,
> >      memory_region_add_subregion(system_memory, 0, ram_below_4g);
> >      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
> >      if (x86ms->above_4g_mem_size > 0) {
> > +        int shift = 45;
> >          ram_above_4g = g_malloc(sizeof(*ram_above_4g));
> >          memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g", ram,
> >                                   x86ms->below_4g_mem_size,
> >                                   x86ms->above_4g_mem_size);
> > -        memory_region_add_subregion(system_memory, 0x100000000ULL,
> > +        memory_region_add_subregion(system_memory, 1ULL << shift,
> >                                      ram_above_4g);
> > -        e820_add_entry(0x100000000ULL, x86ms->above_4g_mem_size, E820_RAM);
> > +        e820_add_entry(1ULL << shift, x86ms->above_4g_mem_size, E820_RAM);
> >      }
> >  
> >      if (!pcmc->has_reserved_memory &&
> > diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> > index cde2a16b941a..694c26947bf6 100644
> > --- a/target/i386/cpu.h
> > +++ b/target/i386/cpu.h
> > @@ -1928,7 +1928,7 @@ uint64_t cpu_get_tsc(CPUX86State *env);
> >  /* XXX: This value should match the one returned by CPUID
> >   * and in exec.c */
> >  # if defined(TARGET_X86_64)
> > -# define TCG_PHYS_ADDR_BITS 40
> > +# define TCG_PHYS_ADDR_BITS 52
> >  # else
> >  # define TCG_PHYS_ADDR_BITS 36
> >  # endif
> > 
> > ---
> >  mm/page_alloc.c | 28 +++++++++++++++++++++++++++-
> >  1 file changed, 27 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index df62a49cd09e..442dc0244bb4 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5873,6 +5873,30 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn)
> >  	return false;
> >  }
> >  
> > +#ifdef CONFIG_SPARSEMEM
> > +/* Skip PFNs that belong to non-present sections */
> > +static inline __meminit unsigned long next_pfn(unsigned long pfn)
> > +{
> > +	unsigned long section_nr;
> > +
> > +	section_nr = pfn_to_section_nr(++pfn);
> > +	if (present_section_nr(section_nr))
> > +		return pfn;
> > +
> > +	while (++section_nr <= __highest_present_section_nr) {
> > +		if (present_section_nr(section_nr))
> > +			return section_nr_to_pfn(section_nr);
> > +	}
> > +
> > +	return -1;
> > +}
> > +#else
> > +static inline __meminit unsigned long next_pfn(unsigned long pfn)
> > +{
> > +	return pfn++;
> > +}
> > +#endif
> > +
> >  /*
> >   * Initially all pages are reserved - free ones are freed
> >   * up by memblock_free_all() once the early boot process is
> > @@ -5912,8 +5936,10 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> >  		 * function.  They do not exist on hotplugged memory.
> >  		 */
> >  		if (context == MEMMAP_EARLY) {
> > -			if (!early_pfn_valid(pfn))
> > +			if (!early_pfn_valid(pfn)) {
> > +				pfn = next_pfn(pfn) - 1;
> 
> Just pass by, I think this is a necessary optimization. Wondering why
> next_pfn(pfn) is not put in for loop:
> -	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> +	for (pfn = start_pfn; pfn < end_pfn; pfn=next_pfn(pfn)) {
> 
> 
> >  				continue;
> > +			}
> >  			if (!early_pfn_in_nid(pfn, nid))
> >  				continue;
> 
> Why the other two 'continue' don't need be worried on the huge hole
> case?

OK, I see. early_pfn_valid() may have encountered the huge hole case,
the check in patch sounds reasonable.

FWIW, looks good to me.

Reviewed-by: Baoquan He <bhe@redhat.com>

Thanks
Baoquan


  reply	other threads:[~2019-12-31  1:34 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-30  9:38 [PATCH] mm/page_alloc: Skip non present sections on zone initialization Kirill A. Shutemov
2019-12-31  1:23 ` Baoquan He
2019-12-31  1:33   ` Baoquan He [this message]
2020-01-08 14:40 ` Michal Hocko
2020-01-10 13:15   ` David Hildenbrand
2020-01-10 13:45     ` Kirill A. Shutemov
2020-01-10 14:34       ` David Hildenbrand
2020-01-10 14:47         ` Kirill A. Shutemov
2020-01-10 14:48           ` David Hildenbrand
2020-01-10 14:54             ` Kirill A. Shutemov
2020-01-10 14:56               ` David Hildenbrand
2020-01-10 17:55                 ` Kirill A. Shutemov
2020-01-10 18:05                   ` David Hildenbrand
2020-01-10 18:22                     ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191231013318.GB26758@MiWiFi-R3L-srv \
    --to=bhe@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=vbabka@suse.cz \
    --cc=zhi.jin@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.