[PATCHv5] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr
diff mbox series

Message ID 1546848299-23628-1-git-send-email-kernelfans@gmail.com
State Superseded
Headers show
Series
  • [PATCHv5] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr
Related show

Commit Message

Pingfan Liu Jan. 7, 2019, 8:04 a.m. UTC
Customer reported a bug on a high end server with many pcie devices, where
kernel bootup with crashkernel=384M, and kaslr is enabled. Even
though we still see much memory under 896 MB, the finding still failed
intermittently. Because currently we can only find region under 896 MB,
if w/0 ',high' specified. Then KASLR breaks 896 MB into several parts
randomly, and crashkernel reservation need be aligned to 128 MB, that's
why failure is found. It raises confusion to the end user that sometimes
crashkernel=X works while sometimes fails.
If want to make it succeed, customer can change kernel option to
"crashkernel=384M, high". Just this give "crashkernel=xx@yy" a very
limited space to behave even though its grammer looks more generic.
And we can't answer questions raised from customer that confidently:
1) why it doesn't succeed to reserve 896 MB;
2) what's wrong with memory region under 4G;
3) why I have to add ',high', I only require 384 MB, not 3840 MB.

This patch simplifies the method suggested in the mail [1]. It just goes
bottom-up to find a candidate region for crashkernel. The bottom-up may be
better compatible with the old reservation style, i.e. still want to get
memory region from 896 MB firstly, then [896 MB, 4G], finally above 4G.

There is one trivial thing about the compatibility with old kexec-tools:
if the reserved region is above 896M, then old tool will fail to load
bzImage. But without this patch, the old tool also fail since there is no
memory below 896M can be reserved for crashkernel.

[1]: http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Daniel Vacek <neelx@redhat.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: yinghai@kernel.org,
Cc: vgoyal@redhat.com
Cc: linux-kernel@vger.kernel.org
---
v4 -> v5:
  add a wrapper of bottom up allocation func
v3 -> v4:
  instead of exporting the stage of parsing mem hotplug info, just using the bottom-up allocation func directly
 arch/x86/kernel/setup.c  |  8 ++++----
 include/linux/memblock.h |  3 +++
 mm/memblock.c            | 29 +++++++++++++++++++++++++++++
 3 files changed, 36 insertions(+), 4 deletions(-)

Comments

Mike Rapoport Jan. 8, 2019, 8:05 a.m. UTC | #1
On Mon, Jan 07, 2019 at 04:04:59PM +0800, Pingfan Liu wrote:
> Customer reported a bug on a high end server with many pcie devices, where
> kernel bootup with crashkernel=384M, and kaslr is enabled. Even
> though we still see much memory under 896 MB, the finding still failed
> intermittently. Because currently we can only find region under 896 MB,
> if w/0 ',high' specified. Then KASLR breaks 896 MB into several parts
> randomly, and crashkernel reservation need be aligned to 128 MB, that's
> why failure is found. It raises confusion to the end user that sometimes
> crashkernel=X works while sometimes fails.
> If want to make it succeed, customer can change kernel option to
> "crashkernel=384M, high". Just this give "crashkernel=xx@yy" a very
> limited space to behave even though its grammer looks more generic.
> And we can't answer questions raised from customer that confidently:
> 1) why it doesn't succeed to reserve 896 MB;
> 2) what's wrong with memory region under 4G;
> 3) why I have to add ',high', I only require 384 MB, not 3840 MB.
> 
> This patch simplifies the method suggested in the mail [1]. It just goes
> bottom-up to find a candidate region for crashkernel. The bottom-up may be
> better compatible with the old reservation style, i.e. still want to get
> memory region from 896 MB firstly, then [896 MB, 4G], finally above 4G.
> 
> There is one trivial thing about the compatibility with old kexec-tools:
> if the reserved region is above 896M, then old tool will fail to load
> bzImage. But without this patch, the old tool also fail since there is no
> memory below 896M can be reserved for crashkernel.
> 
> [1]: http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
> Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> Cc: Tang Chen <tangchen@cn.fujitsu.com>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Len Brown <lenb@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Daniel Vacek <neelx@redhat.com>
> Cc: Mathieu Malaterre <malat@debian.org>
> Cc: Stefan Agner <stefan@agner.ch>
> Cc: Dave Young <dyoung@redhat.com>
> Cc: Baoquan He <bhe@redhat.com>
> Cc: yinghai@kernel.org,
> Cc: vgoyal@redhat.com
> Cc: linux-kernel@vger.kernel.org
> ---
> v4 -> v5:
>   add a wrapper of bottom up allocation func
> v3 -> v4:
>   instead of exporting the stage of parsing mem hotplug info, just using the bottom-up allocation func directly
>  arch/x86/kernel/setup.c  |  8 ++++----
>  include/linux/memblock.h |  3 +++
>  mm/memblock.c            | 29 +++++++++++++++++++++++++++++
>  3 files changed, 36 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index d494b9b..80e7923 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -546,10 +546,10 @@ static void __init reserve_crashkernel(void)
>  		 * as old kexec-tools loads bzImage below that, unless
>  		 * "crashkernel=size[KMG],high" is specified.
>  		 */
> -		crash_base = memblock_find_in_range(CRASH_ALIGN,
> -						    high ? CRASH_ADDR_HIGH_MAX
> -							 : CRASH_ADDR_LOW_MAX,
> -						    crash_size, CRASH_ALIGN);
> +		crash_base = memblock_find_range_bottom_up(CRASH_ALIGN,
> +			(max_pfn * PAGE_SIZE), crash_size, CRASH_ALIGN,
> +			NUMA_NO_NODE);
> +
>  		if (!crash_base) {
>  			pr_info("crashkernel reservation failed - No suitable area found.\n");
>  			return;
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index aee299a..a35ae17 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -116,6 +116,9 @@ phys_addr_t memblock_find_in_range_node(phys_addr_t size, phys_addr_t align,
>  					int nid, enum memblock_flags flags);
>  phys_addr_t memblock_find_in_range(phys_addr_t start, phys_addr_t end,
>  				   phys_addr_t size, phys_addr_t align);
> +phys_addr_t __init_memblock
> +memblock_find_range_bottom_up(phys_addr_t start, phys_addr_t end,
> +	phys_addr_t size, phys_addr_t align, int nid);
>  void memblock_allow_resize(void);
>  int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid);
>  int memblock_add(phys_addr_t base, phys_addr_t size);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 81ae63c..f68287e 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -192,6 +192,35 @@ __memblock_find_range_bottom_up(phys_addr_t start, phys_addr_t end,
>  	return 0;
>  }
> 
> +phys_addr_t __init_memblock
> +memblock_find_range_bottom_up(phys_addr_t start, phys_addr_t end,
> +	phys_addr_t size, phys_addr_t align, int nid)
> +{
> +	phys_addr_t ret;
> +	enum memblock_flags flags = choose_memblock_flags();
> +
> +	/* pump up @end */
> +	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> +		end = memblock.current_limit;
> +
> +	/* avoid allocating the first page */
> +	start = max_t(phys_addr_t, start, PAGE_SIZE);
> +	end = max(start, end);
> +
> +again:
> +	ret = __memblock_find_range_bottom_up(start, end, size, align,
> +					    nid, flags);
> +
> +	if (!ret && (flags & MEMBLOCK_MIRROR)) {
> +		pr_warn("Could not allocate %pap bytes of mirrored memory\n",
> +			&size);
> +		flags &= ~MEMBLOCK_MIRROR;
> +		goto again;
> +	}

I'm not thrilled by duplicating this code (yet again).
I liked the v3 of this patch [1] more, assuming we allow bottom-up mode to
allocate [0, kernel_start) unconditionally. 
I'd just replace you first patch in v3 [2] with something like:

diff --git a/mm/memblock.c b/mm/memblock.c
index 7df468c..d1b30b9 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -274,24 +274,14 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
 	 * try bottom-up allocation only when bottom-up mode
 	 * is set and @end is above the kernel image.
 	 */
-	if (memblock_bottom_up() && end > kernel_end) {
-		phys_addr_t bottom_up_start;
-
-		/* make sure we will allocate above the kernel */
-		bottom_up_start = max(start, kernel_end);
-
+	if (memblock_bottom_up()) {
 		/* ok, try bottom-up allocation first */
-		ret = __memblock_find_range_bottom_up(bottom_up_start, end,
+		ret = __memblock_find_range_bottom_up(start, end,
 						      size, align, nid, flags);
 		if (ret)
 			return ret;
 
 		/*
-		 * we always limit bottom-up allocation above the kernel,
-		 * but top-down allocation doesn't have the limit, so
-		 * retrying top-down allocation may succeed when bottom-up
-		 * allocation failed.
-		 *
 		 * bottom-up allocation is expected to be fail very rarely,
 		 * so we use WARN_ONCE() here to see the stack trace if
 		 * fail happens.

[1] https://lore.kernel.org/lkml/1545966002-3075-3-git-send-email-kernelfans@gmail.com/
[2] https://lore.kernel.org/lkml/1545966002-3075-2-git-send-email-kernelfans@gmail.com/

> +
> +	return ret;
> +}
> +
>  /**
>   * __memblock_find_range_top_down - find free area utility, in top-down
>   * @start: start of candidate range
> -- 
> 2.7.4
>
Baoquan He Jan. 8, 2019, 9:01 a.m. UTC | #2
Hi Mike,

On 01/08/19 at 10:05am, Mike Rapoport wrote:
> I'm not thrilled by duplicating this code (yet again).
> I liked the v3 of this patch [1] more, assuming we allow bottom-up mode to
> allocate [0, kernel_start) unconditionally. 
> I'd just replace you first patch in v3 [2] with something like:

In initmem_init(), we will restore the top-down allocation style anyway.
While reserve_crashkernel() is called after initmem_init(), it's not
appropriate to adjust memblock_find_in_range_node(), and we really want
to find region bottom up for crashkernel reservation, no matter where
kernel is loaded, better call __memblock_find_range_bottom_up().

Create a wrapper to do the necessary handling, then call
__memblock_find_range_bottom_up() directly, looks better.

Thanks
Baoquan

> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 7df468c..d1b30b9 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -274,24 +274,14 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
>  	 * try bottom-up allocation only when bottom-up mode
>  	 * is set and @end is above the kernel image.
>  	 */
> -	if (memblock_bottom_up() && end > kernel_end) {
> -		phys_addr_t bottom_up_start;
> -
> -		/* make sure we will allocate above the kernel */
> -		bottom_up_start = max(start, kernel_end);
> -
> +	if (memblock_bottom_up()) {
>  		/* ok, try bottom-up allocation first */
> -		ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> +		ret = __memblock_find_range_bottom_up(start, end,
>  						      size, align, nid, flags);
>  		if (ret)
>  			return ret;
>  
>  		/*
> -		 * we always limit bottom-up allocation above the kernel,
> -		 * but top-down allocation doesn't have the limit, so
> -		 * retrying top-down allocation may succeed when bottom-up
> -		 * allocation failed.
> -		 *
>  		 * bottom-up allocation is expected to be fail very rarely,
>  		 * so we use WARN_ONCE() here to see the stack trace if
>  		 * fail happens.
> 
> [1] https://lore.kernel.org/lkml/1545966002-3075-3-git-send-email-kernelfans@gmail.com/
> [2] https://lore.kernel.org/lkml/1545966002-3075-2-git-send-email-kernelfans@gmail.com/
> 
> > +
> > +	return ret;
> > +}
> > +
> >  /**
> >   * __memblock_find_range_top_down - find free area utility, in top-down
> >   * @start: start of candidate range
> > -- 
> > 2.7.4
> > 
> 
> -- 
> Sincerely yours,
> Mike.
>
Mike Rapoport Jan. 8, 2019, 3:48 p.m. UTC | #3
On Tue, Jan 08, 2019 at 05:01:38PM +0800, Baoquan He wrote:
> Hi Mike,
> 
> On 01/08/19 at 10:05am, Mike Rapoport wrote:
> > I'm not thrilled by duplicating this code (yet again).
> > I liked the v3 of this patch [1] more, assuming we allow bottom-up mode to
> > allocate [0, kernel_start) unconditionally. 
> > I'd just replace you first patch in v3 [2] with something like:
> 
> In initmem_init(), we will restore the top-down allocation style anyway.
> While reserve_crashkernel() is called after initmem_init(), it's not
> appropriate to adjust memblock_find_in_range_node(), and we really want
> to find region bottom up for crashkernel reservation, no matter where
> kernel is loaded, better call __memblock_find_range_bottom_up().
> 
> Create a wrapper to do the necessary handling, then call
> __memblock_find_range_bottom_up() directly, looks better.

What bothers me is 'the necessary handling' which is already done in
several places in memblock in a similar, but yet slightly different way.

memblock_find_in_range() and memblock_phys_alloc_nid() retry with different
MEMBLOCK_MIRROR, but memblock_phys_alloc_try_nid() does that only when
allocating from the specified node and does not retry when it falls back to
any node. And memblock_alloc_internal() has yet another set of fallbacks. 

So what should be the necessary handling in the wrapper for
__memblock_find_range_bottom_up() ?

BTW, even without any memblock modifications, retrying allocation in
reserve_crashkerenel() for different ranges, like the proposal at [1] would
also work, wouldn't it?

[1] http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
 
> Thanks
> Baoquan
> 
> > 
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 7df468c..d1b30b9 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -274,24 +274,14 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> >  	 * try bottom-up allocation only when bottom-up mode
> >  	 * is set and @end is above the kernel image.
> >  	 */
> > -	if (memblock_bottom_up() && end > kernel_end) {
> > -		phys_addr_t bottom_up_start;
> > -
> > -		/* make sure we will allocate above the kernel */
> > -		bottom_up_start = max(start, kernel_end);
> > -
> > +	if (memblock_bottom_up()) {
> >  		/* ok, try bottom-up allocation first */
> > -		ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> > +		ret = __memblock_find_range_bottom_up(start, end,
> >  						      size, align, nid, flags);
> >  		if (ret)
> >  			return ret;
> >  
> >  		/*
> > -		 * we always limit bottom-up allocation above the kernel,
> > -		 * but top-down allocation doesn't have the limit, so
> > -		 * retrying top-down allocation may succeed when bottom-up
> > -		 * allocation failed.
> > -		 *
> >  		 * bottom-up allocation is expected to be fail very rarely,
> >  		 * so we use WARN_ONCE() here to see the stack trace if
> >  		 * fail happens.
> > 
> > [1] https://lore.kernel.org/lkml/1545966002-3075-3-git-send-email-kernelfans@gmail.com/
> > [2] https://lore.kernel.org/lkml/1545966002-3075-2-git-send-email-kernelfans@gmail.com/
> > 
> > > +
> > > +	return ret;
> > > +}
> > > +
> > >  /**
> > >   * __memblock_find_range_top_down - find free area utility, in top-down
> > >   * @start: start of candidate range
> > > -- 
> > > 2.7.4
> > > 
> > 
> > -- 
> > Sincerely yours,
> > Mike.
> > 
>
Pingfan Liu Jan. 9, 2019, 1:02 p.m. UTC | #4
On Tue, Jan 8, 2019 at 11:49 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Tue, Jan 08, 2019 at 05:01:38PM +0800, Baoquan He wrote:
> > Hi Mike,
> >
> > On 01/08/19 at 10:05am, Mike Rapoport wrote:
> > > I'm not thrilled by duplicating this code (yet again).
> > > I liked the v3 of this patch [1] more, assuming we allow bottom-up mode to
> > > allocate [0, kernel_start) unconditionally.
> > > I'd just replace you first patch in v3 [2] with something like:
> >
> > In initmem_init(), we will restore the top-down allocation style anyway.
> > While reserve_crashkernel() is called after initmem_init(), it's not
> > appropriate to adjust memblock_find_in_range_node(), and we really want
> > to find region bottom up for crashkernel reservation, no matter where
> > kernel is loaded, better call __memblock_find_range_bottom_up().
> >
> > Create a wrapper to do the necessary handling, then call
> > __memblock_find_range_bottom_up() directly, looks better.
>
> What bothers me is 'the necessary handling' which is already done in
> several places in memblock in a similar, but yet slightly different way.
>
> memblock_find_in_range() and memblock_phys_alloc_nid() retry with different
> MEMBLOCK_MIRROR, but memblock_phys_alloc_try_nid() does that only when
> allocating from the specified node and does not retry when it falls back to
> any node. And memblock_alloc_internal() has yet another set of fallbacks.
>
> So what should be the necessary handling in the wrapper for
> __memblock_find_range_bottom_up() ?
>
Well, it is a hard choice.
> BTW, even without any memblock modifications, retrying allocation in
> reserve_crashkerenel() for different ranges, like the proposal at [1] would
> also work, wouldn't it?
>
Yes, it can work. Then is it worth to expose the bottom-up allocation
style beside for hotmovable purpose?

Thanks,
Pingfan
> [1] http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
>
> > Thanks
> > Baoquan
> >
> > >
> > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > index 7df468c..d1b30b9 100644
> > > --- a/mm/memblock.c
> > > +++ b/mm/memblock.c
> > > @@ -274,24 +274,14 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > >      * try bottom-up allocation only when bottom-up mode
> > >      * is set and @end is above the kernel image.
> > >      */
> > > -   if (memblock_bottom_up() && end > kernel_end) {
> > > -           phys_addr_t bottom_up_start;
> > > -
> > > -           /* make sure we will allocate above the kernel */
> > > -           bottom_up_start = max(start, kernel_end);
> > > -
> > > +   if (memblock_bottom_up()) {
> > >             /* ok, try bottom-up allocation first */
> > > -           ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> > > +           ret = __memblock_find_range_bottom_up(start, end,
> > >                                                   size, align, nid, flags);
> > >             if (ret)
> > >                     return ret;
> > >
> > >             /*
> > > -            * we always limit bottom-up allocation above the kernel,
> > > -            * but top-down allocation doesn't have the limit, so
> > > -            * retrying top-down allocation may succeed when bottom-up
> > > -            * allocation failed.
> > > -            *
> > >              * bottom-up allocation is expected to be fail very rarely,
> > >              * so we use WARN_ONCE() here to see the stack trace if
> > >              * fail happens.
> > >
> > > [1] https://lore.kernel.org/lkml/1545966002-3075-3-git-send-email-kernelfans@gmail.com/
> > > [2] https://lore.kernel.org/lkml/1545966002-3075-2-git-send-email-kernelfans@gmail.com/
> > >
> > > > +
> > > > + return ret;
> > > > +}
> > > > +
> > > >  /**
> > > >   * __memblock_find_range_top_down - find free area utility, in top-down
> > > >   * @start: start of candidate range
> > > > --
> > > > 2.7.4
> > > >
> > >
> > > --
> > > Sincerely yours,
> > > Mike.
> > >
> >
>
> --
> Sincerely yours,
> Mike.
>
Baoquan He Jan. 9, 2019, 2:25 p.m. UTC | #5
On 01/08/19 at 05:48pm, Mike Rapoport wrote:
> On Tue, Jan 08, 2019 at 05:01:38PM +0800, Baoquan He wrote:
> > Hi Mike,
> > 
> > On 01/08/19 at 10:05am, Mike Rapoport wrote:
> > > I'm not thrilled by duplicating this code (yet again).
> > > I liked the v3 of this patch [1] more, assuming we allow bottom-up mode to
> > > allocate [0, kernel_start) unconditionally. 
> > > I'd just replace you first patch in v3 [2] with something like:
> > 
> > In initmem_init(), we will restore the top-down allocation style anyway.
> > While reserve_crashkernel() is called after initmem_init(), it's not
> > appropriate to adjust memblock_find_in_range_node(), and we really want
> > to find region bottom up for crashkernel reservation, no matter where
> > kernel is loaded, better call __memblock_find_range_bottom_up().
> > 
> > Create a wrapper to do the necessary handling, then call
> > __memblock_find_range_bottom_up() directly, looks better.
> 
> What bothers me is 'the necessary handling' which is already done in
> several places in memblock in a similar, but yet slightly different way.

The page aligning for start and the mirror flag setting, I suppose.
> 
> memblock_find_in_range() and memblock_phys_alloc_nid() retry with different
> MEMBLOCK_MIRROR, but memblock_phys_alloc_try_nid() does that only when
> allocating from the specified node and does not retry when it falls back to
> any node. And memblock_alloc_internal() has yet another set of fallbacks. 

Get what you mean, seems they are trying to allocate within mirrorred
memory region, if fail, try the non-mirrorred region. If kernel data
allocation failed, no need to care about if it's movable or not, it need
to live firstly. For the bottom-up allocation wrapper, maybe we need do
like this too?

> 
> So what should be the necessary handling in the wrapper for
> __memblock_find_range_bottom_up() ?
> 
> BTW, even without any memblock modifications, retrying allocation in
> reserve_crashkerenel() for different ranges, like the proposal at [1] would
> also work, wouldn't it?

Yes, it also looks good. This patch only calls once, seems a simpler
line adding. 

In fact, below one and this patch, both is fine to me, as long as it
fixes the problem customers are complaining about.

> 
> [1] http://lists.infradead.org/pipermail/kexec/2017-October/019571.html

Thanks
Baoquan
Mike Rapoport Jan. 10, 2019, 7:56 a.m. UTC | #6
Hi Pingfan,

On Wed, Jan 09, 2019 at 09:02:41PM +0800, Pingfan Liu wrote:
> On Tue, Jan 8, 2019 at 11:49 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> >
> > On Tue, Jan 08, 2019 at 05:01:38PM +0800, Baoquan He wrote:
> > > Hi Mike,
> > >
> > > On 01/08/19 at 10:05am, Mike Rapoport wrote:
> > > > I'm not thrilled by duplicating this code (yet again).
> > > > I liked the v3 of this patch [1] more, assuming we allow bottom-up mode to
> > > > allocate [0, kernel_start) unconditionally.
> > > > I'd just replace you first patch in v3 [2] with something like:
> > >
> > > In initmem_init(), we will restore the top-down allocation style anyway.
> > > While reserve_crashkernel() is called after initmem_init(), it's not
> > > appropriate to adjust memblock_find_in_range_node(), and we really want
> > > to find region bottom up for crashkernel reservation, no matter where
> > > kernel is loaded, better call __memblock_find_range_bottom_up().
> > >
> > > Create a wrapper to do the necessary handling, then call
> > > __memblock_find_range_bottom_up() directly, looks better.
> >
> > What bothers me is 'the necessary handling' which is already done in
> > several places in memblock in a similar, but yet slightly different way.
> >
> > memblock_find_in_range() and memblock_phys_alloc_nid() retry with different
> > MEMBLOCK_MIRROR, but memblock_phys_alloc_try_nid() does that only when
> > allocating from the specified node and does not retry when it falls back to
> > any node. And memblock_alloc_internal() has yet another set of fallbacks.
> >
> > So what should be the necessary handling in the wrapper for
> > __memblock_find_range_bottom_up() ?
> >
> Well, it is a hard choice.
> > BTW, even without any memblock modifications, retrying allocation in
> > reserve_crashkerenel() for different ranges, like the proposal at [1] would
> > also work, wouldn't it?
> >
> Yes, it can work. Then is it worth to expose the bottom-up allocation
> style beside for hotmovable purpose?

Some architectures use bottom-up as a "compatability" mode with bootmem.
And, I believe, powerpc and s390 use bottom-up to make some of the
allocations close to the kernel.
 
> Thanks,
> Pingfan
> > [1] http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
> >
> > > Thanks
> > > Baoquan
> > >
> > > >
> > > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > > index 7df468c..d1b30b9 100644
> > > > --- a/mm/memblock.c
> > > > +++ b/mm/memblock.c
> > > > @@ -274,24 +274,14 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > > >      * try bottom-up allocation only when bottom-up mode
> > > >      * is set and @end is above the kernel image.
> > > >      */
> > > > -   if (memblock_bottom_up() && end > kernel_end) {
> > > > -           phys_addr_t bottom_up_start;
> > > > -
> > > > -           /* make sure we will allocate above the kernel */
> > > > -           bottom_up_start = max(start, kernel_end);
> > > > -
> > > > +   if (memblock_bottom_up()) {
> > > >             /* ok, try bottom-up allocation first */
> > > > -           ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> > > > +           ret = __memblock_find_range_bottom_up(start, end,
> > > >                                                   size, align, nid, flags);
> > > >             if (ret)
> > > >                     return ret;
> > > >
> > > >             /*
> > > > -            * we always limit bottom-up allocation above the kernel,
> > > > -            * but top-down allocation doesn't have the limit, so
> > > > -            * retrying top-down allocation may succeed when bottom-up
> > > > -            * allocation failed.
> > > > -            *
> > > >              * bottom-up allocation is expected to be fail very rarely,
> > > >              * so we use WARN_ONCE() here to see the stack trace if
> > > >              * fail happens.
> > > >
> > > > [1] https://lore.kernel.org/lkml/1545966002-3075-3-git-send-email-kernelfans@gmail.com/
> > > > [2] https://lore.kernel.org/lkml/1545966002-3075-2-git-send-email-kernelfans@gmail.com/
> > > >
> > > > > +
> > > > > + return ret;
> > > > > +}
> > > > > +
> > > > >  /**
> > > > >   * __memblock_find_range_top_down - find free area utility, in top-down
> > > > >   * @start: start of candidate range
> > > > > --
> > > > > 2.7.4
> > > > >
> > > >
> > > > --
> > > > Sincerely yours,
> > > > Mike.
> > > >
> > >
> >
> > --
> > Sincerely yours,
> > Mike.
> >
>
Pingfan Liu Jan. 11, 2019, 2:37 a.m. UTC | #7
On Thu, Jan 10, 2019 at 3:57 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> Hi Pingfan,
>
> On Wed, Jan 09, 2019 at 09:02:41PM +0800, Pingfan Liu wrote:
> > On Tue, Jan 8, 2019 at 11:49 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> > >
> > > On Tue, Jan 08, 2019 at 05:01:38PM +0800, Baoquan He wrote:
> > > > Hi Mike,
> > > >
> > > > On 01/08/19 at 10:05am, Mike Rapoport wrote:
> > > > > I'm not thrilled by duplicating this code (yet again).
> > > > > I liked the v3 of this patch [1] more, assuming we allow bottom-up mode to
> > > > > allocate [0, kernel_start) unconditionally.
> > > > > I'd just replace you first patch in v3 [2] with something like:
> > > >
> > > > In initmem_init(), we will restore the top-down allocation style anyway.
> > > > While reserve_crashkernel() is called after initmem_init(), it's not
> > > > appropriate to adjust memblock_find_in_range_node(), and we really want
> > > > to find region bottom up for crashkernel reservation, no matter where
> > > > kernel is loaded, better call __memblock_find_range_bottom_up().
> > > >
> > > > Create a wrapper to do the necessary handling, then call
> > > > __memblock_find_range_bottom_up() directly, looks better.
> > >
> > > What bothers me is 'the necessary handling' which is already done in
> > > several places in memblock in a similar, but yet slightly different way.
> > >
> > > memblock_find_in_range() and memblock_phys_alloc_nid() retry with different
> > > MEMBLOCK_MIRROR, but memblock_phys_alloc_try_nid() does that only when
> > > allocating from the specified node and does not retry when it falls back to
> > > any node. And memblock_alloc_internal() has yet another set of fallbacks.
> > >
> > > So what should be the necessary handling in the wrapper for
> > > __memblock_find_range_bottom_up() ?
> > >
> > Well, it is a hard choice.
> > > BTW, even without any memblock modifications, retrying allocation in
> > > reserve_crashkerenel() for different ranges, like the proposal at [1] would
> > > also work, wouldn't it?
> > >
> > Yes, it can work. Then is it worth to expose the bottom-up allocation
> > style beside for hotmovable purpose?
>
> Some architectures use bottom-up as a "compatability" mode with bootmem.
> And, I believe, powerpc and s390 use bottom-up to make some of the
> allocations close to the kernel.
>
Ok, got it. Thanks.

Best regards,
Pingfan

> > Thanks,
> > Pingfan
> > > [1] http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
> > >
> > > > Thanks
> > > > Baoquan
> > > >
> > > > >
> > > > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > > > index 7df468c..d1b30b9 100644
> > > > > --- a/mm/memblock.c
> > > > > +++ b/mm/memblock.c
> > > > > @@ -274,24 +274,14 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
> > > > >      * try bottom-up allocation only when bottom-up mode
> > > > >      * is set and @end is above the kernel image.
> > > > >      */
> > > > > -   if (memblock_bottom_up() && end > kernel_end) {
> > > > > -           phys_addr_t bottom_up_start;
> > > > > -
> > > > > -           /* make sure we will allocate above the kernel */
> > > > > -           bottom_up_start = max(start, kernel_end);
> > > > > -
> > > > > +   if (memblock_bottom_up()) {
> > > > >             /* ok, try bottom-up allocation first */
> > > > > -           ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> > > > > +           ret = __memblock_find_range_bottom_up(start, end,
> > > > >                                                   size, align, nid, flags);
> > > > >             if (ret)
> > > > >                     return ret;
> > > > >
> > > > >             /*
> > > > > -            * we always limit bottom-up allocation above the kernel,
> > > > > -            * but top-down allocation doesn't have the limit, so
> > > > > -            * retrying top-down allocation may succeed when bottom-up
> > > > > -            * allocation failed.
> > > > > -            *
> > > > >              * bottom-up allocation is expected to be fail very rarely,
> > > > >              * so we use WARN_ONCE() here to see the stack trace if
> > > > >              * fail happens.
> > > > >
> > > > > [1] https://lore.kernel.org/lkml/1545966002-3075-3-git-send-email-kernelfans@gmail.com/
> > > > > [2] https://lore.kernel.org/lkml/1545966002-3075-2-git-send-email-kernelfans@gmail.com/
> > > > >
> > > > > > +
> > > > > > + return ret;
> > > > > > +}
> > > > > > +
> > > > > >  /**
> > > > > >   * __memblock_find_range_top_down - find free area utility, in top-down
> > > > > >   * @start: start of candidate range
> > > > > > --
> > > > > > 2.7.4
> > > > > >
> > > > >
> > > > > --
> > > > > Sincerely yours,
> > > > > Mike.
> > > > >
> > > >
> > >
> > > --
> > > Sincerely yours,
> > > Mike.
> > >
> >
>
> --
> Sincerely yours,
> Mike.
>
Pingfan Liu Jan. 11, 2019, 2:41 a.m. UTC | #8
On Wed, Jan 9, 2019 at 10:25 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 01/08/19 at 05:48pm, Mike Rapoport wrote:
> > On Tue, Jan 08, 2019 at 05:01:38PM +0800, Baoquan He wrote:
> > > Hi Mike,
> > >
> > > On 01/08/19 at 10:05am, Mike Rapoport wrote:
> > > > I'm not thrilled by duplicating this code (yet again).
> > > > I liked the v3 of this patch [1] more, assuming we allow bottom-up mode to
> > > > allocate [0, kernel_start) unconditionally.
> > > > I'd just replace you first patch in v3 [2] with something like:
> > >
> > > In initmem_init(), we will restore the top-down allocation style anyway.
> > > While reserve_crashkernel() is called after initmem_init(), it's not
> > > appropriate to adjust memblock_find_in_range_node(), and we really want
> > > to find region bottom up for crashkernel reservation, no matter where
> > > kernel is loaded, better call __memblock_find_range_bottom_up().
> > >
> > > Create a wrapper to do the necessary handling, then call
> > > __memblock_find_range_bottom_up() directly, looks better.
> >
> > What bothers me is 'the necessary handling' which is already done in
> > several places in memblock in a similar, but yet slightly different way.
>
> The page aligning for start and the mirror flag setting, I suppose.
> >
> > memblock_find_in_range() and memblock_phys_alloc_nid() retry with different
> > MEMBLOCK_MIRROR, but memblock_phys_alloc_try_nid() does that only when
> > allocating from the specified node and does not retry when it falls back to
> > any node. And memblock_alloc_internal() has yet another set of fallbacks.
>
> Get what you mean, seems they are trying to allocate within mirrorred
> memory region, if fail, try the non-mirrorred region. If kernel data
> allocation failed, no need to care about if it's movable or not, it need
> to live firstly. For the bottom-up allocation wrapper, maybe we need do
> like this too?
>
> >
> > So what should be the necessary handling in the wrapper for
> > __memblock_find_range_bottom_up() ?
> >
> > BTW, even without any memblock modifications, retrying allocation in
> > reserve_crashkerenel() for different ranges, like the proposal at [1] would
> > also work, wouldn't it?
>
> Yes, it also looks good. This patch only calls once, seems a simpler
> line adding.
>
> In fact, below one and this patch, both is fine to me, as long as it
> fixes the problem customers are complaining about.
>
It seems that there is divergence on opinion. Maybe it is easier to
fix this bug by dyoung's patch. I will repost his patch.

Thanks and regards,
Pingfan
> >
> > [1] http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
>
> Thanks
> Baoquan

Patch
diff mbox series

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d494b9b..80e7923 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -546,10 +546,10 @@  static void __init reserve_crashkernel(void)
 		 * as old kexec-tools loads bzImage below that, unless
 		 * "crashkernel=size[KMG],high" is specified.
 		 */
-		crash_base = memblock_find_in_range(CRASH_ALIGN,
-						    high ? CRASH_ADDR_HIGH_MAX
-							 : CRASH_ADDR_LOW_MAX,
-						    crash_size, CRASH_ALIGN);
+		crash_base = memblock_find_range_bottom_up(CRASH_ALIGN,
+			(max_pfn * PAGE_SIZE), crash_size, CRASH_ALIGN,
+			NUMA_NO_NODE);
+
 		if (!crash_base) {
 			pr_info("crashkernel reservation failed - No suitable area found.\n");
 			return;
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index aee299a..a35ae17 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -116,6 +116,9 @@  phys_addr_t memblock_find_in_range_node(phys_addr_t size, phys_addr_t align,
 					int nid, enum memblock_flags flags);
 phys_addr_t memblock_find_in_range(phys_addr_t start, phys_addr_t end,
 				   phys_addr_t size, phys_addr_t align);
+phys_addr_t __init_memblock
+memblock_find_range_bottom_up(phys_addr_t start, phys_addr_t end,
+	phys_addr_t size, phys_addr_t align, int nid);
 void memblock_allow_resize(void);
 int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid);
 int memblock_add(phys_addr_t base, phys_addr_t size);
diff --git a/mm/memblock.c b/mm/memblock.c
index 81ae63c..f68287e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -192,6 +192,35 @@  __memblock_find_range_bottom_up(phys_addr_t start, phys_addr_t end,
 	return 0;
 }
 
+phys_addr_t __init_memblock
+memblock_find_range_bottom_up(phys_addr_t start, phys_addr_t end,
+	phys_addr_t size, phys_addr_t align, int nid)
+{
+	phys_addr_t ret;
+	enum memblock_flags flags = choose_memblock_flags();
+
+	/* pump up @end */
+	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
+		end = memblock.current_limit;
+
+	/* avoid allocating the first page */
+	start = max_t(phys_addr_t, start, PAGE_SIZE);
+	end = max(start, end);
+
+again:
+	ret = __memblock_find_range_bottom_up(start, end, size, align,
+					    nid, flags);
+
+	if (!ret && (flags & MEMBLOCK_MIRROR)) {
+		pr_warn("Could not allocate %pap bytes of mirrored memory\n",
+			&size);
+		flags &= ~MEMBLOCK_MIRROR;
+		goto again;
+	}
+
+	return ret;
+}
+
 /**
  * __memblock_find_range_top_down - find free area utility, in top-down
  * @start: start of candidate range