All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] x86/smp: Fix __max_logical_packages value setup
@ 2016-08-03 16:23 Jiri Olsa
  2016-08-10 11:41 ` Jiri Olsa
  2016-08-10 13:54 ` Peter Zijlstra
  0 siblings, 2 replies; 16+ messages in thread
From: Jiri Olsa @ 2016-08-03 16:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andi Kleen, Peter Zijlstra, linux-kernel, Andi Kleen, x86,
	Ingo Molnar, Frank Ramsay

Frank reported kernel panic when he disabled several cores in BIOS
via following option:

  Core Disable Bitmap(Hex)   [0]

with number 0xFFE, which leaves 16 CPUs in system (out of 48).

The kernel panic below goes along with following messages:

 smpboot: Max logical packages: 2^M
 smpboot: APIC(0) Converting physical 0 to logical package 0^M
 smpboot: APIC(20) Converting physical 1 to logical package 1^M
 smpboot: APIC(40) Package 2 exceeds logical package map^M
 smpboot: CPU 8 APICId 40 disabled^M
 smpboot: APIC(60) Package 3 exceeds logical package map^M
 smpboot: CPU 12 APICId 60 disabled^M
 ...
 general protection fault: 0000 [#1] SMP^M
 Modules linked in:^M
 CPU: 15 PID: 1 Comm: swapper/0 Not tainted 4.7.0-rc5+ #1^M
 Hardware name: SGI UV300/UV300, BIOS SGI UV 300 series BIOS 05/25/2016^M
 task: ffff8801673e0000 ti: ffff8801673ac000 task.ti: ffff8801673ac000^M
 RIP: 0010:[<ffffffff81014d54>]  [<ffffffff81014d54>] uncore_change_context+0xd4/0x180^M
 ...
  [<ffffffff810158ac>] uncore_event_init_cpu+0x6c/0x70^M
  [<ffffffff81d8c91c>] intel_uncore_init+0x1c2/0x2dd^M
  [<ffffffff81d8c75a>] ? uncore_cpu_setup+0x17/0x17^M
  [<ffffffff81002190>] do_one_initcall+0x50/0x190^M
  [<ffffffff810ab193>] ? parse_args+0x293/0x480^M
  [<ffffffff81d87365>] kernel_init_freeable+0x1a5/0x249^M
  [<ffffffff81d86a35>] ? set_debug_rodata+0x12/0x12^M
  [<ffffffff816dc19e>] kernel_init+0xe/0x110^M
  [<ffffffff816e93bf>] ret_from_fork+0x1f/0x40^M
  [<ffffffff816dc190>] ? rest_init+0x80/0x80^M

The reason for the panic is wrong value of __max_logical_packages,
which lets logical_package_map uninitialized and the uncore code
relying on this map being properly initialized (maybe we should
add some safety checks there as well).

The __max_logical_packages is computed as:

  DIV_ROUND_UP(total_cpus, ncpus);
  - ncpus being number of cores

With above BIOS setup we get total_cpus == 16 which set
__max_logical_packages to 2 (ncpus is 12). 

Once topology_update_package_map processes CPU with logical
pkg over 2 we display above messages and fail to initialize
the physical_to_logical_pkg map, which makes the uncore code
crash.

The fix is to set __max_logical_packages directly to total_cpus,
which should be the maximum possible logical ID of the pkg in
any case.

Reported-by: Frank Ramsay <framsay@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 157bf0957219..484f7d357c77 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -340,7 +340,7 @@ static void __init smp_init_package_map(void)
 		ncpus = 1;
 	}
 
-	__max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus);
+	__max_logical_packages = total_cpus;
 
 	/*
 	 * Possibly larger than what we need as the number of apic ids per

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC][PATCH] x86/smp: Fix __max_logical_packages value setup
  2016-08-03 16:23 [RFC][PATCH] x86/smp: Fix __max_logical_packages value setup Jiri Olsa
@ 2016-08-10 11:41 ` Jiri Olsa
  2016-08-10 13:54 ` Peter Zijlstra
  1 sibling, 0 replies; 16+ messages in thread
From: Jiri Olsa @ 2016-08-10 11:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andi Kleen, Peter Zijlstra, linux-kernel, Andi Kleen, x86,
	Ingo Molnar, Frank Ramsay

ping

thanks,
jirka

On Wed, Aug 03, 2016 at 06:23:58PM +0200, Jiri Olsa wrote:
> Frank reported kernel panic when he disabled several cores in BIOS
> via following option:
> 
>   Core Disable Bitmap(Hex)   [0]
> 
> with number 0xFFE, which leaves 16 CPUs in system (out of 48).
> 
> The kernel panic below goes along with following messages:
> 
>  smpboot: Max logical packages: 2^M
>  smpboot: APIC(0) Converting physical 0 to logical package 0^M
>  smpboot: APIC(20) Converting physical 1 to logical package 1^M
>  smpboot: APIC(40) Package 2 exceeds logical package map^M
>  smpboot: CPU 8 APICId 40 disabled^M
>  smpboot: APIC(60) Package 3 exceeds logical package map^M
>  smpboot: CPU 12 APICId 60 disabled^M
>  ...
>  general protection fault: 0000 [#1] SMP^M
>  Modules linked in:^M
>  CPU: 15 PID: 1 Comm: swapper/0 Not tainted 4.7.0-rc5+ #1^M
>  Hardware name: SGI UV300/UV300, BIOS SGI UV 300 series BIOS 05/25/2016^M
>  task: ffff8801673e0000 ti: ffff8801673ac000 task.ti: ffff8801673ac000^M
>  RIP: 0010:[<ffffffff81014d54>]  [<ffffffff81014d54>] uncore_change_context+0xd4/0x180^M
>  ...
>   [<ffffffff810158ac>] uncore_event_init_cpu+0x6c/0x70^M
>   [<ffffffff81d8c91c>] intel_uncore_init+0x1c2/0x2dd^M
>   [<ffffffff81d8c75a>] ? uncore_cpu_setup+0x17/0x17^M
>   [<ffffffff81002190>] do_one_initcall+0x50/0x190^M
>   [<ffffffff810ab193>] ? parse_args+0x293/0x480^M
>   [<ffffffff81d87365>] kernel_init_freeable+0x1a5/0x249^M
>   [<ffffffff81d86a35>] ? set_debug_rodata+0x12/0x12^M
>   [<ffffffff816dc19e>] kernel_init+0xe/0x110^M
>   [<ffffffff816e93bf>] ret_from_fork+0x1f/0x40^M
>   [<ffffffff816dc190>] ? rest_init+0x80/0x80^M
> 
> The reason for the panic is wrong value of __max_logical_packages,
> which lets logical_package_map uninitialized and the uncore code
> relying on this map being properly initialized (maybe we should
> add some safety checks there as well).
> 
> The __max_logical_packages is computed as:
> 
>   DIV_ROUND_UP(total_cpus, ncpus);
>   - ncpus being number of cores
> 
> With above BIOS setup we get total_cpus == 16 which set
> __max_logical_packages to 2 (ncpus is 12). 
> 
> Once topology_update_package_map processes CPU with logical
> pkg over 2 we display above messages and fail to initialize
> the physical_to_logical_pkg map, which makes the uncore code
> crash.
> 
> The fix is to set __max_logical_packages directly to total_cpus,
> which should be the maximum possible logical ID of the pkg in
> any case.
> 
> Reported-by: Frank Ramsay <framsay@redhat.com>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 157bf0957219..484f7d357c77 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -340,7 +340,7 @@ static void __init smp_init_package_map(void)
>  		ncpus = 1;
>  	}
>  
> -	__max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus);
> +	__max_logical_packages = total_cpus;
>  
>  	/*
>  	 * Possibly larger than what we need as the number of apic ids per

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC][PATCH] x86/smp: Fix __max_logical_packages value setup
  2016-08-03 16:23 [RFC][PATCH] x86/smp: Fix __max_logical_packages value setup Jiri Olsa
  2016-08-10 11:41 ` Jiri Olsa
@ 2016-08-10 13:54 ` Peter Zijlstra
  2016-08-10 14:00   ` Jiri Olsa
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2016-08-10 13:54 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Thomas Gleixner, Andi Kleen, linux-kernel, Andi Kleen, x86,
	Ingo Molnar, Frank Ramsay

On Wed, Aug 03, 2016 at 06:23:58PM +0200, Jiri Olsa wrote:
> Frank reported kernel panic when he disabled several cores in BIOS
> via following option:
> 
>   Core Disable Bitmap(Hex)   [0]
> 
> with number 0xFFE, which leaves 16 CPUs in system (out of 48).

That seems like a daft BIOS option. How wide spread is that? I can't
remember ever seeing that.

> The reason for the panic is wrong value of __max_logical_packages,
> which lets logical_package_map uninitialized and the uncore code
> relying on this map being properly initialized (maybe we should
> add some safety checks there as well).
> 
> The __max_logical_packages is computed as:
> 
>   DIV_ROUND_UP(total_cpus, ncpus);
>   - ncpus being number of cores
> 
> With above BIOS setup we get total_cpus == 16 which set
> __max_logical_packages to 2 (ncpus is 12). 
> 
> Once topology_update_package_map processes CPU with logical
> pkg over 2 we display above messages and fail to initialize
> the physical_to_logical_pkg map, which makes the uncore code
> crash.
> 
> The fix is to set __max_logical_packages directly to total_cpus,
> which should be the maximum possible logical ID of the pkg in
> any case.
> 
> Reported-by: Frank Ramsay <framsay@redhat.com>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 157bf0957219..484f7d357c77 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -340,7 +340,7 @@ static void __init smp_init_package_map(void)
>  		ncpus = 1;
>  	}
>  
> -	__max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus);
> +	__max_logical_packages = total_cpus;

This seems undesirable.. it would grow the bitmap unnecessarily big on
most setups.

Is there no way to detect the brain damage inflicted by that BIOS option
and fudge ncpus in that case?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC][PATCH] x86/smp: Fix __max_logical_packages value setup
  2016-08-10 13:54 ` Peter Zijlstra
@ 2016-08-10 14:00   ` Jiri Olsa
  2016-08-10 14:15     ` Jiri Olsa
  0 siblings, 1 reply; 16+ messages in thread
From: Jiri Olsa @ 2016-08-10 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Andi Kleen, linux-kernel, Andi Kleen, x86,
	Ingo Molnar, Frank Ramsay

On Wed, Aug 10, 2016 at 03:54:17PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 03, 2016 at 06:23:58PM +0200, Jiri Olsa wrote:

SNIP

> > ---
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index 157bf0957219..484f7d357c77 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -340,7 +340,7 @@ static void __init smp_init_package_map(void)
> >  		ncpus = 1;
> >  	}
> >  
> > -	__max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus);
> > +	__max_logical_packages = total_cpus;
> 
> This seems undesirable.. it would grow the bitmap unnecessarily big on
> most setups.
> 
> Is there no way to detect the brain damage inflicted by that BIOS option
> and fudge ncpus in that case?

yea, I have no idea.. that's why this is RFC ;-)

maybe we could also gradually allocate this bitmap
in topology_update_package_map.. I'll check on that

jirka

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC][PATCH] x86/smp: Fix __max_logical_packages value setup
  2016-08-10 14:00   ` Jiri Olsa
@ 2016-08-10 14:15     ` Jiri Olsa
  2016-08-10 15:52       ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Jiri Olsa @ 2016-08-10 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Andi Kleen, linux-kernel, Andi Kleen, x86,
	Ingo Molnar, Frank Ramsay, Prarit Bhargava

On Wed, Aug 10, 2016 at 04:00:33PM +0200, Jiri Olsa wrote:
> On Wed, Aug 10, 2016 at 03:54:17PM +0200, Peter Zijlstra wrote:
> > On Wed, Aug 03, 2016 at 06:23:58PM +0200, Jiri Olsa wrote:
> 
> SNIP
> 
> > > ---
> > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > > index 157bf0957219..484f7d357c77 100644
> > > --- a/arch/x86/kernel/smpboot.c
> > > +++ b/arch/x86/kernel/smpboot.c
> > > @@ -340,7 +340,7 @@ static void __init smp_init_package_map(void)
> > >  		ncpus = 1;
> > >  	}
> > >  
> > > -	__max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus);
> > > +	__max_logical_packages = total_cpus;
> > 
> > This seems undesirable.. it would grow the bitmap unnecessarily big on
> > most setups.
> > 
> > Is there no way to detect the brain damage inflicted by that BIOS option
> > and fudge ncpus in that case?
> 
> yea, I have no idea.. that's why this is RFC ;-)
> 
> maybe we could also gradually allocate this bitmap
> in topology_update_package_map.. I'll check on that

it's very likely I'm missing something, but seems to me
that attached patch (untested) might work as well

jirka


---
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 157bf0957219..003208bc8e44 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -100,7 +100,6 @@ EXPORT_PER_CPU_SYMBOL(cpu_info);
 /* Logical package management. We might want to allocate that dynamically */
 static int *physical_to_logical_pkg __read_mostly;
 static unsigned long *physical_package_map __read_mostly;;
-static unsigned long *logical_package_map  __read_mostly;
 static unsigned int max_physical_pkg_id __read_mostly;
 unsigned int __max_logical_packages __read_mostly;
 EXPORT_SYMBOL(__max_logical_packages);
@@ -277,14 +276,8 @@ int topology_update_package_map(unsigned int apicid, unsigned int cpu)
 	if (test_and_set_bit(pkg, physical_package_map))
 		goto found;
 
-	new = find_first_zero_bit(logical_package_map, __max_logical_packages);
-	if (new >= __max_logical_packages) {
-		physical_to_logical_pkg[pkg] = -1;
-		pr_warn("APIC(%x) Package %u exceeds logical package map\n",
-			apicid, pkg);
-		return -ENOSPC;
-	}
-	set_bit(new, logical_package_map);
+	new = __max_logical_packages++;
+
 	pr_info("APIC(%x) Converting physical %u to logical package %u\n",
 		apicid, pkg, new);
 	physical_to_logical_pkg[pkg] = new;
@@ -340,7 +333,7 @@ static void __init smp_init_package_map(void)
 		ncpus = 1;
 	}
 
-	__max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus);
+	__max_logical_packages = 0;
 
 	/*
 	 * Possibly larger than what we need as the number of apic ids per
@@ -352,10 +345,6 @@ static void __init smp_init_package_map(void)
 	memset(physical_to_logical_pkg, 0xff, size);
 	size = BITS_TO_LONGS(max_physical_pkg_id) * sizeof(unsigned long);
 	physical_package_map = kzalloc(size, GFP_KERNEL);
-	size = BITS_TO_LONGS(__max_logical_packages) * sizeof(unsigned long);
-	logical_package_map = kzalloc(size, GFP_KERNEL);
-
-	pr_info("Max logical packages: %u\n", __max_logical_packages);
 
 	for_each_present_cpu(cpu) {
 		unsigned int apicid = apic->cpu_present_to_apicid(cpu);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC][PATCH] x86/smp: Fix __max_logical_packages value setup
  2016-08-10 14:15     ` Jiri Olsa
@ 2016-08-10 15:52       ` Peter Zijlstra
  2016-08-10 16:14         ` [PATCH] " Jiri Olsa
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2016-08-10 15:52 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Thomas Gleixner, Andi Kleen, linux-kernel, Andi Kleen, x86,
	Ingo Molnar, Frank Ramsay, Prarit Bhargava

On Wed, Aug 10, 2016 at 04:15:38PM +0200, Jiri Olsa wrote:
> it's very likely I'm missing something, but seems to me
> that attached patch (untested) might work as well

Hmm, I think you're right. Earlier code (which hasn't survived) tried to
match logical and physical if possible, but that was busted and got
removed.

The code as is should indeed allow for this.

Maybe one nit, the variable is no longer used for a max, so maybe rename
it too?

> ---
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 157bf0957219..003208bc8e44 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c

> @@ -277,14 +276,8 @@ int topology_update_package_map(unsigned int apicid, unsigned int cpu)
>  	if (test_and_set_bit(pkg, physical_package_map))
>  		goto found;
>  
> +	new = __max_logical_packages++;
> +
>  	pr_info("APIC(%x) Converting physical %u to logical package %u\n",
>  		apicid, pkg, new);
>  	physical_to_logical_pkg[pkg] = new;
> @@ -340,7 +333,7 @@ static void __init smp_init_package_map(void)
>  		ncpus = 1;
>  	}
>  
> +	__max_logical_packages = 0;
>  
>  	/*
>  	 * Possibly larger than what we need as the number of apic ids per

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] x86/smp: Fix __max_logical_packages value setup
  2016-08-10 15:52       ` Peter Zijlstra
@ 2016-08-10 16:14         ` Jiri Olsa
  2016-08-11 12:48           ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Jiri Olsa @ 2016-08-10 16:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Andi Kleen, linux-kernel, Andi Kleen, x86,
	Ingo Molnar, Frank Ramsay, Prarit Bhargava

On Wed, Aug 10, 2016 at 05:52:05PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 10, 2016 at 04:15:38PM +0200, Jiri Olsa wrote:
> > it's very likely I'm missing something, but seems to me
> > that attached patch (untested) might work as well
> 
> Hmm, I think you're right. Earlier code (which hasn't survived) tried to
> match logical and physical if possible, but that was busted and got
> removed.
> 
> The code as is should indeed allow for this.
> 
> Maybe one nit, the variable is no longer used for a max, so maybe rename
> it too?

ok, renaming it to logical_packages then, attaching full patch

thanks,
jirka


---
Frank reported kernel panic when he disabled several cores in BIOS
via following option:

  Core Disable Bitmap(Hex)   [0]

with number 0xFFE, which leaves 16 CPUs in system (out of 48).

The kernel panic below goes along with following messages:

 smpboot: Max logical packages: 2^M
 smpboot: APIC(0) Converting physical 0 to logical package 0^M
 smpboot: APIC(20) Converting physical 1 to logical package 1^M
 smpboot: APIC(40) Package 2 exceeds logical package map^M
 smpboot: CPU 8 APICId 40 disabled^M
 smpboot: APIC(60) Package 3 exceeds logical package map^M
 smpboot: CPU 12 APICId 60 disabled^M
 ...
 general protection fault: 0000 [#1] SMP^M
 Modules linked in:^M
 CPU: 15 PID: 1 Comm: swapper/0 Not tainted 4.7.0-rc5+ #1^M
 Hardware name: SGI UV300/UV300, BIOS SGI UV 300 series BIOS 05/25/2016^M
 task: ffff8801673e0000 ti: ffff8801673ac000 task.ti: ffff8801673ac000^M
 RIP: 0010:[<ffffffff81014d54>]  [<ffffffff81014d54>] uncore_change_context+0xd4/0x180^M
 ...
  [<ffffffff810158ac>] uncore_event_init_cpu+0x6c/0x70^M
  [<ffffffff81d8c91c>] intel_uncore_init+0x1c2/0x2dd^M
  [<ffffffff81d8c75a>] ? uncore_cpu_setup+0x17/0x17^M
  [<ffffffff81002190>] do_one_initcall+0x50/0x190^M
  [<ffffffff810ab193>] ? parse_args+0x293/0x480^M
  [<ffffffff81d87365>] kernel_init_freeable+0x1a5/0x249^M
  [<ffffffff81d86a35>] ? set_debug_rodata+0x12/0x12^M
  [<ffffffff816dc19e>] kernel_init+0xe/0x110^M
  [<ffffffff816e93bf>] ret_from_fork+0x1f/0x40^M
  [<ffffffff816dc190>] ? rest_init+0x80/0x80^M

The reason for the panic is wrong value of __max_logical_packages,
which lets logical_package_map uninitialized and the uncore code
relying on this map being properly initialized (maybe we should
add some safety checks there as well).

The __max_logical_packages is computed as:

  DIV_ROUND_UP(total_cpus, ncpus);
  - ncpus being number of cores

With above BIOS setup we get total_cpus == 16 which set
__max_logical_packages to 2 (ncpus is 12).

Once topology_update_package_map processes CPU with logical
pkg over 2 we display above messages and fail to initialize
the physical_to_logical_pkg map, which makes the uncore code
crash.

The fix is to remove logical_package_map bitmap completely
and just keep and update the logical_packages number.

Reported-by: Frank Ramsay <framsay@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/include/asm/topology.h |  4 ++--
 arch/x86/kernel/smpboot.c       | 21 +++++----------------
 2 files changed, 7 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index cf75871d2f81..c28010088651 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -118,8 +118,8 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 #define topology_core_cpumask(cpu)		(per_cpu(cpu_core_map, cpu))
 #define topology_sibling_cpumask(cpu)		(per_cpu(cpu_sibling_map, cpu))
 
-extern unsigned int __max_logical_packages;
-#define topology_max_packages()			(__max_logical_packages)
+extern unsigned int logical_packages;
+#define topology_max_packages()			(logical_packages)
 
 extern int __max_smt_threads;
 
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 157bf0957219..b1b775dbede6 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -100,10 +100,9 @@ EXPORT_PER_CPU_SYMBOL(cpu_info);
 /* Logical package management. We might want to allocate that dynamically */
 static int *physical_to_logical_pkg __read_mostly;
 static unsigned long *physical_package_map __read_mostly;;
-static unsigned long *logical_package_map  __read_mostly;
 static unsigned int max_physical_pkg_id __read_mostly;
-unsigned int __max_logical_packages __read_mostly;
-EXPORT_SYMBOL(__max_logical_packages);
+unsigned int logical_packages __read_mostly;
+EXPORT_SYMBOL(logical_packages);
 
 /* Maximum number of SMT threads on any online core */
 int __max_smt_threads __read_mostly;
@@ -277,14 +276,8 @@ int topology_update_package_map(unsigned int apicid, unsigned int cpu)
 	if (test_and_set_bit(pkg, physical_package_map))
 		goto found;
 
-	new = find_first_zero_bit(logical_package_map, __max_logical_packages);
-	if (new >= __max_logical_packages) {
-		physical_to_logical_pkg[pkg] = -1;
-		pr_warn("APIC(%x) Package %u exceeds logical package map\n",
-			apicid, pkg);
-		return -ENOSPC;
-	}
-	set_bit(new, logical_package_map);
+	new = logical_packages++;
+
 	pr_info("APIC(%x) Converting physical %u to logical package %u\n",
 		apicid, pkg, new);
 	physical_to_logical_pkg[pkg] = new;
@@ -340,7 +333,7 @@ static void __init smp_init_package_map(void)
 		ncpus = 1;
 	}
 
-	__max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus);
+	logical_packages = 0;
 
 	/*
 	 * Possibly larger than what we need as the number of apic ids per
@@ -352,10 +345,6 @@ static void __init smp_init_package_map(void)
 	memset(physical_to_logical_pkg, 0xff, size);
 	size = BITS_TO_LONGS(max_physical_pkg_id) * sizeof(unsigned long);
 	physical_package_map = kzalloc(size, GFP_KERNEL);
-	size = BITS_TO_LONGS(__max_logical_packages) * sizeof(unsigned long);
-	logical_package_map = kzalloc(size, GFP_KERNEL);
-
-	pr_info("Max logical packages: %u\n", __max_logical_packages);
 
 	for_each_present_cpu(cpu) {
 		unsigned int apicid = apic->cpu_present_to_apicid(cpu);
-- 
2.4.11

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/smp: Fix __max_logical_packages value setup
  2016-08-10 16:14         ` [PATCH] " Jiri Olsa
@ 2016-08-11 12:48           ` Peter Zijlstra
  2016-08-11 13:05             ` Jiri Olsa
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2016-08-11 12:48 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Thomas Gleixner, Andi Kleen, linux-kernel, Andi Kleen, x86,
	Ingo Molnar, Frank Ramsay, Prarit Bhargava

On Wed, Aug 10, 2016 at 06:14:18PM +0200, Jiri Olsa wrote:
> > Maybe one nit, the variable is no longer used for a max, so maybe rename
> > it too?

> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index cf75871d2f81..c28010088651 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -118,8 +118,8 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
>  #define topology_core_cpumask(cpu)		(per_cpu(cpu_core_map, cpu))
>  #define topology_sibling_cpumask(cpu)		(per_cpu(cpu_sibling_map, cpu))
>  
> -extern unsigned int __max_logical_packages;
> -#define topology_max_packages()			(__max_logical_packages)
> +extern unsigned int logical_packages;
> +#define topology_max_packages()			(logical_packages)
>  
>  extern int __max_smt_threads;
>  

Aaahh, I missed this bit yesterday..


Imagine a machine with physical hotplug, where we boot with half the
sockets populated (say 2), then topology_max_packages() will return 2
when we run the uncore driver init.

That driver will allocate resources based on 2.

Then we physically hotplug the remaining sockets (another 2), which will
result in topology_max_packages() to now return 4.

When we run the cpuhotplug hook, it will try and access resources based
on 4, which were not allocated.


blergh..

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/smp: Fix __max_logical_packages value setup
  2016-08-11 12:48           ` Peter Zijlstra
@ 2016-08-11 13:05             ` Jiri Olsa
  2016-08-11 13:46               ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Jiri Olsa @ 2016-08-11 13:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Andi Kleen, linux-kernel, Andi Kleen, x86,
	Ingo Molnar, Frank Ramsay, Prarit Bhargava

On Thu, Aug 11, 2016 at 02:48:39PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 10, 2016 at 06:14:18PM +0200, Jiri Olsa wrote:
> > > Maybe one nit, the variable is no longer used for a max, so maybe rename
> > > it too?
> 
> > diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> > index cf75871d2f81..c28010088651 100644
> > --- a/arch/x86/include/asm/topology.h
> > +++ b/arch/x86/include/asm/topology.h
> > @@ -118,8 +118,8 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
> >  #define topology_core_cpumask(cpu)		(per_cpu(cpu_core_map, cpu))
> >  #define topology_sibling_cpumask(cpu)		(per_cpu(cpu_sibling_map, cpu))
> >  
> > -extern unsigned int __max_logical_packages;
> > -#define topology_max_packages()			(__max_logical_packages)
> > +extern unsigned int logical_packages;
> > +#define topology_max_packages()			(logical_packages)
> >  
> >  extern int __max_smt_threads;
> >  
> 
> Aaahh, I missed this bit yesterday..
> 
> 
> Imagine a machine with physical hotplug, where we boot with half the
> sockets populated (say 2), then topology_max_packages() will return 2
> when we run the uncore driver init.
> 
> That driver will allocate resources based on 2.
> 
> Then we physically hotplug the remaining sockets (another 2), which will
> result in topology_max_packages() to now return 4.
> 
> When we run the cpuhotplug hook, it will try and access resources based
> on 4, which were not allocated.

hum, so we either need some acpi solution to get number of all
sockets or fix the uncore code to initialize pmu boxes on cpu
hotplug as well

jirka

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/smp: Fix __max_logical_packages value setup
  2016-08-11 13:05             ` Jiri Olsa
@ 2016-08-11 13:46               ` Peter Zijlstra
  2016-08-12 12:24                 ` Jiri Olsa
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2016-08-11 13:46 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Thomas Gleixner, Andi Kleen, linux-kernel, Andi Kleen, x86,
	Ingo Molnar, Frank Ramsay, Prarit Bhargava

On Thu, Aug 11, 2016 at 03:05:21PM +0200, Jiri Olsa wrote:
> hum, so we either need some acpi solution to get number of all
> sockets or

This.. So the problem here is that the BIOS completely screws us over.

It wrecks the ACPI-ID table with that option to limit the number of CPUs
exposed to the OS (note that it didn't need to do that, it could have
enumerated them as empty, instead of not there at all) while keeping the
CPUID of the CPUs as reporting they have many (12? was it) cores.

This results in inconsistent state, and we're left with nothing useful.

> fix the uncore code to initialize pmu boxes on cpu hotplug as well

Can't.. it uses the boxes at STARTING time, and we can't do allocs
there. Not can we alloc earlier, because we don't know max_packages is
going to increase.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/smp: Fix __max_logical_packages value setup
  2016-08-11 13:46               ` Peter Zijlstra
@ 2016-08-12 12:24                 ` Jiri Olsa
  2016-08-12 13:12                   ` Jiri Olsa
  2016-08-15  9:04                   ` Peter Zijlstra
  0 siblings, 2 replies; 16+ messages in thread
From: Jiri Olsa @ 2016-08-12 12:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Andi Kleen, linux-kernel, Andi Kleen, x86,
	Ingo Molnar, Frank Ramsay, Prarit Bhargava

On Thu, Aug 11, 2016 at 03:46:51PM +0200, Peter Zijlstra wrote:
> On Thu, Aug 11, 2016 at 03:05:21PM +0200, Jiri Olsa wrote:
> > hum, so we either need some acpi solution to get number of all
> > sockets or
> 
> This.. So the problem here is that the BIOS completely screws us over.
> 
> It wrecks the ACPI-ID table with that option to limit the number of CPUs
> exposed to the OS (note that it didn't need to do that, it could have
> enumerated them as empty, instead of not there at all) while keeping the
> CPUID of the CPUs as reporting they have many (12? was it) cores.
> 
> This results in inconsistent state, and we're left with nothing useful.
> 
> > fix the uncore code to initialize pmu boxes on cpu hotplug as well
> 
> Can't.. it uses the boxes at STARTING time, and we can't do allocs
> there. Not can we alloc earlier, because we don't know max_packages is
> going to increase.

I still need to test this, but would this be something
like you proposed on irc?

thanks,
jirka


---
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 2a6e84a30a54..4296beb8fdd3 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -100,10 +100,11 @@ EXPORT_PER_CPU_SYMBOL(cpu_info);
 /* Logical package management. We might want to allocate that dynamically */
 static int *physical_to_logical_pkg __read_mostly;
 static unsigned long *physical_package_map __read_mostly;;
-static unsigned long *logical_package_map  __read_mostly;
 static unsigned int max_physical_pkg_id __read_mostly;
 unsigned int __max_logical_packages __read_mostly;
 EXPORT_SYMBOL(__max_logical_packages);
+static unsigned int logical_packages __read_mostly;
+static bool logical_packages_frozen __read_mostly;
 
 /* Maximum number of SMT threads on any online core */
 int __max_smt_threads __read_mostly;
@@ -277,14 +278,14 @@ int topology_update_package_map(unsigned int apicid, unsigned int cpu)
 	if (test_and_set_bit(pkg, physical_package_map))
 		goto found;
 
-	new = find_first_zero_bit(logical_package_map, __max_logical_packages);
-	if (new >= __max_logical_packages) {
+	if (logical_packages_frozen) {
 		physical_to_logical_pkg[pkg] = -1;
-		pr_warn("APIC(%x) Package %u exceeds logical package map\n",
+		pr_warn("APIC(%x) Package %u exceeds logical package max\n",
 			apicid, pkg);
 		return -ENOSPC;
 	}
-	set_bit(new, logical_package_map);
+
+	new = logical_packages++;
 	pr_info("APIC(%x) Converting physical %u to logical package %u\n",
 		apicid, pkg, new);
 	physical_to_logical_pkg[pkg] = new;
@@ -341,6 +342,7 @@ static void __init smp_init_package_map(void)
 	}
 
 	__max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus);
+	logical_packages = 0;
 
 	/*
 	 * Possibly larger than what we need as the number of apic ids per
@@ -352,10 +354,6 @@ static void __init smp_init_package_map(void)
 	memset(physical_to_logical_pkg, 0xff, size);
 	size = BITS_TO_LONGS(max_physical_pkg_id) * sizeof(unsigned long);
 	physical_package_map = kzalloc(size, GFP_KERNEL);
-	size = BITS_TO_LONGS(__max_logical_packages) * sizeof(unsigned long);
-	logical_package_map = kzalloc(size, GFP_KERNEL);
-
-	pr_info("Max logical packages: %u\n", __max_logical_packages);
 
 	for_each_present_cpu(cpu) {
 		unsigned int apicid = apic->cpu_present_to_apicid(cpu);
@@ -369,6 +367,15 @@ static void __init smp_init_package_map(void)
 		set_cpu_possible(cpu, false);
 		set_cpu_present(cpu, false);
 	}
+
+	if (logical_packages > __max_logical_packages) {
+		pr_warn("Detected more packages (%u), then computed by BIOS data (%u).\n",
+			logical_packages, __max_logical_packages);
+		logical_packages_frozen = true;
+		__max_logical_packages  = logical_packages;
+	}
+
+	pr_info("Max logical packages: %u\n", __max_logical_packages);
 }
 
 void __init smp_store_boot_cpu_info(void)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/smp: Fix __max_logical_packages value setup
  2016-08-12 12:24                 ` Jiri Olsa
@ 2016-08-12 13:12                   ` Jiri Olsa
  2016-08-15  9:04                   ` Peter Zijlstra
  1 sibling, 0 replies; 16+ messages in thread
From: Jiri Olsa @ 2016-08-12 13:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Andi Kleen, linux-kernel, Andi Kleen, x86,
	Ingo Molnar, Frank Ramsay, Prarit Bhargava

On Fri, Aug 12, 2016 at 02:24:57PM +0200, Jiri Olsa wrote:
> On Thu, Aug 11, 2016 at 03:46:51PM +0200, Peter Zijlstra wrote:
> > On Thu, Aug 11, 2016 at 03:05:21PM +0200, Jiri Olsa wrote:
> > > hum, so we either need some acpi solution to get number of all
> > > sockets or
> > 
> > This.. So the problem here is that the BIOS completely screws us over.
> > 
> > It wrecks the ACPI-ID table with that option to limit the number of CPUs
> > exposed to the OS (note that it didn't need to do that, it could have
> > enumerated them as empty, instead of not there at all) while keeping the
> > CPUID of the CPUs as reporting they have many (12? was it) cores.
> > 
> > This results in inconsistent state, and we're left with nothing useful.
> > 
> > > fix the uncore code to initialize pmu boxes on cpu hotplug as well
> > 
> > Can't.. it uses the boxes at STARTING time, and we can't do allocs
> > there. Not can we alloc earlier, because we don't know max_packages is
> > going to increase.
> 
> I still need to test this, but would this be something
> like you proposed on irc?

works on my test machine:

[    0.742505] smpboot: APIC(0) Converting physical 0 to logical package 0
[    0.749902] smpboot: APIC(20) Converting physical 1 to logical package 1
[    0.757390] smpboot: APIC(40) Converting physical 2 to logical package 2
[    0.764879] smpboot: APIC(60) Converting physical 3 to logical package 3
[    0.772368] smpboot: Detected more packages (4), then computed by BIOS data (1).
[    0.780630] smpboot: Max logical packages: 4


jirka

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/smp: Fix __max_logical_packages value setup
  2016-08-12 12:24                 ` Jiri Olsa
  2016-08-12 13:12                   ` Jiri Olsa
@ 2016-08-15  9:04                   ` Peter Zijlstra
  2016-08-15 10:17                     ` Jiri Olsa
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2016-08-15  9:04 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Thomas Gleixner, Andi Kleen, linux-kernel, Andi Kleen, x86,
	Ingo Molnar, Frank Ramsay, Prarit Bhargava

On Fri, Aug 12, 2016 at 02:24:57PM +0200, Jiri Olsa wrote:
> I still need to test this, but would this be something
> like you proposed on irc?

Yep, looks good. Please post with Changelog etc..

Thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] x86/smp: Fix __max_logical_packages value setup
  2016-08-15  9:04                   ` Peter Zijlstra
@ 2016-08-15 10:17                     ` Jiri Olsa
  2016-08-15 11:46                       ` Prarit Bhargava
  2016-08-18 10:50                       ` [tip:x86/urgent] " tip-bot for Jiri Olsa
  0 siblings, 2 replies; 16+ messages in thread
From: Jiri Olsa @ 2016-08-15 10:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Andi Kleen, linux-kernel, Andi Kleen, x86,
	Ingo Molnar, Frank Ramsay, Prarit Bhargava

On Mon, Aug 15, 2016 at 11:04:34AM +0200, Peter Zijlstra wrote:
> On Fri, Aug 12, 2016 at 02:24:57PM +0200, Jiri Olsa wrote:
> > I still need to test this, but would this be something
> > like you proposed on irc?
> 
> Yep, looks good. Please post with Changelog etc..

attached,

thanks,
jirka


---
Frank reported kernel panic when he disabled several cores in BIOS
via following option:

  Core Disable Bitmap(Hex)   [0]

with number 0xFFE, which leaves 16 CPUs in system (out of 48).

The kernel panic below goes along with following messages:

 smpboot: Max logical packages: 2^M
 smpboot: APIC(0) Converting physical 0 to logical package 0^M
 smpboot: APIC(20) Converting physical 1 to logical package 1^M
 smpboot: APIC(40) Package 2 exceeds logical package map^M
 smpboot: CPU 8 APICId 40 disabled^M
 smpboot: APIC(60) Package 3 exceeds logical package map^M
 smpboot: CPU 12 APICId 60 disabled^M
 ...
 general protection fault: 0000 [#1] SMP^M
 Modules linked in:^M
 CPU: 15 PID: 1 Comm: swapper/0 Not tainted 4.7.0-rc5+ #1^M
 Hardware name: SGI UV300/UV300, BIOS SGI UV 300 series BIOS 05/25/2016^M
 task: ffff8801673e0000 ti: ffff8801673ac000 task.ti: ffff8801673ac000^M
 RIP: 0010:[<ffffffff81014d54>]  [<ffffffff81014d54>] uncore_change_context+0xd4/0x180^M
 ...
  [<ffffffff810158ac>] uncore_event_init_cpu+0x6c/0x70^M
  [<ffffffff81d8c91c>] intel_uncore_init+0x1c2/0x2dd^M
  [<ffffffff81d8c75a>] ? uncore_cpu_setup+0x17/0x17^M
  [<ffffffff81002190>] do_one_initcall+0x50/0x190^M
  [<ffffffff810ab193>] ? parse_args+0x293/0x480^M
  [<ffffffff81d87365>] kernel_init_freeable+0x1a5/0x249^M
  [<ffffffff81d86a35>] ? set_debug_rodata+0x12/0x12^M
  [<ffffffff816dc19e>] kernel_init+0xe/0x110^M
  [<ffffffff816e93bf>] ret_from_fork+0x1f/0x40^M
  [<ffffffff816dc190>] ? rest_init+0x80/0x80^M

The reason for the panic is wrong value of __max_logical_packages,
which lets logical_package_map uninitialized and the uncore code
relying on this map being properly initialized (maybe we should
add some safety checks there as well).

The __max_logical_packages is computed as:

  DIV_ROUND_UP(total_cpus, ncpus);
  - ncpus being number of cores

With above BIOS setup we get total_cpus == 16 which set
__max_logical_packages to 2 (ncpus is 12).

Once topology_update_package_map processes CPU with logical
pkg over 2 we display above messages and fail to initialize
the physical_to_logical_pkg map, which makes the uncore code
crash.

The fix is to remove logical_package_map bitmap completely
and keep and update the logical_packages number instead.

After we enumerate all the present cpus, we check if the
enumerated logical packages count is within its computed
maximum from BIOS data.

If it's not the case, we set this maximum to the new enumerated
value and freeze any new addition of logical packages.

The freeze is because lot of init code like uncore/rapl/cqm
depends on having maximum logical package value set to allocate
their data, so we can't change it later on.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Frank Ramsay <framsay@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/kernel/smpboot.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 2a6e84a30a54..4296beb8fdd3 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -100,10 +100,11 @@ EXPORT_PER_CPU_SYMBOL(cpu_info);
 /* Logical package management. We might want to allocate that dynamically */
 static int *physical_to_logical_pkg __read_mostly;
 static unsigned long *physical_package_map __read_mostly;;
-static unsigned long *logical_package_map  __read_mostly;
 static unsigned int max_physical_pkg_id __read_mostly;
 unsigned int __max_logical_packages __read_mostly;
 EXPORT_SYMBOL(__max_logical_packages);
+static unsigned int logical_packages __read_mostly;
+static bool logical_packages_frozen __read_mostly;
 
 /* Maximum number of SMT threads on any online core */
 int __max_smt_threads __read_mostly;
@@ -277,14 +278,14 @@ int topology_update_package_map(unsigned int apicid, unsigned int cpu)
 	if (test_and_set_bit(pkg, physical_package_map))
 		goto found;
 
-	new = find_first_zero_bit(logical_package_map, __max_logical_packages);
-	if (new >= __max_logical_packages) {
+	if (logical_packages_frozen) {
 		physical_to_logical_pkg[pkg] = -1;
-		pr_warn("APIC(%x) Package %u exceeds logical package map\n",
+		pr_warn("APIC(%x) Package %u exceeds logical package max\n",
 			apicid, pkg);
 		return -ENOSPC;
 	}
-	set_bit(new, logical_package_map);
+
+	new = logical_packages++;
 	pr_info("APIC(%x) Converting physical %u to logical package %u\n",
 		apicid, pkg, new);
 	physical_to_logical_pkg[pkg] = new;
@@ -341,6 +342,7 @@ static void __init smp_init_package_map(void)
 	}
 
 	__max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus);
+	logical_packages = 0;
 
 	/*
 	 * Possibly larger than what we need as the number of apic ids per
@@ -352,10 +354,6 @@ static void __init smp_init_package_map(void)
 	memset(physical_to_logical_pkg, 0xff, size);
 	size = BITS_TO_LONGS(max_physical_pkg_id) * sizeof(unsigned long);
 	physical_package_map = kzalloc(size, GFP_KERNEL);
-	size = BITS_TO_LONGS(__max_logical_packages) * sizeof(unsigned long);
-	logical_package_map = kzalloc(size, GFP_KERNEL);
-
-	pr_info("Max logical packages: %u\n", __max_logical_packages);
 
 	for_each_present_cpu(cpu) {
 		unsigned int apicid = apic->cpu_present_to_apicid(cpu);
@@ -369,6 +367,15 @@ static void __init smp_init_package_map(void)
 		set_cpu_possible(cpu, false);
 		set_cpu_present(cpu, false);
 	}
+
+	if (logical_packages > __max_logical_packages) {
+		pr_warn("Detected more packages (%u), then computed by BIOS data (%u).\n",
+			logical_packages, __max_logical_packages);
+		logical_packages_frozen = true;
+		__max_logical_packages  = logical_packages;
+	}
+
+	pr_info("Max logical packages: %u\n", __max_logical_packages);
 }
 
 void __init smp_store_boot_cpu_info(void)
-- 
2.4.11

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/smp: Fix __max_logical_packages value setup
  2016-08-15 10:17                     ` Jiri Olsa
@ 2016-08-15 11:46                       ` Prarit Bhargava
  2016-08-18 10:50                       ` [tip:x86/urgent] " tip-bot for Jiri Olsa
  1 sibling, 0 replies; 16+ messages in thread
From: Prarit Bhargava @ 2016-08-15 11:46 UTC (permalink / raw)
  To: Jiri Olsa, Peter Zijlstra
  Cc: Thomas Gleixner, Andi Kleen, linux-kernel, Andi Kleen, x86,
	Ingo Molnar, Frank Ramsay



On 08/15/2016 06:17 AM, Jiri Olsa wrote:
> On Mon, Aug 15, 2016 at 11:04:34AM +0200, Peter Zijlstra wrote:
>> On Fri, Aug 12, 2016 at 02:24:57PM +0200, Jiri Olsa wrote:
>>> I still need to test this, but would this be something
>>> like you proposed on irc?
>>
>> Yep, looks good. Please post with Changelog etc..
> 
> attached,
> 
> thanks,
> jirka
> 
> 
> ---
> Frank reported kernel panic when he disabled several cores in BIOS
> via following option:
> 
>   Core Disable Bitmap(Hex)   [0]
> 
> with number 0xFFE, which leaves 16 CPUs in system (out of 48).
> 
> The kernel panic below goes along with following messages:
> 
>  smpboot: Max logical packages: 2^M
>  smpboot: APIC(0) Converting physical 0 to logical package 0^M
>  smpboot: APIC(20) Converting physical 1 to logical package 1^M
>  smpboot: APIC(40) Package 2 exceeds logical package map^M
>  smpboot: CPU 8 APICId 40 disabled^M
>  smpboot: APIC(60) Package 3 exceeds logical package map^M
>  smpboot: CPU 12 APICId 60 disabled^M
>  ...
>  general protection fault: 0000 [#1] SMP^M
>  Modules linked in:^M
>  CPU: 15 PID: 1 Comm: swapper/0 Not tainted 4.7.0-rc5+ #1^M
>  Hardware name: SGI UV300/UV300, BIOS SGI UV 300 series BIOS 05/25/2016^M
>  task: ffff8801673e0000 ti: ffff8801673ac000 task.ti: ffff8801673ac000^M
>  RIP: 0010:[<ffffffff81014d54>]  [<ffffffff81014d54>] uncore_change_context+0xd4/0x180^M
>  ...
>   [<ffffffff810158ac>] uncore_event_init_cpu+0x6c/0x70^M
>   [<ffffffff81d8c91c>] intel_uncore_init+0x1c2/0x2dd^M
>   [<ffffffff81d8c75a>] ? uncore_cpu_setup+0x17/0x17^M
>   [<ffffffff81002190>] do_one_initcall+0x50/0x190^M
>   [<ffffffff810ab193>] ? parse_args+0x293/0x480^M
>   [<ffffffff81d87365>] kernel_init_freeable+0x1a5/0x249^M
>   [<ffffffff81d86a35>] ? set_debug_rodata+0x12/0x12^M
>   [<ffffffff816dc19e>] kernel_init+0xe/0x110^M
>   [<ffffffff816e93bf>] ret_from_fork+0x1f/0x40^M
>   [<ffffffff816dc190>] ? rest_init+0x80/0x80^M
> 
> The reason for the panic is wrong value of __max_logical_packages,
> which lets logical_package_map uninitialized and the uncore code
> relying on this map being properly initialized (maybe we should
> add some safety checks there as well).
> 
> The __max_logical_packages is computed as:
> 
>   DIV_ROUND_UP(total_cpus, ncpus);
>   - ncpus being number of cores
> 
> With above BIOS setup we get total_cpus == 16 which set
> __max_logical_packages to 2 (ncpus is 12).
> 
> Once topology_update_package_map processes CPU with logical
> pkg over 2 we display above messages and fail to initialize
> the physical_to_logical_pkg map, which makes the uncore code
> crash.
> 
> The fix is to remove logical_package_map bitmap completely
> and keep and update the logical_packages number instead.
> 
> After we enumerate all the present cpus, we check if the
> enumerated logical packages count is within its computed
> maximum from BIOS data.
> 
> If it's not the case, we set this maximum to the new enumerated
> value and freeze any new addition of logical packages.
> 
> The freeze is because lot of init code like uncore/rapl/cqm
> depends on having maximum logical package value set to allocate
> their data, so we can't change it later on.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Reported-by: Frank Ramsay <framsay@redhat.com>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>

Reviewed-and-tested-by: Prarit Bhargava <prarit@redhat.com>


>From dmidecode:
        Core Count: 24
        Core Enabled: 24
        Thread Count: 48


Testing of patch below ...

Orig kernel output:

[    0.464981] smpboot: Max logical packages: 19
[    0.469861] smpboot: APIC(0) Converting physical 0 to logical package 0
[    0.477261] smpboot: APIC(40) Converting physical 1 to logical package 1
[    0.484760] smpboot: APIC(80) Converting physical 2 to logical package 2
[    0.492258] smpboot: APIC(c0) Converting physical 3 to logical package 3


1.  nr_cpus=8, should stop enumerating in package 0

[    0.533664] smpboot: APIC(0) Converting physical 0 to logical package 0
[    0.539596] smpboot: Max logical packages: 19


2.  max_cpus=8, should still enumerate all packages

[    0.526494] smpboot: APIC(0) Converting physical 0 to logical package 0
[    0.532428] smpboot: APIC(40) Converting physical 1 to logical package 1
[    0.538456] smpboot: APIC(80) Converting physical 2 to logical package 2
[    0.544486] smpboot: APIC(c0) Converting physical 3 to logical package 3
[    0.550524] smpboot: Max logical packages: 19

3.  nr_cpus=49 ( 2 socket + 1 core on 3rd socket), should stop enumerating in
package 2

[    0.521378] smpboot: APIC(0) Converting physical 0 to logical package 0
[    0.527314] smpboot: APIC(40) Converting physical 1 to logical package 1
[    0.533345] smpboot: APIC(80) Converting physical 2 to logical package 2
[    0.539368] smpboot: Max logical packages: 19

4.  maxcpus=49, should still enumerate all packages

[    0.525591] smpboot: APIC(0) Converting physical 0 to logical package 0
[    0.531525] smpboot: APIC(40) Converting physical 1 to logical package 1
[    0.537547] smpboot: APIC(80) Converting physical 2 to logical package 2
[    0.543579] smpboot: APIC(c0) Converting physical 3 to logical package 3
[    0.549624] smpboot: Max logical packages: 19

5.  kdump (nr_cpus=1) works.

P.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [tip:x86/urgent] x86/smp: Fix __max_logical_packages value setup
  2016-08-15 10:17                     ` Jiri Olsa
  2016-08-15 11:46                       ` Prarit Bhargava
@ 2016-08-18 10:50                       ` tip-bot for Jiri Olsa
  1 sibling, 0 replies; 16+ messages in thread
From: tip-bot for Jiri Olsa @ 2016-08-18 10:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: jolsa, jolsa, linux-kernel, peterz, framsay, mingo, hpa,
	torvalds, prarit, tglx

Commit-ID:  7b0501b1e7cddd32b265178e32d332bdfbb532d4
Gitweb:     http://git.kernel.org/tip/7b0501b1e7cddd32b265178e32d332bdfbb532d4
Author:     Jiri Olsa <jolsa@redhat.com>
AuthorDate: Mon, 15 Aug 2016 12:17:00 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Aug 2016 10:14:48 +0200

x86/smp: Fix __max_logical_packages value setup

Frank reported kernel panic when he disabled several cores in BIOS
via following option:

  Core Disable Bitmap(Hex)   [0]

with number 0xFFE, which leaves 16 CPUs in system (out of 48).

The kernel panic below goes along with following messages:

 smpboot: Max logical packages: 2^M
 smpboot: APIC(0) Converting physical 0 to logical package 0^M
 smpboot: APIC(20) Converting physical 1 to logical package 1^M
 smpboot: APIC(40) Package 2 exceeds logical package map^M
 smpboot: CPU 8 APICId 40 disabled^M
 smpboot: APIC(60) Package 3 exceeds logical package map^M
 smpboot: CPU 12 APICId 60 disabled^M
 ...
 general protection fault: 0000 [#1] SMP^M
 Modules linked in:^M
 CPU: 15 PID: 1 Comm: swapper/0 Not tainted 4.7.0-rc5+ #1^M
 Hardware name: SGI UV300/UV300, BIOS SGI UV 300 series BIOS 05/25/2016^M
 task: ffff8801673e0000 ti: ffff8801673ac000 task.ti: ffff8801673ac000^M
 RIP: 0010:[<ffffffff81014d54>]  [<ffffffff81014d54>] uncore_change_context+0xd4/0x180^M
 ...
  [<ffffffff810158ac>] uncore_event_init_cpu+0x6c/0x70^M
  [<ffffffff81d8c91c>] intel_uncore_init+0x1c2/0x2dd^M
  [<ffffffff81d8c75a>] ? uncore_cpu_setup+0x17/0x17^M
  [<ffffffff81002190>] do_one_initcall+0x50/0x190^M
  [<ffffffff810ab193>] ? parse_args+0x293/0x480^M
  [<ffffffff81d87365>] kernel_init_freeable+0x1a5/0x249^M
  [<ffffffff81d86a35>] ? set_debug_rodata+0x12/0x12^M
  [<ffffffff816dc19e>] kernel_init+0xe/0x110^M
  [<ffffffff816e93bf>] ret_from_fork+0x1f/0x40^M
  [<ffffffff816dc190>] ? rest_init+0x80/0x80^M

The reason for the panic is wrong value of __max_logical_packages,
which lets logical_package_map uninitialized and the uncore code
relying on this map being properly initialized (maybe we should
add some safety checks there as well).

The __max_logical_packages is computed as:

  DIV_ROUND_UP(total_cpus, ncpus);
  - ncpus being number of cores

With above BIOS setup we get total_cpus == 16 which set
__max_logical_packages to 2 (ncpus is 12).

Once topology_update_package_map processes CPU with logical
pkg over 2 we display above messages and fail to initialize
the physical_to_logical_pkg map, which makes the uncore code
crash.

The fix is to remove logical_package_map bitmap completely
and keep and update the logical_packages number instead.

After we enumerate all the present CPUs, we check if the
enumerated logical packages count is within its computed
maximum from BIOS data.

If it's not the case, we set this maximum to the new enumerated
value and freeze any new addition of logical packages.

The freeze is because lot of init code like uncore/rapl/cqm
depends on having maximum logical package value set to allocate
their data, so we can't change it later on.

Prarit Bhargava tested the patch and confirms that it solves
the problem:

  From dmidecode:
          Core Count: 24
          Core Enabled: 24
          Thread Count: 48

Orig kernel boot log:

 [    0.464981] smpboot: Max logical packages: 19
 [    0.469861] smpboot: APIC(0) Converting physical 0 to logical package 0
 [    0.477261] smpboot: APIC(40) Converting physical 1 to logical package 1
 [    0.484760] smpboot: APIC(80) Converting physical 2 to logical package 2
 [    0.492258] smpboot: APIC(c0) Converting physical 3 to logical package 3

1.  nr_cpus=8, should stop enumerating in package 0:

 [    0.533664] smpboot: APIC(0) Converting physical 0 to logical package 0
 [    0.539596] smpboot: Max logical packages: 19

2.  max_cpus=8, should still enumerate all packages:

 [    0.526494] smpboot: APIC(0) Converting physical 0 to logical package 0
 [    0.532428] smpboot: APIC(40) Converting physical 1 to logical package 1
 [    0.538456] smpboot: APIC(80) Converting physical 2 to logical package 2
 [    0.544486] smpboot: APIC(c0) Converting physical 3 to logical package 3
 [    0.550524] smpboot: Max logical packages: 19

3.  nr_cpus=49 ( 2 socket + 1 core on 3rd socket), should stop enumerating in
    package 2:

 [    0.521378] smpboot: APIC(0) Converting physical 0 to logical package 0
 [    0.527314] smpboot: APIC(40) Converting physical 1 to logical package 1
 [    0.533345] smpboot: APIC(80) Converting physical 2 to logical package 2
 [    0.539368] smpboot: Max logical packages: 19

4.  maxcpus=49, should still enumerate all packages:

 [    0.525591] smpboot: APIC(0) Converting physical 0 to logical package 0
 [    0.531525] smpboot: APIC(40) Converting physical 1 to logical package 1
 [    0.537547] smpboot: APIC(80) Converting physical 2 to logical package 2
 [    0.543579] smpboot: APIC(c0) Converting physical 3 to logical package 3
 [    0.549624] smpboot: Max logical packages: 19

5.  kdump (nr_cpus=1) works as well.

Reported-by: Frank Ramsay <framsay@redhat.com>
Tested-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Reviewed-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160815101700.GA30090@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/smpboot.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 2a6e84a..4296beb 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -100,10 +100,11 @@ EXPORT_PER_CPU_SYMBOL(cpu_info);
 /* Logical package management. We might want to allocate that dynamically */
 static int *physical_to_logical_pkg __read_mostly;
 static unsigned long *physical_package_map __read_mostly;;
-static unsigned long *logical_package_map  __read_mostly;
 static unsigned int max_physical_pkg_id __read_mostly;
 unsigned int __max_logical_packages __read_mostly;
 EXPORT_SYMBOL(__max_logical_packages);
+static unsigned int logical_packages __read_mostly;
+static bool logical_packages_frozen __read_mostly;
 
 /* Maximum number of SMT threads on any online core */
 int __max_smt_threads __read_mostly;
@@ -277,14 +278,14 @@ int topology_update_package_map(unsigned int apicid, unsigned int cpu)
 	if (test_and_set_bit(pkg, physical_package_map))
 		goto found;
 
-	new = find_first_zero_bit(logical_package_map, __max_logical_packages);
-	if (new >= __max_logical_packages) {
+	if (logical_packages_frozen) {
 		physical_to_logical_pkg[pkg] = -1;
-		pr_warn("APIC(%x) Package %u exceeds logical package map\n",
+		pr_warn("APIC(%x) Package %u exceeds logical package max\n",
 			apicid, pkg);
 		return -ENOSPC;
 	}
-	set_bit(new, logical_package_map);
+
+	new = logical_packages++;
 	pr_info("APIC(%x) Converting physical %u to logical package %u\n",
 		apicid, pkg, new);
 	physical_to_logical_pkg[pkg] = new;
@@ -341,6 +342,7 @@ static void __init smp_init_package_map(void)
 	}
 
 	__max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus);
+	logical_packages = 0;
 
 	/*
 	 * Possibly larger than what we need as the number of apic ids per
@@ -352,10 +354,6 @@ static void __init smp_init_package_map(void)
 	memset(physical_to_logical_pkg, 0xff, size);
 	size = BITS_TO_LONGS(max_physical_pkg_id) * sizeof(unsigned long);
 	physical_package_map = kzalloc(size, GFP_KERNEL);
-	size = BITS_TO_LONGS(__max_logical_packages) * sizeof(unsigned long);
-	logical_package_map = kzalloc(size, GFP_KERNEL);
-
-	pr_info("Max logical packages: %u\n", __max_logical_packages);
 
 	for_each_present_cpu(cpu) {
 		unsigned int apicid = apic->cpu_present_to_apicid(cpu);
@@ -369,6 +367,15 @@ static void __init smp_init_package_map(void)
 		set_cpu_possible(cpu, false);
 		set_cpu_present(cpu, false);
 	}
+
+	if (logical_packages > __max_logical_packages) {
+		pr_warn("Detected more packages (%u), then computed by BIOS data (%u).\n",
+			logical_packages, __max_logical_packages);
+		logical_packages_frozen = true;
+		__max_logical_packages  = logical_packages;
+	}
+
+	pr_info("Max logical packages: %u\n", __max_logical_packages);
 }
 
 void __init smp_store_boot_cpu_info(void)

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-08-18 10:51 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-03 16:23 [RFC][PATCH] x86/smp: Fix __max_logical_packages value setup Jiri Olsa
2016-08-10 11:41 ` Jiri Olsa
2016-08-10 13:54 ` Peter Zijlstra
2016-08-10 14:00   ` Jiri Olsa
2016-08-10 14:15     ` Jiri Olsa
2016-08-10 15:52       ` Peter Zijlstra
2016-08-10 16:14         ` [PATCH] " Jiri Olsa
2016-08-11 12:48           ` Peter Zijlstra
2016-08-11 13:05             ` Jiri Olsa
2016-08-11 13:46               ` Peter Zijlstra
2016-08-12 12:24                 ` Jiri Olsa
2016-08-12 13:12                   ` Jiri Olsa
2016-08-15  9:04                   ` Peter Zijlstra
2016-08-15 10:17                     ` Jiri Olsa
2016-08-15 11:46                       ` Prarit Bhargava
2016-08-18 10:50                       ` [tip:x86/urgent] " tip-bot for Jiri Olsa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.