Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

All of lore.kernel.org
 help / color / mirror / Atom feed

* Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
@ 2014-07-16 14:55 Bruno Wolff III
  2014-07-16 15:17 ` Josh Boyer
  0 siblings, 1 reply; 44+ messages in thread
From: Bruno Wolff III @ 2014-07-16 14:55 UTC (permalink / raw)
  To: mingo, peterz; +Cc: linux-kernel, jwboyer

caffcdd8d27ba78730d5540396ce72ad022aff2c has been causing crashes early in 
the boot process on one of three machines I have been testing the kernel 
on. On that one machine it happens every boot. It happens before netconsole 
is functional.

A partial revert of the commit fixes the problem. I do not know why the 
commit is broken though.

I have filed https://bugzilla.kernel.org/show_bug.cgi?id=80251 for this 
issue.

The problem happens on both Fedora and Linus kernels.

git diff caffcdd8d27ba78730d5540396ce72ad022aff2c^ caffcdd8d27ba78730d5540396ce72ad022aff2c
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 45d077ed24fb..6340c601475d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5794,8 +5794,6 @@ build_sched_groups(struct sched_domain *sd, int cpu)
                        continue;

                group = get_group(i, sdd, &sg);
-               cpumask_clear(sched_group_cpus(sg));
-               sg->sgp->power = 0;
                cpumask_setall(sched_group_mask(sg));

                for_each_cpu(j, span) {

By rc5 the second line can't be added back because the structure has changed. 
However adding back cpumask_clear(sched_group_cpus(sg)); to rc5 got things 
working for me again.

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-16 14:55 Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c Bruno Wolff III
@ 2014-07-16 15:17 ` Josh Boyer
  2014-07-16 19:17   ` Dietmar Eggemann
  0 siblings, 1 reply; 44+ messages in thread
From: Josh Boyer @ 2014-07-16 15:17 UTC (permalink / raw)
  To: Bruno Wolff III, Dietmar Eggemann; +Cc: mingo, peterz, linux-kernel

Adding Dietmar in since he is the original author.

josh

On Wed, Jul 16, 2014 at 09:55:46AM -0500, Bruno Wolff III wrote:
> caffcdd8d27ba78730d5540396ce72ad022aff2c has been causing crashes
> early in the boot process on one of three machines I have been
> testing the kernel on. On that one machine it happens every boot. It
> happens before netconsole is functional.
> 
> A partial revert of the commit fixes the problem. I do not know why
> the commit is broken though.
> 
> I have filed https://bugzilla.kernel.org/show_bug.cgi?id=80251 for
> this issue.
> 
> The problem happens on both Fedora and Linus kernels.
> 
> git diff caffcdd8d27ba78730d5540396ce72ad022aff2c^ caffcdd8d27ba78730d5540396ce72ad022aff2c
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 45d077ed24fb..6340c601475d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5794,8 +5794,6 @@ build_sched_groups(struct sched_domain *sd, int cpu)
>                        continue;
> 
>                group = get_group(i, sdd, &sg);
> -               cpumask_clear(sched_group_cpus(sg));
> -               sg->sgp->power = 0;
>                cpumask_setall(sched_group_mask(sg));
> 
>                for_each_cpu(j, span) {
> 
> By rc5 the second line can't be added back because the structure has
> changed. However adding back cpumask_clear(sched_group_cpus(sg)); to
> rc5 got things working for me again.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-16 15:17 ` Josh Boyer
@ 2014-07-16 19:17   ` Dietmar Eggemann
  2014-07-16 19:54     ` Bruno Wolff III
  2014-07-17  4:28     ` Bruno Wolff III
  0 siblings, 2 replies; 44+ messages in thread
From: Dietmar Eggemann @ 2014-07-16 19:17 UTC (permalink / raw)
  To: Josh Boyer, Bruno Wolff III; +Cc: mingo, peterz, linux-kernel

Hi Bruno and Josh,

On 16/07/14 17:17, Josh Boyer wrote:
> Adding Dietmar in since he is the original author.
>
> josh
>
> On Wed, Jul 16, 2014 at 09:55:46AM -0500, Bruno Wolff III wrote:
>> caffcdd8d27ba78730d5540396ce72ad022aff2c has been causing crashes
>> early in the boot process on one of three machines I have been
>> testing the kernel on. On that one machine it happens every boot. It
>> happens before netconsole is functional.

I tested this patch on two platforms (ARM TC2 and INTEL i5 M520) by 
replacing the two lines (already with the new sg->sgc->capacity instead 
of the old sg->sgp->power) by:

  BUG_ON(!cpumask_empty(sched_group_cpus(sg)));
  BUG_ON(sg->sgc->capacity);

The memory for sg is allocated and zeroed out in __sdt_alloc() with:

sgc = kzalloc_node(sizeof(struct sched_group_capacity) + cpumask_size(),
					GFP_KERNEL, cpu_to_node(j));

The related call chain:

build_sched_domains()
	__visit_domain_allocation_hell()
		__sdt_alloc()
	build_sched_groups()

>>
>> A partial revert of the commit fixes the problem. I do not know why
>> the commit is broken though.
>>
>> I have filed https://bugzilla.kernel.org/show_bug.cgi?id=80251 for
>> this issue.

 From the issue, I see that the machine making trouble is an Xeon (2 
processors w/ hyper-threading).

Could you please share:

  cat /proc/cpuinfo and
  cat /proc/schedstat (kernel config w/ CONFIG_SCHEDSTATS=y)

from this machine.

I don't think it is SMT (since it's also there on my INTEL i5 M520 
(arch/x86/configs/x86_64_defconfig).

Could you also put the two BUG_ON lines into build_sched_groups() 
[kernel/sched/core.c] wo/ the cpumask_clear() and setting 
sg->sgc->capacity to 0 and share the possible crash output as well?

>>
>> The problem happens on both Fedora and Linus kernels.
>>
>> git diff caffcdd8d27ba78730d5540396ce72ad022aff2c^ caffcdd8d27ba78730d5540396ce72ad022aff2c
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 45d077ed24fb..6340c601475d 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -5794,8 +5794,6 @@ build_sched_groups(struct sched_domain *sd, int cpu)
>>                         continue;
>>
>>                 group = get_group(i, sdd, &sg);
>> -               cpumask_clear(sched_group_cpus(sg));
>> -               sg->sgp->power = 0;
>>                 cpumask_setall(sched_group_mask(sg));
>>
>>                 for_each_cpu(j, span) {
>>
>> By rc5 the second line can't be added back because the structure has
>> changed. However adding back cpumask_clear(sched_group_cpus(sg)); to
>> rc5 got things working for me again.

That's because 'sched: Let 'struct sched_group_power' care about CPU 
capacity' (commit id 63b2ca30bdb3) changes the struct sched_group member 
from struct sched_group_power *sgp to struct sched_group_capacity *sgc .

I.e. the second line becomes

  sg->sgc->capacity = 0;

Thanks,

-- Dietmar

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-16 19:17   ` Dietmar Eggemann
@ 2014-07-16 19:54     ` Bruno Wolff III
  2014-07-16 23:18       ` Dietmar Eggemann
  2014-07-17  4:28     ` Bruno Wolff III
  1 sibling, 1 reply; 44+ messages in thread
From: Bruno Wolff III @ 2014-07-16 19:54 UTC (permalink / raw)
  To: Dietmar Eggemann; +Cc: Josh Boyer, mingo, peterz, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 926 bytes --]

On Wed, Jul 16, 2014 at 21:17:32 +0200,
  Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>Hi Bruno and Josh,
>
>From the issue, I see that the machine making trouble is an Xeon (2 
>processors w/ hyper-threading).
>
>Could you please share:
>
> cat /proc/cpuinfo and

I have attached it to the bug and to this message.

> cat /proc/schedstat (kernel config w/ CONFIG_SCHEDSTATS=y)

It looks like that isn't set for my previous builds and I'll need to 
set it for my next test build.

>Could you also put the two BUG_ON lines into build_sched_groups() 
>[kernel/sched/core.c] wo/ the cpumask_clear() and setting 
>sg->sgc->capacity to 0 and share the possible crash output as well?

I can try a new build with this. I can probably get results back tomorrow 
before I leave for work. The crashes happen too early in the boot process 
for me to easily capture output as text. I can slow things down to take 
pictures though.

[-- Attachment #2: cpuinfo.out --]
[-- Type: text/plain, Size: 2560 bytes --]

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 15
model		: 2
model name	: Intel(R) Xeon(TM) CPU 2.66GHz
stepping	: 9
microcode	: 0x2d
cpu MHz		: 2657.830
cache size	: 512 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fdiv_bug	: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs bts cid xtpr
bogomips	: 5315.66
clflush size	: 64
cache_alignment	: 128
address sizes	: 36 bits physical, 32 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 15
model		: 2
model name	: Intel(R) Xeon(TM) CPU 2.66GHz
stepping	: 9
microcode	: 0x2d
cpu MHz		: 2657.830
cache size	: 512 KB
physical id	: 3
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 6
initial apicid	: 6
fdiv_bug	: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs bts cid xtpr
bogomips	: 5314.67
clflush size	: 64
cache_alignment	: 128
address sizes	: 36 bits physical, 32 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 15
model		: 2
model name	: Intel(R) Xeon(TM) CPU 2.66GHz
stepping	: 9
microcode	: 0x2d
cpu MHz		: 2657.830
cache size	: 512 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 1
initial apicid	: 1
fdiv_bug	: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs bts cid xtpr
bogomips	: 5314.68
clflush size	: 64
cache_alignment	: 128
address sizes	: 36 bits physical, 32 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 15
model		: 2
model name	: Intel(R) Xeon(TM) CPU 2.66GHz
stepping	: 9
microcode	: 0x2d
cpu MHz		: 2657.830
cache size	: 512 KB
physical id	: 3
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 7
initial apicid	: 7
fdiv_bug	: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs bts cid xtpr
bogomips	: 5314.68
clflush size	: 64
cache_alignment	: 128
address sizes	: 36 bits physical, 32 bits virtual
power management:


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-16 19:54     ` Bruno Wolff III
@ 2014-07-16 23:18       ` Dietmar Eggemann
  2014-07-17  3:09         ` Bruno Wolff III
  2014-07-17  4:21         ` Bruno Wolff III
  0 siblings, 2 replies; 44+ messages in thread
From: Dietmar Eggemann @ 2014-07-16 23:18 UTC (permalink / raw)
  To: Bruno Wolff III; +Cc: Josh Boyer, mingo, peterz, linux-kernel

On 16/07/14 21:54, Bruno Wolff III wrote:
> On Wed, Jul 16, 2014 at 21:17:32 +0200,
>    Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> Hi Bruno and Josh,
>>
>>From the issue, I see that the machine making trouble is an Xeon (2
>> processors w/ hyper-threading).
>>
>> Could you please share:
>>
>> cat /proc/cpuinfo and
>
> I have attached it to the bug and to this message.
>
>> cat /proc/schedstat (kernel config w/ CONFIG_SCHEDSTATS=y)
>
> It looks like that isn't set for my previous builds and I'll need to
> set it for my next test build.
>
>> Could you also put the two BUG_ON lines into build_sched_groups()
>> [kernel/sched/core.c] wo/ the cpumask_clear() and setting
>> sg->sgc->capacity to 0 and share the possible crash output as well?
>
> I can try a new build with this. I can probably get results back tomorrow
> before I leave for work. The crashes happen too early in the boot process
> for me to easily capture output as text. I can slow things down to take
> pictures though.
>

That would be helpful. Thanks. I saw that you have CONFIG_SCHED_DEBUG 
enabled.

So the output of

$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/*

would be handy too.

The difference to the Intel machine I tested on is that yours is a "dual 
single core CPU with hyper-threading' and mine is a 'dual core with 
hyper-threading'

yours:
$ cat cpuinfo.out | grep '^physical\|^core\|^cpu cores'
physical id	: 0
core id		: 0
cpu cores	: 1
physical id	: 3
core id		: 0
cpu cores	: 1
physical id	: 0
core id		: 0
cpu cores	: 1
physical id	: 3
core id		: 0
cpu cores	: 1

mine:
$ cat /proc/cpuinfo | grep '^physical\|^core\|^cpu cores'
physical id	: 0
core id		: 0
cpu cores	: 2
physical id	: 0
core id		: 0
cpu cores	: 2
physical id	: 0
core id		: 1
cpu cores	: 2
physical id	: 0
core id		: 1
cpu cores	: 2

Just to make sure, you do have 'CONFIG_X86_32=y' and '# CONFIG_NUMA is 
not set' in your build?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-16 23:18       ` Dietmar Eggemann
@ 2014-07-17  3:09         ` Bruno Wolff III
  2014-07-17  8:57           ` Dietmar Eggemann
  2014-07-17  4:21         ` Bruno Wolff III
  1 sibling, 1 reply; 44+ messages in thread
From: Bruno Wolff III @ 2014-07-17  3:09 UTC (permalink / raw)
  To: Dietmar Eggemann; +Cc: Josh Boyer, mingo, peterz, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 567 bytes --]

On Thu, Jul 17, 2014 at 01:18:36 +0200,
  Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>So the output of
>
>$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/*
>
>would be handy too.

Attached and added to the bug.

>Just to make sure, you do have 'CONFIG_X86_32=y' and '# CONFIG_NUMA is 
>not set' in your build?

Yes.

I probably won't be able to get /proc/schedstat on my next test since the 
system will probably crash right away. However, I probably will have a 
much faster rebuild and might still be able to get the info for you 
before I leave tomorrow.

[-- Attachment #2: sched_domain.out --]
[-- Type: text/plain, Size: 300 bytes --]

32
0
0
687
0
0
110
4
28353
2
SMT
0
0
32
2
1
4143
0
1
125
8
15370
4
DIE
0
0
32
0
0
687
0
0
110
4
21753
2
SMT
0
0
32
2
1
4143
0
1
125
8
12715
4
DIE
0
0
32
0
0
687
0
0
110
4
25097
2
SMT
0
0
32
2
1
4143
0
1
125
8
21624
4
DIE
0
0
32
0
0
687
0
0
110
4
23714
2
SMT
0
0
32
2
1
4143
0
1
125
8
14920
4
DIE
0
0

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-16 23:18       ` Dietmar Eggemann
  2014-07-17  3:09         ` Bruno Wolff III
@ 2014-07-17  4:21         ` Bruno Wolff III
  1 sibling, 0 replies; 44+ messages in thread
From: Bruno Wolff III @ 2014-07-17  4:21 UTC (permalink / raw)
  To: Dietmar Eggemann; +Cc: Josh Boyer, mingo, peterz, linux-kernel

>>>Could you also put the two BUG_ON lines into build_sched_groups()
>>>[kernel/sched/core.c] wo/ the cpumask_clear() and setting
>>>sg->sgc->capacity to 0 and share the possible crash output as well?
>>
>>I can try a new build with this. I can probably get results back tomorrow
>>before I leave for work. The crashes happen too early in the boot process
>>for me to easily capture output as text. I can slow things down to take
>>pictures though.
>>
>
>That would be helpful. Thanks. I saw that you have CONFIG_SCHED_DEBUG 
>enabled.

Well that didn't help much. It still crashed. Taking pictures didn't 
get a good capture. I wasn't able to use boot_delay to slow things down, 
as even a small value resulted in me only seeing one line of output 
before giving up after a minute or two. I used a serial console to 
slow things down, but it isn't enough to make it easy to take pictures 
with the camera I had. The crash was someplace inside the scheduler, but 
I don't know if there were messages from the BUG_ON lines.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-16 19:17   ` Dietmar Eggemann
  2014-07-16 19:54     ` Bruno Wolff III
@ 2014-07-17  4:28     ` Bruno Wolff III
  1 sibling, 0 replies; 44+ messages in thread
From: Bruno Wolff III @ 2014-07-17  4:28 UTC (permalink / raw)
  To: Dietmar Eggemann; +Cc: Josh Boyer, mingo, peterz, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 242 bytes --]

On Wed, Jul 16, 2014 at 21:17:32 +0200,
  Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>Could you please share:
>
> cat /proc/cpuinfo and
> cat /proc/schedstat (kernel config w/ CONFIG_SCHEDSTATS=y)

/proc/schedstat output is attached.

[-- Attachment #2: schedstat.out --]
[-- Type: text/plain, Size: 1417 bytes --]

version 15
timestamp 4294858660
cpu0 12 0 85767 30027 61826 37767 15709950719 5620241067 53904
domain0 00000005 5408 5285 91 16777 39 2 1 5284 107 88 9 3926 11 0 0 88 23792 23076 349 87051 391 19 11 23065 3 0 3 0 0 0 0 0 0 18607 338 0
domain1 0000000f 4365 3913 399 64817 59 0 486 3427 59 50 3 1948 6 0 3 47 23422 21879 1368 206776 197 1 11697 10182 0 0 0 0 0 0 0 0 0 5434 164 0
cpu1 0 0 56596 21903 29921 13292 19364110947 8836735986 34640
domain0 0000000a 3239 3163 56 24775 27 6 0 3163 181 166 6 4460 9 1 0 166 20452 19845 272 90788 374 8 17 19828 4 0 4 0 0 0 0 0 0 10132 501 0
domain1 0000000f 2540 2279 207 69991 57 2 258 2021 99 90 2 2757 7 0 13 77 20103 19160 744 193572 228 4 5770 13390 1 1 0 0 0 0 0 0 0 6497 141 0
cpu2 120 0 58755 24874 18071 6797 16937128548 3947587861 33681
domain0 00000005 2940 2811 105 39819 35 13 0 2811 158 141 4 8878 13 0 0 141 18795 18156 336 189651 339 7 10 18146 8 0 8 0 0 0 0 0 0 5062 206 0
domain1 0000000f 2376 1903 437 82849 40 1 216 1687 35 32 1 6881 2 0 6 26 18491 17419 885 260774 216 6 2076 15343 0 0 0 0 0 0 0 0 0 6212 130 0
cpu3 0 0 54095 22291 28164 13979 19759585558 3515338364 31766
domain0 0000000a 4642 4135 495 61645 15 4 0 4135 157 147 7 1546 3 0 0 147 20473 19444 670 114383 394 7 15 19429 3 0 3 0 0 0 0 0 0 9962 468 0
domain1 0000000f 3104 2629 431 71143 49 2 326 2303 82 76 4 495 2 0 9 67 20105 18739 1168 207093 223 4 5545 13194 0 0 0 0 0 0 0 0 0 4223 157 0

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-17  3:09         ` Bruno Wolff III
@ 2014-07-17  8:57           ` Dietmar Eggemann
  2014-07-17  9:04             ` Peter Zijlstra
  2014-07-17 16:36             ` Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c Bruno Wolff III
  0 siblings, 2 replies; 44+ messages in thread
From: Dietmar Eggemann @ 2014-07-17  8:57 UTC (permalink / raw)
  To: Bruno Wolff III; +Cc: Josh Boyer, mingo, peterz, linux-kernel

On 17/07/14 05:09, Bruno Wolff III wrote:
> On Thu, Jul 17, 2014 at 01:18:36 +0200,
>    Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> So the output of
>>
>> $ cat /proc/sys/kernel/sched_domain/cpu*/domain*/*
>>
>> would be handy too.

Thanks, this was helpful.
I see from the sched domain layout that you have SMT (domain0) and DIE 
(domain1) level. So on this system, the MC level gets degenerated 
(sd_degenerate() in kernel/sched/core.c).
I fail so far to see how this can have an effect on the memory of the 
sched groups. But I can try to fake this situation on one of my platforms.

There is also the possibility that the memory for sched_group sg is not 
(completely) zeroed out:

   sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
			GFP_KERNEL, cpu_to_node(j));


   struct sched_group {
	...
	 * NOTE: this field is variable length. (Allocated dynamically
	 * by attaching extra space to the end of the structure,
	 * depending on how many CPUs the kernel has booted up with)
	 */
	unsigned long cpumask[0];
};

so that the cpumask of a sched group is not 0 and can only be cured by 
an explicit cpumask_clear(sched_group_cpus(sg)) in build_sched_groups() 
on this kind of machine.

>
> Attached and added to the bug.
>
>> Just to make sure, you do have 'CONFIG_X86_32=y' and '# CONFIG_NUMA is
>> not set' in your build?
>
> Yes.
>
> I probably won't be able to get /proc/schedstat on my next test since the
> system will probably crash right away. However, I probably will have a
> much faster rebuild and might still be able to get the info for you
> before I leave tomorrow.
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-17  8:57           ` Dietmar Eggemann
@ 2014-07-17  9:04             ` Peter Zijlstra
  2014-07-17 11:23               ` Dietmar Eggemann
  2014-07-17 16:36             ` Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c Bruno Wolff III
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2014-07-17  9:04 UTC (permalink / raw)
  To: Dietmar Eggemann; +Cc: Bruno Wolff III, Josh Boyer, mingo, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 824 bytes --]

On Thu, Jul 17, 2014 at 10:57:55AM +0200, Dietmar Eggemann wrote:
> There is also the possibility that the memory for sched_group sg is not
> (completely) zeroed out:
> 
>   sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
> 			GFP_KERNEL, cpu_to_node(j));
> 
> 
>   struct sched_group {
> 	...
> 	 * NOTE: this field is variable length. (Allocated dynamically
> 	 * by attaching extra space to the end of the structure,
> 	 * depending on how many CPUs the kernel has booted up with)
> 	 */
> 	unsigned long cpumask[0];

well kZalloc should Zero the entire allocated size, and the specified
size very much includes the cpumask size as per:
  sizeof(struct sched_group) + cpumask_size()

But yeah, I'm also a bit puzzled why this goes bang. Makes we worry we
scribble it somewhere or so.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-17  9:04             ` Peter Zijlstra
@ 2014-07-17 11:23               ` Dietmar Eggemann
  2014-07-17 12:35                 ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Dietmar Eggemann @ 2014-07-17 11:23 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Bruno Wolff III, Josh Boyer, mingo, linux-kernel

On 17/07/14 11:04, Peter Zijlstra wrote:
> On Thu, Jul 17, 2014 at 10:57:55AM +0200, Dietmar Eggemann wrote:
>> There is also the possibility that the memory for sched_group sg is not
>> (completely) zeroed out:
>>
>>    sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
>> 			GFP_KERNEL, cpu_to_node(j));
>>
>>
>>    struct sched_group {
>> 	...
>> 	 * NOTE: this field is variable length. (Allocated dynamically
>> 	 * by attaching extra space to the end of the structure,
>> 	 * depending on how many CPUs the kernel has booted up with)
>> 	 */
>> 	unsigned long cpumask[0];
>
> well kZalloc should Zero the entire allocated size, and the specified
> size very much includes the cpumask size as per:
>    sizeof(struct sched_group) + cpumask_size()

Yes, I think so too.

>
> But yeah, I'm also a bit puzzled why this goes bang. Makes we worry we
> scribble it somewhere or so.
>

But then, this must be happening in build_sched_domains() between 
__visit_domain_allocation_hell()->__sdt_alloc() and build_sched_groups().


Couldn't catch this phenomena by adding a fake SMT level (just a copy of 
the real MC level) to my ARM TC2 (dual cluster dual/triple core, no 
hyper-threading) to provoke sd degenerate. It does not show the issue 
and MC level gets degenerated nicely. Might not be the real example 
since SMT and MC are using the same cpu mask here).

@@ -281,6 +281,7 @@ static inline const int cpu_corepower_flags(void)
  }

  static struct sched_domain_topology_level arm_topology[] = {
+       { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(SMT) },
  #ifdef CONFIG_SCHED_MC
         { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
         { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },


Maybe by enabling sched_debug on command line (earlyprintk=keep 
sched_debug), Bruno could spot topology setup issues on his XEON machine 
which could lead to this problem unless the sg cpumask gets zeroed out 
in build_sched_groups() a second time ?

Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz dmesg snippet as an example when 
booted with 'earlyprintk=keep sched_debug':
  ...
  [    0.119737] CPU0 attaching sched-domain:
  [    0.119740]  domain 0: span 0-1 level SIBLING
  [    0.119742]   groups: 0 (cpu_power = 588) 1 (cpu_power = 588)
  [    0.119745]   domain 1: span 0-3 level MC
  [    0.119747]    groups: 0-1 (cpu_power = 1176) 2-3 (cpu_power = 1176)
  ...












^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-17 11:23               ` Dietmar Eggemann
@ 2014-07-17 12:35                 ` Peter Zijlstra
  2014-07-18  5:34                   ` Bruno Wolff III
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2014-07-17 12:35 UTC (permalink / raw)
  To: Dietmar Eggemann; +Cc: Bruno Wolff III, Josh Boyer, mingo, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4313 bytes --]

On Thu, Jul 17, 2014 at 01:23:51PM +0200, Dietmar Eggemann wrote:
> On 17/07/14 11:04, Peter Zijlstra wrote:
> >On Thu, Jul 17, 2014 at 10:57:55AM +0200, Dietmar Eggemann wrote:
> >>There is also the possibility that the memory for sched_group sg is not
> >>(completely) zeroed out:
> >>
> >>   sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
> >>			GFP_KERNEL, cpu_to_node(j));
> >>
> >>
> >>   struct sched_group {
> >>	...
> >>	 * NOTE: this field is variable length. (Allocated dynamically
> >>	 * by attaching extra space to the end of the structure,
> >>	 * depending on how many CPUs the kernel has booted up with)
> >>	 */
> >>	unsigned long cpumask[0];
> >
> >well kZalloc should Zero the entire allocated size, and the specified
> >size very much includes the cpumask size as per:
> >   sizeof(struct sched_group) + cpumask_size()
> 
> Yes, I think so too.
> 
> >
> >But yeah, I'm also a bit puzzled why this goes bang. Makes we worry we
> >scribble it somewhere or so.
> >
> 
> But then, this must be happening in build_sched_domains() between
> __visit_domain_allocation_hell()->__sdt_alloc() and build_sched_groups().
> 
> 
> Couldn't catch this phenomena by adding a fake SMT level (just a copy of the
> real MC level) to my ARM TC2 (dual cluster dual/triple core, no
> hyper-threading) to provoke sd degenerate. It does not show the issue and MC
> level gets degenerated nicely. Might not be the real example since SMT and
> MC are using the same cpu mask here).

Yeah, obviously my machines didn't trigger this either, and afaik none
of Ingo's did either.

In any case, can someone who can trigger this run with the below; its
'clean' for me, but supposedly you'll trigger a FAIL somewhere.

It includes the cpumask_clear() so your machines should boot, albeit
with a noisy dmesg :-)

---
 kernel/sched/core.c | 16 ++++++++++++++++
 lib/vsprintf.c      |  5 +++++
 2 files changed, 21 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7bc599dc4aa4..1c140057db12 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5857,6 +5857,17 @@ build_sched_groups(struct sched_domain *sd, int cpu)
 			continue;
 
 		group = get_group(i, sdd, &sg);
+
+		if (!cpumask_empty(sched_group_cpus(sg)))
+			printk("%s: FAIL\n", __func__);
+
+		printk("%s: got group %p with cpus: %pc\n",
+				__func__,
+				sg,
+				sched_group_cpus(sg));
+
+		cpumask_clear(sched_group_cpus(sg));
+
 		cpumask_setall(sched_group_mask(sg));
 
 		for_each_cpu(j, span) {
@@ -6418,6 +6429,11 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 			if (!sg)
 				return -ENOMEM;
 
+			printk("%s: allocated %p with cpus: %pc\n",
+					__func__,
+					sg,
+					sched_group_cpus(sg));
+
 			sg->next = sg;
 
 			*per_cpu_ptr(sdd->sg, j) = sg;
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 6fe2c84eb055..ac22c46fd6d0 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -28,6 +28,7 @@
 #include <linux/ioport.h>
 #include <linux/dcache.h>
 #include <linux/cred.h>
+#include <linux/cpumask.h>
 #include <net/addrconf.h>
 
 #include <asm/page.h>		/* for PAGE_SIZE */
@@ -1250,6 +1251,7 @@ int kptr_restrict __read_mostly;
  *           (default assumed to be phys_addr_t, passed by reference)
  * - 'd[234]' For a dentry name (optionally 2-4 last components)
  * - 'D[234]' Same as 'd' but for a struct file
+ * - 'c' For a cpumask list
  *
  * Note: The difference between 'S' and 'F' is that on ia64 and ppc64
  * function pointers are really function descriptors, which contain a
@@ -1389,6 +1391,8 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr,
 		return dentry_name(buf, end,
 				   ((const struct file *)ptr)->f_path.dentry,
 				   spec, fmt);
+	case 'c':
+		return buf + cpulist_scnprintf(buf, end - buf, ptr);
 	}
 	spec.flags |= SMALL;
 	if (spec.field_width == -1) {
@@ -1635,6 +1639,7 @@ int format_decode(const char *fmt, struct printf_spec *spec)
  *   case.
  * %*ph[CDN] a variable-length hex string with a separator (supports up to 64
  *           bytes of the input)
+ * %pc print a cpumask as comma-separated list
  * %n is ignored
  *
  * ** Please update Documentation/printk-formats.txt when making changes **

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-17  8:57           ` Dietmar Eggemann
  2014-07-17  9:04             ` Peter Zijlstra
@ 2014-07-17 16:36             ` Bruno Wolff III
  2014-07-17 18:43               ` Dietmar Eggemann
  1 sibling, 1 reply; 44+ messages in thread
From: Bruno Wolff III @ 2014-07-17 16:36 UTC (permalink / raw)
  To: Dietmar Eggemann; +Cc: Josh Boyer, mingo, peterz, linux-kernel

I did a few quick boots this morning while taking a bunch of pictures. I have 
gone through some of them this morning and found one that shows bug on 
was triggered at 5850 which is from:
BUG_ON(!cpumask_empty(sched_group_cpus(sg)));

You can see the JPEG at:
https://bugzilla.kernel.org/attachment.cgi?id=143331

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-17 16:36             ` Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c Bruno Wolff III
@ 2014-07-17 18:43               ` Dietmar Eggemann
  2014-07-17 18:54                 ` Bruno Wolff III
  0 siblings, 1 reply; 44+ messages in thread
From: Dietmar Eggemann @ 2014-07-17 18:43 UTC (permalink / raw)
  To: Bruno Wolff III; +Cc: Josh Boyer, mingo, peterz, linux-kernel

On 17/07/14 18:36, Bruno Wolff III wrote:
> I did a few quick boots this morning while taking a bunch of pictures. I have
> gone through some of them this morning and found one that shows bug on
> was triggered at 5850 which is from:
> BUG_ON(!cpumask_empty(sched_group_cpus(sg)));
>
> You can see the JPEG at:
> https://bugzilla.kernel.org/attachment.cgi?id=143331
>

Many thanks for testing this, Bruno!

So the memory of the cpumask of some sched_group(s) in your system has 
been altered between __visit_domain_allocation_hell()->__sdt_alloc() and 
build_sched_groups().

In the meantime, PeterZ has posted a patch which barfs when this happens 
but also prints out the sched groups with the related cpus but also 
includes the cpumask_clear so your machine would boot still fine.

If you could apply the patch:

https://lkml.org/lkml/2014/7/17/288

and then run it on your machine, that would give us more details, i.e. 
the information on which sched_group(s) and in which sched domain level 
(SMT and/or DIE) this issue occurs.


Another thing which you could do is to boot with an extra 
'earlyprintk=keep sched_debug' in your command line options with a build 
containing the cpumask_clear() in build_sched_groups() and extract the 
dmesg output of the scheduler-setup code:

Example:

[    0.119737] CPU0 attaching sched-domain:
[    0.119740]  domain 0: span 0-1 level SIBLING
[    0.119742]   groups: 0 (cpu_power = 588) 1 (cpu_power = 588)
[    0.119745]   domain 1: span 0-3 level MC
[    0.119747]    groups: 0-1 (cpu_power = 1176) 2-3 (cpu_power = 1176)
[    0.119751] CPU1 attaching sched-domain:
[    0.119752]  domain 0: span 0-1 level SIBLING
[    0.119753]   groups: 1 (cpu_power = 588) 0 (cpu_power = 588)
[    0.119756]   domain 1: span 0-3 level MC
[    0.119757]    groups: 0-1 (cpu_power = 1176) 2-3 (cpu_power = 1176)
[    0.119759] CPU2 attaching sched-domain:
[    0.119760]  domain 0: span 2-3 level SIBLING
[    0.119761]   groups: 2 (cpu_power = 588) 3 (cpu_power = 588)
[    0.119764]   domain 1: span 0-3 level MC
[    0.119765]    groups: 2-3 (cpu_power = 1176) 0-1 (cpu_power = 1176)
[    0.119767] CPU3 attaching sched-domain:
[    0.119768]  domain 0: span 2-3 level SIBLING
[    0.119769]   groups: 3 (cpu_power = 588) 2 (cpu_power = 588)
[    0.119772]   domain 1: span 0-3 level MC
[    0.119773]    groups: 2-3 (cpu_power = 1176) 0-1 (cpu_power = 1176)



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-17 18:43               ` Dietmar Eggemann
@ 2014-07-17 18:54                 ` Bruno Wolff III
  0 siblings, 0 replies; 44+ messages in thread
From: Bruno Wolff III @ 2014-07-17 18:54 UTC (permalink / raw)
  To: Dietmar Eggemann; +Cc: Josh Boyer, mingo, peterz, linux-kernel

On Thu, Jul 17, 2014 at 20:43:16 +0200,
  Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
>If you could apply the patch:
>
>https://lkml.org/lkml/2014/7/17/288
>
>and then run it on your machine, that would give us more details, i.e. 
>the information on which sched_group(s) and in which sched domain 
>level (SMT and/or DIE) this issue occurs.
>
>
>Another thing which you could do is to boot with an extra 
>'earlyprintk=keep sched_debug' in your command line options with a 
>build containing the cpumask_clear() in build_sched_groups() and 
>extract the dmesg output of the scheduler-setup code:

I'll see what I can do. I have plans after work today and I don't know 
if I'll be awake enough when I get home to followup tonight or early 
tomorrow. Worst case will be over the weekend.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-17 12:35                 ` Peter Zijlstra
@ 2014-07-18  5:34                   ` Bruno Wolff III
  2014-07-18  9:28                     ` Dietmar Eggemann
  2014-07-18 10:16                     ` Peter Zijlstra
  0 siblings, 2 replies; 44+ messages in thread
From: Bruno Wolff III @ 2014-07-18  5:34 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel

On Thu, Jul 17, 2014 at 14:35:02 +0200,
  Peter Zijlstra <peterz@infradead.org> wrote:
>
>In any case, can someone who can trigger this run with the below; its
>'clean' for me, but supposedly you'll trigger a FAIL somewhere.

I got a couple of fail messages.

dmesg output is available in the bug as the following attachment:
https://bugzilla.kernel.org/attachment.cgi?id=143361

The part of interest is probably:

[    0.253354] build_sched_groups: got group f255b020 with cpus: 
[    0.253436] build_sched_groups: got group f255b120 with cpus: 
[    0.253519] build_sched_groups: got group f255b1a0 with cpus: 
[    0.253600] build_sched_groups: got group f255b2a0 with cpus: 
[    0.253681] build_sched_groups: got group f255b2e0 with cpus: 
[    0.253762] build_sched_groups: got group f255b320 with cpus: 
[    0.253843] build_sched_groups: got group f255b360 with cpus: 
[    0.254004] build_sched_groups: got group f255b0e0 with cpus: 
[    0.254087] build_sched_groups: got group f255b160 with cpus: 
[    0.254170] build_sched_groups: got group f255b1e0 with cpus: 
[    0.254252] build_sched_groups: FAIL
[    0.254331] build_sched_groups: got group f255b1a0 with cpus: 0
[    0.255004] build_sched_groups: FAIL
[    0.255084] build_sched_groups: got group f255b1e0 with cpus: 1

I also booted with early printk=keepsched_debug as requested by 
Dietmar.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-18  5:34                   ` Bruno Wolff III
@ 2014-07-18  9:28                     ` Dietmar Eggemann
  2014-07-18 12:09                       ` Bruno Wolff III
  2014-07-18 10:16                     ` Peter Zijlstra
  1 sibling, 1 reply; 44+ messages in thread
From: Dietmar Eggemann @ 2014-07-18  9:28 UTC (permalink / raw)
  To: Bruno Wolff III, Peter Zijlstra; +Cc: Josh Boyer, mingo, linux-kernel

On 18/07/14 07:34, Bruno Wolff III wrote:
> On Thu, Jul 17, 2014 at 14:35:02 +0200,
>    Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> In any case, can someone who can trigger this run with the below; its
>> 'clean' for me, but supposedly you'll trigger a FAIL somewhere.
>
> I got a couple of fail messages.
>
> dmesg output is available in the bug as the following attachment:
> https://bugzilla.kernel.org/attachment.cgi?id=143361
>
> The part of interest is probably:
>
> [    0.253354] build_sched_groups: got group f255b020 with cpus:
> [    0.253436] build_sched_groups: got group f255b120 with cpus:
> [    0.253519] build_sched_groups: got group f255b1a0 with cpus:
> [    0.253600] build_sched_groups: got group f255b2a0 with cpus:
> [    0.253681] build_sched_groups: got group f255b2e0 with cpus:
> [    0.253762] build_sched_groups: got group f255b320 with cpus:
> [    0.253843] build_sched_groups: got group f255b360 with cpus:
> [    0.254004] build_sched_groups: got group f255b0e0 with cpus:
> [    0.254087] build_sched_groups: got group f255b160 with cpus:
> [    0.254170] build_sched_groups: got group f255b1e0 with cpus:
> [    0.254252] build_sched_groups: FAIL
> [    0.254331] build_sched_groups: got group f255b1a0 with cpus: 0
> [    0.255004] build_sched_groups: FAIL
> [    0.255084] build_sched_groups: got group f255b1e0 with cpus: 1

That (partly) explains it. f255b1a0 (5) and f255b1e0 (6) are reused 
here! This reuse doesn't happen on my machines.

But if they are used for a different cpu mask (not including cpu0 resp. 
cpu1 this would mess up their first usage?

I guess that the second time, cpu3 will be added to the cpumask of 
f255b1a0 and cpu4 to f255b1e0?

Maybe we can extend PeterZ patch to print out cpu and span as well us 
this printk also in free_sched_domain() to debug further if this is not 
enough evidence?

[    0.252059] __sdt_alloc: allocated f255b020 with cpus: (1)
[    0.252147] __sdt_alloc: allocated f255b0e0 with cpus: (2)
[    0.252229] __sdt_alloc: allocated f255b120 with cpus: (3)
[    0.252311] __sdt_alloc: allocated f255b160 with cpus: (4)
[    0.252395] __sdt_alloc: allocated f255b1a0 with cpus: (5)
[    0.252477] __sdt_alloc: allocated f255b1e0 with cpus: (6)
[    0.252559] __sdt_alloc: allocated f255b220 with cpus: (7) (not used)
[    0.252641] __sdt_alloc: allocated f255b260 with cpus: (8) (not used)
[    0.253013] __sdt_alloc: allocated f255b2a0 with cpus: (9)
[    0.253097] __sdt_alloc: allocated f255b2e0 with cpus: (10)
[    0.253184] __sdt_alloc: allocated f255b320 with cpus: (11)
[    0.253265] __sdt_alloc: allocated f255b360 with cpus: (12)

[    0.253354] build_sched_groups: got group f255b020 with cpus: (1)
[    0.253436] build_sched_groups: got group f255b120 with cpus: (3)
[    0.253519] build_sched_groups: got group f255b1a0 with cpus: (5)
[    0.253600] build_sched_groups: got group f255b2a0 with cpus: (9)
[    0.253681] build_sched_groups: got group f255b2e0 with cpus: (10)
[    0.253762] build_sched_groups: got group f255b320 with cpus: (11)
[    0.253843] build_sched_groups: got group f255b360 with cpus: (12)
[    0.254004] build_sched_groups: got group f255b0e0 with cpus: (2)
[    0.254087] build_sched_groups: got group f255b160 with cpus: (4)
[    0.254170] build_sched_groups: got group f255b1e0 with cpus: (6)
[    0.254252] build_sched_groups: FAIL
[    0.254331] build_sched_groups: got group f255b1a0 with cpus: 0 (5)
[    0.255004] build_sched_groups: FAIL
[    0.255084] build_sched_groups: got group f255b1e0 with cpus: 1 (6)
[    0.255365] devtmpfs: initialized

>
> I also booted with early printk=keepsched_debug as requested by
> Dietmar.
>

Didn't see what I was looking for in your dmesg output. Did you use
'earlyprintk=keep sched_debug'








^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-18  5:34                   ` Bruno Wolff III
  2014-07-18  9:28                     ` Dietmar Eggemann
@ 2014-07-18 10:16                     ` Peter Zijlstra
  2014-07-18 13:01                       ` Bruno Wolff III
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2014-07-18 10:16 UTC (permalink / raw)
  To: Bruno Wolff III; +Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6605 bytes --]

On Fri, Jul 18, 2014 at 12:34:49AM -0500, Bruno Wolff III wrote:
> On Thu, Jul 17, 2014 at 14:35:02 +0200,
>  Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >In any case, can someone who can trigger this run with the below; its
> >'clean' for me, but supposedly you'll trigger a FAIL somewhere.
> 
> I got a couple of fail messages.
> 
> dmesg output is available in the bug as the following attachment:
> https://bugzilla.kernel.org/attachment.cgi?id=143361

Thanks!

[    0.252059] __sdt_alloc: allocated f255b020 with cpus: 
[    0.252147] __sdt_alloc: allocated f255b0e0 with cpus: 
[    0.252229] __sdt_alloc: allocated f255b120 with cpus: 
[    0.252311] __sdt_alloc: allocated f255b160 with cpus: 

[    0.252395] __sdt_alloc: allocated f255b1a0 with cpus: 
[    0.252477] __sdt_alloc: allocated f255b1e0 with cpus: 
[    0.252559] __sdt_alloc: allocated f255b220 with cpus: 
[    0.252641] __sdt_alloc: allocated f255b260 with cpus: 

[    0.253013] __sdt_alloc: allocated f255b2a0 with cpus: 
[    0.253097] __sdt_alloc: allocated f255b2e0 with cpus: 
[    0.253184] __sdt_alloc: allocated f255b320 with cpus: 
[    0.253265] __sdt_alloc: allocated f255b360 with cpus: 

[    0.253354] build_sched_groups: got group f255b020 with cpus: 
[    0.253436] build_sched_groups: got group f255b120 with cpus: 
[    0.253519] build_sched_groups: got group f255b1a0 with cpus: 
[    0.253600] build_sched_groups: got group f255b2a0 with cpus: 
[    0.253681] build_sched_groups: got group f255b2e0 with cpus: 

[    0.253762] build_sched_groups: got group f255b320 with cpus: 
[    0.253843] build_sched_groups: got group f255b360 with cpus: 
[    0.254004] build_sched_groups: got group f255b0e0 with cpus: 
[    0.254087] build_sched_groups: got group f255b160 with cpus: 
[    0.254170] build_sched_groups: got group f255b1e0 with cpus: 
[    0.254252] build_sched_groups: FAIL
[    0.254331] build_sched_groups: got group f255b1a0 with cpus: 0
[    0.255004] build_sched_groups: FAIL
[    0.255084] build_sched_groups: got group f255b1e0 with cpus: 1

So from previous msgs we know:

	CPU0	CPU1	CPU2	CPU3

D0	*		*		SMT
		*		*

D2	*	*	*	*	DIE


This gives us (from __sdt_alloc):

	020	0e0	120	160	SMT
	1a0	1e0	220	260	MC
	2a0	2e0	320	360	DIE

Given that you have a DIE domain, and MC is found degenerate, I'll
conclude that you do not have the shared L3 possible for your machine
and only have the dual socket, with 2 threads per socket.

So the domains _should_ look like:

D0	0,2	1,3	0,2	1,3
D1	0,2	1,3	0,2	1,3
D2	0,1,2,3 0,1,2,3	0,1,2,3	0,1,2,3

Assuming that, build_sched_groups(), which gets called for each cpu, for
each domain, we get:

D0g	020(0)		120(2)
D1g	1a0(0,2)
D2g	2a0(0,2)

So far so good, at this point we're in build_sched_groups, we have a
.cpu=0 @span=0-3 @covered=0,2 @i=0 and we're just about to start the
loop for @i=1.

	1 is not set in covered

	get_group(i=1, sdd, &sg)
	  @sd = *per_cpu_ptr(sdd->sd, 1); /* should be D2 for CPU1 */
	  @child = sd->child; /* should be D1 for CPU1: 1,3 */
	  @cpu = 1
	  @sg = *per_cpu_ptr(sdd->sg, 1); /* should be: 2e0 */

But instead we get 320 !?

The 2e0 group would cover 1,3, thereby increasing @cover to 0-3 and
we're done for CPU0. Instead things go on to return 360, more WTF!

So it looks like the actual domain tree is broken, and not what we
assumed it was.

Could I bother you to run with the below instead? It should also print
out the sched domain masks so we don't need to guess about them.

(make sure you have CONFIG_SCHED_DEBUG=y otherwise it will not build)

> I also booted with early printk=keepsched_debug as requested by Dietmar.

can you make that: sched_debug ?

---
 kernel/sched/core.c | 22 ++++++++++++++++++++++
 lib/vsprintf.c      |  5 +++++
 2 files changed, 27 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7bc599dc4aa4..4babcbbc11b6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5857,6 +5857,17 @@ build_sched_groups(struct sched_domain *sd, int cpu)
 			continue;
 
 		group = get_group(i, sdd, &sg);
+
+		if (!cpumask_empty(sched_group_cpus(sg)))
+			printk("%s: FAIL\n", __func__);
+
+		printk("%s: got group %p with cpus: %pc\n",
+				__func__,
+				sg,
+				sched_group_cpus(sg));
+
+		cpumask_clear(sched_group_cpus(sg));
+
 		cpumask_setall(sched_group_mask(sg));
 
 		for_each_cpu(j, span) {
@@ -6418,6 +6429,11 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 			if (!sg)
 				return -ENOMEM;
 
+			printk("%s: allocated %p with cpus: %pc\n",
+					__func__,
+					sg,
+					sched_group_cpus(sg));
+
 			sg->next = sg;
 
 			*per_cpu_ptr(sdd->sg, j) = sg;
@@ -6474,6 +6490,12 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
 	if (!sd)
 		return child;
 
+	printk("%s: cpu: %d level: %s cpu_map: %pc tl->mask: %pc\n",
+			__func__,
+			cpu, tl->name,
+			cpu_map,
+			tl->mask(cpu));
+
 	cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
 	if (child) {
 		sd->level = child->level + 1;
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 6fe2c84eb055..ac22c46fd6d0 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -28,6 +28,7 @@
 #include <linux/ioport.h>
 #include <linux/dcache.h>
 #include <linux/cred.h>
+#include <linux/cpumask.h>
 #include <net/addrconf.h>
 
 #include <asm/page.h>		/* for PAGE_SIZE */
@@ -1250,6 +1251,7 @@ int kptr_restrict __read_mostly;
  *           (default assumed to be phys_addr_t, passed by reference)
  * - 'd[234]' For a dentry name (optionally 2-4 last components)
  * - 'D[234]' Same as 'd' but for a struct file
+ * - 'c' For a cpumask list
  *
  * Note: The difference between 'S' and 'F' is that on ia64 and ppc64
  * function pointers are really function descriptors, which contain a
@@ -1389,6 +1391,8 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr,
 		return dentry_name(buf, end,
 				   ((const struct file *)ptr)->f_path.dentry,
 				   spec, fmt);
+	case 'c':
+		return buf + cpulist_scnprintf(buf, end - buf, ptr);
 	}
 	spec.flags |= SMALL;
 	if (spec.field_width == -1) {
@@ -1635,6 +1639,7 @@ int format_decode(const char *fmt, struct printf_spec *spec)
  *   case.
  * %*ph[CDN] a variable-length hex string with a separator (supports up to 64
  *           bytes of the input)
+ * %pc print a cpumask as comma-separated list
  * %n is ignored
  *
  * ** Please update Documentation/printk-formats.txt when making changes **

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-18  9:28                     ` Dietmar Eggemann
@ 2014-07-18 12:09                       ` Bruno Wolff III
  0 siblings, 0 replies; 44+ messages in thread
From: Bruno Wolff III @ 2014-07-18 12:09 UTC (permalink / raw)
  To: Dietmar Eggemann; +Cc: Peter Zijlstra, Josh Boyer, mingo, linux-kernel

On Fri, Jul 18, 2014 at 11:28:14 +0200,
  Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
>Didn't see what I was looking for in your dmesg output. Did you use
>'earlyprintk=keep sched_debug'

I was missing a space. I'll get it on the next run.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-18 10:16                     ` Peter Zijlstra
@ 2014-07-18 13:01                       ` Bruno Wolff III
  2014-07-18 14:16                         ` Dietmar Eggemann
  2014-07-18 14:16                         ` Peter Zijlstra
  0 siblings, 2 replies; 44+ messages in thread
From: Bruno Wolff III @ 2014-07-18 13:01 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel

On Fri, Jul 18, 2014 at 12:16:33 +0200,
  Peter Zijlstra <peterz@infradead.org> wrote:
>So it looks like the actual domain tree is broken, and not what we
>assumed it was.
>
>Could I bother you to run with the below instead? It should also print
>out the sched domain masks so we don't need to guess about them.

The full dmesg output is at:
https://bugzilla.kernel.org/attachment.cgi?id=143381

>(make sure you have CONFIG_SCHED_DEBUG=y otherwise it will not build)
>
>> I also booted with early printk=keepsched_debug as requested by Dietmar.
>
>can you make that: sched_debug ?

I think I've fixed that.

I think the part you are most interested in contains the following:
[    0.252280] smpboot: Total of 4 processors activated (21438.11 BogoMIPS)
[    0.253058] __sdt_alloc: allocated f255b020 with cpus: 
[    0.253146] __sdt_alloc: allocated f255b0e0 with cpus: 
[    0.253227] __sdt_alloc: allocated f255b120 with cpus: 
[    0.253308] __sdt_alloc: allocated f255b160 with cpus: 
[    0.253390] __sdt_alloc: allocated f255b1a0 with cpus: 
[    0.253471] __sdt_alloc: allocated f255b1e0 with cpus: 
[    0.253551] __sdt_alloc: allocated f255b220 with cpus: 
[    0.253632] __sdt_alloc: allocated f255b260 with cpus: 
[    0.254009] __sdt_alloc: allocated f255b2a0 with cpus: 
[    0.254092] __sdt_alloc: allocated f255b2e0 with cpus: 
[    0.254181] __sdt_alloc: allocated f255b320 with cpus: 
[    0.254262] __sdt_alloc: allocated f255b360 with cpus: 
[    0.254350] build_sched_domain: cpu: 0 level: SMT cpu_map: 0-3 tl->mask: 0,2
[    0.254433] build_sched_domain: cpu: 0 level: MC cpu_map: 0-3 tl->mask: 0
[    0.254516] build_sched_domain: cpu: 0 level: DIE cpu_map: 0-3 tl->mask: 0-3
[    0.254600] build_sched_domain: cpu: 1 level: SMT cpu_map: 0-3 tl->mask: 1,3
[    0.254683] build_sched_domain: cpu: 1 level: MC cpu_map: 0-3 tl->mask: 1
[    0.254766] build_sched_domain: cpu: 1 level: DIE cpu_map: 0-3 tl->mask: 0-3
[    0.254850] build_sched_domain: cpu: 2 level: SMT cpu_map: 0-3 tl->mask: 0,2
[    0.254932] build_sched_domain: cpu: 2 level: MC cpu_map: 0-3 tl->mask: 2
[    0.255005] build_sched_domain: cpu: 2 level: DIE cpu_map: 0-3 tl->mask: 0-3
[    0.255091] build_sched_domain: cpu: 3 level: SMT cpu_map: 0-3 tl->mask: 1,3
[    0.255176] build_sched_domain: cpu: 3 level: MC cpu_map: 0-3 tl->mask: 3
[    0.255260] build_sched_domain: cpu: 3 level: DIE cpu_map: 0-3 tl->mask: 0-3
[    0.256006] build_sched_groups: got group f255b020 with cpus: 
[    0.256089] build_sched_groups: got group f255b120 with cpus: 
[    0.256171] build_sched_groups: got group f255b1a0 with cpus: 
[    0.256252] build_sched_groups: got group f255b2a0 with cpus: 
[    0.256333] build_sched_groups: got group f255b2e0 with cpus: 
[    0.256414] build_sched_groups: got group f255b320 with cpus: 
[    0.256495] build_sched_groups: got group f255b360 with cpus: 
[    0.256576] build_sched_groups: got group f255b0e0 with cpus: 
[    0.256657] build_sched_groups: got group f255b160 with cpus: 
[    0.256740] build_sched_groups: got group f255b1e0 with cpus: 
[    0.256821] build_sched_groups: FAIL
[    0.257004] build_sched_groups: got group f255b1a0 with cpus: 0
[    0.257087] build_sched_groups: FAIL
[    0.257167] build_sched_groups: got group f255b1e0 with cpus: 1


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-18 13:01                       ` Bruno Wolff III
@ 2014-07-18 14:16                         ` Dietmar Eggemann
  2014-07-18 14:16                         ` Peter Zijlstra
  1 sibling, 0 replies; 44+ messages in thread
From: Dietmar Eggemann @ 2014-07-18 14:16 UTC (permalink / raw)
  To: Bruno Wolff III, Peter Zijlstra; +Cc: Josh Boyer, mingo, linux-kernel

On 18/07/14 15:01, Bruno Wolff III wrote:
> On Fri, Jul 18, 2014 at 12:16:33 +0200,
>    Peter Zijlstra <peterz@infradead.org> wrote:
>> So it looks like the actual domain tree is broken, and not what we
>> assumed it was.
>>
>> Could I bother you to run with the below instead? It should also print
>> out the sched domain masks so we don't need to guess about them.
>
> The full dmesg output is at:
> https://bugzilla.kernel.org/attachment.cgi?id=143381
>
>> (make sure you have CONFIG_SCHED_DEBUG=y otherwise it will not build)
>>
>>> I also booted with early printk=keepsched_debug as requested by Dietmar.
>>
>> can you make that: sched_debug ?
>
> I think I've fixed that.
>
> I think the part you are most interested in contains the following:
> [    0.252280] smpboot: Total of 4 processors activated (21438.11 BogoMIPS)
> [    0.253058] __sdt_alloc: allocated f255b020 with cpus:
> [    0.253146] __sdt_alloc: allocated f255b0e0 with cpus:
> [    0.253227] __sdt_alloc: allocated f255b120 with cpus:
> [    0.253308] __sdt_alloc: allocated f255b160 with cpus:
> [    0.253390] __sdt_alloc: allocated f255b1a0 with cpus:
> [    0.253471] __sdt_alloc: allocated f255b1e0 with cpus:
> [    0.253551] __sdt_alloc: allocated f255b220 with cpus:
> [    0.253632] __sdt_alloc: allocated f255b260 with cpus:
> [    0.254009] __sdt_alloc: allocated f255b2a0 with cpus:
> [    0.254092] __sdt_alloc: allocated f255b2e0 with cpus:
> [    0.254181] __sdt_alloc: allocated f255b320 with cpus:
> [    0.254262] __sdt_alloc: allocated f255b360 with cpus:
> [    0.254350] build_sched_domain: cpu: 0 level: SMT cpu_map: 0-3 tl->mask: 0,2
> [    0.254433] build_sched_domain: cpu: 0 level: MC cpu_map: 0-3 tl->mask: 0

So the MC level cpu mask function is wrong on this machine. Should be 
0,2 here, right?

The cpu_capacity values look strange too (probably a subsequent error).

[    0.257260] CPU0 attaching sched-domain:
[    0.257264]  domain 0: span 0,2 level SMT
[    0.257268]   groups: 0 (cpu_capacity = 586) 2 (cpu_capacity = 587)
[    0.257275]   domain 1: span 0-3 level DIE
[    0.257278]    groups: 0 (cpu_capacity = 587) 1 (cpu_capacity = 588) 
2 (cpu_capacity = 587) 3 (cpu_capacity = 588)

> [    0.254516] build_sched_domain: cpu: 0 level: DIE cpu_map: 0-3 tl->mask: 0-3
> [    0.254600] build_sched_domain: cpu: 1 level: SMT cpu_map: 0-3 tl->mask: 1,3
> [    0.254683] build_sched_domain: cpu: 1 level: MC cpu_map: 0-3 tl->mask: 1
> [    0.254766] build_sched_domain: cpu: 1 level: DIE cpu_map: 0-3 tl->mask: 0-3
> [    0.254850] build_sched_domain: cpu: 2 level: SMT cpu_map: 0-3 tl->mask: 0,2
> [    0.254932] build_sched_domain: cpu: 2 level: MC cpu_map: 0-3 tl->mask: 2
> [    0.255005] build_sched_domain: cpu: 2 level: DIE cpu_map: 0-3 tl->mask: 0-3
> [    0.255091] build_sched_domain: cpu: 3 level: SMT cpu_map: 0-3 tl->mask: 1,3
> [    0.255176] build_sched_domain: cpu: 3 level: MC cpu_map: 0-3 tl->mask: 3
> [    0.255260] build_sched_domain: cpu: 3 level: DIE cpu_map: 0-3 tl->mask: 0-3
> [    0.256006] build_sched_groups: got group f255b020 with cpus:
> [    0.256089] build_sched_groups: got group f255b120 with cpus:
> [    0.256171] build_sched_groups: got group f255b1a0 with cpus:
> [    0.256252] build_sched_groups: got group f255b2a0 with cpus:
> [    0.256333] build_sched_groups: got group f255b2e0 with cpus:
> [    0.256414] build_sched_groups: got group f255b320 with cpus:
> [    0.256495] build_sched_groups: got group f255b360 with cpus:
> [    0.256576] build_sched_groups: got group f255b0e0 with cpus:
> [    0.256657] build_sched_groups: got group f255b160 with cpus:
> [    0.256740] build_sched_groups: got group f255b1e0 with cpus:
> [    0.256821] build_sched_groups: FAIL
> [    0.257004] build_sched_groups: got group f255b1a0 with cpus: 0
> [    0.257087] build_sched_groups: FAIL
> [    0.257167] build_sched_groups: got group f255b1e0 with cpus: 1
>
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-18 13:01                       ` Bruno Wolff III
  2014-07-18 14:16                         ` Dietmar Eggemann
@ 2014-07-18 14:16                         ` Peter Zijlstra
  2014-07-18 14:50                           ` Peter Zijlstra
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2014-07-18 14:16 UTC (permalink / raw)
  To: Bruno Wolff III
  Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

On Fri, Jul 18, 2014 at 08:01:26AM -0500, Bruno Wolff III wrote:
> build_sched_domain: cpu: 0 level: SMT cpu_map: 0-3 tl->mask: 0,2
> [    0.254433] build_sched_domain: cpu: 0 level: MC cpu_map: 0-3 tl->mask: 0
> [    0.254516] build_sched_domain: cpu: 0 level: DIE cpu_map: 0-3 tl->mask: 0-3
> [    0.254600] build_sched_domain: cpu: 1 level: SMT cpu_map: 0-3 tl->mask: 1,3
> [    0.254683] build_sched_domain: cpu: 1 level: MC cpu_map: 0-3 tl->mask: 1
> [    0.254766] build_sched_domain: cpu: 1 level: DIE cpu_map: 0-3 tl->mask: 0-3
> [    0.254850] build_sched_domain: cpu: 2 level: SMT cpu_map: 0-3 tl->mask: 0,2
> [    0.254932] build_sched_domain: cpu: 2 level: MC cpu_map: 0-3 tl->mask: 2
> [    0.255005] build_sched_domain: cpu: 2 level: DIE cpu_map: 0-3 tl->mask: 0-3
> [    0.255091] build_sched_domain: cpu: 3 level: SMT cpu_map: 0-3 tl->mask: 1,3
> [    0.255176] build_sched_domain: cpu: 3 level: MC cpu_map: 0-3 tl->mask: 3
> [    0.255260] build_sched_domain: cpu: 3 level: DIE cpu_map: 0-3 tl->mask: 0-3

*blink*...

That's, shall we say, unexpected. Let me ponder that a bit. HPA any clue
why a machine might report such a weird topology? AFAIK threads _always_
share cache.  So how can cpu_coregroup_mask be a subset (instead of a
superset) of topology_thread_cpumask?

Let me go stare at the x86 topology mask setup code.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-18 14:16                         ` Peter Zijlstra
@ 2014-07-18 14:50                           ` Peter Zijlstra
  2014-07-18 16:16                             ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2014-07-18 14:50 UTC (permalink / raw)
  To: Bruno Wolff III
  Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

On Fri, Jul 18, 2014 at 04:16:48PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 18, 2014 at 08:01:26AM -0500, Bruno Wolff III wrote:
> > build_sched_domain: cpu: 0 level: SMT cpu_map: 0-3 tl->mask: 0,2
> > [    0.254433] build_sched_domain: cpu: 0 level: MC cpu_map: 0-3 tl->mask: 0
> > [    0.254516] build_sched_domain: cpu: 0 level: DIE cpu_map: 0-3 tl->mask: 0-3
> > [    0.254600] build_sched_domain: cpu: 1 level: SMT cpu_map: 0-3 tl->mask: 1,3
> > [    0.254683] build_sched_domain: cpu: 1 level: MC cpu_map: 0-3 tl->mask: 1
> > [    0.254766] build_sched_domain: cpu: 1 level: DIE cpu_map: 0-3 tl->mask: 0-3
> > [    0.254850] build_sched_domain: cpu: 2 level: SMT cpu_map: 0-3 tl->mask: 0,2
> > [    0.254932] build_sched_domain: cpu: 2 level: MC cpu_map: 0-3 tl->mask: 2
> > [    0.255005] build_sched_domain: cpu: 2 level: DIE cpu_map: 0-3 tl->mask: 0-3
> > [    0.255091] build_sched_domain: cpu: 3 level: SMT cpu_map: 0-3 tl->mask: 1,3
> > [    0.255176] build_sched_domain: cpu: 3 level: MC cpu_map: 0-3 tl->mask: 3
> > [    0.255260] build_sched_domain: cpu: 3 level: DIE cpu_map: 0-3 tl->mask: 0-3
> 
> *blink*...
> 
> That's, shall we say, unexpected. Let me ponder that a bit. HPA any clue
> why a machine might report such a weird topology? AFAIK threads _always_
> share cache.  So how can cpu_coregroup_mask be a subset (instead of a
> superset) of topology_thread_cpumask?
> 
> Let me go stare at the x86 topology mask setup code.

Possibly something like so, but I'm not too sure. Anybody?

---
 arch/x86/kernel/smpboot.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 5492798930ef..5eefa9abc2a9 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -338,9 +338,15 @@ static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 {
 	int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
 
-	if (per_cpu(cpu_llc_id, cpu1) != BAD_APICID &&
-	    per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2))
+	if (cpu_has_topoext) {
+		if (per_cpu(cpu_llc_id, cpu1) != BAD_APICID &&
+		    per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2))
+			return topology_sane(c, o, "llc");
+
+	} else if (c->phys_proc_id == o->phys_proc_id &&
+		   c->cpu_core_id == o->cpu_core_id) {
 		return topology_sane(c, o, "llc");
+	}
 
 	return false;
 }

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-18 14:50                           ` Peter Zijlstra
@ 2014-07-18 16:16                             ` Peter Zijlstra
  2014-07-21 16:35                               ` Bruno Wolff III
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2014-07-18 16:16 UTC (permalink / raw)
  To: Bruno Wolff III
  Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

On Fri, Jul 18, 2014 at 04:50:40PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 18, 2014 at 04:16:48PM +0200, Peter Zijlstra wrote:
> > On Fri, Jul 18, 2014 at 08:01:26AM -0500, Bruno Wolff III wrote:
> > > build_sched_domain: cpu: 0 level: SMT cpu_map: 0-3 tl->mask: 0,2
> > > [    0.254433] build_sched_domain: cpu: 0 level: MC cpu_map: 0-3 tl->mask: 0
> > > [    0.254516] build_sched_domain: cpu: 0 level: DIE cpu_map: 0-3 tl->mask: 0-3
> > > [    0.254600] build_sched_domain: cpu: 1 level: SMT cpu_map: 0-3 tl->mask: 1,3
> > > [    0.254683] build_sched_domain: cpu: 1 level: MC cpu_map: 0-3 tl->mask: 1
> > > [    0.254766] build_sched_domain: cpu: 1 level: DIE cpu_map: 0-3 tl->mask: 0-3
> > > [    0.254850] build_sched_domain: cpu: 2 level: SMT cpu_map: 0-3 tl->mask: 0,2
> > > [    0.254932] build_sched_domain: cpu: 2 level: MC cpu_map: 0-3 tl->mask: 2
> > > [    0.255005] build_sched_domain: cpu: 2 level: DIE cpu_map: 0-3 tl->mask: 0-3
> > > [    0.255091] build_sched_domain: cpu: 3 level: SMT cpu_map: 0-3 tl->mask: 1,3
> > > [    0.255176] build_sched_domain: cpu: 3 level: MC cpu_map: 0-3 tl->mask: 3
> > > [    0.255260] build_sched_domain: cpu: 3 level: DIE cpu_map: 0-3 tl->mask: 0-3
> > 
> > *blink*...
> > 
> > That's, shall we say, unexpected. Let me ponder that a bit. HPA any clue
> > why a machine might report such a weird topology? AFAIK threads _always_
> > share cache.  So how can cpu_coregroup_mask be a subset (instead of a
> > superset) of topology_thread_cpumask?
> > 
> > Let me go stare at the x86 topology mask setup code.
> 
> Possibly something like so, but I'm not too sure. Anybody?

OK, Borislav says topoext is AMD only, so that's not the problem. In
which case the problem must be that cpu_llc_id is wrong.

This gets set in init_intel_cacheinfo() but that's hurting my brain for
the moment. There's plenty P4 specific cruft in there though.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-18 16:16                             ` Peter Zijlstra
@ 2014-07-21 16:35                               ` Bruno Wolff III
  2014-07-21 16:52                                 ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Bruno Wolff III @ 2014-07-21 16:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

Is there more I can do to help with this now? Or should I just wait for 
patches to test?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-21 16:35                               ` Bruno Wolff III
@ 2014-07-21 16:52                                 ` Peter Zijlstra
  2014-07-22  9:47                                   ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2014-07-21 16:52 UTC (permalink / raw)
  To: Bruno Wolff III
  Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

On Mon, Jul 21, 2014 at 11:35:28AM -0500, Bruno Wolff III wrote:
> Is there more I can do to help with this now? Or should I just wait for
> patches to test?

Yeah, sorry, was wiped out today. I'll go stare harder at the P4
topology setup code tomorrow. Something fishy there.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-21 16:52                                 ` Peter Zijlstra
@ 2014-07-22  9:47                                   ` Peter Zijlstra
  2014-07-22 10:38                                     ` Peter Zijlstra
                                                       ` (3 more replies)
  0 siblings, 4 replies; 44+ messages in thread
From: Peter Zijlstra @ 2014-07-22  9:47 UTC (permalink / raw)
  To: Bruno Wolff III
  Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

On Mon, Jul 21, 2014 at 06:52:12PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 21, 2014 at 11:35:28AM -0500, Bruno Wolff III wrote:
> > Is there more I can do to help with this now? Or should I just wait for
> > patches to test?
> 
> Yeah, sorry, was wiped out today. I'll go stare harder at the P4
> topology setup code tomorrow. Something fishy there.

Does this make your machine boot again (while giving an error)?

It tries to robustify the topology setup a bit, crashing on crap input
should be avoided if possible of course.

I'll go stare at the x86/P4 topology code like promised.

---
Subject: sched: Robustify topology setup
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon Jul 21 23:07:06 CEST 2014

We hard assume that higher topology levels are strict supersets of
lower levels.

Detect, warn and try to fixup when we encounter this violated.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-cgp9j2tk0qnunhtpps3udsom@git.kernel.org
---
 kernel/sched/core.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6480,6 +6480,20 @@ struct sched_domain *build_sched_domain(
 		sched_domain_level_max = max(sched_domain_level_max, sd->level);
 		child->parent = sd;
 		sd->child = child;
+
+		if (!cpumask_subset(sched_domain_span(child),
+				    sched_domain_span(sd))) {
+			pr_err("BUG: arch topology borken\n");
+#ifdef CONFIG_SCHED_DEBUG
+			pr_err("     the %s domain not a subset of the %s domain\n",
+					child->name, sd->name);
+#endif
+			/* Fixup, ensure @sd has at least @child cpus. */
+			cpumask_or(sched_domain_span(sd),
+				   sched_domain_span(sd),
+				   sched_domain_span(child));
+		}
+
 	}
 	set_domain_attribute(sd, attr);
 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-22  9:47                                   ` Peter Zijlstra
@ 2014-07-22 10:38                                     ` Peter Zijlstra
  2014-07-22 12:10                                       ` Bruno Wolff III
  2014-07-22 12:12                                     ` Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c Dietmar Eggemann
                                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2014-07-22 10:38 UTC (permalink / raw)
  To: Bruno Wolff III
  Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

On Tue, Jul 22, 2014 at 11:47:40AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 21, 2014 at 06:52:12PM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 21, 2014 at 11:35:28AM -0500, Bruno Wolff III wrote:
> > > Is there more I can do to help with this now? Or should I just wait for
> > > patches to test?
> > 
> > Yeah, sorry, was wiped out today. I'll go stare harder at the P4
> > topology setup code tomorrow. Something fishy there.
> 
> Does this make your machine boot again (while giving an error)?
> 
> It tries to robustify the topology setup a bit, crashing on crap input
> should be avoided if possible of course.
> 
> I'll go stare at the x86/P4 topology code like promised.

Could you provide the output of cpuid and cpuid -r for your machine?
This code is magic and I've no idea what your machine is telling it to
do :/



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-22 10:38                                     ` Peter Zijlstra
@ 2014-07-22 12:10                                       ` Bruno Wolff III
  2014-07-22 13:03                                         ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Bruno Wolff III @ 2014-07-22 12:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 316 bytes --]

On Tue, Jul 22, 2014 at 12:38:57 +0200,
  Peter Zijlstra <peterz@infradead.org> wrote:
>
>Could you provide the output of cpuid and cpuid -r for your machine?
>This code is magic and I've no idea what your machine is telling it to
>do :/

I am attaching both sets of output. (I also added copies to the bug report.)

[-- Attachment #2: cpuid.out --]
[-- Type: text/plain, Size: 20500 bytes --]

CPU 0:
   vendor_id = "GenuineIntel"
   version information (1/eax):
      processor type  = primary processor (0)
      family          = Intel Pentium 4/Pentium D/Pentium Extreme Edition/Celeron/Xeon/Xeon MP/Itanium2, AMD Athlon 64/Athlon XP-M/Opteron/Sempron/Turion (15)
      model           = 0x2 (2)
      stepping id     = 0x9 (9)
      extended family = 0x0 (0)
      extended model  = 0x0 (0)
      (simple synth)  = Intel Pentium 4 (Northwood D1) / Xeon (Prestonia D1) / Mobile Pentium 4 (Northwood D1) / Mobile Pentium 4 Processor-M (Northwood D1) / Celeron 478-pin (Northwood D1), .13um
   miscellaneous (1/ebx):
      process local APIC physical ID = 0x0 (0)
      cpu count                      = 0x2 (2)
      CLFLUSH line size              = 0x8 (8)
      brand index                    = 0xb (11)
   brand id = 0x0b (11): Intel Xeon, .13um
   feature information (1/edx):
      x87 FPU on chip                        = true
      virtual-8086 mode enhancement          = true
      debugging extensions                   = true
      page size extensions                   = true
      time stamp counter                     = true
      RDMSR and WRMSR support                = true
      physical address extensions            = true
      machine check exception                = true
      CMPXCHG8B inst.                        = true
      APIC on chip                           = true
      SYSENTER and SYSEXIT                   = true
      memory type range registers            = true
      PTE global bit                         = true
      machine check architecture             = true
      conditional move/compare instruction   = true
      page attribute table                   = true
      page size extension                    = true
      processor serial number                = false
      CLFLUSH instruction                    = true
      debug store                            = true
      thermal monitor and clock ctrl         = true
      MMX Technology                         = true
      FXSAVE/FXRSTOR                         = true
      SSE extensions                         = true
      SSE2 extensions                        = true
      self snoop                             = true
      hyper-threading / multi-core supported = true
      therm. monitor                         = true
      IA64                                   = false
      pending break event                    = true
   feature information (1/ecx):
      PNI/SSE3: Prescott New Instructions     = false
      PCLMULDQ instruction                    = false
      64-bit debug store                      = false
      MONITOR/MWAIT                           = false
      CPL-qualified debug store               = false
      VMX: virtual machine extensions         = false
      SMX: safer mode extensions              = false
      Enhanced Intel SpeedStep Technology     = false
      thermal monitor 2                       = false
      SSSE3 extensions                        = false
      context ID: adaptive or shared L1 data  = true
      FMA instruction                         = false
      CMPXCHG16B instruction                  = false
      xTPR disable                            = true
      perfmon and debug                       = false
      process context identifiers             = false
      direct cache access                     = false
      SSE4.1 extensions                       = false
      SSE4.2 extensions                       = false
      extended xAPIC support                  = false
      MOVBE instruction                       = false
      POPCNT instruction                      = false
      time stamp counter deadline             = false
      AES instruction                         = false
      XSAVE/XSTOR states                      = false
      OS-enabled XSAVE/XSTOR                  = false
      AVX: advanced vector extensions         = false
      F16C half-precision convert instruction = false
      RDRAND instruction                      = false
      hypervisor guest status                 = false
   cache and TLB information (2):
      0x50: instruction TLB: 4K & 2M/4M pages, 64 entries
      0x5b: data TLB: 4K & 4M pages, 64 entries
      0x66: L1 data cache: 8K, 4-way, 64 byte lines
      0x40: No L3 cache
      0x70: Trace cache: 12K-uop, 8-way
      0x7b: L2 cache: 512K, 8-way, sectored, 64 byte lines
   extended feature flags (0x80000001/edx):
      SYSCALL and SYSRET instructions        = false
      execution disable                      = false
      1-GB large page support                = false
      RDTSCP                                 = false
      64-bit extensions technology available = false
   Intel feature flags (0x80000001/ecx):
      LAHF/SAHF supported in 64-bit mode     = false
      LZCNT advanced bit manipulation        = false
      3DNow! PREFETCH/PREFETCHW instructions = false
   brand = "                  Intel(R) Xeon(TM) CPU 2.66GHz"
   (multi-processing synth): hyper-threaded (t=2)
   (multi-processing method): Intel leaf 1
   (synth) = Intel Xeon (Prestonia D1), .13um
CPU 1:
   vendor_id = "GenuineIntel"
   version information (1/eax):
      processor type  = primary processor (0)
      family          = Intel Pentium 4/Pentium D/Pentium Extreme Edition/Celeron/Xeon/Xeon MP/Itanium2, AMD Athlon 64/Athlon XP-M/Opteron/Sempron/Turion (15)
      model           = 0x2 (2)
      stepping id     = 0x9 (9)
      extended family = 0x0 (0)
      extended model  = 0x0 (0)
      (simple synth)  = Intel Pentium 4 (Northwood D1) / Xeon (Prestonia D1) / Mobile Pentium 4 (Northwood D1) / Mobile Pentium 4 Processor-M (Northwood D1) / Celeron 478-pin (Northwood D1), .13um
   miscellaneous (1/ebx):
      process local APIC physical ID = 0x6 (6)
      cpu count                      = 0x2 (2)
      CLFLUSH line size              = 0x8 (8)
      brand index                    = 0xb (11)
   brand id = 0x0b (11): Intel Xeon, .13um
   feature information (1/edx):
      x87 FPU on chip                        = true
      virtual-8086 mode enhancement          = true
      debugging extensions                   = true
      page size extensions                   = true
      time stamp counter                     = true
      RDMSR and WRMSR support                = true
      physical address extensions            = true
      machine check exception                = true
      CMPXCHG8B inst.                        = true
      APIC on chip                           = true
      SYSENTER and SYSEXIT                   = true
      memory type range registers            = true
      PTE global bit                         = true
      machine check architecture             = true
      conditional move/compare instruction   = true
      page attribute table                   = true
      page size extension                    = true
      processor serial number                = false
      CLFLUSH instruction                    = true
      debug store                            = true
      thermal monitor and clock ctrl         = true
      MMX Technology                         = true
      FXSAVE/FXRSTOR                         = true
      SSE extensions                         = true
      SSE2 extensions                        = true
      self snoop                             = true
      hyper-threading / multi-core supported = true
      therm. monitor                         = true
      IA64                                   = false
      pending break event                    = true
   feature information (1/ecx):
      PNI/SSE3: Prescott New Instructions     = false
      PCLMULDQ instruction                    = false
      64-bit debug store                      = false
      MONITOR/MWAIT                           = false
      CPL-qualified debug store               = false
      VMX: virtual machine extensions         = false
      SMX: safer mode extensions              = false
      Enhanced Intel SpeedStep Technology     = false
      thermal monitor 2                       = false
      SSSE3 extensions                        = false
      context ID: adaptive or shared L1 data  = true
      FMA instruction                         = false
      CMPXCHG16B instruction                  = false
      xTPR disable                            = true
      perfmon and debug                       = false
      process context identifiers             = false
      direct cache access                     = false
      SSE4.1 extensions                       = false
      SSE4.2 extensions                       = false
      extended xAPIC support                  = false
      MOVBE instruction                       = false
      POPCNT instruction                      = false
      time stamp counter deadline             = false
      AES instruction                         = false
      XSAVE/XSTOR states                      = false
      OS-enabled XSAVE/XSTOR                  = false
      AVX: advanced vector extensions         = false
      F16C half-precision convert instruction = false
      RDRAND instruction                      = false
      hypervisor guest status                 = false
   cache and TLB information (2):
      0x50: instruction TLB: 4K & 2M/4M pages, 64 entries
      0x5b: data TLB: 4K & 4M pages, 64 entries
      0x66: L1 data cache: 8K, 4-way, 64 byte lines
      0x40: No L3 cache
      0x70: Trace cache: 12K-uop, 8-way
      0x7b: L2 cache: 512K, 8-way, sectored, 64 byte lines
   extended feature flags (0x80000001/edx):
      SYSCALL and SYSRET instructions        = false
      execution disable                      = false
      1-GB large page support                = false
      RDTSCP                                 = false
      64-bit extensions technology available = false
   Intel feature flags (0x80000001/ecx):
      LAHF/SAHF supported in 64-bit mode     = false
      LZCNT advanced bit manipulation        = false
      3DNow! PREFETCH/PREFETCHW instructions = false
   brand = "                  Intel(R) Xeon(TM) CPU 2.66GHz"
   (multi-processing synth): hyper-threaded (t=2)
   (multi-processing method): Intel leaf 1
   (synth) = Intel Xeon (Prestonia D1), .13um
CPU 2:
   vendor_id = "GenuineIntel"
   version information (1/eax):
      processor type  = primary processor (0)
      family          = Intel Pentium 4/Pentium D/Pentium Extreme Edition/Celeron/Xeon/Xeon MP/Itanium2, AMD Athlon 64/Athlon XP-M/Opteron/Sempron/Turion (15)
      model           = 0x2 (2)
      stepping id     = 0x9 (9)
      extended family = 0x0 (0)
      extended model  = 0x0 (0)
      (simple synth)  = Intel Pentium 4 (Northwood D1) / Xeon (Prestonia D1) / Mobile Pentium 4 (Northwood D1) / Mobile Pentium 4 Processor-M (Northwood D1) / Celeron 478-pin (Northwood D1), .13um
   miscellaneous (1/ebx):
      process local APIC physical ID = 0x1 (1)
      cpu count                      = 0x2 (2)
      CLFLUSH line size              = 0x8 (8)
      brand index                    = 0xb (11)
   brand id = 0x0b (11): Intel Xeon, .13um
   feature information (1/edx):
      x87 FPU on chip                        = true
      virtual-8086 mode enhancement          = true
      debugging extensions                   = true
      page size extensions                   = true
      time stamp counter                     = true
      RDMSR and WRMSR support                = true
      physical address extensions            = true
      machine check exception                = true
      CMPXCHG8B inst.                        = true
      APIC on chip                           = true
      SYSENTER and SYSEXIT                   = true
      memory type range registers            = true
      PTE global bit                         = true
      machine check architecture             = true
      conditional move/compare instruction   = true
      page attribute table                   = true
      page size extension                    = true
      processor serial number                = false
      CLFLUSH instruction                    = true
      debug store                            = true
      thermal monitor and clock ctrl         = true
      MMX Technology                         = true
      FXSAVE/FXRSTOR                         = true
      SSE extensions                         = true
      SSE2 extensions                        = true
      self snoop                             = true
      hyper-threading / multi-core supported = true
      therm. monitor                         = true
      IA64                                   = false
      pending break event                    = true
   feature information (1/ecx):
      PNI/SSE3: Prescott New Instructions     = false
      PCLMULDQ instruction                    = false
      64-bit debug store                      = false
      MONITOR/MWAIT                           = false
      CPL-qualified debug store               = false
      VMX: virtual machine extensions         = false
      SMX: safer mode extensions              = false
      Enhanced Intel SpeedStep Technology     = false
      thermal monitor 2                       = false
      SSSE3 extensions                        = false
      context ID: adaptive or shared L1 data  = true
      FMA instruction                         = false
      CMPXCHG16B instruction                  = false
      xTPR disable                            = true
      perfmon and debug                       = false
      process context identifiers             = false
      direct cache access                     = false
      SSE4.1 extensions                       = false
      SSE4.2 extensions                       = false
      extended xAPIC support                  = false
      MOVBE instruction                       = false
      POPCNT instruction                      = false
      time stamp counter deadline             = false
      AES instruction                         = false
      XSAVE/XSTOR states                      = false
      OS-enabled XSAVE/XSTOR                  = false
      AVX: advanced vector extensions         = false
      F16C half-precision convert instruction = false
      RDRAND instruction                      = false
      hypervisor guest status                 = false
   cache and TLB information (2):
      0x50: instruction TLB: 4K & 2M/4M pages, 64 entries
      0x5b: data TLB: 4K & 4M pages, 64 entries
      0x66: L1 data cache: 8K, 4-way, 64 byte lines
      0x40: No L3 cache
      0x70: Trace cache: 12K-uop, 8-way
      0x7b: L2 cache: 512K, 8-way, sectored, 64 byte lines
   extended feature flags (0x80000001/edx):
      SYSCALL and SYSRET instructions        = false
      execution disable                      = false
      1-GB large page support                = false
      RDTSCP                                 = false
      64-bit extensions technology available = false
   Intel feature flags (0x80000001/ecx):
      LAHF/SAHF supported in 64-bit mode     = false
      LZCNT advanced bit manipulation        = false
      3DNow! PREFETCH/PREFETCHW instructions = false
   brand = "                  Intel(R) Xeon(TM) CPU 2.66GHz"
   (multi-processing synth): hyper-threaded (t=2)
   (multi-processing method): Intel leaf 1
   (synth) = Intel Xeon (Prestonia D1), .13um
CPU 3:
   vendor_id = "GenuineIntel"
   version information (1/eax):
      processor type  = primary processor (0)
      family          = Intel Pentium 4/Pentium D/Pentium Extreme Edition/Celeron/Xeon/Xeon MP/Itanium2, AMD Athlon 64/Athlon XP-M/Opteron/Sempron/Turion (15)
      model           = 0x2 (2)
      stepping id     = 0x9 (9)
      extended family = 0x0 (0)
      extended model  = 0x0 (0)
      (simple synth)  = Intel Pentium 4 (Northwood D1) / Xeon (Prestonia D1) / Mobile Pentium 4 (Northwood D1) / Mobile Pentium 4 Processor-M (Northwood D1) / Celeron 478-pin (Northwood D1), .13um
   miscellaneous (1/ebx):
      process local APIC physical ID = 0x7 (7)
      cpu count                      = 0x2 (2)
      CLFLUSH line size              = 0x8 (8)
      brand index                    = 0xb (11)
   brand id = 0x0b (11): Intel Xeon, .13um
   feature information (1/edx):
      x87 FPU on chip                        = true
      virtual-8086 mode enhancement          = true
      debugging extensions                   = true
      page size extensions                   = true
      time stamp counter                     = true
      RDMSR and WRMSR support                = true
      physical address extensions            = true
      machine check exception                = true
      CMPXCHG8B inst.                        = true
      APIC on chip                           = true
      SYSENTER and SYSEXIT                   = true
      memory type range registers            = true
      PTE global bit                         = true
      machine check architecture             = true
      conditional move/compare instruction   = true
      page attribute table                   = true
      page size extension                    = true
      processor serial number                = false
      CLFLUSH instruction                    = true
      debug store                            = true
      thermal monitor and clock ctrl         = true
      MMX Technology                         = true
      FXSAVE/FXRSTOR                         = true
      SSE extensions                         = true
      SSE2 extensions                        = true
      self snoop                             = true
      hyper-threading / multi-core supported = true
      therm. monitor                         = true
      IA64                                   = false
      pending break event                    = true
   feature information (1/ecx):
      PNI/SSE3: Prescott New Instructions     = false
      PCLMULDQ instruction                    = false
      64-bit debug store                      = false
      MONITOR/MWAIT                           = false
      CPL-qualified debug store               = false
      VMX: virtual machine extensions         = false
      SMX: safer mode extensions              = false
      Enhanced Intel SpeedStep Technology     = false
      thermal monitor 2                       = false
      SSSE3 extensions                        = false
      context ID: adaptive or shared L1 data  = true
      FMA instruction                         = false
      CMPXCHG16B instruction                  = false
      xTPR disable                            = true
      perfmon and debug                       = false
      process context identifiers             = false
      direct cache access                     = false
      SSE4.1 extensions                       = false
      SSE4.2 extensions                       = false
      extended xAPIC support                  = false
      MOVBE instruction                       = false
      POPCNT instruction                      = false
      time stamp counter deadline             = false
      AES instruction                         = false
      XSAVE/XSTOR states                      = false
      OS-enabled XSAVE/XSTOR                  = false
      AVX: advanced vector extensions         = false
      F16C half-precision convert instruction = false
      RDRAND instruction                      = false
      hypervisor guest status                 = false
   cache and TLB information (2):
      0x50: instruction TLB: 4K & 2M/4M pages, 64 entries
      0x5b: data TLB: 4K & 4M pages, 64 entries
      0x66: L1 data cache: 8K, 4-way, 64 byte lines
      0x40: No L3 cache
      0x70: Trace cache: 12K-uop, 8-way
      0x7b: L2 cache: 512K, 8-way, sectored, 64 byte lines
   extended feature flags (0x80000001/edx):
      SYSCALL and SYSRET instructions        = false
      execution disable                      = false
      1-GB large page support                = false
      RDTSCP                                 = false
      64-bit extensions technology available = false
   Intel feature flags (0x80000001/ecx):
      LAHF/SAHF supported in 64-bit mode     = false
      LZCNT advanced bit manipulation        = false
      3DNow! PREFETCH/PREFETCHW instructions = false
   brand = "                  Intel(R) Xeon(TM) CPU 2.66GHz"
   (multi-processing synth): hyper-threaded (t=2)
   (multi-processing method): Intel leaf 1
   (synth) = Intel Xeon (Prestonia D1), .13um

[-- Attachment #3: cpuidr.out --]
[-- Type: text/plain, Size: 3228 bytes --]

CPU 0:
   0x00000000 0x00: eax=0x00000002 ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69
   0x00000001 0x00: eax=0x00000f29 ebx=0x0002080b ecx=0x00004400 edx=0xbfebfbff
   0x00000002 0x00: eax=0x665b5001 ebx=0x00000000 ecx=0x00000000 edx=0x007b7040
   0x80000000 0x00: eax=0x80000004 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000001 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000002 0x00: eax=0x20202020 ebx=0x20202020 ecx=0x20202020 edx=0x20202020
   0x80000003 0x00: eax=0x6e492020 ebx=0x286c6574 ecx=0x58202952 edx=0x286e6f65
   0x80000004 0x00: eax=0x20294d54 ebx=0x20555043 ecx=0x36362e32 edx=0x007a4847
   0x80860000 0x00: eax=0x665b5001 ebx=0x00000000 ecx=0x00000000 edx=0x007b7040
   0xc0000000 0x00: eax=0x665b5001 ebx=0x00000000 ecx=0x00000000 edx=0x007b7040
CPU 1:
   0x00000000 0x00: eax=0x00000002 ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69
   0x00000001 0x00: eax=0x00000f29 ebx=0x0602080b ecx=0x00004400 edx=0xbfebfbff
   0x00000002 0x00: eax=0x665b5001 ebx=0x00000000 ecx=0x00000000 edx=0x007b7040
   0x80000000 0x00: eax=0x80000004 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000001 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000002 0x00: eax=0x20202020 ebx=0x20202020 ecx=0x20202020 edx=0x20202020
   0x80000003 0x00: eax=0x6e492020 ebx=0x286c6574 ecx=0x58202952 edx=0x286e6f65
   0x80000004 0x00: eax=0x20294d54 ebx=0x20555043 ecx=0x36362e32 edx=0x007a4847
   0x80860000 0x00: eax=0x665b5001 ebx=0x00000000 ecx=0x00000000 edx=0x007b7040
   0xc0000000 0x00: eax=0x665b5001 ebx=0x00000000 ecx=0x00000000 edx=0x007b7040
CPU 2:
   0x00000000 0x00: eax=0x00000002 ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69
   0x00000001 0x00: eax=0x00000f29 ebx=0x0102080b ecx=0x00004400 edx=0xbfebfbff
   0x00000002 0x00: eax=0x665b5001 ebx=0x00000000 ecx=0x00000000 edx=0x007b7040
   0x80000000 0x00: eax=0x80000004 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000001 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000002 0x00: eax=0x20202020 ebx=0x20202020 ecx=0x20202020 edx=0x20202020
   0x80000003 0x00: eax=0x6e492020 ebx=0x286c6574 ecx=0x58202952 edx=0x286e6f65
   0x80000004 0x00: eax=0x20294d54 ebx=0x20555043 ecx=0x36362e32 edx=0x007a4847
   0x80860000 0x00: eax=0x665b5001 ebx=0x00000000 ecx=0x00000000 edx=0x007b7040
   0xc0000000 0x00: eax=0x665b5001 ebx=0x00000000 ecx=0x00000000 edx=0x007b7040
CPU 3:
   0x00000000 0x00: eax=0x00000002 ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69
   0x00000001 0x00: eax=0x00000f29 ebx=0x0702080b ecx=0x00004400 edx=0xbfebfbff
   0x00000002 0x00: eax=0x665b5001 ebx=0x00000000 ecx=0x00000000 edx=0x007b7040
   0x80000000 0x00: eax=0x80000004 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000001 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000002 0x00: eax=0x20202020 ebx=0x20202020 ecx=0x20202020 edx=0x20202020
   0x80000003 0x00: eax=0x6e492020 ebx=0x286c6574 ecx=0x58202952 edx=0x286e6f65
   0x80000004 0x00: eax=0x20294d54 ebx=0x20555043 ecx=0x36362e32 edx=0x007a4847
   0x80860000 0x00: eax=0x665b5001 ebx=0x00000000 ecx=0x00000000 edx=0x007b7040
   0xc0000000 0x00: eax=0x665b5001 ebx=0x00000000 ecx=0x00000000 edx=0x007b7040

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-22  9:47                                   ` Peter Zijlstra
  2014-07-22 10:38                                     ` Peter Zijlstra
@ 2014-07-22 12:12                                     ` Dietmar Eggemann
  2014-07-22 12:57                                     ` Bruno Wolff III
  2014-07-28  8:28                                     ` [tip:sched/core] sched: Robustify topology setup tip-bot for Peter Zijlstra
  3 siblings, 0 replies; 44+ messages in thread
From: Dietmar Eggemann @ 2014-07-22 12:12 UTC (permalink / raw)
  To: Peter Zijlstra, Bruno Wolff III
  Cc: Josh Boyer, mingo, linux-kernel, H. Peter Anvin, Thomas Gleixner

On 22/07/14 10:47, Peter Zijlstra wrote:
> On Mon, Jul 21, 2014 at 06:52:12PM +0200, Peter Zijlstra wrote:
>> On Mon, Jul 21, 2014 at 11:35:28AM -0500, Bruno Wolff III wrote:
>>> Is there more I can do to help with this now? Or should I just wait for
>>> patches to test?
>>
>> Yeah, sorry, was wiped out today. I'll go stare harder at the P4
>> topology setup code tomorrow. Something fishy there.
> 
> Does this make your machine boot again (while giving an error)?
> 
> It tries to robustify the topology setup a bit, crashing on crap input
> should be avoided if possible of course.
> 
> I'll go stare at the x86/P4 topology code like promised.
> 
> ---
> Subject: sched: Robustify topology setup
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Mon Jul 21 23:07:06 CEST 2014
> 
> We hard assume that higher topology levels are strict supersets of
> lower levels.

IMHO, we require only that higher topology levels are supersets of lower
levels, not strict (proper) supersets.

AFAICS, the patch itself requires only supersets, i.e. on ARM TC2 with
the following change in cpu_corepower_mask:

 const struct cpumask *cpu_corepower_mask(int cpu)
 {
-       return &cpu_topology[cpu].thread_sibling;
+       return &cpu_topology[cpu].core_sibling;
 }

I get:

...
build_sched_domain: cpu: 0 level: GMC cpu_map: 0-4 tl->mask: 0-1
build_sched_domain: cpu: 0 level: MC cpu_map: 0-4 tl->mask: 0-1
build_sched_domain: cpu: 0 level: DIE cpu_map: 0-4 tl->mask: 0-4
...

without hitting the newly introduced pr_err's.

> 
> Detect, warn and try to fixup when we encounter this violated.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/n/tip-cgp9j2tk0qnunhtpps3udsom@git.kernel.org
> ---
>  kernel/sched/core.c |   14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6480,6 +6480,20 @@ struct sched_domain *build_sched_domain(
>  		sched_domain_level_max = max(sched_domain_level_max, sd->level);
>  		child->parent = sd;
>  		sd->child = child;
> +
> +		if (!cpumask_subset(sched_domain_span(child),
> +				    sched_domain_span(sd))) {
> +			pr_err("BUG: arch topology borken\n");
> +#ifdef CONFIG_SCHED_DEBUG
> +			pr_err("     the %s domain not a subset of the %s domain\n",
> +					child->name, sd->name);
> +#endif
> +			/* Fixup, ensure @sd has at least @child cpus. */
> +			cpumask_or(sched_domain_span(sd),
> +				   sched_domain_span(sd),
> +				   sched_domain_span(child));

This fixup will (probably) heal the Bruno's issue with it's wrong
cpu_coregroup_mask() function.

If I exchange cpu_corepower_mask with cpu_coregroup_mask in arm_topology[]

 static struct sched_domain_topology_level arm_topology[] = {
 #ifdef CONFIG_SCHED_MC
-       { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
-       { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+       { cpu_coregroup_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
+       { cpu_corepower_mask, cpu_core_flags, SD_INIT_NAME(MC) },
 #endif

I get:

...
build_sched_domain: cpu: 0 level: GMC cpu_map: 0-4 tl->mask: 0-1
build_sched_domain: cpu: 0 level: MC cpu_map: 0-4 tl->mask: 0
BUG: arch topology borken
     the GMC domain not a subset of the MC domain
build_sched_domain: cpu: 0 level: DIE cpu_map: 0-4 tl->mask: 0-4
...

cat /proc/schedstat
...
cpu0 0 0 16719 6169 10937 6392 5348510220 2935348625 10448
domain0 03 19190 19168 9 10265 13 0 0 19168 16 16 0 0 0 0 0 16 1196 1080
46 43570 75 0 0 1080 0 0 0 0 0 0 0 0 0 2947 280 0
domain1 1f 18768 18763 3 3006 2 0 9 18055 6 6 0 0 0 0 0 1 1125 996 94
81038 43 0 18 978 0 0 0 0 0 0 0 0 0 1582 172 0

# cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
GMC
DIE

so MC level gets changed to mask 0-1.

> +		}
> +
>  	}
>  	set_domain_attribute(sd, attr);
>  
> 

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>




















^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-22  9:47                                   ` Peter Zijlstra
  2014-07-22 10:38                                     ` Peter Zijlstra
  2014-07-22 12:12                                     ` Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c Dietmar Eggemann
@ 2014-07-22 12:57                                     ` Bruno Wolff III
  2014-07-28  8:28                                     ` [tip:sched/core] sched: Robustify topology setup tip-bot for Peter Zijlstra
  3 siblings, 0 replies; 44+ messages in thread
From: Bruno Wolff III @ 2014-07-22 12:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

On Tue, Jul 22, 2014 at 11:47:40 +0200,
  Peter Zijlstra <peterz@infradead.org> wrote:
>On Mon, Jul 21, 2014 at 06:52:12PM +0200, Peter Zijlstra wrote:
>> On Mon, Jul 21, 2014 at 11:35:28AM -0500, Bruno Wolff III wrote:
>> > Is there more I can do to help with this now? Or should I just wait for
>> > patches to test?
>>
>> Yeah, sorry, was wiped out today. I'll go stare harder at the P4
>> topology setup code tomorrow. Something fishy there.
>
>Does this make your machine boot again (while giving an error)?

I won't be able to actually test this until after work. 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-22 12:10                                       ` Bruno Wolff III
@ 2014-07-22 13:03                                         ` Peter Zijlstra
  2014-07-22 13:26                                           ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2014-07-22 13:03 UTC (permalink / raw)
  To: Bruno Wolff III
  Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

On Tue, Jul 22, 2014 at 07:10:01AM -0500, Bruno Wolff III wrote:
> On Tue, Jul 22, 2014 at 12:38:57 +0200,
>  Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >Could you provide the output of cpuid and cpuid -r for your machine?
> >This code is magic and I've no idea what your machine is telling it to
> >do :/
> 
> I am attaching both sets of output. (I also added copies to the bug report.)

Thanks! and yes I now see (and I should have seen before) what is
'broken'.

>    0x00000000 0x00: eax=0x00000002 ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69

This gives us cpuid_level=0x02

>    0x00000002 0x00: eax=0x665b5001 ebx=0x00000000 ecx=0x00000000 edx=0x007b7040

Which means that init_intel_cacheinfo() will not use cpuid4 for
cacheinfo and we revert to cpuid2, which translates into:

>    cache and TLB information (2):
>       0x50: instruction TLB: 4K & 2M/4M pages, 64 entries
>       0x5b: data TLB: 4K & 4M pages, 64 entries
>       0x66: L1 data cache: 8K, 4-way, 64 byte lines
>       0x40: No L3 cache
>       0x70: Trace cache: 12K-uop, 8-way
>       0x7b: L2 cache: 512K, 8-way, sectored, 64 byte lines

Now the problem is that cpu_llc_id is only set on new_l[23], and set to
l[23]_id. Both new_l[23] and l[23]_id are only set in the cpuid4 case.

So for this P4 cpu_llc_id remains unset.

Furthermore cpuid2 does not include cpu masks, so we need to use cpuid1:

>    (multi-processing method): Intel leaf 1

>   0x00000001 0x00: eax=0x00000f29 ebx=0x0002080b ecx=0x00004400 edx=0xbfebfbff

to reconstruct the topology, with the added assumption that SMT threads
share all caches.

Oh, of course we do SMP detection and setup after the cache setup...
lovely.

/me goes bang head against wall


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-22 13:03                                         ` Peter Zijlstra
@ 2014-07-22 13:26                                           ` Peter Zijlstra
  2014-07-22 13:35                                             ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2014-07-22 13:26 UTC (permalink / raw)
  To: Bruno Wolff III
  Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

On Tue, Jul 22, 2014 at 03:03:43PM +0200, Peter Zijlstra wrote:
> Oh, of course we do SMP detection and setup after the cache setup...
> lovely.
> 
> /me goes bang head against wall

hpa, could we move the legacy cpuid1/cpuid4 topology detection muck up,
preferably right after detect_extended_topology()?

I need c->phys_proc_id in init_intel_cacheinfo() for machines with
cpuid_level < 4.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-22 13:26                                           ` Peter Zijlstra
@ 2014-07-22 13:35                                             ` Peter Zijlstra
  2014-07-22 14:09                                               ` Bruno Wolff III
                                                                 ` (3 more replies)
  0 siblings, 4 replies; 44+ messages in thread
From: Peter Zijlstra @ 2014-07-22 13:35 UTC (permalink / raw)
  To: Bruno Wolff III
  Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

On Tue, Jul 22, 2014 at 03:26:03PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 22, 2014 at 03:03:43PM +0200, Peter Zijlstra wrote:
> > Oh, of course we do SMP detection and setup after the cache setup...
> > lovely.
> > 
> > /me goes bang head against wall
> 
> hpa, could we move the legacy cpuid1/cpuid4 topology detection muck up,
> preferably right after detect_extended_topology()?
> 
> I need c->phys_proc_id in init_intel_cacheinfo() for machines with
> cpuid_level < 4.

Something like so.. anything obviously broken?

---
 arch/x86/kernel/cpu/intel.c           | 22 +++++++++++-----------
 arch/x86/kernel/cpu/intel_cacheinfo.c | 12 ++++++++++++
 2 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 0fd955778f35..9483ee5b3991 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -370,6 +370,17 @@ static void init_intel(struct cpuinfo_x86 *c)
 	 */
 	detect_extended_topology(c);
 
+	if (!cpu_has(c, X86_FEATURE_XTOPOLOGY)) {
+		/*
+		 * let's use the legacy cpuid vector 0x1 and 0x4 for topology
+		 * detection.
+		 */
+		c->x86_max_cores = intel_num_cpu_cores(c);
+#ifdef CONFIG_X86_32
+		detect_ht(c);
+#endif
+	}
+
 	l2 = init_intel_cacheinfo(c);
 	if (c->cpuid_level > 9) {
 		unsigned eax = cpuid_eax(10);
@@ -438,17 +449,6 @@ static void init_intel(struct cpuinfo_x86 *c)
 		set_cpu_cap(c, X86_FEATURE_P3);
 #endif
 
-	if (!cpu_has(c, X86_FEATURE_XTOPOLOGY)) {
-		/*
-		 * let's use the legacy cpuid vector 0x1 and 0x4 for topology
-		 * detection.
-		 */
-		c->x86_max_cores = intel_num_cpu_cores(c);
-#ifdef CONFIG_X86_32
-		detect_ht(c);
-#endif
-	}
-
 	/* Work around errata */
 	srat_detect_node(c);
 
diff --git a/arch/x86/kernel/cpu/intel_cacheinfo.c b/arch/x86/kernel/cpu/intel_cacheinfo.c
index a952e9c85b6f..9c8f7394c612 100644
--- a/arch/x86/kernel/cpu/intel_cacheinfo.c
+++ b/arch/x86/kernel/cpu/intel_cacheinfo.c
@@ -730,6 +730,18 @@ unsigned int init_intel_cacheinfo(struct cpuinfo_x86 *c)
 #endif
 	}
 
+#ifdef CONFIG_X86_HT
+	/*
+	 * If cpu_llc_id is not yet set, this means cpuid_level < 4 which in
+	 * turns means that the only possibility is SMT (as indicated in
+	 * cpuid1). Since cpuid2 doesn't specify shared caches, and we know
+	 * that SMT shares all caches, we can unconditionally set cpu_llc_id to
+	 * c->phys_proc_id.
+	 */
+	if (per_cpu(cpu_llc_id, cpu) == BAD_APICID)
+		per_cpu(cpu_llc_id, cpu) = c->phys_proc_id;
+#endif
+
 	c->x86_cache_size = l3 ? l3 : (l2 ? l2 : (l1i+l1d));
 
 	return l2;

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-22 13:35                                             ` Peter Zijlstra
@ 2014-07-22 14:09                                               ` Bruno Wolff III
  2014-07-22 14:18                                                 ` Peter Zijlstra
  2014-07-22 17:05                                               ` H. Peter Anvin
                                                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 44+ messages in thread
From: Bruno Wolff III @ 2014-07-22 14:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

On Tue, Jul 22, 2014 at 15:35:14 +0200,
  Peter Zijlstra <peterz@infradead.org> wrote:
>On Tue, Jul 22, 2014 at 03:26:03PM +0200, Peter Zijlstra wrote:
>
>Something like so.. anything obviously broken?

Do you want me to test this change instead of, or combined with the other 
patch you wanted tested earlier?

>
>---
> arch/x86/kernel/cpu/intel.c           | 22 +++++++++++-----------
> arch/x86/kernel/cpu/intel_cacheinfo.c | 12 ++++++++++++
> 2 files changed, 23 insertions(+), 11 deletions(-)
>
>diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
>index 0fd955778f35..9483ee5b3991 100644
>--- a/arch/x86/kernel/cpu/intel.c
>+++ b/arch/x86/kernel/cpu/intel.c
>@@ -370,6 +370,17 @@ static void init_intel(struct cpuinfo_x86 *c)
> 	 */
> 	detect_extended_topology(c);
>
>+	if (!cpu_has(c, X86_FEATURE_XTOPOLOGY)) {
>+		/*
>+		 * let's use the legacy cpuid vector 0x1 and 0x4 for topology
>+		 * detection.
>+		 */
>+		c->x86_max_cores = intel_num_cpu_cores(c);
>+#ifdef CONFIG_X86_32
>+		detect_ht(c);
>+#endif
>+	}
>+
> 	l2 = init_intel_cacheinfo(c);
> 	if (c->cpuid_level > 9) {
> 		unsigned eax = cpuid_eax(10);
>@@ -438,17 +449,6 @@ static void init_intel(struct cpuinfo_x86 *c)
> 		set_cpu_cap(c, X86_FEATURE_P3);
> #endif
>
>-	if (!cpu_has(c, X86_FEATURE_XTOPOLOGY)) {
>-		/*
>-		 * let's use the legacy cpuid vector 0x1 and 0x4 for topology
>-		 * detection.
>-		 */
>-		c->x86_max_cores = intel_num_cpu_cores(c);
>-#ifdef CONFIG_X86_32
>-		detect_ht(c);
>-#endif
>-	}
>-
> 	/* Work around errata */
> 	srat_detect_node(c);
>
>diff --git a/arch/x86/kernel/cpu/intel_cacheinfo.c b/arch/x86/kernel/cpu/intel_cacheinfo.c
>index a952e9c85b6f..9c8f7394c612 100644
>--- a/arch/x86/kernel/cpu/intel_cacheinfo.c
>+++ b/arch/x86/kernel/cpu/intel_cacheinfo.c
>@@ -730,6 +730,18 @@ unsigned int init_intel_cacheinfo(struct cpuinfo_x86 *c)
> #endif
> 	}
>
>+#ifdef CONFIG_X86_HT
>+	/*
>+	 * If cpu_llc_id is not yet set, this means cpuid_level < 4 which in
>+	 * turns means that the only possibility is SMT (as indicated in
>+	 * cpuid1). Since cpuid2 doesn't specify shared caches, and we know
>+	 * that SMT shares all caches, we can unconditionally set cpu_llc_id to
>+	 * c->phys_proc_id.
>+	 */
>+	if (per_cpu(cpu_llc_id, cpu) == BAD_APICID)
>+		per_cpu(cpu_llc_id, cpu) = c->phys_proc_id;
>+#endif
>+
> 	c->x86_cache_size = l3 ? l3 : (l2 ? l2 : (l1i+l1d));
>
> 	return l2;
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-22 14:09                                               ` Bruno Wolff III
@ 2014-07-22 14:18                                                 ` Peter Zijlstra
  2014-07-23  1:37                                                   ` Bruno Wolff III
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2014-07-22 14:18 UTC (permalink / raw)
  To: Bruno Wolff III
  Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

On Tue, Jul 22, 2014 at 09:09:12AM -0500, Bruno Wolff III wrote:
> On Tue, Jul 22, 2014 at 15:35:14 +0200,
>  Peter Zijlstra <peterz@infradead.org> wrote:
> >On Tue, Jul 22, 2014 at 03:26:03PM +0200, Peter Zijlstra wrote:
> >
> >Something like so.. anything obviously broken?
> 
> Do you want me to test this change instead of, or combined with the other
> patch you wanted tested earlier?

You can put this on top of them. I hope that this will make the pr_err()
introduced in the robustify patch go away.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-22 13:35                                             ` Peter Zijlstra
  2014-07-22 14:09                                               ` Bruno Wolff III
@ 2014-07-22 17:05                                               ` H. Peter Anvin
  2014-07-23 15:11                                               ` Peter Zijlstra
  2014-07-23 15:39                                               ` [tip:x86/urgent] x86, cpu: Fix cache topology for early P4-SMT tip-bot for Peter Zijlstra
  3 siblings, 0 replies; 44+ messages in thread
From: H. Peter Anvin @ 2014-07-22 17:05 UTC (permalink / raw)
  To: Peter Zijlstra, Bruno Wolff III
  Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel, Thomas Gleixner

On 07/22/2014 06:35 AM, Peter Zijlstra wrote:
> On Tue, Jul 22, 2014 at 03:26:03PM +0200, Peter Zijlstra wrote:
>> On Tue, Jul 22, 2014 at 03:03:43PM +0200, Peter Zijlstra wrote:
>>> Oh, of course we do SMP detection and setup after the cache setup...
>>> lovely.
>>>
>>> /me goes bang head against wall
>>
>> hpa, could we move the legacy cpuid1/cpuid4 topology detection muck up,
>> preferably right after detect_extended_topology()?
>>
>> I need c->phys_proc_id in init_intel_cacheinfo() for machines with
>> cpuid_level < 4.
> 
> Something like so.. anything obviously broken?
> 

Nothing *obvious*.  I should stare a pass at the code.

	-hpa



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-22 14:18                                                 ` Peter Zijlstra
@ 2014-07-23  1:37                                                   ` Bruno Wolff III
  2014-07-23  6:51                                                     ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Bruno Wolff III @ 2014-07-23  1:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

On Tue, Jul 22, 2014 at 16:18:55 +0200,
  Peter Zijlstra <peterz@infradead.org> wrote:
>
>You can put this on top of them. I hope that this will make the pr_err()
>introduced in the robustify patch go away.

I went to 3.16-rc6 and then reapplied three patches from your previous 
email messages. The dmesg output and the diff from 3.16-rc6 have been 
added to https://bugzilla.kernel.org/show_bug.cgi?id=80251 .
The dmesg output is at: https://bugzilla.kernel.org/attachment.cgi?id=143961
The combined diff is at: https://bugzilla.kernel.org/attachment.cgi?id=143971

What I think you are probably looking for in the dmesg output:
[    0.251061] __sdt_alloc: allocated f515b020 with cpus: 
[    0.251149] __sdt_alloc: allocated f515b0e0 with cpus: 
[    0.251231] __sdt_alloc: allocated f515b120 with cpus: 
[    0.251313] __sdt_alloc: allocated f515b160 with cpus: 
[    0.251397] __sdt_alloc: allocated f515b1a0 with cpus: 
[    0.251479] __sdt_alloc: allocated f515b1e0 with cpus: 
[    0.251561] __sdt_alloc: allocated f515b220 with cpus: 
[    0.251643] __sdt_alloc: allocated f515b260 with cpus: 
[    0.252011] __sdt_alloc: allocated f515b2a0 with cpus: 
[    0.252095] __sdt_alloc: allocated f515b2e0 with cpus: 
[    0.252184] __sdt_alloc: allocated f515b320 with cpus: 
[    0.252266] __sdt_alloc: allocated f515b360 with cpus: 
[    0.252355] build_sched_domain: cpu: 0 level: SMT cpu_map: 0-3 tl->mask: 0,2
[    0.252441] build_sched_domain: cpu: 0 level: MC cpu_map: 0-3 tl->mask: 0,2
[    0.252526] build_sched_domain: cpu: 0 level: DIE cpu_map: 0-3 tl->mask: 0-3
[    0.252611] build_sched_domain: cpu: 1 level: SMT cpu_map: 0-3 tl->mask: 1,3
[    0.252696] build_sched_domain: cpu: 1 level: MC cpu_map: 0-3 tl->mask: 1,3
[    0.252781] build_sched_domain: cpu: 1 level: DIE cpu_map: 0-3 tl->mask: 0-3
[    0.252866] build_sched_domain: cpu: 2 level: SMT cpu_map: 0-3 tl->mask: 0,2
[    0.252951] build_sched_domain: cpu: 2 level: MC cpu_map: 0-3 tl->mask: 0,2
[    0.253005] build_sched_domain: cpu: 2 level: DIE cpu_map: 0-3 tl->mask: 0-3
[    0.253091] build_sched_domain: cpu: 3 level: SMT cpu_map: 0-3 tl->mask: 1,3
[    0.253176] build_sched_domain: cpu: 3 level: MC cpu_map: 0-3 tl->mask: 1,3
[    0.253261] build_sched_domain: cpu: 3 level: DIE cpu_map: 0-3 tl->mask: 0-3
[    0.254004] build_sched_groups: got group f515b020 with cpus: 
[    0.254088] build_sched_groups: got group f515b120 with cpus: 
[    0.254170] build_sched_groups: got group f515b1a0 with cpus: 
[    0.254253] build_sched_groups: got group f515b2a0 with cpus: 
[    0.254336] build_sched_groups: got group f515b2e0 with cpus: 
[    0.254419] build_sched_groups: got group f515b0e0 with cpus: 
[    0.254502] build_sched_groups: got group f515b160 with cpus: 
[    0.254585] build_sched_groups: got group f515b1e0 with cpus: 
[    0.254680] CPU0 attaching sched-domain:
[    0.254684]  domain 0: span 0,2 level SMT
[    0.254687]   groups: 0 (cpu_capacity = 586) 2 (cpu_capacity = 588)
[    0.254695]   domain 1: span 0-3 level DIE
[    0.254698]    groups: 0,2 (cpu_capacity = 1174) 1,3 (cpu_capacity = 1176)
[    0.254709] CPU1 attaching sched-domain:
[    0.254711]  domain 0: span 1,3 level SMT
[    0.254714]   groups: 1 (cpu_capacity = 588) 3 (cpu_capacity = 588)
[    0.254721]   domain 1: span 0-3 level DIE
[    0.254724]    groups: 1,3 (cpu_capacity = 1176) 0,2 (cpu_capacity = 1174)
[    0.254733] CPU2 attaching sched-domain:
[    0.254735]  domain 0: span 0,2 level SMT
[    0.254738]   groups: 2 (cpu_capacity = 588) 0 (cpu_capacity = 586)
[    0.254745]   domain 1: span 0-3 level DIE
[    0.254747]    groups: 0,2 (cpu_capacity = 1174) 1,3 (cpu_capacity = 1176)
[    0.254756] CPU3 attaching sched-domain:
[    0.254758]  domain 0: span 1,3 level SMT
[    0.254761]   groups: 3 (cpu_capacity = 588) 1 (cpu_capacity = 588)
[    0.254768]   domain 1: span 0-3 level DIE
[    0.254770]    groups: 1,3 (cpu_capacity = 1176) 0,2 (cpu_capacity = 1174)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-23  1:37                                                   ` Bruno Wolff III
@ 2014-07-23  6:51                                                     ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2014-07-23  6:51 UTC (permalink / raw)
  To: Bruno Wolff III
  Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

On Tue, Jul 22, 2014 at 08:37:19PM -0500, Bruno Wolff III wrote:
>                build_sched_domain: cpu: 0 level: SMT cpu_map: 0-3 tl->mask: 0,2
> [    0.252441] build_sched_domain: cpu: 0 level: MC cpu_map: 0-3 tl->mask: 0,2
> [    0.252526] build_sched_domain: cpu: 0 level: DIE cpu_map: 0-3 tl->mask: 0-3

W00t! that seems to have cured it.

Thanks Bruno, I'll go write up a proper changelog and such.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-22 13:35                                             ` Peter Zijlstra
  2014-07-22 14:09                                               ` Bruno Wolff III
  2014-07-22 17:05                                               ` H. Peter Anvin
@ 2014-07-23 15:11                                               ` Peter Zijlstra
  2014-07-23 15:12                                                 ` H. Peter Anvin
  2014-07-24  1:45                                                 ` Bruno Wolff III
  2014-07-23 15:39                                               ` [tip:x86/urgent] x86, cpu: Fix cache topology for early P4-SMT tip-bot for Peter Zijlstra
  3 siblings, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2014-07-23 15:11 UTC (permalink / raw)
  To: Bruno Wolff III
  Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner


OK, so that's become the below patch. I'll feed it to Ingo if that's OK
with hpa.

---
Subject: x86: Fix cache topology for early P4-SMT
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue, 22 Jul 2014 15:35:14 +0200

P4 systems with cpuid level < 4 can have SMT, but the cache topology
description available (cpuid2) does not include SMP information.

Now we know that SMT shares all cache levels, and therefore we can
mark all available cache levels as shared.

We do this by setting cpu_llc_id to ->phys_proc_id, since that's
the same for each SMT thread. We can do this unconditional since if
there's no SMT its still true, the one CPU shares cache with only
itself.

This fixes a problem where such CPUs report an incorrect LLC CPU mask.

This in turn fixes a crash in the scheduler where the topology was
build wrong, it assumes the LLC mask to include at least the SMT CPUs.

Cc: Josh Boyer <jwboyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Bruno Wolff III <bruno@wolff.to>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140722133514.GM12054@laptop.lan
---
 arch/x86/kernel/cpu/intel.c           |   22 +++++++++++-----------
 arch/x86/kernel/cpu/intel_cacheinfo.c |   12 ++++++++++++
 2 files changed, 23 insertions(+), 11 deletions(-)

--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -370,6 +370,17 @@ static void init_intel(struct cpuinfo_x8
 	 */
 	detect_extended_topology(c);
 
+	if (!cpu_has(c, X86_FEATURE_XTOPOLOGY)) {
+		/*
+		 * let's use the legacy cpuid vector 0x1 and 0x4 for topology
+		 * detection.
+		 */
+		c->x86_max_cores = intel_num_cpu_cores(c);
+#ifdef CONFIG_X86_32
+		detect_ht(c);
+#endif
+	}
+
 	l2 = init_intel_cacheinfo(c);
 	if (c->cpuid_level > 9) {
 		unsigned eax = cpuid_eax(10);
@@ -438,17 +449,6 @@ static void init_intel(struct cpuinfo_x8
 		set_cpu_cap(c, X86_FEATURE_P3);
 #endif
 
-	if (!cpu_has(c, X86_FEATURE_XTOPOLOGY)) {
-		/*
-		 * let's use the legacy cpuid vector 0x1 and 0x4 for topology
-		 * detection.
-		 */
-		c->x86_max_cores = intel_num_cpu_cores(c);
-#ifdef CONFIG_X86_32
-		detect_ht(c);
-#endif
-	}
-
 	/* Work around errata */
 	srat_detect_node(c);
 
--- a/arch/x86/kernel/cpu/intel_cacheinfo.c
+++ b/arch/x86/kernel/cpu/intel_cacheinfo.c
@@ -730,6 +730,18 @@ unsigned int init_intel_cacheinfo(struct
 #endif
 	}
 
+#ifdef CONFIG_X86_HT
+	/*
+	 * If cpu_llc_id is not yet set, this means cpuid_level < 4 which in
+	 * turns means that the only possibility is SMT (as indicated in
+	 * cpuid1). Since cpuid2 doesn't specify shared caches, and we know
+	 * that SMT shares all caches, we can unconditionally set cpu_llc_id to
+	 * c->phys_proc_id.
+	 */
+	if (per_cpu(cpu_llc_id, cpu) == BAD_APICID)
+		per_cpu(cpu_llc_id, cpu) = c->phys_proc_id;
+#endif
+
 	c->x86_cache_size = l3 ? l3 : (l2 ? l2 : (l1i+l1d));
 
 	return l2;

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-23 15:11                                               ` Peter Zijlstra
@ 2014-07-23 15:12                                                 ` H. Peter Anvin
  2014-07-24  1:45                                                 ` Bruno Wolff III
  1 sibling, 0 replies; 44+ messages in thread
From: H. Peter Anvin @ 2014-07-23 15:12 UTC (permalink / raw)
  To: Peter Zijlstra, Bruno Wolff III
  Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel, Thomas Gleixner

On 07/23/2014 08:11 AM, Peter Zijlstra wrote:
> 
> OK, so that's become the below patch. I'll feed it to Ingo if that's OK
> with hpa.
> 

I'll grab it directly, it is a bit quicker that way.

	-hpa



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [tip:x86/urgent] x86, cpu: Fix cache topology for early P4-SMT
  2014-07-22 13:35                                             ` Peter Zijlstra
                                                                 ` (2 preceding siblings ...)
  2014-07-23 15:11                                               ` Peter Zijlstra
@ 2014-07-23 15:39                                               ` tip-bot for Peter Zijlstra
  3 siblings, 0 replies; 44+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-07-23 15:39 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, dietmar.eggemann, peterz, jwboyer, bruno, tglx

Commit-ID:  2a2261553dd1472ca574acadbd93e12f44c4e6d5
Gitweb:     http://git.kernel.org/tip/2a2261553dd1472ca574acadbd93e12f44c4e6d5
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 22 Jul 2014 15:35:14 +0200
Committer:  H. Peter Anvin <hpa@zytor.com>
CommitDate: Wed, 23 Jul 2014 08:16:17 -0700

x86, cpu: Fix cache topology for early P4-SMT

P4 systems with cpuid level < 4 can have SMT, but the cache topology
description available (cpuid2) does not include SMP information.

Now we know that SMT shares all cache levels, and therefore we can
mark all available cache levels as shared.

We do this by setting cpu_llc_id to ->phys_proc_id, since that's
the same for each SMT thread. We can do this unconditional since if
there's no SMT its still true, the one CPU shares cache with only
itself.

This fixes a problem where such CPUs report an incorrect LLC CPU mask.

This in turn fixes a crash in the scheduler where the topology was
build wrong, it assumes the LLC mask to include at least the SMT CPUs.

Cc: Josh Boyer <jwboyer@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Bruno Wolff III <bruno@wolff.to>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140722133514.GM12054@laptop.lan
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
---
 arch/x86/kernel/cpu/intel.c           | 22 +++++++++++-----------
 arch/x86/kernel/cpu/intel_cacheinfo.c | 12 ++++++++++++
 2 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index a800290..f9e4fdd 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -370,6 +370,17 @@ static void init_intel(struct cpuinfo_x86 *c)
 	 */
 	detect_extended_topology(c);
 
+	if (!cpu_has(c, X86_FEATURE_XTOPOLOGY)) {
+		/*
+		 * let's use the legacy cpuid vector 0x1 and 0x4 for topology
+		 * detection.
+		 */
+		c->x86_max_cores = intel_num_cpu_cores(c);
+#ifdef CONFIG_X86_32
+		detect_ht(c);
+#endif
+	}
+
 	l2 = init_intel_cacheinfo(c);
 	if (c->cpuid_level > 9) {
 		unsigned eax = cpuid_eax(10);
@@ -438,17 +449,6 @@ static void init_intel(struct cpuinfo_x86 *c)
 		set_cpu_cap(c, X86_FEATURE_P3);
 #endif
 
-	if (!cpu_has(c, X86_FEATURE_XTOPOLOGY)) {
-		/*
-		 * let's use the legacy cpuid vector 0x1 and 0x4 for topology
-		 * detection.
-		 */
-		c->x86_max_cores = intel_num_cpu_cores(c);
-#ifdef CONFIG_X86_32
-		detect_ht(c);
-#endif
-	}
-
 	/* Work around errata */
 	srat_detect_node(c);
 
diff --git a/arch/x86/kernel/cpu/intel_cacheinfo.c b/arch/x86/kernel/cpu/intel_cacheinfo.c
index a952e9c..9c8f739 100644
--- a/arch/x86/kernel/cpu/intel_cacheinfo.c
+++ b/arch/x86/kernel/cpu/intel_cacheinfo.c
@@ -730,6 +730,18 @@ unsigned int init_intel_cacheinfo(struct cpuinfo_x86 *c)
 #endif
 	}
 
+#ifdef CONFIG_X86_HT
+	/*
+	 * If cpu_llc_id is not yet set, this means cpuid_level < 4 which in
+	 * turns means that the only possibility is SMT (as indicated in
+	 * cpuid1). Since cpuid2 doesn't specify shared caches, and we know
+	 * that SMT shares all caches, we can unconditionally set cpu_llc_id to
+	 * c->phys_proc_id.
+	 */
+	if (per_cpu(cpu_llc_id, cpu) == BAD_APICID)
+		per_cpu(cpu_llc_id, cpu) = c->phys_proc_id;
+#endif
+
 	c->x86_cache_size = l3 ? l3 : (l2 ? l2 : (l1i+l1d));
 
 	return l2;

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c
  2014-07-23 15:11                                               ` Peter Zijlstra
  2014-07-23 15:12                                                 ` H. Peter Anvin
@ 2014-07-24  1:45                                                 ` Bruno Wolff III
  1 sibling, 0 replies; 44+ messages in thread
From: Bruno Wolff III @ 2014-07-24  1:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Josh Boyer, mingo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner

On Wed, Jul 23, 2014 at 17:11:40 +0200,
  Peter Zijlstra <peterz@infradead.org> wrote:
>
>OK, so that's become the below patch. I'll feed it to Ingo if that's OK
>with hpa.

I tested this patch on 3 machines and it continued to fix the one that 
was broken and didn't seem to break anything on the two that weren't 
broken.

Thanks for developing this patch so quickly.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [tip:sched/core] sched: Robustify topology setup
  2014-07-22  9:47                                   ` Peter Zijlstra
                                                       ` (2 preceding siblings ...)
  2014-07-22 12:57                                     ` Bruno Wolff III
@ 2014-07-28  8:28                                     ` tip-bot for Peter Zijlstra
  3 siblings, 0 replies; 44+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-07-28  8:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, dietmar.eggemann, peterz,
	jwboyer, bruno, tglx

Commit-ID:  6ae72dff37596f736b795426306404f0793e4b1b
Gitweb:     http://git.kernel.org/tip/6ae72dff37596f736b795426306404f0793e4b1b
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 22 Jul 2014 11:47:40 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 28 Jul 2014 10:04:13 +0200

sched: Robustify topology setup

We hard assume that higher topology levels are supersets of lower
levels.

Detect, warn and try to fixup when we encounter this violated.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Bruno Wolff III <bruno@wolff.to>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140722094740.GJ12054@laptop.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 415ab02..2a36a74 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6481,6 +6481,20 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
 		sched_domain_level_max = max(sched_domain_level_max, sd->level);
 		child->parent = sd;
 		sd->child = child;
+
+		if (!cpumask_subset(sched_domain_span(child),
+				    sched_domain_span(sd))) {
+			pr_err("BUG: arch topology borken\n");
+#ifdef CONFIG_SCHED_DEBUG
+			pr_err("     the %s domain not a subset of the %s domain\n",
+					child->name, sd->name);
+#endif
+			/* Fixup, ensure @sd has at least @child cpus. */
+			cpumask_or(sched_domain_span(sd),
+				   sched_domain_span(sd),
+				   sched_domain_span(child));
+		}
+
 	}
 	set_domain_attribute(sd, attr);
 

^ permalink raw reply related	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2014-07-28  8:30 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-16 14:55 Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c Bruno Wolff III
2014-07-16 15:17 ` Josh Boyer
2014-07-16 19:17   ` Dietmar Eggemann
2014-07-16 19:54     ` Bruno Wolff III
2014-07-16 23:18       ` Dietmar Eggemann
2014-07-17  3:09         ` Bruno Wolff III
2014-07-17  8:57           ` Dietmar Eggemann
2014-07-17  9:04             ` Peter Zijlstra
2014-07-17 11:23               ` Dietmar Eggemann
2014-07-17 12:35                 ` Peter Zijlstra
2014-07-18  5:34                   ` Bruno Wolff III
2014-07-18  9:28                     ` Dietmar Eggemann
2014-07-18 12:09                       ` Bruno Wolff III
2014-07-18 10:16                     ` Peter Zijlstra
2014-07-18 13:01                       ` Bruno Wolff III
2014-07-18 14:16                         ` Dietmar Eggemann
2014-07-18 14:16                         ` Peter Zijlstra
2014-07-18 14:50                           ` Peter Zijlstra
2014-07-18 16:16                             ` Peter Zijlstra
2014-07-21 16:35                               ` Bruno Wolff III
2014-07-21 16:52                                 ` Peter Zijlstra
2014-07-22  9:47                                   ` Peter Zijlstra
2014-07-22 10:38                                     ` Peter Zijlstra
2014-07-22 12:10                                       ` Bruno Wolff III
2014-07-22 13:03                                         ` Peter Zijlstra
2014-07-22 13:26                                           ` Peter Zijlstra
2014-07-22 13:35                                             ` Peter Zijlstra
2014-07-22 14:09                                               ` Bruno Wolff III
2014-07-22 14:18                                                 ` Peter Zijlstra
2014-07-23  1:37                                                   ` Bruno Wolff III
2014-07-23  6:51                                                     ` Peter Zijlstra
2014-07-22 17:05                                               ` H. Peter Anvin
2014-07-23 15:11                                               ` Peter Zijlstra
2014-07-23 15:12                                                 ` H. Peter Anvin
2014-07-24  1:45                                                 ` Bruno Wolff III
2014-07-23 15:39                                               ` [tip:x86/urgent] x86, cpu: Fix cache topology for early P4-SMT tip-bot for Peter Zijlstra
2014-07-22 12:12                                     ` Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c Dietmar Eggemann
2014-07-22 12:57                                     ` Bruno Wolff III
2014-07-28  8:28                                     ` [tip:sched/core] sched: Robustify topology setup tip-bot for Peter Zijlstra
2014-07-17 16:36             ` Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c Bruno Wolff III
2014-07-17 18:43               ` Dietmar Eggemann
2014-07-17 18:54                 ` Bruno Wolff III
2014-07-17  4:21         ` Bruno Wolff III
2014-07-17  4:28     ` Bruno Wolff III

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.