All of lore.kernel.org
 help / color / mirror / Atom feed
* 50 Watt idle power regression bisected to Linux-3.10
@ 2013-12-07  8:00 Len Brown
  2013-12-07  8:39 ` Mike Galbraith
                   ` (3 more replies)
  0 siblings, 4 replies; 96+ messages in thread
From: Len Brown @ 2013-12-07  8:00 UTC (permalink / raw)
  To: tglx, Peter Zijlstra; +Cc: Linux PM list, linux-kernel, Jeremy Eder, x86

Hello Thomas,

An idle WSM-EX box (40 Xeon cores) runs 50 Watts hotter after this patch:

commit 7d1a941731fabf27e5fb6edbebb79fe856edb4e5
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Thu Mar 21 22:50:03 2013 +0100

    x86: Use generic idle loop

ie. the commit before this patch (aba92c9e2cf3042bf6efc68fa2e4235ba01bf499)
runs at 50 watts less, as do Linux 3.7, 3.8 and 3.9.

The difference is that the good kernels allow about 98% residence
in the package C6 state, while the bad kernel is so noisy that it
gets into pc6 0% of the time.
(indeed, even core C6 is reduced to about 50% from over 99%)

No, Linux-3.13-rc3 does not fix this issue, even though it contains
the following patch, claiming to address an issue with the commit above:

commit ea8117478918a4734586d35ff530721b682425be
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed Sep 11 12:43:13 2013 +0200

    sched, idle: Fix the idle polling state logic

    Mike reported that commit 7d1a9417 ("x86: Use generic idle loop")
    regressed several workloads and caused excessive reschedule
    interrupts.

    The patch in question failed to notice that the x86 code had an
    inverted sense of the polling state versus the new generic code (x86:
    default polling, generic: default !polling).

    Fix the two prominent x86 mwait based idle drivers and introduce a few
    new generic polling helpers (fixing the wrong smp_mb__after_clear_bit
    usage).

    Also switch the idle routines to using tif_need_resched() which is an
    immediate TIF_NEED_RESCHED test as opposed to need_resched which will
    end up being slightly different.

    Reported-by: Mike Galbraith <bitbucket@online.de>
    Signed-off-by: Peter Zijlstra <peterz@infradead.org>
    Cc: lenb@kernel.org
    Cc: tglx@linutronix.de
    Link: http://lkml.kernel.org/n/tip-nc03imb0etuefmzybzj7sprf@git.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

How shall we proceed?

thanks,
-Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-07  8:00 50 Watt idle power regression bisected to Linux-3.10 Len Brown
@ 2013-12-07  8:39 ` Mike Galbraith
  2013-12-07 16:01   ` Len Brown
  2013-12-07 12:54 ` Thomas Gleixner
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 96+ messages in thread
From: Mike Galbraith @ 2013-12-07  8:39 UTC (permalink / raw)
  To: Len Brown
  Cc: tglx, Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86

On Sat, 2013-12-07 at 03:00 -0500, Len Brown wrote:

> No, Linux-3.13-rc3 does not fix this issue, even though it contains
> the following patch, claiming to address an issue with the commit above:
> 
> commit ea8117478918a4734586d35ff530721b682425be
> Author: Peter Zijlstra <peterz@infradead.org>
> Date:   Wed Sep 11 12:43:13 2013 +0200
> 
>     sched, idle: Fix the idle polling state logic
> 
>     Mike reported that commit 7d1a9417 ("x86: Use generic idle loop")
>     regressed several workloads and caused excessive reschedule
>     interrupts.

It fixes that, except for my Q6600 box.  Too bad mwait_idle() went away,
beloved old box doesn't play hints game, so it continues to flog itself.

-Mike


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-07  8:00 50 Watt idle power regression bisected to Linux-3.10 Len Brown
  2013-12-07  8:39 ` Mike Galbraith
@ 2013-12-07 12:54 ` Thomas Gleixner
  2013-12-08  4:57 ` Mike Galbraith
  2013-12-18 21:44   ` Len Brown
  3 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2013-12-07 12:54 UTC (permalink / raw)
  To: Len Brown; +Cc: Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86

Len,

On Sat, 7 Dec 2013, Len Brown wrote:
>
> How shall we proceed?

Can you please gather a function trace so I can see what the system is
doing? Preferrably with 3.13-rc3 so we have Peters fix included.

Please send it offlist or put it somehwere for download.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-07  8:39 ` Mike Galbraith
@ 2013-12-07 16:01   ` Len Brown
  2013-12-07 16:45     ` Len Brown
  0 siblings, 1 reply; 96+ messages in thread
From: Len Brown @ 2013-12-07 16:01 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: tglx, Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86

On Sat, Dec 7, 2013 at 3:39 AM, Mike Galbraith <bitbucket@online.de> wrote:
> On Sat, 2013-12-07 at 03:00 -0500, Len Brown wrote:
>
>> No, Linux-3.13-rc3 does not fix this issue, even though it contains
>> the following patch, claiming to address an issue with the commit above:
>>
>> commit ea8117478918a4734586d35ff530721b682425be
>> Author: Peter Zijlstra <peterz@infradead.org>
>> Date:   Wed Sep 11 12:43:13 2013 +0200
>>
>>     sched, idle: Fix the idle polling state logic
>>
>>     Mike reported that commit 7d1a9417 ("x86: Use generic idle loop")
>>     regressed several workloads and caused excessive reschedule
>>     interrupts.
>
> It fixes that, except for my Q6600 box.  Too bad mwait_idle() went away,
> beloved old box doesn't play hints game, so it continues to flog itself.

Hi Mike,
What C-states are available on your Q6600 box?
Is it running HALT only?
(if so, I would expect C1 MWAIT be about the same, if it still existed)

But I would expect the box to have deeper C-states than just C1, so...

Can you show me the output of
dmesg | grep idle
grep . /sys/devices/system/cpu/cpu0/cpuidle/*/*

Please elaborate on exactly what symptom you see.
Do you have a power meter?

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-07 16:01   ` Len Brown
@ 2013-12-07 16:45     ` Len Brown
  2013-12-07 19:17       ` Mike Galbraith
  0 siblings, 1 reply; 96+ messages in thread
From: Len Brown @ 2013-12-07 16:45 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Thomas Gleixner, Peter Zijlstra, Linux PM list, linux-kernel,
	Jeremy Eder, x86

>> It fixes that, except for my Q6600 box.  Too bad mwait_idle() went away,
>> beloved old box doesn't play hints game, so it continues to flog itself.


Thanks for pointing this out, Mike!

A Q6600 is a Kentsfield.  I dug one of those up.
Indeed, the only idle capabilities it has are  HALT
and old style MWAIT, and the latter is much more effective.
running 3.8 it idles at 75 watts.
running 3.8 with idle=nomwait it idles at 100 watts,
which is what it will do with 3.9 and later due to the patch below.

commit 69fb3676df3329a7142803bb3502fa59dc0db2e3
Author: Len Brown <len.brown@intel.com>
Date:   Sun Feb 10 01:38:39 2013 -0500

    x86 idle: remove mwait_idle() and "idle=mwait" cmdline param

Kentsfield proves that patch was based on a fault assumption.
Sweet box in its day, ECC memory and everything -- probably still
a fair number of them running...

Plus, I've found another machine that depends on having an idle=mwait
idle loop (A Sony Vaio BIOS SMM code apparently assumes we use it in
https://bugzilla.kernel.org/show_bug.cgi?id=60770)

So it looks like I need to (also) restore the simple idle=mwait idle loop
to make some machines happy.

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-07 16:45     ` Len Brown
@ 2013-12-07 19:17       ` Mike Galbraith
  2013-12-10 11:41         ` Ingo Molnar
  0 siblings, 1 reply; 96+ messages in thread
From: Mike Galbraith @ 2013-12-07 19:17 UTC (permalink / raw)
  To: Len Brown
  Cc: Thomas Gleixner, Peter Zijlstra, Linux PM list, linux-kernel,
	Jeremy Eder, x86

On Sat, 2013-12-07 at 11:45 -0500, Len Brown wrote: 
> >> It fixes that, except for my Q6600 box.  Too bad mwait_idle() went away,
> >> beloved old box doesn't play hints game, so it continues to flog itself.
> 
> 
> Thanks for pointing this out, Mike!
> 
> A Q6600 is a Kentsfield.  I dug one of those up.
> Indeed, the only idle capabilities it has are  HALT
> and old style MWAIT, and the latter is much more effective.
> running 3.8 it idles at 75 watts.
> running 3.8 with idle=nomwait it idles at 100 watts,
> which is what it will do with 3.9 and later due to the patch below.
> 
> commit 69fb3676df3329a7142803bb3502fa59dc0db2e3
> Author: Len Brown <len.brown@intel.com>
> Date:   Sun Feb 10 01:38:39 2013 -0500
> 
>     x86 idle: remove mwait_idle() and "idle=mwait" cmdline param
> 
> Kentsfield proves that patch was based on a fault assumption.
> Sweet box in its day, ECC memory and everything -- probably still
> a fair number of them running...
> 
> Plus, I've found another machine that depends on having an idle=mwait
> idle loop (A Sony Vaio BIOS SMM code apparently assumes we use it in
> https://bugzilla.kernel.org/show_bug.cgi?id=60770)
> 
> So it looks like I need to (also) restore the simple idle=mwait idle loop
> to make some machines happy.

Cool, box will definitely be happier.

-Mike


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-07  8:00 50 Watt idle power regression bisected to Linux-3.10 Len Brown
  2013-12-07  8:39 ` Mike Galbraith
  2013-12-07 12:54 ` Thomas Gleixner
@ 2013-12-08  4:57 ` Mike Galbraith
  2013-12-08 20:40   ` Len Brown
  2013-12-18 21:44   ` Len Brown
  3 siblings, 1 reply; 96+ messages in thread
From: Mike Galbraith @ 2013-12-08  4:57 UTC (permalink / raw)
  To: Len Brown
  Cc: tglx, Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86

On Sat, 2013-12-07 at 03:00 -0500, Len Brown wrote: 
> Hello Thomas,
> 
> An idle WSM-EX box (40 Xeon cores) runs 50 Watts hotter after this patch:
> 
> commit 7d1a941731fabf27e5fb6edbebb79fe856edb4e5
> Author: Thomas Gleixner <tglx@linutronix.de>
> Date:   Thu Mar 21 22:50:03 2013 +0100
> 
>     x86: Use generic idle loop
> 
> ie. the commit before this patch (aba92c9e2cf3042bf6efc68fa2e4235ba01bf499)
> runs at 50 watts less, as do Linux 3.7, 3.8 and 3.9.
> 
> The difference is that the good kernels allow about 98% residence
> in the package C6 state, while the bad kernel is so noisy that it
> gets into pc6 0% of the time.
> (indeed, even core C6 is reduced to about 50% from over 99%)

FWIW, my little x3550 (E5620) box is spending ~99% of its time in C3 per
powertop, deepest state it has.  It took a big hit from 69a37bea as well
as 7d1a9417, but with revert of 69a37bea, and addition of ea811747, it's
fairly close to 3.7 in fastpath weight, and greenery _seems_ fine.

-Mike


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-08  4:57 ` Mike Galbraith
@ 2013-12-08 20:40   ` Len Brown
  2013-12-09  3:16     ` Mike Galbraith
  0 siblings, 1 reply; 96+ messages in thread
From: Len Brown @ 2013-12-08 20:40 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Thomas Gleixner, Peter Zijlstra, Linux PM list, linux-kernel,
	Jeremy Eder, x86

> FWIW, my little x3550 (E5620) box is spending ~99% of its time in C3 per
> powertop, deepest state it has.

Please run turbostat, which always describes what the hardware does.
Depending on the version of powertop and the hardware involved,
it may be describing that the OS requested -- which is not always the same.

The hardware reserves the right to demote any requested C-state to a
shallower one.  Also, it is the hardware decision whether to promote
core c-states requested by the OS into power saving package C-states.

cheers,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-08 20:40   ` Len Brown
@ 2013-12-09  3:16     ` Mike Galbraith
  2013-12-10  5:17       ` Mike Galbraith
  0 siblings, 1 reply; 96+ messages in thread
From: Mike Galbraith @ 2013-12-09  3:16 UTC (permalink / raw)
  To: Len Brown
  Cc: Thomas Gleixner, Peter Zijlstra, Linux PM list, linux-kernel,
	Jeremy Eder, x86

On Sun, 2013-12-08 at 15:40 -0500, Len Brown wrote: 
> > FWIW, my little x3550 (E5620) box is spending ~99% of its time in C3 per
> > powertop, deepest state it has.
> 
> Please run turbostat, which always describes what the hardware does.
> Depending on the version of powertop and the hardware involved,
> it may be describing that the OS requested -- which is not always the same.
> 
> The hardware reserves the right to demote any requested C-state to a
> shallower one.  Also, it is the hardware decision whether to promote
> core c-states requested by the OS into power saving package C-states.

Hrm, turbostat says rtbox has a C6, and is using it.  Ah, BIOS says ACPI
C3 == Intel C6, so I suppose Intel C3 == ACPI C2.  Whatever, both tools
seem to agree that these particular boxen (x3550/E5620 and DL980/X7560)
are snoozing peacefully.

rtbox:~ # turbostat -i 60
cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
          0.18 1.73 2.40   0   0.19   0.02  99.61   25   0.08  92.74
  0   0   0.47 1.60 2.40   0   0.62   0.08  98.83   18   0.08  92.74
  1   1   0.03 1.61 2.40   0   0.05   0.00  99.92   25
  9   2   0.05 1.73 2.40   0   0.06   0.00  99.89   16
 10   3   0.15 2.13 2.40   0   0.05   0.01  99.79   20

vogelweide:~/:[0]# turbostat -i 60
pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
             0.06 2.03 2.26   0   0.76  99.18   0.00   49  94.97   0.00
 0   0   0   0.40 1.57 2.26   3  25.28  74.32   0.00   49  64.24   0.00
 0   1   1   0.08 2.16 2.26   3  19.28  80.64   0.00   48
 0   2   2   0.09 2.20 2.26   3   0.00  99.91   0.00   49
 0   3   3   0.09 2.20 2.26   3   0.00  99.91   0.00   49
 0   8   4   0.08 2.21 2.26   3   0.00  99.91   0.00   45
 0   9   5   0.08 2.22 2.26   3   0.00  99.92   0.00   46
 0  10   6   0.08 2.23 2.26   3   0.00  99.92   0.00   45
 0  11   7   0.08 2.23 2.26   3   0.00  99.92   0.00   45
 1   0   8   0.10 1.94 2.26   3   0.01  99.89   0.00   38  99.88   0.00
 1   1   9   0.08 2.11 2.26   3   0.00  99.92   0.00   39
 1   2  10   0.08 2.12 2.26   3   0.00  99.92   0.00   41
 1   3  11   0.07 2.14 2.26   3   0.00  99.92   0.00   41
 1   8  12   0.08 2.15 2.26   3   0.00  99.92   0.00   42
 1   9  13   0.07 2.17 2.26   3   0.00  99.92   0.00   43
 1  10  14   0.07 2.19 2.26   3   0.00  99.92   0.00   41
 1  11  15   0.07 2.20 2.26   3   0.00  99.93   0.00   41
 2   0  16   0.09 1.93 2.26   3   0.01  99.90   0.00   41  99.89   0.00
 2   1  17   0.07 2.09 2.26   3   0.00  99.92   0.00   40
 2   2  18   0.07 2.11 2.26   3   0.00  99.93   0.00   41
 2   3  19   0.07 2.12 2.26   3   0.00  99.93   0.00   42
 2   8  20   0.07 2.14 2.26   3   0.00  99.93   0.00   41
 2   9  21   0.07 2.16 2.26   3   0.00  99.93   0.00   41
 2  10  22   0.07 2.17 2.26   3   0.00  99.93   0.00   37
 2  11  23   0.07 2.20 2.26   3   0.00  99.93   0.00   39
 3   0  24   0.08 1.91 2.26   3   0.01  99.91   0.00   41  99.88   0.00
 3   1  25   0.06 2.07 2.26   3   0.00  99.93   0.00   42
 3   2  26   0.07 2.09 2.26   3   0.02  99.92   0.00   41
 3   3  27   0.06 2.11 2.26   3   0.00  99.94   0.00   42
 3   8  28   0.06 2.12 2.26   3   0.00  99.94   0.00   47
 3   9  29   0.06 2.15 2.26   3   0.00  99.94   0.00   46
 3  10  30   0.06 2.18 2.26   3   0.00  99.94   0.00   44
 3  11  31   0.06 2.19 2.26   3   0.02  99.92   0.00   46
 4   0  32   0.07 1.92 2.26   3   0.01  99.92   0.00   36  96.25   0.00
 4   1  33   0.06 2.07 2.26   3   0.00  99.94   0.00   36
 4   2  34   0.06 2.08 2.26   3   0.00  99.94   0.00   39
 4   3  35   0.05 2.11 2.26   3   0.00  99.94   0.00   39
 4   8  36   0.05 2.12 2.26   3   3.67  96.28   0.00   40
 4   9  37   0.05 2.15 2.26   3   0.00  99.94   0.00   42
 4  10  38   0.05 2.18 2.26   3   0.00  99.95   0.00   39
 4  11  39   0.05 2.20 2.26   3   0.00  99.95   0.00   40
 5   0  40   0.07 1.73 2.26   3   0.02  99.91   0.00   35  99.90   0.00
 5   1  41   0.05 2.00 2.26   3   0.00  99.95   0.00   36
 5   2  42   0.05 2.04 2.26   3   0.00  99.95   0.00   39
 5   3  43   0.04 2.04 2.26   3   0.00  99.95   0.00   38
 5   8  44   0.05 2.09 2.26   3   0.00  99.95   0.00   40
 5   9  45   0.04 2.10 2.26   3   0.00  99.95   0.00   41
 5  10  46   0.04 2.13 2.26   3   0.00  99.96   0.00   37
 5  11  47   0.04 2.17 2.26   3   0.00  99.96   0.00   37
 6   0  48   0.08 1.61 2.26   3   0.02  99.90   0.00   35  99.87   0.00
 6   1  49   0.05 1.79 2.26   3   0.01  99.94   0.00   36
 6   2  50   0.04 1.98 2.26   3   0.00  99.96   0.00   36
 6   3  51   0.04 2.00 2.26   3   0.00  99.96   0.00   37
 6   8  52   0.04 2.01 2.26   3   0.00  99.96   0.00   38
 6   9  53   0.03 2.07 2.26   3   0.00  99.96   0.00   36
 6  10  54   0.03 2.08 2.26   3   0.00  99.96   0.00   36
 6  11  55   0.03 2.14 2.26   3   0.00  99.96   0.00   36
 7   0  56   0.05 1.69 2.26   3   0.01  99.94   0.00   41  99.89   0.00
 7   1  57   0.03 1.87 2.26   3   0.00  99.97   0.00   41
 7   2  58   0.03 1.89 2.26   3   0.00  99.97   0.00   44
 7   3  59   0.03 1.93 2.26   3   0.00  99.97   0.00   44
 7   8  60   0.03 1.97 2.26   3   0.05  99.92   0.00   41
 7   9  61   0.03 2.02 2.26   3   0.00  99.97   0.00   38
 7  10  62   0.02 2.06 2.26   3   0.00  99.97   0.00   39
 7  11  63   0.03 2.12 2.26   3   0.00  99.97   0.00   36


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-09  3:16     ` Mike Galbraith
@ 2013-12-10  5:17       ` Mike Galbraith
  2013-12-10 11:45         ` Ingo Molnar
  2013-12-10 14:29         ` Thomas Gleixner
  0 siblings, 2 replies; 96+ messages in thread
From: Mike Galbraith @ 2013-12-10  5:17 UTC (permalink / raw)
  To: Len Brown
  Cc: Thomas Gleixner, Peter Zijlstra, Linux PM list, linux-kernel,
	Jeremy Eder, x86

Hi Len,

I'm unable to reproduce those DL980 results.  I updated the kernel and
config yesterday, and happened to run turbostat again.. and the box was
nowhere near as quiet.  I ended up spending all day futzing with the
darn thing, checking various config/kernel combos, and none produced the
previous result.

ATM, running the same exact updated kernel as the x3550 is running
(still happily) with only a couple needed drivers added:

vogelweide:~/:[130]# turbostat -P -i 60
pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
             0.02 2.12 2.26   0  43.40  56.57   0.00   53  33.81   0.00
 0   0   0   0.23 2.10 2.26   5  65.47  34.30   0.00   52  10.69   0.00
 1   0   8   0.04 2.02 2.26   5  63.10  36.86   0.00   41  13.31   0.00
 2   0  16   0.04 1.90 2.26   5  35.70  64.25   0.00   43  37.88   0.00
 3   0  24   0.03 2.08 2.26   5  39.78  60.19   0.00   42  29.00   0.00
 4   0  32   0.03 1.95 2.26   5  14.64  85.33   0.00   37  65.00   0.00
 5   0  40   0.03 1.95 2.26   5  15.96  84.02   0.00   36  74.34   0.00
 6   0  48   0.02 1.99 2.26   5  36.97  63.01   0.00   37  40.20   0.00
 7   0  56   0.02 2.08 2.26   5  57.54  42.44   0.00   44   0.08   0.00
pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
             0.03 2.08 2.26   0  31.24  68.73   0.00   53  44.26   0.00
 0   0   0   0.25 1.76 2.26   8  52.96  46.78   0.00   51  31.99   0.00
 1   0   8   0.03 1.96 2.26   8  50.69  49.28   0.00   41  30.80   0.00
 2   0  16   0.04 1.91 2.26   8  36.10  63.85   0.00   42  44.37   0.00
 3   0  24   0.03 1.96 2.26   8  24.15  75.82   0.00   42  59.14   0.00
 4   0  32   0.03 1.94 2.26   8  14.68  85.29   0.00   37  71.64   0.00
 5   0  40   0.03 1.94 2.26   8  16.01  83.96   0.00   36  78.62   0.00
 6   0  48   0.02 2.18 2.26   8  46.79  53.18   0.00   36  11.05   0.00
 7   0  56   0.02 2.01 2.26   8  50.84  49.14   0.00   45  26.50   0.00
pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
             0.03 2.10 2.26   0  30.88  69.09   0.00   51  48.09   0.00
 0   0   0   0.23 1.84 2.26   8  51.98  47.79   0.00   50  31.06   0.00
 1   0   8   0.04 2.05 2.26   8  59.44  40.52   0.00   41  21.86   0.00
 2   0  16   0.04 1.73 2.26   8  22.87  77.09   0.00   41  62.26   0.00
 3   0  24   0.03 1.93 2.26   8   9.83  90.14   0.00   41  77.33   0.00
 4   0  32   0.03 1.90 2.26   8   8.66  91.31   0.00   37  89.16   0.00
 5   0  40   0.03 2.05 2.26   8  27.30  72.67   0.00   35  50.62   0.00
 6   0  48   0.02 2.07 2.26   8  44.81  55.17   0.00   36  18.97   0.00
 7   0  56   0.02 2.00 2.26   8  48.06  51.91   0.00   42  33.44   0.00

That, vs solid >99% pc3 for old 3.0 enterprise kernel.  What happened,
dunno, kernel called itself master.  Whatever, I don't like test bogons,
and this is one, so I thought I should let you know.  DL980 refuses to
come close to the previous result.

'nuff of that, off to the slave pits before yet another day evaporates.

-Mike


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-07 19:17       ` Mike Galbraith
@ 2013-12-10 11:41         ` Ingo Molnar
  0 siblings, 0 replies; 96+ messages in thread
From: Ingo Molnar @ 2013-12-10 11:41 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Len Brown, Thomas Gleixner, Peter Zijlstra, Linux PM list,
	linux-kernel, Jeremy Eder, x86


* Mike Galbraith <bitbucket@online.de> wrote:

> On Sat, 2013-12-07 at 11:45 -0500, Len Brown wrote: 
> > >> It fixes that, except for my Q6600 box.  Too bad mwait_idle() went away,
> > >> beloved old box doesn't play hints game, so it continues to flog itself.
> > 
> > 
> > Thanks for pointing this out, Mike!
> > 
> > A Q6600 is a Kentsfield.  I dug one of those up.
> > Indeed, the only idle capabilities it has are  HALT
> > and old style MWAIT, and the latter is much more effective.
> > running 3.8 it idles at 75 watts.
> > running 3.8 with idle=nomwait it idles at 100 watts,
> > which is what it will do with 3.9 and later due to the patch below.
> > 
> > commit 69fb3676df3329a7142803bb3502fa59dc0db2e3
> > Author: Len Brown <len.brown@intel.com>
> > Date:   Sun Feb 10 01:38:39 2013 -0500
> > 
> >     x86 idle: remove mwait_idle() and "idle=mwait" cmdline param
> > 
> > Kentsfield proves that patch was based on a fault assumption.
> > Sweet box in its day, ECC memory and everything -- probably still
> > a fair number of them running...
> > 
> > Plus, I've found another machine that depends on having an idle=mwait
> > idle loop (A Sony Vaio BIOS SMM code apparently assumes we use it in
> > https://bugzilla.kernel.org/show_bug.cgi?id=60770)
> > 
> > So it looks like I need to (also) restore the simple idle=mwait idle loop
> > to make some machines happy.
> 
> Cool, box will definitely be happier.

I assume old-style MWAIT will be activated automatically on such 
boxes, there's no need to pass in idle=mwait on the boot command line, 
correct?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-10  5:17       ` Mike Galbraith
@ 2013-12-10 11:45         ` Ingo Molnar
  2013-12-10 14:29         ` Thomas Gleixner
  1 sibling, 0 replies; 96+ messages in thread
From: Ingo Molnar @ 2013-12-10 11:45 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Len Brown, Thomas Gleixner, Peter Zijlstra, Linux PM list,
	linux-kernel, Jeremy Eder, x86


* Mike Galbraith <bitbucket@online.de> wrote:

> Hi Len,
> 
> I'm unable to reproduce those DL980 results.  I updated the kernel and
> config yesterday, and happened to run turbostat again.. and the box was
> nowhere near as quiet.  I ended up spending all day futzing with the
> darn thing, checking various config/kernel combos, and none produced the
> previous result.
> 
> ATM, running the same exact updated kernel as the x3550 is running
> (still happily) with only a couple needed drivers added:
> 
> vogelweide:~/:[130]# turbostat -P -i 60
> pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
>              0.02 2.12 2.26   0  43.40  56.57   0.00   53  33.81   0.00
>  0   0   0   0.23 2.10 2.26   5  65.47  34.30   0.00   52  10.69   0.00
>  1   0   8   0.04 2.02 2.26   5  63.10  36.86   0.00   41  13.31   0.00
>  2   0  16   0.04 1.90 2.26   5  35.70  64.25   0.00   43  37.88   0.00
>  3   0  24   0.03 2.08 2.26   5  39.78  60.19   0.00   42  29.00   0.00
>  4   0  32   0.03 1.95 2.26   5  14.64  85.33   0.00   37  65.00   0.00
>  5   0  40   0.03 1.95 2.26   5  15.96  84.02   0.00   36  74.34   0.00
>  6   0  48   0.02 1.99 2.26   5  36.97  63.01   0.00   37  40.20   0.00
>  7   0  56   0.02 2.08 2.26   5  57.54  42.44   0.00   44   0.08   0.00
> pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
>              0.03 2.08 2.26   0  31.24  68.73   0.00   53  44.26   0.00
>  0   0   0   0.25 1.76 2.26   8  52.96  46.78   0.00   51  31.99   0.00
>  1   0   8   0.03 1.96 2.26   8  50.69  49.28   0.00   41  30.80   0.00
>  2   0  16   0.04 1.91 2.26   8  36.10  63.85   0.00   42  44.37   0.00
>  3   0  24   0.03 1.96 2.26   8  24.15  75.82   0.00   42  59.14   0.00
>  4   0  32   0.03 1.94 2.26   8  14.68  85.29   0.00   37  71.64   0.00
>  5   0  40   0.03 1.94 2.26   8  16.01  83.96   0.00   36  78.62   0.00
>  6   0  48   0.02 2.18 2.26   8  46.79  53.18   0.00   36  11.05   0.00
>  7   0  56   0.02 2.01 2.26   8  50.84  49.14   0.00   45  26.50   0.00
> pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
>              0.03 2.10 2.26   0  30.88  69.09   0.00   51  48.09   0.00
>  0   0   0   0.23 1.84 2.26   8  51.98  47.79   0.00   50  31.06   0.00
>  1   0   8   0.04 2.05 2.26   8  59.44  40.52   0.00   41  21.86   0.00
>  2   0  16   0.04 1.73 2.26   8  22.87  77.09   0.00   41  62.26   0.00
>  3   0  24   0.03 1.93 2.26   8   9.83  90.14   0.00   41  77.33   0.00
>  4   0  32   0.03 1.90 2.26   8   8.66  91.31   0.00   37  89.16   0.00
>  5   0  40   0.03 2.05 2.26   8  27.30  72.67   0.00   35  50.62   0.00
>  6   0  48   0.02 2.07 2.26   8  44.81  55.17   0.00   36  18.97   0.00
>  7   0  56   0.02 2.00 2.26   8  48.06  51.91   0.00   42  33.44   0.00
> 
> That, vs solid >99% pc3 for old 3.0 enterprise kernel.  What 
> happened, dunno, kernel called itself master.  Whatever, I don't 
> like test bogons, and this is one, so I thought I should let you 
> know.  DL980 refuses to come close to the previous result.
> 
> 'nuff of that, off to the slave pits before yet another day 
> evaporates.

An ftrace function trace of the anomaly would be awfully useful.

(or a trace generated of a dozen strategically placed trace_printk()s 
debug lines tracing the guts of cpuidle.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-10  5:17       ` Mike Galbraith
  2013-12-10 11:45         ` Ingo Molnar
@ 2013-12-10 14:29         ` Thomas Gleixner
  2013-12-10 15:06           ` Ingo Molnar
  2013-12-11  2:05           ` Thomas Gleixner
  1 sibling, 2 replies; 96+ messages in thread
From: Thomas Gleixner @ 2013-12-10 14:29 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Len Brown, Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86

On Tue, 10 Dec 2013, Mike Galbraith wrote:
> vogelweide:~/:[130]# turbostat -P -i 60
> pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
>              0.02 2.12 2.26   0  43.40  56.57   0.00   53  33.81   0.00
>  0   0   0   0.23 2.10 2.26   5  65.47  34.30   0.00   52  10.69   0.00

...
 
> That, vs solid >99% pc3 for old 3.0 enterprise kernel.  What happened,
> dunno, kernel called itself master.  Whatever, I don't like test bogons,
> and this is one, so I thought I should let you know.  DL980 refuses to
> come close to the previous result.

Hmm, that's very similar to the issue Len is seing on his quad socket WM.

But I can't see it on my dual socket SNB, neither does PeterZ on his
dual socket WM machine. Heisenbugs ....

Can you offline all cpus except core 0, enable all trace events (nop
tracer), run "sleep 100" and stop and grab the trace? Please make sure
that your tracebuffer is big enough to cover the full thing.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-10 14:29         ` Thomas Gleixner
@ 2013-12-10 15:06           ` Ingo Molnar
  2013-12-11  2:05           ` Thomas Gleixner
  1 sibling, 0 replies; 96+ messages in thread
From: Ingo Molnar @ 2013-12-10 15:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mike Galbraith, Len Brown, Peter Zijlstra, Linux PM list,
	linux-kernel, Jeremy Eder, x86


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Tue, 10 Dec 2013, Mike Galbraith wrote:
> > vogelweide:~/:[130]# turbostat -P -i 60
> > pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
> >              0.02 2.12 2.26   0  43.40  56.57   0.00   53  33.81   0.00
> >  0   0   0   0.23 2.10 2.26   5  65.47  34.30   0.00   52  10.69   0.00
> 
> ...
>  
> > That, vs solid >99% pc3 for old 3.0 enterprise kernel.  What happened,
> > dunno, kernel called itself master.  Whatever, I don't like test bogons,
> > and this is one, so I thought I should let you know.  DL980 refuses to
> > come close to the previous result.
> 
> Hmm, that's very similar to the issue Len is seing on his quad 
> socket WM.
> 
> But I can't see it on my dual socket SNB, neither does PeterZ on his 
> dual socket WM machine. Heisenbugs ....
> 
> Can you offline all cpus except core 0, enable all trace events (nop 
> tracer), run "sleep 100" and stop and grab the trace? Please make 
> sure that your tracebuffer is big enough to cover the full thing.

I guess because we are facing a Heisenbug it might be prudent to check 
that the bug is still present with all those cpus offlined and with 
tracing enabled :-/

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-10 14:29         ` Thomas Gleixner
  2013-12-10 15:06           ` Ingo Molnar
@ 2013-12-11  2:05           ` Thomas Gleixner
  2013-12-11  3:21             ` Mike Galbraith
  1 sibling, 1 reply; 96+ messages in thread
From: Thomas Gleixner @ 2013-12-11  2:05 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Len Brown, Peter Zijlstra, Linux PM list, linux-kernel,
	Jeremy Eder, x86, Borislav Petkov

On Tue, 10 Dec 2013, Thomas Gleixner wrote:
> On Tue, 10 Dec 2013, Mike Galbraith wrote:
> > vogelweide:~/:[130]# turbostat -P -i 60
> > pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
> >              0.02 2.12 2.26   0  43.40  56.57   0.00   53  33.81   0.00
> >  0   0   0   0.23 2.10 2.26   5  65.47  34.30   0.00   52  10.69   0.00
> 
> ...
>  
> > That, vs solid >99% pc3 for old 3.0 enterprise kernel.  What happened,
> > dunno, kernel called itself master.  Whatever, I don't like test bogons,
> > and this is one, so I thought I should let you know.  DL980 refuses to
> > come close to the previous result.
> 
> Hmm, that's very similar to the issue Len is seing on his quad socket WM.
> 
> But I can't see it on my dual socket SNB, neither does PeterZ on his
> dual socket WM machine. Heisenbugs ....
> 
> Can you offline all cpus except core 0, enable all trace events (nop
> tracer), run "sleep 100" and stop and grab the trace? Please make sure
> that your tracebuffer is big enough to cover the full thing.

So Boris found a quad socket WM which shows the issue and he also
confirmed that dual socket ones cannot reproduce it...

The trace Boris provided shows similar bogosity as you and Len are
seing. Decoding it shows:

Sleep time	state=1	state=2	state=3	state=4
  0 -  49 ms:	202	  6	 32	948
 50 -  99 ms:	 15	  0	  0	 66
100 - 149 ms:	  3	  0	  0	 12
150 - 199 ms:	  1	  0	  0	  0

So we actually sleep for more than 150ms in C1!?! Lets look at the
trace:

          <idle>-0     [000] d...  1977.728444: cpu_idle: state=4 cpu_id=0
          <idle>-0     [000] ....  1977.728446: cpu_idle: state=4294967295 cpu_id=0
          <idle>-0     [000] d...  1977.728448: cpu_idle: state=4 cpu_id=0
          <idle>-0     [000] ....  1977.728450: cpu_idle: state=4294967295 cpu_id=0
          <idle>-0     [000] d...  1977.728452: cpu_idle: state=4 cpu_id=0
          <idle>-0     [000] ....  1977.728453: cpu_idle: state=4294967295 cpu_id=0
          <idle>-0     [000] d...  1977.728455: cpu_idle: state=4 cpu_id=0
          <idle>-0     [000] ....  1977.728456: cpu_idle: state=4294967295 cpu_id=0
          <idle>-0     [000] d...  1977.728458: cpu_idle: state=4 cpu_id=0
          <idle>-0     [000] ....  1977.728460: cpu_idle: state=4294967295 cpu_id=0
          <idle>-0     [000] d...  1977.728462: cpu_idle: state=4 cpu_id=0
          <idle>-0     [000] ....  1977.728463: cpu_idle: state=4294967295 cpu_id=0
          <idle>-0     [000] d...  1977.728465: cpu_idle: state=1 cpu_id=0
          <idle>-0     [000] d.h.  1977.891539: local_timer_entry: vector=239

So what happens is that SIX consecutive calls into the idle machinery
with state=4 return right away. That means either need_resched() is
set, which is nowhere indicated by the trace and would be causing a
reschedule in between. And after that the idle governor decides to go
into state=1 where we stay for a whopping 163ms.

Now I wanted to analyze that a bit more and asked Boris to apply the
debug patch below.

Drumroll, then laugther and then banging the head against the wall:

Sleep time	state=1	state=2	state=3	state=4
  0 -  49 ms:	  8	  3	112	2849
 50 -  99 ms:	  0	  0	  0 	 419
100 - 149 ms:	  0	  0	  0	  84
150 - 199 ms:	  0	  0	  0	  21
200 - 249 ms:	  0	  0	  0	   3

Also turbostat confirms that the machine is staying in its deepest
power state as much as it can.

I neither can observe similar massive bouncing in and out of a state
as in the above trace. There are a few lonely ones, but they do not
do any damage.

Now what does this patch change?

Nothing in the logic, it merily changes the timing of that idle
code by adding trace_printks.

And what did the x86 -> generic idle code change do ?

 Lets ignore the subtle SMP issues vs. the polling flag which are
 fixed in current mainline and completely irrelevant for an idle UP
 machine. And all the traces are from an UP machine (all cores
 offlined except 0)

 It changed the timing as well.

So the trace addons making the issue go away AND the fact that nobody
can reproduce the issue on a dual socket machine makes me wonder even
more.

Now there is an important data point:

  The problem is timing sensitive and the issue is only exposed on
  quad socket machines.

So I looked at the code pathes once more and the only way we can
bounce out of that is that the mwait in intel_idle() comes back
prematurely.

Now what makes mwait come back early? Especially in a situation where
only a single core is involved in code execution?

Definitely not a change in the idle code, which is relating to UP
exacly ZERO, except the fact that we touch the cpu local nmi watchdog
variable via touch_nmi_watchdog() inside the irq disabled region
instead of the previous irq enabled region.

So what makes mwait come back right away?

Lets look at a snippet of the trace which was taken with the debug
patch applied:

<idle>-0  [000] d... 1747.667184: menu_select: expected_us: 39875 predicted_us: 22697

So we have according to the nohz code 39875 usec idle time ahead. The
governer heuristics make that: 22697 predicted time.

<idle>-0  [000] d... 1747.667187: menu_select: Trying idle state 1 s->dis 0 su->dis 0
<idle>-0  [000] d... 1747.667188: menu_select: residency 6
<idle>-0  [000] d... 1747.667188: menu_select: exit_latency 3 vs. 2000000000 multiplier 1
<idle>-0  [000] d... 1747.667189: menu_select: Trying idle state 2 s->dis 0 su->dis 0
<idle>-0  [000] d... 1747.667190: menu_select: residency 20
<idle>-0  [000] d... 1747.667191: menu_select: exit_latency 10 vs. 2000000000 multiplier 1
<idle>-0  [000] d... 1747.667191: menu_select: Trying idle state 3 s->dis 0 su->dis 0
<idle>-0  [000] d... 1747.667192: menu_select: residency 80
<idle>-0  [000] d... 1747.667193: menu_select: exit_latency 20 vs. 2000000000 multiplier 1
<idle>-0  [000] d... 1747.667194: menu_select: Trying idle state 4 s->dis 0 su->dis 0
<idle>-0  [000] d... 1747.667194: menu_select: residency 800
<idle>-0  [000] d... 1747.667195: menu_select: exit_latency 200 vs. 2000000000 multiplier 1
<idle>-0  [000] d... 1747.667195: cpu_idle: state=4 cpu_id=0

So the governor settles on state=4 which is the deepest one on this machine.

<idle>-0  [000] .... 1747.667197: cpuidle_idle_call: coupled 0: entered state 4

And TWO micro seconds later we are out of that state, but
need_resched() is NOT set and we did not get an interrupt and nothing
could have fiddled with the thread_info->state cacheline simply
because there is only ONE cpu online, i.e. CPU0 the very cpu which
trace we are staring at.

<idle>-0  [000] .... 1747.667198: cpu_idle: state=4294967295 cpu_id=0
<idle>-0  [000] d... 1747.667199: menu_select: expected_us: 39875 predicted_us: 19860

So we still have an expected 39875 usec sleep time ahead, which is
nonsense, but understandable because the nohz idle code did not go
through an update. But of course the immediate return takes the
residence of about 0 usec in state=4 into account and reduces the
predicted_us to 19860 usec

<idle>-0  [000] d... 1747.667201: menu_select: Trying idle state 1 s->dis 0 su->dis 0
<idle>-0  [000] d... 1747.667201: menu_select: residency 6
<idle>-0  [000] d... 1747.667202: menu_select: exit_latency 3 vs. 2000000000 multiplier 1
<idle>-0  [000] d... 1747.667203: menu_select: Trying idle state 2 s->dis 0 su->dis 0
<idle>-0  [000] d... 1747.667204: menu_select: residency 20
<idle>-0  [000] d... 1747.667204: menu_select: exit_latency 10 vs. 2000000000 multiplier 1
<idle>-0  [000] d... 1747.667205: menu_select: Trying idle state 3 s->dis 0 su->dis 0
<idle>-0  [000] d... 1747.667206: menu_select: residency 80
<idle>-0  [000] d... 1747.667206: menu_select: exit_latency 20 vs. 2000000000 multiplier 1
<idle>-0  [000] d... 1747.667207: menu_select: Trying idle state 4 s->dis 0 su->dis 0
<idle>-0  [000] d... 1747.667208: menu_select: residency 800
<idle>-0  [000] d... 1747.667208: menu_select: exit_latency 200 vs. 2000000000 multiplier 1
<idle>-0  [000] d... 1747.667209: cpu_idle: state=4 cpu_id=0

Now here we get lucky for whatever reasons and actually stay in state
4 until a real reason for leaving it happens.

<idle>-0  [000] d.h.  1747.669851: irq_handler_entry: irq=120 name=eth0-TxRx-0

So now the subtle difference between the old x86 idle code and the
generic one is that the new code fiddles with the polling state in
thread_info->state more than necessary. We need to fix that, but
nevertheless this whole mwait thing seems to be broken in some way or
the other and ever worked by chance.

So the difference boils down to:

+ modify(idle->thread_info->status)
  
  <some random code>
 
  mwait(idle->thread_info->status)

So on an UP machine the only CPU which cares about that state is that
very same idle thread which goes to sleep. Now there are a few
questions:

1) Why does writing to that cacheline from the very same cpu just
   before going to sleep cause an issue?

2) Why does it come back less often if the timing is different,
   e.g.when the debug patch is active?

3) Why does adding a mb() instead of the trace printks make the thing
   NOT go away?

   Is it really just timing related? I'm going to ask Boris to replace
   the mb() with a plain udelay(10) tomorrow.

4) Why can't we observe that on dual socket machines?

And for the record, I have analyzed traces before the switch to the
generic idle code which show the same drop in/ drop out issue on an UP
environment:

  <idle>-0     [000] d...   307.256288: cpu_idle: state=4 cpu_id=0
  <idle>-0     [000] ....   307.256289: cpu_idle: state=4294967295 cpu_id=0
  <idle>-0     [000] d...   307.256291: cpu_idle: state=4 cpu_id=0
  <idle>-0     [000] d.h.   307.257337: irq_handler_entry: irq=21 name=ata_piix

Yes, it does not cause the same issue as you can observe with the
current code, but definitely there is something extremly fishy in an
UP situation also there.

So it boils down to the following:

If something fiddles with the cache line which is monitored by mwait()
close before the mwait() happens then mwait() is playing silly
buggers. And that sillyness depends on the number of sockets.

And what's even worse is that we get a regression report for probably
exposing hardware sillyness. Something which should have been known
already - but unfortunately never has been documented as a requirement
for this mwait() stuff to work proper.

Sorry to be blunt about that. If stuff like this happens:

  <idle>-0     [000] d...   307.256288: cpu_idle: state=4 cpu_id=0
  <idle>-0     [000] ....   307.256289: cpu_idle: state=4294967295 cpu_id=0
  <idle>-0     [000] d...   307.256291: cpu_idle: state=4 cpu_id=0

on a kernel prior to that change in an UP scenario, then someone
should have investigated that years ago. But the preferred solution
was obviously to ignore that because it did not do substantial damage.

Now that the damage comes to light you cry murder for the completely
wrong reason.

The change exposes a shortcoming in mwait() rather than the more
obvious performance issue which mwait() is supposed to avoid in the
first place. Mike DID notice it and it got fixed, but of course it did
not heal the now obvious shortcomings of mwait()

Thanks,

	tglx
---

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a55e68f..b6399af 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -118,7 +118,7 @@ int cpuidle_idle_call(void)
 	struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
 	struct cpuidle_driver *drv;
 	int next_state, entered_state;
-	bool broadcast;
+	bool broadcast, coupled = false;
 
 	if (off || !initialized)
 		return -ENODEV;
@@ -147,15 +147,18 @@ int cpuidle_idle_call(void)
 	if (broadcast)
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);
 
-	if (cpuidle_state_is_coupled(dev, drv, next_state))
+	if (cpuidle_state_is_coupled(dev, drv, next_state)) {
 		entered_state = cpuidle_enter_state_coupled(dev, drv,
 							    next_state);
-	else
+		coupled = true;
+	} else
 		entered_state = cpuidle_enter_state(dev, drv, next_state);
 
 	if (broadcast)
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);
 
+	trace_printk("coupled %d: entered state %d\n", coupled, entered_state);
+
 	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
 
 	/* give the governor an opportunity to reflect on the outcome */
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index cf7f2f0..9de7ee2 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -309,7 +309,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 	data->expected_us =
 		t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC;
 
-
 	data->bucket = which_bucket(data->expected_us);
 
 	multiplier = performance_multiplier();
@@ -330,6 +329,9 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 					 data->correction_factor[data->bucket],
 					 RESOLUTION * DECAY);
 
+	trace_printk("expected_us: %d predicted_us: %d\n", data->expected_us,
+		     data->predicted_us);
+
 	get_typical_interval(data);
 
 	/*
@@ -349,10 +351,15 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 		struct cpuidle_state *s = &drv->states[i];
 		struct cpuidle_state_usage *su = &dev->states_usage[i];
 
+		trace_printk("Trying idle state %d s->dis %d su->dis %d\n", i,
+			     s->disabled, su->disable);
 		if (s->disabled || su->disable)
 			continue;
+		trace_printk("residency %d\n", s->target_residency);
 		if (s->target_residency > data->predicted_us)
 			continue;
+		trace_printk("exit_latency %d vs. %d multiplier %d\n",
+			     s->exit_latency, latency_req, multiplier);
 		if (s->exit_latency > latency_req)
 			continue;
 		if (s->exit_latency * multiplier > data->predicted_us)


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11  2:05           ` Thomas Gleixner
@ 2013-12-11  3:21             ` Mike Galbraith
  2013-12-11 11:28               ` Thomas Gleixner
  0 siblings, 1 reply; 96+ messages in thread
From: Mike Galbraith @ 2013-12-11  3:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Len Brown, Peter Zijlstra, Linux PM list, linux-kernel,
	Jeremy Eder, x86, Borislav Petkov

Alakazam..

pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
             0.17 2.01 2.26   0   0.02  99.82   0.00   49  99.55   0.00
 0   0   0   0.95 1.45 2.26   2   0.43  98.62   0.00   48  98.48   0.00
 1   0   8   0.24 1.99 2.26   2   0.02  99.75   0.00   38  99.68   0.00
 2   0  16   0.17 1.97 2.26   2   0.02  99.81   0.00   40  99.65   0.00
 3   0  24   0.18 1.92 2.26   2   0.02  99.80   0.00   41  99.68   0.00
 4   0  32   0.18 1.95 2.26   2   0.02  99.80   0.00   36  99.66   0.00
 5   0  40   0.15 1.85 2.26   0   0.03  99.83   0.00   35  99.70   0.00
 6   0  48   0.10 1.83 2.26   0   0.01  99.89   0.00   36  99.79   0.00
 7   0  56   0.10 1.97 2.26   0   0.01  99.89   0.00   43  99.75   0.00

Yup, magical gremlin repellent works on 8 socket DL980 too.

> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index a55e68f..b6399af 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -118,7 +118,7 @@ int cpuidle_idle_call(void)
>  	struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
>  	struct cpuidle_driver *drv;
>  	int next_state, entered_state;
> -	bool broadcast;
> +	bool broadcast, coupled = false;
>  
>  	if (off || !initialized)
>  		return -ENODEV;
> @@ -147,15 +147,18 @@ int cpuidle_idle_call(void)
>  	if (broadcast)
>  		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);
>  
> -	if (cpuidle_state_is_coupled(dev, drv, next_state))
> +	if (cpuidle_state_is_coupled(dev, drv, next_state)) {
>  		entered_state = cpuidle_enter_state_coupled(dev, drv,
>  							    next_state);
> -	else
> +		coupled = true;
> +	} else
>  		entered_state = cpuidle_enter_state(dev, drv, next_state);
>  
>  	if (broadcast)
>  		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);
>  
> +	trace_printk("coupled %d: entered state %d\n", coupled, entered_state);
> +
>  	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
>  
>  	/* give the governor an opportunity to reflect on the outcome */
> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> index cf7f2f0..9de7ee2 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -309,7 +309,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>  	data->expected_us =
>  		t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC;
>  
> -
>  	data->bucket = which_bucket(data->expected_us);
>  
>  	multiplier = performance_multiplier();
> @@ -330,6 +329,9 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>  					 data->correction_factor[data->bucket],
>  					 RESOLUTION * DECAY);
>  
> +	trace_printk("expected_us: %d predicted_us: %d\n", data->expected_us,
> +		     data->predicted_us);
> +
>  	get_typical_interval(data);
>  
>  	/*
> @@ -349,10 +351,15 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>  		struct cpuidle_state *s = &drv->states[i];
>  		struct cpuidle_state_usage *su = &dev->states_usage[i];
>  
> +		trace_printk("Trying idle state %d s->dis %d su->dis %d\n", i,
> +			     s->disabled, su->disable);
>  		if (s->disabled || su->disable)
>  			continue;
> +		trace_printk("residency %d\n", s->target_residency);
>  		if (s->target_residency > data->predicted_us)
>  			continue;
> +		trace_printk("exit_latency %d vs. %d multiplier %d\n",
> +			     s->exit_latency, latency_req, multiplier);
>  		if (s->exit_latency > latency_req)
>  			continue;
>  		if (s->exit_latency * multiplier > data->predicted_us)
> 



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11  3:21             ` Mike Galbraith
@ 2013-12-11 11:28               ` Thomas Gleixner
  2013-12-11 11:38                 ` Borislav Petkov
  0 siblings, 1 reply; 96+ messages in thread
From: Thomas Gleixner @ 2013-12-11 11:28 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Len Brown, Peter Zijlstra, Linux PM list, linux-kernel,
	Jeremy Eder, x86, Borislav Petkov

On Wed, 11 Dec 2013, Mike Galbraith wrote:
> Alakazam..
> Yup, magical gremlin repellent works on 8 socket DL980 too.

Now here is a less magical version of the gremlin repellent.

And just for the amusement value: The erratum for the series 7400
says:

AAI65. MONITOR/MWAIT May Have Excessive False Wakeups

Problem:	Normally, if MWAIT is used to enter a C-state that is
		C1 or higher, a store to the address range armed by
		the MONITOR instruction will cause the processor to
		exit MWAIT. Due to this erratum, false wakeups may
		occur when the monitored address range was recently
		written prior to executing the MONITOR instruction.

Implication:	Due to this erratum, performance and power savings may
		be impacted due to excessive false wakeups.

Workaround: 	Execute a CLFLUSH Instruction immediately before every
		MONITOR instruction when the monitored location may
		have been recently written.

Now that looks like the very same issue on these westmere EX
machines.

These false wakeups can be observed already before the idle changes
and now they are just more prominent.

Adding that clflush() unconditionally fixes the issue at least on
Boris machine.

Mike, can you retest on that 8 socket monstrum, please?

So it looks like the idle power regression is actually a software
change which exhibits a hardware "regression".

So much for proper validated advertising which promises core power 0W
at idle for these beasts :)

Thanks,

	tglx
---
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 92d1206..50299ad 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -376,7 +376,7 @@ static int intel_idle(struct cpuidle_device *dev,
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 
 	if (!current_set_polling_and_test()) {
-
+		clflush(&current_thread_info()->flags);
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
 		smp_mb();
 		if (!need_resched())

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 11:28               ` Thomas Gleixner
@ 2013-12-11 11:38                 ` Borislav Petkov
  2013-12-11 11:52                   ` Peter Zijlstra
  0 siblings, 1 reply; 96+ messages in thread
From: Borislav Petkov @ 2013-12-11 11:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mike Galbraith, Len Brown, Peter Zijlstra, Linux PM list,
	linux-kernel, Jeremy Eder, x86

On Wed, Dec 11, 2013 at 12:28:36PM +0100, Thomas Gleixner wrote:
> On Wed, 11 Dec 2013, Mike Galbraith wrote:
> > Alakazam..
> > Yup, magical gremlin repellent works on 8 socket DL980 too.
> 
> Now here is a less magical version of the gremlin repellent.
> 
> And just for the amusement value: The erratum for the series 7400
> says:
> 
> AAI65. MONITOR/MWAIT May Have Excessive False Wakeups
> 
> Problem:	Normally, if MWAIT is used to enter a C-state that is
> 		C1 or higher, a store to the address range armed by
> 		the MONITOR instruction will cause the processor to
> 		exit MWAIT. Due to this erratum, false wakeups may
> 		occur when the monitored address range was recently
> 		written prior to executing the MONITOR instruction.
> 
> Implication:	Due to this erratum, performance and power savings may
> 		be impacted due to excessive false wakeups.
> 
> Workaround: 	Execute a CLFLUSH Instruction immediately before every
> 		MONITOR instruction when the monitored location may
> 		have been recently written.
> 
> Now that looks like the very same issue on these westmere EX
> machines.
> 
> These false wakeups can be observed already before the idle changes
> and now they are just more prominent.
> 
> Adding that clflush() unconditionally fixes the issue at least on
> Boris machine.
> 
> Mike, can you retest on that 8 socket monstrum, please?
> 
> So it looks like the idle power regression is actually a software
> change which exhibits a hardware "regression".

Right, if it turns out that this is really the case and that this
erratum hasn't been fixed for models later than 29 - we'd need the
additional model numbers to set X86_FEATURE_CLFLUSH_MONITOR correctly.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 11:38                 ` Borislav Petkov
@ 2013-12-11 11:52                   ` Peter Zijlstra
  2013-12-11 12:29                     ` Mike Galbraith
  2013-12-11 21:43                     ` Len Brown
  0 siblings, 2 replies; 96+ messages in thread
From: Peter Zijlstra @ 2013-12-11 11:52 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, Mike Galbraith, Len Brown, Linux PM list,
	linux-kernel, Jeremy Eder, x86

On Wed, Dec 11, 2013 at 12:38:39PM +0100, Borislav Petkov wrote:
> Right, if it turns out that this is really the case and that this
> erratum hasn't been fixed for models later than 29 - we'd need the
> additional model numbers to set X86_FEATURE_CLFLUSH_MONITOR correctly.

You also need: https://lkml.org/lkml/2013/11/19/143

Because obviously not all mwait idle loops check that cpu bit.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 11:52                   ` Peter Zijlstra
@ 2013-12-11 12:29                     ` Mike Galbraith
  2013-12-11 12:43                       ` Peter Zijlstra
  2013-12-11 21:43                     ` Len Brown
  1 sibling, 1 reply; 96+ messages in thread
From: Mike Galbraith @ 2013-12-11 12:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Thomas Gleixner, Len Brown, Linux PM list,
	linux-kernel, Jeremy Eder, x86

On Wed, 2013-12-11 at 12:52 +0100, Peter Zijlstra wrote: 
> On Wed, Dec 11, 2013 at 12:38:39PM +0100, Borislav Petkov wrote:
> > Right, if it turns out that this is really the case and that this
> > erratum hasn't been fixed for models later than 29 - we'd need the
> > additional model numbers to set X86_FEATURE_CLFLUSH_MONITOR correctly.
> 
> You also need: https://lkml.org/lkml/2013/11/19/143
> 
> Because obviously not all mwait idle loops check that cpu bit.

I had tried that patch, to see if it would magically make the thing
start working, nope.  I had also tried...

---
 drivers/idle/intel_idle.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Index: linux-2.6/drivers/idle/intel_idle.c
===================================================================
--- linux-2.6.orig/drivers/idle/intel_idle.c
+++ linux-2.6/drivers/idle/intel_idle.c
@@ -376,11 +376,14 @@ static int intel_idle(struct cpuidle_dev
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 
 	if (!current_set_polling_and_test()) {
-
+		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+			clflush((void *)&current_thread_info()->flags);
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
 		smp_mb();
 		if (!need_resched())
 			__mwait(eax, ecx);
+		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+			clflush((void *)&current_thread_info()->flags);
 	}
 
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))

..a cflush before _and_ after, among other (shazam!.. darn) guesses, but
nogo.  Turning that into the tglx one liner indeed did fix the thing, as
did adding this to your patch.

---
 arch/x86/include/asm/mwait.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/arch/x86/include/asm/mwait.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/mwait.h
+++ linux-2.6/arch/x86/include/asm/mwait.h
@@ -43,7 +43,7 @@ static inline void __mwait(unsigned long
 static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
 {
 	if (!current_set_polling_and_test()) {
-		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+//		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);



Grrr.

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2
ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt lahf_lm dtherm
tpr_shadow vnmi flexpriority ept vpid

-Mike


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 12:29                     ` Mike Galbraith
@ 2013-12-11 12:43                       ` Peter Zijlstra
  2013-12-11 13:10                         ` Mike Galbraith
                                           ` (2 more replies)
  0 siblings, 3 replies; 96+ messages in thread
From: Peter Zijlstra @ 2013-12-11 12:43 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, Thomas Gleixner, Len Brown, Linux PM list,
	linux-kernel, Jeremy Eder, x86

On Wed, Dec 11, 2013 at 01:29:15PM +0100, Mike Galbraith wrote:
> On Wed, 2013-12-11 at 12:52 +0100, Peter Zijlstra wrote: 
> > On Wed, Dec 11, 2013 at 12:38:39PM +0100, Borislav Petkov wrote:
> > > Right, if it turns out that this is really the case and that this
> > > erratum hasn't been fixed for models later than 29 - we'd need the
> > > additional model numbers to set X86_FEATURE_CLFLUSH_MONITOR correctly.
> > 
> > You also need: https://lkml.org/lkml/2013/11/19/143
> > 
> > Because obviously not all mwait idle loops check that cpu bit.
> 
> I had tried that patch, to see if it would magically make the thing
> start working, nope.  I had also tried...

> +		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> +			clflush((void *)&current_thread_info()->flags);

Yeah, you need a bit extra to enable that feature bit for your CPU as
bpetkov said.

Something like the below.. someone needs to double check and possibly
add SNB/IVB EX parts if they're already available.

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index dc1ec0dff939..015642eed045 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -387,8 +387,15 @@ static void init_intel(struct cpuinfo_x86 *c)
 			set_cpu_cap(c, X86_FEATURE_PEBS);
 	}
 
-	if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush)
-		set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
+	if (c->x86 == 6 && cpu_has_clflush) {
+		switch (c->x86_model) {
+		case 29: /* Core2 EX */
+		case 46: /* NHM EX */
+		case 47: /* WSM EX */
+			set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
+			break;
+		}
+	}
 
 #ifdef CONFIG_X86_64
 	if (c->x86 == 15)

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 12:43                       ` Peter Zijlstra
@ 2013-12-11 13:10                         ` Mike Galbraith
  2013-12-11 13:40                         ` Borislav Petkov
  2013-12-11 14:42                         ` Ingo Molnar
  2 siblings, 0 replies; 96+ messages in thread
From: Mike Galbraith @ 2013-12-11 13:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Thomas Gleixner, Len Brown, Linux PM list,
	linux-kernel, Jeremy Eder, x86

On Wed, 2013-12-11 at 13:43 +0100, Peter Zijlstra wrote: 
> On Wed, Dec 11, 2013 at 01:29:15PM +0100, Mike Galbraith wrote:
> > On Wed, 2013-12-11 at 12:52 +0100, Peter Zijlstra wrote: 
> > > On Wed, Dec 11, 2013 at 12:38:39PM +0100, Borislav Petkov wrote:
> > > > Right, if it turns out that this is really the case and that this
> > > > erratum hasn't been fixed for models later than 29 - we'd need the
> > > > additional model numbers to set X86_FEATURE_CLFLUSH_MONITOR correctly.
> > > 
> > > You also need: https://lkml.org/lkml/2013/11/19/143
> > > 
> > > Because obviously not all mwait idle loops check that cpu bit.
> > 
> > I had tried that patch, to see if it would magically make the thing
> > start working, nope.  I had also tried...
> 
> > +		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> > +			clflush((void *)&current_thread_info()->flags);
> 
> Yeah, you need a bit extra to enable that feature bit for your CPU as
> bpetkov said.

Works for me, one more for the stable bucket.

So as soon as Len resurrects mwait_idle for Q6600 (and other core2 when
booted max_cstates=1 so tsc clocksource is used instead of pos hpet),
all the (known) idle regressions should be history.

-Mike


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 12:43                       ` Peter Zijlstra
  2013-12-11 13:10                         ` Mike Galbraith
@ 2013-12-11 13:40                         ` Borislav Petkov
  2013-12-11 14:56                           ` Ingo Molnar
  2013-12-11 14:42                         ` Ingo Molnar
  2 siblings, 1 reply; 96+ messages in thread
From: Borislav Petkov @ 2013-12-11 13:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Thomas Gleixner, Len Brown, Linux PM list,
	linux-kernel, Jeremy Eder, x86

On Wed, Dec 11, 2013 at 01:43:52PM +0100, Peter Zijlstra wrote:
> Something like the below.. someone needs to double check and possibly
> add SNB/IVB EX parts if they're already available.

Right, our friends at Intel would need to tell us which families/models
does AAI65 span... if, this is actually the case.

But now that Mike's box is fixed too, it very much looks like it.

Oh, and AAIxx errata are Intel Xeon Processor 7400 Series ones, we would
also need to know whether other incarnations are affected too.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 12:43                       ` Peter Zijlstra
  2013-12-11 13:10                         ` Mike Galbraith
  2013-12-11 13:40                         ` Borislav Petkov
@ 2013-12-11 14:42                         ` Ingo Molnar
  2013-12-11 15:02                           ` Thomas Gleixner
  2013-12-11 16:44                           ` Peter Zijlstra
  2 siblings, 2 replies; 96+ messages in thread
From: Ingo Molnar @ 2013-12-11 14:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Borislav Petkov, Thomas Gleixner, Len Brown,
	Linux PM list, linux-kernel, Jeremy Eder, x86


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Dec 11, 2013 at 01:29:15PM +0100, Mike Galbraith wrote:
> > On Wed, 2013-12-11 at 12:52 +0100, Peter Zijlstra wrote: 
> > > On Wed, Dec 11, 2013 at 12:38:39PM +0100, Borislav Petkov wrote:
> > > > Right, if it turns out that this is really the case and that this
> > > > erratum hasn't been fixed for models later than 29 - we'd need the
> > > > additional model numbers to set X86_FEATURE_CLFLUSH_MONITOR correctly.
> > > 
> > > You also need: https://lkml.org/lkml/2013/11/19/143
> > > 
> > > Because obviously not all mwait idle loops check that cpu bit.
> > 
> > I had tried that patch, to see if it would magically make the thing
> > start working, nope.  I had also tried...
> 
> > +		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> > +			clflush((void *)&current_thread_info()->flags);
> 
> Yeah, you need a bit extra to enable that feature bit for your CPU as
> bpetkov said.
> 
> Something like the below.. someone needs to double check and possibly
> add SNB/IVB EX parts if they're already available.
> 
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index dc1ec0dff939..015642eed045 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -387,8 +387,15 @@ static void init_intel(struct cpuinfo_x86 *c)
>  			set_cpu_cap(c, X86_FEATURE_PEBS);
>  	}
>  
> -	if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush)
> -		set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
> +	if (c->x86 == 6 && cpu_has_clflush) {
> +		switch (c->x86_model) {
> +		case 29: /* Core2 EX */
> +		case 46: /* NHM EX */
> +		case 47: /* WSM EX */
> +			set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
> +			break;
> +		}
> +	}

Another thing that is required I think is to issue a write barrier 
before CLFLUSH instruction. By my (possibly incorrect ...) reading of 
the documentation CLFLUSH does not appear to be ordered (at all), so 
it might execute before the modification to the affected memory?


So something like:

		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
			smp_wmb(); /* order CLFLUSH */
			clflush(&current_thread_info()->flags);
		}

possibly put behind some utility function, smp_clflush() or so, hiding 
the CPU feature bit check as well:

		smp_clflush(&current_thread_info()->flags);

or so?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 13:40                         ` Borislav Petkov
@ 2013-12-11 14:56                           ` Ingo Molnar
  2013-12-11 16:02                             ` Borislav Petkov
  2013-12-11 16:43                             ` Peter Zijlstra
  0 siblings, 2 replies; 96+ messages in thread
From: Ingo Molnar @ 2013-12-11 14:56 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Mike Galbraith, Thomas Gleixner, Len Brown,
	Linux PM list, linux-kernel, Jeremy Eder, x86


* Borislav Petkov <bp@alien8.de> wrote:

> On Wed, Dec 11, 2013 at 01:43:52PM +0100, Peter Zijlstra wrote:
> > Something like the below.. someone needs to double check and possibly
> > add SNB/IVB EX parts if they're already available.
> 
> Right, our friends at Intel would need to tell us which families/models
> does AAI65 span... if, this is actually the case.

I think CLFLUSH should be pretty universally available, IIRC graphics 
drivers were using it rather heavily in combination with 
write-combining MTRRs, both on Linux and on Windows.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 14:42                         ` Ingo Molnar
@ 2013-12-11 15:02                           ` Thomas Gleixner
  2013-12-11 15:09                             ` Ingo Molnar
  2013-12-11 16:44                           ` Peter Zijlstra
  1 sibling, 1 reply; 96+ messages in thread
From: Thomas Gleixner @ 2013-12-11 15:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Mike Galbraith, Borislav Petkov, Len Brown,
	Linux PM list, linux-kernel, Jeremy Eder, x86

On Wed, 11 Dec 2013, Ingo Molnar wrote:
> 
> Another thing that is required I think is to issue a write barrier 
> before CLFLUSH instruction. By my (possibly incorrect ...) reading of 
> the documentation CLFLUSH does not appear to be ordered (at all), so 
> it might execute before the modification to the affected memory?

We already have a write barrier in the code which modifies
current_thread_info()->flags.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 15:02                           ` Thomas Gleixner
@ 2013-12-11 15:09                             ` Ingo Molnar
  2013-12-11 16:44                               ` Peter Zijlstra
  0 siblings, 1 reply; 96+ messages in thread
From: Ingo Molnar @ 2013-12-11 15:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Mike Galbraith, Borislav Petkov, Len Brown,
	Linux PM list, linux-kernel, Jeremy Eder, x86


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Wed, 11 Dec 2013, Ingo Molnar wrote:
> > 
> > Another thing that is required I think is to issue a write barrier 
> > before CLFLUSH instruction. By my (possibly incorrect ...) reading of 
> > the documentation CLFLUSH does not appear to be ordered (at all), so 
> > it might execute before the modification to the affected memory?
> 
> We already have a write barrier in the code which modifies
> current_thread_info()->flags.

Which code is that, could you please cite it? (I tried finding it but 
my grep-fu failed me.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 14:56                           ` Ingo Molnar
@ 2013-12-11 16:02                             ` Borislav Petkov
  2013-12-11 16:43                             ` Peter Zijlstra
  1 sibling, 0 replies; 96+ messages in thread
From: Borislav Petkov @ 2013-12-11 16:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Mike Galbraith, Thomas Gleixner, Len Brown,
	Linux PM list, linux-kernel, Jeremy Eder, x86

On Wed, Dec 11, 2013 at 03:56:55PM +0100, Ingo Molnar wrote:
> I think CLFLUSH should be pretty universally available, IIRC
> graphics drivers were using it rather heavily in combination with
> write-combining MTRRs, both on Linux and on Windows.

... and it is also very expensive. So I don't think it would be in
Intel's best interest to do CLFLUSH unconditionally on all families but
only on those which really need to.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 14:56                           ` Ingo Molnar
  2013-12-11 16:02                             ` Borislav Petkov
@ 2013-12-11 16:43                             ` Peter Zijlstra
  2013-12-11 17:50                               ` Ingo Molnar
  1 sibling, 1 reply; 96+ messages in thread
From: Peter Zijlstra @ 2013-12-11 16:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Borislav Petkov, Mike Galbraith, Thomas Gleixner, Len Brown,
	Linux PM list, linux-kernel, Jeremy Eder, x86

On Wed, Dec 11, 2013 at 03:56:55PM +0100, Ingo Molnar wrote:
> 
> * Borislav Petkov <bp@alien8.de> wrote:
> 
> > On Wed, Dec 11, 2013 at 01:43:52PM +0100, Peter Zijlstra wrote:
> > > Something like the below.. someone needs to double check and possibly
> > > add SNB/IVB EX parts if they're already available.
> > 
> > Right, our friends at Intel would need to tell us which families/models
> > does AAI65 span... if, this is actually the case.
> 
> I think CLFLUSH should be pretty universally available, IIRC graphics 
> drivers were using it rather heavily in combination with 
> write-combining MTRRs, both on Linux and on Windows.

The availability isn't the problem; the cost is. We shouldn't issue one
if its not required. Only 'broken' EX hardware needs it.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 14:42                         ` Ingo Molnar
  2013-12-11 15:02                           ` Thomas Gleixner
@ 2013-12-11 16:44                           ` Peter Zijlstra
  2013-12-11 17:47                             ` Ingo Molnar
  1 sibling, 1 reply; 96+ messages in thread
From: Peter Zijlstra @ 2013-12-11 16:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, Borislav Petkov, Thomas Gleixner, Len Brown,
	Linux PM list, linux-kernel, Jeremy Eder, x86

On Wed, Dec 11, 2013 at 03:42:38PM +0100, Ingo Molnar wrote:
> Another thing that is required I think is to issue a write barrier 
> before CLFLUSH instruction. By my (possibly incorrect ...) reading of 
> the documentation CLFLUSH does not appear to be ordered (at all), so 
> it might execute before the modification to the affected memory?
> 
> 
> So something like:
> 
> 		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
> 			smp_wmb(); /* order CLFLUSH */
> 			clflush(&current_thread_info()->flags);
> 		}

smp_wmb() is a NO-OP on x86 remember :-)

Also, a wmb doesn't actually need to flush the store buffers.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 15:09                             ` Ingo Molnar
@ 2013-12-11 16:44                               ` Peter Zijlstra
  2013-12-11 17:48                                 ` Ingo Molnar
  0 siblings, 1 reply; 96+ messages in thread
From: Peter Zijlstra @ 2013-12-11 16:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Mike Galbraith, Borislav Petkov, Len Brown,
	Linux PM list, linux-kernel, Jeremy Eder, x86

On Wed, Dec 11, 2013 at 04:09:11PM +0100, Ingo Molnar wrote:
> 
> * Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > On Wed, 11 Dec 2013, Ingo Molnar wrote:
> > > 
> > > Another thing that is required I think is to issue a write barrier 
> > > before CLFLUSH instruction. By my (possibly incorrect ...) reading of 
> > > the documentation CLFLUSH does not appear to be ordered (at all), so 
> > > it might execute before the modification to the affected memory?
> > 
> > We already have a write barrier in the code which modifies
> > current_thread_info()->flags.
> 
> Which code is that, could you please cite it? (I tried finding it but 
> my grep-fu failed me.)

current_set_polling_and_test().

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 16:44                           ` Peter Zijlstra
@ 2013-12-11 17:47                             ` Ingo Molnar
  0 siblings, 0 replies; 96+ messages in thread
From: Ingo Molnar @ 2013-12-11 17:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Borislav Petkov, Thomas Gleixner, Len Brown,
	Linux PM list, linux-kernel, Jeremy Eder, x86


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Dec 11, 2013 at 03:42:38PM +0100, Ingo Molnar wrote:
> > Another thing that is required I think is to issue a write barrier 
> > before CLFLUSH instruction. By my (possibly incorrect ...) reading of 
> > the documentation CLFLUSH does not appear to be ordered (at all), so 
> > it might execute before the modification to the affected memory?
> > 
> > 
> > So something like:
> > 
> > 		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
> > 			smp_wmb(); /* order CLFLUSH */
> > 			clflush(&current_thread_info()->flags);
> > 		}
> 
> smp_wmb() is a NO-OP on x86 remember :-)

Well, it's a compiler barrier but yes - I suspect a smp_mb() might be 
needed - at least according to the CLFLUSH documentation it has no 
implicit guaranteed ordering wrt. preceding writes.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 16:44                               ` Peter Zijlstra
@ 2013-12-11 17:48                                 ` Ingo Molnar
  0 siblings, 0 replies; 96+ messages in thread
From: Ingo Molnar @ 2013-12-11 17:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Mike Galbraith, Borislav Petkov, Len Brown,
	Linux PM list, linux-kernel, Jeremy Eder, x86


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Dec 11, 2013 at 04:09:11PM +0100, Ingo Molnar wrote:
> > 
> > * Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > > On Wed, 11 Dec 2013, Ingo Molnar wrote:
> > > > 
> > > > Another thing that is required I think is to issue a write barrier 
> > > > before CLFLUSH instruction. By my (possibly incorrect ...) reading of 
> > > > the documentation CLFLUSH does not appear to be ordered (at all), so 
> > > > it might execute before the modification to the affected memory?
> > > 
> > > We already have a write barrier in the code which modifies
> > > current_thread_info()->flags.
> > 
> > Which code is that, could you please cite it? (I tried finding it but 
> > my grep-fu failed me.)
> 
> current_set_polling_and_test().

Thx, those variants have the right barrier indeed.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 16:43                             ` Peter Zijlstra
@ 2013-12-11 17:50                               ` Ingo Molnar
  2013-12-11 23:08                                 ` H. Peter Anvin
  0 siblings, 1 reply; 96+ messages in thread
From: Ingo Molnar @ 2013-12-11 17:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Mike Galbraith, Thomas Gleixner, Len Brown,
	Linux PM list, linux-kernel, Jeremy Eder, x86


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Dec 11, 2013 at 03:56:55PM +0100, Ingo Molnar wrote:
> > 
> > * Borislav Petkov <bp@alien8.de> wrote:
> > 
> > > On Wed, Dec 11, 2013 at 01:43:52PM +0100, Peter Zijlstra wrote:
> > > > Something like the below.. someone needs to double check and possibly
> > > > add SNB/IVB EX parts if they're already available.
> > > 
> > > Right, our friends at Intel would need to tell us which 
> > > families/models does AAI65 span... if, this is actually the 
> > > case.
> > 
> > I think CLFLUSH should be pretty universally available, IIRC 
> > graphics drivers were using it rather heavily in combination with 
> > write-combining MTRRs, both on Linux and on Windows.
> 
> The availability isn't the problem; the cost is. We shouldn't issue 
> one if its not required. Only 'broken' EX hardware needs it.

Well, availability could be a problem too, if some CPU (real or 
virtual) implements MWAIT but not CLFLUSH.

In theory we could make mwait an alternatives variant and patch in the 
right combination of instructions? The CLFLUSH goes to the same 
address as on which the monitoring happens, so it could be considered 
one meta-instruction.

Thansk,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 11:52                   ` Peter Zijlstra
  2013-12-11 12:29                     ` Mike Galbraith
@ 2013-12-11 21:43                     ` Len Brown
  2013-12-11 22:22                       ` Thomas Gleixner
  1 sibling, 1 reply; 96+ messages in thread
From: Len Brown @ 2013-12-11 21:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Thomas Gleixner, Mike Galbraith, Linux PM list,
	linux-kernel, Jeremy Eder, x86

For the record...

intel_idle doesn't check that bit because it doesn't run on model 29 --
the Xeon 7400 was the "Dunnington" 4-socket generation based on Core2.
Until now, i was not aware that this issue might apply to models other
than that one.

Checking w/ the HW guys...

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 21:43                     ` Len Brown
@ 2013-12-11 22:22                       ` Thomas Gleixner
  0 siblings, 0 replies; 96+ messages in thread
From: Thomas Gleixner @ 2013-12-11 22:22 UTC (permalink / raw)
  To: Len Brown
  Cc: Peter Zijlstra, Borislav Petkov, Mike Galbraith, Linux PM list,
	linux-kernel, Jeremy Eder, x86

On Wed, 11 Dec 2013, Len Brown wrote:

> For the record...
> 
> intel_idle doesn't check that bit because it doesn't run on model 29 --
> the Xeon 7400 was the "Dunnington" 4-socket generation based on Core2.

Right, we wondered about the restricted model check
already. Interesting that this is a 4 socket issue as well.

> Until now, i was not aware that this issue might apply to models other
> than that one.
> 
> Checking w/ the HW guys...

I hope they are not equally surprised :)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 17:50                               ` Ingo Molnar
@ 2013-12-11 23:08                                 ` H. Peter Anvin
  2013-12-11 23:14                                   ` Borislav Petkov
  2013-12-12  8:51                                   ` Peter Zijlstra
  0 siblings, 2 replies; 96+ messages in thread
From: H. Peter Anvin @ 2013-12-11 23:08 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Borislav Petkov, Mike Galbraith, Thomas Gleixner, Len Brown,
	Linux PM list, linux-kernel, Jeremy Eder, x86

On 12/11/2013 09:50 AM, Ingo Molnar wrote:
> 
> Well, availability could be a problem too, if some CPU (real or 
> virtual) implements MWAIT but not CLFLUSH.
> 
> In theory we could make mwait an alternatives variant and patch in the 
> right combination of instructions? The CLFLUSH goes to the same 
> address as on which the monitoring happens, so it could be considered 
> one meta-instruction.
> 

The first thing to do is probably to drop the use of thread_info as a
wakeup doorbell.  It seemed like a good idea at the time -- after all,
there is one for each thread -- but it is extremely likely to be dirty
in the cache, which is (presumably) what causes these kinds of bugs to
be maximally likely.  Even if we don't do the CLFLUSH it is likely that
the hardware has to do something expensive behind the scenes.

So I would like to propose that we switch to using a percpu variable
which is a single cache line of nothing at all.  It would only ever be
touched by MONITOR and for explicit wakeup.  Hopefully that will resolve
this problem without the need for the CLFLUSH.

	-hpa



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 23:08                                 ` H. Peter Anvin
@ 2013-12-11 23:14                                   ` Borislav Petkov
  2013-12-12  0:52                                     ` H. Peter Anvin
  2013-12-12  8:51                                   ` Peter Zijlstra
  1 sibling, 1 reply; 96+ messages in thread
From: Borislav Petkov @ 2013-12-11 23:14 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86

On Wed, Dec 11, 2013 at 03:08:35PM -0800, H. Peter Anvin wrote:
> So I would like to propose that we switch to using a percpu variable
> which is a single cache line of nothing at all. It would only ever
> be touched by MONITOR and for explicit wakeup. Hopefully that will
> resolve this problem without the need for the CLFLUSH.

Yep, makes a lot of sense to me to have an exclusive (overloaded meaning
here :-)) cacheline only for that. And, if it works, we'll save us the
penalty from the CLFLUSH too, cool.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 23:14                                   ` Borislav Petkov
@ 2013-12-12  0:52                                     ` H. Peter Anvin
  2013-12-12  4:25                                       ` Mike Galbraith
  0 siblings, 1 reply; 96+ messages in thread
From: H. Peter Anvin @ 2013-12-12  0:52 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86

[-- Attachment #1: Type: text/plain, Size: 824 bytes --]

On 12/11/2013 03:14 PM, Borislav Petkov wrote:
> On Wed, Dec 11, 2013 at 03:08:35PM -0800, H. Peter Anvin wrote:
>> So I would like to propose that we switch to using a percpu variable
>> which is a single cache line of nothing at all. It would only ever
>> be touched by MONITOR and for explicit wakeup. Hopefully that will
>> resolve this problem without the need for the CLFLUSH.
> 
> Yep, makes a lot of sense to me to have an exclusive (overloaded meaning
> here :-)) cacheline only for that. And, if it works, we'll save us the
> penalty from the CLFLUSH too, cool.
> 

Here is a POC patch... anyone willing to test it out?

Two obvious things to watch out for:

1. I couldn't actually spot any obvious cases of a deliberate monitor
   trigger.

2. Should we do cpu_relax() for all users, not just powerclamp?

	-hpa


[-- Attachment #2: 0001-x86-mwait-Use-a-dedicated-percpu-area-for-mwait-door.patch --]
[-- Type: text/x-patch, Size: 9373 bytes --]

>From 20a54d54ea06f050650ab8923b7d9ee1d21b5317 Mon Sep 17 00:00:00 2001
From: "H. Peter Anvin" <hpa@linux.intel.com>
Date: Wed, 11 Dec 2013 16:31:04 -0800
Subject: [PATCH] x86, mwait: Use a dedicated percpu area for mwait doorbell

We have used the cache line that includes thread_info.flags as the
mwait doorbell.  However, this is liable to be dirty in the cache,
which may trigger hardware errata, plus it might cause false wakeups.
Instead, use a dedicated wakeup doorbell area.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/include/asm/cpufeature.h  |  1 -
 arch/x86/include/asm/mwait.h       | 46 ++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/processor.h   | 23 -------------------
 arch/x86/kernel/acpi/cstate.c      |  6 +----
 arch/x86/kernel/cpu/common.c       | 17 ++++++++++++++
 arch/x86/kernel/cpu/intel.c        |  3 ---
 arch/x86/kernel/setup_percpu.c     |  3 +++
 arch/x86/kernel/smpboot.c          | 19 +---------------
 drivers/acpi/acpi_pad.c            |  3 +--
 drivers/idle/intel_idle.c          |  4 +---
 drivers/thermal/intel_powerclamp.c |  2 +-
 11 files changed, 71 insertions(+), 56 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index e099f95..cdc77f3 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -96,7 +96,6 @@
 #define X86_FEATURE_XTOPOLOGY	(3*32+22) /* cpu topology enum extensions */
 #define X86_FEATURE_TSC_RELIABLE (3*32+23) /* TSC is known to be reliable */
 #define X86_FEATURE_NONSTOP_TSC	(3*32+24) /* TSC does not stop in C states */
-#define X86_FEATURE_CLFLUSH_MONITOR (3*32+25) /* "" clflush reqd with monitor */
 #define X86_FEATURE_EXTD_APICID	(3*32+26) /* has extended APICID (8 bits) */
 #define X86_FEATURE_AMD_DCM     (3*32+27) /* multi-node processor */
 #define X86_FEATURE_APERFMPERF	(3*32+28) /* APERFMPERF */
diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 2f366d0..4c82863 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -13,4 +13,50 @@
 
 #define MWAIT_ECX_INTERRUPT_BREAK	0x1
 
+#ifndef __ASSEMBLY__
+
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+#include <linux/percpu.h>
+
+static inline void __monitor(const void *eax, unsigned long ecx,
+			     unsigned long edx)
+{
+	/* "monitor %eax, %ecx, %edx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc8;"
+		     :: "a" (eax), "c" (ecx), "d"(edx));
+}
+
+static inline void __mwait(unsigned long eax, unsigned long ecx)
+{
+	/* "mwait %eax, %ecx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc9;"
+		     :: "a" (eax), "c" (ecx));
+}
+
+static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
+{
+	trace_hardirqs_on();
+	/* "mwait %eax, %ecx;" */
+	asm volatile("sti; .byte 0x0f, 0x01, 0xc9;"
+		     :: "a" (eax), "c" (ecx));
+}
+
+extern char __percpu *mwait_doorbell;
+
+void __init setup_mwait_doorbell(void);
+
+static inline void x86_monitor_doorbell(unsigned long ecx, unsigned long edx)
+{
+	__monitor(__this_cpu_ptr(mwait_doorbell), ecx, edx);
+	mb();
+}
+
+static inline void x86_mwait_doorbell_bing_bong(int cpu)
+{
+	ACCESS_ONCE(*per_cpu_ptr(mwait_doorbell, cpu)) = 0;
+}
+
+#endif /* !__ASSEMBLY__ */
+
 #endif /* _ASM_X86_MWAIT_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index b7845a1..a51a79e 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -723,29 +723,6 @@ static inline void sync_core(void)
 #endif
 }
 
-static inline void __monitor(const void *eax, unsigned long ecx,
-			     unsigned long edx)
-{
-	/* "monitor %eax, %ecx, %edx;" */
-	asm volatile(".byte 0x0f, 0x01, 0xc8;"
-		     :: "a" (eax), "c" (ecx), "d"(edx));
-}
-
-static inline void __mwait(unsigned long eax, unsigned long ecx)
-{
-	/* "mwait %eax, %ecx;" */
-	asm volatile(".byte 0x0f, 0x01, 0xc9;"
-		     :: "a" (eax), "c" (ecx));
-}
-
-static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
-{
-	trace_hardirqs_on();
-	/* "mwait %eax, %ecx;" */
-	asm volatile("sti; .byte 0x0f, 0x01, 0xc9;"
-		     :: "a" (eax), "c" (ecx));
-}
-
 extern void select_idle_routine(const struct cpuinfo_x86 *c);
 extern void init_amd_e400_c1e_mask(void);
 
diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index d2b7f27..aec26c5 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -163,11 +163,7 @@ EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	if (!need_resched()) {
-		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
-			clflush((void *)&current_thread_info()->flags);
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
+		x86_monitor_doorbell(0, 0);
 		if (!need_resched())
 			__mwait(ax, cx);
 	}
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 6abc172..b6b19ab 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -24,6 +24,7 @@
 #include <linux/cpumask.h>
 #include <asm/pgtable.h>
 #include <linux/atomic.h>
+#include <asm/mwait.h>
 #include <asm/proto.h>
 #include <asm/setup.h>
 #include <asm/apic.h>
@@ -63,6 +64,22 @@ void __init setup_cpu_local_masks(void)
 	alloc_bootmem_cpumask_var(&cpu_sibling_setup_mask);
 }
 
+/* allocate percpu area for mwait doorbell */
+char __percpu *mwait_doorbell;
+
+void __init setup_mwait_doorbell(void)
+{
+	if (boot_cpu_has(X86_FEATURE_MWAIT)) {
+		mwait_doorbell = __alloc_percpu(boot_cpu_data.clflush_size,
+						boot_cpu_data.clflush_size);
+
+		if (!mwait_doorbell) {
+			/* This should never happen... */
+			panic("Unable to allocate mwait doorbell!\n");
+		}
+	}
+}
+
 static void default_init(struct cpuinfo_x86 *c)
 {
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index dc1ec0d..47b4e7b 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -387,9 +387,6 @@ static void init_intel(struct cpuinfo_x86 *c)
 			set_cpu_cap(c, X86_FEATURE_PEBS);
 	}
 
-	if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush)
-		set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
-
 #ifdef CONFIG_X86_64
 	if (c->x86 == 15)
 		c->x86_cache_alignment = c->x86_clflush_size * 2;
diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index 5cdff03..917e1af 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -284,4 +284,7 @@ void __init setup_per_cpu_areas(void)
 
 	/* Setup cpu initialized, callin, callout masks */
 	setup_cpu_local_masks();
+
+	/* Setup MWAIT doorbell */
+	setup_mwait_doorbell();
 }
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 85dc05a..558e097 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1368,7 +1368,6 @@ static inline void mwait_play_dead(void)
 	unsigned int eax, ebx, ecx, edx;
 	unsigned int highest_cstate = 0;
 	unsigned int highest_subcstate = 0;
-	void *mwait_ptr;
 	int i;
 
 	if (!this_cpu_has(X86_FEATURE_MWAIT))
@@ -1400,26 +1399,10 @@ static inline void mwait_play_dead(void)
 			(highest_subcstate - 1);
 	}
 
-	/*
-	 * This should be a memory location in a cache line which is
-	 * unlikely to be touched by other processors.  The actual
-	 * content is immaterial as it is not actually modified in any way.
-	 */
-	mwait_ptr = &current_thread_info()->flags;
-
 	wbinvd();
 
 	while (1) {
-		/*
-		 * The CLFLUSH is a workaround for erratum AAI65 for
-		 * the Xeon 7400 series.  It's not clear it is actually
-		 * needed, but it should be harmless in either case.
-		 * The WBINVD is insufficient due to the spurious-wakeup
-		 * case where we return around the loop.
-		 */
-		clflush(mwait_ptr);
-		__monitor(mwait_ptr, 0, 0);
-		mb();
+		x86_monitor_doorbell(0, 0);
 		__mwait(eax, 0);
 		/*
 		 * If NMI wants to wake up CPU0, start CPU0.
diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c
index fc6008f..38aaa8b 100644
--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -193,8 +193,7 @@ static int power_saving_thread(void *data)
 					CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 			stop_critical_timings();
 
-			__monitor((void *)&current_thread_info()->flags, 0, 0);
-			smp_mb();
+			x86_monitor_doorbell(0, 0);
 			if (!need_resched())
 				__mwait(power_saving_mwait_eax, 1);
 
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 92d1206..6fef6d9 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -376,9 +376,7 @@ static int intel_idle(struct cpuidle_device *dev,
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 
 	if (!current_set_polling_and_test()) {
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
+		x86_monitor_doorbell(0, 0);
 		if (!need_resched())
 			__mwait(eax, ecx);
 	}
diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
index 8f181b3..15cf013 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -438,7 +438,7 @@ static int clamp_thread(void *arg)
 			 */
 			local_touch_nmi();
 			stop_critical_timings();
-			__monitor((void *)&current_thread_info()->flags, 0, 0);
+			x86_monitor_doorbell(0, 0);
 			cpu_relax(); /* allow HT sibling to run */
 			__mwait(eax, ecx);
 			start_critical_timings();
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12  0:52                                     ` H. Peter Anvin
@ 2013-12-12  4:25                                       ` Mike Galbraith
  2013-12-12  4:49                                         ` H. Peter Anvin
  0 siblings, 1 reply; 96+ messages in thread
From: Mike Galbraith @ 2013-12-12  4:25 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86

On Wed, 2013-12-11 at 16:52 -0800, H. Peter Anvin wrote: 
> On 12/11/2013 03:14 PM, Borislav Petkov wrote:
> > On Wed, Dec 11, 2013 at 03:08:35PM -0800, H. Peter Anvin wrote:
> >> So I would like to propose that we switch to using a percpu variable
> >> which is a single cache line of nothing at all. It would only ever
> >> be touched by MONITOR and for explicit wakeup. Hopefully that will
> >> resolve this problem without the need for the CLFLUSH.
> > 
> > Yep, makes a lot of sense to me to have an exclusive (overloaded meaning
> > here :-)) cacheline only for that. And, if it works, we'll save us the
> > penalty from the CLFLUSH too, cool.
> > 
> 
> Here is a POC patch... anyone willing to test it out?

Got it built, but it went boom on boot.  Off to rummage. 

[    0.000000] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:64 nr_node_ids:8
[    0.000000] PERCPU: Embedded 26 pages/cpu @ffff88027ee00000 s75904 r8192 d22400 u131072
[    0.000000] pcpu-alloc: s75904 r8192 d22400 u131072 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 
[    0.000000] pcpu-alloc: [0] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
[    0.000000] pcpu-alloc: [0] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 
[    0.000000] pcpu-alloc: [0] 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 
[    0.000000] BUG: unable to handle kernel paging request at 000000000000b8a0
[    0.000000] IP: [<ffffffff81a9d072>] setup_mwait_doorbell+0x20/0x38
[    0.000000] PGD 0 
[    0.000000] Oops: 0002 [#1] SMP 
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.13.0-master #185
[    0.000000] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010
[    0.000000] task: ffffffff81a10460 ti: ffffffff81a00000 task.ti: ffffffff81a00000
[    0.000000] RIP: 0010:[<ffffffff81a9d072>]  [<ffffffff81a9d072>] setup_mwait_doorbell+0x20/0x38
[    0.000000] RSP: 0000:ffffffff81a01f28  EFLAGS: 00010002
[    0.000000] RAX: 0000000000014880 RBX: 0000000000000040 RCX: 0000000000000000
[    0.000000] RDX: 0000000000000040 RSI: 0000000000000040 RDI: ffffffff81a38e60
[    0.000000] RBP: ffffffff81a01f28 R08: 0000000000000040 R09: 0000000000000000
[    0.000000] R10: ffff88027f5f4880 R11: 0000000000000001 R12: 000000000000b850
[    0.000000] R13: 000000000000b026 R14: 000000000000b024 R15: 000000000000b020
[    0.000000] FS:  0000000000000000(0000) GS:ffff88027ee00000(0000) knlGS:0000000000000000
[    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.000000] CR2: 000000000000b8a0 CR3: 0000000001a0b000 CR4: 00000000000000b0
[    0.000000] Stack:
[    0.000000]  ffffffff81a01f78 ffffffff81aa3641 ffffffff81a01f98 000000000000cd48
[    0.000000]  ffff88027ee00000 0000000000000000 0000000000000000 0000000000000000
[    0.000000]  0000000000000000 0000000000000000 ffffffff81a01fa8 ffffffff81a96d89
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff81aa3641>] setup_per_cpu_areas+0x233/0x242
[    0.000000]  [<ffffffff81a96d89>] start_kernel+0x84/0x370
[    0.000000]  [<ffffffff81a964cc>] x86_64_start_reservations+0x1b/0x35
[    0.000000]  [<ffffffff81a96614>] x86_64_start_kernel+0x12e/0x135
[    0.000000] Code: 40 8f a7 81 e8 f6 fe ff ff c9 c3 55 48 8b 05 0a bf fd ff 48 89 e5 a8 08 75 02 c9 c3 0f b7 3d 84 bf fd ff 48 89 fe e8 fe dc 64 ff <48> 89 05 27 e8 56 7e 48 85 c0 75 e3 48 c7 c7 f0 83 78 81 e8 55 
[    0.000000] RIP  [<ffffffff81a9d072>] setup_mwait_doorbell+0x20/0x38
[    0.000000]  RSP <ffffffff81a01f28>
[    0.000000] CR2: 000000000000b8a0
[    0.000000] ---[ end trace f6e32c58e0729292 ]---
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!

Build delta.

---
 arch/x86/include/asm/mwait.h   |    4 ++--
 arch/x86/kernel/cpu/common.c   |    7 ++++---
 arch/x86/kernel/setup_percpu.c |    1 +
 3 files changed, 7 insertions(+), 5 deletions(-)

Index: linux-2.6/arch/x86/kernel/cpu/common.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/common.c
+++ linux-2.6/arch/x86/kernel/cpu/common.c
@@ -65,13 +65,14 @@ void __init setup_cpu_local_masks(void)
 }
 
 /* allocate percpu area for mwait doorbell */
-char __percpu *mwait_doorbell;
+DEFINE_PER_CPU(char *, mwait_doorbell);
+EXPORT_PER_CPU_SYMBOL(mwait_doorbell);
 
 void __init setup_mwait_doorbell(void)
 {
 	if (boot_cpu_has(X86_FEATURE_MWAIT)) {
-		mwait_doorbell = __alloc_percpu(boot_cpu_data.clflush_size,
-						boot_cpu_data.clflush_size);
+		mwait_doorbell = __alloc_percpu(boot_cpu_data.x86_clflush_size,
+						boot_cpu_data.x86_clflush_size);
 
 		if (!mwait_doorbell) {
 			/* This should never happen... */
Index: linux-2.6/arch/x86/kernel/setup_percpu.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup_percpu.c
+++ linux-2.6/arch/x86/kernel/setup_percpu.c
@@ -20,6 +20,7 @@
 #include <asm/cpumask.h>
 #include <asm/cpu.h>
 #include <asm/stackprotector.h>
+#include <asm/mwait.h>
 
 DEFINE_PER_CPU_READ_MOSTLY(int, cpu_number);
 EXPORT_PER_CPU_SYMBOL(cpu_number);
Index: linux-2.6/arch/x86/include/asm/mwait.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/mwait.h
+++ linux-2.6/arch/x86/include/asm/mwait.h
@@ -42,9 +42,9 @@ static inline void __sti_mwait(unsigned
 		     :: "a" (eax), "c" (ecx));
 }
 
-extern char __percpu *mwait_doorbell;
+DECLARE_PER_CPU(char *, mwait_doorbell);
 
-void __init setup_mwait_doorbell(void);
+extern void __init setup_mwait_doorbell(void);
 
 static inline void x86_monitor_doorbell(unsigned long ecx, unsigned long edx)
 {




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12  4:25                                       ` Mike Galbraith
@ 2013-12-12  4:49                                         ` H. Peter Anvin
  2013-12-12  4:59                                           ` Mike Galbraith
  2013-12-12  5:37                                           ` Mike Galbraith
  0 siblings, 2 replies; 96+ messages in thread
From: H. Peter Anvin @ 2013-12-12  4:49 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86

On 12/11/2013 08:25 PM, Mike Galbraith wrote:
>  arch/x86/include/asm/mwait.h   |    4 ++--
>  arch/x86/kernel/cpu/common.c   |    7 ++++---
>  arch/x86/kernel/setup_percpu.c |    1 +
>  3 files changed, 7 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6/arch/x86/kernel/cpu/common.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/cpu/common.c
> +++ linux-2.6/arch/x86/kernel/cpu/common.c
> @@ -65,13 +65,14 @@ void __init setup_cpu_local_masks(void)
>  }
>  
>  /* allocate percpu area for mwait doorbell */
> -char __percpu *mwait_doorbell;
> +DEFINE_PER_CPU(char *, mwait_doorbell);
> +EXPORT_PER_CPU_SYMBOL(mwait_doorbell);
>  

Sorry, this is wrong.  This is NOT a percpu variable, it is a pointer to
a percpu allocation, but the variable itself is not a percpu variable.
This explains your boom.

>  void __init setup_mwait_doorbell(void)
>  {
>  	if (boot_cpu_has(X86_FEATURE_MWAIT)) {
> -		mwait_doorbell = __alloc_percpu(boot_cpu_data.clflush_size,
> -						boot_cpu_data.clflush_size);
> +		mwait_doorbell = __alloc_percpu(boot_cpu_data.x86_clflush_size,
> +						boot_cpu_data.x86_clflush_size);
>  
>  		if (!mwait_doorbell) {
>  			/* This should never happen... */

	-hpa



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12  4:49                                         ` H. Peter Anvin
@ 2013-12-12  4:59                                           ` Mike Galbraith
  2013-12-12  5:37                                           ` Mike Galbraith
  1 sibling, 0 replies; 96+ messages in thread
From: Mike Galbraith @ 2013-12-12  4:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86

On Wed, 2013-12-11 at 20:49 -0800, H. Peter Anvin wrote: 
> On 12/11/2013 08:25 PM, Mike Galbraith wrote:
> >  arch/x86/include/asm/mwait.h   |    4 ++--
> >  arch/x86/kernel/cpu/common.c   |    7 ++++---
> >  arch/x86/kernel/setup_percpu.c |    1 +
> >  3 files changed, 7 insertions(+), 5 deletions(-)
> > 
> > Index: linux-2.6/arch/x86/kernel/cpu/common.c
> > ===================================================================
> > --- linux-2.6.orig/arch/x86/kernel/cpu/common.c
> > +++ linux-2.6/arch/x86/kernel/cpu/common.c
> > @@ -65,13 +65,14 @@ void __init setup_cpu_local_masks(void)
> >  }
> >  
> >  /* allocate percpu area for mwait doorbell */
> > -char __percpu *mwait_doorbell;
> > +DEFINE_PER_CPU(char *, mwait_doorbell);
> > +EXPORT_PER_CPU_SYMBOL(mwait_doorbell);
> >  
> 
> Sorry, this is wrong.  This is NOT a percpu variable, it is a pointer to
> a percpu allocation, but the variable itself is not a percpu variable.
> This explains your boom.

Yeah, I know, I already slapped myself upside the head.

(what were you thinking mikie...la la la la la:)

-Mike


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12  4:49                                         ` H. Peter Anvin
  2013-12-12  4:59                                           ` Mike Galbraith
@ 2013-12-12  5:37                                           ` Mike Galbraith
  2013-12-12  5:45                                             ` H. Peter Anvin
  1 sibling, 1 reply; 96+ messages in thread
From: Mike Galbraith @ 2013-12-12  5:37 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86

On Wed, 2013-12-11 at 20:49 -0800, H. Peter Anvin wrote: 
> On 12/11/2013 08:25 PM, Mike Galbraith wrote:
> >  arch/x86/include/asm/mwait.h   |    4 ++--
> >  arch/x86/kernel/cpu/common.c   |    7 ++++---
> >  arch/x86/kernel/setup_percpu.c |    1 +
> >  3 files changed, 7 insertions(+), 5 deletions(-)
> > 
> > Index: linux-2.6/arch/x86/kernel/cpu/common.c
> > ===================================================================
> > --- linux-2.6.orig/arch/x86/kernel/cpu/common.c
> > +++ linux-2.6/arch/x86/kernel/cpu/common.c
> > @@ -65,13 +65,14 @@ void __init setup_cpu_local_masks(void)
> >  }
> >  
> >  /* allocate percpu area for mwait doorbell */
> > -char __percpu *mwait_doorbell;
> > +DEFINE_PER_CPU(char *, mwait_doorbell);
> > +EXPORT_PER_CPU_SYMBOL(mwait_doorbell);
> >  
> 
> Sorry, this is wrong.  This is NOT a percpu variable, it is a pointer to
> a percpu allocation, but the variable itself is not a percpu variable.
> This explains your boom.

With that fixed, it boots, but is not quite perfect.

... 
[  258.560079] fbcon: radeondrmfb (fb0) is primary device
[  258.722483] Console: switching to colour frame buffer device 128x48
[  258.847076] radeon 0000:01:03.0: fb0: radeondrmfb frame buffer device
[  258.911991] radeon 0000:01:03.0: registered panic notifier
[  258.968772] [drm] Initialized radeon 2.35.0 20080528 for 0000:01:03.0 on minor 0
...
[  469.738604] netxen_nic 0000:04:00.3: using msi-x interrupts
[  469.739078] netxen_nic 0000:04:00.3: eth5: GbE port initialized
[  469.830512] ipmi_si 00:01: Found new BMC (man_id: 0x00000b, prod_id: 0x2000, dev_id: 0x13)
[  469.830524] ipmi_si 00:01: IPMI kcs interface initialized
[  473.729862] iTCO_wdt: unable to reset NO_REBOOT flag, device disabled by hardware/BIOS
...
[  711.636741] fuse init (API version 7.22)

...zzzz ok box, doctor appointment is in an hour away.

-Mike 



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12  5:37                                           ` Mike Galbraith
@ 2013-12-12  5:45                                             ` H. Peter Anvin
  2013-12-12  5:57                                               ` Mike Galbraith
  0 siblings, 1 reply; 96+ messages in thread
From: H. Peter Anvin @ 2013-12-12  5:45 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86

As in it hangs at that point?

Mike Galbraith <bitbucket@online.de> wrote:
>On Wed, 2013-12-11 at 20:49 -0800, H. Peter Anvin wrote: 
>> On 12/11/2013 08:25 PM, Mike Galbraith wrote:
>> >  arch/x86/include/asm/mwait.h   |    4 ++--
>> >  arch/x86/kernel/cpu/common.c   |    7 ++++---
>> >  arch/x86/kernel/setup_percpu.c |    1 +
>> >  3 files changed, 7 insertions(+), 5 deletions(-)
>> > 
>> > Index: linux-2.6/arch/x86/kernel/cpu/common.c
>> > ===================================================================
>> > --- linux-2.6.orig/arch/x86/kernel/cpu/common.c
>> > +++ linux-2.6/arch/x86/kernel/cpu/common.c
>> > @@ -65,13 +65,14 @@ void __init setup_cpu_local_masks(void)
>> >  }
>> >  
>> >  /* allocate percpu area for mwait doorbell */
>> > -char __percpu *mwait_doorbell;
>> > +DEFINE_PER_CPU(char *, mwait_doorbell);
>> > +EXPORT_PER_CPU_SYMBOL(mwait_doorbell);
>> >  
>> 
>> Sorry, this is wrong.  This is NOT a percpu variable, it is a pointer
>to
>> a percpu allocation, but the variable itself is not a percpu
>variable.
>> This explains your boom.
>
>With that fixed, it boots, but is not quite perfect.
>
>... 
>[  258.560079] fbcon: radeondrmfb (fb0) is primary device
>[  258.722483] Console: switching to colour frame buffer device 128x48
>[  258.847076] radeon 0000:01:03.0: fb0: radeondrmfb frame buffer
>device
>[  258.911991] radeon 0000:01:03.0: registered panic notifier
>[  258.968772] [drm] Initialized radeon 2.35.0 20080528 for
>0000:01:03.0 on minor 0
>...
>[  469.738604] netxen_nic 0000:04:00.3: using msi-x interrupts
>[  469.739078] netxen_nic 0000:04:00.3: eth5: GbE port initialized
>[  469.830512] ipmi_si 00:01: Found new BMC (man_id: 0x00000b, prod_id:
>0x2000, dev_id: 0x13)
>[  469.830524] ipmi_si 00:01: IPMI kcs interface initialized
>[  473.729862] iTCO_wdt: unable to reset NO_REBOOT flag, device
>disabled by hardware/BIOS
>...
>[  711.636741] fuse init (API version 7.22)
>
>...zzzz ok box, doctor appointment is in an hour away.
>
>-Mike 

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12  5:45                                             ` H. Peter Anvin
@ 2013-12-12  5:57                                               ` Mike Galbraith
  2013-12-12  6:05                                                 ` Mike Galbraith
  0 siblings, 1 reply; 96+ messages in thread
From: Mike Galbraith @ 2013-12-12  5:57 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86

On Wed, 2013-12-11 at 21:45 -0800, H. Peter Anvin wrote: 
> As in it hangs at that point?

Nope, it's still going.

[1567.578340] pcc-cpufreq: (v1.10.00) driver loaded with frequency limits: 1064 MHz, 2266 MHz

Funny, continents move faster :)  Maybe missing a write or two.

-Mike


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12  5:57                                               ` Mike Galbraith
@ 2013-12-12  6:05                                                 ` Mike Galbraith
  2013-12-12  7:57                                                   ` H. Peter Anvin
  0 siblings, 1 reply; 96+ messages in thread
From: Mike Galbraith @ 2013-12-12  6:05 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86

On Thu, 2013-12-12 at 06:57 +0100, Mike Galbraith wrote: 
> On Wed, 2013-12-11 at 21:45 -0800, H. Peter Anvin wrote: 
> > As in it hangs at that point?
> 
> Nope, it's still going.
> 
> [1567.578340] pcc-cpufreq: (v1.10.00) driver loaded with frequency limits: 1064 MHz, 2266 MHz
> 
> Funny, continents move faster :)  Maybe missing a write or two.

When I get back it may be done booting.  I'm gonna let it try for grins
while I'm away, then take a peek, see if I can spot it.

[ 1567.578340] pcc-cpufreq: (v1.10.00) driver loaded with frequency limits: 1064 MHz, 2266 MHz
                                                                      done
Starting HAL daemon                                                   done


Setting up (localfs) network interfaces:
    lo        
    lo        IP address: 127.0.0.1/8   
              IP address: 127.0.0.2/8                                 done
    eth0      device: Broadcom Corporation NetXtreme II BCM5709 Gig
              No configuration found for eth0                         unused
    eth1      device: Broadcom Corporation NetXtreme II BCM5709 Gig
              No configuration found for eth1                         unused
    eth2      device: NetXen Incorporated NX3031 Multifunction 1/10
[ 2457.114007] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
[ 2457.114455] netxen_nic: eth2 NIC Link is up
[ 2457.223582] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
    eth2      IP address: 0.0.0.0/32


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12  6:05                                                 ` Mike Galbraith
@ 2013-12-12  7:57                                                   ` H. Peter Anvin
  0 siblings, 0 replies; 96+ messages in thread
From: H. Peter Anvin @ 2013-12-12  7:57 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86

I'm guessing there is an implicit or explicit wake-up somewhere that I managed to miss.  If it is implicit it could be hard to catch.

Mike Galbraith <bitbucket@online.de> wrote:
>On Thu, 2013-12-12 at 06:57 +0100, Mike Galbraith wrote: 
>> On Wed, 2013-12-11 at 21:45 -0800, H. Peter Anvin wrote: 
>> > As in it hangs at that point?
>> 
>> Nope, it's still going.
>> 
>> [1567.578340] pcc-cpufreq: (v1.10.00) driver loaded with frequency
>limits: 1064 MHz, 2266 MHz
>> 
>> Funny, continents move faster :)  Maybe missing a write or two.
>
>When I get back it may be done booting.  I'm gonna let it try for grins
>while I'm away, then take a peek, see if I can spot it.
>
>[ 1567.578340] pcc-cpufreq: (v1.10.00) driver loaded with frequency
>limits: 1064 MHz, 2266 MHz
>                                                                   done
>Starting HAL daemon                                                  
>done
>
>
>Setting up (localfs) network interfaces:
>    lo        
>    lo        IP address: 127.0.0.1/8   
>           IP address: 127.0.0.2/8                                 done
>    eth0      device: Broadcom Corporation NetXtreme II BCM5709 Gig
>         No configuration found for eth0                         unused
>    eth1      device: Broadcom Corporation NetXtreme II BCM5709 Gig
>         No configuration found for eth1                         unused
>    eth2      device: NetXen Incorporated NX3031 Multifunction 1/10
>[ 2457.114007] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
>[ 2457.114455] netxen_nic: eth2 NIC Link is up
>[ 2457.223582] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
>    eth2      IP address: 0.0.0.0/32

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 23:08                                 ` H. Peter Anvin
  2013-12-11 23:14                                   ` Borislav Petkov
@ 2013-12-12  8:51                                   ` Peter Zijlstra
  2013-12-12 13:28                                     ` Ingo Molnar
  1 sibling, 1 reply; 96+ messages in thread
From: Peter Zijlstra @ 2013-12-12  8:51 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Borislav Petkov, Mike Galbraith, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86

On Wed, Dec 11, 2013 at 03:08:35PM -0800, H. Peter Anvin wrote:
> On 12/11/2013 09:50 AM, Ingo Molnar wrote:
> > 
> > Well, availability could be a problem too, if some CPU (real or 
> > virtual) implements MWAIT but not CLFLUSH.
> > 
> > In theory we could make mwait an alternatives variant and patch in the 
> > right combination of instructions? The CLFLUSH goes to the same 
> > address as on which the monitoring happens, so it could be considered 
> > one meta-instruction.
> > 
> 
> The first thing to do is probably to drop the use of thread_info as a
> wakeup doorbell.  It seemed like a good idea at the time -- after all,
> there is one for each thread -- but it is extremely likely to be dirty
> in the cache, which is (presumably) what causes these kinds of bugs to
> be maximally likely.  Even if we don't do the CLFLUSH it is likely that
> the hardware has to do something expensive behind the scenes.
> 
> So I would like to propose that we switch to using a percpu variable
> which is a single cache line of nothing at all.  It would only ever be
> touched by MONITOR and for explicit wakeup.  Hopefully that will resolve
> this problem without the need for the CLFLUSH.

The reason we use thread_info::flags is because we need to write
TIF_NEED_RESCHED into it to wake up anyhow.

Using another cacheline would mean the wakeup path would need to write a
second cross cpu cacheline -- that is badness too.

So no, I don't think we want to listen to another line.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12  8:51                                   ` Peter Zijlstra
@ 2013-12-12 13:28                                     ` Ingo Molnar
  2013-12-12 15:06                                       ` H. Peter Anvin
  0 siblings, 1 reply; 96+ messages in thread
From: Ingo Molnar @ 2013-12-12 13:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Borislav Petkov, Mike Galbraith, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Dec 11, 2013 at 03:08:35PM -0800, H. Peter Anvin wrote:
> > On 12/11/2013 09:50 AM, Ingo Molnar wrote:
> > > 
> > > Well, availability could be a problem too, if some CPU (real or 
> > > virtual) implements MWAIT but not CLFLUSH.
> > > 
> > > In theory we could make mwait an alternatives variant and patch in the 
> > > right combination of instructions? The CLFLUSH goes to the same 
> > > address as on which the monitoring happens, so it could be considered 
> > > one meta-instruction.
> > > 
> > 
> > The first thing to do is probably to drop the use of thread_info as a
> > wakeup doorbell.  It seemed like a good idea at the time -- after all,
> > there is one for each thread -- but it is extremely likely to be dirty
> > in the cache, which is (presumably) what causes these kinds of bugs to
> > be maximally likely.  Even if we don't do the CLFLUSH it is likely that
> > the hardware has to do something expensive behind the scenes.
> > 
> > So I would like to propose that we switch to using a percpu variable
> > which is a single cache line of nothing at all.  It would only ever be
> > touched by MONITOR and for explicit wakeup.  Hopefully that will resolve
> > this problem without the need for the CLFLUSH.
> 
> The reason we use thread_info::flags is because we need to write
> TIF_NEED_RESCHED into it to wake up anyhow.
> 
> Using another cacheline would mean the wakeup path would need to write a
> second cross cpu cacheline -- that is badness too.
> 
> So no, I don't think we want to listen to another line.

Seconded ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12 13:28                                     ` Ingo Molnar
@ 2013-12-12 15:06                                       ` H. Peter Anvin
  2013-12-12 15:51                                         ` Peter Zijlstra
  0 siblings, 1 reply; 96+ messages in thread
From: H. Peter Anvin @ 2013-12-12 15:06 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Borislav Petkov, Mike Galbraith, Thomas Gleixner, Len Brown,
	Linux PM list, linux-kernel, Jeremy Eder, x86

On 12/12/2013 05:28 AM, Ingo Molnar wrote:
>>
>> The reason we use thread_info::flags is because we need to write
>> TIF_NEED_RESCHED into it to wake up anyhow.
>>
>> Using another cacheline would mean the wakeup path would need to write a
>> second cross cpu cacheline -- that is badness too.
>>
>> So no, I don't think we want to listen to another line.
> 

Right, okay, so that's the implicit wakeup.  However, I would think the
CLFLUSH would hurt a lot more.

	-hpa



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12 15:06                                       ` H. Peter Anvin
@ 2013-12-12 15:51                                         ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2013-12-12 15:51 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Borislav Petkov, Mike Galbraith, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86

On Thu, Dec 12, 2013 at 07:06:57AM -0800, H. Peter Anvin wrote:
> On 12/12/2013 05:28 AM, Ingo Molnar wrote:
> >>
> >> The reason we use thread_info::flags is because we need to write
> >> TIF_NEED_RESCHED into it to wake up anyhow.
> >>
> >> Using another cacheline would mean the wakeup path would need to write a
> >> second cross cpu cacheline -- that is badness too.
> >>
> >> So no, I don't think we want to listen to another line.
> > 
> 
> Right, okay, so that's the implicit wakeup.  However, I would think the
> CLFLUSH would hurt a lot more.

Maybe, but still, who cares? Its only a few broken cpus that actually
need the clflush, normal cpus do not. We should not optimize for the
broken case.



^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH] x86 idle: repair large-server 50-watt idle-power regression
@ 2013-12-18 21:44   ` Len Brown
  0 siblings, 0 replies; 96+ messages in thread
From: Len Brown @ 2013-12-18 21:44 UTC (permalink / raw)
  To: x86; +Cc: linux-pm, linux-kernel, Len Brown, stable

From: Len Brown <len.brown@intel.com>

Linux 3.10 changed the timing of how thread_info->flags is touched:

	x86: Use generic idle loop
	(7d1a941731fabf27e5fb6edbebb79fe856edb4e5)

This caused Intel NHM-EX and WSM-EX servers to experience a large number
of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote
from deep C-states to shallow C-states, which caused these platforms
to experience a significant increase in idle power.

Note that this issue was already present before the commit above,
however, it wasn't seen often enough to be noticed in power measurements.

Here we extend an errata workaround from the Core2 EX "Dunnington"
to extend to NHM-EX and WSM-EX, to prevent these immediate
returns from MWAIT, reducing idle power on these platforms.

While only acpi_idle ran on Dunnington, intel_idle
may also run on these two newer systems.
As of today, there are no other models that are known
to need this tweak.

ref: https://lkml.org/lkml/2013/12/7/22
Signed-off-by: Len Brown <len.brown@intel.com>
Cc: <stable@vger.kernel.org> # 3.12.x, 3.11.x, 3.10.x
---
 arch/x86/kernel/cpu/intel.c | 3 ++-
 drivers/idle/intel_idle.c   | 3 +++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index dc1ec0d..ea04b34 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -387,7 +387,8 @@ static void init_intel(struct cpuinfo_x86 *c)
 			set_cpu_cap(c, X86_FEATURE_PEBS);
 	}
 
-	if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush)
+	if (c->x86 == 6 && cpu_has_clflush &&
+	    (c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47))
 		set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
 
 #ifdef CONFIG_X86_64
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 92d1206..f80b700 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev,
 
 	if (!current_set_polling_and_test()) {
 
+		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+			clflush((void *)&current_thread_info()->flags);
+
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
 		smp_mb();
 		if (!need_resched())
-- 
1.8.5.1.19.gdaad3aa


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH] x86 idle: repair large-server 50-watt idle-power regression
@ 2013-12-18 21:44   ` Len Brown
  0 siblings, 0 replies; 96+ messages in thread
From: Len Brown @ 2013-12-18 21:44 UTC (permalink / raw)
  To: x86; +Cc: linux-pm, linux-kernel, Len Brown, stable

From: Len Brown <len.brown@intel.com>

Linux 3.10 changed the timing of how thread_info->flags is touched:

	x86: Use generic idle loop
	(7d1a941731fabf27e5fb6edbebb79fe856edb4e5)

This caused Intel NHM-EX and WSM-EX servers to experience a large number
of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote
from deep C-states to shallow C-states, which caused these platforms
to experience a significant increase in idle power.

Note that this issue was already present before the commit above,
however, it wasn't seen often enough to be noticed in power measurements.

Here we extend an errata workaround from the Core2 EX "Dunnington"
to extend to NHM-EX and WSM-EX, to prevent these immediate
returns from MWAIT, reducing idle power on these platforms.

While only acpi_idle ran on Dunnington, intel_idle
may also run on these two newer systems.
As of today, there are no other models that are known
to need this tweak.

ref: https://lkml.org/lkml/2013/12/7/22
Signed-off-by: Len Brown <len.brown@intel.com>
Cc: <stable@vger.kernel.org> # 3.12.x, 3.11.x, 3.10.x
---
 arch/x86/kernel/cpu/intel.c | 3 ++-
 drivers/idle/intel_idle.c   | 3 +++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index dc1ec0d..ea04b34 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -387,7 +387,8 @@ static void init_intel(struct cpuinfo_x86 *c)
 			set_cpu_cap(c, X86_FEATURE_PEBS);
 	}
 
-	if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush)
+	if (c->x86 == 6 && cpu_has_clflush &&
+	    (c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47))
 		set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
 
 #ifdef CONFIG_X86_64
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 92d1206..f80b700 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev,
 
 	if (!current_set_polling_and_test()) {
 
+		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+			clflush((void *)&current_thread_info()->flags);
+
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
 		smp_mb();
 		if (!need_resched())
-- 
1.8.5.1.19.gdaad3aa

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-18 21:44   ` Len Brown
  (?)
@ 2013-12-19 12:22   ` Ingo Molnar
  2013-12-19 14:40     ` H. Peter Anvin
  2013-12-19 15:55     ` H. Peter Anvin
  -1 siblings, 2 replies; 96+ messages in thread
From: Ingo Molnar @ 2013-12-19 12:22 UTC (permalink / raw)
  To: Len Brown
  Cc: x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds,
	H. Peter Anvin, Thomas Gleixner, Peter Zijlstra, Mike Galbraith,
	Borislav Petkov


* Len Brown <lenb@kernel.org> wrote:

> From: Len Brown <len.brown@intel.com>
> 
> Linux 3.10 changed the timing of how thread_info->flags is touched:
> 
> 	x86: Use generic idle loop
> 	(7d1a941731fabf27e5fb6edbebb79fe856edb4e5)
> 
> This caused Intel NHM-EX and WSM-EX servers to experience a large number
> of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote
> from deep C-states to shallow C-states, which caused these platforms
> to experience a significant increase in idle power.
> 
> Note that this issue was already present before the commit above,
> however, it wasn't seen often enough to be noticed in power measurements.
> 
> Here we extend an errata workaround from the Core2 EX "Dunnington"
> to extend to NHM-EX and WSM-EX, to prevent these immediate
> returns from MWAIT, reducing idle power on these platforms.
> 
> While only acpi_idle ran on Dunnington, intel_idle
> may also run on these two newer systems.
> As of today, there are no other models that are known
> to need this tweak.
> 
> ref: https://lkml.org/lkml/2013/12/7/22
> Signed-off-by: Len Brown <len.brown@intel.com>
> Cc: <stable@vger.kernel.org> # 3.12.x, 3.11.x, 3.10.x
> ---
>  arch/x86/kernel/cpu/intel.c | 3 ++-
>  drivers/idle/intel_idle.c   | 3 +++
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index dc1ec0d..ea04b34 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -387,7 +387,8 @@ static void init_intel(struct cpuinfo_x86 *c)
>  			set_cpu_cap(c, X86_FEATURE_PEBS);
>  	}
>  
> -	if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush)
> +	if (c->x86 == 6 && cpu_has_clflush &&
> +	    (c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47))
>  		set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
>  
>  #ifdef CONFIG_X86_64
> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> index 92d1206..f80b700 100644
> --- a/drivers/idle/intel_idle.c
> +++ b/drivers/idle/intel_idle.c
> @@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev,
>  
>  	if (!current_set_polling_and_test()) {
>  
> +		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> +			clflush((void *)&current_thread_info()->flags);
> +
>  		__monitor((void *)&current_thread_info()->flags, 0, 0);

I don't think either of these casts to '(void *)' is needed, both the 
clflush() and __monitor() will take pointers.

Looks good to me otherwise - except that maybe the best way to 
represent this quirk would be for the CLFLUSH+MONITOR sequence to be a 
single 'instruction' which is patched in dynamically during bootup, 
using our usual alternatives framework.

On non-affected CPUs a NOP would remain in place of the CLFLUSH, 
eliminating the branch above.

So the whole thing could be thought of as a slightly more complex 
'monitor' instruction - not exposing the quirk details to actual usage 
sites.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 12:22   ` Ingo Molnar
@ 2013-12-19 14:40     ` H. Peter Anvin
  2013-12-19 15:45       ` Borislav Petkov
  2013-12-19 15:55     ` H. Peter Anvin
  1 sibling, 1 reply; 96+ messages in thread
From: H. Peter Anvin @ 2013-12-19 14:40 UTC (permalink / raw)
  To: Ingo Molnar, Len Brown
  Cc: x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Borislav Petkov

... or just use static_cpu_has() maybe?

Ingo Molnar <mingo@kernel.org> wrote:
>
>* Len Brown <lenb@kernel.org> wrote:
>
>> From: Len Brown <len.brown@intel.com>
>> 
>> Linux 3.10 changed the timing of how thread_info->flags is touched:
>> 
>> 	x86: Use generic idle loop
>> 	(7d1a941731fabf27e5fb6edbebb79fe856edb4e5)
>> 
>> This caused Intel NHM-EX and WSM-EX servers to experience a large
>number
>> of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to
>demote
>> from deep C-states to shallow C-states, which caused these platforms
>> to experience a significant increase in idle power.
>> 
>> Note that this issue was already present before the commit above,
>> however, it wasn't seen often enough to be noticed in power
>measurements.
>> 
>> Here we extend an errata workaround from the Core2 EX "Dunnington"
>> to extend to NHM-EX and WSM-EX, to prevent these immediate
>> returns from MWAIT, reducing idle power on these platforms.
>> 
>> While only acpi_idle ran on Dunnington, intel_idle
>> may also run on these two newer systems.
>> As of today, there are no other models that are known
>> to need this tweak.
>> 
>> ref: https://lkml.org/lkml/2013/12/7/22
>> Signed-off-by: Len Brown <len.brown@intel.com>
>> Cc: <stable@vger.kernel.org> # 3.12.x, 3.11.x, 3.10.x
>> ---
>>  arch/x86/kernel/cpu/intel.c | 3 ++-
>>  drivers/idle/intel_idle.c   | 3 +++
>>  2 files changed, 5 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/kernel/cpu/intel.c
>b/arch/x86/kernel/cpu/intel.c
>> index dc1ec0d..ea04b34 100644
>> --- a/arch/x86/kernel/cpu/intel.c
>> +++ b/arch/x86/kernel/cpu/intel.c
>> @@ -387,7 +387,8 @@ static void init_intel(struct cpuinfo_x86 *c)
>>  			set_cpu_cap(c, X86_FEATURE_PEBS);
>>  	}
>>  
>> -	if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush)
>> +	if (c->x86 == 6 && cpu_has_clflush &&
>> +	    (c->x86_model == 29 || c->x86_model == 46 || c->x86_model ==
>47))
>>  		set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
>>  
>>  #ifdef CONFIG_X86_64
>> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
>> index 92d1206..f80b700 100644
>> --- a/drivers/idle/intel_idle.c
>> +++ b/drivers/idle/intel_idle.c
>> @@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev,
>>  
>>  	if (!current_set_polling_and_test()) {
>>  
>> +		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
>> +			clflush((void *)&current_thread_info()->flags);
>> +
>>  		__monitor((void *)&current_thread_info()->flags, 0, 0);
>
>I don't think either of these casts to '(void *)' is needed, both the 
>clflush() and __monitor() will take pointers.
>
>Looks good to me otherwise - except that maybe the best way to 
>represent this quirk would be for the CLFLUSH+MONITOR sequence to be a 
>single 'instruction' which is patched in dynamically during bootup, 
>using our usual alternatives framework.
>
>On non-affected CPUs a NOP would remain in place of the CLFLUSH, 
>eliminating the branch above.
>
>So the whole thing could be thought of as a slightly more complex 
>'monitor' instruction - not exposing the quirk details to actual usage 
>sites.
>
>Thanks,
>
>	Ingo

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 14:40     ` H. Peter Anvin
@ 2013-12-19 15:45       ` Borislav Petkov
  0 siblings, 0 replies; 96+ messages in thread
From: Borislav Petkov @ 2013-12-19 15:45 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Peter Zijlstra,
	Mike Galbraith

On Thu, Dec 19, 2013 at 06:40:41AM -0800, H. Peter Anvin wrote:
> ... or just use static_cpu_has() maybe?

Right, if we can get the compiler to generate the shortest CLFLUSH of 3
bytes with register indirect addressing for the operand, then stomping
over those 3 bytes with the alternatives would be the fastest solution.

If, of course, three NOPs which get discarded in the front end are
cheaper than the near JMP from static_cpu_has.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 12:22   ` Ingo Molnar
  2013-12-19 14:40     ` H. Peter Anvin
@ 2013-12-19 15:55     ` H. Peter Anvin
  2013-12-19 16:02       ` Ingo Molnar
  1 sibling, 1 reply; 96+ messages in thread
From: H. Peter Anvin @ 2013-12-19 15:55 UTC (permalink / raw)
  To: Ingo Molnar, Len Brown
  Cc: x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Borislav Petkov

On 12/19/2013 04:22 AM, Ingo Molnar wrote:
>> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
>> index 92d1206..f80b700 100644
>> --- a/drivers/idle/intel_idle.c
>> +++ b/drivers/idle/intel_idle.c
>> @@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev,
>>  
>>  	if (!current_set_polling_and_test()) {
>>  
>> +		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
>> +			clflush((void *)&current_thread_info()->flags);
>> +
>>  		__monitor((void *)&current_thread_info()->flags, 0, 0);
> 
> I don't think either of these casts to '(void *)' is needed, both the 
> clflush() and __monitor() will take pointers.

__monitor() currently doesn't, which is idiotic.

	-hpa


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 15:55     ` H. Peter Anvin
@ 2013-12-19 16:02       ` Ingo Molnar
  2013-12-19 16:09         ` H. Peter Anvin
  2013-12-19 16:13         ` H. Peter Anvin
  0 siblings, 2 replies; 96+ messages in thread
From: Ingo Molnar @ 2013-12-19 16:02 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable,
	Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Mike Galbraith,
	Borislav Petkov


* H. Peter Anvin <hpa@zytor.com> wrote:

> On 12/19/2013 04:22 AM, Ingo Molnar wrote:
> >> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> >> index 92d1206..f80b700 100644
> >> --- a/drivers/idle/intel_idle.c
> >> +++ b/drivers/idle/intel_idle.c
> >> @@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev,
> >>  
> >>  	if (!current_set_polling_and_test()) {
> >>  
> >> +		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> >> +			clflush((void *)&current_thread_info()->flags);
> >> +
> >>  		__monitor((void *)&current_thread_info()->flags, 0, 0);
> > 
> > I don't think either of these casts to '(void *)' is needed, both the 
> > clflush() and __monitor() will take pointers.
> 
> __monitor() currently doesn't, which is idiotic.

Hm, __monitor() seems to take a void *:

 arch/x86/include/asm/processor.h:static inline void __monitor(const void *eax, unsigned long ecx,

So writing:

		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
			clflush(&current_thread_info()->flags);

  		__monitor(&current_thread_info()->flags, 0, 0);

ought to work just fine.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 16:02       ` Ingo Molnar
@ 2013-12-19 16:09         ` H. Peter Anvin
  2013-12-19 16:13         ` H. Peter Anvin
  1 sibling, 0 replies; 96+ messages in thread
From: H. Peter Anvin @ 2013-12-19 16:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable,
	Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Mike Galbraith,
	Borislav Petkov

On 12/19/2013 08:02 AM, Ingo Molnar wrote:
>>
>> __monitor() currently doesn't, which is idiotic.
> 
> Hm, __monitor() seems to take a void *:
> 
>  arch/x86/include/asm/processor.h:static inline void __monitor(const void *eax, unsigned long ecx,
> 
> So writing:
> 
> 		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> 			clflush(&current_thread_info()->flags);
> 
>   		__monitor(&current_thread_info()->flags, 0, 0);
> 

Yes, brain failure on my part.

	-hpa



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 16:02       ` Ingo Molnar
  2013-12-19 16:09         ` H. Peter Anvin
@ 2013-12-19 16:13         ` H. Peter Anvin
  2013-12-19 16:21           ` Peter Zijlstra
  1 sibling, 1 reply; 96+ messages in thread
From: H. Peter Anvin @ 2013-12-19 16:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable,
	Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Mike Galbraith,
	Borislav Petkov

[-- Attachment #1: Type: text/plain, Size: 189 bytes --]

How does this look?  Completely untested, of course.

I do wonder if we need more memory barriers, though.

An alternative would be to move everything into mwait_idle_with_hints().

	-hpa


[-- Attachment #2: diff --]
[-- Type: text/plain, Size: 1626 bytes --]

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 7b034a4057f9..6dce588f94b4 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -723,6 +723,23 @@ static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
 		     :: "a" (eax), "c" (ecx));
 }
 
+/*
+ * Issue a clflush in preparation for a monitor instruction if the CPU
+ * needs it.  We force the address into the ax register to get a fixed
+ * length for the instruction, however, this is what the monitor instruction
+ * is going to need anyway, so it shouldn't add any additional code.
+ */
+static inline void clflush_monitor(const void *addr, unsigned long ecx,
+				   unsigned long edx)
+{
+	alternative_input(ASM_NOP3,
+			  "clflush (%0)",
+			  X86_FEATURE_CLFLUSH_MONITOR,
+			  "a" (addr));
+	__monitor(addr, eax, edx);
+	smp_mb();
+}
+
 extern void select_idle_routine(const struct cpuinfo_x86 *c);
 extern void init_amd_e400_c1e_mask(void);
 
diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index d2b7f27781bc..b14d02354134 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -163,11 +163,7 @@ EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	if (!need_resched()) {
-		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
-			clflush((void *)&current_thread_info()->flags);
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
+		clflush_monitor(&current_thread_info()->flags, 0, 0);
 		if (!need_resched())
 			__mwait(ax, cx);
 	}

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 16:13         ` H. Peter Anvin
@ 2013-12-19 16:21           ` Peter Zijlstra
  2013-12-19 16:50             ` H. Peter Anvin
  0 siblings, 1 reply; 96+ messages in thread
From: Peter Zijlstra @ 2013-12-19 16:21 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith,
	Borislav Petkov

On Thu, Dec 19, 2013 at 08:13:21AM -0800, H. Peter Anvin wrote:
> How does this look?  Completely untested, of course.
> 
> I do wonder if we need more memory barriers, though.
> 
> An alternative would be to move everything into mwait_idle_with_hints().
> 
> 	-hpa
> 

> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 7b034a4057f9..6dce588f94b4 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -723,6 +723,23 @@ static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
>  		     :: "a" (eax), "c" (ecx));
>  }
>  
> +/*
> + * Issue a clflush in preparation for a monitor instruction if the CPU
> + * needs it.  We force the address into the ax register to get a fixed
> + * length for the instruction, however, this is what the monitor instruction
> + * is going to need anyway, so it shouldn't add any additional code.
> + */
> +static inline void clflush_monitor(const void *addr, unsigned long ecx,
> +				   unsigned long edx)
> +{
> +	alternative_input(ASM_NOP3,
> +			  "clflush (%0)",
> +			  X86_FEATURE_CLFLUSH_MONITOR,
> +			  "a" (addr));
> +	__monitor(addr, eax, edx);
> +	smp_mb();
> +}

What's that mb for?

Also, can you please merge:

  http://marc.info/?l=linux-kernel&m=138685838420632

Thomas said he would pick up that series, but seems to have gone missing
the past week or so.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 16:21           ` Peter Zijlstra
@ 2013-12-19 16:50             ` H. Peter Anvin
  2013-12-19 17:07               ` Ingo Molnar
  0 siblings, 1 reply; 96+ messages in thread
From: H. Peter Anvin @ 2013-12-19 16:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith,
	Borislav Petkov

On 12/19/2013 08:21 AM, Peter Zijlstra wrote:
> 
> What's that mb for?
> 

It already exists in mwait_idle_with_hints(); I just moved it into this
common function.  It is a bit odd, I have to admit; it seems like it
should be *before* the monitor (and possibly we should have one after
the CLFLUSH as well?)

> Also, can you please merge:
> 
>   http://marc.info/?l=linux-kernel&m=138685838420632
> 
> Thomas said he would pick up that series, but seems to have gone missing
> the past week or so.
> 

We need to get the immediate regression fixed.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 16:50             ` H. Peter Anvin
@ 2013-12-19 17:07               ` Ingo Molnar
  2013-12-19 17:25                 ` Peter Zijlstra
  2013-12-19 18:09                 ` H. Peter Anvin
  0 siblings, 2 replies; 96+ messages in thread
From: Ingo Molnar @ 2013-12-19 17:07 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel,
	Len Brown, stable, Linus Torvalds, Thomas Gleixner,
	Mike Galbraith, Borislav Petkov


* H. Peter Anvin <hpa@zytor.com> wrote:

> On 12/19/2013 08:21 AM, Peter Zijlstra wrote:
> > 
> > What's that mb for?
> > 
> 
> It already exists in mwait_idle_with_hints(); I just moved it into 
> this common function.  It is a bit odd, I have to admit; it seems 
> like it should be *before* the monitor (and possibly we should have 
> one after the CLFLUSH as well?)

Yes, I think we need a barrier before the CLFLUSH, because according 
to my reading of the Intel documentation CLFLUSH has no implicit 
ordering so it might get reordered with the store to ->flags in 
current_set_polling_and_test(), which might result in spurious wakeup 
problems again.

(And CLFLUSH is a store in a sense, so special in that the regular 
ordering for stores does not apply.)

Likewise, having a barrier before the MONITOR looks sensible as well. 
Having it _after_ monitor looks weird and is probably wrong. [It might 
have been the effects of someone seeing the spurious wakeup problems 
with realizing the true source, or so.]

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 17:07               ` Ingo Molnar
@ 2013-12-19 17:25                 ` Peter Zijlstra
  2013-12-19 17:36                   ` Peter Zijlstra
                                     ` (2 more replies)
  2013-12-19 18:09                 ` H. Peter Anvin
  1 sibling, 3 replies; 96+ messages in thread
From: Peter Zijlstra @ 2013-12-19 17:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Len Brown, x86, linux-pm, linux-kernel,
	Len Brown, stable, Linus Torvalds, Thomas Gleixner,
	Mike Galbraith, Borislav Petkov

On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote:
> 
> * H. Peter Anvin <hpa@zytor.com> wrote:
> 
> > On 12/19/2013 08:21 AM, Peter Zijlstra wrote:
> > > 
> > > What's that mb for?
> > > 
> > 
> > It already exists in mwait_idle_with_hints(); I just moved it into 
> > this common function.  It is a bit odd, I have to admit; it seems 
> > like it should be *before* the monitor (and possibly we should have 
> > one after the CLFLUSH as well?)
> 
> Yes, I think we need a barrier before the CLFLUSH, because according 
> to my reading of the Intel documentation CLFLUSH has no implicit 
> ordering so it might get reordered with the store to ->flags in 
> current_set_polling_and_test(), which might result in spurious wakeup 
> problems again.

No it cannot; since current_set_polling_and_test() already has a barrier
to prevent that.

Also, the location patched by hpa doesn't actually call that at all.

That said, I would find it very strange indeed if a CLFLUSH doesn't also
flush the store buffer.

> (And CLFLUSH is a store in a sense, so special in that the regular 
> ordering for stores does not apply.)
> 
> Likewise, having a barrier before the MONITOR looks sensible as well. 
> Having it _after_ monitor looks weird and is probably wrong. [It might 
> have been the effects of someone seeing the spurious wakeup problems 
> with realizing the true source, or so.]

I again have to disagree, one would expect monitor to flush all that is
required to start the monitor -- and it actually does so. As is
testified by this extra CLFLUSH being called a bug workaround.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 17:25                 ` Peter Zijlstra
@ 2013-12-19 17:36                   ` Peter Zijlstra
  2013-12-19 18:05                     ` H. Peter Anvin
  2013-12-19 17:50                   ` Peter Zijlstra
  2013-12-19 18:10                   ` Ingo Molnar
  2 siblings, 1 reply; 96+ messages in thread
From: Peter Zijlstra @ 2013-12-19 17:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Len Brown, x86, linux-pm, linux-kernel,
	Len Brown, stable, Linus Torvalds, Thomas Gleixner,
	Mike Galbraith, Borislav Petkov

On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote:
> That said, I would find it very strange indeed if a CLFLUSH doesn't also
> flush the store buffer.

OK, it explicitly states it does not do that and you indeed need an
mfence before the clflush.



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 17:25                 ` Peter Zijlstra
  2013-12-19 17:36                   ` Peter Zijlstra
@ 2013-12-19 17:50                   ` Peter Zijlstra
  2013-12-19 18:18                     ` Ingo Molnar
  2013-12-19 18:10                   ` Ingo Molnar
  2 siblings, 1 reply; 96+ messages in thread
From: Peter Zijlstra @ 2013-12-19 17:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Len Brown, x86, linux-pm, linux-kernel,
	Len Brown, stable, Linus Torvalds, Thomas Gleixner,
	Mike Galbraith, Borislav Petkov

On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote:
> > Likewise, having a barrier before the MONITOR looks sensible as well. 
> 
> I again have to disagree, one would expect monitor to flush all that is
> required to start the monitor -- and it actually does so. As is
> testified by this extra CLFLUSH being called a bug workaround.

SDM states that MONITOR is ordered like a LOAD, and a LOAD cannot pass
a previous STORE to the same address.

That said; there's enough holes in there to swim a titanic through,
seeing how MONITOR stares at an entire cacheline and LOAD/STORE order is
specified on location, whatever that means.



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 17:36                   ` Peter Zijlstra
@ 2013-12-19 18:05                     ` H. Peter Anvin
  2013-12-19 18:14                       ` Ingo Molnar
  0 siblings, 1 reply; 96+ messages in thread
From: H. Peter Anvin @ 2013-12-19 18:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable,
	Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov

On 12/19/2013 09:36 AM, Peter Zijlstra wrote:
> On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote:
>> That said, I would find it very strange indeed if a CLFLUSH doesn't also
>> flush the store buffer.
> 
> OK, it explicitly states it does not do that and you indeed need an
> mfence before the clflush.
> 

So, MONITOR is defined to be ordered as a load, which I think should be
adequate, but I really wonder if we should have mfence on both sides of
clflush.  This now is up to 9 bytes, and perhaps pushing it a bit with
how much we would be willing to patch out.

On the other hand - the CLFLUSH seems to have worked well enough by
itself, and this is all probabilistic anyway, so perhaps we should just
leave the naked CLFLUSH in and not worry about it unless measurements
say otherwise?

	-hpa


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 17:07               ` Ingo Molnar
  2013-12-19 17:25                 ` Peter Zijlstra
@ 2013-12-19 18:09                 ` H. Peter Anvin
  2013-12-19 18:19                   ` H. Peter Anvin
  2013-12-19 18:19                   ` [PATCH] x86 idle: repair large-server 50-watt idle-power regression Ingo Molnar
  1 sibling, 2 replies; 96+ messages in thread
From: H. Peter Anvin @ 2013-12-19 18:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel,
	Len Brown, stable, Linus Torvalds, Thomas Gleixner,
	Mike Galbraith, Borislav Petkov

On 12/19/2013 09:07 AM, Ingo Molnar wrote:
> 
> Likewise, having a barrier before the MONITOR looks sensible as well. 
> Having it _after_ monitor looks weird and is probably wrong. [It might 
> have been the effects of someone seeing the spurious wakeup problems 
> with realizing the true source, or so.]
> 

Does anyone know the history of this barrier after the monitor?  I know
Len is looking for a minimal patchset that can go into -stable, and it
seems prudent to not preturb the code more than necessary, but going
forward it would be nice to know...

	-hpa



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 17:25                 ` Peter Zijlstra
  2013-12-19 17:36                   ` Peter Zijlstra
  2013-12-19 17:50                   ` Peter Zijlstra
@ 2013-12-19 18:10                   ` Ingo Molnar
  2 siblings, 0 replies; 96+ messages in thread
From: Ingo Molnar @ 2013-12-19 18:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Len Brown, x86, linux-pm, linux-kernel,
	Len Brown, stable, Linus Torvalds, Thomas Gleixner,
	Mike Galbraith, Borislav Petkov


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote:
> > 
> > * H. Peter Anvin <hpa@zytor.com> wrote:
> > 
> > > On 12/19/2013 08:21 AM, Peter Zijlstra wrote:
> > > > 
> > > > What's that mb for?
> > > > 
> > > 
> > > It already exists in mwait_idle_with_hints(); I just moved it into 
> > > this common function.  It is a bit odd, I have to admit; it seems 
> > > like it should be *before* the monitor (and possibly we should have 
> > > one after the CLFLUSH as well?)
> > 
> > Yes, I think we need a barrier before the CLFLUSH, because according 
> > to my reading of the Intel documentation CLFLUSH has no implicit 
> > ordering so it might get reordered with the store to ->flags in 
> > current_set_polling_and_test(), which might result in spurious wakeup 
> > problems again.
> 
> No it cannot; since current_set_polling_and_test() already has a 
> barrier to prevent that.

See below:

> Also, the location patched by hpa doesn't actually call that at all.
> 
> That said, I would find it very strange indeed if a CLFLUSH doesn't 
> also flush the store buffer.

So, the Intel documentation says (sorry about the lazy-link):

  http://www.jaist.ac.jp/iscenter-new/mpc/altix/altixdata/opt/intel/vtune/doc/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc31.htm

 "CLFLUSH is only ordered by the MFENCE instruction. It is not 
  guaranteed to be ordered by any other fencing, serializing or other 
  CLFLUSH instruction. For example, software can use an MFENCE 
  instruction to insure that previous stores are included in the 
  write-back."

So a specific MFENCE barrier is needed.

Also note that this wording excludes implicit serialization such as 
LOCK prefix or XCHG barriers. As it happens 
current_set_polling_and_test() uses smp_mb(), which happens to map to 
MFENCE on all CPUs that can do CLFLUSH, but that's really just an 
accident and in no way engineered.

_At minimum_ we need a prominent comment at the clflush usage site 
that we rely on the MFENCE in current_set_polling_and_test() ...

> > (And CLFLUSH is a store in a sense, so special in that the regular 
> > ordering for stores does not apply.)
> > 
> > Likewise, having a barrier before the MONITOR looks sensible as 
> > well. Having it _after_ monitor looks weird and is probably wrong. 
> > [It might have been the effects of someone seeing the spurious 
> > wakeup problems with realizing the true source, or so.]
> 
> I again have to disagree, one would expect monitor to flush all that 
> is required to start the monitor -- and it actually does so. As is 
> testified by this extra CLFLUSH being called a bug workaround.

This assumption would be safer - although AFAICS the Intel 
MONITOR/MWAIT documentation is quiet about this aspect.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 18:05                     ` H. Peter Anvin
@ 2013-12-19 18:14                       ` Ingo Molnar
  0 siblings, 0 replies; 96+ messages in thread
From: Ingo Molnar @ 2013-12-19 18:14 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel,
	Len Brown, stable, Linus Torvalds, Thomas Gleixner,
	Mike Galbraith, Borislav Petkov


* H. Peter Anvin <hpa@zytor.com> wrote:

> On 12/19/2013 09:36 AM, Peter Zijlstra wrote:
> > On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote:
> >> That said, I would find it very strange indeed if a CLFLUSH doesn't also
> >> flush the store buffer.
> > 
> > OK, it explicitly states it does not do that and you indeed need 
> > an mfence before the clflush.
> 
> So, MONITOR is defined to be ordered as a load, which I think should 
> be adequate, but I really wonder if we should have mfence on both 
> sides of clflush.  This now is up to 9 bytes, and perhaps pushing it 
> a bit with how much we would be willing to patch out.
> 
> On the other hand - the CLFLUSH seems to have worked well enough by 
> itself, and this is all probabilistic anyway, so perhaps we should 
> just leave the naked CLFLUSH in and not worry about it unless 
> measurements say otherwise?

So I think the window of breakage was rather large here, and since it 
seems to trigger on rare types of hardware I think we'd be better off 
by erring on the side of robustness this time around ...

This is the 'go to idle' path, which isn't as time-critical as the 
'get out of idle' code path.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 17:50                   ` Peter Zijlstra
@ 2013-12-19 18:18                     ` Ingo Molnar
  2013-12-19 21:05                       ` H. Peter Anvin
  0 siblings, 1 reply; 96+ messages in thread
From: Ingo Molnar @ 2013-12-19 18:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Len Brown, x86, linux-pm, linux-kernel,
	Len Brown, stable, Linus Torvalds, Thomas Gleixner,
	Mike Galbraith, Borislav Petkov


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote:
> > On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote:
> > > Likewise, having a barrier before the MONITOR looks sensible as well. 
> > 
> > I again have to disagree, one would expect monitor to flush all that is
> > required to start the monitor -- and it actually does so. As is
> > testified by this extra CLFLUSH being called a bug workaround.
> 
> SDM states that MONITOR is ordered like a LOAD, and a LOAD cannot 
> pass a previous STORE to the same address.

Yes ... but you could argue that CLFLUSH is neither a load nor a 
store, it's a _cache sync_ operation, with its special ordering 
properties.

> That said; there's enough holes in there to swim a titanic through, 
> seeing how MONITOR stares at an entire cacheline and LOAD/STORE 
> order is specified on location, whatever that means.

I think assuming that MONITOR is ordered as a load or better is a 
pretty safe one (and in fact the Intel documentation seems to say so) 
- I'd say MONITOR is in micro-code and essentially snoops on cache 
events on that specific cache line, and loads the cache line on a 
snoop hit?

Btw., what state is the cache line after a MONITOR instruction, is it 
loaded as shared, or as excusive? (exclusive would probably be better 
for performance.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 18:09                 ` H. Peter Anvin
@ 2013-12-19 18:19                   ` H. Peter Anvin
  2013-12-19 18:23                     ` Ingo Molnar
  2013-12-19 18:19                   ` [PATCH] x86 idle: repair large-server 50-watt idle-power regression Ingo Molnar
  1 sibling, 1 reply; 96+ messages in thread
From: H. Peter Anvin @ 2013-12-19 18:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel,
	Len Brown, stable, Linus Torvalds, Thomas Gleixner,
	Mike Galbraith, Borislav Petkov

On 12/19/2013 10:09 AM, H. Peter Anvin wrote:
> On 12/19/2013 09:07 AM, Ingo Molnar wrote:
>>
>> Likewise, having a barrier before the MONITOR looks sensible as well. 
>> Having it _after_ monitor looks weird and is probably wrong. [It might 
>> have been the effects of someone seeing the spurious wakeup problems 
>> with realizing the true source, or so.]
>>
> 
> Does anyone know the history of this barrier after the monitor?  I know
> Len is looking for a minimal patchset that can go into -stable, and it
> seems prudent to not preturb the code more than necessary, but going
> forward it would be nice to know...
> 

Hmm... it *looks* like it is intended to be part of the construct:

	smp_mb();
	if (!need_resched())
		...

I found a note in the HLT variant of the function saying:

/*
 * TS_POLLING-cleared state must be visible before we
 * test NEED_RESCHED:
 */

... which presumably has been copied elsewhere.

	-hpa



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 18:09                 ` H. Peter Anvin
  2013-12-19 18:19                   ` H. Peter Anvin
@ 2013-12-19 18:19                   ` Ingo Molnar
  2013-12-19 19:22                     ` H. Peter Anvin
  1 sibling, 1 reply; 96+ messages in thread
From: Ingo Molnar @ 2013-12-19 18:19 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel,
	Len Brown, stable, Linus Torvalds, Thomas Gleixner,
	Mike Galbraith, Borislav Petkov


* H. Peter Anvin <hpa@zytor.com> wrote:

> On 12/19/2013 09:07 AM, Ingo Molnar wrote:
> > 
> > Likewise, having a barrier before the MONITOR looks sensible as 
> > well. Having it _after_ monitor looks weird and is probably wrong. 
> > [It might have been the effects of someone seeing the spurious 
> > wakeup problems with realizing the true source, or so.]
> 
> Does anyone know the history of this barrier after the monitor?  I 
> know Len is looking for a minimal patchset that can go into -stable, 
> and it seems prudent to not preturb the code more than necessary, 
> but going forward it would be nice to know...

For the minimal fix I don't think we should change it - but for v3.14 
it looks like a speedup for the from-idle codepath, which is 
performance sensitive.

( It would also be nice to know whether MONITOR loads that cacheline 
  into the CPUs cache, and in what state it loads it. )

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 18:19                   ` H. Peter Anvin
@ 2013-12-19 18:23                     ` Ingo Molnar
       [not found]                       ` <CA+55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com>
  0 siblings, 1 reply; 96+ messages in thread
From: Ingo Molnar @ 2013-12-19 18:23 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel,
	Len Brown, stable, Linus Torvalds, Thomas Gleixner,
	Mike Galbraith, Borislav Petkov


* H. Peter Anvin <hpa@zytor.com> wrote:

> On 12/19/2013 10:09 AM, H. Peter Anvin wrote:
> > On 12/19/2013 09:07 AM, Ingo Molnar wrote:
> >>
> >> Likewise, having a barrier before the MONITOR looks sensible as well. 
> >> Having it _after_ monitor looks weird and is probably wrong. [It might 
> >> have been the effects of someone seeing the spurious wakeup problems 
> >> with realizing the true source, or so.]
> >>
> > 
> > Does anyone know the history of this barrier after the monitor?  I know
> > Len is looking for a minimal patchset that can go into -stable, and it
> > seems prudent to not preturb the code more than necessary, but going
> > forward it would be nice to know...
> > 
> 
> Hmm... it *looks* like it is intended to be part of the construct:
> 
> 	smp_mb();
> 	if (!need_resched())
> 		...
> 
> I found a note in the HLT variant of the function saying:
> 
> /*
>  * TS_POLLING-cleared state must be visible before we
>  * test NEED_RESCHED:
>  */

Yes, that makes sense: the need_resched test is a load, and MONITOR is 
a load as well. Can the two ever cross, or does the CPU guarantee that 
because it's the same address, the loads don't cross?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
       [not found]                       ` <CA+55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com>
@ 2013-12-19 18:43                           ` Ingo Molnar
  2013-12-19 20:09                         ` [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers tip-bot for H. Peter Anvin
  2013-12-19 20:33                         ` [tip:x86/idle] x86, idle: Add memory barriers around clflush in mwait_play_dead() tip-bot for H. Peter Anvin
  2 siblings, 0 replies; 96+ messages in thread
From: Ingo Molnar @ 2013-12-19 18:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Mike Galbraith, Thomas Gleixner, Len Brown,
	linux-pm, x86, Peter Zijlstra, Len Brown, H. Peter Anvin, stable,
	linux-kernel


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> The x86 memory rules are that two loads always execute in order (ie 
> rmb is a no-op).
> 
> So I see no reason for a memory barrier after the monitor. [...]

Yes, I'm leaning towards that interpretation as well, but the reason 
I'm a bit catious is the somewhat curious (to me!) wording of the 
MONITOR instruction:

  Sets up a linear address range to be monitored by hardware and 
  activates the monitor.

  ...

  The MONITOR instruction is ordered as a load operation with respect 
  to other memory trans­actions. The instruction can be used at all 
  privilege levels and is subject to all permission checking and 
  faults associated with a byte load. Like a load, the MONITOR 
  instruction sets the A-bit but not the D-bit in page tables.

Where apparently the 'range' means 'full cache line surrounding the 
memory address in question'.

We have no other load instructions that operate on such a large 
'range' of addresses, and I wanted to make sure it's a true (single 
byte) load for that specific address. The documentation does not 
appear to explicitly state that it's a load for that address - only 
that it's ordered as a load.

The reason I'm asking is because 'flags' itself might not be at the 
beginning of the cache line, as it's in the middle of thread_info:

 struct thread_info {
        struct task_struct      *task;          /* main task structure */
        struct exec_domain      *exec_domain;   /* execution domain */
        __u32                   flags;          /* low level flags */

while 'MONITOR' appears to work on the cache line. So are all 
addresses within that cache line ordered? Only the specific address 
given to the instruction itself? Only the first word of the cacheline 
itself?

The documentation is a bit vague, at least in my reading, and 
depending on which actual word the instruction reads (if it reads any 
word at all ... it's probably just setting up an address for MWAIT) 
from that cacheline, its ordering properties might be surprising.

> [...] But both sides of clflush sounds sane, and as mentioned the 
> "go to sleep" side isn't as critical as the "wake up" side if the 
> monitor.

Yeah.

> Please let's just make that pre-monitor hack be a static key, and do 
> mfence explicitly around the clflush inside that conditional 
> section.

Agreed.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
@ 2013-12-19 18:43                           ` Ingo Molnar
  0 siblings, 0 replies; 96+ messages in thread
From: Ingo Molnar @ 2013-12-19 18:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Mike Galbraith, Thomas Gleixner, Len Brown,
	linux-pm, x86, Peter Zijlstra, Len Brown, H. Peter Anvin, stable,
	linux-kernel


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> The x86 memory rules are that two loads always execute in order (ie 
> rmb is a no-op).
> 
> So I see no reason for a memory barrier after the monitor. [...]

Yes, I'm leaning towards that interpretation as well, but the reason 
I'm a bit catious is the somewhat curious (to me!) wording of the 
MONITOR instruction:

  Sets up a linear address range to be monitored by hardware and 
  activates the monitor.

  ...

  The MONITOR instruction is ordered as a load operation with respect 
  to other memory trans�actions. The instruction can be used at all 
  privilege levels and is subject to all permission checking and 
  faults associated with a byte load. Like a load, the MONITOR 
  instruction sets the A-bit but not the D-bit in page tables.

Where apparently the 'range' means 'full cache line surrounding the 
memory address in question'.

We have no other load instructions that operate on such a large 
'range' of addresses, and I wanted to make sure it's a true (single 
byte) load for that specific address. The documentation does not 
appear to explicitly state that it's a load for that address - only 
that it's ordered as a load.

The reason I'm asking is because 'flags' itself might not be at the 
beginning of the cache line, as it's in the middle of thread_info:

 struct thread_info {
        struct task_struct      *task;          /* main task structure */
        struct exec_domain      *exec_domain;   /* execution domain */
        __u32                   flags;          /* low level flags */

while 'MONITOR' appears to work on the cache line. So are all 
addresses within that cache line ordered? Only the specific address 
given to the instruction itself? Only the first word of the cacheline 
itself?

The documentation is a bit vague, at least in my reading, and 
depending on which actual word the instruction reads (if it reads any 
word at all ... it's probably just setting up an address for MWAIT) 
from that cacheline, its ordering properties might be surprising.

> [...] But both sides of clflush sounds sane, and as mentioned the 
> "go to sleep" side isn't as critical as the "wake up" side if the 
> monitor.

Yeah.

> Please let's just make that pre-monitor hack be a static key, and do 
> mfence explicitly around the clflush inside that conditional 
> section.

Agreed.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 18:19                   ` [PATCH] x86 idle: repair large-server 50-watt idle-power regression Ingo Molnar
@ 2013-12-19 19:22                     ` H. Peter Anvin
  2013-12-19 19:27                       ` Peter Zijlstra
  0 siblings, 1 reply; 96+ messages in thread
From: H. Peter Anvin @ 2013-12-19 19:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel,
	Len Brown, stable, Linus Torvalds, Thomas Gleixner,
	Mike Galbraith, Borislav Petkov

On 12/19/2013 10:19 AM, Ingo Molnar wrote:
> 
> ( It would also be nice to know whether MONITOR loads that cacheline 
>   into the CPUs cache, and in what state it loads it. )
> 

I would assume that is implementation-dependent.  However, one plausible
implementation is to load the cache line into the cache in shared state
and monitor for evictions.

	-hpa



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 19:22                     ` H. Peter Anvin
@ 2013-12-19 19:27                       ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2013-12-19 19:27 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith,
	Borislav Petkov

On Thu, Dec 19, 2013 at 11:22:01AM -0800, H. Peter Anvin wrote:
> On 12/19/2013 10:19 AM, Ingo Molnar wrote:
> > 
> > ( It would also be nice to know whether MONITOR loads that cacheline 
> >   into the CPUs cache, and in what state it loads it. )
> > 
> 
> I would assume that is implementation-dependent.  However, one plausible
> implementation is to load the cache line into the cache in shared state
> and monitor for evictions.

I suppose the monitor part is the important part because certain C
states drop all cache, presumably including the cacheline we're actually
monitoring.



^ permalink raw reply	[flat|nested] 96+ messages in thread

* [tip:x86/urgent] x86 idle: Repair large-server 50-watt idle-power regression
  2013-12-18 21:44   ` Len Brown
  (?)
  (?)
@ 2013-12-19 19:51   ` tip-bot for Len Brown
  2014-03-18  0:20     ` Davidlohr Bueso
  -1 siblings, 1 reply; 96+ messages in thread
From: tip-bot for Len Brown @ 2013-12-19 19:51 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, tglx, hpa, len.brown

Commit-ID:  40e2d7f9b5dae048789c64672bf3027fbb663ffa
Gitweb:     http://git.kernel.org/tip/40e2d7f9b5dae048789c64672bf3027fbb663ffa
Author:     Len Brown <len.brown@intel.com>
AuthorDate: Wed, 18 Dec 2013 16:44:57 -0500
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Thu, 19 Dec 2013 11:47:39 -0800

x86 idle: Repair large-server 50-watt idle-power regression

Linux 3.10 changed the timing of how thread_info->flags is touched:

	x86: Use generic idle loop
	(7d1a941731fabf27e5fb6edbebb79fe856edb4e5)

This caused Intel NHM-EX and WSM-EX servers to experience a large number
of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote
from deep C-states to shallow C-states, which caused these platforms
to experience a significant increase in idle power.

Note that this issue was already present before the commit above,
however, it wasn't seen often enough to be noticed in power measurements.

Here we extend an errata workaround from the Core2 EX "Dunnington"
to extend to NHM-EX and WSM-EX, to prevent these immediate
returns from MWAIT, reducing idle power on these platforms.

While only acpi_idle ran on Dunnington, intel_idle
may also run on these two newer systems.
As of today, there are no other models that are known
to need this tweak.

Link: http://lkml.kernel.org/r/CAJvTdK=%2BaNN66mYpCGgbHGCHhYQAKx-vB0kJSWjVpsNb_hOAtQ@mail.gmail.com
Signed-off-by: Len Brown <len.brown@intel.com>
Link: http://lkml.kernel.org/r/baff264285f6e585df757d58b17788feabc68918.1387403066.git.len.brown@intel.com
Cc: <stable@vger.kernel.org> # 3.12.x, 3.11.x, 3.10.x
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/kernel/cpu/intel.c | 3 ++-
 drivers/idle/intel_idle.c   | 3 +++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index dc1ec0d..ea04b34 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -387,7 +387,8 @@ static void init_intel(struct cpuinfo_x86 *c)
 			set_cpu_cap(c, X86_FEATURE_PEBS);
 	}
 
-	if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush)
+	if (c->x86 == 6 && cpu_has_clflush &&
+	    (c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47))
 		set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
 
 #ifdef CONFIG_X86_64
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 92d1206..f80b700 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev,
 
 	if (!current_set_polling_and_test()) {
 
+		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+			clflush((void *)&current_thread_info()->flags);
+
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
 		smp_mb();
 		if (!need_resched())

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers
       [not found]                       ` <CA+55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com>
  2013-12-19 18:43                           ` Ingo Molnar
@ 2013-12-19 20:09                         ` tip-bot for H. Peter Anvin
  2013-12-19 20:40                           ` Ingo Molnar
  2013-12-19 20:33                         ` [tip:x86/idle] x86, idle: Add memory barriers around clflush in mwait_play_dead() tip-bot for H. Peter Anvin
  2 siblings, 1 reply; 96+ messages in thread
From: tip-bot for H. Peter Anvin @ 2013-12-19 20:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, peterz, tglx, hpa, len.brown

Commit-ID:  7e98b71920464b8d15fa95c74366416cd3c88861
Gitweb:     http://git.kernel.org/tip/7e98b71920464b8d15fa95c74366416cd3c88861
Author:     H. Peter Anvin <hpa@linux.intel.com>
AuthorDate: Thu, 19 Dec 2013 11:58:16 -0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Thu, 19 Dec 2013 11:58:16 -0800

x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers

Use static_cpu_has() to conditionalize the CLFLUSH workaround, and add
memory barriers around it since the documentation is explicit that
CLFLUSH is only ordered with respect to MFENCE.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Len Brown <len.brown@intel.com>
Link: http://lkml.kernel.org/r/CA%2B55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com
---
 arch/x86/include/asm/mwait.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 361b02e..19b71c4 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -43,8 +43,11 @@ static inline void __mwait(unsigned long eax, unsigned long ecx)
 static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
 {
 	if (!current_set_polling_and_test()) {
-		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+		if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
+			mb();
 			clflush((void *)&current_thread_info()->flags);
+			mb();
+		}
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
 		if (!need_resched())

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [tip:x86/idle] x86, idle: Add memory barriers around clflush in mwait_play_dead()
       [not found]                       ` <CA+55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com>
  2013-12-19 18:43                           ` Ingo Molnar
  2013-12-19 20:09                         ` [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers tip-bot for H. Peter Anvin
@ 2013-12-19 20:33                         ` tip-bot for H. Peter Anvin
  2 siblings, 0 replies; 96+ messages in thread
From: tip-bot for H. Peter Anvin @ 2013-12-19 20:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, peterz, tglx, hpa, len.brown

Commit-ID:  7d590cca7cd2cce4ed7c47d221d6f90566653ba8
Gitweb:     http://git.kernel.org/tip/7d590cca7cd2cce4ed7c47d221d6f90566653ba8
Author:     H. Peter Anvin <hpa@linux.intel.com>
AuthorDate: Thu, 19 Dec 2013 12:30:03 -0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Thu, 19 Dec 2013 12:30:03 -0800

x86, idle: Add memory barriers around clflush in mwait_play_dead()

For consistency with mwait_idle_with_hints().  Not sure they help, but
they really won't hurt...

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Len Brown <len.brown@intel.com>
Link: http://lkml.kernel.org/r/CA%2B55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com
---
 arch/x86/kernel/smpboot.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 85dc05a..f5252c4 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1417,7 +1417,9 @@ static inline void mwait_play_dead(void)
 		 * The WBINVD is insufficient due to the spurious-wakeup
 		 * case where we return around the loop.
 		 */
+		mb();
 		clflush(mwait_ptr);
+		mb();
 		__monitor(mwait_ptr, 0, 0);
 		mb();
 		__mwait(eax, 0);

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers
  2013-12-19 20:09                         ` [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers tip-bot for H. Peter Anvin
@ 2013-12-19 20:40                           ` Ingo Molnar
  2013-12-19 20:46                             ` Linus Torvalds
  2013-12-19 21:55                             ` Peter Zijlstra
  0 siblings, 2 replies; 96+ messages in thread
From: Ingo Molnar @ 2013-12-19 20:40 UTC (permalink / raw)
  To: hpa, linux-kernel, torvalds, peterz, tglx, hpa, len.brown
  Cc: linux-tip-commits


* tip-bot for H. Peter Anvin <tipbot@zytor.com> wrote:

> Commit-ID:  7e98b71920464b8d15fa95c74366416cd3c88861
> Gitweb:     http://git.kernel.org/tip/7e98b71920464b8d15fa95c74366416cd3c88861
> Author:     H. Peter Anvin <hpa@linux.intel.com>
> AuthorDate: Thu, 19 Dec 2013 11:58:16 -0800
> Committer:  H. Peter Anvin <hpa@linux.intel.com>
> CommitDate: Thu, 19 Dec 2013 11:58:16 -0800
> 
> x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers
> 
> Use static_cpu_has() to conditionalize the CLFLUSH workaround, and add
> memory barriers around it since the documentation is explicit that
> CLFLUSH is only ordered with respect to MFENCE.
> 
> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Len Brown <len.brown@intel.com>
> Link: http://lkml.kernel.org/r/CA%2B55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com
> ---
>  arch/x86/include/asm/mwait.h | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
> index 361b02e..19b71c4 100644
> --- a/arch/x86/include/asm/mwait.h
> +++ b/arch/x86/include/asm/mwait.h
> @@ -43,8 +43,11 @@ static inline void __mwait(unsigned long eax, unsigned long ecx)
>  static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
>  {
>  	if (!current_set_polling_and_test()) {
> -		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> +		if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
> +			mb();
>  			clflush((void *)&current_thread_info()->flags);
> +			mb();
> +		}

So since this is a v3.14 commit, lets get it right, ok?

The first mb() looks superfluous, because see 
current_set_polling_and_test():

static inline bool __must_check current_set_polling_and_test(void)
{
        __current_set_polling();

        /*
         * Polling state must be visible before we test NEED_RESCHED,
         * paired by resched_task()
         */
        smp_mb();

        return unlikely(tif_need_resched());
}

So it already has a (MFENCE) barrier after ->flags is modified.

So using a comment - something like the patch below, should be enough 
as well, agreed?

Thanks,

	Ingo

Signed-off-by: Ingo Molnar <mingo@kernel.org>

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 19b71c4..d4f5c9a 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -44,7 +44,10 @@ static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
 {
 	if (!current_set_polling_and_test()) {
 		if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
-			mb();
+			/*
+			 * There's no need for an extra barrier here: current_set_polling_and_test()
+			 * already has an smp_mb() after ->flags gets modified, see sched.h.
+			 */
 			clflush((void *)&current_thread_info()->flags);
 			mb();
 		}

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers
  2013-12-19 20:40                           ` Ingo Molnar
@ 2013-12-19 20:46                             ` Linus Torvalds
  2013-12-19 21:14                               ` Ingo Molnar
  2013-12-19 21:55                             ` Peter Zijlstra
  1 sibling, 1 reply; 96+ messages in thread
From: Linus Torvalds @ 2013-12-19 20:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Anvin, Linux Kernel Mailing List, Peter Zijlstra,
	Thomas Gleixner, Peter Anvin, Len Brown, linux-tip-commits

On Fri, Dec 20, 2013 at 5:40 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> The first mb() looks superfluous, because see
> current_set_polling_and_test():
>
> static inline bool __must_check current_set_polling_and_test(void)
> {
>         __current_set_polling();
>
>         /*
>          * Polling state must be visible before we test NEED_RESCHED,
>          * paired by resched_task()
>          */
>         smp_mb();
>
>         return unlikely(tif_need_resched());
> }
>
> So it already has a (MFENCE) barrier after ->flags is modified.

So what?

The mb() isn't necessarily against the *write*, it is also against the *read*.

If the cflush is needed before the monitor, it's likely because the
monitor instruction has some bug with already-existing cachelines in
shared state or whatever.

Which means that we want to make sure that the cflush is ordered wrt
*reads* from that cacheline too.

cflush has nothing specifically to do with writes.

             Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 18:18                     ` Ingo Molnar
@ 2013-12-19 21:05                       ` H. Peter Anvin
  2013-12-19 21:17                         ` Ingo Molnar
  0 siblings, 1 reply; 96+ messages in thread
From: H. Peter Anvin @ 2013-12-19 21:05 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable,
	Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov

CLFLUSH is only (guaranteed to be) ordered with respect to MFENCE according to the SDM.

Ingo Molnar <mingo@kernel.org> wrote:
>
>* Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote:
>> > On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote:
>> > > Likewise, having a barrier before the MONITOR looks sensible as
>well. 
>> > 
>> > I again have to disagree, one would expect monitor to flush all
>that is
>> > required to start the monitor -- and it actually does so. As is
>> > testified by this extra CLFLUSH being called a bug workaround.
>> 
>> SDM states that MONITOR is ordered like a LOAD, and a LOAD cannot 
>> pass a previous STORE to the same address.
>
>Yes ... but you could argue that CLFLUSH is neither a load nor a 
>store, it's a _cache sync_ operation, with its special ordering 
>properties.
>
>> That said; there's enough holes in there to swim a titanic through, 
>> seeing how MONITOR stares at an entire cacheline and LOAD/STORE 
>> order is specified on location, whatever that means.
>
>I think assuming that MONITOR is ordered as a load or better is a 
>pretty safe one (and in fact the Intel documentation seems to say so) 
>- I'd say MONITOR is in micro-code and essentially snoops on cache 
>events on that specific cache line, and loads the cache line on a 
>snoop hit?
>
>Btw., what state is the cache line after a MONITOR instruction, is it 
>loaded as shared, or as excusive? (exclusive would probably be better 
>for performance.)
>
>Thanks,
>
>	Ingo

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers
  2013-12-19 20:46                             ` Linus Torvalds
@ 2013-12-19 21:14                               ` Ingo Molnar
  2013-12-19 21:25                                 ` Linus Torvalds
  0 siblings, 1 reply; 96+ messages in thread
From: Ingo Molnar @ 2013-12-19 21:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Anvin, Linux Kernel Mailing List, Peter Zijlstra,
	Thomas Gleixner, Peter Anvin, Len Brown, linux-tip-commits


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, Dec 20, 2013 at 5:40 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > The first mb() looks superfluous, because see
> > current_set_polling_and_test():
> >
> > static inline bool __must_check current_set_polling_and_test(void)
> > {
> >         __current_set_polling();
> >
> >         /*
> >          * Polling state must be visible before we test NEED_RESCHED,
> >          * paired by resched_task()
> >          */
> >         smp_mb();
> >
> >         return unlikely(tif_need_resched());
> > }
> >
> > So it already has a (MFENCE) barrier after ->flags is modified.
> 
> So what?
> 
> The mb() isn't necessarily against the *write*, it is also against 
> the *read*.
> 
> If the cflush is needed before the monitor, it's likely because the 
> monitor instruction has some bug with already-existing cachelines in 
> shared state or whatever.

So, my thinking was that maybe it's the other way around: the problem 
is with the write not being drained to cache yet.

One possibility would be that the bug is that MONITOR will not see the 
previous write in the store buffer and when shortly afterwards the 
store hits the cache, it falsely 'wakes up' the MWAIT.

(If so then the race window roughly depends on the the number of 
cycles the current_set_polling_and_test() modification retires and 
explains how small reorganizations of the code triggered the hw bug.)

The CLFLUSH ensures that the modification is visible in the cache 
before MONITOR is run.

If this guess is true then the need_resched() read is immaterial to 
the bug. On the flip side, if my guess is wrong then I'm leaving 
another subtle race window in the code...

So yeah I agree that without more information from Intel we are better 
off with a more conservative approach, the bug took relatively long to 
find, Thomas did the original 7d1a941731fab back in March, 3 kernel 
releases ago ...

If only there were Intel employees on Cc: who could tell us more about 
the background of the bug ;-)

> Which means that we want to make sure that the cflush is ordered wrt 
> *reads* from that cacheline too.
> 
> cflush has nothing specifically to do with writes.

Yes.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 21:05                       ` H. Peter Anvin
@ 2013-12-19 21:17                         ` Ingo Molnar
  0 siblings, 0 replies; 96+ messages in thread
From: Ingo Molnar @ 2013-12-19 21:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel,
	Len Brown, stable, Linus Torvalds, Thomas Gleixner,
	Mike Galbraith, Borislav Petkov


* H. Peter Anvin <hpa@zytor.com> wrote:

> CLFLUSH is only (guaranteed to be) ordered with respect to MFENCE 
> according to the SDM.

Yes, I quoted the relevant text a couple of hours ago, CLFLUSH is 
pretty special, so it's not ordered against atomics or serializing 
instructions for example - only ordered against explict MFENCE calls, 
which on x86 is mb()/smp_mb().

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers
  2013-12-19 21:14                               ` Ingo Molnar
@ 2013-12-19 21:25                                 ` Linus Torvalds
  0 siblings, 0 replies; 96+ messages in thread
From: Linus Torvalds @ 2013-12-19 21:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Anvin, Linux Kernel Mailing List, Peter Zijlstra,
	Thomas Gleixner, Peter Anvin, Len Brown, linux-tip-commits

On Fri, Dec 20, 2013 at 6:14 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> So, my thinking was that maybe it's the other way around: the problem
> is with the write not being drained to cache yet.

That is possible. I haven't seen the exact details of the monitor
errata . If it's just about the write buffer and cache interaction,
then I agree that the barrier after the write in
current_set_polling_and_test() is sufficient.  But in the absense of
knowledge, I think we should have it after the read too.

              Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers
  2013-12-19 20:40                           ` Ingo Molnar
  2013-12-19 20:46                             ` Linus Torvalds
@ 2013-12-19 21:55                             ` Peter Zijlstra
  2013-12-20  8:47                               ` Ingo Molnar
  1 sibling, 1 reply; 96+ messages in thread
From: Peter Zijlstra @ 2013-12-19 21:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: hpa, linux-kernel, torvalds, tglx, hpa, len.brown, linux-tip-commits

On Thu, Dec 19, 2013 at 09:40:51PM +0100, Ingo Molnar wrote:
> > +		if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
> > +			mb();
> >  			clflush((void *)&current_thread_info()->flags);
> > +			mb();
> > +		}

>  	if (!current_set_polling_and_test()) {
>  		if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
> -			mb();
> +			/*
> +			 * There's no need for an extra barrier here: current_set_polling_and_test()
> +			 * already has an smp_mb() after ->flags gets modified, see sched.h.
> +			 */
>  			clflush((void *)&current_thread_info()->flags);
>  			mb();
>  		}

So almost, except that smp_mb() will disappear on an UP kernel whereas
this errata might just always need the MFENCE.

Then again, you'd be a pretty big nutter to run an UP kernel on an EX
part.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers
  2013-12-19 21:55                             ` Peter Zijlstra
@ 2013-12-20  8:47                               ` Ingo Molnar
  0 siblings, 0 replies; 96+ messages in thread
From: Ingo Molnar @ 2013-12-20  8:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hpa, linux-kernel, torvalds, tglx, hpa, len.brown, linux-tip-commits


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Dec 19, 2013 at 09:40:51PM +0100, Ingo Molnar wrote:
> > > +		if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
> > > +			mb();
> > >  			clflush((void *)&current_thread_info()->flags);
> > > +			mb();
> > > +		}
> 
> >  	if (!current_set_polling_and_test()) {
> >  		if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
> > -			mb();
> > +			/*
> > +			 * There's no need for an extra barrier here: current_set_polling_and_test()
> > +			 * already has an smp_mb() after ->flags gets modified, see sched.h.
> > +			 */
> >  			clflush((void *)&current_thread_info()->flags);
> >  			mb();
> >  		}
> 
> So almost, except that smp_mb() will disappear on an UP kernel whereas
> this errata might just always need the MFENCE.

Yeah. Sigh, so I noticed that and even thought through the UP case and 
concluded: on UP the race cannot happen - which is bogus. Yesterday 
was so much fail, maybe today's a better day ;-)

Thanks,

	ngo

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [tip:x86/urgent] x86 idle: Repair large-server 50-watt idle-power regression
  2013-12-19 19:51   ` [tip:x86/urgent] x86 idle: Repair " tip-bot for Len Brown
@ 2014-03-18  0:20     ` Davidlohr Bueso
  2014-03-18  9:16       ` Peter Zijlstra
  2014-04-08 21:43       ` Brown, Len
  0 siblings, 2 replies; 96+ messages in thread
From: Davidlohr Bueso @ 2014-03-18  0:20 UTC (permalink / raw)
  To: mingo, hpa, linux-kernel, tglx, hpa, len.brown, Peter Zijlstra
  Cc: linux-tip-commits

On Thu, 2013-12-19 at 11:51 -0800, tip-bot for Len Brown wrote:
> Commit-ID:  40e2d7f9b5dae048789c64672bf3027fbb663ffa
> Gitweb:     http://git.kernel.org/tip/40e2d7f9b5dae048789c64672bf3027fbb663ffa
> Author:     Len Brown <len.brown@intel.com>
> AuthorDate: Wed, 18 Dec 2013 16:44:57 -0500
> Committer:  H. Peter Anvin <hpa@linux.intel.com>
> CommitDate: Thu, 19 Dec 2013 11:47:39 -0800
> 
> x86 idle: Repair large-server 50-watt idle-power regression

FYI this commit can cause some non trivial performance regressions for
larger core count systems. While not surprising because of the nature of
the change, having intel_idle do more cacheline invalidations, I still
wanted to let you guys know. For instance, on a 160 core Westmere
system, aim7 throughput can go down in a number of tests, anywhere from
-10% to -25%.

I guess it comes down to one of those performance vs energy things. And
sure, max_cstate can be set to overcome this, but it's still something
that was previously taken for granted.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [tip:x86/urgent] x86 idle: Repair large-server 50-watt idle-power regression
  2014-03-18  0:20     ` Davidlohr Bueso
@ 2014-03-18  9:16       ` Peter Zijlstra
  2014-03-19  2:14         ` Jason Low
  2014-04-08 21:43       ` Brown, Len
  1 sibling, 1 reply; 96+ messages in thread
From: Peter Zijlstra @ 2014-03-18  9:16 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, hpa, linux-kernel, tglx, hpa, len.brown, linux-tip-commits

On Mon, Mar 17, 2014 at 05:20:10PM -0700, Davidlohr Bueso wrote:
> On Thu, 2013-12-19 at 11:51 -0800, tip-bot for Len Brown wrote:
> > Commit-ID:  40e2d7f9b5dae048789c64672bf3027fbb663ffa
> > Gitweb:     http://git.kernel.org/tip/40e2d7f9b5dae048789c64672bf3027fbb663ffa
> > Author:     Len Brown <len.brown@intel.com>
> > AuthorDate: Wed, 18 Dec 2013 16:44:57 -0500
> > Committer:  H. Peter Anvin <hpa@linux.intel.com>
> > CommitDate: Thu, 19 Dec 2013 11:47:39 -0800
> > 
> > x86 idle: Repair large-server 50-watt idle-power regression
> 
> FYI this commit can cause some non trivial performance regressions for
> larger core count systems. While not surprising because of the nature of
> the change, having intel_idle do more cacheline invalidations, I still
> wanted to let you guys know. For instance, on a 160 core Westmere
> system, aim7 throughput can go down in a number of tests, anywhere from
> -10% to -25%.
> 
> I guess it comes down to one of those performance vs energy things. And
> sure, max_cstate can be set to overcome this, but it's still something
> that was previously taken for granted.

-10% to -25% seems a lot for a single cacheline flush. Also I would
expect the expected idle time to be very short while running aim7. So
could it be the cacheflush is actually taking longer than the expected
idle time?



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [tip:x86/urgent] x86 idle: Repair large-server 50-watt idle-power regression
  2014-03-18  9:16       ` Peter Zijlstra
@ 2014-03-19  2:14         ` Jason Low
  2014-03-19  6:42           ` Peter Zijlstra
  0 siblings, 1 reply; 96+ messages in thread
From: Jason Low @ 2014-03-19  2:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Davidlohr Bueso, Ingo Molnar, H. Peter Anvin,
	Linux Kernel Mailing List, Thomas Gleixner, hpa, len.brown,
	linux-tip-commits

On Tue, Mar 18, 2014 at 2:16 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Mar 17, 2014 at 05:20:10PM -0700, Davidlohr Bueso wrote:
>> On Thu, 2013-12-19 at 11:51 -0800, tip-bot for Len Brown wrote:
>> > Commit-ID:  40e2d7f9b5dae048789c64672bf3027fbb663ffa
>> > Gitweb:     http://git.kernel.org/tip/40e2d7f9b5dae048789c64672bf3027fbb663ffa
>> > Author:     Len Brown <len.brown@intel.com>
>> > AuthorDate: Wed, 18 Dec 2013 16:44:57 -0500
>> > Committer:  H. Peter Anvin <hpa@linux.intel.com>
>> > CommitDate: Thu, 19 Dec 2013 11:47:39 -0800
>> >
>> > x86 idle: Repair large-server 50-watt idle-power regression
>>
>> FYI this commit can cause some non trivial performance regressions for
>> larger core count systems. While not surprising because of the nature of
>> the change, having intel_idle do more cacheline invalidations, I still
>> wanted to let you guys know. For instance, on a 160 core Westmere
>> system, aim7 throughput can go down in a number of tests, anywhere from
>> -10% to -25%.
>>
>> I guess it comes down to one of those performance vs energy things. And
>> sure, max_cstate can be set to overcome this, but it's still something
>> that was previously taken for granted.
>
> -10% to -25% seems a lot for a single cacheline flush. Also I would
> expect the expected idle time to be very short while running aim7. So
> could it be the cacheflush is actually taking longer than the expected
> idle time?

Can we consider conditionally skipping the cacheline flush if the
approximate average CPU idle time is very short, for instance,
something along the lines of skipping if CPU average idle time <
(sched migration cost or "cacheline_flush_penalty")?

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [tip:x86/urgent] x86 idle: Repair large-server 50-watt idle-power regression
  2014-03-19  2:14         ` Jason Low
@ 2014-03-19  6:42           ` Peter Zijlstra
  0 siblings, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2014-03-19  6:42 UTC (permalink / raw)
  To: Jason Low
  Cc: Davidlohr Bueso, Ingo Molnar, H. Peter Anvin,
	Linux Kernel Mailing List, Thomas Gleixner, hpa, len.brown,
	linux-tip-commits

On Tue, Mar 18, 2014 at 07:14:04PM -0700, Jason Low wrote:
> On Tue, Mar 18, 2014 at 2:16 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, Mar 17, 2014 at 05:20:10PM -0700, Davidlohr Bueso wrote:
> >> On Thu, 2013-12-19 at 11:51 -0800, tip-bot for Len Brown wrote:
> >> > Commit-ID:  40e2d7f9b5dae048789c64672bf3027fbb663ffa
> >> > Gitweb:     http://git.kernel.org/tip/40e2d7f9b5dae048789c64672bf3027fbb663ffa
> >> > Author:     Len Brown <len.brown@intel.com>
> >> > AuthorDate: Wed, 18 Dec 2013 16:44:57 -0500
> >> > Committer:  H. Peter Anvin <hpa@linux.intel.com>
> >> > CommitDate: Thu, 19 Dec 2013 11:47:39 -0800
> >> >
> >> > x86 idle: Repair large-server 50-watt idle-power regression
> >>
> >> FYI this commit can cause some non trivial performance regressions for
> >> larger core count systems. While not surprising because of the nature of
> >> the change, having intel_idle do more cacheline invalidations, I still
> >> wanted to let you guys know. For instance, on a 160 core Westmere
> >> system, aim7 throughput can go down in a number of tests, anywhere from
> >> -10% to -25%.
> >>
> >> I guess it comes down to one of those performance vs energy things. And
> >> sure, max_cstate can be set to overcome this, but it's still something
> >> that was previously taken for granted.
> >
> > -10% to -25% seems a lot for a single cacheline flush. Also I would
> > expect the expected idle time to be very short while running aim7. So
> > could it be the cacheflush is actually taking longer than the expected
> > idle time?
> 
> Can we consider conditionally skipping the cacheline flush if the
> approximate average CPU idle time is very short, for instance,
> something along the lines of skipping if CPU average idle time <
> (sched migration cost or "cacheline_flush_penalty")?

I'd say its too early for proposing solutions; first we need to find
facts about what is happening.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* RE: [tip:x86/urgent] x86 idle: Repair large-server 50-watt idle-power regression
  2014-03-18  0:20     ` Davidlohr Bueso
  2014-03-18  9:16       ` Peter Zijlstra
@ 2014-04-08 21:43       ` Brown, Len
  2014-04-09  8:18         ` Peter Zijlstra
  2014-04-15  3:27         ` Davidlohr Bueso
  1 sibling, 2 replies; 96+ messages in thread
From: Brown, Len @ 2014-04-08 21:43 UTC (permalink / raw)
  To: Davidlohr Bueso, mingo, hpa, linux-kernel, tglx, hpa, Peter Zijlstra
  Cc: linux-tip-commits

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1809 bytes --]

Davidlohr,

Thanks for the note.

Ideally (on Linux in general, and on servers, in particular) we strive
for the performance impact of power saving features to be small enough
to be considered in "measurement noise".

Your report for 160 core Westmere AIM numbers being hit at 10-25%
shows 15% measurement noise?  But even if true, this looks bad.

Any chance you can re-run, with the following two tweaks,
one at a time?

I'd be curious if you can wrap the invocation in turbostat -v
and capture that output to how what states we are seeing
during the benchmark run.

thanks,
-Len



#1: skip flush for C1
	
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index f80b700..6027d06 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -377,7 +377,7 @@ static int intel_idle(struct cpuidle_device *dev,
 
 	if (!current_set_polling_and_test()) {
 
-		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+		if ((eax > 0) && this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);


#2: skip flush for C1 and C1E
	
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index f80b700..6027d06 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -377,7 +377,7 @@ static int intel_idle(struct cpuidle_device *dev,
 
 	if (!current_set_polling_and_test()) {
 
-		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+		if ((eax > 1) && this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);


ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [tip:x86/urgent] x86 idle: Repair large-server 50-watt idle-power regression
  2014-04-08 21:43       ` Brown, Len
@ 2014-04-09  8:18         ` Peter Zijlstra
  2014-04-15  3:27         ` Davidlohr Bueso
  1 sibling, 0 replies; 96+ messages in thread
From: Peter Zijlstra @ 2014-04-09  8:18 UTC (permalink / raw)
  To: Brown, Len
  Cc: Davidlohr Bueso, mingo, hpa, linux-kernel, tglx, hpa, linux-tip-commits

On Tue, Apr 08, 2014 at 09:43:54PM +0000, Brown, Len wrote:
> #1: skip flush for C1
> #2: skip flush for C1 and C1E

One stray thought I had: IFF the erratum is due to the cache flushing of
the higher C states, these two patches should still be correct while
also (possibly) avoiding the performance penalty, since C1* doesn't
flush caches yet.

Of course, I've no clue if this is indeed the root cause of this
particular erratum and I'm not sure you guys are allowed to say even if
you would know :/

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [tip:x86/urgent] x86 idle: Repair large-server 50-watt idle-power regression
  2014-04-08 21:43       ` Brown, Len
  2014-04-09  8:18         ` Peter Zijlstra
@ 2014-04-15  3:27         ` Davidlohr Bueso
  1 sibling, 0 replies; 96+ messages in thread
From: Davidlohr Bueso @ 2014-04-15  3:27 UTC (permalink / raw)
  To: Brown, Len
  Cc: mingo, hpa, linux-kernel, tglx, hpa, Peter Zijlstra, linux-tip-commits

On Tue, 2014-04-08 at 21:43 +0000, Brown, Len wrote:
> Davidlohr,
> 
> Thanks for the note.
> 
> Ideally (on Linux in general, and on servers, in particular) we strive
> for the performance impact of power saving features to be small enough
> to be considered in "measurement noise".
> 
> Your report for 160 core Westmere AIM numbers being hit at 10-25%
> shows 15% measurement noise?  But even if true, this looks bad.
> 
> Any chance you can re-run, with the following two tweaks,
> one at a time?
> 
> I'd be curious if you can wrap the invocation in turbostat -v
> and capture that output to how what states we are seeing
> during the benchmark run.

Hi Len, apologies for the delay, I've been having to use my machine for
other things and haven't gotten time to get around to this yet. I'll get
you the requested information sometime this week.


^ permalink raw reply	[flat|nested] 96+ messages in thread

end of thread, other threads:[~2014-04-15  3:27 UTC | newest]

Thread overview: 96+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-07  8:00 50 Watt idle power regression bisected to Linux-3.10 Len Brown
2013-12-07  8:39 ` Mike Galbraith
2013-12-07 16:01   ` Len Brown
2013-12-07 16:45     ` Len Brown
2013-12-07 19:17       ` Mike Galbraith
2013-12-10 11:41         ` Ingo Molnar
2013-12-07 12:54 ` Thomas Gleixner
2013-12-08  4:57 ` Mike Galbraith
2013-12-08 20:40   ` Len Brown
2013-12-09  3:16     ` Mike Galbraith
2013-12-10  5:17       ` Mike Galbraith
2013-12-10 11:45         ` Ingo Molnar
2013-12-10 14:29         ` Thomas Gleixner
2013-12-10 15:06           ` Ingo Molnar
2013-12-11  2:05           ` Thomas Gleixner
2013-12-11  3:21             ` Mike Galbraith
2013-12-11 11:28               ` Thomas Gleixner
2013-12-11 11:38                 ` Borislav Petkov
2013-12-11 11:52                   ` Peter Zijlstra
2013-12-11 12:29                     ` Mike Galbraith
2013-12-11 12:43                       ` Peter Zijlstra
2013-12-11 13:10                         ` Mike Galbraith
2013-12-11 13:40                         ` Borislav Petkov
2013-12-11 14:56                           ` Ingo Molnar
2013-12-11 16:02                             ` Borislav Petkov
2013-12-11 16:43                             ` Peter Zijlstra
2013-12-11 17:50                               ` Ingo Molnar
2013-12-11 23:08                                 ` H. Peter Anvin
2013-12-11 23:14                                   ` Borislav Petkov
2013-12-12  0:52                                     ` H. Peter Anvin
2013-12-12  4:25                                       ` Mike Galbraith
2013-12-12  4:49                                         ` H. Peter Anvin
2013-12-12  4:59                                           ` Mike Galbraith
2013-12-12  5:37                                           ` Mike Galbraith
2013-12-12  5:45                                             ` H. Peter Anvin
2013-12-12  5:57                                               ` Mike Galbraith
2013-12-12  6:05                                                 ` Mike Galbraith
2013-12-12  7:57                                                   ` H. Peter Anvin
2013-12-12  8:51                                   ` Peter Zijlstra
2013-12-12 13:28                                     ` Ingo Molnar
2013-12-12 15:06                                       ` H. Peter Anvin
2013-12-12 15:51                                         ` Peter Zijlstra
2013-12-11 14:42                         ` Ingo Molnar
2013-12-11 15:02                           ` Thomas Gleixner
2013-12-11 15:09                             ` Ingo Molnar
2013-12-11 16:44                               ` Peter Zijlstra
2013-12-11 17:48                                 ` Ingo Molnar
2013-12-11 16:44                           ` Peter Zijlstra
2013-12-11 17:47                             ` Ingo Molnar
2013-12-11 21:43                     ` Len Brown
2013-12-11 22:22                       ` Thomas Gleixner
2013-12-18 21:44 ` [PATCH] x86 idle: repair large-server 50-watt idle-power regression Len Brown
2013-12-18 21:44   ` Len Brown
2013-12-19 12:22   ` Ingo Molnar
2013-12-19 14:40     ` H. Peter Anvin
2013-12-19 15:45       ` Borislav Petkov
2013-12-19 15:55     ` H. Peter Anvin
2013-12-19 16:02       ` Ingo Molnar
2013-12-19 16:09         ` H. Peter Anvin
2013-12-19 16:13         ` H. Peter Anvin
2013-12-19 16:21           ` Peter Zijlstra
2013-12-19 16:50             ` H. Peter Anvin
2013-12-19 17:07               ` Ingo Molnar
2013-12-19 17:25                 ` Peter Zijlstra
2013-12-19 17:36                   ` Peter Zijlstra
2013-12-19 18:05                     ` H. Peter Anvin
2013-12-19 18:14                       ` Ingo Molnar
2013-12-19 17:50                   ` Peter Zijlstra
2013-12-19 18:18                     ` Ingo Molnar
2013-12-19 21:05                       ` H. Peter Anvin
2013-12-19 21:17                         ` Ingo Molnar
2013-12-19 18:10                   ` Ingo Molnar
2013-12-19 18:09                 ` H. Peter Anvin
2013-12-19 18:19                   ` H. Peter Anvin
2013-12-19 18:23                     ` Ingo Molnar
     [not found]                       ` <CA+55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com>
2013-12-19 18:43                         ` Ingo Molnar
2013-12-19 18:43                           ` Ingo Molnar
2013-12-19 20:09                         ` [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers tip-bot for H. Peter Anvin
2013-12-19 20:40                           ` Ingo Molnar
2013-12-19 20:46                             ` Linus Torvalds
2013-12-19 21:14                               ` Ingo Molnar
2013-12-19 21:25                                 ` Linus Torvalds
2013-12-19 21:55                             ` Peter Zijlstra
2013-12-20  8:47                               ` Ingo Molnar
2013-12-19 20:33                         ` [tip:x86/idle] x86, idle: Add memory barriers around clflush in mwait_play_dead() tip-bot for H. Peter Anvin
2013-12-19 18:19                   ` [PATCH] x86 idle: repair large-server 50-watt idle-power regression Ingo Molnar
2013-12-19 19:22                     ` H. Peter Anvin
2013-12-19 19:27                       ` Peter Zijlstra
2013-12-19 19:51   ` [tip:x86/urgent] x86 idle: Repair " tip-bot for Len Brown
2014-03-18  0:20     ` Davidlohr Bueso
2014-03-18  9:16       ` Peter Zijlstra
2014-03-19  2:14         ` Jason Low
2014-03-19  6:42           ` Peter Zijlstra
2014-04-08 21:43       ` Brown, Len
2014-04-09  8:18         ` Peter Zijlstra
2014-04-15  3:27         ` Davidlohr Bueso

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.