* 50 Watt idle power regression bisected to Linux-3.10 @ 2013-12-07 8:00 Len Brown 2013-12-07 8:39 ` Mike Galbraith ` (3 more replies) 0 siblings, 4 replies; 96+ messages in thread From: Len Brown @ 2013-12-07 8:00 UTC (permalink / raw) To: tglx, Peter Zijlstra; +Cc: Linux PM list, linux-kernel, Jeremy Eder, x86 Hello Thomas, An idle WSM-EX box (40 Xeon cores) runs 50 Watts hotter after this patch: commit 7d1a941731fabf27e5fb6edbebb79fe856edb4e5 Author: Thomas Gleixner <tglx@linutronix.de> Date: Thu Mar 21 22:50:03 2013 +0100 x86: Use generic idle loop ie. the commit before this patch (aba92c9e2cf3042bf6efc68fa2e4235ba01bf499) runs at 50 watts less, as do Linux 3.7, 3.8 and 3.9. The difference is that the good kernels allow about 98% residence in the package C6 state, while the bad kernel is so noisy that it gets into pc6 0% of the time. (indeed, even core C6 is reduced to about 50% from over 99%) No, Linux-3.13-rc3 does not fix this issue, even though it contains the following patch, claiming to address an issue with the commit above: commit ea8117478918a4734586d35ff530721b682425be Author: Peter Zijlstra <peterz@infradead.org> Date: Wed Sep 11 12:43:13 2013 +0200 sched, idle: Fix the idle polling state logic Mike reported that commit 7d1a9417 ("x86: Use generic idle loop") regressed several workloads and caused excessive reschedule interrupts. The patch in question failed to notice that the x86 code had an inverted sense of the polling state versus the new generic code (x86: default polling, generic: default !polling). Fix the two prominent x86 mwait based idle drivers and introduce a few new generic polling helpers (fixing the wrong smp_mb__after_clear_bit usage). Also switch the idle routines to using tif_need_resched() which is an immediate TIF_NEED_RESCHED test as opposed to need_resched which will end up being slightly different. Reported-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: lenb@kernel.org Cc: tglx@linutronix.de Link: http://lkml.kernel.org/n/tip-nc03imb0etuefmzybzj7sprf@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> How shall we proceed? thanks, -Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-07 8:00 50 Watt idle power regression bisected to Linux-3.10 Len Brown @ 2013-12-07 8:39 ` Mike Galbraith 2013-12-07 16:01 ` Len Brown 2013-12-07 12:54 ` Thomas Gleixner ` (2 subsequent siblings) 3 siblings, 1 reply; 96+ messages in thread From: Mike Galbraith @ 2013-12-07 8:39 UTC (permalink / raw) To: Len Brown Cc: tglx, Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86 On Sat, 2013-12-07 at 03:00 -0500, Len Brown wrote: > No, Linux-3.13-rc3 does not fix this issue, even though it contains > the following patch, claiming to address an issue with the commit above: > > commit ea8117478918a4734586d35ff530721b682425be > Author: Peter Zijlstra <peterz@infradead.org> > Date: Wed Sep 11 12:43:13 2013 +0200 > > sched, idle: Fix the idle polling state logic > > Mike reported that commit 7d1a9417 ("x86: Use generic idle loop") > regressed several workloads and caused excessive reschedule > interrupts. It fixes that, except for my Q6600 box. Too bad mwait_idle() went away, beloved old box doesn't play hints game, so it continues to flog itself. -Mike ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-07 8:39 ` Mike Galbraith @ 2013-12-07 16:01 ` Len Brown 2013-12-07 16:45 ` Len Brown 0 siblings, 1 reply; 96+ messages in thread From: Len Brown @ 2013-12-07 16:01 UTC (permalink / raw) To: Mike Galbraith Cc: tglx, Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86 On Sat, Dec 7, 2013 at 3:39 AM, Mike Galbraith <bitbucket@online.de> wrote: > On Sat, 2013-12-07 at 03:00 -0500, Len Brown wrote: > >> No, Linux-3.13-rc3 does not fix this issue, even though it contains >> the following patch, claiming to address an issue with the commit above: >> >> commit ea8117478918a4734586d35ff530721b682425be >> Author: Peter Zijlstra <peterz@infradead.org> >> Date: Wed Sep 11 12:43:13 2013 +0200 >> >> sched, idle: Fix the idle polling state logic >> >> Mike reported that commit 7d1a9417 ("x86: Use generic idle loop") >> regressed several workloads and caused excessive reschedule >> interrupts. > > It fixes that, except for my Q6600 box. Too bad mwait_idle() went away, > beloved old box doesn't play hints game, so it continues to flog itself. Hi Mike, What C-states are available on your Q6600 box? Is it running HALT only? (if so, I would expect C1 MWAIT be about the same, if it still existed) But I would expect the box to have deeper C-states than just C1, so... Can you show me the output of dmesg | grep idle grep . /sys/devices/system/cpu/cpu0/cpuidle/*/* Please elaborate on exactly what symptom you see. Do you have a power meter? thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-07 16:01 ` Len Brown @ 2013-12-07 16:45 ` Len Brown 2013-12-07 19:17 ` Mike Galbraith 0 siblings, 1 reply; 96+ messages in thread From: Len Brown @ 2013-12-07 16:45 UTC (permalink / raw) To: Mike Galbraith Cc: Thomas Gleixner, Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86 >> It fixes that, except for my Q6600 box. Too bad mwait_idle() went away, >> beloved old box doesn't play hints game, so it continues to flog itself. Thanks for pointing this out, Mike! A Q6600 is a Kentsfield. I dug one of those up. Indeed, the only idle capabilities it has are HALT and old style MWAIT, and the latter is much more effective. running 3.8 it idles at 75 watts. running 3.8 with idle=nomwait it idles at 100 watts, which is what it will do with 3.9 and later due to the patch below. commit 69fb3676df3329a7142803bb3502fa59dc0db2e3 Author: Len Brown <len.brown@intel.com> Date: Sun Feb 10 01:38:39 2013 -0500 x86 idle: remove mwait_idle() and "idle=mwait" cmdline param Kentsfield proves that patch was based on a fault assumption. Sweet box in its day, ECC memory and everything -- probably still a fair number of them running... Plus, I've found another machine that depends on having an idle=mwait idle loop (A Sony Vaio BIOS SMM code apparently assumes we use it in https://bugzilla.kernel.org/show_bug.cgi?id=60770) So it looks like I need to (also) restore the simple idle=mwait idle loop to make some machines happy. thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-07 16:45 ` Len Brown @ 2013-12-07 19:17 ` Mike Galbraith 2013-12-10 11:41 ` Ingo Molnar 0 siblings, 1 reply; 96+ messages in thread From: Mike Galbraith @ 2013-12-07 19:17 UTC (permalink / raw) To: Len Brown Cc: Thomas Gleixner, Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86 On Sat, 2013-12-07 at 11:45 -0500, Len Brown wrote: > >> It fixes that, except for my Q6600 box. Too bad mwait_idle() went away, > >> beloved old box doesn't play hints game, so it continues to flog itself. > > > Thanks for pointing this out, Mike! > > A Q6600 is a Kentsfield. I dug one of those up. > Indeed, the only idle capabilities it has are HALT > and old style MWAIT, and the latter is much more effective. > running 3.8 it idles at 75 watts. > running 3.8 with idle=nomwait it idles at 100 watts, > which is what it will do with 3.9 and later due to the patch below. > > commit 69fb3676df3329a7142803bb3502fa59dc0db2e3 > Author: Len Brown <len.brown@intel.com> > Date: Sun Feb 10 01:38:39 2013 -0500 > > x86 idle: remove mwait_idle() and "idle=mwait" cmdline param > > Kentsfield proves that patch was based on a fault assumption. > Sweet box in its day, ECC memory and everything -- probably still > a fair number of them running... > > Plus, I've found another machine that depends on having an idle=mwait > idle loop (A Sony Vaio BIOS SMM code apparently assumes we use it in > https://bugzilla.kernel.org/show_bug.cgi?id=60770) > > So it looks like I need to (also) restore the simple idle=mwait idle loop > to make some machines happy. Cool, box will definitely be happier. -Mike ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-07 19:17 ` Mike Galbraith @ 2013-12-10 11:41 ` Ingo Molnar 0 siblings, 0 replies; 96+ messages in thread From: Ingo Molnar @ 2013-12-10 11:41 UTC (permalink / raw) To: Mike Galbraith Cc: Len Brown, Thomas Gleixner, Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86 * Mike Galbraith <bitbucket@online.de> wrote: > On Sat, 2013-12-07 at 11:45 -0500, Len Brown wrote: > > >> It fixes that, except for my Q6600 box. Too bad mwait_idle() went away, > > >> beloved old box doesn't play hints game, so it continues to flog itself. > > > > > > Thanks for pointing this out, Mike! > > > > A Q6600 is a Kentsfield. I dug one of those up. > > Indeed, the only idle capabilities it has are HALT > > and old style MWAIT, and the latter is much more effective. > > running 3.8 it idles at 75 watts. > > running 3.8 with idle=nomwait it idles at 100 watts, > > which is what it will do with 3.9 and later due to the patch below. > > > > commit 69fb3676df3329a7142803bb3502fa59dc0db2e3 > > Author: Len Brown <len.brown@intel.com> > > Date: Sun Feb 10 01:38:39 2013 -0500 > > > > x86 idle: remove mwait_idle() and "idle=mwait" cmdline param > > > > Kentsfield proves that patch was based on a fault assumption. > > Sweet box in its day, ECC memory and everything -- probably still > > a fair number of them running... > > > > Plus, I've found another machine that depends on having an idle=mwait > > idle loop (A Sony Vaio BIOS SMM code apparently assumes we use it in > > https://bugzilla.kernel.org/show_bug.cgi?id=60770) > > > > So it looks like I need to (also) restore the simple idle=mwait idle loop > > to make some machines happy. > > Cool, box will definitely be happier. I assume old-style MWAIT will be activated automatically on such boxes, there's no need to pass in idle=mwait on the boot command line, correct? Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-07 8:00 50 Watt idle power regression bisected to Linux-3.10 Len Brown 2013-12-07 8:39 ` Mike Galbraith @ 2013-12-07 12:54 ` Thomas Gleixner 2013-12-08 4:57 ` Mike Galbraith 2013-12-18 21:44 ` Len Brown 3 siblings, 0 replies; 96+ messages in thread From: Thomas Gleixner @ 2013-12-07 12:54 UTC (permalink / raw) To: Len Brown; +Cc: Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86 Len, On Sat, 7 Dec 2013, Len Brown wrote: > > How shall we proceed? Can you please gather a function trace so I can see what the system is doing? Preferrably with 3.13-rc3 so we have Peters fix included. Please send it offlist or put it somehwere for download. Thanks, tglx ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-07 8:00 50 Watt idle power regression bisected to Linux-3.10 Len Brown 2013-12-07 8:39 ` Mike Galbraith 2013-12-07 12:54 ` Thomas Gleixner @ 2013-12-08 4:57 ` Mike Galbraith 2013-12-08 20:40 ` Len Brown 2013-12-18 21:44 ` Len Brown 3 siblings, 1 reply; 96+ messages in thread From: Mike Galbraith @ 2013-12-08 4:57 UTC (permalink / raw) To: Len Brown Cc: tglx, Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86 On Sat, 2013-12-07 at 03:00 -0500, Len Brown wrote: > Hello Thomas, > > An idle WSM-EX box (40 Xeon cores) runs 50 Watts hotter after this patch: > > commit 7d1a941731fabf27e5fb6edbebb79fe856edb4e5 > Author: Thomas Gleixner <tglx@linutronix.de> > Date: Thu Mar 21 22:50:03 2013 +0100 > > x86: Use generic idle loop > > ie. the commit before this patch (aba92c9e2cf3042bf6efc68fa2e4235ba01bf499) > runs at 50 watts less, as do Linux 3.7, 3.8 and 3.9. > > The difference is that the good kernels allow about 98% residence > in the package C6 state, while the bad kernel is so noisy that it > gets into pc6 0% of the time. > (indeed, even core C6 is reduced to about 50% from over 99%) FWIW, my little x3550 (E5620) box is spending ~99% of its time in C3 per powertop, deepest state it has. It took a big hit from 69a37bea as well as 7d1a9417, but with revert of 69a37bea, and addition of ea811747, it's fairly close to 3.7 in fastpath weight, and greenery _seems_ fine. -Mike ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-08 4:57 ` Mike Galbraith @ 2013-12-08 20:40 ` Len Brown 2013-12-09 3:16 ` Mike Galbraith 0 siblings, 1 reply; 96+ messages in thread From: Len Brown @ 2013-12-08 20:40 UTC (permalink / raw) To: Mike Galbraith Cc: Thomas Gleixner, Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86 > FWIW, my little x3550 (E5620) box is spending ~99% of its time in C3 per > powertop, deepest state it has. Please run turbostat, which always describes what the hardware does. Depending on the version of powertop and the hardware involved, it may be describing that the OS requested -- which is not always the same. The hardware reserves the right to demote any requested C-state to a shallower one. Also, it is the hardware decision whether to promote core c-states requested by the OS into power saving package C-states. cheers, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-08 20:40 ` Len Brown @ 2013-12-09 3:16 ` Mike Galbraith 2013-12-10 5:17 ` Mike Galbraith 0 siblings, 1 reply; 96+ messages in thread From: Mike Galbraith @ 2013-12-09 3:16 UTC (permalink / raw) To: Len Brown Cc: Thomas Gleixner, Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86 On Sun, 2013-12-08 at 15:40 -0500, Len Brown wrote: > > FWIW, my little x3550 (E5620) box is spending ~99% of its time in C3 per > > powertop, deepest state it has. > > Please run turbostat, which always describes what the hardware does. > Depending on the version of powertop and the hardware involved, > it may be describing that the OS requested -- which is not always the same. > > The hardware reserves the right to demote any requested C-state to a > shallower one. Also, it is the hardware decision whether to promote > core c-states requested by the OS into power saving package C-states. Hrm, turbostat says rtbox has a C6, and is using it. Ah, BIOS says ACPI C3 == Intel C6, so I suppose Intel C3 == ACPI C2. Whatever, both tools seem to agree that these particular boxen (x3550/E5620 and DL980/X7560) are snoozing peacefully. rtbox:~ # turbostat -i 60 cor CPU %c0 GHz TSC SMI %c1 %c3 %c6 CTMP %pc3 %pc6 0.18 1.73 2.40 0 0.19 0.02 99.61 25 0.08 92.74 0 0 0.47 1.60 2.40 0 0.62 0.08 98.83 18 0.08 92.74 1 1 0.03 1.61 2.40 0 0.05 0.00 99.92 25 9 2 0.05 1.73 2.40 0 0.06 0.00 99.89 16 10 3 0.15 2.13 2.40 0 0.05 0.01 99.79 20 vogelweide:~/:[0]# turbostat -i 60 pk cor CPU %c0 GHz TSC SMI %c1 %c3 %c6 CTMP %pc3 %pc6 0.06 2.03 2.26 0 0.76 99.18 0.00 49 94.97 0.00 0 0 0 0.40 1.57 2.26 3 25.28 74.32 0.00 49 64.24 0.00 0 1 1 0.08 2.16 2.26 3 19.28 80.64 0.00 48 0 2 2 0.09 2.20 2.26 3 0.00 99.91 0.00 49 0 3 3 0.09 2.20 2.26 3 0.00 99.91 0.00 49 0 8 4 0.08 2.21 2.26 3 0.00 99.91 0.00 45 0 9 5 0.08 2.22 2.26 3 0.00 99.92 0.00 46 0 10 6 0.08 2.23 2.26 3 0.00 99.92 0.00 45 0 11 7 0.08 2.23 2.26 3 0.00 99.92 0.00 45 1 0 8 0.10 1.94 2.26 3 0.01 99.89 0.00 38 99.88 0.00 1 1 9 0.08 2.11 2.26 3 0.00 99.92 0.00 39 1 2 10 0.08 2.12 2.26 3 0.00 99.92 0.00 41 1 3 11 0.07 2.14 2.26 3 0.00 99.92 0.00 41 1 8 12 0.08 2.15 2.26 3 0.00 99.92 0.00 42 1 9 13 0.07 2.17 2.26 3 0.00 99.92 0.00 43 1 10 14 0.07 2.19 2.26 3 0.00 99.92 0.00 41 1 11 15 0.07 2.20 2.26 3 0.00 99.93 0.00 41 2 0 16 0.09 1.93 2.26 3 0.01 99.90 0.00 41 99.89 0.00 2 1 17 0.07 2.09 2.26 3 0.00 99.92 0.00 40 2 2 18 0.07 2.11 2.26 3 0.00 99.93 0.00 41 2 3 19 0.07 2.12 2.26 3 0.00 99.93 0.00 42 2 8 20 0.07 2.14 2.26 3 0.00 99.93 0.00 41 2 9 21 0.07 2.16 2.26 3 0.00 99.93 0.00 41 2 10 22 0.07 2.17 2.26 3 0.00 99.93 0.00 37 2 11 23 0.07 2.20 2.26 3 0.00 99.93 0.00 39 3 0 24 0.08 1.91 2.26 3 0.01 99.91 0.00 41 99.88 0.00 3 1 25 0.06 2.07 2.26 3 0.00 99.93 0.00 42 3 2 26 0.07 2.09 2.26 3 0.02 99.92 0.00 41 3 3 27 0.06 2.11 2.26 3 0.00 99.94 0.00 42 3 8 28 0.06 2.12 2.26 3 0.00 99.94 0.00 47 3 9 29 0.06 2.15 2.26 3 0.00 99.94 0.00 46 3 10 30 0.06 2.18 2.26 3 0.00 99.94 0.00 44 3 11 31 0.06 2.19 2.26 3 0.02 99.92 0.00 46 4 0 32 0.07 1.92 2.26 3 0.01 99.92 0.00 36 96.25 0.00 4 1 33 0.06 2.07 2.26 3 0.00 99.94 0.00 36 4 2 34 0.06 2.08 2.26 3 0.00 99.94 0.00 39 4 3 35 0.05 2.11 2.26 3 0.00 99.94 0.00 39 4 8 36 0.05 2.12 2.26 3 3.67 96.28 0.00 40 4 9 37 0.05 2.15 2.26 3 0.00 99.94 0.00 42 4 10 38 0.05 2.18 2.26 3 0.00 99.95 0.00 39 4 11 39 0.05 2.20 2.26 3 0.00 99.95 0.00 40 5 0 40 0.07 1.73 2.26 3 0.02 99.91 0.00 35 99.90 0.00 5 1 41 0.05 2.00 2.26 3 0.00 99.95 0.00 36 5 2 42 0.05 2.04 2.26 3 0.00 99.95 0.00 39 5 3 43 0.04 2.04 2.26 3 0.00 99.95 0.00 38 5 8 44 0.05 2.09 2.26 3 0.00 99.95 0.00 40 5 9 45 0.04 2.10 2.26 3 0.00 99.95 0.00 41 5 10 46 0.04 2.13 2.26 3 0.00 99.96 0.00 37 5 11 47 0.04 2.17 2.26 3 0.00 99.96 0.00 37 6 0 48 0.08 1.61 2.26 3 0.02 99.90 0.00 35 99.87 0.00 6 1 49 0.05 1.79 2.26 3 0.01 99.94 0.00 36 6 2 50 0.04 1.98 2.26 3 0.00 99.96 0.00 36 6 3 51 0.04 2.00 2.26 3 0.00 99.96 0.00 37 6 8 52 0.04 2.01 2.26 3 0.00 99.96 0.00 38 6 9 53 0.03 2.07 2.26 3 0.00 99.96 0.00 36 6 10 54 0.03 2.08 2.26 3 0.00 99.96 0.00 36 6 11 55 0.03 2.14 2.26 3 0.00 99.96 0.00 36 7 0 56 0.05 1.69 2.26 3 0.01 99.94 0.00 41 99.89 0.00 7 1 57 0.03 1.87 2.26 3 0.00 99.97 0.00 41 7 2 58 0.03 1.89 2.26 3 0.00 99.97 0.00 44 7 3 59 0.03 1.93 2.26 3 0.00 99.97 0.00 44 7 8 60 0.03 1.97 2.26 3 0.05 99.92 0.00 41 7 9 61 0.03 2.02 2.26 3 0.00 99.97 0.00 38 7 10 62 0.02 2.06 2.26 3 0.00 99.97 0.00 39 7 11 63 0.03 2.12 2.26 3 0.00 99.97 0.00 36 ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-09 3:16 ` Mike Galbraith @ 2013-12-10 5:17 ` Mike Galbraith 2013-12-10 11:45 ` Ingo Molnar 2013-12-10 14:29 ` Thomas Gleixner 0 siblings, 2 replies; 96+ messages in thread From: Mike Galbraith @ 2013-12-10 5:17 UTC (permalink / raw) To: Len Brown Cc: Thomas Gleixner, Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86 Hi Len, I'm unable to reproduce those DL980 results. I updated the kernel and config yesterday, and happened to run turbostat again.. and the box was nowhere near as quiet. I ended up spending all day futzing with the darn thing, checking various config/kernel combos, and none produced the previous result. ATM, running the same exact updated kernel as the x3550 is running (still happily) with only a couple needed drivers added: vogelweide:~/:[130]# turbostat -P -i 60 pk cor CPU %c0 GHz TSC SMI %c1 %c3 %c6 CTMP %pc3 %pc6 0.02 2.12 2.26 0 43.40 56.57 0.00 53 33.81 0.00 0 0 0 0.23 2.10 2.26 5 65.47 34.30 0.00 52 10.69 0.00 1 0 8 0.04 2.02 2.26 5 63.10 36.86 0.00 41 13.31 0.00 2 0 16 0.04 1.90 2.26 5 35.70 64.25 0.00 43 37.88 0.00 3 0 24 0.03 2.08 2.26 5 39.78 60.19 0.00 42 29.00 0.00 4 0 32 0.03 1.95 2.26 5 14.64 85.33 0.00 37 65.00 0.00 5 0 40 0.03 1.95 2.26 5 15.96 84.02 0.00 36 74.34 0.00 6 0 48 0.02 1.99 2.26 5 36.97 63.01 0.00 37 40.20 0.00 7 0 56 0.02 2.08 2.26 5 57.54 42.44 0.00 44 0.08 0.00 pk cor CPU %c0 GHz TSC SMI %c1 %c3 %c6 CTMP %pc3 %pc6 0.03 2.08 2.26 0 31.24 68.73 0.00 53 44.26 0.00 0 0 0 0.25 1.76 2.26 8 52.96 46.78 0.00 51 31.99 0.00 1 0 8 0.03 1.96 2.26 8 50.69 49.28 0.00 41 30.80 0.00 2 0 16 0.04 1.91 2.26 8 36.10 63.85 0.00 42 44.37 0.00 3 0 24 0.03 1.96 2.26 8 24.15 75.82 0.00 42 59.14 0.00 4 0 32 0.03 1.94 2.26 8 14.68 85.29 0.00 37 71.64 0.00 5 0 40 0.03 1.94 2.26 8 16.01 83.96 0.00 36 78.62 0.00 6 0 48 0.02 2.18 2.26 8 46.79 53.18 0.00 36 11.05 0.00 7 0 56 0.02 2.01 2.26 8 50.84 49.14 0.00 45 26.50 0.00 pk cor CPU %c0 GHz TSC SMI %c1 %c3 %c6 CTMP %pc3 %pc6 0.03 2.10 2.26 0 30.88 69.09 0.00 51 48.09 0.00 0 0 0 0.23 1.84 2.26 8 51.98 47.79 0.00 50 31.06 0.00 1 0 8 0.04 2.05 2.26 8 59.44 40.52 0.00 41 21.86 0.00 2 0 16 0.04 1.73 2.26 8 22.87 77.09 0.00 41 62.26 0.00 3 0 24 0.03 1.93 2.26 8 9.83 90.14 0.00 41 77.33 0.00 4 0 32 0.03 1.90 2.26 8 8.66 91.31 0.00 37 89.16 0.00 5 0 40 0.03 2.05 2.26 8 27.30 72.67 0.00 35 50.62 0.00 6 0 48 0.02 2.07 2.26 8 44.81 55.17 0.00 36 18.97 0.00 7 0 56 0.02 2.00 2.26 8 48.06 51.91 0.00 42 33.44 0.00 That, vs solid >99% pc3 for old 3.0 enterprise kernel. What happened, dunno, kernel called itself master. Whatever, I don't like test bogons, and this is one, so I thought I should let you know. DL980 refuses to come close to the previous result. 'nuff of that, off to the slave pits before yet another day evaporates. -Mike ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-10 5:17 ` Mike Galbraith @ 2013-12-10 11:45 ` Ingo Molnar 2013-12-10 14:29 ` Thomas Gleixner 1 sibling, 0 replies; 96+ messages in thread From: Ingo Molnar @ 2013-12-10 11:45 UTC (permalink / raw) To: Mike Galbraith Cc: Len Brown, Thomas Gleixner, Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86 * Mike Galbraith <bitbucket@online.de> wrote: > Hi Len, > > I'm unable to reproduce those DL980 results. I updated the kernel and > config yesterday, and happened to run turbostat again.. and the box was > nowhere near as quiet. I ended up spending all day futzing with the > darn thing, checking various config/kernel combos, and none produced the > previous result. > > ATM, running the same exact updated kernel as the x3550 is running > (still happily) with only a couple needed drivers added: > > vogelweide:~/:[130]# turbostat -P -i 60 > pk cor CPU %c0 GHz TSC SMI %c1 %c3 %c6 CTMP %pc3 %pc6 > 0.02 2.12 2.26 0 43.40 56.57 0.00 53 33.81 0.00 > 0 0 0 0.23 2.10 2.26 5 65.47 34.30 0.00 52 10.69 0.00 > 1 0 8 0.04 2.02 2.26 5 63.10 36.86 0.00 41 13.31 0.00 > 2 0 16 0.04 1.90 2.26 5 35.70 64.25 0.00 43 37.88 0.00 > 3 0 24 0.03 2.08 2.26 5 39.78 60.19 0.00 42 29.00 0.00 > 4 0 32 0.03 1.95 2.26 5 14.64 85.33 0.00 37 65.00 0.00 > 5 0 40 0.03 1.95 2.26 5 15.96 84.02 0.00 36 74.34 0.00 > 6 0 48 0.02 1.99 2.26 5 36.97 63.01 0.00 37 40.20 0.00 > 7 0 56 0.02 2.08 2.26 5 57.54 42.44 0.00 44 0.08 0.00 > pk cor CPU %c0 GHz TSC SMI %c1 %c3 %c6 CTMP %pc3 %pc6 > 0.03 2.08 2.26 0 31.24 68.73 0.00 53 44.26 0.00 > 0 0 0 0.25 1.76 2.26 8 52.96 46.78 0.00 51 31.99 0.00 > 1 0 8 0.03 1.96 2.26 8 50.69 49.28 0.00 41 30.80 0.00 > 2 0 16 0.04 1.91 2.26 8 36.10 63.85 0.00 42 44.37 0.00 > 3 0 24 0.03 1.96 2.26 8 24.15 75.82 0.00 42 59.14 0.00 > 4 0 32 0.03 1.94 2.26 8 14.68 85.29 0.00 37 71.64 0.00 > 5 0 40 0.03 1.94 2.26 8 16.01 83.96 0.00 36 78.62 0.00 > 6 0 48 0.02 2.18 2.26 8 46.79 53.18 0.00 36 11.05 0.00 > 7 0 56 0.02 2.01 2.26 8 50.84 49.14 0.00 45 26.50 0.00 > pk cor CPU %c0 GHz TSC SMI %c1 %c3 %c6 CTMP %pc3 %pc6 > 0.03 2.10 2.26 0 30.88 69.09 0.00 51 48.09 0.00 > 0 0 0 0.23 1.84 2.26 8 51.98 47.79 0.00 50 31.06 0.00 > 1 0 8 0.04 2.05 2.26 8 59.44 40.52 0.00 41 21.86 0.00 > 2 0 16 0.04 1.73 2.26 8 22.87 77.09 0.00 41 62.26 0.00 > 3 0 24 0.03 1.93 2.26 8 9.83 90.14 0.00 41 77.33 0.00 > 4 0 32 0.03 1.90 2.26 8 8.66 91.31 0.00 37 89.16 0.00 > 5 0 40 0.03 2.05 2.26 8 27.30 72.67 0.00 35 50.62 0.00 > 6 0 48 0.02 2.07 2.26 8 44.81 55.17 0.00 36 18.97 0.00 > 7 0 56 0.02 2.00 2.26 8 48.06 51.91 0.00 42 33.44 0.00 > > That, vs solid >99% pc3 for old 3.0 enterprise kernel. What > happened, dunno, kernel called itself master. Whatever, I don't > like test bogons, and this is one, so I thought I should let you > know. DL980 refuses to come close to the previous result. > > 'nuff of that, off to the slave pits before yet another day > evaporates. An ftrace function trace of the anomaly would be awfully useful. (or a trace generated of a dozen strategically placed trace_printk()s debug lines tracing the guts of cpuidle.) Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-10 5:17 ` Mike Galbraith 2013-12-10 11:45 ` Ingo Molnar @ 2013-12-10 14:29 ` Thomas Gleixner 2013-12-10 15:06 ` Ingo Molnar 2013-12-11 2:05 ` Thomas Gleixner 1 sibling, 2 replies; 96+ messages in thread From: Thomas Gleixner @ 2013-12-10 14:29 UTC (permalink / raw) To: Mike Galbraith Cc: Len Brown, Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86 On Tue, 10 Dec 2013, Mike Galbraith wrote: > vogelweide:~/:[130]# turbostat -P -i 60 > pk cor CPU %c0 GHz TSC SMI %c1 %c3 %c6 CTMP %pc3 %pc6 > 0.02 2.12 2.26 0 43.40 56.57 0.00 53 33.81 0.00 > 0 0 0 0.23 2.10 2.26 5 65.47 34.30 0.00 52 10.69 0.00 ... > That, vs solid >99% pc3 for old 3.0 enterprise kernel. What happened, > dunno, kernel called itself master. Whatever, I don't like test bogons, > and this is one, so I thought I should let you know. DL980 refuses to > come close to the previous result. Hmm, that's very similar to the issue Len is seing on his quad socket WM. But I can't see it on my dual socket SNB, neither does PeterZ on his dual socket WM machine. Heisenbugs .... Can you offline all cpus except core 0, enable all trace events (nop tracer), run "sleep 100" and stop and grab the trace? Please make sure that your tracebuffer is big enough to cover the full thing. Thanks, tglx ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-10 14:29 ` Thomas Gleixner @ 2013-12-10 15:06 ` Ingo Molnar 2013-12-11 2:05 ` Thomas Gleixner 1 sibling, 0 replies; 96+ messages in thread From: Ingo Molnar @ 2013-12-10 15:06 UTC (permalink / raw) To: Thomas Gleixner Cc: Mike Galbraith, Len Brown, Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86 * Thomas Gleixner <tglx@linutronix.de> wrote: > On Tue, 10 Dec 2013, Mike Galbraith wrote: > > vogelweide:~/:[130]# turbostat -P -i 60 > > pk cor CPU %c0 GHz TSC SMI %c1 %c3 %c6 CTMP %pc3 %pc6 > > 0.02 2.12 2.26 0 43.40 56.57 0.00 53 33.81 0.00 > > 0 0 0 0.23 2.10 2.26 5 65.47 34.30 0.00 52 10.69 0.00 > > ... > > > That, vs solid >99% pc3 for old 3.0 enterprise kernel. What happened, > > dunno, kernel called itself master. Whatever, I don't like test bogons, > > and this is one, so I thought I should let you know. DL980 refuses to > > come close to the previous result. > > Hmm, that's very similar to the issue Len is seing on his quad > socket WM. > > But I can't see it on my dual socket SNB, neither does PeterZ on his > dual socket WM machine. Heisenbugs .... > > Can you offline all cpus except core 0, enable all trace events (nop > tracer), run "sleep 100" and stop and grab the trace? Please make > sure that your tracebuffer is big enough to cover the full thing. I guess because we are facing a Heisenbug it might be prudent to check that the bug is still present with all those cpus offlined and with tracing enabled :-/ Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-10 14:29 ` Thomas Gleixner 2013-12-10 15:06 ` Ingo Molnar @ 2013-12-11 2:05 ` Thomas Gleixner 2013-12-11 3:21 ` Mike Galbraith 1 sibling, 1 reply; 96+ messages in thread From: Thomas Gleixner @ 2013-12-11 2:05 UTC (permalink / raw) To: Mike Galbraith Cc: Len Brown, Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86, Borislav Petkov On Tue, 10 Dec 2013, Thomas Gleixner wrote: > On Tue, 10 Dec 2013, Mike Galbraith wrote: > > vogelweide:~/:[130]# turbostat -P -i 60 > > pk cor CPU %c0 GHz TSC SMI %c1 %c3 %c6 CTMP %pc3 %pc6 > > 0.02 2.12 2.26 0 43.40 56.57 0.00 53 33.81 0.00 > > 0 0 0 0.23 2.10 2.26 5 65.47 34.30 0.00 52 10.69 0.00 > > ... > > > That, vs solid >99% pc3 for old 3.0 enterprise kernel. What happened, > > dunno, kernel called itself master. Whatever, I don't like test bogons, > > and this is one, so I thought I should let you know. DL980 refuses to > > come close to the previous result. > > Hmm, that's very similar to the issue Len is seing on his quad socket WM. > > But I can't see it on my dual socket SNB, neither does PeterZ on his > dual socket WM machine. Heisenbugs .... > > Can you offline all cpus except core 0, enable all trace events (nop > tracer), run "sleep 100" and stop and grab the trace? Please make sure > that your tracebuffer is big enough to cover the full thing. So Boris found a quad socket WM which shows the issue and he also confirmed that dual socket ones cannot reproduce it... The trace Boris provided shows similar bogosity as you and Len are seing. Decoding it shows: Sleep time state=1 state=2 state=3 state=4 0 - 49 ms: 202 6 32 948 50 - 99 ms: 15 0 0 66 100 - 149 ms: 3 0 0 12 150 - 199 ms: 1 0 0 0 So we actually sleep for more than 150ms in C1!?! Lets look at the trace: <idle>-0 [000] d... 1977.728444: cpu_idle: state=4 cpu_id=0 <idle>-0 [000] .... 1977.728446: cpu_idle: state=4294967295 cpu_id=0 <idle>-0 [000] d... 1977.728448: cpu_idle: state=4 cpu_id=0 <idle>-0 [000] .... 1977.728450: cpu_idle: state=4294967295 cpu_id=0 <idle>-0 [000] d... 1977.728452: cpu_idle: state=4 cpu_id=0 <idle>-0 [000] .... 1977.728453: cpu_idle: state=4294967295 cpu_id=0 <idle>-0 [000] d... 1977.728455: cpu_idle: state=4 cpu_id=0 <idle>-0 [000] .... 1977.728456: cpu_idle: state=4294967295 cpu_id=0 <idle>-0 [000] d... 1977.728458: cpu_idle: state=4 cpu_id=0 <idle>-0 [000] .... 1977.728460: cpu_idle: state=4294967295 cpu_id=0 <idle>-0 [000] d... 1977.728462: cpu_idle: state=4 cpu_id=0 <idle>-0 [000] .... 1977.728463: cpu_idle: state=4294967295 cpu_id=0 <idle>-0 [000] d... 1977.728465: cpu_idle: state=1 cpu_id=0 <idle>-0 [000] d.h. 1977.891539: local_timer_entry: vector=239 So what happens is that SIX consecutive calls into the idle machinery with state=4 return right away. That means either need_resched() is set, which is nowhere indicated by the trace and would be causing a reschedule in between. And after that the idle governor decides to go into state=1 where we stay for a whopping 163ms. Now I wanted to analyze that a bit more and asked Boris to apply the debug patch below. Drumroll, then laugther and then banging the head against the wall: Sleep time state=1 state=2 state=3 state=4 0 - 49 ms: 8 3 112 2849 50 - 99 ms: 0 0 0 419 100 - 149 ms: 0 0 0 84 150 - 199 ms: 0 0 0 21 200 - 249 ms: 0 0 0 3 Also turbostat confirms that the machine is staying in its deepest power state as much as it can. I neither can observe similar massive bouncing in and out of a state as in the above trace. There are a few lonely ones, but they do not do any damage. Now what does this patch change? Nothing in the logic, it merily changes the timing of that idle code by adding trace_printks. And what did the x86 -> generic idle code change do ? Lets ignore the subtle SMP issues vs. the polling flag which are fixed in current mainline and completely irrelevant for an idle UP machine. And all the traces are from an UP machine (all cores offlined except 0) It changed the timing as well. So the trace addons making the issue go away AND the fact that nobody can reproduce the issue on a dual socket machine makes me wonder even more. Now there is an important data point: The problem is timing sensitive and the issue is only exposed on quad socket machines. So I looked at the code pathes once more and the only way we can bounce out of that is that the mwait in intel_idle() comes back prematurely. Now what makes mwait come back early? Especially in a situation where only a single core is involved in code execution? Definitely not a change in the idle code, which is relating to UP exacly ZERO, except the fact that we touch the cpu local nmi watchdog variable via touch_nmi_watchdog() inside the irq disabled region instead of the previous irq enabled region. So what makes mwait come back right away? Lets look at a snippet of the trace which was taken with the debug patch applied: <idle>-0 [000] d... 1747.667184: menu_select: expected_us: 39875 predicted_us: 22697 So we have according to the nohz code 39875 usec idle time ahead. The governer heuristics make that: 22697 predicted time. <idle>-0 [000] d... 1747.667187: menu_select: Trying idle state 1 s->dis 0 su->dis 0 <idle>-0 [000] d... 1747.667188: menu_select: residency 6 <idle>-0 [000] d... 1747.667188: menu_select: exit_latency 3 vs. 2000000000 multiplier 1 <idle>-0 [000] d... 1747.667189: menu_select: Trying idle state 2 s->dis 0 su->dis 0 <idle>-0 [000] d... 1747.667190: menu_select: residency 20 <idle>-0 [000] d... 1747.667191: menu_select: exit_latency 10 vs. 2000000000 multiplier 1 <idle>-0 [000] d... 1747.667191: menu_select: Trying idle state 3 s->dis 0 su->dis 0 <idle>-0 [000] d... 1747.667192: menu_select: residency 80 <idle>-0 [000] d... 1747.667193: menu_select: exit_latency 20 vs. 2000000000 multiplier 1 <idle>-0 [000] d... 1747.667194: menu_select: Trying idle state 4 s->dis 0 su->dis 0 <idle>-0 [000] d... 1747.667194: menu_select: residency 800 <idle>-0 [000] d... 1747.667195: menu_select: exit_latency 200 vs. 2000000000 multiplier 1 <idle>-0 [000] d... 1747.667195: cpu_idle: state=4 cpu_id=0 So the governor settles on state=4 which is the deepest one on this machine. <idle>-0 [000] .... 1747.667197: cpuidle_idle_call: coupled 0: entered state 4 And TWO micro seconds later we are out of that state, but need_resched() is NOT set and we did not get an interrupt and nothing could have fiddled with the thread_info->state cacheline simply because there is only ONE cpu online, i.e. CPU0 the very cpu which trace we are staring at. <idle>-0 [000] .... 1747.667198: cpu_idle: state=4294967295 cpu_id=0 <idle>-0 [000] d... 1747.667199: menu_select: expected_us: 39875 predicted_us: 19860 So we still have an expected 39875 usec sleep time ahead, which is nonsense, but understandable because the nohz idle code did not go through an update. But of course the immediate return takes the residence of about 0 usec in state=4 into account and reduces the predicted_us to 19860 usec <idle>-0 [000] d... 1747.667201: menu_select: Trying idle state 1 s->dis 0 su->dis 0 <idle>-0 [000] d... 1747.667201: menu_select: residency 6 <idle>-0 [000] d... 1747.667202: menu_select: exit_latency 3 vs. 2000000000 multiplier 1 <idle>-0 [000] d... 1747.667203: menu_select: Trying idle state 2 s->dis 0 su->dis 0 <idle>-0 [000] d... 1747.667204: menu_select: residency 20 <idle>-0 [000] d... 1747.667204: menu_select: exit_latency 10 vs. 2000000000 multiplier 1 <idle>-0 [000] d... 1747.667205: menu_select: Trying idle state 3 s->dis 0 su->dis 0 <idle>-0 [000] d... 1747.667206: menu_select: residency 80 <idle>-0 [000] d... 1747.667206: menu_select: exit_latency 20 vs. 2000000000 multiplier 1 <idle>-0 [000] d... 1747.667207: menu_select: Trying idle state 4 s->dis 0 su->dis 0 <idle>-0 [000] d... 1747.667208: menu_select: residency 800 <idle>-0 [000] d... 1747.667208: menu_select: exit_latency 200 vs. 2000000000 multiplier 1 <idle>-0 [000] d... 1747.667209: cpu_idle: state=4 cpu_id=0 Now here we get lucky for whatever reasons and actually stay in state 4 until a real reason for leaving it happens. <idle>-0 [000] d.h. 1747.669851: irq_handler_entry: irq=120 name=eth0-TxRx-0 So now the subtle difference between the old x86 idle code and the generic one is that the new code fiddles with the polling state in thread_info->state more than necessary. We need to fix that, but nevertheless this whole mwait thing seems to be broken in some way or the other and ever worked by chance. So the difference boils down to: + modify(idle->thread_info->status) <some random code> mwait(idle->thread_info->status) So on an UP machine the only CPU which cares about that state is that very same idle thread which goes to sleep. Now there are a few questions: 1) Why does writing to that cacheline from the very same cpu just before going to sleep cause an issue? 2) Why does it come back less often if the timing is different, e.g.when the debug patch is active? 3) Why does adding a mb() instead of the trace printks make the thing NOT go away? Is it really just timing related? I'm going to ask Boris to replace the mb() with a plain udelay(10) tomorrow. 4) Why can't we observe that on dual socket machines? And for the record, I have analyzed traces before the switch to the generic idle code which show the same drop in/ drop out issue on an UP environment: <idle>-0 [000] d... 307.256288: cpu_idle: state=4 cpu_id=0 <idle>-0 [000] .... 307.256289: cpu_idle: state=4294967295 cpu_id=0 <idle>-0 [000] d... 307.256291: cpu_idle: state=4 cpu_id=0 <idle>-0 [000] d.h. 307.257337: irq_handler_entry: irq=21 name=ata_piix Yes, it does not cause the same issue as you can observe with the current code, but definitely there is something extremly fishy in an UP situation also there. So it boils down to the following: If something fiddles with the cache line which is monitored by mwait() close before the mwait() happens then mwait() is playing silly buggers. And that sillyness depends on the number of sockets. And what's even worse is that we get a regression report for probably exposing hardware sillyness. Something which should have been known already - but unfortunately never has been documented as a requirement for this mwait() stuff to work proper. Sorry to be blunt about that. If stuff like this happens: <idle>-0 [000] d... 307.256288: cpu_idle: state=4 cpu_id=0 <idle>-0 [000] .... 307.256289: cpu_idle: state=4294967295 cpu_id=0 <idle>-0 [000] d... 307.256291: cpu_idle: state=4 cpu_id=0 on a kernel prior to that change in an UP scenario, then someone should have investigated that years ago. But the preferred solution was obviously to ignore that because it did not do substantial damage. Now that the damage comes to light you cry murder for the completely wrong reason. The change exposes a shortcoming in mwait() rather than the more obvious performance issue which mwait() is supposed to avoid in the first place. Mike DID notice it and it got fixed, but of course it did not heal the now obvious shortcomings of mwait() Thanks, tglx --- diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index a55e68f..b6399af 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -118,7 +118,7 @@ int cpuidle_idle_call(void) struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices); struct cpuidle_driver *drv; int next_state, entered_state; - bool broadcast; + bool broadcast, coupled = false; if (off || !initialized) return -ENODEV; @@ -147,15 +147,18 @@ int cpuidle_idle_call(void) if (broadcast) clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu); - if (cpuidle_state_is_coupled(dev, drv, next_state)) + if (cpuidle_state_is_coupled(dev, drv, next_state)) { entered_state = cpuidle_enter_state_coupled(dev, drv, next_state); - else + coupled = true; + } else entered_state = cpuidle_enter_state(dev, drv, next_state); if (broadcast) clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu); + trace_printk("coupled %d: entered state %d\n", coupled, entered_state); + trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu); /* give the governor an opportunity to reflect on the outcome */ diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c index cf7f2f0..9de7ee2 100644 --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -309,7 +309,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) data->expected_us = t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC; - data->bucket = which_bucket(data->expected_us); multiplier = performance_multiplier(); @@ -330,6 +329,9 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) data->correction_factor[data->bucket], RESOLUTION * DECAY); + trace_printk("expected_us: %d predicted_us: %d\n", data->expected_us, + data->predicted_us); + get_typical_interval(data); /* @@ -349,10 +351,15 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) struct cpuidle_state *s = &drv->states[i]; struct cpuidle_state_usage *su = &dev->states_usage[i]; + trace_printk("Trying idle state %d s->dis %d su->dis %d\n", i, + s->disabled, su->disable); if (s->disabled || su->disable) continue; + trace_printk("residency %d\n", s->target_residency); if (s->target_residency > data->predicted_us) continue; + trace_printk("exit_latency %d vs. %d multiplier %d\n", + s->exit_latency, latency_req, multiplier); if (s->exit_latency > latency_req) continue; if (s->exit_latency * multiplier > data->predicted_us) ^ permalink raw reply related [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 2:05 ` Thomas Gleixner @ 2013-12-11 3:21 ` Mike Galbraith 2013-12-11 11:28 ` Thomas Gleixner 0 siblings, 1 reply; 96+ messages in thread From: Mike Galbraith @ 2013-12-11 3:21 UTC (permalink / raw) To: Thomas Gleixner Cc: Len Brown, Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86, Borislav Petkov Alakazam.. pk cor CPU %c0 GHz TSC SMI %c1 %c3 %c6 CTMP %pc3 %pc6 0.17 2.01 2.26 0 0.02 99.82 0.00 49 99.55 0.00 0 0 0 0.95 1.45 2.26 2 0.43 98.62 0.00 48 98.48 0.00 1 0 8 0.24 1.99 2.26 2 0.02 99.75 0.00 38 99.68 0.00 2 0 16 0.17 1.97 2.26 2 0.02 99.81 0.00 40 99.65 0.00 3 0 24 0.18 1.92 2.26 2 0.02 99.80 0.00 41 99.68 0.00 4 0 32 0.18 1.95 2.26 2 0.02 99.80 0.00 36 99.66 0.00 5 0 40 0.15 1.85 2.26 0 0.03 99.83 0.00 35 99.70 0.00 6 0 48 0.10 1.83 2.26 0 0.01 99.89 0.00 36 99.79 0.00 7 0 56 0.10 1.97 2.26 0 0.01 99.89 0.00 43 99.75 0.00 Yup, magical gremlin repellent works on 8 socket DL980 too. > diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c > index a55e68f..b6399af 100644 > --- a/drivers/cpuidle/cpuidle.c > +++ b/drivers/cpuidle/cpuidle.c > @@ -118,7 +118,7 @@ int cpuidle_idle_call(void) > struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices); > struct cpuidle_driver *drv; > int next_state, entered_state; > - bool broadcast; > + bool broadcast, coupled = false; > > if (off || !initialized) > return -ENODEV; > @@ -147,15 +147,18 @@ int cpuidle_idle_call(void) > if (broadcast) > clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu); > > - if (cpuidle_state_is_coupled(dev, drv, next_state)) > + if (cpuidle_state_is_coupled(dev, drv, next_state)) { > entered_state = cpuidle_enter_state_coupled(dev, drv, > next_state); > - else > + coupled = true; > + } else > entered_state = cpuidle_enter_state(dev, drv, next_state); > > if (broadcast) > clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu); > > + trace_printk("coupled %d: entered state %d\n", coupled, entered_state); > + > trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu); > > /* give the governor an opportunity to reflect on the outcome */ > diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c > index cf7f2f0..9de7ee2 100644 > --- a/drivers/cpuidle/governors/menu.c > +++ b/drivers/cpuidle/governors/menu.c > @@ -309,7 +309,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) > data->expected_us = > t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC; > > - > data->bucket = which_bucket(data->expected_us); > > multiplier = performance_multiplier(); > @@ -330,6 +329,9 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) > data->correction_factor[data->bucket], > RESOLUTION * DECAY); > > + trace_printk("expected_us: %d predicted_us: %d\n", data->expected_us, > + data->predicted_us); > + > get_typical_interval(data); > > /* > @@ -349,10 +351,15 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) > struct cpuidle_state *s = &drv->states[i]; > struct cpuidle_state_usage *su = &dev->states_usage[i]; > > + trace_printk("Trying idle state %d s->dis %d su->dis %d\n", i, > + s->disabled, su->disable); > if (s->disabled || su->disable) > continue; > + trace_printk("residency %d\n", s->target_residency); > if (s->target_residency > data->predicted_us) > continue; > + trace_printk("exit_latency %d vs. %d multiplier %d\n", > + s->exit_latency, latency_req, multiplier); > if (s->exit_latency > latency_req) > continue; > if (s->exit_latency * multiplier > data->predicted_us) > ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 3:21 ` Mike Galbraith @ 2013-12-11 11:28 ` Thomas Gleixner 2013-12-11 11:38 ` Borislav Petkov 0 siblings, 1 reply; 96+ messages in thread From: Thomas Gleixner @ 2013-12-11 11:28 UTC (permalink / raw) To: Mike Galbraith Cc: Len Brown, Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86, Borislav Petkov On Wed, 11 Dec 2013, Mike Galbraith wrote: > Alakazam.. > Yup, magical gremlin repellent works on 8 socket DL980 too. Now here is a less magical version of the gremlin repellent. And just for the amusement value: The erratum for the series 7400 says: AAI65. MONITOR/MWAIT May Have Excessive False Wakeups Problem: Normally, if MWAIT is used to enter a C-state that is C1 or higher, a store to the address range armed by the MONITOR instruction will cause the processor to exit MWAIT. Due to this erratum, false wakeups may occur when the monitored address range was recently written prior to executing the MONITOR instruction. Implication: Due to this erratum, performance and power savings may be impacted due to excessive false wakeups. Workaround: Execute a CLFLUSH Instruction immediately before every MONITOR instruction when the monitored location may have been recently written. Now that looks like the very same issue on these westmere EX machines. These false wakeups can be observed already before the idle changes and now they are just more prominent. Adding that clflush() unconditionally fixes the issue at least on Boris machine. Mike, can you retest on that 8 socket monstrum, please? So it looks like the idle power regression is actually a software change which exhibits a hardware "regression". So much for proper validated advertising which promises core power 0W at idle for these beasts :) Thanks, tglx --- diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index 92d1206..50299ad 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -376,7 +376,7 @@ static int intel_idle(struct cpuidle_device *dev, clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu); if (!current_set_polling_and_test()) { - + clflush(¤t_thread_info()->flags); __monitor((void *)¤t_thread_info()->flags, 0, 0); smp_mb(); if (!need_resched()) ^ permalink raw reply related [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 11:28 ` Thomas Gleixner @ 2013-12-11 11:38 ` Borislav Petkov 2013-12-11 11:52 ` Peter Zijlstra 0 siblings, 1 reply; 96+ messages in thread From: Borislav Petkov @ 2013-12-11 11:38 UTC (permalink / raw) To: Thomas Gleixner Cc: Mike Galbraith, Len Brown, Peter Zijlstra, Linux PM list, linux-kernel, Jeremy Eder, x86 On Wed, Dec 11, 2013 at 12:28:36PM +0100, Thomas Gleixner wrote: > On Wed, 11 Dec 2013, Mike Galbraith wrote: > > Alakazam.. > > Yup, magical gremlin repellent works on 8 socket DL980 too. > > Now here is a less magical version of the gremlin repellent. > > And just for the amusement value: The erratum for the series 7400 > says: > > AAI65. MONITOR/MWAIT May Have Excessive False Wakeups > > Problem: Normally, if MWAIT is used to enter a C-state that is > C1 or higher, a store to the address range armed by > the MONITOR instruction will cause the processor to > exit MWAIT. Due to this erratum, false wakeups may > occur when the monitored address range was recently > written prior to executing the MONITOR instruction. > > Implication: Due to this erratum, performance and power savings may > be impacted due to excessive false wakeups. > > Workaround: Execute a CLFLUSH Instruction immediately before every > MONITOR instruction when the monitored location may > have been recently written. > > Now that looks like the very same issue on these westmere EX > machines. > > These false wakeups can be observed already before the idle changes > and now they are just more prominent. > > Adding that clflush() unconditionally fixes the issue at least on > Boris machine. > > Mike, can you retest on that 8 socket monstrum, please? > > So it looks like the idle power regression is actually a software > change which exhibits a hardware "regression". Right, if it turns out that this is really the case and that this erratum hasn't been fixed for models later than 29 - we'd need the additional model numbers to set X86_FEATURE_CLFLUSH_MONITOR correctly. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 11:38 ` Borislav Petkov @ 2013-12-11 11:52 ` Peter Zijlstra 2013-12-11 12:29 ` Mike Galbraith 2013-12-11 21:43 ` Len Brown 0 siblings, 2 replies; 96+ messages in thread From: Peter Zijlstra @ 2013-12-11 11:52 UTC (permalink / raw) To: Borislav Petkov Cc: Thomas Gleixner, Mike Galbraith, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On Wed, Dec 11, 2013 at 12:38:39PM +0100, Borislav Petkov wrote: > Right, if it turns out that this is really the case and that this > erratum hasn't been fixed for models later than 29 - we'd need the > additional model numbers to set X86_FEATURE_CLFLUSH_MONITOR correctly. You also need: https://lkml.org/lkml/2013/11/19/143 Because obviously not all mwait idle loops check that cpu bit. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 11:52 ` Peter Zijlstra @ 2013-12-11 12:29 ` Mike Galbraith 2013-12-11 12:43 ` Peter Zijlstra 2013-12-11 21:43 ` Len Brown 1 sibling, 1 reply; 96+ messages in thread From: Mike Galbraith @ 2013-12-11 12:29 UTC (permalink / raw) To: Peter Zijlstra Cc: Borislav Petkov, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On Wed, 2013-12-11 at 12:52 +0100, Peter Zijlstra wrote: > On Wed, Dec 11, 2013 at 12:38:39PM +0100, Borislav Petkov wrote: > > Right, if it turns out that this is really the case and that this > > erratum hasn't been fixed for models later than 29 - we'd need the > > additional model numbers to set X86_FEATURE_CLFLUSH_MONITOR correctly. > > You also need: https://lkml.org/lkml/2013/11/19/143 > > Because obviously not all mwait idle loops check that cpu bit. I had tried that patch, to see if it would magically make the thing start working, nope. I had also tried... --- drivers/idle/intel_idle.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) Index: linux-2.6/drivers/idle/intel_idle.c =================================================================== --- linux-2.6.orig/drivers/idle/intel_idle.c +++ linux-2.6/drivers/idle/intel_idle.c @@ -376,11 +376,14 @@ static int intel_idle(struct cpuidle_dev clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu); if (!current_set_polling_and_test()) { - + if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) + clflush((void *)¤t_thread_info()->flags); __monitor((void *)¤t_thread_info()->flags, 0, 0); smp_mb(); if (!need_resched()) __mwait(eax, ecx); + if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) + clflush((void *)¤t_thread_info()->flags); } if (!(lapic_timer_reliable_states & (1 << (cstate)))) ..a cflush before _and_ after, among other (shazam!.. darn) guesses, but nogo. Turning that into the tglx one liner indeed did fix the thing, as did adding this to your patch. --- arch/x86/include/asm/mwait.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6/arch/x86/include/asm/mwait.h =================================================================== --- linux-2.6.orig/arch/x86/include/asm/mwait.h +++ linux-2.6/arch/x86/include/asm/mwait.h @@ -43,7 +43,7 @@ static inline void __mwait(unsigned long static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx) { if (!current_set_polling_and_test()) { - if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) +// if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) clflush((void *)¤t_thread_info()->flags); __monitor((void *)¤t_thread_info()->flags, 0, 0); Grrr. flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt lahf_lm dtherm tpr_shadow vnmi flexpriority ept vpid -Mike ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 12:29 ` Mike Galbraith @ 2013-12-11 12:43 ` Peter Zijlstra 2013-12-11 13:10 ` Mike Galbraith ` (2 more replies) 0 siblings, 3 replies; 96+ messages in thread From: Peter Zijlstra @ 2013-12-11 12:43 UTC (permalink / raw) To: Mike Galbraith Cc: Borislav Petkov, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On Wed, Dec 11, 2013 at 01:29:15PM +0100, Mike Galbraith wrote: > On Wed, 2013-12-11 at 12:52 +0100, Peter Zijlstra wrote: > > On Wed, Dec 11, 2013 at 12:38:39PM +0100, Borislav Petkov wrote: > > > Right, if it turns out that this is really the case and that this > > > erratum hasn't been fixed for models later than 29 - we'd need the > > > additional model numbers to set X86_FEATURE_CLFLUSH_MONITOR correctly. > > > > You also need: https://lkml.org/lkml/2013/11/19/143 > > > > Because obviously not all mwait idle loops check that cpu bit. > > I had tried that patch, to see if it would magically make the thing > start working, nope. I had also tried... > + if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) > + clflush((void *)¤t_thread_info()->flags); Yeah, you need a bit extra to enable that feature bit for your CPU as bpetkov said. Something like the below.. someone needs to double check and possibly add SNB/IVB EX parts if they're already available. diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index dc1ec0dff939..015642eed045 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -387,8 +387,15 @@ static void init_intel(struct cpuinfo_x86 *c) set_cpu_cap(c, X86_FEATURE_PEBS); } - if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush) - set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR); + if (c->x86 == 6 && cpu_has_clflush) { + switch (c->x86_model) { + case 29: /* Core2 EX */ + case 46: /* NHM EX */ + case 47: /* WSM EX */ + set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR); + break; + } + } #ifdef CONFIG_X86_64 if (c->x86 == 15) ^ permalink raw reply related [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 12:43 ` Peter Zijlstra @ 2013-12-11 13:10 ` Mike Galbraith 2013-12-11 13:40 ` Borislav Petkov 2013-12-11 14:42 ` Ingo Molnar 2 siblings, 0 replies; 96+ messages in thread From: Mike Galbraith @ 2013-12-11 13:10 UTC (permalink / raw) To: Peter Zijlstra Cc: Borislav Petkov, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On Wed, 2013-12-11 at 13:43 +0100, Peter Zijlstra wrote: > On Wed, Dec 11, 2013 at 01:29:15PM +0100, Mike Galbraith wrote: > > On Wed, 2013-12-11 at 12:52 +0100, Peter Zijlstra wrote: > > > On Wed, Dec 11, 2013 at 12:38:39PM +0100, Borislav Petkov wrote: > > > > Right, if it turns out that this is really the case and that this > > > > erratum hasn't been fixed for models later than 29 - we'd need the > > > > additional model numbers to set X86_FEATURE_CLFLUSH_MONITOR correctly. > > > > > > You also need: https://lkml.org/lkml/2013/11/19/143 > > > > > > Because obviously not all mwait idle loops check that cpu bit. > > > > I had tried that patch, to see if it would magically make the thing > > start working, nope. I had also tried... > > > + if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) > > + clflush((void *)¤t_thread_info()->flags); > > Yeah, you need a bit extra to enable that feature bit for your CPU as > bpetkov said. Works for me, one more for the stable bucket. So as soon as Len resurrects mwait_idle for Q6600 (and other core2 when booted max_cstates=1 so tsc clocksource is used instead of pos hpet), all the (known) idle regressions should be history. -Mike ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 12:43 ` Peter Zijlstra 2013-12-11 13:10 ` Mike Galbraith @ 2013-12-11 13:40 ` Borislav Petkov 2013-12-11 14:56 ` Ingo Molnar 2013-12-11 14:42 ` Ingo Molnar 2 siblings, 1 reply; 96+ messages in thread From: Borislav Petkov @ 2013-12-11 13:40 UTC (permalink / raw) To: Peter Zijlstra Cc: Mike Galbraith, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On Wed, Dec 11, 2013 at 01:43:52PM +0100, Peter Zijlstra wrote: > Something like the below.. someone needs to double check and possibly > add SNB/IVB EX parts if they're already available. Right, our friends at Intel would need to tell us which families/models does AAI65 span... if, this is actually the case. But now that Mike's box is fixed too, it very much looks like it. Oh, and AAIxx errata are Intel Xeon Processor 7400 Series ones, we would also need to know whether other incarnations are affected too. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 13:40 ` Borislav Petkov @ 2013-12-11 14:56 ` Ingo Molnar 2013-12-11 16:02 ` Borislav Petkov 2013-12-11 16:43 ` Peter Zijlstra 0 siblings, 2 replies; 96+ messages in thread From: Ingo Molnar @ 2013-12-11 14:56 UTC (permalink / raw) To: Borislav Petkov Cc: Peter Zijlstra, Mike Galbraith, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 * Borislav Petkov <bp@alien8.de> wrote: > On Wed, Dec 11, 2013 at 01:43:52PM +0100, Peter Zijlstra wrote: > > Something like the below.. someone needs to double check and possibly > > add SNB/IVB EX parts if they're already available. > > Right, our friends at Intel would need to tell us which families/models > does AAI65 span... if, this is actually the case. I think CLFLUSH should be pretty universally available, IIRC graphics drivers were using it rather heavily in combination with write-combining MTRRs, both on Linux and on Windows. Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 14:56 ` Ingo Molnar @ 2013-12-11 16:02 ` Borislav Petkov 2013-12-11 16:43 ` Peter Zijlstra 1 sibling, 0 replies; 96+ messages in thread From: Borislav Petkov @ 2013-12-11 16:02 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Mike Galbraith, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On Wed, Dec 11, 2013 at 03:56:55PM +0100, Ingo Molnar wrote: > I think CLFLUSH should be pretty universally available, IIRC > graphics drivers were using it rather heavily in combination with > write-combining MTRRs, both on Linux and on Windows. ... and it is also very expensive. So I don't think it would be in Intel's best interest to do CLFLUSH unconditionally on all families but only on those which really need to. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 14:56 ` Ingo Molnar 2013-12-11 16:02 ` Borislav Petkov @ 2013-12-11 16:43 ` Peter Zijlstra 2013-12-11 17:50 ` Ingo Molnar 1 sibling, 1 reply; 96+ messages in thread From: Peter Zijlstra @ 2013-12-11 16:43 UTC (permalink / raw) To: Ingo Molnar Cc: Borislav Petkov, Mike Galbraith, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On Wed, Dec 11, 2013 at 03:56:55PM +0100, Ingo Molnar wrote: > > * Borislav Petkov <bp@alien8.de> wrote: > > > On Wed, Dec 11, 2013 at 01:43:52PM +0100, Peter Zijlstra wrote: > > > Something like the below.. someone needs to double check and possibly > > > add SNB/IVB EX parts if they're already available. > > > > Right, our friends at Intel would need to tell us which families/models > > does AAI65 span... if, this is actually the case. > > I think CLFLUSH should be pretty universally available, IIRC graphics > drivers were using it rather heavily in combination with > write-combining MTRRs, both on Linux and on Windows. The availability isn't the problem; the cost is. We shouldn't issue one if its not required. Only 'broken' EX hardware needs it. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 16:43 ` Peter Zijlstra @ 2013-12-11 17:50 ` Ingo Molnar 2013-12-11 23:08 ` H. Peter Anvin 0 siblings, 1 reply; 96+ messages in thread From: Ingo Molnar @ 2013-12-11 17:50 UTC (permalink / raw) To: Peter Zijlstra Cc: Borislav Petkov, Mike Galbraith, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 * Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Dec 11, 2013 at 03:56:55PM +0100, Ingo Molnar wrote: > > > > * Borislav Petkov <bp@alien8.de> wrote: > > > > > On Wed, Dec 11, 2013 at 01:43:52PM +0100, Peter Zijlstra wrote: > > > > Something like the below.. someone needs to double check and possibly > > > > add SNB/IVB EX parts if they're already available. > > > > > > Right, our friends at Intel would need to tell us which > > > families/models does AAI65 span... if, this is actually the > > > case. > > > > I think CLFLUSH should be pretty universally available, IIRC > > graphics drivers were using it rather heavily in combination with > > write-combining MTRRs, both on Linux and on Windows. > > The availability isn't the problem; the cost is. We shouldn't issue > one if its not required. Only 'broken' EX hardware needs it. Well, availability could be a problem too, if some CPU (real or virtual) implements MWAIT but not CLFLUSH. In theory we could make mwait an alternatives variant and patch in the right combination of instructions? The CLFLUSH goes to the same address as on which the monitoring happens, so it could be considered one meta-instruction. Thansk, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 17:50 ` Ingo Molnar @ 2013-12-11 23:08 ` H. Peter Anvin 2013-12-11 23:14 ` Borislav Petkov 2013-12-12 8:51 ` Peter Zijlstra 0 siblings, 2 replies; 96+ messages in thread From: H. Peter Anvin @ 2013-12-11 23:08 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: Borislav Petkov, Mike Galbraith, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On 12/11/2013 09:50 AM, Ingo Molnar wrote: > > Well, availability could be a problem too, if some CPU (real or > virtual) implements MWAIT but not CLFLUSH. > > In theory we could make mwait an alternatives variant and patch in the > right combination of instructions? The CLFLUSH goes to the same > address as on which the monitoring happens, so it could be considered > one meta-instruction. > The first thing to do is probably to drop the use of thread_info as a wakeup doorbell. It seemed like a good idea at the time -- after all, there is one for each thread -- but it is extremely likely to be dirty in the cache, which is (presumably) what causes these kinds of bugs to be maximally likely. Even if we don't do the CLFLUSH it is likely that the hardware has to do something expensive behind the scenes. So I would like to propose that we switch to using a percpu variable which is a single cache line of nothing at all. It would only ever be touched by MONITOR and for explicit wakeup. Hopefully that will resolve this problem without the need for the CLFLUSH. -hpa ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 23:08 ` H. Peter Anvin @ 2013-12-11 23:14 ` Borislav Petkov 2013-12-12 0:52 ` H. Peter Anvin 2013-12-12 8:51 ` Peter Zijlstra 1 sibling, 1 reply; 96+ messages in thread From: Borislav Petkov @ 2013-12-11 23:14 UTC (permalink / raw) To: H. Peter Anvin Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On Wed, Dec 11, 2013 at 03:08:35PM -0800, H. Peter Anvin wrote: > So I would like to propose that we switch to using a percpu variable > which is a single cache line of nothing at all. It would only ever > be touched by MONITOR and for explicit wakeup. Hopefully that will > resolve this problem without the need for the CLFLUSH. Yep, makes a lot of sense to me to have an exclusive (overloaded meaning here :-)) cacheline only for that. And, if it works, we'll save us the penalty from the CLFLUSH too, cool. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 23:14 ` Borislav Petkov @ 2013-12-12 0:52 ` H. Peter Anvin 2013-12-12 4:25 ` Mike Galbraith 0 siblings, 1 reply; 96+ messages in thread From: H. Peter Anvin @ 2013-12-12 0:52 UTC (permalink / raw) To: Borislav Petkov Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 [-- Attachment #1: Type: text/plain, Size: 824 bytes --] On 12/11/2013 03:14 PM, Borislav Petkov wrote: > On Wed, Dec 11, 2013 at 03:08:35PM -0800, H. Peter Anvin wrote: >> So I would like to propose that we switch to using a percpu variable >> which is a single cache line of nothing at all. It would only ever >> be touched by MONITOR and for explicit wakeup. Hopefully that will >> resolve this problem without the need for the CLFLUSH. > > Yep, makes a lot of sense to me to have an exclusive (overloaded meaning > here :-)) cacheline only for that. And, if it works, we'll save us the > penalty from the CLFLUSH too, cool. > Here is a POC patch... anyone willing to test it out? Two obvious things to watch out for: 1. I couldn't actually spot any obvious cases of a deliberate monitor trigger. 2. Should we do cpu_relax() for all users, not just powerclamp? -hpa [-- Attachment #2: 0001-x86-mwait-Use-a-dedicated-percpu-area-for-mwait-door.patch --] [-- Type: text/x-patch, Size: 9373 bytes --] >From 20a54d54ea06f050650ab8923b7d9ee1d21b5317 Mon Sep 17 00:00:00 2001 From: "H. Peter Anvin" <hpa@linux.intel.com> Date: Wed, 11 Dec 2013 16:31:04 -0800 Subject: [PATCH] x86, mwait: Use a dedicated percpu area for mwait doorbell We have used the cache line that includes thread_info.flags as the mwait doorbell. However, this is liable to be dirty in the cache, which may trigger hardware errata, plus it might cause false wakeups. Instead, use a dedicated wakeup doorbell area. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> --- arch/x86/include/asm/cpufeature.h | 1 - arch/x86/include/asm/mwait.h | 46 ++++++++++++++++++++++++++++++++++++++ arch/x86/include/asm/processor.h | 23 ------------------- arch/x86/kernel/acpi/cstate.c | 6 +---- arch/x86/kernel/cpu/common.c | 17 ++++++++++++++ arch/x86/kernel/cpu/intel.c | 3 --- arch/x86/kernel/setup_percpu.c | 3 +++ arch/x86/kernel/smpboot.c | 19 +--------------- drivers/acpi/acpi_pad.c | 3 +-- drivers/idle/intel_idle.c | 4 +--- drivers/thermal/intel_powerclamp.c | 2 +- 11 files changed, 71 insertions(+), 56 deletions(-) diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index e099f95..cdc77f3 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -96,7 +96,6 @@ #define X86_FEATURE_XTOPOLOGY (3*32+22) /* cpu topology enum extensions */ #define X86_FEATURE_TSC_RELIABLE (3*32+23) /* TSC is known to be reliable */ #define X86_FEATURE_NONSTOP_TSC (3*32+24) /* TSC does not stop in C states */ -#define X86_FEATURE_CLFLUSH_MONITOR (3*32+25) /* "" clflush reqd with monitor */ #define X86_FEATURE_EXTD_APICID (3*32+26) /* has extended APICID (8 bits) */ #define X86_FEATURE_AMD_DCM (3*32+27) /* multi-node processor */ #define X86_FEATURE_APERFMPERF (3*32+28) /* APERFMPERF */ diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h index 2f366d0..4c82863 100644 --- a/arch/x86/include/asm/mwait.h +++ b/arch/x86/include/asm/mwait.h @@ -13,4 +13,50 @@ #define MWAIT_ECX_INTERRUPT_BREAK 0x1 +#ifndef __ASSEMBLY__ + +#include <linux/compiler.h> +#include <linux/kernel.h> +#include <linux/percpu.h> + +static inline void __monitor(const void *eax, unsigned long ecx, + unsigned long edx) +{ + /* "monitor %eax, %ecx, %edx;" */ + asm volatile(".byte 0x0f, 0x01, 0xc8;" + :: "a" (eax), "c" (ecx), "d"(edx)); +} + +static inline void __mwait(unsigned long eax, unsigned long ecx) +{ + /* "mwait %eax, %ecx;" */ + asm volatile(".byte 0x0f, 0x01, 0xc9;" + :: "a" (eax), "c" (ecx)); +} + +static inline void __sti_mwait(unsigned long eax, unsigned long ecx) +{ + trace_hardirqs_on(); + /* "mwait %eax, %ecx;" */ + asm volatile("sti; .byte 0x0f, 0x01, 0xc9;" + :: "a" (eax), "c" (ecx)); +} + +extern char __percpu *mwait_doorbell; + +void __init setup_mwait_doorbell(void); + +static inline void x86_monitor_doorbell(unsigned long ecx, unsigned long edx) +{ + __monitor(__this_cpu_ptr(mwait_doorbell), ecx, edx); + mb(); +} + +static inline void x86_mwait_doorbell_bing_bong(int cpu) +{ + ACCESS_ONCE(*per_cpu_ptr(mwait_doorbell, cpu)) = 0; +} + +#endif /* !__ASSEMBLY__ */ + #endif /* _ASM_X86_MWAIT_H */ diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index b7845a1..a51a79e 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -723,29 +723,6 @@ static inline void sync_core(void) #endif } -static inline void __monitor(const void *eax, unsigned long ecx, - unsigned long edx) -{ - /* "monitor %eax, %ecx, %edx;" */ - asm volatile(".byte 0x0f, 0x01, 0xc8;" - :: "a" (eax), "c" (ecx), "d"(edx)); -} - -static inline void __mwait(unsigned long eax, unsigned long ecx) -{ - /* "mwait %eax, %ecx;" */ - asm volatile(".byte 0x0f, 0x01, 0xc9;" - :: "a" (eax), "c" (ecx)); -} - -static inline void __sti_mwait(unsigned long eax, unsigned long ecx) -{ - trace_hardirqs_on(); - /* "mwait %eax, %ecx;" */ - asm volatile("sti; .byte 0x0f, 0x01, 0xc9;" - :: "a" (eax), "c" (ecx)); -} - extern void select_idle_routine(const struct cpuinfo_x86 *c); extern void init_amd_e400_c1e_mask(void); diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c index d2b7f27..aec26c5 100644 --- a/arch/x86/kernel/acpi/cstate.c +++ b/arch/x86/kernel/acpi/cstate.c @@ -163,11 +163,7 @@ EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe); void mwait_idle_with_hints(unsigned long ax, unsigned long cx) { if (!need_resched()) { - if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) - clflush((void *)¤t_thread_info()->flags); - - __monitor((void *)¤t_thread_info()->flags, 0, 0); - smp_mb(); + x86_monitor_doorbell(0, 0); if (!need_resched()) __mwait(ax, cx); } diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 6abc172..b6b19ab 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -24,6 +24,7 @@ #include <linux/cpumask.h> #include <asm/pgtable.h> #include <linux/atomic.h> +#include <asm/mwait.h> #include <asm/proto.h> #include <asm/setup.h> #include <asm/apic.h> @@ -63,6 +64,22 @@ void __init setup_cpu_local_masks(void) alloc_bootmem_cpumask_var(&cpu_sibling_setup_mask); } +/* allocate percpu area for mwait doorbell */ +char __percpu *mwait_doorbell; + +void __init setup_mwait_doorbell(void) +{ + if (boot_cpu_has(X86_FEATURE_MWAIT)) { + mwait_doorbell = __alloc_percpu(boot_cpu_data.clflush_size, + boot_cpu_data.clflush_size); + + if (!mwait_doorbell) { + /* This should never happen... */ + panic("Unable to allocate mwait doorbell!\n"); + } + } +} + static void default_init(struct cpuinfo_x86 *c) { #ifdef CONFIG_X86_64 diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index dc1ec0d..47b4e7b 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -387,9 +387,6 @@ static void init_intel(struct cpuinfo_x86 *c) set_cpu_cap(c, X86_FEATURE_PEBS); } - if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush) - set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR); - #ifdef CONFIG_X86_64 if (c->x86 == 15) c->x86_cache_alignment = c->x86_clflush_size * 2; diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c index 5cdff03..917e1af 100644 --- a/arch/x86/kernel/setup_percpu.c +++ b/arch/x86/kernel/setup_percpu.c @@ -284,4 +284,7 @@ void __init setup_per_cpu_areas(void) /* Setup cpu initialized, callin, callout masks */ setup_cpu_local_masks(); + + /* Setup MWAIT doorbell */ + setup_mwait_doorbell(); } diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 85dc05a..558e097 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1368,7 +1368,6 @@ static inline void mwait_play_dead(void) unsigned int eax, ebx, ecx, edx; unsigned int highest_cstate = 0; unsigned int highest_subcstate = 0; - void *mwait_ptr; int i; if (!this_cpu_has(X86_FEATURE_MWAIT)) @@ -1400,26 +1399,10 @@ static inline void mwait_play_dead(void) (highest_subcstate - 1); } - /* - * This should be a memory location in a cache line which is - * unlikely to be touched by other processors. The actual - * content is immaterial as it is not actually modified in any way. - */ - mwait_ptr = ¤t_thread_info()->flags; - wbinvd(); while (1) { - /* - * The CLFLUSH is a workaround for erratum AAI65 for - * the Xeon 7400 series. It's not clear it is actually - * needed, but it should be harmless in either case. - * The WBINVD is insufficient due to the spurious-wakeup - * case where we return around the loop. - */ - clflush(mwait_ptr); - __monitor(mwait_ptr, 0, 0); - mb(); + x86_monitor_doorbell(0, 0); __mwait(eax, 0); /* * If NMI wants to wake up CPU0, start CPU0. diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c index fc6008f..38aaa8b 100644 --- a/drivers/acpi/acpi_pad.c +++ b/drivers/acpi/acpi_pad.c @@ -193,8 +193,7 @@ static int power_saving_thread(void *data) CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu); stop_critical_timings(); - __monitor((void *)¤t_thread_info()->flags, 0, 0); - smp_mb(); + x86_monitor_doorbell(0, 0); if (!need_resched()) __mwait(power_saving_mwait_eax, 1); diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index 92d1206..6fef6d9 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -376,9 +376,7 @@ static int intel_idle(struct cpuidle_device *dev, clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu); if (!current_set_polling_and_test()) { - - __monitor((void *)¤t_thread_info()->flags, 0, 0); - smp_mb(); + x86_monitor_doorbell(0, 0); if (!need_resched()) __mwait(eax, ecx); } diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c index 8f181b3..15cf013 100644 --- a/drivers/thermal/intel_powerclamp.c +++ b/drivers/thermal/intel_powerclamp.c @@ -438,7 +438,7 @@ static int clamp_thread(void *arg) */ local_touch_nmi(); stop_critical_timings(); - __monitor((void *)¤t_thread_info()->flags, 0, 0); + x86_monitor_doorbell(0, 0); cpu_relax(); /* allow HT sibling to run */ __mwait(eax, ecx); start_critical_timings(); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-12 0:52 ` H. Peter Anvin @ 2013-12-12 4:25 ` Mike Galbraith 2013-12-12 4:49 ` H. Peter Anvin 0 siblings, 1 reply; 96+ messages in thread From: Mike Galbraith @ 2013-12-12 4:25 UTC (permalink / raw) To: H. Peter Anvin Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On Wed, 2013-12-11 at 16:52 -0800, H. Peter Anvin wrote: > On 12/11/2013 03:14 PM, Borislav Petkov wrote: > > On Wed, Dec 11, 2013 at 03:08:35PM -0800, H. Peter Anvin wrote: > >> So I would like to propose that we switch to using a percpu variable > >> which is a single cache line of nothing at all. It would only ever > >> be touched by MONITOR and for explicit wakeup. Hopefully that will > >> resolve this problem without the need for the CLFLUSH. > > > > Yep, makes a lot of sense to me to have an exclusive (overloaded meaning > > here :-)) cacheline only for that. And, if it works, we'll save us the > > penalty from the CLFLUSH too, cool. > > > > Here is a POC patch... anyone willing to test it out? Got it built, but it went boom on boot. Off to rummage. [ 0.000000] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:64 nr_node_ids:8 [ 0.000000] PERCPU: Embedded 26 pages/cpu @ffff88027ee00000 s75904 r8192 d22400 u131072 [ 0.000000] pcpu-alloc: s75904 r8192 d22400 u131072 alloc=1*2097152 [ 0.000000] pcpu-alloc: [0] 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 [ 0.000000] pcpu-alloc: [0] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 [ 0.000000] pcpu-alloc: [0] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 [ 0.000000] pcpu-alloc: [0] 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 [ 0.000000] BUG: unable to handle kernel paging request at 000000000000b8a0 [ 0.000000] IP: [<ffffffff81a9d072>] setup_mwait_doorbell+0x20/0x38 [ 0.000000] PGD 0 [ 0.000000] Oops: 0002 [#1] SMP [ 0.000000] Modules linked in: [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.13.0-master #185 [ 0.000000] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010 [ 0.000000] task: ffffffff81a10460 ti: ffffffff81a00000 task.ti: ffffffff81a00000 [ 0.000000] RIP: 0010:[<ffffffff81a9d072>] [<ffffffff81a9d072>] setup_mwait_doorbell+0x20/0x38 [ 0.000000] RSP: 0000:ffffffff81a01f28 EFLAGS: 00010002 [ 0.000000] RAX: 0000000000014880 RBX: 0000000000000040 RCX: 0000000000000000 [ 0.000000] RDX: 0000000000000040 RSI: 0000000000000040 RDI: ffffffff81a38e60 [ 0.000000] RBP: ffffffff81a01f28 R08: 0000000000000040 R09: 0000000000000000 [ 0.000000] R10: ffff88027f5f4880 R11: 0000000000000001 R12: 000000000000b850 [ 0.000000] R13: 000000000000b026 R14: 000000000000b024 R15: 000000000000b020 [ 0.000000] FS: 0000000000000000(0000) GS:ffff88027ee00000(0000) knlGS:0000000000000000 [ 0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.000000] CR2: 000000000000b8a0 CR3: 0000000001a0b000 CR4: 00000000000000b0 [ 0.000000] Stack: [ 0.000000] ffffffff81a01f78 ffffffff81aa3641 ffffffff81a01f98 000000000000cd48 [ 0.000000] ffff88027ee00000 0000000000000000 0000000000000000 0000000000000000 [ 0.000000] 0000000000000000 0000000000000000 ffffffff81a01fa8 ffffffff81a96d89 [ 0.000000] Call Trace: [ 0.000000] [<ffffffff81aa3641>] setup_per_cpu_areas+0x233/0x242 [ 0.000000] [<ffffffff81a96d89>] start_kernel+0x84/0x370 [ 0.000000] [<ffffffff81a964cc>] x86_64_start_reservations+0x1b/0x35 [ 0.000000] [<ffffffff81a96614>] x86_64_start_kernel+0x12e/0x135 [ 0.000000] Code: 40 8f a7 81 e8 f6 fe ff ff c9 c3 55 48 8b 05 0a bf fd ff 48 89 e5 a8 08 75 02 c9 c3 0f b7 3d 84 bf fd ff 48 89 fe e8 fe dc 64 ff <48> 89 05 27 e8 56 7e 48 85 c0 75 e3 48 c7 c7 f0 83 78 81 e8 55 [ 0.000000] RIP [<ffffffff81a9d072>] setup_mwait_doorbell+0x20/0x38 [ 0.000000] RSP <ffffffff81a01f28> [ 0.000000] CR2: 000000000000b8a0 [ 0.000000] ---[ end trace f6e32c58e0729292 ]--- [ 0.000000] Kernel panic - not syncing: Attempted to kill the idle task! Build delta. --- arch/x86/include/asm/mwait.h | 4 ++-- arch/x86/kernel/cpu/common.c | 7 ++++--- arch/x86/kernel/setup_percpu.c | 1 + 3 files changed, 7 insertions(+), 5 deletions(-) Index: linux-2.6/arch/x86/kernel/cpu/common.c =================================================================== --- linux-2.6.orig/arch/x86/kernel/cpu/common.c +++ linux-2.6/arch/x86/kernel/cpu/common.c @@ -65,13 +65,14 @@ void __init setup_cpu_local_masks(void) } /* allocate percpu area for mwait doorbell */ -char __percpu *mwait_doorbell; +DEFINE_PER_CPU(char *, mwait_doorbell); +EXPORT_PER_CPU_SYMBOL(mwait_doorbell); void __init setup_mwait_doorbell(void) { if (boot_cpu_has(X86_FEATURE_MWAIT)) { - mwait_doorbell = __alloc_percpu(boot_cpu_data.clflush_size, - boot_cpu_data.clflush_size); + mwait_doorbell = __alloc_percpu(boot_cpu_data.x86_clflush_size, + boot_cpu_data.x86_clflush_size); if (!mwait_doorbell) { /* This should never happen... */ Index: linux-2.6/arch/x86/kernel/setup_percpu.c =================================================================== --- linux-2.6.orig/arch/x86/kernel/setup_percpu.c +++ linux-2.6/arch/x86/kernel/setup_percpu.c @@ -20,6 +20,7 @@ #include <asm/cpumask.h> #include <asm/cpu.h> #include <asm/stackprotector.h> +#include <asm/mwait.h> DEFINE_PER_CPU_READ_MOSTLY(int, cpu_number); EXPORT_PER_CPU_SYMBOL(cpu_number); Index: linux-2.6/arch/x86/include/asm/mwait.h =================================================================== --- linux-2.6.orig/arch/x86/include/asm/mwait.h +++ linux-2.6/arch/x86/include/asm/mwait.h @@ -42,9 +42,9 @@ static inline void __sti_mwait(unsigned :: "a" (eax), "c" (ecx)); } -extern char __percpu *mwait_doorbell; +DECLARE_PER_CPU(char *, mwait_doorbell); -void __init setup_mwait_doorbell(void); +extern void __init setup_mwait_doorbell(void); static inline void x86_monitor_doorbell(unsigned long ecx, unsigned long edx) { ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-12 4:25 ` Mike Galbraith @ 2013-12-12 4:49 ` H. Peter Anvin 2013-12-12 4:59 ` Mike Galbraith 2013-12-12 5:37 ` Mike Galbraith 0 siblings, 2 replies; 96+ messages in thread From: H. Peter Anvin @ 2013-12-12 4:49 UTC (permalink / raw) To: Mike Galbraith Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On 12/11/2013 08:25 PM, Mike Galbraith wrote: > arch/x86/include/asm/mwait.h | 4 ++-- > arch/x86/kernel/cpu/common.c | 7 ++++--- > arch/x86/kernel/setup_percpu.c | 1 + > 3 files changed, 7 insertions(+), 5 deletions(-) > > Index: linux-2.6/arch/x86/kernel/cpu/common.c > =================================================================== > --- linux-2.6.orig/arch/x86/kernel/cpu/common.c > +++ linux-2.6/arch/x86/kernel/cpu/common.c > @@ -65,13 +65,14 @@ void __init setup_cpu_local_masks(void) > } > > /* allocate percpu area for mwait doorbell */ > -char __percpu *mwait_doorbell; > +DEFINE_PER_CPU(char *, mwait_doorbell); > +EXPORT_PER_CPU_SYMBOL(mwait_doorbell); > Sorry, this is wrong. This is NOT a percpu variable, it is a pointer to a percpu allocation, but the variable itself is not a percpu variable. This explains your boom. > void __init setup_mwait_doorbell(void) > { > if (boot_cpu_has(X86_FEATURE_MWAIT)) { > - mwait_doorbell = __alloc_percpu(boot_cpu_data.clflush_size, > - boot_cpu_data.clflush_size); > + mwait_doorbell = __alloc_percpu(boot_cpu_data.x86_clflush_size, > + boot_cpu_data.x86_clflush_size); > > if (!mwait_doorbell) { > /* This should never happen... */ -hpa ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-12 4:49 ` H. Peter Anvin @ 2013-12-12 4:59 ` Mike Galbraith 2013-12-12 5:37 ` Mike Galbraith 1 sibling, 0 replies; 96+ messages in thread From: Mike Galbraith @ 2013-12-12 4:59 UTC (permalink / raw) To: H. Peter Anvin Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On Wed, 2013-12-11 at 20:49 -0800, H. Peter Anvin wrote: > On 12/11/2013 08:25 PM, Mike Galbraith wrote: > > arch/x86/include/asm/mwait.h | 4 ++-- > > arch/x86/kernel/cpu/common.c | 7 ++++--- > > arch/x86/kernel/setup_percpu.c | 1 + > > 3 files changed, 7 insertions(+), 5 deletions(-) > > > > Index: linux-2.6/arch/x86/kernel/cpu/common.c > > =================================================================== > > --- linux-2.6.orig/arch/x86/kernel/cpu/common.c > > +++ linux-2.6/arch/x86/kernel/cpu/common.c > > @@ -65,13 +65,14 @@ void __init setup_cpu_local_masks(void) > > } > > > > /* allocate percpu area for mwait doorbell */ > > -char __percpu *mwait_doorbell; > > +DEFINE_PER_CPU(char *, mwait_doorbell); > > +EXPORT_PER_CPU_SYMBOL(mwait_doorbell); > > > > Sorry, this is wrong. This is NOT a percpu variable, it is a pointer to > a percpu allocation, but the variable itself is not a percpu variable. > This explains your boom. Yeah, I know, I already slapped myself upside the head. (what were you thinking mikie...la la la la la:) -Mike ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-12 4:49 ` H. Peter Anvin 2013-12-12 4:59 ` Mike Galbraith @ 2013-12-12 5:37 ` Mike Galbraith 2013-12-12 5:45 ` H. Peter Anvin 1 sibling, 1 reply; 96+ messages in thread From: Mike Galbraith @ 2013-12-12 5:37 UTC (permalink / raw) To: H. Peter Anvin Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On Wed, 2013-12-11 at 20:49 -0800, H. Peter Anvin wrote: > On 12/11/2013 08:25 PM, Mike Galbraith wrote: > > arch/x86/include/asm/mwait.h | 4 ++-- > > arch/x86/kernel/cpu/common.c | 7 ++++--- > > arch/x86/kernel/setup_percpu.c | 1 + > > 3 files changed, 7 insertions(+), 5 deletions(-) > > > > Index: linux-2.6/arch/x86/kernel/cpu/common.c > > =================================================================== > > --- linux-2.6.orig/arch/x86/kernel/cpu/common.c > > +++ linux-2.6/arch/x86/kernel/cpu/common.c > > @@ -65,13 +65,14 @@ void __init setup_cpu_local_masks(void) > > } > > > > /* allocate percpu area for mwait doorbell */ > > -char __percpu *mwait_doorbell; > > +DEFINE_PER_CPU(char *, mwait_doorbell); > > +EXPORT_PER_CPU_SYMBOL(mwait_doorbell); > > > > Sorry, this is wrong. This is NOT a percpu variable, it is a pointer to > a percpu allocation, but the variable itself is not a percpu variable. > This explains your boom. With that fixed, it boots, but is not quite perfect. ... [ 258.560079] fbcon: radeondrmfb (fb0) is primary device [ 258.722483] Console: switching to colour frame buffer device 128x48 [ 258.847076] radeon 0000:01:03.0: fb0: radeondrmfb frame buffer device [ 258.911991] radeon 0000:01:03.0: registered panic notifier [ 258.968772] [drm] Initialized radeon 2.35.0 20080528 for 0000:01:03.0 on minor 0 ... [ 469.738604] netxen_nic 0000:04:00.3: using msi-x interrupts [ 469.739078] netxen_nic 0000:04:00.3: eth5: GbE port initialized [ 469.830512] ipmi_si 00:01: Found new BMC (man_id: 0x00000b, prod_id: 0x2000, dev_id: 0x13) [ 469.830524] ipmi_si 00:01: IPMI kcs interface initialized [ 473.729862] iTCO_wdt: unable to reset NO_REBOOT flag, device disabled by hardware/BIOS ... [ 711.636741] fuse init (API version 7.22) ...zzzz ok box, doctor appointment is in an hour away. -Mike ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-12 5:37 ` Mike Galbraith @ 2013-12-12 5:45 ` H. Peter Anvin 2013-12-12 5:57 ` Mike Galbraith 0 siblings, 1 reply; 96+ messages in thread From: H. Peter Anvin @ 2013-12-12 5:45 UTC (permalink / raw) To: Mike Galbraith Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 As in it hangs at that point? Mike Galbraith <bitbucket@online.de> wrote: >On Wed, 2013-12-11 at 20:49 -0800, H. Peter Anvin wrote: >> On 12/11/2013 08:25 PM, Mike Galbraith wrote: >> > arch/x86/include/asm/mwait.h | 4 ++-- >> > arch/x86/kernel/cpu/common.c | 7 ++++--- >> > arch/x86/kernel/setup_percpu.c | 1 + >> > 3 files changed, 7 insertions(+), 5 deletions(-) >> > >> > Index: linux-2.6/arch/x86/kernel/cpu/common.c >> > =================================================================== >> > --- linux-2.6.orig/arch/x86/kernel/cpu/common.c >> > +++ linux-2.6/arch/x86/kernel/cpu/common.c >> > @@ -65,13 +65,14 @@ void __init setup_cpu_local_masks(void) >> > } >> > >> > /* allocate percpu area for mwait doorbell */ >> > -char __percpu *mwait_doorbell; >> > +DEFINE_PER_CPU(char *, mwait_doorbell); >> > +EXPORT_PER_CPU_SYMBOL(mwait_doorbell); >> > >> >> Sorry, this is wrong. This is NOT a percpu variable, it is a pointer >to >> a percpu allocation, but the variable itself is not a percpu >variable. >> This explains your boom. > >With that fixed, it boots, but is not quite perfect. > >... >[ 258.560079] fbcon: radeondrmfb (fb0) is primary device >[ 258.722483] Console: switching to colour frame buffer device 128x48 >[ 258.847076] radeon 0000:01:03.0: fb0: radeondrmfb frame buffer >device >[ 258.911991] radeon 0000:01:03.0: registered panic notifier >[ 258.968772] [drm] Initialized radeon 2.35.0 20080528 for >0000:01:03.0 on minor 0 >... >[ 469.738604] netxen_nic 0000:04:00.3: using msi-x interrupts >[ 469.739078] netxen_nic 0000:04:00.3: eth5: GbE port initialized >[ 469.830512] ipmi_si 00:01: Found new BMC (man_id: 0x00000b, prod_id: >0x2000, dev_id: 0x13) >[ 469.830524] ipmi_si 00:01: IPMI kcs interface initialized >[ 473.729862] iTCO_wdt: unable to reset NO_REBOOT flag, device >disabled by hardware/BIOS >... >[ 711.636741] fuse init (API version 7.22) > >...zzzz ok box, doctor appointment is in an hour away. > >-Mike -- Sent from my mobile phone. Please pardon brevity and lack of formatting. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-12 5:45 ` H. Peter Anvin @ 2013-12-12 5:57 ` Mike Galbraith 2013-12-12 6:05 ` Mike Galbraith 0 siblings, 1 reply; 96+ messages in thread From: Mike Galbraith @ 2013-12-12 5:57 UTC (permalink / raw) To: H. Peter Anvin Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On Wed, 2013-12-11 at 21:45 -0800, H. Peter Anvin wrote: > As in it hangs at that point? Nope, it's still going. [1567.578340] pcc-cpufreq: (v1.10.00) driver loaded with frequency limits: 1064 MHz, 2266 MHz Funny, continents move faster :) Maybe missing a write or two. -Mike ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-12 5:57 ` Mike Galbraith @ 2013-12-12 6:05 ` Mike Galbraith 2013-12-12 7:57 ` H. Peter Anvin 0 siblings, 1 reply; 96+ messages in thread From: Mike Galbraith @ 2013-12-12 6:05 UTC (permalink / raw) To: H. Peter Anvin Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On Thu, 2013-12-12 at 06:57 +0100, Mike Galbraith wrote: > On Wed, 2013-12-11 at 21:45 -0800, H. Peter Anvin wrote: > > As in it hangs at that point? > > Nope, it's still going. > > [1567.578340] pcc-cpufreq: (v1.10.00) driver loaded with frequency limits: 1064 MHz, 2266 MHz > > Funny, continents move faster :) Maybe missing a write or two. When I get back it may be done booting. I'm gonna let it try for grins while I'm away, then take a peek, see if I can spot it. [ 1567.578340] pcc-cpufreq: (v1.10.00) driver loaded with frequency limits: 1064 MHz, 2266 MHz done Starting HAL daemon done Setting up (localfs) network interfaces: lo lo IP address: 127.0.0.1/8 IP address: 127.0.0.2/8 done eth0 device: Broadcom Corporation NetXtreme II BCM5709 Gig No configuration found for eth0 unused eth1 device: Broadcom Corporation NetXtreme II BCM5709 Gig No configuration found for eth1 unused eth2 device: NetXen Incorporated NX3031 Multifunction 1/10 [ 2457.114007] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready [ 2457.114455] netxen_nic: eth2 NIC Link is up [ 2457.223582] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready eth2 IP address: 0.0.0.0/32 ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-12 6:05 ` Mike Galbraith @ 2013-12-12 7:57 ` H. Peter Anvin 0 siblings, 0 replies; 96+ messages in thread From: H. Peter Anvin @ 2013-12-12 7:57 UTC (permalink / raw) To: Mike Galbraith Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 I'm guessing there is an implicit or explicit wake-up somewhere that I managed to miss. If it is implicit it could be hard to catch. Mike Galbraith <bitbucket@online.de> wrote: >On Thu, 2013-12-12 at 06:57 +0100, Mike Galbraith wrote: >> On Wed, 2013-12-11 at 21:45 -0800, H. Peter Anvin wrote: >> > As in it hangs at that point? >> >> Nope, it's still going. >> >> [1567.578340] pcc-cpufreq: (v1.10.00) driver loaded with frequency >limits: 1064 MHz, 2266 MHz >> >> Funny, continents move faster :) Maybe missing a write or two. > >When I get back it may be done booting. I'm gonna let it try for grins >while I'm away, then take a peek, see if I can spot it. > >[ 1567.578340] pcc-cpufreq: (v1.10.00) driver loaded with frequency >limits: 1064 MHz, 2266 MHz > done >Starting HAL daemon >done > > >Setting up (localfs) network interfaces: > lo > lo IP address: 127.0.0.1/8 > IP address: 127.0.0.2/8 done > eth0 device: Broadcom Corporation NetXtreme II BCM5709 Gig > No configuration found for eth0 unused > eth1 device: Broadcom Corporation NetXtreme II BCM5709 Gig > No configuration found for eth1 unused > eth2 device: NetXen Incorporated NX3031 Multifunction 1/10 >[ 2457.114007] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready >[ 2457.114455] netxen_nic: eth2 NIC Link is up >[ 2457.223582] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready > eth2 IP address: 0.0.0.0/32 -- Sent from my mobile phone. Please pardon brevity and lack of formatting. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 23:08 ` H. Peter Anvin 2013-12-11 23:14 ` Borislav Petkov @ 2013-12-12 8:51 ` Peter Zijlstra 2013-12-12 13:28 ` Ingo Molnar 1 sibling, 1 reply; 96+ messages in thread From: Peter Zijlstra @ 2013-12-12 8:51 UTC (permalink / raw) To: H. Peter Anvin Cc: Ingo Molnar, Borislav Petkov, Mike Galbraith, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On Wed, Dec 11, 2013 at 03:08:35PM -0800, H. Peter Anvin wrote: > On 12/11/2013 09:50 AM, Ingo Molnar wrote: > > > > Well, availability could be a problem too, if some CPU (real or > > virtual) implements MWAIT but not CLFLUSH. > > > > In theory we could make mwait an alternatives variant and patch in the > > right combination of instructions? The CLFLUSH goes to the same > > address as on which the monitoring happens, so it could be considered > > one meta-instruction. > > > > The first thing to do is probably to drop the use of thread_info as a > wakeup doorbell. It seemed like a good idea at the time -- after all, > there is one for each thread -- but it is extremely likely to be dirty > in the cache, which is (presumably) what causes these kinds of bugs to > be maximally likely. Even if we don't do the CLFLUSH it is likely that > the hardware has to do something expensive behind the scenes. > > So I would like to propose that we switch to using a percpu variable > which is a single cache line of nothing at all. It would only ever be > touched by MONITOR and for explicit wakeup. Hopefully that will resolve > this problem without the need for the CLFLUSH. The reason we use thread_info::flags is because we need to write TIF_NEED_RESCHED into it to wake up anyhow. Using another cacheline would mean the wakeup path would need to write a second cross cpu cacheline -- that is badness too. So no, I don't think we want to listen to another line. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-12 8:51 ` Peter Zijlstra @ 2013-12-12 13:28 ` Ingo Molnar 2013-12-12 15:06 ` H. Peter Anvin 0 siblings, 1 reply; 96+ messages in thread From: Ingo Molnar @ 2013-12-12 13:28 UTC (permalink / raw) To: Peter Zijlstra Cc: H. Peter Anvin, Borislav Petkov, Mike Galbraith, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 * Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Dec 11, 2013 at 03:08:35PM -0800, H. Peter Anvin wrote: > > On 12/11/2013 09:50 AM, Ingo Molnar wrote: > > > > > > Well, availability could be a problem too, if some CPU (real or > > > virtual) implements MWAIT but not CLFLUSH. > > > > > > In theory we could make mwait an alternatives variant and patch in the > > > right combination of instructions? The CLFLUSH goes to the same > > > address as on which the monitoring happens, so it could be considered > > > one meta-instruction. > > > > > > > The first thing to do is probably to drop the use of thread_info as a > > wakeup doorbell. It seemed like a good idea at the time -- after all, > > there is one for each thread -- but it is extremely likely to be dirty > > in the cache, which is (presumably) what causes these kinds of bugs to > > be maximally likely. Even if we don't do the CLFLUSH it is likely that > > the hardware has to do something expensive behind the scenes. > > > > So I would like to propose that we switch to using a percpu variable > > which is a single cache line of nothing at all. It would only ever be > > touched by MONITOR and for explicit wakeup. Hopefully that will resolve > > this problem without the need for the CLFLUSH. > > The reason we use thread_info::flags is because we need to write > TIF_NEED_RESCHED into it to wake up anyhow. > > Using another cacheline would mean the wakeup path would need to write a > second cross cpu cacheline -- that is badness too. > > So no, I don't think we want to listen to another line. Seconded ... Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-12 13:28 ` Ingo Molnar @ 2013-12-12 15:06 ` H. Peter Anvin 2013-12-12 15:51 ` Peter Zijlstra 0 siblings, 1 reply; 96+ messages in thread From: H. Peter Anvin @ 2013-12-12 15:06 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: Borislav Petkov, Mike Galbraith, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On 12/12/2013 05:28 AM, Ingo Molnar wrote: >> >> The reason we use thread_info::flags is because we need to write >> TIF_NEED_RESCHED into it to wake up anyhow. >> >> Using another cacheline would mean the wakeup path would need to write a >> second cross cpu cacheline -- that is badness too. >> >> So no, I don't think we want to listen to another line. > Right, okay, so that's the implicit wakeup. However, I would think the CLFLUSH would hurt a lot more. -hpa ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-12 15:06 ` H. Peter Anvin @ 2013-12-12 15:51 ` Peter Zijlstra 0 siblings, 0 replies; 96+ messages in thread From: Peter Zijlstra @ 2013-12-12 15:51 UTC (permalink / raw) To: H. Peter Anvin Cc: Ingo Molnar, Borislav Petkov, Mike Galbraith, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On Thu, Dec 12, 2013 at 07:06:57AM -0800, H. Peter Anvin wrote: > On 12/12/2013 05:28 AM, Ingo Molnar wrote: > >> > >> The reason we use thread_info::flags is because we need to write > >> TIF_NEED_RESCHED into it to wake up anyhow. > >> > >> Using another cacheline would mean the wakeup path would need to write a > >> second cross cpu cacheline -- that is badness too. > >> > >> So no, I don't think we want to listen to another line. > > > > Right, okay, so that's the implicit wakeup. However, I would think the > CLFLUSH would hurt a lot more. Maybe, but still, who cares? Its only a few broken cpus that actually need the clflush, normal cpus do not. We should not optimize for the broken case. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 12:43 ` Peter Zijlstra 2013-12-11 13:10 ` Mike Galbraith 2013-12-11 13:40 ` Borislav Petkov @ 2013-12-11 14:42 ` Ingo Molnar 2013-12-11 15:02 ` Thomas Gleixner 2013-12-11 16:44 ` Peter Zijlstra 2 siblings, 2 replies; 96+ messages in thread From: Ingo Molnar @ 2013-12-11 14:42 UTC (permalink / raw) To: Peter Zijlstra Cc: Mike Galbraith, Borislav Petkov, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 * Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Dec 11, 2013 at 01:29:15PM +0100, Mike Galbraith wrote: > > On Wed, 2013-12-11 at 12:52 +0100, Peter Zijlstra wrote: > > > On Wed, Dec 11, 2013 at 12:38:39PM +0100, Borislav Petkov wrote: > > > > Right, if it turns out that this is really the case and that this > > > > erratum hasn't been fixed for models later than 29 - we'd need the > > > > additional model numbers to set X86_FEATURE_CLFLUSH_MONITOR correctly. > > > > > > You also need: https://lkml.org/lkml/2013/11/19/143 > > > > > > Because obviously not all mwait idle loops check that cpu bit. > > > > I had tried that patch, to see if it would magically make the thing > > start working, nope. I had also tried... > > > + if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) > > + clflush((void *)¤t_thread_info()->flags); > > Yeah, you need a bit extra to enable that feature bit for your CPU as > bpetkov said. > > Something like the below.. someone needs to double check and possibly > add SNB/IVB EX parts if they're already available. > > diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c > index dc1ec0dff939..015642eed045 100644 > --- a/arch/x86/kernel/cpu/intel.c > +++ b/arch/x86/kernel/cpu/intel.c > @@ -387,8 +387,15 @@ static void init_intel(struct cpuinfo_x86 *c) > set_cpu_cap(c, X86_FEATURE_PEBS); > } > > - if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush) > - set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR); > + if (c->x86 == 6 && cpu_has_clflush) { > + switch (c->x86_model) { > + case 29: /* Core2 EX */ > + case 46: /* NHM EX */ > + case 47: /* WSM EX */ > + set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR); > + break; > + } > + } Another thing that is required I think is to issue a write barrier before CLFLUSH instruction. By my (possibly incorrect ...) reading of the documentation CLFLUSH does not appear to be ordered (at all), so it might execute before the modification to the affected memory? So something like: if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) { smp_wmb(); /* order CLFLUSH */ clflush(¤t_thread_info()->flags); } possibly put behind some utility function, smp_clflush() or so, hiding the CPU feature bit check as well: smp_clflush(¤t_thread_info()->flags); or so? Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 14:42 ` Ingo Molnar @ 2013-12-11 15:02 ` Thomas Gleixner 2013-12-11 15:09 ` Ingo Molnar 2013-12-11 16:44 ` Peter Zijlstra 1 sibling, 1 reply; 96+ messages in thread From: Thomas Gleixner @ 2013-12-11 15:02 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Mike Galbraith, Borislav Petkov, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On Wed, 11 Dec 2013, Ingo Molnar wrote: > > Another thing that is required I think is to issue a write barrier > before CLFLUSH instruction. By my (possibly incorrect ...) reading of > the documentation CLFLUSH does not appear to be ordered (at all), so > it might execute before the modification to the affected memory? We already have a write barrier in the code which modifies current_thread_info()->flags. Thanks, tglx ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 15:02 ` Thomas Gleixner @ 2013-12-11 15:09 ` Ingo Molnar 2013-12-11 16:44 ` Peter Zijlstra 0 siblings, 1 reply; 96+ messages in thread From: Ingo Molnar @ 2013-12-11 15:09 UTC (permalink / raw) To: Thomas Gleixner Cc: Peter Zijlstra, Mike Galbraith, Borislav Petkov, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 * Thomas Gleixner <tglx@linutronix.de> wrote: > On Wed, 11 Dec 2013, Ingo Molnar wrote: > > > > Another thing that is required I think is to issue a write barrier > > before CLFLUSH instruction. By my (possibly incorrect ...) reading of > > the documentation CLFLUSH does not appear to be ordered (at all), so > > it might execute before the modification to the affected memory? > > We already have a write barrier in the code which modifies > current_thread_info()->flags. Which code is that, could you please cite it? (I tried finding it but my grep-fu failed me.) Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 15:09 ` Ingo Molnar @ 2013-12-11 16:44 ` Peter Zijlstra 2013-12-11 17:48 ` Ingo Molnar 0 siblings, 1 reply; 96+ messages in thread From: Peter Zijlstra @ 2013-12-11 16:44 UTC (permalink / raw) To: Ingo Molnar Cc: Thomas Gleixner, Mike Galbraith, Borislav Petkov, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On Wed, Dec 11, 2013 at 04:09:11PM +0100, Ingo Molnar wrote: > > * Thomas Gleixner <tglx@linutronix.de> wrote: > > > On Wed, 11 Dec 2013, Ingo Molnar wrote: > > > > > > Another thing that is required I think is to issue a write barrier > > > before CLFLUSH instruction. By my (possibly incorrect ...) reading of > > > the documentation CLFLUSH does not appear to be ordered (at all), so > > > it might execute before the modification to the affected memory? > > > > We already have a write barrier in the code which modifies > > current_thread_info()->flags. > > Which code is that, could you please cite it? (I tried finding it but > my grep-fu failed me.) current_set_polling_and_test(). ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 16:44 ` Peter Zijlstra @ 2013-12-11 17:48 ` Ingo Molnar 0 siblings, 0 replies; 96+ messages in thread From: Ingo Molnar @ 2013-12-11 17:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Thomas Gleixner, Mike Galbraith, Borislav Petkov, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 * Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Dec 11, 2013 at 04:09:11PM +0100, Ingo Molnar wrote: > > > > * Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > On Wed, 11 Dec 2013, Ingo Molnar wrote: > > > > > > > > Another thing that is required I think is to issue a write barrier > > > > before CLFLUSH instruction. By my (possibly incorrect ...) reading of > > > > the documentation CLFLUSH does not appear to be ordered (at all), so > > > > it might execute before the modification to the affected memory? > > > > > > We already have a write barrier in the code which modifies > > > current_thread_info()->flags. > > > > Which code is that, could you please cite it? (I tried finding it but > > my grep-fu failed me.) > > current_set_polling_and_test(). Thx, those variants have the right barrier indeed. Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 14:42 ` Ingo Molnar 2013-12-11 15:02 ` Thomas Gleixner @ 2013-12-11 16:44 ` Peter Zijlstra 2013-12-11 17:47 ` Ingo Molnar 1 sibling, 1 reply; 96+ messages in thread From: Peter Zijlstra @ 2013-12-11 16:44 UTC (permalink / raw) To: Ingo Molnar Cc: Mike Galbraith, Borislav Petkov, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 On Wed, Dec 11, 2013 at 03:42:38PM +0100, Ingo Molnar wrote: > Another thing that is required I think is to issue a write barrier > before CLFLUSH instruction. By my (possibly incorrect ...) reading of > the documentation CLFLUSH does not appear to be ordered (at all), so > it might execute before the modification to the affected memory? > > > So something like: > > if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) { > smp_wmb(); /* order CLFLUSH */ > clflush(¤t_thread_info()->flags); > } smp_wmb() is a NO-OP on x86 remember :-) Also, a wmb doesn't actually need to flush the store buffers. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 16:44 ` Peter Zijlstra @ 2013-12-11 17:47 ` Ingo Molnar 0 siblings, 0 replies; 96+ messages in thread From: Ingo Molnar @ 2013-12-11 17:47 UTC (permalink / raw) To: Peter Zijlstra Cc: Mike Galbraith, Borislav Petkov, Thomas Gleixner, Len Brown, Linux PM list, linux-kernel, Jeremy Eder, x86 * Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Dec 11, 2013 at 03:42:38PM +0100, Ingo Molnar wrote: > > Another thing that is required I think is to issue a write barrier > > before CLFLUSH instruction. By my (possibly incorrect ...) reading of > > the documentation CLFLUSH does not appear to be ordered (at all), so > > it might execute before the modification to the affected memory? > > > > > > So something like: > > > > if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) { > > smp_wmb(); /* order CLFLUSH */ > > clflush(¤t_thread_info()->flags); > > } > > smp_wmb() is a NO-OP on x86 remember :-) Well, it's a compiler barrier but yes - I suspect a smp_mb() might be needed - at least according to the CLFLUSH documentation it has no implicit guaranteed ordering wrt. preceding writes. Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 11:52 ` Peter Zijlstra 2013-12-11 12:29 ` Mike Galbraith @ 2013-12-11 21:43 ` Len Brown 2013-12-11 22:22 ` Thomas Gleixner 1 sibling, 1 reply; 96+ messages in thread From: Len Brown @ 2013-12-11 21:43 UTC (permalink / raw) To: Peter Zijlstra Cc: Borislav Petkov, Thomas Gleixner, Mike Galbraith, Linux PM list, linux-kernel, Jeremy Eder, x86 For the record... intel_idle doesn't check that bit because it doesn't run on model 29 -- the Xeon 7400 was the "Dunnington" 4-socket generation based on Core2. Until now, i was not aware that this issue might apply to models other than that one. Checking w/ the HW guys... thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: 50 Watt idle power regression bisected to Linux-3.10 2013-12-11 21:43 ` Len Brown @ 2013-12-11 22:22 ` Thomas Gleixner 0 siblings, 0 replies; 96+ messages in thread From: Thomas Gleixner @ 2013-12-11 22:22 UTC (permalink / raw) To: Len Brown Cc: Peter Zijlstra, Borislav Petkov, Mike Galbraith, Linux PM list, linux-kernel, Jeremy Eder, x86 On Wed, 11 Dec 2013, Len Brown wrote: > For the record... > > intel_idle doesn't check that bit because it doesn't run on model 29 -- > the Xeon 7400 was the "Dunnington" 4-socket generation based on Core2. Right, we wondered about the restricted model check already. Interesting that this is a 4 socket issue as well. > Until now, i was not aware that this issue might apply to models other > than that one. > > Checking w/ the HW guys... I hope they are not equally surprised :) Thanks, tglx ^ permalink raw reply [flat|nested] 96+ messages in thread
* [PATCH] x86 idle: repair large-server 50-watt idle-power regression @ 2013-12-18 21:44 ` Len Brown 0 siblings, 0 replies; 96+ messages in thread From: Len Brown @ 2013-12-18 21:44 UTC (permalink / raw) To: x86; +Cc: linux-pm, linux-kernel, Len Brown, stable From: Len Brown <len.brown@intel.com> Linux 3.10 changed the timing of how thread_info->flags is touched: x86: Use generic idle loop (7d1a941731fabf27e5fb6edbebb79fe856edb4e5) This caused Intel NHM-EX and WSM-EX servers to experience a large number of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote from deep C-states to shallow C-states, which caused these platforms to experience a significant increase in idle power. Note that this issue was already present before the commit above, however, it wasn't seen often enough to be noticed in power measurements. Here we extend an errata workaround from the Core2 EX "Dunnington" to extend to NHM-EX and WSM-EX, to prevent these immediate returns from MWAIT, reducing idle power on these platforms. While only acpi_idle ran on Dunnington, intel_idle may also run on these two newer systems. As of today, there are no other models that are known to need this tweak. ref: https://lkml.org/lkml/2013/12/7/22 Signed-off-by: Len Brown <len.brown@intel.com> Cc: <stable@vger.kernel.org> # 3.12.x, 3.11.x, 3.10.x --- arch/x86/kernel/cpu/intel.c | 3 ++- drivers/idle/intel_idle.c | 3 +++ 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index dc1ec0d..ea04b34 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -387,7 +387,8 @@ static void init_intel(struct cpuinfo_x86 *c) set_cpu_cap(c, X86_FEATURE_PEBS); } - if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush) + if (c->x86 == 6 && cpu_has_clflush && + (c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47)) set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR); #ifdef CONFIG_X86_64 diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index 92d1206..f80b700 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev, if (!current_set_polling_and_test()) { + if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) + clflush((void *)¤t_thread_info()->flags); + __monitor((void *)¤t_thread_info()->flags, 0, 0); smp_mb(); if (!need_resched()) -- 1.8.5.1.19.gdaad3aa ^ permalink raw reply related [flat|nested] 96+ messages in thread
* [PATCH] x86 idle: repair large-server 50-watt idle-power regression @ 2013-12-18 21:44 ` Len Brown 0 siblings, 0 replies; 96+ messages in thread From: Len Brown @ 2013-12-18 21:44 UTC (permalink / raw) To: x86; +Cc: linux-pm, linux-kernel, Len Brown, stable From: Len Brown <len.brown@intel.com> Linux 3.10 changed the timing of how thread_info->flags is touched: x86: Use generic idle loop (7d1a941731fabf27e5fb6edbebb79fe856edb4e5) This caused Intel NHM-EX and WSM-EX servers to experience a large number of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote from deep C-states to shallow C-states, which caused these platforms to experience a significant increase in idle power. Note that this issue was already present before the commit above, however, it wasn't seen often enough to be noticed in power measurements. Here we extend an errata workaround from the Core2 EX "Dunnington" to extend to NHM-EX and WSM-EX, to prevent these immediate returns from MWAIT, reducing idle power on these platforms. While only acpi_idle ran on Dunnington, intel_idle may also run on these two newer systems. As of today, there are no other models that are known to need this tweak. ref: https://lkml.org/lkml/2013/12/7/22 Signed-off-by: Len Brown <len.brown@intel.com> Cc: <stable@vger.kernel.org> # 3.12.x, 3.11.x, 3.10.x --- arch/x86/kernel/cpu/intel.c | 3 ++- drivers/idle/intel_idle.c | 3 +++ 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index dc1ec0d..ea04b34 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -387,7 +387,8 @@ static void init_intel(struct cpuinfo_x86 *c) set_cpu_cap(c, X86_FEATURE_PEBS); } - if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush) + if (c->x86 == 6 && cpu_has_clflush && + (c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47)) set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR); #ifdef CONFIG_X86_64 diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index 92d1206..f80b700 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev, if (!current_set_polling_and_test()) { + if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) + clflush((void *)¤t_thread_info()->flags); + __monitor((void *)¤t_thread_info()->flags, 0, 0); smp_mb(); if (!need_resched()) -- 1.8.5.1.19.gdaad3aa ^ permalink raw reply related [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-18 21:44 ` Len Brown (?) @ 2013-12-19 12:22 ` Ingo Molnar 2013-12-19 14:40 ` H. Peter Anvin 2013-12-19 15:55 ` H. Peter Anvin -1 siblings, 2 replies; 96+ messages in thread From: Ingo Molnar @ 2013-12-19 12:22 UTC (permalink / raw) To: Len Brown Cc: x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Borislav Petkov * Len Brown <lenb@kernel.org> wrote: > From: Len Brown <len.brown@intel.com> > > Linux 3.10 changed the timing of how thread_info->flags is touched: > > x86: Use generic idle loop > (7d1a941731fabf27e5fb6edbebb79fe856edb4e5) > > This caused Intel NHM-EX and WSM-EX servers to experience a large number > of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote > from deep C-states to shallow C-states, which caused these platforms > to experience a significant increase in idle power. > > Note that this issue was already present before the commit above, > however, it wasn't seen often enough to be noticed in power measurements. > > Here we extend an errata workaround from the Core2 EX "Dunnington" > to extend to NHM-EX and WSM-EX, to prevent these immediate > returns from MWAIT, reducing idle power on these platforms. > > While only acpi_idle ran on Dunnington, intel_idle > may also run on these two newer systems. > As of today, there are no other models that are known > to need this tweak. > > ref: https://lkml.org/lkml/2013/12/7/22 > Signed-off-by: Len Brown <len.brown@intel.com> > Cc: <stable@vger.kernel.org> # 3.12.x, 3.11.x, 3.10.x > --- > arch/x86/kernel/cpu/intel.c | 3 ++- > drivers/idle/intel_idle.c | 3 +++ > 2 files changed, 5 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c > index dc1ec0d..ea04b34 100644 > --- a/arch/x86/kernel/cpu/intel.c > +++ b/arch/x86/kernel/cpu/intel.c > @@ -387,7 +387,8 @@ static void init_intel(struct cpuinfo_x86 *c) > set_cpu_cap(c, X86_FEATURE_PEBS); > } > > - if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush) > + if (c->x86 == 6 && cpu_has_clflush && > + (c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47)) > set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR); > > #ifdef CONFIG_X86_64 > diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c > index 92d1206..f80b700 100644 > --- a/drivers/idle/intel_idle.c > +++ b/drivers/idle/intel_idle.c > @@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev, > > if (!current_set_polling_and_test()) { > > + if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) > + clflush((void *)¤t_thread_info()->flags); > + > __monitor((void *)¤t_thread_info()->flags, 0, 0); I don't think either of these casts to '(void *)' is needed, both the clflush() and __monitor() will take pointers. Looks good to me otherwise - except that maybe the best way to represent this quirk would be for the CLFLUSH+MONITOR sequence to be a single 'instruction' which is patched in dynamically during bootup, using our usual alternatives framework. On non-affected CPUs a NOP would remain in place of the CLFLUSH, eliminating the branch above. So the whole thing could be thought of as a slightly more complex 'monitor' instruction - not exposing the quirk details to actual usage sites. Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 12:22 ` Ingo Molnar @ 2013-12-19 14:40 ` H. Peter Anvin 2013-12-19 15:45 ` Borislav Petkov 2013-12-19 15:55 ` H. Peter Anvin 1 sibling, 1 reply; 96+ messages in thread From: H. Peter Anvin @ 2013-12-19 14:40 UTC (permalink / raw) To: Ingo Molnar, Len Brown Cc: x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Borislav Petkov ... or just use static_cpu_has() maybe? Ingo Molnar <mingo@kernel.org> wrote: > >* Len Brown <lenb@kernel.org> wrote: > >> From: Len Brown <len.brown@intel.com> >> >> Linux 3.10 changed the timing of how thread_info->flags is touched: >> >> x86: Use generic idle loop >> (7d1a941731fabf27e5fb6edbebb79fe856edb4e5) >> >> This caused Intel NHM-EX and WSM-EX servers to experience a large >number >> of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to >demote >> from deep C-states to shallow C-states, which caused these platforms >> to experience a significant increase in idle power. >> >> Note that this issue was already present before the commit above, >> however, it wasn't seen often enough to be noticed in power >measurements. >> >> Here we extend an errata workaround from the Core2 EX "Dunnington" >> to extend to NHM-EX and WSM-EX, to prevent these immediate >> returns from MWAIT, reducing idle power on these platforms. >> >> While only acpi_idle ran on Dunnington, intel_idle >> may also run on these two newer systems. >> As of today, there are no other models that are known >> to need this tweak. >> >> ref: https://lkml.org/lkml/2013/12/7/22 >> Signed-off-by: Len Brown <len.brown@intel.com> >> Cc: <stable@vger.kernel.org> # 3.12.x, 3.11.x, 3.10.x >> --- >> arch/x86/kernel/cpu/intel.c | 3 ++- >> drivers/idle/intel_idle.c | 3 +++ >> 2 files changed, 5 insertions(+), 1 deletion(-) >> >> diff --git a/arch/x86/kernel/cpu/intel.c >b/arch/x86/kernel/cpu/intel.c >> index dc1ec0d..ea04b34 100644 >> --- a/arch/x86/kernel/cpu/intel.c >> +++ b/arch/x86/kernel/cpu/intel.c >> @@ -387,7 +387,8 @@ static void init_intel(struct cpuinfo_x86 *c) >> set_cpu_cap(c, X86_FEATURE_PEBS); >> } >> >> - if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush) >> + if (c->x86 == 6 && cpu_has_clflush && >> + (c->x86_model == 29 || c->x86_model == 46 || c->x86_model == >47)) >> set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR); >> >> #ifdef CONFIG_X86_64 >> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c >> index 92d1206..f80b700 100644 >> --- a/drivers/idle/intel_idle.c >> +++ b/drivers/idle/intel_idle.c >> @@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev, >> >> if (!current_set_polling_and_test()) { >> >> + if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) >> + clflush((void *)¤t_thread_info()->flags); >> + >> __monitor((void *)¤t_thread_info()->flags, 0, 0); > >I don't think either of these casts to '(void *)' is needed, both the >clflush() and __monitor() will take pointers. > >Looks good to me otherwise - except that maybe the best way to >represent this quirk would be for the CLFLUSH+MONITOR sequence to be a >single 'instruction' which is patched in dynamically during bootup, >using our usual alternatives framework. > >On non-affected CPUs a NOP would remain in place of the CLFLUSH, >eliminating the branch above. > >So the whole thing could be thought of as a slightly more complex >'monitor' instruction - not exposing the quirk details to actual usage >sites. > >Thanks, > > Ingo -- Sent from my mobile phone. Please pardon brevity and lack of formatting. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 14:40 ` H. Peter Anvin @ 2013-12-19 15:45 ` Borislav Petkov 0 siblings, 0 replies; 96+ messages in thread From: Borislav Petkov @ 2013-12-19 15:45 UTC (permalink / raw) To: H. Peter Anvin Cc: Ingo Molnar, Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Mike Galbraith On Thu, Dec 19, 2013 at 06:40:41AM -0800, H. Peter Anvin wrote: > ... or just use static_cpu_has() maybe? Right, if we can get the compiler to generate the shortest CLFLUSH of 3 bytes with register indirect addressing for the operand, then stomping over those 3 bytes with the alternatives would be the fastest solution. If, of course, three NOPs which get discarded in the front end are cheaper than the near JMP from static_cpu_has. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 12:22 ` Ingo Molnar 2013-12-19 14:40 ` H. Peter Anvin @ 2013-12-19 15:55 ` H. Peter Anvin 2013-12-19 16:02 ` Ingo Molnar 1 sibling, 1 reply; 96+ messages in thread From: H. Peter Anvin @ 2013-12-19 15:55 UTC (permalink / raw) To: Ingo Molnar, Len Brown Cc: x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Borislav Petkov On 12/19/2013 04:22 AM, Ingo Molnar wrote: >> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c >> index 92d1206..f80b700 100644 >> --- a/drivers/idle/intel_idle.c >> +++ b/drivers/idle/intel_idle.c >> @@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev, >> >> if (!current_set_polling_and_test()) { >> >> + if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) >> + clflush((void *)¤t_thread_info()->flags); >> + >> __monitor((void *)¤t_thread_info()->flags, 0, 0); > > I don't think either of these casts to '(void *)' is needed, both the > clflush() and __monitor() will take pointers. __monitor() currently doesn't, which is idiotic. -hpa ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 15:55 ` H. Peter Anvin @ 2013-12-19 16:02 ` Ingo Molnar 2013-12-19 16:09 ` H. Peter Anvin 2013-12-19 16:13 ` H. Peter Anvin 0 siblings, 2 replies; 96+ messages in thread From: Ingo Molnar @ 2013-12-19 16:02 UTC (permalink / raw) To: H. Peter Anvin Cc: Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Borislav Petkov * H. Peter Anvin <hpa@zytor.com> wrote: > On 12/19/2013 04:22 AM, Ingo Molnar wrote: > >> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c > >> index 92d1206..f80b700 100644 > >> --- a/drivers/idle/intel_idle.c > >> +++ b/drivers/idle/intel_idle.c > >> @@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev, > >> > >> if (!current_set_polling_and_test()) { > >> > >> + if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) > >> + clflush((void *)¤t_thread_info()->flags); > >> + > >> __monitor((void *)¤t_thread_info()->flags, 0, 0); > > > > I don't think either of these casts to '(void *)' is needed, both the > > clflush() and __monitor() will take pointers. > > __monitor() currently doesn't, which is idiotic. Hm, __monitor() seems to take a void *: arch/x86/include/asm/processor.h:static inline void __monitor(const void *eax, unsigned long ecx, So writing: if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) clflush(¤t_thread_info()->flags); __monitor(¤t_thread_info()->flags, 0, 0); ought to work just fine. Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 16:02 ` Ingo Molnar @ 2013-12-19 16:09 ` H. Peter Anvin 2013-12-19 16:13 ` H. Peter Anvin 1 sibling, 0 replies; 96+ messages in thread From: H. Peter Anvin @ 2013-12-19 16:09 UTC (permalink / raw) To: Ingo Molnar Cc: Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Borislav Petkov On 12/19/2013 08:02 AM, Ingo Molnar wrote: >> >> __monitor() currently doesn't, which is idiotic. > > Hm, __monitor() seems to take a void *: > > arch/x86/include/asm/processor.h:static inline void __monitor(const void *eax, unsigned long ecx, > > So writing: > > if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) > clflush(¤t_thread_info()->flags); > > __monitor(¤t_thread_info()->flags, 0, 0); > Yes, brain failure on my part. -hpa ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 16:02 ` Ingo Molnar 2013-12-19 16:09 ` H. Peter Anvin @ 2013-12-19 16:13 ` H. Peter Anvin 2013-12-19 16:21 ` Peter Zijlstra 1 sibling, 1 reply; 96+ messages in thread From: H. Peter Anvin @ 2013-12-19 16:13 UTC (permalink / raw) To: Ingo Molnar Cc: Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Borislav Petkov [-- Attachment #1: Type: text/plain, Size: 189 bytes --] How does this look? Completely untested, of course. I do wonder if we need more memory barriers, though. An alternative would be to move everything into mwait_idle_with_hints(). -hpa [-- Attachment #2: diff --] [-- Type: text/plain, Size: 1626 bytes --] diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 7b034a4057f9..6dce588f94b4 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -723,6 +723,23 @@ static inline void __sti_mwait(unsigned long eax, unsigned long ecx) :: "a" (eax), "c" (ecx)); } +/* + * Issue a clflush in preparation for a monitor instruction if the CPU + * needs it. We force the address into the ax register to get a fixed + * length for the instruction, however, this is what the monitor instruction + * is going to need anyway, so it shouldn't add any additional code. + */ +static inline void clflush_monitor(const void *addr, unsigned long ecx, + unsigned long edx) +{ + alternative_input(ASM_NOP3, + "clflush (%0)", + X86_FEATURE_CLFLUSH_MONITOR, + "a" (addr)); + __monitor(addr, eax, edx); + smp_mb(); +} + extern void select_idle_routine(const struct cpuinfo_x86 *c); extern void init_amd_e400_c1e_mask(void); diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c index d2b7f27781bc..b14d02354134 100644 --- a/arch/x86/kernel/acpi/cstate.c +++ b/arch/x86/kernel/acpi/cstate.c @@ -163,11 +163,7 @@ EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe); void mwait_idle_with_hints(unsigned long ax, unsigned long cx) { if (!need_resched()) { - if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) - clflush((void *)¤t_thread_info()->flags); - - __monitor((void *)¤t_thread_info()->flags, 0, 0); - smp_mb(); + clflush_monitor(¤t_thread_info()->flags, 0, 0); if (!need_resched()) __mwait(ax, cx); } ^ permalink raw reply related [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 16:13 ` H. Peter Anvin @ 2013-12-19 16:21 ` Peter Zijlstra 2013-12-19 16:50 ` H. Peter Anvin 0 siblings, 1 reply; 96+ messages in thread From: Peter Zijlstra @ 2013-12-19 16:21 UTC (permalink / raw) To: H. Peter Anvin Cc: Ingo Molnar, Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov On Thu, Dec 19, 2013 at 08:13:21AM -0800, H. Peter Anvin wrote: > How does this look? Completely untested, of course. > > I do wonder if we need more memory barriers, though. > > An alternative would be to move everything into mwait_idle_with_hints(). > > -hpa > > diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h > index 7b034a4057f9..6dce588f94b4 100644 > --- a/arch/x86/include/asm/processor.h > +++ b/arch/x86/include/asm/processor.h > @@ -723,6 +723,23 @@ static inline void __sti_mwait(unsigned long eax, unsigned long ecx) > :: "a" (eax), "c" (ecx)); > } > > +/* > + * Issue a clflush in preparation for a monitor instruction if the CPU > + * needs it. We force the address into the ax register to get a fixed > + * length for the instruction, however, this is what the monitor instruction > + * is going to need anyway, so it shouldn't add any additional code. > + */ > +static inline void clflush_monitor(const void *addr, unsigned long ecx, > + unsigned long edx) > +{ > + alternative_input(ASM_NOP3, > + "clflush (%0)", > + X86_FEATURE_CLFLUSH_MONITOR, > + "a" (addr)); > + __monitor(addr, eax, edx); > + smp_mb(); > +} What's that mb for? Also, can you please merge: http://marc.info/?l=linux-kernel&m=138685838420632 Thomas said he would pick up that series, but seems to have gone missing the past week or so. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 16:21 ` Peter Zijlstra @ 2013-12-19 16:50 ` H. Peter Anvin 2013-12-19 17:07 ` Ingo Molnar 0 siblings, 1 reply; 96+ messages in thread From: H. Peter Anvin @ 2013-12-19 16:50 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov On 12/19/2013 08:21 AM, Peter Zijlstra wrote: > > What's that mb for? > It already exists in mwait_idle_with_hints(); I just moved it into this common function. It is a bit odd, I have to admit; it seems like it should be *before* the monitor (and possibly we should have one after the CLFLUSH as well?) > Also, can you please merge: > > http://marc.info/?l=linux-kernel&m=138685838420632 > > Thomas said he would pick up that series, but seems to have gone missing > the past week or so. > We need to get the immediate regression fixed. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 16:50 ` H. Peter Anvin @ 2013-12-19 17:07 ` Ingo Molnar 2013-12-19 17:25 ` Peter Zijlstra 2013-12-19 18:09 ` H. Peter Anvin 0 siblings, 2 replies; 96+ messages in thread From: Ingo Molnar @ 2013-12-19 17:07 UTC (permalink / raw) To: H. Peter Anvin Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov * H. Peter Anvin <hpa@zytor.com> wrote: > On 12/19/2013 08:21 AM, Peter Zijlstra wrote: > > > > What's that mb for? > > > > It already exists in mwait_idle_with_hints(); I just moved it into > this common function. It is a bit odd, I have to admit; it seems > like it should be *before* the monitor (and possibly we should have > one after the CLFLUSH as well?) Yes, I think we need a barrier before the CLFLUSH, because according to my reading of the Intel documentation CLFLUSH has no implicit ordering so it might get reordered with the store to ->flags in current_set_polling_and_test(), which might result in spurious wakeup problems again. (And CLFLUSH is a store in a sense, so special in that the regular ordering for stores does not apply.) Likewise, having a barrier before the MONITOR looks sensible as well. Having it _after_ monitor looks weird and is probably wrong. [It might have been the effects of someone seeing the spurious wakeup problems with realizing the true source, or so.] Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 17:07 ` Ingo Molnar @ 2013-12-19 17:25 ` Peter Zijlstra 2013-12-19 17:36 ` Peter Zijlstra ` (2 more replies) 2013-12-19 18:09 ` H. Peter Anvin 1 sibling, 3 replies; 96+ messages in thread From: Peter Zijlstra @ 2013-12-19 17:25 UTC (permalink / raw) To: Ingo Molnar Cc: H. Peter Anvin, Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote: > > * H. Peter Anvin <hpa@zytor.com> wrote: > > > On 12/19/2013 08:21 AM, Peter Zijlstra wrote: > > > > > > What's that mb for? > > > > > > > It already exists in mwait_idle_with_hints(); I just moved it into > > this common function. It is a bit odd, I have to admit; it seems > > like it should be *before* the monitor (and possibly we should have > > one after the CLFLUSH as well?) > > Yes, I think we need a barrier before the CLFLUSH, because according > to my reading of the Intel documentation CLFLUSH has no implicit > ordering so it might get reordered with the store to ->flags in > current_set_polling_and_test(), which might result in spurious wakeup > problems again. No it cannot; since current_set_polling_and_test() already has a barrier to prevent that. Also, the location patched by hpa doesn't actually call that at all. That said, I would find it very strange indeed if a CLFLUSH doesn't also flush the store buffer. > (And CLFLUSH is a store in a sense, so special in that the regular > ordering for stores does not apply.) > > Likewise, having a barrier before the MONITOR looks sensible as well. > Having it _after_ monitor looks weird and is probably wrong. [It might > have been the effects of someone seeing the spurious wakeup problems > with realizing the true source, or so.] I again have to disagree, one would expect monitor to flush all that is required to start the monitor -- and it actually does so. As is testified by this extra CLFLUSH being called a bug workaround. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 17:25 ` Peter Zijlstra @ 2013-12-19 17:36 ` Peter Zijlstra 2013-12-19 18:05 ` H. Peter Anvin 2013-12-19 17:50 ` Peter Zijlstra 2013-12-19 18:10 ` Ingo Molnar 2 siblings, 1 reply; 96+ messages in thread From: Peter Zijlstra @ 2013-12-19 17:36 UTC (permalink / raw) To: Ingo Molnar Cc: H. Peter Anvin, Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote: > That said, I would find it very strange indeed if a CLFLUSH doesn't also > flush the store buffer. OK, it explicitly states it does not do that and you indeed need an mfence before the clflush. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 17:36 ` Peter Zijlstra @ 2013-12-19 18:05 ` H. Peter Anvin 2013-12-19 18:14 ` Ingo Molnar 0 siblings, 1 reply; 96+ messages in thread From: H. Peter Anvin @ 2013-12-19 18:05 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar Cc: Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov On 12/19/2013 09:36 AM, Peter Zijlstra wrote: > On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote: >> That said, I would find it very strange indeed if a CLFLUSH doesn't also >> flush the store buffer. > > OK, it explicitly states it does not do that and you indeed need an > mfence before the clflush. > So, MONITOR is defined to be ordered as a load, which I think should be adequate, but I really wonder if we should have mfence on both sides of clflush. This now is up to 9 bytes, and perhaps pushing it a bit with how much we would be willing to patch out. On the other hand - the CLFLUSH seems to have worked well enough by itself, and this is all probabilistic anyway, so perhaps we should just leave the naked CLFLUSH in and not worry about it unless measurements say otherwise? -hpa ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 18:05 ` H. Peter Anvin @ 2013-12-19 18:14 ` Ingo Molnar 0 siblings, 0 replies; 96+ messages in thread From: Ingo Molnar @ 2013-12-19 18:14 UTC (permalink / raw) To: H. Peter Anvin Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov * H. Peter Anvin <hpa@zytor.com> wrote: > On 12/19/2013 09:36 AM, Peter Zijlstra wrote: > > On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote: > >> That said, I would find it very strange indeed if a CLFLUSH doesn't also > >> flush the store buffer. > > > > OK, it explicitly states it does not do that and you indeed need > > an mfence before the clflush. > > So, MONITOR is defined to be ordered as a load, which I think should > be adequate, but I really wonder if we should have mfence on both > sides of clflush. This now is up to 9 bytes, and perhaps pushing it > a bit with how much we would be willing to patch out. > > On the other hand - the CLFLUSH seems to have worked well enough by > itself, and this is all probabilistic anyway, so perhaps we should > just leave the naked CLFLUSH in and not worry about it unless > measurements say otherwise? So I think the window of breakage was rather large here, and since it seems to trigger on rare types of hardware I think we'd be better off by erring on the side of robustness this time around ... This is the 'go to idle' path, which isn't as time-critical as the 'get out of idle' code path. Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 17:25 ` Peter Zijlstra 2013-12-19 17:36 ` Peter Zijlstra @ 2013-12-19 17:50 ` Peter Zijlstra 2013-12-19 18:18 ` Ingo Molnar 2013-12-19 18:10 ` Ingo Molnar 2 siblings, 1 reply; 96+ messages in thread From: Peter Zijlstra @ 2013-12-19 17:50 UTC (permalink / raw) To: Ingo Molnar Cc: H. Peter Anvin, Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote: > On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote: > > Likewise, having a barrier before the MONITOR looks sensible as well. > > I again have to disagree, one would expect monitor to flush all that is > required to start the monitor -- and it actually does so. As is > testified by this extra CLFLUSH being called a bug workaround. SDM states that MONITOR is ordered like a LOAD, and a LOAD cannot pass a previous STORE to the same address. That said; there's enough holes in there to swim a titanic through, seeing how MONITOR stares at an entire cacheline and LOAD/STORE order is specified on location, whatever that means. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 17:50 ` Peter Zijlstra @ 2013-12-19 18:18 ` Ingo Molnar 2013-12-19 21:05 ` H. Peter Anvin 0 siblings, 1 reply; 96+ messages in thread From: Ingo Molnar @ 2013-12-19 18:18 UTC (permalink / raw) To: Peter Zijlstra Cc: H. Peter Anvin, Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov * Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote: > > On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote: > > > Likewise, having a barrier before the MONITOR looks sensible as well. > > > > I again have to disagree, one would expect monitor to flush all that is > > required to start the monitor -- and it actually does so. As is > > testified by this extra CLFLUSH being called a bug workaround. > > SDM states that MONITOR is ordered like a LOAD, and a LOAD cannot > pass a previous STORE to the same address. Yes ... but you could argue that CLFLUSH is neither a load nor a store, it's a _cache sync_ operation, with its special ordering properties. > That said; there's enough holes in there to swim a titanic through, > seeing how MONITOR stares at an entire cacheline and LOAD/STORE > order is specified on location, whatever that means. I think assuming that MONITOR is ordered as a load or better is a pretty safe one (and in fact the Intel documentation seems to say so) - I'd say MONITOR is in micro-code and essentially snoops on cache events on that specific cache line, and loads the cache line on a snoop hit? Btw., what state is the cache line after a MONITOR instruction, is it loaded as shared, or as excusive? (exclusive would probably be better for performance.) Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 18:18 ` Ingo Molnar @ 2013-12-19 21:05 ` H. Peter Anvin 2013-12-19 21:17 ` Ingo Molnar 0 siblings, 1 reply; 96+ messages in thread From: H. Peter Anvin @ 2013-12-19 21:05 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov CLFLUSH is only (guaranteed to be) ordered with respect to MFENCE according to the SDM. Ingo Molnar <mingo@kernel.org> wrote: > >* Peter Zijlstra <peterz@infradead.org> wrote: > >> On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote: >> > On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote: >> > > Likewise, having a barrier before the MONITOR looks sensible as >well. >> > >> > I again have to disagree, one would expect monitor to flush all >that is >> > required to start the monitor -- and it actually does so. As is >> > testified by this extra CLFLUSH being called a bug workaround. >> >> SDM states that MONITOR is ordered like a LOAD, and a LOAD cannot >> pass a previous STORE to the same address. > >Yes ... but you could argue that CLFLUSH is neither a load nor a >store, it's a _cache sync_ operation, with its special ordering >properties. > >> That said; there's enough holes in there to swim a titanic through, >> seeing how MONITOR stares at an entire cacheline and LOAD/STORE >> order is specified on location, whatever that means. > >I think assuming that MONITOR is ordered as a load or better is a >pretty safe one (and in fact the Intel documentation seems to say so) >- I'd say MONITOR is in micro-code and essentially snoops on cache >events on that specific cache line, and loads the cache line on a >snoop hit? > >Btw., what state is the cache line after a MONITOR instruction, is it >loaded as shared, or as excusive? (exclusive would probably be better >for performance.) > >Thanks, > > Ingo -- Sent from my mobile phone. Please pardon brevity and lack of formatting. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 21:05 ` H. Peter Anvin @ 2013-12-19 21:17 ` Ingo Molnar 0 siblings, 0 replies; 96+ messages in thread From: Ingo Molnar @ 2013-12-19 21:17 UTC (permalink / raw) To: H. Peter Anvin Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov * H. Peter Anvin <hpa@zytor.com> wrote: > CLFLUSH is only (guaranteed to be) ordered with respect to MFENCE > according to the SDM. Yes, I quoted the relevant text a couple of hours ago, CLFLUSH is pretty special, so it's not ordered against atomics or serializing instructions for example - only ordered against explict MFENCE calls, which on x86 is mb()/smp_mb(). Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 17:25 ` Peter Zijlstra 2013-12-19 17:36 ` Peter Zijlstra 2013-12-19 17:50 ` Peter Zijlstra @ 2013-12-19 18:10 ` Ingo Molnar 2 siblings, 0 replies; 96+ messages in thread From: Ingo Molnar @ 2013-12-19 18:10 UTC (permalink / raw) To: Peter Zijlstra Cc: H. Peter Anvin, Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov * Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote: > > > > * H. Peter Anvin <hpa@zytor.com> wrote: > > > > > On 12/19/2013 08:21 AM, Peter Zijlstra wrote: > > > > > > > > What's that mb for? > > > > > > > > > > It already exists in mwait_idle_with_hints(); I just moved it into > > > this common function. It is a bit odd, I have to admit; it seems > > > like it should be *before* the monitor (and possibly we should have > > > one after the CLFLUSH as well?) > > > > Yes, I think we need a barrier before the CLFLUSH, because according > > to my reading of the Intel documentation CLFLUSH has no implicit > > ordering so it might get reordered with the store to ->flags in > > current_set_polling_and_test(), which might result in spurious wakeup > > problems again. > > No it cannot; since current_set_polling_and_test() already has a > barrier to prevent that. See below: > Also, the location patched by hpa doesn't actually call that at all. > > That said, I would find it very strange indeed if a CLFLUSH doesn't > also flush the store buffer. So, the Intel documentation says (sorry about the lazy-link): http://www.jaist.ac.jp/iscenter-new/mpc/altix/altixdata/opt/intel/vtune/doc/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc31.htm "CLFLUSH is only ordered by the MFENCE instruction. It is not guaranteed to be ordered by any other fencing, serializing or other CLFLUSH instruction. For example, software can use an MFENCE instruction to insure that previous stores are included in the write-back." So a specific MFENCE barrier is needed. Also note that this wording excludes implicit serialization such as LOCK prefix or XCHG barriers. As it happens current_set_polling_and_test() uses smp_mb(), which happens to map to MFENCE on all CPUs that can do CLFLUSH, but that's really just an accident and in no way engineered. _At minimum_ we need a prominent comment at the clflush usage site that we rely on the MFENCE in current_set_polling_and_test() ... > > (And CLFLUSH is a store in a sense, so special in that the regular > > ordering for stores does not apply.) > > > > Likewise, having a barrier before the MONITOR looks sensible as > > well. Having it _after_ monitor looks weird and is probably wrong. > > [It might have been the effects of someone seeing the spurious > > wakeup problems with realizing the true source, or so.] > > I again have to disagree, one would expect monitor to flush all that > is required to start the monitor -- and it actually does so. As is > testified by this extra CLFLUSH being called a bug workaround. This assumption would be safer - although AFAICS the Intel MONITOR/MWAIT documentation is quiet about this aspect. Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 17:07 ` Ingo Molnar 2013-12-19 17:25 ` Peter Zijlstra @ 2013-12-19 18:09 ` H. Peter Anvin 2013-12-19 18:19 ` H. Peter Anvin 2013-12-19 18:19 ` [PATCH] x86 idle: repair large-server 50-watt idle-power regression Ingo Molnar 1 sibling, 2 replies; 96+ messages in thread From: H. Peter Anvin @ 2013-12-19 18:09 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov On 12/19/2013 09:07 AM, Ingo Molnar wrote: > > Likewise, having a barrier before the MONITOR looks sensible as well. > Having it _after_ monitor looks weird and is probably wrong. [It might > have been the effects of someone seeing the spurious wakeup problems > with realizing the true source, or so.] > Does anyone know the history of this barrier after the monitor? I know Len is looking for a minimal patchset that can go into -stable, and it seems prudent to not preturb the code more than necessary, but going forward it would be nice to know... -hpa ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 18:09 ` H. Peter Anvin @ 2013-12-19 18:19 ` H. Peter Anvin 2013-12-19 18:23 ` Ingo Molnar 2013-12-19 18:19 ` [PATCH] x86 idle: repair large-server 50-watt idle-power regression Ingo Molnar 1 sibling, 1 reply; 96+ messages in thread From: H. Peter Anvin @ 2013-12-19 18:19 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov On 12/19/2013 10:09 AM, H. Peter Anvin wrote: > On 12/19/2013 09:07 AM, Ingo Molnar wrote: >> >> Likewise, having a barrier before the MONITOR looks sensible as well. >> Having it _after_ monitor looks weird and is probably wrong. [It might >> have been the effects of someone seeing the spurious wakeup problems >> with realizing the true source, or so.] >> > > Does anyone know the history of this barrier after the monitor? I know > Len is looking for a minimal patchset that can go into -stable, and it > seems prudent to not preturb the code more than necessary, but going > forward it would be nice to know... > Hmm... it *looks* like it is intended to be part of the construct: smp_mb(); if (!need_resched()) ... I found a note in the HLT variant of the function saying: /* * TS_POLLING-cleared state must be visible before we * test NEED_RESCHED: */ ... which presumably has been copied elsewhere. -hpa ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 18:19 ` H. Peter Anvin @ 2013-12-19 18:23 ` Ingo Molnar [not found] ` <CA+55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com> 0 siblings, 1 reply; 96+ messages in thread From: Ingo Molnar @ 2013-12-19 18:23 UTC (permalink / raw) To: H. Peter Anvin Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov * H. Peter Anvin <hpa@zytor.com> wrote: > On 12/19/2013 10:09 AM, H. Peter Anvin wrote: > > On 12/19/2013 09:07 AM, Ingo Molnar wrote: > >> > >> Likewise, having a barrier before the MONITOR looks sensible as well. > >> Having it _after_ monitor looks weird and is probably wrong. [It might > >> have been the effects of someone seeing the spurious wakeup problems > >> with realizing the true source, or so.] > >> > > > > Does anyone know the history of this barrier after the monitor? I know > > Len is looking for a minimal patchset that can go into -stable, and it > > seems prudent to not preturb the code more than necessary, but going > > forward it would be nice to know... > > > > Hmm... it *looks* like it is intended to be part of the construct: > > smp_mb(); > if (!need_resched()) > ... > > I found a note in the HLT variant of the function saying: > > /* > * TS_POLLING-cleared state must be visible before we > * test NEED_RESCHED: > */ Yes, that makes sense: the need_resched test is a load, and MONITOR is a load as well. Can the two ever cross, or does the CPU guarantee that because it's the same address, the loads don't cross? Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
[parent not found: <CA+55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com>]
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression [not found] ` <CA+55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com> @ 2013-12-19 18:43 ` Ingo Molnar 2013-12-19 20:09 ` [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers tip-bot for H. Peter Anvin 2013-12-19 20:33 ` [tip:x86/idle] x86, idle: Add memory barriers around clflush in mwait_play_dead() tip-bot for H. Peter Anvin 2 siblings, 0 replies; 96+ messages in thread From: Ingo Molnar @ 2013-12-19 18:43 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Mike Galbraith, Thomas Gleixner, Len Brown, linux-pm, x86, Peter Zijlstra, Len Brown, H. Peter Anvin, stable, linux-kernel * Linus Torvalds <torvalds@linux-foundation.org> wrote: > The x86 memory rules are that two loads always execute in order (ie > rmb is a no-op). > > So I see no reason for a memory barrier after the monitor. [...] Yes, I'm leaning towards that interpretation as well, but the reason I'm a bit catious is the somewhat curious (to me!) wording of the MONITOR instruction: Sets up a linear address range to be monitored by hardware and activates the monitor. ... The MONITOR instruction is ordered as a load operation with respect to other memory transactions. The instruction can be used at all privilege levels and is subject to all permission checking and faults associated with a byte load. Like a load, the MONITOR instruction sets the A-bit but not the D-bit in page tables. Where apparently the 'range' means 'full cache line surrounding the memory address in question'. We have no other load instructions that operate on such a large 'range' of addresses, and I wanted to make sure it's a true (single byte) load for that specific address. The documentation does not appear to explicitly state that it's a load for that address - only that it's ordered as a load. The reason I'm asking is because 'flags' itself might not be at the beginning of the cache line, as it's in the middle of thread_info: struct thread_info { struct task_struct *task; /* main task structure */ struct exec_domain *exec_domain; /* execution domain */ __u32 flags; /* low level flags */ while 'MONITOR' appears to work on the cache line. So are all addresses within that cache line ordered? Only the specific address given to the instruction itself? Only the first word of the cacheline itself? The documentation is a bit vague, at least in my reading, and depending on which actual word the instruction reads (if it reads any word at all ... it's probably just setting up an address for MWAIT) from that cacheline, its ordering properties might be surprising. > [...] But both sides of clflush sounds sane, and as mentioned the > "go to sleep" side isn't as critical as the "wake up" side if the > monitor. Yeah. > Please let's just make that pre-monitor hack be a static key, and do > mfence explicitly around the clflush inside that conditional > section. Agreed. Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression @ 2013-12-19 18:43 ` Ingo Molnar 0 siblings, 0 replies; 96+ messages in thread From: Ingo Molnar @ 2013-12-19 18:43 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Mike Galbraith, Thomas Gleixner, Len Brown, linux-pm, x86, Peter Zijlstra, Len Brown, H. Peter Anvin, stable, linux-kernel * Linus Torvalds <torvalds@linux-foundation.org> wrote: > The x86 memory rules are that two loads always execute in order (ie > rmb is a no-op). > > So I see no reason for a memory barrier after the monitor. [...] Yes, I'm leaning towards that interpretation as well, but the reason I'm a bit catious is the somewhat curious (to me!) wording of the MONITOR instruction: Sets up a linear address range to be monitored by hardware and activates the monitor. ... The MONITOR instruction is ordered as a load operation with respect to other memory trans�actions. The instruction can be used at all privilege levels and is subject to all permission checking and faults associated with a byte load. Like a load, the MONITOR instruction sets the A-bit but not the D-bit in page tables. Where apparently the 'range' means 'full cache line surrounding the memory address in question'. We have no other load instructions that operate on such a large 'range' of addresses, and I wanted to make sure it's a true (single byte) load for that specific address. The documentation does not appear to explicitly state that it's a load for that address - only that it's ordered as a load. The reason I'm asking is because 'flags' itself might not be at the beginning of the cache line, as it's in the middle of thread_info: struct thread_info { struct task_struct *task; /* main task structure */ struct exec_domain *exec_domain; /* execution domain */ __u32 flags; /* low level flags */ while 'MONITOR' appears to work on the cache line. So are all addresses within that cache line ordered? Only the specific address given to the instruction itself? Only the first word of the cacheline itself? The documentation is a bit vague, at least in my reading, and depending on which actual word the instruction reads (if it reads any word at all ... it's probably just setting up an address for MWAIT) from that cacheline, its ordering properties might be surprising. > [...] But both sides of clflush sounds sane, and as mentioned the > "go to sleep" side isn't as critical as the "wake up" side if the > monitor. Yeah. > Please let's just make that pre-monitor hack be a static key, and do > mfence explicitly around the clflush inside that conditional > section. Agreed. Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers [not found] ` <CA+55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com> 2013-12-19 18:43 ` Ingo Molnar @ 2013-12-19 20:09 ` tip-bot for H. Peter Anvin 2013-12-19 20:40 ` Ingo Molnar 2013-12-19 20:33 ` [tip:x86/idle] x86, idle: Add memory barriers around clflush in mwait_play_dead() tip-bot for H. Peter Anvin 2 siblings, 1 reply; 96+ messages in thread From: tip-bot for H. Peter Anvin @ 2013-12-19 20:09 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, hpa, mingo, torvalds, peterz, tglx, hpa, len.brown Commit-ID: 7e98b71920464b8d15fa95c74366416cd3c88861 Gitweb: http://git.kernel.org/tip/7e98b71920464b8d15fa95c74366416cd3c88861 Author: H. Peter Anvin <hpa@linux.intel.com> AuthorDate: Thu, 19 Dec 2013 11:58:16 -0800 Committer: H. Peter Anvin <hpa@linux.intel.com> CommitDate: Thu, 19 Dec 2013 11:58:16 -0800 x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers Use static_cpu_has() to conditionalize the CLFLUSH workaround, and add memory barriers around it since the documentation is explicit that CLFLUSH is only ordered with respect to MFENCE. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Len Brown <len.brown@intel.com> Link: http://lkml.kernel.org/r/CA%2B55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com --- arch/x86/include/asm/mwait.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h index 361b02e..19b71c4 100644 --- a/arch/x86/include/asm/mwait.h +++ b/arch/x86/include/asm/mwait.h @@ -43,8 +43,11 @@ static inline void __mwait(unsigned long eax, unsigned long ecx) static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx) { if (!current_set_polling_and_test()) { - if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) + if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) { + mb(); clflush((void *)¤t_thread_info()->flags); + mb(); + } __monitor((void *)¤t_thread_info()->flags, 0, 0); if (!need_resched()) ^ permalink raw reply related [flat|nested] 96+ messages in thread
* Re: [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers 2013-12-19 20:09 ` [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers tip-bot for H. Peter Anvin @ 2013-12-19 20:40 ` Ingo Molnar 2013-12-19 20:46 ` Linus Torvalds 2013-12-19 21:55 ` Peter Zijlstra 0 siblings, 2 replies; 96+ messages in thread From: Ingo Molnar @ 2013-12-19 20:40 UTC (permalink / raw) To: hpa, linux-kernel, torvalds, peterz, tglx, hpa, len.brown Cc: linux-tip-commits * tip-bot for H. Peter Anvin <tipbot@zytor.com> wrote: > Commit-ID: 7e98b71920464b8d15fa95c74366416cd3c88861 > Gitweb: http://git.kernel.org/tip/7e98b71920464b8d15fa95c74366416cd3c88861 > Author: H. Peter Anvin <hpa@linux.intel.com> > AuthorDate: Thu, 19 Dec 2013 11:58:16 -0800 > Committer: H. Peter Anvin <hpa@linux.intel.com> > CommitDate: Thu, 19 Dec 2013 11:58:16 -0800 > > x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers > > Use static_cpu_has() to conditionalize the CLFLUSH workaround, and add > memory barriers around it since the documentation is explicit that > CLFLUSH is only ordered with respect to MFENCE. > > Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> > Cc: Linus Torvalds <torvalds@linux-foundation.org> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Len Brown <len.brown@intel.com> > Link: http://lkml.kernel.org/r/CA%2B55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com > --- > arch/x86/include/asm/mwait.h | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h > index 361b02e..19b71c4 100644 > --- a/arch/x86/include/asm/mwait.h > +++ b/arch/x86/include/asm/mwait.h > @@ -43,8 +43,11 @@ static inline void __mwait(unsigned long eax, unsigned long ecx) > static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx) > { > if (!current_set_polling_and_test()) { > - if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) > + if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) { > + mb(); > clflush((void *)¤t_thread_info()->flags); > + mb(); > + } So since this is a v3.14 commit, lets get it right, ok? The first mb() looks superfluous, because see current_set_polling_and_test(): static inline bool __must_check current_set_polling_and_test(void) { __current_set_polling(); /* * Polling state must be visible before we test NEED_RESCHED, * paired by resched_task() */ smp_mb(); return unlikely(tif_need_resched()); } So it already has a (MFENCE) barrier after ->flags is modified. So using a comment - something like the patch below, should be enough as well, agreed? Thanks, Ingo Signed-off-by: Ingo Molnar <mingo@kernel.org> diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h index 19b71c4..d4f5c9a 100644 --- a/arch/x86/include/asm/mwait.h +++ b/arch/x86/include/asm/mwait.h @@ -44,7 +44,10 @@ static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx) { if (!current_set_polling_and_test()) { if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) { - mb(); + /* + * There's no need for an extra barrier here: current_set_polling_and_test() + * already has an smp_mb() after ->flags gets modified, see sched.h. + */ clflush((void *)¤t_thread_info()->flags); mb(); } ^ permalink raw reply related [flat|nested] 96+ messages in thread
* Re: [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers 2013-12-19 20:40 ` Ingo Molnar @ 2013-12-19 20:46 ` Linus Torvalds 2013-12-19 21:14 ` Ingo Molnar 2013-12-19 21:55 ` Peter Zijlstra 1 sibling, 1 reply; 96+ messages in thread From: Linus Torvalds @ 2013-12-19 20:46 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Anvin, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Peter Anvin, Len Brown, linux-tip-commits On Fri, Dec 20, 2013 at 5:40 AM, Ingo Molnar <mingo@kernel.org> wrote: > > The first mb() looks superfluous, because see > current_set_polling_and_test(): > > static inline bool __must_check current_set_polling_and_test(void) > { > __current_set_polling(); > > /* > * Polling state must be visible before we test NEED_RESCHED, > * paired by resched_task() > */ > smp_mb(); > > return unlikely(tif_need_resched()); > } > > So it already has a (MFENCE) barrier after ->flags is modified. So what? The mb() isn't necessarily against the *write*, it is also against the *read*. If the cflush is needed before the monitor, it's likely because the monitor instruction has some bug with already-existing cachelines in shared state or whatever. Which means that we want to make sure that the cflush is ordered wrt *reads* from that cacheline too. cflush has nothing specifically to do with writes. Linus ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers 2013-12-19 20:46 ` Linus Torvalds @ 2013-12-19 21:14 ` Ingo Molnar 2013-12-19 21:25 ` Linus Torvalds 0 siblings, 1 reply; 96+ messages in thread From: Ingo Molnar @ 2013-12-19 21:14 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Anvin, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Peter Anvin, Len Brown, linux-tip-commits * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Dec 20, 2013 at 5:40 AM, Ingo Molnar <mingo@kernel.org> wrote: > > > > The first mb() looks superfluous, because see > > current_set_polling_and_test(): > > > > static inline bool __must_check current_set_polling_and_test(void) > > { > > __current_set_polling(); > > > > /* > > * Polling state must be visible before we test NEED_RESCHED, > > * paired by resched_task() > > */ > > smp_mb(); > > > > return unlikely(tif_need_resched()); > > } > > > > So it already has a (MFENCE) barrier after ->flags is modified. > > So what? > > The mb() isn't necessarily against the *write*, it is also against > the *read*. > > If the cflush is needed before the monitor, it's likely because the > monitor instruction has some bug with already-existing cachelines in > shared state or whatever. So, my thinking was that maybe it's the other way around: the problem is with the write not being drained to cache yet. One possibility would be that the bug is that MONITOR will not see the previous write in the store buffer and when shortly afterwards the store hits the cache, it falsely 'wakes up' the MWAIT. (If so then the race window roughly depends on the the number of cycles the current_set_polling_and_test() modification retires and explains how small reorganizations of the code triggered the hw bug.) The CLFLUSH ensures that the modification is visible in the cache before MONITOR is run. If this guess is true then the need_resched() read is immaterial to the bug. On the flip side, if my guess is wrong then I'm leaving another subtle race window in the code... So yeah I agree that without more information from Intel we are better off with a more conservative approach, the bug took relatively long to find, Thomas did the original 7d1a941731fab back in March, 3 kernel releases ago ... If only there were Intel employees on Cc: who could tell us more about the background of the bug ;-) > Which means that we want to make sure that the cflush is ordered wrt > *reads* from that cacheline too. > > cflush has nothing specifically to do with writes. Yes. Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers 2013-12-19 21:14 ` Ingo Molnar @ 2013-12-19 21:25 ` Linus Torvalds 0 siblings, 0 replies; 96+ messages in thread From: Linus Torvalds @ 2013-12-19 21:25 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Anvin, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Peter Anvin, Len Brown, linux-tip-commits On Fri, Dec 20, 2013 at 6:14 AM, Ingo Molnar <mingo@kernel.org> wrote: > > > So, my thinking was that maybe it's the other way around: the problem > is with the write not being drained to cache yet. That is possible. I haven't seen the exact details of the monitor errata . If it's just about the write buffer and cache interaction, then I agree that the barrier after the write in current_set_polling_and_test() is sufficient. But in the absense of knowledge, I think we should have it after the read too. Linus ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers 2013-12-19 20:40 ` Ingo Molnar 2013-12-19 20:46 ` Linus Torvalds @ 2013-12-19 21:55 ` Peter Zijlstra 2013-12-20 8:47 ` Ingo Molnar 1 sibling, 1 reply; 96+ messages in thread From: Peter Zijlstra @ 2013-12-19 21:55 UTC (permalink / raw) To: Ingo Molnar Cc: hpa, linux-kernel, torvalds, tglx, hpa, len.brown, linux-tip-commits On Thu, Dec 19, 2013 at 09:40:51PM +0100, Ingo Molnar wrote: > > + if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) { > > + mb(); > > clflush((void *)¤t_thread_info()->flags); > > + mb(); > > + } > if (!current_set_polling_and_test()) { > if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) { > - mb(); > + /* > + * There's no need for an extra barrier here: current_set_polling_and_test() > + * already has an smp_mb() after ->flags gets modified, see sched.h. > + */ > clflush((void *)¤t_thread_info()->flags); > mb(); > } So almost, except that smp_mb() will disappear on an UP kernel whereas this errata might just always need the MFENCE. Then again, you'd be a pretty big nutter to run an UP kernel on an EX part. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers 2013-12-19 21:55 ` Peter Zijlstra @ 2013-12-20 8:47 ` Ingo Molnar 0 siblings, 0 replies; 96+ messages in thread From: Ingo Molnar @ 2013-12-20 8:47 UTC (permalink / raw) To: Peter Zijlstra Cc: hpa, linux-kernel, torvalds, tglx, hpa, len.brown, linux-tip-commits * Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Dec 19, 2013 at 09:40:51PM +0100, Ingo Molnar wrote: > > > + if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) { > > > + mb(); > > > clflush((void *)¤t_thread_info()->flags); > > > + mb(); > > > + } > > > if (!current_set_polling_and_test()) { > > if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) { > > - mb(); > > + /* > > + * There's no need for an extra barrier here: current_set_polling_and_test() > > + * already has an smp_mb() after ->flags gets modified, see sched.h. > > + */ > > clflush((void *)¤t_thread_info()->flags); > > mb(); > > } > > So almost, except that smp_mb() will disappear on an UP kernel whereas > this errata might just always need the MFENCE. Yeah. Sigh, so I noticed that and even thought through the UP case and concluded: on UP the race cannot happen - which is bogus. Yesterday was so much fail, maybe today's a better day ;-) Thanks, ngo ^ permalink raw reply [flat|nested] 96+ messages in thread
* [tip:x86/idle] x86, idle: Add memory barriers around clflush in mwait_play_dead() [not found] ` <CA+55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com> 2013-12-19 18:43 ` Ingo Molnar 2013-12-19 20:09 ` [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers tip-bot for H. Peter Anvin @ 2013-12-19 20:33 ` tip-bot for H. Peter Anvin 2 siblings, 0 replies; 96+ messages in thread From: tip-bot for H. Peter Anvin @ 2013-12-19 20:33 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, hpa, mingo, torvalds, peterz, tglx, hpa, len.brown Commit-ID: 7d590cca7cd2cce4ed7c47d221d6f90566653ba8 Gitweb: http://git.kernel.org/tip/7d590cca7cd2cce4ed7c47d221d6f90566653ba8 Author: H. Peter Anvin <hpa@linux.intel.com> AuthorDate: Thu, 19 Dec 2013 12:30:03 -0800 Committer: H. Peter Anvin <hpa@linux.intel.com> CommitDate: Thu, 19 Dec 2013 12:30:03 -0800 x86, idle: Add memory barriers around clflush in mwait_play_dead() For consistency with mwait_idle_with_hints(). Not sure they help, but they really won't hurt... Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Len Brown <len.brown@intel.com> Link: http://lkml.kernel.org/r/CA%2B55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com --- arch/x86/kernel/smpboot.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 85dc05a..f5252c4 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1417,7 +1417,9 @@ static inline void mwait_play_dead(void) * The WBINVD is insufficient due to the spurious-wakeup * case where we return around the loop. */ + mb(); clflush(mwait_ptr); + mb(); __monitor(mwait_ptr, 0, 0); mb(); __mwait(eax, 0); ^ permalink raw reply related [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 18:09 ` H. Peter Anvin 2013-12-19 18:19 ` H. Peter Anvin @ 2013-12-19 18:19 ` Ingo Molnar 2013-12-19 19:22 ` H. Peter Anvin 1 sibling, 1 reply; 96+ messages in thread From: Ingo Molnar @ 2013-12-19 18:19 UTC (permalink / raw) To: H. Peter Anvin Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov * H. Peter Anvin <hpa@zytor.com> wrote: > On 12/19/2013 09:07 AM, Ingo Molnar wrote: > > > > Likewise, having a barrier before the MONITOR looks sensible as > > well. Having it _after_ monitor looks weird and is probably wrong. > > [It might have been the effects of someone seeing the spurious > > wakeup problems with realizing the true source, or so.] > > Does anyone know the history of this barrier after the monitor? I > know Len is looking for a minimal patchset that can go into -stable, > and it seems prudent to not preturb the code more than necessary, > but going forward it would be nice to know... For the minimal fix I don't think we should change it - but for v3.14 it looks like a speedup for the from-idle codepath, which is performance sensitive. ( It would also be nice to know whether MONITOR loads that cacheline into the CPUs cache, and in what state it loads it. ) Thanks, Ingo ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 18:19 ` [PATCH] x86 idle: repair large-server 50-watt idle-power regression Ingo Molnar @ 2013-12-19 19:22 ` H. Peter Anvin 2013-12-19 19:27 ` Peter Zijlstra 0 siblings, 1 reply; 96+ messages in thread From: H. Peter Anvin @ 2013-12-19 19:22 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov On 12/19/2013 10:19 AM, Ingo Molnar wrote: > > ( It would also be nice to know whether MONITOR loads that cacheline > into the CPUs cache, and in what state it loads it. ) > I would assume that is implementation-dependent. However, one plausible implementation is to load the cache line into the cache in shared state and monitor for evictions. -hpa ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression 2013-12-19 19:22 ` H. Peter Anvin @ 2013-12-19 19:27 ` Peter Zijlstra 0 siblings, 0 replies; 96+ messages in thread From: Peter Zijlstra @ 2013-12-19 19:27 UTC (permalink / raw) To: H. Peter Anvin Cc: Ingo Molnar, Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov On Thu, Dec 19, 2013 at 11:22:01AM -0800, H. Peter Anvin wrote: > On 12/19/2013 10:19 AM, Ingo Molnar wrote: > > > > ( It would also be nice to know whether MONITOR loads that cacheline > > into the CPUs cache, and in what state it loads it. ) > > > > I would assume that is implementation-dependent. However, one plausible > implementation is to load the cache line into the cache in shared state > and monitor for evictions. I suppose the monitor part is the important part because certain C states drop all cache, presumably including the cacheline we're actually monitoring. ^ permalink raw reply [flat|nested] 96+ messages in thread
* [tip:x86/urgent] x86 idle: Repair large-server 50-watt idle-power regression 2013-12-18 21:44 ` Len Brown (?) (?) @ 2013-12-19 19:51 ` tip-bot for Len Brown 2014-03-18 0:20 ` Davidlohr Bueso -1 siblings, 1 reply; 96+ messages in thread From: tip-bot for Len Brown @ 2013-12-19 19:51 UTC (permalink / raw) To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, tglx, hpa, len.brown Commit-ID: 40e2d7f9b5dae048789c64672bf3027fbb663ffa Gitweb: http://git.kernel.org/tip/40e2d7f9b5dae048789c64672bf3027fbb663ffa Author: Len Brown <len.brown@intel.com> AuthorDate: Wed, 18 Dec 2013 16:44:57 -0500 Committer: H. Peter Anvin <hpa@linux.intel.com> CommitDate: Thu, 19 Dec 2013 11:47:39 -0800 x86 idle: Repair large-server 50-watt idle-power regression Linux 3.10 changed the timing of how thread_info->flags is touched: x86: Use generic idle loop (7d1a941731fabf27e5fb6edbebb79fe856edb4e5) This caused Intel NHM-EX and WSM-EX servers to experience a large number of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote from deep C-states to shallow C-states, which caused these platforms to experience a significant increase in idle power. Note that this issue was already present before the commit above, however, it wasn't seen often enough to be noticed in power measurements. Here we extend an errata workaround from the Core2 EX "Dunnington" to extend to NHM-EX and WSM-EX, to prevent these immediate returns from MWAIT, reducing idle power on these platforms. While only acpi_idle ran on Dunnington, intel_idle may also run on these two newer systems. As of today, there are no other models that are known to need this tweak. Link: http://lkml.kernel.org/r/CAJvTdK=%2BaNN66mYpCGgbHGCHhYQAKx-vB0kJSWjVpsNb_hOAtQ@mail.gmail.com Signed-off-by: Len Brown <len.brown@intel.com> Link: http://lkml.kernel.org/r/baff264285f6e585df757d58b17788feabc68918.1387403066.git.len.brown@intel.com Cc: <stable@vger.kernel.org> # 3.12.x, 3.11.x, 3.10.x Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> --- arch/x86/kernel/cpu/intel.c | 3 ++- drivers/idle/intel_idle.c | 3 +++ 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index dc1ec0d..ea04b34 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -387,7 +387,8 @@ static void init_intel(struct cpuinfo_x86 *c) set_cpu_cap(c, X86_FEATURE_PEBS); } - if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush) + if (c->x86 == 6 && cpu_has_clflush && + (c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47)) set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR); #ifdef CONFIG_X86_64 diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index 92d1206..f80b700 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev, if (!current_set_polling_and_test()) { + if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) + clflush((void *)¤t_thread_info()->flags); + __monitor((void *)¤t_thread_info()->flags, 0, 0); smp_mb(); if (!need_resched()) ^ permalink raw reply related [flat|nested] 96+ messages in thread
* Re: [tip:x86/urgent] x86 idle: Repair large-server 50-watt idle-power regression 2013-12-19 19:51 ` [tip:x86/urgent] x86 idle: Repair " tip-bot for Len Brown @ 2014-03-18 0:20 ` Davidlohr Bueso 2014-03-18 9:16 ` Peter Zijlstra 2014-04-08 21:43 ` Brown, Len 0 siblings, 2 replies; 96+ messages in thread From: Davidlohr Bueso @ 2014-03-18 0:20 UTC (permalink / raw) To: mingo, hpa, linux-kernel, tglx, hpa, len.brown, Peter Zijlstra Cc: linux-tip-commits On Thu, 2013-12-19 at 11:51 -0800, tip-bot for Len Brown wrote: > Commit-ID: 40e2d7f9b5dae048789c64672bf3027fbb663ffa > Gitweb: http://git.kernel.org/tip/40e2d7f9b5dae048789c64672bf3027fbb663ffa > Author: Len Brown <len.brown@intel.com> > AuthorDate: Wed, 18 Dec 2013 16:44:57 -0500 > Committer: H. Peter Anvin <hpa@linux.intel.com> > CommitDate: Thu, 19 Dec 2013 11:47:39 -0800 > > x86 idle: Repair large-server 50-watt idle-power regression FYI this commit can cause some non trivial performance regressions for larger core count systems. While not surprising because of the nature of the change, having intel_idle do more cacheline invalidations, I still wanted to let you guys know. For instance, on a 160 core Westmere system, aim7 throughput can go down in a number of tests, anywhere from -10% to -25%. I guess it comes down to one of those performance vs energy things. And sure, max_cstate can be set to overcome this, but it's still something that was previously taken for granted. Thanks, Davidlohr ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [tip:x86/urgent] x86 idle: Repair large-server 50-watt idle-power regression 2014-03-18 0:20 ` Davidlohr Bueso @ 2014-03-18 9:16 ` Peter Zijlstra 2014-03-19 2:14 ` Jason Low 2014-04-08 21:43 ` Brown, Len 1 sibling, 1 reply; 96+ messages in thread From: Peter Zijlstra @ 2014-03-18 9:16 UTC (permalink / raw) To: Davidlohr Bueso Cc: mingo, hpa, linux-kernel, tglx, hpa, len.brown, linux-tip-commits On Mon, Mar 17, 2014 at 05:20:10PM -0700, Davidlohr Bueso wrote: > On Thu, 2013-12-19 at 11:51 -0800, tip-bot for Len Brown wrote: > > Commit-ID: 40e2d7f9b5dae048789c64672bf3027fbb663ffa > > Gitweb: http://git.kernel.org/tip/40e2d7f9b5dae048789c64672bf3027fbb663ffa > > Author: Len Brown <len.brown@intel.com> > > AuthorDate: Wed, 18 Dec 2013 16:44:57 -0500 > > Committer: H. Peter Anvin <hpa@linux.intel.com> > > CommitDate: Thu, 19 Dec 2013 11:47:39 -0800 > > > > x86 idle: Repair large-server 50-watt idle-power regression > > FYI this commit can cause some non trivial performance regressions for > larger core count systems. While not surprising because of the nature of > the change, having intel_idle do more cacheline invalidations, I still > wanted to let you guys know. For instance, on a 160 core Westmere > system, aim7 throughput can go down in a number of tests, anywhere from > -10% to -25%. > > I guess it comes down to one of those performance vs energy things. And > sure, max_cstate can be set to overcome this, but it's still something > that was previously taken for granted. -10% to -25% seems a lot for a single cacheline flush. Also I would expect the expected idle time to be very short while running aim7. So could it be the cacheflush is actually taking longer than the expected idle time? ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [tip:x86/urgent] x86 idle: Repair large-server 50-watt idle-power regression 2014-03-18 9:16 ` Peter Zijlstra @ 2014-03-19 2:14 ` Jason Low 2014-03-19 6:42 ` Peter Zijlstra 0 siblings, 1 reply; 96+ messages in thread From: Jason Low @ 2014-03-19 2:14 UTC (permalink / raw) To: Peter Zijlstra Cc: Davidlohr Bueso, Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List, Thomas Gleixner, hpa, len.brown, linux-tip-commits On Tue, Mar 18, 2014 at 2:16 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Mar 17, 2014 at 05:20:10PM -0700, Davidlohr Bueso wrote: >> On Thu, 2013-12-19 at 11:51 -0800, tip-bot for Len Brown wrote: >> > Commit-ID: 40e2d7f9b5dae048789c64672bf3027fbb663ffa >> > Gitweb: http://git.kernel.org/tip/40e2d7f9b5dae048789c64672bf3027fbb663ffa >> > Author: Len Brown <len.brown@intel.com> >> > AuthorDate: Wed, 18 Dec 2013 16:44:57 -0500 >> > Committer: H. Peter Anvin <hpa@linux.intel.com> >> > CommitDate: Thu, 19 Dec 2013 11:47:39 -0800 >> > >> > x86 idle: Repair large-server 50-watt idle-power regression >> >> FYI this commit can cause some non trivial performance regressions for >> larger core count systems. While not surprising because of the nature of >> the change, having intel_idle do more cacheline invalidations, I still >> wanted to let you guys know. For instance, on a 160 core Westmere >> system, aim7 throughput can go down in a number of tests, anywhere from >> -10% to -25%. >> >> I guess it comes down to one of those performance vs energy things. And >> sure, max_cstate can be set to overcome this, but it's still something >> that was previously taken for granted. > > -10% to -25% seems a lot for a single cacheline flush. Also I would > expect the expected idle time to be very short while running aim7. So > could it be the cacheflush is actually taking longer than the expected > idle time? Can we consider conditionally skipping the cacheline flush if the approximate average CPU idle time is very short, for instance, something along the lines of skipping if CPU average idle time < (sched migration cost or "cacheline_flush_penalty")? ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [tip:x86/urgent] x86 idle: Repair large-server 50-watt idle-power regression 2014-03-19 2:14 ` Jason Low @ 2014-03-19 6:42 ` Peter Zijlstra 0 siblings, 0 replies; 96+ messages in thread From: Peter Zijlstra @ 2014-03-19 6:42 UTC (permalink / raw) To: Jason Low Cc: Davidlohr Bueso, Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List, Thomas Gleixner, hpa, len.brown, linux-tip-commits On Tue, Mar 18, 2014 at 07:14:04PM -0700, Jason Low wrote: > On Tue, Mar 18, 2014 at 2:16 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Mar 17, 2014 at 05:20:10PM -0700, Davidlohr Bueso wrote: > >> On Thu, 2013-12-19 at 11:51 -0800, tip-bot for Len Brown wrote: > >> > Commit-ID: 40e2d7f9b5dae048789c64672bf3027fbb663ffa > >> > Gitweb: http://git.kernel.org/tip/40e2d7f9b5dae048789c64672bf3027fbb663ffa > >> > Author: Len Brown <len.brown@intel.com> > >> > AuthorDate: Wed, 18 Dec 2013 16:44:57 -0500 > >> > Committer: H. Peter Anvin <hpa@linux.intel.com> > >> > CommitDate: Thu, 19 Dec 2013 11:47:39 -0800 > >> > > >> > x86 idle: Repair large-server 50-watt idle-power regression > >> > >> FYI this commit can cause some non trivial performance regressions for > >> larger core count systems. While not surprising because of the nature of > >> the change, having intel_idle do more cacheline invalidations, I still > >> wanted to let you guys know. For instance, on a 160 core Westmere > >> system, aim7 throughput can go down in a number of tests, anywhere from > >> -10% to -25%. > >> > >> I guess it comes down to one of those performance vs energy things. And > >> sure, max_cstate can be set to overcome this, but it's still something > >> that was previously taken for granted. > > > > -10% to -25% seems a lot for a single cacheline flush. Also I would > > expect the expected idle time to be very short while running aim7. So > > could it be the cacheflush is actually taking longer than the expected > > idle time? > > Can we consider conditionally skipping the cacheline flush if the > approximate average CPU idle time is very short, for instance, > something along the lines of skipping if CPU average idle time < > (sched migration cost or "cacheline_flush_penalty")? I'd say its too early for proposing solutions; first we need to find facts about what is happening. ^ permalink raw reply [flat|nested] 96+ messages in thread
* RE: [tip:x86/urgent] x86 idle: Repair large-server 50-watt idle-power regression 2014-03-18 0:20 ` Davidlohr Bueso 2014-03-18 9:16 ` Peter Zijlstra @ 2014-04-08 21:43 ` Brown, Len 2014-04-09 8:18 ` Peter Zijlstra 2014-04-15 3:27 ` Davidlohr Bueso 1 sibling, 2 replies; 96+ messages in thread From: Brown, Len @ 2014-04-08 21:43 UTC (permalink / raw) To: Davidlohr Bueso, mingo, hpa, linux-kernel, tglx, hpa, Peter Zijlstra Cc: linux-tip-commits [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1809 bytes --] Davidlohr, Thanks for the note. Ideally (on Linux in general, and on servers, in particular) we strive for the performance impact of power saving features to be small enough to be considered in "measurement noise". Your report for 160 core Westmere AIM numbers being hit at 10-25% shows 15% measurement noise? But even if true, this looks bad. Any chance you can re-run, with the following two tweaks, one at a time? I'd be curious if you can wrap the invocation in turbostat -v and capture that output to how what states we are seeing during the benchmark run. thanks, -Len #1: skip flush for C1 diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index f80b700..6027d06 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -377,7 +377,7 @@ static int intel_idle(struct cpuidle_device *dev, if (!current_set_polling_and_test()) { - if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) + if ((eax > 0) && this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) clflush((void *)¤t_thread_info()->flags); __monitor((void *)¤t_thread_info()->flags, 0, 0); #2: skip flush for C1 and C1E diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index f80b700..6027d06 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -377,7 +377,7 @@ static int intel_idle(struct cpuidle_device *dev, if (!current_set_polling_and_test()) { - if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) + if ((eax > 1) && this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) clflush((void *)¤t_thread_info()->flags); __monitor((void *)¤t_thread_info()->flags, 0, 0); ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply related [flat|nested] 96+ messages in thread
* Re: [tip:x86/urgent] x86 idle: Repair large-server 50-watt idle-power regression 2014-04-08 21:43 ` Brown, Len @ 2014-04-09 8:18 ` Peter Zijlstra 2014-04-15 3:27 ` Davidlohr Bueso 1 sibling, 0 replies; 96+ messages in thread From: Peter Zijlstra @ 2014-04-09 8:18 UTC (permalink / raw) To: Brown, Len Cc: Davidlohr Bueso, mingo, hpa, linux-kernel, tglx, hpa, linux-tip-commits On Tue, Apr 08, 2014 at 09:43:54PM +0000, Brown, Len wrote: > #1: skip flush for C1 > #2: skip flush for C1 and C1E One stray thought I had: IFF the erratum is due to the cache flushing of the higher C states, these two patches should still be correct while also (possibly) avoiding the performance penalty, since C1* doesn't flush caches yet. Of course, I've no clue if this is indeed the root cause of this particular erratum and I'm not sure you guys are allowed to say even if you would know :/ ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [tip:x86/urgent] x86 idle: Repair large-server 50-watt idle-power regression 2014-04-08 21:43 ` Brown, Len 2014-04-09 8:18 ` Peter Zijlstra @ 2014-04-15 3:27 ` Davidlohr Bueso 1 sibling, 0 replies; 96+ messages in thread From: Davidlohr Bueso @ 2014-04-15 3:27 UTC (permalink / raw) To: Brown, Len Cc: mingo, hpa, linux-kernel, tglx, hpa, Peter Zijlstra, linux-tip-commits On Tue, 2014-04-08 at 21:43 +0000, Brown, Len wrote: > Davidlohr, > > Thanks for the note. > > Ideally (on Linux in general, and on servers, in particular) we strive > for the performance impact of power saving features to be small enough > to be considered in "measurement noise". > > Your report for 160 core Westmere AIM numbers being hit at 10-25% > shows 15% measurement noise? But even if true, this looks bad. > > Any chance you can re-run, with the following two tweaks, > one at a time? > > I'd be curious if you can wrap the invocation in turbostat -v > and capture that output to how what states we are seeing > during the benchmark run. Hi Len, apologies for the delay, I've been having to use my machine for other things and haven't gotten time to get around to this yet. I'll get you the requested information sometime this week. ^ permalink raw reply [flat|nested] 96+ messages in thread
end of thread, other threads:[~2014-04-15 3:27 UTC | newest] Thread overview: 96+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-12-07 8:00 50 Watt idle power regression bisected to Linux-3.10 Len Brown 2013-12-07 8:39 ` Mike Galbraith 2013-12-07 16:01 ` Len Brown 2013-12-07 16:45 ` Len Brown 2013-12-07 19:17 ` Mike Galbraith 2013-12-10 11:41 ` Ingo Molnar 2013-12-07 12:54 ` Thomas Gleixner 2013-12-08 4:57 ` Mike Galbraith 2013-12-08 20:40 ` Len Brown 2013-12-09 3:16 ` Mike Galbraith 2013-12-10 5:17 ` Mike Galbraith 2013-12-10 11:45 ` Ingo Molnar 2013-12-10 14:29 ` Thomas Gleixner 2013-12-10 15:06 ` Ingo Molnar 2013-12-11 2:05 ` Thomas Gleixner 2013-12-11 3:21 ` Mike Galbraith 2013-12-11 11:28 ` Thomas Gleixner 2013-12-11 11:38 ` Borislav Petkov 2013-12-11 11:52 ` Peter Zijlstra 2013-12-11 12:29 ` Mike Galbraith 2013-12-11 12:43 ` Peter Zijlstra 2013-12-11 13:10 ` Mike Galbraith 2013-12-11 13:40 ` Borislav Petkov 2013-12-11 14:56 ` Ingo Molnar 2013-12-11 16:02 ` Borislav Petkov 2013-12-11 16:43 ` Peter Zijlstra 2013-12-11 17:50 ` Ingo Molnar 2013-12-11 23:08 ` H. Peter Anvin 2013-12-11 23:14 ` Borislav Petkov 2013-12-12 0:52 ` H. Peter Anvin 2013-12-12 4:25 ` Mike Galbraith 2013-12-12 4:49 ` H. Peter Anvin 2013-12-12 4:59 ` Mike Galbraith 2013-12-12 5:37 ` Mike Galbraith 2013-12-12 5:45 ` H. Peter Anvin 2013-12-12 5:57 ` Mike Galbraith 2013-12-12 6:05 ` Mike Galbraith 2013-12-12 7:57 ` H. Peter Anvin 2013-12-12 8:51 ` Peter Zijlstra 2013-12-12 13:28 ` Ingo Molnar 2013-12-12 15:06 ` H. Peter Anvin 2013-12-12 15:51 ` Peter Zijlstra 2013-12-11 14:42 ` Ingo Molnar 2013-12-11 15:02 ` Thomas Gleixner 2013-12-11 15:09 ` Ingo Molnar 2013-12-11 16:44 ` Peter Zijlstra 2013-12-11 17:48 ` Ingo Molnar 2013-12-11 16:44 ` Peter Zijlstra 2013-12-11 17:47 ` Ingo Molnar 2013-12-11 21:43 ` Len Brown 2013-12-11 22:22 ` Thomas Gleixner 2013-12-18 21:44 ` [PATCH] x86 idle: repair large-server 50-watt idle-power regression Len Brown 2013-12-18 21:44 ` Len Brown 2013-12-19 12:22 ` Ingo Molnar 2013-12-19 14:40 ` H. Peter Anvin 2013-12-19 15:45 ` Borislav Petkov 2013-12-19 15:55 ` H. Peter Anvin 2013-12-19 16:02 ` Ingo Molnar 2013-12-19 16:09 ` H. Peter Anvin 2013-12-19 16:13 ` H. Peter Anvin 2013-12-19 16:21 ` Peter Zijlstra 2013-12-19 16:50 ` H. Peter Anvin 2013-12-19 17:07 ` Ingo Molnar 2013-12-19 17:25 ` Peter Zijlstra 2013-12-19 17:36 ` Peter Zijlstra 2013-12-19 18:05 ` H. Peter Anvin 2013-12-19 18:14 ` Ingo Molnar 2013-12-19 17:50 ` Peter Zijlstra 2013-12-19 18:18 ` Ingo Molnar 2013-12-19 21:05 ` H. Peter Anvin 2013-12-19 21:17 ` Ingo Molnar 2013-12-19 18:10 ` Ingo Molnar 2013-12-19 18:09 ` H. Peter Anvin 2013-12-19 18:19 ` H. Peter Anvin 2013-12-19 18:23 ` Ingo Molnar [not found] ` <CA+55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com> 2013-12-19 18:43 ` Ingo Molnar 2013-12-19 18:43 ` Ingo Molnar 2013-12-19 20:09 ` [tip:x86/idle] x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers tip-bot for H. Peter Anvin 2013-12-19 20:40 ` Ingo Molnar 2013-12-19 20:46 ` Linus Torvalds 2013-12-19 21:14 ` Ingo Molnar 2013-12-19 21:25 ` Linus Torvalds 2013-12-19 21:55 ` Peter Zijlstra 2013-12-20 8:47 ` Ingo Molnar 2013-12-19 20:33 ` [tip:x86/idle] x86, idle: Add memory barriers around clflush in mwait_play_dead() tip-bot for H. Peter Anvin 2013-12-19 18:19 ` [PATCH] x86 idle: repair large-server 50-watt idle-power regression Ingo Molnar 2013-12-19 19:22 ` H. Peter Anvin 2013-12-19 19:27 ` Peter Zijlstra 2013-12-19 19:51 ` [tip:x86/urgent] x86 idle: Repair " tip-bot for Len Brown 2014-03-18 0:20 ` Davidlohr Bueso 2014-03-18 9:16 ` Peter Zijlstra 2014-03-19 2:14 ` Jason Low 2014-03-19 6:42 ` Peter Zijlstra 2014-04-08 21:43 ` Brown, Len 2014-04-09 8:18 ` Peter Zijlstra 2014-04-15 3:27 ` Davidlohr Bueso
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.