linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] intel_idle: use static_key to optimize idle enter/exit paths
@ 2014-07-11 17:54 Jason Baron
  2014-07-28 20:38 ` Len Brown
  0 siblings, 1 reply; 3+ messages in thread
From: Jason Baron @ 2014-07-11 17:54 UTC (permalink / raw)
  To: lenb; +Cc: linux-pm, linux-kernel

If 'arat' is set in the cpuflags, we can avoid the checks for entering/exiting
the tick broadcast code entirely. It would seem that this is a hot enough code
path to make this worthwhile. I ran a few hackbench runs, and consistenly see
reduced branches and cycles.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 drivers/idle/intel_idle.c | 29 ++++++++++++++++++++---------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 4d140bb..61e965c 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -80,6 +80,8 @@ static unsigned int mwait_substates;
 #define LAPIC_TIMER_ALWAYS_RELIABLE 0xFFFFFFFF
 /* Reliable LAPIC Timer States, bit 1 for C1 etc.  */
 static unsigned int lapic_timer_reliable_states = (1 << 1);	 /* Default to only C1 */
+/* if arat is set no sense in checking on each c-state transition */
+static struct static_key lapic_timer_unreliable __read_mostly;
 
 struct idle_cpu {
 	struct cpuidle_state *state_table;
@@ -507,12 +509,10 @@ static int intel_idle(struct cpuidle_device *dev,
 {
 	unsigned long ecx = 1; /* break on interrupt flag */
 	struct cpuidle_state *state = &drv->states[index];
-	unsigned long eax = flg2MWAIT(state->flags);
-	unsigned int cstate;
+	unsigned long uninitialized_var(eax);
+	unsigned int uninitialized_var(cstate);
 	int cpu = smp_processor_id();
 
-	cstate = (((eax) >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1;
-
 	/*
 	 * leave_mm() to avoid costly and often unnecessary wakeups
 	 * for flushing the user TLB's associated with the active mm.
@@ -520,13 +520,22 @@ static int intel_idle(struct cpuidle_device *dev,
 	if (state->flags & CPUIDLE_FLAG_TLB_FLUSHED)
 		leave_mm(cpu);
 
-	if (!(lapic_timer_reliable_states & (1 << (cstate))))
-		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
+	if (static_key_false(&lapic_timer_unreliable)) {
+		eax = flg2MWAIT(state->flags);
+		cstate = (((eax) >> MWAIT_SUBSTATE_SIZE) &
+					MWAIT_CSTATE_MASK) + 1;
+		if (!(lapic_timer_reliable_states & (1 << (cstate))))
+			clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER,
+					   &cpu);
+	}
 
 	mwait_idle_with_hints(eax, ecx);
 
-	if (!(lapic_timer_reliable_states & (1 << (cstate))))
-		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
+	if (static_key_false(&lapic_timer_unreliable)) {
+		if (!(lapic_timer_reliable_states & (1 << (cstate))))
+			clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT,
+					   &cpu);
+	}
 
 	return index;
 }
@@ -702,8 +711,10 @@ static int __init intel_idle_probe(void)
 
 	if (boot_cpu_has(X86_FEATURE_ARAT))	/* Always Reliable APIC Timer */
 		lapic_timer_reliable_states = LAPIC_TIMER_ALWAYS_RELIABLE;
-	else
+	else {
+		static_key_slow_inc(&lapic_timer_unreliable);
 		on_each_cpu(__setup_broadcast_timer, (void *)true, 1);
+	}
 
 	pr_debug(PREFIX "v" INTEL_IDLE_VERSION
 		" model 0x%X\n", boot_cpu_data.x86_model);
-- 
1.8.2.rc2


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] intel_idle: use static_key to optimize idle enter/exit paths
  2014-07-11 17:54 [PATCH] intel_idle: use static_key to optimize idle enter/exit paths Jason Baron
@ 2014-07-28 20:38 ` Len Brown
  2014-07-28 21:50   ` Jason Baron
  0 siblings, 1 reply; 3+ messages in thread
From: Len Brown @ 2014-07-28 20:38 UTC (permalink / raw)
  To: Jason Baron; +Cc: Linux PM list, linux-kernel

On Fri, Jul 11, 2014 at 1:54 PM, Jason Baron <jbaron@akamai.com> wrote:
> If 'arat' is set in the cpuflags, we can avoid the checks for entering/exiting
> the tick broadcast code entirely. It would seem that this is a hot enough code
> path to make this worthwhile. I ran a few hackbench runs, and consistenly see
> reduced branches and cycles.

Hi Jason,

Your logic looks right -- though I've never used this
static_key_slow_inc() stuff.
I'm impressed that something in user-space could detect this change.

Can you share how to run the workload where you detected a difference,
and describe the hardware you measured?

thanks,
-Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] intel_idle: use static_key to optimize idle enter/exit paths
  2014-07-28 20:38 ` Len Brown
@ 2014-07-28 21:50   ` Jason Baron
  0 siblings, 0 replies; 3+ messages in thread
From: Jason Baron @ 2014-07-28 21:50 UTC (permalink / raw)
  To: Len Brown; +Cc: Linux PM list, linux-kernel

On 07/28/2014 04:38 PM, Len Brown wrote:
> On Fri, Jul 11, 2014 at 1:54 PM, Jason Baron <jbaron@akamai.com> wrote:
>> If 'arat' is set in the cpuflags, we can avoid the checks for entering/exiting
>> the tick broadcast code entirely. It would seem that this is a hot enough code
>> path to make this worthwhile. I ran a few hackbench runs, and consistenly see
>> reduced branches and cycles.
> 
> Hi Jason,
> 
> Your logic looks right -- though I've never used this
> static_key_slow_inc() stuff.
> I'm impressed that something in user-space could detect this change.
> 
> Can you share how to run the workload where you detected a difference,
> and describe the hardware you measured?
> 
> thanks,
> -Len Brown, Intel Open Source Technology Center
> 


Hi Len,

So using something like hackbench appears to show the difference
(with CONFIG_JUMP_LABEL enabled):

Without the patch:

 Performance counter stats for 'perf bench sched messaging' (200 runs):

        641.113816 task-clock                #    8.020 CPUs utilized            ( +-  0.16% ) [100.00%]
             29020 context-switches          #    0.045 M/sec                    ( +-  1.66% ) [100.00%]
              2487 cpu-migrations            #    0.004 M/sec                    ( +-  0.89% ) [100.00%]
             10514 page-faults               #    0.016 M/sec                    ( +-  0.11% )
        2085813986 cycles                    #    3.253 GHz                      ( +-  0.16% ) [100.00%]
        1658381753 stalled-cycles-frontend   #   79.51% frontend cycles idle     ( +-  0.18% ) [100.00%]
   <not supported> stalled-cycles-backend  
        1221737228 instructions              #    0.59  insns per cycle        
                                             #    1.36  stalled cycles per insn  ( +-  0.12% ) [100.00%]
         211723499 branches                  #  330.243 M/sec                    ( +-  0.14% ) [100.00%]
            716846 branch-misses             #    0.34% of all branches          ( +-  0.66% )

       0.079936660 seconds time elapsed                                          ( +-  0.16% )


With the patch:

Performance counter stats for 'perf bench sched messaging' (200 runs):

        638.819963 task-clock                #    8.020 CPUs utilized            ( +-  0.15% ) [100.00%]
             27751 context-switches          #    0.043 M/sec                    ( +-  1.61% ) [100.00%]
              2502 cpu-migrations            #    0.004 M/sec                    ( +-  0.92% ) [100.00%]
             10503 page-faults               #    0.016 M/sec                    ( +-  0.09% )
        2078109565 cycles                    #    3.253 GHz                      ( +-  0.14% ) [100.00%]
        1653002141 stalled-cycles-frontend   #   79.54% frontend cycles idle     ( +-  0.17% ) [100.00%]
   <not supported> stalled-cycles-backend
        1218013520 instructions              #    0.59  insns per cycle
                                             #    1.36  stalled cycles per insn  ( +-  0.12% ) [100.00%]
         210943815 branches                  #  330.209 M/sec                    ( +-  0.14% ) [100.00%]
            697865 branch-misses             #    0.33% of all branches          ( +-  0.66% )

       0.079648462 seconds time elapsed                                          ( +-  0.15% )

So you can see that 'branches' is higher without the patch. Yes, there is some
'noise' here, but there is a measurable impact. It doesn't seem to make too much
sense to me to check for the presence of a h/w feature every time through this kind
of code path if its easily avoidable.

Hardware is 4 core Intel box:

model name	: Intel(R) Xeon(R) CPU E3-1270 V2 @ 3.50GHz
stepping	: 9
microcode	: 0x12
cpu MHz		: 3501.000
cache size	: 8192 KB

Thanks,

-Jason


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-07-28 21:50 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-11 17:54 [PATCH] intel_idle: use static_key to optimize idle enter/exit paths Jason Baron
2014-07-28 20:38 ` Len Brown
2014-07-28 21:50   ` Jason Baron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).