Re: [PATCH v2 1/1] cpuidle-powernv : forced wakeup for stop states

From: Abhishek <huntbag@linux.vnet.ibm.com>
To: Nicholas Piggin <npiggin@gmail.com>,
	linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org
Cc: daniel.lezcano@linaro.org, dja@axtens.net,
	ego@linux.vnet.ibm.com, mpe@ellerman.id.au, rjw@rjwysocki.net
Subject: Re: [PATCH v2 1/1] cpuidle-powernv : forced wakeup for stop states
Date: Wed, 19 Jun 2019 14:38:13 +0530	[thread overview]
Message-ID: <689a52a7-7bfc-7225-e563-ac07f7357e75@linux.vnet.ibm.com> (raw)
In-Reply-To: <1560917320.mk5nn6r8jw.astroid@bobo.none>

Hi Nick,

Thanks for the review. Some replies below.

On 06/19/2019 09:53 AM, Nicholas Piggin wrote:
> Abhishek Goel's on June 17, 2019 7:56 pm:
>> Currently, the cpuidle governors determine what idle state a idling CPU
>> should enter into based on heuristics that depend on the idle history on
>> that CPU. Given that no predictive heuristic is perfect, there are cases
>> where the governor predicts a shallow idle state, hoping that the CPU will
>> be busy soon. However, if no new workload is scheduled on that CPU in the
>> near future, the CPU may end up in the shallow state.
>>
>> This is problematic, when the predicted state in the aforementioned
>> scenario is a shallow stop state on a tickless system. As we might get
>> stuck into shallow states for hours, in absence of ticks or interrupts.
>>
>> To address this, We forcefully wakeup the cpu by setting the
>> decrementer. The decrementer is set to a value that corresponds with the
>> residency of the next available state. Thus firing up a timer that will
>> forcefully wakeup the cpu. Few such iterations will essentially train the
>> governor to select a deeper state for that cpu, as the timer here
>> corresponds to the next available cpuidle state residency. Thus, cpu will
>> eventually end up in the deepest possible state.
>>
>> Signed-off-by: Abhishek Goel <huntbag@linux.vnet.ibm.com>
>> ---
>>
>> Auto-promotion
>>   v1 : started as auto promotion logic for cpuidle states in generic
>> driver
>>   v2 : Removed timeout_needed and rebased the code to upstream kernel
>> Forced-wakeup
>>   v1 : New patch with name of forced wakeup started
>>   v2 : Extending the forced wakeup logic for all states. Setting the
>> decrementer instead of queuing up a hrtimer to implement the logic.
>>
>>   drivers/cpuidle/cpuidle-powernv.c | 38 +++++++++++++++++++++++++++++++
>>   1 file changed, 38 insertions(+)
>>
>> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
>> index 84b1ebe212b3..bc9ca18ae7e3 100644
>> --- a/drivers/cpuidle/cpuidle-powernv.c
>> +++ b/drivers/cpuidle/cpuidle-powernv.c
>> @@ -46,6 +46,26 @@ static struct stop_psscr_table stop_psscr_table[CPUIDLE_STATE_MAX] __read_mostly
>>   static u64 default_snooze_timeout __read_mostly;
>>   static bool snooze_timeout_en __read_mostly;
>>   
>> +static u64 forced_wakeup_timeout(struct cpuidle_device *dev,
>> +				 struct cpuidle_driver *drv,
>> +				 int index)
>> +{
>> +	int i;
>> +
>> +	for (i = index + 1; i < drv->state_count; i++) {
>> +		struct cpuidle_state *s = &drv->states[i];
>> +		struct cpuidle_state_usage *su = &dev->states_usage[i];
>> +
>> +		if (s->disabled || su->disable)
>> +			continue;
>> +
>> +		return (s->target_residency + 2 * s->exit_latency) *
>> +			tb_ticks_per_usec;
>> +	}
>> +
>> +	return 0;
>> +}
> It would be nice to not have this kind of loop iteration in the
> idle fast path. Can we add a flag or something to the idle state?
Currently, we do not have any callback notification or some feedback that
notifies the driver everytime some state is enabled/disabled. So we have
to parse everytime to get the next enabled state. Are you suggesting to
add something like next_enabled_state in cpuidle state structure itself
which will be updated when a state is enabled or disabled?
>> +
>>   static u64 get_snooze_timeout(struct cpuidle_device *dev,
>>   			      struct cpuidle_driver *drv,
>>   			      int index)
>> @@ -144,8 +164,26 @@ static int stop_loop(struct cpuidle_device *dev,
>>   		     struct cpuidle_driver *drv,
>>   		     int index)
>>   {
>> +	u64 dec_expiry_tb, dec, timeout_tb, forced_wakeup;
>> +
>> +	dec = mfspr(SPRN_DEC);
>> +	timeout_tb = forced_wakeup_timeout(dev, drv, index);
>> +	forced_wakeup = 0;
>> +
>> +	if (timeout_tb && timeout_tb < dec) {
>> +		forced_wakeup = 1;
>> +		dec_expiry_tb = mftb() + dec;
>> +	}
> The compiler probably can't optimise away the SPR manipulations so try
> to avoid them if possible.
Are you suggesting something like set_dec_before_idle?(in line with
what you have suggested to do after idle, reset_dec_after_idle)
>
>> +
>> +	if (forced_wakeup)
>> +		mtspr(SPRN_DEC, timeout_tb);
> This should just be put in the above 'if'.
Fair point.
>
>> +
>>   	power9_idle_type(stop_psscr_table[index].val,
>>   			 stop_psscr_table[index].mask);
>> +
>> +	if (forced_wakeup)
>> +		mtspr(SPRN_DEC, dec_expiry_tb - mftb());
> This will sometimes go negative and result in another timer interrupt.
>
> It also breaks irq work (which can be set here by machine check I
> believe.
>
> May need to implement some timer code to do this for you.
>
> static void reset_dec_after_idle(void)
> {
> 	u64 now;
>          u64 *next_tb;
>
> 	if (test_irq_work_pending())
> 		return;
> 	now = mftb;
> 	next_tb = this_cpu_ptr(&decrementers_next_tb);
>
> 	if (now >= *next_tb)
> 		return;
> 	set_dec(*next_tb - now);
> 	if (test_irq_work_pending())
> 		set_dec(1);
> }
>
> Something vaguely like that. See timer_interrupt().
Ah, Okay. Will go through timer_interrupt().
> Thanks,
> Nick
Thanks,
Abhishek