Re: [PATCH v1 07/13] perf/core: add idle hooks

From: Peter Zijlstra <peterz@infradead.org>
To: Stephane Eranian <eranian@google.com>
Cc: linux-kernel@vger.kernel.org, acme@redhat.com, jolsa@redhat.com,
	kim.phillips@amd.com, namhyung@kernel.org, irogers@google.com
Subject: Re: [PATCH v1 07/13] perf/core: add idle hooks
Date: Thu, 9 Sep 2021 11:15:04 +0200	[thread overview]
Message-ID: <YTnQmPXXQwuMFuoJ@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <20210909075700.4025355-8-eranian@google.com>

On Thu, Sep 09, 2021 at 12:56:54AM -0700, Stephane Eranian wrote:
> This patch adds a new set of hooks to connect perf_events with the
> idle task. On some PMU models, it may be necessary to flush or stop
> the PMU when going to low power. Upon return from low power, the opposite
> action, i.e., re-enable the PMU, may be necessary. The patch adds
> perf_pmu_register_lopwr_cb() to register a callback on entry or return
> from low power. The callback is invoked with a boolean arg. If true,
> then this is an entry. If false, this is a return.
> 
> The callback is invoked from the idle code with interrupts already
> disabled.

Urghh... we've had this before and always managed to refuse doing this
(in generic code).

Also, it's broken..

> +/*
> + * The perf_lopwr_cb() is invoked from the idle task. As such it
> + * cannot grab a mutex that may end up sleeping. The idle task cannot
> + * sleep by construction. Therefore we create a spinlock and a new
> + * list of PMUs to invoke on idle. The list is protected by a spinlock
> + * Normally we would use the pmus_lock and iterate over each PMUs. But
> + * mutex is not possible and we need to iterate only over the PMU which
> + * do require a idle callback.

That rationale is broken, pmus is also SRCU protected and SRCU iteration
would actually work from the idle context, unlike:

> + */
> +static DEFINE_SPINLOCK(lopwr_cb_lock);

which is not allowed from preempt disable context, if you really need
that lock, you need a raw_spinlock_t.

> +static LIST_HEAD(lopwr_cb_pmus);
> +static DEFINE_PER_CPU(int, lopwr_nr_active);
> +
> +void perf_lopwr_active_inc(void)
> +{
> +	__this_cpu_inc(lopwr_nr_active);
> +}
> +
> +void perf_lopwr_active_dec(void)
> +{
> +	__this_cpu_dec(lopwr_nr_active);
> +}
> +
> +/*
> + * lopwr_in = true means going to low power state
> + * lopwr_in = false means returning from low power state
> + */
> +void perf_lopwr_cb(bool lopwr_in)
> +{
> +	struct pmu *pmu;
> +	unsigned long flags;
> +
> +	if (!__this_cpu_read(lopwr_nr_active))
> +		return;

We have static_branch and static_call to deal with these things.

> +
> +	spin_lock_irqsave(&lopwr_cb_lock, flags);
> +
> +	list_for_each_entry(pmu, &lopwr_cb_pmus, lopwr_entry) {
> +		if (pmu->lopwr_cb)
> +			pmu->lopwr_cb(lopwr_in);
> +	}

I *really* do not like unbound iteration in the idle path.

> +
> +	spin_unlock_irqrestore(&lopwr_cb_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(perf_lopwr_cb);

Why?

> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 912b47aa99d8..14ce130aee1b 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -179,6 +179,7 @@ static void cpuidle_idle_call(void)
>  	 */
>  	if (need_resched()) {
>  		local_irq_enable();
> +		perf_lopwr_cb(false);
>  		return;
>  	}
>  
> @@ -191,7 +192,14 @@ static void cpuidle_idle_call(void)
>  	if (cpuidle_not_available(drv, dev)) {
>  		tick_nohz_idle_stop_tick();
>  
> +		if (!cpu_idle_force_poll)
> +			perf_lopwr_cb(true);
> +
>  		default_idle_call();
> +
> +		if (!cpu_idle_force_poll)
> +			perf_lopwr_cb(false);
> +
>  		goto exit_idle;
>  	}
>  
> @@ -249,8 +257,10 @@ static void cpuidle_idle_call(void)
>  	/*
>  	 * It is up to the idle functions to reenable local interrupts
>  	 */
> -	if (WARN_ON_ONCE(irqs_disabled()))
> +	if (WARN_ON_ONCE(irqs_disabled())) {
>  		local_irq_enable();
> +		perf_lopwr_cb(false);
> +	}
>  }
>  
>  /*
> @@ -279,9 +289,12 @@ static void do_idle(void)
>  	__current_set_polling();
>  	tick_nohz_idle_enter();
>  
> +
>  	while (!need_resched()) {
>  		rmb();
>  
> +		perf_lopwr_cb(true);
> +
>  		local_irq_disable();
>  
>  		if (cpu_is_offline(cpu)) {

I so hate all of this...  but mostly, this doesn't seem right, you're
unconditionally calling this, even for high power idle states.

*IFF* you really have to do this, stuff it in an AMD idle state driver
that knows about the relevant idle states. But really, why can't we just
take the his while the event is running?