Re: [PATCH v2 8/8] drivers/perf: Add Apple icestorm/firestorm CPU PMU driver

From: Marc Zyngier <maz@kernel.org>
To: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org, devicetree@vger.kernel.org,
	linux-kernel@vger.kernel.org, Will Deacon <will@kernel.org>,
	Hector Martin <marcan@marcan.st>, Sven Peter <sven@svenpeter.dev>,
	Alyssa Rosenzweig <alyssa@rosenzweig.io>,
	Rob Herring <robh+dt@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Dougall <dougallj@gmail.com>,
	kernel-team@android.com
Subject: Re: [PATCH v2 8/8] drivers/perf: Add Apple icestorm/firestorm CPU PMU driver
Date: Thu, 02 Dec 2021 15:39:46 +0000	[thread overview]
Message-ID: <877dcnm2wt.wl-maz@kernel.org> (raw)
In-Reply-To: <YaepolizIKkzDQoV@FVFF77S0Q05N>

On Wed, 01 Dec 2021 16:58:10 +0000,
Mark Rutland <mark.rutland@arm.com> wrote:
> 
> On Wed, Dec 01, 2021 at 01:49:09PM +0000, Marc Zyngier wrote:
> > Add a new, weird and wonderful driver for the equally weird Apple
> > PMU HW. Although the PMU itself is functional, we don't know much
> > about the events yet, so this can be considered as yet another
> > random number generator...
> 
> It's really frustrating that Apple built this rather than the
> architected PMU, because we've generally pushed back on
> IMPLEMENTATION DEFINED junk in this area, and supporting this makes
> it harder to push back on other vendors going the same route, which
> I'm not keen on. That, and the usual state of IMP-DEF stuff making
> this stupidly painful to reason about.

As much as I agree with you on the stinking aspect of an IMPDEF PMU,
this doesn't contradicts the architecture. To avoid the spread of this
madness, forbidding an IMPDEF implementation in the architecture would
be the right thing to do.

> 
> I can see that we can get this working bare-metal with DT, but I
> really don't want to try to support this in other cases (e.g. in a
> VM, potentially with ACPI), or this IMP-DEFness is going to spread
> more throughout the arm_pmu code.

Well, an alternative would be to sidestep the arm_pmu framework
altogether.  Which would probably suck even more.

> How does this interact with PMU emulation for a KVM guest?

It doesn't. No non-architected PMU will get exposed to a KVM guest,
and the usual "inject an UNDEF exception on IMPDEF access" applies. As
far as I am concerned, KVM is purely architectural and doesn't need to
be encumbered with this.

> 
> I have a bunch of comments and questions below.
> 
> [...]
> 
> > +#define ANY_BUT_0_1			GENMASK(9, 2)
> > +#define ONLY_2_TO_7			GENMASK(7, 2)
> > +#define ONLY_2_4_6			(BIT(2) | BIT(4) | BIT(6))
> > +#define ONLY_5_6_7			GENMASK(7, 5)
> 
> For clarity/consistency it might be better to use separate BIT()s for
> ONLY_5_6_7 too.

Sure.

[...]

> > +static const u16 m1_pmu_event_affinity[M1_PMU_PERFCTR_LAST + 1] = {
> > +	[0 ... M1_PMU_PERFCTR_LAST]	= ANY_BUT_0_1,
> > +	[M1_PMU_PERFCTR_UNKNOWN_01]	= BIT(7),
> > +	[M1_PMU_PERFCTR_CPU_CYCLES]	= ANY_BUT_0_1 | BIT(0),
> > +	[M1_PMU_PERFCTR_INSTRUCTIONS]	= BIT(7) | BIT(1),
> > +	[M1_PMU_PERFCTR_UNKNOWN_8d]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_8e]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_8f]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_90]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_93]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_94]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_95]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_96]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_97]	= BIT(7),
> > +	[M1_PMU_PERFCTR_UNKNOWN_98]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_99]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_9a]	= BIT(7),
> > +	[M1_PMU_PERFCTR_UNKNOWN_9b]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_9c]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_9f]	= BIT(7),
> > +	[M1_PMU_PERFCTR_UNKNOWN_bf]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_c0]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_c1]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_c4]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_c5]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_c6]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_c8]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_ca]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_cb]	= ONLY_5_6_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_f5]	= ONLY_2_4_6,
> > +	[M1_PMU_PERFCTR_UNKNOWN_f6]	= ONLY_2_4_6,
> > +	[M1_PMU_PERFCTR_UNKNOWN_f7]	= ONLY_2_4_6,
> > +	[M1_PMU_PERFCTR_UNKNOWN_f8]	= ONLY_2_TO_7,
> > +	[M1_PMU_PERFCTR_UNKNOWN_fd]	= ONLY_2_4_6,
> > +};
> 
> I don't entirely follow what's going on here. Is this a matrix
> scheme like what QC had in their IMP-DEF Krait PMUs? See:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/kernel/perf_event_v7.c?h=v5.16-rc3#n1286

It is nowhere as complicated as that.

> I'm a bit worried about this, since is this is of that shape, there
> are potential constraints on which counters and/or events you can
> use concurrently, and if you violate those they can conflict. If so,
> we need to be *very* careful about the abstraction we provide to
> userspace.

The HW does have placement constraints (this is what this per-event
bitmap is expressing), but the counting seems completely independent
as long as you find an ad-hoc counter to place the event. Which means
that if you try and count (for example) 4 events that would only fit
in {5,6,7}, we'll say NO to the fourth one.

As I say somewhere in a comment, we could do a better job if we had a
global view of the events to be counted, and split them in batches
that the core perf would then schedule.

If you think any of this somehow breaks the userspace ABI, please let
me know (my understand of perf is pretty limited...).

> 
> [...]
> 
> > +/* Low level accessors. No synchronisation. */
> > +#define PMU_READ_COUNTER(_idx)						\
> > +	case _idx:	return read_sysreg_s(SYS_IMP_APL_PMC## _idx ##_EL1)
> > +
> > +#define PMU_WRITE_COUNTER(_val, _idx)					\
> > +	case _idx:							\
> > +		write_sysreg_s(_val, SYS_IMP_APL_PMC## _idx ##_EL1);	\
> > +		return
> > +
> > +static u64 m1_pmu_read_hw_counter(unsigned int index)
> > +{
> > +	switch (index) {
> > +		PMU_READ_COUNTER(0);
> > +		PMU_READ_COUNTER(1);
> > +		PMU_READ_COUNTER(2);
> > +		PMU_READ_COUNTER(3);
> > +		PMU_READ_COUNTER(4);
> > +		PMU_READ_COUNTER(5);
> > +		PMU_READ_COUNTER(6);
> > +		PMU_READ_COUNTER(7);
> > +		PMU_READ_COUNTER(8);
> > +		PMU_READ_COUNTER(9);
> > +	}
> > +
> > +	BUG();
> > +}
> > +
> > +static void m1_pmu_write_hw_counter(u64 val, unsigned int index)
> > +{
> > +	switch (index) {
> > +		PMU_WRITE_COUNTER(val, 0);
> > +		PMU_WRITE_COUNTER(val, 1);
> > +		PMU_WRITE_COUNTER(val, 2);
> > +		PMU_WRITE_COUNTER(val, 3);
> > +		PMU_WRITE_COUNTER(val, 4);
> > +		PMU_WRITE_COUNTER(val, 5);
> > +		PMU_WRITE_COUNTER(val, 6);
> > +		PMU_WRITE_COUNTER(val, 7);
> > +		PMU_WRITE_COUNTER(val, 8);
> > +		PMU_WRITE_COUNTER(val, 9);
> > +	}
> > +
> > +	BUG();
> > +}
> 
> As an aside, since this pattern has cropped up in a few places, maybe we want
> to look into scripting the generation of sysreg banki accessors like this.
> 
> [...]
> 
> > +static void m1_pmu_configure_counter(unsigned int index, u8 event,
> > +				     bool user, bool kernel)
> > +{
> > +	u64 val, user_bit, kernel_bit;
> > +	int shift;
> > +
> > +	switch (index) {
> > +	case 0 ... 7:
> > +		user_bit = BIT(get_bit_offset(index, PMCR1_COUNT_A64_EL0_0_7));
> > +		kernel_bit = BIT(get_bit_offset(index, PMCR1_COUNT_A64_EL1_0_7));
> > +		break;
> > +	case 8 ... 9:
> > +		user_bit = BIT(get_bit_offset(index - 8, PMCR1_COUNT_A64_EL0_8_9));
> > +		kernel_bit = BIT(get_bit_offset(index - 8, PMCR1_COUNT_A64_EL1_8_9));
> > +		break;
> 
> When this says 'EL1', presuambly that's counting at EL2 in VHE?

It does.

> Are there separate EL1 / EL2 controls, or anythign of that sort we need to be
> aware of?

No, there is a single, per-counter control for EL0 and EL2. I couldn't
get the counters to report anything useful while a guest was running,
but that doesn't mean such control doesn't exist.

> 
> [...]
> 
> > +/* arm_pmu backend */
> > +static void m1_pmu_enable_event(struct perf_event *event)
> > +{
> > +	struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
> > +	struct pmu_hw_events *cpuc = this_cpu_ptr(cpu_pmu->hw_events);
> > +	unsigned long flags;
> > +	bool user, kernel;
> > +	u8 evt;
> > +
> > +	evt = event->hw.config_base & M1_PMU_CFG_EVENT;
> > +	user = event->hw.config_base & M1_PMU_CFG_COUNT_USER;
> > +	kernel = event->hw.config_base & M1_PMU_CFG_COUNT_KERNEL;
> > +
> > +	raw_spin_lock_irqsave(&cpuc->pmu_lock, flags);
> 
> You shouldn't need this locking. The perf core always calls into the HW access
> functions with IRQs disabled and we don't do cross-cpu state modification.
> Likewise elsewhere in this file.
> 
> We pulled similar out of the architectural PMU driver in commit:
> 
> 2a0e2a02e4b71917 ("arm64: perf: Remove PMU locking")
> 
> ... though that says non-preemptible when it should say non-interruptible.

Ah, nice. I'll get rid of it.

[...]

> > +	state = read_sysreg_s(SYS_IMP_APL_PMCR0_EL1);
> > +	overflow = read_sysreg_s(SYS_IMP_APL_PMSR_EL1);
> 
> I assume the overflow behaviour is free-running rather than stopping?

Configurable, apparently. At the moment, I set it to stop on overflow.
Happy to change the behaviour though.

> > +	if (!overflow) {
> > +		ret = IRQ_NONE;
> > +		goto out;
> > +	}
> > +
> > +	regs = get_irq_regs();
> > +
> > +	for (idx = 0; idx < cpu_pmu->num_events; idx++) {
> > +		struct perf_event *event = cpuc->events[idx];
> > +		struct perf_sample_data data;
> > +
> > +		if (!event)
> > +			continue;
> > +
> > +		armpmu_event_update(event);
> > +		perf_sample_data_init(&data, 0, event->hw.last_period);
> > +		if (!armpmu_event_set_period(event))
> > +			continue;
> > +
> > +		if (perf_event_overflow(event, &data, regs))
> > +			__m1_pmu_disable_event(event);
> > +	}
> > +
> > +out:
> > +	state &= ~PMCR0_IACT;
> > +	write_sysreg_s(state, SYS_IMP_APL_PMCR0_EL1);
> > +	isb();
> > +
> > +	raw_spin_unlock_irqrestore(&cpuc->pmu_lock, flags);
> > +
> > +	return ret;
> > +}
> 
> [...]
> 
> > +static int m1_pmu_device_probe(struct platform_device *pdev)
> > +{
> > +	int ret;
> > +
> > +	ret = arm_pmu_device_probe(pdev, m1_pmu_of_device_ids, NULL);
> > +	if (!ret) {
> > +		/*
> > +		 * If probe succeeds, taint the kernel as this is all
> > +		 * undocumented, implementation defined black magic.
> > +		 */
> > +		add_taint(TAINT_CPU_OUT_OF_SPEC, LOCKDEP_STILL_OK);
> > +	}
> > +
> > +	return ret;
> > +}
> 
> Hmmm... that means we're always going to TAINT on this HW with an appropriate
> DT, which could mask other reasons TAINT_CPU_OUT_OF_SPEC would be set, even
> where the user isn't using the PMU.
> 
> Maybe we should have a cmdline option to opt-in to using the IMP-DEF PMU (and
> only tainting in that case)?

I'd rather taint on first use. Requiring a command-line argument for
this seems a bit over the top...

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.