Re: [V3] powerpc/perf: Enable PMU counters post partition migration if PMU is active

From: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
To: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Lynch <nathanl@linux.ibm.com>,
	maddy@linux.vnet.ibm.com, rnsastry@linux.ibm.com,
	kjain@linux.ibm.com, Nicholas Piggin <npiggin@gmail.com>,
	linuxppc-dev@lists.ozlabs.org
Subject: Re: [V3] powerpc/perf: Enable PMU counters post partition migration if PMU is active
Date: Tue, 2 Nov 2021 16:30:33 +0530	[thread overview]
Message-ID: <552CB828-5F44-4310-A46A-81E27DED14A2@linux.vnet.ibm.com> (raw)
In-Reply-To: <87sfwk7z0m.fsf@mpe.ellerman.id.au>

[-- Attachment #1: Type: text/plain, Size: 9755 bytes --]

> On 29-Oct-2021, at 6:45 PM, Michael Ellerman <mpe@ellerman.id.au> wrote:
> 
> Nicholas Piggin <npiggin@gmail.com <mailto:npiggin@gmail.com>> writes:
>> Excerpts from Athira Rajeev's message of October 29, 2021 1:05 pm:
>>> During Live Partition Migration (LPM), it is observed that perf
>>> counter values reports zero post migration completion. However
>>> 'perf stat' with workload continues to show counts post migration
>>> since PMU gets disabled/enabled during sched switches. But incase
>>> of system/cpu wide monitoring, zero counts were reported with 'perf
>>> stat' after migration completion.
>>> 
>>> Example:
>>> ./perf stat -e r1001e -I 1000
>>>           time             counts unit events
>>>     1.001010437         22,137,414      r1001e
>>>     2.002495447         15,455,821      r1001e
>>> <<>> As seen in next below logs, the counter values shows zero
>>>        after migration is completed.
>>> <<>>
>>>    86.142535370    129,392,333,440      r1001e
>>>    87.144714617                  0      r1001e
>>>    88.146526636                  0      r1001e
>>>    89.148085029                  0      r1001e
>> 
>> This is the output without the patch? After the patch it keeps counting 
>> I suppose? And does the very large count go away too?
>> 
>>> 
>>> Here PMU is enabled during start of perf session and counter
>>> values are read at intervals. Counters are only disabled at the
>>> end of session. The powerpc mobility code presently does not handle
>>> disabling and enabling back of PMU counters during partition
>>> migration. Also since the PMU register values are not saved/restored
>>> during migration, PMU registers like Monitor Mode Control Register 0
>>> (MMCR0), Monitor Mode Control Register 1 (MMCR1) will not contain
>>> the value it was programmed with. Hence PMU counters will not be
>>> enabled correctly post migration.
>>> 
>>> Fix this in mobility code by handling disabling and enabling of
>>> PMU in all cpu's before and after migration. Patch introduces two
>>> functions 'mobility_pmu_disable' and 'mobility_pmu_enable'.
>>> mobility_pmu_disable() is called before the processor threads goes
>>> to suspend state so as to disable the PMU counters. And disable is
>>> done only if there are any active events running on that cpu.
>>> mobility_pmu_enable() is called after the migrate is done to enable
>>> back the PMU counters.
>>> 
>>> Since the performance Monitor counters ( PMCs) are not
>>> saved/restored during LPM, results in PMC value being zero and the
>>> 'event->hw.prev_count' being non-zero value. This causes problem
>>> during updation of event->count since we always accumulate
>>> (event->hw.prev_count - PMC value) in event->count.  If
>>> event->hw.prev_count is greater PMC value, event->count becomes
>>> negative. To fix this, 'prev_count' also needs to be re-initialised
>>> for all events while enabling back the events. Hence read the
>>> existing events and clear the PMC index (stored in event->hw.idx)
>>> for all events im mobility_pmu_disable. By this way, event count
>>> settings will get re-initialised correctly in power_pmu_enable.
>>> 
>>> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
>>> [ Fixed compilation error reported by kernel test robot ]
>>> Reported-by: kernel test robot <lkp@intel.com>
>>> ---
>>> Changelog:
>>> Change from v2 -> v3:
>>> Addressed review comments from Nicholas Piggin.
>>> - Removed the "migrate" field which was added in initial
>>>   patch to address updation of event count settings correctly
>>>   in power_pmu_enable. Instead read off existing events in
>>>   mobility_pmu_disable before power_pmu_enable.
>>> - Moved the mobility_pmu_disable/enable declaration from
>>>   rtas.h to perf event header file.
>>> 
>>> Addressed review comments from Nathan.
>>> - Moved the mobility function calls from stop_machine
>>>   context out to pseries_migrate_partition. Also now this
>>>   is a per cpu invocation.
>>> 
>>> Change from v1 -> v2:
>>> - Moved the mobility_pmu_enable and mobility_pmu_disable
>>>   declarations under CONFIG_PPC_PERF_CTRS in rtas.h.
>>>   Also included 'asm/rtas.h' in core-book3s to fix the
>>>   compilation warning reported by kernel test robot.
>>> 
>>> arch/powerpc/include/asm/perf_event.h     |  8 +++++
>>> arch/powerpc/perf/core-book3s.c           | 39 +++++++++++++++++++++++
>>> arch/powerpc/platforms/pseries/mobility.c |  7 ++++
>>> 3 files changed, 54 insertions(+)
>>> 
>>> diff --git a/arch/powerpc/include/asm/perf_event.h b/arch/powerpc/include/asm/perf_event.h
>>> index 164e910bf654..88aab6cf840c 100644
>>> --- a/arch/powerpc/include/asm/perf_event.h
>>> +++ b/arch/powerpc/include/asm/perf_event.h
>>> @@ -17,6 +17,14 @@ static inline bool is_sier_available(void) { return false; }
>>> static inline unsigned long get_pmcs_ext_regs(int idx) { return 0; }
>>> #endif
>>> 
>>> +#ifdef CONFIG_PPC_PERF_CTRS
>>> +void mobility_pmu_disable(void *unused);
>>> +void mobility_pmu_enable(void *unused);
>>> +#else
>>> +static inline void mobility_pmu_disable(void *unused) { }
>>> +static inline void mobility_pmu_enable(void *unused) { }
>>> +#endif
>>> +
>>> #ifdef CONFIG_FSL_EMB_PERF_EVENT
>>> #include <asm/perf_event_fsl_emb.h>
>>> #endif
>>> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
>>> index 73e62e9b179b..2e8c4c668fa3 100644
>>> --- a/arch/powerpc/perf/core-book3s.c
>>> +++ b/arch/powerpc/perf/core-book3s.c
>>> @@ -1343,6 +1343,33 @@ static void power_pmu_disable(struct pmu *pmu)
>>> 	local_irq_restore(flags);
>>> }
>>> 
>>> +/*
>>> + * Called from pseries_migrate_partition() function
>>> + * before migration, from powerpc/mobility code.
>>> + */
> 
> These are only needed if pseries is built, so should be inside a PSERIES
> ifdef.

Sure mpe, will address this change in next version
> 
> This function should handle iterating over CPUs, that shouldn't be left
> up to the mobility.c code.
> 
> And the names should be something like pmu_start_migration(),
> pmu_finish_migration().

Ok, will change
> 
>>> +void mobility_pmu_disable(void *unused)
>>> +{
>>> +	struct cpu_hw_events *cpuhw = this_cpu_ptr(&cpu_hw_events);
>>> +	struct perf_event *event;
>>> +
>>> +	if (cpuhw->n_events != 0) {
>>> +		int i;
>>> +
>>> +		power_pmu_disable(NULL);
>>> +		/*
>>> +		 * Read off any pre-existing events because the register
>>> +		 * values may not be migrated.
>>> +		 */
>>> +		for (i = 0; i < cpuhw->n_events; ++i) {
>>> +			event = cpuhw->event[i];
>>> +			if (event->hw.idx) {
>>> +				power_pmu_read(event);
>>> +				event->hw.idx = 0;
>>> +			}
>>> +		}
>>> +	}
>>> +}
>>> +
>>> /*
>>>  * Re-enable all events if disable == 0.
>>>  * If we were previously disabled and events were added, then
>>> @@ -1515,6 +1542,18 @@ static void power_pmu_enable(struct pmu *pmu)
>>> 	local_irq_restore(flags);
>>> }
>>> 
>>> +/*
>>> + * Called from pseries_migrate_partition() function
>>> + * after migration, from powerpc/mobility code.
>>> + */
>>> +void mobility_pmu_enable(void *unused)
>>> +{
>>> +	struct cpu_hw_events *cpuhw = this_cpu_ptr(&cpu_hw_events);
>>> +
>>> +	cpuhw->n_added = cpuhw->n_events;
>>> +	power_pmu_enable(NULL);
>>> +}
>>> +
>>> static int collect_events(struct perf_event *group, int max_count,
>>> 			  struct perf_event *ctrs[], u64 *events,
>>> 			  unsigned int *flags)
>>> diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
>>> index e83e0891272d..3e96485ccba4 100644
>>> --- a/arch/powerpc/platforms/pseries/mobility.c
>>> +++ b/arch/powerpc/platforms/pseries/mobility.c
>>> @@ -22,6 +22,7 @@
>>> #include <linux/delay.h>
>>> #include <linux/slab.h>
>>> #include <linux/stringify.h>
>>> +#include <linux/perf_event.h>
>>> 
>>> #include <asm/machdep.h>
>>> #include <asm/rtas.h>
>>> @@ -631,12 +632,18 @@ static int pseries_migrate_partition(u64 handle)
>>> 	if (ret)
>>> 		return ret;
>>> 
>>> +	/* Disable PMU before suspend */
>>> +	on_each_cpu(&mobility_pmu_disable, NULL, 0);
>> 
>> Why was this moved out of stop machine and to an IPI?
>> 
>> My concern would be, what are the other CPUs doing at this time? Is it 
>> possible they could take interrupts and schedule? Could that mess up the
>> perf state here?
> 
> pseries_migrate_partition() is called directly from migration_store(),
> which is the sysfs store function, which can be called concurrently by
> different CPUs.
> 
> It's also potentially called from rtas_syscall_dispatch_ibm_suspend_me(),
> from sys_rtas(), again with no locking.
> 
> So we could have two CPUs calling into here at the same time, which
> might not crash, but is unlikely to work well.
> 
> I think the lack of locking might have been OK in the past because only
> one CPU will successfully get the other CPUs to call do_join() in
> pseries_suspend(). But I could be wrong.
> 
> Anyway, now that we're mutating the PMU state before suspending we need
> to be more careful. So I think we need a lock around the whole sequence.

Thanks for the review comments.

The PMU callback invocations were from inside stop machine ( inside “do_join” ) in initial version.
Moved these out to pseries_migrate_partition as an IPI so as to minimise calls to other sub-system from “do_join” context, as pointed by Nathan.
But if it is not safe to have per-cpu invocation from pseries_migrate_partition, should we handle this under the interrupt disable path in do_join itself ( as in the initial version ) ? 

Please share your thoughts/suggestions.

Thanks
Athira

> 
> cheers

[-- Attachment #2: Type: text/html, Size: 35735 bytes --]