Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error

From: Pierre Gondois <pierre.gondois@arm.com>
To: Yang Shi <yang@os.amperecomputing.com>,
	Ionela Voinescu <Ionela.Voinescu@arm.com>
Cc: viresh.kumar@linaro.org, scott@os.amperecomputing.com,
	linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org,
	"Rafael J. Wysocki" <rafael@kernel.org>
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error
Date: Thu, 20 Apr 2023 18:01:15 +0200	[thread overview]
Message-ID: <cc32f950-ea78-87cb-e708-6d42d1e58cc8@arm.com> (raw)
In-Reply-To: <1ce09fd7-0c1d-fc46-ce12-01b25fbd4afd@os.amperecomputing.com>

>>
>> You say that the cause of this is a congestion in the interconnect. I
>> don't
>> see a way to check that right now.
>> However your trace is on the CPU0, so maybe all the other cores were
>> shutdown
>> in your test. If this is the case, do you think a congestion could
>> happen with
>> only one CPU ?
> 
> No, other CPUs were not shut down in my test. I just ran "yes" on all
> cores except CPU 0, then ran the reading freq script. Since all other
> cores are busy, so the script should be always running on CPU 0.
> 
> Since the counters, memory and other devices are on the interconnect, so
> the congestion may be caused by plenty of factors IIUC.

+Ionela

Ionela pointed me to the following patch-set, which seems realated:
https://lore.kernel.org/all/20230418113459.12860-5-sumitg@nvidia.com/

One thing that we didn't check I believe is and that Ionela pointed out
is that we don't know whether we are accessing the present CPU or a remote
CPU'AMUs. In the latter case there would be IPIs and possible delays in
waking up/accessing the remote CPU).

> 
>>
>> Just 2 other comments:
>> a-
>> It might be interesting to change the order in which cpc registers are
>> read
>> just to see if it has an impact, but even if it has, I m not sure how
>> this
>> could be exploitable.
>> Just in case, I mean doing that, but I think that b. might be better
>> to try.
>>
>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
>> index c51d3ccb4cca..479b55006020 100644
>> --- a/drivers/acpi/cppc_acpi.c
>> +++ b/drivers/acpi/cppc_acpi.c
>> @@ -1350,8 +1350,8 @@ int cppc_get_perf_ctrs(int cpunum, struct
>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>                  }
>>          }
>>
>> -       cpc_read(cpunum, delivered_reg, &delivered);
>>          cpc_read(cpunum, reference_reg, &reference);
>> +       cpc_read(cpunum, delivered_reg, &delivered);
>>          cpc_read(cpunum, ref_perf_reg, &ref_perf);
>>
>>          /*
>>
>> b-
>> In the trace that you shared, the cpc_read() calls in the fist
>> cppc_get_perf_ctrs() calls seem to always take a bit more time than in
>> the
>> second cppc_get_perf_ctrs() call.
>> Would it be possible to collect traces similar as above with 3 or 4
>> calls to
>> cppc_get_perf_ctrs() instead of 2 ? It would allow to check whether in
>> the first
>> call, accessing the cpc registers takes more time than in the
>> following calls,
>> due to cache misses or other reasons.
> 
> Cache miss? The counters should be not cached and reading the counters
> should not hit cache IIUC.

Yes you are correct, what I said is copmletely wrong.

> 
>> Ideally statistics on the result would be the best, or if you have a
>> trace.dat
>> to share containing a trace with multiple cppc_cpufreq_get_rate() calls.
> 
> Tried option b, I managed to get histogram:
> 
> @hist_first_ns[cat]:
> [4K, 8K)          112321
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [8K, 16K)            212
> |                                                    |
> [16K, 32K)            25
> |                                                    |
> [32K, 64K)            59
> |                                                    |
> [64K, 128K)            6
> |                                                    |
> [128K, 256K)           9
> |                                                    |
> 
> @hist_second_ns[cat]:
> [2K, 4K)          112590
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [4K, 8K)               4
> |                                                    |
> [8K, 16K)              0
> |                                                    |
> [16K, 32K)            15
> |                                                    |
> [32K, 64K)            18
> |                                                    |
> [64K, 128K)            1
> |                                                    |
> [128K, 256K)           4
> |                                                    |
> 
> The "first" means the first cppc_get_perf_ctrs() call. But the bpftrace
> script can't tell the second, the third and the fourth, so all them are
> shown as "second". Anyway it seems fine. We can tell the first read took
> longer than the later ones for the most time.
> 
> And a typical func_graph trace shows:
> 
> # tracer: function_graph
> #
> #     TIME        CPU  DURATION                  FUNCTION CALLS
> #      |          |     |   |                     |   |   | |
>    4447.171333 |     0)               |  cppc_cpufreq_get_rate
> [cppc_cpufreq]() {
>    4447.171334 |     0)               |    cpufreq_cpu_get() {
>    4447.171334 |     0)   1.060 us    | _raw_read_lock_irqsave();
>    4447.171336 |     0)   0.560 us    | _raw_read_unlock_irqrestore();
>    4447.171337 |     0)   3.480 us    | }
>    4447.171338 |     0)   0.400 us    | cpufreq_cpu_put();
>    4447.171338 |     0)               |    cppc_get_perf_ctrs() {
>    4447.171339 |     0)   0.720 us    | cpc_read.isra.0();
>    4447.171341 |     0)   0.700 us    | cpc_read.isra.0();
>    4447.171342 |     0)   0.380 us    | cpc_read.isra.0();
>    4447.171342 |     0)   0.600 us    | cpc_read.isra.0();
>    4447.171343 |     0)   4.900 us    | }
>    4447.171344 |     0)               |    __delay() {
>    4447.171344 |     0)   0.540 us    | arch_timer_evtstrm_available();
>    4447.171346 |     0)   2.420 us    | }
>    4447.171347 |     0)               |    cppc_get_perf_ctrs() {
>    4447.171347 |     0)   0.540 us    | cpc_read.isra.0();
>    4447.171348 |     0)   0.520 us    | cpc_read.isra.0();
>    4447.171349 |     0)   0.400 us    | cpc_read.isra.0();
>    4447.171350 |     0)   0.440 us    | cpc_read.isra.0();
>    4447.171350 |     0)   3.660 us    | }
>    4447.171351 |     0)               |    __delay() {
>    4447.171351 |     0)   0.400 us    | arch_timer_evtstrm_available();
>    4447.171353 |     0)   2.400 us    | }
>    4447.171353 |     0)               |    cppc_get_perf_ctrs() {
>    4447.171354 |     0)   0.540 us    | cpc_read.isra.0();
>    4447.171355 |     0)   0.540 us    | cpc_read.isra.0();
>    4447.171356 |     0)   0.380 us    | cpc_read.isra.0();
>    4447.171356 |     0)   0.420 us    | cpc_read.isra.0();
>    4447.171357 |     0)   3.640 us    | }
>    4447.171357 |     0)               |    __delay() {
>    4447.171358 |     0)   0.380 us    | arch_timer_evtstrm_available();
>    4447.171360 |     0)   2.380 us    |    }
>    4447.171360 |     0)               |    cppc_get_perf_ctrs() {
>    4447.171361 |     0)   0.520 us    |      cpc_read.isra.0();
>    4447.171361 |     0)   0.520 us    |      cpc_read.isra.0();
>    4447.171362 |     0)   0.400 us    |      cpc_read.isra.0();
>    4447.171363 |     0)   0.440 us    |      cpc_read.isra.0();
>    4447.171364 |     0)   3.640 us    |    }
>    4447.171364 |     0)   0.520 us    |    cppc_cpufreq_perf_to_khz
> [cppc_cpufreq]();
>    4447.171365 |     0) + 34.240 us   |  }
> 
> It also shows the first reading typically took longer than the later
> ones. The second, the third and the fourth actually took almost same time.
> 
> I also tried to read perf_fb_ctrs_t0 twice (so total 3 reads, 2 for t0,
> 1 for t1, 2us delay between each read), but I didn't see noticeable
> improvement. 4 reads (2 for t0, 2 for t1) does show some noticeable
> improvement.
> 

Thanks for the new data.

>>
>> Example of code where we do 4 calls to cppc_get_perf_ctrs():
>>
>> diff --git a/drivers/cpufreq/cppc_cpufreq.c
>> b/drivers/cpufreq/cppc_cpufreq.c
>> index 022e3555407c..6370f2f0bdad 100644
>> --- a/drivers/cpufreq/cppc_cpufreq.c
>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>> @@ -853,6 +853,20 @@ static unsigned int
>> cppc_cpufreq_get_rate(unsigned int cpu)
>>
>>          udelay(2); /* 2usec delay between sampling */
>>
>> +       ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>> +       if (ret)
>> +               return ret;
>> +
>> +       udelay(2); /* 2usec delay between sampling */
>> +
>> +       /* Do a third call. */
>> +       ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>> +       if (ret)
>> +               return ret;
>> +
>> +       udelay(2); /* 2usec delay between sampling */
>> +
>> +       /* Do a fourth call. */
>>          ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>>          if (ret)
>>                  return ret;
>>
>>>
>>> The above trace data shows some cpc reading took a little bit longer
>>> than usual. I suspected it was caused by interconnect congestion.
>>>
>>>
>>> So it looks like IRQ is the major contributing factor of high error
>>> (4xxxxxx KHz), interconnect congestion is the major contributing factor
>>> of low error (3xxxxxx KHz).
>>>
>>> So I did the below test:
>>> 1. Disable IRQ: The high errors were gone (> 3700000KHz), but low errors
>>> were still seen.
>>> 2.10us delay: The high errors were still seen.
>>> 3. Disable IRQ + 10us delay: all the errors were gone.
>>>
>>> I think the test result also supports the tracing data.
>>>
>>>
>>> I also got some confusion about calling cppc_cpufreq_get_rate() with irq
>>> disabled. Rafael thought 10ms delay is too long because the function may
>>> be called with irq disabled. But a deeper look at the function shows it
>>> should *NOT* be called with irq disabled at all.
>>>
>>> First, if pcc channel is used, cpc reading may take over 100ms, it is
>>> way larger the proposed 10ms delay.
>>> Second, reading from cpc channel needs to take a semaphore, so it may
>>> sleep. But sleep with IRQ disabled is not allowed.
>>
>> Yes right, however the semaphore is not taken in between the sequence of
>> cpc_read() calls in cppc_get_perf_ctrs(). So maybe the change below
>> should
>> be acceptable:
> 
> Yeah, we should be able to find a smaller irq disable section.
> 
>>
>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
>> index c51d3ccb4cca..105a7e2ffffa 100644
>> --- a/drivers/acpi/cppc_acpi.c
>> +++ b/drivers/acpi/cppc_acpi.c
>> @@ -1315,6 +1315,7 @@ int cppc_get_perf_ctrs(int cpunum, struct
>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>          struct cppc_pcc_data *pcc_ss_data = NULL;
>>          u64 delivered, reference, ref_perf, ctr_wrap_time;
>>          int ret = 0, regs_in_pcc = 0;
>> +       unsigned long flags;
>>
>>          if (!cpc_desc) {
>>                  pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
>> @@ -1350,10 +1351,14 @@ int cppc_get_perf_ctrs(int cpunum, struct
>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>                  }
>>          }
>>
>> +       local_irq_save(flags);
>> +
>>          cpc_read(cpunum, delivered_reg, &delivered);
>>          cpc_read(cpunum, reference_reg, &reference);
>>          cpc_read(cpunum, ref_perf_reg, &ref_perf);
>>
>> +       local_irq_restore(flags);
>> +
> 
> cpc_read_ffh() would return -EPERM if irq is disabled.
> 
> So, the irq disabling must happen for mmio only in cpc_read(), for example:

I thought the issue was that irqs could happen in between cpc_read() functions,
the patch below would not cover it. If the frequency is more accurate
with this patch, I think I don't understand something.

(asking for more information)
Just to check, the core/perf counters are AMUs and the other CPPC registers
are mmio right ? Is it possible to know the CPPC registers that are
implemented on your platform ?
Also is it possible which platform you are using ?

> 
> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
> index c51d3ccb4cca..f3c92d844074 100644
> --- a/drivers/acpi/cppc_acpi.c
> +++ b/drivers/acpi/cppc_acpi.c
> @@ -982,6 +982,7 @@ static int cpc_read(int cpu, struct
> cpc_register_resource *reg_res, u64 *val)
>           void __iomem *vaddr = NULL;
>           int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpu);
>           struct cpc_reg *reg = &reg_res->cpc_entry.reg;
> +       unsigned long flags;
> 
>           if (reg_res->type == ACPI_TYPE_INTEGER) {
>                   *val = reg_res->cpc_entry.int_value;
> @@ -1015,6 +1016,7 @@ static int cpc_read(int cpu, struct
> cpc_register_resource *reg_res, u64 *val)
>                   return
> acpi_os_read_memory((acpi_physical_address)reg->address,
>                                   val, reg->bit_width);
> 
> +       local_irq_save(flags);
>           switch (reg->bit_width) {
>           case 8:
>                   *val = readb_relaxed(vaddr);
> @@ -1029,10 +1031,12 @@ static int cpc_read(int cpu, struct
> cpc_register_resource *reg_res, u64 *val)
>                   *val = readq_relaxed(vaddr);
>                   break;
>           default:
> +               local_irq_restore(flags);
>                   pr_debug("Error: Cannot read %u bit width from PCC for
> ss: %d\n",
>                            reg->bit_width, pcc_ss_id);
>                   return -EFAULT;
>           }
> +       local_irq_restore(flags);
> 
>           return 0;
>    }
> 
>>          /*
>>           * Per spec, if ctr_wrap_time optional register is
>> unsupported, then the
>>           * performance counters are assumed to never wrap during the
>> lifetime of
>>
>>> Third, if the counters are implemented by AMU, cpc_read_ffh() needs to
>>> send IPI so it requires IRQ enabled.
>>
>> If I'm not mistaken, the CPU calling cpc_read_ffh() might have IRQs
>> disabled,
>> it should not prevent it to send IPIs no ?
> 
> It can't work with irq disabled. The comment in counters_read_on_cpu()
> says "Abort call on counterless CPU or when interrupts are disabled -
> can lead to deadlock in smp sync call."
> 
> 
> And it just returns -EPERM and raise a warning if irq is disabled.

Ok right,

Regards,
Piere