Re: rt-tests: cyclictest: Add option to specify main pid affinity

From: Jonathan Schwender <schwenderjonathan@gmail.com>
To: "Ahmed S. Darwish" <a.darwish@linutronix.de>
Cc: linux-rt-users@vger.kernel.org
Subject: Re: rt-tests: cyclictest: Add option to specify main pid affinity
Date: Mon, 29 Mar 2021 16:37:45 +0200	[thread overview]
Message-ID: <f3cfb8ce-1d06-e1f1-a9e9-129595bbe3d2@gmail.com> (raw)
In-Reply-To: <YFsHN/w+2mDHr1W8@lx-t490>

Hi Ahmed,

On 3/24/21 10:32 AM, Ahmed S. Darwish wrote:
> Hi Jonathan,
>
>
> Since I'm doing some CAT-related stuff on RT tasks vs. GPU workloads,
> I'm curious, how much was the benefit of CAT ON/OFF?

I'm assuming you're testing iGPU workloads and not on a dedicated GPU 
since you are mentioning CAT. Or is there any benefit of using CAT with 
a dedicated GPU?

> In your benchmarks you show that the combination of --mainaffinity, CPU
> isolation, and CAT, improves worst case latency by 2 micro seconds. If
> you keep everything as-is, but disable only CAT, how much change happens
> in the results?

First I'd like to mention that my test system had an inclusive 
cache-architecture. I'd guess that the difference between CAT and no CAT 
is smaller for exclusive or non-inclusive caches (assuming cyclictest is 
running on an isolated CPU).

So the results will depend on the amount of isolated CPUs and how much 
of the shared L3 cache the load on housekeeping CPU uses.

Rendered Markdown: 
https://gist.github.com/jschwe/3502dbf1e56c85e9bf1a340041885b33

# Isolation capabilities without CAT

## Test 2021-01-31 - Isolate all CPUs on NUMA node 1

The figure below shows a worst-case latency of 4 microseconds
measured by cyclictest on the isolated CPUs on NUMA node 1.

cmdline: `nosmt 
isolcpus=domain,managed_irq,wq,rcu,misc,kthread,1,3,5,7,9,11 
rcu_nocbs=1,3,5,7,9,11 irqaffinity=0,2,4 maxcpus=12 rcu_nocb_poll 
nowatchdog tsc=nowatchdog processor.max_cstate=1 intel_idle.max_cstate=0`

Test parameters: `sudo taskset -c 0-11 rteval --duration=24h 
--loads-cpulist=0,2,4,6,8,10 --measurement-cpulist=0-11`

![Figure: Latency of completely isolated node vs housekeeping 
node](https://gist.githubusercontent.com/jschwe/3502dbf1e56c85e9bf1a340041885b33/raw/962244e4e5309507feb0b4ec0627efbabe064c85/2021-01-31.png)

## Test 2021-02-01 - Isolate only CPU 11

The figure below shows a worst-case latency of 11 microseconds for the 
isolated CPU 11.
Interestingly, the worst-case latencies also increased for the 
housekeeping CPUs with respect
to the previous test.
It is consistent with other tests I made though, and the worst-case 
latency of the housekeeping CPUs is reduced
if I isolate all or all-but-one CPUs on node 1.

cmdline: `nosmt isolcpus=domain,managed_irq,wq,rcu,misc,kthread,11 
rcu_nocbs=11 irqaffinity=0,2,4 maxcpus=12 rcu_nocb_poll nowatchdog 
tsc=nowatchdog processor.max_cstate=1 intel_idle.max_cstate=0`

Test parameters: `sudo taskset -c 0-11 rteval --duration=24h 
--loads-cpulist=0-10 --measurement-cpulist=0-11`

![Figure: CPU 11 latency with load on neighboring 
CPUs](https://gist.githubusercontent.com/jschwe/3502dbf1e56c85e9bf1a340041885b33/raw/962244e4e5309507feb0b4ec0627efbabe064c85/2021-02-01.png)

Note: The error bars show the unbiased standard error of the mean

> Also, how many classes of service (CLOS) your CPU has? How was the cache
> bitmask divided vis-a-vis the available CLOSes? And did you assign
> isolated CPUs to one CLOS, and non-isolated CPUs to a different CLOS? Or
> was the division more granular?

I don't have access to the system anymore, but I think it had 8 CLOS 
available (according to resctrl).

I always used exclusive bitmasks. I mostly used one CLOS for the 
isolated CPUs, the default CLOS, and sometimes an additional CLOS for 
tid-based CAT.Due to the "exclusive" setting in resctrl I had to take 
away one way of the node 0 cache, even for CLOS that were only intended 
for node 1, which is a bit unfortunate.

I also tested tid-based vs. CPU based CAT on isolated CPUs and the 
take-away was it doesn't matter too much:

tid based CAT visibly (negatively) impacts the best-case latencies (1 
micro-second bin). However, the differences regarding the worst-case 
latencies were minor.

In one test, I used CDP to reserve 4-ways (4 MiB) for each code and data 
(so 8-way total) for 1 cyclictest instance (with 3 measurement threads). 
For CPU-based CAT the utilization oscillated between 0.98MB and 1.11MB. 
For tid-based CAT, the utilization oscillated between 98kB and 163kB.

In the next test I only used CAT to reserve 2-ways (2 MiB) shared 
between code and data,  also for 1 cyclictest instance with 3 
measurement threads. In this case the CPU-based approach utilized 
between 0.45MB and 0.85MB of the reserved L3 cache, but the latencies 
measured by cyclictest were basically unchanged. The tid-based approach 
actually had a utilization of 0. I'm assuming that's because more L3 was 
available to the default CLOS, and the relevant cache-lines were never 
evicted from that part of the L3 cache, so the reservation didn't even 
come in to play there.

> Kind regards,
>
> --
> Ahmed S. Darwish
> Linutronix GmbH

Best regards

Jonathan Schwender