Re: Xen on ARM IRQ latency and scheduler overhead

From: Dario Faggioli <dario.faggioli@citrix.com>
To: Stefano Stabellini <sstabellini@kernel.org>, xen-devel@lists.xen.org
Cc: george.dunlap@eu.citrix.com, edgar.iglesias@xilinx.com,
	julien.grall@arm.com
Subject: Re: Xen on ARM IRQ latency and scheduler overhead
Date: Fri, 17 Feb 2017 19:40:45 +0100	[thread overview]
Message-ID: <1487356845.6732.100.camel@citrix.com> (raw)
In-Reply-To: <alpine.DEB.2.10.1702091603240.20549@sstabellini-ThinkPad-X260>

[-- Attachment #1.1: Type: text/plain, Size: 4169 bytes --]

On Thu, 2017-02-09 at 16:54 -0800, Stefano Stabellini wrote:
> These are the results, in nanosec:
> 
>                         AVG     MIN     MAX     WARM MAX
> 
> NODEBUG no WFI          1890    1800    3170    2070
> NODEBUG WFI             4850    4810    7030    4980
> NODEBUG no WFI credit2  2217    2090    3420    2650
> NODEBUG WFI credit2     8080    7890    10320   8300
> 
> DEBUG no WFI            2252    2080    3320    2650
> DEBUG WFI               6500    6140    8520    8130
> DEBUG WFI, credit2      8050    7870    10680   8450
> 
> As you can see, depending on whether the guest issues a WFI or not
> while
> waiting for interrupts, the results change significantly.
> Interestingly,
> credit2 does worse than credit1 in this area.
> 
I did some measuring myself, on x86, with different tools. So,
cyclictest is basically something very very similar to the app
Stefano's app.

I've run it both within Dom0, and inside a guest. I also run a Xen
build (in this case, only inside of the guest).

> We are down to 2000-3000ns. Then, I started investigating the
> scheduler.
> I measured how long it takes to run "vcpu_unblock": 1050ns, which is
> significant. I don't know what is causing the remaining 1000-2000ns,
> but
> I bet on another scheduler function. Do you have any suggestions on
> which one?
> 
So, vcpu_unblock() calls vcpu_wake(), which then invokes the
scheduler's wakeup related functions.

If you time vcpu_unblock(), from beginning to end of the function, you
actually capture quite a few things. E.g., the scheduler lock is taken
inside vcpu_wake(), so you're basically including time spent waited on
the lock in the estimation.

That is probably ok (as in, lock contention definitely is something
relevant to latency), but it is expected for things to be rather
different between Credit1 and Credit2.

I've, OTOH, tried to time, SCHED_OP(wake) and SCHED_OP(do_schedule),
and here's the result. Numbers are in cycles (I've used RDTSC) and, for
making sure to obtain consistent and comparable numbers, I've set the
frequency scaling governor to performance.

Dom0, [performance]							
	        cyclictest 1us	cyclictest 1ms	cyclictest 100ms			
(cycles)	Credit1	Credit2	Credit1	Credit2	Credit1	Credit2		
wakeup-avg	2429	2035	1980	1633	2535	1979		
wakeup-max	14577	113682	15153	203136	12285	115164		
sched-avg	1716	1860	2527	1651	2286	1670		
sched-max	16059	15000	12297	101760	15831	13122		

VM, [performance]							
	        cyclictest 1us	cyclictest 1ms	cyclictest 100ms make -j xen	
(cycles)	Credit1	Credit2	Credit1	Credit2	Credit1	Credit2	 Credit1 Credit2
wakeup-avg	2213	2128	1944	2342	2374	2213	 2429	 1618
wakeup-max	9990	10104	11262	9927	10290	10218	 14430	 15108
sched-avg	2437	2472	1620	1594	2498	1759	 2449	 1809
sched-max	14100	14634	10071	9984	10878	8748	 16476	 14220

Actually, TSC on this box should be stable and invariant, so I guess I
can try with the default governor. Will do that on Monday. Does ARM
have frequency scaling (I did remember something on xen-devel, but I am
not sure whether it landed upstream)?

But anyway. You're seeing big differences between Credit1 and Credit2,
while I, at least as far as the actual schedulers' code is concerned,
don't.

Credit2 shows higher wakeup-max values, but only in cases where the
workload runs in dom0. But it also shows better (lower) averages, in
both the two kind of workload considered and both in the dom0 and VM
case.

I therefore wonder what is actually responsible for the huge
differences between the two scheduler that you are seeing... could be
lock contention, but with only 4 pCPUs and 2 active vCPUs, I honestly
doubt it...

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel