On Fri, 17 Feb 2017, Dario Faggioli wrote:
> On Thu, 2017-02-09 at 16:54 -0800, Stefano Stabellini wrote:
> > These are the results, in nanosec:
> > 
> >                         AVG     MIN     MAX     WARM MAX
> > 
> > NODEBUG no WFI          1890    1800    3170    2070
> > NODEBUG WFI             4850    4810    7030    4980
> > NODEBUG no WFI credit2  2217    2090    3420    2650
> > NODEBUG WFI credit2     8080    7890    10320   8300
> > 
> > DEBUG no WFI            2252    2080    3320    2650
> > DEBUG WFI               6500    6140    8520    8130
> > DEBUG WFI, credit2      8050    7870    10680   8450
> > 
> > As you can see, depending on whether the guest issues a WFI or not
> > while
> > waiting for interrupts, the results change significantly.
> > Interestingly,
> > credit2 does worse than credit1 in this area.
> > 
> I did some measuring myself, on x86, with different tools. So,
> cyclictest is basically something very very similar to the app
> Stefano's app.
> 
> I've run it both within Dom0, and inside a guest. I also run a Xen
> build (in this case, only inside of the guest).
> 
> > We are down to 2000-3000ns. Then, I started investigating the
> > scheduler.
> > I measured how long it takes to run "vcpu_unblock": 1050ns, which is
> > significant. I don't know what is causing the remaining 1000-2000ns,
> > but
> > I bet on another scheduler function. Do you have any suggestions on
> > which one?
> > 
> So, vcpu_unblock() calls vcpu_wake(), which then invokes the
> scheduler's wakeup related functions.
> 
> If you time vcpu_unblock(), from beginning to end of the function, you
> actually capture quite a few things. E.g., the scheduler lock is taken
> inside vcpu_wake(), so you're basically including time spent waited on
> the lock in the estimation.
> 
> That is probably ok (as in, lock contention definitely is something
> relevant to latency), but it is expected for things to be rather
> different between Credit1 and Credit2.
> 
> I've, OTOH, tried to time, SCHED_OP(wake) and SCHED_OP(do_schedule),
> and here's the result. Numbers are in cycles (I've used RDTSC) and, for
> making sure to obtain consistent and comparable numbers, I've set the
> frequency scaling governor to performance.
> 
> Dom0, [performance]							
> 	        cyclictest 1us	cyclictest 1ms	cyclictest 100ms			
> (cycles)	Credit1	Credit2	Credit1	Credit2	Credit1	Credit2		
> wakeup-avg	2429	2035	1980	1633	2535	1979		
> wakeup-max	14577	113682	15153	203136	12285	115164		

I am not that familiar with the x86 side of things, but the 113682 and
203136 look worrisome, especially considering that credit1 doesn't have
them.


> sched-avg	1716	1860	2527	1651	2286	1670		
> sched-max	16059	15000	12297	101760	15831	13122		
> 								
> VM, [performance]							
> 	        cyclictest 1us	cyclictest 1ms	cyclictest 100ms make -j xen	
> (cycles)	Credit1	Credit2	Credit1	Credit2	Credit1	Credit2	 Credit1 Credit2
> wakeup-avg	2213	2128	1944	2342	2374	2213	 2429	 1618
> wakeup-max	9990	10104	11262	9927	10290	10218	 14430	 15108
> sched-avg	2437	2472	1620	1594	2498	1759	 2449	 1809
> sched-max	14100	14634	10071	9984	10878	8748	 16476	 14220
> 

These are the corresponding numbers I have in ns:

                            	AVG		MAX	    WARM MAX
credit2 sched_op do_schedule	638		2410	2290
credit2 sched_op wake	        603		2920	670
credit1 sched_op do_schedule	508		980	    980
credit1 sched_op wake	        792		2080	930

I would also like to see the nop scheduler as a comparison.

It looks like that credit2 has higher max values. I am attaching the raw
numbers because I think they are interesting (also in ns): credit2 has
an higher initial variability. FYI the scenario is still the same: domU
vcpu pinned to a pcpu, dom0 running elsewhere.