On Fri, 17 Feb 2017, Dario Faggioli wrote: > On Thu, 2017-02-09 at 16:54 -0800, Stefano Stabellini wrote: > > These are the results, in nanosec: > > > >                         AVG     MIN     MAX     WARM MAX > > > > NODEBUG no WFI          1890    1800    3170    2070 > > NODEBUG WFI             4850    4810    7030    4980 > > NODEBUG no WFI credit2  2217    2090    3420    2650 > > NODEBUG WFI credit2     8080    7890    10320   8300 > > > > DEBUG no WFI            2252    2080    3320    2650 > > DEBUG WFI               6500    6140    8520    8130 > > DEBUG WFI, credit2      8050    7870    10680   8450 > > > > As you can see, depending on whether the guest issues a WFI or not > > while > > waiting for interrupts, the results change significantly. > > Interestingly, > > credit2 does worse than credit1 in this area. > > > I did some measuring myself, on x86, with different tools. So, > cyclictest is basically something very very similar to the app > Stefano's app. > > I've run it both within Dom0, and inside a guest. I also run a Xen > build (in this case, only inside of the guest). > > > We are down to 2000-3000ns. Then, I started investigating the > > scheduler. > > I measured how long it takes to run "vcpu_unblock": 1050ns, which is > > significant. I don't know what is causing the remaining 1000-2000ns, > > but > > I bet on another scheduler function. Do you have any suggestions on > > which one? > > > So, vcpu_unblock() calls vcpu_wake(), which then invokes the > scheduler's wakeup related functions. > > If you time vcpu_unblock(), from beginning to end of the function, you > actually capture quite a few things. E.g., the scheduler lock is taken > inside vcpu_wake(), so you're basically including time spent waited on > the lock in the estimation. > > That is probably ok (as in, lock contention definitely is something > relevant to latency), but it is expected for things to be rather > different between Credit1 and Credit2. > > I've, OTOH, tried to time, SCHED_OP(wake) and SCHED_OP(do_schedule), > and here's the result. Numbers are in cycles (I've used RDTSC) and, for > making sure to obtain consistent and comparable numbers, I've set the > frequency scaling governor to performance. > > Dom0, [performance] > cyclictest 1us cyclictest 1ms cyclictest 100ms > (cycles) Credit1 Credit2 Credit1 Credit2 Credit1 Credit2 > wakeup-avg 2429 2035 1980 1633 2535 1979 > wakeup-max 14577 113682 15153 203136 12285 115164 I am not that familiar with the x86 side of things, but the 113682 and 203136 look worrisome, especially considering that credit1 doesn't have them. > sched-avg 1716 1860 2527 1651 2286 1670 > sched-max 16059 15000 12297 101760 15831 13122 > > VM, [performance] > cyclictest 1us cyclictest 1ms cyclictest 100ms make -j xen > (cycles) Credit1 Credit2 Credit1 Credit2 Credit1 Credit2 Credit1 Credit2 > wakeup-avg 2213 2128 1944 2342 2374 2213 2429 1618 > wakeup-max 9990 10104 11262 9927 10290 10218 14430 15108 > sched-avg 2437 2472 1620 1594 2498 1759 2449 1809 > sched-max 14100 14634 10071 9984 10878 8748 16476 14220 > These are the corresponding numbers I have in ns: AVG MAX WARM MAX credit2 sched_op do_schedule 638 2410 2290 credit2 sched_op wake 603 2920 670 credit1 sched_op do_schedule 508 980 980 credit1 sched_op wake 792 2080 930 I would also like to see the nop scheduler as a comparison. It looks like that credit2 has higher max values. I am attaching the raw numbers because I think they are interesting (also in ns): credit2 has an higher initial variability. FYI the scenario is still the same: domU vcpu pinned to a pcpu, dom0 running elsewhere.