Re: Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling

From: "Wu, Feng" <feng.wu@intel.com>
To: George Dunlap <george.dunlap@citrix.com>,
	Jan Beulich <JBeulich@suse.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>
Cc: "Tian, Kevin" <kevin.tian@intel.com>, Keir Fraser <keir@xen.org>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Dario Faggioli <dario.faggioli@citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	"Wu, Feng" <feng.wu@intel.com>
Subject: Re: Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling
Date: Wed, 9 Mar 2016 05:22:24 +0000	[thread overview]
Message-ID: <E959C4978C3B6342920538CF579893F00C369BFF@SHSMSX104.ccr.corp.intel.com> (raw)
In-Reply-To: <56DF066B.3090106@citrix.com>

> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@citrix.com]
> Sent: Wednesday, March 9, 2016 1:06 AM
> To: Jan Beulich <JBeulich@suse.com>; George Dunlap
> <George.Dunlap@eu.citrix.com>; Wu, Feng <feng.wu@intel.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; Dario Faggioli
> <dario.faggioli@citrix.com>; Tian, Kevin <kevin.tian@intel.com>; xen-
> devel@lists.xen.org; Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>; Keir
> Fraser <keir@xen.org>
> Subject: Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt
> core logic handling
> 
> On 08/03/16 15:42, Jan Beulich wrote:
> >>>> On 08.03.16 at 15:42, <George.Dunlap@eu.citrix.com> wrote:
> >> On Tue, Mar 8, 2016 at 1:10 PM, Wu, Feng <feng.wu@intel.com> wrote:
> >>>> -----Original Message-----
> >>>> From: George Dunlap [mailto:george.dunlap@citrix.com]
> >> [snip]
> >>>> It seems like there are a couple of ways we could approach this:
> >>>>
> >>>> 1. Try to optimize the reverse look-up code so that it's not a linear
> >>>> linked list (getting rid of the theoretical fear)
> >>>
> >>> Good point.
> >>>
> >>>>
> >>>> 2. Try to test engineered situations where we expect this to be a
> >>>> problem, to see how big of a problem it is (proving the theory to be
> >>>> accurate or inaccurate in this case)
> >>>
> >>> Maybe we can run a SMP guest with all the vcpus pinned to a dedicated
> >>> pCPU, we can run some benchmark in the guest with VT-d PI and without
> >>> VT-d PI, then see the performance difference between these two sceanrios.
> >>
> >> This would give us an idea what the worst-case scenario would be.
> >
> > How would a single VM ever give us an idea about the worst
> > case? Something getting close to worst case is a ton of single
> > vCPU guests all temporarily pinned to one and the same pCPU
> > (could be multi-vCPU ones, but the more vCPU-s the more
> > artificial this pinning would become) right before they go into
> > blocked state (i.e. through one of the two callers of
> > arch_vcpu_block()), the pinning removed while blocked, and
> > then all getting woken at once.
> 
> Why would removing the pinning be important?
> 
> And I guess it's actually the case that it doesn't need all VMs to
> actually be *receiving* interrupts; it just requires them to be
> *capable* of receiving interrupts, for there to be a long chain all
> blocked on the same physical cpu.
> 
> >
> >>  But
> >> pinning all vcpus to a single pcpu isn't really a sensible use case we
> >> want to support -- if you have to do something stupid to get a
> >> performance regression, then I as far as I'm concerned it's not a
> >> problem.
> >>
> >> Or to put it a different way: If we pin 10 vcpus to a single pcpu and
> >> then pound them all with posted interrupts, and there is *no*
> >> significant performance regression, then that will conclusively prove
> >> that the theoretical performance regression is of no concern, and we
> >> can enable PI by default.
> >
> > The point isn't the pinning. The point is what pCPU they're on when
> > going to sleep. And that could involve quite a few more than just
> > 10 vCPU-s, provided they all sleep long enough.
> >
> > And the "theoretical performance regression is of no concern" is
> > also not a proper way of looking at it, I would say: Even if such
> > a situation would happen extremely rarely, if it can happen at all,
> > it would still be a security issue.
> 
> What I'm trying to get at is -- exactly what situation?  What actually
> constitutes a problematic interrupt latency / interrupt processing
> workload, how many vcpus must be sleeping on the same pcpu to actually
> risk triggering that latency / workload, and how feasible is it that
> such a situation would arise in a reasonable scenario?
> 
> If 200us is too long, and it only takes 3 sleeping vcpus to get there,
> then yes, there is a genuine problem we need to try to address before we
> turn it on by default.  If we say that up to 500us is tolerable, and it
> takes 100 sleeping vcpus to reach that latency, then this is something I
> don't really think we need to worry about.
> 
> "I think something bad may happen" is a really difficult to work with.
> "I want to make sure that even a high number of blocked cpus won't cause
> the interrupt latency to exceed 500us; and I want it to be basically
> impossible for the interrupt latency to exceed 5ms under any
> circumstances" is a concrete target someone can either demonstrate that
> they meet, or aim for when trying to improve the situation.
> 
> Feng: It should be pretty easy for you to:

George, thanks a lot for you to pointing the possible way to move forward.

> * Implement a modified version of Xen where
>  - *All* vcpus get put on the waitqueue

So this means, all the vcpus are blocked, and hence waiting in the
blocking list, right?

>  - Measure how long it took to run the loop in pi_wakeup_interrupt
> * Have one VM receiving posted interrupts on a regular basis.
> * Slowly increase the number of vcpus blocked on a single cpu (e.g., by
> creating more guests), stopping when you either reach 500us or 500
> vcpus. :-)

This may depends on the environment, I was using a 10G NIC to do the
test, if we increase the number of guests, I need more NICs to get assigned
to the guests, I will see if I can get them.

Thanks,
Feng

> 
> To report the measurements, you could either create a Xen trace record
> and use xentrace_format or xenalyze to plot the results; or you could
> create some software performance counters for different "buckets" --
> less than 100us, 100-200us, 200-300us, 300-400us, 400-500us, and more
> than 500us.
> 
> Or you could printk the min / average / max every 5000 interrupts or so. :-)
> 
> To test, it seems like using a network benchmark with short packet
> lengths should be able to trigger large numbers of interrupts; and it
> also can let you know if / when there's a performance impact of adding
> more vcpus.
> 
> Or alternately, you could try to come up with a quicker reverse-lookup
> algorithm. :-)
> 
>  -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel