Re: IRQ latency measurements in hypervisor

From: Stefano Stabellini <sstabellini@kernel.org>
To: Julien Grall <julien@xen.org>
Cc: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>,
	 Stefano Stabellini <stefano.stabellini@xilinx.com>,
	 "xen-devel@lists.xenproject.org"
	<xen-devel@lists.xenproject.org>,
	 Julien Grall <jgrall@amazon.com>,
	Dario Faggioli <dario.faggioli@suse.com>,
	 "Bertrand.Marquis@arm.com" <Bertrand.Marquis@arm.com>,
	 "andrew.cooper3@citrix.com" <andrew.cooper3@citrix.com>
Subject: Re: IRQ latency measurements in hypervisor
Date: Fri, 15 Jan 2021 15:41:58 -0800 (PST)	[thread overview]
Message-ID: <alpine.DEB.2.21.2101151459280.31265@sstabellini-ThinkPad-T480s> (raw)
In-Reply-To: <187995c9-78f4-0a1c-d912-ca5100d07321@xen.org>

On Fri, 15 Jan 2021, Julien Grall wrote:
> On 15/01/2021 15:45, Volodymyr Babchuk wrote:
> > 
> > Hi Julien,
> > 
> > Julien Grall writes:
> > 
> > > Hi Volodymyr, Stefano,
> > > 
> > > On 14/01/2021 23:33, Stefano Stabellini wrote:
> > > > + Bertrand, Andrew (see comment on alloc_heap_pages())
> > > 
> > > Long running hypercalls are usually considered security issues.
> > > 
> > > In this case, only the control domain can issue large memory
> > > allocation (2GB at a time). Guest, would only be able to allocate 2MB
> > > at the time, so from the numbers below, it would only take 1ms max.
> > > 
> > > So I think we are fine here. Next time, you find a large loop, please
> > > provide an explanation why they are not security issues (e.g. cannot
> > > be used by guests) or send an email to the Security Team in doubt.
> > 
> > Sure. In this case I took into account that only control domain can
> > issue this call, I just didn't stated this explicitly. Next time will
> > do.
> 
> I am afraid that's not correct. The guest can request to populate a region.
> This is used for instance in the ballooning case.
> 
> The main difference is a non-privileged guest will not be able to do
> allocation larger than 2MB.
> 
> [...]
> 
> > > > This is very interestingi too. Did you get any spikes with the
> > > > period
> > > > set to 100us? It would be fantastic if there were none.
> > > > 
> > > > > 3. Huge latency spike during domain creation. I conducted some
> > > > >      additional tests, including use of PV drivers, but this didn't
> > > > >      affected the latency in my "real time" domain. But attempt to
> > > > >      create another domain with relatively large memory size of 2GB
> > > > > led
> > > > >      to huge spike in latency. Debugging led to this call path:
> > > > > 
> > > > >      XENMEM_populate_physmap -> populate_physmap() ->
> > > > >      alloc_domheap_pages() -> alloc_heap_pages()-> huge
> > > > >      "for ( i = 0; i < (1 << order); i++ )" loop.
> > > 
> > > There are two for loops in alloc_heap_pages() using this syntax. Which
> > > one are your referring to?
> > 
> > I did some tracing with Lautrebach. It pointed to the first loop and
> > especially to flush_page_to_ram() call if I remember correctly.
> 
> Thanks, I am not entirely surprised because we are clean and invalidating the
> region line by line and across all the CPUs.
> 
> If we are assuming 128 bytes cacheline, we will need to issue 32 cache
> instructions per page. This going to involve quite a bit of traffic on the
> system.

I think Julien is most likely right. It would be good to verify this
with an experiment. For instance, you could remove the
flush_page_to_ram() call for one test and see if you see any latency
problems.

> One possibility would be to defer the cache flush when the domain is created
> and use the hypercall XEN_DOMCTL_cacheflush to issue the flush.
> 
> Note that XEN_DOMCTL_cacheflush would need some modification to be
> preemptible. But at least, it will work on a GFN which is easier to track.

This looks like a solid suggestion. XEN_DOMCTL_cacheflush is already
used by the toolstack in a few places. 

I am also wondering if we can get away with fewer flush_page_to_ram()
calls from alloc_heap_pages() for memory allocations done at boot time
soon after global boot memory scrubbing.

> > > > > I managed to overcome the issue #3 by commenting out all calls to
> > > > > populate_one_size() except the populate_one_size(PFN_4K_SHIFT) in
> > > > > xg_dom_arm.c. This lengthened domain construction, but my "RT" domain
> > > > > didn't experienced so big latency issues. Apparently all other
> > > > > hypercalls which are used during domain creation are either fast or
> > > > > preemptible. No doubts that my hack lead to page tables inflation and
> > > > > overall performance drop.
> > > > I think we need to follow this up and fix this. Maybe just by adding
> > > > a hypercall continuation to the loop.
> > > 
> > > When I read "hypercall continuation", I read we will return to the
> > > guest context so it can process interrupts and potentially switch to
> > > another task.
> > > 
> > > This means that the guest could issue a second populate_physmap() from
> > > the vCPU. Therefore any restart information should be part of the
> > > hypercall parameters. So far, I don't see how this would be possible.
> > > 
> > > Even if we overcome that part, this can be easily abuse by a guest as
> > > the memory is not yet accounted to the domain. Imagine a guest that
> > > never request the continuation of the populate_physmap(). So we would
> > > need to block the vCPU until the allocation is finished.
> > 
> > Moreover, most of the alloc_heap_pages() sits under spinlock, so first
> > step would be to split this function into smaller atomic parts.
> 
> Do you have any suggestion how to split it?
> 
> > 
> > > I think the first step is we need to figure out which part of the
> > > allocation is slow (see my question above). From there, we can figure
> > > out if there is a way to reduce the impact.
> > 
> > I'll do more tracing and will return with more accurate numbers. But as far
> > as I can see, any loop on 262144 pages will take some time..
> .
> 
> It really depends on the content of the loop. On any modern processors, you
> are very likely not going to notice a loop that update just a flag.
> 
> However, you are likely going to be see an impact if your loop is going to
> clean & invalidate the cache for each page.