From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christoffer Dall Subject: Re: issues with emulated PCI MMIO backed by host memory under KVM Date: Tue, 28 Jun 2016 14:20:43 +0200 Message-ID: <20160628122043.GO26498@cbox> References: <20160627091619.GB26498@cbox> <20160627103421.GC26498@cbox> <20160627133508.GI26498@cbox> <20160628100405.GK26498@cbox> <9fbfb578-2235-2f2a-4502-a285e9ba22e6@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from localhost (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id BEE6E49B51 for ; Tue, 28 Jun 2016 08:14:47 -0400 (EDT) Received: from mm01.cs.columbia.edu ([127.0.0.1]) by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id WYLM8Zv0mn-A for ; Tue, 28 Jun 2016 08:14:46 -0400 (EDT) Received: from mail-wm0-f48.google.com (mail-wm0-f48.google.com [74.125.82.48]) by mm01.cs.columbia.edu (Postfix) with ESMTPS id 080CF49B4C for ; Tue, 28 Jun 2016 08:14:45 -0400 (EDT) Received: by mail-wm0-f48.google.com with SMTP id r201so25018549wme.1 for ; Tue, 28 Jun 2016 05:19:56 -0700 (PDT) Content-Disposition: inline In-Reply-To: <9fbfb578-2235-2f2a-4502-a285e9ba22e6@redhat.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: kvmarm-bounces@lists.cs.columbia.edu Sender: kvmarm-bounces@lists.cs.columbia.edu To: Laszlo Ersek Cc: Ard Biesheuvel , Marc Zyngier , Catalin Marinas , "kvmarm@lists.cs.columbia.edu" List-Id: kvmarm@lists.cs.columbia.edu On Tue, Jun 28, 2016 at 01:06:36PM +0200, Laszlo Ersek wrote: > On 06/28/16 12:04, Christoffer Dall wrote: > > On Mon, Jun 27, 2016 at 03:57:28PM +0200, Ard Biesheuvel wrote: > >> On 27 June 2016 at 15:35, Christoffer Dall wrote: > >>> On Mon, Jun 27, 2016 at 02:30:46PM +0200, Ard Biesheuvel wrote: > >>>> On 27 June 2016 at 12:34, Christoffer Dall wrote: > >>>>> On Mon, Jun 27, 2016 at 11:47:18AM +0200, Ard Biesheuvel wrote: > >>>>>> On 27 June 2016 at 11:16, Christoffer Dall wrote: > >>>>>>> Hi, > >>>>>>> > >>>>>>> I'm going to ask some stupid questions here... > >>>>>>> > >>>>>>> On Fri, Jun 24, 2016 at 04:04:45PM +0200, Ard Biesheuvel wrote: > >>>>>>>> Hi all, > >>>>>>>> > >>>>>>>> This old subject came up again in a discussion related to PCIe support > >>>>>>>> for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO > >>>>>>>> regions as cacheable is preventing us from reusing a significant slice > >>>>>>>> of the PCIe support infrastructure, and so I'd like to bring this up > >>>>>>>> again, perhaps just to reiterate why we're simply out of luck. > >>>>>>>> > >>>>>>>> To refresh your memories, the issue is that on ARM, PCI MMIO regions > >>>>>>>> for emulated devices may be backed by memory that is mapped cacheable > >>>>>>>> by the host. Note that this has nothing to do with the device being > >>>>>>>> DMA coherent or not: in this case, we are dealing with regions that > >>>>>>>> are not memory from the POV of the guest, and it is reasonable for the > >>>>>>>> guest to assume that accesses to such a region are not visible to the > >>>>>>>> device before they hit the actual PCI MMIO window and are translated > >>>>>>>> into cycles on the PCI bus. > >>>>>>> > >>>>>>> For the sake of completeness, why is this reasonable? > >>>>>>> > >>>>>> > >>>>>> Because the whole point of accessing these regions is to communicate > >>>>>> with the device. It is common to use write combining mappings for > >>>>>> things like framebuffers to group writes before they hit the PCI bus, > >>>>>> but any caching just makes it more difficult for the driver state and > >>>>>> device state to remain synchronized. > >>>>>> > >>>>>>> Is this how any real ARM system implementing PCI would actually work? > >>>>>>> > >>>>>> > >>>>>> Yes. > >>>>>> > >>>>>>>> That means that mapping such a region > >>>>>>>> cacheable is a strange thing to do, in fact, and it is unlikely that > >>>>>>>> patches implementing this against the generic PCI stack in Tianocore > >>>>>>>> will be accepted by the maintainers. > >>>>>>>> > >>>>>>>> Note that this issue not only affects framebuffers on PCI cards, it > >>>>>>>> also affects emulated USB host controllers (perhaps Alex can remind us > >>>>>>>> which one exactly?) and likely other emulated generic PCI devices as > >>>>>>>> well. > >>>>>>>> > >>>>>>>> Since the issue exists only for emulated PCI devices whose MMIO > >>>>>>>> regions are backed by host memory, is there any way we can already > >>>>>>>> distinguish such memslots from ordinary ones? If we can, is there > >>>>>>>> anything we could do to treat these specially? Perhaps something like > >>>>>>>> using read-only memslots so we can at least trap guest writes instead > >>>>>>>> of having main memory going out of sync with the caches unnoticed? I > >>>>>>>> am just brainstorming here ... > >>>>>>> > >>>>>>> I think the only sensible solution is to make sure that the guest and > >>>>>>> emulation mappings use the same memory type, either cached or > >>>>>>> non-cached, and we 'simply' have to find the best way to implement this. > >>>>>>> > >>>>>>> As Drew suggested, forcing some S2 mappings to be non-cacheable is the > >>>>>>> one way. > >>>>>>> > >>>>>>> The other way is to use something like what you once wrote that rewrites > >>>>>>> stage-1 mappings to be cacheable, does that apply here ? > >>>>>>> > >>>>>>> Do we have a clear picture of why we'd prefer one way over the other? > >>>>>>> > >>>>>> > >>>>>> So first of all, let me reiterate that I could only find a single > >>>>>> instance in QEMU where a PCI MMIO region is backed by host memory, > >>>>>> which is vga-pci.c. I wonder of there are any other occurrences, but > >>>>>> if there aren't any, it makes much more sense to prohibit PCI BARs > >>>>>> backed by host memory rather than spend a lot of effort working around > >>>>>> it. > >>>>> > >>>>> Right, ok. So Marc's point during his KVM Forum talk was basically, > >>>>> don't use the legacy VGA adapter on ARM and use virtio graphics, right? > >>>>> > >>>> > >>>> Yes. But nothing is preventing you currently from using that, and I > >>>> think we should prefer crappy performance but correct operation over > >>>> the current situation. So in general, we should either disallow PCI > >>>> BARs backed by host memory, or emulate them, but never back them by a > >>>> RAM memslot when running under ARM/KVM. > >>> > >>> agreed, I just think that emulating accesses by trapping them is not > >>> just slow, it's not really possible in practice and even if it is, it's > >>> probably *unusably* slow. > >>> > >> > >> Well, it would probably involve a lot of effort to implement emulation > >> of instructions with multiple output registers, such as ldp/stp and > >> register writeback. And indeed, trapping on each store instruction to > >> the framebuffer is going to be sloooooowwwww. > >> > >> So let's disregard that option for now ... > >> > >>>> > >>>>> What is the proposed solution for someone shipping an ARM server and > >>>>> wishing to provide a graphical output for that server? > >>>>> > >>>> > >>>> The problem does not exist on bare metal. It is an implementation > >>>> detail of KVM on ARM that guest PCI BAR mappings are incoherent with > >>>> the view of the emulator in QEMU. > >>>> > >>>>> It feels strange to work around supporting PCI VGA adapters in ARM VMs, > >>>>> if that's not a supported real hardware case. However, I don't see what > >>>>> would prevent someone from plugging a VGA adapter into the PCI slot on > >>>>> an ARM server, and people selling ARM servers probably want this to > >>>>> happen, I'm guessing. > >>>>> > >>>> > >>>> As I said, the problem does not exist on bare metal. > >>>> > >>>>>> > >>>>>> If we do decide to fix this, the best way would be to use uncached > >>>>>> attributes for the QEMU userland mapping, and force it uncached in the > >>>>>> guest via a stage 2 override (as Drews suggests). The only problem I > >>>>>> see here is that the host's kernel direct mapping has a cached alias > >>>>>> that we need to get rid of. > >>>>> > >>>>> Do we have a way to accomplish that? > >>>>> > >>>>> Will we run into a bunch of other problems if we begin punching holes in > >>>>> the direct mapping for regular RAM? > >>>>> > >>>> > >>>> I think the policy up until now has been not to remap regions in the > >>>> kernel direct mapping for the purposes of DMA, and I think by the same > >>>> reasoning, it is not preferable for KVM either > >>> > >>> I guess the difference is that from the (host) kernel's point of view > >>> this is not DMA memory, but just regular RAM. I just don't know enough > >>> about the kernel's VM mappings to know what's involved here, but we > >>> should find out somehow... > >>> > >> > >> Whether it is DMA memory or not does not make a difference. The point > >> is simply that arm64 maps all RAM owned by the kernel as cacheable, > >> and remapping arbitrary ranges with different attributes is > >> problematic, since it is also likely to involve splitting of regions, > >> which is cumbersome with a mapping that is always live. > >> > >> So instead, we'd have to reserve some system memory early on and > >> remove it from the linear mapping, the complexity of which is more > >> than we are probably prepared to put up with. > > > > Don't we have any existing frameworks for such things, like ion or > > other things like that? Not sure if these systems export anything to > > userspace or even serve the purpose we want, but thought I'd throw it > > out there. > > > >> > >> So if vga-pci.c is the only problematic device, for which a reasonable > >> alternative exists (virtio-gpu), I think the only feasible solution is > >> to educate QEMU not to allow RAM memslots being exposed via PCI BARs > >> when running under KVM/ARM. > > > > It would be good if we could support vga-pci under KVM/ARM, but if > > there's no other way than rewriting the arm64 kernel's memory mappings > > completely, then probably we're stuck there, unfortunately. > > It's been mentioned earlier that the specific combination of S1 and S2 > mappings on aarch64 is actually an *architecture bug*. If we accept that > qualification, then we should realize our efforts here target finding a > *workaround*. > > In your blog post > , > you mention VHE ("Virtualization Host Extensions"). That's clearly a > sign of the architecture adapting to virt software needs. > > Do you see any chance that the S1-S2 combinations too can be fixed in a > new revision of the architecture? > I really can't speculate about this, I assume there are reasons for why the architecture is defined in this particular way, but I haven't investigated this aspect in any depth. -Christoffer