issues with emulated PCI MMIO backed by host memory under KVM

All of lore.kernel.org
 help / color / mirror / Atom feed

* issues with emulated PCI MMIO backed by host memory under KVM
@ 2016-06-24 14:04 Ard Biesheuvel
  2016-06-24 14:57 ` Andrew Jones
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Ard Biesheuvel @ 2016-06-24 14:04 UTC (permalink / raw)
  To: Christoffer Dall, Peter Maydell, Marc Zyngier, Andrew Jones,
	Laszlo Ersek, kvmarm, Alexander Graf
  Cc: Catalin Marinas

Hi all,

This old subject came up again in a discussion related to PCIe support
for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO
regions as cacheable is preventing us from reusing a significant slice
of the PCIe support infrastructure, and so I'd like to bring this up
again, perhaps just to reiterate why we're simply out of luck.

To refresh your memories, the issue is that on ARM, PCI MMIO regions
for emulated devices may be backed by memory that is mapped cacheable
by the host. Note that this has nothing to do with the device being
DMA coherent or not: in this case, we are dealing with regions that
are not memory from the POV of the guest, and it is reasonable for the
guest to assume that accesses to such a region are not visible to the
device before they hit the actual PCI MMIO window and are translated
into cycles on the PCI bus. That means that mapping such a region
cacheable is a strange thing to do, in fact, and it is unlikely that
patches implementing this against the generic PCI stack in Tianocore
will be accepted by the maintainers.

Note that this issue not only affects framebuffers on PCI cards, it
also affects emulated USB host controllers (perhaps Alex can remind us
which one exactly?) and likely other emulated generic PCI devices as
well.

Since the issue exists only for emulated PCI devices whose MMIO
regions are backed by host memory, is there any way we can already
distinguish such memslots from ordinary ones? If we can, is there
anything we could do to treat these specially? Perhaps something like
using read-only memslots so we can at least trap guest writes instead
of having main memory going out of sync with the caches unnoticed? I
am just brainstorming here ...

In any case, it would be good to put this to bed one way or the other
(assuming it hasn't been put to bed already)

Thanks,
Ard.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-24 14:04 issues with emulated PCI MMIO backed by host memory under KVM Ard Biesheuvel
@ 2016-06-24 14:57 ` Andrew Jones
  2016-06-27  8:17   ` Marc Zyngier
  2016-06-24 18:16 ` Ard Biesheuvel
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Andrew Jones @ 2016-06-24 14:57 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: Marc Zyngier, Catalin Marinas, Laszlo Ersek, kvmarm


Hi Ard,

Thanks for bringing this back up again (I think :-)

On Fri, Jun 24, 2016 at 04:04:45PM +0200, Ard Biesheuvel wrote:
> Hi all,
> 
> This old subject came up again in a discussion related to PCIe support
> for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO
> regions as cacheable is preventing us from reusing a significant slice
> of the PCIe support infrastructure, and so I'd like to bring this up
> again, perhaps just to reiterate why we're simply out of luck.
> 
> To refresh your memories, the issue is that on ARM, PCI MMIO regions
> for emulated devices may be backed by memory that is mapped cacheable
> by the host. Note that this has nothing to do with the device being
> DMA coherent or not: in this case, we are dealing with regions that
> are not memory from the POV of the guest, and it is reasonable for the
> guest to assume that accesses to such a region are not visible to the
> device before they hit the actual PCI MMIO window and are translated
> into cycles on the PCI bus. That means that mapping such a region
> cacheable is a strange thing to do, in fact, and it is unlikely that
> patches implementing this against the generic PCI stack in Tianocore
> will be accepted by the maintainers.
> 
> Note that this issue not only affects framebuffers on PCI cards, it
> also affects emulated USB host controllers (perhaps Alex can remind us
> which one exactly?) and likely other emulated generic PCI devices as
> well.
> 
> Since the issue exists only for emulated PCI devices whose MMIO
> regions are backed by host memory, is there any way we can already
> distinguish such memslots from ordinary ones? If we can, is there

When I was looking at this I didn't see any way to identify these
memslots. I wrote some patches to add a new flag, KVM_MEM_NONCACHEABLE,
allowing userspace to point them out. That was the easy part (although
I didn't like that userspace developers would have to go around finding
all memory regions that needed to be flagged, and new devices would
likely not be flagged when developed on non-arm architectures, so we'd
always be chasing it...) However what really slowed/stopped me was
trying to figure out what to do with those identified memslots.

My last idea, which had implementation issues (probably because I was
getting in over my head), was

 1) introduce PAGE_S2_NORMAL_NC and use it when mapping the guest's pages
 2) flush the userspace pages and update all PTEs to be NC

The reasoning was that, while we can't force a guest to use cacheable
memory, we can take advantage of the noncacheable precedence of the
architecture, forcing the memory accesses to be noncached by way of
S2 attributes. And of course userspace mappings also need to become NC
to finally have coherency.

> anything we could do to treat these specially? Perhaps something like
> using read-only memslots so we can at least trap guest writes instead
> of having main memory going out of sync with the caches unnoticed? I
> am just brainstorming here ...
> 
> In any case, it would be good to put this to bed one way or the other
> (assuming it hasn't been put to bed already)

I'm willing to work on this again (because it's fun), but I'm a bit
overloaded right now, and last time I touched it it sucked me into a
time hole...

drew

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-24 14:04 issues with emulated PCI MMIO backed by host memory under KVM Ard Biesheuvel
  2016-06-24 14:57 ` Andrew Jones
@ 2016-06-24 18:16 ` Ard Biesheuvel
  2016-06-25  7:15   ` Alexander Graf
  2016-06-25  7:19 ` Alexander Graf
  2016-06-27  9:16 ` Christoffer Dall
  3 siblings, 1 reply; 32+ messages in thread
From: Ard Biesheuvel @ 2016-06-24 18:16 UTC (permalink / raw)
  To: Christoffer Dall, Peter Maydell, Marc Zyngier, Andrew Jones,
	Laszlo Ersek, kvmarm, Alexander Graf
  Cc: Catalin Marinas

On 24 June 2016 at 16:04, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
[...]
> Note that this issue not only affects framebuffers on PCI cards, it
> also affects emulated USB host controllers (perhaps Alex can remind us
> which one exactly?)

Actually, looking at the QEMU source code, I am not able to spot the
USB hcd emulation code that backs a PCI MMIO BAR using host memory,
and in fact, the only instance I *can* find is vga-pci.c

@Alex: could you please explain which exact issue with USB emulation
is suspected to be caused by this?

@team-RH: are there any other examples beyond VGA PCI where this is a problem?

Thanks,
Ard.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-24 18:16 ` Ard Biesheuvel
@ 2016-06-25  7:15   ` Alexander Graf
  0 siblings, 0 replies; 32+ messages in thread
From: Alexander Graf @ 2016-06-25  7:15 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: Marc Zyngier, Catalin Marinas, Laszlo Ersek, kvmarm



> Am 24.06.2016 um 20:16 schrieb Ard Biesheuvel <ard.biesheuvel@linaro.org>:
> 
>> On 24 June 2016 at 16:04, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>> [...]
>> Note that this issue not only affects framebuffers on PCI cards, it
>> also affects emulated USB host controllers (perhaps Alex can remind us
>> which one exactly?)
> 
> Actually, looking at the QEMU source code, I am not able to spot the
> USB hcd emulation code that backs a PCI MMIO BAR using host memory,
> and in fact, the only instance I *can* find is vga-pci.c
> 
> @Alex: could you please explain which exact issue with USB emulation
> is suspected to be caused by this?

IIRC Linux put thhe usb rings into guest memory and mapped them as NC inside the guest. So the host will see stale data from the cache.


Alex

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-24 14:04 issues with emulated PCI MMIO backed by host memory under KVM Ard Biesheuvel
  2016-06-24 14:57 ` Andrew Jones
  2016-06-24 18:16 ` Ard Biesheuvel
@ 2016-06-25  7:19 ` Alexander Graf
  2016-06-27  8:11   ` Marc Zyngier
  2016-06-27  9:16 ` Christoffer Dall
  3 siblings, 1 reply; 32+ messages in thread
From: Alexander Graf @ 2016-06-25  7:19 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: Marc Zyngier, Catalin Marinas, Laszlo Ersek, kvmarm



> Am 24.06.2016 um 16:04 schrieb Ard Biesheuvel <ard.biesheuvel@linaro.org>:
> 
> Hi all,
> 
> This old subject came up again in a discussion related to PCIe support
> for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO
> regions as cacheable is preventing us from reusing a significant slice
> of the PCIe support infrastructure, and so I'd like to bring this up
> again, perhaps just to reiterate why we're simply out of luck.
> 
> To refresh your memories, the issue is that on ARM, PCI MMIO regions
> for emulated devices may be backed by memory that is mapped cacheable
> by the host. Note that this has nothing to do with the device being
> DMA coherent or not: in this case, we are dealing with regions that
> are not memory from the POV of the guest, and it is reasonable for the
> guest to assume that accesses to such a region are not visible to the
> device before they hit the actual PCI MMIO window and are translated
> into cycles on the PCI bus. That means that mapping such a region
> cacheable is a strange thing to do, in fact, and it is unlikely that
> patches implementing this against the generic PCI stack in Tianocore
> will be accepted by the maintainers.
> 
> Note that this issue not only affects framebuffers on PCI cards, it
> also affects emulated USB host controllers (perhaps Alex can remind us
> which one exactly?) and likely other emulated generic PCI devices as
> well.
> 
> Since the issue exists only for emulated PCI devices whose MMIO
> regions are backed by host memory, is there any way we can already
> distinguish such memslots from ordinary ones? If we can, is there
> anything we could do to treat these specially? Perhaps something like
> using read-only memslots so we can at least trap guest writes instead
> of having main memory going out of sync with the caches unnoticed? I
> am just brainstorming here ...

The "easiest" first step would be to simply not map host memory into the guest when we're on arm. Unfortunately that would mean we trap on everything as mmio accesses, including user space access from Xorg for example. That in turn means we'd need to mmio emulate neon instructions and all other sorts of things that can trigger mmio exits without being emulated today.

Also, even with that working and maybe even coalesced mmio implemented, I'd guess it'd still be too slow for real world usage...


Alex

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-25  7:19 ` Alexander Graf
@ 2016-06-27  8:11   ` Marc Zyngier
  0 siblings, 0 replies; 32+ messages in thread
From: Marc Zyngier @ 2016-06-27  8:11 UTC (permalink / raw)
  To: Alexander Graf, Ard Biesheuvel; +Cc: Catalin Marinas, Laszlo Ersek, kvmarm

On 25/06/16 08:19, Alexander Graf wrote:
> 
> 
>> Am 24.06.2016 um 16:04 schrieb Ard Biesheuvel <ard.biesheuvel@linaro.org>:
>>
>> Hi all,
>>
>> This old subject came up again in a discussion related to PCIe support
>> for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO
>> regions as cacheable is preventing us from reusing a significant slice
>> of the PCIe support infrastructure, and so I'd like to bring this up
>> again, perhaps just to reiterate why we're simply out of luck.
>>
>> To refresh your memories, the issue is that on ARM, PCI MMIO regions
>> for emulated devices may be backed by memory that is mapped cacheable
>> by the host. Note that this has nothing to do with the device being
>> DMA coherent or not: in this case, we are dealing with regions that
>> are not memory from the POV of the guest, and it is reasonable for the
>> guest to assume that accesses to such a region are not visible to the
>> device before they hit the actual PCI MMIO window and are translated
>> into cycles on the PCI bus. That means that mapping such a region
>> cacheable is a strange thing to do, in fact, and it is unlikely that
>> patches implementing this against the generic PCI stack in Tianocore
>> will be accepted by the maintainers.
>>
>> Note that this issue not only affects framebuffers on PCI cards, it
>> also affects emulated USB host controllers (perhaps Alex can remind us
>> which one exactly?) and likely other emulated generic PCI devices as
>> well.
>>
>> Since the issue exists only for emulated PCI devices whose MMIO
>> regions are backed by host memory, is there any way we can already
>> distinguish such memslots from ordinary ones? If we can, is there
>> anything we could do to treat these specially? Perhaps something like
>> using read-only memslots so we can at least trap guest writes instead
>> of having main memory going out of sync with the caches unnoticed? I
>> am just brainstorming here ...
> 
> The "easiest" first step would be to simply not map host memory into
> the guest when we're on arm. Unfortunately that would mean we trap on
> everything as mmio accesses, including user space access from Xorg
> for example. That in turn means we'd need to mmio emulate neon
> instructions and all other sorts of things that can trigger mmio
> exits without being emulated today.

It is not possible to emulate these instructions (load/store multiple,
whether they are GP or FP registers) other than with a "stop the world"
approach (in order to close the race where you read the instruction from
memory while another vcpu changes the pages tables).

> Also, even with that working and maybe even coalesced mmio
> implemented, I'd guess it'd still be too slow for real world
> usage...

And probably even slower than you think. There is no way around using
the architecture as is should be used. Either the guest is using
cacheable memory, or userspace is using non-cacheable memory. Everything
else is bound to fail one way or another.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-24 14:57 ` Andrew Jones
@ 2016-06-27  8:17   ` Marc Zyngier
  0 siblings, 0 replies; 32+ messages in thread
From: Marc Zyngier @ 2016-06-27  8:17 UTC (permalink / raw)
  To: Andrew Jones, Ard Biesheuvel; +Cc: Catalin Marinas, Laszlo Ersek, kvmarm

On 24/06/16 15:57, Andrew Jones wrote:
> 
> Hi Ard,
> 
> Thanks for bringing this back up again (I think :-)
> 
> On Fri, Jun 24, 2016 at 04:04:45PM +0200, Ard Biesheuvel wrote:
>> Hi all,
>>
>> This old subject came up again in a discussion related to PCIe support
>> for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO
>> regions as cacheable is preventing us from reusing a significant slice
>> of the PCIe support infrastructure, and so I'd like to bring this up
>> again, perhaps just to reiterate why we're simply out of luck.
>>
>> To refresh your memories, the issue is that on ARM, PCI MMIO regions
>> for emulated devices may be backed by memory that is mapped cacheable
>> by the host. Note that this has nothing to do with the device being
>> DMA coherent or not: in this case, we are dealing with regions that
>> are not memory from the POV of the guest, and it is reasonable for the
>> guest to assume that accesses to such a region are not visible to the
>> device before they hit the actual PCI MMIO window and are translated
>> into cycles on the PCI bus. That means that mapping such a region
>> cacheable is a strange thing to do, in fact, and it is unlikely that
>> patches implementing this against the generic PCI stack in Tianocore
>> will be accepted by the maintainers.
>>
>> Note that this issue not only affects framebuffers on PCI cards, it
>> also affects emulated USB host controllers (perhaps Alex can remind us
>> which one exactly?) and likely other emulated generic PCI devices as
>> well.
>>
>> Since the issue exists only for emulated PCI devices whose MMIO
>> regions are backed by host memory, is there any way we can already
>> distinguish such memslots from ordinary ones? If we can, is there
> 
> When I was looking at this I didn't see any way to identify these
> memslots. I wrote some patches to add a new flag, KVM_MEM_NONCACHEABLE,
> allowing userspace to point them out. That was the easy part (although
> I didn't like that userspace developers would have to go around finding
> all memory regions that needed to be flagged, and new devices would
> likely not be flagged when developed on non-arm architectures, so we'd
> always be chasing it...) However what really slowed/stopped me was
> trying to figure out what to do with those identified memslots.
> 
> My last idea, which had implementation issues (probably because I was
> getting in over my head), was
> 
>  1) introduce PAGE_S2_NORMAL_NC and use it when mapping the guest's pages
>  2) flush the userspace pages and update all PTEs to be NC
> 
> The reasoning was that, while we can't force a guest to use cacheable
> memory, we can take advantage of the noncacheable precedence of the
> architecture, forcing the memory accesses to be noncached by way of
> S2 attributes. And of course userspace mappings also need to become NC
> to finally have coherency.

I think this is a sensible course of action, as long as you can identify
a specific memblock on which to apply this. You may even not have to
"repaint" the PTEs, but instead obtain a non-cacheable mapping from the
kernel (at a different address).

I'm more worried if we end-up having both cacheable and non-cacheable
pages inside the same VMA (and Alex seems to point at USB having weird
requirements around this).

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-24 14:04 issues with emulated PCI MMIO backed by host memory under KVM Ard Biesheuvel
                   ` (2 preceding siblings ...)
  2016-06-25  7:19 ` Alexander Graf
@ 2016-06-27  9:16 ` Christoffer Dall
  2016-06-27  9:47   ` Ard Biesheuvel
  3 siblings, 1 reply; 32+ messages in thread
From: Christoffer Dall @ 2016-06-27  9:16 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: Marc Zyngier, Catalin Marinas, Laszlo Ersek, kvmarm

Hi,

I'm going to ask some stupid questions here...

On Fri, Jun 24, 2016 at 04:04:45PM +0200, Ard Biesheuvel wrote:
> Hi all,
> 
> This old subject came up again in a discussion related to PCIe support
> for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO
> regions as cacheable is preventing us from reusing a significant slice
> of the PCIe support infrastructure, and so I'd like to bring this up
> again, perhaps just to reiterate why we're simply out of luck.
> 
> To refresh your memories, the issue is that on ARM, PCI MMIO regions
> for emulated devices may be backed by memory that is mapped cacheable
> by the host. Note that this has nothing to do with the device being
> DMA coherent or not: in this case, we are dealing with regions that
> are not memory from the POV of the guest, and it is reasonable for the
> guest to assume that accesses to such a region are not visible to the
> device before they hit the actual PCI MMIO window and are translated
> into cycles on the PCI bus. 

For the sake of completeness, why is this reasonable?

Is this how any real ARM system implementing PCI would actually work?

> That means that mapping such a region
> cacheable is a strange thing to do, in fact, and it is unlikely that
> patches implementing this against the generic PCI stack in Tianocore
> will be accepted by the maintainers.
> 
> Note that this issue not only affects framebuffers on PCI cards, it
> also affects emulated USB host controllers (perhaps Alex can remind us
> which one exactly?) and likely other emulated generic PCI devices as
> well.
> 
> Since the issue exists only for emulated PCI devices whose MMIO
> regions are backed by host memory, is there any way we can already
> distinguish such memslots from ordinary ones? If we can, is there
> anything we could do to treat these specially? Perhaps something like
> using read-only memslots so we can at least trap guest writes instead
> of having main memory going out of sync with the caches unnoticed? I
> am just brainstorming here ...

I think the only sensible solution is to make sure that the guest and
emulation mappings use the same memory type, either cached or
non-cached, and we 'simply' have to find the best way to implement this.

As Drew suggested, forcing some S2 mappings to be non-cacheable is the
one way.

The other way is to use something like what you once wrote that rewrites
stage-1 mappings to be cacheable, does that apply here ?

Do we have a clear picture of why we'd prefer one way over the other?

> 
> In any case, it would be good to put this to bed one way or the other
> (assuming it hasn't been put to bed already)
> 

Agreed.

Thanks for the mail!

-Christoffer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-27  9:16 ` Christoffer Dall
@ 2016-06-27  9:47   ` Ard Biesheuvel
  2016-06-27 10:34     ` Christoffer Dall
  2016-06-27 13:15     ` Peter Maydell
  0 siblings, 2 replies; 32+ messages in thread
From: Ard Biesheuvel @ 2016-06-27  9:47 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Marc Zyngier, Catalin Marinas, Laszlo Ersek, kvmarm

On 27 June 2016 at 11:16, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> Hi,
>
> I'm going to ask some stupid questions here...
>
> On Fri, Jun 24, 2016 at 04:04:45PM +0200, Ard Biesheuvel wrote:
>> Hi all,
>>
>> This old subject came up again in a discussion related to PCIe support
>> for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO
>> regions as cacheable is preventing us from reusing a significant slice
>> of the PCIe support infrastructure, and so I'd like to bring this up
>> again, perhaps just to reiterate why we're simply out of luck.
>>
>> To refresh your memories, the issue is that on ARM, PCI MMIO regions
>> for emulated devices may be backed by memory that is mapped cacheable
>> by the host. Note that this has nothing to do with the device being
>> DMA coherent or not: in this case, we are dealing with regions that
>> are not memory from the POV of the guest, and it is reasonable for the
>> guest to assume that accesses to such a region are not visible to the
>> device before they hit the actual PCI MMIO window and are translated
>> into cycles on the PCI bus.
>
> For the sake of completeness, why is this reasonable?
>

Because the whole point of accessing these regions is to communicate
with the device. It is common to use write combining mappings for
things like framebuffers to group writes before they hit the PCI bus,
but any caching just makes it more difficult for the driver state and
device state to remain synchronized.

> Is this how any real ARM system implementing PCI would actually work?
>

Yes.

>> That means that mapping such a region
>> cacheable is a strange thing to do, in fact, and it is unlikely that
>> patches implementing this against the generic PCI stack in Tianocore
>> will be accepted by the maintainers.
>>
>> Note that this issue not only affects framebuffers on PCI cards, it
>> also affects emulated USB host controllers (perhaps Alex can remind us
>> which one exactly?) and likely other emulated generic PCI devices as
>> well.
>>
>> Since the issue exists only for emulated PCI devices whose MMIO
>> regions are backed by host memory, is there any way we can already
>> distinguish such memslots from ordinary ones? If we can, is there
>> anything we could do to treat these specially? Perhaps something like
>> using read-only memslots so we can at least trap guest writes instead
>> of having main memory going out of sync with the caches unnoticed? I
>> am just brainstorming here ...
>
> I think the only sensible solution is to make sure that the guest and
> emulation mappings use the same memory type, either cached or
> non-cached, and we 'simply' have to find the best way to implement this.
>
> As Drew suggested, forcing some S2 mappings to be non-cacheable is the
> one way.
>
> The other way is to use something like what you once wrote that rewrites
> stage-1 mappings to be cacheable, does that apply here ?
>
> Do we have a clear picture of why we'd prefer one way over the other?
>

So first of all, let me reiterate that I could only find a single
instance in QEMU where a PCI MMIO region is backed by host memory,
which is vga-pci.c. I wonder of there are any other occurrences, but
if there aren't any, it makes much more sense to prohibit PCI BARs
backed by host memory rather than spend a lot of effort working around
it.

If we do decide to fix this, the best way would be to use uncached
attributes for the QEMU userland mapping, and force it uncached in the
guest via a stage 2 override (as Drews suggests). The only problem I
see here is that the host's kernel direct mapping has a cached alias
that we need to get rid of. The MAIR hack is just that, a hack, since
there are corner cases that cannot be handled (but please refer to the
old thread for the details)

As for the USB case, I can't really figure out what is going on here,
but I am fairly certain it is a different issue. If this is related to
DMA, I wonder if adding the 'dma-coherent' property to the PCIe root
complex node fixes anything.

-- 
Ard.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-27  9:47   ` Ard Biesheuvel
@ 2016-06-27 10:34     ` Christoffer Dall
  2016-06-27 12:30       ` Ard Biesheuvel
                         ` (2 more replies)
  2016-06-27 13:15     ` Peter Maydell
  1 sibling, 3 replies; 32+ messages in thread
From: Christoffer Dall @ 2016-06-27 10:34 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: Marc Zyngier, Catalin Marinas, Laszlo Ersek, kvmarm

On Mon, Jun 27, 2016 at 11:47:18AM +0200, Ard Biesheuvel wrote:
> On 27 June 2016 at 11:16, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> > Hi,
> >
> > I'm going to ask some stupid questions here...
> >
> > On Fri, Jun 24, 2016 at 04:04:45PM +0200, Ard Biesheuvel wrote:
> >> Hi all,
> >>
> >> This old subject came up again in a discussion related to PCIe support
> >> for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO
> >> regions as cacheable is preventing us from reusing a significant slice
> >> of the PCIe support infrastructure, and so I'd like to bring this up
> >> again, perhaps just to reiterate why we're simply out of luck.
> >>
> >> To refresh your memories, the issue is that on ARM, PCI MMIO regions
> >> for emulated devices may be backed by memory that is mapped cacheable
> >> by the host. Note that this has nothing to do with the device being
> >> DMA coherent or not: in this case, we are dealing with regions that
> >> are not memory from the POV of the guest, and it is reasonable for the
> >> guest to assume that accesses to such a region are not visible to the
> >> device before they hit the actual PCI MMIO window and are translated
> >> into cycles on the PCI bus.
> >
> > For the sake of completeness, why is this reasonable?
> >
> 
> Because the whole point of accessing these regions is to communicate
> with the device. It is common to use write combining mappings for
> things like framebuffers to group writes before they hit the PCI bus,
> but any caching just makes it more difficult for the driver state and
> device state to remain synchronized.
> 
> > Is this how any real ARM system implementing PCI would actually work?
> >
> 
> Yes.
> 
> >> That means that mapping such a region
> >> cacheable is a strange thing to do, in fact, and it is unlikely that
> >> patches implementing this against the generic PCI stack in Tianocore
> >> will be accepted by the maintainers.
> >>
> >> Note that this issue not only affects framebuffers on PCI cards, it
> >> also affects emulated USB host controllers (perhaps Alex can remind us
> >> which one exactly?) and likely other emulated generic PCI devices as
> >> well.
> >>
> >> Since the issue exists only for emulated PCI devices whose MMIO
> >> regions are backed by host memory, is there any way we can already
> >> distinguish such memslots from ordinary ones? If we can, is there
> >> anything we could do to treat these specially? Perhaps something like
> >> using read-only memslots so we can at least trap guest writes instead
> >> of having main memory going out of sync with the caches unnoticed? I
> >> am just brainstorming here ...
> >
> > I think the only sensible solution is to make sure that the guest and
> > emulation mappings use the same memory type, either cached or
> > non-cached, and we 'simply' have to find the best way to implement this.
> >
> > As Drew suggested, forcing some S2 mappings to be non-cacheable is the
> > one way.
> >
> > The other way is to use something like what you once wrote that rewrites
> > stage-1 mappings to be cacheable, does that apply here ?
> >
> > Do we have a clear picture of why we'd prefer one way over the other?
> >
> 
> So first of all, let me reiterate that I could only find a single
> instance in QEMU where a PCI MMIO region is backed by host memory,
> which is vga-pci.c. I wonder of there are any other occurrences, but
> if there aren't any, it makes much more sense to prohibit PCI BARs
> backed by host memory rather than spend a lot of effort working around
> it.

Right, ok.  So Marc's point during his KVM Forum talk was basically,
don't use the legacy VGA adapter on ARM and use virtio graphics, right?

What is the proposed solution for someone shipping an ARM server and
wishing to provide a graphical output for that server?

It feels strange to work around supporting PCI VGA adapters in ARM VMs,
if that's not a supported real hardware case.  However, I don't see what
would prevent someone from plugging a VGA adapter into the PCI slot on
an ARM server, and people selling ARM servers probably want this to
happen, I'm guessing.

> 
> If we do decide to fix this, the best way would be to use uncached
> attributes for the QEMU userland mapping, and force it uncached in the
> guest via a stage 2 override (as Drews suggests). The only problem I
> see here is that the host's kernel direct mapping has a cached alias
> that we need to get rid of. 

Do we have a way to accomplish that?

Will we run into a bunch of other problems if we begin punching holes in
the direct mapping for regular RAM?

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-27 10:34     ` Christoffer Dall
@ 2016-06-27 12:30       ` Ard Biesheuvel
  2016-06-27 13:35         ` Christoffer Dall
  2016-06-27 14:24       ` Alexander Graf
  2016-06-28 10:55       ` Laszlo Ersek
  2 siblings, 1 reply; 32+ messages in thread
From: Ard Biesheuvel @ 2016-06-27 12:30 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Marc Zyngier, Catalin Marinas, Laszlo Ersek, kvmarm

On 27 June 2016 at 12:34, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> On Mon, Jun 27, 2016 at 11:47:18AM +0200, Ard Biesheuvel wrote:
>> On 27 June 2016 at 11:16, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>> > Hi,
>> >
>> > I'm going to ask some stupid questions here...
>> >
>> > On Fri, Jun 24, 2016 at 04:04:45PM +0200, Ard Biesheuvel wrote:
>> >> Hi all,
>> >>
>> >> This old subject came up again in a discussion related to PCIe support
>> >> for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO
>> >> regions as cacheable is preventing us from reusing a significant slice
>> >> of the PCIe support infrastructure, and so I'd like to bring this up
>> >> again, perhaps just to reiterate why we're simply out of luck.
>> >>
>> >> To refresh your memories, the issue is that on ARM, PCI MMIO regions
>> >> for emulated devices may be backed by memory that is mapped cacheable
>> >> by the host. Note that this has nothing to do with the device being
>> >> DMA coherent or not: in this case, we are dealing with regions that
>> >> are not memory from the POV of the guest, and it is reasonable for the
>> >> guest to assume that accesses to such a region are not visible to the
>> >> device before they hit the actual PCI MMIO window and are translated
>> >> into cycles on the PCI bus.
>> >
>> > For the sake of completeness, why is this reasonable?
>> >
>>
>> Because the whole point of accessing these regions is to communicate
>> with the device. It is common to use write combining mappings for
>> things like framebuffers to group writes before they hit the PCI bus,
>> but any caching just makes it more difficult for the driver state and
>> device state to remain synchronized.
>>
>> > Is this how any real ARM system implementing PCI would actually work?
>> >
>>
>> Yes.
>>
>> >> That means that mapping such a region
>> >> cacheable is a strange thing to do, in fact, and it is unlikely that
>> >> patches implementing this against the generic PCI stack in Tianocore
>> >> will be accepted by the maintainers.
>> >>
>> >> Note that this issue not only affects framebuffers on PCI cards, it
>> >> also affects emulated USB host controllers (perhaps Alex can remind us
>> >> which one exactly?) and likely other emulated generic PCI devices as
>> >> well.
>> >>
>> >> Since the issue exists only for emulated PCI devices whose MMIO
>> >> regions are backed by host memory, is there any way we can already
>> >> distinguish such memslots from ordinary ones? If we can, is there
>> >> anything we could do to treat these specially? Perhaps something like
>> >> using read-only memslots so we can at least trap guest writes instead
>> >> of having main memory going out of sync with the caches unnoticed? I
>> >> am just brainstorming here ...
>> >
>> > I think the only sensible solution is to make sure that the guest and
>> > emulation mappings use the same memory type, either cached or
>> > non-cached, and we 'simply' have to find the best way to implement this.
>> >
>> > As Drew suggested, forcing some S2 mappings to be non-cacheable is the
>> > one way.
>> >
>> > The other way is to use something like what you once wrote that rewrites
>> > stage-1 mappings to be cacheable, does that apply here ?
>> >
>> > Do we have a clear picture of why we'd prefer one way over the other?
>> >
>>
>> So first of all, let me reiterate that I could only find a single
>> instance in QEMU where a PCI MMIO region is backed by host memory,
>> which is vga-pci.c. I wonder of there are any other occurrences, but
>> if there aren't any, it makes much more sense to prohibit PCI BARs
>> backed by host memory rather than spend a lot of effort working around
>> it.
>
> Right, ok.  So Marc's point during his KVM Forum talk was basically,
> don't use the legacy VGA adapter on ARM and use virtio graphics, right?
>

Yes. But nothing is preventing you currently from using that, and I
think we should prefer crappy performance but correct operation over
the current situation. So in general, we should either disallow PCI
BARs backed by host memory, or emulate them, but never back them by a
RAM memslot when running under ARM/KVM.

> What is the proposed solution for someone shipping an ARM server and
> wishing to provide a graphical output for that server?
>

The problem does not exist on bare metal. It is an implementation
detail of KVM on ARM that guest PCI BAR mappings are incoherent with
the view of the emulator in QEMU.

> It feels strange to work around supporting PCI VGA adapters in ARM VMs,
> if that's not a supported real hardware case.  However, I don't see what
> would prevent someone from plugging a VGA adapter into the PCI slot on
> an ARM server, and people selling ARM servers probably want this to
> happen, I'm guessing.
>

As I said, the problem does not exist on bare metal.

>>
>> If we do decide to fix this, the best way would be to use uncached
>> attributes for the QEMU userland mapping, and force it uncached in the
>> guest via a stage 2 override (as Drews suggests). The only problem I
>> see here is that the host's kernel direct mapping has a cached alias
>> that we need to get rid of.
>
> Do we have a way to accomplish that?
>
> Will we run into a bunch of other problems if we begin punching holes in
> the direct mapping for regular RAM?
>

I think the policy up until now has been not to remap regions in the
kernel direct mapping for the purposes of DMA, and I think by the same
reasoning, it is not preferable for KVM either

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-27  9:47   ` Ard Biesheuvel
  2016-06-27 10:34     ` Christoffer Dall
@ 2016-06-27 13:15     ` Peter Maydell
  2016-06-27 13:49       ` Mark Rutland
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Maydell @ 2016-06-27 13:15 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: Marc Zyngier, Catalin Marinas, Laszlo Ersek, kvmarm

On 27 June 2016 at 10:47, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> As for the USB case, I can't really figure out what is going on here,
> but I am fairly certain it is a different issue. If this is related to
> DMA, I wonder if adding the 'dma-coherent' property to the PCIe root
> complex node fixes anything.

I get the impression dma-coherent is the right thing to advertise
anyway. Do you have the documentation to hand that specifies what
"dma-coherent" means? The Documentation/devicetree docs in the
kernel tree seem to rather unhelpfully define it as "Present if
dma operations are coherent", which doesn't really clarify anything
to me...

thanks
-- PMM

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-27 12:30       ` Ard Biesheuvel
@ 2016-06-27 13:35         ` Christoffer Dall
  2016-06-27 13:57           ` Ard Biesheuvel
  0 siblings, 1 reply; 32+ messages in thread
From: Christoffer Dall @ 2016-06-27 13:35 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: Marc Zyngier, Catalin Marinas, Laszlo Ersek, kvmarm

On Mon, Jun 27, 2016 at 02:30:46PM +0200, Ard Biesheuvel wrote:
> On 27 June 2016 at 12:34, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> > On Mon, Jun 27, 2016 at 11:47:18AM +0200, Ard Biesheuvel wrote:
> >> On 27 June 2016 at 11:16, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> >> > Hi,
> >> >
> >> > I'm going to ask some stupid questions here...
> >> >
> >> > On Fri, Jun 24, 2016 at 04:04:45PM +0200, Ard Biesheuvel wrote:
> >> >> Hi all,
> >> >>
> >> >> This old subject came up again in a discussion related to PCIe support
> >> >> for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO
> >> >> regions as cacheable is preventing us from reusing a significant slice
> >> >> of the PCIe support infrastructure, and so I'd like to bring this up
> >> >> again, perhaps just to reiterate why we're simply out of luck.
> >> >>
> >> >> To refresh your memories, the issue is that on ARM, PCI MMIO regions
> >> >> for emulated devices may be backed by memory that is mapped cacheable
> >> >> by the host. Note that this has nothing to do with the device being
> >> >> DMA coherent or not: in this case, we are dealing with regions that
> >> >> are not memory from the POV of the guest, and it is reasonable for the
> >> >> guest to assume that accesses to such a region are not visible to the
> >> >> device before they hit the actual PCI MMIO window and are translated
> >> >> into cycles on the PCI bus.
> >> >
> >> > For the sake of completeness, why is this reasonable?
> >> >
> >>
> >> Because the whole point of accessing these regions is to communicate
> >> with the device. It is common to use write combining mappings for
> >> things like framebuffers to group writes before they hit the PCI bus,
> >> but any caching just makes it more difficult for the driver state and
> >> device state to remain synchronized.
> >>
> >> > Is this how any real ARM system implementing PCI would actually work?
> >> >
> >>
> >> Yes.
> >>
> >> >> That means that mapping such a region
> >> >> cacheable is a strange thing to do, in fact, and it is unlikely that
> >> >> patches implementing this against the generic PCI stack in Tianocore
> >> >> will be accepted by the maintainers.
> >> >>
> >> >> Note that this issue not only affects framebuffers on PCI cards, it
> >> >> also affects emulated USB host controllers (perhaps Alex can remind us
> >> >> which one exactly?) and likely other emulated generic PCI devices as
> >> >> well.
> >> >>
> >> >> Since the issue exists only for emulated PCI devices whose MMIO
> >> >> regions are backed by host memory, is there any way we can already
> >> >> distinguish such memslots from ordinary ones? If we can, is there
> >> >> anything we could do to treat these specially? Perhaps something like
> >> >> using read-only memslots so we can at least trap guest writes instead
> >> >> of having main memory going out of sync with the caches unnoticed? I
> >> >> am just brainstorming here ...
> >> >
> >> > I think the only sensible solution is to make sure that the guest and
> >> > emulation mappings use the same memory type, either cached or
> >> > non-cached, and we 'simply' have to find the best way to implement this.
> >> >
> >> > As Drew suggested, forcing some S2 mappings to be non-cacheable is the
> >> > one way.
> >> >
> >> > The other way is to use something like what you once wrote that rewrites
> >> > stage-1 mappings to be cacheable, does that apply here ?
> >> >
> >> > Do we have a clear picture of why we'd prefer one way over the other?
> >> >
> >>
> >> So first of all, let me reiterate that I could only find a single
> >> instance in QEMU where a PCI MMIO region is backed by host memory,
> >> which is vga-pci.c. I wonder of there are any other occurrences, but
> >> if there aren't any, it makes much more sense to prohibit PCI BARs
> >> backed by host memory rather than spend a lot of effort working around
> >> it.
> >
> > Right, ok.  So Marc's point during his KVM Forum talk was basically,
> > don't use the legacy VGA adapter on ARM and use virtio graphics, right?
> >
> 
> Yes. But nothing is preventing you currently from using that, and I
> think we should prefer crappy performance but correct operation over
> the current situation. So in general, we should either disallow PCI
> BARs backed by host memory, or emulate them, but never back them by a
> RAM memslot when running under ARM/KVM.

agreed, I just think that emulating accesses by trapping them is not
just slow, it's not really possible in practice and even if it is, it's
probably *unusably* slow.

> 
> > What is the proposed solution for someone shipping an ARM server and
> > wishing to provide a graphical output for that server?
> >
> 
> The problem does not exist on bare metal. It is an implementation
> detail of KVM on ARM that guest PCI BAR mappings are incoherent with
> the view of the emulator in QEMU.
> 
> > It feels strange to work around supporting PCI VGA adapters in ARM VMs,
> > if that's not a supported real hardware case.  However, I don't see what
> > would prevent someone from plugging a VGA adapter into the PCI slot on
> > an ARM server, and people selling ARM servers probably want this to
> > happen, I'm guessing.
> >
> 
> As I said, the problem does not exist on bare metal.
> 
> >>
> >> If we do decide to fix this, the best way would be to use uncached
> >> attributes for the QEMU userland mapping, and force it uncached in the
> >> guest via a stage 2 override (as Drews suggests). The only problem I
> >> see here is that the host's kernel direct mapping has a cached alias
> >> that we need to get rid of.
> >
> > Do we have a way to accomplish that?
> >
> > Will we run into a bunch of other problems if we begin punching holes in
> > the direct mapping for regular RAM?
> >
> 
> I think the policy up until now has been not to remap regions in the
> kernel direct mapping for the purposes of DMA, and I think by the same
> reasoning, it is not preferable for KVM either

I guess the difference is that from the (host) kernel's point of view
this is not DMA memory, but just regular RAM.  I just don't know enough
about the kernel's VM mappings to know what's involved here, but we
should find out somehow...


Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-27 13:15     ` Peter Maydell
@ 2016-06-27 13:49       ` Mark Rutland
  2016-06-27 14:10         ` Peter Maydell
  0 siblings, 1 reply; 32+ messages in thread
From: Mark Rutland @ 2016-06-27 13:49 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas, Laszlo Ersek, kvmarm

On Mon, Jun 27, 2016 at 02:15:29PM +0100, Peter Maydell wrote:
> On 27 June 2016 at 10:47, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> > As for the USB case, I can't really figure out what is going on here,
> > but I am fairly certain it is a different issue. If this is related to
> > DMA, I wonder if adding the 'dma-coherent' property to the PCIe root
> > complex node fixes anything.
> 
> I get the impression dma-coherent is the right thing to advertise
> anyway. Do you have the documentation to hand that specifies what
> "dma-coherent" means? The Documentation/devicetree docs in the
> kernel tree seem to rather unhelpfully define it as "Present if
> dma operations are coherent", which doesn't really clarify anything
> to me...

It's ill-defined today, and the precise definition is an open question.
See replies to [1], which seems to have stalled as of [2].

My view is that for arm/arm64 this should mean the device makes accesses
which are coherent with Inner Shareable Normal Inner-WB Outer-WB
attributes, as this is the functional de-facto semantics today, and
anything short of that is not well-defined or usable.

Thanks,
Mark.

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-June/433626.html
[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-June/434143.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-27 13:35         ` Christoffer Dall
@ 2016-06-27 13:57           ` Ard Biesheuvel
  2016-06-27 14:29             ` Alexander Graf
  2016-06-28 10:04             ` Christoffer Dall
  0 siblings, 2 replies; 32+ messages in thread
From: Ard Biesheuvel @ 2016-06-27 13:57 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Marc Zyngier, Catalin Marinas, Laszlo Ersek, kvmarm

On 27 June 2016 at 15:35, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> On Mon, Jun 27, 2016 at 02:30:46PM +0200, Ard Biesheuvel wrote:
>> On 27 June 2016 at 12:34, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>> > On Mon, Jun 27, 2016 at 11:47:18AM +0200, Ard Biesheuvel wrote:
>> >> On 27 June 2016 at 11:16, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>> >> > Hi,
>> >> >
>> >> > I'm going to ask some stupid questions here...
>> >> >
>> >> > On Fri, Jun 24, 2016 at 04:04:45PM +0200, Ard Biesheuvel wrote:
>> >> >> Hi all,
>> >> >>
>> >> >> This old subject came up again in a discussion related to PCIe support
>> >> >> for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO
>> >> >> regions as cacheable is preventing us from reusing a significant slice
>> >> >> of the PCIe support infrastructure, and so I'd like to bring this up
>> >> >> again, perhaps just to reiterate why we're simply out of luck.
>> >> >>
>> >> >> To refresh your memories, the issue is that on ARM, PCI MMIO regions
>> >> >> for emulated devices may be backed by memory that is mapped cacheable
>> >> >> by the host. Note that this has nothing to do with the device being
>> >> >> DMA coherent or not: in this case, we are dealing with regions that
>> >> >> are not memory from the POV of the guest, and it is reasonable for the
>> >> >> guest to assume that accesses to such a region are not visible to the
>> >> >> device before they hit the actual PCI MMIO window and are translated
>> >> >> into cycles on the PCI bus.
>> >> >
>> >> > For the sake of completeness, why is this reasonable?
>> >> >
>> >>
>> >> Because the whole point of accessing these regions is to communicate
>> >> with the device. It is common to use write combining mappings for
>> >> things like framebuffers to group writes before they hit the PCI bus,
>> >> but any caching just makes it more difficult for the driver state and
>> >> device state to remain synchronized.
>> >>
>> >> > Is this how any real ARM system implementing PCI would actually work?
>> >> >
>> >>
>> >> Yes.
>> >>
>> >> >> That means that mapping such a region
>> >> >> cacheable is a strange thing to do, in fact, and it is unlikely that
>> >> >> patches implementing this against the generic PCI stack in Tianocore
>> >> >> will be accepted by the maintainers.
>> >> >>
>> >> >> Note that this issue not only affects framebuffers on PCI cards, it
>> >> >> also affects emulated USB host controllers (perhaps Alex can remind us
>> >> >> which one exactly?) and likely other emulated generic PCI devices as
>> >> >> well.
>> >> >>
>> >> >> Since the issue exists only for emulated PCI devices whose MMIO
>> >> >> regions are backed by host memory, is there any way we can already
>> >> >> distinguish such memslots from ordinary ones? If we can, is there
>> >> >> anything we could do to treat these specially? Perhaps something like
>> >> >> using read-only memslots so we can at least trap guest writes instead
>> >> >> of having main memory going out of sync with the caches unnoticed? I
>> >> >> am just brainstorming here ...
>> >> >
>> >> > I think the only sensible solution is to make sure that the guest and
>> >> > emulation mappings use the same memory type, either cached or
>> >> > non-cached, and we 'simply' have to find the best way to implement this.
>> >> >
>> >> > As Drew suggested, forcing some S2 mappings to be non-cacheable is the
>> >> > one way.
>> >> >
>> >> > The other way is to use something like what you once wrote that rewrites
>> >> > stage-1 mappings to be cacheable, does that apply here ?
>> >> >
>> >> > Do we have a clear picture of why we'd prefer one way over the other?
>> >> >
>> >>
>> >> So first of all, let me reiterate that I could only find a single
>> >> instance in QEMU where a PCI MMIO region is backed by host memory,
>> >> which is vga-pci.c. I wonder of there are any other occurrences, but
>> >> if there aren't any, it makes much more sense to prohibit PCI BARs
>> >> backed by host memory rather than spend a lot of effort working around
>> >> it.
>> >
>> > Right, ok.  So Marc's point during his KVM Forum talk was basically,
>> > don't use the legacy VGA adapter on ARM and use virtio graphics, right?
>> >
>>
>> Yes. But nothing is preventing you currently from using that, and I
>> think we should prefer crappy performance but correct operation over
>> the current situation. So in general, we should either disallow PCI
>> BARs backed by host memory, or emulate them, but never back them by a
>> RAM memslot when running under ARM/KVM.
>
> agreed, I just think that emulating accesses by trapping them is not
> just slow, it's not really possible in practice and even if it is, it's
> probably *unusably* slow.
>

Well, it would probably involve a lot of effort to implement emulation
of instructions with multiple output registers, such as ldp/stp and
register writeback. And indeed, trapping on each store instruction to
the framebuffer is going to be sloooooowwwww.

So let's disregard that option for now ...

>>
>> > What is the proposed solution for someone shipping an ARM server and
>> > wishing to provide a graphical output for that server?
>> >
>>
>> The problem does not exist on bare metal. It is an implementation
>> detail of KVM on ARM that guest PCI BAR mappings are incoherent with
>> the view of the emulator in QEMU.
>>
>> > It feels strange to work around supporting PCI VGA adapters in ARM VMs,
>> > if that's not a supported real hardware case.  However, I don't see what
>> > would prevent someone from plugging a VGA adapter into the PCI slot on
>> > an ARM server, and people selling ARM servers probably want this to
>> > happen, I'm guessing.
>> >
>>
>> As I said, the problem does not exist on bare metal.
>>
>> >>
>> >> If we do decide to fix this, the best way would be to use uncached
>> >> attributes for the QEMU userland mapping, and force it uncached in the
>> >> guest via a stage 2 override (as Drews suggests). The only problem I
>> >> see here is that the host's kernel direct mapping has a cached alias
>> >> that we need to get rid of.
>> >
>> > Do we have a way to accomplish that?
>> >
>> > Will we run into a bunch of other problems if we begin punching holes in
>> > the direct mapping for regular RAM?
>> >
>>
>> I think the policy up until now has been not to remap regions in the
>> kernel direct mapping for the purposes of DMA, and I think by the same
>> reasoning, it is not preferable for KVM either
>
> I guess the difference is that from the (host) kernel's point of view
> this is not DMA memory, but just regular RAM.  I just don't know enough
> about the kernel's VM mappings to know what's involved here, but we
> should find out somehow...
>

Whether it is DMA memory or not does not make a difference. The point
is simply that arm64 maps all RAM owned by the kernel as cacheable,
and remapping arbitrary ranges with different attributes is
problematic, since it is also likely to involve splitting of regions,
which is cumbersome with a mapping that is always live.

So instead, we'd have to reserve some system memory early on and
remove it from the linear mapping, the complexity of which is more
than we are probably prepared to put up with.

So if vga-pci.c is the only problematic device, for which a reasonable
alternative exists (virtio-gpu), I think the only feasible solution is
to educate QEMU not to allow RAM memslots being exposed via PCI BARs
when running under KVM/ARM.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-27 13:49       ` Mark Rutland
@ 2016-06-27 14:10         ` Peter Maydell
  2016-06-28 10:05           ` Christoffer Dall
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Maydell @ 2016-06-27 14:10 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas, Laszlo Ersek, kvmarm

On 27 June 2016 at 14:49, Mark Rutland <mark.rutland@arm.com> wrote:
> On Mon, Jun 27, 2016 at 02:15:29PM +0100, Peter Maydell wrote:
>> I get the impression dma-coherent is the right thing to advertise
>> anyway. Do you have the documentation to hand that specifies what
>> "dma-coherent" means? The Documentation/devicetree docs in the
>> kernel tree seem to rather unhelpfully define it as "Present if
>> dma operations are coherent", which doesn't really clarify anything
>> to me...
>
> It's ill-defined today, and the precise definition is an open question.
> See replies to [1], which seems to have stalled as of [2].
>
> My view is that for arm/arm64 this should mean the device makes accesses
> which are coherent with Inner Shareable Normal Inner-WB Outer-WB
> attributes, as this is the functional de-facto semantics today, and
> anything short of that is not well-defined or usable.

OK, so for any emulated device in QEMU we should specify
dma-coherent by those rules. I think our only DMA devices
in the virt board are the emulated PCI devices; dma-coherent
here is a property of the pci-controller and applies to any
device on it, right? Presumably this means that if the host
pci-controller doesn't advertise itself as dma-coherent then
we cannot do any PCI passthrough of host hardware?

thanks
-- PMM

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-27 10:34     ` Christoffer Dall
  2016-06-27 12:30       ` Ard Biesheuvel
@ 2016-06-27 14:24       ` Alexander Graf
  2016-06-28 10:55       ` Laszlo Ersek
  2 siblings, 0 replies; 32+ messages in thread
From: Alexander Graf @ 2016-06-27 14:24 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas, Laszlo Ersek, kvmarm



> Am 27.06.2016 um 12:34 schrieb Christoffer Dall <christoffer.dall@linaro.org>:
> 
>> On Mon, Jun 27, 2016 at 11:47:18AM +0200, Ard Biesheuvel wrote:
>>> On 27 June 2016 at 11:16, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>> Hi,
>>> 
>>> I'm going to ask some stupid questions here...
>>> 
>>>> On Fri, Jun 24, 2016 at 04:04:45PM +0200, Ard Biesheuvel wrote:
>>>> Hi all,
>>>> 
>>>> This old subject came up again in a discussion related to PCIe support
>>>> for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO
>>>> regions as cacheable is preventing us from reusing a significant slice
>>>> of the PCIe support infrastructure, and so I'd like to bring this up
>>>> again, perhaps just to reiterate why we're simply out of luck.
>>>> 
>>>> To refresh your memories, the issue is that on ARM, PCI MMIO regions
>>>> for emulated devices may be backed by memory that is mapped cacheable
>>>> by the host. Note that this has nothing to do with the device being
>>>> DMA coherent or not: in this case, we are dealing with regions that
>>>> are not memory from the POV of the guest, and it is reasonable for the
>>>> guest to assume that accesses to such a region are not visible to the
>>>> device before they hit the actual PCI MMIO window and are translated
>>>> into cycles on the PCI bus.
>>> 
>>> For the sake of completeness, why is this reasonable?
>> 
>> Because the whole point of accessing these regions is to communicate
>> with the device. It is common to use write combining mappings for
>> things like framebuffers to group writes before they hit the PCI bus,
>> but any caching just makes it more difficult for the driver state and
>> device state to remain synchronized.
>> 
>>> Is this how any real ARM system implementing PCI would actually work?
>> 
>> Yes.
>> 
>>>> That means that mapping such a region
>>>> cacheable is a strange thing to do, in fact, and it is unlikely that
>>>> patches implementing this against the generic PCI stack in Tianocore
>>>> will be accepted by the maintainers.
>>>> 
>>>> Note that this issue not only affects framebuffers on PCI cards, it
>>>> also affects emulated USB host controllers (perhaps Alex can remind us
>>>> which one exactly?) and likely other emulated generic PCI devices as
>>>> well.
>>>> 
>>>> Since the issue exists only for emulated PCI devices whose MMIO
>>>> regions are backed by host memory, is there any way we can already
>>>> distinguish such memslots from ordinary ones? If we can, is there
>>>> anything we could do to treat these specially? Perhaps something like
>>>> using read-only memslots so we can at least trap guest writes instead
>>>> of having main memory going out of sync with the caches unnoticed? I
>>>> am just brainstorming here ...
>>> 
>>> I think the only sensible solution is to make sure that the guest and
>>> emulation mappings use the same memory type, either cached or
>>> non-cached, and we 'simply' have to find the best way to implement this.
>>> 
>>> As Drew suggested, forcing some S2 mappings to be non-cacheable is the
>>> one way.
>>> 
>>> The other way is to use something like what you once wrote that rewrites
>>> stage-1 mappings to be cacheable, does that apply here ?
>>> 
>>> Do we have a clear picture of why we'd prefer one way over the other?
>> 
>> So first of all, let me reiterate that I could only find a single
>> instance in QEMU where a PCI MMIO region is backed by host memory,
>> which is vga-pci.c. I wonder of there are any other occurrences, but
>> if there aren't any, it makes much more sense to prohibit PCI BARs
>> backed by host memory rather than spend a lot of effort working around
>> it.
> 
> Right, ok.  So Marc's point during his KVM Forum talk was basically,
> don't use the legacy VGA adapter on ARM and use virtio graphics, right?
> 
> What is the proposed solution for someone shipping an ARM server and
> wishing to provide a graphical output for that server?

Well, there is at least one server that I know of that has PCI VGA built in ;).

I think he was more concerned about VMs rather than real hardware.

> 
> It feels strange to work around supporting PCI VGA adapters in ARM VMs,
> if that's not a supported real hardware case.  However, I don't see what
> would prevent someone from plugging a VGA adapter into the PCI slot on
> an ARM server, and people selling ARM servers probably want this to
> happen, I'm guessing.
> 
>> 
>> If we do decide to fix this, the best way would be to use uncached
>> attributes for the QEMU userland mapping, and force it uncached in the
>> guest via a stage 2 override (as Drews suggests). The only problem I
>> see here is that the host's kernel direct mapping has a cached alias
>> that we need to get rid of.
> 
> Do we have a way to accomplish that?
> 
> Will we run into a bunch of other problems if we begin punching holes in
> the direct mapping for regular RAM?

Yeah, and how do you deal with aliases on that memory? You'd also need to stop ksm to run on it for example.

Alex

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-27 13:57           ` Ard Biesheuvel
@ 2016-06-27 14:29             ` Alexander Graf
  2016-06-28 11:02               ` Laszlo Ersek
  2016-06-28 10:04             ` Christoffer Dall
  1 sibling, 1 reply; 32+ messages in thread
From: Alexander Graf @ 2016-06-27 14:29 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: Marc Zyngier, Catalin Marinas, Laszlo Ersek, kvmarm



> Am 27.06.2016 um 15:57 schrieb Ard Biesheuvel <ard.biesheuvel@linaro.org>:
> 
>> On 27 June 2016 at 15:35, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>> On Mon, Jun 27, 2016 at 02:30:46PM +0200, Ard Biesheuvel wrote:
>>>> On 27 June 2016 at 12:34, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>>>> On Mon, Jun 27, 2016 at 11:47:18AM +0200, Ard Biesheuvel wrote:
>>>>>> On 27 June 2016 at 11:16, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I'm going to ask some stupid questions here...
>>>>>> 
>>>>>>> On Fri, Jun 24, 2016 at 04:04:45PM +0200, Ard Biesheuvel wrote:
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> This old subject came up again in a discussion related to PCIe support
>>>>>>> for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO
>>>>>>> regions as cacheable is preventing us from reusing a significant slice
>>>>>>> of the PCIe support infrastructure, and so I'd like to bring this up
>>>>>>> again, perhaps just to reiterate why we're simply out of luck.
>>>>>>> 
>>>>>>> To refresh your memories, the issue is that on ARM, PCI MMIO regions
>>>>>>> for emulated devices may be backed by memory that is mapped cacheable
>>>>>>> by the host. Note that this has nothing to do with the device being
>>>>>>> DMA coherent or not: in this case, we are dealing with regions that
>>>>>>> are not memory from the POV of the guest, and it is reasonable for the
>>>>>>> guest to assume that accesses to such a region are not visible to the
>>>>>>> device before they hit the actual PCI MMIO window and are translated
>>>>>>> into cycles on the PCI bus.
>>>>>> 
>>>>>> For the sake of completeness, why is this reasonable?
>>>>> 
>>>>> Because the whole point of accessing these regions is to communicate
>>>>> with the device. It is common to use write combining mappings for
>>>>> things like framebuffers to group writes before they hit the PCI bus,
>>>>> but any caching just makes it more difficult for the driver state and
>>>>> device state to remain synchronized.
>>>>> 
>>>>>> Is this how any real ARM system implementing PCI would actually work?
>>>>> 
>>>>> Yes.
>>>>> 
>>>>>>> That means that mapping such a region
>>>>>>> cacheable is a strange thing to do, in fact, and it is unlikely that
>>>>>>> patches implementing this against the generic PCI stack in Tianocore
>>>>>>> will be accepted by the maintainers.
>>>>>>> 
>>>>>>> Note that this issue not only affects framebuffers on PCI cards, it
>>>>>>> also affects emulated USB host controllers (perhaps Alex can remind us
>>>>>>> which one exactly?) and likely other emulated generic PCI devices as
>>>>>>> well.
>>>>>>> 
>>>>>>> Since the issue exists only for emulated PCI devices whose MMIO
>>>>>>> regions are backed by host memory, is there any way we can already
>>>>>>> distinguish such memslots from ordinary ones? If we can, is there
>>>>>>> anything we could do to treat these specially? Perhaps something like
>>>>>>> using read-only memslots so we can at least trap guest writes instead
>>>>>>> of having main memory going out of sync with the caches unnoticed? I
>>>>>>> am just brainstorming here ...
>>>>>> 
>>>>>> I think the only sensible solution is to make sure that the guest and
>>>>>> emulation mappings use the same memory type, either cached or
>>>>>> non-cached, and we 'simply' have to find the best way to implement this.
>>>>>> 
>>>>>> As Drew suggested, forcing some S2 mappings to be non-cacheable is the
>>>>>> one way.
>>>>>> 
>>>>>> The other way is to use something like what you once wrote that rewrites
>>>>>> stage-1 mappings to be cacheable, does that apply here ?
>>>>>> 
>>>>>> Do we have a clear picture of why we'd prefer one way over the other?
>>>>> 
>>>>> So first of all, let me reiterate that I could only find a single
>>>>> instance in QEMU where a PCI MMIO region is backed by host memory,
>>>>> which is vga-pci.c. I wonder of there are any other occurrences, but
>>>>> if there aren't any, it makes much more sense to prohibit PCI BARs
>>>>> backed by host memory rather than spend a lot of effort working around
>>>>> it.
>>>> 
>>>> Right, ok.  So Marc's point during his KVM Forum talk was basically,
>>>> don't use the legacy VGA adapter on ARM and use virtio graphics, right?
>>> 
>>> Yes. But nothing is preventing you currently from using that, and I
>>> think we should prefer crappy performance but correct operation over
>>> the current situation. So in general, we should either disallow PCI
>>> BARs backed by host memory, or emulate them, but never back them by a
>>> RAM memslot when running under ARM/KVM.
>> 
>> agreed, I just think that emulating accesses by trapping them is not
>> just slow, it's not really possible in practice and even if it is, it's
>> probably *unusably* slow.
> 
> Well, it would probably involve a lot of effort to implement emulation
> of instructions with multiple output registers, such as ldp/stp and
> register writeback. And indeed, trapping on each store instruction to
> the framebuffer is going to be sloooooowwwww.
> 
> So let's disregard that option for now ...
> 
>>> 
>>>> What is the proposed solution for someone shipping an ARM server and
>>>> wishing to provide a graphical output for that server?
>>> 
>>> The problem does not exist on bare metal. It is an implementation
>>> detail of KVM on ARM that guest PCI BAR mappings are incoherent with
>>> the view of the emulator in QEMU.
>>> 
>>>> It feels strange to work around supporting PCI VGA adapters in ARM VMs,
>>>> if that's not a supported real hardware case.  However, I don't see what
>>>> would prevent someone from plugging a VGA adapter into the PCI slot on
>>>> an ARM server, and people selling ARM servers probably want this to
>>>> happen, I'm guessing.
>>> 
>>> As I said, the problem does not exist on bare metal.
>>> 
>>>>> 
>>>>> If we do decide to fix this, the best way would be to use uncached
>>>>> attributes for the QEMU userland mapping, and force it uncached in the
>>>>> guest via a stage 2 override (as Drews suggests). The only problem I
>>>>> see here is that the host's kernel direct mapping has a cached alias
>>>>> that we need to get rid of.
>>>> 
>>>> Do we have a way to accomplish that?
>>>> 
>>>> Will we run into a bunch of other problems if we begin punching holes in
>>>> the direct mapping for regular RAM?
>>> 
>>> I think the policy up until now has been not to remap regions in the
>>> kernel direct mapping for the purposes of DMA, and I think by the same
>>> reasoning, it is not preferable for KVM either
>> 
>> I guess the difference is that from the (host) kernel's point of view
>> this is not DMA memory, but just regular RAM.  I just don't know enough
>> about the kernel's VM mappings to know what's involved here, but we
>> should find out somehow...
> 
> Whether it is DMA memory or not does not make a difference. The point
> is simply that arm64 maps all RAM owned by the kernel as cacheable,
> and remapping arbitrary ranges with different attributes is
> problematic, since it is also likely to involve splitting of regions,
> which is cumbersome with a mapping that is always live.
> 
> So instead, we'd have to reserve some system memory early on and
> remove it from the linear mapping, the complexity of which is more
> than we are probably prepared to put up with.
> 
> So if vga-pci.c is the only problematic device, for which a reasonable
> alternative exists (virtio-gpu), I think the only feasible solution is
> to educate QEMU not to allow RAM memslots being exposed via PCI BARs
> when running under KVM/ARM.

That's ok, if there is a viable alternative. So if we had working virtio-gpu support in OVMF, we could just disable the legacy vga device with kvm on arm altogether - it'd either crash your guest (unhandled opcode in mmio emulation) or give you broken graphics.

But first, someone would need to sit down and make virtio-gpu work in OVMF.


Alex

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-27 13:57           ` Ard Biesheuvel
  2016-06-27 14:29             ` Alexander Graf
@ 2016-06-28 10:04             ` Christoffer Dall
  2016-06-28 11:06               ` Laszlo Ersek
  1 sibling, 1 reply; 32+ messages in thread
From: Christoffer Dall @ 2016-06-28 10:04 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: Marc Zyngier, Catalin Marinas, Laszlo Ersek, kvmarm

On Mon, Jun 27, 2016 at 03:57:28PM +0200, Ard Biesheuvel wrote:
> On 27 June 2016 at 15:35, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> > On Mon, Jun 27, 2016 at 02:30:46PM +0200, Ard Biesheuvel wrote:
> >> On 27 June 2016 at 12:34, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> >> > On Mon, Jun 27, 2016 at 11:47:18AM +0200, Ard Biesheuvel wrote:
> >> >> On 27 June 2016 at 11:16, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > I'm going to ask some stupid questions here...
> >> >> >
> >> >> > On Fri, Jun 24, 2016 at 04:04:45PM +0200, Ard Biesheuvel wrote:
> >> >> >> Hi all,
> >> >> >>
> >> >> >> This old subject came up again in a discussion related to PCIe support
> >> >> >> for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO
> >> >> >> regions as cacheable is preventing us from reusing a significant slice
> >> >> >> of the PCIe support infrastructure, and so I'd like to bring this up
> >> >> >> again, perhaps just to reiterate why we're simply out of luck.
> >> >> >>
> >> >> >> To refresh your memories, the issue is that on ARM, PCI MMIO regions
> >> >> >> for emulated devices may be backed by memory that is mapped cacheable
> >> >> >> by the host. Note that this has nothing to do with the device being
> >> >> >> DMA coherent or not: in this case, we are dealing with regions that
> >> >> >> are not memory from the POV of the guest, and it is reasonable for the
> >> >> >> guest to assume that accesses to such a region are not visible to the
> >> >> >> device before they hit the actual PCI MMIO window and are translated
> >> >> >> into cycles on the PCI bus.
> >> >> >
> >> >> > For the sake of completeness, why is this reasonable?
> >> >> >
> >> >>
> >> >> Because the whole point of accessing these regions is to communicate
> >> >> with the device. It is common to use write combining mappings for
> >> >> things like framebuffers to group writes before they hit the PCI bus,
> >> >> but any caching just makes it more difficult for the driver state and
> >> >> device state to remain synchronized.
> >> >>
> >> >> > Is this how any real ARM system implementing PCI would actually work?
> >> >> >
> >> >>
> >> >> Yes.
> >> >>
> >> >> >> That means that mapping such a region
> >> >> >> cacheable is a strange thing to do, in fact, and it is unlikely that
> >> >> >> patches implementing this against the generic PCI stack in Tianocore
> >> >> >> will be accepted by the maintainers.
> >> >> >>
> >> >> >> Note that this issue not only affects framebuffers on PCI cards, it
> >> >> >> also affects emulated USB host controllers (perhaps Alex can remind us
> >> >> >> which one exactly?) and likely other emulated generic PCI devices as
> >> >> >> well.
> >> >> >>
> >> >> >> Since the issue exists only for emulated PCI devices whose MMIO
> >> >> >> regions are backed by host memory, is there any way we can already
> >> >> >> distinguish such memslots from ordinary ones? If we can, is there
> >> >> >> anything we could do to treat these specially? Perhaps something like
> >> >> >> using read-only memslots so we can at least trap guest writes instead
> >> >> >> of having main memory going out of sync with the caches unnoticed? I
> >> >> >> am just brainstorming here ...
> >> >> >
> >> >> > I think the only sensible solution is to make sure that the guest and
> >> >> > emulation mappings use the same memory type, either cached or
> >> >> > non-cached, and we 'simply' have to find the best way to implement this.
> >> >> >
> >> >> > As Drew suggested, forcing some S2 mappings to be non-cacheable is the
> >> >> > one way.
> >> >> >
> >> >> > The other way is to use something like what you once wrote that rewrites
> >> >> > stage-1 mappings to be cacheable, does that apply here ?
> >> >> >
> >> >> > Do we have a clear picture of why we'd prefer one way over the other?
> >> >> >
> >> >>
> >> >> So first of all, let me reiterate that I could only find a single
> >> >> instance in QEMU where a PCI MMIO region is backed by host memory,
> >> >> which is vga-pci.c. I wonder of there are any other occurrences, but
> >> >> if there aren't any, it makes much more sense to prohibit PCI BARs
> >> >> backed by host memory rather than spend a lot of effort working around
> >> >> it.
> >> >
> >> > Right, ok.  So Marc's point during his KVM Forum talk was basically,
> >> > don't use the legacy VGA adapter on ARM and use virtio graphics, right?
> >> >
> >>
> >> Yes. But nothing is preventing you currently from using that, and I
> >> think we should prefer crappy performance but correct operation over
> >> the current situation. So in general, we should either disallow PCI
> >> BARs backed by host memory, or emulate them, but never back them by a
> >> RAM memslot when running under ARM/KVM.
> >
> > agreed, I just think that emulating accesses by trapping them is not
> > just slow, it's not really possible in practice and even if it is, it's
> > probably *unusably* slow.
> >
> 
> Well, it would probably involve a lot of effort to implement emulation
> of instructions with multiple output registers, such as ldp/stp and
> register writeback. And indeed, trapping on each store instruction to
> the framebuffer is going to be sloooooowwwww.
> 
> So let's disregard that option for now ...
> 
> >>
> >> > What is the proposed solution for someone shipping an ARM server and
> >> > wishing to provide a graphical output for that server?
> >> >
> >>
> >> The problem does not exist on bare metal. It is an implementation
> >> detail of KVM on ARM that guest PCI BAR mappings are incoherent with
> >> the view of the emulator in QEMU.
> >>
> >> > It feels strange to work around supporting PCI VGA adapters in ARM VMs,
> >> > if that's not a supported real hardware case.  However, I don't see what
> >> > would prevent someone from plugging a VGA adapter into the PCI slot on
> >> > an ARM server, and people selling ARM servers probably want this to
> >> > happen, I'm guessing.
> >> >
> >>
> >> As I said, the problem does not exist on bare metal.
> >>
> >> >>
> >> >> If we do decide to fix this, the best way would be to use uncached
> >> >> attributes for the QEMU userland mapping, and force it uncached in the
> >> >> guest via a stage 2 override (as Drews suggests). The only problem I
> >> >> see here is that the host's kernel direct mapping has a cached alias
> >> >> that we need to get rid of.
> >> >
> >> > Do we have a way to accomplish that?
> >> >
> >> > Will we run into a bunch of other problems if we begin punching holes in
> >> > the direct mapping for regular RAM?
> >> >
> >>
> >> I think the policy up until now has been not to remap regions in the
> >> kernel direct mapping for the purposes of DMA, and I think by the same
> >> reasoning, it is not preferable for KVM either
> >
> > I guess the difference is that from the (host) kernel's point of view
> > this is not DMA memory, but just regular RAM.  I just don't know enough
> > about the kernel's VM mappings to know what's involved here, but we
> > should find out somehow...
> >
> 
> Whether it is DMA memory or not does not make a difference. The point
> is simply that arm64 maps all RAM owned by the kernel as cacheable,
> and remapping arbitrary ranges with different attributes is
> problematic, since it is also likely to involve splitting of regions,
> which is cumbersome with a mapping that is always live.
> 
> So instead, we'd have to reserve some system memory early on and
> remove it from the linear mapping, the complexity of which is more
> than we are probably prepared to put up with.

Don't we have any existing frameworks for such things, like ion or
other things like that?  Not sure if these systems export anything to
userspace or even serve the purpose we want, but thought I'd throw it
out there.

> 
> So if vga-pci.c is the only problematic device, for which a reasonable
> alternative exists (virtio-gpu), I think the only feasible solution is
> to educate QEMU not to allow RAM memslots being exposed via PCI BARs
> when running under KVM/ARM.

It would be good if we could support vga-pci under KVM/ARM, but if
there's no other way than rewriting the arm64 kernel's memory mappings
completely, then probably we're stuck there, unfortunately.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-27 14:10         ` Peter Maydell
@ 2016-06-28 10:05           ` Christoffer Dall
  0 siblings, 0 replies; 32+ messages in thread
From: Christoffer Dall @ 2016-06-28 10:05 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas, Laszlo Ersek, kvmarm

On Mon, Jun 27, 2016 at 03:10:20PM +0100, Peter Maydell wrote:
> On 27 June 2016 at 14:49, Mark Rutland <mark.rutland@arm.com> wrote:
> > On Mon, Jun 27, 2016 at 02:15:29PM +0100, Peter Maydell wrote:
> >> I get the impression dma-coherent is the right thing to advertise
> >> anyway. Do you have the documentation to hand that specifies what
> >> "dma-coherent" means? The Documentation/devicetree docs in the
> >> kernel tree seem to rather unhelpfully define it as "Present if
> >> dma operations are coherent", which doesn't really clarify anything
> >> to me...
> >
> > It's ill-defined today, and the precise definition is an open question.
> > See replies to [1], which seems to have stalled as of [2].
> >
> > My view is that for arm/arm64 this should mean the device makes accesses
> > which are coherent with Inner Shareable Normal Inner-WB Outer-WB
> > attributes, as this is the functional de-facto semantics today, and
> > anything short of that is not well-defined or usable.
> 
> OK, so for any emulated device in QEMU we should specify
> dma-coherent by those rules. I think our only DMA devices
> in the virt board are the emulated PCI devices; dma-coherent
> here is a property of the pci-controller and applies to any
> device on it, right? Presumably this means that if the host
> pci-controller doesn't advertise itself as dma-coherent then
> we cannot do any PCI passthrough of host hardware?
> 
Someone suggested a while back to have a second PCI controller
matching the host properties for this purpose...

-Christoffer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-27 10:34     ` Christoffer Dall
  2016-06-27 12:30       ` Ard Biesheuvel
  2016-06-27 14:24       ` Alexander Graf
@ 2016-06-28 10:55       ` Laszlo Ersek
  2016-06-28 13:14         ` Ard Biesheuvel
  2016-06-28 15:23         ` Alexander Graf
  2 siblings, 2 replies; 32+ messages in thread
From: Laszlo Ersek @ 2016-06-28 10:55 UTC (permalink / raw)
  To: Christoffer Dall, Ard Biesheuvel; +Cc: Marc Zyngier, Catalin Marinas, kvmarm

On 06/27/16 12:34, Christoffer Dall wrote:
> On Mon, Jun 27, 2016 at 11:47:18AM +0200, Ard Biesheuvel wrote:

>> So first of all, let me reiterate that I could only find a single
>> instance in QEMU where a PCI MMIO region is backed by host memory,
>> which is vga-pci.c. I wonder of there are any other occurrences, but
>> if there aren't any, it makes much more sense to prohibit PCI BARs
>> backed by host memory rather than spend a lot of effort working around
>> it.
> 
> Right, ok.  So Marc's point during his KVM Forum talk was basically,
> don't use the legacy VGA adapter on ARM and use virtio graphics, right?

The EFI GOP (Graphics Output Protocol) abstraction provides two ways for
UEFI applications to access the display, and one way for a runtime OS to
inherit the display hardware from the firmware (without OS native drivers).

(a) For UEFI apps:
- direct framebuffer access
- Blt() (block transfer) member function

(b) For runtime OS:
- direct framebuffer access ("efifb" in Linux)

Virtio-gpu lacks a linear framebuffer by design. Therefore the above
methods are reduced to the following:

(c) UEFI apps can access virtio-gpu with:
- GOP.Blt() member function only

(d) The runtime guest OS can access the virtio-gpu device as-inherited
from the firmware (i.e., without native drivers) with:
- n/a.

Given that we expect all aarch64 OSes to include native virtio-gpu
drivers on their install media, (d) is actually not a problem. Whenever
the OS kernel runs, we except to have no need for "efifb", ever. So
that's good.

The problem is (c). UEFI boot loaders would have to be taught to call
GOP.Blt() manually, whenever they need to display something. I'm not
sure about grub2's current status, but it is free software, so in theory
it should be doable. However, UEFI windows boot loaders are proprietary
*and* they require direct framebuffer access (on x86 at least); they
don't work with Blt()-only. (I found some Microsoft presentations about
this earlier.)

So, virtio-gpu is an almost universal solution for the problem, but not
entirely. For any given GOP, offering Blt() *only* (i.e., not exposing a
linear framebuffer) conforms to the UEFI spec, but some boot loaders are
known to present further requirements (on x86 anyway).

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-27 14:29             ` Alexander Graf
@ 2016-06-28 11:02               ` Laszlo Ersek
  0 siblings, 0 replies; 32+ messages in thread
From: Laszlo Ersek @ 2016-06-28 11:02 UTC (permalink / raw)
  To: Alexander Graf, Ard Biesheuvel; +Cc: Marc Zyngier, Catalin Marinas, kvmarm

On 06/27/16 16:29, Alexander Graf wrote:
> 
> 
>> Am 27.06.2016 um 15:57 schrieb Ard Biesheuvel <ard.biesheuvel@linaro.org>:
>>
>>> On 27 June 2016 at 15:35, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>>> On Mon, Jun 27, 2016 at 02:30:46PM +0200, Ard Biesheuvel wrote:
>>>>> On 27 June 2016 at 12:34, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>>>>> On Mon, Jun 27, 2016 at 11:47:18AM +0200, Ard Biesheuvel wrote:
>>>>>>> On 27 June 2016 at 11:16, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm going to ask some stupid questions here...
>>>>>>>
>>>>>>>> On Fri, Jun 24, 2016 at 04:04:45PM +0200, Ard Biesheuvel wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> This old subject came up again in a discussion related to PCIe support
>>>>>>>> for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO
>>>>>>>> regions as cacheable is preventing us from reusing a significant slice
>>>>>>>> of the PCIe support infrastructure, and so I'd like to bring this up
>>>>>>>> again, perhaps just to reiterate why we're simply out of luck.
>>>>>>>>
>>>>>>>> To refresh your memories, the issue is that on ARM, PCI MMIO regions
>>>>>>>> for emulated devices may be backed by memory that is mapped cacheable
>>>>>>>> by the host. Note that this has nothing to do with the device being
>>>>>>>> DMA coherent or not: in this case, we are dealing with regions that
>>>>>>>> are not memory from the POV of the guest, and it is reasonable for the
>>>>>>>> guest to assume that accesses to such a region are not visible to the
>>>>>>>> device before they hit the actual PCI MMIO window and are translated
>>>>>>>> into cycles on the PCI bus.
>>>>>>>
>>>>>>> For the sake of completeness, why is this reasonable?
>>>>>>
>>>>>> Because the whole point of accessing these regions is to communicate
>>>>>> with the device. It is common to use write combining mappings for
>>>>>> things like framebuffers to group writes before they hit the PCI bus,
>>>>>> but any caching just makes it more difficult for the driver state and
>>>>>> device state to remain synchronized.
>>>>>>
>>>>>>> Is this how any real ARM system implementing PCI would actually work?
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>> That means that mapping such a region
>>>>>>>> cacheable is a strange thing to do, in fact, and it is unlikely that
>>>>>>>> patches implementing this against the generic PCI stack in Tianocore
>>>>>>>> will be accepted by the maintainers.
>>>>>>>>
>>>>>>>> Note that this issue not only affects framebuffers on PCI cards, it
>>>>>>>> also affects emulated USB host controllers (perhaps Alex can remind us
>>>>>>>> which one exactly?) and likely other emulated generic PCI devices as
>>>>>>>> well.
>>>>>>>>
>>>>>>>> Since the issue exists only for emulated PCI devices whose MMIO
>>>>>>>> regions are backed by host memory, is there any way we can already
>>>>>>>> distinguish such memslots from ordinary ones? If we can, is there
>>>>>>>> anything we could do to treat these specially? Perhaps something like
>>>>>>>> using read-only memslots so we can at least trap guest writes instead
>>>>>>>> of having main memory going out of sync with the caches unnoticed? I
>>>>>>>> am just brainstorming here ...
>>>>>>>
>>>>>>> I think the only sensible solution is to make sure that the guest and
>>>>>>> emulation mappings use the same memory type, either cached or
>>>>>>> non-cached, and we 'simply' have to find the best way to implement this.
>>>>>>>
>>>>>>> As Drew suggested, forcing some S2 mappings to be non-cacheable is the
>>>>>>> one way.
>>>>>>>
>>>>>>> The other way is to use something like what you once wrote that rewrites
>>>>>>> stage-1 mappings to be cacheable, does that apply here ?
>>>>>>>
>>>>>>> Do we have a clear picture of why we'd prefer one way over the other?
>>>>>>
>>>>>> So first of all, let me reiterate that I could only find a single
>>>>>> instance in QEMU where a PCI MMIO region is backed by host memory,
>>>>>> which is vga-pci.c. I wonder of there are any other occurrences, but
>>>>>> if there aren't any, it makes much more sense to prohibit PCI BARs
>>>>>> backed by host memory rather than spend a lot of effort working around
>>>>>> it.
>>>>>
>>>>> Right, ok.  So Marc's point during his KVM Forum talk was basically,
>>>>> don't use the legacy VGA adapter on ARM and use virtio graphics, right?
>>>>
>>>> Yes. But nothing is preventing you currently from using that, and I
>>>> think we should prefer crappy performance but correct operation over
>>>> the current situation. So in general, we should either disallow PCI
>>>> BARs backed by host memory, or emulate them, but never back them by a
>>>> RAM memslot when running under ARM/KVM.
>>>
>>> agreed, I just think that emulating accesses by trapping them is not
>>> just slow, it's not really possible in practice and even if it is, it's
>>> probably *unusably* slow.
>>
>> Well, it would probably involve a lot of effort to implement emulation
>> of instructions with multiple output registers, such as ldp/stp and
>> register writeback. And indeed, trapping on each store instruction to
>> the framebuffer is going to be sloooooowwwww.
>>
>> So let's disregard that option for now ...
>>
>>>>
>>>>> What is the proposed solution for someone shipping an ARM server and
>>>>> wishing to provide a graphical output for that server?
>>>>
>>>> The problem does not exist on bare metal. It is an implementation
>>>> detail of KVM on ARM that guest PCI BAR mappings are incoherent with
>>>> the view of the emulator in QEMU.
>>>>
>>>>> It feels strange to work around supporting PCI VGA adapters in ARM VMs,
>>>>> if that's not a supported real hardware case.  However, I don't see what
>>>>> would prevent someone from plugging a VGA adapter into the PCI slot on
>>>>> an ARM server, and people selling ARM servers probably want this to
>>>>> happen, I'm guessing.
>>>>
>>>> As I said, the problem does not exist on bare metal.
>>>>
>>>>>>
>>>>>> If we do decide to fix this, the best way would be to use uncached
>>>>>> attributes for the QEMU userland mapping, and force it uncached in the
>>>>>> guest via a stage 2 override (as Drews suggests). The only problem I
>>>>>> see here is that the host's kernel direct mapping has a cached alias
>>>>>> that we need to get rid of.
>>>>>
>>>>> Do we have a way to accomplish that?
>>>>>
>>>>> Will we run into a bunch of other problems if we begin punching holes in
>>>>> the direct mapping for regular RAM?
>>>>
>>>> I think the policy up until now has been not to remap regions in the
>>>> kernel direct mapping for the purposes of DMA, and I think by the same
>>>> reasoning, it is not preferable for KVM either
>>>
>>> I guess the difference is that from the (host) kernel's point of view
>>> this is not DMA memory, but just regular RAM.  I just don't know enough
>>> about the kernel's VM mappings to know what's involved here, but we
>>> should find out somehow...
>>
>> Whether it is DMA memory or not does not make a difference. The point
>> is simply that arm64 maps all RAM owned by the kernel as cacheable,
>> and remapping arbitrary ranges with different attributes is
>> problematic, since it is also likely to involve splitting of regions,
>> which is cumbersome with a mapping that is always live.
>>
>> So instead, we'd have to reserve some system memory early on and
>> remove it from the linear mapping, the complexity of which is more
>> than we are probably prepared to put up with.
>>
>> So if vga-pci.c is the only problematic device, for which a reasonable
>> alternative exists (virtio-gpu), I think the only feasible solution is
>> to educate QEMU not to allow RAM memslots being exposed via PCI BARs
>> when running under KVM/ARM.
> 
> That's ok, if there is a viable alternative. So if we had working virtio-gpu support in OVMF, we could just disable the legacy vga device with kvm on arm altogether - it'd either crash your guest (unhandled opcode in mmio emulation) or give you broken graphics.
> 
> But first, someone would need to sit down and make virtio-gpu work in OVMF.

I've offered to (attempt to) implement a GOP driver for virtio-gpu, to
be used by OvmfPkg and ArmVirtPkg, once the virtio-gpu bits become part
of the official virtio specification.

However, as I mentioned elsewhere in this thread, a GOP driver for
virtio-gpu could provide the Blt() kind of display output *only*.
Virtio-gpu (the device model) lacks a linear framebuffer by design,
hence no GOP can expose it. The GOP can only offer the Blt() member
function, which would internally turn the block transfer requests into
virtio-gpu request.

Unfortunately, offering Blt() *only* is not good enough. Some UEFI boot
laoders (on x86 at least) depend on direct framebuffer access.

Let me put it this way: the UEFI spec describes a possibility for the
GOP implementor to expose a linear framebuffer for the display device if
there is one. If there is one, great; if there isn't, that's fine too,
the GOP specification allows it as well. It is the UEFI boot loaders
(well, some of them) that ultimately depend on the linear framebuffer,
which is what virtio-gpu (the device model) lacks.

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-28 10:04             ` Christoffer Dall
@ 2016-06-28 11:06               ` Laszlo Ersek
  2016-06-28 12:20                 ` Christoffer Dall
  0 siblings, 1 reply; 32+ messages in thread
From: Laszlo Ersek @ 2016-06-28 11:06 UTC (permalink / raw)
  To: Christoffer Dall, Ard Biesheuvel; +Cc: Marc Zyngier, Catalin Marinas, kvmarm

On 06/28/16 12:04, Christoffer Dall wrote:
> On Mon, Jun 27, 2016 at 03:57:28PM +0200, Ard Biesheuvel wrote:
>> On 27 June 2016 at 15:35, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>> On Mon, Jun 27, 2016 at 02:30:46PM +0200, Ard Biesheuvel wrote:
>>>> On 27 June 2016 at 12:34, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>>>> On Mon, Jun 27, 2016 at 11:47:18AM +0200, Ard Biesheuvel wrote:
>>>>>> On 27 June 2016 at 11:16, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm going to ask some stupid questions here...
>>>>>>>
>>>>>>> On Fri, Jun 24, 2016 at 04:04:45PM +0200, Ard Biesheuvel wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> This old subject came up again in a discussion related to PCIe support
>>>>>>>> for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO
>>>>>>>> regions as cacheable is preventing us from reusing a significant slice
>>>>>>>> of the PCIe support infrastructure, and so I'd like to bring this up
>>>>>>>> again, perhaps just to reiterate why we're simply out of luck.
>>>>>>>>
>>>>>>>> To refresh your memories, the issue is that on ARM, PCI MMIO regions
>>>>>>>> for emulated devices may be backed by memory that is mapped cacheable
>>>>>>>> by the host. Note that this has nothing to do with the device being
>>>>>>>> DMA coherent or not: in this case, we are dealing with regions that
>>>>>>>> are not memory from the POV of the guest, and it is reasonable for the
>>>>>>>> guest to assume that accesses to such a region are not visible to the
>>>>>>>> device before they hit the actual PCI MMIO window and are translated
>>>>>>>> into cycles on the PCI bus.
>>>>>>>
>>>>>>> For the sake of completeness, why is this reasonable?
>>>>>>>
>>>>>>
>>>>>> Because the whole point of accessing these regions is to communicate
>>>>>> with the device. It is common to use write combining mappings for
>>>>>> things like framebuffers to group writes before they hit the PCI bus,
>>>>>> but any caching just makes it more difficult for the driver state and
>>>>>> device state to remain synchronized.
>>>>>>
>>>>>>> Is this how any real ARM system implementing PCI would actually work?
>>>>>>>
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>> That means that mapping such a region
>>>>>>>> cacheable is a strange thing to do, in fact, and it is unlikely that
>>>>>>>> patches implementing this against the generic PCI stack in Tianocore
>>>>>>>> will be accepted by the maintainers.
>>>>>>>>
>>>>>>>> Note that this issue not only affects framebuffers on PCI cards, it
>>>>>>>> also affects emulated USB host controllers (perhaps Alex can remind us
>>>>>>>> which one exactly?) and likely other emulated generic PCI devices as
>>>>>>>> well.
>>>>>>>>
>>>>>>>> Since the issue exists only for emulated PCI devices whose MMIO
>>>>>>>> regions are backed by host memory, is there any way we can already
>>>>>>>> distinguish such memslots from ordinary ones? If we can, is there
>>>>>>>> anything we could do to treat these specially? Perhaps something like
>>>>>>>> using read-only memslots so we can at least trap guest writes instead
>>>>>>>> of having main memory going out of sync with the caches unnoticed? I
>>>>>>>> am just brainstorming here ...
>>>>>>>
>>>>>>> I think the only sensible solution is to make sure that the guest and
>>>>>>> emulation mappings use the same memory type, either cached or
>>>>>>> non-cached, and we 'simply' have to find the best way to implement this.
>>>>>>>
>>>>>>> As Drew suggested, forcing some S2 mappings to be non-cacheable is the
>>>>>>> one way.
>>>>>>>
>>>>>>> The other way is to use something like what you once wrote that rewrites
>>>>>>> stage-1 mappings to be cacheable, does that apply here ?
>>>>>>>
>>>>>>> Do we have a clear picture of why we'd prefer one way over the other?
>>>>>>>
>>>>>>
>>>>>> So first of all, let me reiterate that I could only find a single
>>>>>> instance in QEMU where a PCI MMIO region is backed by host memory,
>>>>>> which is vga-pci.c. I wonder of there are any other occurrences, but
>>>>>> if there aren't any, it makes much more sense to prohibit PCI BARs
>>>>>> backed by host memory rather than spend a lot of effort working around
>>>>>> it.
>>>>>
>>>>> Right, ok.  So Marc's point during his KVM Forum talk was basically,
>>>>> don't use the legacy VGA adapter on ARM and use virtio graphics, right?
>>>>>
>>>>
>>>> Yes. But nothing is preventing you currently from using that, and I
>>>> think we should prefer crappy performance but correct operation over
>>>> the current situation. So in general, we should either disallow PCI
>>>> BARs backed by host memory, or emulate them, but never back them by a
>>>> RAM memslot when running under ARM/KVM.
>>>
>>> agreed, I just think that emulating accesses by trapping them is not
>>> just slow, it's not really possible in practice and even if it is, it's
>>> probably *unusably* slow.
>>>
>>
>> Well, it would probably involve a lot of effort to implement emulation
>> of instructions with multiple output registers, such as ldp/stp and
>> register writeback. And indeed, trapping on each store instruction to
>> the framebuffer is going to be sloooooowwwww.
>>
>> So let's disregard that option for now ...
>>
>>>>
>>>>> What is the proposed solution for someone shipping an ARM server and
>>>>> wishing to provide a graphical output for that server?
>>>>>
>>>>
>>>> The problem does not exist on bare metal. It is an implementation
>>>> detail of KVM on ARM that guest PCI BAR mappings are incoherent with
>>>> the view of the emulator in QEMU.
>>>>
>>>>> It feels strange to work around supporting PCI VGA adapters in ARM VMs,
>>>>> if that's not a supported real hardware case.  However, I don't see what
>>>>> would prevent someone from plugging a VGA adapter into the PCI slot on
>>>>> an ARM server, and people selling ARM servers probably want this to
>>>>> happen, I'm guessing.
>>>>>
>>>>
>>>> As I said, the problem does not exist on bare metal.
>>>>
>>>>>>
>>>>>> If we do decide to fix this, the best way would be to use uncached
>>>>>> attributes for the QEMU userland mapping, and force it uncached in the
>>>>>> guest via a stage 2 override (as Drews suggests). The only problem I
>>>>>> see here is that the host's kernel direct mapping has a cached alias
>>>>>> that we need to get rid of.
>>>>>
>>>>> Do we have a way to accomplish that?
>>>>>
>>>>> Will we run into a bunch of other problems if we begin punching holes in
>>>>> the direct mapping for regular RAM?
>>>>>
>>>>
>>>> I think the policy up until now has been not to remap regions in the
>>>> kernel direct mapping for the purposes of DMA, and I think by the same
>>>> reasoning, it is not preferable for KVM either
>>>
>>> I guess the difference is that from the (host) kernel's point of view
>>> this is not DMA memory, but just regular RAM.  I just don't know enough
>>> about the kernel's VM mappings to know what's involved here, but we
>>> should find out somehow...
>>>
>>
>> Whether it is DMA memory or not does not make a difference. The point
>> is simply that arm64 maps all RAM owned by the kernel as cacheable,
>> and remapping arbitrary ranges with different attributes is
>> problematic, since it is also likely to involve splitting of regions,
>> which is cumbersome with a mapping that is always live.
>>
>> So instead, we'd have to reserve some system memory early on and
>> remove it from the linear mapping, the complexity of which is more
>> than we are probably prepared to put up with.
> 
> Don't we have any existing frameworks for such things, like ion or
> other things like that?  Not sure if these systems export anything to
> userspace or even serve the purpose we want, but thought I'd throw it
> out there.
> 
>>
>> So if vga-pci.c is the only problematic device, for which a reasonable
>> alternative exists (virtio-gpu), I think the only feasible solution is
>> to educate QEMU not to allow RAM memslots being exposed via PCI BARs
>> when running under KVM/ARM.
> 
> It would be good if we could support vga-pci under KVM/ARM, but if
> there's no other way than rewriting the arm64 kernel's memory mappings
> completely, then probably we're stuck there, unfortunately.

It's been mentioned earlier that the specific combination of S1 and S2
mappings on aarch64 is actually an *architecture bug*. If we accept that
qualification, then we should realize our efforts here target finding a
*workaround*.

In your blog post
<http://www.linaro.org/blog/core-dump/on-the-performance-of-arm-virtualization/>,
you mention VHE ("Virtualization Host Extensions"). That's clearly a
sign of the architecture adapting to virt software needs.

Do you see any chance that the S1-S2 combinations too can be fixed in a
new revision of the architecture?

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-28 11:06               ` Laszlo Ersek
@ 2016-06-28 12:20                 ` Christoffer Dall
  2016-06-28 13:10                   ` Catalin Marinas
  0 siblings, 1 reply; 32+ messages in thread
From: Christoffer Dall @ 2016-06-28 12:20 UTC (permalink / raw)
  To: Laszlo Ersek; +Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas, kvmarm

On Tue, Jun 28, 2016 at 01:06:36PM +0200, Laszlo Ersek wrote:
> On 06/28/16 12:04, Christoffer Dall wrote:
> > On Mon, Jun 27, 2016 at 03:57:28PM +0200, Ard Biesheuvel wrote:
> >> On 27 June 2016 at 15:35, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> >>> On Mon, Jun 27, 2016 at 02:30:46PM +0200, Ard Biesheuvel wrote:
> >>>> On 27 June 2016 at 12:34, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> >>>>> On Mon, Jun 27, 2016 at 11:47:18AM +0200, Ard Biesheuvel wrote:
> >>>>>> On 27 June 2016 at 11:16, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I'm going to ask some stupid questions here...
> >>>>>>>
> >>>>>>> On Fri, Jun 24, 2016 at 04:04:45PM +0200, Ard Biesheuvel wrote:
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> This old subject came up again in a discussion related to PCIe support
> >>>>>>>> for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO
> >>>>>>>> regions as cacheable is preventing us from reusing a significant slice
> >>>>>>>> of the PCIe support infrastructure, and so I'd like to bring this up
> >>>>>>>> again, perhaps just to reiterate why we're simply out of luck.
> >>>>>>>>
> >>>>>>>> To refresh your memories, the issue is that on ARM, PCI MMIO regions
> >>>>>>>> for emulated devices may be backed by memory that is mapped cacheable
> >>>>>>>> by the host. Note that this has nothing to do with the device being
> >>>>>>>> DMA coherent or not: in this case, we are dealing with regions that
> >>>>>>>> are not memory from the POV of the guest, and it is reasonable for the
> >>>>>>>> guest to assume that accesses to such a region are not visible to the
> >>>>>>>> device before they hit the actual PCI MMIO window and are translated
> >>>>>>>> into cycles on the PCI bus.
> >>>>>>>
> >>>>>>> For the sake of completeness, why is this reasonable?
> >>>>>>>
> >>>>>>
> >>>>>> Because the whole point of accessing these regions is to communicate
> >>>>>> with the device. It is common to use write combining mappings for
> >>>>>> things like framebuffers to group writes before they hit the PCI bus,
> >>>>>> but any caching just makes it more difficult for the driver state and
> >>>>>> device state to remain synchronized.
> >>>>>>
> >>>>>>> Is this how any real ARM system implementing PCI would actually work?
> >>>>>>>
> >>>>>>
> >>>>>> Yes.
> >>>>>>
> >>>>>>>> That means that mapping such a region
> >>>>>>>> cacheable is a strange thing to do, in fact, and it is unlikely that
> >>>>>>>> patches implementing this against the generic PCI stack in Tianocore
> >>>>>>>> will be accepted by the maintainers.
> >>>>>>>>
> >>>>>>>> Note that this issue not only affects framebuffers on PCI cards, it
> >>>>>>>> also affects emulated USB host controllers (perhaps Alex can remind us
> >>>>>>>> which one exactly?) and likely other emulated generic PCI devices as
> >>>>>>>> well.
> >>>>>>>>
> >>>>>>>> Since the issue exists only for emulated PCI devices whose MMIO
> >>>>>>>> regions are backed by host memory, is there any way we can already
> >>>>>>>> distinguish such memslots from ordinary ones? If we can, is there
> >>>>>>>> anything we could do to treat these specially? Perhaps something like
> >>>>>>>> using read-only memslots so we can at least trap guest writes instead
> >>>>>>>> of having main memory going out of sync with the caches unnoticed? I
> >>>>>>>> am just brainstorming here ...
> >>>>>>>
> >>>>>>> I think the only sensible solution is to make sure that the guest and
> >>>>>>> emulation mappings use the same memory type, either cached or
> >>>>>>> non-cached, and we 'simply' have to find the best way to implement this.
> >>>>>>>
> >>>>>>> As Drew suggested, forcing some S2 mappings to be non-cacheable is the
> >>>>>>> one way.
> >>>>>>>
> >>>>>>> The other way is to use something like what you once wrote that rewrites
> >>>>>>> stage-1 mappings to be cacheable, does that apply here ?
> >>>>>>>
> >>>>>>> Do we have a clear picture of why we'd prefer one way over the other?
> >>>>>>>
> >>>>>>
> >>>>>> So first of all, let me reiterate that I could only find a single
> >>>>>> instance in QEMU where a PCI MMIO region is backed by host memory,
> >>>>>> which is vga-pci.c. I wonder of there are any other occurrences, but
> >>>>>> if there aren't any, it makes much more sense to prohibit PCI BARs
> >>>>>> backed by host memory rather than spend a lot of effort working around
> >>>>>> it.
> >>>>>
> >>>>> Right, ok.  So Marc's point during his KVM Forum talk was basically,
> >>>>> don't use the legacy VGA adapter on ARM and use virtio graphics, right?
> >>>>>
> >>>>
> >>>> Yes. But nothing is preventing you currently from using that, and I
> >>>> think we should prefer crappy performance but correct operation over
> >>>> the current situation. So in general, we should either disallow PCI
> >>>> BARs backed by host memory, or emulate them, but never back them by a
> >>>> RAM memslot when running under ARM/KVM.
> >>>
> >>> agreed, I just think that emulating accesses by trapping them is not
> >>> just slow, it's not really possible in practice and even if it is, it's
> >>> probably *unusably* slow.
> >>>
> >>
> >> Well, it would probably involve a lot of effort to implement emulation
> >> of instructions with multiple output registers, such as ldp/stp and
> >> register writeback. And indeed, trapping on each store instruction to
> >> the framebuffer is going to be sloooooowwwww.
> >>
> >> So let's disregard that option for now ...
> >>
> >>>>
> >>>>> What is the proposed solution for someone shipping an ARM server and
> >>>>> wishing to provide a graphical output for that server?
> >>>>>
> >>>>
> >>>> The problem does not exist on bare metal. It is an implementation
> >>>> detail of KVM on ARM that guest PCI BAR mappings are incoherent with
> >>>> the view of the emulator in QEMU.
> >>>>
> >>>>> It feels strange to work around supporting PCI VGA adapters in ARM VMs,
> >>>>> if that's not a supported real hardware case.  However, I don't see what
> >>>>> would prevent someone from plugging a VGA adapter into the PCI slot on
> >>>>> an ARM server, and people selling ARM servers probably want this to
> >>>>> happen, I'm guessing.
> >>>>>
> >>>>
> >>>> As I said, the problem does not exist on bare metal.
> >>>>
> >>>>>>
> >>>>>> If we do decide to fix this, the best way would be to use uncached
> >>>>>> attributes for the QEMU userland mapping, and force it uncached in the
> >>>>>> guest via a stage 2 override (as Drews suggests). The only problem I
> >>>>>> see here is that the host's kernel direct mapping has a cached alias
> >>>>>> that we need to get rid of.
> >>>>>
> >>>>> Do we have a way to accomplish that?
> >>>>>
> >>>>> Will we run into a bunch of other problems if we begin punching holes in
> >>>>> the direct mapping for regular RAM?
> >>>>>
> >>>>
> >>>> I think the policy up until now has been not to remap regions in the
> >>>> kernel direct mapping for the purposes of DMA, and I think by the same
> >>>> reasoning, it is not preferable for KVM either
> >>>
> >>> I guess the difference is that from the (host) kernel's point of view
> >>> this is not DMA memory, but just regular RAM.  I just don't know enough
> >>> about the kernel's VM mappings to know what's involved here, but we
> >>> should find out somehow...
> >>>
> >>
> >> Whether it is DMA memory or not does not make a difference. The point
> >> is simply that arm64 maps all RAM owned by the kernel as cacheable,
> >> and remapping arbitrary ranges with different attributes is
> >> problematic, since it is also likely to involve splitting of regions,
> >> which is cumbersome with a mapping that is always live.
> >>
> >> So instead, we'd have to reserve some system memory early on and
> >> remove it from the linear mapping, the complexity of which is more
> >> than we are probably prepared to put up with.
> > 
> > Don't we have any existing frameworks for such things, like ion or
> > other things like that?  Not sure if these systems export anything to
> > userspace or even serve the purpose we want, but thought I'd throw it
> > out there.
> > 
> >>
> >> So if vga-pci.c is the only problematic device, for which a reasonable
> >> alternative exists (virtio-gpu), I think the only feasible solution is
> >> to educate QEMU not to allow RAM memslots being exposed via PCI BARs
> >> when running under KVM/ARM.
> > 
> > It would be good if we could support vga-pci under KVM/ARM, but if
> > there's no other way than rewriting the arm64 kernel's memory mappings
> > completely, then probably we're stuck there, unfortunately.
> 
> It's been mentioned earlier that the specific combination of S1 and S2
> mappings on aarch64 is actually an *architecture bug*. If we accept that
> qualification, then we should realize our efforts here target finding a
> *workaround*.
> 
> In your blog post
> <http://www.linaro.org/blog/core-dump/on-the-performance-of-arm-virtualization/>,
> you mention VHE ("Virtualization Host Extensions"). That's clearly a
> sign of the architecture adapting to virt software needs.
> 
> Do you see any chance that the S1-S2 combinations too can be fixed in a
> new revision of the architecture?
> 
I really can't speculate about this, I assume there are reasons for why
the architecture is defined in this particular way, but I haven't
investigated this aspect in any depth.

-Christoffer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-28 12:20                 ` Christoffer Dall
@ 2016-06-28 13:10                   ` Catalin Marinas
  2016-06-28 13:19                     ` Ard Biesheuvel
  0 siblings, 1 reply; 32+ messages in thread
From: Catalin Marinas @ 2016-06-28 13:10 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Ard Biesheuvel, Marc Zyngier, Laszlo Ersek, kvmarm

On Tue, Jun 28, 2016 at 02:20:43PM +0200, Christoffer Dall wrote:
> On Tue, Jun 28, 2016 at 01:06:36PM +0200, Laszlo Ersek wrote:
> > On 06/28/16 12:04, Christoffer Dall wrote:
> > > On Mon, Jun 27, 2016 at 03:57:28PM +0200, Ard Biesheuvel wrote:
> > >> So if vga-pci.c is the only problematic device, for which a reasonable
> > >> alternative exists (virtio-gpu), I think the only feasible solution is
> > >> to educate QEMU not to allow RAM memslots being exposed via PCI BARs
> > >> when running under KVM/ARM.
> > > 
> > > It would be good if we could support vga-pci under KVM/ARM, but if
> > > there's no other way than rewriting the arm64 kernel's memory mappings
> > > completely, then probably we're stuck there, unfortunately.

Just to be clear, the behaviour of mismatched memory attributes is
defined in the ARM ARM and so far Linux worked fine with such cacheable
vs non-cacheable (as long as only one of them is accessed *or* cache
maintenance is performed accordingly). I don't think the arm64 kernel
memory map needs to be rewritten.

> > It's been mentioned earlier that the specific combination of S1 and S2
> > mappings on aarch64 is actually an *architecture bug*. If we accept that
> > qualification, then we should realize our efforts here target finding a
> > *workaround*.

I haven't read this thread in detail but I doubt it's an architecture
bug. You may say a missing feature.

> > In your blog post
> > <http://www.linaro.org/blog/core-dump/on-the-performance-of-arm-virtualization/>,
> > you mention VHE ("Virtualization Host Extensions"). That's clearly a
> > sign of the architecture adapting to virt software needs.
> > 
> > Do you see any chance that the S1-S2 combinations too can be fixed in a
> > new revision of the architecture?
> 
> I really can't speculate about this, I assume there are reasons for why
> the architecture is defined in this particular way, but I haven't
> investigated this aspect in any depth.

In general, there are software issues with forcing cacheability at S2
when S1 required non-cacheable transactions, with all the coherency
assumptions. The problem becomes even more complicated when memory
types, not just cacheability, are "upgraded". E.g. forcing S1 Device to
S2 Normal with consequences on memory ordering that the guest is not
aware of.

While there are potential, specific, hardware solutions, they can't be
"back-ported" to existing CPU implementation, so we need a solution in
software. *If* the only software solution has severe performance
implications and it is on a critical path, the architecture might be
improved in the future (like we did with VHE). But I don't think that's
the case here.

-- 
Catalin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-28 10:55       ` Laszlo Ersek
@ 2016-06-28 13:14         ` Ard Biesheuvel
  2016-06-28 13:32           ` Laszlo Ersek
  2016-06-28 15:23         ` Alexander Graf
  1 sibling, 1 reply; 32+ messages in thread
From: Ard Biesheuvel @ 2016-06-28 13:14 UTC (permalink / raw)
  To: Laszlo Ersek; +Cc: Marc Zyngier, Catalin Marinas, kvmarm

On 28 June 2016 at 12:55, Laszlo Ersek <lersek@redhat.com> wrote:
> On 06/27/16 12:34, Christoffer Dall wrote:
>> On Mon, Jun 27, 2016 at 11:47:18AM +0200, Ard Biesheuvel wrote:
>
>>> So first of all, let me reiterate that I could only find a single
>>> instance in QEMU where a PCI MMIO region is backed by host memory,
>>> which is vga-pci.c. I wonder of there are any other occurrences, but
>>> if there aren't any, it makes much more sense to prohibit PCI BARs
>>> backed by host memory rather than spend a lot of effort working around
>>> it.
>>
>> Right, ok.  So Marc's point during his KVM Forum talk was basically,
>> don't use the legacy VGA adapter on ARM and use virtio graphics, right?
>
> The EFI GOP (Graphics Output Protocol) abstraction provides two ways for
> UEFI applications to access the display, and one way for a runtime OS to
> inherit the display hardware from the firmware (without OS native drivers).
>
> (a) For UEFI apps:
> - direct framebuffer access
> - Blt() (block transfer) member function
>
> (b) For runtime OS:
> - direct framebuffer access ("efifb" in Linux)
>
> Virtio-gpu lacks a linear framebuffer by design. Therefore the above
> methods are reduced to the following:
>
> (c) UEFI apps can access virtio-gpu with:
> - GOP.Blt() member function only
>
> (d) The runtime guest OS can access the virtio-gpu device as-inherited
> from the firmware (i.e., without native drivers) with:
> - n/a.
>
> Given that we expect all aarch64 OSes to include native virtio-gpu
> drivers on their install media, (d) is actually not a problem. Whenever
> the OS kernel runs, we except to have no need for "efifb", ever. So
> that's good.
>
> The problem is (c). UEFI boot loaders would have to be taught to call
> GOP.Blt() manually, whenever they need to display something. I'm not
> sure about grub2's current status, but it is free software, so in theory
> it should be doable. However, UEFI windows boot loaders are proprietary
> *and* they require direct framebuffer access (on x86 at least); they
> don't work with Blt()-only. (I found some Microsoft presentations about
> this earlier.)
>
> So, virtio-gpu is an almost universal solution for the problem, but not
> entirely. For any given GOP, offering Blt() *only* (i.e., not exposing a
> linear framebuffer) conforms to the UEFI spec, but some boot loaders are
> known to present further requirements (on x86 anyway).
>

Even if virtio-gpu would expose a linear framebuffer, it would likely
expose it as a PCI BAR, and we would be in the exact same situation.

The only way we can work around this is to emulate a DMA coherent
device that uses a framebuffer in system RAM. I looked at the PL111,
which is already supported both in EDK2 and the Linux kernel, and
would only require minor changes to support DMA coherent devices.
Unfortunately, we would not be able to advertise its presence when
running under ACPI, since it is not a PCI device.

In any case, reconciling software that requires a framebuffer with a
GPU emulation that does not expose one by design is going to be
problematic even without this issue. How is this supposed to work on
x86?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-28 13:10                   ` Catalin Marinas
@ 2016-06-28 13:19                     ` Ard Biesheuvel
  2016-06-28 13:25                       ` Catalin Marinas
  0 siblings, 1 reply; 32+ messages in thread
From: Ard Biesheuvel @ 2016-06-28 13:19 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Marc Zyngier, Laszlo Ersek, kvmarm

On 28 June 2016 at 15:10, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Tue, Jun 28, 2016 at 02:20:43PM +0200, Christoffer Dall wrote:
>> On Tue, Jun 28, 2016 at 01:06:36PM +0200, Laszlo Ersek wrote:
>> > On 06/28/16 12:04, Christoffer Dall wrote:
>> > > On Mon, Jun 27, 2016 at 03:57:28PM +0200, Ard Biesheuvel wrote:
>> > >> So if vga-pci.c is the only problematic device, for which a reasonable
>> > >> alternative exists (virtio-gpu), I think the only feasible solution is
>> > >> to educate QEMU not to allow RAM memslots being exposed via PCI BARs
>> > >> when running under KVM/ARM.
>> > >
>> > > It would be good if we could support vga-pci under KVM/ARM, but if
>> > > there's no other way than rewriting the arm64 kernel's memory mappings
>> > > completely, then probably we're stuck there, unfortunately.
>
> Just to be clear, the behaviour of mismatched memory attributes is
> defined in the ARM ARM and so far Linux worked fine with such cacheable
> vs non-cacheable (as long as only one of them is accessed *or* cache
> maintenance is performed accordingly). I don't think the arm64 kernel
> memory map needs to be rewritten.
>

That would suggest that having an uncached userland mapping in QEMU
and an uncached kernel mapping in the guest would be ok as long as we
don't access the host kernel's cacheable alias?
In that case, Drew's approach would be feasible, and the
pci_register_bar() function in QEMU could be modified to force the
userland mapping and the stage2 mapping to 'device' [when running
under KVM/ARM] if it refers to a memslot that is backed by host
memory.

-- 
Ard.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-28 13:19                     ` Ard Biesheuvel
@ 2016-06-28 13:25                       ` Catalin Marinas
  2016-06-28 14:02                         ` Andrew Jones
  0 siblings, 1 reply; 32+ messages in thread
From: Catalin Marinas @ 2016-06-28 13:25 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: Marc Zyngier, Laszlo Ersek, kvmarm

On Tue, Jun 28, 2016 at 03:19:14PM +0200, Ard Biesheuvel wrote:
> On 28 June 2016 at 15:10, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Tue, Jun 28, 2016 at 02:20:43PM +0200, Christoffer Dall wrote:
> >> On Tue, Jun 28, 2016 at 01:06:36PM +0200, Laszlo Ersek wrote:
> >> > On 06/28/16 12:04, Christoffer Dall wrote:
> >> > > On Mon, Jun 27, 2016 at 03:57:28PM +0200, Ard Biesheuvel wrote:
> >> > >> So if vga-pci.c is the only problematic device, for which a reasonable
> >> > >> alternative exists (virtio-gpu), I think the only feasible solution is
> >> > >> to educate QEMU not to allow RAM memslots being exposed via PCI BARs
> >> > >> when running under KVM/ARM.
> >> > >
> >> > > It would be good if we could support vga-pci under KVM/ARM, but if
> >> > > there's no other way than rewriting the arm64 kernel's memory mappings
> >> > > completely, then probably we're stuck there, unfortunately.
> >
> > Just to be clear, the behaviour of mismatched memory attributes is
> > defined in the ARM ARM and so far Linux worked fine with such cacheable
> > vs non-cacheable (as long as only one of them is accessed *or* cache
> > maintenance is performed accordingly). I don't think the arm64 kernel
> > memory map needs to be rewritten.
> 
> That would suggest that having an uncached userland mapping in QEMU
> and an uncached kernel mapping in the guest would be ok as long as we
> don't access the host kernel's cacheable alias?

Yes, from an architecture perspective. Many framebuffer drivers already
work in a similar way and map the framebuffer memory in user as
non-cacheable. Of course, one difference is that the other agent
accessing the memory is a DMA device rather than the CPU.

> In that case, Drew's approach would be feasible, and the
> pci_register_bar() function in QEMU could be modified to force the
> userland mapping and the stage2 mapping to 'device' [when running
> under KVM/ARM] if it refers to a memslot that is backed by host
> memory.

Device or normal non-cacheable (depending on the unaligned access
requirements).

Since such memory is allocated by Qemu (rather than a kernel driver),
KVM would need to mark the pages as reserved so that they are not moved
around by the host kernel, especially since it would use the cacheable
alias.

Another issue is taking care of the host kernel merging adjacent vmas
since we only want to apply the attributes to a single region.

-- 
Catalin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-28 13:14         ` Ard Biesheuvel
@ 2016-06-28 13:32           ` Laszlo Ersek
  2016-06-29  7:12             ` Gerd Hoffmann
  0 siblings, 1 reply; 32+ messages in thread
From: Laszlo Ersek @ 2016-06-28 13:32 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: Marc Zyngier, Catalin Marinas, Gerd Hoffmann, kvmarm

(adding Gerd)

On 06/28/16 15:14, Ard Biesheuvel wrote:

> In any case, reconciling software that requires a framebuffer with a
> GPU emulation that does not expose one by design is going to be
> problematic even without this issue. How is this supposed to work on
> x86?

AFAIK:

"virtio-gpu-pci" is the device model without the framebuffer. It is good
for secondary displays (i.e. those that you don't boot with, only use
after the guest kernel starts up).

"virtio-vga" is the same, but it also has the legacy VGA framebuffer,
hence it can be used for accommodating boot loaders. (Except it won't
work for aarch64 KVM guests, because of $SUBJECT.)

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-28 13:25                       ` Catalin Marinas
@ 2016-06-28 14:02                         ` Andrew Jones
  0 siblings, 0 replies; 32+ messages in thread
From: Andrew Jones @ 2016-06-28 14:02 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Ard Biesheuvel, Marc Zyngier, Laszlo Ersek, kvmarm

On Tue, Jun 28, 2016 at 02:25:19PM +0100, Catalin Marinas wrote:
> On Tue, Jun 28, 2016 at 03:19:14PM +0200, Ard Biesheuvel wrote:
> > On 28 June 2016 at 15:10, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > On Tue, Jun 28, 2016 at 02:20:43PM +0200, Christoffer Dall wrote:
> > >> On Tue, Jun 28, 2016 at 01:06:36PM +0200, Laszlo Ersek wrote:
> > >> > On 06/28/16 12:04, Christoffer Dall wrote:
> > >> > > On Mon, Jun 27, 2016 at 03:57:28PM +0200, Ard Biesheuvel wrote:
> > >> > >> So if vga-pci.c is the only problematic device, for which a reasonable
> > >> > >> alternative exists (virtio-gpu), I think the only feasible solution is
> > >> > >> to educate QEMU not to allow RAM memslots being exposed via PCI BARs
> > >> > >> when running under KVM/ARM.
> > >> > >
> > >> > > It would be good if we could support vga-pci under KVM/ARM, but if
> > >> > > there's no other way than rewriting the arm64 kernel's memory mappings
> > >> > > completely, then probably we're stuck there, unfortunately.
> > >
> > > Just to be clear, the behaviour of mismatched memory attributes is
> > > defined in the ARM ARM and so far Linux worked fine with such cacheable
> > > vs non-cacheable (as long as only one of them is accessed *or* cache
> > > maintenance is performed accordingly). I don't think the arm64 kernel
> > > memory map needs to be rewritten.
> > 
> > That would suggest that having an uncached userland mapping in QEMU
> > and an uncached kernel mapping in the guest would be ok as long as we
> > don't access the host kernel's cacheable alias?
> 
> Yes, from an architecture perspective. Many framebuffer drivers already
> work in a similar way and map the framebuffer memory in user as
> non-cacheable. Of course, one difference is that the other agent
> accessing the memory is a DMA device rather than the CPU.
> 
> > In that case, Drew's approach would be feasible, and the
> > pci_register_bar() function in QEMU could be modified to force the
> > userland mapping and the stage2 mapping to 'device' [when running
> > under KVM/ARM] if it refers to a memslot that is backed by host
> > memory.
> 
> Device or normal non-cacheable (depending on the unaligned access
> requirements).
> 
> Since such memory is allocated by Qemu (rather than a kernel driver),
> KVM would need to mark the pages as reserved so that they are not moved
> around by the host kernel, especially since it would use the cacheable
> alias.
> 
> Another issue is taking care of the host kernel merging adjacent vmas
> since we only want to apply the attributes to a single region.

I also experimented with dropping the KVM memslot flag in favor of an
madvise flag, allowing us to avoid vma merging problems. I forget if
I dropped that for any other reason than I thought it would generate
too much hate mail... Or maybe it was because I ended up needing to
add a new mprotect flag too, which I was quite sure would generate hate
mail, even though I found precedence for it

  ef3d3246a0d0 powerpc/mm: Add Strong Access Ordering support

I have experimental patches (somewhere, they don't seem to be on this
laptop) for that stuff, but I can't recall what was still broken with
them in the end. I just recall it still didn't work, which is why I
never posted even a crazy RFC.

Thanks,
drew

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-28 10:55       ` Laszlo Ersek
  2016-06-28 13:14         ` Ard Biesheuvel
@ 2016-06-28 15:23         ` Alexander Graf
  1 sibling, 0 replies; 32+ messages in thread
From: Alexander Graf @ 2016-06-28 15:23 UTC (permalink / raw)
  To: Laszlo Ersek, Christoffer Dall, Ard Biesheuvel
  Cc: Marc Zyngier, Catalin Marinas, kvmarm



On 06/28/2016 12:55 PM, Laszlo Ersek wrote:
> On 06/27/16 12:34, Christoffer Dall wrote:
>> On Mon, Jun 27, 2016 at 11:47:18AM +0200, Ard Biesheuvel wrote:
>>> So first of all, let me reiterate that I could only find a single
>>> instance in QEMU where a PCI MMIO region is backed by host memory,
>>> which is vga-pci.c. I wonder of there are any other occurrences, but
>>> if there aren't any, it makes much more sense to prohibit PCI BARs
>>> backed by host memory rather than spend a lot of effort working around
>>> it.
>> Right, ok.  So Marc's point during his KVM Forum talk was basically,
>> don't use the legacy VGA adapter on ARM and use virtio graphics, right?
> The EFI GOP (Graphics Output Protocol) abstraction provides two ways for
> UEFI applications to access the display, and one way for a runtime OS to
> inherit the display hardware from the firmware (without OS native drivers).
>
> (a) For UEFI apps:
> - direct framebuffer access
> - Blt() (block transfer) member function
>
> (b) For runtime OS:
> - direct framebuffer access ("efifb" in Linux)
>
> Virtio-gpu lacks a linear framebuffer by design. Therefore the above
> methods are reduced to the following:
>
> (c) UEFI apps can access virtio-gpu with:
> - GOP.Blt() member function only
>
> (d) The runtime guest OS can access the virtio-gpu device as-inherited
> from the firmware (i.e., without native drivers) with:
> - n/a.
>
> Given that we expect all aarch64 OSes to include native virtio-gpu
> drivers on their install media, (d) is actually not a problem. Whenever
> the OS kernel runs, we except to have no need for "efifb", ever. So
> that's good.
>
> The problem is (c). UEFI boot loaders would have to be taught to call
> GOP.Blt() manually, whenever they need to display something. I'm not
> sure about grub2's current status, but it is free software, so in theory
> it should be doable. However, UEFI windows boot loaders are proprietary

Yes, grub2 already ignores the frame buffer target address and instead 
uses Blt() operations only.

> *and* they require direct framebuffer access (on x86 at least); they
> don't work with Blt()-only. (I found some Microsoft presentations about
> this earlier.)
>
> So, virtio-gpu is an almost universal solution for the problem, but not
> entirely. For any given GOP, offering Blt() *only* (i.e., not exposing a
> linear framebuffer) conforms to the UEFI spec, but some boot loaders are
> known to present further requirements (on x86 anyway).

Well, I'm perfectly happy in ignoring Windows on KVM for now, if that 
gets us working, smooth Linux guest support :).


Alex

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: issues with emulated PCI MMIO backed by host memory under KVM
  2016-06-28 13:32           ` Laszlo Ersek
@ 2016-06-29  7:12             ` Gerd Hoffmann
  0 siblings, 0 replies; 32+ messages in thread
From: Gerd Hoffmann @ 2016-06-29  7:12 UTC (permalink / raw)
  To: Laszlo Ersek; +Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas, kvmarm

On Di, 2016-06-28 at 15:32 +0200, Laszlo Ersek wrote:
> (adding Gerd)
> 
> On 06/28/16 15:14, Ard Biesheuvel wrote:
> 
> > In any case, reconciling software that requires a framebuffer with a
> > GPU emulation that does not expose one by design is going to be
> > problematic even without this issue. How is this supposed to work on
> > x86?
> 
> AFAIK:
> 
> "virtio-gpu-pci" is the device model without the framebuffer. It is good
> for secondary displays (i.e. those that you don't boot with, only use
> after the guest kernel starts up).
> 
> "virtio-vga" is the same, but it also has the legacy VGA framebuffer,
> hence it can be used for accommodating boot loaders. (Except it won't
> work for aarch64 KVM guests, because of $SUBJECT.)

Exactly.  virtio-vga is basically virtio-gpu-pci + stdvga combined.
Power-on default is vga mode.  It switches into virtio mode when the
guest configures a output using virtio commands.  It switches back to
vga mode on reset.

You can get a simple framebuffer by using the stdvga part of the device.
QemuVideoDxe does exactly that.  The linux kernel switches from vga mode
(efifb) to virtio mode when the virtio-gpu kms driver loads.

Of course virtio-vga in vga mode has exactly the same cache coherency
issues as stdvga on arm.  So, once the linux kernel with the virtio-gpu
is up'n'running everything is fine, but how to handle early bootloader
display isn't solved yet.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2016-06-29  7:07 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-24 14:04 issues with emulated PCI MMIO backed by host memory under KVM Ard Biesheuvel
2016-06-24 14:57 ` Andrew Jones
2016-06-27  8:17   ` Marc Zyngier
2016-06-24 18:16 ` Ard Biesheuvel
2016-06-25  7:15   ` Alexander Graf
2016-06-25  7:19 ` Alexander Graf
2016-06-27  8:11   ` Marc Zyngier
2016-06-27  9:16 ` Christoffer Dall
2016-06-27  9:47   ` Ard Biesheuvel
2016-06-27 10:34     ` Christoffer Dall
2016-06-27 12:30       ` Ard Biesheuvel
2016-06-27 13:35         ` Christoffer Dall
2016-06-27 13:57           ` Ard Biesheuvel
2016-06-27 14:29             ` Alexander Graf
2016-06-28 11:02               ` Laszlo Ersek
2016-06-28 10:04             ` Christoffer Dall
2016-06-28 11:06               ` Laszlo Ersek
2016-06-28 12:20                 ` Christoffer Dall
2016-06-28 13:10                   ` Catalin Marinas
2016-06-28 13:19                     ` Ard Biesheuvel
2016-06-28 13:25                       ` Catalin Marinas
2016-06-28 14:02                         ` Andrew Jones
2016-06-27 14:24       ` Alexander Graf
2016-06-28 10:55       ` Laszlo Ersek
2016-06-28 13:14         ` Ard Biesheuvel
2016-06-28 13:32           ` Laszlo Ersek
2016-06-29  7:12             ` Gerd Hoffmann
2016-06-28 15:23         ` Alexander Graf
2016-06-27 13:15     ` Peter Maydell
2016-06-27 13:49       ` Mark Rutland
2016-06-27 14:10         ` Peter Maydell
2016-06-28 10:05           ` Christoffer Dall

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.