linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* Memory regions and VMAs across architectures
@ 2019-11-08 11:19 Christoffer Dall
  2019-11-08 13:59 ` Ard Biesheuvel
  2019-11-20  3:44 ` Sean Christopherson
  0 siblings, 2 replies; 6+ messages in thread
From: Christoffer Dall @ 2019-11-08 11:19 UTC (permalink / raw)
  To: kvm
  Cc: Marc Zyngier, sean.j.christopherson, borntraeger, Ard Biesheuvel,
	Paolo Bonzini, kvmarm, linux-arm-kernel

Hi,

I had a look at our relatively complicated logic in
kvm_arch_prepare_memory_region(), and was wondering if there was room to
unify some of this handling between architectures.

(If you haven't seen our implementation, you can find it in
virt/kvm/arm/mmu.c, and it has lovely ASCII art!)

I then had a look at the x86 code, but that doesn't actually do anything
when creating memory regions, which makes me wonder why the arhitectures
differ in this aspect.

The reason we added the logic that we have for arm/arm64 is that we
don't really want to take faults for I/O accesses.  I'm not actually
sure if this is a corretness thing, or an optimization effort, and the
original commit message doesn't really explain.  Ard, you wrote that
code, do you recall the details?

In any case, what we do is to check for each VMA backing a memslot, we
check if the memslot flags and vma flags are a reasonable match, and we
try to detect I/O mappings by looking for the VM_PFNMAP flag on the VMA
and pre-populate stage 2 page tables (our equivalent of EPT/NPT/...).
However, there are some things which are not clear to me:

First, what prevents user space from messing around with the VMAs after
kvm_arch_prepare_memory_region() completes?  If nothing, then what is
the value of the cheks we perform wrt. to VMAs?

Second, why would arm/arm64 need special handling for I/O mappings
compared to other architectures, and how is this dealt with for
x86/s390/power/... ?


Thanks,

    Christoffer

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Memory regions and VMAs across architectures
  2019-11-08 11:19 Memory regions and VMAs across architectures Christoffer Dall
@ 2019-11-08 13:59 ` Ard Biesheuvel
  2019-11-20  3:44 ` Sean Christopherson
  1 sibling, 0 replies; 6+ messages in thread
From: Ard Biesheuvel @ 2019-11-08 13:59 UTC (permalink / raw)
  To: Christoffer Dall, kvm
  Cc: Marc Zyngier, sean.j.christopherson, borntraeger, Paolo Bonzini,
	kvmarm, linux-arm-kernel

On 11/8/19 12:19 PM, Christoffer Dall wrote:
> Hi,
>
> I had a look at our relatively complicated logic in
> kvm_arch_prepare_memory_region(), and was wondering if there was room to
> unify some of this handling between architectures.
>
> (If you haven't seen our implementation, you can find it in
> virt/kvm/arm/mmu.c, and it has lovely ASCII art!)
>
> I then had a look at the x86 code, but that doesn't actually do anything
> when creating memory regions, which makes me wonder why the arhitectures
> differ in this aspect.
>
> The reason we added the logic that we have for arm/arm64 is that we
> don't really want to take faults for I/O accesses.  I'm not actually
> sure if this is a corretness thing, or an optimization effort, and the
> original commit message doesn't really explain.  Ard, you wrote that
> code, do you recall the details?
>

I have a vague recollection of implementing execution from read-only
guest memory in order to support execute-in-place from emulated NOR
flash in UEFI, and going down a rabbit hole debugging random, seemingly
unrelated crashes in the host which turned out to be caused by the zero
page getting corrupted because it was mapped read-write in the guest to
back uninitialized regions of the NOR flash.

That doesn't quite answer your question, though - I think it was just an
optimization ...

> In any case, what we do is to check for each VMA backing a memslot, we
> check if the memslot flags and vma flags are a reasonable match, and we
> try to detect I/O mappings by looking for the VM_PFNMAP flag on the VMA
> and pre-populate stage 2 page tables (our equivalent of EPT/NPT/...).
> However, there are some things which are not clear to me:
>
> First, what prevents user space from messing around with the VMAs after
> kvm_arch_prepare_memory_region() completes?  If nothing, then what is
> the value of the cheks we perform wrt. to VMAs?
>
> Second, why would arm/arm64 need special handling for I/O mappings
> compared to other architectures, and how is this dealt with for
> x86/s390/power/... ?
>
>
> Thanks,
>
>      Christoffer
>

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Memory regions and VMAs across architectures
  2019-11-08 11:19 Memory regions and VMAs across architectures Christoffer Dall
  2019-11-08 13:59 ` Ard Biesheuvel
@ 2019-11-20  3:44 ` Sean Christopherson
  2019-11-20 11:52   ` Christoffer Dall
  1 sibling, 1 reply; 6+ messages in thread
From: Sean Christopherson @ 2019-11-20  3:44 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: kvm, Marc Zyngier, borntraeger, Ard Biesheuvel, Paolo Bonzini,
	kvmarm, linux-arm-kernel

On Fri, Nov 08, 2019 at 12:19:20PM +0100, Christoffer Dall wrote:
> Hi,
> 
> I had a look at our relatively complicated logic in
> kvm_arch_prepare_memory_region(), and was wondering if there was room to
> unify some of this handling between architectures.
> 
> (If you haven't seen our implementation, you can find it in
> virt/kvm/arm/mmu.c, and it has lovely ASCII art!)
> 
> I then had a look at the x86 code, but that doesn't actually do anything
> when creating memory regions, which makes me wonder why the arhitectures
> differ in this aspect.
> 
> The reason we added the logic that we have for arm/arm64 is that we
> don't really want to take faults for I/O accesses.  I'm not actually
> sure if this is a corretness thing, or an optimization effort, and the
> original commit message doesn't really explain.  Ard, you wrote that
> code, do you recall the details?
> 
> In any case, what we do is to check for each VMA backing a memslot, we
> check if the memslot flags and vma flags are a reasonable match, and we
> try to detect I/O mappings by looking for the VM_PFNMAP flag on the VMA
> and pre-populate stage 2 page tables (our equivalent of EPT/NPT/...).
> However, there are some things which are not clear to me:
> 
> First, what prevents user space from messing around with the VMAs after
> kvm_arch_prepare_memory_region() completes?  If nothing, then what is
> the value of the cheks we perform wrt. to VMAs?

Arm's prepare_memory_region() holds mmap_sem and mmu_lock while processing
the VMAs and populating the stage 2 page tables.  Holding mmap_sem prevents
the VMAs from being invalidated while the stage 2 tables are populated,
e.g. prevents racing with the mmu notifier.  The VMAs could be modified
after prepare_memory_region(), but the mmu notifier will ensure they are
unmapped from stage2 prior the the host change taking effect.  So I think
you're safe (famous last words).

> Second, why would arm/arm64 need special handling for I/O mappings
> compared to other architectures, and how is this dealt with for
> x86/s390/power/... ?

As Ard mentioned, it looks like an optimization.  The "passthrough"
part from the changelog implies that VM_PFNMAP memory regions are exclusive
to the guest.  Mapping the entire thing would be a nice boot optimization
as it would save taking page faults on every page of the MMIO region.

As for how this is different from other archs... at least on x86, VM_PFNMAP
isn't guaranteed to be passthrough or even MMIO, e.g. prefaulting the
pages may actually trigger allocation, and remapping the addresses could be
flat out wrong.


  commit 8eef91239e57d2e932e7470879c9a504d5494ebb
  Author: Ard Biesheuvel <ard.biesheuvel@linaro.org>
  Date:   Fri Oct 10 17:00:32 2014 +0200

    arm/arm64: KVM: map MMIO regions at creation time

    There is really no point in faulting in memory regions page by page
    if they are not backed by demand paged system RAM but by a linear
    passthrough mapping of a host MMIO region. So instead, detect such
    regions at setup time and install the mappings for the backing all
    at once.



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Memory regions and VMAs across architectures
  2019-11-20  3:44 ` Sean Christopherson
@ 2019-11-20 11:52   ` Christoffer Dall
  2019-11-20 15:28     ` Sean Christopherson
  0 siblings, 1 reply; 6+ messages in thread
From: Christoffer Dall @ 2019-11-20 11:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Marc Zyngier, borntraeger, Ard Biesheuvel, Paolo Bonzini,
	kvmarm, linux-arm-kernel

On Tue, Nov 19, 2019 at 07:44:48PM -0800, Sean Christopherson wrote:
> On Fri, Nov 08, 2019 at 12:19:20PM +0100, Christoffer Dall wrote:
> > Hi,
> > 
> > I had a look at our relatively complicated logic in
> > kvm_arch_prepare_memory_region(), and was wondering if there was room to
> > unify some of this handling between architectures.
> > 
> > (If you haven't seen our implementation, you can find it in
> > virt/kvm/arm/mmu.c, and it has lovely ASCII art!)
> > 
> > I then had a look at the x86 code, but that doesn't actually do anything
> > when creating memory regions, which makes me wonder why the arhitectures
> > differ in this aspect.
> > 
> > The reason we added the logic that we have for arm/arm64 is that we
> > don't really want to take faults for I/O accesses.  I'm not actually
> > sure if this is a corretness thing, or an optimization effort, and the
> > original commit message doesn't really explain.  Ard, you wrote that
> > code, do you recall the details?
> > 
> > In any case, what we do is to check for each VMA backing a memslot, we
> > check if the memslot flags and vma flags are a reasonable match, and we
> > try to detect I/O mappings by looking for the VM_PFNMAP flag on the VMA
> > and pre-populate stage 2 page tables (our equivalent of EPT/NPT/...).
> > However, there are some things which are not clear to me:
> > 
> > First, what prevents user space from messing around with the VMAs after
> > kvm_arch_prepare_memory_region() completes?  If nothing, then what is
> > the value of the cheks we perform wrt. to VMAs?
> 
> Arm's prepare_memory_region() holds mmap_sem and mmu_lock while processing
> the VMAs and populating the stage 2 page tables.  Holding mmap_sem prevents
> the VMAs from being invalidated while the stage 2 tables are populated,
> e.g. prevents racing with the mmu notifier.  The VMAs could be modified
> after prepare_memory_region(), but the mmu notifier will ensure they are
> unmapped from stage2 prior the the host change taking effect.  So I think
> you're safe (famous last words).
> 

So we for example check:

	writeable = !(memslot->falgs & KVM_MEM_READONLY);
	if (writeable && !(vma->vm_flags & VM_WRITE))
		return -EPERM;

And yes, user space can then unmap the VMAs and MMU notifiers will
unmap the stage 2 entries, but user space can then create a new
read-only VMA covering the area of the memslot and the fault-handling
path will have to deal with this same check later.  Only, the fault
handling path, via gfn_to_pfn_prot(), returns an address based on an
entirely different set of mechanics, than our prepare_memory_region,
which I think indicates we are doing something wrong somewhere, and we
should have a common path for faulting things in, for I/O, both if we do
this up-front or if we do this at fault time.


> > Second, why would arm/arm64 need special handling for I/O mappings
> > compared to other architectures, and how is this dealt with for
> > x86/s390/power/... ?
> 
> As Ard mentioned, it looks like an optimization.  The "passthrough"
> part from the changelog implies that VM_PFNMAP memory regions are exclusive
> to the guest.  Mapping the entire thing would be a nice boot optimization
> as it would save taking page faults on every page of the MMIO region.
> 
> As for how this is different from other archs... at least on x86, VM_PFNMAP
> isn't guaranteed to be passthrough or even MMIO, e.g. prefaulting the
> pages may actually trigger allocation, and remapping the addresses could be
> flat out wrong.

What does VM_PFNMAP mean on x86?  I didn't think we were relying on
anything architecture specific in their meaning in the arm code, and I
thought the VM_PFNMAP was a generic mm flag with generic mm meaning,
but I could be wrong here?

Is there any valid semantics for creating a memslot backed by a
VM_PFNMAP on x86, and if so, what are those?

Similarly, if you do map a device region straight to the guest on x86,
how is that handled?  (A pointer to the right place in the myriad of EPT
and shadow code in x86 would be much appreciated.)


Thanks!

    Christoffer

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Memory regions and VMAs across architectures
  2019-11-20 11:52   ` Christoffer Dall
@ 2019-11-20 15:28     ` Sean Christopherson
  2019-11-21  9:40       ` Christoffer Dall
  0 siblings, 1 reply; 6+ messages in thread
From: Sean Christopherson @ 2019-11-20 15:28 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: kvm, Marc Zyngier, borntraeger, Ard Biesheuvel, Paolo Bonzini,
	kvmarm, linux-arm-kernel

On Wed, Nov 20, 2019 at 12:52:16PM +0100, Christoffer Dall wrote:
> On Tue, Nov 19, 2019 at 07:44:48PM -0800, Sean Christopherson wrote:
> > On Fri, Nov 08, 2019 at 12:19:20PM +0100, Christoffer Dall wrote:
> > > First, what prevents user space from messing around with the VMAs after
> > > kvm_arch_prepare_memory_region() completes?  If nothing, then what is
> > > the value of the cheks we perform wrt. to VMAs?
> > 
> > Arm's prepare_memory_region() holds mmap_sem and mmu_lock while processing
> > the VMAs and populating the stage 2 page tables.  Holding mmap_sem prevents
> > the VMAs from being invalidated while the stage 2 tables are populated,
> > e.g. prevents racing with the mmu notifier.  The VMAs could be modified
> > after prepare_memory_region(), but the mmu notifier will ensure they are
> > unmapped from stage2 prior the the host change taking effect.  So I think
> > you're safe (famous last words).
> > 
> 
> So we for example check:
> 
> 	writeable = !(memslot->falgs & KVM_MEM_READONLY);
> 	if (writeable && !(vma->vm_flags & VM_WRITE))
> 		return -EPERM;
> 
> And yes, user space can then unmap the VMAs and MMU notifiers will
> unmap the stage 2 entries, but user space can then create a new
> read-only VMA covering the area of the memslot and the fault-handling
> path will have to deal with this same check later.Only, the fault
> handling path, via gfn_to_pfn_prot(), returns an address based on an
> entirely different set of mechanics, than our prepare_memory_region,
> which I think indicates we are doing something wrong somewhere, and we
> should have a common path for faulting things in, for I/O, both if we do
> this up-front or if we do this at fault time.

Unconditionally interpreting vm_pgoff as a physical address does not seem
correct.  There are cases where that might be correct, e.g. if the backing
(virtual) file is a flat representation of the address space, which appears
to be the case on some architectures, e.g. for PCI handling.  But even then
there should be some confirmation that the VMA is actually associated with
such a file, otherwise KVM is at the mercy of userspace to do the right
thing (unless there are other guarantees on arm I am unaware of).

> > > Second, why would arm/arm64 need special handling for I/O mappings
> > > compared to other architectures, and how is this dealt with for
> > > x86/s390/power/... ?
> > 
> > As Ard mentioned, it looks like an optimization.  The "passthrough"
> > part from the changelog implies that VM_PFNMAP memory regions are exclusive
> > to the guest.  Mapping the entire thing would be a nice boot optimization
> > as it would save taking page faults on every page of the MMIO region.
> > 
> > As for how this is different from other archs... at least on x86, VM_PFNMAP
> > isn't guaranteed to be passthrough or even MMIO, e.g. prefaulting the
> > pages may actually trigger allocation, and remapping the addresses could be
> > flat out wrong.
> 
> What does VM_PFNMAP mean on x86?  I didn't think we were relying on
> anything architecture specific in their meaning in the arm code, and I
> thought the VM_PFNMAP was a generic mm flag with generic mm meaning,
> but I could be wrong here?

No, you're correct, VM_PFNMAP is a generic flag that state the VMA doesn't
have an associated struct page and is being managed directly by something
other than the core mmu.

But not having a struct page doesn't guarantee that the PFN is backed by
MMIO, or that it is exclusive to the guest (although in practice this is
probably the case 99.9999% of the time).  E.g. x86 supports having guest
memory backed by regular ram that is hidden from the host kernel via
'mem=', which will show up as VM_PFNMAP.

> Is there any valid semantics for creating a memslot backed by a
> VM_PFNMAP on x86, and if so, what are those?
> 
> Similarly, if you do map a device region straight to the guest on x86,
> how is that handled?  (A pointer to the right place in the myriad of EPT
> and shadow code in x86 would be much appreciated.)

There is no special handling in x86 for VM_PFNMAP memory, we rely on KVM's
generic __gfn_to_pfn_memslot() to retrieve the PFN on demand, and use
mmu_notifier_seq to ensure the stale PFNs (invalidated in the host) aren't
inserted into the guest page tables.  Effectively the same thing arm does,
sans the prepare_memory_region() shenanigans.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Memory regions and VMAs across architectures
  2019-11-20 15:28     ` Sean Christopherson
@ 2019-11-21  9:40       ` Christoffer Dall
  0 siblings, 0 replies; 6+ messages in thread
From: Christoffer Dall @ 2019-11-21  9:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Marc Zyngier, borntraeger, Ard Biesheuvel, Paolo Bonzini,
	kvmarm, linux-arm-kernel

On Wed, Nov 20, 2019 at 07:28:07AM -0800, Sean Christopherson wrote:
> On Wed, Nov 20, 2019 at 12:52:16PM +0100, Christoffer Dall wrote:
> > On Tue, Nov 19, 2019 at 07:44:48PM -0800, Sean Christopherson wrote:
> > > On Fri, Nov 08, 2019 at 12:19:20PM +0100, Christoffer Dall wrote:
> > > > First, what prevents user space from messing around with the VMAs after
> > > > kvm_arch_prepare_memory_region() completes?  If nothing, then what is
> > > > the value of the cheks we perform wrt. to VMAs?
> > > 
> > > Arm's prepare_memory_region() holds mmap_sem and mmu_lock while processing
> > > the VMAs and populating the stage 2 page tables.  Holding mmap_sem prevents
> > > the VMAs from being invalidated while the stage 2 tables are populated,
> > > e.g. prevents racing with the mmu notifier.  The VMAs could be modified
> > > after prepare_memory_region(), but the mmu notifier will ensure they are
> > > unmapped from stage2 prior the the host change taking effect.  So I think
> > > you're safe (famous last words).
> > > 
> > 
> > So we for example check:
> > 
> > 	writeable = !(memslot->falgs & KVM_MEM_READONLY);
> > 	if (writeable && !(vma->vm_flags & VM_WRITE))
> > 		return -EPERM;
> > 
> > And yes, user space can then unmap the VMAs and MMU notifiers will
> > unmap the stage 2 entries, but user space can then create a new
> > read-only VMA covering the area of the memslot and the fault-handling
> > path will have to deal with this same check later.Only, the fault
> > handling path, via gfn_to_pfn_prot(), returns an address based on an
> > entirely different set of mechanics, than our prepare_memory_region,
> > which I think indicates we are doing something wrong somewhere, and we
> > should have a common path for faulting things in, for I/O, both if we do
> > this up-front or if we do this at fault time.
> 
> Unconditionally interpreting vm_pgoff as a physical address does not seem
> correct.  There are cases where that might be correct, e.g. if the backing
> (virtual) file is a flat representation of the address space, which appears
> to be the case on some architectures, e.g. for PCI handling.  But even then
> there should be some confirmation that the VMA is actually associated with
> such a file, otherwise KVM is at the mercy of userspace to do the right
> thing (unless there are other guarantees on arm I am unaware of).
> 
> > > > Second, why would arm/arm64 need special handling for I/O mappings
> > > > compared to other architectures, and how is this dealt with for
> > > > x86/s390/power/... ?
> > > 
> > > As Ard mentioned, it looks like an optimization.  The "passthrough"
> > > part from the changelog implies that VM_PFNMAP memory regions are exclusive
> > > to the guest.  Mapping the entire thing would be a nice boot optimization
> > > as it would save taking page faults on every page of the MMIO region.
> > > 
> > > As for how this is different from other archs... at least on x86, VM_PFNMAP
> > > isn't guaranteed to be passthrough or even MMIO, e.g. prefaulting the
> > > pages may actually trigger allocation, and remapping the addresses could be
> > > flat out wrong.
> > 
> > What does VM_PFNMAP mean on x86?  I didn't think we were relying on
> > anything architecture specific in their meaning in the arm code, and I
> > thought the VM_PFNMAP was a generic mm flag with generic mm meaning,
> > but I could be wrong here?
> 
> No, you're correct, VM_PFNMAP is a generic flag that state the VMA doesn't
> have an associated struct page and is being managed directly by something
> other than the core mmu.
> 
> But not having a struct page doesn't guarantee that the PFN is backed by
> MMIO, or that it is exclusive to the guest (although in practice this is
> probably the case 99.9999% of the time).  E.g. x86 supports having guest
> memory backed by regular ram that is hidden from the host kernel via
> 'mem=', which will show up as VM_PFNMAP.
> 
> > Is there any valid semantics for creating a memslot backed by a
> > VM_PFNMAP on x86, and if so, what are those?
> > 
> > Similarly, if you do map a device region straight to the guest on x86,
> > how is that handled?  (A pointer to the right place in the myriad of EPT
> > and shadow code in x86 would be much appreciated.)
> 
> There is no special handling in x86 for VM_PFNMAP memory, we rely on KVM's
> generic __gfn_to_pfn_memslot() to retrieve the PFN on demand, and use
> mmu_notifier_seq to ensure the stale PFNs (invalidated in the host) aren't
> inserted into the guest page tables.  Effectively the same thing arm does,
> sans the prepare_memory_region() shenanigans.

Thanks Sean, I'll have a look at reworking said shenanigans ;)

    Christoffer

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-11-21  9:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-08 11:19 Memory regions and VMAs across architectures Christoffer Dall
2019-11-08 13:59 ` Ard Biesheuvel
2019-11-20  3:44 ` Sean Christopherson
2019-11-20 11:52   ` Christoffer Dall
2019-11-20 15:28     ` Sean Christopherson
2019-11-21  9:40       ` Christoffer Dall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).