All of lore.kernel.org
 help / color / mirror / Atom feed
* Interaction between host-side mprotect() and KVM MMU
@ 2019-05-21  7:24 Martin Lucina
  2019-05-21  8:14 ` Martin Lucina
  2019-05-21 14:02 ` Sean Christopherson
  0 siblings, 2 replies; 8+ messages in thread
From: Martin Lucina @ 2019-05-21  7:24 UTC (permalink / raw)
  To: kvm

Hi all,

as part of an effort to enforce W^X for the KVM backend of Solo5 [1], I'm
trying to understand how host-side mprotect() interacts with the KVM MMU.

Take a KVM guest on x86_64, where the guest runs exclusively in long mode,
in virtual ring 0, using 1:1 2MB pages in the guest, and all guest page
tables are RWX, i.e. no memory protection is enforced inside the guest
itself. EPT is enabled on the host.

Instead, our ELF loader applies a host-side mprotect(PROT_...) based on the
protection bits in the guest application (unikernel) ELF PHDRs.

The observed behaviour I see, from tests run inside the guest:

1. Attempting to WRITE to .text which has had mprotect(PROT_READ |
PROT_EXEC) applied on the host side results in a EFAULT from KVM_RUN in the
userspace tender (our equivalent of a VMM).

2. Attempting to EXECUTE code in .data which has had mprotect(PROT_READ |
PROT_WRITE) applied on the host side succeeds.

Questions:

a. Is this the intended behaviour, and can it be relied on? Note that
KVM/aarch64 behaves the same for me.

b. Why does case (1) fail but case (2) succeed? I spent a day reading
through the KVM MMU code, but failed to understand how this is implemented.

c. In order to enforce W^X both ways I'd like to have case (2) also fail
with EFAULT, is this possible?

Martin

[1] https://github.com/Solo5/solo5

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Interaction between host-side mprotect() and KVM MMU
  2019-05-21  7:24 Interaction between host-side mprotect() and KVM MMU Martin Lucina
@ 2019-05-21  8:14 ` Martin Lucina
  2019-05-21 14:02 ` Sean Christopherson
  1 sibling, 0 replies; 8+ messages in thread
From: Martin Lucina @ 2019-05-21  8:14 UTC (permalink / raw)
  To: kvm

On Tuesday, 21.05.2019 at 09:24, Martin Lucina wrote:
> Questions:
> 
> a. Is this the intended behaviour, and can it be relied on? Note that
> KVM/aarch64 behaves the same for me.

As a further data point, I've added a check in the userspace tender binary to
verify that sys_personality does not include READ_IMPLIES_EXEC, though it
appears that my toolchain (Debian stable) is producing binaries with -z
noexecstack by default.

Martin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Interaction between host-side mprotect() and KVM MMU
  2019-05-21  7:24 Interaction between host-side mprotect() and KVM MMU Martin Lucina
  2019-05-21  8:14 ` Martin Lucina
@ 2019-05-21 14:02 ` Sean Christopherson
  2019-05-23  9:27   ` Martin Lucina
  1 sibling, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2019-05-21 14:02 UTC (permalink / raw)
  To: Martin Lucina; +Cc: kvm

On Tue, May 21, 2019 at 09:24:34AM +0200, Martin Lucina wrote:
> Hi all,
> 
> as part of an effort to enforce W^X for the KVM backend of Solo5 [1], I'm
> trying to understand how host-side mprotect() interacts with the KVM MMU.
> 
> Take a KVM guest on x86_64, where the guest runs exclusively in long mode,
> in virtual ring 0, using 1:1 2MB pages in the guest, and all guest page
> tables are RWX, i.e. no memory protection is enforced inside the guest
> itself. EPT is enabled on the host.
> 
> Instead, our ELF loader applies a host-side mprotect(PROT_...) based on the
> protection bits in the guest application (unikernel) ELF PHDRs.
> 
> The observed behaviour I see, from tests run inside the guest:
> 
> 1. Attempting to WRITE to .text which has had mprotect(PROT_READ |
> PROT_EXEC) applied on the host side results in a EFAULT from KVM_RUN in the
> userspace tender (our equivalent of a VMM).
> 
> 2. Attempting to EXECUTE code in .data which has had mprotect(PROT_READ |
> PROT_WRITE) applied on the host side succeeds.
> 
> Questions:
> 
> a. Is this the intended behaviour, and can it be relied on? Note that
> KVM/aarch64 behaves the same for me.
> 
> b. Why does case (1) fail but case (2) succeed? I spent a day reading
> through the KVM MMU code, but failed to understand how this is implemented.

Case (1) fails because KVM explicitly grabs WRITE permissions when
retrieving the HPA.  See __gfn_to_pfn_memslot() and hva_to_pfn().
Note, KVM also allows userspace to set a guest memslot as RO
independent of mprotect().

Case (2) doesn't fault because KVM doesn't support execute protection,
i.e. all pages are executable in the guest (at least on x86).  My guess
is that execute protection isn't supported because there isn't a strong
use case for traditional virtualization and so no one has gone through
the effort to add NX support.  E.g. the vast majority of system memory
can be dynamically allocated (for userspace code), which practically
speaking leaves only the guest kernel's data sections, and marking those
NX requires at a minimum:

  - knowing exactly what kernel will be loaded
  - no ASLR in the physical domain
  - no transient execution, e.g. in vBIOS or trampoline code

> c. In order to enforce W^X both ways I'd like to have case (2) also fail
> with EFAULT, is this possible?

Not without modifying KVM and the kernel (if you want to do it through
mprotect()).

> 
> Martin
> 
> [1] https://github.com/Solo5/solo5

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Interaction between host-side mprotect() and KVM MMU
  2019-05-21 14:02 ` Sean Christopherson
@ 2019-05-23  9:27   ` Martin Lucina
  2019-05-23 14:53     ` Sean Christopherson
  2019-05-24 19:26     ` Sean Christopherson
  0 siblings, 2 replies; 8+ messages in thread
From: Martin Lucina @ 2019-05-23  9:27 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm

On Tuesday, 21.05.2019 at 07:02, Sean Christopherson wrote:
> > Questions:
> > 
> > a. Is this the intended behaviour, and can it be relied on? Note that
> > KVM/aarch64 behaves the same for me.
> > 
> > b. Why does case (1) fail but case (2) succeed? I spent a day reading
> > through the KVM MMU code, but failed to understand how this is implemented.
> 
> Case (1) fails because KVM explicitly grabs WRITE permissions when
> retrieving the HPA.  See __gfn_to_pfn_memslot() and hva_to_pfn().
> Note, KVM also allows userspace to set a guest memslot as RO
> independent of mprotect().

Thanks for the pointers. I'm aware of the ability to set a memslot as RO,
but currently we use a single memslot + mprotect() as it suits our loader
architecture better (see below).

> Case (2) doesn't fault because KVM doesn't support execute protection,
> i.e. all pages are executable in the guest (at least on x86).  My guess
> is that execute protection isn't supported because there isn't a strong
> use case for traditional virtualization and so no one has gone through
> the effort to add NX support.  E.g. the vast majority of system memory
> can be dynamically allocated (for userspace code), which practically
> speaking leaves only the guest kernel's data sections, and marking those
> NX requires at a minimum:
> 
>   - knowing exactly what kernel will be loaded
>   - no ASLR in the physical domain
>   - no transient execution, e.g. in vBIOS or trampoline code

In the Solo5 case we're using hardware virtualization in a non-traditional
sense, as an isolation layer for a static guest (i.e. no changes to
physical memory layout or page protections after "boot"). The guest is
considered untrusted and all [*] the setup is performed by the loader/VMM
("tender" in our terminology), which has all the knowledge of what gets
loaded into the VM available up front. So your points above are not an
issue.

[*] well, almost all, the guest sets up its own IDT in order to report
exceptions and abort

> 
> > c. In order to enforce W^X both ways I'd like to have case (2) also fail
> > with EFAULT, is this possible?
> 
> Not without modifying KVM and the kernel (if you want to do it through
> mprotect()).

Hooking up the full EPT protection bits available to KVM via mprotect()
would be the best solution for us, and could also give us the ability to
have execute-only pages on x86, which is a nice defence against ROP attacks
in the guest. However, I can see now that this is not a trivial
undertaking, especially across the various MMU models (tdp, softmmu) and
architectures dealt with by the core KVM code.

N.B. We also have tender implementations for bhyve and OpenBSD vmm, and at
least in the OpenBSD case some community contributors are looking into
developing an "ept_mprotect" for precisely this use-case, though their vmm
code is much simpler (and does less) compared to KVM.

I take it there's no other way to mark a range of pages as NX by the guest
from the host side, so if we want this without modifying KVM and the
kernel, the only way to get it would be to set up "real" page tables inside
the guest ...?

Martin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Interaction between host-side mprotect() and KVM MMU
  2019-05-23  9:27   ` Martin Lucina
@ 2019-05-23 14:53     ` Sean Christopherson
  2019-05-24 12:03       ` Martin Lucina
  2019-05-24 19:26     ` Sean Christopherson
  1 sibling, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2019-05-23 14:53 UTC (permalink / raw)
  To: kvm

On Thu, May 23, 2019 at 11:27:03AM +0200, Martin Lucina wrote:
> On Tuesday, 21.05.2019 at 07:02, Sean Christopherson wrote:
> > > Questions:
> > > 
> > > a. Is this the intended behaviour, and can it be relied on? Note that
> > > KVM/aarch64 behaves the same for me.
> > > 
> > > b. Why does case (1) fail but case (2) succeed? I spent a day reading
> > > through the KVM MMU code, but failed to understand how this is implemented.
> > 
> > Case (1) fails because KVM explicitly grabs WRITE permissions when
> > retrieving the HPA.  See __gfn_to_pfn_memslot() and hva_to_pfn().
> > Note, KVM also allows userspace to set a guest memslot as RO
> > independent of mprotect().
> 
> Thanks for the pointers. I'm aware of the ability to set a memslot as RO,
> but currently we use a single memslot + mprotect() as it suits our loader
> architecture better (see below).
> 
> > Case (2) doesn't fault because KVM doesn't support execute protection,
> > i.e. all pages are executable in the guest (at least on x86).  My guess
> > is that execute protection isn't supported because there isn't a strong
> > use case for traditional virtualization and so no one has gone through
> > the effort to add NX support.  E.g. the vast majority of system memory
> > can be dynamically allocated (for userspace code), which practically
> > speaking leaves only the guest kernel's data sections, and marking those
> > NX requires at a minimum:
> > 
> >   - knowing exactly what kernel will be loaded
> >   - no ASLR in the physical domain
> >   - no transient execution, e.g. in vBIOS or trampoline code
> 
> In the Solo5 case we're using hardware virtualization in a non-traditional
> sense, as an isolation layer for a static guest (i.e. no changes to
> physical memory layout or page protections after "boot"). The guest is
> considered untrusted and all [*] the setup is performed by the loader/VMM
> ("tender" in our terminology), which has all the knowledge of what gets
> loaded into the VM available up front. So your points above are not an
> issue.

I assumed as much, I was simply pointing out why KVM historically has not
supported NX.

> [*] well, almost all, the guest sets up its own IDT in order to report
> exceptions and abort
> 
> > 
> > > c. In order to enforce W^X both ways I'd like to have case (2) also fail
> > > with EFAULT, is this possible?
> > 
> > Not without modifying KVM and the kernel (if you want to do it through
> > mprotect()).
> 
> Hooking up the full EPT protection bits available to KVM via mprotect()
> would be the best solution for us, and could also give us the ability to
> have execute-only pages on x86, which is a nice defence against ROP attacks
> in the guest. However, I can see now that this is not a trivial
> undertaking, especially across the various MMU models (tdp, softmmu) and
> architectures dealt with by the core KVM code.
> 
> N.B. We also have tender implementations for bhyve and OpenBSD vmm, and at
> least in the OpenBSD case some community contributors are looking into
> developing an "ept_mprotect" for precisely this use-case, though their vmm
> code is much simpler (and does less) compared to KVM.
> 
> I take it there's no other way to mark a range of pages as NX by the guest
> from the host side, so if we want this without modifying KVM and the
> kernel, the only way to get it would be to set up "real" page tables inside
> the guest ...?

Correct, KVM does currently support marking pages NX from the host.  But
note that when EPT is enabled, KVM does not intercept writes to CR3, i.e.
the guest can configure and load its own page page tables to bypass the
restrictions of the tender, which may or may not be an issue.

On the other hand, modifying KVM to support NX via mprotect() in a limited
capacity might be a relatively low effort option, e.g. support it as a
per-module opt-in feature only when using TDP (EPT or NPT).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Interaction between host-side mprotect() and KVM MMU
  2019-05-23 14:53     ` Sean Christopherson
@ 2019-05-24 12:03       ` Martin Lucina
  0 siblings, 0 replies; 8+ messages in thread
From: Martin Lucina @ 2019-05-24 12:03 UTC (permalink / raw)
  To: kvm

On Thursday, 23.05.2019 at 07:53, Sean Christopherson wrote:
> > > > c. In order to enforce W^X both ways I'd like to have case (2) also fail
> > > > with EFAULT, is this possible?
> > > 
> > > Not without modifying KVM and the kernel (if you want to do it through
> > > mprotect()).
> > 
> > Hooking up the full EPT protection bits available to KVM via mprotect()
> > would be the best solution for us, and could also give us the ability to
> > have execute-only pages on x86, which is a nice defence against ROP attacks
> > in the guest. However, I can see now that this is not a trivial
> > undertaking, especially across the various MMU models (tdp, softmmu) and
> > architectures dealt with by the core KVM code.
> > 
> > N.B. We also have tender implementations for bhyve and OpenBSD vmm, and at
> > least in the OpenBSD case some community contributors are looking into
> > developing an "ept_mprotect" for precisely this use-case, though their vmm
> > code is much simpler (and does less) compared to KVM.
> > 
> > I take it there's no other way to mark a range of pages as NX by the guest
> > from the host side, so if we want this without modifying KVM and the
> > kernel, the only way to get it would be to set up "real" page tables inside
> > the guest ...?
> 
> Correct, KVM does currently support marking pages NX from the host.  But
> note that when EPT is enabled, KVM does not intercept writes to CR3, i.e.
> the guest can configure and load its own page page tables to bypass the
> restrictions of the tender, which may or may not be an issue.

I'm aware of that. I've considered various options over time, including
running untrusted guest code in Ring 3, but that would require quite a bit
more work on the the loader side to provide Ring 0 infrastructure in the
guest (e.g. exception reporting), which complicates the architecture and
"supply chain".

> On the other hand, modifying KVM to support NX via mprotect() in a limited
> capacity might be a relatively low effort option, e.g. support it as a
> per-module opt-in feature only when using TDP (EPT or NPT).

That would be an interesting feature, especially if it would also enable
marking guest pages as execute-only on a TDP host. Why the opt-in? To avoid
breaking existing userspace relying on the existing mprotect() behaviour?
Do you think it could be implemented as a run-time opt-in, e.g. via a
new KVM_CAP_*?

Martin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Interaction between host-side mprotect() and KVM MMU
  2019-05-23  9:27   ` Martin Lucina
  2019-05-23 14:53     ` Sean Christopherson
@ 2019-05-24 19:26     ` Sean Christopherson
  2019-06-06 11:52       ` Martin Lucina
  1 sibling, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2019-05-24 19:26 UTC (permalink / raw)
  To: kvm

On Thu, May 23, 2019 at 11:27:03AM +0200, Martin Lucina wrote:
> On Tuesday, 21.05.2019 at 07:02, Sean Christopherson wrote:
> > Not without modifying KVM and the kernel (if you want to do it through
> > mprotect()).
> 
> Hooking up the full EPT protection bits available to KVM via mprotect()
> would be the best solution for us, and could also give us the ability to
> have execute-only pages on x86, which is a nice defence against ROP attacks
> in the guest. However, I can see now that this is not a trivial
> undertaking, especially across the various MMU models (tdp, softmmu) and
> architectures dealt with by the core KVM code.

Belated thought on this...

Propagating PROT_EXEC from the host's VMAs to the EPT tables would require
having *guest* memory mapped with PROT_EXEC in the host.  This is a
non-starter for traditional virtualization as it would all but require the
hypervisor to have RWX pages.

For the Solo5 case, since the guest is untrusted, mapping its code as
executable in the host seems almost as bad from a security perspective.

So yeah, mprotect() might be convenient, but adding a KVM_MEM_NOEXEC
flag to KVM_SET_USER_MEMORY_REGION would be more secure (and probably
easier to implement in KVM).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Interaction between host-side mprotect() and KVM MMU
  2019-05-24 19:26     ` Sean Christopherson
@ 2019-06-06 11:52       ` Martin Lucina
  0 siblings, 0 replies; 8+ messages in thread
From: Martin Lucina @ 2019-06-06 11:52 UTC (permalink / raw)
  To: kvm

On Friday, 24.05.2019 at 12:26, Sean Christopherson wrote:
> On Thu, May 23, 2019 at 11:27:03AM +0200, Martin Lucina wrote:
> > On Tuesday, 21.05.2019 at 07:02, Sean Christopherson wrote:
> > > Not without modifying KVM and the kernel (if you want to do it through
> > > mprotect()).
> > 
> > Hooking up the full EPT protection bits available to KVM via mprotect()
> > would be the best solution for us, and could also give us the ability to
> > have execute-only pages on x86, which is a nice defence against ROP attacks
> > in the guest. However, I can see now that this is not a trivial
> > undertaking, especially across the various MMU models (tdp, softmmu) and
> > architectures dealt with by the core KVM code.
> 
> Belated thought on this...
> 
> Propagating PROT_EXEC from the host's VMAs to the EPT tables would require
> having *guest* memory mapped with PROT_EXEC in the host.  This is a
> non-starter for traditional virtualization as it would all but require the
> hypervisor to have RWX pages.
> 
> For the Solo5 case, since the guest is untrusted, mapping its code as
> executable in the host seems almost as bad from a security perspective.
> 
> So yeah, mprotect() might be convenient, but adding a KVM_MEM_NOEXEC
> flag to KVM_SET_USER_MEMORY_REGION would be more secure (and probably
> easier to implement in KVM).

This is a good point, and it had slipped my mind. Thanks for bringing it
up. So it looks like the correct way forward would be to use individual
memslots for the different Solo5 guest regions rather than mprotect() from
the host, i.e. splitting the protection bits at all the different layers
(host, EPT/TDP, guest).

This does change our architecture somewhat, I'll think about how it could
work and come back to this, am in the middle of some feature work right
now.

Thanks for the feedback.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-06-06 11:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-21  7:24 Interaction between host-side mprotect() and KVM MMU Martin Lucina
2019-05-21  8:14 ` Martin Lucina
2019-05-21 14:02 ` Sean Christopherson
2019-05-23  9:27   ` Martin Lucina
2019-05-23 14:53     ` Sean Christopherson
2019-05-24 12:03       ` Martin Lucina
2019-05-24 19:26     ` Sean Christopherson
2019-06-06 11:52       ` Martin Lucina

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.