All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Unmapping KVM Guest Memory from Host Kernel
@ 2024-03-08 21:05 Manwaring, Derek
  2024-03-11  9:26 ` Fuad Tabba
  0 siblings, 1 reply; 21+ messages in thread
From: Manwaring, Derek @ 2024-03-08 21:05 UTC (permalink / raw)
  To: David Woodhouse, David Matlack, Brendan Jackman, tabba, qperret,
	jason.cj.chen
  Cc: Gowans, James, seanjc, akpm, Roy, Patrick, chao.p.peng, rppt,
	pbonzini, Kalyazin, Nikita, lstoakes, Liam.Howlett, linux-mm,
	qemu-devel, kirill.shutemov, vbabka, mst, somlo, Graf (AWS),
	Alexander, kvm, linux-coco, kvmarm, kvmarm

On 2024-03-08 at 10:46-0700, David Woodhouse wrote:
> On Fri, 2024-03-08 at 09:35 -0800, David Matlack wrote:
> > I think what James is looking for (and what we are also interested
> > in), is _eliminating_ the ability to access guest memory from the
> > direct map entirely. And in general, eliminate the ability to access
> > guest memory in as many ways as possible.
>
> Well, pKVM does that...

Yes we've been looking at pKVM and it accomplishes a lot of what we're trying
to do. Our initial inclination is that we want to stick with VHE for the lower
overhead. We also want flexibility across server parts, so we would need to
get pKVM working on Intel & AMD if we went this route.

Certainly there are advantages of pKVM on the perf side like the in-place
memory sharing rather than copying as well as on the security side by simply
reducing the TCB. I'd be interested to hear others' thoughts on pKVM vs
memfd_secret or general ASI.

Derek


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Unmapping KVM Guest Memory from Host Kernel
  2024-03-08 21:05 Unmapping KVM Guest Memory from Host Kernel Manwaring, Derek
@ 2024-03-11  9:26 ` Fuad Tabba
  2024-03-11  9:29   ` Fuad Tabba
  0 siblings, 1 reply; 21+ messages in thread
From: Fuad Tabba @ 2024-03-11  9:26 UTC (permalink / raw)
  To: Manwaring, Derek
  Cc: David Woodhouse, David Matlack, Brendan Jackman, qperret,
	jason.cj.chen, Gowans, James, seanjc, akpm, Roy, Patrick,
	chao.p.peng, rppt, pbonzini, Kalyazin, Nikita, lstoakes,
	Liam.Howlett, linux-mm, qemu-devel, kirill.shutemov, vbabka, mst,
	somlo, Graf (AWS),
	Alexander, kvm, linux-coco, kvmarm, kvmarm

Hi,

On Fri, Mar 8, 2024 at 9:05 PM Manwaring, Derek <derekmn@amazon.com> wrote:
>
> On 2024-03-08 at 10:46-0700, David Woodhouse wrote:
> > On Fri, 2024-03-08 at 09:35 -0800, David Matlack wrote:
> > > I think what James is looking for (and what we are also interested
> > > in), is _eliminating_ the ability to access guest memory from the
> > > direct map entirely. And in general, eliminate the ability to access
> > > guest memory in as many ways as possible.
> >
> > Well, pKVM does that...
>
> Yes we've been looking at pKVM and it accomplishes a lot of what we're trying
> to do. Our initial inclination is that we want to stick with VHE for the lower
> overhead. We also want flexibility across server parts, so we would need to
> get pKVM working on Intel & AMD if we went this route.
>
> Certainly there are advantages of pKVM on the perf side like the in-place
> memory sharing rather than copying as well as on the security side by simply
> reducing the TCB. I'd be interested to hear others' thoughts on pKVM vs
> memfd_secret or general ASI.

The work we've done for pKVM is still an RFC [*], but there is nothing
in it that limits it to nVHE (at least not intentionally). It should
work with VHE and hVHE as well. On respinning the patch series [*], we
plan on adding support for normal VMs to use guest_memfd() as well in
arm64, mainly for testing, and to make it easier for others to base
their work on it.

Cheers,
/fuad

[*] https://lore.kernel.org/all/20240222161047.402609-1-tabba@google.com
>
> Derek
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Unmapping KVM Guest Memory from Host Kernel
  2024-03-11  9:26 ` Fuad Tabba
@ 2024-03-11  9:29   ` Fuad Tabba
  0 siblings, 0 replies; 21+ messages in thread
From: Fuad Tabba @ 2024-03-11  9:29 UTC (permalink / raw)
  To: Manwaring, Derek
  Cc: David Woodhouse, David Matlack, Brendan Jackman, qperret,
	jason.cj.chen, Gowans, James, seanjc, akpm, Roy, Patrick,
	chao.p.peng, rppt, pbonzini, Kalyazin, Nikita, lstoakes,
	Liam.Howlett, linux-mm, qemu-devel, kirill.shutemov, vbabka, mst,
	somlo, Graf (AWS),
	Alexander, kvm, linux-coco, kvmarm, kvmarm

On Mon, Mar 11, 2024 at 9:26 AM Fuad Tabba <tabba@google.com> wrote:
>
> Hi,
>
> On Fri, Mar 8, 2024 at 9:05 PM Manwaring, Derek <derekmn@amazon.com> wrote:
> >
> > On 2024-03-08 at 10:46-0700, David Woodhouse wrote:
> > > On Fri, 2024-03-08 at 09:35 -0800, David Matlack wrote:
> > > > I think what James is looking for (and what we are also interested
> > > > in), is _eliminating_ the ability to access guest memory from the
> > > > direct map entirely. And in general, eliminate the ability to access
> > > > guest memory in as many ways as possible.
> > >
> > > Well, pKVM does that...
> >
> > Yes we've been looking at pKVM and it accomplishes a lot of what we're trying
> > to do. Our initial inclination is that we want to stick with VHE for the lower
> > overhead. We also want flexibility across server parts, so we would need to
> > get pKVM working on Intel & AMD if we went this route.
> >
> > Certainly there are advantages of pKVM on the perf side like the in-place
> > memory sharing rather than copying as well as on the security side by simply
> > reducing the TCB. I'd be interested to hear others' thoughts on pKVM vs
> > memfd_secret or general ASI.
>
> The work we've done for pKVM is still an RFC [*], but there is nothing
> in it that limits it to nVHE (at least not intentionally). It should
> work with VHE and hVHE as well. On respinning the patch series [*], we
> plan on adding support for normal VMs to use guest_memfd() as well in
> arm64, mainly for testing, and to make it easier for others to base
> their work on it.

Just to clarify, I am referring specifically to the work we did in
porting guest_memfd() to pKVM/arm64. pKVM itself works only in nVHE
mode.
>
> Cheers,
> /fuad
>
> [*] https://lore.kernel.org/all/20240222161047.402609-1-tabba@google.com
> >
> > Derek
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Unmapping KVM Guest Memory from Host Kernel
  2024-05-13 20:36                 ` Sean Christopherson
@ 2024-05-13 22:01                   ` Manwaring, Derek
  0 siblings, 0 replies; 21+ messages in thread
From: Manwaring, Derek @ 2024-05-13 22:01 UTC (permalink / raw)
  To: Sean Christopherson, James Gowans
  Cc: kvm, linux-coco, Nikita Kalyazin, rppt, qemu-devel, Patrick Roy,
	somlo, vbabka, akpm, kirill.shutemov, Liam.Howlett,
	David Woodhouse, pbonzini, linux-mm, Alexander Graf, chao.p.peng,
	lstoakes, mst, Moritz Lipp, Claudio Canella

On 2024-05-13 13:36-0700, Sean Christopherson wrote:
> Hmm, a slightly crazy idea (ok, maybe wildly crazy) would be to support mapping
> all of guest_memfd into kernel address space, but as USER=1 mappings.  I.e. don't
> require a carve-out from userspace, but do require CLAC/STAC when access guest
> memory from the kernel.  I think/hope that would provide the speculative execution
> mitigation properties you're looking for?

This is interesting. I'm hesitant to rely on SMAP since it can be
enforced too late by the microarchitecture. But Canella, et al. [1] did
say in 2019 that the kernel->user access route seemed to be free of any
"Meltdown" effects. LASS sounds like it will be even stronger, though
it's not clear to me from Intel's programming reference that speculative
scenarios are in scope [2]. AMD does list SMAP specifically as a
feature that can control speculation [3].

I don't see an equivalent read-access control on ARM. It has PXN for
execute. Read access can probably also be controlled?  But I think for
the non-CoCo case we should favor solutions that are less dependent on
hardware-specific protections.

Derek


[1] https://www.usenix.org/system/files/sec19-canella.pdf
[2] https://cdrdv2.intel.com/v1/dl/getContent/671368
[3] https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/tuning-guides/software-techniques-for-managing-speculation.pdf

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Unmapping KVM Guest Memory from Host Kernel
  2024-05-13 19:43               ` Gowans, James
@ 2024-05-13 20:36                 ` Sean Christopherson
  2024-05-13 22:01                   ` Manwaring, Derek
  0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2024-05-13 20:36 UTC (permalink / raw)
  To: James Gowans
  Cc: kvm, linux-coco, Nikita Kalyazin, rppt, qemu-devel, Patrick Roy,
	somlo, vbabka, akpm, kirill.shutemov, Liam.Howlett,
	David Woodhouse, pbonzini, linux-mm, Alexander Graf,
	Derek Manwaring, chao.p.peng, lstoakes, mst

On Mon, May 13, 2024, James Gowans wrote:
> On Mon, 2024-05-13 at 10:09 -0700, Sean Christopherson wrote:
> > On Mon, May 13, 2024, James Gowans wrote:
> > > On Mon, 2024-05-13 at 08:39 -0700, Sean Christopherson wrote:
> > > > > Sean, you mentioned that you envision guest_memfd also supporting non-CoCo VMs.
> > > > > Do you have some thoughts about how to make the above cases work in the
> > > > > guest_memfd context?
> > > > 
> > > > Yes.  The hand-wavy plan is to allow selectively mmap()ing guest_memfd().  There
> > > > is a long thread[*] discussing how exactly we want to do that.  The TL;DR is that
> > > > the basic functionality is also straightforward; the bulk of the discussion is
> > > > around gup(), reclaim, page migration, etc.
> > > 
> > > I still need to read this long thread, but just a thought on the word
> > > "restricted" here: for MMIO the instruction can be anywhere and
> > > similarly the load/store MMIO data can be anywhere. Does this mean that
> > > for running unmodified non-CoCo VMs with guest_memfd backend that we'll
> > > always need to have the whole of guest memory mmapped?
> > 
> > Not necessarily, e.g. KVM could re-establish the direct map or mremap() on-demand.
> > There are variation on that, e.g. if ASI[*] were to ever make it's way upstream,
> > which is a huge if, then we could have guest_memfd mapped into a KVM-only CR3.
> 
> Yes, on-demand mapping in of guest RAM pages is definitely an option. It
> sounds quite challenging to need to always go via interfaces which
> demand map/fault memory, and also potentially quite slow needing to
> unmap and flush afterwards. 
> 
> Not too sure what you have in mind with "guest_memfd mapped into KVM-
> only CR3" - could you expand?

Remove guest_memfd from the kernel's direct map, e.g. so that the kernel at-large
can't touch guest memory, but have a separate set of page tables that have the
direct map, userspace page tables, _and_ kernel mappings for guest_memfd.  On
KVM_RUN (or vcpu_load()?), switch to KVM's CR3 so that KVM always map/unmap are
free (literal nops).

That's an imperfect solution as IRQs and NMIs will run kernel code with KVM's
page tables, i.e. guest memory would still be exposed to the host kernel.  And
of course we'd need to get buy in from multiple architecturs and maintainers,
etc.

> > > I guess the idea is that this use case will still be subject to the
> > > normal restriction rules, but for a non-CoCo non-pKVM VM there will be
> > > no restriction in practice, and userspace will need to mmap everything
> > > always?
> > > 
> > > It really seems yucky to need to have all of guest RAM mmapped all the
> > > time just for MMIO to work... But I suppose there is no way around that
> > > for Intel x86.
> > 
> > It's not just MMIO.  Nested virtualization, and more specifically shadowing nested
> > TDP, is also problematic (probably more so than MMIO).  And there are more cases,
> > i.e. we'll need a generic solution for this.  As above, there are a variety of
> > options, it's largely just a matter of doing the work.  I'm not saying it's a
> > trivial amount of work/effort, but it's far from an unsolvable problem.
> 
> I didn't even think of nested virt, but that will absolutely be an even
> bigger problem too. MMIO was just the first roadblock which illustrated
> the problem.
> Overall what I'm trying to figure out is whether there is any sane path
> here other than needing to mmap all guest RAM all the time. Trying to
> get nested virt and MMIO and whatever else needs access to guest RAM
> working by doing just-in-time (aka: on-demand) mappings and unmappings
> of guest RAM sounds like a painful game of whack-a-mole, potentially
> really bad for performance too.

It's a whack-a-mole game that KVM already plays, e.g. for dirty tracking, post-copy
demand paging, etc..  There is still plenty of room for improvement, e.g. to reduce
the number of touchpoints and thus the potential for missed cases.  But KVM more
or less needs to solve this basic problem no matter what, so I don't think that
guest_memfd adds much, if any, burden.

> Do you think we should look at doing this on-demand mapping, or, for
> now, simply require that all guest RAM is mmapped all the time and KVM
> be given a valid virtual addr for the memslots?

I don't think "map everything into userspace" is a viable approach, precisely
because it requires reflecting that back into KVM's memslots, which in turn
means guest_memfd needs to allow gup().  And I don't think we want to allow gup(),
because that opens a rather large can of worms (see the long thread I linked).

Hmm, a slightly crazy idea (ok, maybe wildly crazy) would be to support mapping
all of guest_memfd into kernel address space, but as USER=1 mappings.  I.e. don't
require a carve-out from userspace, but do require CLAC/STAC when access guest
memory from the kernel.  I think/hope that would provide the speculative execution
mitigation properties you're looking for?

Userspace would still have access to guest memory, but it would take a truly
malicious userspace for that to matter.  And when CPUs that support LASS come
along, userspace would be completely unable to access guest memory through KVM's
magic mapping.

This too would require a decent amount of buy-in from outside of KVM, e.g. to
carve out the virtual address range in the kernel.  But the performance overhead
would be identical to the status quo.  And there could be advantages to being
able to identify accesses to guest memory based purely on kernel virtual address.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Unmapping KVM Guest Memory from Host Kernel
  2024-05-13 17:09             ` Sean Christopherson
@ 2024-05-13 19:43               ` Gowans, James
  2024-05-13 20:36                 ` Sean Christopherson
  0 siblings, 1 reply; 21+ messages in thread
From: Gowans, James @ 2024-05-13 19:43 UTC (permalink / raw)
  To: seanjc
  Cc: kvm, linux-coco, Kalyazin, Nikita, rppt, qemu-devel, Roy,
	Patrick, somlo, vbabka, akpm, kirill.shutemov, Liam.Howlett,
	Woodhouse, David, pbonzini, linux-mm, Graf (AWS),
	Alexander, Manwaring, Derek, chao.p.peng, lstoakes, mst

On Mon, 2024-05-13 at 10:09 -0700, Sean Christopherson wrote:
> On Mon, May 13, 2024, James Gowans wrote:
> > On Mon, 2024-05-13 at 08:39 -0700, Sean Christopherson wrote:
> > > > Sean, you mentioned that you envision guest_memfd also supporting non-CoCo VMs.
> > > > Do you have some thoughts about how to make the above cases work in the
> > > > guest_memfd context?
> > > 
> > > Yes.  The hand-wavy plan is to allow selectively mmap()ing guest_memfd().  There
> > > is a long thread[*] discussing how exactly we want to do that.  The TL;DR is that
> > > the basic functionality is also straightforward; the bulk of the discussion is
> > > around gup(), reclaim, page migration, etc.
> > 
> > I still need to read this long thread, but just a thought on the word
> > "restricted" here: for MMIO the instruction can be anywhere and
> > similarly the load/store MMIO data can be anywhere. Does this mean that
> > for running unmodified non-CoCo VMs with guest_memfd backend that we'll
> > always need to have the whole of guest memory mmapped?
> 
> Not necessarily, e.g. KVM could re-establish the direct map or mremap() on-demand.
> There are variation on that, e.g. if ASI[*] were to ever make it's way upstream,
> which is a huge if, then we could have guest_memfd mapped into a KVM-only CR3.

Yes, on-demand mapping in of guest RAM pages is definitely an option. It
sounds quite challenging to need to always go via interfaces which
demand map/fault memory, and also potentially quite slow needing to
unmap and flush afterwards. 

Not too sure what you have in mind with "guest_memfd mapped into KVM-
only CR3" - could you expand?

> > I guess the idea is that this use case will still be subject to the
> > normal restriction rules, but for a non-CoCo non-pKVM VM there will be
> > no restriction in practice, and userspace will need to mmap everything
> > always?
> > 
> > It really seems yucky to need to have all of guest RAM mmapped all the
> > time just for MMIO to work... But I suppose there is no way around that
> > for Intel x86.
> 
> It's not just MMIO.  Nested virtualization, and more specifically shadowing nested
> TDP, is also problematic (probably more so than MMIO).  And there are more cases,
> i.e. we'll need a generic solution for this.  As above, there are a variety of
> options, it's largely just a matter of doing the work.  I'm not saying it's a
> trivial amount of work/effort, but it's far from an unsolvable problem.

I didn't even think of nested virt, but that will absolutely be an even
bigger problem too. MMIO was just the first roadblock which illustrated
the problem.
Overall what I'm trying to figure out is whether there is any sane path
here other than needing to mmap all guest RAM all the time. Trying to
get nested virt and MMIO and whatever else needs access to guest RAM
working by doing just-in-time (aka: on-demand) mappings and unmappings
of guest RAM sounds like a painful game of whack-a-mole, potentially
really bad for performance too.

Do you think we should look at doing this on-demand mapping, or, for
now, simply require that all guest RAM is mmapped all the time and KVM
be given a valid virtual addr for the memslots?
Note that I'm specifically referring to regular non-CoCo non-enlightened
VMs here. For CoCo we definitely need all the cooperative MMIO and
sharing. What we're trying to do here is to get guest RAM out of the
direct map using guest_memfd, and now tackling the knock-on problem of
whether or not to mmap all of guest RAM all the time in userspace.

JG

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Unmapping KVM Guest Memory from Host Kernel
  2024-05-13 16:01           ` Gowans, James
@ 2024-05-13 17:09             ` Sean Christopherson
  2024-05-13 19:43               ` Gowans, James
  0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2024-05-13 17:09 UTC (permalink / raw)
  To: James Gowans
  Cc: Patrick Roy, kvm, Nikita Kalyazin, qemu-devel, rppt, linux-coco,
	somlo, vbabka, akpm, Liam.Howlett, kirill.shutemov,
	David Woodhouse, pbonzini, linux-mm, Alexander Graf,
	Derek Manwaring, chao.p.peng, lstoakes, mst

On Mon, May 13, 2024, James Gowans wrote:
> On Mon, 2024-05-13 at 08:39 -0700, Sean Christopherson wrote:
> > > Sean, you mentioned that you envision guest_memfd also supporting non-CoCo VMs.
> > > Do you have some thoughts about how to make the above cases work in the
> > > guest_memfd context?
> > 
> > Yes.  The hand-wavy plan is to allow selectively mmap()ing guest_memfd().  There
> > is a long thread[*] discussing how exactly we want to do that.  The TL;DR is that
> > the basic functionality is also straightforward; the bulk of the discussion is
> > around gup(), reclaim, page migration, etc.
> 
> I still need to read this long thread, but just a thought on the word
> "restricted" here: for MMIO the instruction can be anywhere and
> similarly the load/store MMIO data can be anywhere. Does this mean that
> for running unmodified non-CoCo VMs with guest_memfd backend that we'll
> always need to have the whole of guest memory mmapped?

Not necessarily, e.g. KVM could re-establish the direct map or mremap() on-demand.
There are variation on that, e.g. if ASI[*] were to ever make it's way upstream,
which is a huge if, then we could have guest_memfd mapped into a KVM-only CR3.

> I guess the idea is that this use case will still be subject to the
> normal restriction rules, but for a non-CoCo non-pKVM VM there will be 
> no restriction in practice, and userspace will need to mmap everything
> always?
> 
> It really seems yucky to need to have all of guest RAM mmapped all the
> time just for MMIO to work... But I suppose there is no way around that
> for Intel x86.

It's not just MMIO.  Nested virtualization, and more specifically shadowing nested
TDP, is also problematic (probably more so than MMIO).  And there are more cases,
i.e. we'll need a generic solution for this.  As above, there are a variety of
options, it's largely just a matter of doing the work.  I'm not saying it's a
trivial amount of work/effort, but it's far from an unsolvable problem.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Unmapping KVM Guest Memory from Host Kernel
  2024-05-13 15:39         ` Sean Christopherson
@ 2024-05-13 16:01           ` Gowans, James
  2024-05-13 17:09             ` Sean Christopherson
  0 siblings, 1 reply; 21+ messages in thread
From: Gowans, James @ 2024-05-13 16:01 UTC (permalink / raw)
  To: seanjc, Roy, Patrick
  Cc: kvm, Kalyazin, Nikita, qemu-devel, rppt, linux-coco, somlo,
	vbabka, akpm, Liam.Howlett, kirill.shutemov, Woodhouse, David,
	pbonzini, linux-mm, Graf (AWS),
	Alexander, Manwaring, Derek, chao.p.peng, lstoakes, mst

On Mon, 2024-05-13 at 08:39 -0700, Sean Christopherson wrote:
> > Sean, you mentioned that you envision guest_memfd also supporting non-CoCo VMs.
> > Do you have some thoughts about how to make the above cases work in the
> > guest_memfd context?
> 
> Yes.  The hand-wavy plan is to allow selectively mmap()ing guest_memfd().  There
> is a long thread[*] discussing how exactly we want to do that.  The TL;DR is that
> the basic functionality is also straightforward; the bulk of the discussion is
> around gup(), reclaim, page migration, etc.

I still need to read this long thread, but just a thought on the word
"restricted" here: for MMIO the instruction can be anywhere and
similarly the load/store MMIO data can be anywhere. Does this mean that
for running unmodified non-CoCo VMs with guest_memfd backend that we'll
always need to have the whole of guest memory mmapped?

I guess the idea is that this use case will still be subject to the
normal restriction rules, but for a non-CoCo non-pKVM VM there will be 
no restriction in practice, and userspace will need to mmap everything
always?

It really seems yucky to need to have all of guest RAM mmapped all the
time just for MMIO to work... But I suppose there is no way around that
for Intel x86.

JG

> 
> [*] https://lore.kernel.org/all/ZdfoR3nCEP3HTtm1@casper.infradead.org


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Unmapping KVM Guest Memory from Host Kernel
  2024-05-13 10:31       ` Patrick Roy
@ 2024-05-13 15:39         ` Sean Christopherson
  2024-05-13 16:01           ` Gowans, James
  0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2024-05-13 15:39 UTC (permalink / raw)
  To: Patrick Roy
  Cc: Mike Rapoport, James Gowans, akpm, chao.p.peng, Derek Manwaring,
	pbonzini, David Woodhouse, Nikita Kalyazin, lstoakes,
	Liam.Howlett, linux-mm, qemu-devel, kirill.shutemov, vbabka, mst,
	somlo, Alexander Graf, kvm, linux-coco

On Mon, May 13, 2024, Patrick Roy wrote:

> For non-CoCo VMs, where memory is not encrypted, and the threat model assumes a
> trusted host userspace, we would like to avoid changing the VM model so
> completely. If we adopt CoCo’s approaches where KVM / Userspace touches guest
> memory we would get all the complexity, yet none of the encryption.
> Particularly the complexity on the MMIO path seems nasty, but x86 does not

Uber nit, modern AMD CPUs do provide the byte stream, though there is at least
one related erratum.  Intel CPUs don't provide the byte stream or pre-decode in
any way.

> pre-decode instructions on MMIO exits (which are just EPT_VIOLATIONs) like it
> does for PIO exits, so I also don’t really see a way around it in the
> guest_memfd model.

...

> Sean, you mentioned that you envision guest_memfd also supporting non-CoCo VMs.
> Do you have some thoughts about how to make the above cases work in the
> guest_memfd context?

Yes.  The hand-wavy plan is to allow selectively mmap()ing guest_memfd().  There
is a long thread[*] discussing how exactly we want to do that.  The TL;DR is that
the basic functionality is also straightforward; the bulk of the discussion is
around gup(), reclaim, page migration, etc.

[*] https://lore.kernel.org/all/ZdfoR3nCEP3HTtm1@casper.infradead.org

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Unmapping KVM Guest Memory from Host Kernel
  2024-03-09 11:14     ` Mike Rapoport
@ 2024-05-13 10:31       ` Patrick Roy
  2024-05-13 15:39         ` Sean Christopherson
  0 siblings, 1 reply; 21+ messages in thread
From: Patrick Roy @ 2024-05-13 10:31 UTC (permalink / raw)
  To: Mike Rapoport, Sean Christopherson
  Cc: James Gowans, akpm, chao.p.peng, Derek Manwaring, pbonzini,
	David Woodhouse, Nikita Kalyazin, lstoakes, Liam.Howlett,
	linux-mm, qemu-devel, kirill.shutemov, vbabka, mst, somlo,
	Alexander Graf, kvm, linux-coco

Hi all,

On 3/9/24 11:14, Mike Rapoport wrote:

>>> >>> With this in mind, what’s the best way to solve getting guest RAM out of
>>> >>> the direct map? Is memfd_secret integration with KVM the way to go, or
>>> >>> should we build a solution on top of guest_memfd, for example via some
>>> >>> flag that causes it to leave memory in the host userspace’s page tables,
>>> >>> but removes it from the direct map?
>> >> memfd_secret obviously gets you a PoC much faster, but in the long term I'm quite
>> >> sure you'll be fighting memfd_secret all the way.  E.g. it's not dumpable, it
>> >> deliberately allocates at 4KiB granularity (though I suspect the bug you found
>> >> means that it can be inadvertantly mapped with 2MiB hugepages), it has no line
>> >> of sight to taking userspace out of the equation, etc.
>> >>
>> >> With guest_memfd on the other hand, everyone contributing to and maintaining it
>> >> has goals that are *very* closely aligned with what you want to do.
> > I agree with Sean, guest_memfd seems a better interface to use. It's
> > integrated by design with KVM and removing guest memory from the direct map
> > looks like a natural enhancement to guest_memfd.
> >
> > Unless I'm missing something, for fast-and-dirty POC it'll be a oneliner
> > that adds set_memory_np() to kvm_gmem_get_folio() and then figuring out
> > what to do with virtio :)

We’ve been playing around with extending guest_memfd to remove guest memory
from the direct map. Removal from direct map aspect is indeed fairly
straight-forward; since we cannot map guest_memfd, we don’t need to worry about
folios without direct map entries getting to places where they will cause
kernel panics.

However, we ran into problems running non-CoCo VMs with guest_memfd for guest
memory, independent of direct map entries being available or not. There’s a
handful of places where a traditional KVM / Userspace setup currently touches
guest memory:

* Loading the Guest Kernel into guest-owned memory
* Instruction fetch from arbitrary guest addresses and guest page table walks  
  for MMIO emulation (for example for IOAPIC accesses)
* kvm-clock
* I/O devices

With guest_memfd, if the guest is running from guest-private memory, these need
to be rethought, since now the memory is unavailable to userspace, and KVM is
not enlightened about guest_memfd’s existance everywhere (when I was
experimenting with this, it generally read garbage data from the shared VMA,
but I think I’ve since seen some patches floating around that would make it
return -EFAULT instead).

CoCo VMs have various methods for working around these: You load a guest kernel
using some “populate on first access” mechanism [1], kvm-clock and I/O is
solved by having the guest mark the relevant address ranges as “shared” ahead
of time [2] and bounce buffering via swiotlb [4], and Intel TDX solves the
instruction emulation problem for MMIO by injecting a #VE and having the guest
do the emulation itself [3].

For non-CoCo VMs, where memory is not encrypted, and the threat model assumes a
trusted host userspace, we would like to avoid changing the VM model so
completely. If we adopt CoCo’s approaches where KVM / Userspace touches guest
memory we would get all the complexity, yet none of the encryption.
Particularly the complexity on the MMIO path seems nasty, but x86 does not
pre-decode instructions on MMIO exits (which are just EPT_VIOLATIONs) like it
does for PIO exits, so I also don’t really see a way around it in the
guest_memfd model.

We’ve played around a lot with allowing userspace mappings of guest_memfd, and
then having KVM internally access guest_memfd via userspace page tables (and
came up with multiple hacky ways to boot simple Linux initrds from
guest_memfd), but this is fairly awkward for two reasons:

1. Now lots of codepaths in KVM end up accessing guest_memfd, which from my
understanding goes against the guest_memfd goal of making machine checks
because of incorrect accesses to TDX memory impossible, and
2. We need to somehow get a userspace mapping of guest_memfd into KVM (a hacky
way I could make this work was setting up kvm_user_memory_region2 with
userspace_addr set to a mmap of guest_memory, which actually "works" for
everything but kvm-clock, but I also realized later that this is just
memfd_secret with extra steps).

We also played around with having KVM access guest_memfd through the direct map
(by temporarily reinserting pages into it when needed), but this again means
lots of KVM code learns about how to access guest RAM via guest_memfd.

There are a few other features we need to support, such as serving page faults
using UFFD, which we are not too sure how to realize with guest_memfd since
UFFD is VMA based (although to me some sort of “UFFD-for-FD” sounds like
something that’d be useful even outside of our guest_memfd usecase).

With these challenges in mind, some variant of memfd_secret continues to look
attractive for the non-CoCo case. Perhaps a variant that supports in-kernel
faults and provides some way for gfn_to_pfn_cache users like kvm-clock to
restore the direct map entries.

Sean, you mentioned that you envision guest_memfd also supporting non-CoCo VMs.
Do you have some thoughts about how to make the above cases work in the
guest_memfd context?

> > --
> > Sincerely yours,
> > Mike.

Best,
Patrick

[1]: https://lore.kernel.org/kvm/20240404185034.3184582-1-pbonzini@redhat.com/T/#m4cc08ce3142a313d96951c2b1286eb290c7d1dac
[2]: https://elixir.bootlin.com/linux/latest/source/arch/x86/kernel/kvmclock.c#L227
[3]: https://www.kernel.org/doc/html/next/x86/tdx.html#mmio-handling
[4]: https://www.kernel.org/doc/html/next/x86/tdx.html#shared-memory-conversions


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Unmapping KVM Guest Memory from Host Kernel
  2024-03-09  2:45       ` Manwaring, Derek
@ 2024-03-18 14:11         ` Brendan Jackman
  0 siblings, 0 replies; 21+ messages in thread
From: Brendan Jackman @ 2024-03-18 14:11 UTC (permalink / raw)
  To: Manwaring, Derek
  Cc: David Matlack, Gowans, James, seanjc, akpm, Roy, Patrick,
	chao.p.peng, rppt, pbonzini, Woodhouse, David, Kalyazin, Nikita,
	lstoakes, Liam.Howlett, linux-mm, qemu-devel, kirill.shutemov,
	vbabka, mst, somlo, Graf (AWS),
	Alexander, kvm, linux-coco, kvmarm, tabba, qperret,
	jason.cj.chen

On Fri, 8 Mar 2024 at 18:36, David Matlack <dmatlack@google.com> wrote:
> I'm not sure if ASI provides a solution to the problem James is trying
> to solve. ASI creates a separate "restricted" address spaces where, yes,
> guest memory can be not mapped. But any access to guest memory is
>  still allowed. An access will trigger a page fault, the kernel will
> switch to the "full" kernel address space (flushing hardware buffers
> along the way to prevent speculation), and then proceed. i.e. ASI
> doesn't not prevent accessing guest memory through the
> direct map, it just prevents speculation of guest memory through the
> direct map.

Yes, there's also a sense in which ASI is a "smaller hammer" in that
it _only_ protects against hardware-bug exploits.

>  it just prevents speculation of guest memory through the
> direct map.

(Although, this is not _all_ it does, because when returning to the
restricted address space, i.e. right before VM Enter, we have an
opportunity to flush _data buffers_ too. So ASI also mitigates
Meltdown-style attacks, e.g. L1TF, where the speculation-related stuff
all happens on the attacker side)

On Sat, 9 Mar 2024 at 03:46, Manwaring, Derek <derekmn@amazon.com> wrote:
> Brendan,
> I will look into the general ASI approach, thank you. Did you consider
> memfd_secret or a guest_memfd-based approach for Userspace-ASI?

I might be misunderstanding you here: I guess you mean using
memfd_secret as a way for userspace to communicate about which parts
of userspace memory are "secret"?

If I didn't misunderstand: we have not looked into this so far because
we actually just consider _all_ userspace/guest memory to be "secret"
from the perspective of other processes/guests.

> Based on
> Sean's earlier reply to James it sounds like the vision of guest_memfd
> aligns with ASI's goals.

But yes, the more general point seems to make sense, I think I need to
research this topic some more, thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Unmapping KVM Guest Memory from Host Kernel
  2024-03-08 23:22   ` Sean Christopherson
  2024-03-09 11:14     ` Mike Rapoport
@ 2024-03-14 21:45     ` Manwaring, Derek
  1 sibling, 0 replies; 21+ messages in thread
From: Manwaring, Derek @ 2024-03-14 21:45 UTC (permalink / raw)
  To: Sean Christopherson, James Gowans
  Cc: akpm, Patrick Roy, chao.p.peng, rppt, pbonzini, David Woodhouse,
	Nikita Kalyazin, lstoakes, Liam.Howlett, linux-mm, qemu-devel,
	kirill.shutemov, vbabka, mst, somlo, Alexander Graf, kvm,
	linux-coco, xmarcalx, tabba, qperret, kvmarm

On Fri, 8 Mar 2024 15:22:50 -0800, Sean Christopherson wrote:
> On Fri, Mar 08, 2024, James Gowans wrote:
> > We are also aware of ongoing work on guest_memfd. The current
> > implementation unmaps guest memory from VMM address space, but leaves it
> > in the kernel’s direct map. We’re not looking at unmapping from VMM
> > userspace yet; we still need guest RAM there for PV drivers like virtio
> > to continue to work. So KVM’s gmem doesn’t seem like the right solution?
>
> We (and by "we", I really mean the pKVM folks) are also working on allowing
> userspace to mmap() guest_memfd[*].  pKVM aside, the long term vision I have for
> guest_memfd is to be able to use it for non-CoCo VMs, precisely for the security
> and robustness benefits it can bring.
>
> What I am hoping to do with guest_memfd is get userspace to only map memory it
> needs, e.g. for emulated/synthetic devices, on-demand.  I.e. to get to a state
> where guest memory is mapped only when it needs to be.

Thank you for the direction, this is super helpful.

We are new to the guest_memfd space, and for simplicity we'd prefer to
leave guest_memfd completely mapped in userspace. Even in the long term,
we actually don't have any use for unmapping from host userspace. The
current form of marking pages shared doesn't quite align with what we're
trying to do either since it also shares the pages with the host kernel.

What are your thoughts on a flag for KVM_CREATE_GUEST_MEMFD that only
removes from the host kernel's direct map, but leaves everything mapped
in userspace?

Derek

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Unmapping KVM Guest Memory from Host Kernel
  2024-03-08 23:22   ` Sean Christopherson
@ 2024-03-09 11:14     ` Mike Rapoport
  2024-05-13 10:31       ` Patrick Roy
  2024-03-14 21:45     ` Manwaring, Derek
  1 sibling, 1 reply; 21+ messages in thread
From: Mike Rapoport @ 2024-03-09 11:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: James Gowans, akpm, Patrick Roy, chao.p.peng, Derek Manwaring,
	pbonzini, David Woodhouse, Nikita Kalyazin, lstoakes,
	Liam.Howlett, linux-mm, qemu-devel, kirill.shutemov, vbabka, mst,
	somlo, Alexander Graf, kvm, linux-coco

On Fri, Mar 08, 2024 at 03:22:50PM -0800, Sean Christopherson wrote:
> On Fri, Mar 08, 2024, James Gowans wrote:
> > However, memfd_secret doesn’t work out the box for KVM guest memory; the
> > main reason seems to be that the GUP path is intentionally disabled for
> > memfd_secret, so if we use a memfd_secret backed VMA for a memslot then
> > KVM is not able to fault the memory in. If it’s been pre-faulted in by
> > userspace then it seems to work.
> 
> Huh, that _shouldn't_ work.  The folio_is_secretmem() in gup_pte_range() is
> supposed to prevent the "fast gup" path from getting secretmem pages.

I suspect this works because KVM only calls gup on faults and if the memory
was pre-faulted via memfd_secret there won't be faults and no gups from
KVM.
 
> > With this in mind, what’s the best way to solve getting guest RAM out of
> > the direct map? Is memfd_secret integration with KVM the way to go, or
> > should we build a solution on top of guest_memfd, for example via some
> > flag that causes it to leave memory in the host userspace’s page tables,
> > but removes it from the direct map? 
> 
> memfd_secret obviously gets you a PoC much faster, but in the long term I'm quite
> sure you'll be fighting memfd_secret all the way.  E.g. it's not dumpable, it
> deliberately allocates at 4KiB granularity (though I suspect the bug you found
> means that it can be inadvertantly mapped with 2MiB hugepages), it has no line
> of sight to taking userspace out of the equation, etc.
> 
> With guest_memfd on the other hand, everyone contributing to and maintaining it
> has goals that are *very* closely aligned with what you want to do.

I agree with Sean, guest_memfd seems a better interface to use. It's
integrated by design with KVM and removing guest memory from the direct map
looks like a natural enhancement to guest_memfd. 

Unless I'm missing something, for fast-and-dirty POC it'll be a oneliner
that adds set_memory_np() to kvm_gmem_get_folio() and then figuring out
what to do with virtio :)

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Unmapping KVM Guest Memory from Host Kernel
  2024-03-08 15:50 ` Gowans, James
  2024-03-08 16:25   ` Brendan Jackman
  2024-03-08 23:22   ` Sean Christopherson
@ 2024-03-09  5:01   ` Matthew Wilcox
  2 siblings, 0 replies; 21+ messages in thread
From: Matthew Wilcox @ 2024-03-09  5:01 UTC (permalink / raw)
  To: Gowans, James
  Cc: seanjc, akpm, Roy, Patrick, chao.p.peng, Manwaring, Derek, rppt,
	pbonzini, Woodhouse, David, Kalyazin, Nikita, lstoakes,
	Liam.Howlett, linux-mm, qemu-devel, kirill.shutemov, vbabka, mst,
	somlo, Graf (AWS),
	Alexander, kvm, linux-coco

On Fri, Mar 08, 2024 at 03:50:05PM +0000, Gowans, James wrote:
> Currently when using anonymous memory for KVM guest RAM, the memory all
> remains mapped into the kernel direct map. We are looking at options to
> get KVM guest memory out of the kernel’s direct map as a principled
> approach to mitigating speculative execution issues in the host kernel.
> Our goal is to more completely address the class of issues whose leak
> origin is categorized as "Mapped memory" [1].

One of the things that is holding Linux back is the inability to do I/O
to memory which is not part of memmap.  _So Much_ of our infrastructure
is based on having a struct page available to stick into an sglist, bio,
skb_frag, or whatever.  The solution to this is to move to a (phys_addr,
length) tuple instead of (page, offset, len) tuple.  I call this "phyr"
and I've written about it before.  I'm not working on this as I have
quite enough to do with the folio work, but I hope somebody works on it
before I get time to.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Unmapping KVM Guest Memory from Host Kernel
  2024-03-08 17:35     ` David Matlack
  2024-03-08 17:45       ` David Woodhouse
@ 2024-03-09  2:45       ` Manwaring, Derek
  2024-03-18 14:11         ` Brendan Jackman
  1 sibling, 1 reply; 21+ messages in thread
From: Manwaring, Derek @ 2024-03-09  2:45 UTC (permalink / raw)
  To: David Matlack, Brendan Jackman
  Cc: Gowans, James, seanjc, akpm, Roy, Patrick, chao.p.peng, rppt,
	pbonzini, Woodhouse, David, Kalyazin, Nikita, lstoakes,
	Liam.Howlett, linux-mm, qemu-devel, kirill.shutemov, vbabka, mst,
	somlo, Graf (AWS),
	Alexander, kvm, linux-coco, kvmarm, tabba, qperret,
	jason.cj.chen

On 2024-03-08 10:36-0700, David Matlack wrote:
> On Fri, Mar 8, 2024 at 8:25 AM Brendan Jackman <jackmanb@google.com> wrote:
> > On Fri, 8 Mar 2024 at 16:50, Gowans, James <jgowans@amazon> wrote:
> > > Our goal is to more completely address the class of issues whose leak
> > > origin is categorized as "Mapped memory" [1].
> >
> > Did you forget a link below? I'm interested in hearing about that
> > categorisation.

The paper from Hertogh, et al. is https://download.vusec.net/papers/quarantine_raid23.pdf
specifically Table 1.

> > It's perhaps a bigger hammer than you are looking for, but the
> > solution we're working on at Google is "Address Space Isolation" (ASI)
> > - the latest posting about that is [2].
>
> I think what James is looking for (and what we are also interested
> in), is _eliminating_ the ability to access guest memory from the
> direct map entirely.

Actually, just preventing speculation of guest memory through the
direct map is sufficient for our current focus.

Brendan,
I will look into the general ASI approach, thank you. Did you consider
memfd_secret or a guest_memfd-based approach for Userspace-ASI? Based on
Sean's earlier reply to James it sounds like the vision of guest_memfd
aligns with ASI's goals.

Derek

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Unmapping KVM Guest Memory from Host Kernel
  2024-03-08 15:50 ` Gowans, James
  2024-03-08 16:25   ` Brendan Jackman
@ 2024-03-08 23:22   ` Sean Christopherson
  2024-03-09 11:14     ` Mike Rapoport
  2024-03-14 21:45     ` Manwaring, Derek
  2024-03-09  5:01   ` Matthew Wilcox
  2 siblings, 2 replies; 21+ messages in thread
From: Sean Christopherson @ 2024-03-08 23:22 UTC (permalink / raw)
  To: James Gowans
  Cc: akpm, Patrick Roy, chao.p.peng, Derek Manwaring, rppt, pbonzini,
	David Woodhouse, Nikita Kalyazin, lstoakes, Liam.Howlett,
	linux-mm, qemu-devel, kirill.shutemov, vbabka, mst, somlo,
	Alexander Graf, kvm, linux-coco

On Fri, Mar 08, 2024, James Gowans wrote:
> However, memfd_secret doesn’t work out the box for KVM guest memory; the
> main reason seems to be that the GUP path is intentionally disabled for
> memfd_secret, so if we use a memfd_secret backed VMA for a memslot then
> KVM is not able to fault the memory in. If it’s been pre-faulted in by
> userspace then it seems to work.

Huh, that _shouldn't_ work.  The folio_is_secretmem() in gup_pte_range() is
supposed to prevent the "fast gup" path from getting secretmem pages.

Is this on an upstream kernel?  If so, and if you have bandwidth, can you figure
out why that isn't working?  At the very least, I suspect the memfd_secret
maintainers would be very interested to know that it's possible to fast gup
secretmem.

> There are a few other issues around when KVM accesses the guest memory.
> For example the KVM PV clock code goes directly to the PFN via the
> pfncache, and that also breaks if the PFN is not in the direct map, so
> we’d need to change that sort of thing, perhaps going via userspace
> addresses.
> 
> If we remove the memfd_secret check from the GUP path, and disable KVM’s
> pvclock from userspace via KVM_CPUID_FEATURES, we are able to boot a
> simple Linux initrd using a Firecracker VMM modified to use
> memfd_secret.
> 
> We are also aware of ongoing work on guest_memfd. The current
> implementation unmaps guest memory from VMM address space, but leaves it
> in the kernel’s direct map. We’re not looking at unmapping from VMM
> userspace yet; we still need guest RAM there for PV drivers like virtio
> to continue to work. So KVM’s gmem doesn’t seem like the right solution?

We (and by "we", I really mean the pKVM folks) are also working on allowing
userspace to mmap() guest_memfd[*].  pKVM aside, the long term vision I have for
guest_memfd is to be able to use it for non-CoCo VMs, precisely for the security
and robustness benefits it can bring.

What I am hoping to do with guest_memfd is get userspace to only map memory it
needs, e.g. for emulated/synthetic devices, on-demand.  I.e. to get to a state
where guest memory is mapped only when it needs to be.  More below.

> With this in mind, what’s the best way to solve getting guest RAM out of
> the direct map? Is memfd_secret integration with KVM the way to go, or
> should we build a solution on top of guest_memfd, for example via some
> flag that causes it to leave memory in the host userspace’s page tables,
> but removes it from the direct map? 

100% enhance guest_memfd.  If you're willing to wait long enough, pKVM might even
do all the work for you. :-)

The killer feature of guest_memfd is that it allows the guest mappings to be a
superset of the host userspace mappings.  Most obviously, it allows mapping memory
into the guest without mapping first mapping the memory into the userspace page
tables.  More subtly, it also makes it easier (in theory) to do things like map
the memory with 1GiB hugepages for the guest, but selectively map at 4KiB granularity
in the host.  Or map memory as RWX in the guest, but RO in the host (I don't have
a concrete use case for this, just pointing out it'll be trivial to do once
guest_memfd supports mmap()).

Every attempt to allow mapping VMA-based memory into a guest without it being
accessible by host userspace emory failed; it's literally why we ended up
implementing guest_memfd.  We could teach KVM to do the same with memfd_secret,
but we'd just end up re-implementing guest_memfd.

memfd_secret obviously gets you a PoC much faster, but in the long term I'm quite
sure you'll be fighting memfd_secret all the way.  E.g. it's not dumpable, it
deliberately allocates at 4KiB granularity (though I suspect the bug you found
means that it can be inadvertantly mapped with 2MiB hugepages), it has no line
of sight to taking userspace out of the equation, etc.

With guest_memfd on the other hand, everyone contributing to and maintaining it
has goals that are *very* closely aligned with what you want to do.

[*] https://lore.kernel.org/all/20240222161047.402609-1-tabba@google.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Unmapping KVM Guest Memory from Host Kernel
  2024-03-08 17:45       ` David Woodhouse
@ 2024-03-08 22:47         ` Sean Christopherson
  0 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2024-03-08 22:47 UTC (permalink / raw)
  To: David Woodhouse
  Cc: David Matlack, Brendan Jackman, James Gowans, akpm, Patrick Roy,
	chao.p.peng, Derek Manwaring, rppt, pbonzini, Nikita Kalyazin,
	lstoakes, Liam.Howlett, linux-mm, qemu-devel, kirill.shutemov,
	vbabka, mst, somlo, Alexander Graf, kvm, linux-coco

On Fri, Mar 08, 2024, David Woodhouse wrote:
> On Fri, 2024-03-08 at 09:35 -0800, David Matlack wrote:
> > I think what James is looking for (and what we are also interested
> > in), is _eliminating_ the ability to access guest memory from the
> > direct map entirely. And in general, eliminate the ability to access
> > guest memory in as many ways as possible.
> 
> Well, pKVM does that... 

Out-of-tree :-)

I'm not just being snarky; when pKVM lands this functionality upstream, I fully
expect zapping direct map entries to be generic guest_memfd functionality that
would be opt-in, either by the in-kernel technology, e.g. pKVM, or by userspace,
or by some combination of the two, e.g. I can see making it optional to nuke the
direct map when using guest_memfd for TDX guests so that rogue accesses from the
host generate synchronous #PFs instead of latent #MCs.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re:  Unmapping KVM Guest Memory from Host Kernel
  2024-03-08 17:35     ` David Matlack
@ 2024-03-08 17:45       ` David Woodhouse
  2024-03-08 22:47         ` Sean Christopherson
  2024-03-09  2:45       ` Manwaring, Derek
  1 sibling, 1 reply; 21+ messages in thread
From: David Woodhouse @ 2024-03-08 17:45 UTC (permalink / raw)
  To: David Matlack, Brendan Jackman
  Cc: Gowans, James, seanjc, akpm, Roy, Patrick, chao.p.peng,
	Manwaring, Derek, rppt, pbonzini, Kalyazin, Nikita, lstoakes,
	Liam.Howlett, linux-mm, qemu-devel, kirill.shutemov, vbabka, mst,
	somlo, Graf (AWS),
	Alexander, kvm, linux-coco

[-- Attachment #1: Type: text/plain, Size: 341 bytes --]

On Fri, 2024-03-08 at 09:35 -0800, David Matlack wrote:
> I think what James is looking for (and what we are also interested
> in), is _eliminating_ the ability to access guest memory from the
> direct map entirely. And in general, eliminate the ability to access
> guest memory in as many ways as possible.

Well, pKVM does that... 

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Unmapping KVM Guest Memory from Host Kernel
  2024-03-08 16:25   ` Brendan Jackman
@ 2024-03-08 17:35     ` David Matlack
  2024-03-08 17:45       ` David Woodhouse
  2024-03-09  2:45       ` Manwaring, Derek
  0 siblings, 2 replies; 21+ messages in thread
From: David Matlack @ 2024-03-08 17:35 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Gowans, James, seanjc, akpm, Roy, Patrick, chao.p.peng,
	Manwaring, Derek, rppt, pbonzini, Woodhouse, David, Kalyazin,
	Nikita, lstoakes, Liam.Howlett, linux-mm, qemu-devel,
	kirill.shutemov, vbabka, mst, somlo, Graf (AWS),
	Alexander, kvm, linux-coco

On Fri, Mar 8, 2024 at 8:25 AM Brendan Jackman <jackmanb@google.com> wrote:
>
> Hi James
>
> On Fri, 8 Mar 2024 at 16:50, Gowans, James <jgowans@amazon.com> wrote:
> > Our goal is to more completely address the class of issues whose leak
> > origin is categorized as "Mapped memory" [1].
>
> Did you forget a link below? I'm interested in hearing about that
> categorisation.
>
> > ... what’s the best way to solve getting guest RAM out of
> > the direct map?
>
> It's perhaps a bigger hammer than you are looking for, but the
> solution we're working on at Google is "Address Space Isolation" (ASI)
> - the latest posting about that is [2].
>
> The sense in which it's a bigger hammer is that it doesn't only
> support removing guest memory from the direct map, but rather
> arbitrary data from arbitrary kernel mappings.

I'm not sure if ASI provides a solution to the problem James is trying
to solve. ASI creates a separate "restricted" address spaces where, yes,
guest memory can be not mapped. But any access to guest memory is
 still allowed. An access will trigger a page fault, the kernel will
switch to the "full" kernel address space (flushing hardware buffers
along the way to prevent speculation), and then proceed. i.e. ASI
doesn't not prevent accessing guest memory through the
direct map, it just prevents speculation of guest memory through the
direct map.

I think what James is looking for (and what we are also interested
in), is _eliminating_ the ability to access guest memory from the
direct map entirely. And in general, eliminate the ability to access
guest memory in as many ways as possible.

For that goal, I have been thinking about guest_memfd as a
solution. Yes guest_memfd today is backed by pages of memory that are
mapped in the direct map. But what we can do is add the ability to
back guest_memfd by pages of memory that aren't in the direct map. I
haven't thought it fully through yet but something like... Hide the
majority of RAM from Linux (I believe there are kernel parameters to
do this) and hand it off to guest_memfd to allocate from as a source
of guest memory. Then the only way to access guest memory is to mmap()
a guest_memfd (e.g. for PV userspace devices).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Unmapping KVM Guest Memory from Host Kernel
  2024-03-08 15:50 ` Gowans, James
@ 2024-03-08 16:25   ` Brendan Jackman
  2024-03-08 17:35     ` David Matlack
  2024-03-08 23:22   ` Sean Christopherson
  2024-03-09  5:01   ` Matthew Wilcox
  2 siblings, 1 reply; 21+ messages in thread
From: Brendan Jackman @ 2024-03-08 16:25 UTC (permalink / raw)
  To: Gowans, James
  Cc: seanjc, akpm, Roy, Patrick, chao.p.peng, Manwaring, Derek, rppt,
	pbonzini, Woodhouse, David, Kalyazin, Nikita, lstoakes,
	Liam.Howlett, linux-mm, qemu-devel, kirill.shutemov, vbabka, mst,
	somlo, Graf (AWS),
	Alexander, kvm, linux-coco

Hi James

On Fri, 8 Mar 2024 at 16:50, Gowans, James <jgowans@amazon.com> wrote:
> Our goal is to more completely address the class of issues whose leak
> origin is categorized as "Mapped memory" [1].

Did you forget a link below? I'm interested in hearing about that
categorisation.

> ... what’s the best way to solve getting guest RAM out of
> the direct map?

It's perhaps a bigger hammer than you are looking for, but the
solution we're working on at Google is "Address Space Isolation" (ASI)
- the latest posting about that is [2].

The sense in which it's a bigger hammer is that it doesn't only
support removing guest memory from the direct map, but rather
arbitrary data from arbitrary kernel mappings.

[2] https://lore.kernel.org/linux-mm/CA+i-1C169s8pyqZDx+iSnFmftmGfssdQA29+pYm-gqySAYWgpg@mail.gmail.com/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Unmapping KVM Guest Memory from Host Kernel
@ 2024-03-08 15:50 ` Gowans, James
  2024-03-08 16:25   ` Brendan Jackman
                     ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Gowans, James @ 2024-03-08 15:50 UTC (permalink / raw)
  To: seanjc, akpm, Roy, Patrick, chao.p.peng, Manwaring, Derek, rppt,
	pbonzini, Woodhouse, David
  Cc: Kalyazin, Nikita, lstoakes, Liam.Howlett, linux-mm, qemu-devel,
	kirill.shutemov, vbabka, mst, somlo, Graf (AWS),
	Alexander, kvm, linux-coco

Hello KVM, MM and memfd_secret folks,

Currently when using anonymous memory for KVM guest RAM, the memory all
remains mapped into the kernel direct map. We are looking at options to
get KVM guest memory out of the kernel’s direct map as a principled
approach to mitigating speculative execution issues in the host kernel.
Our goal is to more completely address the class of issues whose leak
origin is categorized as "Mapped memory" [1].

We currently have downstream-only solutions to this, but we want to move
to purely upstream code.

So far we have been looking at using memfd_secret, which seems to be
designed exactly for usecases where it is undesirable to have some
memory range accessible through the kernel’s direct map.

However, memfd_secret doesn’t work out the box for KVM guest memory; the
main reason seems to be that the GUP path is intentionally disabled for
memfd_secret, so if we use a memfd_secret backed VMA for a memslot then
KVM is not able to fault the memory in. If it’s been pre-faulted in by
userspace then it seems to work.

There are a few other issues around when KVM accesses the guest memory.
For example the KVM PV clock code goes directly to the PFN via the
pfncache, and that also breaks if the PFN is not in the direct map, so
we’d need to change that sort of thing, perhaps going via userspace
addresses.

If we remove the memfd_secret check from the GUP path, and disable KVM’s
pvclock from userspace via KVM_CPUID_FEATURES, we are able to boot a
simple Linux initrd using a Firecracker VMM modified to use
memfd_secret.

We are also aware of ongoing work on guest_memfd. The current
implementation unmaps guest memory from VMM address space, but leaves it
in the kernel’s direct map. We’re not looking at unmapping from VMM
userspace yet; we still need guest RAM there for PV drivers like virtio
to continue to work. So KVM’s gmem doesn’t seem like the right solution?

With this in mind, what’s the best way to solve getting guest RAM out of
the direct map? Is memfd_secret integration with KVM the way to go, or
should we build a solution on top of guest_memfd, for example via some
flag that causes it to leave memory in the host userspace’s page tables,
but removes it from the direct map? 

We are keen to help contribute to getting this working, we’re just
looking for guidance from maintainers on what the correct way to solve
this is.

Cheers,
James + colleagues Derek and Patrick


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-05-13 22:02 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-08 21:05 Unmapping KVM Guest Memory from Host Kernel Manwaring, Derek
2024-03-11  9:26 ` Fuad Tabba
2024-03-11  9:29   ` Fuad Tabba
     [not found] <AQHacXBJeX10YUH0O0SiQBg1zQLaEw==>
2024-03-08 15:50 ` Gowans, James
2024-03-08 16:25   ` Brendan Jackman
2024-03-08 17:35     ` David Matlack
2024-03-08 17:45       ` David Woodhouse
2024-03-08 22:47         ` Sean Christopherson
2024-03-09  2:45       ` Manwaring, Derek
2024-03-18 14:11         ` Brendan Jackman
2024-03-08 23:22   ` Sean Christopherson
2024-03-09 11:14     ` Mike Rapoport
2024-05-13 10:31       ` Patrick Roy
2024-05-13 15:39         ` Sean Christopherson
2024-05-13 16:01           ` Gowans, James
2024-05-13 17:09             ` Sean Christopherson
2024-05-13 19:43               ` Gowans, James
2024-05-13 20:36                 ` Sean Christopherson
2024-05-13 22:01                   ` Manwaring, Derek
2024-03-14 21:45     ` Manwaring, Derek
2024-03-09  5:01   ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.