All of lore.kernel.org
 help / color / mirror / Atom feed
* AMD SEV-SNP/Intel TDX: validation of memory pages
@ 2021-02-02  1:51 David Rientjes
  2021-02-02 13:17 ` Matthew Wilcox
                   ` (5 more replies)
  0 siblings, 6 replies; 40+ messages in thread
From: David Rientjes @ 2021-02-02  1:51 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Kirill A. Shutemov, Andi Kleen, Brijesh Singh,
	Tom Lendacky, Jon Grimm, Thomas Gleixner, Christoph Hellwig,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Joerg Roedel
  Cc: x86, linux-mm

Hi everybody,

I'd like to kick-start the discussion on lazy validation of guest memory
for the purposes of AMD SEV-SNP and Intel TDX.

Both AMD SEV-SNP and Intel TDX require validation of guest memory before
it may be used by the guest.  This is needed for integrity protection from
a potentially malicious hypervisor or other host components.

For AMD SEV-SNP, the hypervisor assigns a page to the guest using the new
RMPUPDATE instruction.  The guest then transitions the page to a usable by
the new PVALIDATE instruction[1].  This sets the Validated flag in the
Reverse Map Table (RMP) for a guest addressable page, which opts into
hardware and firmware integrity protection.  This may only be done by the
guest itself and until that time, the guest cannot access the page.

The guest can only PVALIDATE memory for a gPA once; the RMP then
guarantees for each hPA that there is only a single gPA mapping.  This
validation can either be done all up front at the time the guest is booted
or it can be done lazily at runtime on fault if the guest keeps track of
Valid vs Invalid pages.  Because doing PVALIDATE for all guest memory at
boot would be extremely lengthy, I'd like to discuss the options for doing
it lazily.

Similarly, for Intel TDX, the hypervisor unmaps the gPA from the shared
EPT and invalidates the tlb and all caches for the TD's vcpus; it then
adds a page to the gPA address space for a TD by using the new
TDH.MEM.PAGE.AUG call.  The TDG.MEM.PAGE.ACCEPT TDCALL[2] then allows a
guest to accept a guest page for a gPA and initialize it using the private
key for that TD.  This may only be done by the TD itself and until that
time, the gPA cannot be used within the TD.

Both AMD SEV-SNP and Intel TDX support hugepages.  SEV-SNP supports 2MB
whereas TDX has accept TDCALL support for 2MB and 1GB.

I believe the UEFI ECR[3] for the unaccepted memory type to
EFI_MEMORY_TYPE was accepted in December.  This should enable the guest to
learn what memory has not yet been validated (or accepted) by the firmware
if all guest memory is not done completely up front.

This likely requires a pre-validation of all memory that can be accessed
when handling a #VC (or #VE for TDX) such as IST stacks, including memory
in the x86 boot sequence that must be validated before the core mm
subsystem is up and running to handle the lazy validation.  I believe
lazy validation can be done by the core mm after that, perhaps by
maintaining a new "validated" bit in struct page flags.

Has anybody looked into this or, even better, is anybody currently working
on this?

I think quite invasive changes are needed for the guest to support lazy
validation/acceptance to core areas that lots of people on the recipient
list have strong opinions about.  Some things that come to mind:

 - Annotations for pages that must be pre-validated in the x86 boot
   sequence, including IST stacks

 - Proliferation of these annotations throughout any kernel code that can
   access memory for #VC or #VE

 - Handling lazy validation of guest memory through the core mm layer,
   most likely involving a bit in struct page flags to track their status

 - Any need for validating memory that is not backed by struct page that
   needs to be special-cased

 - Any concerns about this for the DMA layer

One possibility for minimal disruption to the boot entry code is to
require the guest BIOS to validate 4GB and below, and then leave 4GB and
above to be done lazily (the true amount of memory will actually be less
due to the MMIO hole).

Thanks!

 [1] https://www.amd.com/system/files/TechDocs/56860.pdf
 [2] https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1eas.pdf
 [3] https://github.com/microsoft/mu_basecore/blob/aa16ee6518b521a7d8101c34d2884ae09ef78bce/Unaccepted%20Memory%20UEFI%20ECR.md


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-02  1:51 AMD SEV-SNP/Intel TDX: validation of memory pages David Rientjes
@ 2021-02-02 13:17 ` Matthew Wilcox
  2021-02-02 16:02 ` Kirill A. Shutemov
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 40+ messages in thread
From: Matthew Wilcox @ 2021-02-02 13:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Kirill A. Shutemov, Andi Kleen, Brijesh Singh,
	Tom Lendacky, Jon Grimm, Thomas Gleixner, Christoph Hellwig,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Joerg Roedel, x86,
	linux-mm

On Mon, Feb 01, 2021 at 05:51:09PM -0800, David Rientjes wrote:
>  - Handling lazy validation of guest memory through the core mm layer,
>    most likely involving a bit in struct page flags to track their status

I'd like to not use a new bit.  I think it can be adequately expressed as
PageReserved, augmented with another bit (say, PG_uptodate) to indicate
why it's reserved.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-02  1:51 AMD SEV-SNP/Intel TDX: validation of memory pages David Rientjes
  2021-02-02 13:17 ` Matthew Wilcox
@ 2021-02-02 16:02 ` Kirill A. Shutemov
  2021-02-03  0:16   ` Brijesh Singh
  2021-02-02 22:37 ` Andi Kleen
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 40+ messages in thread
From: Kirill A. Shutemov @ 2021-02-02 16:02 UTC (permalink / raw)
  To: David Rientjes
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Christoph Hellwig, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Joerg Roedel, x86, linux-mm

On Mon, Feb 01, 2021 at 05:51:09PM -0800, David Rientjes wrote:
> Hi everybody,
> 
> I'd like to kick-start the discussion on lazy validation of guest memory
> for the purposes of AMD SEV-SNP and Intel TDX.
> 
> Both AMD SEV-SNP and Intel TDX require validation of guest memory before
> it may be used by the guest.  This is needed for integrity protection from
> a potentially malicious hypervisor or other host components.
> 
> For AMD SEV-SNP, the hypervisor assigns a page to the guest using the new
> RMPUPDATE instruction.  The guest then transitions the page to a usable by
> the new PVALIDATE instruction[1].  This sets the Validated flag in the
> Reverse Map Table (RMP) for a guest addressable page, which opts into
> hardware and firmware integrity protection.  This may only be done by the
> guest itself and until that time, the guest cannot access the page.
> 
> The guest can only PVALIDATE memory for a gPA once; the RMP then
> guarantees for each hPA that there is only a single gPA mapping.  This
> validation can either be done all up front at the time the guest is booted
> or it can be done lazily at runtime on fault if the guest keeps track of
> Valid vs Invalid pages.  Because doing PVALIDATE for all guest memory at
> boot would be extremely lengthy, I'd like to discuss the options for doing
> it lazily.
> 
> Similarly, for Intel TDX, the hypervisor unmaps the gPA from the shared
> EPT and invalidates the tlb and all caches for the TD's vcpus; it then
> adds a page to the gPA address space for a TD by using the new
> TDH.MEM.PAGE.AUG call.  The TDG.MEM.PAGE.ACCEPT TDCALL[2] then allows a
> guest to accept a guest page for a gPA and initialize it using the private
> key for that TD.  This may only be done by the TD itself and until that
> time, the gPA cannot be used within the TD.
> 
> Both AMD SEV-SNP and Intel TDX support hugepages.  SEV-SNP supports 2MB
> whereas TDX has accept TDCALL support for 2MB and 1GB.
> 
> I believe the UEFI ECR[3] for the unaccepted memory type to
> EFI_MEMORY_TYPE was accepted in December.  This should enable the guest to
> learn what memory has not yet been validated (or accepted) by the firmware
> if all guest memory is not done completely up front.
> 
> This likely requires a pre-validation of all memory that can be accessed
> when handling a #VC (or #VE for TDX) such as IST stacks, including memory
> in the x86 boot sequence that must be validated before the core mm
> subsystem is up and running to handle the lazy validation.  I believe
> lazy validation can be done by the core mm after that, perhaps by
> maintaining a new "validated" bit in struct page flags.
> 
> Has anybody looked into this or, even better, is anybody currently working
> on this?

It's likely I'm going to do this on Intel side, but I have not looked
deeply into it.

> I think quite invasive changes are needed for the guest to support lazy
> validation/acceptance to core areas that lots of people on the recipient
> list have strong opinions about.  Some things that come to mind:
> 
>  - Annotations for pages that must be pre-validated in the x86 boot
>    sequence, including IST stacks
> 
>  - Proliferation of these annotations throughout any kernel code that can
>    access memory for #VC or #VE
> 
>  - Handling lazy validation of guest memory through the core mm layer,
>    most likely involving a bit in struct page flags to track their status
> 
>  - Any need for validating memory that is not backed by struct page that
>    needs to be special-cased
> 
>  - Any concerns about this for the DMA layer
> 
> One possibility for minimal disruption to the boot entry code is to
> require the guest BIOS to validate 4GB and below, and then leave 4GB and
> above to be done lazily (the true amount of memory will actually be less
> due to the MMIO hole).

[ As I didn't looked into actual code, I may say total garbage below... ]

Pre-validating 4GB would indeed be easiest way to go, but it's going to be
too slow.

The more realistic is for BIOS to pre-validate memory where kernel and
initrd are placed, plus few dozen megs for runtime. It means decompression
code would need to be aware about the validation.

The critical thing is that once memory is validate we must not validate
it again. It's possible VMM->guest attack vector. We must track precisely
what memory has been validated and stop the guest on detecting the
unexpected second validation request.

It also means that we has to keep the information when the control gets
passed from decompression code to the real kernel. Page flag is no good
for this.

My initial thought that we can use e820/efi memmap to keep track of
information -- remove the unaccepted memory flag from the range that got
accepted.

The decompression code validates the memory that it's need for
decompression, modify memmap accordingly and pass control to the main
kernel.

The main kernel may accept the memory via #VE/#VC, but ideally it need to
stay within memory accepted by decompression code for initial boot.

I think the bulk of memory validation can be done via existing machinery:
we have already deferred struct page initialization code in kernel and I
believe we can hook up into it for the purpose.

Any comments?

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-02  1:51 AMD SEV-SNP/Intel TDX: validation of memory pages David Rientjes
  2021-02-02 13:17 ` Matthew Wilcox
  2021-02-02 16:02 ` Kirill A. Shutemov
@ 2021-02-02 22:37 ` Andi Kleen
  2021-02-11 20:46 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 40+ messages in thread
From: Andi Kleen @ 2021-02-02 22:37 UTC (permalink / raw)
  To: David Rientjes
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Kirill A. Shutemov, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Christoph Hellwig, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Joerg Roedel, x86, linux-mm

On Mon, Feb 01, 2021 at 05:51:09PM -0800, David Rientjes wrote:
> the new PVALIDATE instruction[1].  This sets the Validated flag in the
> Reverse Map Table (RMP) for a guest addressable page, which opts into
> hardware and firmware integrity protection.  This may only be done by the
> guest itself and until that time, the guest cannot access the page.

Another important point is that we need to reject (panic) any accepts for 
memory that have already been accepted, to avoid an attacker
replacing memory. This means that any memory requires some
metadata.

>  - Any need for validating memory that is not backed by struct page that
>    needs to be special-cased

We may not have struct page for firmware structures for example.

> 
>  - Any concerns about this for the DMA layer

It would be needed to handle directly assigned devices because they
could do DMA to not yet accepted memory.

> 
> One possibility for minimal disruption to the boot entry code is to
> require the guest BIOS to validate 4GB and below, and then leave 4GB and
> above to be done lazily (the true amount of memory will actually be less
> due to the MMIO hole).

This would seem fragile to me, requiring Linux to never access any
memory >4GB early. Better would be if Linux accepts everything it
needs early by itself.

-Andi


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-02 16:02 ` Kirill A. Shutemov
@ 2021-02-03  0:16   ` Brijesh Singh
  2021-02-11 17:46     ` Sean Christopherson
  0 siblings, 1 reply; 40+ messages in thread
From: Brijesh Singh @ 2021-02-03  0:16 UTC (permalink / raw)
  To: Kirill A. Shutemov, David Rientjes
  Cc: brijesh.singh, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Andi Kleen, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Christoph Hellwig, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Joerg Roedel, x86, linux-mm


On 2/2/21 10:02 AM, Kirill A. Shutemov wrote:
> On Mon, Feb 01, 2021 at 05:51:09PM -0800, David Rientjes wrote:
>> Hi everybody,
>>
>> I'd like to kick-start the discussion on lazy validation of guest memory
>> for the purposes of AMD SEV-SNP and Intel TDX.
>>
>> Both AMD SEV-SNP and Intel TDX require validation of guest memory before
>> it may be used by the guest.  This is needed for integrity protection from
>> a potentially malicious hypervisor or other host components.
>>
>> For AMD SEV-SNP, the hypervisor assigns a page to the guest using the new
>> RMPUPDATE instruction.  The guest then transitions the page to a usable by
>> the new PVALIDATE instruction[1].  This sets the Validated flag in the
>> Reverse Map Table (RMP) for a guest addressable page, which opts into
>> hardware and firmware integrity protection.  This may only be done by the
>> guest itself and until that time, the guest cannot access the page.
>>
>> The guest can only PVALIDATE memory for a gPA once; the RMP then
>> guarantees for each hPA that there is only a single gPA mapping.  This
>> validation can either be done all up front at the time the guest is booted
>> or it can be done lazily at runtime on fault if the guest keeps track of
>> Valid vs Invalid pages.  Because doing PVALIDATE for all guest memory at
>> boot would be extremely lengthy, I'd like to discuss the options for doing
>> it lazily.
>>
>> Similarly, for Intel TDX, the hypervisor unmaps the gPA from the shared
>> EPT and invalidates the tlb and all caches for the TD's vcpus; it then
>> adds a page to the gPA address space for a TD by using the new
>> TDH.MEM.PAGE.AUG call.  The TDG.MEM.PAGE.ACCEPT TDCALL[2] then allows a
>> guest to accept a guest page for a gPA and initialize it using the private
>> key for that TD.  This may only be done by the TD itself and until that
>> time, the gPA cannot be used within the TD.
>>
>> Both AMD SEV-SNP and Intel TDX support hugepages.  SEV-SNP supports 2MB
>> whereas TDX has accept TDCALL support for 2MB and 1GB.
>>
>> I believe the UEFI ECR[3] for the unaccepted memory type to
>> EFI_MEMORY_TYPE was accepted in December.  This should enable the guest to
>> learn what memory has not yet been validated (or accepted) by the firmware
>> if all guest memory is not done completely up front.
>>
>> This likely requires a pre-validation of all memory that can be accessed
>> when handling a #VC (or #VE for TDX) such as IST stacks, including memory
>> in the x86 boot sequence that must be validated before the core mm
>> subsystem is up and running to handle the lazy validation.  I believe
>> lazy validation can be done by the core mm after that, perhaps by
>> maintaining a new "validated" bit in struct page flags.
>>
>> Has anybody looked into this or, even better, is anybody currently working
>> on this?
> It's likely I'm going to do this on Intel side, but I have not looked
> deeply into it.
>
>> I think quite invasive changes are needed for the guest to support lazy
>> validation/acceptance to core areas that lots of people on the recipient
>> list have strong opinions about.  Some things that come to mind:
>>
>>  - Annotations for pages that must be pre-validated in the x86 boot
>>    sequence, including IST stacks
>>
>>  - Proliferation of these annotations throughout any kernel code that can
>>    access memory for #VC or #VE
>>
>>  - Handling lazy validation of guest memory through the core mm layer,
>>    most likely involving a bit in struct page flags to track their status
>>
>>  - Any need for validating memory that is not backed by struct page that
>>    needs to be special-cased
>>
>>  - Any concerns about this for the DMA layer
>>
>> One possibility for minimal disruption to the boot entry code is to
>> require the guest BIOS to validate 4GB and below, and then leave 4GB and
>> above to be done lazily (the true amount of memory will actually be less
>> due to the MMIO hole).
> [ As I didn't looked into actual code, I may say total garbage below... ]
>
> Pre-validating 4GB would indeed be easiest way to go, but it's going to be
> too slow.
>
> The more realistic is for BIOS to pre-validate memory where kernel and
> initrd are placed, plus few dozen megs for runtime. It means decompression
> code would need to be aware about the validation.


I was thinking that BIOS validating the lower 4GB will simplify the
changes to the kernel entry code path as well provide a clean approach
to support kexec. 

My initial thought is

- BIOS or VMM validate lower 4GB memory.

- BIOS mark the higher 4GB as unaccepted in e820/efi memmap

- Kernel early boot can be achieved with minimal (or no changes)

- If there is an unaccepted type discovered then allocate a bitmap that
can be used to keep track of information (e.g which pages are
validated). We can also explore whether removing the unaccepted flag
from the memmap range will work.

- On #VC/#VE, look at the bitmap to see if we need to validate the
pages. To speed up, we can validate more than one page on #VC/#VE.

- If we get kexec'd then rebuild the e820/memmap based on the bitmap so
that we don't double validate. 


>
> The critical thing is that once memory is validate we must not validate
> it again. It's possible VMM->guest attack vector. We must track precisely
> what memory has been validated and stop the guest on detecting the
> unexpected second validation request.
>
> It also means that we has to keep the information when the control gets
> passed from decompression code to the real kernel. Page flag is no good
> for this.
>
> My initial thought that we can use e820/efi memmap to keep track of
> information -- remove the unaccepted memory flag from the range that got
> accepted.
>
> The decompression code validates the memory that it's need for
> decompression, modify memmap accordingly and pass control to the main
> kernel.
>
> The main kernel may accept the memory via #VE/#VC, but ideally it need to
> stay within memory accepted by decompression code for initial boot.
>
> I think the bulk of memory validation can be done via existing machinery:
> we have already deferred struct page initialization code in kernel and I
> believe we can hook up into it for the purpose.
>
> Any comments?
>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-03  0:16   ` Brijesh Singh
@ 2021-02-11 17:46     ` Sean Christopherson
  0 siblings, 0 replies; 40+ messages in thread
From: Sean Christopherson @ 2021-02-11 17:46 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: Kirill A. Shutemov, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Andrew Morton, Andi Kleen, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Christoph Hellwig, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Joerg Roedel, x86, linux-mm

On Tue, Feb 02, 2021, Brijesh Singh wrote:
> 
> On 2/2/21 10:02 AM, Kirill A. Shutemov wrote:
> > On Mon, Feb 01, 2021 at 05:51:09PM -0800, David Rientjes wrote:
> >> One possibility for minimal disruption to the boot entry code is to
> >> require the guest BIOS to validate 4GB and below, and then leave 4GB and
> >> above to be done lazily (the true amount of memory will actually be less
> >> due to the MMIO hole).
> > [ As I didn't looked into actual code, I may say total garbage below... ]
> >
> > Pre-validating 4GB would indeed be easiest way to go, but it's going to be
> > too slow.
> >
> > The more realistic is for BIOS to pre-validate memory where kernel and
> > initrd are placed, plus few dozen megs for runtime. It means decompression
> > code would need to be aware about the validation.
> 
> I was thinking that BIOS validating the lower 4GB will simplify the
> changes to the kernel entry code path as well provide a clean approach
> to support kexec. 
> 
> My initial thought is
> 
> - BIOS or VMM validate lower 4GB memory.

I think we need to treat this as a "plan for the worst, hope for the best"
scenario.  I agree that validating all of memory below 4gb would be simpler, but
there's no guarantee that that approach will be fast enough.  Even if it's
sufficient for traditional VMs, inevitably someone will come up with a use case
that wants/needs even shorter boot times.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-02  1:51 AMD SEV-SNP/Intel TDX: validation of memory pages David Rientjes
                   ` (2 preceding siblings ...)
  2021-02-02 22:37 ` Andi Kleen
@ 2021-02-11 20:46 ` Peter Zijlstra
  2021-02-12 13:19 ` Joerg Roedel
  2021-03-23  9:33 ` Joerg Roedel
  5 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2021-02-11 20:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Kirill A. Shutemov, Andi Kleen, Brijesh Singh,
	Tom Lendacky, Jon Grimm, Thomas Gleixner, Christoph Hellwig,
	Paolo Bonzini, Ingo Molnar, Joerg Roedel, x86, linux-mm

On Mon, Feb 01, 2021 at 05:51:09PM -0800, David Rientjes wrote:
> I think quite invasive changes are needed for the guest to support lazy
> validation/acceptance to core areas that lots of people on the recipient
> list have strong opinions about.  Some things that come to mind:
> 
>  - Annotations for pages that must be pre-validated in the x86 boot
>    sequence, including IST stacks
> 
>  - Proliferation of these annotations throughout any kernel code that can
>    access memory for #VC or #VE

Kernel code that is critical should already be covered by the noinstr
annotation. Data that is used from noinstr should ideally be placed in
noinstr data sections, but that is currently still a TODO.

This is all required for correct functioning of the entry code on
native, but seems to nicely line up with the TDX requirements.

The thing we'll not accept is making #VE an IST.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-02  1:51 AMD SEV-SNP/Intel TDX: validation of memory pages David Rientjes
                   ` (3 preceding siblings ...)
  2021-02-11 20:46 ` Peter Zijlstra
@ 2021-02-12 13:19 ` Joerg Roedel
  2021-02-12 14:17   ` Peter Zijlstra
  2021-03-23  9:33 ` Joerg Roedel
  5 siblings, 1 reply; 40+ messages in thread
From: Joerg Roedel @ 2021-02-12 13:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Kirill A. Shutemov, Andi Kleen, Brijesh Singh,
	Tom Lendacky, Jon Grimm, Thomas Gleixner, Christoph Hellwig,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, x86, linux-mm

Hi,

On Mon, Feb 01, 2021 at 05:51:09PM -0800, David Rientjes wrote:
> I think quite invasive changes are needed for the guest to support lazy
> validation/acceptance to core areas that lots of people on the recipient
> list have strong opinions about.  Some things that come to mind:
> 
>  - Annotations for pages that must be pre-validated in the x86 boot
>    sequence, including IST stacks

The problem is two-fold:

	1. How to pass the information about pre-validated memory
	   through the whole boot process. This includes firmware,
	   boot-loader, Kernel decompressor and the early Linux boot
	   code.

	2. How to keep track of validated memory at systems runtime.

	(I omit the kexec case for now)

For 1. the best option I see is passing this information as part of the
Linux boot protocol:

	- First we should require from the firmware that everything in
	  the memmap which is not marked as E820 Usable is already
	  validated. This includes any memory used by the firmware
	  itself.

	- Then we can pass this information up the boot process by
	  extending struct boot_params. The bootloader can pass which
	  E820 usable memory it validated, same for the kernel
	  decompressor. The text+data (but not bss) of the running
	  kernel image is per definition validated by the decompressor
	  and does not need to be added explicitly to boot_params.

	- Once the running kernel image took over there are multiple
	  choices. The simplest is probably to keep a validation log,
	  but other data structures can work too until the memmap is set
	  up.

The benefit of passing it up via boot_params is that the whole process
does not necessarily rely on EFI. There are some use-cases for SNP and
TDX like lightweight container runtimes with only minimal or even no
firmware.

>  - Proliferation of these annotations throughout any kernel code that can
>    access memory for #VC or #VE
> 
>  - Handling lazy validation of guest memory through the core mm layer,
>    most likely involving a bit in struct page flags to track their status

Yes, I would also prefer tracking the validation info in 'struct page',
once those are set up. We can discuss different options, there is no
strict need for using struct page, it just seems natural to me.

>  - Any need for validating memory that is not backed by struct page that
>    needs to be special-cased

When everything that is not E820 Usable is already validated then this
should be a no-brainer. Shared memory with the HV and MMIO memory is not
guest private anyway, so there should be no issues. Just a page_is_ram()
check in the #VC/#VE handler is needed.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-12 13:19 ` Joerg Roedel
@ 2021-02-12 14:17   ` Peter Zijlstra
  2021-02-12 14:53     ` Joerg Roedel
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2021-02-12 14:17 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kirill A. Shutemov,
	Andi Kleen, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Christoph Hellwig, Paolo Bonzini, Ingo Molnar,
	x86, linux-mm

On Fri, Feb 12, 2021 at 02:19:07PM +0100, Joerg Roedel wrote:
> 	- Then we can pass this information up the boot process by
> 	  extending struct boot_params. The bootloader can pass which
> 	  E820 usable memory it validated, same for the kernel
> 	  decompressor. The text+data (but not bss) of the running
> 	  kernel image is per definition validated by the decompressor
> 	  and does not need to be added explicitly to boot_params.

Even if all text+data is prevalidated, we'll probably still need some
prevalidated bss and certainly some prevalidated percpu data (like the
various stacks, but also crud like the percpu variable we store the DR7
shadow in etc..).


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-12 14:17   ` Peter Zijlstra
@ 2021-02-12 14:53     ` Joerg Roedel
  2021-02-12 15:19       ` Peter Zijlstra
  0 siblings, 1 reply; 40+ messages in thread
From: Joerg Roedel @ 2021-02-12 14:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kirill A. Shutemov,
	Andi Kleen, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Christoph Hellwig, Paolo Bonzini, Ingo Molnar,
	x86, linux-mm

On Fri, Feb 12, 2021 at 03:17:57PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 12, 2021 at 02:19:07PM +0100, Joerg Roedel wrote:
> > 	- Then we can pass this information up the boot process by
> > 	  extending struct boot_params. The bootloader can pass which
> > 	  E820 usable memory it validated, same for the kernel
> > 	  decompressor. The text+data (but not bss) of the running
> > 	  kernel image is per definition validated by the decompressor
> > 	  and does not need to be added explicitly to boot_params.
> 
> Even if all text+data is prevalidated, we'll probably still need some
> prevalidated bss and certainly some prevalidated percpu data (like the
> various stacks, but also crud like the percpu variable we store the DR7
> shadow in etc..).

The kernel sets up early exception handling in head_64.S, right after
setting MSR_GS_BASE. So per-cpu data can probably be be on-demand. For
bss you might be right.

There is a special .bss.decrypted section for SEV which is shared with
the HV. That section also contains the boot_ghcb used for booting and
AP bringup. That one needs to be set up at this point.

So maybe bss should be prevalidated too by the decompressor and when the
kernel starts it makes the bss.decrypted section shared again.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-12 14:53     ` Joerg Roedel
@ 2021-02-12 15:19       ` Peter Zijlstra
  2021-02-12 15:28         ` Joerg Roedel
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2021-02-12 15:19 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kirill A. Shutemov,
	Andi Kleen, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Christoph Hellwig, Paolo Bonzini, Ingo Molnar,
	x86, linux-mm

On Fri, Feb 12, 2021 at 03:53:18PM +0100, Joerg Roedel wrote:
> On Fri, Feb 12, 2021 at 03:17:57PM +0100, Peter Zijlstra wrote:
> > On Fri, Feb 12, 2021 at 02:19:07PM +0100, Joerg Roedel wrote:
> > > 	- Then we can pass this information up the boot process by
> > > 	  extending struct boot_params. The bootloader can pass which
> > > 	  E820 usable memory it validated, same for the kernel
> > > 	  decompressor. The text+data (but not bss) of the running
> > > 	  kernel image is per definition validated by the decompressor
> > > 	  and does not need to be added explicitly to boot_params.
> > 
> > Even if all text+data is prevalidated, we'll probably still need some
> > prevalidated bss and certainly some prevalidated percpu data (like the
> > various stacks, but also crud like the percpu variable we store the DR7
> > shadow in etc..).
> 
> The kernel sets up early exception handling in head_64.S, right after
> setting MSR_GS_BASE. So per-cpu data can probably be be on-demand. For
> bss you might be right.

That's the thing, we don't want #VE to happen in noinstr code *ever*.

noinstr covers the whole entry code, things like the syscall gap and nmi
recursion setup. Getting a #VE there is fail.

So most per-cpu data can be on-demand, but some of it must absolutely
not be.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-12 15:19       ` Peter Zijlstra
@ 2021-02-12 15:28         ` Joerg Roedel
  2021-02-12 16:12           ` Peter Zijlstra
  0 siblings, 1 reply; 40+ messages in thread
From: Joerg Roedel @ 2021-02-12 15:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kirill A. Shutemov,
	Andi Kleen, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Christoph Hellwig, Paolo Bonzini, Ingo Molnar,
	x86, linux-mm

On Fri, Feb 12, 2021 at 04:19:22PM +0100, Peter Zijlstra wrote:
> That's the thing, we don't want #VE to happen in noinstr code *ever*.
> 
> noinstr covers the whole entry code, things like the syscall gap and nmi
> recursion setup. Getting a #VE there is fail.

I don't know the details about TDX and #VE, but could a malicious HV not
trigger a #VE basically everywhere by mapping around pages? So 'fail'
means panic() in this case, right?

> So most per-cpu data can be on-demand, but some of it must absolutely
> not be.

The kernel can validate those itself in the early setup code, the
decompressor shouldn't care about those details.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-12 15:28         ` Joerg Roedel
@ 2021-02-12 16:12           ` Peter Zijlstra
  2021-02-12 16:18             ` Joerg Roedel
  2021-02-12 21:42             ` Andi Kleen
  0 siblings, 2 replies; 40+ messages in thread
From: Peter Zijlstra @ 2021-02-12 16:12 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kirill A. Shutemov,
	Andi Kleen, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Christoph Hellwig, Paolo Bonzini, Ingo Molnar,
	x86, linux-mm

On Fri, Feb 12, 2021 at 04:28:13PM +0100, Joerg Roedel wrote:
> On Fri, Feb 12, 2021 at 04:19:22PM +0100, Peter Zijlstra wrote:
> > That's the thing, we don't want #VE to happen in noinstr code *ever*.
> > 
> > noinstr covers the whole entry code, things like the syscall gap and nmi
> > recursion setup. Getting a #VE there is fail.
> 
> I don't know the details about TDX and #VE, but could a malicious HV not
> trigger a #VE basically everywhere by mapping around pages? So 'fail'
> means panic() in this case, right?

Right.

> > So most per-cpu data can be on-demand, but some of it must absolutely
> > not be.
> 
> The kernel can validate those itself in the early setup code, the
> decompressor shouldn't care about those details.

Oh sure, I was just pointing out it needs to be done before normal
operation.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-12 16:12           ` Peter Zijlstra
@ 2021-02-12 16:18             ` Joerg Roedel
  2021-02-12 16:45               ` Peter Zijlstra
  2021-02-12 21:42             ` Andi Kleen
  1 sibling, 1 reply; 40+ messages in thread
From: Joerg Roedel @ 2021-02-12 16:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kirill A. Shutemov,
	Andi Kleen, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Christoph Hellwig, Paolo Bonzini, Ingo Molnar,
	x86, linux-mm

On Fri, Feb 12, 2021 at 05:12:41PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 12, 2021 at 04:28:13PM +0100, Joerg Roedel wrote:
> > I don't know the details about TDX and #VE, but could a malicious HV not
> > trigger a #VE basically everywhere by mapping around pages? So 'fail'
> > means panic() in this case, right?
> 
> Right.

To fail reliably, doesn't that mean the #VE vector needs to be IST?
"Everywhere" could also be in the SYSCALL entry path before there is a
trusted stack.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-12 16:18             ` Joerg Roedel
@ 2021-02-12 16:45               ` Peter Zijlstra
  2021-02-12 17:48                 ` Dave Hansen
  2021-02-16 10:00                 ` Joerg Roedel
  0 siblings, 2 replies; 40+ messages in thread
From: Peter Zijlstra @ 2021-02-12 16:45 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kirill A. Shutemov,
	Andi Kleen, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Christoph Hellwig, Paolo Bonzini, Ingo Molnar,
	x86, linux-mm

On Fri, Feb 12, 2021 at 05:18:49PM +0100, Joerg Roedel wrote:
> On Fri, Feb 12, 2021 at 05:12:41PM +0100, Peter Zijlstra wrote:
> > On Fri, Feb 12, 2021 at 04:28:13PM +0100, Joerg Roedel wrote:
> > > I don't know the details about TDX and #VE, but could a malicious HV not
> > > trigger a #VE basically everywhere by mapping around pages? So 'fail'
> > > means panic() in this case, right?
> > 
> > Right.
> 
> To fail reliably, doesn't that mean the #VE vector needs to be IST?
> "Everywhere" could also be in the SYSCALL entry path before there is a
> trusted stack.

I really don't want #VE to be IST.  I really *really* detests ISTs,
they're an unmitigated trainwreck.

But you're right, if a HV injects #VE in the syscall gap and gets a
concurrent CPU to 'fix' the exception frame (which then lives on the
user stack) the handler might never know it went ga-ga.

Is this something the TDX thread model covers? A malicous HV and a TDX
guest co-operating to bring down the guest kernel.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-12 16:45               ` Peter Zijlstra
@ 2021-02-12 17:48                 ` Dave Hansen
  2021-02-12 18:22                   ` Sean Christopherson
  2021-02-16 10:00                 ` Joerg Roedel
  1 sibling, 1 reply; 40+ messages in thread
From: Dave Hansen @ 2021-02-12 17:48 UTC (permalink / raw)
  To: Peter Zijlstra, Joerg Roedel
  Cc: David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kirill A. Shutemov,
	Andi Kleen, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Christoph Hellwig, Paolo Bonzini, Ingo Molnar,
	x86, linux-mm

On 2/12/21 8:45 AM, Peter Zijlstra wrote:
> But you're right, if a HV injects #VE in the syscall gap and gets a
> concurrent CPU to 'fix' the exception frame (which then lives on the
> user stack) the handler might never know it went ga-ga.
> 
> Is this something the TDX thread model covers? A malicous HV and a TDX
> guest co-operating to bring down the guest kernel.

I'll say this: The current TDX guest code that Sathya posted is
predicated on an assumption that an malicious HV can not inject a #VE in
the syscall gap, or any of the other sensitive paths.

A #VE in the syscall gap is just as fatal as a #PF or #GP would be
there.  If TDX can't provide guarantees to the guest that a #VE won't
happen there, then TDX is broken, or the kernel implementation is broken.

If anyone knows of any way for a HV to inject #VE in the syscall gap,
please speak up.  Better to know now.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-12 17:48                 ` Dave Hansen
@ 2021-02-12 18:22                   ` Sean Christopherson
  2021-02-12 18:38                     ` Andy Lutomirski
  2021-02-12 18:46                     ` Dave Hansen
  0 siblings, 2 replies; 40+ messages in thread
From: Sean Christopherson @ 2021-02-12 18:22 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Joerg Roedel, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Andrew Morton, Kirill A. Shutemov, Andi Kleen,
	Brijesh Singh, Tom Lendacky, Jon Grimm, Thomas Gleixner,
	Christoph Hellwig, Paolo Bonzini, Ingo Molnar, x86, linux-mm

On Fri, Feb 12, 2021, Dave Hansen wrote:
> On 2/12/21 8:45 AM, Peter Zijlstra wrote:
> > But you're right, if a HV injects #VE in the syscall gap and gets a
> > concurrent CPU to 'fix' the exception frame (which then lives on the
> > user stack) the handler might never know it went ga-ga.
> > 
> > Is this something the TDX thread model covers? A malicous HV and a TDX
> > guest co-operating to bring down the guest kernel.
> 
> I'll say this: The current TDX guest code that Sathya posted is
> predicated on an assumption that an malicious HV can not inject a #VE in
> the syscall gap, or any of the other sensitive paths.
> 
> A #VE in the syscall gap is just as fatal as a #PF or #GP would be
> there.  If TDX can't provide guarantees to the guest that a #VE won't
> happen there, then TDX is broken, or the kernel implementation is broken.
> 
> If anyone knows of any way for a HV to inject #VE in the syscall gap,
> please speak up.  Better to know now.

Removing and reinserting the SYSCALL page (or any other page touched in the
SYSCALL gap) will result in a #VE, as TDX behavior is to generate a #VE on an
access to an unaccepated.

Andy L pointed out this conundrum a while back.  My hack idea to "solve" this
was to add an API to the TDX-Module that would allow the guest kernel to define
a set of GPAs that must never #VE.

https://lkml.kernel.org/r/20200825171903.GA20660@sjchrist-ice


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-12 18:22                   ` Sean Christopherson
@ 2021-02-12 18:38                     ` Andy Lutomirski
  2021-02-12 18:43                       ` Sean Christopherson
  2021-02-12 18:46                     ` Dave Hansen
  1 sibling, 1 reply; 40+ messages in thread
From: Andy Lutomirski @ 2021-02-12 18:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dave Hansen, Peter Zijlstra, Joerg Roedel, David Rientjes,
	Borislav Petkov, Andy Lutomirski, Andrew Morton,
	Kirill A. Shutemov, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Christoph Hellwig, Paolo Bonzini,
	Ingo Molnar, x86, linux-mm


> On Feb 12, 2021, at 10:22 AM, Sean Christopherson <seanjc@google.com> wrote:
> 
> On Fri, Feb 12, 2021, Dave Hansen wrote:
>>> On 2/12/21 8:45 AM, Peter Zijlstra wrote:
>>> But you're right, if a HV injects #VE in the syscall gap and gets a
>>> concurrent CPU to 'fix' the exception frame (which then lives on the
>>> user stack) the handler might never know it went ga-ga.
>>> 
>>> Is this something the TDX thread model covers? A malicous HV and a TDX
>>> guest co-operating to bring down the guest kernel.
>> 
>> I'll say this: The current TDX guest code that Sathya posted is
>> predicated on an assumption that an malicious HV can not inject a #VE in
>> the syscall gap, or any of the other sensitive paths.
>> 
>> A #VE in the syscall gap is just as fatal as a #PF or #GP would be
>> there.  If TDX can't provide guarantees to the guest that a #VE won't
>> happen there, then TDX is broken, or the kernel implementation is broken.
>> 
>> If anyone knows of any way for a HV to inject #VE in the syscall gap,
>> please speak up.  Better to know now.
> 
> Removing and reinserting the SYSCALL page (or any other page touched in the
> SYSCALL gap) will result in a #VE, as TDX behavior is to generate a #VE on an
> access to an unaccepated.
> 
> Andy L pointed out this conundrum a while back.  My hack idea to "solve" this
> was to add an API to the TDX-Module that would allow the guest kernel to define
> a set of GPAs that must never #VE.
> 
> https://lkml.kernel.org/r/20200825171903.GA20660@sjchrist-ice

Is the TDX module involved in #HV delivery?  Just how much cleverness is possible without silicon changes?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-12 18:38                     ` Andy Lutomirski
@ 2021-02-12 18:43                       ` Sean Christopherson
  0 siblings, 0 replies; 40+ messages in thread
From: Sean Christopherson @ 2021-02-12 18:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Peter Zijlstra, Joerg Roedel, David Rientjes,
	Borislav Petkov, Andy Lutomirski, Andrew Morton,
	Kirill A. Shutemov, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Christoph Hellwig, Paolo Bonzini,
	Ingo Molnar, x86, linux-mm

On Fri, Feb 12, 2021, Andy Lutomirski wrote:
> 
> > On Feb 12, 2021, at 10:22 AM, Sean Christopherson <seanjc@google.com> wrote:
> > 
> > On Fri, Feb 12, 2021, Dave Hansen wrote:
> >>> On 2/12/21 8:45 AM, Peter Zijlstra wrote:
> >>> But you're right, if a HV injects #VE in the syscall gap and gets a
> >>> concurrent CPU to 'fix' the exception frame (which then lives on the
> >>> user stack) the handler might never know it went ga-ga.
> >>> 
> >>> Is this something the TDX thread model covers? A malicous HV and a TDX
> >>> guest co-operating to bring down the guest kernel.
> >> 
> >> I'll say this: The current TDX guest code that Sathya posted is
> >> predicated on an assumption that an malicious HV can not inject a #VE in
> >> the syscall gap, or any of the other sensitive paths.
> >> 
> >> A #VE in the syscall gap is just as fatal as a #PF or #GP would be
> >> there.  If TDX can't provide guarantees to the guest that a #VE won't
> >> happen there, then TDX is broken, or the kernel implementation is broken.
> >> 
> >> If anyone knows of any way for a HV to inject #VE in the syscall gap,
> >> please speak up.  Better to know now.
> > 
> > Removing and reinserting the SYSCALL page (or any other page touched in the
> > SYSCALL gap) will result in a #VE, as TDX behavior is to generate a #VE on an
> > access to an unaccepated.
> > 
> > Andy L pointed out this conundrum a while back.  My hack idea to "solve" this
> > was to add an API to the TDX-Module that would allow the guest kernel to define
> > a set of GPAs that must never #VE.
> > 
> > https://lkml.kernel.org/r/20200825171903.GA20660@sjchrist-ice
> 
> Is the TDX module involved in #HV delivery?  Just how much cleverness is
> possible without silicon changes?

In this case, yes.  TDD-Module controls the Secure EPT PTEs, including the
SUPPRESS_VE bit.  Specifically, for unaccepated pages, the SUPPRESS_VE bit is
cleared so that accesses will be reflected by hardware as #VEs instead of
causing EPT violation VM-Exit.

The untrusted hypervisor manages resources, but any changes to S-EPT must be
routed through TDX-Module.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-12 18:22                   ` Sean Christopherson
  2021-02-12 18:38                     ` Andy Lutomirski
@ 2021-02-12 18:46                     ` Dave Hansen
  2021-02-12 19:24                       ` Sean Christopherson
  1 sibling, 1 reply; 40+ messages in thread
From: Dave Hansen @ 2021-02-12 18:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Zijlstra, Joerg Roedel, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Andrew Morton, Kirill A. Shutemov, Andi Kleen,
	Brijesh Singh, Tom Lendacky, Jon Grimm, Thomas Gleixner,
	Christoph Hellwig, Paolo Bonzini, Ingo Molnar, x86, linux-mm

On 2/12/21 10:22 AM, Sean Christopherson wrote:
>> If anyone knows of any way for a HV to inject #VE in the syscall gap,
>> please speak up.  Better to know now.
> Removing and reinserting the SYSCALL page (or any other page touched in the
> SYSCALL gap) will result in a #VE, as TDX behavior is to generate a #VE on an
> access to an unaccepated.
> 
> Andy L pointed out this conundrum a while back.  My hack idea to "solve" this
> was to add an API to the TDX-Module that would allow the guest kernel to define
> a set of GPAs that must never #VE.
> 
> https://lkml.kernel.org/r/20200825171903.GA20660@sjchrist-ice

Reminds me of the "what has to be mapped into userspace?" exercise for
PTI.  That was fun.

Really, the hypervisor shouldn't be able to cause #VE's.  This should be
fatal to the guest, period.  Or, worst case scenario, Linux should be
able to set a bit that says, I will only run under sane hypervisors.  If
I somehow lose a bet and get a crappy, insane hypervisor, I want take my
ball and go home: don't even bother running me any more.

No way do we want another fragile list of magic pages that we have to
maintain.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-12 18:46                     ` Dave Hansen
@ 2021-02-12 19:24                       ` Sean Christopherson
  0 siblings, 0 replies; 40+ messages in thread
From: Sean Christopherson @ 2021-02-12 19:24 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Joerg Roedel, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Andrew Morton, Kirill A. Shutemov, Andi Kleen,
	Brijesh Singh, Tom Lendacky, Jon Grimm, Thomas Gleixner,
	Christoph Hellwig, Paolo Bonzini, Ingo Molnar, x86, linux-mm

On Fri, Feb 12, 2021, Dave Hansen wrote:
> On 2/12/21 10:22 AM, Sean Christopherson wrote:
> >> If anyone knows of any way for a HV to inject #VE in the syscall gap,
> >> please speak up.  Better to know now.
> > Removing and reinserting the SYSCALL page (or any other page touched in the
> > SYSCALL gap) will result in a #VE, as TDX behavior is to generate a #VE on an
> > access to an unaccepated.
> > 
> > Andy L pointed out this conundrum a while back.  My hack idea to "solve" this
> > was to add an API to the TDX-Module that would allow the guest kernel to define
> > a set of GPAs that must never #VE.
> > 
> > https://lkml.kernel.org/r/20200825171903.GA20660@sjchrist-ice
> 
> Reminds me of the "what has to be mapped into userspace?" exercise for
> PTI.  That was fun.
> 
> Really, the hypervisor shouldn't be able to cause #VE's.  This should be
> fatal to the guest, period.  Or, worst case scenario, Linux should be
> able to set a bit that says, I will only run under sane hypervisors.  If
> I somehow lose a bet and get a crappy, insane hypervisor, I want take my
> ball and go home: don't even bother running me any more.

Hmm, #VEs and #VCs are required for lazy-accept/pvalidate, and for converting
pages between private and shared.  Those #VEs/#VCs are technically caused by the
hypervisor.

What I think we want is to prevent the hypervisor from removing an accepted page
without an explicit action from the guest. 

For TDX, IIRC, one of the motivations for allowing the host to remove a page
without an explicit action from the guest was to prevent the guest from holding
pages hostage.  But, if the guest isn't playing nice, the hypervisor can just
kill it.

So for TDX, I _think_ we could dodge this bullet by making MAP_GPA an API
provided by the TDX module so that the module can track which pages are allowed
to be removed by the host.  The S-EPT entry could be made not-present on
MAP_GPA(SHARED) and tagged as being removable.  Since the page is not-present,
there should be plenty of bits available in the entry to do the tagging.

The big question, other than if the above idea is actually feasible, is if page
migration in the host would require the guest to re-accept the new page.  That
would throw a wrench into the whole thing.

> No way do we want another fragile list of magic pages that we have to
> maintain.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-12 16:12           ` Peter Zijlstra
  2021-02-12 16:18             ` Joerg Roedel
@ 2021-02-12 21:42             ` Andi Kleen
  2021-02-12 21:58               ` Peter Zijlstra
  1 sibling, 1 reply; 40+ messages in thread
From: Andi Kleen @ 2021-02-12 21:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joerg Roedel, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kirill A. Shutemov,
	Brijesh Singh, Tom Lendacky, Jon Grimm, Thomas Gleixner,
	Christoph Hellwig, Paolo Bonzini, Ingo Molnar, x86, linux-mm

> > I don't know the details about TDX and #VE, but could a malicious HV not
> > trigger a #VE basically everywhere by mapping around pages? So 'fail'
> > means panic() in this case, right?
> 
> Right.

Well we might not be able to reliably panic if we don't run on a IST
if it hits the syscall gap. Otherwise you might end up with panic
running on the ring 3 stack.

Given it's a bit muddled threat model - would need both a
malicious process in the hypervisor and inside the secure guest,
but I presume that's possible.

That seems to argue that an IST for #VE is actually required.

-Andi


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-12 21:42             ` Andi Kleen
@ 2021-02-12 21:58               ` Peter Zijlstra
  2021-02-12 22:39                 ` Andi Kleen
  2021-02-12 23:51                 ` Paolo Bonzini
  0 siblings, 2 replies; 40+ messages in thread
From: Peter Zijlstra @ 2021-02-12 21:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Joerg Roedel, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kirill A. Shutemov,
	Brijesh Singh, Tom Lendacky, Jon Grimm, Thomas Gleixner,
	Christoph Hellwig, Paolo Bonzini, Ingo Molnar, x86, linux-mm

On Fri, Feb 12, 2021 at 01:42:05PM -0800, Andi Kleen wrote:
> > > I don't know the details about TDX and #VE, but could a malicious HV not
> > > trigger a #VE basically everywhere by mapping around pages? So 'fail'
> > > means panic() in this case, right?
> > 
> > Right.
> 
> Well we might not be able to reliably panic if we don't run on a IST
> if it hits the syscall gap. Otherwise you might end up with panic
> running on the ring 3 stack.
> 
> Given it's a bit muddled threat model - would need both a
> malicious process in the hypervisor and inside the secure guest,
> but I presume that's possible.
> 
> That seems to argue that an IST for #VE is actually required.

But AFAI recursive #VE is entirely possible. The moment #VE reads that
ve_info thing, NMIs can happen, which can trigger another #VE which then
clobbers your stack and we're irrecoverably screwed again.

(also, inhibiting NMI is a seriously dodgy hack, the very last thing x86
needs is is more ductape on the recursion rules)

Repeat after me: ISTs aren't a solution but part of the problem.

If TDX requires IST, it's architecturally bankrupt.

There is much talk elsewhere in this thread about validated pages; have
the TDX module hard guarantee certain pages are available and will not
*ever* generate #VE. TDX module can kill the guest, but must not #VE.

Without something like that it's a complete and utter non-starter.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-12 21:58               ` Peter Zijlstra
@ 2021-02-12 22:39                 ` Andi Kleen
  2021-02-12 22:46                   ` Andy Lutomirski
  2021-02-13  9:38                   ` Peter Zijlstra
  2021-02-12 23:51                 ` Paolo Bonzini
  1 sibling, 2 replies; 40+ messages in thread
From: Andi Kleen @ 2021-02-12 22:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joerg Roedel, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kirill A. Shutemov,
	Brijesh Singh, Tom Lendacky, Jon Grimm, Thomas Gleixner,
	Christoph Hellwig, Paolo Bonzini, Ingo Molnar, x86, linux-mm

> But AFAI recursive #VE is entirely possible. The moment #VE reads that
> ve_info thing, NMIs can happen, which can trigger another #VE which then
> clobbers your stack and we're irrecoverably screwed again.

I don't believe we have anything currently in the NMI handler that
would trigger #VE. While some operations may need TDCALL (like MSR
accesses) those should be all directly hooked.

Also in general to avoid clobbering your stack you would just need
to make sure to adjust the IST stack before you do anything that
could cause another #VE.

-Andi


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-12 22:39                 ` Andi Kleen
@ 2021-02-12 22:46                   ` Andy Lutomirski
  2021-02-13  9:38                   ` Peter Zijlstra
  1 sibling, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2021-02-12 22:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Joerg Roedel, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Kirill A. Shutemov, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Christoph Hellwig, Paolo Bonzini, Ingo Molnar,
	X86 ML, Linux-MM

On Fri, Feb 12, 2021 at 2:39 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> > But AFAI recursive #VE is entirely possible. The moment #VE reads that
> > ve_info thing, NMIs can happen, which can trigger another #VE which then
> > clobbers your stack and we're irrecoverably screwed again.
>
> I don't believe we have anything currently in the NMI handler that
> would trigger #VE. While some operations may need TDCALL (like MSR
> accesses) those should be all directly hooked.
>
> Also in general to avoid clobbering your stack you would just need
> to make sure to adjust the IST stack before you do anything that
> could cause another #VE.

Except that the world contains more than just #VE.  We could get #VE
and then NMI and then #VE or #VE and MCE (in a future revision?), etc.
The x86 exception situation is a mess.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-12 21:58               ` Peter Zijlstra
  2021-02-12 22:39                 ` Andi Kleen
@ 2021-02-12 23:51                 ` Paolo Bonzini
  1 sibling, 0 replies; 40+ messages in thread
From: Paolo Bonzini @ 2021-02-12 23:51 UTC (permalink / raw)
  To: Peter Zijlstra, Andi Kleen
  Cc: Joerg Roedel, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kirill A. Shutemov,
	Brijesh Singh, Tom Lendacky, Jon Grimm, Thomas Gleixner,
	Christoph Hellwig, Ingo Molnar, x86, linux-mm

On 12/02/21 22:58, Peter Zijlstra wrote:
> But AFAI recursive #VE is entirely possible. The moment #VE reads that
> ve_info thing, NMIs can happen, which can trigger another #VE which then
> clobbers your stack and we're irrecoverably screwed again.

Yes, you need to zero the handler-active word in the info structure, and 
at that point recursion can happen.

A while ago Andy proposed re-enabling #VE from an interrupt, that would 
have worked at the time since we were concerned with asynchronous page 
faults but it wouldn't extend to TDX.

Unlike NMIs, however, #VE handlers can be written so that they only a 
single nesting happens. A few months ago, also while discussing #VE for 
asynchronous page faults, I came up with a scheme that did exactly that 
and handled recursion by flipping the IST between two stacks 
(https://lkml.org/lkml/2020/5/15/1239). It should work and it'd be 
almost entirely C code, but I don't expect you or Thomas to be ecstatic 
about it...

> (also, inhibiting NMI is a seriously dodgy hack, the very last thing x86
> needs is is more ductape on the recursion rules)

I can't disagree about that, but then again I don't see many alternatives.

Paolo



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-12 22:39                 ` Andi Kleen
  2021-02-12 22:46                   ` Andy Lutomirski
@ 2021-02-13  9:38                   ` Peter Zijlstra
  1 sibling, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2021-02-13  9:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Joerg Roedel, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kirill A. Shutemov,
	Brijesh Singh, Tom Lendacky, Jon Grimm, Thomas Gleixner,
	Christoph Hellwig, Paolo Bonzini, Ingo Molnar, x86, linux-mm

On Fri, Feb 12, 2021 at 02:39:18PM -0800, Andi Kleen wrote:
> > But AFAI recursive #VE is entirely possible. The moment #VE reads that
> > ve_info thing, NMIs can happen, which can trigger another #VE which then
> > clobbers your stack and we're irrecoverably screwed again.
> 
> I don't believe we have anything currently in the NMI handler that
> would trigger #VE. While some operations may need TDCALL (like MSR
> accesses) those should be all directly hooked.
> 
> Also in general to avoid clobbering your stack you would just need
> to make sure to adjust the IST stack before you do anything that
> could cause another #VE.

NMI can touch user-pages, which then brings the on-demand #VE.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-12 16:45               ` Peter Zijlstra
  2021-02-12 17:48                 ` Dave Hansen
@ 2021-02-16 10:00                 ` Joerg Roedel
  2021-02-16 14:27                   ` Andi Kleen
  1 sibling, 1 reply; 40+ messages in thread
From: Joerg Roedel @ 2021-02-16 10:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kirill A. Shutemov,
	Andi Kleen, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Christoph Hellwig, Paolo Bonzini, Ingo Molnar,
	x86, linux-mm

On Fri, Feb 12, 2021 at 05:45:11PM +0100, Peter Zijlstra wrote:
> I really don't want #VE to be IST.  I really *really* detests ISTs,
> they're an unmitigated trainwreck.
> 
> But you're right, if a HV injects #VE in the syscall gap and gets a
> concurrent CPU to 'fix' the exception frame (which then lives on the
> user stack) the handler might never know it went ga-ga.
> 
> Is this something the TDX thread model covers? A malicous HV and a TDX
> guest co-operating to bring down the guest kernel.

If the guest is not malicous, and you have a valid user-stack in the
SYSCALL gap, then it depends on whether SMAP is active. I guess it
usually is, in which case the #VE would be promoted to a #DF to kill the
machine.

But since we don't trust user-space either in a TDX guest, it must be
expected that user RSP points to something clever within the kernel. And
in this case making #VE an IST handler will be the only help.

Note that a TDX guest doesn't need to be kept alive in this situation,
but we must make sure it can reliably crash. Otherwise a naughty HV has
a chance to take control of the code-flow in the guest to make it reveal
its secrets.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-16 10:00                 ` Joerg Roedel
@ 2021-02-16 14:27                   ` Andi Kleen
  2021-02-16 14:46                     ` Peter Zijlstra
  0 siblings, 1 reply; 40+ messages in thread
From: Andi Kleen @ 2021-02-16 14:27 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Peter Zijlstra, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kirill A. Shutemov,
	Brijesh Singh, Tom Lendacky, Jon Grimm, Thomas Gleixner,
	Christoph Hellwig, Paolo Bonzini, Ingo Molnar, x86, linux-mm

> > Is this something the TDX thread model covers? A malicous HV and a TDX
> > guest co-operating to bring down the guest kernel.
> 
> If the guest is not malicous, and you have a valid user-stack in the
> SYSCALL gap, then it depends on whether SMAP is active. I guess it
> usually is, in which case the #VE would be promoted to a #DF to kill the
> machine.

An malicious user process in the guest could set user RSP to a kernel address.

But yes without that it's probably not a problem due to SMAP.

It really boils down if we consider this combination of "malicious
hypervisor and malicious user process together" to be a threat.

At least in the classical TDX model malicious guest process was
out of scope. So considering it would be a new requirement.

For the kernel exit we could handle it by checking if the 
RSP address is in the kernel, and killing the process
before switching the stack.  It's just a problem for the entry,
which probably would really need an IST.

For the IST we would need to handle two level nesting, but I presume that
would be possible. It should be much simpler than the NMI nesting
with just a single stack switch.

I think the IST solution should at least be explored before
dismissing it. It might be simpler than anything else (like
using new APIs)

-Andi


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-16 14:27                   ` Andi Kleen
@ 2021-02-16 14:46                     ` Peter Zijlstra
  2021-02-16 15:59                       ` Paolo Bonzini
  2021-02-16 16:55                       ` Andi Kleen
  0 siblings, 2 replies; 40+ messages in thread
From: Peter Zijlstra @ 2021-02-16 14:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Joerg Roedel, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kirill A. Shutemov,
	Brijesh Singh, Tom Lendacky, Jon Grimm, Thomas Gleixner,
	Christoph Hellwig, Paolo Bonzini, Ingo Molnar, x86, linux-mm

On Tue, Feb 16, 2021 at 06:27:41AM -0800, Andi Kleen wrote:
> I think the IST solution should at least be explored before
> dismissing it. It might be simpler than anything else (like
> using new APIs)

Have you seen the trainwreck bonzini proposed? The very simplest thing
is saying no to TDX.

That 'solution' also hard relies on #VE not nesting more than once, so
lovely things like: #VE -> #DB -> #VE -> #NMI -> #VE, or #VE -> NMI ->
#VE -> #MC -> #VE or any number of other possible 'fun' combinations
_must_ not happen.

And yes, I know #MC isn't supported just now, but the above would
mandate it never be supported _ever_, because otherwise the IST hack
crumbles.

Again, repeat after me: ISTs are a part of the problem.

So how about fixing TDX instead of forcing us to do horrible fragile
things we all know will end up in tears?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-16 14:46                     ` Peter Zijlstra
@ 2021-02-16 15:59                       ` Paolo Bonzini
  2021-02-16 16:25                         ` Joerg Roedel
                                           ` (2 more replies)
  2021-02-16 16:55                       ` Andi Kleen
  1 sibling, 3 replies; 40+ messages in thread
From: Paolo Bonzini @ 2021-02-16 15:59 UTC (permalink / raw)
  To: Peter Zijlstra, Andi Kleen
  Cc: Joerg Roedel, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kirill A. Shutemov,
	Brijesh Singh, Tom Lendacky, Jon Grimm, Thomas Gleixner,
	Christoph Hellwig, Ingo Molnar, x86, linux-mm

On 16/02/21 15:46, Peter Zijlstra wrote:
> On Tue, Feb 16, 2021 at 06:27:41AM -0800, Andi Kleen wrote:
>> I think the IST solution should at least be explored before
>> dismissing it. It might be simpler than anything else (like
>> using new APIs)
> 
> Have you seen the trainwreck bonzini proposed?

You had been suspiciously silent...

> The very simplest thing is saying no to TDX.
> 
> That 'solution' also hard relies on #VE not nesting more than once, so
> lovely things like: #VE -> #DB -> #VE -> #NMI -> #VE, or #VE -> NMI ->
> #VE -> #MC -> #VE or any number of other possible 'fun' combinations
> _must_ not happen.

... but no, this is not how it works.  It is actually guaranteed that 
#VE does not nest more than once, and that's the big difference with NMIs.

Let's look at the first case you listed, this is what would happen:


#VE handler starts on stack 1
First #VE processing...
clear VE-in-progress flag in the info block (allowing reentrancy)
	#DB handler starts
		nested #VE handler starts on stack 2
		outer #VE handler marks stack 1 for reexecution
		nested #VE handler ends ***
	#DB handler ends
#VE handler IRETs back to the start of the handler itself
Second #VE processing starts (also on stack 1)
clear VE-in-progress flag in the info block
	#NMI handler
		nested #VE handler starts on stack 2
		outer #VE handler marks stack 1 for reexecution
		nested #VE handler ends ***
	#NMI handler ends
#VE handler IRETs back to the start of the handler itself
Third #VE processing starts (also on stack 1)
clear VE-in-progress flag in the info block
#VE handler IRETs back to the caller


Two things of note:

- note that at the points marked *** the nested #VE handler has not 
allowed another exception to come.  That only happens in the outer handler.

- the inner handler does nothing but telling the outer handler to rerun. 
  The way it does it is certainly not pretty, because it has to work at 
any instruction boundary, but at its heart it's basically a do{}while loop.

Paolo

> And yes, I know #MC isn't supported just now, but the above would
> mandate it never be supported _ever_, because otherwise the IST hack
> crumbles.
> 
> Again, repeat after me: ISTs are a part of the problem.
> 
> So how about fixing TDX instead of forcing us to do horrible fragile
> things we all know will end up in tears?
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-16 15:59                       ` Paolo Bonzini
@ 2021-02-16 16:25                         ` Joerg Roedel
  2021-02-16 16:48                           ` Paolo Bonzini
  2021-02-16 16:47                         ` Peter Zijlstra
  2021-02-16 16:57                         ` Andy Lutomirski
  2 siblings, 1 reply; 40+ messages in thread
From: Joerg Roedel @ 2021-02-16 16:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Zijlstra, Andi Kleen, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Kirill A. Shutemov, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Christoph Hellwig, Ingo Molnar, x86, linux-mm

On Tue, Feb 16, 2021 at 04:59:52PM +0100, Paolo Bonzini wrote:
> - the inner handler does nothing but telling the outer handler to rerun.
> The way it does it is certainly not pretty, because it has to work at any
> instruction boundary, but at its heart it's basically a do{}while loop.

That only works if processing of all inner #VE can be deferred, which is
not the case for instruction emulation #VEs like MSR accesses, io-port
or MMIO accesses. I guess those could all be replaced direct TDCALLs,
but the question remains whether this is possible with MSR accesses, means
that the list of MSRs which will cause #VEs is statically defined and
doesn't change between hypervisors. All in all this sounds hard to
maintain and easy to break by unrelated changes.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-16 15:59                       ` Paolo Bonzini
  2021-02-16 16:25                         ` Joerg Roedel
@ 2021-02-16 16:47                         ` Peter Zijlstra
  2021-02-16 16:57                         ` Andy Lutomirski
  2 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2021-02-16 16:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Andi Kleen, Joerg Roedel, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Kirill A. Shutemov, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Christoph Hellwig, Ingo Molnar, x86, linux-mm

On Tue, Feb 16, 2021 at 04:59:52PM +0100, Paolo Bonzini wrote:
> On 16/02/21 15:46, Peter Zijlstra wrote:
> > On Tue, Feb 16, 2021 at 06:27:41AM -0800, Andi Kleen wrote:
> > > I think the IST solution should at least be explored before
> > > dismissing it. It might be simpler than anything else (like
> > > using new APIs)
> > 
> > Have you seen the trainwreck bonzini proposed?
> 
> You had been suspiciously silent...

:-)

> > The very simplest thing is saying no to TDX.
> > 
> > That 'solution' also hard relies on #VE not nesting more than once, so
> > lovely things like: #VE -> #DB -> #VE -> #NMI -> #VE, or #VE -> NMI ->
> > #VE -> #MC -> #VE or any number of other possible 'fun' combinations
> > _must_ not happen.
> 
> ... but no, this is not how it works.  It is actually guaranteed that #VE
> does not nest more than once, and that's the big difference with NMIs.

Note that our NMI entry code is broken vs #MC or any other exception
that can land while we're setting up that recursion mess.

> Let's look at the first case you listed, this is what would happen:
> 
> 
> #VE handler starts on stack 1
> First #VE processing...
> clear VE-in-progress flag in the info block (allowing reentrancy)
> 	#DB handler starts
> 		nested #VE handler starts on stack 2

NMI can't land here because of the special ductape? The inner #VE never
clears VE-in-progress.

> 		outer #VE handler marks stack 1 for reexecution
> 		nested #VE handler ends ***
> 	#DB handler ends

So what does the #DB memop that triggered that #VE actually read? What
if it was a store?

Because clearly it will not have handled the on-demand validation thing.
So how can memops proceed?

> #VE handler IRETs back to the start of the handler itself
> Second #VE processing starts (also on stack 1)
> clear VE-in-progress flag in the info block
> 	#NMI handler
> 		nested #VE handler starts on stack 2
> 		outer #VE handler marks stack 1 for reexecution
> 		nested #VE handler ends ***
> 	#NMI handler ends
> #VE handler IRETs back to the start of the handler itself
> Third #VE processing starts (also on stack 1)
> clear VE-in-progress flag in the info block
> #VE handler IRETs back to the caller
> 
> 
> Two things of note:
> 
> - note that at the points marked *** the nested #VE handler has not allowed
> another exception to come.  That only happens in the outer handler.
> 
> - the inner handler does nothing but telling the outer handler to rerun.
> The way it does it is certainly not pretty, because it has to work at any
> instruction boundary, but at its heart it's basically a do{}while loop.

So this hard relies on inhibiting NMIs and #MC being busted, right? But
I still don't understand what happens to the memops if you don't handle
the #VE.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-16 16:25                         ` Joerg Roedel
@ 2021-02-16 16:48                           ` Paolo Bonzini
  2021-02-16 18:26                             ` Joerg Roedel
  0 siblings, 1 reply; 40+ messages in thread
From: Paolo Bonzini @ 2021-02-16 16:48 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Peter Zijlstra, Andi Kleen, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Kirill A. Shutemov, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Christoph Hellwig, Ingo Molnar, x86, linux-mm

On 16/02/21 17:25, Joerg Roedel wrote:
> On Tue, Feb 16, 2021 at 04:59:52PM +0100, Paolo Bonzini wrote:
>> - the inner handler does nothing but telling the outer handler to rerun.
>> The way it does it is certainly not pretty, because it has to work at any
>> instruction boundary, but at its heart it's basically a do{}while loop.
> 
> That only works if processing of all inner #VE can be deferred, which is
> not the case for instruction emulation #VEs like MSR accesses, io-port
> or MMIO accesses.

No doubt about that, but that's unrelated to ISTs.  ISTs are ugly and 
the ugliness is a symptom of the problem; but not part of the problem. 
NMIs are as usual the most worrisome since you can get those from random 
perf events.

We should minimize the number of #VEs that we get, as they are very 
slow.  Could almost everything that can invoke a #VE go through pvops 
and be turned into a TDCALL?  And if so the same should be true for 
SEV-ES #VC as well.

> I guess those could all be replaced direct TDCALLs,
> but the question remains whether this is possible with MSR accesses, means
> that the list of MSRs which will cause #VEs is statically defined and
> doesn't change between hypervisors. All in all this sounds hard to
> maintain and easy to break by unrelated changes.

I would expect that all MSRs except for a handful (SPEC_CTRL/PRED_CMD, 
the FS/GS/kernelGS bases, anything else?) would be redirect to TDCALL.

Paolo



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-16 14:46                     ` Peter Zijlstra
  2021-02-16 15:59                       ` Paolo Bonzini
@ 2021-02-16 16:55                       ` Andi Kleen
  1 sibling, 0 replies; 40+ messages in thread
From: Andi Kleen @ 2021-02-16 16:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joerg Roedel, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kirill A. Shutemov,
	Brijesh Singh, Tom Lendacky, Jon Grimm, Thomas Gleixner,
	Christoph Hellwig, Paolo Bonzini, Ingo Molnar, x86, linux-mm

On Tue, Feb 16, 2021 at 03:46:36PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 16, 2021 at 06:27:41AM -0800, Andi Kleen wrote:
> > I think the IST solution should at least be explored before
> > dismissing it. It might be simpler than anything else (like
> > using new APIs)
> 
> Have you seen the trainwreck bonzini proposed? The very simplest thing
> is saying no to TDX.

#VE cannot nest until TDINFO. I'm thinking to always switch to
the normal interrupt stack before TDINFO. With that one
it should be equivalent to a non IST #VE, with any
nesting you want supported.

> So how about fixing TDX instead of forcing us to do horrible fragile
> things we all know will end up in tears?

I think we should explore both. If the IST variant is too horrible we
can see about changing TDX. But at least should approach it
with an open mind and see how the code looks like.

-Andi


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-16 15:59                       ` Paolo Bonzini
  2021-02-16 16:25                         ` Joerg Roedel
  2021-02-16 16:47                         ` Peter Zijlstra
@ 2021-02-16 16:57                         ` Andy Lutomirski
  2021-02-16 17:05                           ` Paolo Bonzini
  2 siblings, 1 reply; 40+ messages in thread
From: Andy Lutomirski @ 2021-02-16 16:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Zijlstra, Andi Kleen, Joerg Roedel, David Rientjes,
	Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Kirill A. Shutemov, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Christoph Hellwig, Ingo Molnar, x86,
	linux-mm



> On Feb 16, 2021, at 7:59 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 16/02/21 15:46, Peter Zijlstra wrote:
>>> On Tue, Feb 16, 2021 at 06:27:41AM -0800, Andi Kleen wrote:
>>> I think the IST solution should at least be explored before
>>> dismissing it. It might be simpler than anything else (like
>>> using new APIs)
>> Have you seen the trainwreck bonzini proposed?
> 
> You had been suspiciously silent...

Can one of you point me at the original proposal?

> 
>> The very simplest thing is saying no to TDX.
>> That 'solution' also hard relies on #VE not nesting more than once, so
>> lovely things like: #VE -> #DB -> #VE -> #NMI -> #VE, or #VE -> NMI ->
>> #VE -> #MC -> #VE or any number of other possible 'fun' combinations
>> _must_ not happen.
> 
> ... but no, this is not how it works.  It is actually guaranteed that #VE does not nest more than once, and that's the big difference with NMIs.
> 
> Let's look at the first case you listed, this is what would happen:
> 
> 
> #VE handler starts on stack 1
> First #VE processing...
> clear VE-in-progress flag in the info block (allowing reentrancy)
>    #DB handler starts
>        nested #VE handler starts on stack 2
>        outer #VE handler marks stack 1 for reexecution
>        nested #VE handler ends ***
>    #DB handler ends
> #VE handler IRETs back to the start of the handler itself
> Second #VE processing starts (also on stack 1)
> clear VE-in-progress flag in the info block
>    #NMI handler
>        nested #VE handler starts on stack 2
>        outer #VE handler marks stack 1 for reexecution
>        nested #VE handler ends ***
>    #NMI handler ends
> #VE handler IRETs back to the start of the handler itself

This sounds suspiciously like the current NMI code. I want to look at the code. If nothing else, I suspect it’s busted wrt CET, but the current NMI code definitely has bugs.  For example, if we are about to IRET from NMI and we get #VE in the IRET insn itself and then get a new NMI inside the #VE, we are toast.

> Third #VE processing starts (also on stack 1)
> clear VE-in-progress flag in the info block
> #VE handler IRETs back to the caller
> 
> 
> Two things of note:
> 
> - note that at the points marked *** the nested #VE handler has not allowed another exception to come.  That only happens in the outer handler.
> 
> - the inner handler does nothing but telling the outer handler to rerun.  The way it does it is certainly not pretty, because it has to work at any instruction boundary, but at its heart it's basically a do{}while loop.
> 
> Paolo
> 
>> And yes, I know #MC isn't supported just now, but the above would
>> mandate it never be supported _ever_, because otherwise the IST hack
>> crumbles.
>> Again, repeat after me: ISTs are a part of the problem.
>> So how about fixing TDX instead of forcing us to do horrible fragile
>> things we all know will end up in tears?
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-16 16:57                         ` Andy Lutomirski
@ 2021-02-16 17:05                           ` Paolo Bonzini
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Bonzini @ 2021-02-16 17:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Andi Kleen, Joerg Roedel, David Rientjes,
	Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Kirill A. Shutemov, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Christoph Hellwig, Ingo Molnar, x86,
	linux-mm

On 16/02/21 17:57, Andy Lutomirski wrote:
> 
> 
>> On Feb 16, 2021, at 7:59 AM, Paolo Bonzini <pbonzini@redhat.com>
>> wrote:
>> 
>> On 16/02/21 15:46, Peter Zijlstra wrote:
>>>> On Tue, Feb 16, 2021 at 06:27:41AM -0800, Andi Kleen wrote: I
>>>> think the IST solution should at least be explored before 
>>>> dismissing it. It might be simpler than anything else (like 
>>>> using new APIs)
>>> Have you seen the trainwreck bonzini proposed?
>> 
>> You had been suspiciously silent...
> 
> Can one of you point me at the original proposal?

https://lkml.org/lkml/2020/5/15/1239

(only pseudocode)

> This sounds suspiciously like the current NMI code.

Yes, it's similar in concept.  The exact circumstances of how nested #VE 
happens, however, are different from NMI, and the limitation of two 
nested #VEs simplifies things a bit.

> I want to look at the code. If nothing else, I suspect it’s busted wrt CET,

Yes, that's the obvious part.  You'd have to add some WRSSP or whatnot.

Paolo

> but the current NMI code definitely has bugs.  For example, if we are
> about to IRET from NMI and we get #VE in the IRET insn itself and
> then get a new NMI inside the #VE, we are toast.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-16 16:48                           ` Paolo Bonzini
@ 2021-02-16 18:26                             ` Joerg Roedel
  2021-02-16 18:33                               ` Paolo Bonzini
  0 siblings, 1 reply; 40+ messages in thread
From: Joerg Roedel @ 2021-02-16 18:26 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Zijlstra, Andi Kleen, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Kirill A. Shutemov, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Christoph Hellwig, Ingo Molnar, x86, linux-mm

On Tue, Feb 16, 2021 at 05:48:29PM +0100, Paolo Bonzini wrote:
> We should minimize the number of #VEs that we get, as they are very slow.
> Could almost everything that can invoke a #VE go through pvops and be turned
> into a TDCALL?  And if so the same should be true for SEV-ES #VC as well.

The problem with that is that it requires the guest to know what the
hypervisor will intercept or what instruction will cause a #VE. I
considered this paravirtualization for #VC, but stayed away from it for
that exact reason. You can't easily know which MMIO-access will cause a
#VE/#VC exception. Probing also doesn't work, as the Hypervisor can
change that at runtime. There is just no decent way to handle that
without taking the #VE/#VC. Or take 'hlt' for example, there are
hypervisor configurations which don't intercept it. How do you know that
from within the guest?

> > I guess those could all be replaced direct TDCALLs,
> > but the question remains whether this is possible with MSR accesses, means
> > that the list of MSRs which will cause #VEs is statically defined and
> > doesn't change between hypervisors. All in all this sounds hard to
> > maintain and easy to break by unrelated changes.
> 
> I would expect that all MSRs except for a handful (SPEC_CTRL/PRED_CMD, the
> FS/GS/kernelGS bases, anything else?) would be redirect to TDCALL.

You never know which HV your guest runs under and what it intercepts. It
can certainly be made part of the Spec to only allow direct access to a
given set of MSRs in a TDX guest and require to intercept everything
else. But that Spec probably requires constant updating and will
certainly cause compatibility headaches in the future.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-16 18:26                             ` Joerg Roedel
@ 2021-02-16 18:33                               ` Paolo Bonzini
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Bonzini @ 2021-02-16 18:33 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Peter Zijlstra, Andi Kleen, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Kirill A. Shutemov, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Christoph Hellwig, Ingo Molnar, x86, linux-mm

On 16/02/21 19:26, Joerg Roedel wrote:
> On Tue, Feb 16, 2021 at 05:48:29PM +0100, Paolo Bonzini wrote:
>> We should minimize the number of #VEs that we get, as they are very slow.
>> Could almost everything that can invoke a #VE go through pvops and be turned
>> into a TDCALL?  And if so the same should be true for SEV-ES #VC as well.
> 
> The problem with that is that it requires the guest to know what the
> hypervisor will intercept or what instruction will cause a #VE. I
> considered this paravirtualization for #VC, but stayed away from it for
> that exact reason. You can't easily know which MMIO-access will cause a
> #VE/#VC exception. Probing also doesn't work, as the Hypervisor can
> change that at runtime. There is just no decent way to handle that
> without taking the #VE/#VC. Or take 'hlt' for example, there are
> hypervisor configurations which don't intercept it. How do you know that
> from within the guest?

I'm thinking that the SEV-ES/TDX specs and the hypervisor's PV interface 
(CPUID/MSR) should tell the guest what to invoke directly, not the other 
way round.  TDCALL-ing out should always be possible.

Not saying this is the case right now, but I think the SEV-ES and TDX 
specs should evolve in that direction.

Paolo

>>> I guess those could all be replaced direct TDCALLs,
>>> but the question remains whether this is possible with MSR accesses, means
>>> that the list of MSRs which will cause #VEs is statically defined and
>>> doesn't change between hypervisors. All in all this sounds hard to
>>> maintain and easy to break by unrelated changes.
>>
>> I would expect that all MSRs except for a handful (SPEC_CTRL/PRED_CMD, the
>> FS/GS/kernelGS bases, anything else?) would be redirect to TDCALL.
> 
> You never know which HV your guest runs under and what it intercepts. It
> can certainly be made part of the Spec to only allow direct access to a
> given set of MSRs in a TDX guest and require to intercept everything
> else. But that Spec probably requires constant updating and will
> certainly cause compatibility headaches in the future.
> 
> Regards,
> 
> 	Joerg
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: AMD SEV-SNP/Intel TDX: validation of memory pages
  2021-02-02  1:51 AMD SEV-SNP/Intel TDX: validation of memory pages David Rientjes
                   ` (4 preceding siblings ...)
  2021-02-12 13:19 ` Joerg Roedel
@ 2021-03-23  9:33 ` Joerg Roedel
  5 siblings, 0 replies; 40+ messages in thread
From: Joerg Roedel @ 2021-03-23  9:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Kirill A. Shutemov, Andi Kleen, Brijesh Singh,
	Tom Lendacky, Jon Grimm, Thomas Gleixner, Christoph Hellwig,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, x86, linux-mm

Hi,

On Mon, Feb 01, 2021 at 05:51:09PM -0800, David Rientjes wrote:
> I'd like to kick-start the discussion on lazy validation of guest memory
> for the purposes of AMD SEV-SNP and Intel TDX.

Just wanted to let you know that there is now a mailing list for
discussions around SEV and TDX topics, the Linux Confidential Computing
mailing list on the new lists.linux.dev service.

The address is linux-coco@lists.linux.dev and everyone interested can go
to https://subspace.kernel.org/lists.linux.dev.html and subscribe there.

The list doesn't allow HTML postings and is archived on lore.kernel.org.
The archive is of course public.

Looking forward to reading from you all there :)

Regards,

	Joerg



^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2021-03-23  9:33 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-02  1:51 AMD SEV-SNP/Intel TDX: validation of memory pages David Rientjes
2021-02-02 13:17 ` Matthew Wilcox
2021-02-02 16:02 ` Kirill A. Shutemov
2021-02-03  0:16   ` Brijesh Singh
2021-02-11 17:46     ` Sean Christopherson
2021-02-02 22:37 ` Andi Kleen
2021-02-11 20:46 ` Peter Zijlstra
2021-02-12 13:19 ` Joerg Roedel
2021-02-12 14:17   ` Peter Zijlstra
2021-02-12 14:53     ` Joerg Roedel
2021-02-12 15:19       ` Peter Zijlstra
2021-02-12 15:28         ` Joerg Roedel
2021-02-12 16:12           ` Peter Zijlstra
2021-02-12 16:18             ` Joerg Roedel
2021-02-12 16:45               ` Peter Zijlstra
2021-02-12 17:48                 ` Dave Hansen
2021-02-12 18:22                   ` Sean Christopherson
2021-02-12 18:38                     ` Andy Lutomirski
2021-02-12 18:43                       ` Sean Christopherson
2021-02-12 18:46                     ` Dave Hansen
2021-02-12 19:24                       ` Sean Christopherson
2021-02-16 10:00                 ` Joerg Roedel
2021-02-16 14:27                   ` Andi Kleen
2021-02-16 14:46                     ` Peter Zijlstra
2021-02-16 15:59                       ` Paolo Bonzini
2021-02-16 16:25                         ` Joerg Roedel
2021-02-16 16:48                           ` Paolo Bonzini
2021-02-16 18:26                             ` Joerg Roedel
2021-02-16 18:33                               ` Paolo Bonzini
2021-02-16 16:47                         ` Peter Zijlstra
2021-02-16 16:57                         ` Andy Lutomirski
2021-02-16 17:05                           ` Paolo Bonzini
2021-02-16 16:55                       ` Andi Kleen
2021-02-12 21:42             ` Andi Kleen
2021-02-12 21:58               ` Peter Zijlstra
2021-02-12 22:39                 ` Andi Kleen
2021-02-12 22:46                   ` Andy Lutomirski
2021-02-13  9:38                   ` Peter Zijlstra
2021-02-12 23:51                 ` Paolo Bonzini
2021-03-23  9:33 ` Joerg Roedel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.