Linux Confidential Computing Development
 help / color / Atom feed
* Runtime Memory Validation in Intel-TDX and AMD-SNP
@ 2021-07-19 12:58 Joerg Roedel
  2021-07-19 13:07 ` Matthew Wilcox
                   ` (6 more replies)
  0 siblings, 7 replies; 48+ messages in thread
From: Joerg Roedel @ 2021-07-19 12:58 UTC (permalink / raw)
  To: David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli,
	Joerg Roedel
  Cc: x86, linux-mm, linux-coco

Hi,

I'd like to get some movement again into the discussion around how to
implement runtime memory validation for confidential guests and wrote up
some thoughts on it.
Below are the results in form of a proposal I put together. Please let
me know your thoughts on it and whether it fits everyones requirements.

Thanks,

	Joerg

Proposal for Runtime Memory Validation in Secure Guests on x86
==============================================================

This proposal describes a method and protocol for runtime validation of
memory in virtualization guests running with Intel Trusted Domain
Extensions (Intel-TDX) or AMD Secure Nested Paging (AMD-SNP).

AMD-SNP and Intel-TDX use different terms to discuss memory page states.
In AMD-SNP memory has to be 'validated' while in Intel-TDX is will be
'accepted'. This document uses the term 'validated' for both.

Problem Statement
-----------------

Virtualization guests which run with AMD-SNP or Intel-TDX need to
validate their memory before using it. The validation assigns a hardware
state to each page which allows the guest to detect when the hypervisor
tries to maliciously access or remap a guest-private page. The guest can
only access validated pages.

There are three ways the guest memory can be validated:

	I.   The firmware validates all of guest memory at boot time. This
	     is the simplest method which requires the least changes to
	     the Linux kernel. But this method is also very slow and
	     causes unwanted delays in the boot process, as verification
	     can take several seconds (depending on guest memory size).

	II.  The firmware only validates its own memory and memory
	     validation happens as the memory is used. This significantly
	     improves the boot time, but needs more intrusive changes to
	     the Linux kernel and its boot process.


	III. Approach I. and II. can be combined. The firmware only
	     validates the first X MB/GB of guest memory and the rest is
	     validated on-demand.

For method II. and III. the guest needs to track which pages have
already been validated to detect hypervisor attacks. This information
needs to be carried through the whole boot process.

This poses challenges on the Linux boot process, as there is currently
no way to forward information about validated memory up the boot chain.
This proposal tries to describe a way to solve these challenges.

Memory Validation through the Boot Process and in the Running System
--------------------------------------------------------------------

The memory is validated throughout the boot process as described below.
These steps assume a firmware is present, but this proposal does not
strictly require a firmware. The tasks done be the firmware can also be
done by the hypervisor before starting the guest. The steps are:

	1. The firmware validates all memory which will not be owned by
	   the boot loader or the OS.

	2. The firmware also validates the first X MB of memory, just
	   enough to run a boot loader and to load the compressed Linux
	   kernel image. X is not expected to be very large, 64 or 128
	   MB should be enough. This pre-validation should not cause
	   significant delays in the boot process.

	3. The validated memory is marked E820-Usable in struct
	   boot_params for the Linux decompressor. The rest of the
	   memory is also passed to Linux via new special E820 entries
	   which mark the memory as Usable-but-Invalid.

	4. When the Linux decompressor takes over control, it evaluates
	   the E820 table and calculates to total amount of memory
	   available to Linux (valid and invalid memory).

	   The decompressor allocates a physically contiguous data
	   structure at a random memory location which is big enough to
	   hold the the validation states of all 4kb pages available to
	   the guest. This data structure will be called the Validation
	   Bitmap through the rest of this document. The Validation
	   Bitmap is indexed by page frame numbers. 

	   It still needs to be determined how many bits are required
	   per page. This depends on the necessity to track validation
	   page-sizes. Two bits per page are enough to track the 3
	   page-sizes currently available on the x86 architecture.

	   The decompressor initializes the Validation Bitmap by first
	   validating its backing memory and then updating it with the
	   information from the E820 table. It will also update the
	   table if it changes the state of pages from invalid to valid
	   (and vice versa, e.g. for mapping a GHCB page).

	5. The 'struct boot_params' is extended to carry the location
	   and size of the Validation Bitmap to the extracted kernel
	   image.
	   In fact, since the decompressor already receives a 'struct
	   boot_params', it will check if it carries a Validation
	   Bitmap. If it does, the decompressor uses the existing one
	   instead of allocating a new one.

	6. When the extracted kernel image takes over control, it will
	   make sure the Validation Bitmap is up to date when memory
	   needs to be validated.

	7. When set up, the memblock and page allocators have to check
	   whether the memory they return is already validated, and
	   validate it if not.

	   This should happen after the memory is allocated and all
	   allocator-locks are dropped, but before the memory is
	   returned to the caller. This way the access to the
	   validation bitmap can be implemented without locking and only
	   using atomic instructions.

	   Under no circumstances the Linux kernel is allowed to
	   validate a page more than once. Doing this might create
	   attack vectors for the Hypervisor towards the guest.

	8. When memory is returned to the memblock or page allocators,
	   it is _not_ invalidated. In fact, all memory which is freed
	   need to be valid. If it was marked invalid in the meantime
	   (e.g. if it the memory was used for DMA buffers), the code
	   owning the memory needs to validate it again before freeing
	   it.

	   The benefit of doing memory validation at allocation time is
	   that it keeps the exception handler for invalid memory
	   simple, because no exceptions of this kind are expected under
	   normal operation.

The Validation Bitmap
---------------------

This document proposes the use of a Validation Bitmap to store the
validation state of guest pages. This section discusses the benefits of
this approach.

The Linux kernel already has an array to store various state for each
memory page in the system: The struct page array. While this would be a
natural place to also store page validation information, the Validation
Bitmap is chosen because having the information separated has some clear
benefits:

	- The Validation Bitmap is allocated in the Linux decompressor
	  and already available long before the struct page array is
	  initialized.

	- Since it is a simple in-memory data structure which is
	  physically contiguous, it can be passed along through the
	  various stages of the boot process.

	- It can even be passed to a new kernel booted via kexec/kdump,
	  making it trivial to enable these features for AMD-SNP and
	  Intel-TDX.

	- When memory validation happens in the memblock and page
	  allocators, there is no need for locking when making changes
	  to the Validation Bitmap, because:
	  
	    - Nobody will try to concurrently access the same bits, as
	      the code-path doing the validation is the only owner of
	      the memory.

	    - Updates can happen via atomic cmpxchg instructions
	      when multiple bits are used per page. If only one bit is
	      needed, atomic bit manipulation instructions will suffice.

	- NUMA-locality is not considered to be a problem for the
	  Validation Bitmap. Since memory is not invalidated upon free,
	  the data structure will become read-mostly over time.

Final Notes
-----------

This proposal does not introduce requirements about the firmware that
has to be used to run Intel-TDX or AMD-SNP guests. It works with UEFI
and non-UEFI firmwares, or with no firmware at all. This is important
for use-cases like Confidential Containers running in VMs, which often
use a very small firmware (or no firmware at all) for reducing boot
times.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-19 12:58 Runtime Memory Validation in Intel-TDX and AMD-SNP Joerg Roedel
@ 2021-07-19 13:07 ` Matthew Wilcox
  2021-07-19 15:02   ` Joerg Roedel
  2021-07-19 20:39 ` Andi Kleen
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 48+ messages in thread
From: Matthew Wilcox @ 2021-07-19 13:07 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

On Mon, Jul 19, 2021 at 02:58:22PM +0200, Joerg Roedel wrote:
> I'd like to get some movement again into the discussion around how to
> implement runtime memory validation for confidential guests and wrote up
> some thoughts on it.
> Below are the results in form of a proposal I put together. Please let
> me know your thoughts on it and whether it fits everyones requirements.

I think this proposal skips (intentionally?) something that s390 already
implemented: the secure guest deliberately allowing the hypervisor to
access certain pages for a period and then re-validating them.  I hope x86
can use the same interface as s390 for this, or if not, the interface can
be modified to be usable by all architectures.  See commit f28d43636d6f
("mm/gup/writeback: add callbacks for inaccessible pages").

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-19 13:07 ` Matthew Wilcox
@ 2021-07-19 15:02   ` Joerg Roedel
  0 siblings, 0 replies; 48+ messages in thread
From: Joerg Roedel @ 2021-07-19 15:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

On Mon, Jul 19, 2021 at 02:07:43PM +0100, Matthew Wilcox wrote:
> I think this proposal skips (intentionally?) something that s390 already
> implemented: the secure guest deliberately allowing the hypervisor to
> access certain pages for a period and then re-validating them.  I hope x86
> can use the same interface as s390 for this, or if not, the interface can
> be modified to be usable by all architectures.  See commit f28d43636d6f
> ("mm/gup/writeback: add callbacks for inaccessible pages").

Yeah, sharing memory with the Hypervisor is not the main scope of the
proposal. The requirement I put in step 8. about returning only
validated memory (which means it is not shared with the HV anymore) to
the memory allocator slightly touches this.

In general, on x86 the hypervisor can only write to eplicitly shared and
unencrypted regions of guest memory. The guest decides where those are
and is responsible for setting these areas up.

For x86 this happens mainly in the DMA-API backend and to some degree in
other code which sets up non-DMA shared data structures with the host
(like the code setting up the GHCBs for SEV-ES).

That said, I don't see an immediate use of the API introduced in the
patch above for x86.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-19 12:58 Runtime Memory Validation in Intel-TDX and AMD-SNP Joerg Roedel
  2021-07-19 13:07 ` Matthew Wilcox
@ 2021-07-19 20:39 ` Andi Kleen
  2021-07-20  8:55   ` Joerg Roedel
  2021-07-20  0:26 ` Andy Lutomirski
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 48+ messages in thread
From: Andi Kleen @ 2021-07-19 20:39 UTC (permalink / raw)
  To: Joerg Roedel, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Peter Zijlstra, Paolo Bonzini, Ingo Molnar,
	Kaplan, David, Varad Gautam, Dario Faggioli
  Cc: x86, linux-mm, linux-coco


> 	III. Approach I. and II. can be combined. The firmware only
> 	     validates the first X MB/GB of guest memory and the rest is
> 	     validated on-demand.


It's actually not just the first X. As I understand there is a proposal 
for a new UEFI memory type, that will allow the firmware (and anyone 
else) to declare memory regions as accepted in a fine grained manner.


>
> For method II. and III. the guest needs to track which pages have
> already been validated to detect hypervisor attacks. This information
> needs to be carried through the whole boot process.

I don't think it's that bad. If we know what has been validated already 
using the memory map, then it's straight forward to check what is a 
valid validation request and what is not. Anything that's in a BIOS 
reserved region or in a region already marked as validated must be 
already validated and and can be rejected (or rather panic'ed). So I 
don't see the need to pass a fine grained validation bitmap around. Of 
course the kernel needs to maintain something (likely not a bitmap, but 
rather some form of page flag) on its own, but it doesn't need to be 
visible in any outside interfaces.

There's one exception to this, which is the previous memory view in 
crash kernels. But that's an relatively obscure case and there might be 
other solutions for this.


> Memory Validation through the Boot Process and in the Running System
> --------------------------------------------------------------------
>
> The memory is validated throughout the boot process as described below.
> These steps assume a firmware is present, but this proposal does not
> strictly require a firmware. The tasks done be the firmware can also be
> done by the hypervisor before starting the guest. The steps are:
>
> 	1. The firmware validates all memory which will not be owned by
> 	   the boot loader or the OS.
>
> 	2. The firmware also validates the first X MB of memory, just
> 	   enough to run a boot loader and to load the compressed Linux
> 	   kernel image. X is not expected to be very large, 64 or 128
> 	   MB should be enough. This pre-validation should not cause
> 	   significant delays in the boot process.
>
> 	3. The validated memory is marked E820-Usable in struct
> 	   boot_params for the Linux decompressor. The rest of the
> 	   memory is also passed to Linux via new special E820 entries
> 	   which mark the memory as Usable-but-Invalid.
>
> 	4. When the Linux decompressor takes over control, it evaluates
> 	   the E820 table and calculates to total amount of memory
> 	   available to Linux (valid and invalid memory).
>
> 	   The decompressor allocates a physically contiguous data
> 	   structure at a random memory location which is big enough to
> 	   hold the the validation states of all 4kb pages available to
> 	   the guest. This data structure will be called the Validation
> 	   Bitmap through the rest of this document. The Validation
> 	   Bitmap is indexed by page frame numbers.

I don't think we need to go that fine grained. The decompressor will 
just pre-validate all the memory it needs (which is relatively) limited 
and the later kernel can know about it in some static way and then fix 
up its mem_map state. We might need a few extra allocations between main 
kernel entry and mem_map init, but that could be handled in some simple 
data structure.


>
> 	8. When memory is returned to the memblock or page allocators,
> 	   it is _not_ invalidated. In fact, all memory which is freed
> 	   need to be valid. If it was marked invalid in the meantime
> 	   (e.g. if it the memory was used for DMA buffers), the code
> 	   owning the memory needs to validate it again before freeing
> 	   it.


I'm not sure about AMD, but in TDX we're certainly have no need to 
reaccept after something was shared.

Also in general i don't think it will really happen, at least initially. 
All the shared buffers we use are allocated and never freed. So such a 
problem could be deferred.

-Andi


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-19 12:58 Runtime Memory Validation in Intel-TDX and AMD-SNP Joerg Roedel
  2021-07-19 13:07 ` Matthew Wilcox
  2021-07-19 20:39 ` Andi Kleen
@ 2021-07-20  0:26 ` Andy Lutomirski
       [not found]   ` <CAAYXXYwFzrf8uY-PFkMRSG28+HztfGdJft8kB3Y3keWCx9K8TQ@mail.gmail.com>
                     ` (2 more replies)
  2021-07-20 17:30 ` Kirill A. Shutemov
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 48+ messages in thread
From: Andy Lutomirski @ 2021-07-20  0:26 UTC (permalink / raw)
  To: Joerg Roedel, David Rientjes, Borislav Petkov,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli
  Cc: x86, linux-mm, linux-coco

On 7/19/21 5:58 AM, Joerg Roedel wrote:

> Memory Validation through the Boot Process and in the Running System
> --------------------------------------------------------------------
> 
> The memory is validated throughout the boot process as described below.
> These steps assume a firmware is present, but this proposal does not
> strictly require a firmware. The tasks done be the firmware can also be
> done by the hypervisor before starting the guest. The steps are:
> 
> 	1. The firmware validates all memory which will not be owned by
> 	   the boot loader or the OS.
> 
> 	2. The firmware also validates the first X MB of memory, just
> 	   enough to run a boot loader and to load the compressed Linux
> 	   kernel image. X is not expected to be very large, 64 or 128
> 	   MB should be enough. This pre-validation should not cause
> 	   significant delays in the boot process.
> 
> 	3. The validated memory is marked E820-Usable in struct
> 	   boot_params for the Linux decompressor. The rest of the
> 	   memory is also passed to Linux via new special E820 entries
> 	   which mark the memory as Usable-but-Invalid.
> 
> 	4. When the Linux decompressor takes over control, it evaluates
> 	   the E820 table and calculates to total amount of memory
> 	   available to Linux (valid and invalid memory).
> 
> 	   The decompressor allocates a physically contiguous data
> 	   structure at a random memory location which is big enough to
> 	   hold the the validation states of all 4kb pages available to
> 	   the guest. This data structure will be called the Validation
> 	   Bitmap through the rest of this document. The Validation
> 	   Bitmap is indexed by page frame numbers. 

At the risk of asking a potentially silly question, would it be
reasonable to treat non-validated memory as not-present for kernel
purposes and hot-add it in a thread as it gets validated?  Or would this
result in poor system behavior before enough memory is validated?
Perhaps we should block instead of failing allocations if we want more
memory than is currently validated?

--Andy

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
       [not found]   ` <CAAYXXYwFzrf8uY-PFkMRSG28+HztfGdJft8kB3Y3keWCx9K8TQ@mail.gmail.com>
@ 2021-07-20  2:00     ` Erdem Aktas
  2021-07-20  5:17     ` Andi Kleen
       [not found]     ` <eacb9c1f-2c61-4a7f-b5a3-7bf579e6cbf6@www.fastmail.com>
  2 siblings, 0 replies; 48+ messages in thread
From: Erdem Aktas @ 2021-07-20  2:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, David Rientjes, Borislav Petkov,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

PS: Apologize for sending this twice (resending in plain text mode).

With the new UEFI memory type, option 2 seems like a better option to me.

I was thinking with the lack of new UEFI memory type support yet,
option 3 can be implemented as a temporary solution. IMO, this is
crucial for a reasonable boot performance.

> There's one exception to this, which is the previous memory view in
> crash kernels. But that's an relatively obscure case and there might be
> other solutions for this.

I think this is an important angle. It might cause reliability issues.
if kexec kernel does not know which page is shared or private, it can
use a previously shared page as a code page which will not work. It is
also a security concern. Hosts can always cause crashes which forces
guests to do kexec for crash dump. If the kexec kernel does not know
which pages are validated before, it might be compromised with page
replay attacks.

Also kexec is not only for crash dumps. For warm resets, kexec kernel
needs to know the valid page map.

>> Also in general i don't think it will really happen, at least initially.
>> All the shared buffers we use are allocated and never freed. So such a
>> problem could be deferred.

Does it not depend on kernel configs? Currently, there is a valid
control path in dma_alloc_coherent which might alloc and free shared
pages.

>> At the risk of asking a potentially silly question, would it be
>> reasonable to treat non-validated memory as not-present for kernel
>> purposes and hot-add it in a thread as it gets validated?

My concern with this is, it assumes that all the present memory is
private. UEFI might have some pages which are shared therefore also
are present.

> On Mon, Jul 19, 2021 at 5:26 PM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> On 7/19/21 5:58 AM, Joerg Roedel wrote:
>>
>> > Memory Validation through the Boot Process and in the Running System
>> > --------------------------------------------------------------------
>> >
>> > The memory is validated throughout the boot process as described below.
>> > These steps assume a firmware is present, but this proposal does not
>> > strictly require a firmware. The tasks done be the firmware can also be
>> > done by the hypervisor before starting the guest. The steps are:
>> >
>> >       1. The firmware validates all memory which will not be owned by
>> >          the boot loader or the OS.
>> >
>> >       2. The firmware also validates the first X MB of memory, just
>> >          enough to run a boot loader and to load the compressed Linux
>> >          kernel image. X is not expected to be very large, 64 or 128
>> >          MB should be enough. This pre-validation should not cause
>> >          significant delays in the boot process.
>> >
>> >       3. The validated memory is marked E820-Usable in struct
>> >          boot_params for the Linux decompressor. The rest of the
>> >          memory is also passed to Linux via new special E820 entries
>> >          which mark the memory as Usable-but-Invalid.
>> >
>> >       4. When the Linux decompressor takes over control, it evaluates
>> >          the E820 table and calculates to total amount of memory
>> >          available to Linux (valid and invalid memory).
>> >
>> >          The decompressor allocates a physically contiguous data
>> >          structure at a random memory location which is big enough to
>> >          hold the the validation states of all 4kb pages available to
>> >          the guest. This data structure will be called the Validation
>> >          Bitmap through the rest of this document. The Validation
>> >          Bitmap is indexed by page frame numbers.
>>
>> At the risk of asking a potentially silly question, would it be
>> reasonable to treat non-validated memory as not-present for kernel
>> purposes and hot-add it in a thread as it gets validated?  Or would this
>> result in poor system behavior before enough memory is validated?
>> Perhaps we should block instead of failing allocations if we want more
>> memory than is currently validated?
>>
>> --Andy
>>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
       [not found]   ` <CAAYXXYwFzrf8uY-PFkMRSG28+HztfGdJft8kB3Y3keWCx9K8TQ@mail.gmail.com>
  2021-07-20  2:00     ` Erdem Aktas
@ 2021-07-20  5:17     ` Andi Kleen
  2021-07-20  9:11       ` Joerg Roedel
                         ` (2 more replies)
       [not found]     ` <eacb9c1f-2c61-4a7f-b5a3-7bf579e6cbf6@www.fastmail.com>
  2 siblings, 3 replies; 48+ messages in thread
From: Andi Kleen @ 2021-07-20  5:17 UTC (permalink / raw)
  To: Erdem Aktas, Andy Lutomirski
  Cc: Joerg Roedel, David Rientjes, Borislav Petkov,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Peter Zijlstra, Paolo Bonzini, Ingo Molnar,
	Kaplan, David, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco


On 7/19/2021 6:51 PM, Erdem Aktas wrote:
>
> > There's one exception to this, which is the previous memory view in
> > crash kernels. But that's an relatively obscure case and there might be
> > other solutions for this.
>
> I think this is an important angle. It might cause reliability issues. 
> if kexec kernel does not know which page is shared or private, it can 
> use a previously shared page as a code page which will not work. It is 
> also a security concern. Hosts can always cause crashes which forces 
> guests to do kexec for crash dump. If the kexec kernel does not know 
> which pages are validated before, it might be compromised with page 
> replay attacks.

First I suspect for crash it's not a real security problem if a 
malicious hypervisor would inject zeroed pages. That means actual strong 
checks against revalidation/reaccept are not needed. That still leaves 
the issue of triggering an exception when the memory is not there. TDX 
has an option to trigger a #VE in this case, but we will actually force 
disable it to avoid #VE in the system call gap. But the standard crash 
dump tools already know how to parse mem_map to distinguish different 
types of pages. So they already would be able to do that. We just need 
some kind of safety mechanism to prevent any crashes, but that should be 
doable. Actually I'm not sure they're really needed because that's a 
root operation.

>
> Also kexec is not only for crash dumps. For warm resets, kexec kernel 
> needs to know the valid page map.

For non crash kexec it's fine to reaccept/validate memory because we 
don't care about the old contents anymore, except for the kernel itself 
and perhaps your stack/page tables. So something very simple is enough 
for that too.

>
> >> Also in general i don't think it will really happen, at least 
> initially.
> >> All the shared buffers we use are allocated and never freed. So such a
> >> problem could be deferred.
>
> Does it not depend on kernel configs? Currently, there is a valid 
> control path in dma_alloc_coherent which might alloc and free shared 
> pages.

If the device filter is active it won't.



>
> >> At the risk of asking a potentially silly question, would it be
> >> reasonable to treat non-validated memory as not-present for kernel
> >> purposes and hot-add it in a thread as it gets validated?
>
> My concern with this is, it assumes that all the present memory is 
> private. UEFI might have some pages which are shared therefore also 
> are present.


Hot add is nearly always a bad idea.


-Andi



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-20  0:26 ` Andy Lutomirski
       [not found]   ` <CAAYXXYwFzrf8uY-PFkMRSG28+HztfGdJft8kB3Y3keWCx9K8TQ@mail.gmail.com>
@ 2021-07-20  8:44   ` Joerg Roedel
  2021-07-20 14:14   ` Dave Hansen
  2 siblings, 0 replies; 48+ messages in thread
From: Joerg Roedel @ 2021-07-20  8:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Rientjes, Borislav Petkov, Sean Christopherson,
	Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Andi Kleen,
	Brijesh Singh, Tom Lendacky, Jon Grimm, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Kaplan, David,
	Varad Gautam, Dario Faggioli, x86, linux-mm, linux-coco

On Mon, Jul 19, 2021 at 05:26:20PM -0700, Andy Lutomirski wrote:
> At the risk of asking a potentially silly question, would it be
> reasonable to treat non-validated memory as not-present for kernel
> purposes and hot-add it in a thread as it gets validated?  Or would this
> result in poor system behavior before enough memory is validated?
> Perhaps we should block instead of failing allocations if we want more
> memory than is currently validated?

That is basically the idea of pre-validating the first X GB of memory
(X==4 has been proposed) and validate the rest at runtime. I see two
problems with this:

	1) Pre-validating large amounts of memory takes a lot of time
	   (in the range of a few seconds). This is not suitable for all
	   workloads like, e.g., containers which want to boot in a few
	   hundred milliseconds.

	2) It limits the physical address range for KASLR placement,
	   factually reducing the randomness of where the kernel is
	   placed in physical memory.

With the proposal I sent here only enough memory for the boot-loader and
the kernel image is pre-validated, and when the decompressor takes over
it can place the kernel anywhere, even in yet unvalidated/unaccepted
memory.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-19 20:39 ` Andi Kleen
@ 2021-07-20  8:55   ` Joerg Roedel
  2021-07-20  9:34     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 48+ messages in thread
From: Joerg Roedel @ 2021-07-20  8:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Peter Zijlstra, Paolo Bonzini, Ingo Molnar,
	Kaplan, David, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco

On Mon, Jul 19, 2021 at 01:39:48PM -0700, Andi Kleen wrote:
> It's actually not just the first X. As I understand there is a proposal for
> a new UEFI memory type, that will allow the firmware (and anyone else) to
> declare memory regions as accepted in a fine grained manner.

Yes, but relying on this means we 1) introduce a dependency to UEFI into
booting confidential guests and 2) the decompressor stub in the kernel
needs to parse UEFI tables. None of this is a good idea for several
reasons.

> I don't think it's that bad. If we know what has been validated already
> using the memory map, then it's straight forward to check what is a valid
> validation request and what is not. Anything that's in a BIOS reserved
> region or in a region already marked as validated must be already validated
> and and can be rejected (or rather panic'ed). So I don't see the need to
> pass a fine grained validation bitmap around. Of course the kernel needs to
> maintain something (likely not a bitmap, but rather some form of page flag)
> on its own, but it doesn't need to be visible in any outside interfaces.

Using page flags means that the information about what is already
validated/accepted needs to be carried in another form until the
struct-page array is initialized. A lot can happen until then, and every
modification in the code that runs before carries the risk of breaking
TDX and SNP guests.

The Validation Bitmap on the other side is set up on the first boot and
kept alive for the rest of the guests life-time (even over kexec/kdump)
and will be updated by the allocators in use. This is a much more robust
solution than carrying the information in some other way forward until
the page array is there.

I must admit that I was also voting for a page-flag in the past, but the
benefits for robustness, supporting kexec/kdump, and the boot process in
general made me re-visit this opinion.

> There's one exception to this, which is the previous memory view in crash
> kernels. But that's an relatively obscure case and there might be other
> solutions for this.

Kexec and kdump are not obscure cases, those are real-world requirements
for TDX and SNP guests.

> I'm not sure about AMD, but in TDX we're certainly have no need to reaccept
> after something was shared.

Re-validation is needed on AMD, if I am not mistaken AMD hardware even
enforces that shared memory is mapped unencrypted and private memory
encrypted.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-20  5:17     ` Andi Kleen
@ 2021-07-20  9:11       ` Joerg Roedel
  2021-07-20 17:32         ` Andi Kleen
  2021-07-20 23:09       ` Erdem Aktas
  2021-07-22 17:31       ` Marc Orr
  2 siblings, 1 reply; 48+ messages in thread
From: Joerg Roedel @ 2021-07-20  9:11 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Erdem Aktas, Andy Lutomirski, David Rientjes, Borislav Petkov,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Peter Zijlstra, Paolo Bonzini, Ingo Molnar,
	Kaplan, David, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco

On Mon, Jul 19, 2021 at 10:17:36PM -0700, Andi Kleen wrote:
> For non crash kexec it's fine to reaccept/validate memory because we don't
> care about the old contents anymore, except for the kernel itself and
> perhaps your stack/page tables. So something very simple is enough for that
> too.

I am not sure how it is implemented in TDX hardware, but for SNP the
guest _must_ _not_ double-validate or even double-invalidate memory.

What I sent here is actually v2 of my proposal, v1 had a much more lazy
approach like you are proposing here. But as I learned what can happen
is this:

	* Hypervisor maps GPA X to HPA A
	* Guest validates GPA X
	  Hardware enforces that HPA A always maps to GPA X
	* Hypervisor remaps GPA X to HPA B
	* Guest lazily re-validates GPA X
	  Hardware enforces that HPA B always maps to GPA X
	
The situation we have now is that host pages A and B are validated for
the same guest page, and the hypervisor can switch between them at will,
without the guest being able to notice it.

This can open various attack vectors from the hypervisor towards the
guest, like tricking the guest into a code-path where it accidentially
reveals its secrets.

For that reason the guest needs to know at any time whether a given page
has already been validated/accepted. And the guest needs to pass along
this fine-grained information even in the case of kexec/kdump.

> If the device filter is active it won't.

We are not going to pohibit dma_alloc_coherent() in SNP guests just
because we are too lazy to implement memory re-validation.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-20  8:55   ` Joerg Roedel
@ 2021-07-20  9:34     ` Dr. David Alan Gilbert
  2021-07-20 11:50       ` Joerg Roedel
  0 siblings, 1 reply; 48+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-20  9:34 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andi Kleen, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Peter Zijlstra, Paolo Bonzini, Ingo Molnar,
	Kaplan, David, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco

Hi,
  Does the bitmap need to be page granulairty or can we work
on bigger chunks?

Dave

-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-20  9:34     ` Dr. David Alan Gilbert
@ 2021-07-20 11:50       ` Joerg Roedel
  0 siblings, 0 replies; 48+ messages in thread
From: Joerg Roedel @ 2021-07-20 11:50 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Andi Kleen, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Peter Zijlstra, Paolo Bonzini, Ingo Molnar,
	Kaplan, David, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco

On Tue, Jul 20, 2021 at 10:34:01AM +0100, Dr. David Alan Gilbert wrote:
>   Does the bitmap need to be page granulairty or can we work
> on bigger chunks?

I think page granularity is needed, because some regions shared with the
HV are only one page in size (the GHCBs in SEV-ES for example).

But in general it is worth to discuss whether validating memory in
bigger chunks than a page is beneficial wrt. to allocation latency vs.
required HV round-trips for validation.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-20  0:26 ` Andy Lutomirski
       [not found]   ` <CAAYXXYwFzrf8uY-PFkMRSG28+HztfGdJft8kB3Y3keWCx9K8TQ@mail.gmail.com>
  2021-07-20  8:44   ` Joerg Roedel
@ 2021-07-20 14:14   ` Dave Hansen
  2 siblings, 0 replies; 48+ messages in thread
From: Dave Hansen @ 2021-07-20 14:14 UTC (permalink / raw)
  To: Andy Lutomirski, Joerg Roedel, David Rientjes, Borislav Petkov,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli
  Cc: x86, linux-mm, linux-coco

On 7/19/21 5:26 PM, Andy Lutomirski wrote:
> At the risk of asking a potentially silly question, would it be
> reasonable to treat non-validated memory as not-present for kernel
> purposes and hot-add it in a thread as it gets validated?  Or would this
> result in poor system behavior before enough memory is validated?
> Perhaps we should block instead of failing allocations if we want more
> memory than is currently validated?

It can't be _that_ big of a problem since we already have
DEFERRED_STRUCT_PAGE_INIT causing the same kind of issue.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-19 12:58 Runtime Memory Validation in Intel-TDX and AMD-SNP Joerg Roedel
                   ` (2 preceding siblings ...)
  2021-07-20  0:26 ` Andy Lutomirski
@ 2021-07-20 17:30 ` Kirill A. Shutemov
  2021-07-21  9:20   ` Mike Rapoport
                     ` (2 more replies)
  2021-07-22 15:57 ` David Hildenbrand
                   ` (2 subsequent siblings)
  6 siblings, 3 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2021-07-20 17:30 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

On Mon, Jul 19, 2021 at 02:58:22PM +0200, Joerg Roedel wrote:
> Hi,
> 
> I'd like to get some movement again into the discussion around how to
> implement runtime memory validation for confidential guests and wrote up
> some thoughts on it.
> Below are the results in form of a proposal I put together. Please let
> me know your thoughts on it and whether it fits everyones requirements.

Thanks for bringing it up. I'm working on the topic for Intel TDX. See
comments below.

> 
> Thanks,
> 
> 	Joerg
> 
> Proposal for Runtime Memory Validation in Secure Guests on x86
> ==============================================================
> 
> This proposal describes a method and protocol for runtime validation of
> memory in virtualization guests running with Intel Trusted Domain
> Extensions (Intel-TDX) or AMD Secure Nested Paging (AMD-SNP).
> 
> AMD-SNP and Intel-TDX use different terms to discuss memory page states.
> In AMD-SNP memory has to be 'validated' while in Intel-TDX is will be
> 'accepted'. This document uses the term 'validated' for both.
> 
> Problem Statement
> -----------------
> 
> Virtualization guests which run with AMD-SNP or Intel-TDX need to
> validate their memory before using it. The validation assigns a hardware
> state to each page which allows the guest to detect when the hypervisor
> tries to maliciously access or remap a guest-private page. The guest can
> only access validated pages.
> 
> There are three ways the guest memory can be validated:
> 
> 	I.   The firmware validates all of guest memory at boot time. This
> 	     is the simplest method which requires the least changes to
> 	     the Linux kernel. But this method is also very slow and
> 	     causes unwanted delays in the boot process, as verification
> 	     can take several seconds (depending on guest memory size).
> 
> 	II.  The firmware only validates its own memory and memory
> 	     validation happens as the memory is used. This significantly
> 	     improves the boot time, but needs more intrusive changes to
> 	     the Linux kernel and its boot process.
> 
> 
> 	III. Approach I. and II. can be combined. The firmware only
> 	     validates the first X MB/GB of guest memory and the rest is
> 	     validated on-demand.
> 
> For method II. and III. the guest needs to track which pages have
> already been validated to detect hypervisor attacks. This information
> needs to be carried through the whole boot process.
> 
> This poses challenges on the Linux boot process, as there is currently
> no way to forward information about validated memory up the boot chain.
> This proposal tries to describe a way to solve these challenges.

We use EFI unaccepted memory type to pass this information between
firmware and kernel. In my WIP patch I translate it to a new E820 memory
type: E820_TYPE_UNACCEPTED.

E820 can also be used during early boot for tracking what memory got
accepted by kernel too.

> Memory Validation through the Boot Process and in the Running System
> --------------------------------------------------------------------
> 
> The memory is validated throughout the boot process as described below.
> These steps assume a firmware is present, but this proposal does not
> strictly require a firmware. The tasks done be the firmware can also be
> done by the hypervisor before starting the guest. The steps are:
> 
> 	1. The firmware validates all memory which will not be owned by
> 	   the boot loader or the OS.
> 
> 	2. The firmware also validates the first X MB of memory, just
> 	   enough to run a boot loader and to load the compressed Linux
> 	   kernel image. X is not expected to be very large, 64 or 128
> 	   MB should be enough. This pre-validation should not cause
> 	   significant delays in the boot process.

For now, I debug with 256MiB accepted by firmware. It allows to avoid
dealing with decompression code at this stage of the project. I plan to
lower the number later.

> 
> 	3. The validated memory is marked E820-Usable in struct
> 	   boot_params for the Linux decompressor. The rest of the
> 	   memory is also passed to Linux via new special E820 entries
> 	   which mark the memory as Usable-but-Invalid.
> 
> 	4. When the Linux decompressor takes over control, it evaluates
> 	   the E820 table and calculates to total amount of memory
> 	   available to Linux (valid and invalid memory).
> 
> 	   The decompressor allocates a physically contiguous data
> 	   structure at a random memory location which is big enough to
> 	   hold the the validation states of all 4kb pages available to
> 	   the guest. This data structure will be called the Validation
> 	   Bitmap through the rest of this document. The Validation
> 	   Bitmap is indexed by page frame numbers. 
> 
> 	   It still needs to be determined how many bits are required
> 	   per page. This depends on the necessity to track validation
> 	   page-sizes. Two bits per page are enough to track the 3
> 	   page-sizes currently available on the x86 architecture.
> 
> 	   The decompressor initializes the Validation Bitmap by first
> 	   validating its backing memory and then updating it with the
> 	   information from the E820 table. It will also update the
> 	   table if it changes the state of pages from invalid to valid
> 	   (and vice versa, e.g. for mapping a GHCB page).

I would argue for per-range, not per-page, tracking of accepted/validated
memory for decompresser and early boot code, until page allocator is fully
functional. I have reasonable success with this approach so far.

Once page allocator is ready we can switch to fine-grained tracking.

> 	5. The 'struct boot_params' is extended to carry the location
> 	   and size of the Validation Bitmap to the extracted kernel
> 	   image.
> 	   In fact, since the decompressor already receives a 'struct
> 	   boot_params', it will check if it carries a Validation
> 	   Bitmap. If it does, the decompressor uses the existing one
> 	   instead of allocating a new one.
> 
> 	6. When the extracted kernel image takes over control, it will
> 	   make sure the Validation Bitmap is up to date when memory
> 	   needs to be validated.
> 
> 	7. When set up, the memblock and page allocators have to check
> 	   whether the memory they return is already validated, and
> 	   validate it if not.
> 
> 	   This should happen after the memory is allocated and all
> 	   allocator-locks are dropped, but before the memory is
> 	   returned to the caller. This way the access to the
> 	   validation bitmap can be implemented without locking and only
> 	   using atomic instructions.
> 
> 	   Under no circumstances the Linux kernel is allowed to
> 	   validate a page more than once. Doing this might create
> 	   attack vectors for the Hypervisor towards the guest.
> 
> 	8. When memory is returned to the memblock or page allocators,
> 	   it is _not_ invalidated. In fact, all memory which is freed
> 	   need to be valid. If it was marked invalid in the meantime
> 	   (e.g. if it the memory was used for DMA buffers), the code
> 	   owning the memory needs to validate it again before freeing
> 	   it.
> 
> 	   The benefit of doing memory validation at allocation time is
> 	   that it keeps the exception handler for invalid memory
> 	   simple, because no exceptions of this kind are expected under
> 	   normal operation.

During early boot I treat unaccepted memory as a usable RAM. It only
requires special treatment on memblock_reserve(), which used for early
memory allocation: unaccepted usable RAM has to be accepted, before
reserving.

For fine-grained accepting/validation tracking I use PageOffline() flags
(it's encoded into mapcount): before adding an unaccepted page to free
list I set the PageOffline() to indicate that the page has to be accepted
before returning from the page allocator. Currently, we never have
PageOffline() set for pages on free lists, so we won't have confusion with
ballooning or memory hotplug.

I try to keep pages accepted in 2M or 4M chunks (pageblock_order or
MAX_ORDER). It is reasonable compromise on speed/latency.

I still debugging the code, but hopefully will get working PoC this week.

> The Validation Bitmap
> ---------------------
> 
> This document proposes the use of a Validation Bitmap to store the
> validation state of guest pages. This section discusses the benefits of
> this approach.
> 
> The Linux kernel already has an array to store various state for each
> memory page in the system: The struct page array. While this would be a
> natural place to also store page validation information, the Validation
> Bitmap is chosen because having the information separated has some clear
> benefits:
> 
> 	- The Validation Bitmap is allocated in the Linux decompressor
> 	  and already available long before the struct page array is
> 	  initialized.
> 
> 	- Since it is a simple in-memory data structure which is
> 	  physically contiguous, it can be passed along through the
> 	  various stages of the boot process.
> 
> 	- It can even be passed to a new kernel booted via kexec/kdump,
> 	  making it trivial to enable these features for AMD-SNP and
> 	  Intel-TDX.
> 
> 	- When memory validation happens in the memblock and page
> 	  allocators, there is no need for locking when making changes
> 	  to the Validation Bitmap, because:
> 	  
> 	    - Nobody will try to concurrently access the same bits, as
> 	      the code-path doing the validation is the only owner of
> 	      the memory.
> 
> 	    - Updates can happen via atomic cmpxchg instructions
> 	      when multiple bits are used per page. If only one bit is
> 	      needed, atomic bit manipulation instructions will suffice.
> 
> 	- NUMA-locality is not considered to be a problem for the
> 	  Validation Bitmap. Since memory is not invalidated upon free,
> 	  the data structure will become read-mostly over time.
> 

I'm not sure a bitmap is needed. I hope we can use E820 for early
tracking. But let's see if it works.

> Final Notes
> -----------
> 
> This proposal does not introduce requirements about the firmware that
> has to be used to run Intel-TDX or AMD-SNP guests. It works with UEFI
> and non-UEFI firmwares, or with no firmware at all. This is important
> for use-cases like Confidential Containers running in VMs, which often
> use a very small firmware (or no firmware at all) for reducing boot
> times.
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-20  9:11       ` Joerg Roedel
@ 2021-07-20 17:32         ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2021-07-20 17:32 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Erdem Aktas, Andy Lutomirski, David Rientjes, Borislav Petkov,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Peter Zijlstra, Paolo Bonzini, Ingo Molnar,
	Kaplan, David, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco


On 7/20/2021 2:11 AM, Joerg Roedel wrote:
>
> I am not sure how it is implemented in TDX hardware, but for SNP the
> guest _must_ _not_ double-validate or even double-invalidate memory.


In TDX it just zeroes the data. If you can tolerate zeroing it's fine. 
Of course for most data that's not tolerable, but for kexec (minus 
kernel itself) it is.


>
> What I sent here is actually v2 of my proposal, v1 had a much more lazy
> approach like you are proposing here. But as I learned what can happen
> is this:
>
> 	* Hypervisor maps GPA X to HPA A
> 	* Guest validates GPA X
> 	  Hardware enforces that HPA A always maps to GPA X
> 	* Hypervisor remaps GPA X to HPA B
> 	* Guest lazily re-validates GPA X
> 	  Hardware enforces that HPA B always maps to GPA X
> 	
> The situation we have now is that host pages A and B are validated for
> the same guest page, and the hypervisor can switch between them at will,
> without the guest being able to notice it.


I don't believe that's possible on TDX

>
> This can open various attack vectors from the hypervisor towards the
> guest, like tricking the guest into a code-path where it accidentially
> reveals its secrets.

Well things would certainly easier if you had a purge interface then.

But for the kexec crash case it would be just attacks against the crash 
dump, which I assume are not a real security concern. The crash kexec 
mostly runs in its own memory, which doesn't need this, or is small 
enough that it can be fully pre-accepted. And for the previous memory 
view probably these issues are acceptable.

That leaves the non crash kexec case, but perhaps it is acceptable to 
just restart the guest in such a case instead of creating complicated 
and fragile new interfaces.


>> If the device filter is active it won't.
> We are not going to pohibit dma_alloc_coherent() in SNP guests just
> because we are too lazy to implement memory re-validation.


dma_alloc_coherent is of course allowed, just not freeing. Or rather if 
you free you would need a pool to recycle there.

If you have anything that free coherent dma frequently the performance 
would be terrible so you should probably avoid that at all costs anyways.

But since pretty much all the current IO models rely on a small number 
of static bounce buffers that's not a problem.

-Andi


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
       [not found]     ` <eacb9c1f-2c61-4a7f-b5a3-7bf579e6cbf6@www.fastmail.com>
@ 2021-07-20 19:54       ` Erdem Aktas
  2021-07-20 22:01         ` Andi Kleen
  0 siblings, 1 reply; 48+ messages in thread
From: Erdem Aktas @ 2021-07-20 19:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, David Rientjes, Borislav Petkov,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra (Intel),
	Paolo Bonzini, Ingo Molnar, Kaplan, David, Varad Gautam,
	Dario Faggioli, the arch/x86 maintainers, linux-mm, linux-coco

On Mon, Jul 19, 2021 at 8:30 PM Andy Lutomirski <luto@kernel.org> wrote:
> What’s the attack you have in mind?  With TDX, the guest using the wrong shared vs secure type should, at worst, cause > crashes.  With SEV, I can imagine it’s possible for a guest to read or write the ciphertext of a private page, but actually
> turning that into an attack seems like it would require convincing a guest to use the same page with both modes.

There are a couple of things that can go wrong (maybe more):

Imagine Guest set a page shared. It is assumed that the host should
remove the SEPT entry but it does not have to, so that entry might
stay there with a valid bit set.

Guest should accept a page before it accesses it the first time or
guest can accept a page as part of #VE handler when a PENDING page is
accessed.

Current guest patch series (v3) from intel does not have any #VE
handler to accept pages on the fly. It seems like it has an assumption
that all the pages are accepted by UEFI (I have not reviewed the code
yet in detail).

Now let's say the kernel wants to access a page for the first time, or
after a kexec it wants to make sure all the pages are private. it
needs to call tdx_hcall_gpa_intent or  tdg_accept_page individually.
If the page is already accepted, tdg_accept_page does not return any
error in the current implementation in v3. Depending on how this page
is being used, it's content is now "not zeroed" as opposed to what it
is being expected. Converting this to an attack is not trivial but
possible.

I did not see any #VE implementation to handle SEPT violations when a
page is in PENDING state. I am assuming that this needs to be
supported at some point (If not then we need to discuss the use cases
for such support).

In such an implementation, hypervisor can inject a zeroed page
anytime. This seems like a not big deal but IMO it is and it can be
used as an attack vector to change conditions, masking variables etc.

> >> At the risk of asking a potentially silly question, would it be
> >> reasonable to treat non-validated memory as not-present for kernel
> >> purposes and hot-add it in a thread as it gets validated?
>
> My concern with this is, it assumes that all the present memory is private. UEFI might have some pages which are shared therefore also are present.
>
>
> Why is this a problem?  In TDX, I don’t think shared pages need any sort of validation. The private memory needs acceptance, but only DoS should be possible by getting it wrong. If EFI passed in a messy map with shared and private transitions all over, there will be a lot of extents in the map, but what actually goes wrong?

I mean if the only attack vector is DoS (which is not part of the
threat model that TDX is addressing), then why do we even need
tdaccept? I thought we need tdaccept to prevent VMM to change the
content of a page without guest knowing it.
My comment was about if we assume all non-validated memory as
not-present, how we were going to handle the shared pages transferred
from UEFI to kernel. Those are not validated but the range is present.
How can the guest kernel read those shared pages if it does not know
that they need to be mapped as shared in the first place.

Depending on how the guest kernel is handling shared to private page
conversion or how it is initializing the private pages for the first
time, a lot of things can go wrong but I need to look at the code to
provide more concrete examples.

-Erdem

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-20 19:54       ` Erdem Aktas
@ 2021-07-20 22:01         ` Andi Kleen
  2021-07-20 23:55           ` Erdem Aktas
  2021-07-21  8:51           ` Joerg Roedel
  0 siblings, 2 replies; 48+ messages in thread
From: Andi Kleen @ 2021-07-20 22:01 UTC (permalink / raw)
  To: Erdem Aktas
  Cc: Andy Lutomirski, Joerg Roedel, David Rientjes, Borislav Petkov,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Peter Zijlstra (Intel),
	Paolo Bonzini, Ingo Molnar, Kaplan, David, Varad Gautam,
	Dario Faggioli, the arch/x86 maintainers, linux-mm, linux-coco

On Tue, Jul 20, 2021 at 12:54:16PM -0700, Erdem Aktas wrote:
> Now let's say the kernel wants to access a page for the first time, or
> after a kexec it wants to make sure all the pages are private. it
> needs to call tdx_hcall_gpa_intent or  tdg_accept_page individually.
> If the page is already accepted, tdg_accept_page does not return any
> error in the current implementation in v3. Depending on how this page
> is being used, it's content is now "not zeroed" as opposed to what it
> is being expected. Converting this to an attack is not trivial but
> possible.

You mean when the TDVF is changed? In this case the unaccepted memory
will be a different memory type, so not lazy accept enabled kernels wouldn't
use it.

> 
> I did not see any #VE implementation to handle SEPT violations when a
> page is in PENDING state. I am assuming that this needs to be
> supported at some point (If not then we need to discuss the use cases
> for such support).

We actually plan to disable those #VEs, to avoid any problems with
the system call gap. Instead the plan is that the kernel will know
in advance what memory has been accepted or not, and accept it before
touching.

-Andi

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-20  5:17     ` Andi Kleen
  2021-07-20  9:11       ` Joerg Roedel
@ 2021-07-20 23:09       ` Erdem Aktas
  2021-07-21  0:38         ` Andi Kleen
  2021-07-22 17:31       ` Marc Orr
  2 siblings, 1 reply; 48+ messages in thread
From: Erdem Aktas @ 2021-07-20 23:09 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andy Lutomirski, Joerg Roedel, David Rientjes, Borislav Petkov,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Peter Zijlstra, Paolo Bonzini, Ingo Molnar,
	Kaplan, David, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco

On Mon, Jul 19, 2021 at 10:17 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> First I suspect for crash it's not a real security problem if a
> malicious hypervisor would inject zeroed pages. That means actual strong
> checks against revalidation/reaccept are not needed. That still leaves
> the issue of triggering an exception when the memory is not there. TDX
> has an option to trigger a #VE in this case, but we will actually force
> disable it to avoid #VE in the system call gap. But the standard crash
> dump tools already know how to parse mem_map to distinguish different
> types of pages. So they already would be able to do that. We just need
> some kind of safety mechanism to prevent any crashes, but that should be
> doable. Actually I'm not sure they're really needed because that's a
> root operation.

Just to make sure that I am not confused. We are talking about a
scenario where no private/shared page mapping is transferred between
normal kernel and crash kernel.

It is very hard to identify a security issue without seeing an
implementation but if the crash kernel does not revalidate the memory,
it might use a memory which was not accepted before (for example a
previously shared page) and then it needs to handle EPT-violation #VE
to accept it and now the content is gone. - assuming that we want to
dump all the pages. I might be missing something obvious here but I am
not sure how to crash kernel dumps all the memory when #VE handler is
disabled or without having some private/shared page mapping. Once you
have that #VE handler to accept pages, then VMM can inject zeroed
pages to any location unless the guest keeps track of what has been
accepted before.

> > >> Also in general i don't think it will really happen, at least
> > initially.
> > >> All the shared buffers we use are allocated and never freed. So such a
> > >> problem could be deferred.
> >
> > Does it not depend on kernel configs? Currently, there is a valid
> > control path in dma_alloc_coherent which might alloc and free shared
> > pages.
>
> If the device filter is active it won't.
>

If I am not missing something, I do not see that the device filter
checks for CONFIG_DMA_COHERENT_POOL and if it is not enabled,
dma_alloc_coherent will allocate a regular memory, convert it to
shared and convert it back to private when it is freed.

-Erdem


> > >> At the risk of asking a potentially silly question, would it be
> > >> reasonable to treat non-validated memory as not-present for kernel
> > >> purposes and hot-add it in a thread as it gets validated?
> >
> > My concern with this is, it assumes that all the present memory is
> > private. UEFI might have some pages which are shared therefore also
> > are present.
>
>
> Hot add is nearly always a bad idea.
>
>
> -Andi
>
>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-20 22:01         ` Andi Kleen
@ 2021-07-20 23:55           ` Erdem Aktas
  2021-07-21  0:35             ` Andi Kleen
  2021-07-21  8:51           ` Joerg Roedel
  1 sibling, 1 reply; 48+ messages in thread
From: Erdem Aktas @ 2021-07-20 23:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andy Lutomirski, Joerg Roedel, David Rientjes, Borislav Petkov,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Peter Zijlstra (Intel),
	Paolo Bonzini, Ingo Molnar, Kaplan, David, Varad Gautam,
	Dario Faggioli, the arch/x86 maintainers, linux-mm, linux-coco

Thank you so much for your answer and sorry for keeping the discussion long.

On Tue, Jul 20, 2021 at 3:01 PM Andi Kleen <ak@linux.intel.com> wrote:
> You mean when the TDVF is changed? In this case the unaccepted memory
> will be a different memory type, so not lazy accept enabled kernels wouldn't
> use it.

Thanks Andi for the clarification. I also saw the Kirill's answer. It
makes sense.

> But for the kexec crash case it would be just attacks against the crash
> dump, which I assume are not a real security concern.

If the crash kernel is compromised, it can be used to dump the
customer memory content  to a shared location which is a real security
concern, is it not?

> The crash kexec
> mostly runs in its own memory, which doesn't need this, or is small
> enough that it can be fully pre-accepted. And for the previous memory
> view probably these issues are acceptable.

I think this is where I am getting confused. I agree that we can copy
the crashkernel to its own memory (all accepted) and run it. My
confusion is: crash kernel will dump the memory which might have some
shared pages between. we have 3 options:
1- We can either accept all the pages again, that includes the shared
pages and lose the content of it. If we do not care about the content
in shared pages, then this is okay.
2- Have a mechanism to transfer the private/shared page mapping and
map all the pages accordingly before dumping.
3- Have a #VE handler and to accept the pages on the flight or
identify if it is a shared page based on EPT-violation #VE
information.

I am not sure what crash kernel can do when it accesses a previously
shared page (no SEPT entry) as private with the lack of one of the
above options or similar one.

>
> We actually plan to disable those #VEs, to avoid any problems with
> the system call gap. Instead the plan is that the kernel will know
> in advance what memory has been accepted or not, and accept it before
> touching.

Make sense. Thanks Andi.

-Erdem

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-20 23:55           ` Erdem Aktas
@ 2021-07-21  0:35             ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2021-07-21  0:35 UTC (permalink / raw)
  To: Erdem Aktas
  Cc: Andy Lutomirski, Joerg Roedel, David Rientjes, Borislav Petkov,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Peter Zijlstra (Intel),
	Paolo Bonzini, Ingo Molnar, Kaplan, David, Varad Gautam,
	Dario Faggioli, the arch/x86 maintainers, linux-mm, linux-coco

On Tue, Jul 20, 2021 at 04:55:22PM -0700, Erdem Aktas wrote:
> Thank you so much for your answer and sorry for keeping the discussion long.
> 
> On Tue, Jul 20, 2021 at 3:01 PM Andi Kleen <ak@linux.intel.com> wrote:
> > You mean when the TDVF is changed? In this case the unaccepted memory
> > will be a different memory type, so not lazy accept enabled kernels wouldn't
> > use it.
> 
> Thanks Andi for the clarification. I also saw the Kirill's answer. It
> makes sense.
> 
> > But for the kexec crash case it would be just attacks against the crash
> > dump, which I assume are not a real security concern.
> 
> If the crash kernel is compromised, it can be used to dump the
> customer memory content  to a shared location which is a real security
> concern, is it not?

This wouldn't be about compromising the crash kernel, but just about
inserting random zeroed pages into the crash dump. I assume the crash
parsing tools can handle corrupted data, it certainly happens often
enough with real dumps.

The crash kernel itself would need to be properly pre validated/accepted of
course, but that will likely happen when it loads. And the memory
it uses could be re-accepted as long as it only happens before it
is actually used (at least on Intel, there might be still the issue
Joern pointed out on AMD, but I guess there it could be avoided 
by just pre accepting everything and setting up a suitable memory
map)

BTW with our current plan of disabling the #VE i don't think
it can happen anyways.


> > The crash kexec
> > mostly runs in its own memory, which doesn't need this, or is small
> > enough that it can be fully pre-accepted. And for the previous memory
> > view probably these issues are acceptable.
> 
> I think this is where I am getting confused. I agree that we can copy
> the crashkernel to its own memory (all accepted) and run it. My
> confusion is: crash kernel will dump the memory which might have some
> shared pages between. we have 3 options:
> 1- We can either accept all the pages again, that includes the shared
> pages and lose the content of it. If we do not care about the content
> in shared pages, then this is okay.

On TDX this would lead to clearing the pages, which is definitely
not what you want for a crash dump.

> 2- Have a mechanism to transfer the private/shared page mapping and
> map all the pages accordingly before dumping.

FWIW we have very little shared mappings, and I suspect their
content is probably not super important for debugging (no
kernel data structures there). So even if you lost shared
mappings for crash dumps it likely wouldn't be a problem.

But actually it should just work.

-Andi 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-20 23:09       ` Erdem Aktas
@ 2021-07-21  0:38         ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2021-07-21  0:38 UTC (permalink / raw)
  To: Erdem Aktas
  Cc: Andy Lutomirski, Joerg Roedel, David Rientjes, Borislav Petkov,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Peter Zijlstra, Paolo Bonzini, Ingo Molnar,
	Kaplan, David, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco

> > > >> Also in general i don't think it will really happen, at least
> > > initially.
> > > >> All the shared buffers we use are allocated and never freed. So such a
> > > >> problem could be deferred.
> > >
> > > Does it not depend on kernel configs? Currently, there is a valid
> > > control path in dma_alloc_coherent which might alloc and free shared
> > > pages.
> >
> > If the device filter is active it won't.
> >
> 
> If I am not missing something, I do not see that the device filter
> checks for CONFIG_DMA_COHERENT_POOL and if it is not enabled,
> dma_alloc_coherent will allocate a regular memory, convert it to
> shared and convert it back to private when it is freed.

What I meant is that the only devices that will be supported (mainly
virtio) initially don't ever free coherent memory
And the device filter enforces that.

Now we probably want to support freeing anyways just to be able to run
without device filter, but it definitely doesn't have to be fast
or efficient. If there's a problem with it it would be a quite reasonable
implementation to keep it in a pool.

-Andi

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-20 22:01         ` Andi Kleen
  2021-07-20 23:55           ` Erdem Aktas
@ 2021-07-21  8:51           ` Joerg Roedel
  1 sibling, 0 replies; 48+ messages in thread
From: Joerg Roedel @ 2021-07-21  8:51 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Erdem Aktas, Andy Lutomirski, David Rientjes, Borislav Petkov,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Peter Zijlstra (Intel),
	Paolo Bonzini, Ingo Molnar, Kaplan, David, Varad Gautam,
	Dario Faggioli, the arch/x86 maintainers, linux-mm, linux-coco

On Tue, Jul 20, 2021 at 03:01:13PM -0700, Andi Kleen wrote:
> On Tue, Jul 20, 2021 at 12:54:16PM -0700, Erdem Aktas wrote:
> > I did not see any #VE implementation to handle SEPT violations when a
> > page is in PENDING state. I am assuming that this needs to be
> > supported at some point (If not then we need to discuss the use cases
> > for such support).
> 
> We actually plan to disable those #VEs, to avoid any problems with
> the system call gap. Instead the plan is that the kernel will know
> in advance what memory has been accepted or not, and accept it before
> touching.

This confuses me a bit, what happens when the VMM is malicious and
re-maps an already accepted page and the TD tries to access it?

My thinking was that this causes a #VE, but what happens instead when
this #VE can be disabled?

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-20 17:30 ` Kirill A. Shutemov
@ 2021-07-21  9:20   ` Mike Rapoport
  2021-07-21 10:02     ` Kirill A. Shutemov
  2021-07-21  9:25   ` Joerg Roedel
  2021-07-22 15:46   ` David Hildenbrand
  2 siblings, 1 reply; 48+ messages in thread
From: Mike Rapoport @ 2021-07-21  9:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Joerg Roedel, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

On Tue, Jul 20, 2021 at 08:30:04PM +0300, Kirill A. Shutemov wrote:
> On Mon, Jul 19, 2021 at 02:58:22PM +0200, Joerg Roedel wrote:
> > Hi,
> > 
> > I'd like to get some movement again into the discussion around how to
> > implement runtime memory validation for confidential guests and wrote up
> > some thoughts on it.
> > Below are the results in form of a proposal I put together. Please let
> > me know your thoughts on it and whether it fits everyones requirements.
> 
> Thanks for bringing it up. I'm working on the topic for Intel TDX. See
> comments below.
> 
> > 
> > Thanks,
> > 
> > 	Joerg
> > 
> > Proposal for Runtime Memory Validation in Secure Guests on x86
> > ==============================================================

[ snip ]

> > 	8. When memory is returned to the memblock or page allocators,
> > 	   it is _not_ invalidated. In fact, all memory which is freed
> > 	   need to be valid. If it was marked invalid in the meantime
> > 	   (e.g. if it the memory was used for DMA buffers), the code
> > 	   owning the memory needs to validate it again before freeing
> > 	   it.
> > 
> > 	   The benefit of doing memory validation at allocation time is
> > 	   that it keeps the exception handler for invalid memory
> > 	   simple, because no exceptions of this kind are expected under
> > 	   normal operation.
> 
> During early boot I treat unaccepted memory as a usable RAM. It only
> requires special treatment on memblock_reserve(), which used for early
> memory allocation: unaccepted usable RAM has to be accepted, before
> reserving.

memblock_reserve() is not always used for early allocations and some of the
early allocations on x86 don't use memblock at all. Hooking
validation/acceptance to memblock_reserve() should be fine for PoC but I
suspect there will be caveats for production.
 
> For fine-grained accepting/validation tracking I use PageOffline() flags
> (it's encoded into mapcount): before adding an unaccepted page to free
> list I set the PageOffline() to indicate that the page has to be accepted
> before returning from the page allocator. Currently, we never have
> PageOffline() set for pages on free lists, so we won't have confusion with
> ballooning or memory hotplug.
>
> I try to keep pages accepted in 2M or 4M chunks (pageblock_order or
> MAX_ORDER). It is reasonable compromise on speed/latency.

Keeping fine grained accepting/validation information in the memory map
means it cannot be reused across reboots/kexec and there should be an
additional data structure to carry this information. It could be the same
structure that is used by firmware to inform kernel about usable memory,
just it needs to live after boot and get updates about new (in)validations.
Doing those in 2M/4M chunks will help to prevent this structure from
exploding.

BTW, as Dave mentioned, the deferred struct page init can also take care of
the validation.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-20 17:30 ` Kirill A. Shutemov
  2021-07-21  9:20   ` Mike Rapoport
@ 2021-07-21  9:25   ` Joerg Roedel
  2021-07-21 10:25     ` Kirill A. Shutemov
  2021-07-22 15:46   ` David Hildenbrand
  2 siblings, 1 reply; 48+ messages in thread
From: Joerg Roedel @ 2021-07-21  9:25 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

Hi Kirill,

On Tue, Jul 20, 2021 at 08:30:04PM +0300, Kirill A. Shutemov wrote:
> On Mon, Jul 19, 2021 at 02:58:22PM +0200, Joerg Roedel wrote:
> We use EFI unaccepted memory type to pass this information between
> firmware and kernel. In my WIP patch I translate it to a new E820 memory
> type: E820_TYPE_UNACCEPTED.

Yeah, that is what I meant with a new E820 entry type.

> E820 can also be used during early boot for tracking what memory got
> accepted by kernel too.

Won't this get very fragmented? How do you handle overlaps with other
E820 regions?

> For now, I debug with 256MiB accepted by firmware. It allows to avoid
> dealing with decompression code at this stage of the project. I plan to
> lower the number later.

Yes, this can be experimented with, the proposal allows a custom amount
of memory to be pre-validated/accepted.

> I would argue for per-range, not per-page, tracking of accepted/validated
> memory for decompresser and early boot code, until page allocator is fully
> functional. I have reasonable success with this approach so far.

What do you mean by 'reasonable' success? Especially, how robust is that
against unrelated changes to the boot code? As with SEV-SNP, I guess
there will be no broad testing of unrelated kernel changes in a TDX
environment, so some robustness is key to keep things working.

> During early boot I treat unaccepted memory as a usable RAM. It only
> requires special treatment on memblock_reserve(), which used for early
> memory allocation: unaccepted usable RAM has to be accepted, before
> reserving.

What happens before memblock is active, say in the decompressor. Will
unaccepted memory be considered for KASLR placement?

> For fine-grained accepting/validation tracking I use PageOffline() flags
> (it's encoded into mapcount): before adding an unaccepted page to free
> list I set the PageOffline() to indicate that the page has to be accepted
> before returning from the page allocator. Currently, we never have
> PageOffline() set for pages on free lists, so we won't have confusion with
> ballooning or memory hotplug.

Okay, I think that could also easily break with unrelated memory
management changes, but should work for now in TDX.

> I try to keep pages accepted in 2M or 4M chunks (pageblock_order or
> MAX_ORDER). It is reasonable compromise on speed/latency.

Makes sense, SEV-SNP will likely do something similar.

> I'm not sure a bitmap is needed. I hope we can use E820 for early
> tracking. But let's see if it works.

We should find a solution which works for TDX and SNP, given that the
required changes are intrusive and that it is much easier to just
support one way to handle this.

That said, the Validation Bitmap has a clear benefit for SEV-SNP in that
it makes it trivial to support kexec/kdump scenarios. Further the
bitmap makes it trivial to transport the information through the whole
boot process. It also won't be big, SNP (and I think TDX too) would
be okay with one bit per 4k page, so the bitmap would need 32kb of
memory per GB of guest RAM.

And keeping the information separate from struct page will make the code
more robust against unrelated code changes.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-21  9:20   ` Mike Rapoport
@ 2021-07-21 10:02     ` Kirill A. Shutemov
  2021-07-21 10:22       ` Mike Rapoport
  2021-07-21 10:53       ` Joerg Roedel
  0 siblings, 2 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2021-07-21 10:02 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Joerg Roedel, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

On Wed, Jul 21, 2021 at 12:20:17PM +0300, Mike Rapoport wrote:
> On Tue, Jul 20, 2021 at 08:30:04PM +0300, Kirill A. Shutemov wrote:
> > On Mon, Jul 19, 2021 at 02:58:22PM +0200, Joerg Roedel wrote:
> > > Hi,
> > > 
> > > I'd like to get some movement again into the discussion around how to
> > > implement runtime memory validation for confidential guests and wrote up
> > > some thoughts on it.
> > > Below are the results in form of a proposal I put together. Please let
> > > me know your thoughts on it and whether it fits everyones requirements.
> > 
> > Thanks for bringing it up. I'm working on the topic for Intel TDX. See
> > comments below.
> > 
> > > 
> > > Thanks,
> > > 
> > > 	Joerg
> > > 
> > > Proposal for Runtime Memory Validation in Secure Guests on x86
> > > ==============================================================
> 
> [ snip ]
> 
> > > 	8. When memory is returned to the memblock or page allocators,
> > > 	   it is _not_ invalidated. In fact, all memory which is freed
> > > 	   need to be valid. If it was marked invalid in the meantime
> > > 	   (e.g. if it the memory was used for DMA buffers), the code
> > > 	   owning the memory needs to validate it again before freeing
> > > 	   it.
> > > 
> > > 	   The benefit of doing memory validation at allocation time is
> > > 	   that it keeps the exception handler for invalid memory
> > > 	   simple, because no exceptions of this kind are expected under
> > > 	   normal operation.
> > 
> > During early boot I treat unaccepted memory as a usable RAM. It only
> > requires special treatment on memblock_reserve(), which used for early
> > memory allocation: unaccepted usable RAM has to be accepted, before
> > reserving.
> 
> memblock_reserve() is not always used for early allocations and some of the
> early allocations on x86 don't use memblock at all.

Do you mean any codepath in particular?

> Hooking
> validation/acceptance to memblock_reserve() should be fine for PoC but I
> suspect there will be caveats for production.

That's why I do PoC. Will see. So far so good. Maybe it will be visible
with smaller pre-accepted memory size.

> > For fine-grained accepting/validation tracking I use PageOffline() flags
> > (it's encoded into mapcount): before adding an unaccepted page to free
> > list I set the PageOffline() to indicate that the page has to be accepted
> > before returning from the page allocator. Currently, we never have
> > PageOffline() set for pages on free lists, so we won't have confusion with
> > ballooning or memory hotplug.
> >
> > I try to keep pages accepted in 2M or 4M chunks (pageblock_order or
> > MAX_ORDER). It is reasonable compromise on speed/latency.
> 
> Keeping fine grained accepting/validation information in the memory map
> means it cannot be reused across reboots/kexec and there should be an
> additional data structure to carry this information. It could be the same
> structure that is used by firmware to inform kernel about usable memory,
> just it needs to live after boot and get updates about new (in)validations.
> Doing those in 2M/4M chunks will help to prevent this structure from
> exploding.

Yeah, we would need to reconstruct the EFI map somehow. Or we can give
most of memory back to the host and accept/validate the memory again after
reboot/kexec. I donno.

> BTW, as Dave mentioned, the deferred struct page init can also take care of
> the validation.

That was my first thought too and I tried it just to realize that it is
not what we want. If we would accept page on page struct init it means we
would make host allocate all memory assigned to the guest on boot even if
guest actually use small portion of it.

Also deferred page init only allows to scale validation across multiple
CPUs, but doesn't allow to get to userspace before we done with it. See
wait_for_completion(&pgdat_init_all_done_comp).

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-21 10:02     ` Kirill A. Shutemov
@ 2021-07-21 10:22       ` Mike Rapoport
  2021-07-21 10:53       ` Joerg Roedel
  1 sibling, 0 replies; 48+ messages in thread
From: Mike Rapoport @ 2021-07-21 10:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Joerg Roedel, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

On Wed, Jul 21, 2021 at 01:02:06PM +0300, Kirill A. Shutemov wrote:
> On Wed, Jul 21, 2021 at 12:20:17PM +0300, Mike Rapoport wrote:
> > On Tue, Jul 20, 2021 at 08:30:04PM +0300, Kirill A. Shutemov wrote:
> > > On Mon, Jul 19, 2021 at 02:58:22PM +0200, Joerg Roedel wrote:
> > > > Hi,
> > > > 
> > > > I'd like to get some movement again into the discussion around how to
> > > > implement runtime memory validation for confidential guests and wrote up
> > > > some thoughts on it.
> > > > Below are the results in form of a proposal I put together. Please let
> > > > me know your thoughts on it and whether it fits everyones requirements.
> > > 
> > > Thanks for bringing it up. I'm working on the topic for Intel TDX. See
> > > comments below.
> > > 
> > > > 
> > > > Thanks,
> > > > 
> > > > 	Joerg
> > > > 
> > > > Proposal for Runtime Memory Validation in Secure Guests on x86
> > > > ==============================================================
> > 
> > [ snip ]
> > 
> > > > 	8. When memory is returned to the memblock or page allocators,
> > > > 	   it is _not_ invalidated. In fact, all memory which is freed
> > > > 	   need to be valid. If it was marked invalid in the meantime
> > > > 	   (e.g. if it the memory was used for DMA buffers), the code
> > > > 	   owning the memory needs to validate it again before freeing
> > > > 	   it.
> > > > 
> > > > 	   The benefit of doing memory validation at allocation time is
> > > > 	   that it keeps the exception handler for invalid memory
> > > > 	   simple, because no exceptions of this kind are expected under
> > > > 	   normal operation.
> > > 
> > > During early boot I treat unaccepted memory as a usable RAM. It only
> > > requires special treatment on memblock_reserve(), which used for early
> > > memory allocation: unaccepted usable RAM has to be accepted, before
> > > reserving.
> > 
> > memblock_reserve() is not always used for early allocations and some of the
> > early allocations on x86 don't use memblock at all.
> 
> Do you mean any codepath in particular?

I don't have examples handy, but in general there are calls to
e820__range_update() that make memory !RAM and it never gets into memblock.
On the other side, memblock_reserve() can be called to reserve memory owned
y firmware that may be already accepted.

> > Hooking
> > validation/acceptance to memblock_reserve() should be fine for PoC but I
> > suspect there will be caveats for production.
> 
> That's why I do PoC. Will see. So far so good. Maybe it will be visible
> with smaller pre-accepted memory size.

Maybe some of my concerns only apply to systems with BIOSes weirder than
usual and for VMs all would be fine. 
I'd suggest to experiment with "memmap=" to manually assign various e820
types to memory chunks to see if there are any strange effects.
 
> > > For fine-grained accepting/validation tracking I use PageOffline() flags
> > > (it's encoded into mapcount): before adding an unaccepted page to free
> > > list I set the PageOffline() to indicate that the page has to be accepted
> > > before returning from the page allocator. Currently, we never have
> > > PageOffline() set for pages on free lists, so we won't have confusion with
> > > ballooning or memory hotplug.
> > >
> > > I try to keep pages accepted in 2M or 4M chunks (pageblock_order or
> > > MAX_ORDER). It is reasonable compromise on speed/latency.
> > 
> > Keeping fine grained accepting/validation information in the memory map
> > means it cannot be reused across reboots/kexec and there should be an
> > additional data structure to carry this information. It could be the same
> > structure that is used by firmware to inform kernel about usable memory,
> > just it needs to live after boot and get updates about new (in)validations.
> > Doing those in 2M/4M chunks will help to prevent this structure from
> > exploding.
> 
> Yeah, we would need to reconstruct the EFI map somehow. Or we can give
> most of memory back to the host and accept/validate the memory again after
> reboot/kexec. I donno.
> 
> > BTW, as Dave mentioned, the deferred struct page init can also take care of
> > the validation.
> 
> That was my first thought too and I tried it just to realize that it is
> not what we want. If we would accept page on page struct init it means we
> would make host allocate all memory assigned to the guest on boot even if
> guest actually use small portion of it.

Yep, you are right.
 
> Also deferred page init only allows to scale validation across multiple
> CPUs, but doesn't allow to get to userspace before we done with it. See
> wait_for_completion(&pgdat_init_all_done_comp).

True.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-21  9:25   ` Joerg Roedel
@ 2021-07-21 10:25     ` Kirill A. Shutemov
  2021-07-21 10:48       ` Joerg Roedel
  0 siblings, 1 reply; 48+ messages in thread
From: Kirill A. Shutemov @ 2021-07-21 10:25 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

On Wed, Jul 21, 2021 at 11:25:25AM +0200, Joerg Roedel wrote:
> Hi Kirill,
> 
> On Tue, Jul 20, 2021 at 08:30:04PM +0300, Kirill A. Shutemov wrote:
> > On Mon, Jul 19, 2021 at 02:58:22PM +0200, Joerg Roedel wrote:
> > We use EFI unaccepted memory type to pass this information between
> > firmware and kernel. In my WIP patch I translate it to a new E820 memory
> > type: E820_TYPE_UNACCEPTED.
> 
> Yeah, that is what I meant with a new E820 entry type.
> 
> > E820 can also be used during early boot for tracking what memory got
> > accepted by kernel too.
> 
> Won't this get very fragmented? How do you handle overlaps with other
> E820 regions?

I modify E820 as needed:

	e820__range_update(start, end, E820_TYPE_UNACCEPTED, E820_TYPE_RAM);

I also ask memblock for bottom-up allocation as it helps with using
per-accepted pages first and reduces fragmentation:

	memblock_set_bottom_up(true);

> > For now, I debug with 256MiB accepted by firmware. It allows to avoid
> > dealing with decompression code at this stage of the project. I plan to
> > lower the number later.
> 
> Yes, this can be experimented with, the proposal allows a custom amount
> of memory to be pre-validated/accepted.
> 
> > I would argue for per-range, not per-page, tracking of accepted/validated
> > memory for decompresser and early boot code, until page allocator is fully
> > functional. I have reasonable success with this approach so far.
> 
> What do you mean by 'reasonable' success?

It appears to work fine with 256MiB of pre-accepted memory, but more
testing is required.

> Especially, how robust is that against unrelated changes to the boot
> code? As with SEV-SNP, I guess there will be no broad testing of
> unrelated kernel changes in a TDX environment, so some robustness is key
> to keep things working.

Hard to say. Let me get the prototype functional first. It's easier to
discuss with code on hands.

> > During early boot I treat unaccepted memory as a usable RAM. It only
> > requires special treatment on memblock_reserve(), which used for early
> > memory allocation: unaccepted usable RAM has to be accepted, before
> > reserving.
> 
> What happens before memblock is active, say in the decompressor. Will
> unaccepted memory be considered for KASLR placement?

I tried to postpone thinking about decompresser as long as possible :P

I guess we need pass down information about memory accepted in
decompresser to the main kernel so it can record in E820. I think it will
a single range.

> > For fine-grained accepting/validation tracking I use PageOffline() flags
> > (it's encoded into mapcount): before adding an unaccepted page to free
> > list I set the PageOffline() to indicate that the page has to be accepted
> > before returning from the page allocator. Currently, we never have
> > PageOffline() set for pages on free lists, so we won't have confusion with
> > ballooning or memory hotplug.
> 
> Okay, I think that could also easily break with unrelated memory
> management changes, but should work for now in TDX.
> 
> > I try to keep pages accepted in 2M or 4M chunks (pageblock_order or
> > MAX_ORDER). It is reasonable compromise on speed/latency.
> 
> Makes sense, SEV-SNP will likely do something similar.
> 
> > I'm not sure a bitmap is needed. I hope we can use E820 for early
> > tracking. But let's see if it works.
> 
> We should find a solution which works for TDX and SNP, given that the
> required changes are intrusive and that it is much easier to just
> support one way to handle this.
> 
> That said, the Validation Bitmap has a clear benefit for SEV-SNP in that
> it makes it trivial to support kexec/kdump scenarios. Further the
> bitmap makes it trivial to transport the information through the whole
> boot process. It also won't be big, SNP (and I think TDX too) would
> be okay with one bit per 4k page, so the bitmap would need 32kb of
> memory per GB of guest RAM.

Yes, the bitmap is small, but it going to be rather hot structure. It has
to be consulted on every page allocation, right?

How to do plan to make bitmap scalable? What the locking rules around it?

> And keeping the information separate from struct page will make the code
> more robust against unrelated code changes.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-21 10:25     ` Kirill A. Shutemov
@ 2021-07-21 10:48       ` Joerg Roedel
  0 siblings, 0 replies; 48+ messages in thread
From: Joerg Roedel @ 2021-07-21 10:48 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

On Wed, Jul 21, 2021 at 01:25:36PM +0300, Kirill A. Shutemov wrote:
> On Wed, Jul 21, 2021 at 11:25:25AM +0200, Joerg Roedel wrote:

> I modify E820 as needed:
> 
> 	e820__range_update(start, end, E820_TYPE_UNACCEPTED, E820_TYPE_RAM);
> 
> I also ask memblock for bottom-up allocation as it helps with using
> per-accepted pages first and reduces fragmentation:
> 
> 	memblock_set_bottom_up(true);

This happens already in the decompressed kernel image. The decompressor
also needs to be able to validate memory and pass the information about
it on.

> I tried to postpone thinking about decompresser as long as possible :P
> 
> I guess we need pass down information about memory accepted in
> decompresser to the main kernel so it can record in E820. I think it will
> a single range.

This means modifying the E820 array in boot_params in the decompressor,
handling overlaps, splitting entries and all that.

> Yes, the bitmap is small, but it going to be rather hot structure. It has
> to be consulted on every page allocation, right?
> 
> How to do plan to make bitmap scalable? What the locking rules around it?

It is expected that the bitmap becomes read-mostly over time, so the
cache-lines can be shared. The access should be possible using only
atomic bit manipulation instructions when validation happens in the
memory allocators, because no one is trying to modify the same bits
concurrently.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-21 10:02     ` Kirill A. Shutemov
  2021-07-21 10:22       ` Mike Rapoport
@ 2021-07-21 10:53       ` Joerg Roedel
  1 sibling, 0 replies; 48+ messages in thread
From: Joerg Roedel @ 2021-07-21 10:53 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Mike Rapoport, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

On Wed, Jul 21, 2021 at 01:02:06PM +0300, Kirill A. Shutemov wrote:
> Yeah, we would need to reconstruct the EFI map somehow. Or we can give
> most of memory back to the host and accept/validate the memory again after
> reboot/kexec. I donno.

Invalidating all memory will also take a lot of time (in the range of
seconds). And the EFI map can get pretty large when there is enough
fragmentation. The easiest way to handle this is to just pass on an
up-to-date data structure.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-20 17:30 ` Kirill A. Shutemov
  2021-07-21  9:20   ` Mike Rapoport
  2021-07-21  9:25   ` Joerg Roedel
@ 2021-07-22 15:46   ` David Hildenbrand
  2021-07-26 19:02     ` Joerg Roedel
  2 siblings, 1 reply; 48+ messages in thread
From: David Hildenbrand @ 2021-07-22 15:46 UTC (permalink / raw)
  To: Kirill A. Shutemov, Joerg Roedel
  Cc: David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco


>>
>> 	8. When memory is returned to the memblock or page allocators,
>> 	   it is _not_ invalidated. In fact, all memory which is freed
>> 	   need to be valid. If it was marked invalid in the meantime
>> 	   (e.g. if it the memory was used for DMA buffers), the code
>> 	   owning the memory needs to validate it again before freeing
>> 	   it.
>>
>> 	   The benefit of doing memory validation at allocation time is
>> 	   that it keeps the exception handler for invalid memory
>> 	   simple, because no exceptions of this kind are expected under
>> 	   normal operation.
> 
> During early boot I treat unaccepted memory as a usable RAM. It only
> requires special treatment on memblock_reserve(), which used for early
> memory allocation: unaccepted usable RAM has to be accepted, before
> reserving.
> 
> For fine-grained accepting/validation tracking I use PageOffline() flags
> (it's encoded into mapcount): before adding an unaccepted page to free
> list I set the PageOffline() to indicate that the page has to be accepted
> before returning from the page allocator. Currently, we never have
> PageOffline() set for pages on free lists, so we won't have confusion with
> ballooning or memory hotplug.

I was just about to propose something similar. Something like that 
sounds like the best approach to me

1. Sync e820 to memblock
2. Sync memblock to memmap
3. Let the page allocator deal with validation once initializing/handing 
out memory

PageOffline() does exactly what you want, just be aware that 
PageBuddy()+PageOffline() won't be recognized by crash anymore, as it 
tests for a single memmap value. Can be fixed with makedumpfile updates 
once that applies.

Alternatively, you could use any other page flag that is yet unsued 
combined with PageBuddy.

Sure, there might be obstacles, but it certainly sounds like a clean 
approach to me.

> 
> I try to keep pages accepted in 2M or 4M chunks (pageblock_order or
> MAX_ORDER). It is reasonable compromise on speed/latency.
> 
> I still debugging the code, but hopefully will get working PoC this week.
> 

[...]

> 
> I'm not sure a bitmap is needed. I hope we can use E820 for early
> tracking. But let's see if it works.

+1, this smells like an anti-patter. I'm absolutely not in favor of a 
bitmap, we have the sparse memory model for a reason.

Also, I am not convinced that kexec/kdump is actually easy to realize 
with the bitmap? Who will forward that bitmap? Where will it reside? Who 
says it's not corrupted? Just take a look at how we don't even have 
access to memmap of the oldkernel in the newkernel -- and have to locate 
and decipher it in constantly-to-be-updated user space makedumpfile. Any 
time you'd change anything about the bitmap ("hey, let's use larger 
chunks", "hey, let's split it up") you'd break the old_kernel <-> 
new_kernel agreement.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-19 12:58 Runtime Memory Validation in Intel-TDX and AMD-SNP Joerg Roedel
                   ` (3 preceding siblings ...)
  2021-07-20 17:30 ` Kirill A. Shutemov
@ 2021-07-22 15:57 ` David Hildenbrand
  2021-07-22 19:51 ` Kirill A. Shutemov
  2021-07-23 11:04 ` Varad Gautam
  6 siblings, 0 replies; 48+ messages in thread
From: David Hildenbrand @ 2021-07-22 15:57 UTC (permalink / raw)
  To: Joerg Roedel, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli
  Cc: x86, linux-mm, linux-coco

On 19.07.21 14:58, Joerg Roedel wrote:
> Hi,
> 
> I'd like to get some movement again into the discussion around how to
> implement runtime memory validation for confidential guests and wrote up
> some thoughts on it.
> Below are the results in form of a proposal I put together. Please let
> me know your thoughts on it and whether it fits everyones requirements.
> 
> Thanks,
> 
> 	Joerg
> 
> Proposal for Runtime Memory Validation in Secure Guests on x86
> ==============================================================
> 
> This proposal describes a method and protocol for runtime validation of
> memory in virtualization guests running with Intel Trusted Domain
> Extensions (Intel-TDX) or AMD Secure Nested Paging (AMD-SNP).
> 
> AMD-SNP and Intel-TDX use different terms to discuss memory page states.
> In AMD-SNP memory has to be 'validated' while in Intel-TDX is will be
> 'accepted'. This document uses the term 'validated' for both.
> 
> Problem Statement
> -----------------
> 
> Virtualization guests which run with AMD-SNP or Intel-TDX need to
> validate their memory before using it. The validation assigns a hardware
> state to each page which allows the guest to detect when the hypervisor
> tries to maliciously access or remap a guest-private page. The guest can
> only access validated pages.
> 
> There are three ways the guest memory can be validated:
> 
> 	I.   The firmware validates all of guest memory at boot time. This
> 	     is the simplest method which requires the least changes to
> 	     the Linux kernel. But this method is also very slow and
> 	     causes unwanted delays in the boot process, as verification
> 	     can take several seconds (depending on guest memory size).
> 
> 	II.  The firmware only validates its own memory and memory
> 	     validation happens as the memory is used. This significantly
> 	     improves the boot time, but needs more intrusive changes to
> 	     the Linux kernel and its boot process.
> 
> 
> 	III. Approach I. and II. can be combined. The firmware only
> 	     validates the first X MB/GB of guest memory and the rest is
> 	     validated on-demand.
> 
> For method II. and III. the guest needs to track which pages have
> already been validated to detect hypervisor attacks. This information
> needs to be carried through the whole boot process.
> 
> This poses challenges on the Linux boot process, as there is currently
> no way to forward information about validated memory up the boot chain.
> This proposal tries to describe a way to solve these challenges.
> 
> Memory Validation through the Boot Process and in the Running System
> --------------------------------------------------------------------
> 
> The memory is validated throughout the boot process as described below.
> These steps assume a firmware is present, but this proposal does not
> strictly require a firmware. The tasks done be the firmware can also be
> done by the hypervisor before starting the guest. The steps are:
> 
> 	1. The firmware validates all memory which will not be owned by
> 	   the boot loader or the OS.
> 
> 	2. The firmware also validates the first X MB of memory, just
> 	   enough to run a boot loader and to load the compressed Linux
> 	   kernel image. X is not expected to be very large, 64 or 128
> 	   MB should be enough. This pre-validation should not cause
> 	   significant delays in the boot process.
> 
> 	3. The validated memory is marked E820-Usable in struct
> 	   boot_params for the Linux decompressor. The rest of the
> 	   memory is also passed to Linux via new special E820 entries
> 	   which mark the memory as Usable-but-Invalid.
> 
> 	4. When the Linux decompressor takes over control, it evaluates
> 	   the E820 table and calculates to total amount of memory
> 	   available to Linux (valid and invalid memory).
> 
> 	   The decompressor allocates a physically contiguous data
> 	   structure at a random memory location which is big enough to
> 	   hold the the validation states of all 4kb pages available to
> 	   the guest. This data structure will be called the Validation
> 	   Bitmap through the rest of this document. The Validation
> 	   Bitmap is indexed by page frame numbers.
> 
> 	   It still needs to be determined how many bits are required
> 	   per page. This depends on the necessity to track validation
> 	   page-sizes. Two bits per page are enough to track the 3
> 	   page-sizes currently available on the x86 architecture.
> 
> 	   The decompressor initializes the Validation Bitmap by first
> 	   validating its backing memory and then updating it with the
> 	   information from the E820 table. It will also update the
> 	   table if it changes the state of pages from invalid to valid
> 	   (and vice versa, e.g. for mapping a GHCB page).
> 
> 	5. The 'struct boot_params' is extended to carry the location
> 	   and size of the Validation Bitmap to the extracted kernel
> 	   image.
> 	   In fact, since the decompressor already receives a 'struct
> 	   boot_params', it will check if it carries a Validation
> 	   Bitmap. If it does, the decompressor uses the existing one
> 	   instead of allocating a new one.
> 
> 	6. When the extracted kernel image takes over control, it will
> 	   make sure the Validation Bitmap is up to date when memory
> 	   needs to be validated.
> 
> 	7. When set up, the memblock and page allocators have to check
> 	   whether the memory they return is already validated, and
> 	   validate it if not.
> 
> 	   This should happen after the memory is allocated and all
> 	   allocator-locks are dropped, but before the memory is
> 	   returned to the caller. This way the access to the
> 	   validation bitmap can be implemented without locking and only
> 	   using atomic instructions.
> 
> 	   Under no circumstances the Linux kernel is allowed to
> 	   validate a page more than once. Doing this might create
> 	   attack vectors for the Hypervisor towards the guest.
> 
> 	8. When memory is returned to the memblock or page allocators,
> 	   it is _not_ invalidated. In fact, all memory which is freed
> 	   need to be valid. If it was marked invalid in the meantime
> 	   (e.g. if it the memory was used for DMA buffers), the code
> 	   owning the memory needs to validate it again before freeing
> 	   it.
> 
> 	   The benefit of doing memory validation at allocation time is
> 	   that it keeps the exception handler for invalid memory
> 	   simple, because no exceptions of this kind are expected under
> 	   normal operation.
> 
> The Validation Bitmap
> ---------------------
> 
> This document proposes the use of a Validation Bitmap to store the
> validation state of guest pages. This section discusses the benefits of
> this approach.
> 
> The Linux kernel already has an array to store various state for each
> memory page in the system: The struct page array. While this would be a
> natural place to also store page validation information, the Validation
> Bitmap is chosen because having the information separated has some clear
> benefits:
> 
> 	- The Validation Bitmap is allocated in the Linux decompressor
> 	  and already available long before the struct page array is
> 	  initialized.
> 
> 	- Since it is a simple in-memory data structure which is
> 	  physically contiguous, it can be passed along through the
> 	  various stages of the boot process.
> 
> 	- It can even be passed to a new kernel booted via kexec/kdump,
> 	  making it trivial to enable these features for AMD-SNP and
> 	  Intel-TDX.
> 
> 	- When memory validation happens in the memblock and page
> 	  allocators, there is no need for locking when making changes
> 	  to the Validation Bitmap, because:
> 	
> 	    - Nobody will try to concurrently access the same bits, as
> 	      the code-path doing the validation is the only owner of
> 	      the memory.
> 
> 	    - Updates can happen via atomic cmpxchg instructions
> 	      when multiple bits are used per page. If only one bit is
> 	      needed, atomic bit manipulation instructions will suffice.
> 
> 	- NUMA-locality is not considered to be a problem for the
> 	  Validation Bitmap. Since memory is not invalidated upon free,
> 	  the data structure will become read-mostly over time.
> 
> Final Notes
> -----------
> 
> This proposal does not introduce requirements about the firmware that
> has to be used to run Intel-TDX or AMD-SNP guests. It works with UEFI
> and non-UEFI firmwares, or with no firmware at all. This is important
> for use-cases like Confidential Containers running in VMs, which often
> use a very small firmware (or no firmware at all) for reducing boot
> times.
> 

Although most probably not what people want to have, but I'd just like 
to mention something that might be possible. It essentially hotplugs 
memory during boot what has been suggested here already ...

1. Start the VM with small memory (e.g., 256MiB)
2. Let the firmware validate all boot memory
3. Use virtio-mem to expose additional memory to the VM

As the VM boots up, virtio-mem will add the requested amount of memory 
to the guest. While it gets added, it will get validated and exposed to 
the page allocator.

kexec might need some thought if we end up invalidating parts of our 
validated boot memory (I assume that will happen when sharing memory). 
We would have to express these semantics in the e820 map we forward to 
out new kernel.

Pretty much all you'd need to do is teach virtio-mem encrypted memory 
semantics. Shouldn't be too hard I guess, but we would have to look into 
the details.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-20  5:17     ` Andi Kleen
  2021-07-20  9:11       ` Joerg Roedel
  2021-07-20 23:09       ` Erdem Aktas
@ 2021-07-22 17:31       ` Marc Orr
  2021-07-26 18:55         ` Joerg Roedel
  2 siblings, 1 reply; 48+ messages in thread
From: Marc Orr @ 2021-07-22 17:31 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Erdem Aktas, Andy Lutomirski, Joerg Roedel, David Rientjes,
	Borislav Petkov, Sean Christopherson, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

On Mon, Jul 19, 2021 at 10:17 PM Andi Kleen <ak@linux.intel.com> wrote:
>
>
> On 7/19/2021 6:51 PM, Erdem Aktas wrote:
> >
> > > There's one exception to this, which is the previous memory view in
> > > crash kernels. But that's an relatively obscure case and there might be
> > > other solutions for this.
> >
> > I think this is an important angle. It might cause reliability issues.
> > if kexec kernel does not know which page is shared or private, it can
> > use a previously shared page as a code page which will not work. It is
> > also a security concern. Hosts can always cause crashes which forces
> > guests to do kexec for crash dump. If the kexec kernel does not know
> > which pages are validated before, it might be compromised with page
> > replay attacks.
>
> First I suspect for crash it's not a real security problem if a
> malicious hypervisor would inject zeroed pages. That means actual strong
> checks against revalidation/reaccept are not needed. That still leaves
> the issue of triggering an exception when the memory is not there. TDX
> has an option to trigger a #VE in this case, but we will actually force
> disable it to avoid #VE in the system call gap. But the standard crash
> dump tools already know how to parse mem_map to distinguish different
> types of pages. So they already would be able to do that. We just need
> some kind of safety mechanism to prevent any crashes, but that should be
> doable. Actually I'm not sure they're really needed because that's a
> root operation.

I'm not as familiar with TDX as SNP. So I'm not sure that I follow the
exact security threat that Erdem was alluding to. That being said, I
do want to chime in with the following: I think we should not assume
that the guest getting into a certain state is "probably" not a real
security issue. IMHO, we need to be completely certain that guest data
cannot be compromised if we're going to remove the requirement that
guest memory only be validated once in a certain state (e.g., from
within a crash kernel). Perhaps it is the case that we're certain that
guest data cannot be compromised from within a crash kernel -- but
it's not what I read in the email exchange.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-19 12:58 Runtime Memory Validation in Intel-TDX and AMD-SNP Joerg Roedel
                   ` (4 preceding siblings ...)
  2021-07-22 15:57 ` David Hildenbrand
@ 2021-07-22 19:51 ` Kirill A. Shutemov
  2021-07-23 15:23   ` Mike Rapoport
                     ` (2 more replies)
  2021-07-23 11:04 ` Varad Gautam
  6 siblings, 3 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2021-07-22 19:51 UTC (permalink / raw)
  To: Joerg Roedel, Andi Kleen
  Cc: David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Brijesh Singh, Tom Lendacky, Jon Grimm,
	Thomas Gleixner, Peter Zijlstra, Paolo Bonzini, Ingo Molnar,
	Kaplan, David, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco

On Mon, Jul 19, 2021 at 02:58:22PM +0200, Joerg Roedel wrote:
> Hi,
> 
> I'd like to get some movement again into the discussion around how to
> implement runtime memory validation for confidential guests and wrote up
> some thoughts on it.
> Below are the results in form of a proposal I put together. Please let
> me know your thoughts on it and whether it fits everyones requirements.

Okay, below is my first take on the topic.

Please don't mind code structuring, naming, etc. It's early PoC and aimed
to test the approach.

I ended up combing your idea with bitmap with PageOffline(): early boot
code uses bitmap, but on page allocator init I mark unaccepted pages with
PageOffline(). This way page allocator need to touch the bitmap only when
it steps on PageOffline() which shouldn't be often once things settle
after boot.

One bit in the bitmap represents 2M region. Any unaligned chunks gets
accepted when we construct the bitmap. This way one 4K page can represent
64 GiB of physical address space.

I still don't touch decompresser code. This is something I'll look into
next. Given the density of the bitmap. It should be enough to have a
single-page bitmap allocated in decompresser.

Any feedback is welcome.

diff --git a/arch/x86/include/asm/e820/types.h b/arch/x86/include/asm/e820/types.h
index 314f75d886d0..98f281752a30 100644
--- a/arch/x86/include/asm/e820/types.h
+++ b/arch/x86/include/asm/e820/types.h
@@ -28,6 +28,8 @@ enum e820_type {
 	 */
 	E820_TYPE_PRAM		= 12,
 
+	E820_TYPE_UNACCEPTED	= 13, /* XXX: is there a standardized type ? */
+
 	/*
 	 * Special-purpose memory is indicated to the system via the
 	 * EFI_MEMORY_SP attribute. Define an e820 translation of this
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 7555b48803a8..21972e9159fe 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -71,6 +71,11 @@ static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
 extern bool __virt_addr_valid(unsigned long kaddr);
 #define virt_addr_valid(kaddr)	__virt_addr_valid((unsigned long) (kaddr))
 
+void mark_unaccepted(phys_addr_t start, phys_addr_t end);
+void accept_pages(phys_addr_t start, phys_addr_t end);
+
+void maybe_set_page_offline(struct page *page, unsigned int order);
+void clear_page_offline(struct page *page, unsigned int order);
 #endif	/* __ASSEMBLY__ */
 
 #include <asm-generic/memory_model.h>
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index bc0657f0deed..4cd80d107a06 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -196,6 +196,7 @@ static void __init e820_print_type(enum e820_type type)
 	case E820_TYPE_UNUSABLE:	pr_cont("unusable");			break;
 	case E820_TYPE_PMEM:		/* Fall through: */
 	case E820_TYPE_PRAM:		pr_cont("persistent (type %u)", type);	break;
+	case E820_TYPE_UNACCEPTED:	pr_cont("unaccepted");			break;
 	default:			pr_cont("type %u", type);		break;
 	}
 }
@@ -763,7 +764,9 @@ void __init e820__register_nosave_regions(unsigned long limit_pfn)
 
 		pfn = PFN_DOWN(entry->addr + entry->size);
 
-		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
+		if (entry->type != E820_TYPE_RAM &&
+		    entry->type != E820_TYPE_RESERVED_KERN &&
+		    entry->type != E820_TYPE_UNACCEPTED)
 			register_nosave_region(PFN_UP(entry->addr), pfn);
 
 		if (pfn >= limit_pfn)
@@ -864,7 +867,8 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, enum e820_type
 
 unsigned long __init e820__end_of_ram_pfn(void)
 {
-	return e820_end_pfn(MAX_ARCH_PFN, E820_TYPE_RAM);
+	return max(e820_end_pfn(MAX_ARCH_PFN, E820_TYPE_RAM),
+		   e820_end_pfn(MAX_ARCH_PFN, E820_TYPE_UNACCEPTED));
 }
 
 unsigned long __init e820__end_of_low_ram_pfn(void)
@@ -1064,6 +1068,7 @@ static const char *__init e820_type_to_string(struct e820_entry *entry)
 	case E820_TYPE_PMEM:		return "Persistent Memory";
 	case E820_TYPE_RESERVED:	return "Reserved";
 	case E820_TYPE_SOFT_RESERVED:	return "Soft Reserved";
+	case E820_TYPE_UNACCEPTED:	return "Unaccepted Memory";
 	default:			return "Unknown E820 type";
 	}
 }
@@ -1072,6 +1077,7 @@ static unsigned long __init e820_type_to_iomem_type(struct e820_entry *entry)
 {
 	switch (entry->type) {
 	case E820_TYPE_RESERVED_KERN:	/* Fall-through: */
+	case E820_TYPE_UNACCEPTED:	/* Fall-through: */
 	case E820_TYPE_RAM:		return IORESOURCE_SYSTEM_RAM;
 	case E820_TYPE_ACPI:		/* Fall-through: */
 	case E820_TYPE_NVS:		/* Fall-through: */
@@ -1095,6 +1101,7 @@ static unsigned long __init e820_type_to_iores_desc(struct e820_entry *entry)
 	case E820_TYPE_SOFT_RESERVED:	return IORES_DESC_SOFT_RESERVED;
 	case E820_TYPE_RESERVED_KERN:	/* Fall-through: */
 	case E820_TYPE_RAM:		/* Fall-through: */
+	case E820_TYPE_UNACCEPTED:	/* Fall-through: */
 	case E820_TYPE_UNUSABLE:	/* Fall-through: */
 	default:			return IORES_DESC_NONE;
 	}
@@ -1118,6 +1125,7 @@ static bool __init do_mark_busy(enum e820_type type, struct resource *res)
 		return false;
 	case E820_TYPE_RESERVED_KERN:
 	case E820_TYPE_RAM:
+	case E820_TYPE_UNACCEPTED:
 	case E820_TYPE_ACPI:
 	case E820_TYPE_NVS:
 	case E820_TYPE_UNUSABLE:
@@ -1220,7 +1228,8 @@ void __init e820__reserve_resources_late(void)
 		struct e820_entry *entry = &e820_table->entries[i];
 		u64 start, end;
 
-		if (entry->type != E820_TYPE_RAM)
+		if (entry->type != E820_TYPE_RAM &&
+		    entry->type != E820_TYPE_UNACCEPTED)
 			continue;
 
 		start = entry->addr + entry->size;
@@ -1318,9 +1327,14 @@ void __init e820__memblock_setup(void)
 		if (entry->type == E820_TYPE_SOFT_RESERVED)
 			memblock_reserve(entry->addr, entry->size);
 
-		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
+		if (entry->type != E820_TYPE_RAM &&
+		    entry->type != E820_TYPE_RESERVED_KERN &&
+		    entry->type != E820_TYPE_UNACCEPTED)
 			continue;
 
+		if (entry->type == E820_TYPE_UNACCEPTED)
+			mark_unaccepted(entry->addr, end);
+
 		memblock_add(entry->addr, entry->size);
 	}
 
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 72920af0b3c0..db9d1bcac9ed 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -944,6 +944,8 @@ void __init setup_arch(char **cmdline_p)
 	if (movable_node_is_enabled())
 		memblock_set_bottom_up(true);
 #endif
+	/* TODO: make conditional */
+	memblock_set_bottom_up(true);
 
 	x86_report_nx();
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index e527d829e1ed..582e398f953a 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -463,7 +463,9 @@ phys_pte_init(pte_t *pte_page, unsigned long paddr, unsigned long paddr_end,
 			    !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
 					     E820_TYPE_RAM) &&
 			    !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
-					     E820_TYPE_RESERVED_KERN))
+					     E820_TYPE_RESERVED_KERN) &&
+			    !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
+					     E820_TYPE_UNACCEPTED))
 				set_pte_init(pte, __pte(0), init);
 			continue;
 		}
@@ -518,7 +520,9 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long paddr, unsigned long paddr_end,
 			    !e820__mapped_any(paddr & PMD_MASK, paddr_next,
 					     E820_TYPE_RAM) &&
 			    !e820__mapped_any(paddr & PMD_MASK, paddr_next,
-					     E820_TYPE_RESERVED_KERN))
+					     E820_TYPE_RESERVED_KERN) &&
+			    !e820__mapped_any(paddr & PMD_MASK, paddr_next,
+					     E820_TYPE_UNACCEPTED))
 				set_pmd_init(pmd, __pmd(0), init);
 			continue;
 		}
@@ -606,7 +610,9 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
 			    !e820__mapped_any(paddr & PUD_MASK, paddr_next,
 					     E820_TYPE_RAM) &&
 			    !e820__mapped_any(paddr & PUD_MASK, paddr_next,
-					     E820_TYPE_RESERVED_KERN))
+					     E820_TYPE_RESERVED_KERN) &&
+			    !e820__mapped_any(paddr & PUD_MASK, paddr_next,
+					     E820_TYPE_UNACCEPTED))
 				set_pud_init(pud, __pud(0), init);
 			continue;
 		}
@@ -697,7 +703,9 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
 			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
 					     E820_TYPE_RAM) &&
 			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
-					     E820_TYPE_RESERVED_KERN))
+					     E820_TYPE_RESERVED_KERN) &&
+			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
+					     E820_TYPE_UNACCEPTED))
 				set_p4d_init(p4d, __p4d(0), init);
 			continue;
 		}
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index da94fc2e9b56..f51bb39963f2 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -44,3 +44,86 @@ int arch_has_restricted_virtio_memory_access(void)
 }
 EXPORT_SYMBOL_GPL(arch_has_restricted_virtio_memory_access);
 
+/* TODO: make dynamic. Enough for 64GiB */
+static DECLARE_BITMAP(unaccepted_memory, 32768);
+static DEFINE_SPINLOCK(unaccepted_memory_lock);
+
+#define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT)
+#define PMD_NR (1 << PMD_ORDER)
+
+void mark_unaccepted(phys_addr_t start, phys_addr_t end)
+{
+	unsigned int npages;
+
+	if (start & ~PMD_MASK) {
+		npages = (round_up(start, PMD_SIZE) - start) / PAGE_SIZE;
+		tdx_hcall_gpa_intent(start, npages, TDX_MAP_PRIVATE);
+		start = round_up(start, PMD_SIZE);
+	}
+
+	if (end & ~PMD_MASK) {
+		npages = (end - round_down(end, PMD_SIZE)) / PAGE_SIZE;
+		end = round_down(end, PMD_SIZE);
+		tdx_hcall_gpa_intent(end, npages, TDX_MAP_PRIVATE);
+	}
+
+	npages = (end - start) / PMD_SIZE;
+	spin_lock(&unaccepted_memory_lock);
+	bitmap_set(unaccepted_memory, start / PMD_SIZE, npages);
+	spin_unlock(&unaccepted_memory_lock);
+}
+
+static void __accept_pages(phys_addr_t start, phys_addr_t end)
+{
+	unsigned int rs, re;
+
+	bitmap_for_each_set_region(unaccepted_memory, rs, re,
+				   start / PMD_SIZE, end / PMD_SIZE) {
+		tdx_hcall_gpa_intent(rs * PMD_SIZE, (re - rs) * PMD_NR,
+				     TDX_MAP_PRIVATE);
+
+		bitmap_clear(unaccepted_memory, rs, re - rs);
+	}
+}
+
+void accept_pages(phys_addr_t start, phys_addr_t end)
+{
+	spin_lock(&unaccepted_memory_lock);
+	__accept_pages(start, end);
+	spin_unlock(&unaccepted_memory_lock);
+}
+
+void maybe_set_page_offline(struct page *page, unsigned int order)
+{
+	phys_addr_t addr = page_to_phys(page);
+	bool unaccepted = true;
+	unsigned int i;
+
+	spin_lock(&unaccepted_memory_lock);
+	if (order < PMD_ORDER) {
+		BUG_ON(test_bit(addr / PMD_SIZE, unaccepted_memory));
+		goto out;
+	}
+
+	for (i = 0; i < (1 << (order - PMD_ORDER)); i++) {
+		if (!test_bit(addr / PMD_SIZE + i, unaccepted_memory)) {
+			unaccepted = false;
+			break;
+		}
+	}
+
+	if (unaccepted)
+		__SetPageOffline(page);
+	else
+		__accept_pages(addr, addr + (PAGE_SIZE << order));
+out:
+	spin_unlock(&unaccepted_memory_lock);
+}
+
+void clear_page_offline(struct page *page, unsigned int order)
+{
+	phys_addr_t addr = page_to_phys(page);
+	accept_pages(addr, addr + (PAGE_SIZE << order));
+	__ClearPageOffline(page);
+}
+
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 4b7ee3fa9224..86e85e267ee3 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -737,6 +737,7 @@ static __initdata char memory_type_name[][13] = {
 	"MMIO Port",
 	"PAL Code",
 	"Persistent",
+	"Unaccepted",
 };
 
 char * __init efi_md_typeattr_format(char *buf, size_t size,
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index f14c4ff5839f..dc3aebd493b3 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -504,6 +504,9 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
 			e820_type = E820_TYPE_PMEM;
 			break;
 
+		case EFI_UNACCEPTED_MEMORY:
+			e820_type = E820_TYPE_UNACCEPTED;
+			break;
 		default:
 			continue;
 		}
diff --git a/include/linux/efi.h b/include/linux/efi.h
index 6b5d36babfcc..d43cc872b582 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -108,7 +108,8 @@ typedef	struct {
 #define EFI_MEMORY_MAPPED_IO_PORT_SPACE	12
 #define EFI_PAL_CODE			13
 #define EFI_PERSISTENT_MEMORY		14
-#define EFI_MAX_MEMORY_TYPE		15
+#define EFI_UNACCEPTED_MEMORY		15
+#define EFI_MAX_MEMORY_TYPE		16
 
 /* Attribute values: */
 #define EFI_MEMORY_UC		((u64)0x0000000000000001ULL)	/* uncached */
diff --git a/mm/memblock.c b/mm/memblock.c
index afaefa8fc6ab..f7423ad0a692 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -814,6 +814,7 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
 	memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,
 		     &base, &end, (void *)_RET_IP_);
 
+	accept_pages(base, base + size);
 	return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0);
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aaa1655cf682..0356329b1ea7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -857,6 +857,9 @@ static inline bool page_is_buddy(struct page *page, struct page *buddy,
 	if (buddy_order(buddy) != order)
 		return false;
 
+	if (PageOffline(buddy) || PageOffline(page))
+		return false;
+
 	/*
 	 * zone check is done late to avoid uselessly calculating
 	 * zone/node ids for pages that could never merge.
@@ -959,6 +962,9 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 	if (page_reported(page))
 		__ClearPageReported(page);
 
+	if (PageOffline(page))
+		clear_page_offline(page, order);
+
 	list_del(&page->lru);
 	__ClearPageBuddy(page);
 	set_page_private(page, 0);
@@ -1123,7 +1129,8 @@ static inline void __free_one_page(struct page *page,
 static inline bool page_expected_state(struct page *page,
 					unsigned long check_flags)
 {
-	if (unlikely(atomic_read(&page->_mapcount) != -1))
+	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
+	    !PageOffline(page))
 		return false;
 
 	if (unlikely((unsigned long)page->mapping |
@@ -1666,6 +1673,8 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn,
 {
 	if (early_page_uninitialised(pfn))
 		return;
+
+	maybe_set_page_offline(page, order);
 	__free_pages_core(page, order);
 }
 
@@ -1757,10 +1766,12 @@ static void __init deferred_free_range(unsigned long pfn,
 	if (nr_pages == pageblock_nr_pages &&
 	    (pfn & (pageblock_nr_pages - 1)) == 0) {
 		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+		maybe_set_page_offline(page, pageblock_order);
 		__free_pages_core(page, pageblock_order);
 		return;
 	}
 
+	accept_pages(pfn << PAGE_SHIFT, (pfn + nr_pages) << PAGE_SHIFT);
 	for (i = 0; i < nr_pages; i++, page++, pfn++) {
 		if ((pfn & (pageblock_nr_pages - 1)) == 0)
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-19 12:58 Runtime Memory Validation in Intel-TDX and AMD-SNP Joerg Roedel
                   ` (5 preceding siblings ...)
  2021-07-22 19:51 ` Kirill A. Shutemov
@ 2021-07-23 11:04 ` Varad Gautam
  2021-07-23 14:34   ` Kaplan, David
  6 siblings, 1 reply; 48+ messages in thread
From: Varad Gautam @ 2021-07-23 11:04 UTC (permalink / raw)
  To: Joerg Roedel, David Rientjes, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Andi Kleen, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Dario Faggioli
  Cc: x86, linux-mm, linux-coco

On 7/19/21 2:58 PM, Joerg Roedel wrote:
> Hi,
> 
> I'd like to get some movement again into the discussion around how to
> implement runtime memory validation for confidential guests and wrote up
> some thoughts on it.
> Below are the results in form of a proposal I put together. Please let
> me know your thoughts on it and whether it fits everyones requirements.
> 
> Thanks,
> 
> 	Joerg
> 
> Proposal for Runtime Memory Validation in Secure Guests on x86
> ==============================================================
> 
> This proposal describes a method and protocol for runtime validation of
> memory in virtualization guests running with Intel Trusted Domain
> Extensions (Intel-TDX) or AMD Secure Nested Paging (AMD-SNP).
> 
> AMD-SNP and Intel-TDX use different terms to discuss memory page states.
> In AMD-SNP memory has to be 'validated' while in Intel-TDX is will be
> 'accepted'. This document uses the term 'validated' for both.
> 
> Problem Statement
> -----------------
> 
> Virtualization guests which run with AMD-SNP or Intel-TDX need to
> validate their memory before using it. The validation assigns a hardware
> state to each page which allows the guest to detect when the hypervisor
> tries to maliciously access or remap a guest-private page. The guest can
> only access validated pages.
> 
> There are three ways the guest memory can be validated:
> 
> 	I.   The firmware validates all of guest memory at boot time. This
> 	     is the simplest method which requires the least changes to
> 	     the Linux kernel. But this method is also very slow and
> 	     causes unwanted delays in the boot process, as verification
> 	     can take several seconds (depending on guest memory size).
> 
> 	II.  The firmware only validates its own memory and memory
> 	     validation happens as the memory is used. This significantly
> 	     improves the boot time, but needs more intrusive changes to
> 	     the Linux kernel and its boot process.
> 
> 
> 	III. Approach I. and II. can be combined. The firmware only
> 	     validates the first X MB/GB of guest memory and the rest is
> 	     validated on-demand.
> 
> For method II. and III. the guest needs to track which pages have
> already been validated to detect hypervisor attacks. This information
> needs to be carried through the whole boot process.
> 

The need for tracking validity within the guest can be eliminated if:
- the guest has a trusted communication channel with the security
  processor (PSP in the SNP case), and
- the security processor has access to the validation state (RMP table for
  SNP)

The guest kernel (linux or non-linux) can then just ask the security
processor for this information when needed, provided the communication ABI
exists.

I am not familiar with TDX specifics, but for SNP [1], I see that the PSP
firmware is able to dump the page validation state along with some other
information into a per-page metadata entry on the SNP_PAGE_SWAP_OUT ABI
call. This leads me to conclude that the PSP has access to the RMP table,
in which case it can probably be made to export the RMP state for a given
guest in a cleaner layout (eg, a guest 'GET_VALIDATION_TABLE' call)?

[1] https://www.amd.com/system/files/TechDocs/56860.pdf

Regards,
Varad

> This poses challenges on the Linux boot process, as there is currently
> no way to forward information about validated memory up the boot chain.
> This proposal tries to describe a way to solve these challenges.
> 
> Memory Validation through the Boot Process and in the Running System
> --------------------------------------------------------------------
> 
> The memory is validated throughout the boot process as described below.
> These steps assume a firmware is present, but this proposal does not
> strictly require a firmware. The tasks done be the firmware can also be
> done by the hypervisor before starting the guest. The steps are:
> 
> 	1. The firmware validates all memory which will not be owned by
> 	   the boot loader or the OS.
> 
> 	2. The firmware also validates the first X MB of memory, just
> 	   enough to run a boot loader and to load the compressed Linux
> 	   kernel image. X is not expected to be very large, 64 or 128
> 	   MB should be enough. This pre-validation should not cause
> 	   significant delays in the boot process.
> 
> 	3. The validated memory is marked E820-Usable in struct
> 	   boot_params for the Linux decompressor. The rest of the
> 	   memory is also passed to Linux via new special E820 entries
> 	   which mark the memory as Usable-but-Invalid.
> 
> 	4. When the Linux decompressor takes over control, it evaluates
> 	   the E820 table and calculates to total amount of memory
> 	   available to Linux (valid and invalid memory).
> 
> 	   The decompressor allocates a physically contiguous data
> 	   structure at a random memory location which is big enough to
> 	   hold the the validation states of all 4kb pages available to
> 	   the guest. This data structure will be called the Validation
> 	   Bitmap through the rest of this document. The Validation
> 	   Bitmap is indexed by page frame numbers. 
> 
> 	   It still needs to be determined how many bits are required
> 	   per page. This depends on the necessity to track validation
> 	   page-sizes. Two bits per page are enough to track the 3
> 	   page-sizes currently available on the x86 architecture.
> 
> 	   The decompressor initializes the Validation Bitmap by first
> 	   validating its backing memory and then updating it with the
> 	   information from the E820 table. It will also update the
> 	   table if it changes the state of pages from invalid to valid
> 	   (and vice versa, e.g. for mapping a GHCB page).
> 
> 	5. The 'struct boot_params' is extended to carry the location
> 	   and size of the Validation Bitmap to the extracted kernel
> 	   image.
> 	   In fact, since the decompressor already receives a 'struct
> 	   boot_params', it will check if it carries a Validation
> 	   Bitmap. If it does, the decompressor uses the existing one
> 	   instead of allocating a new one.
> 
> 	6. When the extracted kernel image takes over control, it will
> 	   make sure the Validation Bitmap is up to date when memory
> 	   needs to be validated.
> 
> 	7. When set up, the memblock and page allocators have to check
> 	   whether the memory they return is already validated, and
> 	   validate it if not.
> 
> 	   This should happen after the memory is allocated and all
> 	   allocator-locks are dropped, but before the memory is
> 	   returned to the caller. This way the access to the
> 	   validation bitmap can be implemented without locking and only
> 	   using atomic instructions.
> 
> 	   Under no circumstances the Linux kernel is allowed to
> 	   validate a page more than once. Doing this might create
> 	   attack vectors for the Hypervisor towards the guest.
> 
> 	8. When memory is returned to the memblock or page allocators,
> 	   it is _not_ invalidated. In fact, all memory which is freed
> 	   need to be valid. If it was marked invalid in the meantime
> 	   (e.g. if it the memory was used for DMA buffers), the code
> 	   owning the memory needs to validate it again before freeing
> 	   it.
> 
> 	   The benefit of doing memory validation at allocation time is
> 	   that it keeps the exception handler for invalid memory
> 	   simple, because no exceptions of this kind are expected under
> 	   normal operation.
> 
> The Validation Bitmap
> ---------------------
> 
> This document proposes the use of a Validation Bitmap to store the
> validation state of guest pages. This section discusses the benefits of
> this approach.
> 
> The Linux kernel already has an array to store various state for each
> memory page in the system: The struct page array. While this would be a
> natural place to also store page validation information, the Validation
> Bitmap is chosen because having the information separated has some clear
> benefits:
> 
> 	- The Validation Bitmap is allocated in the Linux decompressor
> 	  and already available long before the struct page array is
> 	  initialized.
> 
> 	- Since it is a simple in-memory data structure which is
> 	  physically contiguous, it can be passed along through the
> 	  various stages of the boot process.
> 
> 	- It can even be passed to a new kernel booted via kexec/kdump,
> 	  making it trivial to enable these features for AMD-SNP and
> 	  Intel-TDX.
> 
> 	- When memory validation happens in the memblock and page
> 	  allocators, there is no need for locking when making changes
> 	  to the Validation Bitmap, because:
> 	  
> 	    - Nobody will try to concurrently access the same bits, as
> 	      the code-path doing the validation is the only owner of
> 	      the memory.
> 
> 	    - Updates can happen via atomic cmpxchg instructions
> 	      when multiple bits are used per page. If only one bit is
> 	      needed, atomic bit manipulation instructions will suffice.
> 
> 	- NUMA-locality is not considered to be a problem for the
> 	  Validation Bitmap. Since memory is not invalidated upon free,
> 	  the data structure will become read-mostly over time.
> 
> Final Notes
> -----------
> 
> This proposal does not introduce requirements about the firmware that
> has to be used to run Intel-TDX or AMD-SNP guests. It works with UEFI
> and non-UEFI firmwares, or with no firmware at all. This is important
> for use-cases like Confidential Containers running in VMs, which often
> use a very small firmware (or no firmware at all) for reducing boot
> times.
> 

-- 
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany

HRB 36809, AG Nürnberg
Geschäftsführer: Felix Imendörffer


^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-23 11:04 ` Varad Gautam
@ 2021-07-23 14:34   ` Kaplan, David
  0 siblings, 0 replies; 48+ messages in thread
From: Kaplan, David @ 2021-07-23 14:34 UTC (permalink / raw)
  To: Varad Gautam, Joerg Roedel, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Andi Kleen, Singh, Brijesh,
	Lendacky, Thomas, Grimm, Jon, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Dario Faggioli
  Cc: x86, linux-mm, linux-coco

[AMD Official Use Only]



> -----Original Message-----
> From: Varad Gautam <varad.gautam@suse.com>
> Sent: Friday, July 23, 2021 6:04 AM
> To: Joerg Roedel <jroedel@suse.de>; David Rientjes
> <rientjes@google.com>; Borislav Petkov <bp@alien8.de>; Andy Lutomirski
> <luto@kernel.org>; Sean Christopherson <seanjc@google.com>; Andrew
> Morton <akpm@linux-foundation.org>; Vlastimil Babka <vbabka@suse.cz>;
> Kirill A. Shutemov <kirill.shutemov@linux.intel.com>; Andi Kleen
> <ak@linux.intel.com>; Singh, Brijesh <brijesh.singh@amd.com>; Lendacky,
> Thomas <Thomas.Lendacky@amd.com>; Grimm, Jon
> <Jon.Grimm@amd.com>; Thomas Gleixner <tglx@linutronix.de>; Peter
> Zijlstra <peterz@infradead.org>; Paolo Bonzini <pbonzini@redhat.com>;
> Ingo Molnar <mingo@redhat.com>; Kaplan, David
> <David.Kaplan@amd.com>; Dario Faggioli <dfaggioli@suse.com>
> Cc: x86@kernel.org; linux-mm@kvack.org; linux-coco@lists.linux.dev
> Subject: Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
> 
> [CAUTION: External Email]
> 
> On 7/19/21 2:58 PM, Joerg Roedel wrote:
> > Hi,
> >
> > I'd like to get some movement again into the discussion around how to
> > implement runtime memory validation for confidential guests and wrote
> > up some thoughts on it.
> > Below are the results in form of a proposal I put together. Please let
> > me know your thoughts on it and whether it fits everyones requirements.
> >
> > Thanks,
> >
> >       Joerg
> >
> > Proposal for Runtime Memory Validation in Secure Guests on x86
> >
> ==========================================================
> ====
> >
> > This proposal describes a method and protocol for runtime validation
> > of memory in virtualization guests running with Intel Trusted Domain
> > Extensions (Intel-TDX) or AMD Secure Nested Paging (AMD-SNP).
> >
> > AMD-SNP and Intel-TDX use different terms to discuss memory page
> states.
> > In AMD-SNP memory has to be 'validated' while in Intel-TDX is will be
> > 'accepted'. This document uses the term 'validated' for both.
> >
> > Problem Statement
> > -----------------
> >
> > Virtualization guests which run with AMD-SNP or Intel-TDX need to
> > validate their memory before using it. The validation assigns a
> > hardware state to each page which allows the guest to detect when the
> > hypervisor tries to maliciously access or remap a guest-private page.
> > The guest can only access validated pages.
> >
> > There are three ways the guest memory can be validated:
> >
> >       I.   The firmware validates all of guest memory at boot time. This
> >            is the simplest method which requires the least changes to
> >            the Linux kernel. But this method is also very slow and
> >            causes unwanted delays in the boot process, as verification
> >            can take several seconds (depending on guest memory size).
> >
> >       II.  The firmware only validates its own memory and memory
> >            validation happens as the memory is used. This significantly
> >            improves the boot time, but needs more intrusive changes to
> >            the Linux kernel and its boot process.
> >
> >
> >       III. Approach I. and II. can be combined. The firmware only
> >            validates the first X MB/GB of guest memory and the rest is
> >            validated on-demand.
> >
> > For method II. and III. the guest needs to track which pages have
> > already been validated to detect hypervisor attacks. This information
> > needs to be carried through the whole boot process.
> >
> 
> The need for tracking validity within the guest can be eliminated if:
> - the guest has a trusted communication channel with the security
>   processor (PSP in the SNP case), and
> - the security processor has access to the validation state (RMP table for
>   SNP)
> 
> The guest kernel (linux or non-linux) can then just ask the security processor
> for this information when needed, provided the communication ABI exists.
> 
> I am not familiar with TDX specifics, but for SNP [1], I see that the PSP
> firmware is able to dump the page validation state along with some other
> information into a per-page metadata entry on the SNP_PAGE_SWAP_OUT
> ABI call. This leads me to conclude that the PSP has access to the RMP table,
> in which case it can probably be made to export the RMP state for a given
> guest in a cleaner layout (eg, a guest 'GET_VALIDATION_TABLE' call)?
> 

This is not supported currently in the SNP ABI and I would not recommend this path.  The guest to PSP communication is slow and in order for the PSP to gather this information it would have to scan the entire RMP table which can be gigabytes in size.  So I don't really see this being workable performance-wise, instead I believe the guest needs to track validation status itself in some way.

--David Kaplan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-22 19:51 ` Kirill A. Shutemov
@ 2021-07-23 15:23   ` Mike Rapoport
  2021-07-23 16:29     ` Kirill A. Shutemov
  2021-07-26 19:13   ` Joerg Roedel
  2021-07-26 23:02   ` Erdem Aktas
  2 siblings, 1 reply; 48+ messages in thread
From: Mike Rapoport @ 2021-07-23 15:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Joerg Roedel, Andi Kleen, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

On Thu, Jul 22, 2021 at 10:51:30PM +0300, Kirill A. Shutemov wrote:
> On Mon, Jul 19, 2021 at 02:58:22PM +0200, Joerg Roedel wrote:
> > Hi,
> > 
> > I'd like to get some movement again into the discussion around how to
> > implement runtime memory validation for confidential guests and wrote up
> > some thoughts on it.
> > Below are the results in form of a proposal I put together. Please let
> > me know your thoughts on it and whether it fits everyones requirements.
> 
> Okay, below is my first take on the topic.
> 
> Please don't mind code structuring, naming, etc. It's early PoC and aimed
> to test the approach.
> 
> I ended up combing your idea with bitmap with PageOffline(): early boot
> code uses bitmap, but on page allocator init I mark unaccepted pages with
> PageOffline(). This way page allocator need to touch the bitmap only when
> it steps on PageOffline() which shouldn't be often once things settle
> after boot.
> 
> One bit in the bitmap represents 2M region. Any unaligned chunks gets
> accepted when we construct the bitmap. This way one 4K page can represent
> 64 GiB of physical address space.
> 
> I still don't touch decompresser code. This is something I'll look into
> next. Given the density of the bitmap. It should be enough to have a
> single-page bitmap allocated in decompresser.
> 
> Any feedback is welcome.
> 
> diff --git a/arch/x86/include/asm/e820/types.h b/arch/x86/include/asm/e820/types.h
> index 314f75d886d0..98f281752a30 100644
> --- a/arch/x86/include/asm/e820/types.h
> +++ b/arch/x86/include/asm/e820/types.h
> @@ -28,6 +28,8 @@ enum e820_type {
>  	 */
>  	E820_TYPE_PRAM		= 12,
>  
> +	E820_TYPE_UNACCEPTED	= 13, /* XXX: is there a standardized type ? */
> +
>  	/*
>  	 * Special-purpose memory is indicated to the system via the
>  	 * EFI_MEMORY_SP attribute. Define an e820 translation of this
> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> index 7555b48803a8..21972e9159fe 100644
> --- a/arch/x86/include/asm/page.h
> +++ b/arch/x86/include/asm/page.h
> @@ -71,6 +71,11 @@ static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
>  extern bool __virt_addr_valid(unsigned long kaddr);
>  #define virt_addr_valid(kaddr)	__virt_addr_valid((unsigned long) (kaddr))
>  
> +void mark_unaccepted(phys_addr_t start, phys_addr_t end);
> +void accept_pages(phys_addr_t start, phys_addr_t end);
> +
> +void maybe_set_page_offline(struct page *page, unsigned int order);
> +void clear_page_offline(struct page *page, unsigned int order);
>  #endif	/* __ASSEMBLY__ */
>  
>  #include <asm-generic/memory_model.h>
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index bc0657f0deed..4cd80d107a06 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -196,6 +196,7 @@ static void __init e820_print_type(enum e820_type type)
>  	case E820_TYPE_UNUSABLE:	pr_cont("unusable");			break;
>  	case E820_TYPE_PMEM:		/* Fall through: */
>  	case E820_TYPE_PRAM:		pr_cont("persistent (type %u)", type);	break;
> +	case E820_TYPE_UNACCEPTED:	pr_cont("unaccepted");			break;
>  	default:			pr_cont("type %u", type);		break;
>  	}
>  }
> @@ -763,7 +764,9 @@ void __init e820__register_nosave_regions(unsigned long limit_pfn)
>  
>  		pfn = PFN_DOWN(entry->addr + entry->size);
>  
> -		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> +		if (entry->type != E820_TYPE_RAM &&
> +		    entry->type != E820_TYPE_RESERVED_KERN &&
> +		    entry->type != E820_TYPE_UNACCEPTED)
>  			register_nosave_region(PFN_UP(entry->addr), pfn);
>  
>  		if (pfn >= limit_pfn)
> @@ -864,7 +867,8 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, enum e820_type
>  
>  unsigned long __init e820__end_of_ram_pfn(void)
>  {
> -	return e820_end_pfn(MAX_ARCH_PFN, E820_TYPE_RAM);
> +	return max(e820_end_pfn(MAX_ARCH_PFN, E820_TYPE_RAM),
> +		   e820_end_pfn(MAX_ARCH_PFN, E820_TYPE_UNACCEPTED));
>  }
>  
>  unsigned long __init e820__end_of_low_ram_pfn(void)
> @@ -1064,6 +1068,7 @@ static const char *__init e820_type_to_string(struct e820_entry *entry)
>  	case E820_TYPE_PMEM:		return "Persistent Memory";
>  	case E820_TYPE_RESERVED:	return "Reserved";
>  	case E820_TYPE_SOFT_RESERVED:	return "Soft Reserved";
> +	case E820_TYPE_UNACCEPTED:	return "Unaccepted Memory";
>  	default:			return "Unknown E820 type";
>  	}
>  }
> @@ -1072,6 +1077,7 @@ static unsigned long __init e820_type_to_iomem_type(struct e820_entry *entry)
>  {
>  	switch (entry->type) {
>  	case E820_TYPE_RESERVED_KERN:	/* Fall-through: */
> +	case E820_TYPE_UNACCEPTED:	/* Fall-through: */
>  	case E820_TYPE_RAM:		return IORESOURCE_SYSTEM_RAM;
>  	case E820_TYPE_ACPI:		/* Fall-through: */
>  	case E820_TYPE_NVS:		/* Fall-through: */
> @@ -1095,6 +1101,7 @@ static unsigned long __init e820_type_to_iores_desc(struct e820_entry *entry)
>  	case E820_TYPE_SOFT_RESERVED:	return IORES_DESC_SOFT_RESERVED;
>  	case E820_TYPE_RESERVED_KERN:	/* Fall-through: */
>  	case E820_TYPE_RAM:		/* Fall-through: */
> +	case E820_TYPE_UNACCEPTED:	/* Fall-through: */
>  	case E820_TYPE_UNUSABLE:	/* Fall-through: */
>  	default:			return IORES_DESC_NONE;
>  	}
> @@ -1118,6 +1125,7 @@ static bool __init do_mark_busy(enum e820_type type, struct resource *res)
>  		return false;
>  	case E820_TYPE_RESERVED_KERN:
>  	case E820_TYPE_RAM:
> +	case E820_TYPE_UNACCEPTED:
>  	case E820_TYPE_ACPI:
>  	case E820_TYPE_NVS:
>  	case E820_TYPE_UNUSABLE:
> @@ -1220,7 +1228,8 @@ void __init e820__reserve_resources_late(void)
>  		struct e820_entry *entry = &e820_table->entries[i];
>  		u64 start, end;
>  
> -		if (entry->type != E820_TYPE_RAM)
> +		if (entry->type != E820_TYPE_RAM &&
> +		    entry->type != E820_TYPE_UNACCEPTED)
>  			continue;
>  
>  		start = entry->addr + entry->size;
> @@ -1318,9 +1327,14 @@ void __init e820__memblock_setup(void)
>  		if (entry->type == E820_TYPE_SOFT_RESERVED)
>  			memblock_reserve(entry->addr, entry->size);
>  
> -		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> +		if (entry->type != E820_TYPE_RAM &&
> +		    entry->type != E820_TYPE_RESERVED_KERN &&
> +		    entry->type != E820_TYPE_UNACCEPTED)
>  			continue;

If I understand correctly, you assume that

* E820_TYPE_RAM and E820_TYPE_RESERVED_KERN regions are already accepted by
  firmware/booloader
* E820_TYPE_UNACCEPTED would have been E820_SYSTEM_RAM if we'd disabled
  encryption

What happens with other types? Particularly E820_TYPE_ACPI and
E820_TYPE_NVS that may reside in memory and might have been accepted by
BIOS.

>  
> +		if (entry->type == E820_TYPE_UNACCEPTED)
> +			mark_unaccepted(entry->addr, end);
> +
>  		memblock_add(entry->addr, entry->size);
>  	}
>  
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 72920af0b3c0..db9d1bcac9ed 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -944,6 +944,8 @@ void __init setup_arch(char **cmdline_p)
>  	if (movable_node_is_enabled())
>  		memblock_set_bottom_up(true);
>  #endif
> +	/* TODO: make conditional */
> +	memblock_set_bottom_up(true);
  
If memory is accepted during memblock allocations this should not really
matter.
Bottom up would be preferable if we'd like to reuse as much of already
accepted memory as possible before page allocator is up.

>  	x86_report_nx();
>  
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index e527d829e1ed..582e398f953a 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -463,7 +463,9 @@ phys_pte_init(pte_t *pte_page, unsigned long paddr, unsigned long paddr_end,
>  			    !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
>  					     E820_TYPE_RAM) &&
>  			    !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
> -					     E820_TYPE_RESERVED_KERN))
> +					     E820_TYPE_RESERVED_KERN) &&
> +			    !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
> +					     E820_TYPE_UNACCEPTED))
>  				set_pte_init(pte, __pte(0), init);
>  			continue;
>  		}
> @@ -518,7 +520,9 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long paddr, unsigned long paddr_end,
>  			    !e820__mapped_any(paddr & PMD_MASK, paddr_next,
>  					     E820_TYPE_RAM) &&
>  			    !e820__mapped_any(paddr & PMD_MASK, paddr_next,
> -					     E820_TYPE_RESERVED_KERN))
> +					     E820_TYPE_RESERVED_KERN) &&
> +			    !e820__mapped_any(paddr & PMD_MASK, paddr_next,
> +					     E820_TYPE_UNACCEPTED))
>  				set_pmd_init(pmd, __pmd(0), init);
>  			continue;
>  		}
> @@ -606,7 +610,9 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
>  			    !e820__mapped_any(paddr & PUD_MASK, paddr_next,
>  					     E820_TYPE_RAM) &&
>  			    !e820__mapped_any(paddr & PUD_MASK, paddr_next,
> -					     E820_TYPE_RESERVED_KERN))
> +					     E820_TYPE_RESERVED_KERN) &&
> +			    !e820__mapped_any(paddr & PUD_MASK, paddr_next,
> +					     E820_TYPE_UNACCEPTED))
>  				set_pud_init(pud, __pud(0), init);
>  			continue;
>  		}
> @@ -697,7 +703,9 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
>  			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
>  					     E820_TYPE_RAM) &&
>  			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
> -					     E820_TYPE_RESERVED_KERN))
> +					     E820_TYPE_RESERVED_KERN) &&
> +			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
> +					     E820_TYPE_UNACCEPTED))
>  				set_p4d_init(p4d, __p4d(0), init);
>  			continue;
>  		}
> diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
> index da94fc2e9b56..f51bb39963f2 100644
> --- a/arch/x86/mm/mem_encrypt_common.c
> +++ b/arch/x86/mm/mem_encrypt_common.c
> @@ -44,3 +44,86 @@ int arch_has_restricted_virtio_memory_access(void)
>  }
>  EXPORT_SYMBOL_GPL(arch_has_restricted_virtio_memory_access);
>  
> +/* TODO: make dynamic. Enough for 64GiB */
> +static DECLARE_BITMAP(unaccepted_memory, 32768);
> +static DEFINE_SPINLOCK(unaccepted_memory_lock);
> +
> +#define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT)
> +#define PMD_NR (1 << PMD_ORDER)
> +
> +void mark_unaccepted(phys_addr_t start, phys_addr_t end)
> +{
> +	unsigned int npages;
> +
> +	if (start & ~PMD_MASK) {
> +		npages = (round_up(start, PMD_SIZE) - start) / PAGE_SIZE;
> +		tdx_hcall_gpa_intent(start, npages, TDX_MAP_PRIVATE);
> +		start = round_up(start, PMD_SIZE);
> +	}
> +
> +	if (end & ~PMD_MASK) {
> +		npages = (end - round_down(end, PMD_SIZE)) / PAGE_SIZE;
> +		end = round_down(end, PMD_SIZE);
> +		tdx_hcall_gpa_intent(end, npages, TDX_MAP_PRIVATE);
> +	}
> +
> +	npages = (end - start) / PMD_SIZE;
> +	spin_lock(&unaccepted_memory_lock);
> +	bitmap_set(unaccepted_memory, start / PMD_SIZE, npages);
> +	spin_unlock(&unaccepted_memory_lock);
> +}
> +
> +static void __accept_pages(phys_addr_t start, phys_addr_t end)
> +{
> +	unsigned int rs, re;
> +
> +	bitmap_for_each_set_region(unaccepted_memory, rs, re,
> +				   start / PMD_SIZE, end / PMD_SIZE) {
> +		tdx_hcall_gpa_intent(rs * PMD_SIZE, (re - rs) * PMD_NR,
> +				     TDX_MAP_PRIVATE);
> +
> +		bitmap_clear(unaccepted_memory, rs, re - rs);
> +	}
> +}
> +
> +void accept_pages(phys_addr_t start, phys_addr_t end)
> +{
> +	spin_lock(&unaccepted_memory_lock);
> +	__accept_pages(start, end);
> +	spin_unlock(&unaccepted_memory_lock);
> +}
> +
> +void maybe_set_page_offline(struct page *page, unsigned int order)
> +{
> +	phys_addr_t addr = page_to_phys(page);
> +	bool unaccepted = true;
> +	unsigned int i;
> +
> +	spin_lock(&unaccepted_memory_lock);
> +	if (order < PMD_ORDER) {
> +		BUG_ON(test_bit(addr / PMD_SIZE, unaccepted_memory));
> +		goto out;
> +	}
> +
> +	for (i = 0; i < (1 << (order - PMD_ORDER)); i++) {
> +		if (!test_bit(addr / PMD_SIZE + i, unaccepted_memory)) {
> +			unaccepted = false;
> +			break;
> +		}
> +	}
> +
> +	if (unaccepted)
> +		__SetPageOffline(page);
> +	else
> +		__accept_pages(addr, addr + (PAGE_SIZE << order));
> +out:
> +	spin_unlock(&unaccepted_memory_lock);
> +}
> +
> +void clear_page_offline(struct page *page, unsigned int order)
> +{
> +	phys_addr_t addr = page_to_phys(page);
> +	accept_pages(addr, addr + (PAGE_SIZE << order));
> +	__ClearPageOffline(page);
> +}
> +
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index 4b7ee3fa9224..86e85e267ee3 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -737,6 +737,7 @@ static __initdata char memory_type_name[][13] = {
>  	"MMIO Port",
>  	"PAL Code",
>  	"Persistent",
> +	"Unaccepted",
>  };
>  
>  char * __init efi_md_typeattr_format(char *buf, size_t size,
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index f14c4ff5839f..dc3aebd493b3 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -504,6 +504,9 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
>  			e820_type = E820_TYPE_PMEM;
>  			break;
>  
> +		case EFI_UNACCEPTED_MEMORY:
> +			e820_type = E820_TYPE_UNACCEPTED;
> +			break;
>  		default:
>  			continue;
>  		}
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index 6b5d36babfcc..d43cc872b582 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -108,7 +108,8 @@ typedef	struct {
>  #define EFI_MEMORY_MAPPED_IO_PORT_SPACE	12
>  #define EFI_PAL_CODE			13
>  #define EFI_PERSISTENT_MEMORY		14
> -#define EFI_MAX_MEMORY_TYPE		15
> +#define EFI_UNACCEPTED_MEMORY		15
> +#define EFI_MAX_MEMORY_TYPE		16
>  
>  /* Attribute values: */
>  #define EFI_MEMORY_UC		((u64)0x0000000000000001ULL)	/* uncached */
> diff --git a/mm/memblock.c b/mm/memblock.c
> index afaefa8fc6ab..f7423ad0a692 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -814,6 +814,7 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
>  	memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,
>  		     &base, &end, (void *)_RET_IP_);
>  
> +	accept_pages(base, base + size);

Hmm, I'm not sure memblock_reserve() is the right place to accept pages. It
can be called to reserve memory owned by firmware which not necessarily
would be encrypted. Besides, memblock_reserve() may be called for absent
memory, could be it'll confuse TDX/SEV?

Ideally, the call to accept_pages() should live in
memblock_alloc_range_nid(), but unfortunately there still stale
memblock_find_in_range() + memblock_reserve() pairs in x86 setup code.

>  	return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0);
>  }
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index aaa1655cf682..0356329b1ea7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -857,6 +857,9 @@ static inline bool page_is_buddy(struct page *page, struct page *buddy,
>  	if (buddy_order(buddy) != order)
>  		return false;
>  
> +	if (PageOffline(buddy) || PageOffline(page))
> +		return false;
> +
>  	/*
>  	 * zone check is done late to avoid uselessly calculating
>  	 * zone/node ids for pages that could never merge.
> @@ -959,6 +962,9 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
>  	if (page_reported(page))
>  		__ClearPageReported(page);
>  
> +	if (PageOffline(page))
> +		clear_page_offline(page, order);
> +
>  	list_del(&page->lru);
>  	__ClearPageBuddy(page);
>  	set_page_private(page, 0);
> @@ -1123,7 +1129,8 @@ static inline void __free_one_page(struct page *page,
>  static inline bool page_expected_state(struct page *page,
>  					unsigned long check_flags)
>  {
> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
> +	    !PageOffline(page))
>  		return false;
>  
>  	if (unlikely((unsigned long)page->mapping |
> @@ -1666,6 +1673,8 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn,
>  {
>  	if (early_page_uninitialised(pfn))
>  		return;
> +
> +	maybe_set_page_offline(page, order);
>  	__free_pages_core(page, order);
>  }
>  
> @@ -1757,10 +1766,12 @@ static void __init deferred_free_range(unsigned long pfn,
>  	if (nr_pages == pageblock_nr_pages &&
>  	    (pfn & (pageblock_nr_pages - 1)) == 0) {
>  		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> +		maybe_set_page_offline(page, pageblock_order);
>  		__free_pages_core(page, pageblock_order);
>  		return;
>  	}
>  
> +	accept_pages(pfn << PAGE_SHIFT, (pfn + nr_pages) << PAGE_SHIFT);
>  	for (i = 0; i < nr_pages; i++, page++, pfn++) {
>  		if ((pfn & (pageblock_nr_pages - 1)) == 0)
>  			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> -- 
>  Kirill A. Shutemov
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-23 15:23   ` Mike Rapoport
@ 2021-07-23 16:29     ` Kirill A. Shutemov
  2021-07-25  9:16       ` Mike Rapoport
  0 siblings, 1 reply; 48+ messages in thread
From: Kirill A. Shutemov @ 2021-07-23 16:29 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Joerg Roedel, Andi Kleen, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

On Fri, Jul 23, 2021 at 06:23:39PM +0300, Mike Rapoport wrote:
> > @@ -1318,9 +1327,14 @@ void __init e820__memblock_setup(void)
> >  		if (entry->type == E820_TYPE_SOFT_RESERVED)
> >  			memblock_reserve(entry->addr, entry->size);
> >  
> > -		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > +		if (entry->type != E820_TYPE_RAM &&
> > +		    entry->type != E820_TYPE_RESERVED_KERN &&
> > +		    entry->type != E820_TYPE_UNACCEPTED)
> >  			continue;
> 
> If I understand correctly, you assume that
> 
> * E820_TYPE_RAM and E820_TYPE_RESERVED_KERN regions are already accepted by
>   firmware/booloader
> * E820_TYPE_UNACCEPTED would have been E820_SYSTEM_RAM if we'd disabled
>   encryption
> 
> What happens with other types? Particularly E820_TYPE_ACPI and
> E820_TYPE_NVS that may reside in memory and might have been accepted by
> BIOS.

Any accessible memory that not marked as UNACCEPTED has to be accepted
before kernel gets control.

> >  
> > +		if (entry->type == E820_TYPE_UNACCEPTED)
> > +			mark_unaccepted(entry->addr, end);
> > +
> >  		memblock_add(entry->addr, entry->size);
> >  	}
> >  
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index 72920af0b3c0..db9d1bcac9ed 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -944,6 +944,8 @@ void __init setup_arch(char **cmdline_p)
> >  	if (movable_node_is_enabled())
> >  		memblock_set_bottom_up(true);
> >  #endif
> > +	/* TODO: make conditional */
> > +	memblock_set_bottom_up(true);
>   
> If memory is accepted during memblock allocations this should not really
> matter.
> Bottom up would be preferable if we'd like to reuse as much of already
> accepted memory as possible before page allocator is up.

One of the main reason for this feature is to speed up boot time and
re-usinging preaccepted memory fits the goal.

> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -814,6 +814,7 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
> >  	memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,
> >  		     &base, &end, (void *)_RET_IP_);
> >  
> > +	accept_pages(base, base + size);
> 
> Hmm, I'm not sure memblock_reserve() is the right place to accept pages. It
> can be called to reserve memory owned by firmware which not necessarily
> would be encrypted. Besides, memblock_reserve() may be called for absent
> memory, could be it'll confuse TDX/SEV?

Such memory will not be marked as unaccepted and accept_pages() will do
nothing.

> Ideally, the call to accept_pages() should live in
> memblock_alloc_range_nid(), but unfortunately there still stale
> memblock_find_in_range() + memblock_reserve() pairs in x86 setup code.

memblock_reserve() is the root of memory allocation in the early boot and
it is natual place to do the trick. Unless we have a good reason to move
it somewhere I would keep it here.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-23 16:29     ` Kirill A. Shutemov
@ 2021-07-25  9:16       ` Mike Rapoport
  2021-07-25 18:28         ` Kirill A. Shutemov
  0 siblings, 1 reply; 48+ messages in thread
From: Mike Rapoport @ 2021-07-25  9:16 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Joerg Roedel, Andi Kleen, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

On Fri, Jul 23, 2021 at 07:29:59PM +0300, Kirill A. Shutemov wrote:
> On Fri, Jul 23, 2021 at 06:23:39PM +0300, Mike Rapoport wrote:
> > > @@ -1318,9 +1327,14 @@ void __init e820__memblock_setup(void)
> > >  		if (entry->type == E820_TYPE_SOFT_RESERVED)
> > >  			memblock_reserve(entry->addr, entry->size);
> > >  
> > > -		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > > +		if (entry->type != E820_TYPE_RAM &&
> > > +		    entry->type != E820_TYPE_RESERVED_KERN &&
> > > +		    entry->type != E820_TYPE_UNACCEPTED)
> > >  			continue;
> > 
> > If I understand correctly, you assume that
> > 
> > * E820_TYPE_RAM and E820_TYPE_RESERVED_KERN regions are already accepted by
> >   firmware/booloader
> > * E820_TYPE_UNACCEPTED would have been E820_SYSTEM_RAM if we'd disabled
> >   encryption
> > 
> > What happens with other types? Particularly E820_TYPE_ACPI and
> > E820_TYPE_NVS that may reside in memory and might have been accepted by
> > BIOS.
> 
> Any accessible memory that not marked as UNACCEPTED has to be accepted
> before kernel gets control.

Hmm, that would mean that everything that runs before the kernel must
maintain precise E820 map. If we use 2M chunk as basic unit for accepting
memory, the firmware must also use the same basic unit. E.g. we can't have
an ACPI table squeezed between E820_TYPE_UNACCEPTED.

Using e820 table would also mean that bootloader must be able to modify
e820 and it also must follow the 2M rule.

I think that using a dedicated data structure would be more robust than
hooking into e820 table.

> > > +		if (entry->type == E820_TYPE_UNACCEPTED)
> > > +			mark_unaccepted(entry->addr, end);
> > > +
> > >  		memblock_add(entry->addr, entry->size);
> > >  	}
> > >  
> > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > > index 72920af0b3c0..db9d1bcac9ed 100644
> > > --- a/arch/x86/kernel/setup.c
> > > +++ b/arch/x86/kernel/setup.c
> > > @@ -944,6 +944,8 @@ void __init setup_arch(char **cmdline_p)
> > >  	if (movable_node_is_enabled())
> > >  		memblock_set_bottom_up(true);
> > >  #endif
> > > +	/* TODO: make conditional */
> > > +	memblock_set_bottom_up(true);
> >   
> > If memory is accepted during memblock allocations this should not really
> > matter.
> > Bottom up would be preferable if we'd like to reuse as much of already
> > accepted memory as possible before page allocator is up.
> 
> One of the main reason for this feature is to speed up boot time and
> re-usinging preaccepted memory fits the goal.

Using bottom up also means that early allocations end up in DMA zones,
which probably not a problem for VMs in general, but who knows what path
through devices people would want to use...
 
> > > --- a/mm/memblock.c
> > > +++ b/mm/memblock.c
> > > @@ -814,6 +814,7 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
> > >  	memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,
> > >  		     &base, &end, (void *)_RET_IP_);
> > >  
> > > +	accept_pages(base, base + size);
> > 
> > Hmm, I'm not sure memblock_reserve() is the right place to accept pages. It
> > can be called to reserve memory owned by firmware which not necessarily
> > would be encrypted. Besides, memblock_reserve() may be called for absent
> > memory, could be it'll confuse TDX/SEV?
> 
> Such memory will not be marked as unaccepted and accept_pages() will do
> nothing.
> 
> > Ideally, the call to accept_pages() should live in
> > memblock_alloc_range_nid(), but unfortunately there still stale
> > memblock_find_in_range() + memblock_reserve() pairs in x86 setup code.
> 
> memblock_reserve() is the root of memory allocation in the early boot and
> it is natual place to do the trick. Unless we have a good reason to move
> it somewhere I would keep it here.

I think it is better to accept memory that is actually allocated rather
than marked as being used. It'll make it more robust against future changes
in memblock_reserve() callers and in what is accept_pages() in your patch. 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-25  9:16       ` Mike Rapoport
@ 2021-07-25 18:28         ` Kirill A. Shutemov
  2021-07-26 10:00           ` Mike Rapoport
  0 siblings, 1 reply; 48+ messages in thread
From: Kirill A. Shutemov @ 2021-07-25 18:28 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Joerg Roedel, Andi Kleen, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

On Sun, Jul 25, 2021 at 12:16:45PM +0300, Mike Rapoport wrote:
> On Fri, Jul 23, 2021 at 07:29:59PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Jul 23, 2021 at 06:23:39PM +0300, Mike Rapoport wrote:
> > > > @@ -1318,9 +1327,14 @@ void __init e820__memblock_setup(void)
> > > >  		if (entry->type == E820_TYPE_SOFT_RESERVED)
> > > >  			memblock_reserve(entry->addr, entry->size);
> > > >  
> > > > -		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > > > +		if (entry->type != E820_TYPE_RAM &&
> > > > +		    entry->type != E820_TYPE_RESERVED_KERN &&
> > > > +		    entry->type != E820_TYPE_UNACCEPTED)
> > > >  			continue;
> > > 
> > > If I understand correctly, you assume that
> > > 
> > > * E820_TYPE_RAM and E820_TYPE_RESERVED_KERN regions are already accepted by
> > >   firmware/booloader
> > > * E820_TYPE_UNACCEPTED would have been E820_SYSTEM_RAM if we'd disabled
> > >   encryption
> > > 
> > > What happens with other types? Particularly E820_TYPE_ACPI and
> > > E820_TYPE_NVS that may reside in memory and might have been accepted by
> > > BIOS.
> > 
> > Any accessible memory that not marked as UNACCEPTED has to be accepted
> > before kernel gets control.
> 
> Hmm, that would mean that everything that runs before the kernel must
> maintain precise E820 map. If we use 2M chunk as basic unit for accepting
> memory, the firmware must also use the same basic unit. E.g. we can't have
> an ACPI table squeezed between E820_TYPE_UNACCEPTED.

No. See mark_unaccepted(). Any chunks that cannot be accepted with 2M, get
accepted upfront, so we will not need to track them.

(I've just realized that mark_unaccepted() is buggy if 'start' and 'end'
are in the same 2M. Will fix.)


> Using e820 table would also mean that bootloader must be able to modify
> e820 and it also must follow the 2M rule.
> 
> I think that using a dedicated data structure would be more robust than
> hooking into e820 table.

Maybe. We can construct the bitmap in the decompresser and translate
EFI_UNACCEPTED_MEMORY to E820_TYPE_RAM. I will look into this.

> > > > +		if (entry->type == E820_TYPE_UNACCEPTED)
> > > > +			mark_unaccepted(entry->addr, end);
> > > > +
> > > >  		memblock_add(entry->addr, entry->size);
> > > >  	}
> > > >  
> > > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > > > index 72920af0b3c0..db9d1bcac9ed 100644
> > > > --- a/arch/x86/kernel/setup.c
> > > > +++ b/arch/x86/kernel/setup.c
> > > > @@ -944,6 +944,8 @@ void __init setup_arch(char **cmdline_p)
> > > >  	if (movable_node_is_enabled())
> > > >  		memblock_set_bottom_up(true);
> > > >  #endif
> > > > +	/* TODO: make conditional */
> > > > +	memblock_set_bottom_up(true);
> > >   
> > > If memory is accepted during memblock allocations this should not really
> > > matter.
> > > Bottom up would be preferable if we'd like to reuse as much of already
> > > accepted memory as possible before page allocator is up.
> > 
> > One of the main reason for this feature is to speed up boot time and
> > re-usinging preaccepted memory fits the goal.
> 
> Using bottom up also means that early allocations end up in DMA zones,
> which probably not a problem for VMs in general, but who knows what path
> through devices people would want to use...

Good point. Maybe we can drop it. Will see based on performance
evaluation.
>
> > > > --- a/mm/memblock.c
> > > > +++ b/mm/memblock.c
> > > > @@ -814,6 +814,7 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
> > > >  	memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,
> > > >  		     &base, &end, (void *)_RET_IP_);
> > > >  
> > > > +	accept_pages(base, base + size);
> > > 
> > > Hmm, I'm not sure memblock_reserve() is the right place to accept pages. It
> > > can be called to reserve memory owned by firmware which not necessarily
> > > would be encrypted. Besides, memblock_reserve() may be called for absent
> > > memory, could be it'll confuse TDX/SEV?
> > 
> > Such memory will not be marked as unaccepted and accept_pages() will do
> > nothing.
> > 
> > > Ideally, the call to accept_pages() should live in
> > > memblock_alloc_range_nid(), but unfortunately there still stale
> > > memblock_find_in_range() + memblock_reserve() pairs in x86 setup code.
> > 
> > memblock_reserve() is the root of memory allocation in the early boot and
> > it is natual place to do the trick. Unless we have a good reason to move
> > it somewhere I would keep it here.
> 
> I think it is better to accept memory that is actually allocated rather
> than marked as being used. It'll make it more robust against future changes
> in memblock_reserve() callers and in what is accept_pages() in your patch. 

I disagree.

If we move accept_pages() up to callers we will make less robust: any new
user of memblock_reserve() has to consider if accept_pages() is needed and
like would ignore it since it's not essential for any non-TDX/non-SEV use
case.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-25 18:28         ` Kirill A. Shutemov
@ 2021-07-26 10:00           ` Mike Rapoport
  2021-07-26 11:53             ` Kirill A. Shutemov
  0 siblings, 1 reply; 48+ messages in thread
From: Mike Rapoport @ 2021-07-26 10:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Joerg Roedel, Andi Kleen, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

On Sun, Jul 25, 2021 at 09:28:28PM +0300, Kirill A. Shutemov wrote:
> On Sun, Jul 25, 2021 at 12:16:45PM +0300, Mike Rapoport wrote:
> > On Fri, Jul 23, 2021 at 07:29:59PM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Jul 23, 2021 at 06:23:39PM +0300, Mike Rapoport wrote:
> > > > > @@ -1318,9 +1327,14 @@ void __init e820__memblock_setup(void)
> > > > >  		if (entry->type == E820_TYPE_SOFT_RESERVED)
> > > > >  			memblock_reserve(entry->addr, entry->size);
> > > > >  
> > > > > -		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > > > > +		if (entry->type != E820_TYPE_RAM &&
> > > > > +		    entry->type != E820_TYPE_RESERVED_KERN &&
> > > > > +		    entry->type != E820_TYPE_UNACCEPTED)
> > > > >  			continue;
> > > > 
> > > > If I understand correctly, you assume that
> > > > 
> > > > * E820_TYPE_RAM and E820_TYPE_RESERVED_KERN regions are already accepted by
> > > >   firmware/booloader
> > > > * E820_TYPE_UNACCEPTED would have been E820_SYSTEM_RAM if we'd disabled
> > > >   encryption
> > > > 
> > > > What happens with other types? Particularly E820_TYPE_ACPI and
> > > > E820_TYPE_NVS that may reside in memory and might have been accepted by
> > > > BIOS.
> > > 
> > > Any accessible memory that not marked as UNACCEPTED has to be accepted
> > > before kernel gets control.
> > 
> > Hmm, that would mean that everything that runs before the kernel must
> > maintain precise E820 map. If we use 2M chunk as basic unit for accepting
> > memory, the firmware must also use the same basic unit. E.g. we can't have
> > an ACPI table squeezed between E820_TYPE_UNACCEPTED.
> 
> No. See mark_unaccepted(). Any chunks that cannot be accepted with 2M, get
> accepted upfront, so we will not need to track them.

What will happen with the following E820 table:

0x400000 - 0x401000 - ACPI (accepted by BIOS)
0x401000 - 0x408000 - UNACCEPTED
0x408000 - 0x409000 - ACPI (accepted by BIOS)

> (I've just realized that mark_unaccepted() is buggy if 'start' and 'end'
> are in the same 2M. Will fix.)
> 
 
> > > > > --- a/mm/memblock.c
> > > > > +++ b/mm/memblock.c
> > > > > @@ -814,6 +814,7 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
> > > > >  	memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,
> > > > >  		     &base, &end, (void *)_RET_IP_);
> > > > >  
> > > > > +	accept_pages(base, base + size);
> > > > 
> > > > Hmm, I'm not sure memblock_reserve() is the right place to accept pages. It
> > > > can be called to reserve memory owned by firmware which not necessarily
> > > > would be encrypted. Besides, memblock_reserve() may be called for absent
> > > > memory, could be it'll confuse TDX/SEV?
> > > 
> > > Such memory will not be marked as unaccepted and accept_pages() will do
> > > nothing.
> > > 
> > > > Ideally, the call to accept_pages() should live in
> > > > memblock_alloc_range_nid(), but unfortunately there still stale
> > > > memblock_find_in_range() + memblock_reserve() pairs in x86 setup code.
> > > 
> > > memblock_reserve() is the root of memory allocation in the early boot and
> > > it is natual place to do the trick. Unless we have a good reason to move
> > > it somewhere I would keep it here.
>
> > I think it is better to accept memory that is actually allocated rather
> > than marked as being used. It'll make it more robust against future changes
> > in memblock_reserve() callers and in what is accept_pages() in your patch. 
> 
> I disagree.
> 
> If we move accept_pages() up to callers we will make less robust: any new
> user of memblock_reserve() has to consider if accept_pages() is needed and
> like would ignore it since it's not essential for any non-TDX/non-SEV use
> case.

I do not suggest to move accept_pages() to all the callers of
memblock_reserve(). I suggest to replace memblock_find_in_range() +
memblock_reserve() pairs with an appropriate memblock_alloc call, make
memblock_find_in_range() static and put accept_pages() there.

This essentially makes memblock_find_in_range() the root of early memory
*allocations* while memblock_reserve() would be only used to mark the
memory that is already used before the allocations can start.

Then we only deal with acceptance of the memory kernel actually allocates.

I can't think now of a concrete example of what may go wrong with calling
accept_pages() from memblock_reserve(), it's more of a gut feeling.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-26 10:00           ` Mike Rapoport
@ 2021-07-26 11:53             ` Kirill A. Shutemov
  0 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2021-07-26 11:53 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Joerg Roedel, Andi Kleen, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

On Mon, Jul 26, 2021 at 01:00:12PM +0300, Mike Rapoport wrote:
> On Sun, Jul 25, 2021 at 09:28:28PM +0300, Kirill A. Shutemov wrote:
> > On Sun, Jul 25, 2021 at 12:16:45PM +0300, Mike Rapoport wrote:
> > > On Fri, Jul 23, 2021 at 07:29:59PM +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Jul 23, 2021 at 06:23:39PM +0300, Mike Rapoport wrote:
> > > > > > @@ -1318,9 +1327,14 @@ void __init e820__memblock_setup(void)
> > > > > >  		if (entry->type == E820_TYPE_SOFT_RESERVED)
> > > > > >  			memblock_reserve(entry->addr, entry->size);
> > > > > >  
> > > > > > -		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > > > > > +		if (entry->type != E820_TYPE_RAM &&
> > > > > > +		    entry->type != E820_TYPE_RESERVED_KERN &&
> > > > > > +		    entry->type != E820_TYPE_UNACCEPTED)
> > > > > >  			continue;
> > > > > 
> > > > > If I understand correctly, you assume that
> > > > > 
> > > > > * E820_TYPE_RAM and E820_TYPE_RESERVED_KERN regions are already accepted by
> > > > >   firmware/booloader
> > > > > * E820_TYPE_UNACCEPTED would have been E820_SYSTEM_RAM if we'd disabled
> > > > >   encryption
> > > > > 
> > > > > What happens with other types? Particularly E820_TYPE_ACPI and
> > > > > E820_TYPE_NVS that may reside in memory and might have been accepted by
> > > > > BIOS.
> > > > 
> > > > Any accessible memory that not marked as UNACCEPTED has to be accepted
> > > > before kernel gets control.
> > > 
> > > Hmm, that would mean that everything that runs before the kernel must
> > > maintain precise E820 map. If we use 2M chunk as basic unit for accepting
> > > memory, the firmware must also use the same basic unit. E.g. we can't have
> > > an ACPI table squeezed between E820_TYPE_UNACCEPTED.
> > 
> > No. See mark_unaccepted(). Any chunks that cannot be accepted with 2M, get
> > accepted upfront, so we will not need to track them.
> 
> What will happen with the following E820 table:
> 
> 0x400000 - 0x401000 - ACPI (accepted by BIOS)
> 0x401000 - 0x408000 - UNACCEPTED
> 0x408000 - 0x409000 - ACPI (accepted by BIOS)

We will accept the on consructing the bitmap and don't mark as unaccepted
in the bitmap.

> > (I've just realized that mark_unaccepted() is buggy if 'start' and 'end'
> > are in the same 2M. Will fix.)
> > 
>  
> > > > > > --- a/mm/memblock.c
> > > > > > +++ b/mm/memblock.c
> > > > > > @@ -814,6 +814,7 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
> > > > > >  	memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,
> > > > > >  		     &base, &end, (void *)_RET_IP_);
> > > > > >  
> > > > > > +	accept_pages(base, base + size);
> > > > > 
> > > > > Hmm, I'm not sure memblock_reserve() is the right place to accept pages. It
> > > > > can be called to reserve memory owned by firmware which not necessarily
> > > > > would be encrypted. Besides, memblock_reserve() may be called for absent
> > > > > memory, could be it'll confuse TDX/SEV?
> > > > 
> > > > Such memory will not be marked as unaccepted and accept_pages() will do
> > > > nothing.
> > > > 
> > > > > Ideally, the call to accept_pages() should live in
> > > > > memblock_alloc_range_nid(), but unfortunately there still stale
> > > > > memblock_find_in_range() + memblock_reserve() pairs in x86 setup code.
> > > > 
> > > > memblock_reserve() is the root of memory allocation in the early boot and
> > > > it is natual place to do the trick. Unless we have a good reason to move
> > > > it somewhere I would keep it here.
> >
> > > I think it is better to accept memory that is actually allocated rather
> > > than marked as being used. It'll make it more robust against future changes
> > > in memblock_reserve() callers and in what is accept_pages() in your patch. 
> > 
> > I disagree.
> > 
> > If we move accept_pages() up to callers we will make less robust: any new
> > user of memblock_reserve() has to consider if accept_pages() is needed and
> > like would ignore it since it's not essential for any non-TDX/non-SEV use
> > case.
> 
> I do not suggest to move accept_pages() to all the callers of
> memblock_reserve(). I suggest to replace memblock_find_in_range() +
> memblock_reserve() pairs with an appropriate memblock_alloc call, make
> memblock_find_in_range() static and put accept_pages() there.
> 
> This essentially makes memblock_find_in_range() the root of early memory
> *allocations* while memblock_reserve() would be only used to mark the
> memory that is already used before the allocations can start.
> 
> Then we only deal with acceptance of the memory kernel actually allocates.

Okay, fair enough. I'll look into this.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-22 17:31       ` Marc Orr
@ 2021-07-26 18:55         ` Joerg Roedel
  0 siblings, 0 replies; 48+ messages in thread
From: Joerg Roedel @ 2021-07-26 18:55 UTC (permalink / raw)
  To: Marc Orr
  Cc: Andi Kleen, Erdem Aktas, Andy Lutomirski, Joerg Roedel,
	David Rientjes, Borislav Petkov, Sean Christopherson,
	Andrew Morton, Vlastimil Babka, Kirill A. Shutemov,
	Brijesh Singh, Tom Lendacky, Jon Grimm, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Kaplan, David,
	Varad Gautam, Dario Faggioli, x86, linux-mm, linux-coco

On Thu, Jul 22, 2021 at 10:31:27AM -0700, Marc Orr wrote:
> IMHO, we need to be completely certain that guest data cannot be
> compromised if we're going to remove the requirement that guest memory
> only be validated once in a certain state (e.g., from within a crash
> kernel). Perhaps it is the case that we're certain that guest data
> cannot be compromised from within a crash kernel -- but it's not what
> I read in the email exchange.

Right, at least SNP has a strict requirement that no memory could be
validated or invalidated twice without giving up security guarantees for
that memory.

Regards,

	Jörg

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-22 15:46   ` David Hildenbrand
@ 2021-07-26 19:02     ` Joerg Roedel
  2021-07-27  9:34       ` David Hildenbrand
  0 siblings, 1 reply; 48+ messages in thread
From: Joerg Roedel @ 2021-07-26 19:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kirill A. Shutemov, Joerg Roedel, David Rientjes,
	Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Andi Kleen,
	Brijesh Singh, Tom Lendacky, Jon Grimm, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Kaplan, David,
	Varad Gautam, Dario Faggioli, x86, linux-mm, linux-coco

On Thu, Jul 22, 2021 at 05:46:13PM +0200, David Hildenbrand wrote:
> +1, this smells like an anti-patter. I'm absolutely not in favor of a
> bitmap, we have the sparse memory model for a reason.

Well, I doubt that TDX or SNP guests will be set up with a sparse memory
layout.

> Also, I am not convinced that kexec/kdump is actually easy to realize with
> the bitmap?

> Who will forward that bitmap?

The kernel decompressor will create it and forward it to the
decompressed kernel image. The running kernel will pass it on to
kexec'ed kernels for the lifetime of the system.

> Where will it reside?

In Linux kernel owned memory, location decided by the kernel
decompressor.

>Who says it's not corrupted?

If the hypervisor corrupts it we can notice it. The guest kernel can
corrupt it on its own, but that is true for all data in the guest, also
the memmap.

> Just take a look at how we don't even have access to memmap of the
> oldkernel in the newkernel -- and have to locate and decipher it in
> constantly-to-be-updated user space makedumpfile. Any time you'd
> change anything about the bitmap ("hey, let's use larger chunks",
> "hey, let's split it up") you'd break the old_kernel
> <-> new_kernel agreement.

Im not sure if makedumpfile needs to know about that bitmap. If we
mirror the same information into the memmap, then there is definitly no
need for it.

Regards,

	Jörg

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-22 19:51 ` Kirill A. Shutemov
  2021-07-23 15:23   ` Mike Rapoport
@ 2021-07-26 19:13   ` Joerg Roedel
  2021-07-26 23:02   ` Erdem Aktas
  2 siblings, 0 replies; 48+ messages in thread
From: Joerg Roedel @ 2021-07-26 19:13 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Joerg Roedel, Andi Kleen, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

Hi Kirill,

On Thu, Jul 22, 2021 at 10:51:30PM +0300, Kirill A. Shutemov wrote:
> Okay, below is my first take on the topic.

Thanks, I havn't looked deeply into the patch yet, but will do so
tomorrow and reply separatly.

> I ended up combing your idea with bitmap with PageOffline(): early boot
> code uses bitmap, but on page allocator init I mark unaccepted pages with
> PageOffline(). This way page allocator need to touch the bitmap only when
> it steps on PageOffline() which shouldn't be often once things settle
> after boot.

I still need to understand the benefit of having this information in the
memmap, but I also don't object to it. For AMD-SNP the bitmap needs to
stay around at least, unless there is another way to implement
kexec/kdump.

> One bit in the bitmap represents 2M region. Any unaligned chunks gets
> accepted when we construct the bitmap. This way one 4K page can represent
> 64 GiB of physical address space.

Yeah, a 2MB chunk size makes sense when it comes to how much we validate
at once. I think it will be good choice for AMD too. On the other side
there is a need for SNP to track shared pages on a 4k granularity. There
are a couple of shared (or at least not valid) pages (GHCB, #HV shared page,
VMSA page) per vCPU which are 4k in size. Oh, and then there is the
.bss_decrypted section, which is also not 2M aligend.

In case of kexec/kdump this information needs to be passed on to the
next kernel.

Regards,

	Jörg


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-22 19:51 ` Kirill A. Shutemov
  2021-07-23 15:23   ` Mike Rapoport
  2021-07-26 19:13   ` Joerg Roedel
@ 2021-07-26 23:02   ` Erdem Aktas
  2021-07-26 23:54     ` Kirill A. Shutemov
  2 siblings, 1 reply; 48+ messages in thread
From: Erdem Aktas @ 2021-07-26 23:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Joerg Roedel, Andi Kleen, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

On Thu, Jul 22, 2021 at 12:51 PM Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> +void mark_unaccepted(phys_addr_t start, phys_addr_t end)
> +{
> +       unsigned int npages;
> +
> +       if (start & ~PMD_MASK) {
> +               npages = (round_up(start, PMD_SIZE) - start) / PAGE_SIZE;
> +               tdx_hcall_gpa_intent(start, npages, TDX_MAP_PRIVATE);
> +               start = round_up(start, PMD_SIZE);
> +       }
> +
> +       if (end & ~PMD_MASK) {
> +               npages = (end - round_down(end, PMD_SIZE)) / PAGE_SIZE;
> +               end = round_down(end, PMD_SIZE);
> +               tdx_hcall_gpa_intent(end, npages, TDX_MAP_PRIVATE);
> +       }

Is not the above code will accept the pages that are already accepted?
It is accepting the pages in the same 2MB region that is before start
and after end. We do not know what code/data is stored on those pages,
right? This might cause security issues depending on what is stored on
those pages.

> +static void __accept_pages(phys_addr_t start, phys_addr_t end)
> +{
> +       unsigned int rs, re;
> +
> +       bitmap_for_each_set_region(unaccepted_memory, rs, re,
> +                                  start / PMD_SIZE, end / PMD_SIZE) {
> +               tdx_hcall_gpa_intent(rs * PMD_SIZE, (re - rs) * PMD_NR,
> +                                    TDX_MAP_PRIVATE);
> +
This assumes that the granularity of the unaccepted pages is always in
PMD_SIZE. I  have seen the answer above saying that mark_unaccepted
makes sure that we have only 2MB unaccepted pages in our bitmap but it
is not enough IMO. This function, as it is, will do double TDACCEPT
for the already accepted 4KB pages in the same 2MB region.

> +void maybe_set_page_offline(struct page *page, unsigned int order)
> +{
> +       phys_addr_t addr = page_to_phys(page);
> +       bool unaccepted = true;
> +       unsigned int i;
> +
> +       spin_lock(&unaccepted_memory_lock);
> +       if (order < PMD_ORDER) {
> +               BUG_ON(test_bit(addr / PMD_SIZE, unaccepted_memory));
> +               goto out;
> +       }
don't we need to throw a bug when order is < PMD_ORDER, independent of
what test_bit() is saying? If the page is accepted or not accepted,
there is a possibility of double accepting pages.
> +       for (i = 0; i < (1 << (order - PMD_ORDER)); i++) {

and if order < PMD_ORDER, this will be a wrong shift operation, right?

> +       if (unaccepted)
> +               __SetPageOffline(page);
> +       else
> +               __accept_pages(addr, addr + (PAGE_SIZE << order));
so all the pages that were accepted will be reaccepted?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-26 23:02   ` Erdem Aktas
@ 2021-07-26 23:54     ` Kirill A. Shutemov
  2021-07-27  1:35       ` Erdem Aktas
  0 siblings, 1 reply; 48+ messages in thread
From: Kirill A. Shutemov @ 2021-07-26 23:54 UTC (permalink / raw)
  To: Erdem Aktas
  Cc: Joerg Roedel, Andi Kleen, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

On Mon, Jul 26, 2021 at 04:02:56PM -0700, Erdem Aktas wrote:
> On Thu, Jul 22, 2021 at 12:51 PM Kirill A. Shutemov
> <kirill@shutemov.name> wrote:
> > +void mark_unaccepted(phys_addr_t start, phys_addr_t end)
> > +{
> > +       unsigned int npages;
> > +
> > +       if (start & ~PMD_MASK) {
> > +               npages = (round_up(start, PMD_SIZE) - start) / PAGE_SIZE;
> > +               tdx_hcall_gpa_intent(start, npages, TDX_MAP_PRIVATE);
> > +               start = round_up(start, PMD_SIZE);
> > +       }
> > +
> > +       if (end & ~PMD_MASK) {
> > +               npages = (end - round_down(end, PMD_SIZE)) / PAGE_SIZE;
> > +               end = round_down(end, PMD_SIZE);
> > +               tdx_hcall_gpa_intent(end, npages, TDX_MAP_PRIVATE);
> > +       }
> 
> Is not the above code will accept the pages that are already accepted?

No. This code will get called for all UNACCEPTED ranges in EFI table.
If such memory is accepted it is a bug.

> It is accepting the pages in the same 2MB region that is before start
> and after end. We do not know what code/data is stored on those pages,
> right? This might cause security issues depending on what is stored on
> those pages.

As I told above, it only get called for unaccepted memory and nothing can
be stored there before the point.

> > +static void __accept_pages(phys_addr_t start, phys_addr_t end)
> > +{
> > +       unsigned int rs, re;
> > +
> > +       bitmap_for_each_set_region(unaccepted_memory, rs, re,
> > +                                  start / PMD_SIZE, end / PMD_SIZE) {
> > +               tdx_hcall_gpa_intent(rs * PMD_SIZE, (re - rs) * PMD_NR,
> > +                                    TDX_MAP_PRIVATE);
> > +
> This assumes that the granularity of the unaccepted pages is always in
> PMD_SIZE.

Yes, because we constructed the bitmap this way. Non-2M-aligned chunks get
accepted when we accept upfront when we populate the bitmap.

See mark_unaccepted().

(mark_unaccepted() has few bugs that will be fixed in the next version)

> I  have seen the answer above saying that mark_unaccepted
> makes sure that we have only 2MB unaccepted pages in our bitmap but it
> is not enough IMO. This function, as it is, will do double TDACCEPT
> for the already accepted 4KB pages in the same 2MB region.

> > +void maybe_set_page_offline(struct page *page, unsigned int order)
> > +{
> > +       phys_addr_t addr = page_to_phys(page);
> > +       bool unaccepted = true;
> > +       unsigned int i;
> > +
> > +       spin_lock(&unaccepted_memory_lock);
> > +       if (order < PMD_ORDER) {
> > +               BUG_ON(test_bit(addr / PMD_SIZE, unaccepted_memory));
> > +               goto out;
> > +       }
> don't we need to throw a bug when order is < PMD_ORDER, independent of
> what test_bit() is saying? If the page is accepted or not accepted,
> there is a possibility of double accepting pages.

No. maybe_set_page_offline() get called on all pages that get added to the
free list on boot. Any pages with order < 9 has to belong to already
accepted regions.

> > +       for (i = 0; i < (1 << (order - PMD_ORDER)); i++) {
> 
> and if order < PMD_ORDER, this will be a wrong shift operation, right?

order < PMD_ORDER handed by the 'if' above.

> > +       if (unaccepted)
> > +               __SetPageOffline(page);
> > +       else
> > +               __accept_pages(addr, addr + (PAGE_SIZE << order));
> so all the pages that were accepted will be reaccepted?

Have you looked at what __accept_pages() does? It only accept unaccepted
pages, according to the bitmap.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-26 23:54     ` Kirill A. Shutemov
@ 2021-07-27  1:35       ` Erdem Aktas
  0 siblings, 0 replies; 48+ messages in thread
From: Erdem Aktas @ 2021-07-27  1:35 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Joerg Roedel, Andi Kleen, David Rientjes, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Brijesh Singh, Tom Lendacky,
	Jon Grimm, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Kaplan, David, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco

On Mon, Jul 26, 2021 at 4:54 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:

> >
> > Is not the above code will accept the pages that are already accepted?
>
> No. This code will get called for all UNACCEPTED ranges in EFI table.
> If such memory is accepted it is a bug.
>
> > It is accepting the pages in the same 2MB region that is before start
> > and after end. We do not know what code/data is stored on those pages,
> > right? This might cause security issues depending on what is stored on
> > those pages.
>
> As I told above, it only get called for unaccepted memory and nothing can
> be stored there before the point.

Thanks Kirill! You are right, it looks like I messed up with
round_up/down in my mind. Thanks for the clarification.

> Yes, because we constructed the bitmap this way. Non-2M-aligned chunks get
> accepted when we accept upfront when we populate the bitmap.
>
> See mark_unaccepted().
>
> (mark_unaccepted() has few bugs that will be fixed in the next version)
>
> Have you looked at what __accept_pages() does? It only accept unaccepted
> pages, according to the bitmap.

Ahh, makes sense!
Thanks for the explanation and sorry for my confusion, Kirill!

-Erdem

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Runtime Memory Validation in Intel-TDX and AMD-SNP
  2021-07-26 19:02     ` Joerg Roedel
@ 2021-07-27  9:34       ` David Hildenbrand
  0 siblings, 0 replies; 48+ messages in thread
From: David Hildenbrand @ 2021-07-27  9:34 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Kirill A. Shutemov, Joerg Roedel, David Rientjes,
	Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Andi Kleen,
	Brijesh Singh, Tom Lendacky, Jon Grimm, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Kaplan, David,
	Varad Gautam, Dario Faggioli, x86, linux-mm, linux-coco

On 26.07.21 21:02, Joerg Roedel wrote:
> On Thu, Jul 22, 2021 at 05:46:13PM +0200, David Hildenbrand wrote:
>> +1, this smells like an anti-patter. I'm absolutely not in favor of a
>> bitmap, we have the sparse memory model for a reason.
> 
> Well, I doubt that TDX or SNP guests will be set up with a sparse memory
> layout.

What makes you think that? I already heard people express desires for 
memory hot(un)plug, especially in the context of running containers 
inside encrypted VMs. And static bitmaps are naturally a bad choice for 
changing memory layouts.

> 
>> Also, I am not convinced that kexec/kdump is actually easy to realize with
>> the bitmap?
> 
>> Who will forward that bitmap?
> 
> The kernel decompressor will create it and forward it to the
> decompressed kernel image. The running kernel will pass it on to
> kexec'ed kernels for the lifetime of the system.

How will the second kernel figure out the location? Similar to how we 
pass the physical address of the vmcore header via the cmdline to the 
new kernel?

> 
>> Where will it reside?
> 
> In Linux kernel owned memory, location decided by the kernel
> decompressor.

Okay, owned by the old kernel, not initially mapped by new kernel in the 
identity mapping. Is there a prototype/code that implements that?

> 
>> Who says it's not corrupted?
> 
> If the hypervisor corrupts it we can notice it. The guest kernel can
> corrupt it on its own, but that is true for all data in the guest, also
> the memmap.

Yes, but it does not affect the kdump kernel booting, only makedumpfile 
might bail out later when it detects a corruption.

I'm wondering, why exactly would a kdump kernel (not touching memory of 
the old kernel while booting up) need access to the bitmap? Just 
wondering, for ACPI tables and such? I can understand why makedumpfile 
would need that information when actually dumping memory of the old 
kernel, but it would have access to the memmap of the old kernel to 
obtain that information.

> 
>> Just take a look at how we don't even have access to memmap of the
>> oldkernel in the newkernel -- and have to locate and decipher it in
>> constantly-to-be-updated user space makedumpfile. Any time you'd
>> change anything about the bitmap ("hey, let's use larger chunks",
>> "hey, let's split it up") you'd break the old_kernel
>> <-> new_kernel agreement.
> 
> Im not sure if makedumpfile needs to know about that bitmap. If we
> mirror the same information into the memmap, then there is definitly no
> need for it.

Mirroring is a good point. But I'd suggest using the bitmap only during 
early boot if really necessary and after syncing it to the bitmap, get 
rid of it. Sure, kexec is more challenging, but at least it's a clean 
design. We can always try expressing the state of validated memory in 
the e820 map we present to the kexec kernel.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, back to index

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-19 12:58 Runtime Memory Validation in Intel-TDX and AMD-SNP Joerg Roedel
2021-07-19 13:07 ` Matthew Wilcox
2021-07-19 15:02   ` Joerg Roedel
2021-07-19 20:39 ` Andi Kleen
2021-07-20  8:55   ` Joerg Roedel
2021-07-20  9:34     ` Dr. David Alan Gilbert
2021-07-20 11:50       ` Joerg Roedel
2021-07-20  0:26 ` Andy Lutomirski
     [not found]   ` <CAAYXXYwFzrf8uY-PFkMRSG28+HztfGdJft8kB3Y3keWCx9K8TQ@mail.gmail.com>
2021-07-20  2:00     ` Erdem Aktas
2021-07-20  5:17     ` Andi Kleen
2021-07-20  9:11       ` Joerg Roedel
2021-07-20 17:32         ` Andi Kleen
2021-07-20 23:09       ` Erdem Aktas
2021-07-21  0:38         ` Andi Kleen
2021-07-22 17:31       ` Marc Orr
2021-07-26 18:55         ` Joerg Roedel
     [not found]     ` <eacb9c1f-2c61-4a7f-b5a3-7bf579e6cbf6@www.fastmail.com>
2021-07-20 19:54       ` Erdem Aktas
2021-07-20 22:01         ` Andi Kleen
2021-07-20 23:55           ` Erdem Aktas
2021-07-21  0:35             ` Andi Kleen
2021-07-21  8:51           ` Joerg Roedel
2021-07-20  8:44   ` Joerg Roedel
2021-07-20 14:14   ` Dave Hansen
2021-07-20 17:30 ` Kirill A. Shutemov
2021-07-21  9:20   ` Mike Rapoport
2021-07-21 10:02     ` Kirill A. Shutemov
2021-07-21 10:22       ` Mike Rapoport
2021-07-21 10:53       ` Joerg Roedel
2021-07-21  9:25   ` Joerg Roedel
2021-07-21 10:25     ` Kirill A. Shutemov
2021-07-21 10:48       ` Joerg Roedel
2021-07-22 15:46   ` David Hildenbrand
2021-07-26 19:02     ` Joerg Roedel
2021-07-27  9:34       ` David Hildenbrand
2021-07-22 15:57 ` David Hildenbrand
2021-07-22 19:51 ` Kirill A. Shutemov
2021-07-23 15:23   ` Mike Rapoport
2021-07-23 16:29     ` Kirill A. Shutemov
2021-07-25  9:16       ` Mike Rapoport
2021-07-25 18:28         ` Kirill A. Shutemov
2021-07-26 10:00           ` Mike Rapoport
2021-07-26 11:53             ` Kirill A. Shutemov
2021-07-26 19:13   ` Joerg Roedel
2021-07-26 23:02   ` Erdem Aktas
2021-07-26 23:54     ` Kirill A. Shutemov
2021-07-27  1:35       ` Erdem Aktas
2021-07-23 11:04 ` Varad Gautam
2021-07-23 14:34   ` Kaplan, David

Linux Confidential Computing Development

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-coco/0 linux-coco/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-coco linux-coco/ https://lore.kernel.org/linux-coco \
		linux-coco@lists.linux.dev
	public-inbox-index linux-coco

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/dev.linux.lists.linux-coco


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git