Re: AMD SEV-SNP/Intel TDX: validation of memory pages

From: Brijesh Singh <brijesh.singh@amd.com>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	David Rientjes <rientjes@google.com>
Cc: brijesh.singh@amd.com, Borislav Petkov <bp@alien8.de>,
	Andy Lutomirski <luto@kernel.org>,
	Sean Christopherson <seanjc@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andi Kleen <ak@linux.intel.com>,
	Tom Lendacky <thomas.lendacky@amd.com>,
	Jon Grimm <jon.grimm@amd.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Christoph Hellwig <hch@lst.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Ingo Molnar <mingo@redhat.com>, Joerg Roedel <jroedel@suse.de>,
	x86@kernel.org, linux-mm@kvack.org
Subject: Re: AMD SEV-SNP/Intel TDX: validation of memory pages
Date: Tue, 2 Feb 2021 18:16:41 -0600	[thread overview]
Message-ID: <961a2736-9bc9-43e1-1e75-6d373fe9590b@amd.com> (raw)
In-Reply-To: <20210202160205.3wfchtibq2sd7pe5@black.fi.intel.com>


On 2/2/21 10:02 AM, Kirill A. Shutemov wrote:
> On Mon, Feb 01, 2021 at 05:51:09PM -0800, David Rientjes wrote:
>> Hi everybody,
>>
>> I'd like to kick-start the discussion on lazy validation of guest memory
>> for the purposes of AMD SEV-SNP and Intel TDX.
>>
>> Both AMD SEV-SNP and Intel TDX require validation of guest memory before
>> it may be used by the guest.  This is needed for integrity protection from
>> a potentially malicious hypervisor or other host components.
>>
>> For AMD SEV-SNP, the hypervisor assigns a page to the guest using the new
>> RMPUPDATE instruction.  The guest then transitions the page to a usable by
>> the new PVALIDATE instruction[1].  This sets the Validated flag in the
>> Reverse Map Table (RMP) for a guest addressable page, which opts into
>> hardware and firmware integrity protection.  This may only be done by the
>> guest itself and until that time, the guest cannot access the page.
>>
>> The guest can only PVALIDATE memory for a gPA once; the RMP then
>> guarantees for each hPA that there is only a single gPA mapping.  This
>> validation can either be done all up front at the time the guest is booted
>> or it can be done lazily at runtime on fault if the guest keeps track of
>> Valid vs Invalid pages.  Because doing PVALIDATE for all guest memory at
>> boot would be extremely lengthy, I'd like to discuss the options for doing
>> it lazily.
>>
>> Similarly, for Intel TDX, the hypervisor unmaps the gPA from the shared
>> EPT and invalidates the tlb and all caches for the TD's vcpus; it then
>> adds a page to the gPA address space for a TD by using the new
>> TDH.MEM.PAGE.AUG call.  The TDG.MEM.PAGE.ACCEPT TDCALL[2] then allows a
>> guest to accept a guest page for a gPA and initialize it using the private
>> key for that TD.  This may only be done by the TD itself and until that
>> time, the gPA cannot be used within the TD.
>>
>> Both AMD SEV-SNP and Intel TDX support hugepages.  SEV-SNP supports 2MB
>> whereas TDX has accept TDCALL support for 2MB and 1GB.
>>
>> I believe the UEFI ECR[3] for the unaccepted memory type to
>> EFI_MEMORY_TYPE was accepted in December.  This should enable the guest to
>> learn what memory has not yet been validated (or accepted) by the firmware
>> if all guest memory is not done completely up front.
>>
>> This likely requires a pre-validation of all memory that can be accessed
>> when handling a #VC (or #VE for TDX) such as IST stacks, including memory
>> in the x86 boot sequence that must be validated before the core mm
>> subsystem is up and running to handle the lazy validation.  I believe
>> lazy validation can be done by the core mm after that, perhaps by
>> maintaining a new "validated" bit in struct page flags.
>>
>> Has anybody looked into this or, even better, is anybody currently working
>> on this?
> It's likely I'm going to do this on Intel side, but I have not looked
> deeply into it.
>
>> I think quite invasive changes are needed for the guest to support lazy
>> validation/acceptance to core areas that lots of people on the recipient
>> list have strong opinions about.  Some things that come to mind:
>>
>>  - Annotations for pages that must be pre-validated in the x86 boot
>>    sequence, including IST stacks
>>
>>  - Proliferation of these annotations throughout any kernel code that can
>>    access memory for #VC or #VE
>>
>>  - Handling lazy validation of guest memory through the core mm layer,
>>    most likely involving a bit in struct page flags to track their status
>>
>>  - Any need for validating memory that is not backed by struct page that
>>    needs to be special-cased
>>
>>  - Any concerns about this for the DMA layer
>>
>> One possibility for minimal disruption to the boot entry code is to
>> require the guest BIOS to validate 4GB and below, and then leave 4GB and
>> above to be done lazily (the true amount of memory will actually be less
>> due to the MMIO hole).
> [ As I didn't looked into actual code, I may say total garbage below... ]
>
> Pre-validating 4GB would indeed be easiest way to go, but it's going to be
> too slow.
>
> The more realistic is for BIOS to pre-validate memory where kernel and
> initrd are placed, plus few dozen megs for runtime. It means decompression
> code would need to be aware about the validation.


I was thinking that BIOS validating the lower 4GB will simplify the
changes to the kernel entry code path as well provide a clean approach
to support kexec. 

My initial thought is

- BIOS or VMM validate lower 4GB memory.

- BIOS mark the higher 4GB as unaccepted in e820/efi memmap

- Kernel early boot can be achieved with minimal (or no changes)

- If there is an unaccepted type discovered then allocate a bitmap that
can be used to keep track of information (e.g which pages are
validated). We can also explore whether removing the unaccepted flag
from the memmap range will work.

- On #VC/#VE, look at the bitmap to see if we need to validate the
pages. To speed up, we can validate more than one page on #VC/#VE.

- If we get kexec'd then rebuild the e820/memmap based on the bitmap so
that we don't double validate. 


>
> The critical thing is that once memory is validate we must not validate
> it again. It's possible VMM->guest attack vector. We must track precisely
> what memory has been validated and stop the guest on detecting the
> unexpected second validation request.
>
> It also means that we has to keep the information when the control gets
> passed from decompression code to the real kernel. Page flag is no good
> for this.
>
> My initial thought that we can use e820/efi memmap to keep track of
> information -- remove the unaccepted memory flag from the range that got
> accepted.
>
> The decompression code validates the memory that it's need for
> decompression, modify memmap accordingly and pass control to the main
> kernel.
>
> The main kernel may accept the memory via #VE/#VC, but ideally it need to
> stay within memory accepted by decompression code for initial boot.
>
> I think the bulk of memory validation can be done via existing machinery:
> we have already deferred struct page initialization code in kernel and I
> believe we can hook up into it for the purpose.
>
> Any comments?
>