linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared
@ 2023-07-06 16:41 Michael Kelley
  2023-08-02 21:57 ` Tom Lendacky
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Michael Kelley @ 2023-07-06 16:41 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, hpa,
	luto, peterz, thomas.lendacky, sathyanarayanan.kuppuswamy,
	kirill.shutemov, seanjc, rick.p.edgecombe, linux-kernel,
	linux-hyperv, x86
  Cc: mikelley

In a CoCo VM when a page transitions from private to shared, or vice
versa, attributes in the PTE must be updated *and* the hypervisor must
be notified of the change. Because there are two separate steps, there's
a window where the settings are inconsistent.  Normally the code that
initiates the transition (via set_memory_decrypted() or
set_memory_encrypted()) ensures that the memory is not being accessed
during a transition, so the window of inconsistency is not a problem.
However, the load_unaligned_zeropad() function can read arbitrary memory
pages at arbitrary times, which could access a transitioning page during
the window.  In such a case, CoCo VM specific exceptions are taken
(depending on the CoCo architecture in use).  Current code in those
exception handlers recovers and does "fixup" on the result returned by
load_unaligned_zeropad().  Unfortunately, this exception handling and
fixup code is tricky and somewhat fragile.  At the moment, it is
broken for both TDX and SEV-SNP.

There's also a problem with the current code in paravisor scenarios:
TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need
to be forwarded from the paravisor to the Linux guest, but there
are no architectural specs for how to do that.

To avoid these complexities of the CoCo exception handlers, change
the core transition code in __set_memory_enc_pgtable() to do the
following:

1.  Remove aliasing mappings
2.  Remove the PRESENT bit from the PTEs of all transitioning pages
3.  Flush the TLB globally
4.  Flush the data cache if needed
5.  Set/clear the encryption attribute as appropriate
6.  Notify the hypervisor of the page status change
7.  Add back the PRESENT bit

With this approach, load_unaligned_zeropad() just takes its normal
page-fault-based fixup path if it touches a page that is transitioning.
As a result, load_unaligned_zeropad() and CoCo VM page transitioning
are completely decoupled.  CoCo VM page transitions can proceed
without needing to handle architecture-specific exceptions and fix
things up. This decoupling reduces the complexity due to separate
TDX and SEV-SNP fixup paths, and gives more freedom to revise and
introduce new capabilities in future versions of the TDX and SEV-SNP
architectures. Paravisor scenarios work properly without needing
to forward exceptions.

This approach may make __set_memory_enc_pgtable() slightly slower
because of touching the PTEs three times instead of just once. But
the run time of this function is already dominated by the hypercall
and the need to flush the TLB at least once and maybe twice. In any
case, this function is only used for CoCo VM page transitions, and
is already unsuitable for hot paths.

The architecture specific callback function for notifying the
hypervisor typically must translate guest kernel virtual addresses
into guest physical addresses to pass to the hypervisor.  Because
the PTEs are invalid at the time of callback, the code for doing the
translation needs updating.  virt_to_phys() or equivalent continues
to work for direct map addresses.  But vmalloc addresses cannot use
vmalloc_to_page() because that function requires the leaf PTE to be
valid. Instead, slow_virt_to_phys() must be used. Both functions
manually walk the page table hierarchy, so performance is the same.

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
---

I'm assuming the TDX handling of the data cache flushing is the
same with this new approach, and that it doesn't need to be paired
with a TLB flush as in the current code.  If that's not a correct
assumption, let me know.

I've left the two hypervisor callbacks, before and after Step 5
above. If the PTEs are invalid, it seems like the order of Step 5
and Step 6 shouldn't matter, so perhaps one of the callback could
be dropped.  Let me know if there are reasons to do otherwise.

It may well be possible to optimize the new implementation of
__set_memory_enc_pgtable(). The existing set_memory_np() and
set_memory_p() functions do all the right things in a very clear
fashion, but perhaps not as optimally as having all three PTE
manipulations directly in the same function. It doesn't appear
that optimizing the performance really matters here, but I'm open
to suggestions.

I've tested this on TDX VMs and SEV-SNP + vTOM VMs.  I can also
test on SEV-SNP VMs without vTOM. But I'll probably need help
covering SEV and SEV-ES VMs.

This RFC patch does *not* remove code that would no longer be
needed. If there's agreement to take this new approach, I'll
add patches to remove the unneeded code.

This patch is built against linux-next20230704.

 arch/x86/hyperv/ivm.c        |  3 ++-
 arch/x86/kernel/sev.c        |  2 +-
 arch/x86/mm/pat/set_memory.c | 32 ++++++++++++--------------------
 3 files changed, 15 insertions(+), 22 deletions(-)

diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 28be6df..2859ec3 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -308,7 +308,8 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
 		return false;
 
 	for (i = 0, pfn = 0; i < pagecount; i++) {
-		pfn_array[pfn] = virt_to_hvpfn((void *)kbuffer + i * HV_HYP_PAGE_SIZE);
+		pfn_array[pfn] = slow_virt_to_phys((void *)kbuffer +
+					i * HV_HYP_PAGE_SIZE) >> HV_HYP_PAGE_SHIFT;
 		pfn++;
 
 		if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 1ee7bed..59db55e 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -784,7 +784,7 @@ static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long
 		hdr->end_entry = i;
 
 		if (is_vmalloc_addr((void *)vaddr)) {
-			pfn = vmalloc_to_pfn((void *)vaddr);
+			pfn = slow_virt_to_phys((void *)vaddr) >> PAGE_SHIFT;
 			use_large_entry = false;
 		} else {
 			pfn = __pa(vaddr) >> PAGE_SHIFT;
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index bda9f12..8a194c7 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2136,6 +2136,11 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
 	if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
 		addr &= PAGE_MASK;
 
+	/* set_memory_np() removes aliasing mappings and flushes the TLB */
+	ret = set_memory_np(addr, numpages);
+	if (ret)
+		return ret;
+
 	memset(&cpa, 0, sizeof(cpa));
 	cpa.vaddr = &addr;
 	cpa.numpages = numpages;
@@ -2143,36 +2148,23 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
 	cpa.mask_clr = enc ? pgprot_decrypted(empty) : pgprot_encrypted(empty);
 	cpa.pgd = init_mm.pgd;
 
-	/* Must avoid aliasing mappings in the highmem code */
-	kmap_flush_unused();
-	vm_unmap_aliases();
-
 	/* Flush the caches as needed before changing the encryption attribute. */
-	if (x86_platform.guest.enc_tlb_flush_required(enc))
-		cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required());
+	if (x86_platform.guest.enc_cache_flush_required())
+		cpa_flush(&cpa, 1);
 
 	/* Notify hypervisor that we are about to set/clr encryption attribute. */
 	if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
 		return -EIO;
 
 	ret = __change_page_attr_set_clr(&cpa, 1);
-
-	/*
-	 * After changing the encryption attribute, we need to flush TLBs again
-	 * in case any speculative TLB caching occurred (but no need to flush
-	 * caches again).  We could just use cpa_flush_all(), but in case TLB
-	 * flushing gets optimized in the cpa_flush() path use the same logic
-	 * as above.
-	 */
-	cpa_flush(&cpa, 0);
+	if (ret)
+		return ret;
 
 	/* Notify hypervisor that we have successfully set/clr encryption attribute. */
-	if (!ret) {
-		if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
-			ret = -EIO;
-	}
+	if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
+		return -EIO;
 
-	return ret;
+	return set_memory_p(&addr, numpages);
 }
 
 static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared
  2023-07-06 16:41 [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared Michael Kelley
@ 2023-08-02 21:57 ` Tom Lendacky
  2023-08-05 14:38   ` Michael Kelley (LINUX)
  2023-08-06 22:19 ` kirill.shutemov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Tom Lendacky @ 2023-08-02 21:57 UTC (permalink / raw)
  To: Michael Kelley, kys, haiyangz, wei.liu, decui, tglx, mingo, bp,
	dave.hansen, hpa, luto, peterz, sathyanarayanan.kuppuswamy,
	kirill.shutemov, seanjc, rick.p.edgecombe, linux-kernel,
	linux-hyperv, x86

On 7/6/23 11:41, Michael Kelley wrote:
> In a CoCo VM when a page transitions from private to shared, or vice
> versa, attributes in the PTE must be updated *and* the hypervisor must
> be notified of the change. Because there are two separate steps, there's
> a window where the settings are inconsistent.  Normally the code that
> initiates the transition (via set_memory_decrypted() or
> set_memory_encrypted()) ensures that the memory is not being accessed
> during a transition, so the window of inconsistency is not a problem.
> However, the load_unaligned_zeropad() function can read arbitrary memory
> pages at arbitrary times, which could access a transitioning page during
> the window.  In such a case, CoCo VM specific exceptions are taken
> (depending on the CoCo architecture in use).  Current code in those
> exception handlers recovers and does "fixup" on the result returned by
> load_unaligned_zeropad().  Unfortunately, this exception handling and
> fixup code is tricky and somewhat fragile.  At the moment, it is
> broken for both TDX and SEV-SNP.
> 
> There's also a problem with the current code in paravisor scenarios:
> TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need
> to be forwarded from the paravisor to the Linux guest, but there
> are no architectural specs for how to do that.
> 
> To avoid these complexities of the CoCo exception handlers, change
> the core transition code in __set_memory_enc_pgtable() to do the
> following:
> 
> 1.  Remove aliasing mappings
> 2.  Remove the PRESENT bit from the PTEs of all transitioning pages
> 3.  Flush the TLB globally
> 4.  Flush the data cache if needed
> 5.  Set/clear the encryption attribute as appropriate
> 6.  Notify the hypervisor of the page status change
> 7.  Add back the PRESENT bit
> 
> With this approach, load_unaligned_zeropad() just takes its normal
> page-fault-based fixup path if it touches a page that is transitioning.
> As a result, load_unaligned_zeropad() and CoCo VM page transitioning
> are completely decoupled.  CoCo VM page transitions can proceed
> without needing to handle architecture-specific exceptions and fix
> things up. This decoupling reduces the complexity due to separate
> TDX and SEV-SNP fixup paths, and gives more freedom to revise and
> introduce new capabilities in future versions of the TDX and SEV-SNP
> architectures. Paravisor scenarios work properly without needing
> to forward exceptions.
> 
> This approach may make __set_memory_enc_pgtable() slightly slower
> because of touching the PTEs three times instead of just once. But
> the run time of this function is already dominated by the hypercall
> and the need to flush the TLB at least once and maybe twice. In any
> case, this function is only used for CoCo VM page transitions, and
> is already unsuitable for hot paths.
> 
> The architecture specific callback function for notifying the
> hypervisor typically must translate guest kernel virtual addresses
> into guest physical addresses to pass to the hypervisor.  Because
> the PTEs are invalid at the time of callback, the code for doing the
> translation needs updating.  virt_to_phys() or equivalent continues
> to work for direct map addresses.  But vmalloc addresses cannot use
> vmalloc_to_page() because that function requires the leaf PTE to be
> valid. Instead, slow_virt_to_phys() must be used. Both functions
> manually walk the page table hierarchy, so performance is the same.

This all seems reasonable if it allows the paravisor approach to run 
without issues.

> 
> Signed-off-by: Michael Kelley <mikelley@microsoft.com>
> ---
> 
> I'm assuming the TDX handling of the data cache flushing is the
> same with this new approach, and that it doesn't need to be paired
> with a TLB flush as in the current code.  If that's not a correct
> assumption, let me know.
> 
> I've left the two hypervisor callbacks, before and after Step 5
> above. If the PTEs are invalid, it seems like the order of Step 5
> and Step 6 shouldn't matter, so perhaps one of the callback could
> be dropped.  Let me know if there are reasons to do otherwise.
> 
> It may well be possible to optimize the new implementation of
> __set_memory_enc_pgtable(). The existing set_memory_np() and
> set_memory_p() functions do all the right things in a very clear
> fashion, but perhaps not as optimally as having all three PTE
> manipulations directly in the same function. It doesn't appear
> that optimizing the performance really matters here, but I'm open
> to suggestions.
> 
> I've tested this on TDX VMs and SEV-SNP + vTOM VMs.  I can also
> test on SEV-SNP VMs without vTOM. But I'll probably need help
> covering SEV and SEV-ES VMs.

I wouldn't think that SEV and SEV-ES VMs would have any issues. However, 
these types of VMs don't make hypercalls at the moment, but I don't know 
that any slow downs would be noticed.

> 
> This RFC patch does *not* remove code that would no longer be
> needed. If there's agreement to take this new approach, I'll
> add patches to remove the unneeded code.
> 
> This patch is built against linux-next20230704.


> 
>   arch/x86/hyperv/ivm.c        |  3 ++-
>   arch/x86/kernel/sev.c        |  2 +-
>   arch/x86/mm/pat/set_memory.c | 32 ++++++++++++--------------------
>   3 files changed, 15 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index 28be6df..2859ec3 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -308,7 +308,8 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
>   		return false;
>   
>   	for (i = 0, pfn = 0; i < pagecount; i++) {
> -		pfn_array[pfn] = virt_to_hvpfn((void *)kbuffer + i * HV_HYP_PAGE_SIZE);
> +		pfn_array[pfn] = slow_virt_to_phys((void *)kbuffer +
> +					i * HV_HYP_PAGE_SIZE) >> HV_HYP_PAGE_SHIFT;

Definitely needs a comment here (and below) that slow_virt_to_phys() is 
being used because of making the page not present.

>   		pfn++;
>   
>   		if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index 1ee7bed..59db55e 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -784,7 +784,7 @@ static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long
>   		hdr->end_entry = i;
>   
>   		if (is_vmalloc_addr((void *)vaddr)) {
> -			pfn = vmalloc_to_pfn((void *)vaddr);
> +			pfn = slow_virt_to_phys((void *)vaddr) >> PAGE_SHIFT;
>   			use_large_entry = false;
>   		} else {
>   			pfn = __pa(vaddr) >> PAGE_SHIFT;
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index bda9f12..8a194c7 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -2136,6 +2136,11 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>   	if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
>   		addr &= PAGE_MASK;
>   
> +	/* set_memory_np() removes aliasing mappings and flushes the TLB */

Is there any case where the TLB wouldn't be flushed when it should? Since, 
for SEV at least, the TLB flush being removed below was always performed.

Thanks,
Tom

> +	ret = set_memory_np(addr, numpages);
> +	if (ret)
> +		return ret;
> +
>   	memset(&cpa, 0, sizeof(cpa));
>   	cpa.vaddr = &addr;
>   	cpa.numpages = numpages;
> @@ -2143,36 +2148,23 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>   	cpa.mask_clr = enc ? pgprot_decrypted(empty) : pgprot_encrypted(empty);
>   	cpa.pgd = init_mm.pgd;
>   
> -	/* Must avoid aliasing mappings in the highmem code */
> -	kmap_flush_unused();
> -	vm_unmap_aliases();
> -
>   	/* Flush the caches as needed before changing the encryption attribute. */
> -	if (x86_platform.guest.enc_tlb_flush_required(enc))
> -		cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required());
> +	if (x86_platform.guest.enc_cache_flush_required())
> +		cpa_flush(&cpa, 1);
>   
>   	/* Notify hypervisor that we are about to set/clr encryption attribute. */
>   	if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
>   		return -EIO;
>   
>   	ret = __change_page_attr_set_clr(&cpa, 1);
> -
> -	/*
> -	 * After changing the encryption attribute, we need to flush TLBs again
> -	 * in case any speculative TLB caching occurred (but no need to flush
> -	 * caches again).  We could just use cpa_flush_all(), but in case TLB
> -	 * flushing gets optimized in the cpa_flush() path use the same logic
> -	 * as above.
> -	 */
> -	cpa_flush(&cpa, 0);
> +	if (ret)
> +		return ret;
>   
>   	/* Notify hypervisor that we have successfully set/clr encryption attribute. */
> -	if (!ret) {
> -		if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> -			ret = -EIO;
> -	}
> +	if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> +		return -EIO;
>   
> -	return ret;
> +	return set_memory_p(&addr, numpages);
>   }
>   
>   static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared
  2023-08-02 21:57 ` Tom Lendacky
@ 2023-08-05 14:38   ` Michael Kelley (LINUX)
  0 siblings, 0 replies; 17+ messages in thread
From: Michael Kelley (LINUX) @ 2023-08-05 14:38 UTC (permalink / raw)
  To: Tom Lendacky, KY Srinivasan, Haiyang Zhang, wei.liu, Dexuan Cui,
	tglx, mingo, bp, dave.hansen, hpa, luto, peterz,
	sathyanarayanan.kuppuswamy, kirill.shutemov, seanjc,
	rick.p.edgecombe, linux-kernel, linux-hyperv, x86

From: Tom Lendacky <thomas.lendacky@amd.com> Sent: Wednesday, August 2, 2023 2:58 PM
> 

[snip]

> >
> > diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> > index 28be6df..2859ec3 100644
> > --- a/arch/x86/hyperv/ivm.c
> > +++ b/arch/x86/hyperv/ivm.c
> > @@ -308,7 +308,8 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
> >   		return false;
> >
> >   	for (i = 0, pfn = 0; i < pagecount; i++) {
> > -		pfn_array[pfn] = virt_to_hvpfn((void *)kbuffer + i * HV_HYP_PAGE_SIZE);
> > +		pfn_array[pfn] = slow_virt_to_phys((void *)kbuffer +
> > +					i * HV_HYP_PAGE_SIZE) >> HV_HYP_PAGE_SHIFT;
> 
> Definitely needs a comment here (and below) that slow_virt_to_phys() is
> being used because of making the page not present.

Agreed.

> 
> >   		pfn++;
> >
> >   		if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
> > diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> > index 1ee7bed..59db55e 100644
> > --- a/arch/x86/kernel/sev.c
> > +++ b/arch/x86/kernel/sev.c
> > @@ -784,7 +784,7 @@ static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long
> >   		hdr->end_entry = i;
> >
> >   		if (is_vmalloc_addr((void *)vaddr)) {
> > -			pfn = vmalloc_to_pfn((void *)vaddr);
> > +			pfn = slow_virt_to_phys((void *)vaddr) >> PAGE_SHIFT;
> >   			use_large_entry = false;
> >   		} else {
> >   			pfn = __pa(vaddr) >> PAGE_SHIFT;
> > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> > index bda9f12..8a194c7 100644
> > --- a/arch/x86/mm/pat/set_memory.c
> > +++ b/arch/x86/mm/pat/set_memory.c
> > @@ -2136,6 +2136,11 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
> >   	if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
> >   		addr &= PAGE_MASK;
> >
> > +	/* set_memory_np() removes aliasing mappings and flushes the TLB */
> 
> Is there any case where the TLB wouldn't be flushed when it should? Since,
> for SEV at least, the TLB flush being removed below was always performed.

The TLB is flushed as long as set_memory_np() actually changes some
PTEs.  It doesn’t make any sense to be doing a private<->shared transition
on pages that aren't marked PRESENT, so clearing the PRESENT bit will
always change the PTEs and cause the TLB to be flushed.  The decision is
made several levels down in __change_page_attr() where the CPA_FLUSHTLB
flag is set.  The flush is done in change_page_attr_set_clr() based on that
flag.  The data cache is *not* flushed.

Also, even if the memory *was* already not PRESENT, then the PTEs
would not be cached in the TLB anyway, and the TLB would not need
to be flushed.

Michael

> 
> > +	ret = set_memory_np(addr, numpages);
> > +	if (ret)
> > +		return ret;
> > +
> >   	memset(&cpa, 0, sizeof(cpa));
> >   	cpa.vaddr = &addr;
> >   	cpa.numpages = numpages;
> > @@ -2143,36 +2148,23 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
> >   	cpa.mask_clr = enc ? pgprot_decrypted(empty) : pgprot_encrypted(empty);
> >   	cpa.pgd = init_mm.pgd;
> >
> > -	/* Must avoid aliasing mappings in the highmem code */
> > -	kmap_flush_unused();
> > -	vm_unmap_aliases();
> > -
> >   	/* Flush the caches as needed before changing the encryption attribute. */
> > -	if (x86_platform.guest.enc_tlb_flush_required(enc))
> > -		cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required());
> > +	if (x86_platform.guest.enc_cache_flush_required())
> > +		cpa_flush(&cpa, 1);
> >
> >   	/* Notify hypervisor that we are about to set/clr encryption attribute. */
> >   	if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
> >   		return -EIO;
> >
> >   	ret = __change_page_attr_set_clr(&cpa, 1);
> > -
> > -	/*
> > -	 * After changing the encryption attribute, we need to flush TLBs again
> > -	 * in case any speculative TLB caching occurred (but no need to flush
> > -	 * caches again).  We could just use cpa_flush_all(), but in case TLB
> > -	 * flushing gets optimized in the cpa_flush() path use the same logic
> > -	 * as above.
> > -	 */
> > -	cpa_flush(&cpa, 0);
> > +	if (ret)
> > +		return ret;
> >
> >   	/* Notify hypervisor that we have successfully set/clr encryption attribute. */
> > -	if (!ret) {
> > -		if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> > -			ret = -EIO;
> > -	}
> > +	if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> > +		return -EIO;
> >
> > -	return ret;
> > +	return set_memory_p(&addr, numpages);
> >   }
> >
> >   static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared
  2023-07-06 16:41 [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared Michael Kelley
  2023-08-02 21:57 ` Tom Lendacky
@ 2023-08-06 22:19 ` kirill.shutemov
  2023-08-16  2:54   ` Michael Kelley (LINUX)
  2023-08-30  0:02 ` Edgecombe, Rick P
  2023-08-30 23:40 ` Edgecombe, Rick P
  3 siblings, 1 reply; 17+ messages in thread
From: kirill.shutemov @ 2023-08-06 22:19 UTC (permalink / raw)
  To: Michael Kelley
  Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, hpa,
	luto, peterz, thomas.lendacky, sathyanarayanan.kuppuswamy,
	seanjc, rick.p.edgecombe, linux-kernel, linux-hyperv, x86

On Thu, Jul 06, 2023 at 09:41:59AM -0700, Michael Kelley wrote:
> In a CoCo VM when a page transitions from private to shared, or vice
> versa, attributes in the PTE must be updated *and* the hypervisor must
> be notified of the change. Because there are two separate steps, there's
> a window where the settings are inconsistent.  Normally the code that
> initiates the transition (via set_memory_decrypted() or
> set_memory_encrypted()) ensures that the memory is not being accessed
> during a transition, so the window of inconsistency is not a problem.
> However, the load_unaligned_zeropad() function can read arbitrary memory
> pages at arbitrary times, which could access a transitioning page during
> the window.  In such a case, CoCo VM specific exceptions are taken
> (depending on the CoCo architecture in use).  Current code in those
> exception handlers recovers and does "fixup" on the result returned by
> load_unaligned_zeropad().  Unfortunately, this exception handling and
> fixup code is tricky and somewhat fragile.  At the moment, it is
> broken for both TDX and SEV-SNP.

I believe it is not fixed for TDX. Is it still a problem for SEV-SNP?

> There's also a problem with the current code in paravisor scenarios:
> TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need
> to be forwarded from the paravisor to the Linux guest, but there
> are no architectural specs for how to do that.
> 
> To avoid these complexities of the CoCo exception handlers, change
> the core transition code in __set_memory_enc_pgtable() to do the
> following:
> 
> 1.  Remove aliasing mappings
> 2.  Remove the PRESENT bit from the PTEs of all transitioning pages
> 3.  Flush the TLB globally
> 4.  Flush the data cache if needed
> 5.  Set/clear the encryption attribute as appropriate
> 6.  Notify the hypervisor of the page status change
> 7.  Add back the PRESENT bit

Okay, looks safe.

> With this approach, load_unaligned_zeropad() just takes its normal
> page-fault-based fixup path if it touches a page that is transitioning.
> As a result, load_unaligned_zeropad() and CoCo VM page transitioning
> are completely decoupled.  CoCo VM page transitions can proceed
> without needing to handle architecture-specific exceptions and fix
> things up. This decoupling reduces the complexity due to separate
> TDX and SEV-SNP fixup paths, and gives more freedom to revise and
> introduce new capabilities in future versions of the TDX and SEV-SNP
> architectures. Paravisor scenarios work properly without needing
> to forward exceptions.
> 
> This approach may make __set_memory_enc_pgtable() slightly slower
> because of touching the PTEs three times instead of just once. But
> the run time of this function is already dominated by the hypercall
> and the need to flush the TLB at least once and maybe twice. In any
> case, this function is only used for CoCo VM page transitions, and
> is already unsuitable for hot paths.
> 
> The architecture specific callback function for notifying the
> hypervisor typically must translate guest kernel virtual addresses
> into guest physical addresses to pass to the hypervisor.  Because
> the PTEs are invalid at the time of callback, the code for doing the
> translation needs updating.  virt_to_phys() or equivalent continues
> to work for direct map addresses.  But vmalloc addresses cannot use
> vmalloc_to_page() because that function requires the leaf PTE to be
> valid. Instead, slow_virt_to_phys() must be used. Both functions
> manually walk the page table hierarchy, so performance is the same.

Uhmm.. But why do we expected slow_virt_to_phys() to work on non-present
page table entries? It seems accident for me that it works now. Somebody
forgot pte_present() check.

Generally, if present bit is clear we cannot really assume anything about
the rest of the bits without external hints.

This smells bad.

Maybe the interface has to be reworked to operate on GPAs?

 
> Signed-off-by: Michael Kelley <mikelley@microsoft.com>
> ---
> 
> I'm assuming the TDX handling of the data cache flushing is the
> same with this new approach, and that it doesn't need to be paired
> with a TLB flush as in the current code.  If that's not a correct
> assumption, let me know.
> 
> I've left the two hypervisor callbacks, before and after Step 5
> above. If the PTEs are invalid, it seems like the order of Step 5
> and Step 6 shouldn't matter, so perhaps one of the callback could
> be dropped.  Let me know if there are reasons to do otherwise.
> 
> It may well be possible to optimize the new implementation of
> __set_memory_enc_pgtable(). The existing set_memory_np() and
> set_memory_p() functions do all the right things in a very clear
> fashion, but perhaps not as optimally as having all three PTE
> manipulations directly in the same function. It doesn't appear
> that optimizing the performance really matters here, but I'm open
> to suggestions.
> 
> I've tested this on TDX VMs and SEV-SNP + vTOM VMs.  I can also
> test on SEV-SNP VMs without vTOM. But I'll probably need help
> covering SEV and SEV-ES VMs.
> 
> This RFC patch does *not* remove code that would no longer be
> needed. If there's agreement to take this new approach, I'll
> add patches to remove the unneeded code.
> 
> This patch is built against linux-next20230704.
> 
>  arch/x86/hyperv/ivm.c        |  3 ++-
>  arch/x86/kernel/sev.c        |  2 +-
>  arch/x86/mm/pat/set_memory.c | 32 ++++++++++++--------------------
>  3 files changed, 15 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index 28be6df..2859ec3 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -308,7 +308,8 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
>  		return false;
>  
>  	for (i = 0, pfn = 0; i < pagecount; i++) {
> -		pfn_array[pfn] = virt_to_hvpfn((void *)kbuffer + i * HV_HYP_PAGE_SIZE);
> +		pfn_array[pfn] = slow_virt_to_phys((void *)kbuffer +
> +					i * HV_HYP_PAGE_SIZE) >> HV_HYP_PAGE_SHIFT;
>  		pfn++;
>  
>  		if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index 1ee7bed..59db55e 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -784,7 +784,7 @@ static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long
>  		hdr->end_entry = i;
>  
>  		if (is_vmalloc_addr((void *)vaddr)) {
> -			pfn = vmalloc_to_pfn((void *)vaddr);
> +			pfn = slow_virt_to_phys((void *)vaddr) >> PAGE_SHIFT;
>  			use_large_entry = false;
>  		} else {
>  			pfn = __pa(vaddr) >> PAGE_SHIFT;
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index bda9f12..8a194c7 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -2136,6 +2136,11 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>  	if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
>  		addr &= PAGE_MASK;
>  
> +	/* set_memory_np() removes aliasing mappings and flushes the TLB */
> +	ret = set_memory_np(addr, numpages);
> +	if (ret)
> +		return ret;
> +
>  	memset(&cpa, 0, sizeof(cpa));
>  	cpa.vaddr = &addr;
>  	cpa.numpages = numpages;
> @@ -2143,36 +2148,23 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>  	cpa.mask_clr = enc ? pgprot_decrypted(empty) : pgprot_encrypted(empty);
>  	cpa.pgd = init_mm.pgd;
>  
> -	/* Must avoid aliasing mappings in the highmem code */
> -	kmap_flush_unused();
> -	vm_unmap_aliases();


Why did you drop this?

> -
>  	/* Flush the caches as needed before changing the encryption attribute. */
> -	if (x86_platform.guest.enc_tlb_flush_required(enc))
> -		cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required());
> +	if (x86_platform.guest.enc_cache_flush_required())
> +		cpa_flush(&cpa, 1);

Do you remove the last enc_cache_flush_required() user?

>  	/* Notify hypervisor that we are about to set/clr encryption attribute. */
>  	if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
>  		return -EIO;
>  
>  	ret = __change_page_attr_set_clr(&cpa, 1);
> -
> -	/*
> -	 * After changing the encryption attribute, we need to flush TLBs again
> -	 * in case any speculative TLB caching occurred (but no need to flush
> -	 * caches again).  We could just use cpa_flush_all(), but in case TLB
> -	 * flushing gets optimized in the cpa_flush() path use the same logic
> -	 * as above.
> -	 */
> -	cpa_flush(&cpa, 0);
> +	if (ret)
> +		return ret;
>  
>  	/* Notify hypervisor that we have successfully set/clr encryption attribute. */
> -	if (!ret) {
> -		if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> -			ret = -EIO;
> -	}
> +	if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> +		return -EIO;
>  
> -	return ret;
> +	return set_memory_p(&addr, numpages);
>  }
>  
>  static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
> -- 
> 1.8.3.1
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared
  2023-08-06 22:19 ` kirill.shutemov
@ 2023-08-16  2:54   ` Michael Kelley (LINUX)
  2023-08-28 14:22     ` Michael Kelley (LINUX)
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Kelley (LINUX) @ 2023-08-16  2:54 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu, Dexuan Cui, tglx, mingo,
	bp, dave.hansen, hpa, luto, peterz, thomas.lendacky,
	sathyanarayanan.kuppuswamy, seanjc, rick.p.edgecombe,
	linux-kernel, linux-hyperv, x86

From: kirill.shutemov@linux.intel.com <kirill.shutemov@linux.intel.com> Sent: Sunday, August 6, 2023 3:20 PM
> 
> On Thu, Jul 06, 2023 at 09:41:59AM -0700, Michael Kelley wrote:
> > In a CoCo VM when a page transitions from private to shared, or vice
> > versa, attributes in the PTE must be updated *and* the hypervisor must
> > be notified of the change. Because there are two separate steps, there's
> > a window where the settings are inconsistent.  Normally the code that
> > initiates the transition (via set_memory_decrypted() or
> > set_memory_encrypted()) ensures that the memory is not being accessed
> > during a transition, so the window of inconsistency is not a problem.
> > However, the load_unaligned_zeropad() function can read arbitrary memory
> > pages at arbitrary times, which could access a transitioning page during
> > the window.  In such a case, CoCo VM specific exceptions are taken
> > (depending on the CoCo architecture in use).  Current code in those
> > exception handlers recovers and does "fixup" on the result returned by
> > load_unaligned_zeropad().  Unfortunately, this exception handling and
> > fixup code is tricky and somewhat fragile.  At the moment, it is
> > broken for both TDX and SEV-SNP.
> 

Thanks for looking at this.  I'm finally catching up after being out on
vacation for a week.

> I believe it is not fixed for TDX. Is it still a problem for SEV-SNP?

I presume you meant "now fixed for TDX", which I agree with.  Tom
Lendacky has indicated that there's still a problem with SEV-SNP.   He
could fix that problem, but this approach of marking the pages
invalid obviates the need for Tom's fix.

> 
> > There's also a problem with the current code in paravisor scenarios:
> > TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need
> > to be forwarded from the paravisor to the Linux guest, but there
> > are no architectural specs for how to do that.

The TD Partitioning case (and the similar SEV-SNP vTOM mode case) is
what doesn't have a solution.  To elaborate, with TD Partitioning, #VE
is sent to the containing VM, not the main Linux guest VM.  For
everything except an EPT violation, the containing VM can handle
the exception on behalf of the main Linux guest by doing the
appropriate emulation.  But for an EPT violation, the containing
VM can only terminate the guest.  It doesn't have sufficient context
to handle a "valid" #VE with EPT violation generated due to
load_unaligned_zeropad().  My proposed scheme of marking the
pages invalid avoids generating those #VEs and lets TD Partitioning
(and similarly for SEV-SNP vTOM) work as intended with a paravisor.

> >
> > To avoid these complexities of the CoCo exception handlers, change
> > the core transition code in __set_memory_enc_pgtable() to do the
> > following:
> >
> > 1.  Remove aliasing mappings
> > 2.  Remove the PRESENT bit from the PTEs of all transitioning pages
> > 3.  Flush the TLB globally
> > 4.  Flush the data cache if needed
> > 5.  Set/clear the encryption attribute as appropriate
> > 6.  Notify the hypervisor of the page status change
> > 7.  Add back the PRESENT bit
> 
> Okay, looks safe.
> 
> > With this approach, load_unaligned_zeropad() just takes its normal
> > page-fault-based fixup path if it touches a page that is transitioning.
> > As a result, load_unaligned_zeropad() and CoCo VM page transitioning
> > are completely decoupled.  CoCo VM page transitions can proceed
> > without needing to handle architecture-specific exceptions and fix
> > things up. This decoupling reduces the complexity due to separate
> > TDX and SEV-SNP fixup paths, and gives more freedom to revise and
> > introduce new capabilities in future versions of the TDX and SEV-SNP
> > architectures. Paravisor scenarios work properly without needing
> > to forward exceptions.
> >
> > This approach may make __set_memory_enc_pgtable() slightly slower
> > because of touching the PTEs three times instead of just once. But
> > the run time of this function is already dominated by the hypercall
> > and the need to flush the TLB at least once and maybe twice. In any
> > case, this function is only used for CoCo VM page transitions, and
> > is already unsuitable for hot paths.
> >
> > The architecture specific callback function for notifying the
> > hypervisor typically must translate guest kernel virtual addresses
> > into guest physical addresses to pass to the hypervisor.  Because
> > the PTEs are invalid at the time of callback, the code for doing the
> > translation needs updating.  virt_to_phys() or equivalent continues
> > to work for direct map addresses.  But vmalloc addresses cannot use
> > vmalloc_to_page() because that function requires the leaf PTE to be
> > valid. Instead, slow_virt_to_phys() must be used. Both functions
> > manually walk the page table hierarchy, so performance is the same.
> 
> Uhmm.. But why do we expected slow_virt_to_phys() to work on non-present
> page table entries? It seems accident for me that it works now. Somebody
> forgot pte_present() check.

I didn't research the history of slow_virt_to_phys(), but I'll do so.

> 
> Generally, if present bit is clear we cannot really assume anything about
> the rest of the bits without external hints.
> 
> This smells bad.

Maybe so.  But if we've just cleared the present bit, then we really
do know something about the rest of the bits.  We could either
document a constraint that slow_virt_to_phys() should not assume
the present bit, or write a separate but equivalent function that
doesn't assume the present bit.

> 
> Maybe the interface has to be reworked to operate on GPAs?

I'll experiment and see what that might look like, and if it is
better than doing the page table walk after the present bit
is cleared.

> 
> 
> > Signed-off-by: Michael Kelley <mikelley@microsoft.com>
> > ---
> >
> > I'm assuming the TDX handling of the data cache flushing is the
> > same with this new approach, and that it doesn't need to be paired
> > with a TLB flush as in the current code.  If that's not a correct
> > assumption, let me know.

Kirill -- could you comment on the above?  I don't fully understand
the TDX scenarios that need data cache flushing.

> >
> > I've left the two hypervisor callbacks, before and after Step 5
> > above. If the PTEs are invalid, it seems like the order of Step 5
> > and Step 6 shouldn't matter, so perhaps one of the callback could
> > be dropped.  Let me know if there are reasons to do otherwise.
> >
> > It may well be possible to optimize the new implementation of
> > __set_memory_enc_pgtable(). The existing set_memory_np() and
> > set_memory_p() functions do all the right things in a very clear
> > fashion, but perhaps not as optimally as having all three PTE
> > manipulations directly in the same function. It doesn't appear
> > that optimizing the performance really matters here, but I'm open
> > to suggestions.
> >
> > I've tested this on TDX VMs and SEV-SNP + vTOM VMs.  I can also
> > test on SEV-SNP VMs without vTOM. But I'll probably need help
> > covering SEV and SEV-ES VMs.
> >
> > This RFC patch does *not* remove code that would no longer be
> > needed. If there's agreement to take this new approach, I'll
> > add patches to remove the unneeded code.
> >
> > This patch is built against linux-next20230704.
> >
> >  arch/x86/hyperv/ivm.c        |  3 ++-
> >  arch/x86/kernel/sev.c        |  2 +-
> >  arch/x86/mm/pat/set_memory.c | 32 ++++++++++++--------------------
> >  3 files changed, 15 insertions(+), 22 deletions(-)
> >
> > diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> > index 28be6df..2859ec3 100644
> > --- a/arch/x86/hyperv/ivm.c
> > +++ b/arch/x86/hyperv/ivm.c
> > @@ -308,7 +308,8 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
> >  		return false;
> >
> >  	for (i = 0, pfn = 0; i < pagecount; i++) {
> > -		pfn_array[pfn] = virt_to_hvpfn((void *)kbuffer + i * HV_HYP_PAGE_SIZE);
> > +		pfn_array[pfn] = slow_virt_to_phys((void *)kbuffer +
> > +					i * HV_HYP_PAGE_SIZE) >> HV_HYP_PAGE_SHIFT;
> >  		pfn++;
> >
> >  		if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
> > diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> > index 1ee7bed..59db55e 100644
> > --- a/arch/x86/kernel/sev.c
> > +++ b/arch/x86/kernel/sev.c
> > @@ -784,7 +784,7 @@ static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long
> >  		hdr->end_entry = i;
> >
> >  		if (is_vmalloc_addr((void *)vaddr)) {
> > -			pfn = vmalloc_to_pfn((void *)vaddr);
> > +			pfn = slow_virt_to_phys((void *)vaddr) >> PAGE_SHIFT;
> >  			use_large_entry = false;
> >  		} else {
> >  			pfn = __pa(vaddr) >> PAGE_SHIFT;
> > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> > index bda9f12..8a194c7 100644
> > --- a/arch/x86/mm/pat/set_memory.c
> > +++ b/arch/x86/mm/pat/set_memory.c
> > @@ -2136,6 +2136,11 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
> >  	if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
> >  		addr &= PAGE_MASK;
> >
> > +	/* set_memory_np() removes aliasing mappings and flushes the TLB */
> > +	ret = set_memory_np(addr, numpages);
> > +	if (ret)
> > +		return ret;
> > +
> >  	memset(&cpa, 0, sizeof(cpa));
> >  	cpa.vaddr = &addr;
> >  	cpa.numpages = numpages;
> > @@ -2143,36 +2148,23 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
> >  	cpa.mask_clr = enc ? pgprot_decrypted(empty) : pgprot_encrypted(empty);
> >  	cpa.pgd = init_mm.pgd;
> >
> > -	/* Must avoid aliasing mappings in the highmem code */
> > -	kmap_flush_unused();
> > -	vm_unmap_aliases();
> 
> 
> Why did you drop this?

Both functions are already called by set_memory_np() ->
change_page_attr_clear() -> change_page_attr_set_clr().

> 
> > -
> >  	/* Flush the caches as needed before changing the encryption attribute. */
> > -	if (x86_platform.guest.enc_tlb_flush_required(enc))
> > -		cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required());
> > +	if (x86_platform.guest.enc_cache_flush_required())
> > +		cpa_flush(&cpa, 1);
> 
> Do you remove the last enc_cache_flush_required() user?

Yes, I think so.  As I commented above, this RFC version doesn't remove
all the unneeded code.  If there's agreement to move forward, I'll do that.

> 
> >  	/* Notify hypervisor that we are about to set/clr encryption attribute. */
> >  	if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
> >  		return -EIO;
> >
> >  	ret = __change_page_attr_set_clr(&cpa, 1);
> > -
> > -	/*
> > -	 * After changing the encryption attribute, we need to flush TLBs again
> > -	 * in case any speculative TLB caching occurred (but no need to flush
> > -	 * caches again).  We could just use cpa_flush_all(), but in case TLB
> > -	 * flushing gets optimized in the cpa_flush() path use the same logic
> > -	 * as above.
> > -	 */
> > -	cpa_flush(&cpa, 0);
> > +	if (ret)
> > +		return ret;
> >
> >  	/* Notify hypervisor that we have successfully set/clr encryption attribute. */
> > -	if (!ret) {
> > -		if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> > -			ret = -EIO;
> > -	}
> > +	if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> > +		return -EIO;
> >
> > -	return ret;
> > +	return set_memory_p(&addr, numpages);
> >  }
> >
> >  static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
> > --
> > 1.8.3.1
> >
> 
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared
  2023-08-16  2:54   ` Michael Kelley (LINUX)
@ 2023-08-28 14:22     ` Michael Kelley (LINUX)
  2023-08-28 16:13       ` kirill.shutemov
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Kelley (LINUX) @ 2023-08-28 14:22 UTC (permalink / raw)
  To: Michael Kelley (LINUX), kirill.shutemov
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu, Dexuan Cui, tglx, mingo,
	bp, dave.hansen, hpa, luto, peterz, thomas.lendacky,
	sathyanarayanan.kuppuswamy, seanjc, rick.p.edgecombe,
	linux-kernel, linux-hyperv, x86

From: Michael Kelley (LINUX) <mikelley@microsoft.com> Sent: Tuesday, August 15, 2023 7:54 PM
> 
> From: kirill.shutemov@linux.intel.com <kirill.shutemov@linux.intel.com> Sent: Sunday,
> August 6, 2023 3:20 PM
> >
> > On Thu, Jul 06, 2023 at 09:41:59AM -0700, Michael Kelley wrote:
> > > In a CoCo VM when a page transitions from private to shared, or vice
> > > versa, attributes in the PTE must be updated *and* the hypervisor must
> > > be notified of the change. Because there are two separate steps, there's
> > > a window where the settings are inconsistent.  Normally the code that
> > > initiates the transition (via set_memory_decrypted() or
> > > set_memory_encrypted()) ensures that the memory is not being accessed
> > > during a transition, so the window of inconsistency is not a problem.
> > > However, the load_unaligned_zeropad() function can read arbitrary memory
> > > pages at arbitrary times, which could access a transitioning page during
> > > the window.  In such a case, CoCo VM specific exceptions are taken
> > > (depending on the CoCo architecture in use).  Current code in those
> > > exception handlers recovers and does "fixup" on the result returned by
> > > load_unaligned_zeropad().  Unfortunately, this exception handling and
> > > fixup code is tricky and somewhat fragile.  At the moment, it is
> > > broken for both TDX and SEV-SNP.
> >
> 
> Thanks for looking at this.  I'm finally catching up after being out on
> vacation for a week.
> 
> > I believe it is not fixed for TDX. Is it still a problem for SEV-SNP?
> 
> I presume you meant "now fixed for TDX", which I agree with.  Tom
> Lendacky has indicated that there's still a problem with SEV-SNP.   He
> could fix that problem, but this approach of marking the pages
> invalid obviates the need for Tom's fix.
> 
> >
> > > There's also a problem with the current code in paravisor scenarios:
> > > TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need
> > > to be forwarded from the paravisor to the Linux guest, but there
> > > are no architectural specs for how to do that.
> 
> The TD Partitioning case (and the similar SEV-SNP vTOM mode case) is
> what doesn't have a solution.  To elaborate, with TD Partitioning, #VE
> is sent to the containing VM, not the main Linux guest VM.  For
> everything except an EPT violation, the containing VM can handle
> the exception on behalf of the main Linux guest by doing the
> appropriate emulation.  But for an EPT violation, the containing
> VM can only terminate the guest.  It doesn't have sufficient context
> to handle a "valid" #VE with EPT violation generated due to
> load_unaligned_zeropad().  My proposed scheme of marking the
> pages invalid avoids generating those #VEs and lets TD Partitioning
> (and similarly for SEV-SNP vTOM) work as intended with a paravisor.
> 
> > >
> > > To avoid these complexities of the CoCo exception handlers, change
> > > the core transition code in __set_memory_enc_pgtable() to do the
> > > following:
> > >
> > > 1.  Remove aliasing mappings
> > > 2.  Remove the PRESENT bit from the PTEs of all transitioning pages
> > > 3.  Flush the TLB globally
> > > 4.  Flush the data cache if needed
> > > 5.  Set/clear the encryption attribute as appropriate
> > > 6.  Notify the hypervisor of the page status change
> > > 7.  Add back the PRESENT bit
> >
> > Okay, looks safe.
> >
> > > With this approach, load_unaligned_zeropad() just takes its normal
> > > page-fault-based fixup path if it touches a page that is transitioning.
> > > As a result, load_unaligned_zeropad() and CoCo VM page transitioning
> > > are completely decoupled.  CoCo VM page transitions can proceed
> > > without needing to handle architecture-specific exceptions and fix
> > > things up. This decoupling reduces the complexity due to separate
> > > TDX and SEV-SNP fixup paths, and gives more freedom to revise and
> > > introduce new capabilities in future versions of the TDX and SEV-SNP
> > > architectures. Paravisor scenarios work properly without needing
> > > to forward exceptions.
> > >
> > > This approach may make __set_memory_enc_pgtable() slightly slower
> > > because of touching the PTEs three times instead of just once. But
> > > the run time of this function is already dominated by the hypercall
> > > and the need to flush the TLB at least once and maybe twice. In any
> > > case, this function is only used for CoCo VM page transitions, and
> > > is already unsuitable for hot paths.
> > >
> > > The architecture specific callback function for notifying the
> > > hypervisor typically must translate guest kernel virtual addresses
> > > into guest physical addresses to pass to the hypervisor.  Because
> > > the PTEs are invalid at the time of callback, the code for doing the
> > > translation needs updating.  virt_to_phys() or equivalent continues
> > > to work for direct map addresses.  But vmalloc addresses cannot use
> > > vmalloc_to_page() because that function requires the leaf PTE to be
> > > valid. Instead, slow_virt_to_phys() must be used. Both functions
> > > manually walk the page table hierarchy, so performance is the same.
> >
> > Uhmm.. But why do we expected slow_virt_to_phys() to work on non-present
> > page table entries? It seems accident for me that it works now. Somebody
> > forgot pte_present() check.
> 
> I didn't research the history of slow_virt_to_phys(), but I'll do so.
> 

The history of slow_virt_to_phys() doesn't show any discussion of
whether it is expected to work for non-present PTEs.  However, the
page table walking is done by lookup_address(), which explicitly
*does* work for non-present PTEs.  For example, lookup_address()
is used in cpa_flush() to find a PTE.  cpa_flush() then checks to see if
the PTE is present.  The comparable vmalloc_to_page() *does* require
the PTE to be present, but arguably that's because it is returning a
struct page * rather than a phys_addr_t.

So I don't know that the pte_present() check was forgotten in
slow_virt_to_phys().  FWIW, Dave Hansen originally added
slow_virt_to_phys() about 10 years ago. :-)

> >
> > Generally, if present bit is clear we cannot really assume anything about
> > the rest of the bits without external hints.
> >
> > This smells bad.
> 
> Maybe so.  But if we've just cleared the present bit, then we really
> do know something about the rest of the bits.  We could either
> document a constraint that slow_virt_to_phys() should not assume
> the present bit, or write a separate but equivalent function that
> doesn't assume the present bit.
> 
> >
> > Maybe the interface has to be reworked to operate on GPAs?
> 
> I'll experiment and see what that might look like, and if it is
> better than doing the page table walk after the present bit
> is cleared.

Converting the enc_status_change_prepare() and
enc_status_change_finish() callbacks to use GPAs instead of
virtual addresses is doable, but makes the code more complex.
The detection of a vmalloc address must be done in
__set_memory_enc_pgtable() where the private<->shared
conversion is done, with looping over the individual pages.  This
is what Dexuan Cui's patch [1] for the TDX implementation of the
callbacks does to handle vmalloc addresses.

There are three implementations of the callbacks:  TDX, SEV-SNP,
and SEV-SNP + vTOM (for Hyper-V).  Changing the TDX callbacks to use
GPAs is easy.  Changing the SEV-SNP and SEV-SNP + vTOM callbacks is
messier because in both cases the underlying hypercalls support
supplying a batch of multiple GPAs that aren't necessarily contiguous
(TDX only supports a single contiguous range).  Batching helps the
performance of the vmalloc case, but the  top-level loop in
__set_memory_enc_pgtable() must provide a batch of GPAs instead
of doing them one-at-a-time.  The callback implementations must also
loop because the batch size they support might be different from
what __set_memory_enc_pgtable() supplies.

All of that is doable.  But adding the complexity doesn't seem like
the right tradeoff vs. just documenting that slow_virt_to_phys() is 
expected to work even if the PTE is not present, especially since
lookup_address() already has that requirement.  And when
we're doing private<->shared conversions, we really do know the
state of the PTEs even after the present bit is cleared.

If it would help to see the actual code for the callbacks to use
GPAs instead of virtual addresses, I'll do that work.  But I'm
hoping just using slow_virt_to_phys() will be judged to be the
better solution.

Michael

[1] https://lore.kernel.org/lkml/20230811214826.9609-3-decui@microsoft.com/

> >
> >
> > > Signed-off-by: Michael Kelley <mikelley@microsoft.com>
> > > ---
> > >
> > > I'm assuming the TDX handling of the data cache flushing is the
> > > same with this new approach, and that it doesn't need to be paired
> > > with a TLB flush as in the current code.  If that's not a correct
> > > assumption, let me know.
> 
> Kirill -- could you comment on the above?  I don't fully understand
> the TDX scenarios that need data cache flushing.
> 
> > >
> > > I've left the two hypervisor callbacks, before and after Step 5
> > > above. If the PTEs are invalid, it seems like the order of Step 5
> > > and Step 6 shouldn't matter, so perhaps one of the callback could
> > > be dropped.  Let me know if there are reasons to do otherwise.
> > >
> > > It may well be possible to optimize the new implementation of
> > > __set_memory_enc_pgtable(). The existing set_memory_np() and
> > > set_memory_p() functions do all the right things in a very clear
> > > fashion, but perhaps not as optimally as having all three PTE
> > > manipulations directly in the same function. It doesn't appear
> > > that optimizing the performance really matters here, but I'm open
> > > to suggestions.
> > >
> > > I've tested this on TDX VMs and SEV-SNP + vTOM VMs.  I can also
> > > test on SEV-SNP VMs without vTOM. But I'll probably need help
> > > covering SEV and SEV-ES VMs.
> > >
> > > This RFC patch does *not* remove code that would no longer be
> > > needed. If there's agreement to take this new approach, I'll
> > > add patches to remove the unneeded code.
> > >
> > > This patch is built against linux-next20230704.
> > >
> > >  arch/x86/hyperv/ivm.c        |  3 ++-
> > >  arch/x86/kernel/sev.c        |  2 +-
> > >  arch/x86/mm/pat/set_memory.c | 32 ++++++++++++--------------------
> > >  3 files changed, 15 insertions(+), 22 deletions(-)
> > >
> > > diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> > > index 28be6df..2859ec3 100644
> > > --- a/arch/x86/hyperv/ivm.c
> > > +++ b/arch/x86/hyperv/ivm.c
> > > @@ -308,7 +308,8 @@ static bool hv_vtom_set_host_visibility(unsigned long
> kbuffer, int pagecount, bo
> > >  		return false;
> > >
> > >  	for (i = 0, pfn = 0; i < pagecount; i++) {
> > > -		pfn_array[pfn] = virt_to_hvpfn((void *)kbuffer + i *
> HV_HYP_PAGE_SIZE);
> > > +		pfn_array[pfn] = slow_virt_to_phys((void *)kbuffer +
> > > +					i * HV_HYP_PAGE_SIZE) >>
> HV_HYP_PAGE_SHIFT;
> > >  		pfn++;
> > >
> > >  		if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
> > > diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> > > index 1ee7bed..59db55e 100644
> > > --- a/arch/x86/kernel/sev.c
> > > +++ b/arch/x86/kernel/sev.c
> > > @@ -784,7 +784,7 @@ static unsigned long __set_pages_state(struct snp_psc_desc
> *data, unsigned long
> > >  		hdr->end_entry = i;
> > >
> > >  		if (is_vmalloc_addr((void *)vaddr)) {
> > > -			pfn = vmalloc_to_pfn((void *)vaddr);
> > > +			pfn = slow_virt_to_phys((void *)vaddr) >> PAGE_SHIFT;
> > >  			use_large_entry = false;
> > >  		} else {
> > >  			pfn = __pa(vaddr) >> PAGE_SHIFT;
> > > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> > > index bda9f12..8a194c7 100644
> > > --- a/arch/x86/mm/pat/set_memory.c
> > > +++ b/arch/x86/mm/pat/set_memory.c
> > > @@ -2136,6 +2136,11 @@ static int __set_memory_enc_pgtable(unsigned long
> addr, int numpages, bool enc)
> > >  	if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
> > >  		addr &= PAGE_MASK;
> > >
> > > +	/* set_memory_np() removes aliasing mappings and flushes the TLB */
> > > +	ret = set_memory_np(addr, numpages);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > >  	memset(&cpa, 0, sizeof(cpa));
> > >  	cpa.vaddr = &addr;
> > >  	cpa.numpages = numpages;
> > > @@ -2143,36 +2148,23 @@ static int __set_memory_enc_pgtable(unsigned long
> addr, int numpages, bool enc)
> > >  	cpa.mask_clr = enc ? pgprot_decrypted(empty) : pgprot_encrypted(empty);
> > >  	cpa.pgd = init_mm.pgd;
> > >
> > > -	/* Must avoid aliasing mappings in the highmem code */
> > > -	kmap_flush_unused();
> > > -	vm_unmap_aliases();
> >
> >
> > Why did you drop this?
> 
> Both functions are already called by set_memory_np() ->
> change_page_attr_clear() -> change_page_attr_set_clr().
> 
> >
> > > -
> > >  	/* Flush the caches as needed before changing the encryption attribute. */
> > > -	if (x86_platform.guest.enc_tlb_flush_required(enc))
> > > -		cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required());
> > > +	if (x86_platform.guest.enc_cache_flush_required())
> > > +		cpa_flush(&cpa, 1);
> >
> > Do you remove the last enc_cache_flush_required() user?
> 
> Yes, I think so.  As I commented above, this RFC version doesn't remove
> all the unneeded code.  If there's agreement to move forward, I'll do that.
> 
> >
> > >  	/* Notify hypervisor that we are about to set/clr encryption attribute. */
> > >  	if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
> > >  		return -EIO;
> > >
> > >  	ret = __change_page_attr_set_clr(&cpa, 1);
> > > -
> > > -	/*
> > > -	 * After changing the encryption attribute, we need to flush TLBs again
> > > -	 * in case any speculative TLB caching occurred (but no need to flush
> > > -	 * caches again).  We could just use cpa_flush_all(), but in case TLB
> > > -	 * flushing gets optimized in the cpa_flush() path use the same logic
> > > -	 * as above.
> > > -	 */
> > > -	cpa_flush(&cpa, 0);
> > > +	if (ret)
> > > +		return ret;
> > >
> > >  	/* Notify hypervisor that we have successfully set/clr encryption attribute. */
> > > -	if (!ret) {
> > > -		if (!x86_platform.guest.enc_status_change_finish(addr, numpages,
> enc))
> > > -			ret = -EIO;
> > > -	}
> > > +	if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> > > +		return -EIO;
> > >
> > > -	return ret;
> > > +	return set_memory_p(&addr, numpages);
> > >  }
> > >
> > >  static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
> > > --
> > > 1.8.3.1
> > >
> >
> > --
> >   Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared
  2023-08-28 14:22     ` Michael Kelley (LINUX)
@ 2023-08-28 16:13       ` kirill.shutemov
  2023-08-28 21:00         ` Michael Kelley (LINUX)
  0 siblings, 1 reply; 17+ messages in thread
From: kirill.shutemov @ 2023-08-28 16:13 UTC (permalink / raw)
  To: Michael Kelley (LINUX)
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu, Dexuan Cui, tglx, mingo,
	bp, dave.hansen, hpa, luto, peterz, thomas.lendacky,
	sathyanarayanan.kuppuswamy, seanjc, rick.p.edgecombe,
	linux-kernel, linux-hyperv, x86

On Mon, Aug 28, 2023 at 02:22:11PM +0000, Michael Kelley (LINUX) wrote:
> From: Michael Kelley (LINUX) <mikelley@microsoft.com> Sent: Tuesday, August 15, 2023 7:54 PM
> > 
> > From: kirill.shutemov@linux.intel.com <kirill.shutemov@linux.intel.com> Sent: Sunday,
> > August 6, 2023 3:20 PM
> > >
> > > On Thu, Jul 06, 2023 at 09:41:59AM -0700, Michael Kelley wrote:
> > > > In a CoCo VM when a page transitions from private to shared, or vice
> > > > versa, attributes in the PTE must be updated *and* the hypervisor must
> > > > be notified of the change. Because there are two separate steps, there's
> > > > a window where the settings are inconsistent.  Normally the code that
> > > > initiates the transition (via set_memory_decrypted() or
> > > > set_memory_encrypted()) ensures that the memory is not being accessed
> > > > during a transition, so the window of inconsistency is not a problem.
> > > > However, the load_unaligned_zeropad() function can read arbitrary memory
> > > > pages at arbitrary times, which could access a transitioning page during
> > > > the window.  In such a case, CoCo VM specific exceptions are taken
> > > > (depending on the CoCo architecture in use).  Current code in those
> > > > exception handlers recovers and does "fixup" on the result returned by
> > > > load_unaligned_zeropad().  Unfortunately, this exception handling and
> > > > fixup code is tricky and somewhat fragile.  At the moment, it is
> > > > broken for both TDX and SEV-SNP.
> > >
> > 
> > Thanks for looking at this.  I'm finally catching up after being out on
> > vacation for a week.
> > 
> > > I believe it is not fixed for TDX. Is it still a problem for SEV-SNP?
> > 
> > I presume you meant "now fixed for TDX", which I agree with.  Tom
> > Lendacky has indicated that there's still a problem with SEV-SNP.   He
> > could fix that problem, but this approach of marking the pages
> > invalid obviates the need for Tom's fix.
> > 
> > >
> > > > There's also a problem with the current code in paravisor scenarios:
> > > > TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need
> > > > to be forwarded from the paravisor to the Linux guest, but there
> > > > are no architectural specs for how to do that.
> > 
> > The TD Partitioning case (and the similar SEV-SNP vTOM mode case) is
> > what doesn't have a solution.  To elaborate, with TD Partitioning, #VE
> > is sent to the containing VM, not the main Linux guest VM.  For
> > everything except an EPT violation, the containing VM can handle
> > the exception on behalf of the main Linux guest by doing the
> > appropriate emulation.  But for an EPT violation, the containing
> > VM can only terminate the guest.  It doesn't have sufficient context
> > to handle a "valid" #VE with EPT violation generated due to
> > load_unaligned_zeropad().  My proposed scheme of marking the
> > pages invalid avoids generating those #VEs and lets TD Partitioning
> > (and similarly for SEV-SNP vTOM) work as intended with a paravisor.
> > 
> > > >
> > > > To avoid these complexities of the CoCo exception handlers, change
> > > > the core transition code in __set_memory_enc_pgtable() to do the
> > > > following:
> > > >
> > > > 1.  Remove aliasing mappings
> > > > 2.  Remove the PRESENT bit from the PTEs of all transitioning pages
> > > > 3.  Flush the TLB globally
> > > > 4.  Flush the data cache if needed
> > > > 5.  Set/clear the encryption attribute as appropriate
> > > > 6.  Notify the hypervisor of the page status change
> > > > 7.  Add back the PRESENT bit
> > >
> > > Okay, looks safe.
> > >
> > > > With this approach, load_unaligned_zeropad() just takes its normal
> > > > page-fault-based fixup path if it touches a page that is transitioning.
> > > > As a result, load_unaligned_zeropad() and CoCo VM page transitioning
> > > > are completely decoupled.  CoCo VM page transitions can proceed
> > > > without needing to handle architecture-specific exceptions and fix
> > > > things up. This decoupling reduces the complexity due to separate
> > > > TDX and SEV-SNP fixup paths, and gives more freedom to revise and
> > > > introduce new capabilities in future versions of the TDX and SEV-SNP
> > > > architectures. Paravisor scenarios work properly without needing
> > > > to forward exceptions.
> > > >
> > > > This approach may make __set_memory_enc_pgtable() slightly slower
> > > > because of touching the PTEs three times instead of just once. But
> > > > the run time of this function is already dominated by the hypercall
> > > > and the need to flush the TLB at least once and maybe twice. In any
> > > > case, this function is only used for CoCo VM page transitions, and
> > > > is already unsuitable for hot paths.
> > > >
> > > > The architecture specific callback function for notifying the
> > > > hypervisor typically must translate guest kernel virtual addresses
> > > > into guest physical addresses to pass to the hypervisor.  Because
> > > > the PTEs are invalid at the time of callback, the code for doing the
> > > > translation needs updating.  virt_to_phys() or equivalent continues
> > > > to work for direct map addresses.  But vmalloc addresses cannot use
> > > > vmalloc_to_page() because that function requires the leaf PTE to be
> > > > valid. Instead, slow_virt_to_phys() must be used. Both functions
> > > > manually walk the page table hierarchy, so performance is the same.
> > >
> > > Uhmm.. But why do we expected slow_virt_to_phys() to work on non-present
> > > page table entries? It seems accident for me that it works now. Somebody
> > > forgot pte_present() check.
> > 
> > I didn't research the history of slow_virt_to_phys(), but I'll do so.
> > 
> 
> The history of slow_virt_to_phys() doesn't show any discussion of
> whether it is expected to work for non-present PTEs.  However, the
> page table walking is done by lookup_address(), which explicitly
> *does* work for non-present PTEs.  For example, lookup_address()
> is used in cpa_flush() to find a PTE.  cpa_flush() then checks to see if
> the PTE is present.

Which is totally fine thing to do. Present bit is the only info you can
always rely to be valid for non-present PTE.

But it is nitpicking on my side. Here you control lifecycle of the PTE, so
you know that PFN will stay valid for the PTE.

I guess the right thing to do is to use lookup_address() and get pfn from
the PTE with the comment why it is valid.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared
  2023-08-28 16:13       ` kirill.shutemov
@ 2023-08-28 21:00         ` Michael Kelley (LINUX)
  2023-08-28 22:13           ` kirill.shutemov
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Kelley (LINUX) @ 2023-08-28 21:00 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu, Dexuan Cui, tglx, mingo,
	bp, dave.hansen, hpa, luto, peterz, thomas.lendacky,
	sathyanarayanan.kuppuswamy, seanjc, rick.p.edgecombe,
	linux-kernel, linux-hyperv, x86

From: kirill.shutemov@linux.intel.com <kirill.shutemov@linux.intel.com> Sent: Monday, August 28, 2023 9:13 AM
> 
> On Mon, Aug 28, 2023 at 02:22:11PM +0000, Michael Kelley (LINUX) wrote:
> > From: Michael Kelley (LINUX) <mikelley@microsoft.com> Sent: Tuesday, August 15, 2023 7:54 PM
> > >
> > > From: kirill.shutemov@linux.intel.com <kirill.shutemov@linux.intel.com> Sent: Sunday,
> > > August 6, 2023 3:20 PM
> > > >
> > > > On Thu, Jul 06, 2023 at 09:41:59AM -0700, Michael Kelley wrote:
> > > > > In a CoCo VM when a page transitions from private to shared, or vice
> > > > > versa, attributes in the PTE must be updated *and* the hypervisor must
> > > > > be notified of the change. Because there are two separate steps, there's
> > > > > a window where the settings are inconsistent.  Normally the code that
> > > > > initiates the transition (via set_memory_decrypted() or
> > > > > set_memory_encrypted()) ensures that the memory is not being accessed
> > > > > during a transition, so the window of inconsistency is not a problem.
> > > > > However, the load_unaligned_zeropad() function can read arbitrary memory
> > > > > pages at arbitrary times, which could access a transitioning page during
> > > > > the window.  In such a case, CoCo VM specific exceptions are taken
> > > > > (depending on the CoCo architecture in use).  Current code in those
> > > > > exception handlers recovers and does "fixup" on the result returned by
> > > > > load_unaligned_zeropad().  Unfortunately, this exception handling and
> > > > > fixup code is tricky and somewhat fragile.  At the moment, it is
> > > > > broken for both TDX and SEV-SNP.
> > > >
> > >
> > > Thanks for looking at this.  I'm finally catching up after being out on
> > > vacation for a week.
> > >
> > > > I believe it is not fixed for TDX. Is it still a problem for SEV-SNP?
> > >
> > > I presume you meant "now fixed for TDX", which I agree with.  Tom
> > > Lendacky has indicated that there's still a problem with SEV-SNP.   He
> > > could fix that problem, but this approach of marking the pages
> > > invalid obviates the need for Tom's fix.
> > >
> > > >
> > > > > There's also a problem with the current code in paravisor scenarios:
> > > > > TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need
> > > > > to be forwarded from the paravisor to the Linux guest, but there
> > > > > are no architectural specs for how to do that.
> > >
> > > The TD Partitioning case (and the similar SEV-SNP vTOM mode case) is
> > > what doesn't have a solution.  To elaborate, with TD Partitioning, #VE
> > > is sent to the containing VM, not the main Linux guest VM.  For
> > > everything except an EPT violation, the containing VM can handle
> > > the exception on behalf of the main Linux guest by doing the
> > > appropriate emulation.  But for an EPT violation, the containing
> > > VM can only terminate the guest.  It doesn't have sufficient context
> > > to handle a "valid" #VE with EPT violation generated due to
> > > load_unaligned_zeropad().  My proposed scheme of marking the
> > > pages invalid avoids generating those #VEs and lets TD Partitioning
> > > (and similarly for SEV-SNP vTOM) work as intended with a paravisor.
> > >
> > > > >
> > > > > To avoid these complexities of the CoCo exception handlers, change
> > > > > the core transition code in __set_memory_enc_pgtable() to do the
> > > > > following:
> > > > >
> > > > > 1.  Remove aliasing mappings
> > > > > 2.  Remove the PRESENT bit from the PTEs of all transitioning pages
> > > > > 3.  Flush the TLB globally
> > > > > 4.  Flush the data cache if needed
> > > > > 5.  Set/clear the encryption attribute as appropriate
> > > > > 6.  Notify the hypervisor of the page status change
> > > > > 7.  Add back the PRESENT bit
> > > >
> > > > Okay, looks safe.
> > > >
> > > > > With this approach, load_unaligned_zeropad() just takes its normal
> > > > > page-fault-based fixup path if it touches a page that is transitioning.
> > > > > As a result, load_unaligned_zeropad() and CoCo VM page transitioning
> > > > > are completely decoupled.  CoCo VM page transitions can proceed
> > > > > without needing to handle architecture-specific exceptions and fix
> > > > > things up. This decoupling reduces the complexity due to separate
> > > > > TDX and SEV-SNP fixup paths, and gives more freedom to revise and
> > > > > introduce new capabilities in future versions of the TDX and SEV-SNP
> > > > > architectures. Paravisor scenarios work properly without needing
> > > > > to forward exceptions.
> > > > >
> > > > > This approach may make __set_memory_enc_pgtable() slightly slower
> > > > > because of touching the PTEs three times instead of just once. But
> > > > > the run time of this function is already dominated by the hypercall
> > > > > and the need to flush the TLB at least once and maybe twice. In any
> > > > > case, this function is only used for CoCo VM page transitions, and
> > > > > is already unsuitable for hot paths.
> > > > >
> > > > > The architecture specific callback function for notifying the
> > > > > hypervisor typically must translate guest kernel virtual addresses
> > > > > into guest physical addresses to pass to the hypervisor.  Because
> > > > > the PTEs are invalid at the time of callback, the code for doing the
> > > > > translation needs updating.  virt_to_phys() or equivalent continues
> > > > > to work for direct map addresses.  But vmalloc addresses cannot use
> > > > > vmalloc_to_page() because that function requires the leaf PTE to be
> > > > > valid. Instead, slow_virt_to_phys() must be used. Both functions
> > > > > manually walk the page table hierarchy, so performance is the same.
> > > >
> > > > Uhmm.. But why do we expected slow_virt_to_phys() to work on non-present
> > > > page table entries? It seems accident for me that it works now. Somebody
> > > > forgot pte_present() check.
> > >
> > > I didn't research the history of slow_virt_to_phys(), but I'll do so.
> > >
> >
> > The history of slow_virt_to_phys() doesn't show any discussion of
> > whether it is expected to work for non-present PTEs.  However, the
> > page table walking is done by lookup_address(), which explicitly
> > *does* work for non-present PTEs.  For example, lookup_address()
> > is used in cpa_flush() to find a PTE.  cpa_flush() then checks to see if
> > the PTE is present.
> 
> Which is totally fine thing to do. Present bit is the only info you can
> always rely to be valid for non-present PTE.
> 
> But it is nitpicking on my side. Here you control lifecycle of the PTE, so
> you know that PFN will stay valid for the PTE.
> 
> I guess the right thing to do is to use lookup_address() and get pfn from
> the PTE with the comment why it is valid.

Each of the three implementations of the enc_status_change_prepare()/
enc_status_change_finish() callbacks needs to translate from vaddr to
pfn, and so would use lookup_address().  For convenience, I would
create a helper function that wraps lookup_address() and returns
a PFN.  This helper function would be very similar to slow_virt_to_phys()
except for removing the shift to create a phys_addr_t from the PFN,
and adding back the offset within the page.  Is that the approach you
had in mind?

Michael

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared
  2023-08-28 21:00         ` Michael Kelley (LINUX)
@ 2023-08-28 22:13           ` kirill.shutemov
  2023-08-28 23:23             ` Michael Kelley (LINUX)
  0 siblings, 1 reply; 17+ messages in thread
From: kirill.shutemov @ 2023-08-28 22:13 UTC (permalink / raw)
  To: Michael Kelley (LINUX)
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu, Dexuan Cui, tglx, mingo,
	bp, dave.hansen, hpa, luto, peterz, thomas.lendacky,
	sathyanarayanan.kuppuswamy, seanjc, rick.p.edgecombe,
	linux-kernel, linux-hyperv, x86

On Mon, Aug 28, 2023 at 09:00:03PM +0000, Michael Kelley (LINUX) wrote:
> From: kirill.shutemov@linux.intel.com <kirill.shutemov@linux.intel.com> Sent: Monday, August 28, 2023 9:13 AM
> > 
> > On Mon, Aug 28, 2023 at 02:22:11PM +0000, Michael Kelley (LINUX) wrote:
> > > From: Michael Kelley (LINUX) <mikelley@microsoft.com> Sent: Tuesday, August 15, 2023 7:54 PM
> > > >
> > > > From: kirill.shutemov@linux.intel.com <kirill.shutemov@linux.intel.com> Sent: Sunday,
> > > > August 6, 2023 3:20 PM
> > > > >
> > > > > On Thu, Jul 06, 2023 at 09:41:59AM -0700, Michael Kelley wrote:
> > > > > > In a CoCo VM when a page transitions from private to shared, or vice
> > > > > > versa, attributes in the PTE must be updated *and* the hypervisor must
> > > > > > be notified of the change. Because there are two separate steps, there's
> > > > > > a window where the settings are inconsistent.  Normally the code that
> > > > > > initiates the transition (via set_memory_decrypted() or
> > > > > > set_memory_encrypted()) ensures that the memory is not being accessed
> > > > > > during a transition, so the window of inconsistency is not a problem.
> > > > > > However, the load_unaligned_zeropad() function can read arbitrary memory
> > > > > > pages at arbitrary times, which could access a transitioning page during
> > > > > > the window.  In such a case, CoCo VM specific exceptions are taken
> > > > > > (depending on the CoCo architecture in use).  Current code in those
> > > > > > exception handlers recovers and does "fixup" on the result returned by
> > > > > > load_unaligned_zeropad().  Unfortunately, this exception handling and
> > > > > > fixup code is tricky and somewhat fragile.  At the moment, it is
> > > > > > broken for both TDX and SEV-SNP.
> > > > >
> > > >
> > > > Thanks for looking at this.  I'm finally catching up after being out on
> > > > vacation for a week.
> > > >
> > > > > I believe it is not fixed for TDX. Is it still a problem for SEV-SNP?
> > > >
> > > > I presume you meant "now fixed for TDX", which I agree with.  Tom
> > > > Lendacky has indicated that there's still a problem with SEV-SNP.   He
> > > > could fix that problem, but this approach of marking the pages
> > > > invalid obviates the need for Tom's fix.
> > > >
> > > > >
> > > > > > There's also a problem with the current code in paravisor scenarios:
> > > > > > TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need
> > > > > > to be forwarded from the paravisor to the Linux guest, but there
> > > > > > are no architectural specs for how to do that.
> > > >
> > > > The TD Partitioning case (and the similar SEV-SNP vTOM mode case) is
> > > > what doesn't have a solution.  To elaborate, with TD Partitioning, #VE
> > > > is sent to the containing VM, not the main Linux guest VM.  For
> > > > everything except an EPT violation, the containing VM can handle
> > > > the exception on behalf of the main Linux guest by doing the
> > > > appropriate emulation.  But for an EPT violation, the containing
> > > > VM can only terminate the guest.  It doesn't have sufficient context
> > > > to handle a "valid" #VE with EPT violation generated due to
> > > > load_unaligned_zeropad().  My proposed scheme of marking the
> > > > pages invalid avoids generating those #VEs and lets TD Partitioning
> > > > (and similarly for SEV-SNP vTOM) work as intended with a paravisor.
> > > >
> > > > > >
> > > > > > To avoid these complexities of the CoCo exception handlers, change
> > > > > > the core transition code in __set_memory_enc_pgtable() to do the
> > > > > > following:
> > > > > >
> > > > > > 1.  Remove aliasing mappings
> > > > > > 2.  Remove the PRESENT bit from the PTEs of all transitioning pages
> > > > > > 3.  Flush the TLB globally
> > > > > > 4.  Flush the data cache if needed
> > > > > > 5.  Set/clear the encryption attribute as appropriate
> > > > > > 6.  Notify the hypervisor of the page status change
> > > > > > 7.  Add back the PRESENT bit
> > > > >
> > > > > Okay, looks safe.
> > > > >
> > > > > > With this approach, load_unaligned_zeropad() just takes its normal
> > > > > > page-fault-based fixup path if it touches a page that is transitioning.
> > > > > > As a result, load_unaligned_zeropad() and CoCo VM page transitioning
> > > > > > are completely decoupled.  CoCo VM page transitions can proceed
> > > > > > without needing to handle architecture-specific exceptions and fix
> > > > > > things up. This decoupling reduces the complexity due to separate
> > > > > > TDX and SEV-SNP fixup paths, and gives more freedom to revise and
> > > > > > introduce new capabilities in future versions of the TDX and SEV-SNP
> > > > > > architectures. Paravisor scenarios work properly without needing
> > > > > > to forward exceptions.
> > > > > >
> > > > > > This approach may make __set_memory_enc_pgtable() slightly slower
> > > > > > because of touching the PTEs three times instead of just once. But
> > > > > > the run time of this function is already dominated by the hypercall
> > > > > > and the need to flush the TLB at least once and maybe twice. In any
> > > > > > case, this function is only used for CoCo VM page transitions, and
> > > > > > is already unsuitable for hot paths.
> > > > > >
> > > > > > The architecture specific callback function for notifying the
> > > > > > hypervisor typically must translate guest kernel virtual addresses
> > > > > > into guest physical addresses to pass to the hypervisor.  Because
> > > > > > the PTEs are invalid at the time of callback, the code for doing the
> > > > > > translation needs updating.  virt_to_phys() or equivalent continues
> > > > > > to work for direct map addresses.  But vmalloc addresses cannot use
> > > > > > vmalloc_to_page() because that function requires the leaf PTE to be
> > > > > > valid. Instead, slow_virt_to_phys() must be used. Both functions
> > > > > > manually walk the page table hierarchy, so performance is the same.
> > > > >
> > > > > Uhmm.. But why do we expected slow_virt_to_phys() to work on non-present
> > > > > page table entries? It seems accident for me that it works now. Somebody
> > > > > forgot pte_present() check.
> > > >
> > > > I didn't research the history of slow_virt_to_phys(), but I'll do so.
> > > >
> > >
> > > The history of slow_virt_to_phys() doesn't show any discussion of
> > > whether it is expected to work for non-present PTEs.  However, the
> > > page table walking is done by lookup_address(), which explicitly
> > > *does* work for non-present PTEs.  For example, lookup_address()
> > > is used in cpa_flush() to find a PTE.  cpa_flush() then checks to see if
> > > the PTE is present.
> > 
> > Which is totally fine thing to do. Present bit is the only info you can
> > always rely to be valid for non-present PTE.
> > 
> > But it is nitpicking on my side. Here you control lifecycle of the PTE, so
> > you know that PFN will stay valid for the PTE.
> > 
> > I guess the right thing to do is to use lookup_address() and get pfn from
> > the PTE with the comment why it is valid.
> 
> Each of the three implementations of the enc_status_change_prepare()/
> enc_status_change_finish() callbacks needs to translate from vaddr to
> pfn, and so would use lookup_address().  For convenience, I would
> create a helper function that wraps lookup_address() and returns
> a PFN.  This helper function would be very similar to slow_virt_to_phys()
> except for removing the shift to create a phys_addr_t from the PFN,
> and adding back the offset within the page.  Is that the approach you
> had in mind?

I guess in this case it is better to use slow_virt_to_phys(), but have a
comment somewhere why it is self. Maybe also add comment inside
slow_virt_to_phys() to indicate that we don't check present bit for a
reason.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared
  2023-08-28 22:13           ` kirill.shutemov
@ 2023-08-28 23:23             ` Michael Kelley (LINUX)
  2023-08-28 23:57               ` kirill.shutemov
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Kelley (LINUX) @ 2023-08-28 23:23 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu, Dexuan Cui, tglx, mingo,
	bp, dave.hansen, hpa, luto, peterz, thomas.lendacky,
	sathyanarayanan.kuppuswamy, seanjc, rick.p.edgecombe,
	linux-kernel, linux-hyperv, x86

From: kirill.shutemov@linux.intel.com <kirill.shutemov@linux.intel.com> Sent: Monday, August 28, 2023 3:14 PM
> 
> On Mon, Aug 28, 2023 at 09:00:03PM +0000, Michael Kelley (LINUX) wrote:
> > From: kirill.shutemov@linux.intel.com <kirill.shutemov@linux.intel.com> Sent: Monday, August 28, 2023 9:13 AM
> > >
> > > On Mon, Aug 28, 2023 at 02:22:11PM +0000, Michael Kelley (LINUX) wrote:
> > > > From: Michael Kelley (LINUX) <mikelley@microsoft.com> Sent: Tuesday, August 15, 2023 7:54 PM
> > > > >
> > > > > From: kirill.shutemov@linux.intel.com <kirill.shutemov@linux.intel.com> Sent: Sunday,
> > > > > August 6, 2023 3:20 PM
> > > > > >
> > > > > > On Thu, Jul 06, 2023 at 09:41:59AM -0700, Michael Kelley wrote:
> > > > > > > In a CoCo VM when a page transitions from private to shared, or vice
> > > > > > > versa, attributes in the PTE must be updated *and* the hypervisor must
> > > > > > > be notified of the change. Because there are two separate steps, there's
> > > > > > > a window where the settings are inconsistent.  Normally the code that
> > > > > > > initiates the transition (via set_memory_decrypted() or
> > > > > > > set_memory_encrypted()) ensures that the memory is not being accessed
> > > > > > > during a transition, so the window of inconsistency is not a problem.
> > > > > > > However, the load_unaligned_zeropad() function can read arbitrary memory
> > > > > > > pages at arbitrary times, which could access a transitioning page during
> > > > > > > the window.  In such a case, CoCo VM specific exceptions are taken
> > > > > > > (depending on the CoCo architecture in use).  Current code in those
> > > > > > > exception handlers recovers and does "fixup" on the result returned by
> > > > > > > load_unaligned_zeropad().  Unfortunately, this exception handling and
> > > > > > > fixup code is tricky and somewhat fragile.  At the moment, it is
> > > > > > > broken for both TDX and SEV-SNP.
> > > > > >
> > > > >
> > > > > Thanks for looking at this.  I'm finally catching up after being out on
> > > > > vacation for a week.
> > > > >
> > > > > > I believe it is not fixed for TDX. Is it still a problem for SEV-SNP?
> > > > >
> > > > > I presume you meant "now fixed for TDX", which I agree with.  Tom
> > > > > Lendacky has indicated that there's still a problem with SEV-SNP.   He
> > > > > could fix that problem, but this approach of marking the pages
> > > > > invalid obviates the need for Tom's fix.
> > > > >
> > > > > >
> > > > > > > There's also a problem with the current code in paravisor scenarios:
> > > > > > > TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need
> > > > > > > to be forwarded from the paravisor to the Linux guest, but there
> > > > > > > are no architectural specs for how to do that.
> > > > >
> > > > > The TD Partitioning case (and the similar SEV-SNP vTOM mode case) is
> > > > > what doesn't have a solution.  To elaborate, with TD Partitioning, #VE
> > > > > is sent to the containing VM, not the main Linux guest VM.  For
> > > > > everything except an EPT violation, the containing VM can handle
> > > > > the exception on behalf of the main Linux guest by doing the
> > > > > appropriate emulation.  But for an EPT violation, the containing
> > > > > VM can only terminate the guest.  It doesn't have sufficient context
> > > > > to handle a "valid" #VE with EPT violation generated due to
> > > > > load_unaligned_zeropad().  My proposed scheme of marking the
> > > > > pages invalid avoids generating those #VEs and lets TD Partitioning
> > > > > (and similarly for SEV-SNP vTOM) work as intended with a paravisor.
> > > > >
> > > > > > >
> > > > > > > To avoid these complexities of the CoCo exception handlers, change
> > > > > > > the core transition code in __set_memory_enc_pgtable() to do the
> > > > > > > following:
> > > > > > >
> > > > > > > 1.  Remove aliasing mappings
> > > > > > > 2.  Remove the PRESENT bit from the PTEs of all transitioning pages
> > > > > > > 3.  Flush the TLB globally
> > > > > > > 4.  Flush the data cache if needed
> > > > > > > 5.  Set/clear the encryption attribute as appropriate
> > > > > > > 6.  Notify the hypervisor of the page status change
> > > > > > > 7.  Add back the PRESENT bit
> > > > > >
> > > > > > Okay, looks safe.
> > > > > >
> > > > > > > With this approach, load_unaligned_zeropad() just takes its normal
> > > > > > > page-fault-based fixup path if it touches a page that is transitioning.
> > > > > > > As a result, load_unaligned_zeropad() and CoCo VM page transitioning
> > > > > > > are completely decoupled.  CoCo VM page transitions can proceed
> > > > > > > without needing to handle architecture-specific exceptions and fix
> > > > > > > things up. This decoupling reduces the complexity due to separate
> > > > > > > TDX and SEV-SNP fixup paths, and gives more freedom to revise and
> > > > > > > introduce new capabilities in future versions of the TDX and SEV-SNP
> > > > > > > architectures. Paravisor scenarios work properly without needing
> > > > > > > to forward exceptions.
> > > > > > >
> > > > > > > This approach may make __set_memory_enc_pgtable() slightly slower
> > > > > > > because of touching the PTEs three times instead of just once. But
> > > > > > > the run time of this function is already dominated by the hypercall
> > > > > > > and the need to flush the TLB at least once and maybe twice. In any
> > > > > > > case, this function is only used for CoCo VM page transitions, and
> > > > > > > is already unsuitable for hot paths.
> > > > > > >
> > > > > > > The architecture specific callback function for notifying the
> > > > > > > hypervisor typically must translate guest kernel virtual addresses
> > > > > > > into guest physical addresses to pass to the hypervisor.  Because
> > > > > > > the PTEs are invalid at the time of callback, the code for doing the
> > > > > > > translation needs updating.  virt_to_phys() or equivalent continues
> > > > > > > to work for direct map addresses.  But vmalloc addresses cannot use
> > > > > > > vmalloc_to_page() because that function requires the leaf PTE to be
> > > > > > > valid. Instead, slow_virt_to_phys() must be used. Both functions
> > > > > > > manually walk the page table hierarchy, so performance is the same.
> > > > > >
> > > > > > Uhmm.. But why do we expected slow_virt_to_phys() to work on non-present
> > > > > > page table entries? It seems accident for me that it works now. Somebody
> > > > > > forgot pte_present() check.
> > > > >
> > > > > I didn't research the history of slow_virt_to_phys(), but I'll do so.
> > > > >
> > > >
> > > > The history of slow_virt_to_phys() doesn't show any discussion of
> > > > whether it is expected to work for non-present PTEs.  However, the
> > > > page table walking is done by lookup_address(), which explicitly
> > > > *does* work for non-present PTEs.  For example, lookup_address()
> > > > is used in cpa_flush() to find a PTE.  cpa_flush() then checks to see if
> > > > the PTE is present.
> > >
> > > Which is totally fine thing to do. Present bit is the only info you can
> > > always rely to be valid for non-present PTE.
> > >
> > > But it is nitpicking on my side. Here you control lifecycle of the PTE, so
> > > you know that PFN will stay valid for the PTE.
> > >
> > > I guess the right thing to do is to use lookup_address() and get pfn from
> > > the PTE with the comment why it is valid.
> >
> > Each of the three implementations of the enc_status_change_prepare()/
> > enc_status_change_finish() callbacks needs to translate from vaddr to
> > pfn, and so would use lookup_address().  For convenience, I would
> > create a helper function that wraps lookup_address() and returns
> > a PFN.  This helper function would be very similar to slow_virt_to_phys()
> > except for removing the shift to create a phys_addr_t from the PFN,
> > and adding back the offset within the page.  Is that the approach you
> > had in mind?
> 
> I guess in this case it is better to use slow_virt_to_phys(), but have a
> comment somewhere why it is self. Maybe also add comment inside
> slow_virt_to_phys() to indicate that we don't check present bit for a
> reason.
> 

OK, works for me.  I'll turn my original RFC patch into a small patch
set that's pretty much as the RFC version is coded, along with the
appropriate comments in slow_virt_to_phys().  Additional patches
in the set will remove code that becomes unnecessary or that can
be simplified.

Michael

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared
  2023-08-28 23:23             ` Michael Kelley (LINUX)
@ 2023-08-28 23:57               ` kirill.shutemov
  0 siblings, 0 replies; 17+ messages in thread
From: kirill.shutemov @ 2023-08-28 23:57 UTC (permalink / raw)
  To: Michael Kelley (LINUX)
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu, Dexuan Cui, tglx, mingo,
	bp, dave.hansen, hpa, luto, peterz, thomas.lendacky,
	sathyanarayanan.kuppuswamy, seanjc, rick.p.edgecombe,
	linux-kernel, linux-hyperv, x86

On Mon, Aug 28, 2023 at 11:23:55PM +0000, Michael Kelley (LINUX) wrote:
> > I guess in this case it is better to use slow_virt_to_phys(), but have a
> > comment somewhere why it is self. Maybe also add comment inside
> > slow_virt_to_phys() to indicate that we don't check present bit for a
> > reason.
> > 
> 
> OK, works for me.  I'll turn my original RFC patch into a small patch
> set that's pretty much as the RFC version is coded, along with the
> appropriate comments in slow_virt_to_phys().  Additional patches
> in the set will remove code that becomes unnecessary or that can
> be simplified.

Oh, I just realized that it conflicts with another patchset that I am
currently working on. This patchset adds kexec support for TDX. One thing
I do in this patchset is revert all shared pages to private during kexec,
so that the next kernel can use them[1]. I determine the shared status by
checking the shared bit in the kernel page tables.

If I understand correctly, your change will work with my patch because
my patch is buggy and does not check the preset bit (irony detected).

Ugh. This gets messy.

[1] https://github.com/intel/tdx/commit/4b63531fc315551c3c42023ea655206c77eef5a1
-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared
  2023-07-06 16:41 [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared Michael Kelley
  2023-08-02 21:57 ` Tom Lendacky
  2023-08-06 22:19 ` kirill.shutemov
@ 2023-08-30  0:02 ` Edgecombe, Rick P
  2023-08-30  3:33   ` Michael Kelley (LINUX)
  2023-08-30 23:40 ` Edgecombe, Rick P
  3 siblings, 1 reply; 17+ messages in thread
From: Edgecombe, Rick P @ 2023-08-30  0:02 UTC (permalink / raw)
  To: linux-hyperv, Lutomirski, Andy, dave.hansen, thomas.lendacky,
	haiyangz, kirill.shutemov, Christopherson,,
	Sean, mingo, linux-kernel, kys, Cui, Dexuan, mikelley, hpa,
	peterz, wei.liu, bp, sathyanarayanan.kuppuswamy, tglx, x86

On Thu, 2023-07-06 at 09:41 -0700, Michael Kelley wrote:
> In a CoCo VM when a page transitions from private to shared, or vice
> versa, attributes in the PTE must be updated *and* the hypervisor
> must
> be notified of the change. Because there are two separate steps,
> there's
> a window where the settings are inconsistent.  Normally the code that
> initiates the transition (via set_memory_decrypted() or
> set_memory_encrypted()) ensures that the memory is not being accessed
> during a transition, so the window of inconsistency is not a problem.
> However, the load_unaligned_zeropad() function can read arbitrary
> memory
> pages at arbitrary times, which could access a transitioning page
> during
> the window.  In such a case, CoCo VM specific exceptions are taken
> (depending on the CoCo architecture in use).  Current code in those
> exception handlers recovers and does "fixup" on the result returned
> by
> load_unaligned_zeropad().  Unfortunately, this exception handling and
> fixup code is tricky and somewhat fragile.  At the moment, it is
> broken for both TDX and SEV-SNP.
> 
> There's also a problem with the current code in paravisor scenarios:
> TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need
> to be forwarded from the paravisor to the Linux guest, but there
> are no architectural specs for how to do that.
> 
> To avoid these complexities of the CoCo exception handlers, change
> the core transition code in __set_memory_enc_pgtable() to do the
> following:
> 
> 1.  Remove aliasing mappings
> 2.  Remove the PRESENT bit from the PTEs of all transitioning pages
> 3.  Flush the TLB globally
> 4.  Flush the data cache if needed
> 5.  Set/clear the encryption attribute as appropriate
> 6.  Notify the hypervisor of the page status change
> 7.  Add back the PRESENT bit
> 
> With this approach, load_unaligned_zeropad() just takes its normal
> page-fault-based fixup path if it touches a page that is
> transitioning.
> As a result, load_unaligned_zeropad() and CoCo VM page transitioning
> are completely decoupled.  CoCo VM page transitions can proceed
> without needing to handle architecture-specific exceptions and fix
> things up. This decoupling reduces the complexity due to separate
> TDX and SEV-SNP fixup paths, and gives more freedom to revise and
> introduce new capabilities in future versions of the TDX and SEV-SNP
> architectures. Paravisor scenarios work properly without needing
> to forward exceptions.
> 
> This approach may make __set_memory_enc_pgtable() slightly slower
> because of touching the PTEs three times instead of just once. But
> the run time of this function is already dominated by the hypercall
> and the need to flush the TLB at least once and maybe twice. In any
> case, this function is only used for CoCo VM page transitions, and
> is already unsuitable for hot paths.

Excluding vm_unmap_aliases(), and just looking at the TLB flushes, it
kind of looks like this:
1. Clear present
2. TLB flush
3. Set C bit
4. Set Present bit
5. TLB flush

But if you instead did:
1. Clear Present and set C bit
2. TLB flush
3. Set Present bit (no flush)

Then you could still have only 1 TLB flush and 2 operations instead of
3. Otherwise it's the same load_unaligned_zeropad() benefits you are
looking for I think. But I'm not very educated on the private<->shared
conversion HW rules though, so maybe not.

> 
> The architecture specific callback function for notifying the
> hypervisor typically must translate guest kernel virtual addresses
> into guest physical addresses to pass to the hypervisor.  Because
> the PTEs are invalid at the time of callback, the code for doing the
> translation needs updating.  virt_to_phys() or equivalent continues
> to work for direct map addresses.  But vmalloc addresses cannot use
> vmalloc_to_page() because that function requires the leaf PTE to be
> valid. Instead, slow_virt_to_phys() must be used. Both functions
> manually walk the page table hierarchy, so performance is the same.

Just curious. Are vmalloc addresses supported here? It looks like in
SEV, but not TDX.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared
  2023-08-30  0:02 ` Edgecombe, Rick P
@ 2023-08-30  3:33   ` Michael Kelley (LINUX)
  0 siblings, 0 replies; 17+ messages in thread
From: Michael Kelley (LINUX) @ 2023-08-30  3:33 UTC (permalink / raw)
  To: Edgecombe, Rick P, linux-hyperv, Lutomirski, Andy, dave.hansen,
	thomas.lendacky, Haiyang Zhang, kirill.shutemov, Christopherson,,
	Sean, mingo, linux-kernel, KY Srinivasan, Dexuan Cui, hpa,
	peterz, wei.liu, bp, sathyanarayanan.kuppuswamy, tglx, x86

From: Edgecombe, Rick P <rick.p.edgecombe@intel.com> Sent: Tuesday, August 29, 2023 5:03 PM
> 
> On Thu, 2023-07-06 at 09:41 -0700, Michael Kelley wrote:
> > In a CoCo VM when a page transitions from private to shared, or vice
> > versa, attributes in the PTE must be updated *and* the hypervisor must
> > be notified of the change. Because there are two separate steps, there's
> > a window where the settings are inconsistent.  Normally the code that
> > initiates the transition (via set_memory_decrypted() or
> > set_memory_encrypted()) ensures that the memory is not being accessed
> > during a transition, so the window of inconsistency is not a problem.
> > However, the load_unaligned_zeropad() function can read arbitrary memory
> > pages at arbitrary times, which could access a transitioning page during
> > the window.  In such a case, CoCo VM specific exceptions are taken
> > (depending on the CoCo architecture in use).  Current code in those
> > exception handlers recovers and does "fixup" on the result returned by
> > load_unaligned_zeropad().  Unfortunately, this exception handling and
> > fixup code is tricky and somewhat fragile.  At the moment, it is
> > broken for both TDX and SEV-SNP.
> >
> > There's also a problem with the current code in paravisor scenarios:
> > TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need
> > to be forwarded from the paravisor to the Linux guest, but there
> > are no architectural specs for how to do that.
> >
> > To avoid these complexities of the CoCo exception handlers, change
> > the core transition code in __set_memory_enc_pgtable() to do the
> > following:
> >
> > 1.  Remove aliasing mappings
> > 2.  Remove the PRESENT bit from the PTEs of all transitioning pages
> > 3.  Flush the TLB globally
> > 4.  Flush the data cache if needed
> > 5.  Set/clear the encryption attribute as appropriate
> > 6.  Notify the hypervisor of the page status change
> > 7.  Add back the PRESENT bit
> >
> > With this approach, load_unaligned_zeropad() just takes its normal
> > page-fault-based fixup path if it touches a page that is transitioning.
> > As a result, load_unaligned_zeropad() and CoCo VM page transitioning
> > are completely decoupled.  CoCo VM page transitions can proceed
> > without needing to handle architecture-specific exceptions and fix
> > things up. This decoupling reduces the complexity due to separate
> > TDX and SEV-SNP fixup paths, and gives more freedom to revise and
> > introduce new capabilities in future versions of the TDX and SEV-SNP
> > architectures. Paravisor scenarios work properly without needing
> > to forward exceptions.
> >
> > This approach may make __set_memory_enc_pgtable() slightly slower
> > because of touching the PTEs three times instead of just once. But
> > the run time of this function is already dominated by the hypercall
> > and the need to flush the TLB at least once and maybe twice. In any
> > case, this function is only used for CoCo VM page transitions, and
> > is already unsuitable for hot paths.
> 
> Excluding vm_unmap_aliases(), and just looking at the TLB flushes, it
> kind of looks like this:
> 1. Clear present
> 2. TLB flush
> 3. Set C bit

This step is either "set C bit" or "clear C bit", depending on whether
the transition is from shared->private, or private->shared.

> 4. Set Present bit
> 5. TLB flush

I was originally thinking that set_memory_p() would be smart
enough to realize that just setting the PRESENT bit doesn't
require flushing the TLB.  But looking at the code now, that's
wrong.  __change_page_attr() will set the CPA_FLUSHTLB flag
even when the only change is to set the PRESENT bit.
change_page_attr_set_clr() will then act on the CPA_FLUSHTLB
flag and do the flush.

> 
> But if you instead did:
> 1. Clear Present and set C bit

Interesting idea.   As noted above, it could be either
setting or clearing the C bit.   Currently there are some things
done before the C bit set/clear, such as flushing the data
cache on TDX.  I don't fully understand that requirement, so
I'm not sure of its actual sequencing needs.  Hopefully
someone with TDX expertise can clarify.  The current
code also allows callbacks pre and post set/clear of the
C bit, though I was considering simplifying to only
having a post callback.

> 2. TLB flush
> 3. Set Present bit (no flush)

I'll look at whether there's a way to prevent the flush
without creating an alternate version of
change_page_attr_set_clr().   Maybe set_memory_p()
could pass in a new CPA_NO_FLUSHTLB flag that
overrides CPA_FLUSHTLB.

In the bigger picture, the performance of these
private<->shared transitions is dominated by the
hypercalls to sync things with the hypervisor.  The Linux
manipulation of the PTEs is a second order cost, though
the TLB flushes are more expensive.  It might be worth
some effort to avoid the extra TLB flush.

> 
> Then you could still have only 1 TLB flush and 2 operations instead of
> 3. Otherwise it's the same load_unaligned_zeropad() benefits you are
> looking for I think. But I'm not very educated on the private<->shared
> conversion HW rules though, so maybe not.

Eliminating the TLB flush after setting the PRESENT bit seems
worthwhile.  I'm less sure about combing the clearing of
PRESENT and the C bit set/clear.  We might want to keep
both the pre and post callbacks for future flexibility in how
the hypervisor in invoked.  In that case the pre callback
needs to be done between clearing PRESENT and the C
bit set/clear operation, and so they would need to stay
separate.  As I said previously, the manipulation of the
PTEs is not a significant cost in the big picture.

> 
> >
> > The architecture specific callback function for notifying the
> > hypervisor typically must translate guest kernel virtual addresses
> > into guest physical addresses to pass to the hypervisor.  Because
> > the PTEs are invalid at the time of callback, the code for doing the
> > translation needs updating.  virt_to_phys() or equivalent continues
> > to work for direct map addresses.  But vmalloc addresses cannot use
> > vmalloc_to_page() because that function requires the leaf PTE to be
> > valid. Instead, slow_virt_to_phys() must be used. Both functions
> > manually walk the page table hierarchy, so performance is the same.
> 
> Just curious. Are vmalloc addresses supported here? It looks like in
> SEV, but not TDX.

Dexuan Cui has a separate patch to support vmalloc address in TDX.
See https://lore.kernel.org/lkml/20230811214826.9609-3-decui@microsoft.com/

Thanks for looking at this ...

Michael

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared
  2023-07-06 16:41 [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared Michael Kelley
                   ` (2 preceding siblings ...)
  2023-08-30  0:02 ` Edgecombe, Rick P
@ 2023-08-30 23:40 ` Edgecombe, Rick P
  2023-08-31 17:29   ` Edgecombe, Rick P
  3 siblings, 1 reply; 17+ messages in thread
From: Edgecombe, Rick P @ 2023-08-30 23:40 UTC (permalink / raw)
  To: linux-hyperv, Lutomirski, Andy, dave.hansen, thomas.lendacky,
	haiyangz, kirill.shutemov, Christopherson,,
	Sean, mingo, linux-kernel, kys, Cui, Dexuan, mikelley, hpa,
	peterz, wei.liu, bp, sathyanarayanan.kuppuswamy, tglx, x86

On Thu, 2023-07-06 at 09:41 -0700, Michael Kelley wrote:
> To avoid these complexities of the CoCo exception handlers, change
> the core transition code in __set_memory_enc_pgtable() to do the
> following:
> 
> 1.  Remove aliasing mappings
> 2.  Remove the PRESENT bit from the PTEs of all transitioning pages

This is a bit of an existing problem, but the failure cases of these
set_memory_en/decrypted() operations does not look to be in great
shape. It could fail halfway through if it needs to split the direct
map under memory pressure, in which case some of the callers will see
the error and free the unmapped pages to the direct map. (I was looking
at dma_direct_alloc()) Other's just leak the pages.

But the situation before the patch is not much better, since the direct
map change or enc_status_change_prepare/finish() could fail and leave
the pages in an inconsistent state, like this patch is trying to
address.

This lack of rollback on failure for CPA calls needs particular odd
handling in all the set_memory() callers. The way is to make a CPA call
to restore it to the previous permission, regardless of the error code
returned in the initial call that failed. The callers depend on any PTE
change successfully made having any needed splits already done for
those PTEs, so the restore can succeed at least as far as the failed
CPA call got.

In this COCO case apparently the enc_status_change_prepare/finish()
could fail too (and maybe not have the same forward progress
behavior?). So I'm not sure what you can do in that case.

I'm also not sure how bad it is to free encryption mismatched pages. Is
it the same as freeing unmapped pages? (likely oops or panic)

> 3.  Flush the TLB globally
> 4.  Flush the data cache if needed
> 5.  Set/clear the encryption attribute as appropriate
> 6.  Notify the hypervisor of the page status change
> 7.  Add back the PRESENT bit


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared
  2023-08-30 23:40 ` Edgecombe, Rick P
@ 2023-08-31 17:29   ` Edgecombe, Rick P
  2023-09-01 14:44     ` Michael Kelley (LINUX)
  0 siblings, 1 reply; 17+ messages in thread
From: Edgecombe, Rick P @ 2023-08-31 17:29 UTC (permalink / raw)
  To: linux-hyperv, Lutomirski, Andy, dave.hansen, thomas.lendacky,
	haiyangz, kirill.shutemov, Christopherson,,
	Sean, mingo, linux-kernel, kys, Cui, Dexuan, mikelley, hpa,
	peterz, wei.liu, bp, sathyanarayanan.kuppuswamy, tglx, x86

On Wed, 2023-08-30 at 16:40 -0700, Rick Edgecombe wrote:
> This is a bit of an existing problem, but the failure cases of these
> set_memory_en/decrypted() operations does not look to be in great
> shape. It could fail halfway through if it needs to split the direct
> map under memory pressure, in which case some of the callers will see
> the error and free the unmapped pages to the direct map. (I was
> looking
> at dma_direct_alloc()) Other's just leak the pages.
> 
> But the situation before the patch is not much better, since the
> direct
> map change or enc_status_change_prepare/finish() could fail and leave
> the pages in an inconsistent state, like this patch is trying to
> address.
> 
> This lack of rollback on failure for CPA calls needs particular odd
> handling in all the set_memory() callers. The way is to make a CPA
> call
> to restore it to the previous permission, regardless of the error
> code
> returned in the initial call that failed. The callers depend on any
> PTE
> change successfully made having any needed splits already done for
> those PTEs, so the restore can succeed at least as far as the failed
> CPA call got.

Wait, since this does set_memory_np() as the first step for both
set_memory_encrypted() and set_memory_decrypted(), that pattern in the
callers wouldn't work. I wonder if it should try to rollback itself if
set_memory_np() fails (call set_memory_p() before returning the error).
At least that will handle failures that happen on the guest side.

> 
> In this COCO case apparently the enc_status_change_prepare/finish()
> could fail too (and maybe not have the same forward progress
> behavior?). So I'm not sure what you can do in that case.
> 
> I'm also not sure how bad it is to free encryption mismatched pages.
> Is
> it the same as freeing unmapped pages? (likely oops or panic)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared
  2023-08-31 17:29   ` Edgecombe, Rick P
@ 2023-09-01 14:44     ` Michael Kelley (LINUX)
  2023-09-01 16:33       ` Edgecombe, Rick P
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Kelley (LINUX) @ 2023-09-01 14:44 UTC (permalink / raw)
  To: Edgecombe, Rick P, linux-hyperv, Lutomirski, Andy, dave.hansen,
	thomas.lendacky, Haiyang Zhang, kirill.shutemov, Christopherson,,
	Sean, mingo, linux-kernel, KY Srinivasan, Dexuan Cui, hpa,
	peterz, wei.liu, bp, sathyanarayanan.kuppuswamy, tglx, x86

From: Edgecombe, Rick P <rick.p.edgecombe@intel.com> Sent: Thursday, August 31, 2023 10:30 AM
> 
> On Wed, 2023-08-30 at 16:40 -0700, Rick Edgecombe wrote:
> > This is a bit of an existing problem, but the failure cases of these
> > set_memory_en/decrypted() operations does not look to be in great
> > shape. It could fail halfway through if it needs to split the direct
> > map under memory pressure, in which case some of the callers will see
> > the error and free the unmapped pages to the direct map. (I was
> > looking at dma_direct_alloc()) Other's just leak the pages.

See further comments below.

> >
> > But the situation before the patch is not much better, since the direct
> > map change or enc_status_change_prepare/finish() could fail and leave
> > the pages in an inconsistent state, like this patch is trying to address.
> >
> > This lack of rollback on failure for CPA calls needs particular odd
> > handling in all the set_memory() callers. The way is to make a CPA call
> > to restore it to the previous permission, regardless of the error code
> > returned in the initial call that failed. The callers depend on any PTE
> > change successfully made having any needed splits already done for
> > those PTEs, so the restore can succeed at least as far as the failed
> > CPA call got.
> 
> Wait, since this does set_memory_np() as the first step for both
> set_memory_encrypted() and set_memory_decrypted(), that pattern in the
> callers wouldn't work. I wonder if it should try to rollback itself if
> set_memory_np() fails (call set_memory_p() before returning the error).
> At least that will handle failures that happen on the guest side.

Yes, I agree the error handling is very limited.  I'll try to make my
patch cleanup properly if set_memory_np() fails as step 1.  In general,
complete error cleanup on private <-> shared transitions looks to be
pretty hard, and the original implementation obviously didn't deal
with it.  For most of the steps in the sequence, a failure indicates
something is pretty seriously broken with the CoCo aspects of the
VM, and it's not clear that trying to clean up is likely to succeed or
will make things any better.  

> 
> >
> > In this COCO case apparently the enc_status_change_prepare/finish()
> > could fail too (and maybe not have the same forward progress
> > behavior?). So I'm not sure what you can do in that case.

Exactly.

> >
> > I'm also not sure how bad it is to free encryption mismatched pages.  Is
> > it the same as freeing unmapped pages? (likely oops or panic)

It's bad to free mismatched pages.  You're just setting a time bomb
for some other code to encounter later.  It will either cause an
oops/panic, or will allow the VM to unknowingly put data in a page
that is visible to the hypervisor and violate the tenets of being a
CoCo VM.  In the code I've worked with, we don't free any memory
where set_memory_encrypted() or set_memory_decrypted() has failed.
We assume there hasn't been a cleanup and hence the state and
consistency of the memory is unknown.

I think there's a decent argument for not investing too much effort
in the cleanup paths for private <-> shared transitions.  A failure
indicates that something is seriously wrong, from which full recovery
is unlikely.  Callers of set_memory_encrypted()/decrypted() should
assume that a failure leaves the memory in an inconsistent state,
and the memory should just be leaked until the VM can be shutdown.
Some strong comments along these lines could be added to
set_memory_encrypted()/decrypted().

I'm going to be out on vacation until mid-September, so it will be
a couple of weeks before I get back to doing a full patch set that
responds to these and other comments.

Michael

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared
  2023-09-01 14:44     ` Michael Kelley (LINUX)
@ 2023-09-01 16:33       ` Edgecombe, Rick P
  0 siblings, 0 replies; 17+ messages in thread
From: Edgecombe, Rick P @ 2023-09-01 16:33 UTC (permalink / raw)
  To: Lutomirski, Andy, dave.hansen, thomas.lendacky, haiyangz,
	kirill.shutemov, Christopherson,,
	Sean, mingo, linux-kernel, kys, Cui, Dexuan, mikelley, hpa,
	peterz, linux-hyperv, wei.liu, bp, sathyanarayanan.kuppuswamy,
	tglx, x86
  Cc: Yamahata, Isaku

+Isaku

On Fri, 2023-09-01 at 14:44 +0000, Michael Kelley (LINUX) wrote:
> > Wait, since this does set_memory_np() as the first step for both
> > set_memory_encrypted() and set_memory_decrypted(), that pattern in
> > the
> > callers wouldn't work. I wonder if it should try to rollback itself
> > if
> > set_memory_np() fails (call set_memory_p() before returning the
> > error).
> > At least that will handle failures that happen on the guest side.
> 
> Yes, I agree the error handling is very limited.  I'll try to make my
> patch cleanup properly if set_memory_np() fails as step 1.  In
> general,
> complete error cleanup on private <-> shared transitions looks to be
> pretty hard, and the original implementation obviously didn't deal
> with it.  For most of the steps in the sequence, a failure indicates
> something is pretty seriously broken with the CoCo aspects of the
> VM, and it's not clear that trying to clean up is likely to succeed
> or
> will make things any better.  

Ah I see. Direct map split failures are not totally unexpected though,
so the kernel should be able to handle that somewhat, like it does in
other places where set_memory() is used. I also wonder if the VMM might
need to split the EPT/NPT and fail in the same way, which would be a
somewhat normal situation.

And yes, I see that this is an existing problem, so don't mean to
suggest it should hold up this improvement.

It seems there are three ongoing improvements on these operations:
 - Handling load_unaligned_zeropad()
 - Make it work with vmalloc
 - Remarking everything private when doing kexec

And then now I'm adding "lack of failure handling". The solutions for
each could affect the others, so I thought it might be worth
considering. I'm not very up to speed with the CoCo specifics here
though, so please take that part with a grain of salt.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-09-01 16:34 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-06 16:41 [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared Michael Kelley
2023-08-02 21:57 ` Tom Lendacky
2023-08-05 14:38   ` Michael Kelley (LINUX)
2023-08-06 22:19 ` kirill.shutemov
2023-08-16  2:54   ` Michael Kelley (LINUX)
2023-08-28 14:22     ` Michael Kelley (LINUX)
2023-08-28 16:13       ` kirill.shutemov
2023-08-28 21:00         ` Michael Kelley (LINUX)
2023-08-28 22:13           ` kirill.shutemov
2023-08-28 23:23             ` Michael Kelley (LINUX)
2023-08-28 23:57               ` kirill.shutemov
2023-08-30  0:02 ` Edgecombe, Rick P
2023-08-30  3:33   ` Michael Kelley (LINUX)
2023-08-30 23:40 ` Edgecombe, Rick P
2023-08-31 17:29   ` Edgecombe, Rick P
2023-09-01 14:44     ` Michael Kelley (LINUX)
2023-09-01 16:33       ` Edgecombe, Rick P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).