linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/5] x86: Fix SEV guest regression
@ 2018-09-10 21:49 Brijesh Singh
  2018-09-10 21:49 ` [PATCH v7 1/5] x86/mm: Restructure sme_encrypt_kernel() Brijesh Singh
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Brijesh Singh @ 2018-09-10 21:49 UTC (permalink / raw)
  To: x86, linux-kernel, kvm
  Cc: Brijesh Singh, Tom Lendacky, Thomas Gleixner, Borislav Petkov,
	Paolo Bonzini, Sean Christopherson, Radim Krčmář

The following commit

"
x86/kvmclock: Remove memblock dependency

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=368a540e0232ad446931f5a4e8a5e06f69f21343
"

introduced SEV guest regression.

The guest physical address holding the wall_clock and hv_clock_boot
are shared with the hypervisor must be mapped with C=0 when SEV
is active. To clear the C-bit we use  kernel_physical_mapping_init() to
split the large pages. The above commit moved the kvmclock initialization
very early and kernel_physical_mapping_init() fails to allocate memory
while spliting the large page.

To solve it, we add a special .data..decrypted section, this section can be
used to hold the shared variables. Early boot code maps this section with
C=0. The section is pmd aligned and sized to avoid the need to split the pages.
Caller can use __decrypted attribute to add the variables in .data..decrypted
section. 

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>

Changes since v6:
 - improve commit messages
 - rename free_decrypted_mem -> mem_encrypt_free_decrypted_mem

Changes since v5:
 - rename hvclock_boot_dec -> hvclock_boot_aux.
 - rename section from .data..decrypted_aux -> .data..decrypted.aux.
 - use NR_CPUS instead of random number elements in hv_clock_aux variable.

Changes since v4:
 - define few static pages in .data..decrypted which can be used
   for cpus > HVC_BOOT_ARRAY_SIZE when SEV is active.

Changes since v3:
 - commit message improvements (based on Sean's feedback)

Changes since v2:
 - commit message and code comment improvements (based on Boris feedback)
 - move sme_populate_pgd fixes in new patch.
 - drop stable Cc - will submit to stable after patch is upstreamed.

Changes since v1:
 - move the logic to re-arrange mapping in new patch
 - move the definition of __start_data_* in mem_encrypt.h
 - map the workarea buffer as encrypted when SEV is enabled
 - enhance the sme_populate_pgd to update the pte/pmd flags when mapping exist

Brijesh Singh (5):
  x86/mm: Restructure sme_encrypt_kernel()
  x86/mm: Enhance sme_populate_pgd() to update page flags
  x86/mm: add .data..decrypted section to hold shared variables
  x86/kvm: use __decrypted attribute in shared variables
  x86/kvm: Avoid dynamic allocation of pvclock data when SEV is active

 arch/x86/include/asm/mem_encrypt.h |  10 ++
 arch/x86/kernel/head64.c           |  11 ++
 arch/x86/kernel/kvmclock.c         |  18 ++-
 arch/x86/kernel/vmlinux.lds.S      |  20 ++++
 arch/x86/mm/init.c                 |   3 +
 arch/x86/mm/mem_encrypt.c          |  10 ++
 arch/x86/mm/mem_encrypt_identity.c | 232 +++++++++++++++++++++++++++----------
 7 files changed, 240 insertions(+), 64 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v7 1/5] x86/mm: Restructure sme_encrypt_kernel()
  2018-09-10 21:49 [PATCH v7 0/5] x86: Fix SEV guest regression Brijesh Singh
@ 2018-09-10 21:49 ` Brijesh Singh
  2018-09-10 21:49 ` [PATCH v7 2/5] x86/mm: Enhance sme_populate_pgd() to update page flags Brijesh Singh
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Brijesh Singh @ 2018-09-10 21:49 UTC (permalink / raw)
  To: x86, linux-kernel, kvm
  Cc: Brijesh Singh, Tom Lendacky, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, Paolo Bonzini, Sean Christopherson,
	Radim Krčmář

Re-arrange the sme_encrypt_kernel() by moving the workarea map/unmap
logic in a separate static function. There are no logical changes in this
patch. The restructuring will allow us to expand the sme_encrypt_kernel
in future.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: kvm@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: linux-kernel@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
---
 arch/x86/mm/mem_encrypt_identity.c | 160 ++++++++++++++++++++++++-------------
 1 file changed, 104 insertions(+), 56 deletions(-)

diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index a19ef1a..f488d46 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -73,6 +73,22 @@ struct sme_populate_pgd_data {
 	unsigned long vaddr_end;
 };
 
+struct sme_workarea_data {
+	unsigned long kernel_start;
+	unsigned long kernel_end;
+	unsigned long kernel_len;
+
+	unsigned long initrd_start;
+	unsigned long initrd_end;
+	unsigned long initrd_len;
+
+	unsigned long workarea_start;
+	unsigned long workarea_end;
+	unsigned long workarea_len;
+
+	unsigned long decrypted_base;
+};
+
 static char sme_cmdline_arg[] __initdata = "mem_encrypt";
 static char sme_cmdline_on[]  __initdata = "on";
 static char sme_cmdline_off[] __initdata = "off";
@@ -267,19 +283,17 @@ static unsigned long __init sme_pgtable_calc(unsigned long len)
 	return entries + tables;
 }
 
-void __init sme_encrypt_kernel(struct boot_params *bp)
+static void __init build_workarea_map(struct boot_params *bp,
+				      struct sme_workarea_data *wa,
+				      struct sme_populate_pgd_data *ppd)
 {
 	unsigned long workarea_start, workarea_end, workarea_len;
 	unsigned long execute_start, execute_end, execute_len;
 	unsigned long kernel_start, kernel_end, kernel_len;
 	unsigned long initrd_start, initrd_end, initrd_len;
-	struct sme_populate_pgd_data ppd;
 	unsigned long pgtable_area_len;
 	unsigned long decrypted_base;
 
-	if (!sme_active())
-		return;
-
 	/*
 	 * Prepare for encrypting the kernel and initrd by building new
 	 * pagetables with the necessary attributes needed to encrypt the
@@ -359,17 +373,17 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
 	 * pagetables and when the new encrypted and decrypted kernel
 	 * mappings are populated.
 	 */
-	ppd.pgtable_area = (void *)execute_end;
+	ppd->pgtable_area = (void *)execute_end;
 
 	/*
 	 * Make sure the current pagetable structure has entries for
 	 * addressing the workarea.
 	 */
-	ppd.pgd = (pgd_t *)native_read_cr3_pa();
-	ppd.paddr = workarea_start;
-	ppd.vaddr = workarea_start;
-	ppd.vaddr_end = workarea_end;
-	sme_map_range_decrypted(&ppd);
+	ppd->pgd = (pgd_t *)native_read_cr3_pa();
+	ppd->paddr = workarea_start;
+	ppd->vaddr = workarea_start;
+	ppd->vaddr_end = workarea_end;
+	sme_map_range_decrypted(ppd);
 
 	/* Flush the TLB - no globals so cr3 is enough */
 	native_write_cr3(__native_read_cr3());
@@ -380,9 +394,9 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
 	 * then be populated with new PUDs and PMDs as the encrypted and
 	 * decrypted kernel mappings are created.
 	 */
-	ppd.pgd = ppd.pgtable_area;
-	memset(ppd.pgd, 0, sizeof(pgd_t) * PTRS_PER_PGD);
-	ppd.pgtable_area += sizeof(pgd_t) * PTRS_PER_PGD;
+	ppd->pgd = ppd->pgtable_area;
+	memset(ppd->pgd, 0, sizeof(pgd_t) * PTRS_PER_PGD);
+	ppd->pgtable_area += sizeof(pgd_t) * PTRS_PER_PGD;
 
 	/*
 	 * A different PGD index/entry must be used to get different
@@ -400,75 +414,109 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
 	decrypted_base <<= PGDIR_SHIFT;
 
 	/* Add encrypted kernel (identity) mappings */
-	ppd.paddr = kernel_start;
-	ppd.vaddr = kernel_start;
-	ppd.vaddr_end = kernel_end;
-	sme_map_range_encrypted(&ppd);
+	ppd->paddr = kernel_start;
+	ppd->vaddr = kernel_start;
+	ppd->vaddr_end = kernel_end;
+	sme_map_range_encrypted(ppd);
 
 	/* Add decrypted, write-protected kernel (non-identity) mappings */
-	ppd.paddr = kernel_start;
-	ppd.vaddr = kernel_start + decrypted_base;
-	ppd.vaddr_end = kernel_end + decrypted_base;
-	sme_map_range_decrypted_wp(&ppd);
+	ppd->paddr = kernel_start;
+	ppd->vaddr = kernel_start + decrypted_base;
+	ppd->vaddr_end = kernel_end + decrypted_base;
+	sme_map_range_decrypted_wp(ppd);
 
 	if (initrd_len) {
 		/* Add encrypted initrd (identity) mappings */
-		ppd.paddr = initrd_start;
-		ppd.vaddr = initrd_start;
-		ppd.vaddr_end = initrd_end;
-		sme_map_range_encrypted(&ppd);
+		ppd->paddr = initrd_start;
+		ppd->vaddr = initrd_start;
+		ppd->vaddr_end = initrd_end;
+		sme_map_range_encrypted(ppd);
 		/*
 		 * Add decrypted, write-protected initrd (non-identity) mappings
 		 */
-		ppd.paddr = initrd_start;
-		ppd.vaddr = initrd_start + decrypted_base;
-		ppd.vaddr_end = initrd_end + decrypted_base;
-		sme_map_range_decrypted_wp(&ppd);
+		ppd->paddr = initrd_start;
+		ppd->vaddr = initrd_start + decrypted_base;
+		ppd->vaddr_end = initrd_end + decrypted_base;
+		sme_map_range_decrypted_wp(ppd);
 	}
 
 	/* Add decrypted workarea mappings to both kernel mappings */
-	ppd.paddr = workarea_start;
-	ppd.vaddr = workarea_start;
-	ppd.vaddr_end = workarea_end;
-	sme_map_range_decrypted(&ppd);
+	ppd->paddr = workarea_start;
+	ppd->vaddr = workarea_start;
+	ppd->vaddr_end = workarea_end;
+	sme_map_range_decrypted(ppd);
 
-	ppd.paddr = workarea_start;
-	ppd.vaddr = workarea_start + decrypted_base;
-	ppd.vaddr_end = workarea_end + decrypted_base;
-	sme_map_range_decrypted(&ppd);
+	ppd->paddr = workarea_start;
+	ppd->vaddr = workarea_start + decrypted_base;
+	ppd->vaddr_end = workarea_end + decrypted_base;
+	sme_map_range_decrypted(ppd);
 
-	/* Perform the encryption */
-	sme_encrypt_execute(kernel_start, kernel_start + decrypted_base,
-			    kernel_len, workarea_start, (unsigned long)ppd.pgd);
+	wa->kernel_start = kernel_start;
+	wa->kernel_end = kernel_end;
+	wa->kernel_len = kernel_len;
 
-	if (initrd_len)
-		sme_encrypt_execute(initrd_start, initrd_start + decrypted_base,
-				    initrd_len, workarea_start,
-				    (unsigned long)ppd.pgd);
+	wa->initrd_start = initrd_start;
+	wa->initrd_end = initrd_end;
+	wa->initrd_len = initrd_len;
+
+	wa->workarea_start = workarea_start;
+	wa->workarea_end = workarea_end;
+	wa->workarea_len = workarea_len;
+
+	wa->decrypted_base = decrypted_base;
+}
 
+static void __init teardown_workarea_map(struct sme_workarea_data *wa,
+					 struct sme_populate_pgd_data *ppd)
+{
 	/*
 	 * At this point we are running encrypted.  Remove the mappings for
 	 * the decrypted areas - all that is needed for this is to remove
 	 * the PGD entry/entries.
 	 */
-	ppd.vaddr = kernel_start + decrypted_base;
-	ppd.vaddr_end = kernel_end + decrypted_base;
-	sme_clear_pgd(&ppd);
-
-	if (initrd_len) {
-		ppd.vaddr = initrd_start + decrypted_base;
-		ppd.vaddr_end = initrd_end + decrypted_base;
-		sme_clear_pgd(&ppd);
+	ppd->vaddr = wa->kernel_start + wa->decrypted_base;
+	ppd->vaddr_end = wa->kernel_end + wa->decrypted_base;
+	sme_clear_pgd(ppd);
+
+	if (wa->initrd_len) {
+		ppd->vaddr = wa->initrd_start + wa->decrypted_base;
+		ppd->vaddr_end = wa->initrd_end + wa->decrypted_base;
+		sme_clear_pgd(ppd);
 	}
 
-	ppd.vaddr = workarea_start + decrypted_base;
-	ppd.vaddr_end = workarea_end + decrypted_base;
-	sme_clear_pgd(&ppd);
+	ppd->vaddr = wa->workarea_start + wa->decrypted_base;
+	ppd->vaddr_end = wa->workarea_end + wa->decrypted_base;
+	sme_clear_pgd(ppd);
 
 	/* Flush the TLB - no globals so cr3 is enough */
 	native_write_cr3(__native_read_cr3());
 }
 
+void __init sme_encrypt_kernel(struct boot_params *bp)
+{
+	struct sme_populate_pgd_data ppd;
+	struct sme_workarea_data wa;
+
+	if (!sme_active())
+		return;
+
+	build_workarea_map(bp, &wa, &ppd);
+
+	/* When SEV is active, encrypt kernel and initrd */
+	sme_encrypt_execute(wa.kernel_start,
+			    wa.kernel_start + wa.decrypted_base,
+			    wa.kernel_len, wa.workarea_start,
+			    (unsigned long)ppd.pgd);
+
+	if (wa.initrd_len)
+		sme_encrypt_execute(wa.initrd_start,
+				    wa.initrd_start + wa.decrypted_base,
+				    wa.initrd_len, wa.workarea_start,
+				    (unsigned long)ppd.pgd);
+
+	teardown_workarea_map(&wa, &ppd);
+}
+
 void __init sme_enable(struct boot_params *bp)
 {
 	const char *cmdline_ptr, *cmdline_arg, *cmdline_on, *cmdline_off;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v7 2/5] x86/mm: Enhance sme_populate_pgd() to update page flags
  2018-09-10 21:49 [PATCH v7 0/5] x86: Fix SEV guest regression Brijesh Singh
  2018-09-10 21:49 ` [PATCH v7 1/5] x86/mm: Restructure sme_encrypt_kernel() Brijesh Singh
@ 2018-09-10 21:49 ` Brijesh Singh
  2018-09-10 21:49 ` [PATCH v7 3/5] x86/mm: add .data..decrypted section to hold shared variables Brijesh Singh
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Brijesh Singh @ 2018-09-10 21:49 UTC (permalink / raw)
  To: x86, linux-kernel, kvm
  Cc: Brijesh Singh, Tom Lendacky, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, Paolo Bonzini, Sean Christopherson,
	Radim Krčmář

Enhance sme_populate_pgd() to update page flags if the PMD/PTE entry
already exists.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Fixes: 6ebcb060713f ("x86/mm: Add support to encrypt the kernel in-place")
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: kvm@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: linux-kernel@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
---
 arch/x86/mm/mem_encrypt_identity.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index f488d46..2b245af 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -155,9 +155,6 @@ static void __init sme_populate_pgd_large(struct sme_populate_pgd_data *ppd)
 		return;
 
 	pmd = pmd_offset(pud, ppd->vaddr);
-	if (pmd_large(*pmd))
-		return;
-
 	set_pmd(pmd, __pmd(ppd->paddr | ppd->pmd_flags));
 }
 
@@ -183,8 +180,7 @@ static void __init sme_populate_pgd(struct sme_populate_pgd_data *ppd)
 		return;
 
 	pte = pte_offset_map(pmd, ppd->vaddr);
-	if (pte_none(*pte))
-		set_pte(pte, __pte(ppd->paddr | ppd->pte_flags));
+	set_pte(pte, __pte(ppd->paddr | ppd->pte_flags));
 }
 
 static void __init __sme_map_range_pmd(struct sme_populate_pgd_data *ppd)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v7 3/5] x86/mm: add .data..decrypted section to hold shared variables
  2018-09-10 21:49 [PATCH v7 0/5] x86: Fix SEV guest regression Brijesh Singh
  2018-09-10 21:49 ` [PATCH v7 1/5] x86/mm: Restructure sme_encrypt_kernel() Brijesh Singh
  2018-09-10 21:49 ` [PATCH v7 2/5] x86/mm: Enhance sme_populate_pgd() to update page flags Brijesh Singh
@ 2018-09-10 21:49 ` Brijesh Singh
  2018-09-10 21:49 ` [PATCH v7 4/5] x86/kvm: use __decrypted attribute in " Brijesh Singh
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Brijesh Singh @ 2018-09-10 21:49 UTC (permalink / raw)
  To: x86, linux-kernel, kvm
  Cc: Brijesh Singh, Tom Lendacky, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, Paolo Bonzini, Sean Christopherson,
	Radim Krčmář

kvmclock defines few static variables which are shared with the
hypervisor during the kvmclock initialization.

When SEV is active, memory is encrypted with a guest-specific key, and
if the guest OS wants to share the memory region with the hypervisor
then it must clear the C-bit before sharing it.

Currently, we use kernel_physical_mapping_init() to split large pages
before clearing the C-bit on shared pages. But it fails when called from
the kvmclock initialization (mainly because the memblock allocator is
not ready that early during boot).

Add a __decrypted section attribute which can be used when defining
such shared variable. The so-defined variables will be placed in the
.data..decrypted section. This section is mapped with C=0 early
during boot, we also ensure that the initialized values are updated
to match with C=0 (i.e perform an in-place decryption). The
.data..decrypted section is PMD-aligned and sized so that we avoid
the need to split the large pages when mapping the section.

sme_encrypt_kernel() performs the in-place encryption of the Linux
kernel and initrd when SME is active. The routine has been enhanced
to decrypt the .data..decrypted section for both SME and SEV cases.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: kvm@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: linux-kernel@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
---
 arch/x86/include/asm/mem_encrypt.h |  6 +++
 arch/x86/kernel/head64.c           | 11 +++++
 arch/x86/kernel/vmlinux.lds.S      | 17 +++++++
 arch/x86/mm/mem_encrypt_identity.c | 94 ++++++++++++++++++++++++++++++++------
 4 files changed, 113 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
index c064383..802b2eb 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -52,6 +52,8 @@ void __init mem_encrypt_init(void);
 bool sme_active(void);
 bool sev_active(void);
 
+#define __decrypted __attribute__((__section__(".data..decrypted")))
+
 #else	/* !CONFIG_AMD_MEM_ENCRYPT */
 
 #define sme_me_mask	0ULL
@@ -77,6 +79,8 @@ early_set_memory_decrypted(unsigned long vaddr, unsigned long size) { return 0;
 static inline int __init
 early_set_memory_encrypted(unsigned long vaddr, unsigned long size) { return 0; }
 
+#define __decrypted
+
 #endif	/* CONFIG_AMD_MEM_ENCRYPT */
 
 /*
@@ -88,6 +92,8 @@ early_set_memory_encrypted(unsigned long vaddr, unsigned long size) { return 0;
 #define __sme_pa(x)		(__pa(x) | sme_me_mask)
 #define __sme_pa_nodebug(x)	(__pa_nodebug(x) | sme_me_mask)
 
+extern char __start_data_decrypted[], __end_data_decrypted[];
+
 #endif	/* __ASSEMBLY__ */
 
 #endif	/* __X86_MEM_ENCRYPT_H__ */
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 8047379..af39d68 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -112,6 +112,7 @@ static bool __head check_la57_support(unsigned long physaddr)
 unsigned long __head __startup_64(unsigned long physaddr,
 				  struct boot_params *bp)
 {
+	unsigned long vaddr, vaddr_end;
 	unsigned long load_delta, *p;
 	unsigned long pgtable_flags;
 	pgdval_t *pgd;
@@ -234,6 +235,16 @@ unsigned long __head __startup_64(unsigned long physaddr,
 	/* Encrypt the kernel and related (if SME is active) */
 	sme_encrypt_kernel(bp);
 
+	/* Clear the memory encryption mask from the .data..decrypted section. */
+	if (mem_encrypt_active()) {
+		vaddr = (unsigned long)__start_data_decrypted;
+		vaddr_end = (unsigned long)__end_data_decrypted;
+		for (; vaddr < vaddr_end; vaddr += PMD_SIZE) {
+			i = pmd_index(vaddr);
+			pmd[i] -= sme_get_me_mask();
+		}
+	}
+
 	/*
 	 * Return the SME encryption mask (if SME is active) to be used as a
 	 * modifier for the initial pgdir entry programmed into CR3.
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 9c77d2d..ae8153e 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -65,6 +65,21 @@ jiffies_64 = jiffies;
 #define ALIGN_ENTRY_TEXT_BEGIN	. = ALIGN(PMD_SIZE);
 #define ALIGN_ENTRY_TEXT_END	. = ALIGN(PMD_SIZE);
 
+/*
+ * This section contains data which will be mapped as decrypted. Memory
+ * encryption operates on a page basis. Make this section PMD-aligned
+ * to avoid splitting the pages while mapping the section early.
+ *
+ * Note: We use a separate section so that only this section gets
+ * decrypted to avoid exposing more than we wish.
+ */
+#define DATA_DECRYPTED						\
+	. = ALIGN(PMD_SIZE);					\
+	__start_data_decrypted = .;				\
+	*(.data..decrypted);					\
+	. = ALIGN(PMD_SIZE);					\
+	__end_data_decrypted = .;				\
+
 #else
 
 #define X86_ALIGN_RODATA_BEGIN
@@ -74,6 +89,7 @@ jiffies_64 = jiffies;
 
 #define ALIGN_ENTRY_TEXT_BEGIN
 #define ALIGN_ENTRY_TEXT_END
+#define DATA_DECRYPTED
 
 #endif
 
@@ -161,6 +177,7 @@ SECTIONS
 		/* rarely changed data like cpu maps */
 		READ_MOSTLY_DATA(INTERNODE_CACHE_BYTES)
 
+		DATA_DECRYPTED
 		/* End of data section */
 		_edata = .;
 	} :data
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 2b245af..65da705 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -52,6 +52,8 @@
 				 (_PAGE_PAT | _PAGE_PWT))
 
 #define PMD_FLAGS_ENC		(PMD_FLAGS_LARGE | _PAGE_ENC)
+#define PMD_FLAGS_ENC_WP	((PMD_FLAGS_ENC & ~_PAGE_CACHE_MASK) | \
+				 (_PAGE_PAT | _PAGE_PWT))
 
 #define PTE_FLAGS		(__PAGE_KERNEL_EXEC & ~_PAGE_GLOBAL)
 
@@ -60,6 +62,8 @@
 				 (_PAGE_PAT | _PAGE_PWT))
 
 #define PTE_FLAGS_ENC		(PTE_FLAGS | _PAGE_ENC)
+#define PTE_FLAGS_ENC_WP	((PTE_FLAGS_ENC & ~_PAGE_CACHE_MASK) | \
+				 (_PAGE_PAT | _PAGE_PWT))
 
 struct sme_populate_pgd_data {
 	void    *pgtable_area;
@@ -232,6 +236,11 @@ static void __init sme_map_range_encrypted(struct sme_populate_pgd_data *ppd)
 	__sme_map_range(ppd, PMD_FLAGS_ENC, PTE_FLAGS_ENC);
 }
 
+static void __init sme_map_range_encrypted_wp(struct sme_populate_pgd_data *ppd)
+{
+	__sme_map_range(ppd, PMD_FLAGS_ENC_WP, PTE_FLAGS_ENC_WP);
+}
+
 static void __init sme_map_range_decrypted(struct sme_populate_pgd_data *ppd)
 {
 	__sme_map_range(ppd, PMD_FLAGS_DEC, PTE_FLAGS_DEC);
@@ -379,7 +388,10 @@ static void __init build_workarea_map(struct boot_params *bp,
 	ppd->paddr = workarea_start;
 	ppd->vaddr = workarea_start;
 	ppd->vaddr_end = workarea_end;
-	sme_map_range_decrypted(ppd);
+	if (sev_active())
+		sme_map_range_encrypted(ppd);
+	else
+		sme_map_range_decrypted(ppd);
 
 	/* Flush the TLB - no globals so cr3 is enough */
 	native_write_cr3(__native_read_cr3());
@@ -436,16 +448,27 @@ static void __init build_workarea_map(struct boot_params *bp,
 		sme_map_range_decrypted_wp(ppd);
 	}
 
-	/* Add decrypted workarea mappings to both kernel mappings */
+	/*
+	 * When SEV is active, kernel is already encrypted hence mapping
+	 * the initial workarea_start as encrypted. When SME is active,
+	 * the kernel is not encrypted hence add decrypted workarea
+	 * mappings to both kernel mappings.
+	 */
 	ppd->paddr = workarea_start;
 	ppd->vaddr = workarea_start;
 	ppd->vaddr_end = workarea_end;
-	sme_map_range_decrypted(ppd);
+	if (sev_active())
+		sme_map_range_encrypted(ppd);
+	else
+		sme_map_range_decrypted(ppd);
 
 	ppd->paddr = workarea_start;
 	ppd->vaddr = workarea_start + decrypted_base;
 	ppd->vaddr_end = workarea_end + decrypted_base;
-	sme_map_range_decrypted(ppd);
+	if (sev_active())
+		sme_map_range_encrypted(ppd);
+	else
+		sme_map_range_decrypted(ppd);
 
 	wa->kernel_start = kernel_start;
 	wa->kernel_end = kernel_end;
@@ -488,28 +511,69 @@ static void __init teardown_workarea_map(struct sme_workarea_data *wa,
 	native_write_cr3(__native_read_cr3());
 }
 
+static void __init decrypt_shared_data(struct sme_workarea_data *wa,
+				       struct sme_populate_pgd_data *ppd)
+{
+	unsigned long decrypted_start, decrypted_end, decrypted_len;
+
+	/* Physical addresses of decrypted data section */
+	decrypted_start = __pa_symbol(__start_data_decrypted);
+	decrypted_end = __pa_symbol(__end_data_decrypted);
+	decrypted_len = decrypted_end - decrypted_start;
+
+	if (!decrypted_len)
+		return;
+
+	/* Add decrypted mapping for the section (identity) */
+	ppd->paddr = decrypted_start;
+	ppd->vaddr = decrypted_start;
+	ppd->vaddr_end = decrypted_end;
+	sme_map_range_decrypted(ppd);
+
+	/* Add encrypted-wp mapping for the section (non-identity) */
+	ppd->paddr = decrypted_start;
+	ppd->vaddr = decrypted_start + wa->decrypted_base;
+	ppd->vaddr_end = decrypted_end + wa->decrypted_base;
+	sme_map_range_encrypted_wp(ppd);
+
+	/* Perform in-place decryption */
+	sme_encrypt_execute(decrypted_start,
+			    decrypted_start + wa->decrypted_base,
+			    decrypted_len, wa->workarea_start,
+			    (unsigned long)ppd->pgd);
+
+	ppd->vaddr = decrypted_start + wa->decrypted_base;
+	ppd->vaddr_end = decrypted_end + wa->decrypted_base;
+	sme_clear_pgd(ppd);
+}
+
 void __init sme_encrypt_kernel(struct boot_params *bp)
 {
 	struct sme_populate_pgd_data ppd;
 	struct sme_workarea_data wa;
 
-	if (!sme_active())
+	if (!mem_encrypt_active())
 		return;
 
 	build_workarea_map(bp, &wa, &ppd);
 
-	/* When SEV is active, encrypt kernel and initrd */
-	sme_encrypt_execute(wa.kernel_start,
-			    wa.kernel_start + wa.decrypted_base,
-			    wa.kernel_len, wa.workarea_start,
-			    (unsigned long)ppd.pgd);
-
-	if (wa.initrd_len)
-		sme_encrypt_execute(wa.initrd_start,
-				    wa.initrd_start + wa.decrypted_base,
-				    wa.initrd_len, wa.workarea_start,
+	/* When SME is active, encrypt kernel and initrd */
+	if (sme_active()) {
+		sme_encrypt_execute(wa.kernel_start,
+				    wa.kernel_start + wa.decrypted_base,
+				    wa.kernel_len, wa.workarea_start,
 				    (unsigned long)ppd.pgd);
 
+		if (wa.initrd_len)
+			sme_encrypt_execute(wa.initrd_start,
+					    wa.initrd_start + wa.decrypted_base,
+					    wa.initrd_len, wa.workarea_start,
+					    (unsigned long)ppd.pgd);
+	}
+
+	/* Decrypt the contents of .data..decrypted section */
+	decrypt_shared_data(&wa, &ppd);
+
 	teardown_workarea_map(&wa, &ppd);
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v7 4/5] x86/kvm: use __decrypted attribute in shared variables
  2018-09-10 21:49 [PATCH v7 0/5] x86: Fix SEV guest regression Brijesh Singh
                   ` (2 preceding siblings ...)
  2018-09-10 21:49 ` [PATCH v7 3/5] x86/mm: add .data..decrypted section to hold shared variables Brijesh Singh
@ 2018-09-10 21:49 ` Brijesh Singh
  2018-09-10 21:49 ` [PATCH v7 5/5] x86/kvm: Avoid dynamic allocation of pvclock data when SEV is active Brijesh Singh
  2018-09-13 16:22 ` [PATCH v7 0/5] x86: Fix SEV guest regression Thomas Gleixner
  5 siblings, 0 replies; 9+ messages in thread
From: Brijesh Singh @ 2018-09-10 21:49 UTC (permalink / raw)
  To: x86, linux-kernel, kvm
  Cc: Brijesh Singh, Tom Lendacky, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, Paolo Bonzini, Sean Christopherson,
	Radim Krčmář

The recent removal of the memblock dependency from kvmclock caused a SEV
guest regression because the wall_clock and hv_clock_boot variables are
no longer mapped decrypted when SEV is active.

Use the __decrypted attribute to put the wall_clock and hv_clock_boot in
the .data..decrypted section so that they are mapped decrypted during boot.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes: 368a540e0232 ("x86/kvmclock: Remove memblock dependency")
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: kvm@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: linux-kernel@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
---
 arch/x86/kernel/kvmclock.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index a36b93a..0b3110b 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -61,8 +61,8 @@ early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
 	(PAGE_SIZE / sizeof(struct pvclock_vsyscall_time_info))
 
 static struct pvclock_vsyscall_time_info
-			hv_clock_boot[HVC_BOOT_ARRAY_SIZE] __aligned(PAGE_SIZE);
-static struct pvclock_wall_clock wall_clock;
+			hv_clock_boot[HVC_BOOT_ARRAY_SIZE] __decrypted __aligned(PAGE_SIZE);
+static struct pvclock_wall_clock wall_clock __decrypted;
 static DEFINE_PER_CPU(struct pvclock_vsyscall_time_info *, hv_clock_per_cpu);
 
 static inline struct pvclock_vcpu_time_info *this_cpu_pvti(void)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v7 5/5] x86/kvm: Avoid dynamic allocation of pvclock data when SEV is active
  2018-09-10 21:49 [PATCH v7 0/5] x86: Fix SEV guest regression Brijesh Singh
                   ` (3 preceding siblings ...)
  2018-09-10 21:49 ` [PATCH v7 4/5] x86/kvm: use __decrypted attribute in " Brijesh Singh
@ 2018-09-10 21:49 ` Brijesh Singh
  2018-09-13 16:22 ` [PATCH v7 0/5] x86: Fix SEV guest regression Thomas Gleixner
  5 siblings, 0 replies; 9+ messages in thread
From: Brijesh Singh @ 2018-09-10 21:49 UTC (permalink / raw)
  To: x86, linux-kernel, kvm
  Cc: Brijesh Singh, Tom Lendacky, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, Paolo Bonzini, Sean Christopherson,
	Radim Krčmář

In the preparatory stage of CPU hotplug, the per-cpu pvclock data pointer
assigns either an element of the static array or dynamically allocated
memory for the pvclock data pointer. Currently, the dynamically allocated
memory is not mapped decrypted. However, when SEV is active this memory
range must be mapped decrypted.

The C-bit determines the encryption status of a 4K page hence a full 4K
page allocation would be required to store a single 32-byte pvclock
variable. This could waste a fairly sizeable amount of memory since each
CPU will perform a separate 4K allocation.

Instead, define a second static array which will be used when SEV is
active. This array will be put in the .data..decrypted section so that it
is mapped decrypted during boot.

The .data..decrypted section has a big chunk of memory that is currently
unused. Since the second array will be used only when memory
encryption is active, free it when memory encryption is not active.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: kvm@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: linux-kernel@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
---
 arch/x86/include/asm/mem_encrypt.h |  4 ++++
 arch/x86/kernel/kvmclock.c         | 14 ++++++++++++++
 arch/x86/kernel/vmlinux.lds.S      |  3 +++
 arch/x86/mm/init.c                 |  3 +++
 arch/x86/mm/mem_encrypt.c          | 10 ++++++++++
 5 files changed, 34 insertions(+)

diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
index 802b2eb..3f2a5e3 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -48,11 +48,13 @@ int __init early_set_memory_encrypted(unsigned long vaddr, unsigned long size);
 
 /* Architecture __weak replacement functions */
 void __init mem_encrypt_init(void);
+void __init mem_encrypt_free_decrypted_mem(void);
 
 bool sme_active(void);
 bool sev_active(void);
 
 #define __decrypted __attribute__((__section__(".data..decrypted")))
+#define __decrypted_aux __attribute__((__section__(".data..decrypted.aux")))
 
 #else	/* !CONFIG_AMD_MEM_ENCRYPT */
 
@@ -80,6 +82,7 @@ static inline int __init
 early_set_memory_encrypted(unsigned long vaddr, unsigned long size) { return 0; }
 
 #define __decrypted
+#define __decrypted_aux
 
 #endif	/* CONFIG_AMD_MEM_ENCRYPT */
 
@@ -93,6 +96,7 @@ early_set_memory_encrypted(unsigned long vaddr, unsigned long size) { return 0;
 #define __sme_pa_nodebug(x)	(__pa_nodebug(x) | sme_me_mask)
 
 extern char __start_data_decrypted[], __end_data_decrypted[];
+extern char __start_data_decrypted_aux[];
 
 #endif	/* __ASSEMBLY__ */
 
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 0b3110b..9d8bad5 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -65,6 +65,15 @@ static struct pvclock_vsyscall_time_info
 static struct pvclock_wall_clock wall_clock __decrypted;
 static DEFINE_PER_CPU(struct pvclock_vsyscall_time_info *, hv_clock_per_cpu);
 
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+/*
+ * The auxiliary array will be used when SEV is active. In non-SEV case,
+ * it will be freed by mem_encrypt_free_decrypted_mem().
+ */
+static struct pvclock_vsyscall_time_info
+			hv_clock_aux[NR_CPUS] __decrypted_aux;
+#endif
+
 static inline struct pvclock_vcpu_time_info *this_cpu_pvti(void)
 {
 	return &this_cpu_read(hv_clock_per_cpu)->pvti;
@@ -269,6 +278,11 @@ static int kvmclock_setup_percpu(unsigned int cpu)
 	/* Use the static page for the first CPUs, allocate otherwise */
 	if (cpu < HVC_BOOT_ARRAY_SIZE)
 		p = &hv_clock_boot[cpu];
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+	/* Use the static page from auxiliary array instead of allocating it. */
+	else if (sev_active())
+		p = &hv_clock_aux[cpu - HVC_BOOT_ARRAY_SIZE];
+#endif
 	else
 		p = kzalloc(sizeof(*p), GFP_KERNEL);
 
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index ae8153e..b78e117 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -77,6 +77,9 @@ jiffies_64 = jiffies;
 	. = ALIGN(PMD_SIZE);					\
 	__start_data_decrypted = .;				\
 	*(.data..decrypted);					\
+	. = ALIGN(PAGE_SIZE);					\
+	__start_data_decrypted_aux = .;				\
+	*(.data..decrypted.aux);				\
 	. = ALIGN(PMD_SIZE);					\
 	__end_data_decrypted = .;				\
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 7a8fc26..b3cc33d 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -815,9 +815,12 @@ void free_kernel_image_pages(void *begin, void *end)
 		set_memory_np_noalias(begin_ul, len_pages);
 }
 
+void __weak mem_encrypt_free_decrypted_mem(void) { }
+
 void __ref free_initmem(void)
 {
 	e820__reallocate_tables();
+	mem_encrypt_free_decrypted_mem();
 
 	free_kernel_image_pages(&__init_begin, &__init_end);
 }
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index b2de398..f1ab7f5 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -348,6 +348,16 @@ bool sev_active(void)
 EXPORT_SYMBOL(sev_active);
 
 /* Architecture __weak replacement functions */
+void __init mem_encrypt_free_decrypted_mem(void)
+{
+	if (mem_encrypt_active())
+		return;
+
+	free_init_pages("unused decrypted",
+			(unsigned long)__start_data_decrypted_aux,
+			(unsigned long)__end_data_decrypted);
+}
+
 void __init mem_encrypt_init(void)
 {
 	if (!sme_me_mask)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v7 0/5] x86: Fix SEV guest regression
  2018-09-10 21:49 [PATCH v7 0/5] x86: Fix SEV guest regression Brijesh Singh
                   ` (4 preceding siblings ...)
  2018-09-10 21:49 ` [PATCH v7 5/5] x86/kvm: Avoid dynamic allocation of pvclock data when SEV is active Brijesh Singh
@ 2018-09-13 16:22 ` Thomas Gleixner
  2018-09-13 17:22   ` Brijesh Singh
  5 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2018-09-13 16:22 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, Tom Lendacky, Borislav Petkov,
	Paolo Bonzini, Sean Christopherson, Radim Krčmář

On Mon, 10 Sep 2018, Brijesh Singh wrote:
> x86/kvmclock: Remove memblock dependency
> introduced SEV guest regression.
> 
> The guest physical address holding the wall_clock and hv_clock_boot
> are shared with the hypervisor must be mapped with C=0 when SEV
> is active. To clear the C-bit we use  kernel_physical_mapping_init() to
> split the large pages. The above commit moved the kvmclock initialization
> very early and kernel_physical_mapping_init() fails to allocate memory
> while spliting the large page.
> 
> To solve it, we add a special .data..decrypted section, this section can be
> used to hold the shared variables. Early boot code maps this section with
> C=0. The section is pmd aligned and sized to avoid the need to split the pages.
> Caller can use __decrypted attribute to add the variables in .data..decrypted
> section. 

I went through this patch series and to be honest this is in no way -rc3+
material.

I really have to ask why this whole change to sme_encrypt_kernel() is
required at all.

Lets look at the problem at hand:

  We need exactly TWO pages to be shareable at early boot when the
  kernel runs in a guest and wants to utilize kvmclock.

  No, we do not need the whole NR_CPUS sized thing at that point at
  all. The secondary CPUs are brought up way later and there is absolutely
  no point in having the ugly decrypted..aux hack. That conditional freeing
  is really horrible and prone to break sooner than later.

  Further the TWO pages are BSS pages which does not require to handle them
  special other than fixing up the PMD and doing a memset(0) on the
  actually utilized TWO pages.

We really can enforce that such decrypted variables have to be in the BSS
for now and revisit this when there is an absolute requirement to have
statically initialized ones, which I doubt we have.

So the way simpler solution is:

1) Add a .bss..decrypted section which is PMD aligned and mark the
   kvmclock hv_clock_boot and wallclock struct with __bss_decrypted.

2) Fixup the .bss..decrypted section PMD in the SEV case at the end of
   sme_encrypt_kernel() and do a memset(0) on that. That does not require
   any of the restructuring, really.

3) Have a function which is called after the page allocator is up which
   does:

static struct pvclock_vsyscall_time_info *hvclock_mem;

void __init kvmclock_init_mem(void)
{
	unsigned int ncpus = num_possible_cpus() - HVC_BOOT_ARRAY_SIZE;
	unsigned int order = get_order(ncpus * sizeof(*hvclock_mem));
	struct page *p;

	p = alloc_pages(order, GFP_KERNEL);
	if (p) {
		hvclock_mem = page_address(p);
		set_memory_decrypted((unsigned long) hvclock_mem,
				     PAGE_SIZE << order);
	}
}

4) In kvmclock_setup_percpu() do the following:

        if (cpu < HVC_BOOT_ARRAY_SIZE)
                p = &hv_clock_boot[cpu];
	else if (hvclock_mem)
		p = hvclock_mem + cpu - HVC_BOOT_ARRAY_SIZE;
	else
		return -ENOMEM;

5) Unconditionally free the pages after the used .bbs..decrypted variables,
   so the unused rest of the 2M page is not lost.

That should be a halfways slim sized and non instrusive changeset.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v7 0/5] x86: Fix SEV guest regression
  2018-09-13 16:22 ` [PATCH v7 0/5] x86: Fix SEV guest regression Thomas Gleixner
@ 2018-09-13 17:22   ` Brijesh Singh
  2018-09-13 18:47     ` Thomas Gleixner
  0 siblings, 1 reply; 9+ messages in thread
From: Brijesh Singh @ 2018-09-13 17:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: brijesh.singh, x86, linux-kernel, kvm, Tom Lendacky,
	Borislav Petkov, Paolo Bonzini, Sean Christopherson,
	Radim Krčmář



On 09/13/2018 11:22 AM, Thomas Gleixner wrote:
> On Mon, 10 Sep 2018, Brijesh Singh wrote:
>> x86/kvmclock: Remove memblock dependency
>> introduced SEV guest regression.
>>
>> The guest physical address holding the wall_clock and hv_clock_boot
>> are shared with the hypervisor must be mapped with C=0 when SEV
>> is active. To clear the C-bit we use  kernel_physical_mapping_init() to
>> split the large pages. The above commit moved the kvmclock initialization
>> very early and kernel_physical_mapping_init() fails to allocate memory
>> while spliting the large page.
>>
>> To solve it, we add a special .data..decrypted section, this section can be
>> used to hold the shared variables. Early boot code maps this section with
>> C=0. The section is pmd aligned and sized to avoid the need to split the pages.
>> Caller can use __decrypted attribute to add the variables in .data..decrypted
>> section.
> 
> I went through this patch series and to be honest this is in no way -rc3+
> material.
> 
> I really have to ask why this whole change to sme_encrypt_kernel() is
> required at all.


The main reason behind making the changes in sme_encrypt_kernel() was
to perverse the initialized values (if any). Since we are adding
  __decrypted attribute hence I thought it would make sense to support
the usecase where in future someone may use this attribute on 
initialized variables.

Now the question is, do we really need this to fix the regression? the
answer is NO. Since there is only one user of this new attribute and
lucky it does not initialize the variables hence it is safe to move the
variable in .bss..decrypted section and memset(0).

> 
> Lets look at the problem at hand:
> 
>    We need exactly TWO pages to be shareable at early boot when the
>    kernel runs in a guest and wants to utilize kvmclock.
> 
>    No, we do not need the whole NR_CPUS sized thing at that point at
>    all. The secondary CPUs are brought up way later and there is absolutely
>    no point in having the ugly decrypted..aux hack. That conditional freeing
>    is really horrible and prone to break sooner than later.
> 
>    Further the TWO pages are BSS pages which does not require to handle them
>    special other than fixing up the PMD and doing a memset(0) on the
>    actually utilized TWO pages.
> 
> We really can enforce that such decrypted variables have to be in the BSS
> for now and revisit this when there is an absolute requirement to have
> statically initialized ones, which I doubt we have.
> 
> So the way simpler solution is:
> 
> 1) Add a .bss..decrypted section which is PMD aligned and mark the
>     kvmclock hv_clock_boot and wallclock struct with __bss_decrypted.
>  > 2) Fixup the .bss..decrypted section PMD in the SEV case at the end of
>     sme_encrypt_kernel() and do a memset(0) on that. That does not require
>     any of the restructuring, really.
> 

Somewhere during the discussion, I was asked to make sure that
  __decrypted attribute can be used by others in future and don't tie it
with just the kvmclock.

To fix the regression we don't need to have this complexity. I am okay
to implement your proposal. I would like to fix this regression sooner
than later ;)


> 3) Have a function which is called after the page allocator is up which
>     does:
> 

I am glad you are pointing this one. In my initial patch I was doing a
kmalloc() of page-size, during review we did discussed to do allocation
once using num_possible_cpus()[like what you have proposed]. But then
discussion moved towards saving memory and that when the secondary array
concept came into the picture. Since .data..decrypted has some unused 
memory hence we were getting creative in reusing it.



> static struct pvclock_vsyscall_time_info *hvclock_mem;
> 
> void __init kvmclock_init_mem(void)
> {
> 	unsigned int ncpus = num_possible_cpus() - HVC_BOOT_ARRAY_SIZE;
> 	unsigned int order = get_order(ncpus * sizeof(*hvclock_mem));
> 	struct page *p;
> 
> 	p = alloc_pages(order, GFP_KERNEL);
> 	if (p) {
> 		hvclock_mem = page_address(p);
> 		set_memory_decrypted((unsigned long) hvclock_mem,
> 				     PAGE_SIZE << order);
> 	}
> }
> 
> 4) In kvmclock_setup_percpu() do the following:
> 
>          if (cpu < HVC_BOOT_ARRAY_SIZE)
>                  p = &hv_clock_boot[cpu];
> 	else if (hvclock_mem)
> 		p = hvclock_mem + cpu - HVC_BOOT_ARRAY_SIZE;
> 	else
> 		return -ENOMEM;
> 
> 5) Unconditionally free the pages after the used .bbs..decrypted variables,
>     so the unused rest of the 2M page is not lost.
> 
> That should be a halfways slim sized and non instrusive changeset.
> 

I am okay to implement and test your recommendation, if anyone
disagree then please let me know. thanks


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v7 0/5] x86: Fix SEV guest regression
  2018-09-13 17:22   ` Brijesh Singh
@ 2018-09-13 18:47     ` Thomas Gleixner
  0 siblings, 0 replies; 9+ messages in thread
From: Thomas Gleixner @ 2018-09-13 18:47 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, Tom Lendacky, Borislav Petkov,
	Paolo Bonzini, Sean Christopherson, Radim Krčmář

On Thu, 13 Sep 2018, Brijesh Singh wrote:
> On 09/13/2018 11:22 AM, Thomas Gleixner wrote:
> > I really have to ask why this whole change to sme_encrypt_kernel() is
> > required at all.
> 
> The main reason behind making the changes in sme_encrypt_kernel() was
> to perverse the initialized values (if any). Since we are adding
>  __decrypted attribute hence I thought it would make sense to support
> the usecase where in future someone may use this attribute on initialized
> variables.
> 
> Now the question is, do we really need this to fix the regression? the
> answer is NO. Since there is only one user of this new attribute and
> lucky it does not initialize the variables hence it is safe to move the
> variable in .bss..decrypted section and memset(0).

And even if it would it usually trivial enought to make it run time
initialized.

> > 1) Add a .bss..decrypted section which is PMD aligned and mark the
> >     kvmclock hv_clock_boot and wallclock struct with __bss_decrypted.
> >  > 2) Fixup the .bss..decrypted section PMD in the SEV case at the end of
> >     sme_encrypt_kernel() and do a memset(0) on that. That does not require
> >     any of the restructuring, really.
> > 
> 
> Somewhere during the discussion, I was asked to make sure that
>  __decrypted attribute can be used by others in future and don't tie it
> with just the kvmclock.

Well, .bss..decrypted can be used by other things and we really only want
to do the .data..decrypted thing when there is a fundamental reason to do
so.

> To fix the regression we don't need to have this complexity. I am okay
> to implement your proposal. I would like to fix this regression sooner
> than later ;)
> 
> 
> > 3) Have a function which is called after the page allocator is up which
> >     does:
> > 
> 
> I am glad you are pointing this one. In my initial patch I was doing a
> kmalloc() of page-size, during review we did discussed to do allocation
> once using num_possible_cpus()[like what you have proposed]. But then
> discussion moved towards saving memory and that when the secondary array
> concept came into the picture. Since .data..decrypted has some unused memory
> hence we were getting creative in reusing it.

Sorry I didn't pay attention to that, was busy with other urgent stuff.

> > That should be a halfways slim sized and non instrusive changeset.
> > 
> 
> I am okay to implement and test your recommendation, if anyone
> disagree then please let me know. thanks

It results in simpler code and if the whole thing ends up allocating a few
KB too much due to the page allocations then so be it. It's not going to
waste 256K in one go.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-09-13 18:47 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-10 21:49 [PATCH v7 0/5] x86: Fix SEV guest regression Brijesh Singh
2018-09-10 21:49 ` [PATCH v7 1/5] x86/mm: Restructure sme_encrypt_kernel() Brijesh Singh
2018-09-10 21:49 ` [PATCH v7 2/5] x86/mm: Enhance sme_populate_pgd() to update page flags Brijesh Singh
2018-09-10 21:49 ` [PATCH v7 3/5] x86/mm: add .data..decrypted section to hold shared variables Brijesh Singh
2018-09-10 21:49 ` [PATCH v7 4/5] x86/kvm: use __decrypted attribute in " Brijesh Singh
2018-09-10 21:49 ` [PATCH v7 5/5] x86/kvm: Avoid dynamic allocation of pvclock data when SEV is active Brijesh Singh
2018-09-13 16:22 ` [PATCH v7 0/5] x86: Fix SEV guest regression Thomas Gleixner
2018-09-13 17:22   ` Brijesh Singh
2018-09-13 18:47     ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).