kexec.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: Borislav Petkov <bp@alien8.de>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	x86@kernel.org, "Rafael J. Wysocki" <rafael@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Adrian Hunter <adrian.hunter@intel.com>,
	Kuppuswamy Sathyanarayanan
	<sathyanarayanan.kuppuswamy@linux.intel.com>,
	Elena Reshetova <elena.reshetova@intel.com>,
	Jun Nakajima <jun.nakajima@intel.com>,
	Rick Edgecombe <rick.p.edgecombe@intel.com>,
	Tom Lendacky <thomas.lendacky@amd.com>,
	"Kalra, Ashish" <ashish.kalra@amd.com>,
	Sean Christopherson <seanjc@google.com>,
	"Huang, Kai" <kai.huang@intel.com>, Baoquan He <bhe@redhat.com>,
	kexec@lists.infradead.org, linux-coco@lists.linux.dev,
	linux-kernel@vger.kernel.org, Tao Liu <ltao@redhat.com>
Subject: Re: [PATCHv10 10/18] x86/tdx: Convert shared memory back to private on kexec
Date: Sun, 5 May 2024 14:13:19 +0200	[thread overview]
Message-ID: <20240505121319.GAZjd337gHC0uhk6aM@fat_crate.local> (raw)
In-Reply-To: <20240409113010.465412-11-kirill.shutemov@linux.intel.com>

On Tue, Apr 09, 2024 at 02:30:02PM +0300, Kirill A. Shutemov wrote:
> TDX guests allocate shared buffers to perform I/O. It is done by
> allocating pages normally from the buddy allocator and converting them
> to shared with set_memory_decrypted().
> 
> The second kernel has no idea what memory is converted this way. It only

"The kexec-ed kernel..."

is more precise.

> sees E820_TYPE_RAM.
> 
> Accessing shared memory via private mapping is fatal. It leads to
> unrecoverable TD exit.
> 
> On kexec walk direct mapping and convert all shared memory back to
> private. It makes all RAM private again and second kernel may use it
> normally.
> 
> The conversion occurs in two steps: stopping new conversions and
> unsharing all memory. In the case of normal kexec, the stopping of
> conversions takes place while scheduling is still functioning. This
> allows for waiting until any ongoing conversions are finished. The
> second step is carried out when all CPUs except one are inactive and
> interrupts are disabled. This prevents any conflicts with code that may
> access shared memory.

This is the missing bit of information I was looking for in the previous
patch. This needs to be documented in the code.

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> Tested-by: Tao Liu <ltao@redhat.com>
> ---
>  arch/x86/coco/tdx/tdx.c           | 72 +++++++++++++++++++++++++++++++
>  arch/x86/include/asm/pgtable.h    |  5 +++
>  arch/x86/include/asm/set_memory.h |  3 ++
>  arch/x86/mm/pat/set_memory.c      | 35 +++++++++++++--
>  4 files changed, 112 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 979891e97d83..59776ce1c1d7 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -7,6 +7,7 @@
>  #include <linux/cpufeature.h>
>  #include <linux/export.h>
>  #include <linux/io.h>
> +#include <linux/kexec.h>
>  #include <asm/coco.h>
>  #include <asm/tdx.h>
>  #include <asm/vmx.h>
> @@ -14,6 +15,7 @@
>  #include <asm/insn.h>
>  #include <asm/insn-eval.h>
>  #include <asm/pgtable.h>
> +#include <asm/set_memory.h>
>  
>  /* MMIO direction */
>  #define EPT_READ	0
> @@ -831,6 +833,73 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
>  	return 0;
>  }
>  
> +/* Stop new private<->shared conversions */
> +static void tdx_kexec_stop_conversion(bool crash)
> +{
> +	/*
> +	 * Crash kernel reaches here with interrupts disabled: can't wait for
> +	 * conversions to finish.
> +	 *
> +	 * If race happened, just report and proceed.
> +	 */
> +	bool wait_for_lock = !crash;

You don't need that bool - use crash.

> +
> +	if (!stop_memory_enc_conversion(wait_for_lock))
> +		pr_warn("Failed to stop shared<->private conversions\n");
> +}
> +
> +static void tdx_kexec_unshare_mem(void)
> +{
> +	unsigned long addr, end;
> +	long found = 0, shared;
> +
> +	/*
> +	 * Walk direct mapping and convert all shared memory back to private,
> +	 */

Over the function name and end with a fullstop.

> +
> +	addr = PAGE_OFFSET;
> +	end  = PAGE_OFFSET + get_max_mapped();
> +
> +	while (addr < end) {
> +		unsigned long size;
> +		unsigned int level;
> +		pte_t *pte;
> +
> +		pte = lookup_address(addr, &level);
> +		size = page_level_size(level);
> +
> +		if (pte && pte_decrypted(*pte)) {
> +			int pages = size / PAGE_SIZE;
> +
> +			/*
> +			 * Touching memory with shared bit set triggers implicit
> +			 * conversion to shared.
> +			 *
> +			 * Make sure nobody touches the shared range from
> +			 * now on.
> +			 */

lockdep_assert_irqs_disabled() ?

> +			set_pte(pte, __pte(0));
> +
> +			if (!tdx_enc_status_changed(addr, pages, true)) {
> +				pr_err("Failed to unshare range %#lx-%#lx\n",
> +				       addr, addr + size);

Why are we printing something here if we're not really acting up on it?

Who should care here?

> +			}
> +
> +			found += pages;
> +		}
> +
> +		addr += size;
> +	}
> +
> +	__flush_tlb_all();
> +
> +	shared = atomic_long_read(&nr_shared);
> +	if (shared != found) {
> +		pr_err("shared page accounting is off\n");
> +		pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found);
> +	}

Ok, we failed unsharing. And yet we don't do anything.

But if we fail unsharing, we will die on a unrecoverable TD exit or
whatever.

Why aren't we failing kexec here?

> +}
> +
>  void __init tdx_early_init(void)
>  {
>  	struct tdx_module_args args = {
> @@ -890,6 +959,9 @@ void __init tdx_early_init(void)
>  	x86_platform.guest.enc_cache_flush_required  = tdx_cache_flush_required;
>  	x86_platform.guest.enc_tlb_flush_required    = tdx_tlb_flush_required;
>  
> +	x86_platform.guest.enc_kexec_stop_conversion = tdx_kexec_stop_conversion;
> +	x86_platform.guest.enc_kexec_unshare_mem     = tdx_kexec_unshare_mem;
> +
>  	/*
>  	 * TDX intercepts the RDMSR to read the X2APIC ID in the parallel
>  	 * bringup low level code. That raises #VE which cannot be handled
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 315535ffb258..17f4d97fae06 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -140,6 +140,11 @@ static inline int pte_young(pte_t pte)
>  	return pte_flags(pte) & _PAGE_ACCESSED;
>  }
>  
> +static inline bool pte_decrypted(pte_t pte)
> +{
> +	return cc_mkdec(pte_val(pte)) == pte_val(pte);
> +}
> +
>  #define pmd_dirty pmd_dirty
>  static inline bool pmd_dirty(pmd_t pmd)
>  {
> diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
> index 9aee31862b4a..44b6d711296c 100644
> --- a/arch/x86/include/asm/set_memory.h
> +++ b/arch/x86/include/asm/set_memory.h
> @@ -49,8 +49,11 @@ int set_memory_wb(unsigned long addr, int numpages);
>  int set_memory_np(unsigned long addr, int numpages);
>  int set_memory_p(unsigned long addr, int numpages);
>  int set_memory_4k(unsigned long addr, int numpages);
> +
> +bool stop_memory_enc_conversion(bool wait);
>  int set_memory_encrypted(unsigned long addr, int numpages);
>  int set_memory_decrypted(unsigned long addr, int numpages);
> +
>  int set_memory_np_noalias(unsigned long addr, int numpages);
>  int set_memory_nonglobal(unsigned long addr, int numpages);
>  int set_memory_global(unsigned long addr, int numpages);
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 6c49f69c0368..21835339c0e6 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -2188,12 +2188,41 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>  	return ret;
>  }
>  

<--- insert comment here what this thing is guarding.

> +static DECLARE_RWSEM(mem_enc_lock);
> +
> +/*
> + * Stop new private<->shared conversions.
> + *
> + * Taking the exclusive mem_enc_lock waits for in-flight conversions to complete.
> + * The lock is not released to prevent new conversions from being started.
> + *
> + * If sleep is not allowed, as in a crash scenario, try to take the lock.
> + * Failure indicates that there is a race with the conversion.
> + */
> +bool stop_memory_enc_conversion(bool wait)

This is a global function which means, it should be called:

set_memory_enc_stop_conversion()

or so. With the proper prefix and so on.

> +{
> +	if (!wait)
> +		return down_write_trylock(&mem_enc_lock);
> +
> +	down_write(&mem_enc_lock);
> +
> +	return true;
> +}
> +
>  static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
>  {
> -	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> -		return __set_memory_enc_pgtable(addr, numpages, enc);
> +	int ret = 0;
>  
> -	return 0;
> +	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
> +		if (!down_read_trylock(&mem_enc_lock))
> +			return -EBUSY;

This function is called on SEV* and HyperV and the respective folks need
to at least ack this new approach.

I see Ashish's patch adds the respective stuff:

https://lore.kernel.org/r/c24516a4636a36d57186ea90ae26495b3c1cfb8b.1714148366.git.ashish.kalra@amd.com

which leaves HyperV. You'd need to Cc them on the next submission.

Thx.


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

  parent reply	other threads:[~2024-05-05 12:14 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20240409113010.465412-1-kirill.shutemov@linux.intel.com>
     [not found] ` <20240409113010.465412-6-kirill.shutemov@linux.intel.com>
2024-04-09 12:38   ` [PATCHv10 05/18] x86/kexec: Keep CR4.MCE set during kexec for TDX guest Huang, Kai
2024-04-09 14:22   ` Sean Christopherson
2024-04-09 15:26     ` Kirill A. Shutemov
2024-04-28 17:11       ` Borislav Petkov
2024-04-29 13:17         ` Kirill A. Shutemov
2024-04-29 14:45           ` Borislav Petkov
2024-04-29 15:16             ` Kirill A. Shutemov
2024-04-30 12:57               ` Borislav Petkov
2024-04-30 13:03   ` Borislav Petkov
2024-04-30 14:49     ` Kirill A. Shutemov
2024-05-02 13:22       ` Borislav Petkov
2024-05-02 13:38         ` Borislav Petkov
2024-04-09 20:42 ` [PATCH v4 0/4] x86/snp: Add kexec support Ashish Kalra
2024-04-09 20:42   ` [PATCH v4 1/4] efi/x86: skip efi_arch_mem_reserve() in case of kexec Ashish Kalra
2024-04-09 20:42   ` [PATCH v4 2/4] x86/sev: add sev_es_enabled() function Ashish Kalra
2024-04-09 21:21     ` Borislav Petkov
2024-04-09 20:42   ` [PATCH v4 3/4] x86/boot/compressed: Skip Video Memory access in Decompressor for SEV-ES/SNP Ashish Kalra
2024-04-09 20:43   ` [PATCH v4 4/4] x86/snp: Convert shared memory back to private on kexec Ashish Kalra
2024-04-10 14:17     ` kernel test robot
2024-04-15 23:22   ` [PATCH v5 0/3] x86/snp: Add kexec support Ashish Kalra
2024-04-15 23:22     ` [PATCH v5 1/3] efi/x86: skip efi_arch_mem_reserve() in case of kexec Ashish Kalra
2024-04-24 14:48       ` Borislav Petkov
2024-04-24 21:17         ` Kalra, Ashish
2024-04-25 16:45           ` Kalra, Ashish
2024-04-26 14:21           ` Borislav Petkov
2024-04-26 14:47             ` Kalra, Ashish
2024-04-26 15:22               ` Borislav Petkov
2024-04-26 15:28                 ` Kalra, Ashish
2024-04-26 15:34                   ` Borislav Petkov
2024-04-26 16:32                     ` Kalra, Ashish
2024-04-15 23:23     ` [PATCH v5 2/3] x86/boot/compressed: Skip Video Memory access in Decompressor for SEV-ES/SNP Ashish Kalra
2024-04-15 23:23     ` [PATCH v5 3/3] x86/snp: Convert shared memory back to private on kexec Ashish Kalra
2024-04-26 16:33   ` [PATCH v6 0/3] x86/snp: Add kexec support Ashish Kalra
2024-04-26 16:33     ` [PATCH v6 1/3] efi/x86: Fix EFI memory map corruption with kexec Ashish Kalra
2024-05-09  9:56       ` Ruirui Yang
2024-05-09 10:00         ` Dave Young
2024-05-10 18:36         ` Kalra, Ashish
2024-04-26 16:34     ` [PATCH v6 2/3] x86/boot/compressed: Skip Video Memory access in Decompressor for SEV-ES/SNP Ashish Kalra
2024-04-26 16:35     ` [PATCH v6 3/3] x86/snp: Convert shared memory back to private on kexec Ashish Kalra
2024-05-02 12:01   ` [PATCH v4 0/4] x86/snp: Add kexec support Alexander Graf
2024-05-02 12:18     ` Vitaly Kuznetsov
2024-05-03  8:32       ` Alexander Graf
2024-05-09  9:19         ` Vitaly Kuznetsov
2024-05-02 21:54     ` Kalra, Ashish
     [not found] ` <20240409113010.465412-4-kirill.shutemov@linux.intel.com>
2024-04-18 14:37   ` [PATCHv10 03/18] cpu/hotplug: Add support for declaring CPU offlining not supported Borislav Petkov
2024-04-19 13:31     ` Kirill A. Shutemov
2024-04-23 13:17       ` Borislav Petkov
     [not found] ` <20240409113010.465412-2-kirill.shutemov@linux.intel.com>
2024-04-18 16:03   ` [PATCHv10 01/18] x86/acpi: Extract ACPI MADT wakeup code into a separate file Borislav Petkov
2024-04-19 13:28     ` Kirill A. Shutemov
     [not found] ` <20240409113010.465412-5-kirill.shutemov@linux.intel.com>
2024-04-23 16:02   ` [PATCHv10 04/18] cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup Borislav Petkov
2024-04-24  8:38     ` Kirill A. Shutemov
2024-04-24 13:50       ` Borislav Petkov
2024-04-24 14:35         ` Kirill A. Shutemov
2024-04-24 14:40           ` Dave Hansen
2024-04-24 14:51             ` Borislav Petkov
     [not found] ` <20240409113010.465412-10-kirill.shutemov@linux.intel.com>
2024-04-27 16:47   ` [PATCHv10 09/18] x86/mm: Adding callbacks to prepare encrypted memory for kexec Borislav Petkov
     [not found]     ` <20240427170634.2397725-1-kirill.shutemov@linux.intel.com>
2024-05-02 13:45       ` [PATCHv10.1 " Borislav Petkov
2024-05-06 13:22         ` Kirill A. Shutemov
2024-05-06 14:21           ` Borislav Petkov
     [not found] ` <20240409113010.465412-7-kirill.shutemov@linux.intel.com>
2024-04-28 17:25   ` [PATCHv10 06/18] x86/mm: Make x86_platform.guest.enc_status_change_*() return errno Borislav Petkov
2024-04-29 14:29     ` Kirill A. Shutemov
2024-04-29 14:53       ` Borislav Petkov
2024-05-03 16:29   ` Michael Kelley
     [not found] ` <20240409113010.465412-11-kirill.shutemov@linux.intel.com>
2024-05-05 12:13   ` Borislav Petkov [this message]
2024-05-06 15:37     ` [PATCHv10 10/18] x86/tdx: Convert shared memory back to private on kexec Kirill A. Shutemov
2024-05-08 12:04       ` Borislav Petkov
2024-05-08 13:30         ` Kirill A. Shutemov
     [not found] ` <20240409113010.465412-12-kirill.shutemov@linux.intel.com>
2024-05-08 12:12   ` [PATCHv10 11/18] x86/mm: Make e820_end_ram_pfn() cover E820_TYPE_ACPI ranges Borislav Petkov
     [not found] ` <20240409113010.465412-14-kirill.shutemov@linux.intel.com>
2024-05-08 12:18   ` [PATCHv10 13/18] x86/acpi: Rename fields in acpi_madt_multiproc_wakeup structure Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240505121319.GAZjd337gHC0uhk6aM@fat_crate.local \
    --to=bp@alien8.de \
    --cc=adrian.hunter@intel.com \
    --cc=ashish.kalra@amd.com \
    --cc=bhe@redhat.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=elena.reshetova@intel.com \
    --cc=jun.nakajima@intel.com \
    --cc=kai.huang@intel.com \
    --cc=kexec@lists.infradead.org \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-coco@lists.linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ltao@redhat.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rafael@kernel.org \
    --cc=rick.p.edgecombe@intel.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=seanjc@google.com \
    --cc=tglx@linutronix.de \
    --cc=thomas.lendacky@amd.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).