All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Hansen <dave.hansen@intel.com>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	luto@kernel.org, peterz@infradead.org
Cc: sathyanarayanan.kuppuswamy@linux.intel.com, aarcange@redhat.com,
	ak@linux.intel.com, dan.j.williams@intel.com, david@redhat.com,
	hpa@zytor.com, jgross@suse.com, jmattson@google.com,
	joro@8bytes.org, jpoimboe@redhat.com, knsathya@kernel.org,
	pbonzini@redhat.com, sdeep@vmware.com, seanjc@google.com,
	tony.luck@intel.com, vkuznets@redhat.com, wanpengli@tencent.com,
	thomas.lendacky@amd.com, brijesh.singh@amd.com, x86@kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCHv5 11/30] x86/tdx: Handle in-kernel MMIO
Date: Tue, 8 Mar 2022 13:26:28 -0800	[thread overview]
Message-ID: <81a7ad6d-6bd9-7674-3229-67a5cd2e485a@intel.com> (raw)
In-Reply-To: <20220302142806.51844-12-kirill.shutemov@linux.intel.com>

On 3/2/22 06:27, Kirill A. Shutemov wrote:
> In non-TDX VMs, MMIO is implemented by providing the guest a mapping
> which will cause a VMEXIT on access and then the VMM emulating the
> instruction that caused the VMEXIT. That's not possible for TDX VM.
> 
> To emulate an instruction an emulator needs two things:
> 
>   - R/W access to the register file to read/modify instruction arguments
>     and see RIP of the faulted instruction.
> 
>   - Read access to memory where instruction is placed to see what to
>     emulate. In this case it is guest kernel text.
> 
> Both of them are not available to VMM in TDX environment:
> 
>   - Register file is never exposed to VMM. When a TD exits to the module,
>     it saves registers into the state-save area allocated for that TD.
>     The module then scrubs these registers before returning execution
>     control to the VMM, to help prevent leakage of TD state.
> 
>   - Memory is encrypted a TD-private key. The CPU disallows software
>     other than the TDX module and TDs from making memory accesses using
>     the private key.

Memory encryption has zero to do with this.  The TDX isolation
mechanisms are totally discrete from memory encryption, although they
are "neighbors" of sorts.

> In TDX the MMIO regions are instead configured by VMM to trigger a #VE
> exception in the guest.
> 
> Add #VE handling that emulates the MMIO instruction inside the guest and
> converts it into a controlled hypercall to the host.
> 
> MMIO addresses can be used with any CPU instruction that accesses
> memory. Address only MMIO accesses done via io.h helpers, such as
> 'readl()' or 'writeq()'.
> 
> Any CPU instruction that accesses memory can also be used to access
> MMIO.  However, by convention, MMIO access are typically performed via
> io.h helpers such as 'readl()' or 'writeq()'.
> 
> The io.h helpers intentionally use a limited set of instructions when
> accessing MMIO.  This known, limited set of instructions makes MMIO
> instruction decoding and emulation feasible in KVM hosts and SEV guests
> today.
> 
> MMIO accesses are performed without the io.h helpers are at the mercy of

		^ s/are//

> the compiler.  Compilers can and will generate a much more broad set of
> instructions which can not practically be decoded and emulated.  TDX
> guests will oops if they encounter one of these decoding failures.
> 
> This means that TDX guests *must* use the io.h helpers to access MMIO.
> 
> This requirement is not new.  Both KVM hosts and AMD SEV guests have the
> same limitations on MMIO access.
> 
> === Potential alternative approaches ===
> 
> == Paravirtualizing all MMIO ==
> 
> An alternative to letting MMIO induce a #VE exception is to avoid
> the #VE in the first place. Similar to the port I/O case, it is
> theoretically possible to paravirtualize MMIO accesses.
> 
> Like the exception-based approach offered here, a fully paravirtualized
> approach would be limited to MMIO users that leverage common
> infrastructure like the io.h macros.
> 
> However, any paravirtual approach would be patching approximately 120k
> call sites. Any paravirtual approach would need to replace a bare memory
> access instruction with (at least) a function call. With a conservative
> overhead estimation of 5 bytes per call site (CALL instruction),
> it leads to bloating code by 600k.
> 
> Many drivers will never be used in the TDX environment and the bloat
> cannot be justified.
> 
> == Patching TDX drivers ==
> 
> Rather than touching the entire kernel, it might also be possible to
> just go after drivers that use MMIO in TDX guests.  Right now, that's
> limited only to virtio and some x86-specific drivers.
> 
> All virtio MMIO appears to be done through a single function, which
> makes virtio eminently easy to patch.
> 
> This approach will be adopted in the future, removing the bulk of
> MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.

This still doesn't *quite* do it for me for a justification.  Why can't
the non-virtio cases be converted as well?  Why doesn't the "patching
MMIO sites" work for x86 code too?

You really need to convince us that *this* approach will be required
forever.

> diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
> index d00b367f8052..e6163e7e3247 100644
> --- a/arch/x86/coco/tdx.c
> +++ b/arch/x86/coco/tdx.c
> @@ -8,11 +8,17 @@
>  #include <asm/coco.h>
>  #include <asm/tdx.h>
>  #include <asm/vmx.h>
> +#include <asm/insn.h>
> +#include <asm/insn-eval.h>
>  
>  /* TDX module Call Leaf IDs */
>  #define TDX_GET_INFO			1
>  #define TDX_GET_VEINFO			3
>  
> +/* MMIO direction */
> +#define EPT_READ	0
> +#define EPT_WRITE	1
> +
>  /*
>   * Wrapper for standard use of __tdx_hypercall with no output aside from
>   * return code.
> @@ -200,6 +206,112 @@ static bool handle_cpuid(struct pt_regs *regs)
>  	return true;
>  }
>  
> +static bool mmio_read(int size, unsigned long addr, unsigned long *val)
> +{
> +	struct tdx_hypercall_args args = {
> +		.r10 = TDX_HYPERCALL_STANDARD,
> +		.r11 = hcall_func(EXIT_REASON_EPT_VIOLATION),
> +		.r12 = size,
> +		.r13 = EPT_READ,
> +		.r14 = addr,
> +		.r15 = *val,
> +	};
> +
> +	if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
> +		return false;
> +	*val = args.r11;
> +	return true;
> +}
> +
> +static bool mmio_write(int size, unsigned long addr, unsigned long val)
> +{
> +	return !_tdx_hypercall(hcall_func(EXIT_REASON_EPT_VIOLATION), size,
> +			       EPT_WRITE, addr, val);
> +}
> +
> +static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> +{
> +	char buffer[MAX_INSN_SIZE];
> +	unsigned long *reg, val;
> +	struct insn insn = {};
> +	enum mmio_type mmio;
> +	int size, extend_size;
> +	u8 extend_val = 0;
> +
> +	if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
> +		return false;
> +
> +	if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
> +		return false;
> +
> +	mmio = insn_decode_mmio(&insn, &size);
> +	if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
> +		return false;
> +
> +	if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
> +		reg = insn_get_modrm_reg_ptr(&insn, regs);
> +		if (!reg)
> +			return false;
> +	}
> +
> +	ve->instr_len = insn.length;
> +
> +	switch (mmio) {
> +	case MMIO_WRITE:
> +		memcpy(&val, reg, size);
> +		return mmio_write(size, ve->gpa, val);
> +	case MMIO_WRITE_IMM:
> +		val = insn.immediate.value;
> +		return mmio_write(size, ve->gpa, val);
> +	case MMIO_READ:
> +	case MMIO_READ_ZERO_EXTEND:
> +	case MMIO_READ_SIGN_EXTEND:
> +		break;
> +	case MMIO_MOVS:
> +	case MMIO_DECODE_FAILED:
> +		/*
> +		 * MMIO was accessed with an instruction that could not be
> +		 * decoded or handled properly. It was likely not using io.h
> +		 * helpers or accessed MMIO accidentally.
> +		 */
> +		return false;
> +	default:
> +		/* Unknown insn_decode_mmio() decode value? */
> +		BUG();
> +	}

BUG()s are bad.  The set of insn_decode_mmio() return codes is known at
compile time.  If we're really on the lookout for unknown values, why
not just:

	BUILD_BUG_ON(NR_MMIO_TYPES != 6); // or whatever

Also, there are *lots* of ways for this function to just fall over and
fail.  Why does this particular failure mode deserve a BUG()?

Is there a reason a BUG() is better than returning failure which
presumably sets off the #GP-like logic?

Also, now that I've read this a few times, I've been confused by the
same thing a few times.  This is handling instructions that might read
or write or do both, correct?

Should that be made explicit in a function comment?

> +	/* Handle reads */
> +	if (!mmio_read(size, ve->gpa, &val))
> +		return false;
> +
> +	switch (mmio) {
> +	case MMIO_READ:
> +		/* Zero-extend for 32-bit operation */
> +		extend_size = size == 4 ? sizeof(*reg) : 0;
> +		break;
> +	case MMIO_READ_ZERO_EXTEND:
> +		/* Zero extend based on operand size */
> +		extend_size = insn.opnd_bytes;
> +		break;
> +	case MMIO_READ_SIGN_EXTEND:
> +		/* Sign extend based on operand size */
> +		extend_size = insn.opnd_bytes;
> +		if (size == 1 && val & BIT(7))
> +			extend_val = 0xFF;
> +		else if (size > 1 && val & BIT(15))
> +			extend_val = 0xFF;
> +		break;
> +	default:
> +		/* All other cases has to be covered with the first switch() */
> +		BUG();
> +	}
> +
> +	if (extend_size)
> +		memset(reg, extend_val, extend_size);
> +	memcpy(reg, &val, size);
> +	return true;
> +}
> +
>  void tdx_get_ve_info(struct ve_info *ve)
>  {
>  	struct tdx_module_output out;
> @@ -247,6 +359,8 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
>  		return write_msr(regs);
>  	case EXIT_REASON_CPUID:
>  		return handle_cpuid(regs);
> +	case EXIT_REASON_EPT_VIOLATION:
> +		return handle_mmio(regs, ve);
>  	default:
>  		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
>  		return false;


  reply	other threads:[~2022-03-08 21:26 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 01/30] x86/tdx: Detect running as a TDX guest in early boot Kirill A. Shutemov
2022-03-04 15:43   ` Borislav Petkov
2022-03-04 15:47     ` Dave Hansen
2022-03-04 16:02       ` Borislav Petkov
2022-03-07 22:24         ` [PATCHv5.1 " Kirill A. Shutemov
2022-03-09 18:22           ` Borislav Petkov
2022-03-02 14:27 ` [PATCHv5 02/30] x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers Kirill A. Shutemov
2022-03-08 19:56   ` Dave Hansen
2022-03-10 12:32   ` Borislav Petkov
2022-03-10 14:44     ` Kirill A. Shutemov
2022-03-10 14:51       ` Borislav Petkov
2022-03-02 14:27 ` [PATCHv5 03/30] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions Kirill A. Shutemov
2022-03-08 20:03   ` Dave Hansen
2022-03-10 15:30   ` Borislav Petkov
2022-03-10 21:20     ` Kirill A. Shutemov
2022-03-10 21:48       ` Kirill A. Shutemov
2022-03-15 15:56         ` Borislav Petkov
2022-03-12 10:41       ` Borislav Petkov
2022-03-02 14:27 ` [PATCHv5 04/30] x86/tdx: Extend the confidential computing API to support TDX guests Kirill A. Shutemov
2022-03-08 20:17   ` Dave Hansen
2022-03-09 16:01     ` [PATCHv5.1 " Kirill A. Shutemov
2022-03-09 18:36       ` Dave Hansen
2022-03-09 23:51         ` [PATCHv5.2 " Kirill A. Shutemov
2022-03-10  0:07           ` Dave Hansen
2022-03-15 19:41           ` Borislav Petkov
2022-03-02 14:27 ` [PATCHv5 05/30] x86/tdx: Exclude shared bit from __PHYSICAL_MASK Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 06/30] x86/traps: Refactor exc_general_protection() Kirill A. Shutemov
2022-03-08 20:18   ` Dave Hansen
2022-03-02 14:27 ` [PATCHv5 07/30] x86/traps: Add #VE support for TDX guest Kirill A. Shutemov
2022-03-08 20:29   ` Dave Hansen
2022-03-02 14:27 ` [PATCHv5 08/30] x86/tdx: Add HLT support for TDX guests Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 09/30] x86/tdx: Add MSR " Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 10/30] x86/tdx: Handle CPUID via #VE Kirill A. Shutemov
2022-03-08 20:33   ` Dave Hansen
2022-03-09 16:15     ` [PATCH] " Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 11/30] x86/tdx: Handle in-kernel MMIO Kirill A. Shutemov
2022-03-08 21:26   ` Dave Hansen [this message]
2022-03-10  0:51     ` Kirill A. Shutemov
2022-03-10  1:06       ` Dave Hansen
2022-03-10 16:48         ` Kirill A. Shutemov
2022-03-10 17:53           ` Dave Hansen
2022-03-11 17:18             ` Kirill A. Shutemov
2022-03-11 17:22               ` Dave Hansen
2022-03-11 18:01               ` Dave Hansen
2022-03-02 14:27 ` [PATCHv5 12/30] x86/tdx: Detect TDX at early kernel decompression time Kirill A. Shutemov
2022-03-07 22:27   ` [PATCHv5.1 " Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 13/30] x86: Adjust types used in port I/O helpers Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 14/30] x86: Consolidate " Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 15/30] x86/boot: Port I/O: allow to hook up alternative helpers Kirill A. Shutemov
2022-03-02 17:42   ` Josh Poimboeuf
2022-03-02 19:41     ` Dave Hansen
2022-03-02 20:02       ` Josh Poimboeuf
2022-03-02 14:27 ` [PATCHv5 16/30] x86/boot: Port I/O: add decompression-time support for TDX Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 17/30] x86/tdx: Port I/O: add runtime hypercalls Kirill A. Shutemov
2022-03-08 21:30   ` Dave Hansen
2022-03-02 14:27 ` [PATCHv5 18/30] x86/tdx: Port I/O: add early boot support Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 19/30] x86/tdx: Wire up KVM hypercalls Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 20/30] x86/boot: Add a trampoline for booting APs via firmware handoff Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 21/30] x86/acpi, x86/boot: Add multiprocessor wake-up support Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 22/30] x86/boot: Set CR0.NE early and keep it set during the boot Kirill A. Shutemov
2022-03-08 21:37   ` Dave Hansen
2022-03-02 14:27 ` [PATCHv5 23/30] x86/boot: Avoid #VE during boot for TDX platforms Kirill A. Shutemov
2022-03-07  9:29   ` Xiaoyao Li
2022-03-07 22:33     ` Kirill A. Shutemov
2022-03-08  1:19       ` Xiaoyao Li
2022-03-08 16:41         ` Kirill A. Shutemov
2022-03-07 22:36     ` [PATCHv5.1 " Kirill A. Shutemov
2022-03-02 14:28 ` [PATCHv5 24/30] x86/topology: Disable CPU online/offline control for TDX guests Kirill A. Shutemov
2022-03-02 14:28 ` [PATCHv5 25/30] x86/tdx: Make pages shared in ioremap() Kirill A. Shutemov
2022-03-08 22:02   ` Dave Hansen
2022-03-02 14:28 ` [PATCHv5 26/30] x86/mm/cpa: Add support for TDX shared memory Kirill A. Shutemov
2022-03-09 19:44   ` Dave Hansen
2022-03-02 14:28 ` [PATCHv5 27/30] x86/kvm: Use bounce buffers for TD guest Kirill A. Shutemov
2022-03-09 20:07   ` Dave Hansen
2022-03-10 14:29     ` Tom Lendacky
2022-03-10 14:51       ` Christoph Hellwig
2022-03-02 14:28 ` [PATCHv5 28/30] x86/tdx: ioapic: Add shared bit for IOAPIC base address Kirill A. Shutemov
2022-03-09 20:39   ` Dave Hansen
2022-03-02 14:28 ` [PATCHv5 29/30] ACPICA: Avoid cache flush inside virtual machines Kirill A. Shutemov
2022-03-02 16:13   ` Dan Williams
2022-03-09 20:56   ` Dave Hansen
2022-03-02 14:28 ` [PATCHv5 30/30] Documentation/x86: Document TDX kernel architecture Kirill A. Shutemov
2022-03-09 21:49   ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=81a7ad6d-6bd9-7674-3229-67a5cd2e485a@intel.com \
    --to=dave.hansen@intel.com \
    --cc=aarcange@redhat.com \
    --cc=ak@linux.intel.com \
    --cc=bp@alien8.de \
    --cc=brijesh.singh@amd.com \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=hpa@zytor.com \
    --cc=jgross@suse.com \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=jpoimboe@redhat.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=knsathya@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=sdeep@vmware.com \
    --cc=seanjc@google.com \
    --cc=tglx@linutronix.de \
    --cc=thomas.lendacky@amd.com \
    --cc=tony.luck@intel.com \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.