linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Hansen <dave.hansen@intel.com>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	luto@kernel.org, peterz@infradead.org
Cc: sathyanarayanan.kuppuswamy@linux.intel.com, aarcange@redhat.com,
	ak@linux.intel.com, dan.j.williams@intel.com, david@redhat.com,
	hpa@zytor.com, jgross@suse.com, jmattson@google.com,
	joro@8bytes.org, jpoimboe@redhat.com, knsathya@kernel.org,
	pbonzini@redhat.com, sdeep@vmware.com, seanjc@google.com,
	tony.luck@intel.com, vkuznets@redhat.com, wanpengli@tencent.com,
	thomas.lendacky@amd.com, brijesh.singh@amd.com, x86@kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCHv5 11/30] x86/tdx: Handle in-kernel MMIO
Date: Tue, 8 Mar 2022 13:26:28 -0800	[thread overview]
Message-ID: <81a7ad6d-6bd9-7674-3229-67a5cd2e485a@intel.com> (raw)
In-Reply-To: <20220302142806.51844-12-kirill.shutemov@linux.intel.com>

On 3/2/22 06:27, Kirill A. Shutemov wrote:
> In non-TDX VMs, MMIO is implemented by providing the guest a mapping
> which will cause a VMEXIT on access and then the VMM emulating the
> instruction that caused the VMEXIT. That's not possible for TDX VM.
> 
> To emulate an instruction an emulator needs two things:
> 
>   - R/W access to the register file to read/modify instruction arguments
>     and see RIP of the faulted instruction.
> 
>   - Read access to memory where instruction is placed to see what to
>     emulate. In this case it is guest kernel text.
> 
> Both of them are not available to VMM in TDX environment:
> 
>   - Register file is never exposed to VMM. When a TD exits to the module,
>     it saves registers into the state-save area allocated for that TD.
>     The module then scrubs these registers before returning execution
>     control to the VMM, to help prevent leakage of TD state.
> 
>   - Memory is encrypted a TD-private key. The CPU disallows software
>     other than the TDX module and TDs from making memory accesses using
>     the private key.

Memory encryption has zero to do with this.  The TDX isolation
mechanisms are totally discrete from memory encryption, although they
are "neighbors" of sorts.

> In TDX the MMIO regions are instead configured by VMM to trigger a #VE
> exception in the guest.
> 
> Add #VE handling that emulates the MMIO instruction inside the guest and
> converts it into a controlled hypercall to the host.
> 
> MMIO addresses can be used with any CPU instruction that accesses
> memory. Address only MMIO accesses done via io.h helpers, such as
> 'readl()' or 'writeq()'.
> 
> Any CPU instruction that accesses memory can also be used to access
> MMIO.  However, by convention, MMIO access are typically performed via
> io.h helpers such as 'readl()' or 'writeq()'.
> 
> The io.h helpers intentionally use a limited set of instructions when
> accessing MMIO.  This known, limited set of instructions makes MMIO
> instruction decoding and emulation feasible in KVM hosts and SEV guests
> today.
> 
> MMIO accesses are performed without the io.h helpers are at the mercy of

		^ s/are//

> the compiler.  Compilers can and will generate a much more broad set of
> instructions which can not practically be decoded and emulated.  TDX
> guests will oops if they encounter one of these decoding failures.
> 
> This means that TDX guests *must* use the io.h helpers to access MMIO.
> 
> This requirement is not new.  Both KVM hosts and AMD SEV guests have the
> same limitations on MMIO access.
> 
> === Potential alternative approaches ===
> 
> == Paravirtualizing all MMIO ==
> 
> An alternative to letting MMIO induce a #VE exception is to avoid
> the #VE in the first place. Similar to the port I/O case, it is
> theoretically possible to paravirtualize MMIO accesses.
> 
> Like the exception-based approach offered here, a fully paravirtualized
> approach would be limited to MMIO users that leverage common
> infrastructure like the io.h macros.
> 
> However, any paravirtual approach would be patching approximately 120k
> call sites. Any paravirtual approach would need to replace a bare memory
> access instruction with (at least) a function call. With a conservative
> overhead estimation of 5 bytes per call site (CALL instruction),
> it leads to bloating code by 600k.
> 
> Many drivers will never be used in the TDX environment and the bloat
> cannot be justified.
> 
> == Patching TDX drivers ==
> 
> Rather than touching the entire kernel, it might also be possible to
> just go after drivers that use MMIO in TDX guests.  Right now, that's
> limited only to virtio and some x86-specific drivers.
> 
> All virtio MMIO appears to be done through a single function, which
> makes virtio eminently easy to patch.
> 
> This approach will be adopted in the future, removing the bulk of
> MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.

This still doesn't *quite* do it for me for a justification.  Why can't
the non-virtio cases be converted as well?  Why doesn't the "patching
MMIO sites" work for x86 code too?

You really need to convince us that *this* approach will be required
forever.

> diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
> index d00b367f8052..e6163e7e3247 100644
> --- a/arch/x86/coco/tdx.c
> +++ b/arch/x86/coco/tdx.c
> @@ -8,11 +8,17 @@
>  #include <asm/coco.h>
>  #include <asm/tdx.h>
>  #include <asm/vmx.h>
> +#include <asm/insn.h>
> +#include <asm/insn-eval.h>
>  
>  /* TDX module Call Leaf IDs */
>  #define TDX_GET_INFO			1
>  #define TDX_GET_VEINFO			3
>  
> +/* MMIO direction */
> +#define EPT_READ	0
> +#define EPT_WRITE	1
> +
>  /*
>   * Wrapper for standard use of __tdx_hypercall with no output aside from
>   * return code.
> @@ -200,6 +206,112 @@ static bool handle_cpuid(struct pt_regs *regs)
>  	return true;
>  }
>  
> +static bool mmio_read(int size, unsigned long addr, unsigned long *val)
> +{
> +	struct tdx_hypercall_args args = {
> +		.r10 = TDX_HYPERCALL_STANDARD,
> +		.r11 = hcall_func(EXIT_REASON_EPT_VIOLATION),
> +		.r12 = size,
> +		.r13 = EPT_READ,
> +		.r14 = addr,
> +		.r15 = *val,
> +	};
> +
> +	if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
> +		return false;
> +	*val = args.r11;
> +	return true;
> +}
> +
> +static bool mmio_write(int size, unsigned long addr, unsigned long val)
> +{
> +	return !_tdx_hypercall(hcall_func(EXIT_REASON_EPT_VIOLATION), size,
> +			       EPT_WRITE, addr, val);
> +}
> +
> +static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> +{
> +	char buffer[MAX_INSN_SIZE];
> +	unsigned long *reg, val;
> +	struct insn insn = {};
> +	enum mmio_type mmio;
> +	int size, extend_size;
> +	u8 extend_val = 0;
> +
> +	if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
> +		return false;
> +
> +	if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
> +		return false;
> +
> +	mmio = insn_decode_mmio(&insn, &size);
> +	if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
> +		return false;
> +
> +	if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
> +		reg = insn_get_modrm_reg_ptr(&insn, regs);
> +		if (!reg)
> +			return false;
> +	}
> +
> +	ve->instr_len = insn.length;
> +
> +	switch (mmio) {
> +	case MMIO_WRITE:
> +		memcpy(&val, reg, size);
> +		return mmio_write(size, ve->gpa, val);
> +	case MMIO_WRITE_IMM:
> +		val = insn.immediate.value;
> +		return mmio_write(size, ve->gpa, val);
> +	case MMIO_READ:
> +	case MMIO_READ_ZERO_EXTEND:
> +	case MMIO_READ_SIGN_EXTEND:
> +		break;
> +	case MMIO_MOVS:
> +	case MMIO_DECODE_FAILED:
> +		/*
> +		 * MMIO was accessed with an instruction that could not be
> +		 * decoded or handled properly. It was likely not using io.h
> +		 * helpers or accessed MMIO accidentally.
> +		 */
> +		return false;
> +	default:
> +		/* Unknown insn_decode_mmio() decode value? */
> +		BUG();
> +	}

BUG()s are bad.  The set of insn_decode_mmio() return codes is known at
compile time.  If we're really on the lookout for unknown values, why
not just:

	BUILD_BUG_ON(NR_MMIO_TYPES != 6); // or whatever

Also, there are *lots* of ways for this function to just fall over and
fail.  Why does this particular failure mode deserve a BUG()?

Is there a reason a BUG() is better than returning failure which
presumably sets off the #GP-like logic?

Also, now that I've read this a few times, I've been confused by the
same thing a few times.  This is handling instructions that might read
or write or do both, correct?

Should that be made explicit in a function comment?

> +	/* Handle reads */
> +	if (!mmio_read(size, ve->gpa, &val))
> +		return false;
> +
> +	switch (mmio) {
> +	case MMIO_READ:
> +		/* Zero-extend for 32-bit operation */
> +		extend_size = size == 4 ? sizeof(*reg) : 0;
> +		break;
> +	case MMIO_READ_ZERO_EXTEND:
> +		/* Zero extend based on operand size */
> +		extend_size = insn.opnd_bytes;
> +		break;
> +	case MMIO_READ_SIGN_EXTEND:
> +		/* Sign extend based on operand size */
> +		extend_size = insn.opnd_bytes;
> +		if (size == 1 && val & BIT(7))
> +			extend_val = 0xFF;
> +		else if (size > 1 && val & BIT(15))
> +			extend_val = 0xFF;
> +		break;
> +	default:
> +		/* All other cases has to be covered with the first switch() */
> +		BUG();
> +	}
> +
> +	if (extend_size)
> +		memset(reg, extend_val, extend_size);
> +	memcpy(reg, &val, size);
> +	return true;
> +}
> +
>  void tdx_get_ve_info(struct ve_info *ve)
>  {
>  	struct tdx_module_output out;
> @@ -247,6 +359,8 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
>  		return write_msr(regs);
>  	case EXIT_REASON_CPUID:
>  		return handle_cpuid(regs);
> +	case EXIT_REASON_EPT_VIOLATION:
> +		return handle_mmio(regs, ve);
>  	default:
>  		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
>  		return false;


  reply	other threads:[~2022-03-08 21:26 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 01/30] x86/tdx: Detect running as a TDX guest in early boot Kirill A. Shutemov
2022-03-04 15:43   ` Borislav Petkov
2022-03-04 15:47     ` Dave Hansen
2022-03-04 16:02       ` Borislav Petkov
2022-03-07 22:24         ` [PATCHv5.1 " Kirill A. Shutemov
2022-03-09 18:22           ` Borislav Petkov
2022-03-02 14:27 ` [PATCHv5 02/30] x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers Kirill A. Shutemov
2022-03-08 19:56   ` Dave Hansen
2022-03-10 12:32   ` Borislav Petkov
2022-03-10 14:44     ` Kirill A. Shutemov
2022-03-10 14:51       ` Borislav Petkov
2022-03-02 14:27 ` [PATCHv5 03/30] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions Kirill A. Shutemov
2022-03-08 20:03   ` Dave Hansen
2022-03-10 15:30   ` Borislav Petkov
2022-03-10 21:20     ` Kirill A. Shutemov
2022-03-10 21:48       ` Kirill A. Shutemov
2022-03-15 15:56         ` Borislav Petkov
2022-03-12 10:41       ` Borislav Petkov
2022-03-02 14:27 ` [PATCHv5 04/30] x86/tdx: Extend the confidential computing API to support TDX guests Kirill A. Shutemov
2022-03-08 20:17   ` Dave Hansen
2022-03-09 16:01     ` [PATCHv5.1 " Kirill A. Shutemov
2022-03-09 18:36       ` Dave Hansen
2022-03-09 23:51         ` [PATCHv5.2 " Kirill A. Shutemov
2022-03-10  0:07           ` Dave Hansen
2022-03-15 19:41           ` Borislav Petkov
2022-03-02 14:27 ` [PATCHv5 05/30] x86/tdx: Exclude shared bit from __PHYSICAL_MASK Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 06/30] x86/traps: Refactor exc_general_protection() Kirill A. Shutemov
2022-03-08 20:18   ` Dave Hansen
2022-03-02 14:27 ` [PATCHv5 07/30] x86/traps: Add #VE support for TDX guest Kirill A. Shutemov
2022-03-08 20:29   ` Dave Hansen
2022-03-02 14:27 ` [PATCHv5 08/30] x86/tdx: Add HLT support for TDX guests Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 09/30] x86/tdx: Add MSR " Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 10/30] x86/tdx: Handle CPUID via #VE Kirill A. Shutemov
2022-03-08 20:33   ` Dave Hansen
2022-03-09 16:15     ` [PATCH] " Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 11/30] x86/tdx: Handle in-kernel MMIO Kirill A. Shutemov
2022-03-08 21:26   ` Dave Hansen [this message]
2022-03-10  0:51     ` Kirill A. Shutemov
2022-03-10  1:06       ` Dave Hansen
2022-03-10 16:48         ` Kirill A. Shutemov
2022-03-10 17:53           ` Dave Hansen
2022-03-11 17:18             ` Kirill A. Shutemov
2022-03-11 17:22               ` Dave Hansen
2022-03-11 18:01               ` Dave Hansen
2022-03-02 14:27 ` [PATCHv5 12/30] x86/tdx: Detect TDX at early kernel decompression time Kirill A. Shutemov
2022-03-07 22:27   ` [PATCHv5.1 " Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 13/30] x86: Adjust types used in port I/O helpers Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 14/30] x86: Consolidate " Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 15/30] x86/boot: Port I/O: allow to hook up alternative helpers Kirill A. Shutemov
2022-03-02 17:42   ` Josh Poimboeuf
2022-03-02 19:41     ` Dave Hansen
2022-03-02 20:02       ` Josh Poimboeuf
2022-03-02 14:27 ` [PATCHv5 16/30] x86/boot: Port I/O: add decompression-time support for TDX Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 17/30] x86/tdx: Port I/O: add runtime hypercalls Kirill A. Shutemov
2022-03-08 21:30   ` Dave Hansen
2022-03-02 14:27 ` [PATCHv5 18/30] x86/tdx: Port I/O: add early boot support Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 19/30] x86/tdx: Wire up KVM hypercalls Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 20/30] x86/boot: Add a trampoline for booting APs via firmware handoff Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 21/30] x86/acpi, x86/boot: Add multiprocessor wake-up support Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 22/30] x86/boot: Set CR0.NE early and keep it set during the boot Kirill A. Shutemov
2022-03-08 21:37   ` Dave Hansen
2022-03-02 14:27 ` [PATCHv5 23/30] x86/boot: Avoid #VE during boot for TDX platforms Kirill A. Shutemov
2022-03-07  9:29   ` Xiaoyao Li
2022-03-07 22:33     ` Kirill A. Shutemov
2022-03-08  1:19       ` Xiaoyao Li
2022-03-08 16:41         ` Kirill A. Shutemov
2022-03-07 22:36     ` [PATCHv5.1 " Kirill A. Shutemov
2022-03-02 14:28 ` [PATCHv5 24/30] x86/topology: Disable CPU online/offline control for TDX guests Kirill A. Shutemov
2022-03-02 14:28 ` [PATCHv5 25/30] x86/tdx: Make pages shared in ioremap() Kirill A. Shutemov
2022-03-08 22:02   ` Dave Hansen
2022-03-02 14:28 ` [PATCHv5 26/30] x86/mm/cpa: Add support for TDX shared memory Kirill A. Shutemov
2022-03-09 19:44   ` Dave Hansen
2022-03-02 14:28 ` [PATCHv5 27/30] x86/kvm: Use bounce buffers for TD guest Kirill A. Shutemov
2022-03-09 20:07   ` Dave Hansen
2022-03-10 14:29     ` Tom Lendacky
2022-03-10 14:51       ` Christoph Hellwig
2022-03-02 14:28 ` [PATCHv5 28/30] x86/tdx: ioapic: Add shared bit for IOAPIC base address Kirill A. Shutemov
2022-03-09 20:39   ` Dave Hansen
2022-03-02 14:28 ` [PATCHv5 29/30] ACPICA: Avoid cache flush inside virtual machines Kirill A. Shutemov
2022-03-02 16:13   ` Dan Williams
2022-03-09 20:56   ` Dave Hansen
2022-03-02 14:28 ` [PATCHv5 30/30] Documentation/x86: Document TDX kernel architecture Kirill A. Shutemov
2022-03-09 21:49   ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=81a7ad6d-6bd9-7674-3229-67a5cd2e485a@intel.com \
    --to=dave.hansen@intel.com \
    --cc=aarcange@redhat.com \
    --cc=ak@linux.intel.com \
    --cc=bp@alien8.de \
    --cc=brijesh.singh@amd.com \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=hpa@zytor.com \
    --cc=jgross@suse.com \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=jpoimboe@redhat.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=knsathya@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=sdeep@vmware.com \
    --cc=seanjc@google.com \
    --cc=tglx@linutronix.de \
    --cc=thomas.lendacky@amd.com \
    --cc=tony.luck@intel.com \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).